Luka Krsnik

Blog: Steam Funometer

2024/10/01

Classification

Data Science

Machine Learning

Natural Language Processing

Regression

Polars

Python

Transformers

Introduction

Two of the most fundamental approaches to machine learning are regression, used for predicting numerical values, and classification, which assigns data points to categories.

The database obtained in previous blog contains many features that can be predicted using classification or regression techniques. This blog aims to train and evaluate different models using this data.

First, we will focus on the binary classification problem: predicting whether a reviewing user recommended a product (game) or not. Secondly, we will use regression to analyze how users perceived the review's humor and helpfulness.

I will be using Polars library for data manipulation and Transformers library for training and evaluating models.

Data

We obtained a big SQLite dataset with scraping. This dataset contains information about Steam reviews, products, and news. While using multiple features for our model would be beneficial, I'll focus on using only the review text to keep things simple. The complete preprocessing code is available in the Jupyter notebook.

Reading Database

We'll load relevant data into a Polars data frame.

db_path = 'dbs/db.sqlite3'
connection_string = 'sqlite://' + db_path
df = pl.read_database_uri(
    '''SELECT product_id, text AS review_text, recommended, found_helpful, found_funny 
    FROM review LEFT JOIN product ON product_id = product.id''',
    connection_string
)

Data Cleanup

Since the data wasn't cleaned during scraping, we need to do it now. The found_funny and found_helpful columns are stored as VARCHAR strings, which might not always be directly convertible to integers. We'll handle this by using strict=False flag when casting to integers. This will return null for problematic values, which we can then fill with zeroes.

# convert to integers
df = df.with_columns(
    # cast features to minimal viable types
    pl.col("found_funny").cast(pl.UInt16, strict=False).fill_null(strategy="zero"),
    pl.col("found_helpful").cast(pl.UInt16, strict=False).fill_null(strategy="zero"),
    pl.col("recommended").cast(pl.Int8)
)

Data inspection revealed that many reviews don't contain any letters. We'll filter out these reviews because we want our models to focus on textual content.

# filter out reviews that don't contain any letters.
df = df.filter(pl.col('review_text').str.contains_any(list(string.ascii_lowercase) + list(string.ascii_uppercase)))

There are other reviews that could be ignored during training (ie. those written in other languages or containing only repeated words). Addressing these issues would require additional effort. Instead, I'll focus on other aspects of the model for now.

Regression Metric

We'll use regression to predict how funny and helpful users perceive a review to be. Luckily, Steam provides this information through user votes on each category. The simplest approach would be to directly predict the normalized amount of "funny" votes a review receives:

# calculate normalized value
pl.col("found_funny") / pl.col("found_funny").max()

This approach has a big limitation. Some reviews might have significantly more views than others. Consequently, reviews with fewer views would naturally have fewer votes, regardless of their actual humor. Unfortunately, we don't have information about the number of views for each review.

To address this, we could normalize the "funny" votes by product. This would give us values between 0 and 1, where 0 indicates no votes and 1 signifies the most upvoted review (in terms of humor) for that specific product. This approach assumes that all reviews within a product had an equal chance of receiving upvotes. While this might not perfectly hold true due to factors like Steam-highlighted comments or activity levels at the time of writing, it still provides some insight into how well a review resonates with users for a particular product.

(pl.col("found_funny") / pl.col("found_funny").max()).over("product_id")

However, this normalization method treats reviews in less popular products equally to reviews in highly popular ones. This is problematic because it's harder to write the "best" (or close to the best) review when there's more competition. As a result, values for less popular products might be artificially inflated.

So far we've explored two metrics, one that evaluates reviews independently, ignoring the potential difference in views, and the second that acknowledges the view disparity but skews votes for less-viewed products upwards. Ideally, we'd like a metric that balances these two approaches. Here, I propose a combined metric that leverages information from both overall votes and product-specific votes. We can calculate it by averaging both normalized values.

df = df.with_columns(
    (
        (
            (pl.col("found_funny") / pl.col("found_funny").max()) + 
            (pl.col("found_funny") / pl.col("found_funny").max()).over("product_id")
        ) / 2).fill_nan(0.0).alias("found_funny")
)

Data Split

Before training the model, we need to split the data into training, testing, and development sets. To be as impartial as possible, we'll first group all reviews by product, then assign each group to a specific set. This guarantees that reviews from the same product won't be split across different datasets.

Typically, around 80% of data is allocated to the train set, with the remaining 20% divided between development and test sets.

# split into train test dev
df_split = df.select("product_id").unique("product_id").sort("product_id")
df_split = df_split.with_columns(
    pl.lit(np.random.rand(df_split.height)).alias("split")
)
df_split = df_split.with_columns(
    pl.when(pl.col("split") < 0.8).then(pl.lit("train"))
        .otherwise(pl.when(pl.col("split") < 0.9).then(pl.lit("test"))
        .otherwise(pl.lit("dev"))).alias("split")
)
df_dict = df.join(df_split, on="product_id", how="left").partition_by("split", as_dict=True, include_key=False)

While scraping returned roughly 45 million reviews, using all this data would require significant computational resources for training the model. To address this, we'll use a smaller subset of 600,000 reviews for this experiment. This will allow for faster training while still providing a good representation of the data.

df_train = df_dict[("train",)].sample(500000, seed=manual_seed, shuffle=True)
df_dev = df_dict[("dev",)].sample(50000, seed=manual_seed, shuffle=True)
df_test = df_dict[("test",)].sample(50000, seed=manual_seed, shuffle=True)

Models

For this initial exploration, I decided to focus on fine-tuning models without leveraging Parameter-Efficient Fine-Tuning (PEFT). I'll explore PEFT in a separate blog post. Due to limited hardware resources, I chose to fine-tune two pre-trained models:

DistilBERT (67M parameters): This is a smaller, more efficient version of the popular BERT model.
RoBERTa-large (355M parameters): This is a larger and more efficient model, but it also requires more computational resources.

Since these Large Language Models (LLMs) have different underlying architectures, it is easier to use another library than fine-tune them directly in PyTorch code. Here, Transformers library comes in handy. It provides a unified interface for various LLMs, simplifying fine-tuning and model manipulation. The complete training code for this experiment is available in the provided Jupyter notebook.

Classification

Data Preparation

Before diving into training, let's prepare our data for the classification task. First, we'll select only the relevant columns from the DataFrames: review_text (the input text) and recommended (the target label indicating whether a review recommends a product). We'll also rename these columns for clarity:

# recommended
df_train = df_train.select(['review_text', 'recommended']).rename({'review_text': 'text', 'recommended': 'label'})
df_dev = df_dev.select(['review_text', 'recommended']).rename({'review_text': 'text', 'recommended': 'label'})
df_test = df_test.select(['review_text', 'recommended']).rename({'review_text': 'text', 'recommended': 'label'})

Currently, our data is in polars.DataFrame class. However, the Transformers library doesn't directly support this format. Instead, it may use datasets.Dataset format. Luckily, both formats use PyArrow under the hood, making the conversion straightforward:

from datasets import Dataset, DatasetDict

dataset = DatasetDict({
    'train': Dataset(df_train.to_arrow()),
    'dev': Dataset(df_dev.to_arrow()),
    'test': Dataset(df_test.to_arrow())
})

Tokenization

The next step involves tokenizing the text data. Tokenization breaks down sentences into smaller units (words or sub-words) that the model can understand. For this purpose, we'll use AutoTokenizer from the Transformers library.

from transformers import AutoTokenizer

model_name = 'distilbert/distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    text = examples['text']
    return tokenizer(text, truncation=True, return_tensors="np", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

AutoTokenizer automatically selects the appropriate tokenizer for the chosen model. The tokenize_function iterates through each batch of text data and performs the following actions:

trunctuation=True: This limits the maximum length to the specified amount of tokens.
max_length=128: This defines the maximum allowed token length for each review. Reviews exceeding this limit will be truncated.
return_tensors="np": This specifies that we use NumPy arrays for storing token ids. We could use PyTorch tensors as an alternative, but they are required to have the same length, which would force us to use padding during tokenization. We can use data collator for padding instead and save some memory.

Data collators play an important role in preparing input data for training. They split data into batches and can dynamically pad sequences to ensure consistent input sizes.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To monitor model progress during training, we define a metrics function. In this case, we'll track classification accuracy.

import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy.compute(predictions=predictions, references=labels)['accuracy']}

Training

We load pretrained model and set the num_labels parameter to 2 for binary classification.

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

The TrainingArguments class allows us to customize the training process. I used only one epoch because our dataset is quite big and training time-consuming. Here's a complete configuration that worked well:

training_args = TrainingArguments(
    output_dir='models/steam-classification-distilbert500k-recommend',
    learning_rate=5e-5,
    weight_decay=0,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=1,
    eval_strategy="steps",
    eval_steps=0.1, # eval after 10% is done
    save_strategy="steps",
    save_steps=0.1, # save after 10% of processing is done
    load_best_model_at_end=True,
)

Finally, we initialize a Trainer instance and start the training process.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['dev'],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)
trainer.train()

Regression

For regression tasks, we can adapt the previous classification code:

Data Preparation

# found funny
df_train = df_train.select(['review_text', 'found_funny']).rename({'review_text': 'text', 'found_funny': 'label'})
df_dev = df_dev.select(['review_text', 'found_funny']).rename({'review_text': 'text', 'found_funny': 'label'})
df_test = df_test.select(['review_text', 'found_funny']).rename({'review_text': 'text', 'found_funny': 'label'})

Training

Since we're predicting a single value (humor rating), we set num_labels to 1 when loading the pre-trained model:

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

We use different metrics for regression tasks. Here, we calculate mean squared error (MSE), mean absolute error (MAE), and R-squared:

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    labels = labels.reshape(-1, 1)
    mse = mean_squared_error(labels, predictions)
    mae = mean_absolute_error(labels, predictions)
    r2 = r2_score(labels, predictions)
    return {"mse": mse, "mae": mae, "r2": r2}

DistilBERT and RoBERTa

While the previous sections demonstrated DistilBERT training, adapting to RoBERTa requires minor adjustments:

Model Selection: Replace model_name with FacebookAI/roberta-large to load RoBERTa model.
Hardware Considerations: Due to RoBERTa's larger size, and my hardware limitations batch_size is reduced to 16.
Hyperparameter Tuning: Models didn't converge with default values, so the learning_rate is decreased to 5e-6.

Evaluations

To evaluate our models, we may use the Trainer.predict method, which provides detailed predictions and allows for metric calculation. We evaluate on test dataset. The code is available in Jupyter Notebook.

trainer.predict(test_dataset=tokenized_dataset['test'])

Baseline models were added to evaluations for comparison. In classification, the baseline model assigns all predictions to the majority class. For regression, it assigns zeroes to all reviews.

Recommendation Models

Model	Accuracy
Baseline	0.8759
DistilBERT	0.9513
RoBERTa	0.9598

Analysis

As expected, the larger RoBERTa model achieved the highest accuracy.

Funny Models

Model	MSE	MAE	R2
Baseline	0.002098	0.006068	-0.01786
DistilBERT	0.002015	0.010775	0.02252
RoBERTa	0.002008	0.008409	0.02582

Analysis

The regression models underperformed, with only slight improvements over the baseline in MSE and R2. MAE was worse than the baseline.

Possible Reasons for underperformance:

Data skewness - About 90% of reviews have no votes. Half of those that do, received just one vote. Only 0.2 % received a score over 0.5, suggesting that the data is right-skewed. MAE might have been worse in our predictions because it is less susceptible to outliers than MSE which was used as a loss function.
Data quality - There are instances of reviews that have identical text but have different number of people that considered them funny. Even though they are quite rare, a lot of them are semantically very similar, with inconsistent evaluations.

Helpful Models

Model	MSE	MAE	R2
Baseline	0.002532	0.009021	-0.03320
DistilBERT	0.002339	0.013148	0.04566
RoBERTa	0.002320	0.013807	0.05324

Analysis

The helpful models showed slightly better performance than the funny models but still faced challenges due to quality issues. Data skewness is also still present, but to a lesser extent, as there are "only" about 75% of reviews with no votes.

Conclusion

This blog has outlined the process of building and evaluating classification and regression models using transformer-based architectures. While the classification models demonstrated promising results, the regression models encountered challenges. Several avenues can be explored to enhance performance, but I decided to draw the line here and maybe do that in another blog.

Improvements

Here are some of the ideas that could improve models:

Feature Engineering - Incorporate additional features like product description text, tags, price, title, etc., to provide more context for the models.
Data Cleanup - Data could be cleaned to address issues like repetitive text, text images and different languages.
Metric Exploration - Experimentation with alternative regression metrics.
Data Selection - Instead of random data selection, we could consider techniques that can mitigate data skewness and ensure a more balanced dataset.
Model Selection - Experimentation with other models.