Blog: Steam Funometer
2024/10/01
Introduction
Two of the most fundamental approaches to machine learning are regression, used for predicting numerical values, and classification, which assigns data points to categories.
The database obtained in previous blog contains many features that can be predicted using classification or regression techniques. This blog aims to train and evaluate different models using this data.
First, we will focus on the binary classification problem: predicting whether a reviewing user recommended a product (game) or not. Secondly, we will use regression to analyze how users perceived the review's humor and helpfulness.
I will be using Polars library for data manipulation and Transformers library for training and evaluating models.
Data
We obtained a big SQLite dataset with scraping. This dataset contains information about Steam reviews, products, and news. While using multiple features for our model would be beneficial, I'll focus on using only the review text to keep things simple. The complete preprocessing code is available in the Jupyter notebook.
Reading Database
We'll load relevant data into a Polars data frame.
db_path = 'dbs/db.sqlite3'
connection_string = 'sqlite://' + db_path
df = pl.read_database_uri(
'''SELECT product_id, text AS review_text, recommended, found_helpful, found_funny
FROM review LEFT JOIN product ON product_id = product.id''',
connection_string
)
Data Cleanup
Since the data wasn't cleaned during scraping, we need to do it now. The found_funny and found_helpful columns are stored as VARCHAR strings, which might not always be directly convertible to integers. We'll handle this by using strict=False flag when casting to integers. This will return null for problematic values, which we can then fill with zeroes.
# convert to integers
df = df.with_columns(
# cast features to minimal viable types
pl.col("found_funny").cast(pl.UInt16, strict=False).fill_null(strategy="zero"),
pl.col("found_helpful").cast(pl.UInt16, strict=False).fill_null(strategy="zero"),
pl.col("recommended").cast(pl.Int8)
)
Data inspection revealed that many reviews don't contain any letters. We'll filter out these reviews because we want our models to focus on textual content.
# filter out reviews that don't contain any letters.
df = df.filter(pl.col('review_text').str.contains_any(list(string.ascii_lowercase) + list(string.ascii_uppercase)))
There are other reviews that could be ignored during training (ie. those written in other languages or containing only repeated words). Addressing these issues would require additional effort. Instead, I'll focus on other aspects of the model for now.
Regression Metric
We'll use regression to predict how funny and helpful users perceive a review to be. Luckily, Steam provides this information through user votes on each category. The simplest approach would be to directly predict the normalized amount of "funny" votes a review receives:
# calculate normalized value
pl.col("found_funny") / pl.col("found_funny").max()
This approach has a big limitation. Some reviews might have significantly more views than others. Consequently, reviews with fewer views would naturally have fewer votes, regardless of their actual humor. Unfortunately, we don't have information about the number of views for each review.
To address this, we could normalize the "funny" votes by product. This would give us values between 0 and 1, where 0 indicates no votes and 1 signifies the most upvoted review (in terms of humor) for that specific product. This approach assumes that all reviews within a product had an equal chance of receiving upvotes. While this might not perfectly hold true due to factors like Steam-highlighted comments or activity levels at the time of writing, it still provides some insight into how well a review resonates with users for a particular product.
(pl.col("found_funny") / pl.col("found_funny").max()).over("product_id")
However, this normalization method treats reviews in less popular products equally to reviews in highly popular ones. This is problematic because it's harder to write the "best" (or close to the best) review when there's more competition. As a result, values for less popular products might be artificially inflated.
So far we've explored two metrics, one that evaluates reviews independently, ignoring the potential difference in views, and the second that acknowledges the view disparity but skews votes for less-viewed products upwards. Ideally, we'd like a metric that balances these two approaches. Here, I propose a combined metric that leverages information from both overall votes and product-specific votes. We can calculate it by averaging both normalized values.
df = df.with_columns(
(
(
(pl.col("found_funny") / pl.col("found_funny").max()) +
(pl.col("found_funny") / pl.col("found_funny").max()).over("product_id")
) / 2).fill_nan(0.0).alias("found_funny")
)
Data Split
Before training the model, we need to split the data into training, testing, and development sets. To be as impartial as possible, we'll first group all reviews by product, then assign each group to a specific set. This guarantees that reviews from the same product won't be split across different datasets.
Typically, around 80% of data is allocated to the train set, with the remaining 20% divided between development and test sets.
# split into train test dev
df_split = df.select("product_id").unique("product_id").sort("product_id")
df_split = df_split.with_columns(
pl.lit(np.random.rand(df_split.height)).alias("split")
)
df_split = df_split.with_columns(
pl.when(pl.col("split") < 0.8).then(pl.lit("train"))
.otherwise(pl.when(pl.col("split") < 0.9).then(pl.lit("test"))
.otherwise(pl.lit("dev"))).alias("split")
)
df_dict = df.join(df_split, on="product_id", how="left").partition_by("split", as_dict=True, include_key=False)
While scraping returned roughly 45 million reviews, using all this data would require significant computational resources for training the model. To address this, we'll use a smaller subset of 600,000 reviews for this experiment. This will allow for faster training while still providing a good representation of the data.
df_train = df_dict[("train",)].sample(500000, seed=manual_seed, shuffle=True)
df_dev = df_dict[("dev",)].sample(50000, seed=manual_seed, shuffle=True)
df_test = df_dict[("test",)].sample(50000, seed=manual_seed, shuffle=True)
Models
For this initial exploration, I decided to focus on fine-tuning models without leveraging Parameter-Efficient Fine-Tuning (PEFT). I'll explore PEFT in a separate blog post. Due to limited hardware resources, I chose to fine-tune two pre-trained models:
- DistilBERT (67M parameters): This is a smaller, more efficient version of the popular BERT model.
- RoBERTa-large (355M parameters): This is a larger and more efficient model, but it also requires more computational resources.
Since these Large Language Models (LLMs) have different underlying architectures, it is easier to use another library than fine-tune them directly in PyTorch code. Here, Transformers library comes in handy. It provides a unified interface for various LLMs, simplifying fine-tuning and model manipulation. The complete training code for this experiment is available in the provided Jupyter notebook.
Classification
Data Preparation
Before diving into training, let's prepare our data for the classification task. First, we'll select only the relevant columns from the DataFrames: review_text (the input text) and recommended (the target label indicating whether a review recommends a product). We'll also rename these columns for clarity:
# recommended
df_train = df_train.select(['review_text', 'recommended']).rename({'review_text': 'text', 'recommended': 'label'})
df_dev = df_dev.select(['review_text', 'recommended']).rename({'review_text': 'text', 'recommended': 'label'})
df_test = df_test.select(['review_text', 'recommended']).rename({'review_text': 'text', 'recommended': 'label'})
Currently, our data is in polars.DataFrame class. However, the Transformers library doesn't directly support this format. Instead, it may use datasets.Dataset format. Luckily, both formats use PyArrow under the hood, making the conversion straightforward:
from datasets import Dataset, DatasetDict
dataset = DatasetDict({
'train': Dataset(df_train.to_arrow()),
'dev': Dataset(df_dev.to_arrow()),
'test': Dataset(df_test.to_arrow())
})
Tokenization
The next step involves tokenizing the text data. Tokenization breaks down sentences into smaller units (words or sub-words) that the model can understand. For this purpose, we'll use AutoTokenizer from the Transformers library.
from transformers import AutoTokenizer
model_name = 'distilbert/distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
text = examples['text']
return tokenizer(text, truncation=True, return_tensors="np", max_length=128)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
AutoTokenizer automatically selects the appropriate tokenizer for the chosen model. The tokenize_function iterates through each batch of text data and performs the following actions:
trunctuation=True: This limits the maximum length to the specified amount of tokens.max_length=128: This defines the maximum allowed token length for each review. Reviews exceeding this limit will be truncated.return_tensors="np": This specifies that we use NumPy arrays for storing token ids. We could use PyTorch tensors as an alternative, but they are required to have the same length, which would force us to use padding during tokenization. We can use data collator for padding instead and save some memory.
Data collators play an important role in preparing input data for training. They split data into batches and can dynamically pad sequences to ensure consistent input sizes.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
To monitor model progress during training, we define a metrics function. In this case, we'll track classification accuracy.
import evaluate
import numpy as np
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return {"accuracy": accuracy.compute(predictions=predictions, references=labels)['accuracy']}
Training
We load pretrained model and set the num_labels parameter to 2 for binary classification.
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
The TrainingArguments class allows us to customize the training process. I used only one epoch because our dataset is quite big and training time-consuming. Here's a complete configuration that worked well:
training_args = TrainingArguments(
output_dir='models/steam-classification-distilbert500k-recommend',
learning_rate=5e-5,
weight_decay=0,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
num_train_epochs=1,
eval_strategy="steps",
eval_steps=0.1, # eval after 10% is done
save_strategy="steps",
save_steps=0.1, # save after 10% of processing is done
load_best_model_at_end=True,
)
Finally, we initialize a Trainer instance and start the training process.
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['dev'],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
Regression
For regression tasks, we can adapt the previous classification code:
Data Preparation
# found funny
df_train = df_train.select(['review_text', 'found_funny']).rename({'review_text': 'text', 'found_funny': 'label'})
df_dev = df_dev.select(['review_text', 'found_funny']).rename({'review_text': 'text', 'found_funny': 'label'})
df_test = df_test.select(['review_text', 'found_funny']).rename({'review_text': 'text', 'found_funny': 'label'})
Training
Since we're predicting a single value (humor rating), we set num_labels to 1 when loading the pre-trained model:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
We use different metrics for regression tasks. Here, we calculate mean squared error (MSE), mean absolute error (MAE), and R-squared:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
def compute_metrics(eval_pred):
predictions, labels = eval_pred
labels = labels.reshape(-1, 1)
mse = mean_squared_error(labels, predictions)
mae = mean_absolute_error(labels, predictions)
r2 = r2_score(labels, predictions)
return {"mse": mse, "mae": mae, "r2": r2}
DistilBERT and RoBERTa
While the previous sections demonstrated DistilBERT training, adapting to RoBERTa requires minor adjustments:
- Model Selection: Replace
model_namewithFacebookAI/roberta-largeto load RoBERTa model. - Hardware Considerations: Due to RoBERTa's larger size, and my hardware limitations
batch_sizeis reduced to 16. - Hyperparameter Tuning: Models didn't converge with default values, so the
learning_rateis decreased to5e-6.
Evaluations
To evaluate our models, we may use the Trainer.predict method, which provides detailed predictions and allows for metric calculation. We evaluate on test dataset. The code is available in Jupyter Notebook.
trainer.predict(test_dataset=tokenized_dataset['test'])
Baseline models were added to evaluations for comparison. In classification, the baseline model assigns all predictions to the majority class. For regression, it assigns zeroes to all reviews.
Recommendation Models
| Model | Accuracy |
|---|---|
| Baseline | 0.8759 |
| DistilBERT | 0.9513 |
| RoBERTa | 0.9598 |
Analysis
As expected, the larger RoBERTa model achieved the highest accuracy.
Funny Models
| Model | MSE | MAE | R2 |
|---|---|---|---|
| Baseline | 0.002098 | 0.006068 | -0.01786 |
| DistilBERT | 0.002015 | 0.010775 | 0.02252 |
| RoBERTa | 0.002008 | 0.008409 | 0.02582 |
Analysis
The regression models underperformed, with only slight improvements over the baseline in MSE and R2. MAE was worse than the baseline.
Possible Reasons for underperformance:
- Data skewness - About 90% of reviews have no votes. Half of those that do, received just one vote. Only 0.2 % received a score over 0.5, suggesting that the data is right-skewed. MAE might have been worse in our predictions because it is less susceptible to outliers than MSE which was used as a loss function.
- Data quality - There are instances of reviews that have identical text but have different number of people that considered them funny. Even though they are quite rare, a lot of them are semantically very similar, with inconsistent evaluations.
Helpful Models
| Model | MSE | MAE | R2 |
|---|---|---|---|
| Baseline | 0.002532 | 0.009021 | -0.03320 |
| DistilBERT | 0.002339 | 0.013148 | 0.04566 |
| RoBERTa | 0.002320 | 0.013807 | 0.05324 |
Analysis
The helpful models showed slightly better performance than the funny models but still faced challenges due to quality issues. Data skewness is also still present, but to a lesser extent, as there are "only" about 75% of reviews with no votes.
Conclusion
This blog has outlined the process of building and evaluating classification and regression models using transformer-based architectures. While the classification models demonstrated promising results, the regression models encountered challenges. Several avenues can be explored to enhance performance, but I decided to draw the line here and maybe do that in another blog.
Improvements
Here are some of the ideas that could improve models:
- Feature Engineering - Incorporate additional features like product description text, tags, price, title, etc., to provide more context for the models.
- Data Cleanup - Data could be cleaned to address issues like repetitive text, text images and different languages.
- Metric Exploration - Experimentation with alternative regression metrics.
- Data Selection - Instead of random data selection, we could consider techniques that can mitigate data skewness and ensure a more balanced dataset.
- Model Selection - Experimentation with other models.