Blog: Steam overview

2024/06/25

Overview




Why blogging?

I am very curious about certain technologies I haven't been able to explore in my professional path. I decided to be proactive and learn them on my own. A blog seemed like the perfect platform to document my discoveries. It might help someone, who is facing similar problems, and force me to draw a line for my projects. This is important to me, as I often find myself being drawn to a rabbit hole. Having clear project goals is one of the few ways that forces me not to explore these alternate routes too deeply.

Project goals

Through this blog series, I want to learn as much as I can. With this project, I specifically want to focus on improving my machine learning and natural language processing knowledge. I'm particularly interested in building machine learning models that work together to achieve a more comprehensive outcome. Most importantly, I'm here to have fun! This might also influence some of my project choices.

Selecting project

Since I'm particularly interested in honing my machine learning and natural language processing skills, finding available, comprehensive, and engaging datasets is crucial for this project.

How to get data?

I found out that getting data today as opposed to a couple of years ago, has changed. On one hand, it is easier to get some prepared datasets due to repositories like huggingface and Kaggle. On the other hand, there is a trend where some large platforms are restricting their data due to privacy concerns and other factors (ie. Reddit).

It all depends on what data you are trying to get. For my experiments, I wanted a dataset that:

  • would be raw and realistic
  • would allow for various NLP tasks
  • is not being used too broadly, so that experiments could be unique
  • has a content I am interested in

Why Steam?

I wanted to have a large and rich dataset that reflected real-world language use. I found out that the Steam website is created in a way that allows me to obtain data and structure it into a format similar to their database. They also allow for scraping of the majority of data and even provide APIs for some limited data.

When I was younger, I also had periods of my life, when I was fond of gaming. So the data from Steam checked all the boxes.