Luka Krsnik

Portfolio: Twitter Nonstandard Language Detection

BERT

Data Science

Large Language Models

Machine Learning

Natural Language Processing

Regression

SVM

Transformers

Project Description

In this project, our objective was to identify Slovenian tweets containing nonstandard language for manual annotations. We created a metric using a parallel corpus, counting standard and nonstandard forms of words to calculate a standardness measure for each tweet. Utilizing Python libraries like Scikit-learn, PyTorch, and transformers, we trained various models, including SVMs and BERT-based neural networks, to predict standardness. My responsibilities included metric development, model training and evaluation, and applying the models to unannotated tweets.

Key details

Standardness Metric Development: Created a unique metric using a parallel corpus to quantify standardness in Slovenian tweets.
Model Training and Evaluation: Utilized SVMs and BERT-based neural networks for machine learning.
Manual Annotation Process: Identified tweets with varying standardness levels, which were manually annotated for detailed linguistic analysis.

Project Outcome

The project successfully identified tweets with varying standardness levels. These tweets were subsequently manually annotated.