Portfolio: Twitter Nonstandard Language Detection
BERT
Data Science
Large Language Models
Machine Learning
Natural Language Processing
Regression
SVM
Transformers
Project Description
In this project, our objective was to identify Slovenian tweets containing nonstandard language for manual annotations. We created a metric using a parallel corpus, counting standard and nonstandard forms of words to calculate a standardness measure for each tweet. Utilizing Python libraries like Scikit-learn, PyTorch, and transformers, we trained various models, including SVMs and BERT-based neural networks, to predict standardness. My responsibilities included metric development, model training and evaluation, and applying the models to unannotated tweets.
Key details
- Standardness Metric Development: Created a unique metric using a parallel corpus to quantify standardness in Slovenian tweets.
- Model Training and Evaluation: Utilized SVMs and BERT-based neural networks for machine learning.
- Manual Annotation Process: Identified tweets with varying standardness levels, which were manually annotated for detailed linguistic analysis.
Project Outcome
The project successfully identified tweets with varying standardness levels. These tweets were subsequently manually annotated.