Portfolio: South Slavic Language Processing with Stanza Fork

Data Science
Dependency Parsing
LSTMs
Lemmatization
Machine Learning
Named Entity Recognition
Natural Language Processing
Part-of-Speech Tagging
Semantic Role Labeling
Text Analysis
Tool Development


Project Description

In this project, we customized the Stanza library, similar to spaCy, to enhance natural language processing tasks for South Slavic languages like Slovenian, Croatian, Serbian, Macedonian, and Bulgarian. The library optimized tokenization, sentence splitting, part-of-speech tagging, lemmatization, dependency parsing, named entity recognition, and semantic role labeling.

Role

My role as the main developer included major refactoring, model training, the addition of rule-based approaches, the incorporation of semantic role labeling models, and introducing tests to validate the pipeline's performance.

Project Outcome

The project produced improvements in accuracy and features of the popular open-source library for South Slavic languages.