Portfolio: South Slavic Language Processing with Stanza Fork
Data Science
Dependency Parsing
LSTMs
Lemmatization
Machine Learning
Named Entity Recognition
Natural Language Processing
Part-of-Speech Tagging
Semantic Role Labeling
Text Analysis
Tool Development
Project Description
In this project, we customized the Stanza library, similar to spaCy, to enhance natural language processing tasks for South Slavic languages like Slovenian, Croatian, Serbian, Macedonian, and Bulgarian. The library optimized tokenization, sentence splitting, part-of-speech tagging, lemmatization, dependency parsing, named entity recognition, and semantic role labeling.
Role
My role as the main developer included major refactoring, model training, the addition of rule-based approaches, the incorporation of semantic role labeling models, and introducing tests to validate the pipeline's performance.
Project Outcome
The project produced improvements in accuracy and features of the popular open-source library for South Slavic languages.