Luka Krsnik

Portfolio: South Slavic Language Processing with Stanza Fork

Data Science

Dependency Parsing

LSTMs

Lemmatization

Machine Learning

Named Entity Recognition

Natural Language Processing

Part-of-Speech Tagging

Semantic Role Labeling

Text Analysis

Tool Development

Project Description

In this project, we customized the Stanza library, similar to spaCy, to enhance natural language processing tasks for South Slavic languages like Slovenian, Croatian, Serbian, Macedonian, and Bulgarian. The library optimized tokenization, sentence splitting, part-of-speech tagging, lemmatization, dependency parsing, named entity recognition, and semantic role labeling.

Role

My role as the main developer included major refactoring, model training, the addition of rule-based approaches, the incorporation of semantic role labeling models, and introducing tests to validate the pipeline's performance.

Project Outcome

The project produced improvements in accuracy and features of the popular open-source library for South Slavic languages.