Portfolio: CORDEX
Data Extraction
Multi-Core Processing
Natural Language Processing
Text Analysis
Text Mining
Tool Development
Project Description
CORDEX is a Python library for extracting collocations from text in diverse formats. It uses multicore processing for identifying and counting collocations based on user-defined rules.
Role
My role as the main developer involved enhancing and completing a partially developed script, that was transformed into a Python library.
Key Contributions
- Major refactoring
- Enhancing user-defined rule options
- Implementing inflectional data lookup
- Integrating API queries
- Adding the option to handle multiple formats in different languages
Project Outcome
We created and published a Python library that is actively used in linguistic research and data feed API. It was also used to process a 1 billion words corpus.