Portfolio: CORDEX

Data Extraction
Multi-Core Processing
Natural Language Processing
Text Analysis
Text Mining
Tool Development


Project Description

CORDEX is a Python library for extracting collocations from text in diverse formats. It uses multicore processing for identifying and counting collocations based on user-defined rules.

Role

My role as the main developer involved enhancing and completing a partially developed script, that was transformed into a Python library.

Key Contributions

  • Major refactoring
  • Enhancing user-defined rule options
  • Implementing inflectional data lookup
  • Integrating API queries
  • Adding the option to handle multiple formats in different languages

Project Outcome

We created and published a Python library that is actively used in linguistic research and data feed API. It was also used to process a 1 billion words corpus.