Portfolio: LIST Corpus Extraction Tool

Data Extraction
Text Analysis
Text Mining
Tool Development


Project Highlights

  • Role: Took over an incomplete project, and completed it together with linguistic experts.
  • Tool development: We developed a tool for data extraction from text corpora at various linguistic levels.
  • Utilization: The tool is actively used by linguists for corpora comparison.

Project Description

The LIST corpus extraction tool is a Java program designed to extract statistics from text corpora. It calculates counts and metrics such as Dice, LogDice, and t-score at multiple linguistic levels, including characters, word parts, words, and word sets. Text corpora may be provided in VERT or one of the multiple variations of the TEI format.

Project Outcome

We improved the LIST corpus extraction tool by adding new extraction parameters, supporting more corpora formats, etc. The software is now actively used in linguistic analysis to calculate and compare statistics across corpora.