Portfolio: LIST Corpus Extraction Tool
Data Extraction
Text Analysis
Text Mining
Tool Development
Project Highlights
- Role: Took over an incomplete project, and completed it together with linguistic experts.
- Tool development: We developed a tool for data extraction from text corpora at various linguistic levels.
- Utilization: The tool is actively used by linguists for corpora comparison.
Project Description
The LIST corpus extraction tool is a Java program designed to extract statistics from text corpora. It calculates counts and metrics such as Dice, LogDice, and t-score at multiple linguistic levels, including characters, word parts, words, and word sets. Text corpora may be provided in VERT or one of the multiple variations of the TEI format.
Project Outcome
We improved the LIST corpus extraction tool by adding new extraction parameters, supporting more corpora formats, etc. The software is now actively used in linguistic analysis to calculate and compare statistics across corpora.