Portfolio: Cross-Lingual Embedding Alignment
Cross-Lingual Embeddings
Named Entities
Natural Language Processing
Research
Project Description
Overview
This research project investigated embedding alignment using anchor points in the absence of proper dictionaries. The main idea behind this concept is to find a way to transform a word embedding vector from one language into a vector of translated word. The reason for this is to enable the training for a specific task in one language and effectively apply it in another.
Key details
- Role I collaborated with a colleague, where I created anchor points and found datasets for downstream evaluation, while he focused on aligning embeddings.
- Generating anchor points using aligned corpora: Utilized aligned corpora from MultiParaCrawl to extract named entities in both languages. Identified an equal number of named entities, treated them as translations, and employed them as anchor points.
- Generating anchor points using a semantic network: Used BabelNet, a semantic network that links concepts and named entities across multiple languages, for anchor points generation.
Project Outcome
We tested various anchor points in a downstream task, leading to the following findings and insights:
- Hand-checked dictionaries yielded the best results.
- In the absence of dictionaries, alternative anchor points delivered satisfactory results.
- BabelNet-derived anchor points outperformed those from named entities.
- Highest effectiveness was observed when aligning similar languages.