Luka Krsnik

Portfolio: Cross-Lingual Embedding Alignment

Cross-Lingual Embeddings

Named Entities

Natural Language Processing

Research

Project Description

Overview

This research project investigated embedding alignment using anchor points in the absence of proper dictionaries. The main idea behind this concept is to find a way to transform a word embedding vector from one language into a vector of translated word. The reason for this is to enable the training for a specific task in one language and effectively apply it in another.

Key details

Role I collaborated with a colleague, where I created anchor points and found datasets for downstream evaluation, while he focused on aligning embeddings.
Generating anchor points using aligned corpora: Utilized aligned corpora from MultiParaCrawl to extract named entities in both languages. Identified an equal number of named entities, treated them as translations, and employed them as anchor points.
Generating anchor points using a semantic network: Used BabelNet, a semantic network that links concepts and named entities across multiple languages, for anchor points generation.

Project Outcome

We tested various anchor points in a downstream task, leading to the following findings and insights:

Hand-checked dictionaries yielded the best results.
In the absence of dictionaries, alternative anchor points delivered satisfactory results.
BabelNet-derived anchor points outperformed those from named entities.
Highest effectiveness was observed when aligning similar languages.