September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...

Introduction to cross-lingual word-embeddings

Diego Sáez-Trumper,Wikimedia Research

September 2019

What you will learn

● What is a word embedding● How can you use cross-lingual word embeddings● What you can’t do with word embeddings

EmbeddingsTaking a set in one domain and represent it in another domain preserving some notion of distance.

Examples:● Words → Vectors ; distance: words meaning● Documents → Vectors ; distance: documents topic● Images → Vectors ; distance: image content

Word Embeddings● Transformation: Words → Vectors● Distance to preserve: Semantic

− Words that have similar meaning should be close in the vector space

− Toy Example:

■ Cat → [0.8, 0]

■ Tiger → [0.9, 0]

■ Car → [0, 0.6]

■ Truck → [0, 0.8]

Sentence Embedding:● Similarly, we can transform a sentence in a vector

− ‘This is a sentence’→ [x0, x1]

− Toy example:

● ‘This is a great day’ -> [0.8, 0]

● ‘This is a beautiful day’ -> [0.7, 0]

● ‘Open Source is great’ -> [0, 0.8]

Difference with other approaches● Word embeddings are directly not based on string similarity

− For embeddings ● cat and car are very different

● For string similarity take a look to metrics like edit distance or Levenshtein distance

● Embeddings won't be script depent!○ You can use any script (latin, cyrillic, arabic, etc)

Doing math with words (Example for an ideal embedding)

Check this app: https://rare-technologies.com/word2vec-tutorial/

King - Man + Woman = Queen

France - Paris + Portugal = Lisbon

Eat + Past = Ate

https://rare-technologies.com/word2vec-tutorial/

Relationships between entities



Relationships between entities (ii)



Get similar concepts



Embeddings are far from perfect!


X


Warning!● Word embeddings are corpus dependent!

If you train your embeddings in a news dataset don’t expected they will work good on Wikipedia!

How word embeddings are computed?You shall know a word by the company it keeps [Fisher 1957]

Words that occur in similar contexts tend to have similar meanings [Harris, 1954]

Predicting words by the context (CBOW):

yesterday was a really [...] day ----> Strong Candidates: ‘Nice’ , ‘Beautiful’

Less probable: delightful

Given a word predict Context (Skip Gram):

Is probable that yesterday was a really [...] day is a suitable context for delightful ?

Word Embeddings Approaches

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

Example from StackOverflow

https://stackoverflow.com/questions/38287772/cbow-v-s-skip-gram-why-invert-context-and-target-words

Word Embeddings Implementations● Examples:

− Word2Vec

− GloVe

− FastText

FastText● Some characteristics

− Works stand-alone (in bash) or as python package

− Allows supervised and unsupervised tasks

− Uses subword information (character level)

● Fasttext(‘Dog’) ≈ Fasttext(Dogs)

− Pre-trained embeddings in Wikipedia in multiple languages

Word Embeddings Size:● In practice we usually work with vectors of 150 to 300 dimensions:

− Example:

● ‘Word’→ [x0, x1, …, x299]

● Where -1 ≤ xn ≤ 1 ; and ‘Word’ could be any sequence of strings

Recap● Word Embeddings

− transform strings in Vectors− words will similar meanings will have similar vectors− embeddings are computed using words’ context

Cross-lingual word embeddings

We use multilingual word embeddings to compare text across different languages

Working with Multilingual embeddings● Problem

− Vectors values doesn’t have a meaning per se.

− Values will change depending on the corpus .

− Therefore, training in different languages (different corpus) will result in different embeddings values.

Cross-lingual Embeddings● Solutions:

− Force the embeddings to be have specific values, using some anchors (like Facebook LASER).

− Learn a transformation using a bilingual dictionary

■ Knowing that (few) points that match in the two vector spaces, you can rotate (more precisely, do a linear transformation) one of the vector spaces to align in with the other.

Knowing that:Publicaciones -> PublicationsDiscografía -> Discography

Warning!● Cross-Lingual embeddings will learn

analogies not identities!

You shouldn’t use cross-lingual embeddings for direct translation!

When given a word W (or sentence) in language X and (a small) set of candidates in language Y, you want to measure which of the candidates the most similar to W

When cross-lingual embeddings will work?

Distance(‘Buenos días’,’Good Morning) < Distance(‘Buenos días’,’Thank you')

Examples

Sections headings alignments across languages● Problem:

− Given the most popular section headings in two languages create an alignment among them.

● Solution:− Cross-lingual embeddings− plus other features

Section Alignment API

● Input language: es● Output Language: en● Section: Historia● API CALL

○ https://secrec.wmflabs.org/API/alignment/es/en/Historia

API's Documentation

https://meta.wikimedia.org/wiki/Research:Expanding_Wikipedia_articles_across_languages/Inter_language_approach#Section_Alignment


● Input language: en● Output Language: ru● Section: History● API CALL

○ https://secrec.wmflabs.org/API/alignment/en/ru/History

API's Documentation



● Input language: es● Output Language: ja● Section: Historia● API CALL


API's Documentation



● Input language: en● Output Language: ru● Section: History● API CALL


API's Documentation

Current languages supported:ar,fr,en,es,ja,ru


Named Parameter Templates Alignments● Problem:

− The CX translation tool needs to translate templates. Automatic translation engines are not designed for translating templates

− There are different names and number of parameters in each language

● (Not perfect) Solution:− Use cross-lingual word embeddings− Languages covered:

■ es, en, fr, ar, ru, uk, pt, vi, zh, ru, he, it, ta, id, fa, ca

Check T221211

https://phabricator.wikimedia.org/T221211

Template:Infobox publisher->Plantilla:Ficha de editorial

Warning!

Cross-Lingual embeddings won’t be as good as bilingual humans in creating alignments!

But:

● They can do the work really fast.● You can use them in all wikipedia

languages.● Even for unusual languages pairs.

Takeaways● Word Embeddings allows machines to understand similarities

among words.● Cross-lingual embeddings allows machines to compare words

across different languages.● For Wikipedia use word embeddings trained on Wikipedia!

Do you want more?● Do you want to know more about other possible usages of

embeddings on Wikipedia?− Check out our white paper about topic embeddings.

● Clone this repository and start working with multilingual embeddings with python:− https://github.com/digitalTranshumant/TutorialCrossLingualE

mbeddings

https://docs.google.com/document/d/18A-Opub-LQuKDG9IgvCTA4w_gQAk4CgX1P76bXG2ki8/edit?usp=sharing

https://github.com/digitalTranshumant/TutorialCrossLingualEmbeddings

https://github.com/digitalTranshumant/TutorialCrossLingualEmbeddings

Thanks!

September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...

Documents

Transcript of September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...