September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...

46
Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019

Transcript of September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...

Page 1: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Introduction to cross-lingual word-embeddings

Diego Sáez-Trumper,Wikimedia Research

September 2019

Page 2: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 3: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

What you will learn

● What is a word embedding● How can you use cross-lingual word embeddings● What you can’t do with word embeddings

Page 4: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

EmbeddingsTaking a set in one domain and represent it in another domain preserving some notion of distance.

Examples:● Words → Vectors ; distance: words meaning● Documents → Vectors ; distance: documents topic● Images → Vectors ; distance: image content

Page 5: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Word Embeddings● Transformation: Words → Vectors● Distance to preserve: Semantic

− Words that have similar meaning should be close in the vector space

− Toy Example:

■ Cat → [0.8, 0]

■ Tiger → [0.9, 0]

■ Car → [0, 0.6]

■ Truck → [0, 0.8]

Page 6: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Sentence Embedding:● Similarly, we can transform a sentence in a vector

− ‘This is a sentence’→ [x0, x1]

− Toy example:

● ‘This is a great day’ -> [0.8, 0]

● ‘This is a beautiful day’ -> [0.7, 0]

● ‘Open Source is great’ -> [0, 0.8]

Page 7: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 8: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Difference with other approaches● Word embeddings are directly not based on string similarity

− For embeddings ● cat and car are very different

● For string similarity take a look to metrics like edit distance or Levenshtein distance

● Embeddings won't be script depent!○ You can use any script (latin, cyrillic, arabic, etc)

Page 9: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Doing math with words (Example for an ideal embedding)

Check this app: https://rare-technologies.com/word2vec-tutorial/

King - Man + Woman = Queen

France - Paris + Portugal = Lisbon

Eat + Past = Ate

Page 10: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Relationships between entities

https://rare-technologies.com/word2vec-tutorial/

Page 11: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Relationships between entities (ii)

https://rare-technologies.com/word2vec-tutorial/

Page 12: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Get similar concepts

https://rare-technologies.com/word2vec-tutorial/

Page 13: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Embeddings are far from perfect!

https://rare-technologies.com/word2vec-tutorial/

X

Page 14: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Warning!● Word embeddings are corpus dependent!

If you train your embeddings in a news dataset don’t expected they will work good on Wikipedia!

Page 15: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

How word embeddings are computed?You shall know a word by the company it keeps [Fisher 1957]

Words that occur in similar contexts tend to have similar meanings [Harris, 1954]

Predicting words by the context (CBOW):

yesterday was a really [...] day ----> Strong Candidates: ‘Nice’ , ‘Beautiful’

Less probable: delightful

Given a word predict Context (Skip Gram):

Is probable that yesterday was a really [...] day is a suitable context for delightful ?

Page 16: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Word Embeddings Approaches

Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.

CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

Example from StackOverflow

Page 17: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Word Embeddings Implementations● Examples:

− Word2Vec

− GloVe

− FastText

Page 18: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

FastText● Some characteristics

− Works stand-alone (in bash) or as python package

− Allows supervised and unsupervised tasks

− Uses subword information (character level)

● Fasttext(‘Dog’) ≈ Fasttext(Dogs)

− Pre-trained embeddings in Wikipedia in multiple languages

Page 19: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Word Embeddings Size:● In practice we usually work with vectors of 150 to 300 dimensions:

− Example:

● ‘Word’→ [x0, x1, …, x299]

● Where -1 ≤ xn ≤ 1 ; and ‘Word’ could be any sequence of strings

Page 20: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Recap● Word Embeddings

− transform strings in Vectors− words will similar meanings will have similar vectors− embeddings are computed using words’ context

Page 21: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Cross-lingual word embeddings

Page 22: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

We use multilingual word embeddings to compare text across different languages

Page 23: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Working with Multilingual embeddings● Problem

− Vectors values doesn’t have a meaning per se.

− Values will change depending on the corpus .

− Therefore, training in different languages (different corpus) will result in different embeddings values.

Page 24: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 25: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 26: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Cross-lingual Embeddings● Solutions:

− Force the embeddings to be have specific values, using some anchors (like Facebook LASER).

− Learn a transformation using a bilingual dictionary

■ Knowing that (few) points that match in the two vector spaces, you can rotate (more precisely, do a linear transformation) one of the vector spaces to align in with the other.

Page 27: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Knowing that:Publicaciones -> PublicationsDiscografía -> Discography

Page 28: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 29: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Warning!● Cross-Lingual embeddings will learn

analogies not identities!

You shouldn’t use cross-lingual embeddings for direct translation!

Page 30: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

When given a word W (or sentence) in language X and (a small) set of candidates in language Y, you want to measure which of the candidates the most similar to W

When cross-lingual embeddings will work?

Distance(‘Buenos días’,’Good Morning) < Distance(‘Buenos días’,’Thank you')

Page 31: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Examples

Page 32: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Sections headings alignments across languages● Problem:

− Given the most popular section headings in two languages create an alignment among them.

● Solution:− Cross-lingual embeddings− plus other features

Page 33: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Section Alignment API

● Input language: es● Output Language: en● Section: Historia● API CALL

○ https://secrec.wmflabs.org/API/alignment/es/en/Historia

API's Documentation

Page 34: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Section Alignment API

● Input language: en● Output Language: ru● Section: History● API CALL

○ https://secrec.wmflabs.org/API/alignment/en/ru/History

API's Documentation

Page 35: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Section Alignment API

● Input language: es● Output Language: ja● Section: Historia● API CALL

○ https://secrec.wmflabs.org/API/alignment/en/ru/History

API's Documentation

Page 36: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Section Alignment API

● Input language: en● Output Language: ru● Section: History● API CALL

○ https://secrec.wmflabs.org/API/alignment/en/ru/History

API's Documentation

Current languages supported:ar,fr,en,es,ja,ru

Page 37: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Named Parameter Templates Alignments● Problem:

− The CX translation tool needs to translate templates. Automatic translation engines are not designed for translating templates

− There are different names and number of parameters in each language

● (Not perfect) Solution:− Use cross-lingual word embeddings− Languages covered:

■ es, en, fr, ar, ru, uk, pt, vi, zh, ru, he, it, ta, id, fa, ca

Check T221211

Page 38: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 39: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 40: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings
Page 41: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Template:Infobox publisher->Plantilla:Ficha de editorial

Page 42: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Template:Infobox publisher->Plantilla:Ficha de editorial

Page 43: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Warning!

Cross-Lingual embeddings won’t be as good as bilingual humans in creating alignments!

But:

● They can do the work really fast.● You can use them in all wikipedia

languages.● Even for unusual languages pairs.

Page 44: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Takeaways● Word Embeddings allows machines to understand similarities

among words.● Cross-lingual embeddings allows machines to compare words

across different languages.● For Wikipedia use word embeddings trained on Wikipedia!

Page 45: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Do you want more?● Do you want to know more about other possible usages of

embeddings on Wikipedia?− Check out our white paper about topic embeddings.

● Clone this repository and start working with multilingual embeddings with python:− https://github.com/digitalTranshumant/TutorialCrossLingualE

mbeddings

Page 46: September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual word-embeddings Diego Sáez-Trumper, Wikimedia Research September 2019. ... Word embeddings

Thanks!