September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...
Transcript of September 2019 Introduction to cross-lingual word ... · Introduction to cross-lingual...
Introduction to cross-lingual word-embeddings
Diego Sáez-Trumper,Wikimedia Research
September 2019
What you will learn
● What is a word embedding● How can you use cross-lingual word embeddings● What you can’t do with word embeddings
EmbeddingsTaking a set in one domain and represent it in another domain preserving some notion of distance.
Examples:● Words → Vectors ; distance: words meaning● Documents → Vectors ; distance: documents topic● Images → Vectors ; distance: image content
Word Embeddings● Transformation: Words → Vectors● Distance to preserve: Semantic
− Words that have similar meaning should be close in the vector space
− Toy Example:
■ Cat → [0.8, 0]
■ Tiger → [0.9, 0]
■ Car → [0, 0.6]
■ Truck → [0, 0.8]
Sentence Embedding:● Similarly, we can transform a sentence in a vector
− ‘This is a sentence’→ [x0, x1]
− Toy example:
● ‘This is a great day’ -> [0.8, 0]
● ‘This is a beautiful day’ -> [0.7, 0]
● ‘Open Source is great’ -> [0, 0.8]
Difference with other approaches● Word embeddings are directly not based on string similarity
− For embeddings ● cat and car are very different
● For string similarity take a look to metrics like edit distance or Levenshtein distance
● Embeddings won't be script depent!○ You can use any script (latin, cyrillic, arabic, etc)
Doing math with words (Example for an ideal embedding)
Check this app: https://rare-technologies.com/word2vec-tutorial/
King - Man + Woman = Queen
France - Paris + Portugal = Lisbon
Eat + Past = Ate
Relationships between entities
https://rare-technologies.com/word2vec-tutorial/
Relationships between entities (ii)
https://rare-technologies.com/word2vec-tutorial/
Get similar concepts
https://rare-technologies.com/word2vec-tutorial/
Embeddings are far from perfect!
https://rare-technologies.com/word2vec-tutorial/
X
Warning!● Word embeddings are corpus dependent!
If you train your embeddings in a news dataset don’t expected they will work good on Wikipedia!
How word embeddings are computed?You shall know a word by the company it keeps [Fisher 1957]
Words that occur in similar contexts tend to have similar meanings [Harris, 1954]
Predicting words by the context (CBOW):
yesterday was a really [...] day ----> Strong Candidates: ‘Nice’ , ‘Beautiful’
Less probable: delightful
Given a word predict Context (Skip Gram):
Is probable that yesterday was a really [...] day is a suitable context for delightful ?
Word Embeddings Approaches
Skip-gram: works well with small amount of the training data, represents well even rare words or phrases.
CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words
Example from StackOverflow
Word Embeddings Implementations● Examples:
− Word2Vec
− GloVe
− FastText
FastText● Some characteristics
− Works stand-alone (in bash) or as python package
− Allows supervised and unsupervised tasks
− Uses subword information (character level)
● Fasttext(‘Dog’) ≈ Fasttext(Dogs)
− Pre-trained embeddings in Wikipedia in multiple languages
Word Embeddings Size:● In practice we usually work with vectors of 150 to 300 dimensions:
− Example:
● ‘Word’→ [x0, x1, …, x299]
● Where -1 ≤ xn ≤ 1 ; and ‘Word’ could be any sequence of strings
Recap● Word Embeddings
− transform strings in Vectors− words will similar meanings will have similar vectors− embeddings are computed using words’ context
Cross-lingual word embeddings
We use multilingual word embeddings to compare text across different languages
Working with Multilingual embeddings● Problem
− Vectors values doesn’t have a meaning per se.
− Values will change depending on the corpus .
− Therefore, training in different languages (different corpus) will result in different embeddings values.
Cross-lingual Embeddings● Solutions:
− Force the embeddings to be have specific values, using some anchors (like Facebook LASER).
− Learn a transformation using a bilingual dictionary
■ Knowing that (few) points that match in the two vector spaces, you can rotate (more precisely, do a linear transformation) one of the vector spaces to align in with the other.
Knowing that:Publicaciones -> PublicationsDiscografía -> Discography
Warning!● Cross-Lingual embeddings will learn
analogies not identities!
You shouldn’t use cross-lingual embeddings for direct translation!
When given a word W (or sentence) in language X and (a small) set of candidates in language Y, you want to measure which of the candidates the most similar to W
When cross-lingual embeddings will work?
Distance(‘Buenos días’,’Good Morning) < Distance(‘Buenos días’,’Thank you')
Examples
Sections headings alignments across languages● Problem:
− Given the most popular section headings in two languages create an alignment among them.
● Solution:− Cross-lingual embeddings− plus other features
Section Alignment API
● Input language: es● Output Language: en● Section: Historia● API CALL
○ https://secrec.wmflabs.org/API/alignment/es/en/Historia
API's Documentation
Section Alignment API
● Input language: en● Output Language: ru● Section: History● API CALL
○ https://secrec.wmflabs.org/API/alignment/en/ru/History
API's Documentation
Section Alignment API
● Input language: es● Output Language: ja● Section: Historia● API CALL
○ https://secrec.wmflabs.org/API/alignment/en/ru/History
API's Documentation
Section Alignment API
● Input language: en● Output Language: ru● Section: History● API CALL
○ https://secrec.wmflabs.org/API/alignment/en/ru/History
API's Documentation
Current languages supported:ar,fr,en,es,ja,ru
Named Parameter Templates Alignments● Problem:
− The CX translation tool needs to translate templates. Automatic translation engines are not designed for translating templates
− There are different names and number of parameters in each language
● (Not perfect) Solution:− Use cross-lingual word embeddings− Languages covered:
■ es, en, fr, ar, ru, uk, pt, vi, zh, ru, he, it, ta, id, fa, ca
Check T221211
Template:Infobox publisher->Plantilla:Ficha de editorial
Template:Infobox publisher->Plantilla:Ficha de editorial
Warning!
Cross-Lingual embeddings won’t be as good as bilingual humans in creating alignments!
But:
● They can do the work really fast.● You can use them in all wikipedia
languages.● Even for unusual languages pairs.
Takeaways● Word Embeddings allows machines to understand similarities
among words.● Cross-lingual embeddings allows machines to compare words
across different languages.● For Wikipedia use word embeddings trained on Wikipedia!
Do you want more?● Do you want to know more about other possible usages of
embeddings on Wikipedia?− Check out our white paper about topic embeddings.
● Clone this repository and start working with multilingual embeddings with python:− https://github.com/digitalTranshumant/TutorialCrossLingualE
mbeddings
Thanks!