Multilingual NLP via Cross-Lingual Word Embeddings...Bilingual lexicon learning, multi-modal...

of 71/71
  • date post

    17-Jun-2020
  • Category

    Documents

  • view

    0
  • download

    0

Embed Size (px)

Transcript of Multilingual NLP via Cross-Lingual Word Embeddings...Bilingual lexicon learning, multi-modal...

  • Multilingual NLP via Cross-LingualWord Embeddings

    Ivan Vuli¢

    LTL, University of Cambridge & PolyAI

    FoTran Workshop

    Helsinki; September 28, 2018

  • Joint Work with...

    ...plus e�orts of many other researchers...1 / 70

  • M

    A cold open...

    2 / 70

  • Embeddings are Everywhere (Monolingual)

    Intrinsic (semantic) tasks:

    Semantic similarity and synonym detection, word analogiesConcept categorization

    Selectional preferences, language grounding

    Extrinsic (downstream) tasks

    Named entity recognition, part-of-speech tagging, chunkingSemantic role labeling, dependency parsing

    Sentiment analysis, discourse parsing

    Beyond NLP:

    Information retrieval, Web search

    Question answering, dialogue systems3 / 70

  • Embeddings are Everywhere (Cross-lingual)

    Intrinsic (semantic) tasks:

    Cross-lingual and multilingual semantic similarity

    Bilingual lexicon learning, multi-modal representations

    Extrinsic (downstream) tasks:

    Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer

    Statistical/Neural machine translation

    Beyond NLP:

    (Cross-Lingual) Information retrieval

    (Cross-Lingual) Question answering4 / 70

  • M

    A slow intro...

    5 / 70

  • Motivation

    We want to understand and model the meaning of...

    Source: dreamstime.com

    ...without manual/human input and without perfect MT6 / 70

  • Motivation

    The NLP community has developed useful features for several tasks but �ndingfeatures that are...

    1. task-invariant (POS tagging, SRL, NER, parsing, ...)

    (monolingual word embeddings)

    2. language-invariant (English, Dutch, Chinese, Spanish, ...)

    (cross-lingual word embeddings → this talk)

    ...is non-trivial and time-consuming (20+ years of feature engineering...)

    7 / 70

  • Motivation

    The NLP community has developed useful features for several tasks but �ndingfeatures that are...

    1. task-invariant (POS tagging, SRL, NER, parsing, ...)

    (monolingual word embeddings)

    2. language-invariant (English, Dutch, Chinese, Spanish, ...)

    (cross-lingual word embeddings → this talk)

    ...is non-trivial and time-consuming (20+ years of feature engineering...)

    Learn word-level features which generalise across tasks and languages

    8 / 70

  • The World Existed B.E. (Before Embeddings)

    B.E. Example 1: Cross-lingual (Brown) clusters

    [Tackstrom et al., NAACL 2012, Faruqui and Dyer, ACL 2013]

    9 / 70

  • The World Existed B.E.

    B.E. Example 2a: Traditional �count-based� cross-lingual spaces...

    [Gaussier et al., ACL 2004; Laroche and and Langlais, COLING 2010]

    10 / 70

  • The World Existed B.E.

    B.E. Example 2b: ...and bootstrapping from a limited bilingual signal

    [Peirsman and Padó, NAACL 2010; Vuli¢ and Moens, EMNLP 2013]

    11 / 70

  • The World Existed B.E.

    B.E. Example 3: Cross-lingual topical spaces

    [Mimno et al., EMNLP 2009; Vuli¢ et al., ACL 2011]12 / 70

  • Motivation Revisited

    Cross-lingual word embeddings:

    - simple: quick and e�cient to train- state-of-the-art and omnipresent- lightweight and cheap

    �Cross-lingual word embedding models are to MT what(monolingual) word embedding models are to language modeling.�

    by Sebastian Ruder

    Are they?

    They support cross-lingual NLP: enabling cross-lingual modelingand cross-lingual transfer

    13 / 70

  • Learning Word Representations

    Key idea

    Distributional hypothesis → words with similar meanings are likelyto appear in similar contexts

    [Harris, Word 1954]

    shouts:

    �Meaning as use!�

    calmly states:

    �You shall know a word by the company it keeps.�14 / 70

  • Word Embeddings

    Representation of each word w ∈ V :

    vec(w) = [f1, f2, . . . , fdim]

    Word representations in the same semantic (or embedding) space!

    Image courtesy of [Gouws et al., ICML 2015]

    15 / 70

  • Cross-Lingual Word Embeddings (CLWEs)

    Representation of a word wS1 ∈ V S :

    vec(wS1 ) = [f11 , f

    12 , . . . , f

    1dim]

    Exactly the same representation for wT2 ∈ V T :

    vec(wT2 ) = [f21 , f

    22 , . . . , f

    2dim]

    Language-independent word representations in the same sharedsemantic (or embedding) space!

    16 / 70

  • Cross-Lingual Word Embeddings

    [Luong et al., NAACL 2015]17 / 70

  • Cross-Lingual Word Embeddings

    Monolingual vs. Cross-lingual

    Q1 → Algorithm Design: How to align semantic spaces in twodi�erent languages?

    Q2 → Data Requirements: Which bilingual signals are used forthe alignment?

    18 / 70

  • Data Requirements: Bilingual Signals

    Parallel Comparable

    Word Dictionaries ImagesSentence Translations CaptionsDocument - Wikipedia

    Nature and alignment level of data sources required by CLWE models

    Data requirements are more important for the �nal CLWE modelperformance than the actual underlying architecture[Levy et al., EACL 2017]

    19 / 70

  • Data Requirements: Bilingual Signals

    [Upadhyay et al., ACL 2016]

    Illustration of di�erent data requirements and bilingual signals

    1. Word-level: word alignments and translation dictionaries

    2. Sentence-level: sentence alignments

    3. Document-level: Wikipedia/news articles

    4. Other: visual alignment: image captions, eye-tracking data

    5. Without any bilingual signal?

    20 / 70

  • Welcome to the CLWE Jungle I

    Parallel ComparableWord Mikolov et al. (2013) Kiela et al. (2015)

    Faruqui & Dyer (2014) Vuli¢ et al. (2016)Xing et al. (2015) Vuli¢ et al. (2017)Dinu et al. (2015) Zhang et al. (2017)Lazaridou et al. (2015) Hauer et al. (2017)Zhang et al. (2016)Zou et al. (2013)Ammar et al. (2016)Artexte et al. (2016)Xiao & Guo (2014)Gouws and Søgaard (2015)Duong et al. (2016)Gardner et al. (2015)Smith et al. (2017)Mrk²i¢ et al. (2017)Artetxe et al. (2017)

    21 / 70

  • Welcome to the CLWE Jungle II

    Parallel ComparableSentence alignments Guo et al. (2015) Calixto et al. (2017)

    Vyas and Carpuat (2016) Gella et al. (2017)Hermann and Blunsom (2013)Lauly et al. (2013)Ko£iský et al. (2014)Chandar et al. (2014)Pham et al. (2015)Klementiev et al. (2012)Soyer et al. (2015)Luong et al. (2015)Gouws et al. (2015)Coulmance et al. (2015)Shi et al. (2015)Rajendran et al. (2016)Levy et al. (2017)

    Document Hermann and Blunsom (2014) Vuli¢ & Korhonen (2016)Vuli¢ and Moens (2016)Søgaard et al. (2015)Mogadala and Rettinger (2016)

    22 / 70

  • General (Simpli�ed) Methodology

    (Bilingual) data sources are more important than the chosen algorithm

    Most algorithms are formulated as: L1(W) + L2(V) + Ω(W,V )

    23 / 70

  • Learning from Word-Level Information

    1. Mapping-based or o�ine approaches: train monolingual wordrepresentations independently and then learn a transformation matrixfrom one language to the other

    2. Pseudo-bilingual approaches: train a monolingual WE model onautomatically constructed corpora containing words from both languages

    3. Joint online approaches: take parallel text as input and minimize thesource and target language losses jointly with the regularization term

    These approaches are equivalent, modulo optimization strategies

    24 / 70

  • Learning from Word-Level Information: Example 1

    The geometric relations that hold between words are similar across languages:e.g., numbers and animals in English show a similar geometric constellation astheir Spanish counterparts

    [Mikolov et al., arXiv 2013]

    25 / 70

  • Learning from Word-Level Information: Example 1

    Basic mapping approach

    [Mikolov et al., arXiv 2013]

    Learn to transform the pre-trained source language embeddings into a space

    where the distance between a word and its translation pair is minimised26 / 70

  • Learning from Word-Level Information: Example 1

    Basic mapping approach

    Based on given word translation pairs (xsi , xti), i = 1, . . . , N

    Mean-squared error between the Euclidean distances (of the actualand transformed vector)

    ΩMSE =n∑

    i=1

    ‖Wxsi − xti‖2

    ΩMSE = ‖WXs −Xt‖2F

    Standard (re)formulation:

    J = L1SGNS + L2SGNS + ΩMSE27 / 70

  • Bootstrapping from Small Seed Dictionaries

    The latest instance of the bootstrapping idea[Artetxe et al., ACL 2017]

    Self-learning iterative framework: starting from cognates or numerals

    28 / 70

  • Bootstrapping from Small Seed Dictionaries

    Lexicon induction results from [Artetxe et al., ACL 2017]

    Bootstrapping helps with really small seed dictionaries

    A performance plateau with simple mapping approaches has been reached?29 / 70

  • Learning without any Bilingual Signal?

    The �rst instance of the �fully unsupervised learning� idea[Conneau et al., ICLR 2018; Artetxe et al., ACL 2018]

    Adversarial training: playing the two-way game between a generator (mapping)and a discriminator

    Alignment at the level of distributions

    Supporting unsupervised NMT [Lample et al., ICLR 2018] and unsupervisedCLIR [Litschko et al., SIGIR 2018]

    30 / 70

  • Learning without any Bilingual Signal?

    But is it really that great?

    [Søgaard Ruder, Ruder, Vuli¢; ACL 2018]

    31 / 70

  • Mapping Approach: Bilingual to Multilingual

    Fixing one space (English) and learning transformations for all monolingualword embeddings in other languages

    This constructs a multilingual space from input monolingual spaces

    [Smith et al., ICLR 2017]:

    A uni�ed multilingual space for 89 languages using fastText vectors and 5kGoogle Translate pairs for each language pair

    32 / 70

  • Learning from Word-Level Information: Example 2

    Pseudo-bilingual training

    [Gouws and Søgaard, NAACL 2015]

    Barista: Bilingual Adaptive Reshu�ing with Individual Stochastic Alternatives

    1. Build a pseudo-bilingual corpus (or corrupt the original training data) usinga set of prede�ned equivalences (e.g., POS tags, word translation pairs)

    2. Train a monolingual WE model on the corrupted/shu�ed corpus

    Original : we build the house

    POS : we build la voiture / they run la house

    Translations: we construire the maison / nous build la house

    33 / 70

  • Learning from Word-Level Information: Example 2

    [Gouws and Søgaard, NAACL 2015]

    Barista: Bilingual Adaptive Reshu�ing with Individual Stochastic Alternatives

    1 Input: Source corpus Cs, target corpus Ct, dictionary D

    2 Concatenate Cs and Ct and then shu�e the concatenated corpus

    3 Pass over the corpus, for each word w: if {w′|(w,w′) ∨ (w′, v) ∈ D} isnon-empty and of cardinality k, replace w with w′ with probability 1/2k;keep w otherwise

    4 Train a standard WE model on the pseudo-bilingual corpus C′

    34 / 70

  • Learning from Word-Level Information: Example 2

    The chosen equivalence class dictates the shape of the output space

    Cross-lingual embedding space with POS equivalences35 / 70

  • Learning from Word-Level Information: Example 3

    Joint (Online) Training

    Bilingual Skip-Gram (BiSkip) proposed by Luong et al., NAACL 2015

    Very similar to the original mapping approach, but formulated as online

    monolingual and cross-lingual training on pseudo-bilingual corpora

    J = L1SGNS + L2SGNS + ΩL1,L2

    ΩL1,L2 = SGNSL1→L2 + SGNSL2→L1 36 / 70

  • Learning from Word-Level Information: Example 3

    Multilingual Skip-Gram (MultiSkip) by Ammar et al., NAACL 2016

    Simple extension (again) → summing up bilingual objectives for allavailable parallel corpora

    J = L1SGNS + L2SGNS + L3SGNS + ΩL1,L2 + ΩL2,L3 + ΩL1,L3

    37 / 70

  • Learning from Word-Level Information: Example 4

    Post-processing/specialisation

    State-of-the-art: Paragram and Attract-Repel[Wieting et al., TACL 2015; Mrk²i¢ et al., TACL 2017]

    If S is the set of synonymous word pairs, the procedure iterates overmini-batches of such constraints BS , optimising the following cost function:

    S(BS) =∑

    (xl,xr)∈BS

    (ReLU (δsim + xltl − xlxr)

    + ReLU (δsim + xrtr − xlxr))

    where δsim is the similarity margin and tl and tr are negative examples for the

    given word pair (xl, xr).

    38 / 70

  • CLWEs via Semantic Specialisation

    Cross-lingual supervision: again word-level �linguistic constraints�(e.g., synonyms)

    Constraint Relation Source

    (en_sweet, it_dolce) SYN BabelNet(en_work, fr_travail) SYN BabelNet(en_school, de_schule) SYN BabelNet(fr_montagne, de_gebirge) SYN BabelNet(sh_gradona£elnik, en_mayor) SYN BabelNet(nl_vrouw, it_donna) SYN BabelNet

    (en_sour, it_dolce) ANT BabelNet(en_asleep, fr_éveillé) ANT BabelNet(en_cheap, de_teuer) ANT BabelNet(de_langsam, es_rápido) ANT BabelNet(sh_obeshrabriti, en_encourage) ANT BabelNet(fr_jour, nl_nacht) ANT BabelNet

    39 / 70

  • CLWEs via Semantic Specialisation

    Negative Examples for each Synonymy Pair

    For each synonymy pair (xl,xr), the negative example pair (tl, tr)is chosen from the remaining in-batch vectors so that tl is the oneclosest (cosine similarity) to xl and tr is closest to xr.

    40 / 70

  • Learning from Aligned Sentences: Example 1

    Trans-gram model: very similar to BiSkip, but it relies on full sententialcontexts

    [Coulmance et al., EMNLP 2015]

    J = L1SGNS + L2SGNS + ΩL1,L2ΩL1,L2 = SGNSL1→L2 + SGNSL2→L1

    Deja vu, right?41 / 70

  • Learning from Aligned Sentences: Example 2

    BilBOWA model: very similar to online models, but using sentence-levelcross-lingual information

    [Gouws et al., ICML 2017] Deja vu, right?

    J = L1SGNS + L2SGNS + ΩBilBOWA

    ΩBilBOWA = ‖1

    m

    m∑wsi∈sent

    s

    xsi −1

    n

    n∑wtj∈sent

    t

    xtj‖2

    42 / 70

  • Learning from Aligned Documents: Example 1

    Again, pseudo-bilingual training

    43 / 70

  • Learning from Aligned Documents: Example 2

    Inverted indexing

    [Søgaard et al. ACL 2015]; slide courtesy of Anders Søgaard

    44 / 70

  • Summary

    Slide courtesy of Anders Søgaard45 / 70

  • Summary

    Slide courtesy of Anders Søgaard

    46 / 70

  • Evaluation

    Quality of Embeddings

    How good are they in representing word meaning?

    How good are they as features in downstream

    tasks?

    Intrinsic Extrinsic

  • Evaluation

    Intrinsic >> Extrinsic?

    Intrinsic

  • Similarity-Based vs Feature-Based?

    Parallel to the division to intrinsic and extrinsic tasks, there is anotherperspective:

    Similarity-based use of embeddings: applications that rely on (constrained)nearest neighbour search in the cross-lingual word embedding graph

    Feature-based use of embeddings: use cross-lingual embeddings directly asfeatures: information retrieval, cross-lingual transfer?

  • Cross-Lingual Lexicon Inductionand Cross-Lingual Word Similarity

    Semantic similarity = measuring distance in the induced space

    Bilingual lexicons: (en_morning, it_mattina), (en_carpet, pt_tapete), ...

    50 / 70

  • Cross-Lingual Lexicon Induction

    Similarity-based vs. classi�cation-based lexicon induction

    Word embeddings used as features for classi�cation

    [Irvine and Callison-Burch, CompLing 2017; Heyman et al., EACL 2017]

  • Extrinsic EvaluationCross-lingual Document Classi�cation

    Quite popular in the CLWE literature

    Cross-lingual knowledge transfer?

    [Klementiev et al., COLING 2012; Heyman et al., DAMI 2015]

  • Cross-Lingual Document Classi�cation

    53 / 70

  • Extrinsic EvaluationCross-lingual Document Classi�cation

    Using word-level information composed into a document-level representation

    [Hermann and Blunsom, ACL 2014]

    1. Is this really an optimal way to approach this task?2. Is this really an optimal way to evaluate word embeddings?

  • Cross-Lingual Dependency Parsing

    55 / 70

  • Other ApplicationsCross-lingual Task X, Cross-lingual Task Y, ...

    Cross-lingual supersense tagging

    [Gouws and Søgaard, NAACL 2015]

    Word alignment

    [Levy et al., EACL 2017]

    POS tagging

    [Søgaard et al., ACL 2015; Zhang et al., NAACL 2016]

    Entity linking: wiki�cation

    [Tsai and Roth, NAACL 2016]

    Relation prediction

    [Glava² and Vuli¢, NAACL 2018]

    56 / 70

  • Other ApplicationsCross-lingual Task X, Cross-lingual Task Y, ...

    Cross-lingual frame-semantic parsing and discourse parsing

    [Johannsen et al., EMNLP 2015, Braud et al., EACL 2017]

    Cross-lingual lexical entailment

    [Vyas and Carpuat, NAACL 2016, Upadhyay et al., NAACL 2018]

    Semantic role labeling

    [Kozhevnikov and Titov, ACL 2013; Akbik et al., ACL 2016]

    Sentiment analysis

    [Zhou et al., ACL 2016; Abdalla and Hirst, arXiv 2017]

    57 / 70

  • Extrinsic Evaluation and ApplicationCross-lingual Information Retrieval

    Semantic representations for cross-lingual information IR

    [Vuli¢ and Moens, SIGIR 2015; Mitra and Craswell, arXiv 2017]

  • Extrinsic Evaluation and ApplicationCross-lingual Information Retrieval

    Document and Query Embeddings

    Composition of word embeddings:[Mitchell and Lapata, ACL 2008]

    −→d = −→w1 +−→w2 + . . .+−−−→w|Nd|

    The dim-dimensional document embedding in the samecross-lingual word embedding space:

    −→d = [fd,1, . . . , fd,k, . . . , fd,dim]

  • Extrinsic Evaluation and ApplicationCross-lingual Information Retrieval

    Document and Query Embeddings

    → The same principles with queries

    −→Q = −→q1 +−→q2 + . . .+−→qm

    The dim-dimensional query embedding in the same bilingualword embedding space:

    −→Q = [fQ,1, . . . , fQ,k, . . . , fQ,dim]

  • Extrinsic Evaluation and ApplicationLanguage Understanding: Multilingual Dialog State Tracking

    61 / 70

  • Extrinsic Evaluation and ApplicationLanguage Understanding: Multilingual Dialog State Tracking

    The Neural Belief Tracker is a novel DST model/framework whichaims to satisfy the following design goals:

    1 End-to-end learnable (no SLU modules or semantic dictionaries).

    2 Generalisation to unseen slot values.

    3 Capability of leveraging the semantic content of pre-trained wordvector spaces without human supervision.

    [Mrk²i¢ et al., ACL 2017; Mrk²i¢ and Vuli¢, ACL 2018]

    62 / 70

  • Extrinsic Evaluation and ApplicationLanguage Understanding: Multilingual Dialog State Tracking

    Representation Learning + Label Embedding +Separate Binary Decisions

    To overcome data sparsity, NBT models use label embedding todecompose multi-class classi�cation into many binary ones.

  • Extrinsic Evaluation and ApplicationLanguage Understanding: Multilingual Dialog State Tracking

    DST in Italian and German

    Multilingual WOZ 2.0 (Mrk²i¢ et al., 2017); Italianand GermanThe 1,200 dialogues in WOZ 2.0 were translated by native Italian and Germanspeakers instructed to consider preceding dialogue context.

    Word Vector Space EN IT DE

    Best Baseline Vector Space 81.6 71.8 50.5Monolingual Distributional Vectors 77.6 71.2 46.6+ Monolingual Specialisation 80.9 72.7 52.4++ Cross-Lingual Specialisation 80.3 75.3 55.7+++ Multilingual DST Model 82.8 77.1 57.7

  • Extrinsic Evaluation and ApplicationLanguage Understanding: Multilingual Dialog State Tracking

    DST Models for Resource-Poor Languages

    Ontology Grounding: Multilingual DST Models

    The domain ontology (i.e. the concepts it expresses) is language agnostic,which means that `labels' persist across languages. Using training data for two(or more) languages, and cross-lingual vectors of high quality, we train the�rst-ever multilingual DST model.

  • Take-Home Messages

    1. Cross-Lingual Word Embedding Models are More Similar toOld(er) NLP Ideas than It Seems

    But they are simple, e�cient, and state-of-the-art...

    2. CLWE Models are More Similar than It SeemsWe have addressed similarities between a plethora of cross-lingual wordembedding models

    3. Data is Crucial (more than the Chosen Algorithm)

    Optimization and various modeling tricks matter, but the chosen multilingualsignal is the most important factor

    4. CL Word Embeddings Support Cross-Lingual NLP

    They are useful across a wide variety of cross-lingual NLP tasks, forcross-lingual (re-lexicalized) transfer, language understanding, IR, ...

  • Take-Home Messages

    1. Cross-Lingual Word Embedding Models are More Similar toOld(er) NLP Ideas than It Seems

    But they are simple, e�cient, and state-of-the-art...

    2. CLWE Models are More Similar than It SeemsWe have addressed similarities between a plethora of cross-lingual wordembedding models

    3. Data is Crucial (more than the Chosen Algorithm)

    Optimization and various modeling tricks matter, but the chosen multilingualsignal is the most important factor

    4. CL Word Embeddings Support Cross-Lingual NLP

    They are useful across a wide variety of cross-lingual NLP tasks, forcross-lingual (re-lexicalized) transfer, language understanding, IR, ...

  • Take-Home Messages

    1. Cross-Lingual Word Embedding Models are More Similar toOld(er) NLP Ideas than It Seems

    But they are simple, e�cient, and state-of-the-art...

    2. CLWE Models are More Similar than It SeemsWe have addressed similarities between a plethora of cross-lingual wordembedding models

    3. Data is Crucial (more than the Chosen Algorithm)

    Optimization and various modeling tricks matter, but the chosen multilingualsignal is the most important factor

    4. CL Word Embeddings Support Cross-Lingual NLP

    They are useful across a wide variety of cross-lingual NLP tasks, forcross-lingual (re-lexicalized) transfer, language understanding, IR, ...

  • Embeddings are Everywhere (Cross-lingual)

    Intrinsic (semantic) tasks:

    Cross-lingual and multilingual semantic similarity

    Bilingual lexicon learning, multi-modal representations

    Extrinsic (downstream) tasks:

    Cross-lingual SRL, POS tagging, NERCross-lingual dependency parsing and sentiment analysisCross-lingual annotation and model transfer

    Statistical/Neural machine translation

    Beyond NLP:

    (Cross-Lingual) Information retrieval

    (Cross-Lingual) Question answering69 / 70

  • Cross-Lingual Space of Thankyous

    This talk was based on a recent tutorial:

    http://people.ds.cam.ac.uk/iv250/tutorial/xlingrep-tutorial.pdf

    Book under review: Morgan & Claypool Synthesis Lectures

    with Anders Søgaard, Sebastian Ruder, Manaal Faruqui70 / 70

    http://people.ds.cam.ac.uk/iv250/tutorial/xlingrep-tutorial.pdf