Download - Neural Machine Plan Translation and - Unbabel...Neural Machine Translation and Universal Multilingual Representa-tions Holger Schwenk Plan Introduction Vision NLP Embeddings RNN Seq2Seq

NeuralMachine

Translationand UniversalMultilingualRepresenta-

tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Neural MachineTranslation and

Universal MultilingualRepresentations

Holger Schwenk

August 28th, NMT Marathon, Lisabon

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Plan of the Talk

• Introduction to deep neural networks for NLP (in 5 min)

• From Sequence-to-Sequence processingto multilingual sentence representations

• Detailed results

• Large scale demo

• Facebook AI Research

• Conclusion and perspectives

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

How to Build Intelligent Machines ?

• The brain is an existence proof of intelligence• The way birds fly were an existence proof of

heavier-than-air flight

• Should we copy the brain ?• Like mankind tried to copy birds to build flying machines• The answer is No, but we should draw inspiration from it

• Design principles of artificial neural networks• Many simple units are combined to perform a complex task• The system learns from examples• It is able to generalize to unseen event

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


Human neuron

Artificial perceptron

• The perceptron: a simple computational unit• Computes a weighted sum of inputs• Output: +1 if cumulated input exceeds threshold, -1 else• The weights are obtained by an learning algorithm:

- iterate through examples with input and desired output- adjust weights to minimize the observed error

⇒ Vaguely inspired by a biological neuron

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


• Characteristics of the human brain• 1011 neurons• 104 synapses per neuron• 1016 “operations” per second

• Learning and processing:• No global supervision• Modification of the

synapses

• Representation ofknowledge

• Distributed• Hierarchical

• Distributed Processing:• Highly parallel

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


• The Multi-Layer Perceptron• combine many perceptrons in several layers• the internal representations can be learned by

back-propagation of the error observed at the output• The weights are obtained by an iterative learning

algorithm:- compare current output with the desired output- adjust weights to minimize the observed error

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Learning Hierarchical Representations

(figure from Y. Le Cun)

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


Image recognition

• pixel → edge → tecton → motif → part → object

Text processing

• char → word → word group → clause → sentence → story

Speech recognition

• wave → spectral band → sound → phone → word →sentence

The intermediate features do not necessarilycorrespond to a well defined entity for humans !

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Network Architectures : ConvNet

Background

• In principle, any problem can be solved with a fullyconnected (deep) neural network

• However, it is very hard to learn the best solution due tothe huge search space

⇒ Constrain the network architecture to be problem-specific

Convolutional networks

• Several layers of small feature detectors and pooling

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Network Architectures : ConvNet

Improved version: GoogleNet

Revolution of Depth

• ResNet with 150 layers (or even up to 1000 !)

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Deep Neural Networks in Computer Vision

Image net challenge

• Train: 1.2M images with 1000 classes, test: 200k images

• Evolution of error rates:

• The classification error decreased from 28 to less than 4%and reaches today human performance

• Deep neural networks are used since 2012

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

What about Deep Neural Networks in NLP ?

• Operate on a low level representation of the data• Vision: pixels• NLP: what is the fundamental unit - words or characters ?

discrete units !

• Use very deep architectures to learn hierarchicalrepresentations of the data

• Vision: feature detectors of increasing abstraction• NLP: how to structure the input ?

n-grams, syntactical or semantic graphs, . . . ?

• Structure the network to adapt it to the problem• Vision: ConvNets implement learnable feature detectors• NLP: Recurrent NN (LSTM, GRU) are very popular

ConvNets can also be used

• Trained end-to-end• Vision: classification problems are well-defined• NLP: sentence generation is often ambiguous, without

unique solution

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Natural Language Processing

Handling words

• Detect relationships between words

• Associate categories to a (sequences of) words

• Estimate probability distributions over (sequence) of words

• Generate sentences...

Old technique

• Define a vocabulary of V known words

• Represent each word by an integer index

• 1-out-of-N encoding, binary vector

⇒ There is no relation between the words

⇒ All the words are equally close or far

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Word Embeddings

Idea

• Associate an arbitrary vector xi ∈ RE to each word

• Learn these embeddings in a way that similar words arenearby in that space

house

building

apple apple

buildinghouse

building

banana

grappe

• The notion of similarity may depend on the application(LM, MT, dialog, . . .)

⇒ There are many techniques to learn word embeddings . . .

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Recurrent Network for LM

Theoretical aspects

• Ideally, one should estimate:

P(wp1 ) = P(w1)

∏pi=2 P(wi |w i−1

1 )

i.e. each word is conditioned on all preceding words

• A recurrent neural network seems to be the perfect choice

• This was proposed by Mikolov et al in 2010, and manyfollow-up works

Practical issues

• Gradients tend to vanish for long sequences

→ Long Short-Term Memory (LSTM) networks

• It is less obvious to optimize recurrent NN

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Sequence-to-Sequence Processing

• There are many tasks which map a sequence of words toanother sequence of words

• POS tagging (words → tags)• machine translation (source → target)• text summarization (long → short text)...

• We need a generative model which can handle variablelengths

• Often, there is no unique solution

How to handle such tasks with neural networks ?

• Encoder/decoder approach[Kalchbrenner et al, EMNLP’13; Cho et al, EMNLP’14]

• Sequence-to-sequence processing[Sutskever at al, NIPS’14]

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


General idea

• An encoder processes the source sentence and creates ancompact representation

• This representation is the input to the decoder whichgenerates a sequence in the target language

• Both encoder and decoder are RNNs

decoder

targetsentence

sentencerepresentation

sourcesentence

encoder

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


Many variants

• Huge and deep LSTM with short-cut connections

• Convolutional networks

• Attention mechanism

• Bean search, RL

• Many details to come this week . . .

This talk

• We do not train an NMT system

• But we use the NMT framework to learn multilingualsentence representations

• No BLEU scores !

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Multiple Encoder/Decoder Framework

sur une plageen una playaUn cheval

einem StrandEin Pferd aufUn caballoA horse on

a beach

ho

rse

bea

ch

Fr

img

De

En

En

A horse ona beach

Ein Pferd aufeinem Strand

De

Esn

egat

ive

neu

tral

po

siti

ve

representationuniversal sentence

speech

• Use several encoders and decoders• different language pairs• other Seq2Seq tasks (speech)• sentence classification tasks (sequence-to-category)• image captioning (image-to-sequence)

• “Force” the representation to be identical for all encoders

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Different Training Paths

EN

one−to−one

FR ES RU

One-to-one strategy

• Alternate between different language pairs with onecommon target

⇒ Encourages joint representation

+ Train with pairwise parallel data

– No embedding for output language

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


ENEN EN

FR ES RU

two−to−one three−to−one

FR ES RU

one−to−one

FR ES RU

average

Many-to-one strategy

• Add a regularizer to explicitly force a joint representations,eg. average, correlation, . . .

– Still no embedding for output language

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion


EN FR ES RUENEN EN

FR ES RU

two−to−one three−to−one

FR ES RU

one−to−one

FR ES RU

one−to−many

EN FR ES RU

average

One-to-many strategy

• Translate from one to all other language, source excluded

⇒ Always at least one common target language

• Sentence embeddings for all languages

– Needs N-way parallel training corpora

• Extension to “many-to-many strategy” straightforward

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Evaluation of Sentence Representations

Desired properties:

• semantic closeness: similar sentences should be close inthe embeddings space

• multilingual closeness: identical sentences in differentlanguages should be close

• preservation of content: task specific: NMT,classification, entailment, etc.

• scalability to many languages: limit the need of humanlabeling of data.

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Evaluation of Sentence Representations

Multilingual evaluation:

1 Sentence classification + transfer (Reuters corpus)• train a sentence classifier on labeled data for one language• apply the same classifier to a different language

without language specific training data

2 Entailment (STS tasks)• input: two sentences, Output: relatedness score• compare sentences in different languages

3 Similarity search• compare sentence vectors with some metric (cosine)• parallel corpus: search closest one + count errors⇒ translation by search, paraphrasing, . . .

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Experimental Evaluation:Architectures

NMT

• BLSTM, the deeper the better

• Quite complicated architectures (short-cut connections)

• Convolutional networks

Sentence representations

• Deep networks doesn’t seem to be useful

• Sentence representation:• last LSTM layer (original seq2seq)• BLSTM + element-wise max-pooling

• the proposed framework is generic:any type of encoder and decoder can be used

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Experimental Evaluation: Data

Corpora

• Most existing MT corpora are N-way parallel

• UN corpus:• 6 very different languages (En, Fr, Es, Ru, Ar, Zh)• 11M sentences, 6-way: 8.3M

• Europarl corpus:• 21 languages, 400k – 2M sentences• 9-way parallel subset:

1.1M sentences, En, Fr, De, Fi, Es, Da, It, Nl, Pt

• TED corpus:• 23 languages, 100k sentences

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Training Strategies: One-to-One

System Average Similarity Errorefs efsr efsra efsraz

#pairs: 6 10 15 21

LSTM nhid=512 + last state:efs-a 2.14 – – –efs-r 1.97 – – –efsr-a 1.90 2.40 – –efsra-z 1.91 2.26 2.51 –efsraz-all 1.70 1.97 2.38 2.59

LSTM nhid=1024 + last state:efsraz-all 1.36 1.64 1.89 1.95

BLSTM nhid=512 + max pooling:efsra-z 1.03 1.20 1.26 –efsraz-all 0.92 1.07 1.15 1.20

• Error decreases with the number of languages covered !

• Training strategy one-to-many is slightly better

• BLSTM + max pooling is considerably better

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Training Strategies: Many-to-One (efs-a)

# input languages SimilarityID 1 2 3 Error

One M:1 strategy1 1 – – 1.03%2 – 0.5 – 1.85%3 – – 1 67.9%

Combining 1:1 and 2:1 strategies12a 0.9 0.05 – 1.09%12b 0.8 0.10 – 1.16%12c 0.7 0.15 – 1.15%12d 0.6 0.20 – 1.12%12e 0.5 0.25 – 1.22%

Combining 1:1 and 3:1 strategies13 0.5 – 0.5 1.31%

Combining 1:1, 2:1 and 3:1 strategies123a 0.33 0.16 0.33 1.32%123b 0.25 0.25 0.25 1.35%

• Combining representations by average doesn’t work

⇒ Ongoing work: explore other technique

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Multilingual Similarity Search (UN)

Source Target languageLang. En Fr Es Ru Ar Zh All

En – 1.10 0.70 1.07 1.05 1.15 1.02Fr 0.97 – 0.95 1.55 1.65 1.68 1.36Es 0.68 1.10 – 1.20 1.35 1.27 1.12Ru 0.78 1.52 1.23 – 1.32 1.32 1.23Ar 0.78 1.52 1.07 1.48 – 1.23 1.22Zh 0.97 1.55 1.12 1.35 1.30 – 1.26

All 0.83 1.36 1.02 1.33 1.33 1.33 1.20

• Errors rates are very low and homogeneous

• The six languages differ significantly with respect tomorphology, inflection, word order, . . .

• Performance on Chinese is surprisingly good

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Multilingual Similarity Search (EuroParl)

Source Target languagelang. en fr es it pt de da nl fi All

en – 3.26 3.28 4.32 3.44 3.26 3.28 4.74 3.10 3.58fr 3.30 – 3.20 3.96 3.42 3.40 3.42 5.00 3.44 3.64es 3.28 3.28 – 4.00 3.46 3.40 3.46 4.40 3.04 3.54it 4.12 3.92 3.88 – 4.32 4.24 4.24 5.40 4.30 4.30

pt 3.44 3.64 3.68 4.34 – 3.82 3.60 5.12 3.44 3.88de 3.40 3.58 3.42 4.62 3.80 – 3.60 4.94 3.30 3.83da 3.16 3.44 3.26 4.40 3.64 3.38 – 4.88 3.10 3.66nl 4.80 4.94 4.52 5.76 5.10 5.08 5.02 – 4.94 5.02fi 3.14 3.32 2.98 4.20 3.22 3.16 3.18 4.70 – 3.49

All 3.58 3.67 3.53 4.45 3.80 3.72 3.73 4.90 3.58 3.88

• Many confusable sentences: “The session starts at 12:00 pm”

• Difficult languages (De, Fi) perform quite well

• Dutch (Nl) seems to be the most difficult language to map into acommon space

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Large-Scale Similarity Search

• Europarl En/Fr, scaling up to 1.5M sentences:

0

2

4

6

8

10

12

1 10 100 1000 10000

Sim

ilarity

err

or

number of examples [x1000]

• Error increases log-linearly

⇒ Translation by search ?

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Monolingual Similarity Search: Examples

Query: All kinds of obstacles must be eliminated.

D2=0.905 All kinds of barriers have to be removed.D3=0.682 All forms of violence must be prohibited.D4=0.673 All forms of provocation must be avoided.D5=0.636 All forms of social dumping must be stopped.

Query: I did not find out why.

D2=0.836 I do not understand why.D3=0.821 I fail to understand why.D4=0.786 I cannot understand why.D5=0.780 I have no idea why.

• Five closest sentences found by monolingual similaritysearch in English (D1 = query, not shown)

• All are some of form para-phrasing → linguistic similarity

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Monolingual Similarity Search: Examples

Query All citizens who commit sexual crimes against children must bepunished, regardless of whether the crime is committed withinor outside the EU.

D2=0.662 The second proposal is to protect children against child sex tourismby all member states criminalising sexual crimes both within andoutside the EU.

D3=0.655 We need standard national legislation throughout Europe whichpunishes union citizens who engage in child sex tourism, irrespec-tive of where the offence was committed.

D4=0.655 The impunity of those who commit terrible crimes against theirown citizens and against other people regardless of their citizenshipmust be ended.

D5=0.609 Any person who commits a criminal act should be punished, in-cluding those who employ the third-country nationals, illegally andunder poor conditions.

• A more complicated English sentence (25 words)

• All closest sentences cover the punishment of (sexual)crimes.

• The similarity is at the overal sentence level not simpleparaphrasing or synonymes

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Multilingual Similarity Search: Examples

EN59177 Query Allow me, however, to comment on certain issues raised by the honourableMembers.

FR59177 D1=0.739 Permettez-moi toutefois de commenter certaines questions soulevees par lesdeputes.

FR394434 D2=0.643 Je voudrais commenter quelques-unes des questions soulevees par les deputes.FR791798 D3=0.618 Je voudrais faire les commentaires suivants sur plusieurs aspects specifiques souleves

par certains orateurs.FR666349 D4=0.615 Permettez-moi de dire quelques mots sur certaines questions qui ont ete soulevees.FR444790 D5=0.609 Je voudrais juste faire quelques commentaires sur certaines des questions qui ont

ete soulevees.

ES59177 D1=0.719 No obstante, permıtanme comentar ciertas cuestiones planteadas por sus senorıas.ES394434 D2=0.628 Me gustarıa comentar algunas de las cuestiones planteadas por algunos diputados.ES271614 D3=0.615 No obstante, quisiera hacer algunos comentarios sobre el debate que nos ocupa.ES661451 D4=0.605 Por ultimo, permıtanme que anada algunos comentarios sobre las enmiendas pre-

sentadas.ES666285 D5=0.605 No obstante, permıtanme que conteste a algunos comentarios que se han realizado.

• All the cosine distances are close and the sentences are indeedsemantically related.

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Multilingual Similarity Search: Examples

EN77622 Query And yet the report on the fight against racism does not demonstratethat the necessary conclusions have been drawn.

FR77622 D1=0.767 Pourtant, le rapport sur la lutte contre le racisme n’indique pas que l’onen ait tire les conclusions qui s’imposent.

FR1094939 D2=0.746 Ainsi, le rapport sur la lutte contre le racisme n’indique pas que l’on ena tire les conclusions qui s’imposent.

FR73928 D3=0.491 Et, comme le demontrent les faits, ce n’est pas en interdisant que l’onva obtenir des resultats.

FR1249269 D4=0.476 Ce rapport, qui se propose de lutter contre la corruption, ne faitqu’illustrer votre incapacite a le faire.

ES77622 D1=0.820 Sin embargo, el informe sobre la lucha contra el racismo no muestra quese hayan extraıdo las conclusiones necesarias.

ES1094939 D2=0.797 Ası, el informe sobre la lucha contra el racismo no muestra que se hayanextraıdo las conclusiones necesarias.

ES287052 D3=0.517 No obstante, el informe deja mucho que desear en lo que se refiere a lasmedidas necesarias para combatir el cambio climatico y, por tanto, ponede relieve que el parlamento europeo no se encuentra a la vanguardiade esta batalla.

ES74892 D4=0.515 Y el informe de los expertos demuestra que no habıa el control y elseguimiento necesarios.

• Correct French and Spanish translation were retrieved

• Second closest sentences are also semantically well related tothe query

• Other have smaller distance and only cover some aspect of thequery

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Large Scale Similarity Search

What happens if we scale up to billions of sentences?

• Huge amounts of monolingual data are on the WEB

• Will we always find a “close” sentence ?

• Will we get (multilingual) confusions with billions ofsentences ?

• Is “translation by search” feasible ?

Data: subset of common crawl

• Data provided by Ken Heafield and colleagues• En/Fr and Es: 1.3 – 1.9 billion sentences• all limited to 5-50 words

• 1024-dim sentence representation: 5 – 7 TB of data !

• Can we find the closest sentence in real-time ?

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Efficient Large Scale Similarity Search

Vector compression

• For efficiency, the whole index must be stored in RAM

• Compression: 5 TB → 128GB (40×)

• Speed: brute force search in 1.5 billion vectors is too slow

• Optimization on Europarl corpus (1.5M sentences)• baseline: 5.7 GB, 7300s, 7.3% error• PCA256: 1.5 GB, 1188s, 7.6% error• PCA64: 0.4 GB, 531s, 14.2% error⇒ error increases significantly, but size still too big

• sophisticated quantization, etc: 120 MB memoryaccuracy/speed trade-off: 763s, 9.7% → 1380s, 8.8%

• Demo: index in En and Fr, 1.5 billions sentencesquery in En, Fr, Es, Ru, Ar or Zhclose to real-time

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Applications of Multilingual Embeddings

What can we do with multilingual joint embeddings

• Cross-lingual sentence classification• Better than STOA in 3 out of 6 language pairs

• Extract parallel data from large monolingual collections• NMT Marathon project

• MT quality estimation• collaboration

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Mine for Parallel Data

Approach

• Embed billions of sentences in same space

• For each sentence in one language,search the k-closest ones in another language

• Decide which sentences are possible translations based ondistance: simple threshold, classifier

⇒ easy to implement, but computational challenge

NMT Marathon project with Kenneth

• Multilingual embeddings for sentence-level parallel corpora

• Open and highly-scalable implementation customized toretrieve nearby sentences

• Target: 17PB of WEB crawl data (Internet Archive)

• Start with 55 TB of CommonCrawl

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

MT Quality Estimation

Approach

• The distance in the joint space could be used to estimatethe quality of a translation

• May be also used to rescore n-best lists

⇒ Interested in collaborations

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

FAIR: Facebook AI Research

Every day on Facebook

• 10 billion text messages are sent

• 300 million pictures are uploaded

• several millions of new videos are published

• 1.5 billion searches are conducted

FAIR vision: AI will mediate communication

• between people• feed ranking, suggestions, real-time translation, etc.

• between people and the digital world• content search, Q&A, real-time dialog, bots

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

FAIR: Facebook AI Research

FAIR: Overview

• ≈ 100 research scientists and engineers

• New York, Menlo Park, Paris, Seattle

• still growing . . .

• Several open positions:Research scientist and engineer, internship

Some projects:

• Image captioning for the visual impaired

• Face recognition, Video analysis

• Neural machine translation

• Dialog system, chat bots, Q&A

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Conclusion

• Multiple encoder/decoder NMT framework can be used tolearn joint multilingual embeddings

• Nice indirect method to learn semantic sentencerepresentations

• Continuous space interlingua

• Trained with usual NMT corpora

• Enables many applications• cross-lingual zero-short transfer• search and comparison of multilingual sentences in one

common space...

NeuralMachine


tions

HolgerSchwenk

Plan

Introduction

Vision

NLP

Embeddings

RNN

Seq2Seq

Architecture

Evaluation

Applications

Demo

others

FAIR

Conclusion

Ongoing research

• Use one encoder for all languages• joint BPE vocabulary• this does work as well as many encoders

• Training with less supervision

• Analysis of the joint sentence space• how sentence length is encoded ?• can we identify linguistic concepts ?• do we observe delta vectors ?• collaboration with NMT project

Visualization Tool for OpenNMT