NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Neural MachineTranslation and
Universal MultilingualRepresentations
Holger Schwenk
August 28th, NMT Marathon, Lisabon
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Plan of the Talk
• Introduction to deep neural networks for NLP (in 5 min)
• From Sequence-to-Sequence processingto multilingual sentence representations
• Detailed results
• Large scale demo
• Facebook AI Research
• Conclusion and perspectives
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
How to Build Intelligent Machines ?
• The brain is an existence proof of intelligence• The way birds fly were an existence proof of
heavier-than-air flight
• Should we copy the brain ?• Like mankind tried to copy birds to build flying machines• The answer is No, but we should draw inspiration from it
• Design principles of artificial neural networks• Many simple units are combined to perform a complex task• The system learns from examples• It is able to generalize to unseen event
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
How to Build Intelligent Machines ?
Human neuron
Artificial perceptron
• The perceptron: a simple computational unit• Computes a weighted sum of inputs• Output: +1 if cumulated input exceeds threshold, -1 else• The weights are obtained by an learning algorithm:
- iterate through examples with input and desired output- adjust weights to minimize the observed error
⇒ Vaguely inspired by a biological neuron
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
How to Build Intelligent Machines ?
• Characteristics of the human brain• 1011 neurons• 104 synapses per neuron• 1016 “operations” per second
• Learning and processing:• No global supervision• Modification of the
synapses
• Representation ofknowledge
• Distributed• Hierarchical
• Distributed Processing:• Highly parallel
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
How to Build Intelligent Machines ?
• The Multi-Layer Perceptron• combine many perceptrons in several layers• the internal representations can be learned by
back-propagation of the error observed at the output• The weights are obtained by an iterative learning
algorithm:- compare current output with the desired output- adjust weights to minimize the observed error
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Learning Hierarchical Representations
(figure from Y. Le Cun)
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Learning Hierarchical Representations
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Learning Hierarchical Representations
Image recognition
• pixel → edge → tecton → motif → part → object
Text processing
• char → word → word group → clause → sentence → story
Speech recognition
• wave → spectral band → sound → phone → word →sentence
The intermediate features do not necessarilycorrespond to a well defined entity for humans !
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Network Architectures : ConvNet
Background
• In principle, any problem can be solved with a fullyconnected (deep) neural network
• However, it is very hard to learn the best solution due tothe huge search space
⇒ Constrain the network architecture to be problem-specific
Convolutional networks
• Several layers of small feature detectors and pooling
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Network Architectures : ConvNet
Improved version: GoogleNet
Revolution of Depth
• ResNet with 150 layers (or even up to 1000 !)
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Deep Neural Networks in Computer Vision
Image net challenge
• Train: 1.2M images with 1000 classes, test: 200k images
• Evolution of error rates:
• The classification error decreased from 28 to less than 4%and reaches today human performance
• Deep neural networks are used since 2012
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
What about Deep Neural Networks in NLP ?
• Operate on a low level representation of the data• Vision: pixels• NLP: what is the fundamental unit - words or characters ?
discrete units !
• Use very deep architectures to learn hierarchicalrepresentations of the data
• Vision: feature detectors of increasing abstraction• NLP: how to structure the input ?
n-grams, syntactical or semantic graphs, . . . ?
• Structure the network to adapt it to the problem• Vision: ConvNets implement learnable feature detectors• NLP: Recurrent NN (LSTM, GRU) are very popular
ConvNets can also be used
• Trained end-to-end• Vision: classification problems are well-defined• NLP: sentence generation is often ambiguous, without
unique solution
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Natural Language Processing
Handling words
• Detect relationships between words
• Associate categories to a (sequences of) words
• Estimate probability distributions over (sequence) of words
• Generate sentences...
Old technique
• Define a vocabulary of V known words
• Represent each word by an integer index
• 1-out-of-N encoding, binary vector
⇒ There is no relation between the words
⇒ All the words are equally close or far
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Word Embeddings
Idea
• Associate an arbitrary vector xi ∈ RE to each word
• Learn these embeddings in a way that similar words arenearby in that space
house
building
apple apple
buildinghouse
building
banana
grappe
• The notion of similarity may depend on the application(LM, MT, dialog, . . .)
⇒ There are many techniques to learn word embeddings . . .
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Recurrent Network for LM
Theoretical aspects
• Ideally, one should estimate:
P(wp1 ) = P(w1)
∏pi=2 P(wi |w i−1
1 )
i.e. each word is conditioned on all preceding words
• A recurrent neural network seems to be the perfect choice
• This was proposed by Mikolov et al in 2010, and manyfollow-up works
Practical issues
• Gradients tend to vanish for long sequences
→ Long Short-Term Memory (LSTM) networks
• It is less obvious to optimize recurrent NN
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Sequence-to-Sequence Processing
• There are many tasks which map a sequence of words toanother sequence of words
• POS tagging (words → tags)• machine translation (source → target)• text summarization (long → short text)...
• We need a generative model which can handle variablelengths
• Often, there is no unique solution
How to handle such tasks with neural networks ?
• Encoder/decoder approach[Kalchbrenner et al, EMNLP’13; Cho et al, EMNLP’14]
• Sequence-to-sequence processing[Sutskever at al, NIPS’14]
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Sequence-to-Sequence Processing
General idea
• An encoder processes the source sentence and creates ancompact representation
• This representation is the input to the decoder whichgenerates a sequence in the target language
• Both encoder and decoder are RNNs
decoder
targetsentence
sentencerepresentation
sourcesentence
encoder
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Sequence-to-Sequence Processing
Many variants
• Huge and deep LSTM with short-cut connections
• Convolutional networks
• Attention mechanism
• Bean search, RL
• Many details to come this week . . .
This talk
• We do not train an NMT system
• But we use the NMT framework to learn multilingualsentence representations
• No BLEU scores !
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Multiple Encoder/Decoder Framework
sur une plageen una playaUn cheval
einem StrandEin Pferd aufUn caballoA horse on
a beach
ho
rse
bea
ch
Fr
img
De
En
En
A horse ona beach
Ein Pferd aufeinem Strand
De
Esn
egat
ive
neu
tral
po
siti
ve
representationuniversal sentence
speech
• Use several encoders and decoders• different language pairs• other Seq2Seq tasks (speech)• sentence classification tasks (sequence-to-category)• image captioning (image-to-sequence)
• “Force” the representation to be identical for all encoders
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Different Training Paths
EN
one−to−one
FR ES RU
One-to-one strategy
• Alternate between different language pairs with onecommon target
⇒ Encourages joint representation
+ Train with pairwise parallel data
– No embedding for output language
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Different Training Paths
ENEN EN
FR ES RU
two−to−one three−to−one
FR ES RU
one−to−one
FR ES RU
average
Many-to-one strategy
• Add a regularizer to explicitly force a joint representations,eg. average, correlation, . . .
– Still no embedding for output language
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Different Training Paths
EN FR ES RUENEN EN
FR ES RU
two−to−one three−to−one
FR ES RU
one−to−one
FR ES RU
one−to−many
EN FR ES RU
average
One-to-many strategy
• Translate from one to all other language, source excluded
⇒ Always at least one common target language
• Sentence embeddings for all languages
– Needs N-way parallel training corpora
• Extension to “many-to-many strategy” straightforward
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Evaluation of Sentence Representations
Desired properties:
• semantic closeness: similar sentences should be close inthe embeddings space
• multilingual closeness: identical sentences in differentlanguages should be close
• preservation of content: task specific: NMT,classification, entailment, etc.
• scalability to many languages: limit the need of humanlabeling of data.
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Evaluation of Sentence Representations
Multilingual evaluation:
1 Sentence classification + transfer (Reuters corpus)• train a sentence classifier on labeled data for one language• apply the same classifier to a different language
without language specific training data
2 Entailment (STS tasks)• input: two sentences, Output: relatedness score• compare sentences in different languages
3 Similarity search• compare sentence vectors with some metric (cosine)• parallel corpus: search closest one + count errors⇒ translation by search, paraphrasing, . . .
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Experimental Evaluation:Architectures
NMT
• BLSTM, the deeper the better
• Quite complicated architectures (short-cut connections)
• Convolutional networks
Sentence representations
• Deep networks doesn’t seem to be useful
• Sentence representation:• last LSTM layer (original seq2seq)• BLSTM + element-wise max-pooling
• the proposed framework is generic:any type of encoder and decoder can be used
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Experimental Evaluation: Data
Corpora
• Most existing MT corpora are N-way parallel
• UN corpus:• 6 very different languages (En, Fr, Es, Ru, Ar, Zh)• 11M sentences, 6-way: 8.3M
• Europarl corpus:• 21 languages, 400k – 2M sentences• 9-way parallel subset:
1.1M sentences, En, Fr, De, Fi, Es, Da, It, Nl, Pt
• TED corpus:• 23 languages, 100k sentences
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Training Strategies: One-to-One
System Average Similarity Errorefs efsr efsra efsraz
#pairs: 6 10 15 21
LSTM nhid=512 + last state:efs-a 2.14 – – –efs-r 1.97 – – –efsr-a 1.90 2.40 – –efsra-z 1.91 2.26 2.51 –efsraz-all 1.70 1.97 2.38 2.59
LSTM nhid=1024 + last state:efsraz-all 1.36 1.64 1.89 1.95
BLSTM nhid=512 + max pooling:efsra-z 1.03 1.20 1.26 –efsraz-all 0.92 1.07 1.15 1.20
• Error decreases with the number of languages covered !
• Training strategy one-to-many is slightly better
• BLSTM + max pooling is considerably better
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Training Strategies: Many-to-One (efs-a)
# input languages SimilarityID 1 2 3 Error
One M:1 strategy1 1 – – 1.03%2 – 0.5 – 1.85%3 – – 1 67.9%
Combining 1:1 and 2:1 strategies12a 0.9 0.05 – 1.09%12b 0.8 0.10 – 1.16%12c 0.7 0.15 – 1.15%12d 0.6 0.20 – 1.12%12e 0.5 0.25 – 1.22%
Combining 1:1 and 3:1 strategies13 0.5 – 0.5 1.31%
Combining 1:1, 2:1 and 3:1 strategies123a 0.33 0.16 0.33 1.32%123b 0.25 0.25 0.25 1.35%
• Combining representations by average doesn’t work
⇒ Ongoing work: explore other technique
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Multilingual Similarity Search (UN)
Source Target languageLang. En Fr Es Ru Ar Zh All
En – 1.10 0.70 1.07 1.05 1.15 1.02Fr 0.97 – 0.95 1.55 1.65 1.68 1.36Es 0.68 1.10 – 1.20 1.35 1.27 1.12Ru 0.78 1.52 1.23 – 1.32 1.32 1.23Ar 0.78 1.52 1.07 1.48 – 1.23 1.22Zh 0.97 1.55 1.12 1.35 1.30 – 1.26
All 0.83 1.36 1.02 1.33 1.33 1.33 1.20
• Errors rates are very low and homogeneous
• The six languages differ significantly with respect tomorphology, inflection, word order, . . .
• Performance on Chinese is surprisingly good
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Multilingual Similarity Search (EuroParl)
Source Target languagelang. en fr es it pt de da nl fi All
en – 3.26 3.28 4.32 3.44 3.26 3.28 4.74 3.10 3.58fr 3.30 – 3.20 3.96 3.42 3.40 3.42 5.00 3.44 3.64es 3.28 3.28 – 4.00 3.46 3.40 3.46 4.40 3.04 3.54it 4.12 3.92 3.88 – 4.32 4.24 4.24 5.40 4.30 4.30
pt 3.44 3.64 3.68 4.34 – 3.82 3.60 5.12 3.44 3.88de 3.40 3.58 3.42 4.62 3.80 – 3.60 4.94 3.30 3.83da 3.16 3.44 3.26 4.40 3.64 3.38 – 4.88 3.10 3.66nl 4.80 4.94 4.52 5.76 5.10 5.08 5.02 – 4.94 5.02fi 3.14 3.32 2.98 4.20 3.22 3.16 3.18 4.70 – 3.49
All 3.58 3.67 3.53 4.45 3.80 3.72 3.73 4.90 3.58 3.88
• Many confusable sentences: “The session starts at 12:00 pm”
• Difficult languages (De, Fi) perform quite well
• Dutch (Nl) seems to be the most difficult language to map into acommon space
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Large-Scale Similarity Search
• Europarl En/Fr, scaling up to 1.5M sentences:
0
2
4
6
8
10
12
1 10 100 1000 10000
Sim
ilarity
err
or
number of examples [x1000]
• Error increases log-linearly
⇒ Translation by search ?
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Monolingual Similarity Search: Examples
Query: All kinds of obstacles must be eliminated.
D2=0.905 All kinds of barriers have to be removed.D3=0.682 All forms of violence must be prohibited.D4=0.673 All forms of provocation must be avoided.D5=0.636 All forms of social dumping must be stopped.
Query: I did not find out why.
D2=0.836 I do not understand why.D3=0.821 I fail to understand why.D4=0.786 I cannot understand why.D5=0.780 I have no idea why.
• Five closest sentences found by monolingual similaritysearch in English (D1 = query, not shown)
• All are some of form para-phrasing → linguistic similarity
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Monolingual Similarity Search: Examples
Query All citizens who commit sexual crimes against children must bepunished, regardless of whether the crime is committed withinor outside the EU.
D2=0.662 The second proposal is to protect children against child sex tourismby all member states criminalising sexual crimes both within andoutside the EU.
D3=0.655 We need standard national legislation throughout Europe whichpunishes union citizens who engage in child sex tourism, irrespec-tive of where the offence was committed.
D4=0.655 The impunity of those who commit terrible crimes against theirown citizens and against other people regardless of their citizenshipmust be ended.
D5=0.609 Any person who commits a criminal act should be punished, in-cluding those who employ the third-country nationals, illegally andunder poor conditions.
• A more complicated English sentence (25 words)
• All closest sentences cover the punishment of (sexual)crimes.
• The similarity is at the overal sentence level not simpleparaphrasing or synonymes
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Multilingual Similarity Search: Examples
EN59177 Query Allow me, however, to comment on certain issues raised by the honourableMembers.
FR59177 D1=0.739 Permettez-moi toutefois de commenter certaines questions soulevees par lesdeputes.
FR394434 D2=0.643 Je voudrais commenter quelques-unes des questions soulevees par les deputes.FR791798 D3=0.618 Je voudrais faire les commentaires suivants sur plusieurs aspects specifiques souleves
par certains orateurs.FR666349 D4=0.615 Permettez-moi de dire quelques mots sur certaines questions qui ont ete soulevees.FR444790 D5=0.609 Je voudrais juste faire quelques commentaires sur certaines des questions qui ont
ete soulevees.
ES59177 D1=0.719 No obstante, permıtanme comentar ciertas cuestiones planteadas por sus senorıas.ES394434 D2=0.628 Me gustarıa comentar algunas de las cuestiones planteadas por algunos diputados.ES271614 D3=0.615 No obstante, quisiera hacer algunos comentarios sobre el debate que nos ocupa.ES661451 D4=0.605 Por ultimo, permıtanme que anada algunos comentarios sobre las enmiendas pre-
sentadas.ES666285 D5=0.605 No obstante, permıtanme que conteste a algunos comentarios que se han realizado.
• All the cosine distances are close and the sentences are indeedsemantically related.
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Multilingual Similarity Search: Examples
EN77622 Query And yet the report on the fight against racism does not demonstratethat the necessary conclusions have been drawn.
FR77622 D1=0.767 Pourtant, le rapport sur la lutte contre le racisme n’indique pas que l’onen ait tire les conclusions qui s’imposent.
FR1094939 D2=0.746 Ainsi, le rapport sur la lutte contre le racisme n’indique pas que l’on ena tire les conclusions qui s’imposent.
FR73928 D3=0.491 Et, comme le demontrent les faits, ce n’est pas en interdisant que l’onva obtenir des resultats.
FR1249269 D4=0.476 Ce rapport, qui se propose de lutter contre la corruption, ne faitqu’illustrer votre incapacite a le faire.
ES77622 D1=0.820 Sin embargo, el informe sobre la lucha contra el racismo no muestra quese hayan extraıdo las conclusiones necesarias.
ES1094939 D2=0.797 Ası, el informe sobre la lucha contra el racismo no muestra que se hayanextraıdo las conclusiones necesarias.
ES287052 D3=0.517 No obstante, el informe deja mucho que desear en lo que se refiere a lasmedidas necesarias para combatir el cambio climatico y, por tanto, ponede relieve que el parlamento europeo no se encuentra a la vanguardiade esta batalla.
ES74892 D4=0.515 Y el informe de los expertos demuestra que no habıa el control y elseguimiento necesarios.
• Correct French and Spanish translation were retrieved
• Second closest sentences are also semantically well related tothe query
• Other have smaller distance and only cover some aspect of thequery
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Large Scale Similarity Search
What happens if we scale up to billions of sentences?
• Huge amounts of monolingual data are on the WEB
• Will we always find a “close” sentence ?
• Will we get (multilingual) confusions with billions ofsentences ?
• Is “translation by search” feasible ?
Data: subset of common crawl
• Data provided by Ken Heafield and colleagues• En/Fr and Es: 1.3 – 1.9 billion sentences• all limited to 5-50 words
• 1024-dim sentence representation: 5 – 7 TB of data !
• Can we find the closest sentence in real-time ?
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Efficient Large Scale Similarity Search
Vector compression
• For efficiency, the whole index must be stored in RAM
• Compression: 5 TB → 128GB (40×)
• Speed: brute force search in 1.5 billion vectors is too slow
• Optimization on Europarl corpus (1.5M sentences)• baseline: 5.7 GB, 7300s, 7.3% error• PCA256: 1.5 GB, 1188s, 7.6% error• PCA64: 0.4 GB, 531s, 14.2% error⇒ error increases significantly, but size still too big
• sophisticated quantization, etc: 120 MB memoryaccuracy/speed trade-off: 763s, 9.7% → 1380s, 8.8%
• Demo: index in En and Fr, 1.5 billions sentencesquery in En, Fr, Es, Ru, Ar or Zhclose to real-time
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Applications of Multilingual Embeddings
What can we do with multilingual joint embeddings
• Cross-lingual sentence classification• Better than STOA in 3 out of 6 language pairs
• Extract parallel data from large monolingual collections• NMT Marathon project
• MT quality estimation• collaboration
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Mine for Parallel Data
Approach
• Embed billions of sentences in same space
• For each sentence in one language,search the k-closest ones in another language
• Decide which sentences are possible translations based ondistance: simple threshold, classifier
⇒ easy to implement, but computational challenge
NMT Marathon project with Kenneth
• Multilingual embeddings for sentence-level parallel corpora
• Open and highly-scalable implementation customized toretrieve nearby sentences
• Target: 17PB of WEB crawl data (Internet Archive)
• Start with 55 TB of CommonCrawl
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
MT Quality Estimation
Approach
• The distance in the joint space could be used to estimatethe quality of a translation
• May be also used to rescore n-best lists
⇒ Interested in collaborations
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
FAIR: Facebook AI Research
Every day on Facebook
• 10 billion text messages are sent
• 300 million pictures are uploaded
• several millions of new videos are published
• 1.5 billion searches are conducted
FAIR vision: AI will mediate communication
• between people• feed ranking, suggestions, real-time translation, etc.
• between people and the digital world• content search, Q&A, real-time dialog, bots
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
FAIR: Facebook AI Research
FAIR: Overview
• ≈ 100 research scientists and engineers
• New York, Menlo Park, Paris, Seattle
• still growing . . .
• Several open positions:Research scientist and engineer, internship
Some projects:
• Image captioning for the visual impaired
• Face recognition, Video analysis
• Neural machine translation
• Dialog system, chat bots, Q&A
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Conclusion
• Multiple encoder/decoder NMT framework can be used tolearn joint multilingual embeddings
• Nice indirect method to learn semantic sentencerepresentations
• Continuous space interlingua
• Trained with usual NMT corpora
• Enables many applications• cross-lingual zero-short transfer• search and comparison of multilingual sentences in one
common space...
NeuralMachine
Translationand UniversalMultilingualRepresenta-
tions
HolgerSchwenk
Plan
Introduction
Vision
NLP
Embeddings
RNN
Seq2Seq
Architecture
Evaluation
Applications
Demo
others
FAIR
Conclusion
Ongoing research
• Use one encoder for all languages• joint BPE vocabulary• this does work as well as many encoders
• Training with less supervision
• Analysis of the joint sentence space• how sentence length is encoded ?• can we identify linguistic concepts ?• do we observe delta vectors ?• collaboration with NMT project
Visualization Tool for OpenNMT
Top Related