ContextualParameterGenera on ... · LEGEND text Exampleinput M Languageembeddingssize W...
Transcript of ContextualParameterGenera on ... · LEGEND text Exampleinput M Languageembeddingssize W...
LEGEND
text Example input
M Language embeddings size
W Word embeddings size
P Number of parameters
Trainable variables
Computed values
picks the same parameters for all languagesUNIVERSAL:picks different enc/dec parameters based on the languagesPER-LANGUAGE:
picks a different parameter set based on the language pairPAIRWISE: The proposed abstrac�onis a generaliza�on overprevious methods
IWSLT-15
~90,000-220,000 train / ~500-900 val / ~1,000 test
14.89 15.9225.2525.9229.6034.4035.1422.2214.0325.5423.1924.128.1814.5614.6019.0225.1525.0215.589.1117.5116.0016.47
16.8826.4426.4131.2438.1037.1126.0316.5428.3325.9127.308.4014.8115.0919.7724.0024.5518.4110.1918.9216.2817.04
17.2227.37
> 25.87[Ha16]
> 28.07[Huang18]
38.3237.8926.3326.77
26.3827.809.4915.3816.0320.2525.7927.1217.6510.1418.9016.8617.76
24.4325.9930.9338.2537.4023.6215.5427.4724.0326.265.716.6411.7018.1024.4723.797.867.1318.016.6913.01
PNMT GML
IWSLT-17
~220,000 train / ~900 val / ~1,100 test
Problem
EnglishHow are you? Πώς είσαι?
GreekMT System
Translate from one language to another.
A mul�lingual MT system can translatebetween any pair of languages.
Assuming L languages and P parametersin a pairwise MT model, we can use:
PER-LANGUAGEENCODER/DECODER
English
German
Japanese
GreekHindi
ChineseEnglish
German
Japanese
GreekHindi
Chinese
[Luong16, Firat16]
· O(LP) parameters· Limited parameter sharing anduse of a�en�on difficult
UNIVERSAL
English
German
Japanese
GreekHindi
ChineseEnglish
German
Japanese
GreekHindi
ChineseUniversalModel
One shared model:
[Ha16, Johnson17]
· O(P) parameters· Lacks language-specificparameteriza�on
PAIRWISE
English
German
Japanese
GreekHindi
ChineseEnglish
German
Japanese
GreekHindi
Chinese
Separate model per language pair:
· O(L2P) parameters· No parameter sharing· Bad for limited/no training data
Proposed Approach
Contextual Parameter Genera�onfor Universal Neural Machine Transla�on
[email protected]. Mitchell
[email protected] Neubig
[email protected] Sachan
[email protected] Antonios Platanios
Experiments
WMIntermediaterepresenta�on
EmbeddingsVocabularyLookup
Thank you very much.ENGLISH VIETNAMESECám ơn rất nhiều.
ENCODER
SOURCE
TARGET, P
PARAMETER GENERATORGenerates model parameters at inference �me, given some context.The source and target language represent thecontext in which transla�on happens:
OBSERVATIONS- The parameters o�en have some“natural grouping” (e.g., first layer weights).- Language embeddings represent alllanguage-specific informa�on and mayneed to be large.- Only a small part of this informa�on isrelevant for each “group”.
CONTROLLED SHARINGLet , where ,and is the number of groups. Then:
where:
andM' <M, and similarly for the decoder.↑M ⇔ ↑ Per-Language Informa�on↑M' ⇔ ↑ Shared Informa�on
We also decouple the encoder and the decoder,thus ge�ng closer to a poten�al intelingua:
For each language, the parametersare defined as a linear combina�onof theM columns of a weightmatrix , which makes forbe�er interpretability.
DECODER
Parameterized by Parameterized by
SOURCE
TARGET
BaselineNeural MTSystem
SOURCE
LANGUAGE
M
TARGET
LANGUAGE
Baseline Model- 2-layer bidirec�onal LSTM encoder- 2-layer LSTM decoder- 512 units per layer / word embedding size- Per-language vocabulary- 20,000 most frequent words⏤ no BPESe�ngs- Supervised: Train using full parallel data- Low-Resource: Limit the size of the parallel data- Zero-Shot: No parallel data for some language pairs
De→En 21.78PNMT
BLEUBLEUGML CPG8CPG*8 CPG8 CPG8C4 CPG64C821.2513.84 14.73
12.2419.4127.5724.4720.8314.6130.6217.9926.3116.8114.0113.5823.8315.3419.6812.7410.5711.5211.5111.51
14.3412.3719.0427.1125.1520.9615.0630.1018.1126.1717.5014.4413.6623.8815.5119.7512.8010.1711.2011.4011.39
14.3411.3217.4627.2624.4820.2014.1829.5617.7126.3316.8912.3812.9624.6515.2919.1612.6710.6911.6311.7811.69
11.9517.0625.7422.4618.6012.7627.9616.2724.7816.1012.4812.2122.8814.1118.1512.509.5710.4710.8210.84
22.56 20.78 21.5013.1610.8519.7527.7024.4119.2314.3929.8416.7426.3016.0312.8412.7524.3313.7018.9912.759.9711.3211.6911.43
Supervised
100%
ParallelData
10%ParallelData
Zero-Shot
De→ItDe→Ro
De→Nl
En→DeEn→ItEn→NlEn→RoIt→De
It→Ro
It→EnIt→Nl
En→CsCs→EnEn→DeDe→EnEn→FrFr→EnEn→ThTh→EnEn→ViVi→En
Nl→En
Nl→De
Nl→ItNl→RoRo→De
Ro→It
Ro→EnRo→NlMean
MeanEn→CsCs→EnEn→DeDe→EnEn→FrFr→EnEn→ThTh→EnEn→ViVi→EnMean Mean
Scala MTlibrary
and code toreproduceexperiments
TensorFlowScala
used for ourexperiments
All experiments were run ona machine with a singleNvidia V100 GPU, and 24GBs of system memory.
The longest experimentrequired ~10 hours.
Trained withoutauto-encoding
Pairwisemodels
M=8
31.7726.77
29.03
M=8 M=8M'=4
M=64M'=8
We learnlanguage embeddings
GoogleMul�lingual
We choose tomake linearfor simplicity andinterpretability
Our contribu�on does not depend on thechoice of g. It would be interes�ng to designmodels that can use side-informa�on aboutthe languages, that may be available.
FEATURESScalableConstant number of parameters - O(MP)Simple &Mul�lingualCan be applied to most exis�ng NMTsystems with minor changes.Semi-SupervisedCan use monolingual data by learning totranslate back-and-forth → Learnlanguage embeddings that encodemeaningful priors / language models.Zero-ShotCan translate between unsupervisedpairs of languages, as long as thelanguages have been seen in anysupervised pairs.AdaptableGiven a trained model, can adapt tosupport a new language by just learningthe language embedding and fixing therest of the model.