ContextualParameterGenera on ... · LEGEND text Exampleinput M Languageembeddingssize W...

1
LEGEND text Example input M Language embeddings size W Word embeddings size P Number of parameters Trainable variables Computed values picks the same parameters for all languages UNIVERSAL: picks different enc/dec parameters based on the languages PER-LANGUAGE: picks a different parameter set based on the language pair PAIRWISE: The proposed abstrac�on is a generaliza�on over previous methods IWSLT-15 ~90,000-220,000 train / ~500-900 val / ~1,000 test 14.89 15.92 25.25 25.92 29.60 34.40 35.14 22.22 14.03 25.54 23.19 24.12 8.18 14.56 14.60 19.02 25.15 25.02 15.58 9.11 17.51 16.00 16.47 16.88 26.44 26.41 31.24 38.10 37.11 26.03 16.54 28.33 25.91 27.30 8.40 14.81 15.09 19.77 24.00 24.55 18.41 10.19 18.92 16.28 17.04 17.22 27.37 > 25.87 [Ha16] > 28.07 [Huang18] 38.32 37.89 26.33 26.77 26.38 27.80 9.49 15.38 16.03 20.25 25.79 27.12 17.65 10.14 18.90 16.86 17.76 24.43 25.99 30.93 38.25 37.40 23.62 15.54 27.47 24.03 26.26 5.71 6.64 11.70 18.10 24.47 23.79 7.86 7.13 18.01 6.69 13.01 PNMT GML IWSLT-17 ~220,000 train / ~900 val / ~1,100 test Problem English How are you? Πώς είσαι? Greek MT System Translate from one language to another. A mul�lingual MT system can translate between any pair of languages. Assuming L languages and P parameters in a pairwise MT model, we can use: PER-LANGUAGE ENCODER/DECODER English German Japanese Greek Hindi Chinese English German Japanese Greek Hindi Chinese [Luong16, Firat16] · O(LP) parameters · Limited parameter sharing and use of a�en�on difficult UNIVERSAL English German Japanese Greek Hindi Chinese English German Japanese Greek Hindi Chinese Universal Model One shared model: [Ha16, Johnson17] · O(P) parameters · Lacks language-specific parameteriza�on PAIRWISE English German Japanese Greek Hindi Chinese English German Japanese Greek Hindi Chinese Separate model per language pair: · O(L 2 P) parameters · No parameter sharing · Bad for limited/no training data Proposed Approach Contextual Parameter Genera�on for Universal Neural Machine Transla�on [email protected] Tom M. Mitchell [email protected] Graham Neubig [email protected] Mrinmaya Sachan [email protected] Emmanouil Antonios Platanios Experiments W M Intermediate representa�on Embeddings Vocabulary Lookup Thank you very much. ENGLISH VIETNAMESE Cám ơn rất nhiều. ENCODER SOURCE TARGET , P PARAMETER GENERATOR Generates model parameters at inference �me, given some context. The source and target language represent the context in which transla�on happens: OBSERVATIONS - The parameters o�en have some “natural grouping” (e.g., first layer weights). - Language embeddings represent all language-specific informa�on and may need to be large. - Only a small part of this informa�on is relevant for each “group”. CONTROLLED SHARING Let , where , and is the number of groups. Then: where: and M' < M, and similarly for the decoder. M ↑ Per-Language Informa�on M' ↑ Shared Informa�on We also decouple the encoder and the decoder, thus ge�ng closer to a poten�al intelingua: For each language, the parameters are defined as a linear combina�on of the M columns of a weight matrix , which makes for be�er interpretability . DECODER Parameterized by Parameterized by SOURCE TARGET Baseline Neural MT System SOURCE LANGUAGE M TARGET LANGUAGE Baseline Model - 2-layer bidirec�onal LSTM encoder - 2-layer LSTM decoder - 512 units per layer / word embedding size - Per-language vocabulary - 20,000 most frequent words no BPE Se�ngs - Supervised: Train using full parallel data - Low-Resource: Limit the size of the parallel data - Zero-Shot: No parallel data for some language pairs De→En 21.78 PNMT BLEU BLEU GML CPG 8 CPG* 8 CPG 8 CPG 8 C4 CPG 64 C8 21.25 13.84 14.73 12.24 19.41 27.57 24.47 20.83 14.61 30.62 17.99 26.31 16.81 14.01 13.58 23.83 15.34 19.68 12.74 10.57 11.52 11.51 11.51 14.34 12.37 19.04 27.11 25.15 20.96 15.06 30.10 18.11 26.17 17.50 14.44 13.66 23.88 15.51 19.75 12.80 10.17 11.20 11.40 11.39 14.34 11.32 17.46 27.26 24.48 20.20 14.18 29.56 17.71 26.33 16.89 12.38 12.96 24.65 15.29 19.16 12.67 10.69 11.63 11.78 11.69 11.95 17.06 25.74 22.46 18.60 12.76 27.96 16.27 24.78 16.10 12.48 12.21 22.88 14.11 18.15 12.50 9.57 10.47 10.82 10.84 22.56 20.78 21.50 13.16 10.85 19.75 27.70 24.41 19.23 14.39 29.84 16.74 26.30 16.03 12.84 12.75 24.33 13.70 18.99 12.75 9.97 11.32 11.69 11.43 Supervised 100% Parallel Data 10% Parallel Data Zero-Shot De→It De→Ro De→Nl En→De En→It En→Nl En→Ro It→De It→Ro It→En It→Nl En→Cs Cs→En En→De De→En En→Fr Fr→En En→Th Th→En En→Vi Vi→En Nl→En Nl→De Nl→It Nl→Ro Ro→De Ro→It Ro→En Ro→Nl Mean Mean En→Cs Cs→En En→De De→En En→Fr Fr→En En→Th Th→En En→Vi Vi→En Mean Mean Scala MT library and code to reproduce experiments TensorFlow Scala used for our experiments All experiments were run on a machine with a single Nvidia V100 GPU, and 24 GBs of system memory. The longest experiment required ~10 hours. Trained without auto-encoding Pairwise models M=8 31.77 26.77 29.03 M=8 M=8 M'=4 M=64 M'=8 We learn language embeddings Google Mul�lingual We choose to make linear for simplicity and interpretability Our contribu�on does not depend on the choice of g. It would be interes�ng to design models that can use side-informa�on about the languages, that may be available. FEATURES Scalable Constant number of parameters - O(MP) Simple & Mul�lingual Can be applied to most exis�ng NMT systems with minor changes. Semi-Supervised Can use monolingual data by learning to translate back-and-forth → Learn language embeddings that encode meaningful priors / language models. Zero-Shot Can translate between unsupervised pairs of languages, as long as the languages have been seen in any supervised pairs. Adaptable Given a trained model, can adapt to support a new language by just learning the language embedding and fixing the rest of the model.

Transcript of ContextualParameterGenera on ... · LEGEND text Exampleinput M Languageembeddingssize W...

Page 1: ContextualParameterGenera on ... · LEGEND text Exampleinput M Languageembeddingssize W Wordembeddingssize P Numberofparameters Trainablevariables Computedvalues UNIVERSAL ...

LEGEND

text Example input

M Language embeddings size

W Word embeddings size

P Number of parameters

Trainable variables

Computed values

picks the same parameters for all languagesUNIVERSAL:picks different enc/dec parameters based on the languagesPER-LANGUAGE:

picks a different parameter set based on the language pairPAIRWISE: The proposed abstrac�onis a generaliza�on overprevious methods

IWSLT-15

~90,000-220,000 train / ~500-900 val / ~1,000 test

14.89 15.9225.2525.9229.6034.4035.1422.2214.0325.5423.1924.128.1814.5614.6019.0225.1525.0215.589.1117.5116.0016.47

16.8826.4426.4131.2438.1037.1126.0316.5428.3325.9127.308.4014.8115.0919.7724.0024.5518.4110.1918.9216.2817.04

17.2227.37

> 25.87[Ha16]

> 28.07[Huang18]

38.3237.8926.3326.77

26.3827.809.4915.3816.0320.2525.7927.1217.6510.1418.9016.8617.76

24.4325.9930.9338.2537.4023.6215.5427.4724.0326.265.716.6411.7018.1024.4723.797.867.1318.016.6913.01

PNMT GML

IWSLT-17

~220,000 train / ~900 val / ~1,100 test

Problem

EnglishHow are you? Πώς είσαι?

GreekMT System

Translate from one language to another.

A mul�lingual MT system can translatebetween any pair of languages.

Assuming L languages and P parametersin a pairwise MT model, we can use:

PER-LANGUAGEENCODER/DECODER

English

German

Japanese

GreekHindi

ChineseEnglish

German

Japanese

GreekHindi

Chinese

[Luong16, Firat16]

· O(LP) parameters· Limited parameter sharing anduse of a�en�on difficult

UNIVERSAL

English

German

Japanese

GreekHindi

ChineseEnglish

German

Japanese

GreekHindi

ChineseUniversalModel

One shared model:

[Ha16, Johnson17]

· O(P) parameters· Lacks language-specificparameteriza�on

PAIRWISE

English

German

Japanese

GreekHindi

ChineseEnglish

German

Japanese

GreekHindi

Chinese

Separate model per language pair:

· O(L2P) parameters· No parameter sharing· Bad for limited/no training data

Proposed Approach

Contextual Parameter Genera�onfor Universal Neural Machine Transla�on

[email protected]. Mitchell

[email protected] Neubig

[email protected] Sachan

[email protected] Antonios Platanios

Experiments

WMIntermediaterepresenta�on

EmbeddingsVocabularyLookup

Thank you very much.ENGLISH VIETNAMESECám ơn rất nhiều.

ENCODER

SOURCE

TARGET, P

PARAMETER GENERATORGenerates model parameters at inference �me, given some context.The source and target language represent thecontext in which transla�on happens:

OBSERVATIONS- The parameters o�en have some“natural grouping” (e.g., first layer weights).- Language embeddings represent alllanguage-specific informa�on and mayneed to be large.- Only a small part of this informa�on isrelevant for each “group”.

CONTROLLED SHARINGLet , where ,and is the number of groups. Then:

where:

andM' <M, and similarly for the decoder.↑M ⇔ ↑ Per-Language Informa�on↑M' ⇔ ↑ Shared Informa�on

We also decouple the encoder and the decoder,thus ge�ng closer to a poten�al intelingua:

For each language, the parametersare defined as a linear combina�onof theM columns of a weightmatrix , which makes forbe�er interpretability.

DECODER

Parameterized by Parameterized by

SOURCE

TARGET

BaselineNeural MTSystem

SOURCE

LANGUAGE

M

TARGET

LANGUAGE

Baseline Model- 2-layer bidirec�onal LSTM encoder- 2-layer LSTM decoder- 512 units per layer / word embedding size- Per-language vocabulary- 20,000 most frequent words⏤ no BPESe�ngs- Supervised: Train using full parallel data- Low-Resource: Limit the size of the parallel data- Zero-Shot: No parallel data for some language pairs

De→En 21.78PNMT

BLEUBLEUGML CPG8CPG*8 CPG8 CPG8C4 CPG64C821.2513.84 14.73

12.2419.4127.5724.4720.8314.6130.6217.9926.3116.8114.0113.5823.8315.3419.6812.7410.5711.5211.5111.51

14.3412.3719.0427.1125.1520.9615.0630.1018.1126.1717.5014.4413.6623.8815.5119.7512.8010.1711.2011.4011.39

14.3411.3217.4627.2624.4820.2014.1829.5617.7126.3316.8912.3812.9624.6515.2919.1612.6710.6911.6311.7811.69

11.9517.0625.7422.4618.6012.7627.9616.2724.7816.1012.4812.2122.8814.1118.1512.509.5710.4710.8210.84

22.56 20.78 21.5013.1610.8519.7527.7024.4119.2314.3929.8416.7426.3016.0312.8412.7524.3313.7018.9912.759.9711.3211.6911.43

Supervised

100%

ParallelData

10%ParallelData

Zero-Shot

De→ItDe→Ro

De→Nl

En→DeEn→ItEn→NlEn→RoIt→De

It→Ro

It→EnIt→Nl

En→CsCs→EnEn→DeDe→EnEn→FrFr→EnEn→ThTh→EnEn→ViVi→En

Nl→En

Nl→De

Nl→ItNl→RoRo→De

Ro→It

Ro→EnRo→NlMean

MeanEn→CsCs→EnEn→DeDe→EnEn→FrFr→EnEn→ThTh→EnEn→ViVi→EnMean Mean

Scala MTlibrary

and code toreproduceexperiments

TensorFlowScala

used for ourexperiments

All experiments were run ona machine with a singleNvidia V100 GPU, and 24GBs of system memory.

The longest experimentrequired ~10 hours.

Trained withoutauto-encoding

Pairwisemodels

M=8

31.7726.77

29.03

M=8 M=8M'=4

M=64M'=8

We learnlanguage embeddings

GoogleMul�lingual

We choose tomake linearfor simplicity andinterpretability

Our contribu�on does not depend on thechoice of g. It would be interes�ng to designmodels that can use side-informa�on aboutthe languages, that may be available.

FEATURESScalableConstant number of parameters - O(MP)Simple &Mul�lingualCan be applied to most exis�ng NMTsystems with minor changes.Semi-SupervisedCan use monolingual data by learning totranslate back-and-forth → Learnlanguage embeddings that encodemeaningful priors / language models.Zero-ShotCan translate between unsupervisedpairs of languages, as long as thelanguages have been seen in anysupervised pairs.AdaptableGiven a trained model, can adapt tosupport a new language by just learningthe language embedding and fixing therest of the model.