WNGT 2020 Efficiency Shared TaskKenneth Heafield,1 Yusuke Oda, Graham Neubig
https://www.aclweb.org/anthology/2020.ngt-1.1https://sites.google.com/view/wngt20/efficiency-task
1Corruptly, both organizer and participant.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 2
Goal: Efficient Machine Translation
Present task: inference → productionFuture task: efficient training?
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 3
Data Condition
WMT 2019 English–German constrained news task.
State-of-the-art systems submit to the latest WMT=⇒ There is no such thing as state-of-the-art on WMT14!
Also, recycle WMT 2019 systems as teachers.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 4
Awkward timing with WMT
2020 training data not final at start, test set unavailable at end.Root cause: WNGT at ACL, WMT at EMNLP.Coordinate with WMT more?
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 5
Test Set
Last year≈1s to translate =⇒ too smallBanned a team for memorizing known test set
Before deadline1 million sentences≤100 space-separated words/sentenceUnspecified test set hidden in input
After deadlineWMT plus filler: EMEA, Tatoeba, German FederalShuffled, also score parallel filler datahttp://data.statmt.org/heafield/wngt20/test/
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 6
Test Set
Last year≈1s to translate =⇒ too smallBanned a team for memorizing known test set
Before deadline1 million sentences≤100 space-separated words/sentenceUnspecified test set hidden in input
After deadlineWMT plus filler: EMEA, Tatoeba, German FederalShuffled, also score parallel filler datahttp://data.statmt.org/heafield/wngt20/test/
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 7
Test Set
Last year≈1s to translate =⇒ too smallBanned a team for memorizing known test set
Before deadline1 million sentences≤100 space-separated words/sentenceUnspecified test set hidden in input
After deadlineWMT plus filler: EMEA, Tatoeba, German FederalShuffled, also score parallel filler datahttp://data.statmt.org/heafield/wngt20/test/
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 8
Approximately Measuring Quality
Need a surprise evaluation set. WMT20 not ready yet.→ Uh, average old WMT test sets?→ WMT12 has sentences longer than 100 words.→ WMT1*: average sacrebleu of WMT11, WMT13–19
See paper supplement for individual WMT scores.Problem: participants likely tuned on WMT sets.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 9
BLEU?
“use human evaluation to verify claims in experiments that use metrics suchas BLEU” –Reviewer of my EU project
“BLEU has been surpassed by various other metrics”–Mathur et al, ACL 2020→ Submitted fast Czech systems to WMT20 with Charles University.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grantagreement No 825303.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 10
Hardware
Recent hardware with 8-bit optimization:
GPU NVidia T4g4dn.xlarge on Amazon Web Services $0.526/hr
CPU Intel Xeon Platinum 8275CL (Cascade Lake) dual socketc5.metal on Amazon Web Services $4.08/hrSingle-core and all-core tracks (48 physical cores)
Provided credits for participants to develop with.
Amazon, Intel, and NVidia have contributed to my research.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 11
Three teams
Multiple submission encouraged!GPU CPU 1 core CPU all core
NiuTrans 4 0 1OpenNMT 4 4 4UEdin 4 2 5
UEdin’s CPU submissions had a memory leak → shown with/without fix.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 12
Pareto Comparison
Submissions have varying quality and efficiency.Unclear how much quality loss to tolerate.
Pareto comparison: quality ≥ baseline and efficiency ≥ baseline.
More efficient with same quality. . . or better quality with same efficiency.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 13
Speed
Primary: wall clock time.Words per second based on 15,048,961 untokenized words.
Supplementary data: CPU time.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 14
GPU speed
28
30
32
34
36
0 5 10 15 20 25
WM
T1*
BLEU
Thousand words per real second
NiuTransOpenNMT
UEdin
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 15
CPU single core speed
28
30
32
34
36
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
WM
T1*
BLEU
Thousand words per real second
OpenNMTUEdin
UEdin Fix
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 16
CPU all core speed
28
30
32
34
36
0 20 40 60 80 100 120
WM
T1*
BLEU
Thousand words per real second
NiuTransOpenNMT
UEdin
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 17
Cost
28
30
32
34
36
0 20 40 60 80 100 120 140 160
WM
T1*
BLEU
Million Words per USD
NiuTrans: GPUOpenNMT: GPU
UEdin: GPUCPUCPUCPU
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 18
Disk
Model size: parameters, BPE, shortlists, etc.
Total Docker size: model, part of Ubuntu, codeOpenNMT won Docker with 122–308 MB; others 432–933 MB.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 19
Model size, all platforms
28
30
32
34
36
0 50 100 150 200 250 300 350 400 450
WM
T1*
BLEU
Model size (MB)
NiuTransOpenNMT
UEdin
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 20
Peak RAM usage
GPU: polling nvidia-smiCPU: memory.max usage in bytes
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 21
GPU RAM
28
30
32
34
36
1 2 4
WM
T1*
BLEU
GPU RAM (GB)
NiuTransOpenNMT
UEdin
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 22
CPU single core RAM
28
30
32
34
36
0.25 0.5 1 2 4 8 16 32 64 128
WM
T1*
BLEU
CPU RAM (GB)
OpenNMTUEdin
UEdin Fix
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 23
CPU all core RAM
28
30
32
34
36
1 2 4 8 16 32 64
WM
T1*
BLEU
CPU RAM (GB)
NiuTransOpenNMT
UEdinUEdin Fix
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 24
Efficiency Task
All participants had something Pareto optimal.
System descriptions:https://sites.google.com/view/wngt20/programme
I am opening the task for rolling submission.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 25
What’s missing
Allowed batching in all conditions→ What about latency?
Where are the non-autoregressive people?→ Non-autoregressive: a case study in poor evaluation.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 26
Latency
Latency is average time to translate one sentence.Experiments with Edinburgh’s systems; sorry I asked too late.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 27
Batching is Important for Speed
28
29
30
31
32
33
34
35
36
0 1 2 3 4 5 6 7 8 9 10
WM
T1*
BLEU
Thousand words per second
Batch on GPUBatch on CPU core
Single on GPUSingle on CPU core
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 28
Latency: 10.3–71.7 ms!
28
29
30
31
32
33
34
35
36
0 10 20 30 40 50 60 70 80
WM
T1*
BLEU
Latency (ms)
GPUCPU core
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 29
Autoregressive MT latency is 10.3–71.7 ms, often <30 ms.
So what’s up with this table from Jiatao Gu et al (2018)�
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 30
Replicating Gu et al (2018)’s setup
Do not try this at home or work.WMT14State-of-the-art is latest WMT.Just don’t claim state-of-the-art like Wang et al (2018) did.
Tokenized BLEUTokenization differences =⇒ BLEU scores are not comparable.But many non-autoregressive papers compare anyway.Use sacrebleu instead.
P100, latency on IWSLT 2016 en-de dev.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 31
Replicating Gu et al (2018)’s setup
Do not try this at home or work.WMT14State-of-the-art is latest WMT.Just don’t claim state-of-the-art like Wang et al (2018) did.
Tokenized BLEUTokenization differences =⇒ BLEU scores are not comparable.But many non-autoregressive papers compare anyway.Use sacrebleu instead.
P100, latency on IWSLT 2016 en-de dev.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 32
Real baselines for Gu et al (2018)
16
18
20
22
24
26
28
30
0 100 200 300 400 500 600 700
Weir
dly
Toke
nize
dW
MT1
4BL
EU
Latency (ms) on P100, IWSLT 2016 dev
Gu et al (2018) autoregressiveGu et al (2018) non-autoregressive
Marian 2018
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 33
Real baselines for Gu et al (2018)
16
18
20
22
24
26
28
30
0 100 200 300 400 500 600 700
Weir
dly
Toke
nize
dW
MT1
4BL
EU
Latency (ms) on P100, IWSLT 2016 dev
Gu et al (2018) autoregressiveGu et al (2018) non-autoregressive
Marian 2018Marian 2019
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 34
Research doesn’t have to be state-of-the-art.Just mention stronger baselines, not 60x weaker straws.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 35
Arguably this is what the shared task explores.Here are some easy things you could have done:
1 Model distillation for autoregressive, since it’s used fornon-autoregressive
2 Use 1–2 decoder layers in autoregressive models3 Averaged attention network
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 36
Recommendations
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 37
Use sacrebleu.You can’t compare against a paper that didn’t.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 38
Don’t have to be state-of-the-art.Just cite it or put it in your table.Strawman baselines are misleading.
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 39
Lots of baselines
1 Fewer parameters or layers2 Quantize3 Prune4 Beam size5 Shortlisting6 Simplify architecture7 Model distillation8 Early exit9 Non-autoregressive
Show your method is a better trade-off via Pareto optimality.Don’t trust papers that get X speedup for “small” Y BLEU loss!
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 40
Conclusion
Currently no evidence that non-autoregressive is competitive.
We’re implementing it in Marian.
WNGT 2020 efficiency task is rolling, send me dockers!
Task Definition Efficiency Results Latency Non-autoregressive Recommendation 41
Top Related