Language Modeling with Deep Transformersirie/... · De ne newstate-of-the-artresults [Luscher &...

Language Modeling with Deep Transformers

Kazuki Irie, Albert Zeyer, Ralf Schluter, Hermann Ney

Human Language Technology and Pattern Recognition GroupRWTH Aachen University, Aachen, Germany

INTERSPEECH 2019, Graz, AustriaNeural Networks for Language Modeling [Thu-O-10-1], September 19, 2019

Introduction

• 2017: Advent of Transformer [Vaswani & Shazeer+ 17] in NLP/beyond.

• Originally an encoder-decoder model for machine translation.

• Decoder component: language model

– Early work in text generation (5 layers) [Liu & Saleh+ 18] ICLR 2018

• Gain in popularity more recently:

– Google 64-layer Transformer character LM [Al-Rfou & Choe+ 19] AAAI 2019

– OpenAI GPT-2 LM (48 layers) [Radford & Wu+ 19] Blog February 2019

• Large scale language model pre-training at the center of interest in NLP.

– Nvidia, Megatron LM (72 layers) Blog August 2019

– Salesforce, Controllable Transformer LM (48 layers) Last week!

2 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,Sep. 15, 2019

Contributions of this work

• Application of Transformer language models to ASR

– Successful training of deep and powerful Transformer language models.– Evaluation in both hybrid and attention based end-to-end ASR.

– Large improvements over the state-of-the-art LSTM LM.

• Comprehensive hyper-parameter tuning

– Crucial for studying a new model.– In particular for Transformers which have lots of hyper-parameters.

• Demonstration of an LM specific property of Transformers

– LM task automatically provides positional information: No need for extra signal.

• Analysis and visualization

• Release of model configurations and checkpoints (link in the paper)https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers

Open-source toolkit RETURNN [Zeyer & Alkhouli+ 18]


https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers

Transformer Language Model

• Stack L layers; each consisting of self-attention and feed-forward modules.• Apply residual connections and layer normalization across modules.

• Self-attention typically has multiple attention heads.

Self-Attention

LayerNorm

Feed-forward

LayerNorm

PositionalEncoding


Experimental Setups

LibriSpeech dataset [Panayotov & Chen+ 15].• 960h audio, read speech transcriptions.• Large LM task: 200K vocab, 800M-word extra textual training data.

Language modeling for speech recognition in 2 settings:

• Word-level models for conventional hybrid HMM/NN system by latticerescoring [Sundermeyer & Tuske+ 14].Push-forwarding Transformer states instead LSTM states.

• BPE subword-level models for end-to-end attention based system byshallow fusion [Gulcehre & Firat+ 17, Toshniwal & Kannan+ 18].

Intensive tuning of the baseline LSTM LM [Sundermeyer & Schluter+ 12]

• All tuning details provided in the paper.• Wide model gave the best results: 2 layers with 4096 LSTM nodes.• Rel. improvements in PPL over 4-gram of about 58%.


Optimization of Transformer models

Exhaustive list of hyper-parameters is long.• Number of layers & dimension of the residual connection.• (Dimension of input word embeddings).• For each layer: number of attention heads, dimension of the key and query,

dimension of the value, and dimension of the feed-forward layer.

To reduce this complexity,• Use the same dimension for key, query, value, and the residual connection.• Use the same dimensionality across all layers.

4 hyper-parameters to describe all our models.

• Number of layers L.• Feed-forward dimension dff .• Residual dimension dres.• Number of attention heads H .


Effect of Depth and Width (Highlight)

Perplexity after 2.5 epoch (H = 8, dres = 512).

L dffParams. Perplexity

in M Train Dev

122,048

243 67.6 67.124 281 62.2 62.342 338 59.0 59.6

6 8,192 262 66.7 66.712 4,096 268 63.5 63.8

416,384 277 67.6 67.432,768 344 65.4 68.4

• For a given parameter budget, deep models tend to perform better.

Full tuning details in the paper!• Effect of number of heads: helps up to 16 ! 8 is already good.• Effect of activation ReLU, GeLU, GLU: the standard ReLU is fine!• Parameter tying (Universal Transformers): improvements w/o extra params!


Optimization of Transformer models: Final results

Further scaling up:

Best model: 96-layer model (L = 96, dff = 2048, dres = 512, H = 8)

(112-layer model even got slightly better after camera-ready deadline.)

Final perplexity on LibriSpeech 200K vocab word level.

LMParam. Perplexity

in M Dev Test4-gram 230 146 152LSTM 1048 60 63Transformer 431 54 56

Large improvements over the highly optimized LSTM LM:

• About 11% relative improvements in PPL.


Do we need extra positional encoding in Tranformer LM?

• Amount of information increases at each time step in LM: position signal?• Our finding: External positional encoding unnecessary.

– Even slight improvements in perplexity w/o positional encoding.

• Attention in the first layer (all 8 heads per target word position shown)

Tar

get

wor

d

<bos

> soth

eywe

nt on to the

vera

ndah and

look

eddo

wnup

on the

light

s of the

priso

nan

dlis

tene

d to the

sea

lapp

ing

the

shor

e

sotheywent

onto

theverandah

andlookeddownupon

thelights

ofthe

prisonand

listenedto

thesea

lappingthe

shore<eos>

With positional encoding

<bos

> soth

eywe

nt on to the

vera

ndah and

look

eddo

wnup

on the

light

s of the

priso

nan

dlis

tene

d to the

sea

lapp

ing

the

shor

e

sotheywent

onto

theverandah

andlookeddownupon

thelights

ofthe

prisonand

listenedto

thesea

lappingthe

shore<eos>

Without positional encoding

Input word:


Other Layers: 3 categories

Analysis for the 24-layer model which is also valid for deeper models.

• There are 3 functional groups of layers.

<bos

> soth

eywe

nt on to the

vera

ndah and

look

eddo

wnup

on the

light

s of the

priso

nan

dlis

tene

d to the

sea

lapp

ing

the

shor

e

sotheywent

onto

theverandah

andlookeddownupon

thelights

ofthe

prisonand

listenedto

thesea

lappingthe

shore<eos>

<bos

> soth

eywe

nt on to the

vera

ndah and

look

eddo

wnup

on the

light

s of the

priso

nan

dlis

tene

d to the

sea

lapp

ing

the

shor

e

sotheywent

onto

theverandah

andlookeddownupon

thelights

ofthe

prisonand

listenedto

thesea

lappingthe

shore<eos>

<bos

> soth

eywe

nt on to the

vera

ndah and

look

eddo

wnup

on the

light

s of the

priso

nan

dlis

tene

d to the

sea

lapp

ing

the

shor

e

sotheywent

onto

theverandah

andlookeddownupon

thelights

ofthe

prisonand

listenedto

thesea

lappingthe

shore<eos>

Bottom layers (2-3):“Blur”

≈ Average over all positions;

Bag-of-words. Global info.

Some heads focus on difficult

words, here verandah.

Mid layers (4-9):“Window”

Focus on the local n-gram.

Top layers (10-24):

“Structured”

Attend to some specific pat-

terns. Feature detector.


Speech Recognition Experiments: Conventional Hybrid System

WERs (%) for hybrid systems on LibriSpeech 960h.

• The first pass decoding generates lattices.• Rescore the lattices (denoted by →) with LSTM or Transformer (Trafo) LM.

Language ModelPrm.

dev test

in Mclean other clean other

PPL WER PPL WER PPL WER PPL WER

4-gram 230 152 3.4 141 8.3 158 3.8 146 8.8→ LSTM 1048 60 2.3 60 5.4 65 2.6 62 5.9→ Transformer 431 53 2.1 54 5.2 58 2.5 55 5.6

LSTM → Trafo - - 1.9 - 4.5 - 2.3 - 5.0

Large improvements over the highly optimized LSTM LM:• 10% relative improvements in PPL translate to:• 4% to 10% relative improvements in WER.

Define new state-of-the-art results [Luscher & Beck+ 19] on LibriSpeech 960h.


Speech Recognition Experiments: Attention-based System

WERs (%) for attention-based models on LibriSpeech 960h dataset.Perplexities are on the 10K BPE level

Language Model

Bea

m

dev testclean other clean other

PPL WER PPL WER PPL WER PPL WERNone 12 - 4.3 - 12.9 - 4.4 - 13.5LSTM

6444 2.9 46 8.9 47 3.2 47 9.9

Transformer 36 2.6 39 8.4 39 2.8 39 9.3

• Follow [Hannun & Lee+ 19] (Interspeech 2019):Larger beam size and end-of-sentence penalty.

• Again, large improvements over the LSTM baseline.

• Best reported WERs for E2E systems w/o data augmentation e.g.SpecAugment [Park & Chan+ 19] (Interspeech 2019).

• Available on: https://github.com/rwth-i6/returnn-experiments


https://github.com/rwth-i6/returnn-experiments

Conclusion

Summary

• Successfully trained deep Transformer LMs with excellent performance for ASR.• Demonstrated that positional encoding is not needed for Transformer LMs.

• Visualized and identified hierarchical feature engineering inside Transformerlanguage models with link to some fundamental concepts in LM:

– N-gram, bag-of-words, and in top layers; max-entropy model-style features(but data driven)?

Future work

• Further scaling up (layer-wise training).• Reduce memory requirements of Transformers.

• More study on scalability of Transformer vs. LSTM vs. amount of data.

• For LSTMs: deeper (and wider) models with residual connections and layernormalization e.g., RNMT+ [Chen & Firat+ 18]?


Thank you for your attention.

Thanks to:Eugen Beck, Liuhui Deng, Christoph Luscher, Arne Nix, JulianSchamper, and Wei Zhou

References

[Al-Rfou & Choe+ 19] R. Al-Rfou, D. Choe, N. Constant, M. Guo, L. Jones.

Character-level language modeling with deeper self-attention.

In Proc. AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, Jan. 2019.

[Chen & Firat+ 18] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser,Z. Chen, Y. Wu, M. Hughes.

The best of both worlds: Combining recent advances in neural machine translation.

In Proc. ACL, pp. 76–86, Melbourne, Australia, July 2018.

[Gulcehre & Firat+ 17] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, Y. Bengio.

On using monolingual corpora in neural machine translation.

Computer Speech & Language, Vol. 45, pp. 137–148, Sept. 2017.

[Hannun & Lee+ 19] A. Hannun, A. Lee, Q. Xu, R. Collobert.

Sequence-to-sequence speech recognition with time-depth separable convolutions.

arXiv preprint arXiv:1904.02619, Vol., 2019.

[Liu & Saleh+ 18] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, . Kaiser, N. Shazeer.

Generating wikipedia by summarizing long sequences.

In Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, April 2018.

[Luscher & Beck+ 19] C. Luscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schluter, H. Ney.

RWTH ASR systems for LibriSpeech: Hybrid vs Attention.

In Submitted to Interspeech 2019, Graz, Austria, Sept. 2019.

[Panayotov & Chen+ 15] V. Panayotov, G. Chen, D. Povey, S. Khudanpur.

LibriSpeech: an ASR corpus based on public domain audio books.

In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, South Brisbane, Queensland, Australia, April 2015.

[Park & Chan+ 19] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le.

SpecAugment: A simple data augmentation method for automatic speech recognition.

In Proc. Interspeech, Graz, Austria, 2019.

[Radford & Wu+ 19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever.

Language models are unsupervised multitask learners.

[Online]. : https://blog.openai.com/better-language-models/, 2019.


References

[Sundermeyer & Schluter+ 12] M. Sundermeyer, R. Schluter, H. Ney.

LSTM neural networks for language modeling.

In Proc. Interspeech, pp. 194–197, Portland, OR, USA, Sept. 2012.

[Sundermeyer & Tuske+ 14] M. Sundermeyer, Z. Tuske, R. Schluter, H. Ney.

Lattice decoding and rescoring with long-span neural network language models.

In Proc. Interspeech, pp. 661–665, Singapore, Sept. 2014.

[Toshniwal & Kannan+ 18] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, K. Livescu.

A comparison of techniques for language model integration in encoder-decoder speech recognition.

In Proc. SLT, Athens, Greece, Dec. 2018.

[Vaswani & Shazeer+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, I. Polosukhin.

Attention is all you need.

In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017.

[Zeyer & Alkhouli+ 18] A. Zeyer, T. Alkhouli, H. Ney.

RETURNN as a generic flexible neural toolkit with application to translation and speech recognition.

In Proc. of the Joint Conf. of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Melbourne, Australia, July2018.


Language Modeling with Deep Transformersirie/... · De ne newstate-of-the-artresults [Luscher &...

Documents

Transcript of Language Modeling with Deep Transformersirie/... · De ne newstate-of-the-artresults [Luscher &...