Advanced Pre-training for Language Models
Transcript of Advanced Pre-training for Language Models
![Page 1: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/1.jpg)
Advanced Pre-training for Language Models
![Page 2: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/2.jpg)
![Page 3: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/3.jpg)
Pre-Training Objectives
● Pre-training with abundantly available data
● Use Self-Supervised Learning
● GPT: Language Modeling
● Predict next word given left context
● Word2Vec, BERT, RoBERTa: Masked Language Modeling
● Predict missing word given a certain context
![Page 4: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/4.jpg)
Next Sentence Prediction (NSP)
https://amitness.com/2020/02/albert-visual-summary/
![Page 5: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/5.jpg)
Train the CLS and SEP tokensInput = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP] penguin [MASK] are flight ##less birds [SEP] Label = NotNext
Strategy used in BERT but shown to be ineffective in RoBERTa
![Page 6: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/6.jpg)
Sentence Order Prediction (SOP)
● NSP primarily requires topic-prediction● SOP can be used to focus on inter-sentence coherence (ALBERT)
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = CorrectOrder
Input = [CLS] he bought a gallon [MASK] milk [SEP] the man went to [MASK] store [SEP] Label = InCorrectOrder
![Page 7: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/7.jpg)
Masking Strategy in BERT
● Mask 15% of the input tokens
● Problem 1: Assumes that the MASK tokens are independent of each other● Problem 2: [MASK] tokens never appear during fine-tuning
![Page 8: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/8.jpg)
Masking Strategy in BERT
● Mask 15% of the input tokens
● 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
![Page 9: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/9.jpg)
Masking Strategy in BERT
● Mask 15% of the input tokens
● 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
● 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
![Page 10: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/10.jpg)
Masking Strategy in BERT● Mask 15% of the input tokens
● 80% of the time: Replace the word with the [MASK] token, e.g., my dog is hairy → my dog is [MASK]
● 10% of the time: Replace the word with a random word, e.g., my dog is hairy → my dog is apple
● 10% of the time: Keep the word unchanged, e.g., my dog is hairy → my dog is hairy.
![Page 11: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/11.jpg)
The boat was beached on the riverside
![Page 12: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/12.jpg)
Permutation Modeling using XLNET
The boat was beached on the riverside
![Page 13: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/13.jpg)
Permutation Language Modeling
● Predict words given any possible permutation of input words
● Order of words is maintained, as in test-time
![Page 14: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/14.jpg)
Approximate Computation Time
![Page 15: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/15.jpg)
Granularity of Masking
● BERT chooses word-pieces but this is sub-optimal
● Philammon → Phil ##am #mon
● Not much information to be gained by predicting at word piece level
![Page 16: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/16.jpg)
Granularity of Masking
● BERT-wwm (whole word masking): Always mask entire word
● BERT: Phil ##am [MASK] was a great singer
● BERT-wwm: [MASK] [MASK] [MASK] was a great singer
![Page 17: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/17.jpg)
SpanBERT
● [MASK] Delhi, the Indian Capital is known for its rich heritage.
● Easier to predict “New”, given that we already know Delhi
● Instead of masking individual words, mask contiguous spans
● [MASK] [MASK], the Indian Capital is known for its rich heritage.
![Page 18: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/18.jpg)
Knowledge Masking Strategies in ERNIE
![Page 19: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/19.jpg)
ERNIE 2.0
● Word-Level Pre-training:○ Capitalization Prediction Task○ Token-Document Relation Prediction Task (Frequency)
● Sentence-Level Pre-training:○ Sentence Reordering Task○ Sentence Distance Task
● Semantic-Level Pre-training:○ Discourse Relation Task○ IR Relevance Task
![Page 20: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/20.jpg)
Pre-training Encoder-Decoder models
BERT GPT
![Page 21: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/21.jpg)
BART from Facebook
![Page 22: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/22.jpg)
T5 from Google
![Page 23: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/23.jpg)
![Page 24: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/24.jpg)
Byte Pair Encoding (cs224n slides)
![Page 25: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/25.jpg)
Byte Pair Encoding
![Page 26: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/26.jpg)
![Page 27: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/27.jpg)
![Page 28: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/28.jpg)
![Page 29: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/29.jpg)
![Page 30: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/30.jpg)
Byte Pair Encoding
![Page 31: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/31.jpg)
Wordpiece model
● Rather than char n-gram count, uses a greedy approximation to maximizing language model log likelihood to choose the pieces
● Add n-gram that maximally reduces perplexity
● [link1] and [link2] contain further details
![Page 32: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/32.jpg)
Issues with Wordpiece tokenizers
● Handles “sub-words” but not “typos”
● Need to re-train for adding new languages
● Takes engineering effort to maintain
![Page 33: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/33.jpg)
Issues with Wordpiece tokenizers
● Handles “sub-words” but not “typos”● Need to re-train for adding new languages● Takes engineering effort to maintain
Character-level models
![Page 34: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/34.jpg)
CANINE: Character-level Pre-trained Encoder
● Issues:● Results in much longer sequence-lengths● 143K unicode characters
● Down-sample input embeddings (Convolution)● Hash functions to reduce embedding space
![Page 35: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/35.jpg)
ByT5: Byte-level Seq2Seq
● Operate directly on byte-representations○ Only 256 input embeddings!
● Embeddings occupy 66% of T5-base
● “Unbalanced”-Seq2Seq○ 6 layer encoder, 2 layer decoder (3:1)
![Page 36: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/36.jpg)
![Page 37: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/37.jpg)
Absolute Positional Embeddings in BERT
● Transformer architecture suggested using sinusoidal embeddings
● BERT instead learnt absolute positional embeddings
● Fine tuned a randomly initialized embedding matrix
● Size: 512 * 768
● 90% of the time trained with sequence size of 128
![Page 38: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/38.jpg)
Relative Positional Embeddings
clip(i-j, k) = max(-k, min(k,x))
![Page 39: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/39.jpg)
Relative Positional Embeddings in T5
Only 32 embeddings, scaled logarithmically
![Page 40: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/40.jpg)
Attention in Transformers
● Every word attends over every other word
● O(n^2) complexity, compared to O(n) complexity of LSTM
● Inherently unscalable to long sequences
![Page 41: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/41.jpg)
Transformer-XL: Recurrence in Transformers
● Chunk the text into segments
● Attend over current and previous segments
● Don’t pass gradient to previous segment
● Faster training and inference (1800x)
![Page 42: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/42.jpg)
Transformer-XL
![Page 43: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/43.jpg)
BigBird - Sparse Attention
![Page 44: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/44.jpg)
BigBird - Sparse Attention
![Page 45: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/45.jpg)
BigBird - Sparse Attention
![Page 46: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/46.jpg)
BigBird - Sparse Attention (8x faster)
![Page 47: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/47.jpg)
Token Mixing● How important is attention in transformers?○ Combining information from multiple tokens
● Linear Mixing○ For mixing information across tokens○ Only use two weight matrices
![Page 48: Advanced Pre-training for Language Models](https://reader033.fdocuments.in/reader033/viewer/2022042922/626ab22a65c6f237a30f9ca1/html5/thumbnails/48.jpg)
Linear Mixing
● Achieves 90% performance and 40% faster
● FNet: Use Fourier Transformer for linear mixing○ 97% performance and 80% faster