BERT: Pre-training of Deep Bidirectional Transformers for...
Transcript of BERT: Pre-training of Deep Bidirectional Transformers for...
![Page 1: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/1.jpg)
BERT:
Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Source : NAACL-HLT 2019
Speaker : Ya-Fang, Hsiao
Advisor : Jia-Ling, Koh
Date : 2019/09/02
![Page 2: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/2.jpg)
CONTENTS
1
Introduction
Related
Work
Method
Experiment
Conclusion
2
3
4
5
![Page 3: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/3.jpg)
1Introduction
![Page 4: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/4.jpg)
Introduction
Bidirectional Encoder Representations from Transformers
Language Model
𝑃 𝑤1, 𝑤2, … , 𝑤𝑇 =ෑ
𝑡=1
𝑇
ሻ𝑃(𝑤𝑡|𝑤1, 𝑤2, … , 𝑤𝑡−1
Pre-trained Language Model
BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding
![Page 5: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/5.jpg)
2 Related
Work
![Page 6: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/6.jpg)
Related Work
Pre-trained Language Model
Feature-based
Fine-tuning
: ELMo
: OpenAI GPT
![Page 7: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/7.jpg)
Related Work
Pre-trained Language Model
Feature-based
Fine-tuning
: ELMo
: OpenAI GPT
1. Unidirectional language model
2. Same objective function
Bidirectional Encoder Representations
from Transformers
Masked Language Models (MLM)
Next Sentence Prediction (NSP)
![Page 8: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/8.jpg)
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Encoder Decoder
Sequence2sequence
RNN : hard to parallel
![Page 9: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/9.jpg)
Encoder-Decoder
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
![Page 10: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/10.jpg)
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Encoder-Decoder *6
Self-attention layer
can be parallelly computed
![Page 11: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/11.jpg)
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Self-Attentionquery (to match others)
key (to be matched)
information to be extracted
![Page 12: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/12.jpg)
《Attention is all you need》Vaswani et al. (NIPS2017)
Transformers
Multi-Head Attention
![Page 13: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/13.jpg)
Transformers 《Attention is all you need》Vaswani et al. (NIPS2017)
![Page 14: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/14.jpg)
BERTBASE
(L=12, H=768, A=12, Parameters=110M)
BERTLARGE
(L=24, H=1024, A=16, Parameters=340M)
L
A
4H
BERTBidirectional Encoder Representations
from Transformers
![Page 15: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/15.jpg)
3Method
![Page 16: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/16.jpg)
Framework
Pre-training : trained on unlabeled data over different
pre-training tasks.
Fine-Tuning : fine-tuned parameters using labeled data
from the downstream tasks.
![Page 17: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/17.jpg)
Input
Token Embedding : WordPiece embeddings with a 30,000 token vocabulary.
[CLS] : classification token
[SEP] : separate token
Segment Embedding : Learned embeddings belong to sentence A or sentence B.
Position Embedding : Learned positional embeddings.
Pre-training corpus : BooksCorpus、English Wikipedia
![Page 18: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/18.jpg)
Pre-training
Two unsupervised tasks:
1. Masked Language Models (MLM)
2. Next Sentence Prediction (NSP)
![Page 19: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/19.jpg)
Task1. MLM Masked Language Models
Hung-Yi Lee - BERT ppt
Mask 15% of all WordPiece tokens
in each sequence at random for prediction.
Replace the token with
(1) the [MASK] token 80% of the time.
(2) a random token 10% of the time.
(3) the unchanged i-th token 10% of the time.
![Page 20: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/20.jpg)
Task2. NSP Next Sentence Prediction
Hung-Yi Lee - BERT ppt
Input = [CLS] the man went to [MASK] store [SEP]
he bought a gallon [MASK] milk [SEP]
Label = IsNext
Input = [CLS] the man [MASK] to the store [SEP]
penguin [MASK] are flight ##less birds [SEP]
Label = NotNext
![Page 21: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/21.jpg)
Fine-Tuning
Fine-Tuning : fine-tuned parameters using labeled
data from the downstream tasks.
![Page 22: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/22.jpg)
Task 1 (b)
BERT
[CLS] w1 w2 w3
Linear Classifier
class
Input: single sentence, output: class
sentence
Example:Sentiment analysis Document Classification
Trained from Scratch
Fine-tune
Hung-Yi Lee - BERT ppt
Single Sentence Classification Tasks
![Page 23: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/23.jpg)
BERT
[CLS] w1 w2 w3
Linear Cls
class
Input: single sentence, output: class of each word
sentence
Example: Slot filling
Linear Cls
class
Linear Cls
class
Hung-Yi Lee - BERT ppt
Task 2 (d) Single Sentence Tagging Tasks
![Page 24: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/24.jpg)
Linear Classifier
w1 w2
BERT
[CLS] [SEP]
Class
Sentence 1 Sentence 2
w3 w4 w5
Input: two sentences,output: class
Example: Natural Language Inference
Hung-Yi Lee - BERT ppt
Task 3 (a) Sentence Pair Classification Tasks
![Page 25: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/25.jpg)
𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁
𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑁
QAModel
output: two integers (𝑠, 𝑒)
𝐴 = 𝑞𝑠, ⋯ , 𝑞𝑒
Document:
Query:
Answer:
𝐷
𝑄
𝑠
𝑒
17
77 79
𝑠 = 17, 𝑒 = 17
𝑠 = 77, 𝑒 = 79
Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks
![Page 26: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/26.jpg)
q1 q2
BERT
[CLS] [SEP]
question document
d1 d2 d3
dot product
Softmax
0.50.3 0.2
The answer is “d2d3”.
s = 2, e = 3
Learned from scratch
Hung-Yi Lee - BERT ppt
Task 4 (c) Question Answering Tasks
![Page 27: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/27.jpg)
q1 q2
BERT
[CLS] [SEP]
question document
d1 d2 d3
The answer is “d2d3”.
s = 2, e = 3
Learned from scratch
Hung-Yi Lee - BERT ppt
dot product
Softmax
0.20.1 0.7
Task 4 (c) Question Answering Tasks
![Page 28: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/28.jpg)
Experiment
4
![Page 29: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/29.jpg)
Experiments Fine-tuning results on 11 NLP tasks
![Page 30: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/30.jpg)
Implements LeeMeng-進擊的BERT (Pytorch)
![Page 31: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/31.jpg)
Implements LeeMeng-進擊的BERT (Pytorch)
![Page 32: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/32.jpg)
Implements LeeMeng-進擊的BERT (Pytorch)
![Page 33: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/33.jpg)
Implements LeeMeng-進擊的BERT (Pytorch)
![Page 34: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/34.jpg)
Implements LeeMeng-進擊的BERT (Pytorch)
![Page 35: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/35.jpg)
5 Conclusion
![Page 36: BERT: Pre-training of Deep Bidirectional Transformers for …184pc128.csie.ntnu.edu.tw/presentation/19-09-02/BERT- Pre... · 2019. 9. 2. · Pre-training corpus : BooksCorpus、English](https://reader034.fdocuments.in/reader034/viewer/2022052020/6034d07baa75790bb900e147/html5/thumbnails/36.jpg)
References
語言模型發展 http://bit.ly/nGram2NNLM
語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT
Attention Is All You Need http://bit.ly/AttIsAllUNeed
BERT http://bit.ly/BERTpaper
李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer
Illustrated Transformer http://bit.ly/illustratedTransformer
詳解Transformer http://bit.ly/explainTransformer
github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch
實作假新聞分類 http://bit.ly/implementpaircls
Pytorch.org_BERT http://bit.ly/pytorchorgBERT