BERT: Pre-training of Deep Bidirectional Transformers for...

Post on 07-Oct-2020

1 views 0 download

Transcript of BERT: Pre-training of Deep Bidirectional Transformers for...

BERT:

Pre-training of

Deep Bidirectional Transformers for

Language Understanding

Source : NAACL-HLT 2019

Speaker : Ya-Fang, Hsiao

Advisor : Jia-Ling, Koh

Date : 2019/09/02

CONTENTS

1

Introduction

Related

Work

Method

Experiment

Conclusion

2

3

4

5

1Introduction

Introduction

Bidirectional Encoder Representations from Transformers

Language Model

𝑃 𝑤1, 𝑤2, … , 𝑤𝑇 =ෑ

𝑡=1

𝑇

ሻ𝑃(𝑤𝑡|𝑤1, 𝑤2, … , 𝑤𝑡−1

Pre-trained Language Model

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

2 Related

Work

Related Work

Pre-trained Language Model

Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

Related Work

Pre-trained Language Model

Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

1. Unidirectional language model

2. Same objective function

Bidirectional Encoder Representations

from Transformers

Masked Language Models (MLM)

Next Sentence Prediction (NSP)

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Encoder Decoder

Sequence2sequence

RNN : hard to parallel

Encoder-Decoder

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Encoder-Decoder *6

Self-attention layer

can be parallelly computed

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Self-Attentionquery (to match others)

key (to be matched)

information to be extracted

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Multi-Head Attention

Transformers 《Attention is all you need》Vaswani et al. (NIPS2017)

BERTBASE

(L=12, H=768, A=12, Parameters=110M)

BERTLARGE

(L=24, H=1024, A=16, Parameters=340M)

L

A

4H

BERTBidirectional Encoder Representations

from Transformers

3Method

Framework

Pre-training : trained on unlabeled data over different

pre-training tasks.

Fine-Tuning : fine-tuned parameters using labeled data

from the downstream tasks.

Input

Token Embedding : WordPiece embeddings with a 30,000 token vocabulary.

[CLS] : classification token

[SEP] : separate token

Segment Embedding : Learned embeddings belong to sentence A or sentence B.

Position Embedding : Learned positional embeddings.

Pre-training corpus : BooksCorpus、English Wikipedia

Pre-training

Two unsupervised tasks:

1. Masked Language Models (MLM)

2. Next Sentence Prediction (NSP)

Task1. MLM Masked Language Models

Hung-Yi Lee - BERT ppt

Mask 15% of all WordPiece tokens

in each sequence at random for prediction.

Replace the token with

(1) the [MASK] token 80% of the time.

(2) a random token 10% of the time.

(3) the unchanged i-th token 10% of the time.

Task2. NSP Next Sentence Prediction

Hung-Yi Lee - BERT ppt

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

Fine-Tuning

Fine-Tuning : fine-tuned parameters using labeled

data from the downstream tasks.

Task 1 (b)

BERT

[CLS] w1 w2 w3

Linear Classifier

class

Input: single sentence, output: class

sentence

Example:Sentiment analysis Document Classification

Trained from Scratch

Fine-tune

Hung-Yi Lee - BERT ppt

Single Sentence Classification Tasks

BERT

[CLS] w1 w2 w3

Linear Cls

class

Input: single sentence, output: class of each word

sentence

Example: Slot filling

Linear Cls

class

Linear Cls

class

Hung-Yi Lee - BERT ppt

Task 2 (d) Single Sentence Tagging Tasks

Linear Classifier

w1 w2

BERT

[CLS] [SEP]

Class

Sentence 1 Sentence 2

w3 w4 w5

Input: two sentences,output: class

Example: Natural Language Inference

Hung-Yi Lee - BERT ppt

Task 3 (a) Sentence Pair Classification Tasks

𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁

𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑁

QAModel

output: two integers (𝑠, 𝑒)

𝐴 = 𝑞𝑠, ⋯ , 𝑞𝑒

Document:

Query:

Answer:

𝐷

𝑄

𝑠

𝑒

17

77 79

𝑠 = 17, 𝑒 = 17

𝑠 = 77, 𝑒 = 79

Hung-Yi Lee - BERT ppt

Task 4 (c) Question Answering Tasks

q1 q2

BERT

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch

Hung-Yi Lee - BERT ppt

Task 4 (c) Question Answering Tasks

q1 q2

BERT

[CLS] [SEP]

question document

d1 d2 d3

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch

Hung-Yi Lee - BERT ppt

dot product

Softmax

0.20.1 0.7

Task 4 (c) Question Answering Tasks

Experiment

4

Experiments Fine-tuning results on 11 NLP tasks

Implements LeeMeng-進擊的BERT (Pytorch)

Implements LeeMeng-進擊的BERT (Pytorch)

Implements LeeMeng-進擊的BERT (Pytorch)

Implements LeeMeng-進擊的BERT (Pytorch)

Implements LeeMeng-進擊的BERT (Pytorch)

5 Conclusion

References

語言模型發展 http://bit.ly/nGram2NNLM

語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT

Attention Is All You Need http://bit.ly/AttIsAllUNeed

BERT http://bit.ly/BERTpaper

李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer

Illustrated Transformer http://bit.ly/illustratedTransformer

詳解Transformer http://bit.ly/explainTransformer

github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch

實作假新聞分類 http://bit.ly/implementpaircls

Pytorch.org_BERT http://bit.ly/pytorchorgBERT