BERT: Pre-training of Deep Bidirectional Transformers for...

Pre-training of

Deep Bidirectional Transformers for

Language Understanding

Source : NAACL-HLT 2019

Speaker : Ya-Fang, Hsiao

Advisor : Jia-Ling, Koh

Date : 2019/09/02

CONTENTS

Introduction

Experiment

Conclusion

1Introduction

Introduction

Bidirectional Encoder Representations from Transformers

Language Model

𝑃 𝑤1, 𝑤2, … , 𝑤𝑇 =ෑ

𝑡=1

ሻ𝑃(𝑤𝑡|𝑤1, 𝑤2, … , 𝑤𝑡−1

Pre-trained Language Model

BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding

2 Related

Related Work

Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

Related Work

Feature-based

Fine-tuning

: ELMo

: OpenAI GPT

1. Unidirectional language model

2. Same objective function

Bidirectional Encoder Representations

from Transformers

Masked Language Models (MLM)

Next Sentence Prediction (NSP)

《Attention is all you need》Vaswani et al. (NIPS2017)

Transformers

Encoder Decoder

Sequence2sequence

RNN : hard to parallel

Encoder-Decoder

Transformers

Encoder-Decoder *6

Self-attention layer

can be parallelly computed

Transformers

Self-Attentionquery (to match others)

key (to be matched)

information to be extracted

Transformers

Multi-Head Attention

Transformers 《Attention is all you need》Vaswani et al. (NIPS2017)

BERTBASE

(L=12, H=768, A=12, Parameters=110M)

BERTLARGE

(L=24, H=1024, A=16, Parameters=340M)

BERTBidirectional Encoder Representations

from Transformers

3Method

Framework

Pre-training : trained on unlabeled data over different

pre-training tasks.

Fine-Tuning : fine-tuned parameters using labeled data

from the downstream tasks.

Token Embedding : WordPiece embeddings with a 30,000 token vocabulary.

[CLS] : classification token

[SEP] : separate token

Segment Embedding : Learned embeddings belong to sentence A or sentence B.

Position Embedding : Learned positional embeddings.

Pre-training corpus : BooksCorpus、English Wikipedia

Pre-training

Two unsupervised tasks:

1. Masked Language Models (MLM)

2. Next Sentence Prediction (NSP)

Task1. MLM Masked Language Models

Hung-Yi Lee - BERT ppt

Mask 15% of all WordPiece tokens

in each sequence at random for prediction.

Replace the token with

(1) the [MASK] token 80% of the time.

(2) a random token 10% of the time.

(3) the unchanged i-th token 10% of the time.

Task2. NSP Next Sentence Prediction

Input = [CLS] the man went to [MASK] store [SEP]

he bought a gallon [MASK] milk [SEP]

Label = IsNext

Input = [CLS] the man [MASK] to the store [SEP]

penguin [MASK] are flight ##less birds [SEP]

Label = NotNext

Fine-Tuning

Fine-Tuning : fine-tuned parameters using labeled

data from the downstream tasks.

Task 1 (b)

[CLS] w1 w2 w3

Linear Classifier

Input: single sentence, output: class

sentence

Example:Sentiment analysis Document Classification

Trained from Scratch

Fine-tune

Single Sentence Classification Tasks

[CLS] w1 w2 w3

Linear Cls

Input: single sentence, output: class of each word

sentence

Example: Slot filling

Linear Cls

Task 2 (d) Single Sentence Tagging Tasks

Linear Classifier

[CLS] [SEP]

Sentence 1 Sentence 2

w3 w4 w5

Input: two sentences,output: class

Example: Natural Language Inference

Task 3 (a) Sentence Pair Classification Tasks

𝐷 = 𝑑1, 𝑑2, ⋯ , 𝑑𝑁

𝑄 = 𝑞1, 𝑞2, ⋯ , 𝑞𝑁

QAModel

output: two integers (𝑠, 𝑒)

𝐴 = 𝑞𝑠, ⋯ , 𝑞𝑒

Document:

Query:

Answer:

𝑠 = 17, 𝑒 = 17

𝑠 = 77, 𝑒 = 79

Task 4 (c) Question Answering Tasks

[CLS] [SEP]

question document

d1 d2 d3

dot product

Softmax

0.50.3 0.2

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch

[CLS] [SEP]

question document

d1 d2 d3

The answer is “d2d3”.

s = 2, e = 3

Learned from scratch

dot product

Softmax

0.20.1 0.7

Experiment

Experiments Fine-tuning results on 11 NLP tasks

Implements LeeMeng-進擊的BERT (Pytorch)

5 Conclusion

References

語言模型發展 http://bit.ly/nGram2NNLM

語言模型預訓練方法 http://bit.ly/ELMo_OpenAIGPT_BERT

Attention Is All You Need http://bit.ly/AttIsAllUNeed

BERT http://bit.ly/BERTpaper

李弘毅-Transformer(Youtube) http://bit.ly/HungYiLee_Transformer

Illustrated Transformer http://bit.ly/illustratedTransformer

詳解Transformer http://bit.ly/explainTransformer

github/codertimo - BERT(pytorch) http://bit.ly/BERT_pytorch

實作假新聞分類 http://bit.ly/implementpaircls

Pytorch.org_BERT http://bit.ly/pytorchorgBERT

BERT: Pre-training of Deep Bidirectional Transformers for...

Documents

Transcript of BERT: Pre-training of Deep Bidirectional Transformers for...

Agilent J-BERT M8020A High-Performance BERT · Agilent J-BERT M8020A High-Performance BERT ... M8000 Series of BER Test ... eyes and a clock recovery with adjustable loop bandwidth

MASS: Masked Sequence to Sequence Pre-training for ... · masked fragment conditioned on the encoder representa-tions. Unlike BERT or a language model that pre-trains only the encoder

Understanding for Task-Oriented Dialogue TOD-BERT: Pre ...

Bert Ihlenfeld

Bert Abu Ran

Boundless Electrical Resistivity Tomography BERT 2 { the ...resistivity.net/download/bert-tutorial.pdf · Boundless Electrical Resistivity Tomography BERT 2 ... we demonstrate how

Introduction to BERT and Transformer: pre-trained self ...cgi.csc.liv.ac.uk/~hang/ppt/BERT and Transformer.pdf · OpenAI GPT (Generative Pre-trained Transformer) –(1) pre-training

E-BERT: Efﬁcient-Yet-Effective Entity Embeddings for BERT

BERT: Pre-training of Deep Bidirectional Transformers for ...web.stanford.edu/class/cs330/presentations/presentation-10.23-1.pdfBERT: Pre-training of Deep Bidirectional Transformers

Ro bert Lovas

BERT General

Contextual Word Representations with BERT and Other Pre ...

TOD-BERT: Pre-trained Natural Language Understanding for ...

PoWER-BERT: Accelerating BERT Inference via Progressive ...Published as a conference paper at ICML 2020 PoWER-BERT: Accelerating BERT Inference via Progressive Word-vector Elimination

CURRICULUM VITAE BERT (GYSBERT) OLIVIER …philosophy.mandela.ac.za/.../documents/Documents/Bert-CV.pdfCURRICULUM VITAE BERT (GYSBERT) OLIVIER (M.A.; D.Phil.; S.T.D.) July 2014 PERSONAL

BERT DE WEERDT

VL-BERT: P G V L R · in NLP, together with its MLM-based pre-training technique in BERT (Devlin et al., 2018). The attention module is powerful and ﬂexible in aggregating and aligning

What do you mean, BERT? Assessing BERT as a Distributional ...

Bert Flugelman

M BERT: TASK-AGNOSTIC COMPRESSION OF BERT PROGRESSIVE ...