Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction •...

101
Deep Learning for Natural Language Processing Yoshimasa Tsuruoka The University of Tokyo

Transcript of Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction •...

Page 1: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Deep Learning for Natural Language Processing

Yoshimasa Tsuruoka

The University of Tokyo

Page 2: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Outline

• Introduction• Word embeddings

– Word2vec, Skip-gram, fastText

• Recurrent Neural Networks– LSTM, GRU

• Neural Machine Translation– Encoder-decoder model– Transformer– Unusupervised NMT

• Pretraining methods– ELMo, GPT, BERT, UNI-LM– Sentiment Analysis, Textual Entailment, Question Answering,

Summarization

Page 3: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Story generation

• GPT-2 [Radford et al., 2019]

https://blog.openai.com/better-language-models/

Page 4: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,
Page 5: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Question Answering

https://rajpurkar.github.io/SQuAD-explorer/

Page 6: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Now machines are better than human!?

Page 7: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Word representation

• Vector representation

– A word is represented as a real-valued vector

• Dimensionality: 50 ~ 1000

– Similar words are treated similarly

• Alleviates the problem of data sparsity

dog

cattea

coffee

Page 8: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Word2Vec [Mikolov et al., 2013]

• How do you learn good word vectors?

• Optimize the vectors so that they can predict well the occurrence of the words in a document

• Two approaches:– Skip-grams

– Continuous Bag of Words (CBOW)

Page 9: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Continuous Bag of Words (CBOW)

… to prevent another financial crisis such as …

… to prevent another financial crisis such as …

… to prevent another financial crisis such as …

• Predict the center word

Page 10: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Skip-gram

• Predict each context word

… to prevent another financial crisis such as …

… to prevent another financial crisis such as …

… to prevent another financial crisis such as …

Page 11: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Skip-gram

• Probability of word o given the center word c

• Each word in the vocabulary has two vectors– u: vector used when the word appears as a context

(outside) word– v: vector used when the word appears as the center

word

V

w

c

T

w

c

T

ocop

1

exp

exp

vu

vu

Page 12: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Skip-gram

• Objective for training

– Maximize the likelihood

– Minimize the negative log likelihood

T

t jmjm

tjt wwpJ1 0,

;

T

t jmjm

tjt wwpT

J1 0,

;log1

Page 13: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Skip gram with negative sampling

• Skip gram

– Needs to compute the summation for all words in the vocabulary

– Computational cost is large

• Skip gram with negative sampling

– Logistic regression problems

– Generate negative examples by random sampling

Page 14: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

1000-dimensional vectors learned with Skip-gram are converted to two-dimensional vectors by PCA

Mikolov, et al., Distributed Representations of Words and Phrases and their Compositionality, NIPS 2013

Page 15: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Analogical reasoning

• What is the female equivalent of a king?

• Analogical reasoning by arithmetic operation of word vectors

vec(king) – vec(man) + vec(woman) ≈ vec(queen)

(Mikolov et al., 2013)

Page 16: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

fastText [Bojanowski et al., 2017]

• Address the rare word problem– A word is represented as a bag of character n-grams

– Each character n-gram has a vector representation

– Each word is represented as the sum of character n-gram vectors

– Example) The word “interlink”• 3-gram: <in, int, nte, ter, erl, rli, lin, ink, nk>

• 4-gram: <int, inte, nter, terl, erli, rlin, link, ink>

• 5-gram: <inte, inter, nterl, terli, erlin, rlink, link>

• 6-gram: <inter, interl, nterli, terlin, erlink, rlink>

• Whole word: <interlink>

Bojanowski et al., Enriching Word Vectors with Subword Information, TACL 2017

Page 17: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

• Feed-forward Neural Network

Neural Network

Sigmoid function

Dx

1x

1z

Mz

Ky

1y

zWy

xWz

yz

zx

The sizes of the input and output vectors are fixed

xe

x

1

1

Input Output

Page 18: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Recurrent Neural Network (RNN)

• Can process a sequence of any length

tt

ttt

hWy

hWxWh

yh

1

hhhx

tx

ty

th

equivalent

Share the weight parameters W

Input Vec

State Vec

Output Vec

18

1y 2y 3y 4y

1h 2h 3h 4h

1x2x 3x

4x

Page 19: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

1ty

1th th 1th

1tx tx 1tx

hxW

hhW

hhW

hhW

ty

yhW

yhW

yhW

1ty

hxW

hxW

Page 20: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

RNN and NLP

• In NLP, we process sequences of words and characters– Language modeling, part-of-speech tagging, machine translation, etc.

• E.g., Language modeling– Predict the next word

Baseball is popular in

is popular in ?Contextinformation

20

1h 2h 3h 4h

Page 21: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

RNN language model

• Words: w1,w2,…,wt-1,wt,wt+1,…,wT

• Word vectors: x1,x2,…,xt-1,xt,xt+1,…,xT

• Hidden states

• Word prediction

jttjt

tt

ywwvwP ,11

S

ˆ,...,

softmaxˆ

hWy

1

hhhx

ttt hWxWh

ℝSW hDV

ℝth hD

ℝtxd

ℝhxW

dDh

ℝhhW hh DD

Page 22: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Softmax function

• Softmax function

– Convert real-valued scores to a probability distribution

• Example

n

i

i

n

xZ

Z

x

Z

x

Z

x

1

21

exp

exp,...,

exp,

expsoftmax

x

xxxx

0.2

2.1

5.3

181.0

007.0

812.01

0.2

2.1

5.3

0.22.15.3

e

e

e

eee

softmax( )

Add up to 1

Page 23: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Training

• Objective– Cross Entropy Loss

– Equivalent to negative log-likelihood• Correct word should be assigned a high probability

T

t

V

j

jtjt yyT

J1 1

,,ˆlog

1

Takes on 1 for the correct word, 0 otherwise

Page 24: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

ty

1th

1tx

SW

Softmax

Cross Entropy1ty

+

Sigmoidhh

W

hxW

+

×

×

×

ty

th

tx

SW

Softmax

Cross Entropyty

+

Sigmoidhh

W

hxW

+

×

×

×

Page 25: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Problems with vanilla RNNs

• Vanishing/exploding gradients

• Fail to capture long-distance dependencies

• Solutions

– Directly connect ht and ht-1 depending on context

– Gated Recurrent Units (GRUs)

– Long Short-Term Memory (LSTM)

Page 26: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Gated Recurrent Unit (GRU) [Cho et al., 2014]

• Update gate:

• Reset gate:

• State update

– Ignore the previous state if the reset gate is 0

– Copy the previous state if the update gate is 1

1 t

z

t

z

t hUxWz 1 t

r

t

r

t hUxWr

1tanh~

tttt UhrWxh

ttttt hzhzh~

11

Page 27: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Long Short-Term Memory (LSTM) [Hochreiter and Schmidhuber, 1997]

• Input gate:

• Forget gate:

• Output gate:

• Memory cell

• State update

o

t

o

t

o

t bhUxWo 1

i

t

i

t

i

t bhUxWi 1

f

t

f

t

f

t bhUxWf 1

1

~

1

~~

~tanh~

ttttt

c

t

c

t

c

t

cfcic

bhUxWc

ttt coh tanh

Page 28: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

LSTM (Long Short-Term Memory)

28

o

t

o

t

o

t bhUxWo 1

i

t

i

t

i

t bhUxWi 1

f

t

f

t

f

t bhUxWf 1

1

~

1

~~

~tanh~

ttttt

c

t

c

t

c

t

cfcic

bhUxWc

ttt coh tanh

Page 29: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Performance of RNN language models

Kim et al., Character-Aware Neural Language Models, 2015

Page 30: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Perplexity

• Perplexity (PP or PPL)

N

N

i ii

N

N

NN

wwwP

wwwP

wwwPW

1 11

21

1

21

...

1

...

1

...PP ・ Average branching factor of word prediction・ The smaller, the better

log PP 𝑊 = −1

𝑁

𝑖=1

𝑁

log 𝑃 𝑤𝑖|𝑤1, … , 𝑤𝑖−1

→ Cross-entropy loss per word

Page 31: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Examples generated by LSTM

• Trained with LaTeX source (word-level)

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Page 32: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Machine Translation

• Translate one language into another

• Train a translation model with a parallel corpus– E.g., WMT’14 English-to-French dataset

• 12 million sentences from Europarl, News Commentary, etc.• 300 million words (English)• 350 million words (Frence)

I'm here on vacation

Je suis là pour les vacances

32

Page 33: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Japanese-English Subtitle Corpus[Pryzant et al., 2017]

Page 34: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

ASPEC corpus [Nakazawa et al., 2016]

Page 35: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Neural Machine Translation

• Encoder-decoder model (Sutskever et al., 2014)

– Encoder RNN• Convert the source sentence into a real-valued vector

– Decoder RNN• Generate a sentence in the target language from the vector

35

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

He likes cheese <EOS> 彼 は チーズ が 好き

彼 は チーズ が 好き <EOS>

LSTM

Page 36: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Example

Output of the system

Reference translation

36

Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014

Page 37: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Results

The BLUE score of the state-of-the-art system was 37.0

• WMT’14 English to French

Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014

Page 38: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

BLEU score

• BLEU score [Papineni et al., 2002]

– Most widely used evaluation measure for MT

4

1

log4

1expBPBLUE

n

np

otherwise1exp

if1

BP

c

r

rc

Modified n-gram precision

Brevity Penalty

c: length of the generated sentencer: length of the reference sentence

Page 39: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Vector representation of source sentences

Sutskever et al., Sequence to Sequence Learning with Neural Networks, NIPS 2014 39

Page 40: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Problems• Represents the content of the source sentence with

a single vector

– Hard to represent a long sentence

• Sentence length vs translation accuracy

Encoder-decoder model Statistical Machine Translation

Cho et al., On the Properties of Neural Machine Translation: Encoder–Decoder Approaches, 2014

Page 41: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Attention (Bahdanau et al., 2015)

• Look at the hidden state of each word in the source sentence when updating the states of the decoder

xT

j

jiji

1

hc

xT

k ik

ij

ij

e

e

1exp

exp

jiij dNNFeedForware hs ,1

Weighted average of hidden states in the decor

Weights (which add up to 1)

tc

Page 42: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Bidirectional LSTM (BiLSTM)

• Stack two RNNs (forward and backward)

– Capture left and right context information

1h 2h 3h4h

1h 2h 3h4h

1x

1y

2x

2y

3x

3y

4x

4y

. . .

. . .

Page 43: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

(Bahdanau et al., 2015)

Attention Example(English to French)

Page 44: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Sentence length and Translation Accuracy

(Bahdanau et al., 2015)

Page 45: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Attention

• Improvements by Luong et al. (2015)

Page 46: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Translation accuracy

• WMT’14 English-German results

– 4.5M sentence pairs

Luong et al., (2015)

Page 47: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Examples

Luong et al., (2015)

Page 48: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Google Neural Machine Translation system (GNMT) (Wu et al., 2016)

• Model– Encoder: 8-layer LSTM (the lowest layer is bidirectional)– Decoder: 8-layer LSTM– Attention: from the bottom layer of the decoder to the top

layer of the encoder– Unit: Wordpiece model

• Training data– For research

• WMT corpus: En-Fr (36M), En-De (5M), etc.

– For production• Two or three orders of magnitude larger than the WMT corpus

Page 49: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

GNMT system

(Wu et al., 2016)

Page 50: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Transformer (Vaswani et al., 2017)

• Published in June of 2017

• No use of RNN

``Attention is all you need’’

• Outperformed GNMT with much less computational cost

– Training• 36M English-French sentence pairs

• 8 GPUs

• Completed in 3.5 day

Vaswani et al., Attention Is All You Need, NIPS 2017

Page 51: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Transformer

He likes cheese 彼 は チーズ

Self-attention

Self-attention

Self-attention

Self-attention

Self-attention

Self-attention Self-attention + attention

Self-attention + attention

Self-attention + attention

Self-attention + attention

Self-attention + attention

Self-attention + attention

Encoder Decoder

Page 52: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

• Encoder

– Compute a contextual embedding for each word

He likes cheese

𝐱1 𝐱2 𝐱3

Self-attention & Feed Forward NN

𝐫𝟏(1)

𝐫𝟐(1)

𝐫𝟑(1)

Self-attention & Feed Forward NN

𝐫𝟏(6)

𝐫𝟐(6)

𝐫𝟑(6)

𝐫𝟏(5)

𝐫𝟐(5)

𝐫𝟑(5)

: : :

∈ ℝ512

∈ ℝ512

∈ ℝ512

∈ ℝ512

Sixlayers

Word embedding

Page 53: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

• Encoder self-attention (1st layer)

He likes cheese

𝐯1 𝐯2 𝐯3𝐤1 𝐤2 𝐤3

𝐪1 ∙ 𝐤1 𝐪1 ∙ 𝐤2 𝐪1 ∙ 𝐤3

0.3 0.5 0.2

𝐱1 𝐱2 𝐱3

𝑊𝐾 𝑊𝑉 𝑊𝐾 𝑊𝑉 𝑊𝐾 𝑊𝑉

He likes cheese𝐱1 𝐱2 𝐱3

𝒒1 𝒒2 𝒒3

𝑊𝑄 𝑊𝑄 𝑊𝑄

Softmax

𝐳1 = 0.3𝐯1 + 0.5𝐯2 + 0.2𝐯3𝐫1 = FFN 𝒛1

Input

← Output(Input to Next Layer)

Dot product of Query & Key

Page 54: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Position-wise Feed-Forward Networks

• Compute the output of each attention layer

– Two-layer feed-forward neural network

– Linear transformation -> ReLU -> Linear Transformation

FFN 𝐱 = max 0, 𝐱𝑊1 + 𝐛1 𝑊2 + 𝐛2

𝑊1 ∈ ℝ512×2048 𝐛1 ∈ ℝ2048

𝑊2 ∈ ℝ2048×512 𝐛2 ∈ ℝ512

𝐱

ReLU

𝑊1

𝑊2

Page 55: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Scaled dot product attention

• Compute the attention for all words at once

𝐪1

𝐪2

𝐪𝟑

𝐤1 𝐤2 𝐤3 =

𝐪1 ∙ 𝐤1 𝐪1 ∙ 𝐤2 𝐪1 ∙ 𝐤3𝐪2 ∙ 𝐤1 𝐪2 ∙ 𝐤2 𝐪2 ∙ 𝐤3𝐪𝟑 ∙ 𝐤1 𝐪3 ∙ 𝐤2 𝐪3 ∙ 𝐤3

𝑄 ∈ ℝ𝑛×𝑑𝑘

𝐾 ∈ ℝ𝑛×𝑑𝑘

𝑉 ∈ ℝ𝑛×𝑑𝑣

Attention 𝑄, 𝐾, 𝑉 = softmax𝑄𝐾𝑇

𝑑𝑘𝑉

0.1 0.2 0.70.4 0.3 0.30.8 0.1 0.1

𝐯1

𝐯2

𝐯𝟑

=

0.1𝐯1 + 0.2𝐯2 + 0.7𝐯3

0.4𝐯1 + 0.3𝐯2 + 0.3𝐯3

0.8𝐯1 + 0.1𝐯𝟐 + 0.1𝐯3

𝑸 𝑲𝑻

𝐬𝐨𝐟𝐭𝐦𝐚𝐱(∙) 𝑽

scaling

Page 56: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Multi-Head Attention

• Use 8 heads with different parameters

Multihead 𝑄,𝐾, 𝑉 = Concat head1, … , head8 𝑊𝑂

head𝑖 = Attention 𝑄𝑊𝑖𝑄, 𝐾𝑊𝑖

𝐾, 𝑉𝑊𝑖𝑉

𝑊𝑖𝑄∈ ℝ512×64

𝑊𝑖𝐾 ∈ ℝ512×64

𝑊𝑖𝑉 ∈ ℝ512×64

𝑊𝑂 ∈ ℝ512×512

Each head collects a differentkinds of information

head𝑖 ∈ ℝ𝑛×64

Multihead ∙ ∈ ℝ𝑛×512

DimensionalityReduction

Page 57: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Decoder

• Six layers of attention– Each layer has two types of attention

mechanism

• Self-attention– Collects information on the generated words

• Attend to the output of the layer below

– Mask the information about the future

• Encoder-decoder attention– Collects information on the source sentence

• Attend to the output of the encoder

Page 58: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Add & Norm

• Add (Residual Connection)

– Learn the difference between input and output

– Implementation

• Simply add input to output

– Reduces train/test error

• Norm (Layer Normalization)

– Normalize the output of each layer

• Mean = 0, Variance = 1

– Faster convergence

Page 59: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Positional Encoding

• Encode position information

– 𝑃𝐸 𝑝𝑜𝑠,2𝑖 = sin 𝑝𝑜𝑠

10000 Τ2𝑖 512

– 𝑃𝐸 𝑝𝑜𝑠,2𝑖+1 = cos 𝑝𝑜𝑠

10000 Τ2𝑖 512

– 0 ≤ 𝑖 < 256

– Wavelength: 2𝜋~10000 ∙ 2𝜋

• Represents each position with soft binary notation

Page 60: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

http://jalammar.github.io/illustrated-transformer/

𝑖𝑛𝑑𝑒𝑥

𝑝𝑜𝑠

Page 61: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

Transformer

Page 62: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Visualizing attention

The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads).

https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

The animal didn’t cross the street because it was too tired.

The animal didn’t cross the street because it was too wide.

Page 63: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Performance

• Translation accuracy and training cost

Page 64: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Results with different hyper-parameter settings

Page 65: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Unsupervised NMT

• Unsupervised NMT [Artetxe et al., 2018; Lample et al., 2018]

– Build translation models by using only monolingual corpora

• Method1. Learn initial translation models

2. Repeat• Generate source and target sentences using the current

translation models

• Train new translation models using the generated sentences

Page 66: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Example

Lample et al., Unsupervised Machine Translation Using Monolingual Corpora Only, ICLR 2018

Page 67: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Initialization

• Approaches– Use a bilingual dictionary (Klementiev et al. 2012)

– Use a dictionary inferred in an unsupervised way (Conneau et al., 2018; Artetxe et al., 2017)

– Use a shared sub-word vocabulary between two languages (Lample et al., 2018)

1. Join the monolingual corpora

2. Apply BPE tokenization (Sennrich et al., 2016) on the resulting corpus

3. Learn token embeddings by fastText

Page 68: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Language modeling (denoising auto-encoding)

EncoderCorpus(Source)

Corpus (Target)

Decoder(→ Source)

EncoderDecoder

(→ Target)

𝑥𝑠𝑟𝑐

Noise (some words dropped and swapped)

ො𝑥𝑠𝑟𝑐

𝑥𝑡𝑔𝑡

Noise

ො𝑥𝑡𝑔𝑡

The first token of the decoder specifies the output language

ℒ𝑙𝑚 = 𝐸𝑥~𝑆 − log𝑃𝑠→𝑠 𝑥|𝐶 𝑥 + 𝐸𝑥~𝑇 − log𝑃𝑡→𝑡 𝑥|𝐶 𝑥

Page 69: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Back-translation

EncoderCorpus(Source)

Corpus (Target)

Decoder(→ Source)

𝑥𝑠𝑟𝑐 ො𝑥𝑠𝑟𝑠

𝑥𝑡𝑔𝑡

Model at Prev. Iter.(→ Target)

EncoderDecoder

(→ Target)ො𝑥𝑡𝑔𝑡

Model at Prev. Iter.

(→ Source)

ℒ𝑏𝑎𝑐𝑘 = 𝐸𝑥~𝑆 − log𝑃𝑡→𝑠 𝑥|𝑣∗ 𝑥 + 𝐸𝑥~𝑇 − log𝑃𝑠→𝑡 𝑥|𝑢∗ 𝑥

Page 70: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Comparison to supervised MT

• WMT’14 En-Fr

Lample et al., Phrase-Based & Neural Unsupervised Machine Translation, EMNLP 2018

Page 71: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Pretraining methods

• Deep learning models require a large amount of labeled data for training

• How can we build accurate models without using a large amount of labeled data?

• Pretraining methods– Train a language model with a large amount of raw

(unlabeled) text and then adapt it to various NLP tasks• Feature-based

– ELMo (Peters et al., 2018)

• Fine-tuning– GPT (Radford et al., 2018)– BERT (Devlin et al., 2019)

Page 72: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

ELMo (Peters et al., 2018)• Train two language models (left-to-right and right-to-left) on a

large raw corpus

• Use the hidden vectors of LSTMs as features for NLP models

Peters et al., Deep contextualized word representations, NAACL 2018

LSTM LSTM LSTM

𝑡1 𝑡2 𝑡3

LSTM LSTM LSTM

LSTM LSTM LSTM LSTM LSTM LSTM

𝑡2 𝑡3 END START 𝑡1 𝑡2

Page 73: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

ELMo (Peters et al., 2018)

Peters et al., Deep contextualized word representations, NAACL 2018

• Accuracy of existing NLP models can be improved by simply adding features produced by ELMo

Page 74: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

GPT (Radford et al., 2018)

• GPT (Generative Pre-trained Transformer)

• Method– Train a Transformer language model with a large

amount of raw text• 12-layer decoder-only Transformer

– sequences of up to 512 tokens.

• Took one month with 8 GPUs

• BooksCorpus: 7000 unpublished books (~5GB of text).

– Add a task-specific layer to the Transformer model and fine-tune it with labeled data• 3 epochs of training was sufficient for most cases

Radford et al., Improving language understanding by generative pre-training, 2018

Page 75: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

GPT (Radford et al., 2018)

Radford et al., Improving language understanding by generative pre-training, 2018

Page 76: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

MNLI (Multi-Genre Natural Language Inference, MultiNLI) [Williams et al., 2018]

Williams et al., A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference, NAACL-HLT 2018

• Entailment, Neural, or Contradiction

Page 77: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

RACE test (Lai et al., 2017)

https://openai.com/blog/language-unsupervised/

Page 78: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

• Results on natural language inference tasks

• Results on QA and common sense reasoning tasks

Page 79: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

GPT (Radford et al., 2018)

Number of layers and accuracy Zero-shot performance

Page 80: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

BERT (Devlin et al., 2019)

• BERT (Bidirectional Encoder Representations from Transformers)

• Method1. Pretraining with unlabeled data

• Learn deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context

2. Fine-tuning with labeled data• Add a task-specific layer to the model

• Fine-tune all the parameters

Page 81: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Pretraining

• Masked Language Model (MLM)– Some of the tokens are replaced with a special token [MASK]

– The model is trained to predict them

• Next Sentence Prediction (NSP)– Predict whether the given two sentences are consecutive

sentences in a document

• Data– BooksCorpus (800M words)

– English Wikipedia (2,500M words)

The man went to the [MASK] to buy a [MASK] of milk

store gallon

Page 82: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Input representation

• First token is always [CLS] (special token for classification)

• Segments are separated by [SEP]

• Add a segment-specific embedding to each token

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 83: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Figure adapted from Devlin et al. (2019)

NSP Masked LM Masked LM

Masked Sentence A Masked Sentence B

Pretraining

Page 84: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Fine-tuning for Single Sentence Tagging Tasks

• CoNLL-2003 NER

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 85: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

CoNLL-2003 NER[Tjong Kim Sang, 2003]

• Named Entity Recognition

B-ORG O B-PER O O B-LOC

U.N. official Ekeus heads for Baghdad

Page 86: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Fine-turning for Single Sentence Classification tasks

• SST-2, CoLA

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 87: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Stanford Sentiment Treebank (SST)[Socher et al., 2013]

https://nlp.stanford.edu/sentiment/treebank.html

A

terrific

B movie

-- in

fact ,

the best in

recent memory

.

Page 88: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

CoLA (The Corpus of Linguistic Acceptability) [Warstadtet al., 2018]

• Grammatical or ungrammatical

Page 89: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Fine-tuning for Sentence Pair Classification tasks

• MNLI, QQP, QNLI, STS-B, MPRC, RTE, SWAG

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 90: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Textual Entailment datasets

https://openai.com/blog/language-unsupervised/

Page 91: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Evaluation

• GLUE test results– BERT outperformed GPT on all tasks

– BERTBASE• L=12, H=768, A=12, Total Param-eters=110M• Roughly the same size as OpenAI GPT

– BERTLARGE• L=24, H=1024, A=16, Total Parameters=340M

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 92: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

GLUE (General Language Understanding Evaluation) benchmark [Wang et al., 2019]

• Nine tasks on natural language understanding– Single-Sentence Tasks

• CoLA (The Corpus of Linguistic Acceptability) [Warstadtet al., 2018]• SST-2 (The Stanford Sentiment Treebank) [Socheret al., 2013]

– Similarity and Paraphrase Tasks• MRPC (Microsoft Research Paraphrase Corpus) [Dolan and Brockett, 2005]• STS-B (The Semantic Textual Similarity Benchmark) [Cer et al., 2017]• QQP (Quora Question Pairs) [Chen et al., 2018]

– Inference Tasks• MNLI (Multi-Genre Natural Language Inference) [Williams et al., 2018]• QNLI (Question Natural Language Inference) [Wanget al., 2018]• RTE (Recognizing Textual Entailment) [Bentivogli et al., 2009]• WNLI (Winograd NLI) [Levesque et al., 2011]

Wang et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. ICLR 2019

Page 93: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Fine-tuning for extractive QA

• SQuAD

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 94: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

SQuAD [Rajpurkar et al., 2016]

Page 95: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Evaluation

• Results on SQuAD 1.1

Devlin et al., BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding, 2019

Page 96: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

UNI-LM (Dong et al., 2019)

• UNI-LM (Unified Pre-trained Language Model)– Model that can be fine-tuned for both NLU and NLG

• Pre-training– Three kinds of language models

• Unidirectional LM (left-to-right / right-to-left)• Bidirectional LM• Sequence-to-sequence LM

– Data• English Wikipedia & BooksCorpus

• Fine-tuning

Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS 2019

Page 97: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS 2019

Page 98: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Examples from Gigaword corpus

Nallapati et al., Abstractive Text Summarization using Sequence-to-sequence RNNs andBeyond, 2016

Page 99: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Example from CNN/DailyMail corpus

See et al., Get To The Point: Summarization with Pointer-Generator Networks, 2016

Page 100: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Evaluation

• CNN/DailyMail • Gigaword

Dong et al., Unified Language Model Pre-training for Natural Language Understanding and Generation, NeurIPS 2019

Page 101: Deep Learning for Natural Language Processingtsuruoka/tmp/tutorialNLPver2.pdf• Introduction • Word embeddings –Word2vec, Skip-gram, fastText • Recurrent Neural Networks –LSTM,

Summary

• Word embeddings– Word2vec, Skip-gram, fastText

• Recurrent Neural Networks– LSTM, GRU

• Neural Machine Translation– Encoder-decoder model– Transformer– Unusupervised NMT

• Pretraining methods– ELMo, GPT, BERT, UNI-LM– Sentiment Analysis, Textual Entailment, Question

Answering, Summarization