COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing....

33
Contextualized Word Embeddings Fall 2019 COS 484: Natural Language Processing

Transcript of COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing....

Page 1: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Contextualized Word Embeddings

Fall 2019

COS 484: Natural Language Processing

Page 2: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Overview

• ELMo

Contextualized Word Representations

= Bidirectional Encoder Representations from Transformers

= Embeddings from Language Models

• BERT

Page 3: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Overview

• Transformers

Page 4: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Recap: word2vec

word = “sweden”

Page 5: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

What’s wrong with word2vec?

• One vector for each word type vcat =

0

BB@

�0.2240.130�0.2900.276

1

CCA

<latexit sha1_base64="ZS11t+SATcIQYaaJ4VZuEjXjz0Y=">AAACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo44vwhmXs2+fjM3vxAuvn33/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNNF4mJzs1PXDU7ROZWafxjn2UjEyaqikII/6jfZpv4wJz6j0pKq2wjjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmONN/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX77MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ==</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">AAACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywwh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDww8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENNkgXyiyyTgKyRbbJL6qRBBLkgt+SePHiX3p336D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">AAACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywwh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDww8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENNkgXyiyyTgKyRbbJL6qRBBLkgt+SePHiX3p336D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUUEoQ+3MeiaCkTTY5/M=">AAACOXicbZBNSyNBEIZ71PVj3NWsHr00BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmmRHQEolKzHnELSipoYUSFZynBngcKTiLLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqqzGRqJfTTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW44/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzzZI2sk00SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>

v(bank)

• Complex characteristics of word use: semantics, syntactic behavior, and connotations

• Polysemous words, e.g., bank, mouse

Page 6: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Contextualized word embeddings

Let’s build a vector for each word conditioned on its context!

movie was terribly exciting !the

Contextualized word embeddings

f : (w1, w2, …, wn) ⟶ x1, …, xn ∈ ℝd

Page 7: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Contextualized word embeddings

(Peters et al, 2018): Deep contextualized word representations

Page 8: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

ELMo

• NAACL’18: Deep contextualized word representations

• Key idea:

• Train an LSTM-based language model on some large corpus

• Use the hidden states of the LSTM for each token to compute a vector representation of each word

Page 9: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

ELMo

input softmax

# words in the sentence

Page 10: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

How to use ELMo?

• : allows the task model to scale the entire ELMo vectorγtask

• : softmax-normalized weights across layersstaskj

hlMk,0 = xLM

k , hLMk,j = [h LM

k,j ; h LMk,j ]

• Plug ELMo into any (neural) NLP model: freeze all the LMs weights and change the input representation to:

(could also insert into higher layers)

# of layers

Page 11: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

More details

• Forward and backward LMs: 2 layers each • Use character CNN to build initial word representation

• 2048 char n-gram filters and 2 highway layers, 512 dim projection

• User 4096 dim hidden/cell LSTM states with 512 dim projections to next input

• A residual connection from the first to second layer • Trained 10 epochs on 1B Word Benchmark

Page 12: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Experimental results

• SQuAD: question answering

• SNLI: natural language inference

• SRL: semantic role labeling

• Coref: coreference resolution

• NER: named entity recognition

• SST-5: sentiment analysis

Page 13: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Intrinsic Evaluation

First Layer > Second Layer Second Layer > First Layer

syntactic information is better represented at lower layers while semantic information is captured a higher layers

Page 14: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Use ELMo in practice

https://allennlp.org/elmo

Also available in TensorFlow

Page 15: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

BERT

• NAACL’19: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

• First released in Oct 2018.

How is BERT different from ELMo?

#1. Unidirectional context vs bidirectional context

#2. LSTMs vs Transformers (will talk later)

#3. The weights are not freezed, called fine-tuning

Page 16: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Bidirectional encoders

• Language models only use left context or right context (although ELMo used two independent LMs from each direction).

• Language understanding is bidirectional

Lecture 9:

Why are LMs unidirectional?

Page 17: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Bidirectional encoders

• Language models only use left context or right context (although ELMo used two independent LMs from each direction).

• Language understanding is bidirectional

Page 18: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Masked language models (MLMs)

• Solution: Mask out 15% of the input words, and then predict the masked words

• Too little masking: too expensive to train • Too much masking: not enough context

Page 19: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Masked language models (MLMs)

A little more complication:

Because [MASK] is never seen when BERT is used…

Page 20: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Next sentence prediction (NSP)

Always sample two sentences, predict whether the second sentence is followed after the first one.

Recent papers show that NSP is not necessary…(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach

Page 21: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Pre-training and fine-tuning

Pre-training Fine-tuning

Key idea: all the weights are fine-tuned on downstream tasks

Page 22: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Applications

Page 23: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

More details

• Input representations

• Use word pieces instead of words: playing => play ##ing Assignment 4

• Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)

• Released two model sizes: BERT_base, BERT_large

Page 24: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Experimental results

(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

BiLSTM: 63.9

Page 25: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Use BERT in practice

TensorFlow: https://github.com/google-research/bert

PyTorch: https://github.com/huggingface/transformers

Page 26: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Contextualized word embeddings in context

• TagLM (Peters et, 2017) • CoVe (McCann et al. 2017) • ULMfit (Howard and Ruder, 2018) • ELMo (Peters et al, 2018) • OpenAI GPT (Radford et al, 2018) • BERT (Devlin et al, 2018) • OpenAI GPT-2 (Radford et al, 2019) • XLNet (Yang et al, 2019) • SpanBERT (Joshi et al, 2019) • RoBERTa (Liu et al, 2019) • AlBERT (Anonymous) • …

Page 27: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Transformers

• NIPS’17: Attention is All You Need • Key idea: Multi-head self-attention • No recurrence structure any more so it

trains much faster • Originally proposed for NMT (encoder-

decoder framework) • Used as the base model of BERT

(encoder only)

Page 28: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Useful Resourcesnn.Transformer:

nn.TransformerEncoder:

The Annotated Transformer:http://nlp.seas.harvard.edu/2018/04/03/attention.html

A Jupyter notebook which explains how Transformer works line by line in PyTorch!

Page 29: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

RNNs vs Transformers

the movie was terribly exciting !

Transformer layer 3

Transformer layer 2

Transformer layer 1

Page 30: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Multi-head Self Attention

• Attention: a query and a set of key-value pairs to an outputq (ki, vi)

• If we have multiple queries:

A(Q, K, V ) = softmax(QK⊺)V

Q ∈ ℝnQ×d, K, V ∈ ℝn×d

• Dot-product attention:A(q, K, V ) = ∑

i

eq⋅ki

∑j eq⋅kjvi

K, V ∈ ℝn×d, q ∈ ℝd

• Self-attention: let’s use each word as query and compute the attention with all the other words

= the word vectors themselves select each other

Page 31: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Multi-head Self Attention

• Scaled Dot-Product Attention:

A(Q, K, V ) = softmax(QK⊺

d)V

• Input: X ∈ ℝn×din

A(XWQ, XWK, XWV) ∈ ℝn×d

WQ, WK, WV ∈ ℝdin×d

• Multi-head attention: using more than one head is always useful..

A(Q, K, V ) = Concat(head1, …, headh)WO

headi = A(XWQi , XWK

i , XWVi )

In practice, , h = 8 d = dout /h, WO = dout × dout

Page 32: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Putting it all together

• Each Transformer block has two sub-layers • Multi-head attention • 2-layer feedforward NN (with ReLU)

• Each sublayer has a residual connection and a layer normalization

LayerNorm(x + SubLayer(x))

(Ba et al, 2016): Layer Normalization

• Input layer has a positional encoding

• BERT_base: 12 layers, 12 heads, hidden size = 768, 110M parameters

• BERT_large: 24 layers, 16 heads, hidden size = 1024, 340M parameters

Page 33: COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing. Overview • ELMo Contextualized Word Representations = Bidirectional Encoder Representations

Have fun with using ELMo or BERT in your final project :)