COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing....

Contextualized Word Embeddings

Fall 2019

COS 484: Natural Language Processing

Overview

• ELMo

Contextualized Word Representations

= Bidirectional Encoder Representations from Transformers

= Embeddings from Language Models

• BERT

Overview

• Transformers

Recap: word2vec

word = “sweden”

What’s wrong with word2vec?

• One vector for each word type vcat =

0

BB@

�0.2240.130�0.2900.276

1

CCA

<latexit sha1_base64="ZS11t+SATcIQYaaJ4VZuEjXjz0Y=">AAACOXicbZDPShxBEMZrjIk65s8aj3poIoFcsvRsQlSIIHjxuIKrws6y9PTWro09PUN3jbgM8wx5m1zyFt4ELx4U8ZoXSM+uiFE/aPj4VRVd9SW5Vo44vwhmXs2+fjM3vxAuvn33/kNj6eOByworsSMzndmjRDjUymCHFGk8yi2KNNF4mJzs1PXDU7ROZWafxjn2UjEyaqikII/6jfZpv4wJz6j0pKq2wjjBkTJlngqy6qwKv/Jmq/WdxXHIm9E3XpsabfIpaq3/CGM0g4eBfmONN/lE7LmJ7s3a9uqvmAFAu984jweZLFI0JLVwrhvxnHqlsKSkxiqMC4e5kCdihF1vjUjR9crJ5RX77MmADTPrnyE2oY8nSpE6N04T3+n3O3ZPazV8qdYtaLjRK5XJC0Ijpx8NC80oY3WMbKAsStJjb4S0yu/K5LGwQpIPO/QhRE9Pfm4OWs3Ih7rn0/gJU83DCnyCLxDBOmzDLrShAxJ+wyVcw03wJ7gKboO7aetMcD+zDP8p+PsPf3eqbQ==</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">AAACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywwh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDww8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENNkgXyiyyTgKyRbbJL6qRBBLkgt+SePHiX3p336D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="X7JObiHYNXwbsISLOmkjXbSsJws=">AAACOXicbZBNSyNBEIZ7/Fh1dHejHvXQrCx42dCTFT9AQfDiMYJRIRNCT6cSG3t6hu4aMQzzG/w3Xrz5E7wJXjwo4lW825OIrLovNLw8VUVXvVGqpEXGbryR0bHxbxOTU/70zPcfPyuzcwc2yYyAhkhUYo4ibkFJDQ2UqOAoNcDjSMFhdLJT1g9PwViZ6H3sp9CKeU/LrhQcHWpX6qftPEQ4w9yRotjywwh6UudpzNHIs8L/w6q12goNQ59Vg7+sNCXaYENUW1v1Q9Cd94F2ZYlV2UD0qwnezNL24nm4/HJ1Xm9XrsNOIrIYNArFrW0GLMVWzg1KoaDww8xCysUJ70HTWc1jsK18cHlBfzvSod3EuKeRDui/EzmPre3Hket0+x3bz7US/q/WzLC73sqlTjMELYYfdTNFMaFljLQjDQhUfWe4MNLtSsUxN1ygC9t3IQSfT/5qDmrVwIW659LYJENNkgXyiyyTgKyRbbJL6qRBBLkgt+SePHiX3p336D0NW0e8t5l58kHe8yuUTqy7</latexit><latexit sha1_base64="yUhkDlYwUUEoQ+3MeiaCkTTY5/M=">AAACOXicbZBNSyNBEIZ71PVj3NWsHr00BsHLhp4oxgUFwYvHCEaFTAg9nUps7OkZumvEMMzf8uK/8CZ48bCLePUP2JME8euFhpenquiqN0qVtMjYvTc1PfNjdm5+wV/8+WtpufJ75dQmmRHQEolKzHnELSipoYUSFZynBngcKTiLLg/L+tkVGCsTfYLDFDoxH2jZl4KjQ91K86qbhwjXmDtSFPt+GMFA6jyNORp5Xfh/WK1e36Zh6LNasMVKU6K/bIzqjR0/BN17G+hWqqzGRqJfTTAxVTJRs1u5C3uJyGLQKBS3th2wFDs5NyiFgsIPMwspF5d8AG1nNY/BdvLR5QXdcKRH+4lxTyMd0fcTOY+tHcaR63T7XdjPtRJ+V2tn2N/t5FKnGYIW44/6maKY0DJG2pMGBKqhM1wY6Xal4oIbLtCF7bsQgs8nfzWn9VrgQj1m1YO9SRzzZI2sk00SkAY5IEekSVpEkBvyQP6R/96t9+g9ec/j1ilvMrNKPsh7eQWaI6kG</latexit>

v(bank)

• Complex characteristics of word use: semantics, syntactic behavior, and connotations

• Polysemous words, e.g., bank, mouse

Contextualized word embeddings

Let’s build a vector for each word conditioned on its context!

movie was terribly exciting !the


f : (w1, w2, …, wn) ⟶ x1, …, xn ∈ ℝd


(Peters et al, 2018): Deep contextualized word representations

ELMo

• NAACL’18: Deep contextualized word representations

• Key idea:

• Train an LSTM-based language model on some large corpus

• Use the hidden states of the LSTM for each token to compute a vector representation of each word

ELMo

input softmax

# words in the sentence

How to use ELMo?

• : allows the task model to scale the entire ELMo vectorγtask

• : softmax-normalized weights across layersstaskj

hlMk,0 = xLM

k , hLMk,j = [h LM

k,j ; h LMk,j ]

• Plug ELMo into any (neural) NLP model: freeze all the LMs weights and change the input representation to:

(could also insert into higher layers)

# of layers

More details

• Forward and backward LMs: 2 layers each • Use character CNN to build initial word representation

• 2048 char n-gram filters and 2 highway layers, 512 dim projection

• User 4096 dim hidden/cell LSTM states with 512 dim projections to next input

• A residual connection from the first to second layer • Trained 10 epochs on 1B Word Benchmark

Experimental results

• SQuAD: question answering

• SNLI: natural language inference

• SRL: semantic role labeling

• Coref: coreference resolution

• NER: named entity recognition

• SST-5: sentiment analysis

Intrinsic Evaluation

First Layer > Second Layer Second Layer > First Layer

syntactic information is better represented at lower layers while semantic information is captured a higher layers

Use ELMo in practice

https://allennlp.org/elmo

Also available in TensorFlow

https://allennlp.org/elmo

BERT

• NAACL’19: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

• First released in Oct 2018.

How is BERT different from ELMo?

#1. Unidirectional context vs bidirectional context

#2. LSTMs vs Transformers (will talk later)

#3. The weights are not freezed, called fine-tuning

Bidirectional encoders

• Language models only use left context or right context (although ELMo used two independent LMs from each direction).

• Language understanding is bidirectional

Lecture 9:

Why are LMs unidirectional?

Bidirectional encoders

• Language models only use left context or right context (although ELMo used two independent LMs from each direction).

• Language understanding is bidirectional

Masked language models (MLMs)

• Solution: Mask out 15% of the input words, and then predict the masked words

• Too little masking: too expensive to train • Too much masking: not enough context

Masked language models (MLMs)

A little more complication:

Because [MASK] is never seen when BERT is used…

Next sentence prediction (NSP)

Always sample two sentences, predict whether the second sentence is followed after the first one.

Recent papers show that NSP is not necessary…(Joshi*, Chen* et al, 2019) :SpanBERT: Improving Pre-training by Representing and Predicting Spans (Liu et al, 2019): RoBERTa: A Robustly Optimized BERT Pretraining Approach

Pre-training and fine-tuning

Pre-training Fine-tuning

Key idea: all the weights are fine-tuned on downstream tasks

Applications

More details

• Input representations

• Use word pieces instead of words: playing => play ##ing Assignment 4

• Trained 40 epochs on Wikipedia (2.5B tokens) + BookCorpus (0.8B tokens)

• Released two model sizes: BERT_base, BERT_large

Experimental results

(Wang et al, 2018): GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

BiLSTM: 63.9

Use BERT in practice

TensorFlow: https://github.com/google-research/bert

PyTorch: https://github.com/huggingface/transformers

https://github.com/google-research/bert

https://github.com/huggingface/transformers

Contextualized word embeddings in context

• TagLM (Peters et, 2017) • CoVe (McCann et al. 2017) • ULMfit (Howard and Ruder, 2018) • ELMo (Peters et al, 2018) • OpenAI GPT (Radford et al, 2018) • BERT (Devlin et al, 2018) • OpenAI GPT-2 (Radford et al, 2019) • XLNet (Yang et al, 2019) • SpanBERT (Joshi et al, 2019) • RoBERTa (Liu et al, 2019) • AlBERT (Anonymous) • …

Transformers

• NIPS’17: Attention is All You Need • Key idea: Multi-head self-attention • No recurrence structure any more so it

trains much faster • Originally proposed for NMT (encoder-

decoder framework) • Used as the base model of BERT

(encoder only)

Useful Resourcesnn.Transformer:

nn.TransformerEncoder:

The Annotated Transformer:http://nlp.seas.harvard.edu/2018/04/03/attention.html

A Jupyter notebook which explains how Transformer works line by line in PyTorch!

http://nlp.seas.harvard.edu/2018/04/03/attention.html

RNNs vs Transformers

the movie was terribly exciting !

Transformer layer 3

Transformer layer 2

Transformer layer 1

Multi-head Self Attention

• Attention: a query and a set of key-value pairs to an outputq (ki, vi)

• If we have multiple queries:

A(Q, K, V ) = softmax(QK⊺)V

Q ∈ ℝnQ×d, K, V ∈ ℝn×d

• Dot-product attention:A(q, K, V ) = ∑

i

eq⋅ki

∑j eq⋅kjvi

K, V ∈ ℝn×d, q ∈ ℝd

• Self-attention: let’s use each word as query and compute the attention with all the other words

= the word vectors themselves select each other

Multi-head Self Attention

• Scaled Dot-Product Attention:

A(Q, K, V ) = softmax(QK⊺

d)V

• Input: X ∈ ℝn×din

A(XWQ, XWK, XWV) ∈ ℝn×d

WQ, WK, WV ∈ ℝdin×d

• Multi-head attention: using more than one head is always useful..

A(Q, K, V ) = Concat(head1, …, headh)WO

headi = A(XWQi , XWK

i , XWVi )

In practice, , h = 8 d = dout /h, WO = dout × dout

Putting it all together

• Each Transformer block has two sub-layers • Multi-head attention • 2-layer feedforward NN (with ReLU)

• Each sublayer has a residual connection and a layer normalization

LayerNorm(x + SubLayer(x))

(Ba et al, 2016): Layer Normalization

• Input layer has a positional encoding

• BERT_base: 12 layers, 12 heads, hidden size = 768, 110M parameters

• BERT_large: 24 layers, 16 heads, hidden size = 1024, 340M parameters

Have fun with using ELMo or BERT in your final project :)

COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing....

Documents

Transcript of COS 484: Natural Language Processing - Princeton NLP · COS 484: Natural Language Processing....