Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural...

114

Transcript of Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural...

Page 1: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)
Page 2: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Neural LMs

Image: (Bengio et al, 03)

Page 3: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

“One Hot” Vectors

Page 4: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Neural LMs

(Bengio et al, 03)

Page 5: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Low-dimensional Representations

▪ Learning representations by back-propagating errors ▪ Rumelhart, Hinton & Williams, 1986

▪ A neural probabilistic language model▪ Bengio et al., 2003

▪ Natural Language Processing (almost) from scratch▪ Collobert & Weston, 2008

▪ Word representations: A simple and general method for semi-supervised learning▪ Turian et al., 2010

▪ Distributed Representations of Words and Phrases and their Compositionality▪ Word2Vec; Mikolov et al., 2013

Page 6: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word Vectors

Distributed representations

Page 7: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

What are various ways to represent the meaning of a word?

Page 8: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Lexical Semantics

▪ How should we represent the meaning of the word?▪ Dictionary definition▪ Lemma and wordforms▪ Senses▪ Relationships between words or senses▪ Taxonomic relationships▪ Word similarity, word relatedness ▪ Semantic frames and roles ▪ Connotation and sentiment

Page 9: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

WordNet

https://wordnet.princeton.edu/

Electronic Dictionaries

Page 10: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Electronic Dictionaries

WordNet

NLTK www.nltk.org

Page 11: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Problems with Discrete Representations

▪ Too coarse▪ expert ↔ skillful

▪ Sparse▪ wicked, badass, ninja

▪ Subjective▪ Expensive▪ Hard to compute word relationships

expert [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]

skillful [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

dimensionality: PTB: 50K, Google1T 13M

Page 12: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Distributional Hypothesis

“The meaning of a word is its use in the language” [Wittgenstein PI 43]

“You shall know a word by the company it keeps” [Firth 1957]

If A and B have almost identical environments we say that they are synonyms.

[Harris 1954]

Page 13: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Example

What does ongchoi mean?

Page 14: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Suppose you see these sentences:▪ Ongchoi is delicious sautéed with garlic. ▪ Ongchoi is superb over rice▪ Ongchoi leaves with salty sauces

▪ And you've also seen these:▪ …spinach sautéed with garlic over rice▪ Chard stems and leaves are delicious▪ Collard greens and other salty leafy greens

Example

What does ongchoi mean?

Page 15: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Ongchoi: Ipomoea aquatica "Water Spinach"

Ongchoi is a leafy green like spinach, chard, or collard greens

Yamaguchi, Wikimedia Commons, public domain

Page 16: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Model of Meaning Focusing on Similarity

▪ Each word = a vector ▪ not just “word” or word45.▪ similar words are “nearby in space”▪ the standard way to represent meaning in NLP

Page 17: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

We'll Introduce 4 Kinds of Embeddings

▪ Count-based▪ Words are represented by a simple function of the counts of nearby words

▪ Class-based▪ Representation is created through hierarchical clustering, Brown clusters

▪ Distributed prediction-based (type) embeddings▪ Representation is created by training a classifier to distinguish nearby and

far-away words: word2vec, fasttext

▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT

Page 18: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Term-Document Matrix

Context = appearing in the same document.

battle 1 0 7 17

soldier 2 80 62 89

fool 36 58 1 4

clown 20 15 2 3

Page 19: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Term-Document Matrix

Each document is represented by a vector of words

battle 1 0 7 17

soldier 2 80 62 89

fool 36 58 1 4

clown 20 15 2 3

Page 20: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

battle 1 0 7 13

soldier 2 80 62 89

fool 36 58 1 4

clown 20 15 2 3

Vectors are the Basis of Information Retrieval

▪ Vectors are similar for the two comedies▪ Different than the history▪ Comedies have more fools and wit and fewer battles.

Page 21: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Visualizing Document Vectors

Page 22: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Words Can Be Vectors Too

▪ battle is "the kind of word that occurs in Julius Caesar and Henry V"▪ fool is "the kind of word that occurs in comedies, especially Twelfth

Night"

battle 1 0 7 13

good 114 80 62 89

fool 36 58 1 4

clown 20 15 2 3

Page 23: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Term-Context Matrix

▪ Two words are “similar” in meaning if their context vectors are similar▪ Similarity == relatedness

knife dog sword love like

knife 0 1 6 5 5

dog 1 0 5 5 5

sword 6 5 0 5 5

love 5 5 5 0 5

like 5 5 5 5 2

Page 24: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Count-Based Representations

▪ Counts: term-frequency▪ remove stop words▪ use log10(tf)▪ normalize by document length

battle 1 0 7 13

good 114 80 62 89

fool 36 58 1 4

wit 20 15 2 3

Page 25: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ What to do with words that are evenly distributed across many documents?

TF-IDF

Total # of docs in collection

# of docs that have word i

Words like "the" or "good" have very low idf

Page 26: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Positive Pointwise Mutual Information (PPMI)

▪ In word--context matrix▪ Do words w and c co-occur more than if they were independent?

▪ PMI is biased toward infrequent events▪ Very rare words have very high PMI values▪ Give rare words slightly higher probabilities α=0.75

(Church and Hanks, 1990)

(Turney and Pantel, 2010)

Page 27: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

(Pecina’09)

Page 28: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Dimensionality Reduction

▪ Wikipedia: ~29 million English documents. Vocab: ~1M words. ▪ High dimensionality of word--document matrix▪ Sparsity▪ The order of rows and columns doesn’t matter

▪ Goal: ▪ good similarity measure for words or documents▪ dense representation

▪ Sparse vs Dense vectors▪ Short vectors may be easier to use as features in

machine learning (less weights to tune)▪ Dense vectors may generalize better than storing explicit counts▪ They may do better at capturing synonymy▪ In practice, they work better

aardvark 1

Page 29: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Solution idea:▪ Find a projection into a low-dimensional space (~300 dim)▪ That gives us a best separation between features

Singular Value Decomposition (SVD)

orthonormal diagonal, sorted

Page 30: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

densewordvectors

Truncated SVD

We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix (the k largest singular values)

⨉ ⨉

9

4

.1

.0

.0

.0

.0

.0

.0

dense document vectors

Page 31: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Latent Semantic Analysis

[Deerwester et al., 1990]

#0 #1 #2 #3 #4 #5we music company how program 10said film mr what project 30have theater its about russian 11they mr inc their space 12not this stock or russia 15but who companies this center 13be movie sales are programs 14do which shares history clark 20he show said be aircraft septthis about business social ballet 16there dance share these its 25you its chief other projects 17are disney executive research orchestra 18what play president writes development 19if production group language work 21

Page 32: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

LSA++

▪ Probabilistic Latent Semantic Indexing (pLSI)▪ Hofmann, 1999

▪ Latent Dirichlet Allocation (LDA)▪ Blei et al., 2003

▪ Nonnegative Matrix Factorization (NMF)▪ Lee & Seung, 1999

Page 33: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word Similarity

Page 34: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Evaluation

▪ Intrinsic▪ Extrinsic▪ Qualitative

Page 35: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Extrinsic Evaluation

▪ Chunking▪ POS tagging▪ Parsing▪ MT▪ SRL▪ Topic categorization▪ Sentiment analysis▪ Metaphor detection▪ etc.▪

Page 36: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Intrinsic Evaluation

▪ WS-353 (Finkelstein et al. ‘02)▪ MEN-3k (Bruni et al. ‘12)▪ SimLex-999 dataset (Hill et al., 2015)

word1 word2similarity (humans)

vanish disappear 9.8

behave obey 7.3

belief impression 5.95

muscle bone 3.65

modest flexible 0.98

hole agreement 0.3

similarity (embeddings)

1.1

0.5

0.3

1.7

0.98

0.3

Spearman's rho (human ranks, model ranks)

Page 37: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Visualisation

▪ Visualizing Data using t-SNE (van der Maaten & Hinton’08)

[Faruqui et al., 2015]

Page 38: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

What we’ve seen by now

▪ Meaning representation▪ Distributional hypothesis▪ Count-based vectors

▪ term-document matrix▪ word-in-context matrix▪ normalizing counts: tf-idf, PPMI▪ dimensionality reduction▪ measuring similarity▪ evaluation

Next:

▪ Brown clusters▪ Representation is created through hierarchical clustering

Page 39: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word embedding representations

▪ Count-based ▪ tf-idf, , PPMI

▪ Class-based▪ Brown clusters

▪ Distributed prediction-based (type) embeddings▪ Word2Vec, Fasttext

▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT

▪ + many more variants ▪ Multilingual embeddings▪ Multisense embeddings▪ Syntactic embeddings▪ etc. etc.

Page 40: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The intuition of Brown clustering

▪ Similar words appear in similar contexts ▪ More precisely: similar words have similar distributions of words to their

immediate left and right

Mondayonlast----

------

Tuesdayonlast----

------

Wednesdayonlast----

------

Page 41: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Brown Clustering

dog [0000]

cat [0001]

ant [001]

river [010]

lake [011]

blue [10]

red [11]

dog cat ant river lake blue red

0

0

00

0 0 11 1 1

11

Page 42: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Brown Clustering

[Brown et al, 1992]

Page 43: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Brown Clustering

[ Miller et al., 2004]

Page 44: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ is a vocabulary

▪ is a partition of the vocabulary into k clusters

▪ is a probability of cluster of wi to follow the cluster of wi-1

Brown Clustering

The model:

Page 45: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ is a vocabulary

▪ is a partition of the vocabulary into k clusters

▪ is a probability of cluster of wi to follow the cluster of wi-1

Brown Clustering

Quality(C)

The model:

Page 46: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Quality(C)

Slide by Michael Collins

Page 47: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

A Naive Algorithm

▪ We start with |V| clusters: each word gets its own cluster

▪ Our aim is to find k final clusters

▪ We run |V| − k merge steps:

▪ At each merge step we pick two clusters ci and cj , and merge them into a single cluster

▪ We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage

▪ Cost? Naive = O(|V|5 ). Improved algorithm gives O(|V|3 ): still too slow for realistic values of |V|

Slide by Michael Collins

Page 48: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Brown Clustering Algorithm▪ Parameter of the approach is m (e.g., m = 1000)▪ Take the top m most frequent words,

put each into its own cluster, c1, c

2, … c

m

▪ For i = (m + 1) … |V| ▪ Create a new cluster, c

m+1, for the i’th most frequent word.

We now have m + 1 clusters ▪ Choose two clusters from c

1 . . . c

m+1 to be merged: pick the merge that gives

a maximum value for Quality(C). We’re now back to m clusters

▪ Carry out (m − 1) final merges, to create a full hierarchy

▪ Running time: O(|V|m2 + n) where n is corpus length

Slide by Michael Collins

Page 49: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word embedding representations

▪ Count-based ▪ tf-idf, PPMI

▪ Class-based▪ Brown clusters

▪ Distributed prediction-based (type) embeddings▪ Word2Vec, Fasttext

▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT

▪ + many more variants ▪ Multilingual embeddings▪ Multisense embeddings▪ Syntactic embeddings▪ etc. etc.

Page 50: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word2Vec

▪ Popular embedding method▪ Very fast to train▪ Code available on the web▪ Idea: predict rather than count

Page 51: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word2Vec

[Mikolov et al.’ 13]

Page 52: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

▪ Predict vs Count

the cat sat on the mat

Page 53: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Predict vs Count

Skip-gram Prediction

the cat sat on the mat

context size = 2

wt = the CLASSIFIER

wt-2

= <start-2

>w

t-1 = <start

-1>

wt+1

= catw

t+2 = sat

Page 54: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

▪ Predict vs Count

the cat sat on the mat

context size = 2

wt = cat CLASSIFIER

wt-2

= <start-1

>w

t-1 = the

wt+1

= satw

t+2 = on

Page 55: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

the cat sat on the mat

▪ Predict vs Count

Skip-gram Prediction

context size = 2

wt = sat CLASSIFIER

wt-2

= thew

t-1 = cat

wt+1

= onw

t+2 = the

Page 56: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2

wt = on CLASSIFIER

wt-2

= catw

t-1 = sat

wt+1

= thew

t+2 = mat

Page 57: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2

wt = the CLASSIFIER

wt-2

= satw

t-1 = on

wt+1

= matw

t+2 = <end

+1>

Page 58: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Predict vs Count

the cat sat on the mat

Skip-gram Prediction

context size = 2

wt = mat CLASSIFIER

wt-2

= onw

t-1 = the

wt+1

= <end+1

>w

t+2 = <end

+2>

Page 59: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

▪ Predict vs Count

Skip-gram Prediction

wt = the CLASSIFIER

wt-2

= <start-2

>w

t-1 = <start

-1>

wt+1

= catw

t+2 = sat

wt = the CLASSIFIER

wt-2

= satw

t-1 = on

wt+1

= matw

t+2 = <end

+1>

Page 60: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

Page 61: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

▪ Training data

wt , w

t-2w

t , w

t-1w

t , w

t+1w

t , w

t+2...

Page 62: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

Page 63: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Objective

▪ For each word in the corpus t= 1 … T

Maximize the probability of any context window given the current center word

Page 64: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

▪ Softmax

Page 65: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

SGNS

▪ Negative Sampling▪ Treat the target word and a neighboring context word as positive examples.

▪ subsample very frequent words

▪ Randomly sample other words in the lexicon to get negative samples▪ x2 negative samples

Given a tuple (t,c) = target, context

▪ (cat, sat)▪ (cat, aardvark)

Page 66: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Choosing noise words

Could pick w according to their unigram frequency P(w)

More common to chosen then according to pα(w)

α= ¾ works well because it gives rare noise words slightly higher probability

To show this, imagine two events p(a)=.99 and p(b) = .01:

Page 67: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

How to compute p(+|t,c)?

Page 68: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

SGNS

Given a tuple (t,c) = target, context

▪ (cat, sat)▪ (cat, aardvark)

Return probability that c is a real context word:

Page 69: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Learning the classifier

▪ Iterative process▪ We’ll start with 0 or random weights▪ Then adjust the word weights to

▪ make the positive pairs more likely ▪ and the negative pairs less likely

▪ over the entire training set:

▪ Train using gradient descent

Page 70: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Skip-gram Prediction

Page 71: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText

https://fasttext.cc/

Page 72: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText: Motivation

Page 73: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Subword Representation

skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}

Page 74: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText

Page 75: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Details

▪ how many possible ngrams? ▪ |character set|n

▪ Hashing to map n-grams to integers in 1 to K=2M

▪ get word vectors for out-of-vocabulary words using subwords.▪ less than 2× slower than word2vec skipgram

▪ n-grams between 3 and 6 characters▪ short n-grams (n = 4) are good to capture syntactic information▪ longer n-grams (n = 6) are good to capture semantic information

Page 76: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText Evaluation▪ Intrinsic evaluation

▪ Arabic, German, Spanish, French, Romanian, Russian

word1 word2similarity (humans)

vanish disappear 9.8

behave obey 7.3

belief impression 5.95

muscle bone 3.65

modest flexible 0.98

hole agreement 0.3

similarity (embeddings)

1.1

0.5

0.3

1.7

0.98

0.3

Spearman's rho (human ranks, model ranks)

Page 77: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText Evaluation

[Grave et al, 2017]

Page 78: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText Evaluation

Page 79: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

FastText Evaluation

Page 80: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Dense Embeddings You Can Download Word2vec (Mikolov et al.’ 13)

https://code.google.com/archive/p/word2vec/

Fasttext (Bojanowski et al.’ 17)

http://www.fasttext.cc/

Glove (Pennington et al., 14)

http://nlp.stanford.edu/projects/glove/

Page 81: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Word embedding representations

▪ Count-based ▪ tf-idf, PPMI

▪ Class-based▪ Brown clusters

▪ Distributed prediction-based (type) embeddings▪ Word2Vec, Fasttext

▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT

▪ + many more variants ▪ Multilingual embeddings▪ Multisense embeddings▪ Syntactic embeddings▪ etc. etc.

Page 82: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Motivation

p(play | Elmo and Cookie Monster play a game .)

≠p(play | The Broadway play premiered yesterday .)

Page 83: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

ELMo

https://allennlp.org/elmo

Page 84: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Background

Page 85: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

??

Page 86: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM??

Page 87: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

?? ??

Page 88: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Embeddings from Language Models

ELMo= ??

Page 89: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

Embeddings from Language Models

ELMo=

Page 90: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

ELMo= + +

Embeddings from Language Models

Page 91: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

The Broadway play premiered yesterday .

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

LSTM

λ1

ELMoλ2 λ0= + +( ( () ) )

Embeddings from Language Models

Page 92: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Evaluation: Extrinsic Tasks

Page 93: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Stanford Question Answering Dataset (SQuAD)

[Rajpurkar et al, ‘16, ‘18]

Page 94: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

SNLI

[Bowman et al, ‘15]

Page 95: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

BERT

https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

Page 96: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Cloze task objective

Page 97: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)
Page 98: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

https://rajpurkar.github.io/SQuAD-explorer/

Page 99: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Multilingual Embeddings

https://github.com/mfaruqui/crosslingual-ccahttp://128.2.220.95/multilingual/

Page 100: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Motivation

model 1 model 2

?

▪ comparison of words trained with different models

Page 101: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Motivation

▪ translation induction▪ improving monolingual

embeddings through cross-lingual context

English French

?

Page 102: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Canonical Correlation Analysis (CCA)

Page 103: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Canonical Correlation Analysis (CCA)

⊆ ⊆

[Faruqui & Dyer, ‘14]

Page 104: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Extension: Multilingual Embeddings

[Ammar et al., ‘16]

104

English

French Spanish Arabic Swedish

French-English

Ofrench→english

Ofrench←english

Ofrench→english x

Ofrench←english

-1

Page 105: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Polyglot Models

[Ammar et al., ‘16, Tsvetkov et al., ‘16]

Page 106: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Embeddings can help study word history!

Page 107: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Diachronic Embeddings

107



▪ count-based embeddings w/ PPMI▪ projected to a common space

Page 108: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Project 300 dimensions down into 2

~30 million books, 1850-1990, Google Books data

Page 109: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Negative words change faster than positive words

Page 110: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Embeddings reflect ethnic stereotypes over time

Page 111: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Change in linguistic framing 1910-1990

Change in association of Chinese names with adjectives framed as "othering" (barbaric, monstrous, bizarre)

Page 112: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Analogy: Embeddings capture relational meaning!

[Mikolov et al.’ 13]

Page 113: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

and also human biases

[Bolukbasi et al., ‘16]

Page 114: Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural probabilistic language model Bengio et al., 2003 Natural Language Processing (almost)

Conclusion

▪ Concepts or word senses▪ Have a complex many-to-many association with words (homonymy, multiple

senses)▪ Have relations with each other▪ Synonymy, Antonymy, Superordinate▪ But are hard to define formally (necessary & sufficient conditions)

▪ Embeddings = vector models of meaning▪ More fine-grained than just a string or index▪ Especially good at modeling similarity/analogy▪ Just download them and use cosines!!▪ Useful in many NLP tasks▪ But know they encode cultural stereotypes