Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural...
Transcript of Neural LMs - Carnegie Mellon Universitydemo.clab.cs.cmu.edu/algo4nlp20/slides/SP20 IITP...A neural...
Neural LMs
Image: (Bengio et al, 03)
“One Hot” Vectors
Neural LMs
(Bengio et al, 03)
Low-dimensional Representations
▪ Learning representations by back-propagating errors ▪ Rumelhart, Hinton & Williams, 1986
▪ A neural probabilistic language model▪ Bengio et al., 2003
▪ Natural Language Processing (almost) from scratch▪ Collobert & Weston, 2008
▪ Word representations: A simple and general method for semi-supervised learning▪ Turian et al., 2010
▪ Distributed Representations of Words and Phrases and their Compositionality▪ Word2Vec; Mikolov et al., 2013
Word Vectors
Distributed representations
What are various ways to represent the meaning of a word?
Lexical Semantics
▪ How should we represent the meaning of the word?▪ Dictionary definition▪ Lemma and wordforms▪ Senses▪ Relationships between words or senses▪ Taxonomic relationships▪ Word similarity, word relatedness ▪ Semantic frames and roles ▪ Connotation and sentiment
Problems with Discrete Representations
▪ Too coarse▪ expert ↔ skillful
▪ Sparse▪ wicked, badass, ninja
▪ Subjective▪ Expensive▪ Hard to compute word relationships
expert [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
skillful [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
dimensionality: PTB: 50K, Google1T 13M
Distributional Hypothesis
“The meaning of a word is its use in the language” [Wittgenstein PI 43]
“You shall know a word by the company it keeps” [Firth 1957]
If A and B have almost identical environments we say that they are synonyms.
[Harris 1954]
Example
What does ongchoi mean?
▪ Suppose you see these sentences:▪ Ongchoi is delicious sautéed with garlic. ▪ Ongchoi is superb over rice▪ Ongchoi leaves with salty sauces
▪ And you've also seen these:▪ …spinach sautéed with garlic over rice▪ Chard stems and leaves are delicious▪ Collard greens and other salty leafy greens
Example
What does ongchoi mean?
Ongchoi: Ipomoea aquatica "Water Spinach"
Ongchoi is a leafy green like spinach, chard, or collard greens
Yamaguchi, Wikimedia Commons, public domain
Model of Meaning Focusing on Similarity
▪ Each word = a vector ▪ not just “word” or word45.▪ similar words are “nearby in space”▪ the standard way to represent meaning in NLP
We'll Introduce 4 Kinds of Embeddings
▪ Count-based▪ Words are represented by a simple function of the counts of nearby words
▪ Class-based▪ Representation is created through hierarchical clustering, Brown clusters
▪ Distributed prediction-based (type) embeddings▪ Representation is created by training a classifier to distinguish nearby and
far-away words: word2vec, fasttext
▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT
Term-Document Matrix
Context = appearing in the same document.
battle 1 0 7 17
soldier 2 80 62 89
fool 36 58 1 4
clown 20 15 2 3
Term-Document Matrix
Each document is represented by a vector of words
battle 1 0 7 17
soldier 2 80 62 89
fool 36 58 1 4
clown 20 15 2 3
battle 1 0 7 13
soldier 2 80 62 89
fool 36 58 1 4
clown 20 15 2 3
Vectors are the Basis of Information Retrieval
▪ Vectors are similar for the two comedies▪ Different than the history▪ Comedies have more fools and wit and fewer battles.
Visualizing Document Vectors
Words Can Be Vectors Too
▪ battle is "the kind of word that occurs in Julius Caesar and Henry V"▪ fool is "the kind of word that occurs in comedies, especially Twelfth
Night"
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
clown 20 15 2 3
Term-Context Matrix
▪ Two words are “similar” in meaning if their context vectors are similar▪ Similarity == relatedness
knife dog sword love like
knife 0 1 6 5 5
dog 1 0 5 5 5
sword 6 5 0 5 5
love 5 5 5 0 5
like 5 5 5 5 2
Count-Based Representations
▪ Counts: term-frequency▪ remove stop words▪ use log10(tf)▪ normalize by document length
battle 1 0 7 13
good 114 80 62 89
fool 36 58 1 4
wit 20 15 2 3
▪ What to do with words that are evenly distributed across many documents?
TF-IDF
Total # of docs in collection
# of docs that have word i
Words like "the" or "good" have very low idf
Positive Pointwise Mutual Information (PPMI)
▪ In word--context matrix▪ Do words w and c co-occur more than if they were independent?
▪ PMI is biased toward infrequent events▪ Very rare words have very high PMI values▪ Give rare words slightly higher probabilities α=0.75
(Church and Hanks, 1990)
(Turney and Pantel, 2010)
(Pecina’09)
Dimensionality Reduction
▪ Wikipedia: ~29 million English documents. Vocab: ~1M words. ▪ High dimensionality of word--document matrix▪ Sparsity▪ The order of rows and columns doesn’t matter
▪ Goal: ▪ good similarity measure for words or documents▪ dense representation
▪ Sparse vs Dense vectors▪ Short vectors may be easier to use as features in
machine learning (less weights to tune)▪ Dense vectors may generalize better than storing explicit counts▪ They may do better at capturing synonymy▪ In practice, they work better
aardvark 1
▪ Solution idea:▪ Find a projection into a low-dimensional space (~300 dim)▪ That gives us a best separation between features
Singular Value Decomposition (SVD)
orthonormal diagonal, sorted
densewordvectors
Truncated SVD
We can approximate the full matrix by only considering the leftmost k terms in the diagonal matrix (the k largest singular values)
⨉ ⨉
9
4
.1
.0
.0
.0
.0
.0
.0
dense document vectors
Latent Semantic Analysis
[Deerwester et al., 1990]
#0 #1 #2 #3 #4 #5we music company how program 10said film mr what project 30have theater its about russian 11they mr inc their space 12not this stock or russia 15but who companies this center 13be movie sales are programs 14do which shares history clark 20he show said be aircraft septthis about business social ballet 16there dance share these its 25you its chief other projects 17are disney executive research orchestra 18what play president writes development 19if production group language work 21
LSA++
▪ Probabilistic Latent Semantic Indexing (pLSI)▪ Hofmann, 1999
▪ Latent Dirichlet Allocation (LDA)▪ Blei et al., 2003
▪ Nonnegative Matrix Factorization (NMF)▪ Lee & Seung, 1999
Word Similarity
Evaluation
▪ Intrinsic▪ Extrinsic▪ Qualitative
Extrinsic Evaluation
▪ Chunking▪ POS tagging▪ Parsing▪ MT▪ SRL▪ Topic categorization▪ Sentiment analysis▪ Metaphor detection▪ etc.▪
Intrinsic Evaluation
▪ WS-353 (Finkelstein et al. ‘02)▪ MEN-3k (Bruni et al. ‘12)▪ SimLex-999 dataset (Hill et al., 2015)
word1 word2similarity (humans)
vanish disappear 9.8
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98
hole agreement 0.3
similarity (embeddings)
1.1
0.5
0.3
1.7
0.98
0.3
Spearman's rho (human ranks, model ranks)
Visualisation
▪ Visualizing Data using t-SNE (van der Maaten & Hinton’08)
[Faruqui et al., 2015]
What we’ve seen by now
▪ Meaning representation▪ Distributional hypothesis▪ Count-based vectors
▪ term-document matrix▪ word-in-context matrix▪ normalizing counts: tf-idf, PPMI▪ dimensionality reduction▪ measuring similarity▪ evaluation
Next:
▪ Brown clusters▪ Representation is created through hierarchical clustering
Word embedding representations
▪ Count-based ▪ tf-idf, , PPMI
▪ Class-based▪ Brown clusters
▪ Distributed prediction-based (type) embeddings▪ Word2Vec, Fasttext
▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT
▪ + many more variants ▪ Multilingual embeddings▪ Multisense embeddings▪ Syntactic embeddings▪ etc. etc.
The intuition of Brown clustering
▪ Similar words appear in similar contexts ▪ More precisely: similar words have similar distributions of words to their
immediate left and right
Mondayonlast----
------
Tuesdayonlast----
------
Wednesdayonlast----
------
Brown Clustering
dog [0000]
cat [0001]
ant [001]
river [010]
lake [011]
blue [10]
red [11]
dog cat ant river lake blue red
0
0
00
0 0 11 1 1
11
Brown Clustering
[Brown et al, 1992]
Brown Clustering
[ Miller et al., 2004]
▪ is a vocabulary
▪ is a partition of the vocabulary into k clusters
▪ is a probability of cluster of wi to follow the cluster of wi-1
▪
Brown Clustering
The model:
▪ is a vocabulary
▪ is a partition of the vocabulary into k clusters
▪ is a probability of cluster of wi to follow the cluster of wi-1
▪
Brown Clustering
Quality(C)
The model:
Quality(C)
Slide by Michael Collins
A Naive Algorithm
▪ We start with |V| clusters: each word gets its own cluster
▪ Our aim is to find k final clusters
▪ We run |V| − k merge steps:
▪ At each merge step we pick two clusters ci and cj , and merge them into a single cluster
▪ We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage
▪ Cost? Naive = O(|V|5 ). Improved algorithm gives O(|V|3 ): still too slow for realistic values of |V|
Slide by Michael Collins
Brown Clustering Algorithm▪ Parameter of the approach is m (e.g., m = 1000)▪ Take the top m most frequent words,
put each into its own cluster, c1, c
2, … c
m
▪ For i = (m + 1) … |V| ▪ Create a new cluster, c
m+1, for the i’th most frequent word.
We now have m + 1 clusters ▪ Choose two clusters from c
1 . . . c
m+1 to be merged: pick the merge that gives
a maximum value for Quality(C). We’re now back to m clusters
▪ Carry out (m − 1) final merges, to create a full hierarchy
▪ Running time: O(|V|m2 + n) where n is corpus length
Slide by Michael Collins
Word embedding representations
▪ Count-based ▪ tf-idf, PPMI
▪ Class-based▪ Brown clusters
▪ Distributed prediction-based (type) embeddings▪ Word2Vec, Fasttext
▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT
▪ + many more variants ▪ Multilingual embeddings▪ Multisense embeddings▪ Syntactic embeddings▪ etc. etc.
Word2Vec
▪ Popular embedding method▪ Very fast to train▪ Code available on the web▪ Idea: predict rather than count
Word2Vec
[Mikolov et al.’ 13]
Skip-gram Prediction
▪ Predict vs Count
the cat sat on the mat
▪ Predict vs Count
Skip-gram Prediction
the cat sat on the mat
context size = 2
wt = the CLASSIFIER
wt-2
= <start-2
>w
t-1 = <start
-1>
wt+1
= catw
t+2 = sat
Skip-gram Prediction
▪ Predict vs Count
the cat sat on the mat
context size = 2
wt = cat CLASSIFIER
wt-2
= <start-1
>w
t-1 = the
wt+1
= satw
t+2 = on
the cat sat on the mat
▪ Predict vs Count
Skip-gram Prediction
context size = 2
wt = sat CLASSIFIER
wt-2
= thew
t-1 = cat
wt+1
= onw
t+2 = the
▪ Predict vs Count
the cat sat on the mat
Skip-gram Prediction
context size = 2
wt = on CLASSIFIER
wt-2
= catw
t-1 = sat
wt+1
= thew
t+2 = mat
▪ Predict vs Count
the cat sat on the mat
Skip-gram Prediction
context size = 2
wt = the CLASSIFIER
wt-2
= satw
t-1 = on
wt+1
= matw
t+2 = <end
+1>
▪ Predict vs Count
the cat sat on the mat
Skip-gram Prediction
context size = 2
wt = mat CLASSIFIER
wt-2
= onw
t-1 = the
wt+1
= <end+1
>w
t+2 = <end
+2>
▪ Predict vs Count
Skip-gram Prediction
wt = the CLASSIFIER
wt-2
= <start-2
>w
t-1 = <start
-1>
wt+1
= catw
t+2 = sat
wt = the CLASSIFIER
wt-2
= satw
t-1 = on
wt+1
= matw
t+2 = <end
+1>
Skip-gram Prediction
Skip-gram Prediction
▪ Training data
wt , w
t-2w
t , w
t-1w
t , w
t+1w
t , w
t+2...
Skip-gram Prediction
Objective
▪ For each word in the corpus t= 1 … T
Maximize the probability of any context window given the current center word
Skip-gram Prediction
▪ Softmax
SGNS
▪ Negative Sampling▪ Treat the target word and a neighboring context word as positive examples.
▪ subsample very frequent words
▪ Randomly sample other words in the lexicon to get negative samples▪ x2 negative samples
Given a tuple (t,c) = target, context
▪ (cat, sat)▪ (cat, aardvark)
Choosing noise words
Could pick w according to their unigram frequency P(w)
More common to chosen then according to pα(w)
α= ¾ works well because it gives rare noise words slightly higher probability
To show this, imagine two events p(a)=.99 and p(b) = .01:
How to compute p(+|t,c)?
SGNS
Given a tuple (t,c) = target, context
▪ (cat, sat)▪ (cat, aardvark)
Return probability that c is a real context word:
Learning the classifier
▪ Iterative process▪ We’ll start with 0 or random weights▪ Then adjust the word weights to
▪ make the positive pairs more likely ▪ and the negative pairs less likely
▪ over the entire training set:
▪ Train using gradient descent
Skip-gram Prediction
FastText: Motivation
Subword Representation
skiing = {^skiing$, ^ski, skii, kiin, iing, ing$}
FastText
Details
▪ how many possible ngrams? ▪ |character set|n
▪ Hashing to map n-grams to integers in 1 to K=2M
▪ get word vectors for out-of-vocabulary words using subwords.▪ less than 2× slower than word2vec skipgram
▪ n-grams between 3 and 6 characters▪ short n-grams (n = 4) are good to capture syntactic information▪ longer n-grams (n = 6) are good to capture semantic information
FastText Evaluation▪ Intrinsic evaluation
▪ Arabic, German, Spanish, French, Romanian, Russian
word1 word2similarity (humans)
vanish disappear 9.8
behave obey 7.3
belief impression 5.95
muscle bone 3.65
modest flexible 0.98
hole agreement 0.3
similarity (embeddings)
1.1
0.5
0.3
1.7
0.98
0.3
Spearman's rho (human ranks, model ranks)
FastText Evaluation
[Grave et al, 2017]
FastText Evaluation
FastText Evaluation
Dense Embeddings You Can Download Word2vec (Mikolov et al.’ 13)
https://code.google.com/archive/p/word2vec/
Fasttext (Bojanowski et al.’ 17)
http://www.fasttext.cc/
Glove (Pennington et al., 14)
http://nlp.stanford.edu/projects/glove/
Word embedding representations
▪ Count-based ▪ tf-idf, PPMI
▪ Class-based▪ Brown clusters
▪ Distributed prediction-based (type) embeddings▪ Word2Vec, Fasttext
▪ Distributed contextual (token) embeddings from language models▪ ELMo, BERT
▪ + many more variants ▪ Multilingual embeddings▪ Multisense embeddings▪ Syntactic embeddings▪ etc. etc.
Motivation
p(play | Elmo and Cookie Monster play a game .)
≠p(play | The Broadway play premiered yesterday .)
Background
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
??
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM??
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
?? ??
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Embeddings from Language Models
ELMo= ??
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
Embeddings from Language Models
ELMo=
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
ELMo= + +
Embeddings from Language Models
The Broadway play premiered yesterday .
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
LSTM
λ1
ELMoλ2 λ0= + +( ( () ) )
Embeddings from Language Models
Evaluation: Extrinsic Tasks
Stanford Question Answering Dataset (SQuAD)
[Rajpurkar et al, ‘16, ‘18]
SNLI
[Bowman et al, ‘15]
BERT
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Cloze task objective
https://rajpurkar.github.io/SQuAD-explorer/
Multilingual Embeddings
https://github.com/mfaruqui/crosslingual-ccahttp://128.2.220.95/multilingual/
Motivation
model 1 model 2
?
▪ comparison of words trained with different models
Motivation
▪ translation induction▪ improving monolingual
embeddings through cross-lingual context
English French
?
Canonical Correlation Analysis (CCA)
Canonical Correlation Analysis (CCA)
⊆ ⊆
[Faruqui & Dyer, ‘14]
Extension: Multilingual Embeddings
[Ammar et al., ‘16]
104
English
French Spanish Arabic Swedish
French-English
Ofrench→english
Ofrench←english
Ofrench→english x
Ofrench←english
-1
Polyglot Models
[Ammar et al., ‘16, Tsvetkov et al., ‘16]
Embeddings can help study word history!
Diachronic Embeddings
107

▪ count-based embeddings w/ PPMI▪ projected to a common space
Project 300 dimensions down into 2
~30 million books, 1850-1990, Google Books data
Negative words change faster than positive words
Embeddings reflect ethnic stereotypes over time
Change in linguistic framing 1910-1990
Change in association of Chinese names with adjectives framed as "othering" (barbaric, monstrous, bizarre)
Analogy: Embeddings capture relational meaning!
[Mikolov et al.’ 13]
and also human biases
[Bolukbasi et al., ‘16]
Conclusion
▪ Concepts or word senses▪ Have a complex many-to-many association with words (homonymy, multiple
senses)▪ Have relations with each other▪ Synonymy, Antonymy, Superordinate▪ But are hard to define formally (necessary & sufficient conditions)
▪ Embeddings = vector models of meaning▪ More fine-grained than just a string or index▪ Especially good at modeling similarity/analogy▪ Just download them and use cosines!!▪ Useful in many NLP tasks▪ But know they encode cultural stereotypes