A Neural Probabilistic Language Model 2014-12-16 Keren Ye.
-
Upload
henry-randall -
Category
Documents
-
view
221 -
download
0
Transcript of A Neural Probabilistic Language Model 2014-12-16 Keren Ye.
![Page 1: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/1.jpg)
A Neural Probabilistic Language Model
2014-12-16Keren Ye
![Page 2: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/2.jpg)
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
![Page 3: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/3.jpg)
n-gram models
• Construct tables of conditional probabilities for the next word
• Combinations of the last n-1 words
11
11 |ˆ|ˆ
t
nttt
t wwPwwP
![Page 4: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/4.jpg)
n-gram models
• i.e. “I like playing basketball”
– Unigram(1-gram)
– Bigram(2-gram)
– Trigram(3-gram)
playingbasketballPplayinglikeIbasketballP |ˆ,,|ˆ
basketballPplayinglikeIbasketballP ˆ,,|ˆ
playinglikebasketballPplayinglikeIbasketballP ,|ˆ,,|ˆ
![Page 5: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/5.jpg)
n-gram models
• Disadvantages
– It is not taking into account contexts farther than 1 or 2 words
– It is not taking into account the similarity between words
• i.e.“The cat is walking in the bedroom”(training corpus)
• “A dog was running in a room”(?)
![Page 6: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/6.jpg)
n-gram models
• Disadvantages
– Curse of Dimensionality
![Page 7: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/7.jpg)
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
![Page 8: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/8.jpg)
Fighting the Curse of Dimensionality
• Associate with each word in the vocabulary a distributed word feature vector (a real-valued vector in )
• Express the joint probability function of word sequences in terms of the feature vectors of these words in the sequence
• Learn simultaneously the word feature vectors and the parameters of that probability function
mR
![Page 9: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/9.jpg)
Fighting the Curse of Dimensionality
• Word feature vectors
– Each word is associated with a point in a vector space
– The number of features (e.g. m=30, 60 or 100 in the experiments) is much smaller than the size of vocabulary (e.g. 20w)
![Page 10: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/10.jpg)
Fighting the Curse of Dimensionality
• Probability function
– Using a multi-layer neural network to predict the next word given the previous ones, in the experiments
– This function has parameters that can be iteratively tuned in order to maximize the log-likelihood of the training data
![Page 11: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/11.jpg)
Fighting the Curse of Dimensionality
• Why does it work?
– If we knew that “dog” and “cat” played similar roles (semantically and syntactically), and similarly for (the, a), (bedroom, room), (is, was), (running, walking), we could naturally generalize from
• The cat is walking in the bedroom
– to and likewise to
• A dog was running in a room
• The cat is running in a room
• A dog is walking in a bedroom
• ….
![Page 12: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/12.jpg)
Fighting the Curse of Dimensionality
• NNLM
– Neural Network Language Model
![Page 13: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/13.jpg)
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
![Page 14: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/14.jpg)
A Neural Probabilistic Language Mode
• Denotations
– The training set is a sequence of words , where the vocabulary V is a large but finite set
– The objective is to learn a good model as below, in the sense that it gives high out-out-sample likelihood
– The only constraint on model is that for any choice of , the sum
Tww ...1 Vwt
111 |ˆ,..., t
tntt wwPwwf1
1tw
1,...,,1 11 V
i ntt wwif
![Page 15: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/15.jpg)
A Neural Probabilistic Language Mode
• Objective function
– Training is achieved by looking for that maximizes the training corpus penalized log-likelihood, where is a regularization term
RwwfT
Lt
ntt ;,...,log1
1
R
![Page 16: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/16.jpg)
A Neural Probabilistic Language Mode
• Model
– We decompose the function in two parts
• A mapping C from any element i of V to a real vector It represents the distributed feature vectors associated with each word in the vocabulary
• The probability function over words, expressed with C : a function g maps an input sequence of feature vectors for words in context, , to a conditional probability distribution over words in V for the next word. The output of g is a vector whose i-th element estimates the probability
111 |ˆ,..., t
tntt wwPwwf
mRiC
11 ,..., tnt wCwC
1111 ,...,,,...,, nttntt wCwCigwwif
![Page 17: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/17.jpg)
A Neural Probabilistic Language Mode
![Page 18: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/18.jpg)
A Neural Probabilistic Language Mode
• Model details (two hidden layers)
– The shared word features layer C, which has no non-linearity (it would not add anything useful)
– The ordinary hyperbolic tangent hidden layer
![Page 19: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/19.jpg)
A Neural Probabilistic Language Mode
• Model details (formal description)
– The neural network computes the following function, with a softmax output layer, which guarantees positive probabilities summing to 1
i
y
y
nttt i
tw
e
ewwwP 11,...,|ˆ
![Page 20: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/20.jpg)
A Neural Probabilistic Language Mode
• Model details (formal description)
– The are the unnormalized log-probabilities for each output word i , computed as follows, with parameters b, W, U, d and H
• Where the hyperbolic tangent tanh is applied element by element, W is optionally zero (no direct connections)
• And x is the word features layer activation vector, which is the concatenation of the input word features from the matrix C
iy
)tanh( HxdUWxby
11 ,..., tnt wCwCx
![Page 21: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/21.jpg)
A Neural Probabilistic Language Mode
![Page 22: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/22.jpg)
A Neural Probabilistic Language Mode
Parameters Brief Dimensions
b Output biases |V|
d Hidden layer bieses h
W No direct connections 0
U Hidden-to-output weights |V|*h matrix
H Word features to output weights h*(n-1)m matrix
C Word features |V|*m matrix
)tanh( HxdUWxby
![Page 23: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/23.jpg)
A Neural Probabilistic Language Mode
![Page 24: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/24.jpg)
A Neural Probabilistic Language Mode
• Stochastic gradient ascent
– Note that a large fraction of the parameters needs not be updated or visited after each example: the word feature C(j) of all words j that do not occur in the input window
CHUWdb ,,,,,
11,...,|ˆlog nttt wwwP
![Page 25: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/25.jpg)
A Neural Probabilistic Language Mode
• Parallel Implementation
– Data-Parallel Processing
• Relied on synchronization commands – slow
• No locks – noise seems to be very small and did not apparently slow down training
– Parameter-parallel Processing
• Parallelize across the parameters
![Page 26: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/26.jpg)
A Neural Probabilistic Language Mode
![Page 27: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/27.jpg)
A Neural Probabilistic Language Mode
![Page 28: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/28.jpg)
Continuous Bag of Words(Word2vec)
• Bag of words
– Traditional solution for the problem of Curse of Dimensionality
playinglikeIP
basketballplayingPbasketballlikePbasketballIPbasketballPplayinglikeIbasketballP
,,
|||,,|
![Page 29: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/29.jpg)
CONTENTS
• N-gram Models
• Fighting the Curse of Dimensionality
• A Neural Probabilistic Language Model
• Continuous Bag of Words(Word2vec)
![Page 30: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/30.jpg)
Continuous Bag of Words(Word2vec)
• Continuous Bag of Words
![Page 31: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/31.jpg)
Continuous Bag of Words(Word2vec)
• Distinctness
– Projection layer
• Sum vs Concatenate
• Order of words
– Hidden layer
• tanh vs NULL
– Hierarchical Softmax
![Page 32: A Neural Probabilistic Language Model 2014-12-16 Keren Ye.](https://reader030.fdocuments.in/reader030/viewer/2022033100/56649cfa5503460f949cbf96/html5/thumbnails/32.jpg)
Thanks
Q&A