Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS,...
Transcript of Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS,...
![Page 1: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/1.jpg)
Neural Networks in NLP:The Curse of Indifferentiability
Lili [email protected]://sei.pku.edu.cn/~moull12
![Page 2: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/2.jpg)
Outline
● Preliminary– Word embeddings– Sequencetosequence generation
● Indifferentiability, solutions, and applications● A case study in semantic parsing
![Page 3: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/3.jpg)
Language Modeling
● One of the most fundamental problems in NLP.
● Given a corpus w=w1w2…wt, the goal is to maximize p(w)
![Page 4: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/4.jpg)
Language Modeling
● One of the most fundamental problems in NLP.
● Given a corpus w=w1w2…wt, the goal is to maximize p(w)
● Philosophical discussion: Does “probability of a corpus/sentence” make sense?– Recognize speech– Wreck a nice beach
● All in all, NLP (especially publishing in NLP) is pragmatic.
![Page 5: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/5.jpg)
Decomposition of the Joint probability
● p(w)= p(w1)p(w1|w2)p(w3|w1w2) … p(wt|w1w2...wt1)
![Page 6: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/6.jpg)
Decomposition of the Joint probability
● p(w)= p(w1)p(w1|w2)p(w3|w1w2) … p(wt|w1w2...wt1)
Minor question: ● Can we decompose any probabilistic distribution into this
form? Yes.● Is it necessary to decompose a probabilistic distribution into
this form? No.
Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, Zhi Jin. "Sequence to backward and forward sequences: A contentintroducing approach to generative shorttext conversation." In COLING, 2016.
![Page 7: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/7.jpg)
Markov Assumption
● p(w)= p(w1)p(w1|w2)p(w3|w1w2) … p(wt|w1w2...wt1)
≈ p(w1)p(w1|w2)p(w3|w2) … p(wt|wt1)
● A word is dependent only on its previous n1 words and independent of its position,
I.e., provided with the previous n1 words, the current word is independent of other random variables.
![Page 8: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/8.jpg)
Multinomial Estimate
● Maximum likelihood estimation for a multinomial distribution is merely counting.
● Problems– #para grows exp. w.r.t. n– Even for very small n (e.g., 2 or 3), we come across severe data sparsity
because of the Zipf distribution
![Page 9: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/9.jpg)
Parameterizing LMs with Neural Networks
● Each word is mapped to a realvalued vector, called embeddings.
● Neural layers capture context information (typically previous words).
● The probability p(w| . ) is predicted by a softmax layer.
![Page 10: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/10.jpg)
FeedForward Language Model
N.B. The Markov assumption also holds.
Bengio, Yoshua, et al. "A Neural Probabilistic Language Model." JMLR. 2003.
![Page 11: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/11.jpg)
Recurrent Neural Language Model
● RNN keeps one or a few hidden states● The hidden states change at each time step according to
the input
● RNN directly parametrizes
rather than
Mikolov T, Karafiát M, Burget L, Cernocký J, Khudanpur S. Recurrent neural network based language model. In INTERSPEECH, 2010.
![Page 12: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/12.jpg)
Complexity Concerns● Time complexity
– Hierarchical softmax [1]– Negative sampling: Hinge loss [2], Noisy contrastive estimation [3]
● Memory complexity– Compressing LM [4]
● Model complexity– Shallow neural networks are still too “deep.”– CBOW, SkipGram [3]
[1] Mnih A, Hinton GE. A scalable hierarchical distributed language model. NIPS, 2009.
[2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural language processing (almost) from scratch. JMLR, 2011.
[3] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013
[4] Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, Zhi Jin. "Compressing neural language models by sparse word representations." In ACL, 2016.
![Page 13: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/13.jpg)
The Role of Word Embeddings?● Word embeddings are essentially a connectional weight
matrix, whose input is a onehot vector.● Implementing by a lookup table is much faster than
matrix multiplication.● Each column of the matrix corresponds to a word.
![Page 14: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/14.jpg)
How can we use word embeddings?
● Embeddings demonstrate the internal structures of words – Relation represented by vector offset
“man” – “woman” = “king” – “queen”– Word similarity
● Embeddings can serve as the initialization of almost every supervised task– A way of pretraining
– N.B.: may not be useful when the training set is large enough
![Page 15: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/15.jpg)
Word Embeddings in our Brain
Huth, Alexander G., et al. "Natural speech reveals the semantic maps that tile human cerebral cortex." Nature 532.7600 (2016): 453458.
![Page 16: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/16.jpg)
“Somatotopic Embeddings” in our Brain
[8] Bear MF, Connors BW, Michael A. Paradiso. Neuroscience: Exploring the Brain. 2007
![Page 17: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/17.jpg)
[8] Bear MF, Connors BW, Michael A. Paradiso. Neuroscience: Exploring the Brain. 2007
![Page 18: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/18.jpg)
Deep neural networks:To be, or not to be? That is the question.
![Page 19: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/19.jpg)
CBOW, SkipGram (word2vec)
[6] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013
![Page 20: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/20.jpg)
Hierarchical Softmax and Negative Contrastive Estimation
● HS
● NCE
[6] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013
![Page 21: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/21.jpg)
Tricks in Training Word Embeddings
● The # of negative samples?– The more, the better.
● The distribution from which negative samples are generated? Should negative samples be close to positive samples?– The closer, the better.
● Full softmax vs. NCE vs. HS vs. hinge loss?
![Page 22: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/22.jpg)
Outline
● Preliminary– Word embeddings
– Sequencetosequence generation
● Indifferentiability, solutions, and applications● A case study in semantic parsing
![Page 23: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/23.jpg)
Seq2Seq NetworksSutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." NIPS. 2014
![Page 24: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/24.jpg)
Feeding back the output
● Mode selection● Potential chaotic behaviors
![Page 25: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/25.jpg)
Attention Mechanisms
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." ICLR, 2014.
![Page 26: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/26.jpg)
Context VectorThe context vector c is a combination of the input sequence's states
where the coefficient is related to– The local context , and – The last output state– is normalized
![Page 27: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/27.jpg)
Attention as Alignment
![Page 28: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/28.jpg)
Training
● Stepbystep training
CE CE CE CE CE CE
![Page 29: Neural Networks in NLP: The Curse of …A scalable hierarchical distributed language model. NIPS, 2009. [2] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural](https://reader033.fdocuments.in/reader033/viewer/2022042005/5e6f77f6f2535148704ef298/html5/thumbnails/29.jpg)
Indifferentiability
CE CE CE CE CE CE