Deep Learning for Natural Language Processing: Word Embeddings

99
1 @graphific Roelof Pieters Deep Learning for Natural Language Processing: Word Embeddings 3 December 2015 KTH www.csc.kth.se/~roelof/ [email protected]

Transcript of Deep Learning for Natural Language Processing: Word Embeddings

1

@graphificRoelof Pieters

Deep  Learning  for  Natural  Language  Processing:  Word  

Embeddings

3  December  2015   KTH

www.csc.kth.se/~roelof/ [email protected]

Language Understanding

2

Can we understand Language ?

1. Language is ambiguous:Every sentence has many possible interpretations.

2. Language is productive:We will always encounter new words or new constructions

3. Language is culturally specific

Some of the challenges in Language Understanding:

3

Can we understand Language ?

1. Language is ambiguous:Every sentence has many possible interpretations.

2. Language is productive:We will always encounter new words or new constructions

• plays well with others

VB ADV P NN

NN NN P DT

• fruit flies like a banana

NN NN VB DT NN

NN VB P DT NN

NN NN P DT NN

NN VB VB DT NN

• the students went to class DT NN VB P NN

4

Some of the challenges in Language Understanding:

Can we understand Language ?

1. Language is ambiguous:Every sentence has many possible interpretations.

2. Language is productive:We will always encounter new words or new constructions

5

Some of the challenges in Language Understanding:

[Karlgren 2014, NLP Sthlm Meetup]6

Can we understand Language ?

1. Language is ambiguous:Every sentence has many possible interpretations.

2. Language is productive:We will always encounter new words or new constructions

3. Language is culturally specific

Some of the challenges in Language Understanding:

7

ML: Traditional Approach

1. Gather as much LABELED data as you can get

2. Throw some algorithms at it (mainly put in an SVM and keep it at that)

3. If you actually have tried more algos: Pick the best

4. Spend hours hand engineering some features / feature selection / dimensionality reduction (PCA, SVD, etc)

5. Repeat…

For each new problem/question::

8

Machine Learning for NLP

Data

Classic Approach: Data is fed into a learning algorithm:

Learning Algorithm

9

Machine Learning for NLP

some of the (many) treebank datasets

source: http://www-nlp.stanford.edu/links/statnlp.html#Treebanks

!

10

Penn TreebankThat’s a lot of “manual” work:

11

• the students went to class

DT NN VB P NN

• plays well with others

VB ADV P NN

NN NN P DT

• fruit flies like a banana

NN NN VB DT NN

NN VB P DT NN

NN NN P DT NN

NN VB VB DT NN

With a lot of issues:

Penn Treebank

12

Machine Learning for NLP

Learning AlgorithmData

“Features”

PredictionPrediction/Classifier

train set

test set

13

Machine Learning for NLP

Learning Algorithm

“Features”

PredictionPrediction/Classifier

train set

test set

14

One Model rules them all ?DL approaches have been successfully applied to:

Deep Learning: Why for NLP ?

Automatic summarization Coreference resolution Discourse analysis

Machine translation Morphological segmentation Named entity recognition (NER)

Natural language generation

Natural language understanding

Optical character recognition (OCR)

Part-of-speech tagging

Parsing

Question answering

Relationship extraction

sentence boundary disambiguation

Sentiment analysis

Speech recognition

Speech segmentation

Topic segmentation and recognition

Word segmentation

Word sense disambiguation

Information retrieval (IR)

Information extraction (IE)

Speech processing

15

Deep Learning: Why for NLP ?

16

• What is the meaning of a word? (Lexical semantics)

• What is the meaning of a sentence? ([Compositional] semantics)

• What is the meaning of a longer piece of text?(Discourse semantics)

Semantics: Meaning

18

• NLP treats words mainly (rule-based/statistical approaches at least) as atomic symbols:

• or in vector space:

• also known as “one hot” representation.

• Its problem ?

Word Representation

Love Candy Store

[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …]

Candy [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 …] ANDStore [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 …] = 0 !

19

Word Representation

20

• Structure corresponds to meaning:

Structure and Meaning

21

• Semantics

• Syntax

22

NLP: what can we work with?

• Language models define probability distributions over (natural language) strings or sentences

• Joint and Conditional Probability

Language Model

23

• Language models define probability distributions over (natural language) strings or sentences

Language Model

24

• Language models define probability distributions over (natural language) strings or sentences

Language Model

25

Word senses

What is the meaning of words?

• Most words have many different senses:dog = animal or sausage?

How are the meanings of different words related?

• - Specific relations between senses:Animal is more general than dog.

• - Semantic fields:money is related to bank

26

Word senses

Polysemy:

• A lexeme is polysemous if it has different related senses

• bank = financial institution or building

Homonyms:

• Two lexemes are homonyms if their senses are unrelated, but they happen to have the same spelling and pronunciation

• bank = (financial) bank or (river) bank

27

Word senses: relations

Symmetric relations:

• Synonyms: couch/sofaTwo lemmas with the same sense

• Antonyms: cold/hot, rise/fall, in/outTwo lemmas with the opposite sense

Hierarchical relations:

• Hypernyms and hyponyms: pet/dog The hyponym (dog) is more specific than the hypernym (pet)

• Holonyms and meronyms: car/wheelThe meronym (wheel) is a part of the holonym (car)

28

Distributional representations

“You shall know a word by the company it keeps” (J. R. Firth 1957)

One of the most successful ideas of modern statistical NLP!

these words represent banking

• Hard (class based) clustering models

• Soft clustering models29

Distributional hypothesis

He filled the wampimuk, passed it around and we all drunk some

We found a little, hairy wampimuk sleeping behind the tree

(McDonald & Ramscar 2001)30

Distributional semantics

Landauer and Dumais (1997), Turney and Pantel (2010), …31

Distributional semanticsDistributional meaning as co-occurrence vector:

32

Distributional representations

• Taking it further:

• Continuous word embeddings

• Combine vector space semantics with the prediction of probabilistic models

• Words are represented as a dense vector:

Candy =

33

Word Embeddings: SocherVector Space Model

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

34

Word Embeddings: SocherVector Space Model

adapted rom Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born

35

• Can theoretically (given enough units) approximate “any” function

• and fit to “any” kind of data

• Efficient for NLP: hidden layers can be used as word lookup tables

• Dense distributed word vectors + efficient NN training algorithms:

• Can scale to billions of words !

Why Neural Networks for NLP?

36

• Representation of words as continuous vectors has a long history (Hinton et al. 1986; Rumelhart et al. 1986; Elman 1990)

• First neural network language model: NNLM (Bengio et al. 2001; Bengio et al. 2003) based on earlier ideas of distributed representations for symbols (Hinton 1986)

How?

37

Word Embeddings: SocherVector Space Model

Figure (edited) from Bengio, “Representation Learning and Deep Learning”, July, 2012, UCLA

In a perfect world:

the country of my birththe place where I was born ?

38

Compositionality

Principle of compositionality:

the “meaning (vector) of a complex expression (sentence) is determined by:

— Gottlob Frege (1848 - 1925)

- the meanings of its constituent expressions (words) and

- the rules (grammar) used to combine them”

39

• How do we handle the compositionality of language in our models?

40

Compositionality

• How do we handle the compositionality of language in our models?

• Recursion :the same operator (same parameters) is applied repeatedly on different components

41

Compositionality

• How do we handle the compositionality of language in our models?

• Option 1: Recurrent Neural Networks (RNN)

42

RNN 1: Recurrent Neural Networks

• How do we handle the compositionality of language in our models?

• Option 2: Recursive Neural Networks (also sometimes called RNN)

43

RNN 2: Recursive Neural Networks

• achieved SOTA in 2011 on Language Modeling (WSJ AR task) (Mikolov et al., INTERSPEECH 2011):

• and again at ASRU 2011:

44

Recurrent Neural Networks

“Comparison to other LMs shows that RNN LMs are state of the art by a large margin. Improvements inrease with more training data.”

“[ RNN LM trained on a] single core on 400M words in a few days, with 1% absolute improvement in WER on state of the art setup”

Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)Recurrent neural network based language model

45

Recurrent Neural Networks

(simple recurrent neural network for LM)

input

hidden layer(s)

output layer

+ sigmoid activation function+ softmax function:

Mikolov, T., Karafiat, M., Burget, L., Cernock, J.H., Khudanpur, S. (2011)Recurrent neural network based language model

46

Recurrent Neural Networks

backpropagation through time

47

Recurrent Neural Networks

backpropagation through time

class based recurrent NN

[code (Mikolov’s RNNLM Toolkit) and more info: http://rnnlm.org/ ]

• Recursive Neural Network for LM (Socher et al. 2011; Socher 2014)

• achieved SOTA on new Stanford Sentiment Treebank dataset (but comparing it to many other models):

Recursive Neural Network

48

Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

info & code: http://nlp.stanford.edu/sentiment/

Recursive Neural Tensor Network

49

code & info: http://www.socher.org/index.php/Main/ParsingNaturalScenesAndNaturalLanguageWithRecursiveNeuralNetworks

Socher, R., Liu, C.C., NG, A.Y., Manning, C.D. (2011) Parsing Natural Scenes and Natural Language with Recursive Neural Networks

Recursive Neural Tensor Network

50

• RNN (Socher et al. 2011a)

Recursive Neural Network

51

Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

info & code: http://nlp.stanford.edu/sentiment/

• RNN (Socher et al. 2011a)

• Matrix-Vector RNN (MV-RNN) (Socher et al., 2012)

Recursive Neural Network

52

Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

info & code: http://nlp.stanford.edu/sentiment/

• RNN (Socher et al. 2011a)

• Matrix-Vector RNN (MV-RNN) (Socher et al., 2012)

• Recursive Neural Tensor Network (RNTN) (Socher et al. 2013)

Recursive Neural Network

53

• negation detection:

Recursive Neural Network

54

Socher, R., Perelygin,, A., Wu, J.Y., Chuang, J., Manning, C.D., Ng, A.Y., Potts, C. (2013)Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

info & code: http://nlp.stanford.edu/sentiment/

NP

PP/IN

NP

DT NN PRP$ NN

Parse TreeRecurrent NN for Vector Space

55

NP

PP/IN

NP

DT NN PRP$ NN

Parse Tree

INDT NN PRP NN

Compositionality

56

Recurrent NN: CompositionalityRecurrent NN for Vector Space

NP

IN

NP

PRP NN

Parse Tree

DT NN

Compositionality

57

Recurrent NN: CompositionalityRecurrent NN for Vector Space

NP

IN

NP

DT NN PRP NN

PP

NP (S / ROOT)

“rules” “meanings”

Compositionality

58

Recurrent NN: CompositionalityRecurrent NN for Vector Space

Vector Space + Word Embeddings: Socher

59

Recurrent NN: CompositionalityRecurrent NN for Vector Space

Vector Space + Word Embeddings: Socher

60

Recurrent NN for Vector Space

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 61

Word Embeddings: Turian (2010)

Turian, J., Ratinov, L., Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning

code & info: http://metaoptimize.com/projects/wordreprs/ 62

Word Embeddings: Collobert & Weston (2011)

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P. (2011) . Natural Language Processing (almost) from Scratch

63

Multi-embeddings: Stanford (2012)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng (2012)Improving Word Representations via Global Context and Multiple Word Prototypes

64

Linguistic Regularities: Mikolov (2013)

code & info: https://code.google.com/p/word2vec/ Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic Regularities in Continuous Space Word Representations

65

Word Embeddings for MT: Mikolov (2013)

Mikolov, T., Le, V. L., Sutskever, I. (2013) . Exploiting Similarities among Languages for Machine Translation

66

Word Embeddings for MT: Kiros (2014)

67

Recursive Deep Models & Sentiment: Socher (2013)

Socher, R., Perelygin, A., Wu, J., Chuang, J.,Manning, C., Ng, A., Potts, C. (2013) Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.

code & demo: http://nlp.stanford.edu/sentiment/index.html68

Paragraph Vectors: Le & Mikolov (2014)

Le, Q., Mikolov,. T. (2014) Distributed Representations of Sentences and Documents69

• add context (sentence, paragraph, document) to word vectors during training

!

Results on Stanford Sentiment Treebank dataset:

Paragraph Vectors: Dai et al. (2014)

70

Paragraph Vectors: Dai et al. (2014)

71

Paragraph Vectors: Dai et al. (2014)

72

Global Vectors, GloVe: Stanford (2014)

Pennington, P., Socher, R., Manning,. D.M. (2014). GloVe: Global Vectors for Word Representation

code & demo: http://nlp.stanford.edu/projects/glove/

vsresults on the word analogy task

“similar accuracy”

73

Dependency-based Embeddings: Levy & Goldberg (2014)

Levy, O., Goldberg, Y. (2014). Dependency-Based Word Embeddings

code & demo: https://levyomer.wordpress.com/2014/04/25/dependency-based-word-embeddings/

- Syntactic Dependency Context

Australian scientist discovers star with telescope

- Bag of Words (BoW) Context

0.3$

0.4$

0.5$

0.6$

0.7$

0.8$

0.9$

1$

0$ 0.1$ 0.2$ 0.3$ 0.4$ 0.5$ 0.6$ 0.7$ 0.8$ 0.9$ 1$

Precision

$

Recall$

“Dependency-based embeddings have more

functional similarities”

74

• LSTMS

• Attention

Wanna Play ?

Recent breakthroughs

75

• LSTMS

• Attention

Wanna Play ?

Recent breakthroughs

76

Wanna Play ?

LSTM

77

• LSTMS

• Attention

Wanna Play ?

Recent breakthroughs

78

Attention

Gregor et al (2015) DRAW: A Recurrent Neural Network For Image Generation (arxiv) (code)

• Question-Answering Systems (&Memory)

• Summarization

• Text Generation

• Dialogue Systems

• Image Captioning & other multimodal tasks

Wanna Play ?

Recent breakthroughs

80

• Question-Answering Systems (&Memory)

• Summarization

• Text Generation

• Dialogue Systems

• Image Captioning & other multimodal tasks

Wanna Play ?

Recent breakthroughs

81

Wanna Play ?

QA & Memory

82

• Memory Networks (Weston et al 2015) • Dynamic Memory Network (Kumar et al 2015) • Neural Turing Machine (Graves et al 2014)

FacebookMetamindDeepMind

Weston et al (2015) Memory Networks (arxiv)

QA & Memory

83

Yyer et al. (2014) A Neural Network for Factoid Question Answering over Paragraphs (paper)

Wanna Play ?

QA & Memory

84

• Memory Networks (Weston et al 2015) • Dynamic Memory Network (Kumar et al 2015) • Neural Turing Machine (Graves et al 2014)

FacebookMetamindDeepMind

Zaremba & Sutskever (2015) Learning to Execute (arxiv)

Wanna Play ?

QA & Memory

85

Babl Dataset

• Question-Answering Systems (&Memory)

• Summarization

• Text Generation

• Dialogue Systems

• Image Captioning & other multimodal tasks

Wanna Play ?

Recent breakthroughs

86

Wanna Play ?

Text generation

87

Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)

Karpathy (2015), The Unreasonable Effectiveness of Recurrent Neural Networks (blog)

• Question-Answering Systems (&Memory)

• Summarization

• Text Generation

• Dialogue Systems

• Image Captioning & other multimodal tasks

Wanna Play ?

Recent breakthroughs

91

Image-Text Embeddings

92

Socher et al (2013) Zero Shot Learning Through Cross-Modal Transfer (info)

Image-Captioning

• Andrej Karpathy Li Fei-Fei , 2015. Deep Visual-Semantic Alignments for Generating Image Descriptions (pdf) (info) (code)

• Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan , 2015. Show and Tell: A Neural Image Caption Generator (arxiv)

• Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (arxiv) (info) (code)

“A person riding a motorcycle on a dirt road.”???

Image-Captioning

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron

“Two hockey players are fighting over the puck.”???

Image-Captioning

“A stop sign is flying in blue skies.”

“A herd of elephants flying in the blue skies.”

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, Ruslan Salakhutdinov, 2015. Generating Images from Captions with Attention (arxiv) (examples)

Image-Captioning

• TensorFlow: Recently released library by Google. http://tensorflow.org

• Theano - CPU/GPU symbolic expression compiler in python (from LISA lab at University of Montreal). http://deeplearning.net/software/theano/

• Caffe - Computer Vision oriented Deep Learning framework: caffe.berkeleyvision.org

• Torch - Matlab-like environment for state-of-the-art machine learning algorithms in lua (from Ronan Collobert, Clement Farabet and Koray Kavukcuoglu) http://torch.ch/

• more info: http://deeplearning.net/software links/

Wanna Play ? General Deep Learning

97

• RNNLM (Mikolov) http://rnnlm.org

• NB-SVM https://github.com/mesnilgr/nbsvm

• Word2Vec (skipgrams/cbow)https://code.google.com/p/word2vec/ (original) http://radimrehurek.com/gensim/models/word2vec.html (python)

• GloVehttp://nlp.stanford.edu/projects/glove/ (original) https://github.com/maciejkula/glove-python (python)

• Socher et al / Stanford RNN Sentiment code:http://nlp.stanford.edu/sentiment/code.html

• Deep Learning without Magic Tutorial: http://nlp.stanford.edu/courses/NAACL2013/

Wanna Play ? NLP

98

Questions?

[email protected]/~roelof/

99

Code & Papers:

Collaborative Open Computer Science.com

@graphific