Natural language processing: neural network language...

Natural language processing:neural network language modelIFT 725 - Réseaux neuronaux

LANGUAGE MODELINGTopics: language modeling• A language model is a probabilistic model that assigns

probabilities to any sequence of words

p(w1, ... ,wT)‣ language modeling is the task of learning a language model that assigns high

probabilities to well formed sentences

‣ plays a crucial role in speech recognition and machine translation systems

‘‘ une personne intelligente ’’

‘‘ a person smart ’’

‘‘ a smart person ’’

LANGUAGE MODELINGTopics: language modeling• An assumption frequently made is the nth order Markov

assumption

p(w1, ... ,wT) = ∏ p(wt | wt−(n−1) , ... ,wt−1)‣ the tth word was generated based only on the n−1 previous words

‣ we will refer to wt−(n−1) , ... ,wt−1 as the context

LANGUAGE MODELINGTopics: n-gram model

• An n-gram is a sequence of n words ‣ unigrams (n=1): ‘‘is’’, ‘‘a’’, ‘‘sequence’’, etc.

‣ bigrams (n=2): [‘‘is’’, ‘‘a’’ ], [‘‘a’’, ‘‘sequence’’ ], etc.

‣ trigrams (n=3): [‘‘is’’, ‘‘a’’, ‘‘sequence’’], [ ‘‘a’’, ‘‘sequence’’, ‘‘of’’], etc.

• n-gram models estimate the conditional from n-grams counts

p(wt | wt−(n−1) , ... ,wt−1) = count(wt−(n−1) , ... ,wt−1, wt) count(wt−(n−1) , ... ,wt−1, ・)

‣ the counts are obtained from a training corpus (a data set of word text)4

LANGUAGE MODELINGTopics: n-gram model

• Issue: data sparsity‣ we want n to be large, for the model to be realistic

‣ however, for large values of n, it is likely that a given n-gram will not have been observed in the training corpora

‣ smoothing the counts can help- combine count(w1 , w2 , w3 , w4), count(w2 , w3 , w4), count(w3 , w4), and count(w4) to

estimate p(w4 |w1, w2, w3)

‣ this only partly solves the problem

NEURAL NETWORK LANGUAGE MODELTopics: neural network language model• Solution: model the conditional p(wt | wt−(n−1) , ... ,wt−1) with a neural network‣ learn word representations

to allow transfer to n-gramsnot observed in training corpus

BENGIO, DUCHARME, VINCENT AND JAUVIN

softmax

. . . . . .. . .

. . . . . .

across words

most computation here

index for index for index for

shared parameters

Matrix

inlook−upTable

wt�1wt�2

C(wt�2) C(wt�1)C(wt�n+1)

wt�n+1

i-th output = P(wt = i | context)

Figure 1: Neural architecture: f (i,wt�1, · · · ,wt�n+1) = g(i,C(wt�1), · · · ,C(wt�n+1)) where g is theneural network andC(i) is the i-th word feature vector.

parameters of the mapping C are simply the feature vectors themselves, represented by a |V |⇥mmatrixC whose row i is the feature vectorC(i) for word i. The function g may be implemented by afeed-forward or recurrent neural network or another parametrized function, with parameters ω. Theoverall parameter set is θ= (C,ω).

Training is achieved by looking for θ that maximizes the training corpus penalized log-likelihood:

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

where R(θ) is a regularization term. For example, in our experiments, R is a weight decay penaltyapplied only to the weights of the neural network and to theC matrix, not to the biases.3

In the above model, the number of free parameters only scales linearly with V , the number ofwords in the vocabulary. It also only scales linearly with the order n : the scaling factor couldbe reduced to sub-linear if more sharing structure were introduced, e.g. using a time-delay neuralnetwork or a recurrent neural network (or a combination of both).

In most experiments below, the neural network has one hidden layer beyond the word featuresmapping, and optionally, direct connections from the word features to the output. Therefore thereare really two hidden layers: the shared word features layer C, which has no non-linearity (it wouldnot add anything useful), and the ordinary hyperbolic tangent hidden layer. More precisely, theneural network computes the following function, with a softmax output layer, which guaranteespositive probabilities summing to 1:

P̂(wt |wt�1, · · ·wt�n+1) =eywt∑i eyi

3. The biases are the additive parameters of the neural network, such as b and d in equation 1 below.

Bengio, Ducharme,Vincent and Jauvin, 2003

softmax

. . . . . .. . .

. . . . . .

across words

shared parameters

Matrix

inlook−upTable

wt�1wt�2

wt�n+1

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

softmax

. . . . . .. . .

. . . . . .

across words

shared parameters

Matrix

inlook−upTable

wt�1wt�2

wt�n+1

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

softmax

. . . . . .. . .

. . . . . .

across words

shared parameters

Matrix

inlook−upTable

wt�1wt�2

wt�n+1

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

NEURAL NETWORK LANGUAGE MODELTopics: neural network language model• Can potentially generalize to contexts not seen in training set‣ example: p(‘‘ eating ’’ | ‘‘ the ’’, ‘‘ cat ’’, ‘‘ is ’’)

- imagine 4-gram [‘‘ the ’’, ‘‘ cat ’’, ‘‘ is ’’, ‘‘ eating ’’ ] is not in training corpus, but [‘‘ the ’’, ‘‘ dog ’’, ‘‘ is ’’, ‘‘ eating ’’ ] is

- if the word representations of ‘‘ cat ’’ and ‘‘ dog ’’ are similar, then the neural network will be able to generalize to the case of ‘‘ cat ’’

- neural network could learn similar word representations for those words based on other 4-grams: [‘‘ the ’’, ‘‘ cat ’’, ‘‘ was ’’, ‘‘ sleeping ’’ ] [‘‘ the ’’, ‘‘ dog ’’, ‘‘ was ’’, ‘‘ sleeping ’’ ]

NEURAL NETWORK LANGUAGE MODELTopics: word representation gradients•We know how to propagate gradients

in such a network‣ we know how to compute the gradient for the

linear activation of the hidden layer

‣ let’s note the submatrix connecting wt−i and the hidden layer as Wi

• The gradient wrt C(w) for any w is

softmax

. . . . . .. . .

. . . . . .

across words

shared parameters

Matrix

inlook−upTable

wt�1wt�2

wt�n+1

L=1T ∑t

log f (wt ,wt�1, · · · ,wt�n+1;θ)+R(θ),

Natural language processing

Hugo LarochelleD

epartement d’informatique

Universit´e de Sherbrooke

hugo.larochelle@usherbrooke.ca

November 13, 2012

Abstract

Math for my slides “Natural language processing”.

•C(w) (= C(w)� ↵rC(w)l

rC(w)l =n�1X

1(wt�i=w) W>i r

• W1 W2 Wn�1

Hugo LarochelleD

November 13, 2012

Abstract

•C(w) (= C(w)� ↵rC(w)l

rC(w)l =n�1X

1(wt�i=w) W>i r

• W1 W2 Wn�1

Wn-1 W2 W1

NEURAL NETWORK LANGUAGE MODELTopics: word representation gradients• Example: [‘‘ the ’’, ‘‘ dog ’’, ‘‘ and ’’, ‘‘ the ’’, ‘‘ cat ’’ ]

‣ the loss is l = − log p(‘‘ cat ’’ | ‘‘ the ’’, ‘‘ dog ’’, ‘‘ and ’’, ‘‘ the ’’)

‣ for all other words w

•Only need to update the representations C(3), C(14) and C(21), 9

w3 w4 w5 w6=

Hugo LarochelleD

November 13, 2012

Abstract

•C(w) (= C(w)� ↵rC(w)l

rC(w)l =n�1X

1(wt�i=w) W>i r

• W1 W2 Wn�1

• rC(3)l = W>3 ra(x)l

• rC(14)l = W>2 ra(x)l

• rC(21)l = W>1 ra(x)l +W>

4 ra(x)l

• rC(w)l = 0

Hugo LarochelleD

November 13, 2012

Abstract

•C(w) (= C(w)� ↵rC(w)l

rC(w)l =n�1X

1(wt�i=w) W>i r

• W1 W2 Wn�1

• rC(3)l = W>3 ra(x)l

• rC(14)l = W>2 ra(x)l

• rC(21)l = W>1 ra(x)l +W>

4 ra(x)l

• rC(w)l = 0

Hugo LarochelleD

November 13, 2012

Abstract

•C(w) (= C(w)� ↵rC(w)l

rC(w)l =n�1X

1(wt�i=w) W>i r

• W1 W2 Wn�1

• rC(3)l = W>3 ra(x)l

• rC(14)l = W>2 ra(x)l

• rC(21)l = W>1 ra(x)l +W>

4 ra(x)l

• rC(w)l = 0

Hugo LarochelleD

November 13, 2012

Abstract

•C(w) (= C(w)� ↵rC(w)l

rC(w)l =n�1X

1(wt�i=w) W>i r

• W1 W2 Wn�1

• rC(3)l = W>3 ra(x)l

• rC(14)l = W>2 ra(x)l

• rC(21)l = W>1 ra(x)l +W>

4 ra(x)l

• rC(w)l = 0

NEURAL NETWORK LANGUAGE MODELTopics: performance evaluation• In language modeling, a common evaluation metric is the

perplexity‣ it is simply the exponential of the average negative log-likelihood

• Evaluation on Brown corpus‣ n-gram model (Kneser-Ney smoothing): 321

‣ neural network language model: 276

‣ neural network + n-gram: 252

NEURAL NETWORK LANGUAGE MODELTopics: performance evaluation• A more interesting (and less straightforward) way of

evaluating a language model is within a particular application‣ does a language model improve the performance of a machine translation or

speech recognition system

• Later work has shown improvements in both cases‣ Connectionist language modeling for large vocabulary continuous speech

recognitionSchwenk and Gauvain, 2002

‣ Continuous-Space Language Models for Statistical Machine TranslationSchwenk, 2010

NEURAL NETWORK LANGUAGE MODELTopics: hierarchical output layer• Issue: output layer is huge‣ we are dealing with vocabularies with a size D in the hundred thousands

‣ computing all output layer units is very computationally expensive

• Solution: use a hierarchical (tree) output layer‣ define a tree where each leaf is a word

‣ neural network assigns probabilities of branching from a parent to any child

‣ the probability of a word is thus the product of each branching probabilities from the root to the word’s leaf

• If the tree is binary and balanced, computing word probabilities is in O(log2 D)

NEURAL NETWORK LANGUAGE MODELTopics: hierarchical output layer• Example: [‘‘ the ’’, ‘‘ dog ’’, ‘‘ and ’’, ‘‘ the ’’, ‘‘ cat ’’ ]

‘‘ the ’’

‘‘ dog ’’

‘‘ and ’’

‘‘ the ’’

......

‘‘ dog ’’ ‘‘ the ’’ ‘‘ and ’’ ‘‘ cat ’’ ‘‘ he ’’ ‘‘ have ’’ ‘‘ be ’’ ‘‘ OOV ’’

4 5 6 7

p(‘‘ cat ’’ | context) =

‘‘ the ’’

‘‘ dog ’’

‘‘ and ’’

‘‘ the ’’

......

4 5 6 7

p(‘‘ cat ’’ | context) =

‘‘ the ’’

‘‘ dog ’’

‘‘ and ’’

‘‘ the ’’

......

4 5 6 7

p(‘‘ cat ’’ | context) = p(branch left at 1| context) x p(branch right at 2| context) x p(branch right at 3| context)

‘‘ the ’’

‘‘ dog ’’

‘‘ and ’’

‘‘ the ’’

......

4 5 6 7

p(‘‘ cat ’’ | context) = (1-p(branch right at 1| context)) x p(branch right at 2| context) x p(branch right at 3| context)

‘‘ the ’’

‘‘ dog ’’

‘‘ and ’’

‘‘ the ’’

......

4 5 6 7

p(‘‘ cat ’’ | context) = (1 - sigm(b1 + V1,· h(x)))x sigm(b2 + V2,· h(x))x sigm(b5 + V5,· h(x))

NEURAL NETWORK LANGUAGE MODELTopics: hierarchical output layer• How to define the word hierarchy‣ can use a randomly generated tree- this is likely to be suboptimal

‣ can use existing linguistic resources, such as WordNet- Hierarchical Probabilistic Neural Network Language Model

Morin and Bengio, 2005

- they report a speedup of 258x, with a slight decrease in performance

‣ can learn the hierarchy using a recursive partitioning strategy- A Scalable Hierarchical Distributed Language Model

Mnih and Hinton, 2008

- similar speedup factors are reported, without a performancedecrease

CONCLUSION•We discussed the task of language modeling

•We saw how to tackle this problem with a neural network that learning word representations‣ word representations can help the neural network to generalize to new

contexts

•We discussed ways of speeding up computations using a hierarchical output layer

Natural language processing: neural network language...

Documents

Transcript of Natural language processing: neural network language...

Second Language Transfer During Third Language Acquisition · PDF fileSecond language transfer during third language acquisition 1 Second Language Transfer During Third Language Acquisition

Second Language Acquisition. Language Learning vs. Language Acquisition Language acquisition is a subconscious process. Language learning requires a formal.

Defense Language Institute Foreign Language Center · PDF fileDefense Language Institute Foreign Language Center ... Defense Language Institute Foreign Language Center ... version,

Go Figure! Figurative Language Recognizing Figurative Language The opposite of literal language is figurative language. Figurative language is language.

LANGUAGE FIGURATIVE LANGUAGE - Walch LANGUAGE FIGURATIVE LANGUAGE. Table of Contents iii Daily Warm-Ups: Figurative Language Introduction ...

Go Figure! Figurative Language Grades 8 Recognizing Figurative Language The opposite of literal language is figurative language. Figurative language.

Procedural Language Structured Query Language …site.iugaza.edu.ps/ilubbad/files/2016/09/Lab_7_Procedural-Language... · Procedural Language Structured Query Language (PL/SQL) ...

Language in Public Speaking. Language is important Meanings of words Using language accurately Using language clearly Using language vividly Using language.

Defense Language Institute Foreign Language Centerliberalarts.utexas.edu/tlc/_files/proficiencyconference/... · Defense Language Institute Foreign Language Center ... Defense Language

Chinese Language Curriculum - United Nations Language Curriculum_0.… · Chinese Language Curriculum Russian Language Curriculum . French Language Curriculum Spanish LanguageCurriculum.

Language Contact Language Loyalty and Language Prejudice

ENGLISH LANGUAGE CENTRE NEWSLETTERENGLISH LANGUAGE … · ENGLISH LANGUAGE CENTRE NEWSLETTERENGLISH LANGUAGE CENTRE NEWSLETTER ... ELC (English Language ... commentary and encouragement.

Language acquisition &language processing

Europe Pacific Islands Europe - continued Language Line ...wpoelearning.workplaceoptions.com/Language Line 2.0/Language ID... · Language Line Services Language IdentificationAlbanian

IFT 615 – Intelligence Artificielleinfo.usherbrooke.ca/hlarochelle/ift615/ift615-intro.pdfIFT 615 – Intelligence Artificielle Introduction Hugo%Larochelle% Départementd’informaque

Natural language processing: syntactic and semantic tagginginfo.usherbrooke.ca/hlarochelle/ift725/nlp-tagging.pdf · A Uniﬁed Architecture for Natural Language Processing: Deep

Representation Language Language

Welsh Language Strategy 'A living language: a language …dera.ioe.ac.uk/18421/1/130321-welsh-language-strategy-measuring... · Welsh Language Strategy 'A living language: a language

Language & Language Based Codes

LANGUAGE MAINTENANCE & SHIFT LANGUAGE & SOCIETY. Language maintenance Language shift WHEN ONE LANGUAGE MEETS THE OTHER.