Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant...

27
Speech Recognition Lecture 7: Maximum Entropy Models Mehryar Mohri Courant Institute and Google Research [email protected]

Transcript of Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant...

Page 1: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Speech RecognitionLecture 7: Maximum Entropy Models

Mehryar MohriCourant Institute and Google Research

[email protected]

Page 2: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Information theory basics

Maximum entropy models

Duality theorem

This Lecture

2

Page 3: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Convexity

Definition: is said to be convex if for any two points the segment lies in :

Definition: let be a convex set. A function is said to be convex if for all and ,

3

X⊆RN

x, y∈X [x, y] X

{αx + (1− α)y, 0 ≤ α ≤ 1} ⊆ X.

f : X→RXx, y∈X

f(αx + (1 − α)y) ≤ αf(x) + (1− α)f(y).

is said to be concave when is convex.f −f

With a strict inequality, is said to be strictly convex.f

α∈(0, 1)

Page 4: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage 4

Theorem: let be a differentiable function. Then, is convex iff is convex and

Theorem: let be a twice differentiable function. Then, is convex iff its Hessian is positive semi-definite:

Properties of Convex Functions

f f

!x, y " dom(f), f(y) # f(x) $ %f(x) · (y # x).

f(y)

f(x) + !f(x)·(y " x).

(x, f(x))

ff

!x " dom(f), #2f(x) $ 0.

dom(f)

Page 5: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Jensen’s Inequality

Theorem: let be a random variable and a measurable convex function. Then,

Proof:

• For a distribution over a finite set, the property follows directly the definition of convexity.

• The general case is a consequence of the continuity of convex functions and the density of finite distributions.

5

f(E[X ]) ≤ E[f(X)].

X f

Page 6: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Entropy

Definition: the entropy of a random variable with probability distribution is

Properties:

• measure of uncertainty of .

• maximal for uniform distribution. For a finite support, by Jensen’s inequality:

6

p(x) = Pr[X = x]X

p(x)

H(X) ≥ 0.

H(X) = E�

log1

p(X)

�≤ log E

�1

p(X)

�= log N.

H(X) = −E[log p(X)] = −�

x∈X

p(x) log p(x).

Page 7: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Relative Entropy

Definition: the relative entropy (or Kullback-Leibler divergence) of two distributions and is

with

Properties:• assymetric measure of divergence between two

distributions. It is convex in and .

• for all and .

• iff 7

0 log0

q= 0 and p log

p

0= !.

D(p � q) = Ep

�log

p(X)q(X)

�=

x

p(x) logp(x)q(x)

,

D(p � q) ≥ 0

D(p � q) = 0

p q

p = q.

p q

p q

Page 8: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Non-Negativity of Relative Entropy

Treat separately the case for , or use the same proof by extending domain of :

8

−D(p � q) = Ep

�log

q(X)p(X)

≤ log Ep

�q(X)p(X)

= log�

x

q(x) = 0.

q(x)=0 x∈supp(p)log

Page 9: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Information theory basics

Maximum entropy models

Duality theorem

This Lecture

9

Page 10: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

10

XD

x1, . . . , xm ∈ X.

p PD

S

Page 11: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing likelihood,

11

or, equivalently

p ∈ P

p� = argmaxp∈P

m�

i=1

p(xi),

p� = argmaxp∈P

m�

i=1

log p(xi).

Pr[S] =m�

i=1

p(xi).

Page 12: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Relative Entropy Formulation

Empirical distribution: distribution corresponding to sample .

Lemma: has maximum likelihood iff

Proof:

12

p�

S

D(p̂ � p) =�

x

p̂(x) log p̂(x)−�

x

p̂(x) log p(x)

= −H(p̂)−�

x

|x|Sm

log p(x)

= −H(p̂)− 1m

log�

x in S

p(x)|x|S

= −H(p̂)− 1m

log�

PrS∼pm

[S]�.

�p

p� = argminp∈P

D(�p � p).

Page 13: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Features: associated to elements of ,

Problem: how do we estimate distribution ? Uniform distribution over ?

13

x1, . . . , xm ∈ X.

S XD

XuD

Φ : X → RN .

X

Page 14: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Features

Examples:

• n-grams, distance-d n-grams.

• class-based n-grams, word triggers.

• sentence length.

• number and type of verbs.

• various grammatical information (e.g., agreement, POS tags).

• dialog-level information.

14

Page 15: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Hoeffding’s Bounds

Theorem: let be a sequence of independent Bernoulli trials taking values in , then for all , the following inequalities hold for :

15

X1, X2, . . . , Xm

[0, 1]�>0

Xm = 1m

�mi=1 Xi

Pr�Xm − E[Xm] ≤ −�

�≤ e−2m�2 .

Pr�Xm − E[Xm] ≥ �

�≤ e−2m�2

Page 16: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum Entropy Principle

For large , empirical average is a good estimate of the expected values of the features:

Principle: find distribution that is closest to the uniform distribution and that preserves the expected values of features.

Closeness is measured using relative entropy.

16

u

(E. T. Jaynes, 1957, 1983)

m

Ex∼D

[Φj(x)] ≈ Ex∼bp

[Φj(x)], j ∈ [1, N ].

Page 17: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum Entropy Formulation

Distributions: let denote the set of distributions

Optimization problem: find distribution verifying

17

P

p� = argminp∈P

D(p � u).

p�

P =�p ∈ ∆ : E

x∼p[Φ(x)] = E

x∼bp[Φ(x)]

�.

Page 18: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Relation with Entropy Maximization

Relationship with entropy:

Optimization problem:

18

D(p � u) =�

x∈X

p(x) logp(x)1/|X |

= log |X | +�

x∈X

p(x) log p(x) = log |X |−H(p).

minimize�

x∈X

p(x) log p(x)

subject to p(x) ≥ 0, ∀x ∈ X,�

x∈X

p(x) = 1,

x∈X

p(x)Φ(x) =1m

m�

i=1

Φ(xi).

Page 19: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Maximum Likelihood Gibbs Distrib.

Gibbs distributions: set of distributions defined by

Maximum likelihood Gibbs distribution:

19

with

where is the closure of .

p(x) =1Z

exp�w · Φ(x)

�=

1Z

exp� N�

j=1

wj · Φj(x)�,

Z =�

x

exp�w · Φ(x)

�.

Q p

Q Q

p� = argmaxq∈Q

m�

i=1

log q(xi) = argminq∈Q̄

D(p̂ � q),

Page 20: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Duality Theorem

Theorem: assume that . Then, there exists a unique probability distribution satisfying

1.

2. for any and (Pythagorean equality);

3. (maximum likelihood);

4. (maximum entropy).

Each of these properties determines uniquely.20

(Della Pietra et al., 1997)

D(p̂ � u)<∞p�

p� ∈ P ∩ Q̄;

D(p � q)=D(p � p�) + D(p� � q) p∈Pq∈Q

p� =argminq∈Q̄

D(p̂ � q)

p� =argminp∈P

D(p � u)

p�

Page 21: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Regularization

Relaxation: sample size too small to impose equality constraints. Instead,

• L1 constraints:

• L2 constraints:

21

��� Ex∼D

[Φj(x)] − Ex∼bp

[Φj(x)]��� ≤ βj , j ∈ [1, N ].

��� Ex∼D

[Φ(x)]− Ex∼bp

[Φ(x)]���

2≤ Λ2.

Page 22: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

L1 Maxent

Optimization problem:

Bayesian interpretation: Laplacian prior.

22

arginfw∈RN

N�

j=1

βj |wj |− 1m

m�

i=1

log pw[xi],

with pw[x] =1Z

exp�w · Φ(x)

�.

pw[x]→ p(w) pw[x]

with p(w) =N�

j=1

βj

2exp(−βj |wj |).

Page 23: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

L2 Maxent

Optimization problem:

Bayesian interpretation: Gaussian prior.

23

with pw[x] =1Z

exp�w · Φ(x)

�.

pw[x]→ p(w) pw[x]

with

arginfw∈RN

λ�w�2 − 1m

m�

i=1

log pw[xi].

p(w) =N�

j=1

1√2πσ2

exp�−

w2j

2σ2

�.

Page 24: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

pageMehryar Mohri - Introduction to Machine Learning

Extensions - Bregman Divergences

Definition: let be a convex and differentiable function, then the Bregman divergence based on is defined as

Examples:

• Unnormalized relative entropy.

• Euclidean distance.

24

x y

F (y)

F (x) + (y ! x) ·"xF (x)

FF

BF (y, x) = F (y)− F (x) − (y − x) ·∇xF (x).

Page 25: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

Information theory basics

Maximum entropy models

Duality theorem

This Lecture

25

Page 26: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy

approach to natural language processing. Computational Linguistics, (22-1), March 1996;

• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.

• Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.

• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.

• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.

• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.

26

Page 27: Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant Institute, NYU Information theory basics Maximum entropy models Duality theorem This Lecture

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions

for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, CMU, 2001.

• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630, 1957.

• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.

• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.

• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.Computer, Speech and Language 10:187--228, 1996.

27