Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant...
Transcript of Speech Recognitionmohri/asr12/lecture_7.pdfMehryar Mohri - Speech Recognition page Courant...
Speech RecognitionLecture 7: Maximum Entropy Models
Mehryar MohriCourant Institute and Google Research
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Information theory basics
Maximum entropy models
Duality theorem
This Lecture
2
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Convexity
Definition: is said to be convex if for any two points the segment lies in :
Definition: let be a convex set. A function is said to be convex if for all and ,
3
X⊆RN
x, y∈X [x, y] X
{αx + (1− α)y, 0 ≤ α ≤ 1} ⊆ X.
f : X→RXx, y∈X
f(αx + (1 − α)y) ≤ αf(x) + (1− α)f(y).
is said to be concave when is convex.f −f
With a strict inequality, is said to be strictly convex.f
α∈(0, 1)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage 4
Theorem: let be a differentiable function. Then, is convex iff is convex and
Theorem: let be a twice differentiable function. Then, is convex iff its Hessian is positive semi-definite:
Properties of Convex Functions
f f
!x, y " dom(f), f(y) # f(x) $ %f(x) · (y # x).
f(y)
f(x) + !f(x)·(y " x).
(x, f(x))
ff
!x " dom(f), #2f(x) $ 0.
dom(f)
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Jensen’s Inequality
Theorem: let be a random variable and a measurable convex function. Then,
Proof:
• For a distribution over a finite set, the property follows directly the definition of convexity.
• The general case is a consequence of the continuity of convex functions and the density of finite distributions.
5
f(E[X ]) ≤ E[f(X)].
X f
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Entropy
Definition: the entropy of a random variable with probability distribution is
Properties:
• measure of uncertainty of .
•
• maximal for uniform distribution. For a finite support, by Jensen’s inequality:
6
p(x) = Pr[X = x]X
p(x)
H(X) ≥ 0.
H(X) = E�
log1
p(X)
�≤ log E
�1
p(X)
�= log N.
H(X) = −E[log p(X)] = −�
x∈X
p(x) log p(x).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Relative Entropy
Definition: the relative entropy (or Kullback-Leibler divergence) of two distributions and is
with
Properties:• assymetric measure of divergence between two
distributions. It is convex in and .
• for all and .
• iff 7
0 log0
q= 0 and p log
p
0= !.
D(p � q) = Ep
�log
p(X)q(X)
�=
�
x
p(x) logp(x)q(x)
,
D(p � q) ≥ 0
D(p � q) = 0
p q
p = q.
p q
p q
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Non-Negativity of Relative Entropy
Treat separately the case for , or use the same proof by extending domain of :
8
−D(p � q) = Ep
�log
q(X)p(X)
�
≤ log Ep
�q(X)p(X)
�
= log�
x
q(x) = 0.
q(x)=0 x∈supp(p)log
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Information theory basics
Maximum entropy models
Duality theorem
This Lecture
9
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Problem
Data: sample drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of a set that best estimates .
10
XD
x1, . . . , xm ∈ X.
p PD
S
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum Likelihood
Likelihood: probability of observing sample under distribution , which, given the independence assumption is
Principle: select distribution maximizing likelihood,
11
or, equivalently
p ∈ P
p� = argmaxp∈P
m�
i=1
p(xi),
p� = argmaxp∈P
m�
i=1
log p(xi).
Pr[S] =m�
i=1
p(xi).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Relative Entropy Formulation
Empirical distribution: distribution corresponding to sample .
Lemma: has maximum likelihood iff
Proof:
12
p�
S
D(p̂ � p) =�
x
p̂(x) log p̂(x)−�
x
p̂(x) log p(x)
= −H(p̂)−�
x
|x|Sm
log p(x)
= −H(p̂)− 1m
log�
x in S
p(x)|x|S
= −H(p̂)− 1m
log�
PrS∼pm
[S]�.
�p
p� = argminp∈P
D(�p � p).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Problem
Data: sample drawn i.i.d. from set according to some distribution ,
Features: associated to elements of ,
Problem: how do we estimate distribution ? Uniform distribution over ?
13
x1, . . . , xm ∈ X.
S XD
XuD
Φ : X → RN .
X
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Features
Examples:
• n-grams, distance-d n-grams.
• class-based n-grams, word triggers.
• sentence length.
• number and type of verbs.
• various grammatical information (e.g., agreement, POS tags).
• dialog-level information.
14
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Hoeffding’s Bounds
Theorem: let be a sequence of independent Bernoulli trials taking values in , then for all , the following inequalities hold for :
15
X1, X2, . . . , Xm
[0, 1]�>0
Xm = 1m
�mi=1 Xi
Pr�Xm − E[Xm] ≤ −�
�≤ e−2m�2 .
Pr�Xm − E[Xm] ≥ �
�≤ e−2m�2
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum Entropy Principle
For large , empirical average is a good estimate of the expected values of the features:
Principle: find distribution that is closest to the uniform distribution and that preserves the expected values of features.
Closeness is measured using relative entropy.
16
u
(E. T. Jaynes, 1957, 1983)
m
Ex∼D
[Φj(x)] ≈ Ex∼bp
[Φj(x)], j ∈ [1, N ].
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum Entropy Formulation
Distributions: let denote the set of distributions
Optimization problem: find distribution verifying
17
P
p� = argminp∈P
D(p � u).
p�
P =�p ∈ ∆ : E
x∼p[Φ(x)] = E
x∼bp[Φ(x)]
�.
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Relation with Entropy Maximization
Relationship with entropy:
Optimization problem:
18
D(p � u) =�
x∈X
p(x) logp(x)1/|X |
= log |X | +�
x∈X
p(x) log p(x) = log |X |−H(p).
minimize�
x∈X
p(x) log p(x)
subject to p(x) ≥ 0, ∀x ∈ X,�
x∈X
p(x) = 1,
�
x∈X
p(x)Φ(x) =1m
m�
i=1
Φ(xi).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Maximum Likelihood Gibbs Distrib.
Gibbs distributions: set of distributions defined by
Maximum likelihood Gibbs distribution:
19
with
where is the closure of .
p(x) =1Z
exp�w · Φ(x)
�=
1Z
exp� N�
j=1
wj · Φj(x)�,
Z =�
x
exp�w · Φ(x)
�.
Q p
Q Q
p� = argmaxq∈Q
m�
i=1
log q(xi) = argminq∈Q̄
D(p̂ � q),
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Duality Theorem
Theorem: assume that . Then, there exists a unique probability distribution satisfying
1.
2. for any and (Pythagorean equality);
3. (maximum likelihood);
4. (maximum entropy).
Each of these properties determines uniquely.20
(Della Pietra et al., 1997)
D(p̂ � u)<∞p�
p� ∈ P ∩ Q̄;
D(p � q)=D(p � p�) + D(p� � q) p∈Pq∈Q
p� =argminq∈Q̄
D(p̂ � q)
p� =argminp∈P
D(p � u)
p�
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Regularization
Relaxation: sample size too small to impose equality constraints. Instead,
• L1 constraints:
• L2 constraints:
21
��� Ex∼D
[Φj(x)] − Ex∼bp
[Φj(x)]��� ≤ βj , j ∈ [1, N ].
��� Ex∼D
[Φ(x)]− Ex∼bp
[Φ(x)]���
2≤ Λ2.
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
L1 Maxent
Optimization problem:
Bayesian interpretation: Laplacian prior.
22
arginfw∈RN
N�
j=1
βj |wj |− 1m
m�
i=1
log pw[xi],
with pw[x] =1Z
exp�w · Φ(x)
�.
pw[x]→ p(w) pw[x]
with p(w) =N�
j=1
βj
2exp(−βj |wj |).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
L2 Maxent
Optimization problem:
Bayesian interpretation: Gaussian prior.
23
with pw[x] =1Z
exp�w · Φ(x)
�.
pw[x]→ p(w) pw[x]
with
arginfw∈RN
λ�w�2 − 1m
m�
i=1
log pw[xi].
p(w) =N�
j=1
1√2πσ2
exp�−
w2j
2σ2
�.
pageMehryar Mohri - Introduction to Machine Learning
Extensions - Bregman Divergences
Definition: let be a convex and differentiable function, then the Bregman divergence based on is defined as
Examples:
• Unnormalized relative entropy.
• Euclidean distance.
24
x y
F (y)
F (x) + (y ! x) ·"xF (x)
FF
BF (y, x) = F (y)− F (x) − (y − x) ·∇xF (x).
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
Information theory basics
Maximum entropy models
Duality theorem
This Lecture
25
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References• Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, (22-1), March 1996;
• Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association 39, 357–365.
• Imre Csiszar and Tusnady. Information geometry and alternating minimization procedures. Statistics and Decisions, Supplement Issue 1, 205-237, 1984.
• Imre Csiszar. A geometric interpretation of Darroch and Ratchliff's generalized iterative scaling. The Annals of Statisics, 17(3), pp. 1409-1413. 1989.
• J. Darroch and D. Ratchliff. Generalized iterative scaling for log-linear models. The Annals of Mathematical Statistics, 43(5), pp. 1470-1480, 1972.
• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:4, pp.380--393, April, 1997.
26
Mehryar Mohri - Speech Recognition Courant Institute, NYUpage
References• Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. Duality and auxiliary functions
for Bregman distances. Technical Report CMU-CS-01-109, School of Computer Science, CMU, 2001.
• E. Jaynes. Information theory and statistical mechanics. Physics Reviews, 106:620–630, 1957.
• E. Jaynes. Papers on Probability, Statistics, and Statistical Physics. R. Rosenkrantz (editor), D. Reidel Publishing Company, 1983.
• O'Sullivan. Alternating minimzation algorithms: From Blahut-Arimoto to expectation-maximization. Codes, Curves and Signals: Common Threads in Communications, A. Vardy, (editor), Kluwer, 1998.
• Roni Rosenfeld. A maximum entropy approach to adaptive statistical language modelling.Computer, Speech and Language 10:187--228, 1996.
27