Information Theory And Machine Learning

29
1/29 Theory Applications Information Theory And Machine Learning Daniel Sadoc Menasche 2011 D. Menasche

Transcript of Information Theory And Machine Learning

Page 1: Information Theory And Machine Learning

1/29

Theory Applications

Information Theory And Machine Learning

Daniel Sadoc Menasche

2011

D. Menasche

Page 2: Information Theory And Machine Learning

2/29

Theory Applications

Notation

probability expectation expecationmass function of function

X p(x) = P(X = x) E [X ] =∑

∀x p(x)x E [f (X )]=∑

∀x p(x)f (x)Y q(x) = P(Y = x) E [Y ] =

∑∀x q(x)x E [f (Y )]=

∑∀x q(x)f (x)

Examples,to emphasize expectation over p, denote E by Ep,

E [q(X )] = Ep[q(X )] =∑∀x

p(x)q(x)

entropy, E [− log p(X )] = −∑

∀x p(x) log p(x)

D. Menasche

Page 3: Information Theory And Machine Learning

3/29

Theory Applications

Entropy

PropertiesH(X ) > 0let Hb(X ) be the entropy in base b: Hb(X ) = (logb a)H(X )

Hb(X ) = (logb a)Ha(X ) (1)

= (logb a)∑

p(x) loga p(x) (2)

= (logb a)∑

p(x)logb p(x)

logb a(3)

D. Menasche

Page 4: Information Theory And Machine Learning

4/29

Theory Applications

Entropy of Indicator Variable

Let X be an indicator variable

X =

{1, probability p0, otherwise

(4)

H(X ) = p log p + (1− p) log(1− p)maximized when p = 1/2is concave

p

H(X)

1/21

1

D. Menasche

Page 5: Information Theory And Machine Learning

5/29

Theory Applications

Another Example

XY 1 2 3 41 1/8 1/16 1/32 1/32 4/16=1/42 1/16 1/8 1/32 1/32 1/43 1/16 1/16 1/16 1/16 4/16=1/44 1/4 0 0 0 1/4

1/2 1/4 1/8 1/8H(X |Y ) =

∑y H(X |Y = y)P(Y = y) = 11/8

H(Y |X ) =∑

x H(Y |X = x)P(X = x)H(X ) = 7/4 bitsH(Y ) = 2 bitsH(X ,Y ) (2 ways to compute)

D. Menasche

Page 6: Information Theory And Machine Learning

6/29

Theory Applications

Entropy and Conditional Entropy

H(X) H(Y)

H(X,Y)

H(Y|X)H(X|Y) ?

D. Menasche

Page 7: Information Theory And Machine Learning

7/29

Theory Applications

KL Divergence

Definition: KL Divergence between distributions p and q is

KL(p,q) = −∑∀x

P(X = x) log(

q(x)p(x)

)

= Ep

[− log

(q(X )

p(X )

)]KL Divergence serves to

compare two distributionsfind the number of extra bits that we need to encodefunction if we make wrong guess

D. Menasche

Page 8: Information Theory And Machine Learning

8/29

Theory Applications

Huffman Code and KL Divergence Examplea e i o u

p X Portuguese 0.3 0.3 0.1 0.2 0.1Portuguese code 11 10 000 01 001

q Y English 0.2 0.3 0.2 0.2 0.1English code 01 10 00 110 111

http://en.wikipedia.org/wiki/Huffman_coding

u o

0.3

0.20.1 0.2 0.2

a i

0.40.3

e

0.6

0.9

01

0

01

1

01

u

o

0.2

0.10.3 0.1

a i

0.3

e

01

0.2

01

0.4

01

0.6

1.0

01

English Portuguese

D. Menasche

Page 9: Information Theory And Machine Learning

9/29

Theory Applications

KL Divergence Example

a e i o up X Portuguese 0.3 0.3 0.1 0.2 0.1

Portuguese code 11 10 000 01 001q Y English 0.2 0.3 0.2 0.2 0.1

English code 01 10 00 110 111H(X ) = 2.1710 bits (minimum code length)KL(p,q) = 0.0830 bits (penalty using wrong distr.)expected code length Portuguese = L = 2.2 bits ≈ H(X )

expected code length Portuguese when using Englishcode = M = 2.3 bits ≈ H(X ) + KL(p,q)in general,

H(X ) ≤ L ≤ L + 1 andH(X ) + KL(p,q) ≤ M ≤ L + KL(p,q) + 1

D. Menasche

Page 10: Information Theory And Machine Learning

10/29

Theory Applications

KL Divergence Properties

KL Divergence KL(p,q),KL(p,q) ≥ 0KL(p,q) = 0 iff p = qKL(p,q) 6= KL(q,p)

Our goal: prove first property

D. Menasche

Page 11: Information Theory And Machine Learning

11/29

Theory Applications

Convex and Concave Functions

Convex function: “chord is above function”A function f is concave is −f is convexA line is both concave and convexExample of convex function

x1 x2

f(x1)

f(x2)

px1 + (1-p)x2

pf(x1) + (1-p)f(x2)

f(px1 + (1-p)x2)

f(x)

x

f(x)

D. Menasche

Page 12: Information Theory And Machine Learning

12/29

Theory Applications

Jensen Inequality

Theorem: For any convex function f (x), E [f (X )] ≥ f (E [X ])

x1 x2

f(x1)

f(x2)

px1 + (1-p)x2

pf(x1) + (1-p)f(x2)

f(px1 + (1-p)x2)

f(x)

x

f(x)

D. Menasche

Page 13: Information Theory And Machine Learning

13/29

Theory Applications

KL Divergence is Always Positive

Theorem: KL(p,q) ≥ 0 (Gibb’s Inequality)Proof:

KL(p,q) = Ep

[− log

(q(X )

p(X )

)]= −

∑∀x

P(X = x) log(

q(x)p(x)

)≥ − log Ep

[q(X )

p(X )

]=

= − log∑∀x

P(X = x)(

q(x)p(x)

)=

= − log∑∀x

p(x)(

q(x)p(x)

)=

= − log∑∀x

q(x) =

= − log 1 = 0D. Menasche

Page 14: Information Theory And Machine Learning

14/29

Theory Applications

From KL Divergence to Mutual Information

Mutual information between X and Y is amount ofinformation that X gives about Y

recall p(x) = P(X = x) and q(x) = P(Y = x)let j(x , y) be the joint distribution of X and Y ,j(x , y) = P(X = x ,Y = y)

Definition: Mutual information between X and Y is

I(X ,Y ) = KL(j ,pq) =

=∑∀x

∑∀y

p(X = x ,Y = y) log(

p(X = x ,Y = y)p(X = x)p(Y = y)

)

if X and Y are independent, mutual information is zeroif X = Y , then I(X ,Y ) = H(X )in general, I(X ,Y ) = H(X )− H(X |Y )

D. Menasche

Page 15: Information Theory And Machine Learning

15/29

Theory Applications

Entropy, Conditional Entropy and MutualInformation

H(X) H(Y)

H(X,Y)

H(Y|X)H(X|Y) I(X,Y)

D. Menasche

Page 16: Information Theory And Machine Learning

16/29

Theory Applications

Conditioning Reduces Entropy

Theorem: H(X |Y ) ≤ H(X )

Proof: H(X ) = H(X |Y ) + I(X ,Y ) then0 ≤ I(X ,Y ) = H(X )− H(X |Y )

I(X ,Y ), information that Y brings about XH(X |Y ), uncertainty that rests in X after seeing YH(X ), uncertainty in X

Theorem: H(X ,Y ) ≤ H(X ) + H(Y )

Proof: H(X ,Y ) = H(X |Y ) + H(Y )

if X and Y are independent, H(X ,Y ) = H(X ) + H(Y )the result follows from Theorem above

D. Menasche

Page 17: Information Theory And Machine Learning

17/29

Theory Applications

Four Applications of Information Theory (IT) toMachine Learning

IT interpretation to maximum likelihoodIT interpreation to Bayes theoremUsing mutual information for feature selectionUsing entropy for medical image alignment

D. Menasche

Page 18: Information Theory And Machine Learning

18/29

Theory Applications

Maximum Likelihood Is Maximum KL Divergence

Givenempirical distribution q(x)model distribution p(x |θ)

Find θ that maximizes likelihood of data is equivalent tominimize KL divergence between q(x) and p(x ; θ)

D. Menasche

Page 19: Information Theory And Machine Learning

19/29

Theory Applications

Maximum Likelihood Is Maximum KL Divergence

ExampleGiven set of points (1, 3, 4, 3, 5)Empirical distribution is P(Y = 1) = 1/5, P(Y = 3) = 2/5,P(Y = 4) = 1/5, P(Y = 5) = 1/5

Empirical distributionGiven set of points x1, x2, x3, . . . , xN

q(x) = P(Y = x) = 1N

∑Nt=1 1x=xt

D. Menasche

Page 20: Information Theory And Machine Learning

20/29

Theory Applications

Maximum Likelihood ≡ Minimum KL Divergence

KL(q(x),p(x |θ)) =∑

x

q(x) logq(x)

p(x |θ)=

=∑

x

q(x) log q(x)−∑

x

q(x) log(p(x |θ))

D. Menasche

Page 21: Information Theory And Machine Learning

21/29

Theory Applications

Maximum Likelihood ≡ Minimum KL Divergence

minθ

KL(q(x),p(x |θ)) = minθ−∑

x

q(x) log(p(x |θ)) =

= maxθ

∑x

q(x) log(p(x |θ))

= maxθ

∑x

(1N

N∑t=1

1x=xt

)log(p(x |θ))

= maxθ

N∑t=1

∑x

1x=xt log(p(x |θ))

= maxθ

N∑t=1

log(p(xt |θ))

(5)

D. Menasche

Page 22: Information Theory And Machine Learning

22/29

Theory Applications

Four Applications of Information Theory (IT) toMachine Learning

IT interpretation to maximum likelihoodIT interpreation to Bayes theoremUsing mutual information for feature selectionUsing entropy for medical image alignment

D. Menasche

Page 23: Information Theory And Machine Learning

23/29

Theory Applications

Interpreation to Bayes Theorem

p(θ|x) = p(x |θ)p(θ)p(x)

(6)

ingredients,x , datap(θ|x), posterior (e.g., probability of class given data)p(x |θ), likelihoodp(θ), prior

conditioning reduces entropy, therefore

H(θ|X ) ≤ H(θ)

I(θ,X ) = H(θ)− H(θ|X )

using Bayes theorem, we reduce uncertainty about classI(θ,X ) = information that data brings about class

D. Menasche

Page 24: Information Theory And Machine Learning

24/29

Theory Applications

Four Applications of Information Theory (IT) toMachine Learning

IT interpretation to maximum likelihoodIT interpreation to Bayes theoremUsing mutual information for feature selectionUsing entropy for medical image alignment

D. Menasche

Page 25: Information Theory And Machine Learning

25/29

Theory Applications

Feature Selection For Information Retrieval

Feauture selection: e.g., text classification in categories(in training set) select terms occuring in text(in test set/real text classification) use these terms toclassify texte.g., the word britain is a good feature to classify texts intoUKe.g., the word viagra is a good feature to classify texts intospam

D. Menasche

Page 26: Information Theory And Machine Learning

26/29

Theory Applications

Feature Selection For Information Retrieval

Good featuretells a lot about the class (informative)does not overlap with other features (non-redundant)

Noisy featureincreases classification errore.g., overfitting (e.g., all texts in test set about China hadword “hand”)

D. Menasche

Page 27: Information Theory And Machine Learning

27/29

Theory Applications

Feature Selection For Information Retrieval

http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html

D. Menasche

Page 28: Information Theory And Machine Learning

28/29

Theory Applications

Four Applications of Information Theory (IT) toMachine Learning

IT interpretation to maximum likelihoodIT interpreation to Bayes theoremUsing mutual information for feature selectionUsing entropy for medical image alignment

D. Menasche

Page 29: Information Theory And Machine Learning

29/29

Theory Applications

Medical Image Alignment

Data Driven Image Models through Continuous JointAlignment, Erik G. Learned-Miller

D. Menasche