Information Theory - Carnegie Mellon Universitycshalizi/2010-06-15-1-CSSS.pdfEntropy and Information...

Information Theory

Cosma Shalizi

15 June 2010Complex Systems Summer School

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Entropy and Information Measuring randomness anddependence in bits

Entropy and Ergodicity Dynamical systems as informationsources, long-run randomness

Information and Inference The connection to statistics

Cover and Thomas (1991) is the best single book oninformation theory.

CSSS Information Theory



EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Entropy

The most fundamental notion in information theoryX = a discrete random variable, values from XThe entropy of X is

H[X ] ≡ −∑x∈X

Pr (X = x) log2 Pr (X = x)





Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x

( EXERCISE)

Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)

Proposition

H[f (X )] ≤ H[X ], equality if and only if f is 1-1





Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)

Proposition


Proposition






Proposition


Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X

( EXERCISE)

Proposition






Proposition


Proposition


Proposition






Interpretations

H[X ] measures

how random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length





Interpretations

H[X ] measureshow random X is

How variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory






Interpretations

H[X ] measureshow random X isHow variable X is

How uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory






Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X

“paleface” problemconsistent resolution leads to a completely subjective probability theory






Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problem

consistent resolution leads to a completely subjective probability theory






Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory






Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Renoin Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions





Description Length


in Reno

in Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai






Description Length


in Renoin Reno send money

in Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai






Description Length


in Renoin Reno send moneyin Reno divorce final

marry me?in Reno send lawyers guns and money kthxbai






Description Length


in Renoin Reno send moneyin Reno divorce finalmarry me?

in Reno send lawyers guns and money kthxbai






Description Length


in Renoin Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai






First goal: ask as few questions as possible

Starting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]





First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = y

Can always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]





First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questions

New goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]





First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questions

Ask about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]





First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages first

Still takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]





First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach x

Mean is then H[X ]





First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]





TheoremH[X ] is the minimum mean number of binary distinctionsneeded to describe X

Units of H[X ] are bits

source

X

binary-coded message

length H[X]receiver





Multiple Variables — Joint Entropy

Joint entropy of two variables X and Y :

H[X ,Y ] ≡ −∑x∈X

∑y∈Y

Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)

Entropy of joint distributionThis is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]







H[X ,Y ] ≡ −∑x∈X

∑y∈Y


Entropy of joint distribution

This is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]







H[X ,Y ] ≡ −∑x∈X

∑y∈Y


Entropy of joint distributionThis is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]





H[X |Y ] = H[X ,Y ]− H[Y ]

“text completion” principleNote: H[X |Y ] 6= H[Y |X ], in general

Chain rule:

H[X n1 ] = H[X1] +

n−1∑t=1

H[Xt+1|X t1]

Describe one variable, then describe 2nd with 1st, 3rd with firsttwo, etc.





H[X |Y ] = H[X ,Y ]− H[Y ]

“text completion” principleNote: H[X |Y ] 6= H[Y |X ], in generalChain rule:

H[X n1 ] = H[X1] +

n−1∑t=1

H[Xt+1|X t1]

Describe one variable, then describe 2nd with 1st, 3rd with firsttwo, etc.





Mutual Information

Mutual information between X and Y

I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]

How much shorter is the actual joint description than the sum ofthe individual descriptions?

Equivalent:

I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]

How much can I shorten my description of either variable byusing the other?

0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]

I[X ; Y ] = 0 if and only if X and Y are statistically independent





Mutual Information


I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]

How much shorter is the actual joint description than the sum ofthe individual descriptions?Equivalent:

I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]


0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]

I[X ; Y ] = 0 if and only if X and Y are statistically independent





Mutual Information


I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]

How much shorter is the actual joint description than the sum ofthe individual descriptions?Equivalent:

I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]


0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]

I[X ; Y ] = 0 if and only if X and Y are statistically independentCSSS Information Theory




noise

channel decodersource

Xencoder

receiver

Y

How much can we learn about what was sent from what wereceive? I[X ; Y ]





Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)





Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)

Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on







channel capacity C = max I[X ; Y ] as we change distribution ofX

Any rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)






channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without error

C is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)






channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)

This is not the only model of communication! (Sperber andWilson, 1995, 1990)






I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]

How much extra information do X and Y give, over and abovewhat’s in Z?X |= Y |Z if and only if I[X ; Y |Z ] = 0

Markov property is completely equivalent to

I[X∞t+1; X t−1−∞|Xt ] = 0







I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]

How much extra information do X and Y give, over and abovewhat’s in Z?X |= Y |Z if and only if I[X ; Y |Z ] = 0Markov property is completely equivalent to

I[X∞t+1; X t−1−∞|Xt ] = 0






What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density function

H(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps







H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possible

Differential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps







H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)

Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps







H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry over

Mutual information definition carries overMI is non-negative and invariant under 1-1 maps







H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries over

MI is non-negative and invariant under 1-1 maps







H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps





Relative Entropy

P, Q = two distributions on the same space X

D(P‖Q) ≡∑x∈X

P(x) log2P(x)

Q(x)

Or, if X is continuous,

D(P‖Q) ≡∫X

dx p(x) log2p(x)

q(x)

Or, if you like measure theory,

D(P‖Q) ≡∫

dP(ω) log2dPdQ

(ω)

a.k.a. Kullback-Leibler divergenceCSSS Information Theory




Relative Entropy Properties

D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps






D(P‖Q) ≥ 0, with equality if and only if P = Q

D(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps






D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in general

D(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps






D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)

Invariant under 1-1 maps






D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps





Relative entropy can be the basic concept

H[X ] = log2 m − D(P‖U)

where m = #X , U = uniform dist on X , P = dist of X

I[X ; Y ] = D(J‖P ⊗Q)

where P = dist of X , Q = dist of Y , J = joint dist





Relative Entropy and Miscoding

Suppose real distribution is P but we think it’s Q and we usethat for codingOur average code length (cross-entropy) is

−∑

x

P(x) log2 Q(x)

But the optimum code length is

−∑

x

P(x) log2 P(x)

Difference is relative entropyRelative entropy is the extra description length from getting thedistribution wrong





Basics: Summary

Entropy = minimum mean description length; variability of therandom quantityMutual information = reduction in description length from usingdependenciesRelative entropy = excess description length from guessing thewrong distribution




Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Information Sources

X1,X2, . . .Xn, . . . a sequence of random variablesX t

s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will do

Sequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics





Information Sources


s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic system

Need not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics





Information Sources


s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical system

or coarse-grained observations of the dynamics





Information Sources


s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics





Definition (Strict or Strong Stationarity)

for any k > 0, T > 0, for all w ∈ X k

Pr(

X k1 = w

)= Pr

(X k+T

1+T = w)

i.e., the distribution is invariant over time





Law of large numbers for stationary sequences

Theorem (Ergodic Theorem)

If X is stationary, then the empirical distribution converges

P̂n → ρ

for some limit ρ, and for all nice functions f

1n

n∑t=1

f (Xt )→ Eρ [f (X )]

but ρ may be random and depend on initial conditionsone ρ per attractor





Entropy Rate

Entropy rate, a.k.a. Shannon entropy rate, a.k.a. metricentropy rate

h1 ≡ limn→∞

H[Xn|X n−11 ]

How many extra bits to we need to describe the nextobservation (in the limit)?

Theoremh1 exists for any stationary process (and some others)





Using chain rule, can re-write h1 as

h1 = limn→∞

1n

H[X n1 ]

description length per unit time





Topological Entropy Rate

Wn ≡ number of allowed words of length n≡ #

{w ∈ X n : Pr

(X n

1 = w)> 0

}log2 Wn ≡ topological entropytopological entropy rate

h0 = limn→∞

1n

log2 Wn

H[X n1 ] = log2 Wn if and only if each word is equally probable

Otherwise H[X n1 ] < log2 Wn





Metric vs. Topological Entropy Rates

h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavily

effective number of wordsSo:

h0 ≥ h1

2h1 = effective # of choices of how to go on






h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of words

So:h0 ≥ h1







h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of wordsSo:

h0 ≥ h1






KS Entropy Rate

h1 = growth rate of mean description length of trajectoriesChaos needs h1 > 0Coarse-graining deterministic dynamics, each partition B hasits own h1(B)Kolmogorov-Sinai (KS) entropy rate:

hKS = supB

h1(B)

TheoremIf G is a generating partition, then hKS = h1(G)

hKS is the asymptotic randomness of the dynamical systemor, the rate at which the symbol sequence provides newinformation about the initial condition





Entropy Rate and Lyapunov Exponents

In general (Ruelle’s inequality),

hKS ≤d∑

i=1

λi1x>0(λi)

If the invariant measure is smooth, this is equality (Pesin’sidentity)





Asymptotic Equipartition Property

When n is large, for any word xn1 , either

Pr (X n1 = xn

1 ) ≈ 2−nh1

orPr (X n

1 = xn1 ) ≈ 0

More exactly, it’s almost certain that

−1n

log Pr (X n1 )→ h1

This is the entropy ergodic theorem orShannon-MacMillan-Breiman theorem





Relative entropy version:

−1n

log Qθ(X n1 )→ h1 + d(P‖Qθ)

whered(P‖Qθ) = lim

n→∞

1n

D(P(X n1 )‖Qθ(X n

1 ))

Relative entropy AEP implies entropy AEP





Entropy and Ergodicity: Summary

h1 is the growth rate of the entropy, or number of choices madein continuing the trajectoryMeasures instability in dynamical systemsTypical sequences have probabilities shrinking at the entropyrate




Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Sampling; Large Deviations

X1,X2, . . .Xn all IID with distribution PEmpirical distribution ≡ P̂nLaw of large numbers (LLN): P̂n → P

Theorem (Sanov)

−1n

log2 Pr(

P̂n ∈ A)→ argmin

Q∈AD(Q‖P)

or, for non-mathematicians,

Pr(

P̂n ≈ Q)≈ 2−nD(Q‖P)





Sanov’s theorem is part of the general theory of largedeviations:

Pr(fluctuations away from law of large numbers)→ 0exponentially in n

rate functon generally a relative entropy

More on large devations: Bucklew (1990); den Hollander (2000)LDP explains statistical mechanics; see Touchette (2008), ortalk to Eric Smith





Relative Entropy and Hypothesis Testing

Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike

Pr (error) ≈ 2−nD(Q‖P)

For dependent data, substitute sum of conditional relative entropies for nD

More exact statement:

1n

log2 Pr (error)→ −D(Q‖P)

For dependent data, substitute sum conditional relative entropy rate for D

The bigger D(Q‖P), the easier is to test which is right





Method of Maximum Likelihood

Fisher (1922)Data = X with true distribution = PModel distributions = Qθ, θ = parameterLook for the Qθ which best describes the dataLikelihood at θ is probability of generating the dataQθ(x) ≡ L(θ)Estimate θ by maximizing likelihood, equivalently log-likelihoodL(θ) ≡ log Qθ(x)

θ̂ ≡ argmaxθ

L(θ) = argmaxθ

n∑t=1

log Qθ(xt |x t−11 )





Maximum likelihood and relative entropy

Suppose we want the Qθ which will best describe new dataOptimal parameter value is

θ∗ = argminθ

D(P‖Qθ)

If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)Otherwise θ∗ is the pseudo-true parameter value







θ∗ = argminθ

D(P‖Qθ)

If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)

Otherwise θ∗ is the pseudo-true parameter value







θ∗ = argminθ

D(P‖Qθ)

If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)Otherwise θ∗ is the pseudo-true parameter value





θ∗ = argminθ

∑x

P(x) log2P(x)

Qθ(x)

= argminθ

∑x

P(x) log2 P(x)− P(x) log2 Qθ(x)

= argminθ−HP [X ]−

∑x

P(x) log2 Qθ(x)

= argminθ−∑

x

P(x) log2 Qθ(x)

= argmaxθ

∑x

P(x) log2 Qθ(x)

This is the expected log-likelihood





We don’t know P but we do have P̂nFor IID case

θ̂ = argmaxθ

n∑t=1

log Qθ(xt )

= argmaxθ

1n

n∑t=1

log2 Qθ(xt )

= argmaxθ

∑x

P̂n(x) log2 Qθ(x)

So θ̂ comes from approximating P by P̂nθ̂ → θ∗ because P̂n → PNon-IID case (e.g. Markov) similar, more notation





Relative Entropy and Log Likelihood

In general:

−H[X ]− D(P‖Q) = expected log-likelihood of Q−H[X ] = optimal expected log-likelihood (ideal model)





Why Maximum Likelihood?

1 The inherent compelling rightness of the optimizationprinciple

(a bad answer)2 Generally consistent: θ̂ converges on the optimal value

(as we just saw)3 Generally efficient: converges faster than other consistent

estimators

(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency






1 The inherent compelling rightness of the optimizationprinciple (a bad answer)

2 Generally consistent: θ̂ converges on the optimal value(as we just saw)

3 Generally efficient: converges faster than other consistentestimators








2 Generally consistent: θ̂ converges on the optimal value

(as we just saw)3 Generally efficient: converges faster than other consistent

estimators








2 Generally consistent: θ̂ converges on the optimal value(as we just saw)

3 Generally efficient: converges faster than other consistentestimators






Fisher Information

Fisher: Taylor-expand L to second order around maximum

Fisher information matrix

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)





Fisher Information

Fisher: Taylor-expand L to second order around maximumFisher information matrix


∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0






Fisher Information



∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

F ∝ n (for IID, Markov, etc.)

Variance of θ̂ = F−1 (under some regularity conditions)





Fisher Information



∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0






The Information Bound

Theorem (Cramér-Rao)

F−1 is the minimum variance for any unbiased estimator

because uncertainty in θ̂ depends on curvature at maximumleads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)








because uncertainty in θ̂ depends on curvature at maximum

leads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)








because uncertainty in θ̂ depends on curvature at maximumleads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)





Relative Entropy and Fisher Information


∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

=

∂2

∂θu∂θvD(Qθ0‖Qθ)

∣∣∣∣θ=θ0

Fisher information is how quickly the relative entropy grows withsmall changes in parameters

D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)

Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”





Maximum Entropy: A Dead End

Given constraints on expectation values of functionsE [g1(X )] = c1,E [g2(X )] = c2, . . .E [gq(X )] = cq

P̃ME ≡ argmaxP

H[P] : ∀i , EP [gi(X )] = ci

= argmaxP

H[P]−q∑

i=1

λi(EP [gi(X )]− ci)

with Lagrange multipliers λi chosen to enforce the constraints





Solution: Exponential Families

Generic solution:

P(x) =e−

Pqi=1 βi gi (x)∫

dxe−Pq

i=1 βi gi (x)=

e−Pq

i=1 βi gi (x)

Z (β1, β2, . . . βq)

again β enforces constraintsPhysics: canonical ensemble with extensive variables gi andintensive variables βiStatistics: exponential family with sufficient statistics gi andnatural parameters βiIf we take this family of distributions as basic, MLE is β suchthat E [gi(X )] = gi(x), i.e., mean = observedBest discussion of the connection is still Mandelbrot (1962)





The Method of Maximum Entropy

Calculate sample statistics gi(x)

Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”






Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraints

i.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”






Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatistics

Refinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”






Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base density

Update distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”






Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropy

Often said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”






Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”





About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everything

And it sometimes works, because1 Exponential families often decent approximations, MLE is

cool but not everything is an exponential family2 Conditional large deviations principle (Csiszár, 1995): if P̂

is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold





About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because

1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family

2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf







About MaxEnt


1 Exponential families often decent approximations, MLE iscool

but not everything is an exponential family2 Conditional large deviations principle (Csiszár, 1995): if P̂

is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf







About MaxEnt




−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf







About MaxEnt




−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf


so P̂ is exponentially close to argminQ∈A D(Q‖P)

but the conditional LDP doesn’t always hold





About MaxEnt




−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf







Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003)

, contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)





Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicists

The “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)





Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)

MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)





Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)





Minimum Description Length Inference

Rissanen (1978, 1989)

Chose a model to concisely describe the datamaximum likelihood minimizes description length of the data. . . but you need to describe the model as well!Two-part MDL:

D2(x , θ,Θ) = − log2 Qθ(x) + C(θ,Θ)

θ̂MDL = argminθ∈Θ

D2(x , θ,Θ)

D2(x ,Θ) = D2(x , θ̂MDL,Θ)

where C is a coding scheme for the parametersCSSS Information Theory




Must fix coding scheme before seeing the data (EXERCISE:why?)By AEP

n−1D2 → h1 + argminθ∈Θ

d(P‖Qθ)

still for finite n the coding scheme matters(One-part MDL exists but would take too long)





Why Use MDL?

1 The inherent compelling rightness of the optimizationprinciple

2 Good properties: for reasonable sources, if the parametriccomplexity

COMP(Θ) = log∑

w∈X n

argmaxθ∈Θ

Qθ(w)

is small — if there aren’t all that many words which gethigh likelihoods — then if MDL did well in-sample, it willgeneralize well to new data from the same source

See Grünwald (2005, 2007) for much more





Information and Statistics: Summary

Relative entropy controls large deviationsRelative entropy = ease of discriminating distributionsEasy discrimination⇒ good estimationLarge deviations explains why MaxEnt works when it does




Amari, Shun-ichi, O. E. Barndorff-Nielsen, Robert E. Kass,Steffe L. Lauritzen and C. R. Rao (1987). DifferentialGeometry in Statistical Inference, vol. 10 of Institute ofMathematical Statistics Lecture Notes-Monographs Series.Hayward, California: Institute of Mathematical Statistics. URLhttp://projecteuclid.org/euclid.lnms/1215467056.

Amari, Shun-ichi and Hiroshi Nagaoka (1993/2000). Methods ofInformation Geometry . Providence, Rhode Island: AmericanMathematical Society. Translated by Daishi Harada. As JohoKika no Hoho, Tokyo: Iwanami Shoten Publishers.

Bucklew, James A. (1990). Large Deviation Techniques inDecision, Simulation, and Estimation. New York:Wiley-Interscience.


http://projecteuclid.org/euclid.lnms/1215467056

http://projecteuclid.org/euclid.lnms/1215467056



Cover, Thomas M. and Joy A. Thomas (1991). Elements ofInformation Theory . New York: Wiley.

Csiszár, Imre (1995). “Maxent, Mathematics, and InformationTheory.” In Maximum Entropy and Bayesian Methods:Proceedings of the Fifteenth International Workshop onMaximum Entropy and Bayesian Methods (Kenneth M.Hanson and Richard N. Silver, eds.), pp. 35–50. Dordrecht:Kluwer Academic.

den Hollander, Frank (2000). Large Deviations. Providence,Rhode Island: American Mathematical Society.

Fisher, R. A. (1922). “On the Mathematical Foundations ofTheoretical Statistics.” Philosophical Transactions of theRoyal Society A, 222: 309–368. URLhttp://digital.library.adelaide.edu.au/coll/special/fisher/stat_math.html.


http://digital.library.adelaide.edu.au/coll/special/fisher/stat_math.html

http://digital.library.adelaide.edu.au/coll/special/fisher/stat_math.html



Grünwald, Peter (2005). “A Tutorial Introduction to the MinimumDescription Length Principle.” In Advances in MinimumDescription Length: Theory and Applications (P. Grünwaldand I. J. Myung and M. Pitt, eds.). Cambridge,Massachusetts: MIT Press. URLhttp://arxiv.org/abs/math.ST/0406077.

Grünwald, Peter D. (2007). The Minimum Description LengthPrinciple. Cambridge, Massachusetts: MIT Press.

Grünwald, Peter D. and Joseph Y. Halpern (2003). “UpdatingProbabilities.” Journal of Artificial Intelligence Research, 19:243–278. doi:10.1613/jair.1164.

Kass, Robert E. and Paul W. Vos (1997). GeometricalFoundations of Asymptotic Inference. New York: Wiley.

Kass, Robert E. and Larry Wasserman (1996). “The Selectionof Prior Distributions by Formal Rules.” Journal of the


http://arxiv.org/abs/math.ST/0406077

http://dx.doi.org/10.1613/jair.1164



American Statistical Association, 91: 1343–1370. URLhttp://www.stat.cmu.edu/~kass/papers/rules.pdf.

Kulhavý, Rudolf (1996). Recursive Nonlinear Estimation: AGeometric Approach, vol. 216 of Lecture Notes in Controland Information Sciences. Berlin: Springer-Verlag.

Mandelbrot, Benoit (1962). “The Role of Sufficiency and ofEstimation in Thermodynamics.” Annals of MathematicalStatistics, 33: 1021–1038. URL http://projecteuclid.org/euclid.aoms/1177704470.

Poundstone, William (2005). Fortune’s Formula: The UntoldStory of the Scientific Betting Systems That Beat the Casinosand Wall Street . New York: Hill and Wang.

Rissanen, Jorma (1978). “Modeling by Shortest DataDescription.” Automatica, 14: 465–471.


http://www.stat.cmu.edu/~kass/papers/rules.pdf

http://www.stat.cmu.edu/~kass/papers/rules.pdf

http://projecteuclid.org/euclid.aoms/1177704470

http://projecteuclid.org/euclid.aoms/1177704470



— (1989). Stochastic Complexity in Statistical Inquiry .Singapore: World Scientific.

Seidenfeld, Teddy (1979). “Why I Am Not an ObjectiveBayesian: Some Reflections Prompted by Rosenkrantz.”Theory and Decision, 11: 413–440. URLhttp://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%20statistical%20issues/Why%20I%20Am%20Not%20an%20Objective%20B.pdf.

— (1987). “Entropy and Uncertainty.” In Foundations ofStatistical Inference (I. B. MacNeill and G. J. Umphrey, eds.),pp. 259–287. Dordrecht: D. Reidel. URLhttp://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%


http://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%20statistical%20issues/Why%20I%20Am%20Not%20an%20Objective%20B.pdf




http://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%20statistical%20issues/Entropy%20and%20Uncertainty%20(revised).pdf





20statistical%20issues/Entropy%20and%20Uncertainty%20(revised).pdf.

Shannon, Claude E. (1948). “A Mathematical Theory ofCommunication.” Bell System Technical Journal , 27:379–423. URL http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html. Reprinted in Shannonand Weaver (1963).

Shannon, Claude E. and Warren Weaver (1963). TheMathematical Theory of Communication. Urbana, Illinois:University of Illinois Press.

Sperber, Dan and Deirdre Wilson (1990). “Rhetoric andRelevance.” In The Ends of Rhetoric: History, Theory,Practice (David Wellbery and John Bender, eds.), pp.140–155. Stanford: Stanford University Press. URLhttp://dan.sperber.com/rhetoric.htm.





http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html

http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html

http://dan.sperber.com/rhetoric.htm



— (1995). Relevance: Cognition and Communication. Oxford:Basil Blackwell, 2nd edn.

Stephenson, Neal (1999). Cryptonomicon. New York: AvonBooks.

Touchette, Hugo (2008). “The Large Deviations Approach toStatistical Mechanics.” E-print, arxiv.org. URLhttp://arxiv.org/abs/0804.0327.

Uffink, Jos (1995). “Can the Maximum Entropy Principle beExplained as a Consistency Requirement?” Studies inHistory and Philosophy of Modern Physics, 26B: 223–261.URL http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mepabst.html.

— (1996). “The Constraint Rule of the Maximum EntropyPrinciple.” Studies in History and Philosophy of Modern


http://arxiv.org/abs/0804.0327

http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mepabst.html

http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mepabst.html



Physics, 27: 47–79. URL http://www.phys.uu.nl/~wwwgrnsl/jos/mep2def/mep2def.html.


http://www.phys.uu.nl/~wwwgrnsl/jos/mep2def/mep2def.html

http://www.phys.uu.nl/~wwwgrnsl/jos/mep2def/mep2def.html

Information Theory - Carnegie Mellon Universitycshalizi/2010-06-15-1-CSSS.pdfEntropy and Information...

Documents

Transcript of Information Theory - Carnegie Mellon Universitycshalizi/2010-06-15-1-CSSS.pdfEntropy and Information...