Information Theory - Carnegie Mellon Universitycshalizi/2010-06-15-1-CSSS.pdfEntropy and Information...
Transcript of Information Theory - Carnegie Mellon Universitycshalizi/2010-06-15-1-CSSS.pdfEntropy and Information...
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Entropy and Information Measuring randomness anddependence in bits
Entropy and Ergodicity Dynamical systems as informationsources, long-run randomness
Information and Inference The connection to statistics
Cover and Thomas (1991) is the best single book oninformation theory.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Entropy
The most fundamental notion in information theoryX = a discrete random variable, values from XThe entropy of X is
H[X ] ≡ −∑x∈X
Pr (X = x) log2 Pr (X = x)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Proposition
H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x
( EXERCISE)
Proposition
H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)
Proposition
H[f (X )] ≤ H[X ], equality if and only if f is 1-1
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Proposition
H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)
Proposition
H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)
Proposition
H[f (X )] ≤ H[X ], equality if and only if f is 1-1
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Proposition
H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)
Proposition
H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X
( EXERCISE)
Proposition
H[f (X )] ≤ H[X ], equality if and only if f is 1-1
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Proposition
H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)
Proposition
H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)
Proposition
H[f (X )] ≤ H[X ], equality if and only if f is 1-1
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Proposition
H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)
Proposition
H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)
Proposition
H[f (X )] ≤ H[X ], equality if and only if f is 1-1
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measures
how random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measureshow random X is
How variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measureshow random X isHow variable X is
How uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measureshow random X isHow variable X isHow uncertain we should be about X
“paleface” problemconsistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problem
consistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Interpretations
H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory
but the more fundamental interpretation is description length
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Description Length
H[X ] = how concisely can we describe X?Imagine X as text message:
in Renoin Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai
Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Description Length
H[X ] = how concisely can we describe X?Imagine X as text message:
in Reno
in Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai
Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Description Length
H[X ] = how concisely can we describe X?Imagine X as text message:
in Renoin Reno send money
in Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai
Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Description Length
H[X ] = how concisely can we describe X?Imagine X as text message:
in Renoin Reno send moneyin Reno divorce final
marry me?in Reno send lawyers guns and money kthxbai
Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Description Length
H[X ] = how concisely can we describe X?Imagine X as text message:
in Renoin Reno send moneyin Reno divorce finalmarry me?
in Reno send lawyers guns and money kthxbai
Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Description Length
H[X ] = how concisely can we describe X?Imagine X as text message:
in Renoin Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai
Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possible
Starting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = y
Can always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questions
New goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questions
Ask about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages first
Still takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach x
Mean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
TheoremH[X ] is the minimum mean number of binary distinctionsneeded to describe X
Units of H[X ] are bits
source
X
binary-coded message
length H[X]receiver
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Multiple Variables — Joint Entropy
Joint entropy of two variables X and Y :
H[X ,Y ] ≡ −∑x∈X
∑y∈Y
Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)
Entropy of joint distributionThis is the minimum mean length to describe both X and Y
H[X ,Y ] ≥ H[X ]
H[X ,Y ] ≥ H[Y ]
H[X ,Y ] ≤ H[X ] + H[Y ]
H[f (X ),X ] = H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Multiple Variables — Joint Entropy
Joint entropy of two variables X and Y :
H[X ,Y ] ≡ −∑x∈X
∑y∈Y
Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)
Entropy of joint distribution
This is the minimum mean length to describe both X and Y
H[X ,Y ] ≥ H[X ]
H[X ,Y ] ≥ H[Y ]
H[X ,Y ] ≤ H[X ] + H[Y ]
H[f (X ),X ] = H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Multiple Variables — Joint Entropy
Joint entropy of two variables X and Y :
H[X ,Y ] ≡ −∑x∈X
∑y∈Y
Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)
Entropy of joint distributionThis is the minimum mean length to describe both X and Y
H[X ,Y ] ≥ H[X ]
H[X ,Y ] ≥ H[Y ]
H[X ,Y ] ≤ H[X ] + H[Y ]
H[f (X ),X ] = H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Multiple Variables — Joint Entropy
Joint entropy of two variables X and Y :
H[X ,Y ] ≡ −∑x∈X
∑y∈Y
Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)
Entropy of joint distributionThis is the minimum mean length to describe both X and Y
H[X ,Y ] ≥ H[X ]
H[X ,Y ] ≥ H[Y ]
H[X ,Y ] ≤ H[X ] + H[Y ]
H[f (X ),X ] = H[X ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Conditional Entropy
Entropy of conditional distribution:
H[X |Y = y ] ≡ −∑x∈X
Pr (X = x |Y = y) log2 Pr (X = x |Y = y)
Average over y :
H[X |Y ] ≡∑y∈Y
Pr (Y = y) H[X |Y = y ]
On average, how many bits are needed to describe X , after Yis given?
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
H[X |Y ] = H[X ,Y ]− H[Y ]
“text completion” principleNote: H[X |Y ] 6= H[Y |X ], in general
Chain rule:
H[X n1 ] = H[X1] +
n−1∑t=1
H[Xt+1|X t1]
Describe one variable, then describe 2nd with 1st, 3rd with firsttwo, etc.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
H[X |Y ] = H[X ,Y ]− H[Y ]
“text completion” principleNote: H[X |Y ] 6= H[Y |X ], in generalChain rule:
H[X n1 ] = H[X1] +
n−1∑t=1
H[Xt+1|X t1]
Describe one variable, then describe 2nd with 1st, 3rd with firsttwo, etc.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Mutual Information
Mutual information between X and Y
I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]
How much shorter is the actual joint description than the sum ofthe individual descriptions?
Equivalent:
I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]
How much can I shorten my description of either variable byusing the other?
0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]
I[X ; Y ] = 0 if and only if X and Y are statistically independent
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Mutual Information
Mutual information between X and Y
I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]
How much shorter is the actual joint description than the sum ofthe individual descriptions?Equivalent:
I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]
How much can I shorten my description of either variable byusing the other?
0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]
I[X ; Y ] = 0 if and only if X and Y are statistically independent
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Mutual Information
Mutual information between X and Y
I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]
How much shorter is the actual joint description than the sum ofthe individual descriptions?Equivalent:
I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]
How much can I shorten my description of either variable byusing the other?
0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]
I[X ; Y ] = 0 if and only if X and Y are statistically independentCSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
noise
channel decodersource
Xencoder
receiver
Y
How much can we learn about what was sent from what wereceive? I[X ; Y ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)
Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofX
Any rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without error
C is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)
This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on
channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Conditional Mutual Information
I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]
How much extra information do X and Y give, over and abovewhat’s in Z?
X |= Y |Z if and only if I[X ; Y |Z ] = 0Markov property is completely equivalent to
I[X∞t+1; X t−1−∞|Xt ] = 0
Markov property is really about information flow
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Conditional Mutual Information
I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]
How much extra information do X and Y give, over and abovewhat’s in Z?X |= Y |Z if and only if I[X ; Y |Z ] = 0
Markov property is completely equivalent to
I[X∞t+1; X t−1−∞|Xt ] = 0
Markov property is really about information flow
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Conditional Mutual Information
I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]
How much extra information do X and Y give, over and abovewhat’s in Z?X |= Y |Z if and only if I[X ; Y |Z ] = 0Markov property is completely equivalent to
I[X∞t+1; X t−1−∞|Xt ] = 0
Markov property is really about information flow
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
What About Continuous Variables?
Differential entropy:
H(X ) ≡ −∫
dxp(x) log2 p(x)
where p has to be the probability density function
H(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
What About Continuous Variables?
Differential entropy:
H(X ) ≡ −∫
dxp(x) log2 p(x)
where p has to be the probability density functionH(X ) < 0 entirely possible
Differential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
What About Continuous Variables?
Differential entropy:
H(X ) ≡ −∫
dxp(x) log2 p(x)
where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)
Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
What About Continuous Variables?
Differential entropy:
H(X ) ≡ −∫
dxp(x) log2 p(x)
where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry over
Mutual information definition carries overMI is non-negative and invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
What About Continuous Variables?
Differential entropy:
H(X ) ≡ −∫
dxp(x) log2 p(x)
where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries over
MI is non-negative and invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
What About Continuous Variables?
Differential entropy:
H(X ) ≡ −∫
dxp(x) log2 p(x)
where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy
P, Q = two distributions on the same space X
D(P‖Q) ≡∑x∈X
P(x) log2P(x)
Q(x)
Or, if X is continuous,
D(P‖Q) ≡∫X
dx p(x) log2p(x)
q(x)
Or, if you like measure theory,
D(P‖Q) ≡∫
dP(ω) log2dPdQ
(ω)
a.k.a. Kullback-Leibler divergenceCSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy Properties
D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy Properties
D(P‖Q) ≥ 0, with equality if and only if P = Q
D(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy Properties
D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in general
D(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy Properties
D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)
Invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy Properties
D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Joint and Conditional Relative Entropies
P, Q now distributions on X ,Y
D(P‖Q) = D(P(X )‖Q(X )) + D(P(Y |X )‖Q(Y |X ))
where
D(P(Y |X )‖Q(Y |X )) =∑
x
P(x)D(P(Y |X = x)‖Q(Y |X = x))
=∑
x
P(x)∑
y
P(y |x) log2P(y |x)
Q(y |x)
and so on for more than two variables
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative entropy can be the basic concept
H[X ] = log2 m − D(P‖U)
where m = #X , U = uniform dist on X , P = dist of X
I[X ; Y ] = D(J‖P ⊗Q)
where P = dist of X , Q = dist of Y , J = joint dist
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative entropy can be the basic concept
H[X ] = log2 m − D(P‖U)
where m = #X , U = uniform dist on X , P = dist of X
I[X ; Y ] = D(J‖P ⊗Q)
where P = dist of X , Q = dist of Y , J = joint dist
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Relative Entropy and Miscoding
Suppose real distribution is P but we think it’s Q and we usethat for codingOur average code length (cross-entropy) is
−∑
x
P(x) log2 Q(x)
But the optimum code length is
−∑
x
P(x) log2 P(x)
Difference is relative entropyRelative entropy is the extra description length from getting thedistribution wrong
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy
Basics: Summary
Entropy = minimum mean description length; variability of therandom quantityMutual information = reduction in description length from usingdependenciesRelative entropy = excess description length from guessing thewrong distribution
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Information Sources
X1,X2, . . .Xn, . . . a sequence of random variablesX t
s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will do
Sequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Information Sources
X1,X2, . . .Xn, . . . a sequence of random variablesX t
s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic system
Need not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Information Sources
X1,X2, . . .Xn, . . . a sequence of random variablesX t
s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical system
or coarse-grained observations of the dynamics
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Information Sources
X1,X2, . . .Xn, . . . a sequence of random variablesX t
s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Definition (Strict or Strong Stationarity)
for any k > 0, T > 0, for all w ∈ X k
Pr(
X k1 = w
)= Pr
(X k+T
1+T = w)
i.e., the distribution is invariant over time
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Law of large numbers for stationary sequences
Theorem (Ergodic Theorem)
If X is stationary, then the empirical distribution converges
P̂n → ρ
for some limit ρ, and for all nice functions f
1n
n∑t=1
f (Xt )→ Eρ [f (X )]
but ρ may be random and depend on initial conditionsone ρ per attractor
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Entropy Rate
Entropy rate, a.k.a. Shannon entropy rate, a.k.a. metricentropy rate
h1 ≡ limn→∞
H[Xn|X n−11 ]
How many extra bits to we need to describe the nextobservation (in the limit)?
Theoremh1 exists for any stationary process (and some others)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Examples of entropy ratesIID H[Xn|X n−1
1 ] = H[X1] = h1
Markov H[Xn|X n−11 ] = H[Xn|Xn−1] = H[X2|X1] = h1
k th-order Markov h1 = H[Xk+1|X k1 ]
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Using chain rule, can re-write h1 as
h1 = limn→∞
1n
H[X n1 ]
description length per unit time
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Topological Entropy Rate
Wn ≡ number of allowed words of length n≡ #
{w ∈ X n : Pr
(X n
1 = w)> 0
}log2 Wn ≡ topological entropytopological entropy rate
h0 = limn→∞
1n
log2 Wn
H[X n1 ] = log2 Wn if and only if each word is equally probable
Otherwise H[X n1 ] < log2 Wn
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Metric vs. Topological Entropy Rates
h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavily
effective number of wordsSo:
h0 ≥ h1
2h1 = effective # of choices of how to go on
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Metric vs. Topological Entropy Rates
h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of words
So:h0 ≥ h1
2h1 = effective # of choices of how to go on
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Metric vs. Topological Entropy Rates
h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of wordsSo:
h0 ≥ h1
2h1 = effective # of choices of how to go on
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Metric vs. Topological Entropy Rates
h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of wordsSo:
h0 ≥ h1
2h1 = effective # of choices of how to go on
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
KS Entropy Rate
h1 = growth rate of mean description length of trajectoriesChaos needs h1 > 0Coarse-graining deterministic dynamics, each partition B hasits own h1(B)Kolmogorov-Sinai (KS) entropy rate:
hKS = supB
h1(B)
TheoremIf G is a generating partition, then hKS = h1(G)
hKS is the asymptotic randomness of the dynamical systemor, the rate at which the symbol sequence provides newinformation about the initial condition
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Entropy Rate and Lyapunov Exponents
In general (Ruelle’s inequality),
hKS ≤d∑
i=1
λi1x>0(λi)
If the invariant measure is smooth, this is equality (Pesin’sidentity)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Asymptotic Equipartition Property
When n is large, for any word xn1 , either
Pr (X n1 = xn
1 ) ≈ 2−nh1
orPr (X n
1 = xn1 ) ≈ 0
More exactly, it’s almost certain that
−1n
log Pr (X n1 )→ h1
This is the entropy ergodic theorem orShannon-MacMillan-Breiman theorem
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Relative entropy version:
−1n
log Qθ(X n1 )→ h1 + d(P‖Qθ)
whered(P‖Qθ) = lim
n→∞
1n
D(P(X n1 )‖Qθ(X n
1 ))
Relative entropy AEP implies entropy AEP
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition
Entropy and Ergodicity: Summary
h1 is the growth rate of the entropy, or number of choices madein continuing the trajectoryMeasures instability in dynamical systemsTypical sequences have probabilities shrinking at the entropyrate
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Sampling; Large Deviations
X1,X2, . . .Xn all IID with distribution PEmpirical distribution ≡ P̂nLaw of large numbers (LLN): P̂n → P
Theorem (Sanov)
−1n
log2 Pr(
P̂n ∈ A)→ argmin
Q∈AD(Q‖P)
or, for non-mathematicians,
Pr(
P̂n ≈ Q)≈ 2−nD(Q‖P)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Sampling; Large Deviations
X1,X2, . . .Xn all IID with distribution PEmpirical distribution ≡ P̂nLaw of large numbers (LLN): P̂n → P
Theorem (Sanov)
−1n
log2 Pr(
P̂n ∈ A)→ argmin
Q∈AD(Q‖P)
or, for non-mathematicians,
Pr(
P̂n ≈ Q)≈ 2−nD(Q‖P)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Sanov’s theorem is part of the general theory of largedeviations:
Pr(fluctuations away from law of large numbers)→ 0exponentially in n
rate functon generally a relative entropy
More on large devations: Bucklew (1990); den Hollander (2000)LDP explains statistical mechanics; see Touchette (2008), ortalk to Eric Smith
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Hypothesis Testing
Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike
Pr (error) ≈ 2−nD(Q‖P)
For dependent data, substitute sum of conditional relative entropies for nD
More exact statement:
1n
log2 Pr (error)→ −D(Q‖P)
For dependent data, substitute sum conditional relative entropy rate for D
The bigger D(Q‖P), the easier is to test which is right
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Hypothesis Testing
Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike
Pr (error) ≈ 2−nD(Q‖P)
For dependent data, substitute sum of conditional relative entropies for nD
More exact statement:
1n
log2 Pr (error)→ −D(Q‖P)
For dependent data, substitute sum conditional relative entropy rate for D
The bigger D(Q‖P), the easier is to test which is right
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Hypothesis Testing
Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike
Pr (error) ≈ 2−nD(Q‖P)
For dependent data, substitute sum of conditional relative entropies for nD
More exact statement:
1n
log2 Pr (error)→ −D(Q‖P)
For dependent data, substitute sum conditional relative entropy rate for D
The bigger D(Q‖P), the easier is to test which is right
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Method of Maximum Likelihood
Fisher (1922)Data = X with true distribution = PModel distributions = Qθ, θ = parameterLook for the Qθ which best describes the dataLikelihood at θ is probability of generating the dataQθ(x) ≡ L(θ)Estimate θ by maximizing likelihood, equivalently log-likelihoodL(θ) ≡ log Qθ(x)
θ̂ ≡ argmaxθ
L(θ) = argmaxθ
n∑t=1
log Qθ(xt |x t−11 )
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Maximum likelihood and relative entropy
Suppose we want the Qθ which will best describe new dataOptimal parameter value is
θ∗ = argminθ
D(P‖Qθ)
If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)Otherwise θ∗ is the pseudo-true parameter value
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Maximum likelihood and relative entropy
Suppose we want the Qθ which will best describe new dataOptimal parameter value is
θ∗ = argminθ
D(P‖Qθ)
If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)
Otherwise θ∗ is the pseudo-true parameter value
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Maximum likelihood and relative entropy
Suppose we want the Qθ which will best describe new dataOptimal parameter value is
θ∗ = argminθ
D(P‖Qθ)
If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)Otherwise θ∗ is the pseudo-true parameter value
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
θ∗ = argminθ
∑x
P(x) log2P(x)
Qθ(x)
= argminθ
∑x
P(x) log2 P(x)− P(x) log2 Qθ(x)
= argminθ−HP [X ]−
∑x
P(x) log2 Qθ(x)
= argminθ−∑
x
P(x) log2 Qθ(x)
= argmaxθ
∑x
P(x) log2 Qθ(x)
This is the expected log-likelihood
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
We don’t know P but we do have P̂nFor IID case
θ̂ = argmaxθ
n∑t=1
log Qθ(xt )
= argmaxθ
1n
n∑t=1
log2 Qθ(xt )
= argmaxθ
∑x
P̂n(x) log2 Qθ(x)
So θ̂ comes from approximating P by P̂nθ̂ → θ∗ because P̂n → PNon-IID case (e.g. Markov) similar, more notation
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Log Likelihood
In general:
−H[X ]− D(P‖Q) = expected log-likelihood of Q−H[X ] = optimal expected log-likelihood (ideal model)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Why Maximum Likelihood?
1 The inherent compelling rightness of the optimizationprinciple
(a bad answer)2 Generally consistent: θ̂ converges on the optimal value
(as we just saw)3 Generally efficient: converges faster than other consistent
estimators
(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Why Maximum Likelihood?
1 The inherent compelling rightness of the optimizationprinciple (a bad answer)
2 Generally consistent: θ̂ converges on the optimal value(as we just saw)
3 Generally efficient: converges faster than other consistentestimators
(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Why Maximum Likelihood?
1 The inherent compelling rightness of the optimizationprinciple (a bad answer)
2 Generally consistent: θ̂ converges on the optimal value
(as we just saw)3 Generally efficient: converges faster than other consistent
estimators
(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Why Maximum Likelihood?
1 The inherent compelling rightness of the optimizationprinciple (a bad answer)
2 Generally consistent: θ̂ converges on the optimal value(as we just saw)
3 Generally efficient: converges faster than other consistentestimators
(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Why Maximum Likelihood?
1 The inherent compelling rightness of the optimizationprinciple (a bad answer)
2 Generally consistent: θ̂ converges on the optimal value(as we just saw)
3 Generally efficient: converges faster than other consistentestimators
(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Fisher Information
Fisher: Taylor-expand L to second order around maximum
Fisher information matrix
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Fisher Information
Fisher: Taylor-expand L to second order around maximumFisher information matrix
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Fisher Information
Fisher: Taylor-expand L to second order around maximumFisher information matrix
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
F ∝ n (for IID, Markov, etc.)
Variance of θ̂ = F−1 (under some regularity conditions)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Fisher Information
Fisher: Taylor-expand L to second order around maximumFisher information matrix
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Information Bound
Theorem (Cramér-Rao)
F−1 is the minimum variance for any unbiased estimator
because uncertainty in θ̂ depends on curvature at maximumleads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Information Bound
Theorem (Cramér-Rao)
F−1 is the minimum variance for any unbiased estimator
because uncertainty in θ̂ depends on curvature at maximum
leads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Information Bound
Theorem (Cramér-Rao)
F−1 is the minimum variance for any unbiased estimator
because uncertainty in θ̂ depends on curvature at maximumleads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Fisher Information
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
=
∂2
∂θu∂θvD(Qθ0‖Qθ)
∣∣∣∣θ=θ0
Fisher information is how quickly the relative entropy grows withsmall changes in parameters
D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)
Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Fisher Information
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
=
∂2
∂θu∂θvD(Qθ0‖Qθ)
∣∣∣∣θ=θ0
Fisher information is how quickly the relative entropy grows withsmall changes in parameters
D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)
Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Relative Entropy and Fisher Information
Fuv (θ0) ≡ −Eθ0
∂2 log Qθ0(X )
∂θu∂θv
∣∣∣∣∣θ=θ0
=
∂2
∂θu∂θvD(Qθ0‖Qθ)
∣∣∣∣θ=θ0
Fisher information is how quickly the relative entropy grows withsmall changes in parameters
D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)
Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Maximum Entropy: A Dead End
Given constraints on expectation values of functionsE [g1(X )] = c1,E [g2(X )] = c2, . . .E [gq(X )] = cq
P̃ME ≡ argmaxP
H[P] : ∀i , EP [gi(X )] = ci
= argmaxP
H[P]−q∑
i=1
λi(EP [gi(X )]− ci)
with Lagrange multipliers λi chosen to enforce the constraints
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Maximum Entropy: A Dead End
Given constraints on expectation values of functionsE [g1(X )] = c1,E [g2(X )] = c2, . . .E [gq(X )] = cq
P̃ME ≡ argmaxP
H[P] : ∀i , EP [gi(X )] = ci
= argmaxP
H[P]−q∑
i=1
λi(EP [gi(X )]− ci)
with Lagrange multipliers λi chosen to enforce the constraints
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Solution: Exponential Families
Generic solution:
P(x) =e−
Pqi=1 βi gi (x)∫
dxe−Pq
i=1 βi gi (x)=
e−Pq
i=1 βi gi (x)
Z (β1, β2, . . . βq)
again β enforces constraintsPhysics: canonical ensemble with extensive variables gi andintensive variables βiStatistics: exponential family with sufficient statistics gi andnatural parameters βiIf we take this family of distributions as basic, MLE is β suchthat E [gi(X )] = gi(x), i.e., mean = observedBest discussion of the connection is still Mandelbrot (1962)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Method of Maximum Entropy
Calculate sample statistics gi(x)
Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Method of Maximum Entropy
Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraints
i.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Method of Maximum Entropy
Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatistics
Refinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Method of Maximum Entropy
Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base density
Update distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Method of Maximum Entropy
Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropy
Often said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
The Method of Maximum Entropy
Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
About MaxEnt
MaxEnt has lots of devotees who basically think it’s the answerto everything
And it sometimes works, because1 Exponential families often decent approximations, MLE is
cool but not everything is an exponential family2 Conditional large deviations principle (Csiszár, 1995): if P̂
is constrained to lie in a convex set A, then
−1n
log Pr(
P̂ ∈ B|P̂ ∈ A)→ inf
Q∈B∩AD(Q‖P)− D(Q‖A)
so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
About MaxEnt
MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because
1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family
2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then
−1n
log Pr(
P̂ ∈ B|P̂ ∈ A)→ inf
Q∈B∩AD(Q‖P)− D(Q‖A)
so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
About MaxEnt
MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because
1 Exponential families often decent approximations, MLE iscool
but not everything is an exponential family2 Conditional large deviations principle (Csiszár, 1995): if P̂
is constrained to lie in a convex set A, then
−1n
log Pr(
P̂ ∈ B|P̂ ∈ A)→ inf
Q∈B∩AD(Q‖P)− D(Q‖A)
so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
About MaxEnt
MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because
1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family
2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then
−1n
log Pr(
P̂ ∈ B|P̂ ∈ A)→ inf
Q∈B∩AD(Q‖P)− D(Q‖A)
so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
About MaxEnt
MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because
1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family
2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then
−1n
log Pr(
P̂ ∈ B|P̂ ∈ A)→ inf
Q∈B∩AD(Q‖P)− D(Q‖A)
so P̂ is exponentially close to argminQ∈A D(Q‖P)
but the conditional LDP doesn’t always hold
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
About MaxEnt
MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because
1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family
2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then
−1n
log Pr(
P̂ ∈ B|P̂ ∈ A)→ inf
Q∈B∩AD(Q‖P)− D(Q‖A)
so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003)
, contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicists
The “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)
MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Minimum Description Length Inference
Rissanen (1978, 1989)
Chose a model to concisely describe the datamaximum likelihood minimizes description length of the data. . . but you need to describe the model as well!Two-part MDL:
D2(x , θ,Θ) = − log2 Qθ(x) + C(θ,Θ)
θ̂MDL = argminθ∈Θ
D2(x , θ,Θ)
D2(x ,Θ) = D2(x , θ̂MDL,Θ)
where C is a coding scheme for the parametersCSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Must fix coding scheme before seeing the data (EXERCISE:why?)By AEP
n−1D2 → h1 + argminθ∈Θ
d(P‖Qθ)
still for finite n the coding scheme matters(One-part MDL exists but would take too long)
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Why Use MDL?
1 The inherent compelling rightness of the optimizationprinciple
2 Good properties: for reasonable sources, if the parametriccomplexity
COMP(Θ) = log∑
w∈X n
argmaxθ∈Θ
Qθ(w)
is small — if there aren’t all that many words which gethigh likelihoods — then if MDL did well in-sample, it willgeneralize well to new data from the same source
See Grünwald (2005, 2007) for much more
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length
Information and Statistics: Summary
Relative entropy controls large deviationsRelative entropy = ease of discriminating distributionsEasy discrimination⇒ good estimationLarge deviations explains why MaxEnt works when it does
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Amari, Shun-ichi, O. E. Barndorff-Nielsen, Robert E. Kass,Steffe L. Lauritzen and C. R. Rao (1987). DifferentialGeometry in Statistical Inference, vol. 10 of Institute ofMathematical Statistics Lecture Notes-Monographs Series.Hayward, California: Institute of Mathematical Statistics. URLhttp://projecteuclid.org/euclid.lnms/1215467056.
Amari, Shun-ichi and Hiroshi Nagaoka (1993/2000). Methods ofInformation Geometry . Providence, Rhode Island: AmericanMathematical Society. Translated by Daishi Harada. As JohoKika no Hoho, Tokyo: Iwanami Shoten Publishers.
Bucklew, James A. (1990). Large Deviation Techniques inDecision, Simulation, and Estimation. New York:Wiley-Interscience.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Cover, Thomas M. and Joy A. Thomas (1991). Elements ofInformation Theory . New York: Wiley.
Csiszár, Imre (1995). “Maxent, Mathematics, and InformationTheory.” In Maximum Entropy and Bayesian Methods:Proceedings of the Fifteenth International Workshop onMaximum Entropy and Bayesian Methods (Kenneth M.Hanson and Richard N. Silver, eds.), pp. 35–50. Dordrecht:Kluwer Academic.
den Hollander, Frank (2000). Large Deviations. Providence,Rhode Island: American Mathematical Society.
Fisher, R. A. (1922). “On the Mathematical Foundations ofTheoretical Statistics.” Philosophical Transactions of theRoyal Society A, 222: 309–368. URLhttp://digital.library.adelaide.edu.au/coll/special/fisher/stat_math.html.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Grünwald, Peter (2005). “A Tutorial Introduction to the MinimumDescription Length Principle.” In Advances in MinimumDescription Length: Theory and Applications (P. Grünwaldand I. J. Myung and M. Pitt, eds.). Cambridge,Massachusetts: MIT Press. URLhttp://arxiv.org/abs/math.ST/0406077.
Grünwald, Peter D. (2007). The Minimum Description LengthPrinciple. Cambridge, Massachusetts: MIT Press.
Grünwald, Peter D. and Joseph Y. Halpern (2003). “UpdatingProbabilities.” Journal of Artificial Intelligence Research, 19:243–278. doi:10.1613/jair.1164.
Kass, Robert E. and Paul W. Vos (1997). GeometricalFoundations of Asymptotic Inference. New York: Wiley.
Kass, Robert E. and Larry Wasserman (1996). “The Selectionof Prior Distributions by Formal Rules.” Journal of the
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
American Statistical Association, 91: 1343–1370. URLhttp://www.stat.cmu.edu/~kass/papers/rules.pdf.
Kulhavý, Rudolf (1996). Recursive Nonlinear Estimation: AGeometric Approach, vol. 216 of Lecture Notes in Controland Information Sciences. Berlin: Springer-Verlag.
Mandelbrot, Benoit (1962). “The Role of Sufficiency and ofEstimation in Thermodynamics.” Annals of MathematicalStatistics, 33: 1021–1038. URL http://projecteuclid.org/euclid.aoms/1177704470.
Poundstone, William (2005). Fortune’s Formula: The UntoldStory of the Scientific Betting Systems That Beat the Casinosand Wall Street . New York: Hill and Wang.
Rissanen, Jorma (1978). “Modeling by Shortest DataDescription.” Automatica, 14: 465–471.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
— (1989). Stochastic Complexity in Statistical Inquiry .Singapore: World Scientific.
Seidenfeld, Teddy (1979). “Why I Am Not an ObjectiveBayesian: Some Reflections Prompted by Rosenkrantz.”Theory and Decision, 11: 413–440. URLhttp://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%20statistical%20issues/Why%20I%20Am%20Not%20an%20Objective%20B.pdf.
— (1987). “Entropy and Uncertainty.” In Foundations ofStatistical Inference (I. B. MacNeill and G. J. Umphrey, eds.),pp. 259–287. Dordrecht: D. Reidel. URLhttp://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
20statistical%20issues/Entropy%20and%20Uncertainty%20(revised).pdf.
Shannon, Claude E. (1948). “A Mathematical Theory ofCommunication.” Bell System Technical Journal , 27:379–423. URL http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html. Reprinted in Shannonand Weaver (1963).
Shannon, Claude E. and Warren Weaver (1963). TheMathematical Theory of Communication. Urbana, Illinois:University of Illinois Press.
Sperber, Dan and Deirdre Wilson (1990). “Rhetoric andRelevance.” In The Ends of Rhetoric: History, Theory,Practice (David Wellbery and John Bender, eds.), pp.140–155. Stanford: Stanford University Press. URLhttp://dan.sperber.com/rhetoric.htm.
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
— (1995). Relevance: Cognition and Communication. Oxford:Basil Blackwell, 2nd edn.
Stephenson, Neal (1999). Cryptonomicon. New York: AvonBooks.
Touchette, Hugo (2008). “The Large Deviations Approach toStatistical Mechanics.” E-print, arxiv.org. URLhttp://arxiv.org/abs/0804.0327.
Uffink, Jos (1995). “Can the Maximum Entropy Principle beExplained as a Consistency Requirement?” Studies inHistory and Philosophy of Modern Physics, 26B: 223–261.URL http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mepabst.html.
— (1996). “The Constraint Rule of the Maximum EntropyPrinciple.” Studies in History and Philosophy of Modern
CSSS Information Theory
Entropy and InformationEntropy and Ergodicity
Relative Entropy and StatisticsReferences
Physics, 27: 47–79. URL http://www.phys.uu.nl/~wwwgrnsl/jos/mep2def/mep2def.html.
CSSS Information Theory