Post on 24-Feb-2016
description
Entropy Estimation and Applications to Decision Trees
EstimationDistribution over K=8 classes
Repeat 50,000 times:1. Generate N samples2. Estimate entropy from samples 1 2 3 4 5 6 7 8
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
1000
2000
3000
4000
5000
6000
7000
8000Plugin H, N=10, 50000 replicates
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
500
1000
1500
2000
2500Plugin H, N=100, 50000 replicates
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
500
1000
1500
2000
2500
3000
3500
4000
4500
5000Plugin H, N=1000, 50000 replicates
N=10 N=100 N=50000
H=1.289
Estimation
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
500
1000
1500
2000
2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown
Estimating the true entropy
Goals:1. Consistency: large N guarantees correct result2. Low variance: variation of estimates small3. Low bias: expected estimate should be correct
Discrete Entropy Estimators
• UCI classificationdata sets
• Accuracy on test set• Plugin vs. Grassberger• Better trees
Experimental Results
Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]
• In regression, differential entropy– measures remaining uncertainty about y– is a function of a distribution
Differential Entropy Estimation
𝐻 (𝑞)=−∫𝑦
❑
𝑞 ( 𝑦|𝑥 ) log𝑞 (𝑦∨𝑥)d 𝑦
• Problem– q is not from a parametric family
• Solution 1: project onto a parametric family• Solution 2: non-parametric entropy estimation
• Multivariate Normal distribution– Estimate covariance matrix of all y vectors– Plugin estimate of the entropy
Solution 1: parametric family
𝐻 (�̂� )= 𝑑2 +
𝑑2 log 2𝜋+
12 log
|�̂�|
– Uniform minimum variance unbiased estimator (UMVUE)
𝐻 (𝑌 )=𝑑2log𝑒𝜋+ 1
2log|∑𝑦∈𝑌 𝑦 𝑦
𝑇|− 12∑𝑗=1𝑑
𝜓(𝑛+1− 𝑗2 )
[Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]
Solution 1: parametric family
Solution 1: parametric family
• Minimal assumptions on distribution• Nearest neighbour estimate
– NN distance – Euler-Mascheroni constant – Volume of d-dim. hypersphere
• Other estimators: KDE, spanning tree, k-NN, etc.
Solution 2: Non-parametric entropy estimation
𝐻1𝑁𝑁=𝑑𝑛∑
𝑖=1
𝑛
log 𝜌𝑖+ log (𝑛−1 )+𝛾+ log𝑉 𝑑
[Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987][Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001][Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]
Solution 2: Non-parametric estimation
Experimental Results
[Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]
Streaming Decision Trees
Streaming Data
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
500
1000
1500
2000
2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown
• “Infinite data” setting
• 10 possible splits and their scores• When to stop and make a decision?
Streaming Decision Trees
[Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000][Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003][Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]
• Score splits on a subset of samples only
• Domingos/Hulten (Hoeffding Trees), 2000:– Compute sample count n for given precision– Streaming decision tree induction– Incorrect confidence intervals, but work well in practice
• Jin/Agralwal, 2003:– Tighter confidence interval, asymptotic derivation using delta method
• Loh/Nowozin, 2013:– Racing algorithm (bad splits are removed early)– Finite sample confidence intervals for entropy and gini
Multivariate Delta Method
Theorem. Let be a sequence of -dimensional random vectors such that . Let be once differentiable at with gradient matrix . Then
[DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]
𝜃
𝑔
𝑔 (𝜃 )
∇𝑔 (𝜃 )
Delta Method for the Information Gain
1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4
1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4
• 8 classes, 2 choices (left/right)• : probability of choice S, class I
• , mutual information (infogain)• Derivation lengthy but not difficult, slight generalization of Jin & Agralwal
Multivariate delta method: for we have that
[Small, “Expansions and Asymptotics for Statistics”, CRC, 2010][DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]
Delta Method Example
As , is fixed
0 50 100 150 200 250 300 350 400 450 5000.2
0.25
0.3
0.35
0.4
0.45
Sample size
Info
gain
est
imat
e
Plugin estimate and standard deviation, 10000 replicates
Infogain estimateInfogain truth
0 50 100 150 200 250 300 350 400 450 5000
0.05
0.1
0.15
Asymptotic variance of the information gain
Empirical stddevDelta method stddev
• Statistical problem• Large body of literature exists on entropy estimation• Better estimators yield better decision trees• Distribution of estimate relevant in the streaming setting
Conclusion on Entropy Estimation