Entropy Estimation and Applications to Decision Trees

EstimationDistribution over K=8 classes

Repeat 50,000 times:1. Generate N samples2. Estimate entropy from samples 1 2 3 4 5 6 7 8

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

8000Plugin H, N=10, 50000 replicates

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

N=10 N=100 N=50000

H=1.289

Estimation

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown

Estimating the true entropy

Goals:1. Consistency: large N guarantees correct result2. Low variance: variation of estimates small3. Low bias: expected estimate should be correct

Discrete Entropy Estimators

• UCI classificationdata sets

• Accuracy on test set• Plugin vs. Grassberger• Better trees

Experimental Results

Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

• In regression, differential entropy– measures remaining uncertainty about y– is a function of a distribution

Differential Entropy Estimation

𝐻 (𝑞)=−∫𝑦

𝑞 ( 𝑦|𝑥 ) log𝑞 (𝑦∨𝑥)d 𝑦

• Problem– q is not from a parametric family

• Solution 1: project onto a parametric family• Solution 2: non-parametric entropy estimation

• Multivariate Normal distribution– Estimate covariance matrix of all y vectors– Plugin estimate of the entropy

Solution 1: parametric family

𝐻 (�̂� )= 𝑑2 +

𝑑2 log 2𝜋+

12 log

|�̂�|

– Uniform minimum variance unbiased estimator (UMVUE)

𝐻 (𝑌 )=𝑑2log𝑒𝜋+ 1

2log|∑𝑦∈𝑌 𝑦 𝑦

𝑇|− 12∑𝑗=1𝑑

𝜓(𝑛+1− 𝑗2 )

[Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]

Solution 1: parametric family

• Minimal assumptions on distribution• Nearest neighbour estimate

– NN distance – Euler-Mascheroni constant – Volume of d-dim. hypersphere

• Other estimators: KDE, spanning tree, k-NN, etc.

Solution 2: Non-parametric entropy estimation

𝐻1𝑁𝑁=𝑑𝑛∑

𝑖=1

log 𝜌𝑖+ log (𝑛−1 )+𝛾+ log𝑉 𝑑

[Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987][Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001][Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]

Solution 2: Non-parametric estimation

Experimental Results

[Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

Streaming Decision Trees

Streaming Data

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20

2500Plugin H, N=100, 50000 replicates, with true entropy and 2 std dev estimates shown

• “Infinite data” setting

• 10 possible splits and their scores• When to stop and make a decision?

Streaming Decision Trees

[Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000][Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003][Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]

• Score splits on a subset of samples only

• Domingos/Hulten (Hoeffding Trees), 2000:– Compute sample count n for given precision– Streaming decision tree induction– Incorrect confidence intervals, but work well in practice

• Jin/Agralwal, 2003:– Tighter confidence interval, asymptotic derivation using delta method

• Loh/Nowozin, 2013:– Racing algorithm (bad splits are removed early)– Finite sample confidence intervals for entropy and gini

Multivariate Delta Method

Theorem. Let be a sequence of -dimensional random vectors such that . Let be once differentiable at with gradient matrix . Then

[DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

𝑔 (𝜃 )

∇𝑔 (𝜃 )

Delta Method for the Information Gain

1 2 3 4 5 6 7 80

• 8 classes, 2 choices (left/right)• : probability of choice S, class I

• , mutual information (infogain)• Derivation lengthy but not difficult, slight generalization of Jin & Agralwal

Multivariate delta method: for we have that

[Small, “Expansions and Asymptotics for Statistics”, CRC, 2010][DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

Delta Method Example

As , is fixed

0 50 100 150 200 250 300 350 400 450 5000.2

Sample size

Plugin estimate and standard deviation, 10000 replicates

Infogain estimateInfogain truth

0 50 100 150 200 250 300 350 400 450 5000

Asymptotic variance of the information gain

Empirical stddevDelta method stddev

• Statistical problem• Large body of literature exists on entropy estimation• Better estimators yield better decision trees• Distribution of estimate relevant in the streaming setting

Conclusion on Entropy Estimation

Entropy Estimation and Applications to Decision Trees

Documents

Transcript of Entropy Estimation and Applications to Decision Trees

Inequality Restricted Maximum Entropy Estimation in STAT

Decision trees, entropy, information gain, ID3euclid.nmu.edu/~mkowalcz/cs495f09/slides/lesson015.pdf · Decision trees, entropy, information gain, ID3 ... I In a decision tree, ...

Density Estimation Trees - mlpackDensity Estimation Trees Parikshit Ram Georgia Institute of Technology Atlanta GA, 30332 p.ram@gatech.edu Alexander G. Gray Georgia Institute of Technology

Entropy Based Trees to Support Decision Making for …przyrbwn.icm.edu.pl/APP/PDF/129/a129z5p15.pdf · 2016-06-06 · Entropy Based Trees to Support Decision Making for Customer Churn

On the Estimation of Entropy in the FastICA Algorithm1 On the Estimation of Entropy in the FastICA Algorithm Paul Smith , Jochen Voss , and Elena Issoglio School of Mathematics, University

ON THE ESTIMATION OF ENTROPY

Marginal Likelihood Estimation with the Cross-Entropy Method 2012.pdf · Marginal Likelihood Estimation with the Cross-Entropy Method JoshuaC.C.Chan1 EricEisenstat2 ... the trade-oﬀ

PARAMETER ESTIMATION FOR ODES U SING A CROSS-ENTROPY APPROACH

Entropy and Dependence Estimation Barnabás Póczos · 2009-07-28 · Entropy and Dependence Estimation Barnabás Póczos Department of Computing Science, ... Contents •Entropy

Introduction to Entropy Estimation (z) = dln ( z) dz, ... j k + Na k+1 c(f)2 e k ; where X0X = hB j;Nf;B ... Catharina Olsen Introduction to Entropy Estimation. Introduction

Entropy-based parametric estimation of spike train statistics … · 2013. 12. 8. · arXiv:1003.3157v2 [physics.data-an] 26 Aug 2010 Entropy-based parametric estimation of spike

Estimation of Entropy and Mutual Informationliam/research/pubs/info_est-nc.pdf · Estimation of Entropy and Mutual Information 1195 ducing anything particularly novel, but merely

Estimation of Entropy Reduction and Degrees of Freedom for ... · Estimation of Entropy-Reduction and Degrees of freedom for signal … dN=−trace(()J′′−1) z (10) Both quantities

Entropy estimation for optical PUFs based on context-tree ...13 Entropy Estimation for Optical PUFs Based on Context-Tree Weighting Methods© Tanya Ignatenko, Frans Willems, Geert-Jan

Maximum Entropy Estimation for Survey SamplingMaximum Entropy Estimation for Survey Sampling Fabrice Gamboa, Jean-Michel Loubes and Paul Rochet Abstract Calibration methods have been

Dynamic Travel Time Estimation Using Regression Trees

National Accounts and SAM Estimation Using Cross-Entropy Methods

Bayesian Entropy Estimation for Countable Discrete …jmlr.csail.mit.edu/papers/volume15/archer14a/archer14a.pdfBayesian Entropy Estimation for Countable Discrete Distributions Evan

The Current Status and Entropy Estimation Methodology in KCMVP · The Current Status and Entropy Estimation Methodology in KCMVP in ICMC16 2016.5.18 Kookmin University Yongjin Yeom

Bayesian Entropy Estimation for Countable Discrete Distributions · 2021. 1. 5. · 2.2 Bayesian Entropy Estimation The Bayesian approach to entropy estimation involves formulating