Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation...

17
Mutual Information and Optimal Data Coding Jules de Tibeiro Université de Moncton à Shippagan Bernard Colin François Dubeau Hussein Khreibani Université de Sherbooke May 9 th 2012

Transcript of Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation...

Page 1: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Mutual Information and

Optimal Data Coding

Jules de Tibeiro

Université de Moncton à Shippagan

Bernard Colin

François Dubeau

Hussein Khreibani

Université de Sherbooke

May 9th 2012

Page 2: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

• Abstract

• Introduction and Motivation – Example

• Theoretical Framework

– 𝜑 – Divergence

– Mutual Information

• Optimal Partition

– Mutual Information explained by a partition

– Existence of an Optimal Partition

• Computational Aspects and Examples

• Conclusions and Perspectives

• References

2 SLDS 2012 09 May 2012

Page 3: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Abstract • Based on the notion of mutual information between the

components of a random vector

• An optimal quantization of the support of its probability measure

• A simultaneous discretization of the whole set of the components of the random vector

• The stochastic dependence between them

• Key words: Divergence, mutual information, copula, optimal quantization

3 SLDS 2012 09 May 2012

Page 4: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Introduction and Motivation • An optimal discretization of the support of a continuous multivariate distribution

• To retain the stochastic dependence between the variables

– 𝑋 = 𝑋1, 𝑋2, … , 𝑋𝑘 − a random vector with values in ℝ𝑘, 𝛽ℝ𝑘 , ℙ𝑋

– Where ℙ𝑋 is the probability measure of 𝑋 and 𝑆ℙ𝑋 = ℝ𝑘is the support of ℙ𝑋

• 𝑛 = 𝑛1𝑛2…𝑛𝑘 − a product of 𝑘 given integers

– A partition 𝑃 of 𝑆ℙ𝑋 in n elements or classes

– A partition 𝑃 is a “product – partition” deduced from partitions 𝑃1, 𝑃2, … ,𝑃𝑘 of the supports of the marginal probability measures in 𝑛1𝑛2…𝑛𝑘 intervals

• Using a mutual information criterion, choosing the set of all intervals such that the quantization of the support of 𝑆ℙ𝑋, retains the stochastic dependence between the components of the random vector 𝑋

4 SLDS 2012 09 May 2012

Page 5: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Introduction and Motivation • Here is the example for which such an optimal discretization might be desirable

– Let us suppose that we have a sample of individuals on which we observe the following variables:

– 𝑋 = 𝑎𝑔𝑒, 𝑌 = 𝑠𝑎𝑙𝑎𝑟𝑦, 𝑍 = 𝑠𝑜𝑐𝑖𝑜𝑝𝑟𝑜𝑓𝑒𝑠𝑠𝑖𝑜𝑛𝑎𝑙 𝑔𝑟𝑜𝑢𝑝

– If we want to take into account the variables simultaneously as, for example, in multiple correspondence analysis, we have to put them on the same form by means of a discretization of the first two ones

– Instead of the usual independent categorization of the variables 𝑋 and 𝑌 in a given number of classes (𝑝 for 𝑋 and 𝑞 for 𝑌), it would be more relevant, using their stochastic dependence, to categorize simultaneously 𝑋 and 𝑌 in 𝑝𝑞 classes (referred sometimes as a (𝑝, 𝑞) – partition), in order to preserve as much as possible the dependence between them

– Moreover, and depending on the values taken by the categorical variable 𝑍, the (conditional) discretization of the random vector (𝑋, 𝑌 ) must differ, from one class to the others, to take into account the stochastic dependence between the continuous random variables and the categorical one

– Usually, we do not take care of this dependence in creating classes for continuous random variables

– However, the dependence between 𝑋 = 𝑎𝑔𝑒 and 𝑌 = 𝑠𝑎𝑙𝑎𝑟𝑦 are certainly quite different between the socioprofessional groups.

5 SLDS 2012 09 May 2012

Page 6: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

𝜑 – Divergence • Let (Ω, ℱ, 𝜇) be a measured space

• 𝜇1 and 𝜇2 be two probability measures defined on ℱ, such that 𝜇𝑖 ≪ 𝜇 for 𝒾 = 1,2

• 𝜑 – divergence or the generalized divergence (Csiszár[2]) between 𝜇1 and 𝜇2

– 𝐼𝜑 𝜇1, 𝜇2 = 𝜑𝑑𝜇1

𝑑𝜇2𝑑𝜇2 = 𝜑

𝑓1

𝑓2𝑓2𝑑𝜇

– where 𝜑 𝑡 is a convex function from ℝ+\ 0 to ℝ and where 𝑓𝑖 =𝑑𝜇1

𝑑𝜇 for 𝒾 = 1,2

– 𝐼𝜑 𝜇1, 𝜇2 does not depend on the choice of 𝜇

• Homogenous models

– 𝐼𝜑 𝜇1, 𝜇2 = 𝑑𝜇2

𝑑𝜇1𝜑

𝑑𝜇1

𝑑𝜇2𝑑𝜇1 =

𝑓2

𝑓1𝜑

𝑓1

𝑓2𝑓1𝑑𝜇

6 SLDS 2012 09 May 2012

Page 7: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

𝜑 – Divergence • Usual measures of 𝜑 – divergence

7

𝝋 (𝒙) Name

𝑥 ln 𝑥; 𝑥 − 1 ln 𝑥 Kullback and Leibler

| 𝑥 − 1| Distance in variation

( 𝑥 − 1)2 Hellinger

1 − 𝑥𝛼; 0 < 𝛼 < 1 Chernoff

(𝑥 − 1)2 𝜒2

[1 − 𝑥1𝑚 ]𝑚; 𝑚 > 0 Jeffreys

SLDS 2012 09 May 2012

Page 8: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Mutual Information • Let (Ω, ℱ, ℙ) be a probability space

• 𝑋1, 𝑋2, … , 𝑋𝑘 be k random variables defined on (Ω, ℱ, ℙ)

– with values in measured spaces 𝒳𝑖, ℱ𝑖, 𝜆𝑖 𝑖 = 1,2, … , 𝑘

• Denote respectively by ℙ𝑋 = ℙ𝑋1,𝑋2,…,𝑋𝑘by ⊗𝑖=1𝑘 ℙ𝑋

– probability measures defined on the product space (𝑋𝑖=1𝑘 𝜒𝑖 , ⊗𝑖=1

𝑘 ℱ𝑖 , ⊗𝑖=1𝑘 𝜆𝑖)

– Equal to the joint probability measure and to the product of the marginal ones,

– Supposed to be absolutely continuous with respect to the product measure λ =⊗𝑖=1𝑘 𝜆𝑖

8 SLDS 2012 09 May 2012

Page 9: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Mutual Information • Definition:1

• The 𝜑 – mutual information or the mutual information between the random variables 𝑋1, 𝑋2, … , 𝑋𝑘, is given by:

• 𝐼𝜑 𝑋1, 𝑋2, … , 𝑋𝑘 = 𝐼𝜑 ℙ𝑋 ,⊗𝑖=1𝑘 ℙ𝑋𝑖 = 𝜑

𝑑ℙ𝑋

𝑑 ⊗𝑖=1𝑘 ℙ𝑋𝑖

d ⊗𝑖=1𝑘 ℙ𝑋𝑖 = 𝜑(

𝑓1

𝑓2)𝑓2𝑑𝜆

– where 𝑓1and 𝑓2 are the probability density functions of the measures ℙ𝑋 and

⊗𝑖=1𝑘 ℙ𝑋𝑖 with respect to 𝜆 =⊗𝑖=1

𝑘 𝜆𝑖

9 SLDS 2012 09 May 2012

Page 10: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Mutual Information explained by a partition

• Random vector 𝑋 defined on (Ω,ℱ, ℙ) has values in (ℝ𝑘, 𝛽ℝ𝑘)

• ℙ𝑋 - probability measure (ℙ𝑋 ≪ 𝜆 where 𝜆 is the Lebesgue measure on ℝ𝑘)

– Support 𝑆ℙ𝑋 may be assumed of the form ×𝑖=1𝑘 [𝑎𝑖, 𝑏𝑖] where −∞ < 𝑎𝑖 < 𝑏𝑖 < ∞ for

every 𝑖 = 1,2, … , 𝑘

– Given k integers 𝑛1, 𝑛2, … , 𝑛𝑘, let 𝑃𝑖 for 𝑖 = 1,2, … , 𝑘 be a partition of [𝑎𝑖 , 𝑏𝑖] in 𝑛𝑖 intervals {𝛾𝑖𝑗𝑖} such that

• 𝑎𝑖 = 𝑥𝑖0 < 𝑥𝑖1 < ⋯ < 𝑥𝑖𝑛𝑖−1 < 𝑥𝑖𝑛𝑖 = 𝑏𝑖

• 𝛾𝑖𝑗𝑖 = [𝑥𝑖𝑗𝑖−1 < 𝑥𝑖𝑗𝑖] for 𝑗𝑖 = 1,2, … , 𝑛𝑖 − 1 and 𝛾𝑖𝑛𝑖 = [𝑥𝑖𝑛𝑖−1, 𝑏𝑖]

– “Product-partition” 𝒫 =⊗𝑖=1𝑘 𝒫𝑖 of 𝑆ℙ𝑋 in 𝑛 = 𝑛1𝑛2…𝑛𝑘 rectangles of ℝ𝑘

• 𝑃 = 𝛾1𝑗1 × 𝛾2𝑗2 ×⋯× 𝛾𝑘𝑗𝑘 = {×𝑖=1𝑘 𝛾𝑖𝑗𝑖}; every 𝑖: 𝑗𝑖 = 1,2, … , 𝑛𝑖

10 SLDS 2012 09 May 2012

Page 11: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Mutual Information explained by a partition

• “Product-partition” 𝒫 =⊗𝑖=1𝑘 𝒫𝑖 of 𝑆ℙ𝑋in 𝑛 = 𝑛1𝑛2…𝑛𝑘 rectangles of ℝ𝑘

– 𝑃 = 𝛾1𝑗1 × 𝛾2𝑗2 ×⋯× 𝛾𝑘𝑗𝑘 = {×𝑖=1𝑘 𝛾𝑖𝑗𝑖}; every 𝑖: 𝑗𝑖 = 1,2, … , 𝑛𝑖

• If 𝜎(𝑃) denotes the 𝜎-algebra generated by 𝑃, the restriction of ℙ𝑋 to 𝜎(𝑃) is given by

– ℙ𝑋(×𝑖=1𝑘 𝛾𝑖𝑗𝑖) for every 𝑗1, 𝑗2, … , 𝑗𝑘

• whose marginal are, for every 𝑖 = 1,2, … , 𝑘:

– ℙ𝑋 ×𝑟=1𝑖−1 𝑎𝑟 , 𝑏𝑟 × 𝛾𝑖𝑗𝑖××𝑟=𝑖+1

𝑘 𝑎𝑟 , 𝑏𝑟 = ℙ𝑋(𝛾𝑖𝑗𝑖)

• the mutual information, denoted by 𝐼𝜑(𝒫), explained by partition 𝑃 of the support of 𝑆ℙ𝑋

– 𝐼𝜑 𝒫 = 𝜑(ℙ𝑋(×𝑖=1

𝑘 𝛾𝑖𝑗𝑖)

ℙ𝑋𝑖(𝛾𝑖𝑗𝑖)𝑘𝑖=1

) ℙ𝑋𝑖(𝛾𝑖𝑗𝑖)𝑘𝑖=1𝑗1,𝑗2,…,𝑗𝑘

11 SLDS 2012 09 May 2012

Page 12: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Existence of an Optimal Partition

• For given integers 𝑛1, 𝑛2, … , 𝑛𝑘 and for every 𝑖 = 1,2, … , 𝑘

– 𝒫𝑖,𝑛𝑖 - the class of partitions of [𝑎𝑖 , 𝑏𝑖] in 𝑛𝑖 disjoint intervals

– 𝒫𝑛 - the class of partitions of 𝑆ℙ𝑋 given by 𝒫𝑛 =⊗𝑖=1𝑘 𝒫𝑖,𝑛𝑖

• where 𝒏 is the multi index (𝑛1, 𝑛2, … , 𝑛𝑘)

– Each element 𝒫 of 𝒫𝑛 may be considered as a vector of ℝ (𝑛𝑖+1)𝑘𝑖=1 having components

• (𝑎1, 𝑥11, … , 𝑥1𝑛1−1, 𝑏1, 𝑎2, 𝑥21, … , 𝑥2𝑛1−1, 𝑏2, … , 𝑎𝑘, 𝑥𝑘1, … , 𝑥𝑘𝑛𝑘−1, 𝑏𝑘),

• Under the constraints: 𝑎𝑖 < 𝑥𝑖1 < ⋯ < 𝑥𝑖𝑛𝑖−1 < 𝑏𝑖 for every 𝑖 = 1,2, … , 𝑘

– A partition 𝒫 of 𝑆ℙ𝑋 , for which the mutual information loss is minimum,

• solve the optimization problem: min𝒫∈𝒫𝑛

(𝐼𝜑 𝑋1, 𝑋2, … , 𝑋𝑘 − 𝐼𝜑(𝒫)), which is equivalent to:

– max𝒫∈𝒫𝑛

𝐼𝜑 𝒫 =max𝒫∈𝒫𝑛

𝜑(ℙ𝑋(×𝑖=1

𝑘 𝛾𝑖𝑗𝑖)

ℙ𝑋𝑖(𝛾𝑖𝑗𝑖)𝑘𝑖=1

) ℙ𝑋𝑖(𝛾𝑖𝑗𝑖)𝑘𝑖=1𝑗1,𝑗2,…,𝑗𝑘

12 SLDS 2012 09 May 2012

Page 13: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Computational Aspects and Examples

• Consider the case of a bivariate random vector 𝑋 = (𝑋1, 𝑋2)

with probability density function 𝑓 𝑥1, 𝑥2 whose support is [0,1]2

• For each component, let respectively:

– 0 = 𝑥10 < 𝑥11 < 𝑥12… < 𝑥1𝑖 < ⋯ < 𝑥1𝑝−1 < 𝑥1𝑝 = 1, and

– 0 = 𝑥20 < 𝑥21 < 𝑥22… < 𝑥2𝑗 < ⋯ < 𝑥2𝑞−1 < 𝑥2𝑞 = 1,

– the ends of intervals of two partitions of [0,1] in respectively 𝑝 and 𝑞 elements

• For 𝑖 = 1,2, … , 𝑝 and 𝑗 = 1,2,… , 𝑞,

– the probability measure of a rectangle 𝑥1𝑖−1, 𝑥1𝑖 × 𝑥2𝑗−1, 𝑥2𝑗 is given by:

– 𝑓 𝑥1, 𝑥2 𝑑𝑥1𝑑𝑥2 = 𝑝𝑖𝑗𝑥2𝑗𝑥2𝑗−1

𝑥1𝑖𝑥1𝑖−1

– While its product probability measure is expressed as:

– 𝑓1 𝑥1 𝑑𝑥1 ×𝑥1𝑖𝑥1𝑖−1

𝑓2 𝑥2 𝑑𝑥2𝑥2𝑗𝑥2𝑗−1

= 𝑝𝑖+ 𝑝+𝑗 with 𝑝𝑖+ = 𝑝𝑖𝑗𝑞𝑗=1 and 𝑝+𝑗 = 𝑝𝑖𝑗

𝑝𝑖=1

13 SLDS 2012 09 May 2012

Page 14: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Computational Aspects and Examples

• The approximation of the mutual information between the random variables 𝑋1 and 𝑋2 ,

conveyed by the discrete probability measure {𝑝𝑖𝑗} is given by

– 𝜑(𝑝𝑖𝑗

𝑝𝑖+𝑝+𝑗)

𝑞𝑗=1

𝑝𝑖=1 𝑝𝑖+𝑝+𝑗

– And for given 𝑝 and 𝑞 and 𝑓(𝑥1, 𝑥2), one has to maximize the following expression;

– max𝑥1𝑖 ,{𝑥2𝑗}

𝜑 𝑓 𝑥1,𝑥2 𝑑𝑥1𝑑𝑥2

𝑥2𝑗𝑥2𝑗−1

𝑥1𝑖𝑥1𝑖−1

𝑓1 𝑥1 𝑑𝑥1×𝑥1𝑖𝑥1𝑖−1

𝑓2 𝑥2 𝑑𝑥2𝑥2𝑗𝑥2𝑗−1

× 𝑓1 𝑥1 𝑑𝑥1 ×𝑥1𝑖𝑥1𝑖−1

𝑓2 𝑥2 𝑑𝑥2𝑥2𝑗𝑥2𝑗−1

𝑞𝑗=1

𝑝𝑖=1

– The well known method of feasible directions in Zoutendijik [3] and also in Berksekas [1]

14 SLDS 2012 09 May 2012

Page 15: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Computational Aspects and Examples

• Example

– Let 𝑋 = 𝑋1, 𝑋2 ~𝜀2 𝜃 −1 ≤ 𝜃 ≤ 1 be a bivariate exponential random vector, whose probability density function is given by

– 𝑓 𝑥1, 𝑥2 = 𝑒−𝑥1−𝑥2 1 + 𝜃 − 2𝜃 𝑒−𝑥1 + 𝑒−𝑥2 − 2𝑒−𝑥1−𝑥2 𝕀ℝ+2 𝑥1, 𝑥2

– Let 𝐶 𝑢1, 𝑢2 be its copula whose probability density function 𝑐 𝑢1, 𝑢2 is

– 𝑐 𝑢1, 𝑢2 = 1 + 𝜃 1 − 2𝑢1 1 − 2𝑢2 𝕀 0,1 2 𝑢1, 𝑢2

– This family of distribution is also known as Farlie-Gumbel-Morgensten class

15 SLDS 2012 09 May 2012

Page 16: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

Conclusions and Perspectives

• In data mining, the choice of a parametric statistical model is not quite realistic due to a huge number of variables and data and, in this case, a non parametric framework is often more appropriate

• To estimate the probability density function of a random vector, we will use a kernel density estimator in order to evaluate the mutual information between its components and study the effects of the choice of the kernel on the robustness of the optimal partition

• In Multiple Correspondence analysis (MC) and in Classification, we have often to deal simultaneously with continuous and categorical variables, and it may be of interest to use an optimal partition in order to retain, as much as possible, the stochastic dependence between the random variables and we will explore the consequences of the choices of 𝜑 and of an optimal partition 𝒫∗ on these models

• Finally, will develop user friendly software to perform optimal coding in the nonparametric and semi parametric cases

16 SLDS 2012 09 May 2012

Page 17: Mutual Information and Optimal Data Codingtouati/SLDS2012... · • Introduction and Motivation –Example • Theoretical Framework – – Divergence –Mutual Information • Optimal

References

[1] – D.P. Bersekas, Nonlinear Programming 2nd Ed, Athena Scientific, Belmont, Mass., 1990

[2] – I. Csiszár, Information-type measures of difference of probability distributions and indirect observations, Studia Scientiarum Mathematicarum Hungarica, 2 (1967), 299-318

[3] – G. Zoutendijk, Methods of feasible directions, Elsevier, Amsterdam and D. VanNostrand, Princeton, N.J, 1960

17 SLDS 2012 09 May 2012