PhD Completion Seminar

Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions

Simone Romano’s PhD Completion Seminar

Design and Adjustment ofDependency Measures Between Variables

November 30th 2015

Supervisor: Prof. James Bailey

Co-Supervisor: A/Prof. Karin Verspoor

Computing and Information Systems (CIS)

Simone Romano University of Melbourne

Design and Adjustment of Dependency Measures Between Variables


BackgroundExamples of ApplicationsCategories of Dependency measuresThesis Motivation

Ranking Dependencies in Noisy DataMotivationDesign of the Randomized Information Coefficient (RIC)Comparison Against Other Measures

A Framework for Adjusting Dependency MeasuresMotivationAdjustment for QuantificationAdjustment for Ranking

Adjustments for Clustering Comparison MeasuresMotivationDetailed Analysis of Contingency TablesApplication Scenarios

Conclusions




Examples of Applications

Dependency Measures

A dependency measure D is used to assessthe amount of dependency between variables:

Example 1: After collecting weight and height for many people,we can compute D(weight, height)

Example 2: assess the amount of dependency between search queries in Google

https://www.google.com/trends/correlate/

They are fundamental for a number of applications in machine learning/ data mining



https://www.google.com/trends/correlate/



Applications of Dependency Measures

Supervised learning

I Feature selection [Guyon and Elisseeff, 2003];

I Decision tree induction [Criminisi et al., 2012];

I Evaluation of classification accuracy [Witten et al., 2011].

Unsupervised learning

I External clustering validation [Strehl and Ghosh, 2003];

I Generation of alternative or multi-view clusterings[Muller et al., 2013, Dang and Bailey, 2015];

I The exploration of the clustering space using results from the Meta-Clustering algorithm[Caruana et al., 2006, Lei et al., 2014].

Exploratory analysis

I Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];

I Analysis of neural time-series data [Cohen, 2014].





Application example (1): feature selection / decision tree induction

Application: Identify if the class C is dependent to a feature F

Toy Example: Is the class C = cancer dependent to the feature F = smoker according tothis data set of 20 patients.

Use of dependency measure: Compute D(F,C)

Smoker CancerNo -Yes +Yes +Yes -No +No -Yes +

......

Yes +

Contingency table is a useful tool:counts the co-occurrences of feature valuesand class values.

+ -10 10

Smoker 8 6 2Non smoker 12 4 8

⇒ if it isdependent theninduce a split inthe decision tree

yes nosmoker





Application example (2): external clustering validation

Application: Compare a clustering solution B to a reference clustering A.

Toy Example: N = 15 data points

reference clustering A with 2 clusters, stars and circles





Application example (2): external clustering validation

Application: Compare a clustering solution B to a reference clustering A.

Toy Example: N = 15 data points

reference clustering A with 2 clusters, stars and circles

clustering solution B with 2 clusters, redand blue





Use of dependency measure: Compute D(A,B)

Once gain the contingency table is a useful too that assesses the amount of overlapbetween A and B

Bred blue6 9

A8 4 47 2 5





Application example (3): genetic network inference

Application: Identify if the gene G1 is interacting with the gene G2

Toy Example: We have a time series of values for each G1 and G2:

Use of dependency measure: Compute D(G1,G2)

time G1 G2

t1 20.4400 19.7450t2 19.0750 20.3300t3 20.0650 20.1700...

......

Time0 20 40 60 80 100 120 140

18

20

22

24

26

G1

G2

Here there is no contingency tablebecause the variables are numerical

G1

18 20 22 24 26

G2

19

20

21

22

23

24

25




Categories of Dependency measures

Categories of Dependency MeasuresDependency measures can be divided in two categories: measures between categoricalvariables and measures between numerical variables.

Between Categorical VariablesThese measures can be computed naturally on a contingency table. For example on:

Decision trees

yes nosmoker

+ -10 10

Smoker 8 6 2

Non smoker 12 4 8

Clusteringcomparisons

Bred blue

6 9

A8 4 4

7 2 5

I Information theoretic [Cover and Thomas, 2012]:e.g. mutual information (a.k.a. information gain)

I Based on pair-counting [Albatineh et al., 2006]: e.g. Rand Index, Jaccard similarity

I Based on set-matching [Meila, 2007]:e.g. classification accuracy, agreement between annotators

I Others: mostly employed as splitting criteria [Kononenko, 1995]: e.g. Gini gain,Chi-square.




Categories of Dependency measures

Between Numerical VariablesNo contingency table. For example:

Biological interaction

G1

18 20 22 24 26

G2

19

20

21

22

23

24

25

I Estimators of mutual information [Khan et al., 2007]:e.g. kNN estimator, kernel estimator, estimator based on grids

I Correlation based:e.g. Pearson’s correlation, distance correlation [Szekely et al., 2009],randomized dependence coefficient [Lopez-Paz et al., 2013]

I Kernel based: e.g. Hilbert-Schimidt Independence Criterion [Gretton et al., 2005]

I Based on information theory:e.g. the Maximal Information Coefficient (MIC) [Reshef et al., 2011],the mutual information dimension [Sugiyama and Borgwardt, 2013],total information coefficient [Reshef et al., 2015].




Thesis Motivation

Thesis Motivation

Even if a dependency measure D has nice theoretical properties,dependencies are estimated on finite data with D.

The following goals of dependency measures are challenging:

Detection: Test for the presence of dependency.E.g. test dependence between two genesExample (3)

Quantification: Summarization of the amount of dependency in an interpretable fashion.E.g. assessing the amount of overlapping between two clusteringsExample (2)

Ranking: Sort the relationships of different variables.E.g. ranking many features in decision treesExample (1)

To improve performances on the three goals above:We need information on the distribution of D




Thesis Motivation

For Example, when Ranking Noisy Relationships

The distribution of D(X ,Y ) when the relationship between X and Y is noisy,should not overlap with the distribution of D(X ,Y ) on a noiseless relationship:








Conclusions




Motivation

Motivation

Mutual information I (X ,Y ) is good to rank relationships with different level of noisebetween variables:high I ⇒ little noisesmall I ⇒ big noise

It can also be computed between sets of variables: e.g.I (X,Y ) = I ({X1,X2},Y ) = I ({weight, height}, BMI)

Mutual Information quantifies the information shared between two variables

MI(X ,Y ) =

∫ +∞

−∞

∫ +∞

−∞fX ,Y (x , y) log

fX ,Y (x , y)

fX (x)fY (y)

Importance of MIIt is based on a well-established theory and quantifies non-linear interactions which might bemissed if e.g. the Pearson’s correlation coefficient r(X ,Y ) is used.




Motivation

Estimation of Mutual Information

Many estimators of mutual information:

Acronym Type Applicable to Sets of vars. Best Compl. Worst Compl.

Iew (Discretization equal width) 7 O(n1.5)Ief (Discretization equal frequency) 7 O(n1.5)IA (Adaptive Partitioning) 7 O(n1.5)Imean (Mean Nearest Neighbours) 3 O(n2)IKDE (Kernel Density Estimation) 3 O(n2)IkNN (Nearest Neighbours) 3 O(n1.5) O(n2)

Discretization based estimators of mutual information exhibits good complexity but notapplicable to sets of variables




Motivation

Discretization based estimator use fixed grids. And compute mutual information on acontingency table.

X18 20 22 24 26

Y

16

18

20

22

24

26

For example Iew discretizesusing equal width binning

Discretized Xb1 · · · bj · · · bc

a1 n11 · · · · · · · n1c

......

......

Discretized Y ai · nij ·...

......

...ar nr1 · · · · · · · nrc

nij counts the number of points in aparticular bin. Mutual information

can be computed with:

Iew(X ,Y ) =r∑

i=1

c∑j=1

nij

Nlog

nijN

aibj




Motivation

Criticism

The discretization approach is less popular between numerical variables because:

I There is a systematic estimation bias which depends to the grid size

However, when comparing dependencies systematic estimation biases cancel each other out[Kraskov et al., 2004, Margolin et al., 2006, Schaffernicht et al., 2010]

Thus too not bad for comparing/ranking relationships!




Motivation

Comparing relationships/ comparing estimations of I

Task: Given a strong relationship s and a weak relationship w , compare the estimates Is andIw of the true values Is and Iw

I Systematic biases cancel out when comparing relationships

Systematic biases translate the distributions by a fixed amount

I It is beneficial to reduce the variance

Challenge: Decreasing the variance of the estimation




Design of the Randomized Information Coefficient (RIC)

Randomized Information Coefficient (RIC)Idea:

I Generate many random grids with different cardinality by random cut-offsI Estimate the normalized mutual information for each of them (because of different

cardinality)

I Average

X18 20 22 24 26

Y16

18

20

22

24

26

X18 20 22 24 26

Y

16

18

20

22

24

26

X18 20 22 24 26

Y

16

18

20

22

24

26

Average

Parameters:I Kr - tunes the number of random gridsI Dmax - tunes the maximum grid cardinality generated

Features:I Proved to decrease the variance like in random forests [Geurts, 2002]I Still good complexity O(n1.5)I Easy to extend to sets of variables





Random discretization of set of variablesRelationship between Y and X = {X1,X2}

X2

10.5

010.5

X1

1

0.5

0

0

Y

X 0 = X1+X2

2

0 0.5 1

Y

0

0.5

1

Need to randomly discretize X ⇒ just choose some random seeds:

X1

0 0.5 1

X2

0

0.2

0.4

0.6

0.8

1





Detection of Relationship

Task: Using permutation test identify if arelationship exists:

I Generate 500 values of RIC undercomplete noise

I Sort the values and identify thevalue x of RIC at position500× 95% = 475

I Generate 500 values of RIC under aparticular relationship

I Count how many values are greaterthan x

⇒ the bigger the count the bigger thePower of RIC

Linear Quadratic Cubic

Sinusoidal low freq. Sinusoidal high freq. 4th Root

Circle Step Function Two Lines

X Sinusoidal varying freq. Circle-bar

Noise Lev. 1

Noise Lev. 6

Noise Lev. 11

Noise Lev. 16

Noise Lev. 21

Noise Lev. 26

Tested on many relationships and level of noise





Power at the increase of the number of random grids

Kr increases the number of random grids

Parameter (Kr)50 100 150 200

Are

aU

nder

Pow

erCurv

e

0

0.2

0.4

0.6

0.8

1RIC, optimum at Kr = 200

Figure : Average power for each relationship - every line is a relationship

More random grids ⇒ less estimation variance ⇒ more power




Comparison Against Other Measures

Comparison with Other Measures

Extensively compared with other measures on the task of relationship detection

Ave

rage

Ran

k-Pow

er

0

2

4

6

8

10

12

14

RIC TICe IKDE dCorr HSIC RDC MIC IkNN Ief GMIC Iew r2 IA ACE Imean MID

Figure : Average rank across relationship (E.g. rank 1st when power is max on a relationship)





Comparison - Biological Network InferenceReverse engineering of network of genes when the ground truth is known

Ave

rage

Ran

k-M

ean

Ave

rage

Pre

cision

3

4

5

6

7

8

9

10

11

12

13

RIC dCorr IKDE IkNN HSIC ACE r2 GMIC Ief IA RDC Iew Imean MIC MID

Figure : Average rank across relationship (E.g. rank 1st when Average Precision is max on a network )

Also compared on:

I Feature filtering for regression

I Feature selection for regression

RIC shows competitive performance





Conclusion - Message

We proposed the Randomized Information Coefficient (RIC)

I Reduces the variance of normalized mutual information via grids when comparingrelationships

I Random discretize multiple variables

Take away message:

I There are different ways to generate random grids (random cut-off/ randomseeds)

I The more the number of grids the smaller the variance

The Randomized Information Coefficient: Ranking Dependencies in Noisy Data, Simone Romano, James Bailey, Nguyen Xuan

Vinh, and Karin Verspoor. Under review in the Machine Learning Journal





Hypothesis so far...

So far we compared numerical variables on samples of fixed size n

Dependency measures might have biases if they:

I Compare samples with different n

I Compare categorical variables

Need for adjustment in these cases








Conclusions




Motivation

Motivation for Adjustment For QuantificationPearson’s correlation between two variables X and Y estimated on a data sampleSn = {(xk , yk )} of n data points:

r(Sn|X ,Y ) ,

∑nk=1(xk − x)(yk − y)√∑n

k=1(xk − x)2∑n

k=1(yk − y)2(1)

1 0.8 0.4 0 -0.4 -0.8 -1

1 1 1 -1 -1 -1

0 0 0 0 0 0 0

Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

r2(Sn|X ,Y ) can be used as a proxy of the amount of noise for linear relationships:

I 1 if noiseless

I 0 if complete noise



https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient


Motivation

The Maximal Information Coefficient (MIC) was published in Science [Reshef et al., 2011]and has 499 citations to date according to Google scholar.

MIC(X ,Y ) can be used as a proxy of the amount fo noise for functional relationships:

Figure : From supplementary material online in [Reshef et al., 2011]

MIC should be equal to:

I 1 if the relationship between X and Y is functional and noiseless

I 0 if there is complete noise




Motivation

Challenge

Nonetheless, its estimation is challenging on a finite data sample Sn of n data points.

We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points:

0.2 0.4 0.6 0.8 1

MIC(S80jX;Y )

MIC(S20jX;Y )

Value can be high because of chance! The user expects values close to 0 in both cases

Challenge: Adjust the estimated MIC to better exploit the range [0, 1]




Adjustment for Quantification

Adjustment for Chance

I We define a framework for adjustment:


AD ,D − E [D0]

max D − E [D0]

I It uses the distribution D0 under independent variables:I r 2

0 : Beta distributionI MIC0: can be computed using Monte Carlo permutations.

Used in κ-statistics. Its application is beneficial to other dependency measures:

I Adjusted r2 ⇒ Ar2

I Adjusted MIC ⇒ AMIC





Adjusted measures enable better interpretabilityTask:Obtain 1 for noiseless relationship, and 0 for complete noise (on average).

0%

r2 = 1Ar2 = 1

20%

r2 = 0:66Ar2 = 0:65

40%

r2 = 0:39Ar2 = 0:37

60%

r2 = 0:2Ar2 = 0:17

80%

r2 = 0:073Ar2 = 0:044

100%

r2 = 0:035Ar2 = 0:00046

Figure : Ar 2 becomes zero on average on 100% noise

0%

MIC = 1AMIC = 1

20%

MIC = 0:7AMIC = 0:6

40%

MIC = 0:47AMIC = 0:29

60%

MIC = 0:34AMIC = 0:11

80%

MIC = 0:27AMIC = 0:021

100%

MIC = 0:26AMIC = 0:0014

Figure : AMIC becomes zero on average on 100% noise





Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missing values)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Raw r2

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

Raw MIC

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 10

n = 20

n = 30

n = 40

n = 100

n = 200

Ar2 (Adjusted)

Noise Level0 50 100

0

0.2

0.4

0.6

0.8

1

n = 20

n = 40

n = 60

n = 80

AMIC (Adjusted)




Adjustment for Ranking

Motivation for Adjustment for Ranking

Say that we want to predict the risks of C cancer using equally unpredictive variables X1 andX2 defined as follows:

I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};

X1= yesX1= no

X2=green

X2=blueX2=brown

Problem:When ranking variables,dependency measures arebiased towards theselection of variableswith many categories

This still happens because of finite samples!





Selection bias experiment

Experimentn = 100 data pointsClass C with 2 categories:

I Generate a variable X1 with 2 categories(independently from C)

I Generate a variable X2 with 3 categories(independently from C)

ComputeGini(X1,C) and Gini(X2,C).

Give a win to the variablethat gets the highest value

REPEAT 10,000 times

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2

Probability of Selection

Result: X2 gets selected 70% of the times ( Bad )Given that they are equally unpredictive, we expected 50%

Challenge: adjust the estimated Gini gain to obtained unbiased ranking





Adjustment for RankingWe propose two adjustments for ranking:

Standardization

SD ,D − E [D0]√

Var(D0)

Quantifies statistical significance like a p-value


AD(α) , D − q0(1− α)

Penalizes on statistical significance according to α

q0 is the quantile of the distribution D0

(small α more penalization)





Standardized Gini (SGini) corrects for Selection bias

Select unpredictive features X1 with 2 categories and X2 with 3 categories.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2


Experiment: X1 and X2 gets selectedon average almost 50% of the times

( Good )

Being similar to a p-value, this is consistent with the literature on decisiontrees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,Strobl et al., 2007].

Nonetheless: we found that this is a simplistic scenario





Standardized Gini (SGini) might be biased

Fix predictiveness of features X1 and X2 to a constant 6= 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

X1 X2


Experiment: SGini becomes biasedtowards X1 because more statically

significant ( Bad )

This behavior has been overlooked in the decision tree community

Use AD(α) to penalize less or even tune the bias!

⇒ AGini(α)





Application to random forest

why random forest? good classifier to try first when there are “meaningful” features[Fernandez-Delgado et al., 2014].

Plug-in different splitting criteria

Experiment: 19 data sets with categorical variables

,0 0.2 0.4 0.6 0.8

Mea

nAU

C

90

90.5

91

91.5

AGini(,)

SGini

Gini

Figure : Using the same α for all data sets

And α can be tuned for each data set with cross-validation.






Dependency estimates are high because of chance under finite samples.

Adjustments can help for:

Quantification, to have an interpretable value between [0, 1]

Ranking, to avoid biases towards:

I missing values

I categorical variables with more categories

A Framework to Adjust Dependency Measure Estimates for Chance, Simone Romano, Nguyen Xuan Vinh, James Bailey, and

Karin Verspoor. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16)

Arxiv: http://arxiv.org/abs/1510.07786



http://arxiv.org/abs/1510.07786






Conclusions




Motivation

Clustering Validation

Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)

⇒ we need dependency measures

There are two very popular measures based on adjustments:

The Adjusted Rand Index (ARI)[Hubert and Arabie, 1985]

∼ 3000 citations

The Adjusted Mutual Information (AMI)[Vinh et al., 2009]

∼ 200 citations

No clear connection between them - Users use them both




Motivation

Both computed on a contingency table

Notation: Contingency table M

ai =∑

j nij are the row marginals andbj =

∑i nij are the column marginals.

Vb1 · · · bj · · · bc

a1 n11 · · · · · · · n1c

......

......

U ai · nij ·...

......

...ar nr1 · · · · · · · nrc

ARI - Adjustment of Rand Index (RI)

based on counting pairs of objects

ARI =RI− E [RI]

max RI− E [RI]

AMI - Adjustment of Mutual Information (MI)

based on information theory

AMI =MI− E [MI]

max MI− E [MI]




Motivation

Link: generalized information theoryGeneralized information theory based on Tsallis q-entropy

Hq(V ) ,1

q − 1

(1−

∑j

(bj

N

)q)generalizes Shannon’s entropy

limq→1

Hq(V ) = H(V ) ,∑

j

bj

Nlog

bj

N

Link between measures:Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:

MIq=2 ∝ RI limq→1

MIq = MI

Challenge: Compute E [MIq] to connect ARI and AMI

Challenge 2.0: Compute Var(MIq) for standardization




Motivation

Link: generalized information theoryGeneralized information theory based on Tsallis q-entropy

Hq(V ) ,1

q − 1

(1−

∑j

(bj

N

)q)generalizes Shannon’s entropy

limq→1

Hq(V ) = H(V ) ,∑

j

bj

Nlog

bj

N

Link between measures:Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:

MIq=2 ∝ RI limq→1

MIq = MI

Challenge: Compute E [MIq] to connect ARI and AMI

Challenge 2.0: Compute Var(MIq) for standardizationSimone Romano University of Melbourne



Motivation

Propose technique applicable to a broader class of measures:We can do:

I Exact computation of measures in Lφwhere S ∈ Lφ is a linear function of the entries of the contingency table:

S = α + β∑

ij

φij (nij )

(α and β are constants)

I Asymptotic approximation of measures in Nφ (non-linear)

Rand Index (RI)

MI Jaccard(J)

GeneralizedInformation Theoretic

VI

MINMI

Figure : Families of measures we can adjust




Detailed Analysis of Contingency Tables

Exact Expected Value by Permutation Model

E [S ] is obtained by summation over all possible contingency tables M obtained bypermutations.

E [S ] =∑M

S(M)P(M) = α + β∑M

∑ij

φij (nij )P(M)

I No method to exhaustively generate M fixing the marginals

I extremely time expensive ( permutations O(N!))

However, it is possible to swap the inner summation with the outer summation:∑M

∑i,j︸︷︷︸

to swap

φij (nij )P(M) =∑i,j

∑nij︸︷︷︸

swapped

φij (nij )P(nij )

I nij has a known hypergeometric distribution,

I Computation time dramatically reduced! ⇒ O (max {rN, cN})





Exact Variance Computation

We have to compute the second moment E [S2] which requires:

∑M

r∑i=1

c∑j=1

φij (nij )

2

P(M)

∑M

∑i,j,i ′,j′︸︷︷︸

to swap

φij (nij ) · φi ′j′(ni ′j′)P(M)

∑i,j,i ′,j′

∑nij

∑ni′ j′︸︷︷︸

swapped

φij (nij ) · φi ′j′(ni ′j′)P(nij , ni ′j′)

Contribution: P(nij , ni ′j′) computation is technically challenging.We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red,blue, and white.





Finally we can define adjustments..

Definition: Adjusted Mutual Information q - AMIq

AMI2 = ARI limq→1

AMIq = AMI

We can finally relate ARI and AMI to generalized information theory!

Also define: a generalized Standardized Mutual Information q - SMIq for selection bias.

Their complexities:

Name Computational complexityAMI O (max {rN, cN})SMI O

(max {rcN3, c2N3}

)Table : Complexity when comparing two clusterings: N objects, r , c number of clusters




Application Scenarios


Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2

Example: Do you prefer U1 or U2?

V10 10 10 70

U1

8 8 0 0 07 0 7 0 07 0 0 7 0

78 2 3 3 70

AMI chooses this one because of many 0’s

V10 10 10 70

U2

10 7 1 1 110 1 7 1 110 1 1 7 170 1 1 1 67

ARI chooses this one

When there are small clusters in V , use AMI because it likes 0’s





Equal sized clusters...

Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2

Example: Do you prefer U1 or U2?

V25 25 25 25

U1

17 17 0 0 017 0 17 0 017 0 0 17 049 8 8 8 25

AMI chooses this one because of many 0’s

V25 25 25 25

U2

24 20 2 1 125 2 20 2 123 1 1 20 128 2 2 2 22

ARI chooses this one

When there are big equal sized clusters in V , use ARI because 0’s are misleading





SMIq can be used to correct selection bias

Reference clustering with 4 clusters and solutions U with different number of clusters

2 3 4 5 6 7 8 9 10

SM

I q

00.05

0.1

Probability of selection (q = 1:001)

2 3 4 5 6 7 8 9 10

AM

I q

0

0.1

Number of sets r in U2 3 4 5 6 7 8 9 10

NM

I q

00.20.4





Correct for selection bias with SMIq for any q

Reference clustering with 4 clusters and solutions U with different number of clusters

2 3 4 5 6 7 8 9 10

SM

I q

00.05

0.1

Probability of selection (q = 2)

2 3 4 5 6 7 8 9 10

AM

I q

00.05

0.1

Number of sets r in U2 3 4 5 6 7 8 9 10

NM

I q

0

0.5






We computed generalized information theoretic measures to propose AMIq and SMIq to:

I identify the application scenarios of ARI and AMI

I correct for selection bias

Take away message:

I Use AMI when the reference is unbalanced and has small clusters

I Use ARI when the reference has big equal sized clusters

I Use SMIq to correct for selection bias

Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance, Simone Romano,

James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Published in Proceedings of the 31st International Conference on Machine

Learning 2014, pp. 1143–1151 (ICML-14)

Adjusting for Chance Clustering Comparison Measures, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor.

To submit to the Journal of Machine Learning Research








Conclusions




Summary

Studying the distribution of the estimates D, we:

I Designed RIC

I Adjusted for quantification

I Adjusted for ranking

These results can aid detection, quantification, and ranking of relationships as follows

Detection: RIC can be used to detect relationships between continuous variables becauseit has high power

Quantification: Adjustment for quantification can be used to obtain a more interpretablerange of values.E.g. AMIC and AMIq

Ranking: Adjustment for ranking can be used to correct for biases towards variableswith missing values or variables with many categories.E.g. AGini(α) for random forests




Future Work

I Dependency measure estimates can obtain high values because of chance also whenthey are computed on different number of dimensions⇒ study adjustments to be unbiased towards different dimensionality

I Adjustment via permutations is slow⇒ compute more analytical adjustments, e.g. for MIC

I The random seeds discretization technique for RIC might have problems with highdimensionality⇒ generate random seeds in random subspaces⇒ study multivariable discretization using random trees

I Inject randomness in other estimators of mutual information⇒ E.g. choose different random kernel widths for the IKDE estimator




Papers

S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Adjusting for Chance Clustering Comparison Measures”. To submit to the

Journal of Machine Learning Research

S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “A Framework to Adjust Dependency Measure Estimates for Chance”. Under

submission in SIAM International Conference on Data Mining 2016 (SDM-16)

S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “The Randomized Information Coefficient: Ranking Dependencies in Noisy

Data” Under review in the Machine Learning Journal

S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized Mutual Information for Clustering Comparisons: One Step

Further in Adjustment for Chance”. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp.

1143–1151 (ICML-14)

Collaborations:Y. Lei, J. C. Bezdek, N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Extending information theoretic validity indices for fuzzy

clusterings”. Submitted to the Transactions on Fuzzy Systems Journal

N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei, “Discovering outlying aspects in large

datasets”. Submitted to the Data Mining and Knowledge Discovery Journal

N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection”.

Published in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,

2014, pp. 512–521

Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized information theoretic cluster validity indices for

soft clusterings”. Published in Proceedings of Computational Intelligence and Data Mining (CIDM), 2014, pp. 24–31




Thank You All

In particular

My supervisors:James Bailey, Karin Verspoor, and Vinh Nguyen

Committee Chair:Tim Baldwin

My fellow PhD students

Questions?

Code available online:

https://github.com/ialuronico



https://github.com/ialuronico


References I

Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006).On similarity indices and correction for chance agreement.Journal of Classification, 23(2):301–313.

Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).Meta clustering.In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.

Cohen, M. X. (2014).Analyzing neural time series data: theory and practice.MIT Press.

Cover, T. M. and Thomas, J. A. (2012).Elements of information theory.John Wiley & Sons.

Criminisi, A., Shotton, J., and Konukoglu, E. (2012).Decision forests: A unified framework for classification, regression, density estimation, manifoldlearning and semi-supervised learning.Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.

Dang, X. H. and Bailey, J. (2015).A framework to uncover multiple alternative clusterings.Machine Learning, 98(1-2):7–30.




References II

Dobra, A. and Gehrke, J. (2001).Bias correction in classification tree construction.In ICML, pages 90–97.

Fernandez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).Do we need hundreds of classifiers to solve real world classification problems?The Journal of Machine Learning Research, 15(1):3133–3181.

Frank, E. and Witten, I. H. (1998).Using a permutation test for attribute selection in decision trees.In ICML, pages 152–160.

Geurts, P. (2002).Bias/Variance Tradeoff and Time Series Classification.PhD thesis, Department d’Eletrecite, Eletronique et Informatique. Institut Momntefiore. Unversite deLiege.

Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. (2005).Measuring statistical dependence with hilbert-schmidt norms.In Algorithmic learning theory, pages 63–77. Springer.

Guyon, I. and Elisseeff, A. (2003).An introduction to variable and feature selection.The Journal of Machine Learning Research, 3:1157–1182.




References III

Hothorn, T., Hornik, K., and Zeileis, A. (2006).Unbiased recursive partitioning: A conditional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674.

Hubert, L. and Arabie, P. (1985).Comparing partitions.Journal of Classification, 2:193–218.

Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson III, D. J., Protopopescu, V., andOstrouchov, G. (2007).Relative performance of mutual information estimation methods for quantifying the dependence amongshort and noisy data.Physical Review E, 76(2):026209.

Kononenko, I. (1995).On biases in estimating multi-valued attributes.In International Joint Conferences on Artificial Intelligence, pages 1034–1040.

Kraskov, A., Stogbauer, H., and Grassberger, P. (2004).Estimating mutual information.Physical review E, 69(6):066138.

Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).Filta: Better view discovery from collections of clusterings via filtering.In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.




References IV

Lopez-Paz, D., Hennig, P., and Scholkopf, B. (2013).The randomized dependence coefficient.In Advances in Neural Information Processing Systems, pages 1–9.

Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano,A. (2006).Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellularcontext.BMC bioinformatics, 7(Suppl 1):S7.

Meila, M. (2007).Comparing clusterings—an information based distance.Journal of Multivariate Analysis, 98(5):873–895.

Muller, E., Gunnemann, S., Farber, I., and Seidl, T. (2013).Discovering multiple clustering solutions: Grouping objects in different views of the data.Tutorial at ICML.

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J.,Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science, 334(6062):1518–1524.

Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. M. (2015).Measuring dependence powerfully and equitably.arXiv preprint arXiv:1505.02213.




References V

Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., and Gross, H.-M. (2010).On estimating mutual information for feature selection.In Artificial Neural Networks ICANN 2010, pages 362–367. Springer.

Strehl, A. and Ghosh, J. (2003).Cluster ensembles—a knowledge reuse framework for combining multiple partitions.The Journal of Machine Learning Research, 3:583–617.

Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).Unbiased split selection for classification trees based on the gini index.Computational Statistics & Data Analysis, 52(1):483–501.

Sugiyama, M. and Borgwardt, K. M. (2013).Measuring statistical dependence via the mutual information dimension.In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages1692–1698. AAAI Press.

Szekely, G. J., Rizzo, M. L., et al. (2009).Brownian distance covariance.The annals of applied statistics, 3(4):1236–1265.

Villaverde, A. F., Ross, J., and Banga, J. R. (2013).Reverse engineering cellular networks with information theoretic methods.Cells, 2(2):306–329.




References VI

Vinh, N. X., Epps, J., and Bailey, J. (2009).Information theoretic measures for clusterings comparison: is a correction for chance necessary?In ICML, pages 1073–1080. ACM.

Witten, I. H., Frank, E., and Hall, M. A. (2011).Data Mining: Practical Machine Learning Tools and Techniques.3rd edition.



PhD Completion Seminar

Science

Transcript of PhD Completion Seminar