Machine Learning Lunch Carnegie Mellon Universitybapoczos/other... · Machine Learning Lunch...

Nonparametric information estimation using Rank based Euclidean Graph Optimization

methods

Barnabás Póczos

Department of Computing Science,

University of Alberta, Canada

Machine Learning Lunch

Carnegie Mellon University

Feb 8, 2010

2

Outline

GOAL: Information estimation

Background on information and entropy

Applications in machine learning

Graph optimization methods

TSP, MST, kNNCopula transformation

Information estimation

Background on information and entropy

4

Who cares about dependence?

• Observing physical, biological, chemical systems

– Which observations are dependent / independent?

– Is there dependence between inputs and outputs?

• Neuroscience: dependence between stimuli and neural response

• Stock markets– which stocks are dependent of each others, which are

independent,

– dependence on other events?

5

What is dependence?

We want to measure dependence

Correlation?

… nope that’s not good enough…

Correlation can be zero when there is dependence between the variables

Isn’t this only an artificial problem?

It is a serious problem…

6

Correlation?… nope that’s not good enough

Correlation measures only pairwise dependencies.

What can we do when we have more variables?

X1

X2

covariance matrix = I

cross-correlation = 0,

but there is dependence

that’s why PCA is not able to separate mixed sources, and we need

ICA methods for considering higher order dependencies, too.

7

What is dependence ?

Alfréd Rényi

1961, Hungary

8

-divergence

• Rényi’s -information

• Shannon’s mutual information

Special cases:

Theorem:

1967, Hungary

Imre Csiszár

9

Measuring Dependence and Uncertainty

Shannon mutual information

Measuring uncertainty (Shannon entropy)

1948, Princeton

Claude Shannon

10

Rényi’s information and entropy

Rényi’s entropy

Rényi’s information

Applications of mutual information in machine learning

12

Feature selection (Peng et. al, 2005)

Goal:

• find a set of features that has the largest dependence

with the target class

• find dependent and independent features

XT =XT Rn x d

labels

Microarray dataprocessing

d = number of genes

n = number of patients

13

Feature selection (Peng et. al, 2005)

XT =XT Rn x d

labels

d = number of genes

n = number of patients

Max dependency

Max relevance

Min redundancy, Max relevance

Don’t want dependent features

14

Independent Subspace Analysis

Observation

Hidden, independent sources

(subspaces)

Estimate A and S observing samples from X only

Goal:

15

Independent Subspace Analysis

In case of perfect

separation WA is a

block permutation

matrix

Objective:

16

Mutual Information Applications

• Information theory, Channel capacity

• Information geometry

• Clustering

• Optimal experiment design, active learning

• Prediction of protein structure

• Drug design

• Boosting

• fMRI data processing

• …

The estimation problem

18

The estimation problem

GOAL:

19

How can we estimate these?

How can we estimate them?


TSP, MST, kNN

21

History• 1959, Beardwood, Halton, Hammersley

– Given uniform iid samples on [0,1]2. TSP length=?

• 1976, Karp

– randomly distributed cities

– This asymptotic can be used for developing a deterministic poly time near-optimal(1+) TSP algorithm (a.s.)

– 1st to show that 9 alg. for a stochastic version

of an NP complete problem (TSP) that inpolynomial time gives a near optimal solution (a.s.)

John Hammersley

Oxford, UK

Richard Karp

Berkeley, CA

– uniform iid samples on [0,1]d. TSP length=?

22

• 1981 – TSP ) MST, Minimal Matching graphs, conjecture for kNN

• Renyi’s -entropy estimation?

observations with density f on [0,1]d

• Beardwood, Halton, Hammersley

Michael Steele Joseph Yukich Wansoo Rhee

1999, Hero, Costa & Michel

23

Steele, Yukich theorem for MST, TSP, Minimal Matching

(Fig taken from J. Costa and A. Hero)Sensitive to outliers…!

24

A new result: This procedure works on Set-Nearest Neighborhood

graphs, too

Calculate:

S={1,3,7} ) consider 1st, 3rd, 7th nearest neighbors of each node

S={k} ) consider the kth nearest neighbors of each node

S={1,2,…,k} ) kNN graphs

25

Rényi’s entropy estimators using Set-Nearest Neighborhood graphs

(Fig. taken from J. Costa and A. Hero)

How can we get information estimators from entropy estimators?

How can we make the estimators more robust?

27

The invariance trick

Information is preserved under monotonic transformations.

28

Transformation to get uniform marginals

The transformation (copula transformation):

Monotone transform Uniform marginals

• Information estimation problem is reduced for

Rényi’s entropy estimation

• a little problem: we don’t know Fi distribution functions

Monotone transformation leading to uniform marginals?

Prob theory 101:

29

Copula methods

• Very popular in financial analysis

– Capture dependence between marginal variables

– An alternative form of normalization

• … but not so widespread in machine learning…

Well in financial analysis, they led

to the global financial crisis…

Maybe we should use them in ML more often!

30

The copula transformation

Monotone transform Uniform marginals

a little problem: we don’t know Fi distribution functions

Solution:

Use empirical distributions Fjn and empirical copula transform.

We need this in 1D only! ) no curse of dimensionality.

31

Empirical copula transformation

We don’t know F1,…,Fd distribution functions

) estimate them with empirical distribution functions

32

Empirical copula transformation

“true” copula transformation:

empirical copula transformation:

empirical copula transformation of the observations:

33

REGO AlgorithmRank-based Euclidean Graph Optimization

Theoretical Results

consistency

35

Is consistency trivial?

Difficulties:

• The true copula transformation is not known… only the empirical

• Samples lie on a grid, not coming from the true copula distribution

True copula transform

Empirical copula transform

36

• Empirical copula transformed points are not iid in n

) Steele & Yukich’s theorem cannot be applied

We need different proofs for the two cases:

• 0<p· 1: We have triangle inequality for ||.||p, but it is not Lipschitz

• 1<p<d/2: ||.||p is Lipschitz, but no triangle inequality

Difficulties:

Is consistency trivial?

37

Consistency Results

Main theorem:

Note:• Consistent MI estimator using ranks only

• Ranks only ) Robust

• We don’t have theoretical results for other d and values…

Empirical Results

39

Empirical results, consistency in 2D

n

kMST=0.8n

40

Empirical results, consistency in 10D

41

Image registration

n=4900=70 x 70

200 outliers

42

Independent subspace analysis

43

Independent subspace analysisin the presence of outliers

44

Infinitesimal robustness (finite sample influence function)

The finite sample influence functions

The amount of change caused by adding one outlier, x

(with some modification)

It cannot be arbitrarily big!

45

Conclusion

Marriage of seemingly unrelated mathematical areas produced a consistent and robust

Rényi’s information estimator


TSP, MST, kNNCopula transformation

Information estimation

46

Thanks for your attention!

“Hey Rege, Róka Rege, Hey REGŐ Rejtem…

I am calling my grandfather, I feel his soul here.

I am listening to his words, they are breaking the silence.

Impart your knowledge to your grandson please,

1000 years have already passed…,

Knowledge equals life, hey!”

Machine Learning Lunch Carnegie Mellon Universitybapoczos/other... · Machine Learning Lunch...

Documents

Transcript of Machine Learning Lunch Carnegie Mellon Universitybapoczos/other... · Machine Learning Lunch...