Hilbert Schmidt Independence Criterion - Alex...

45
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] ICONIP 2006, Hong Kong, October 3 Alexander J. Smola: Hilbert Schmidt Independence Criterion 1 / 30

Transcript of Hilbert Schmidt Independence Criterion - Alex...

Page 1: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Hilbert Schmidt Independence CriterionThanks to Arthur Gretton, Le Song,

Bernhard Schölkopf, Olivier Bousquet

Alexander J. Smola

Statistical Machine Learning ProgramCanberra, ACT 0200 Australia

[email protected]

ICONIP 2006, Hong Kong, October 3

Alexander J. Smola: Hilbert Schmidt Independence Criterion 1 / 30

Page 2: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Outline

1 Measuring IndependenceCovariance OperatorHilbert Space MethodsA Test Statistic and its Analysis

2 Independent Component AnalysisICA PrimerExamples

3 Feature SelectionProblem SettingAlgorithmResults

Alexander J. Smola: Hilbert Schmidt Independence Criterion 2 / 30

Page 3: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Measuring Independence

ProblemGiven {(x1, y1), . . . , (xm, ym)} ∼ Pr(x , y) determinewhether Pr(x , y) = Pr(x) Pr(y).Measure degree of dependence.

ApplicationsIndependent component analysisDimensionality reduction and feature extractionStatistical modeling

Indirect ApproachPerform density estimate of Pr(x , y)Check whether the estimate approximately factorizes

Direct ApproachCheck properties of factorizing distributionsE.g. kurtosis, covariance operators, etc.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

Page 4: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Measuring Independence

ProblemGiven {(x1, y1), . . . , (xm, ym)} ∼ Pr(x , y) determinewhether Pr(x , y) = Pr(x) Pr(y).Measure degree of dependence.

ApplicationsIndependent component analysisDimensionality reduction and feature extractionStatistical modeling

Indirect ApproachPerform density estimate of Pr(x , y)Check whether the estimate approximately factorizes

Direct ApproachCheck properties of factorizing distributionsE.g. kurtosis, covariance operators, etc.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

Page 5: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Measuring Independence

ProblemGiven {(x1, y1), . . . , (xm, ym)} ∼ Pr(x , y) determinewhether Pr(x , y) = Pr(x) Pr(y).Measure degree of dependence.

ApplicationsIndependent component analysisDimensionality reduction and feature extractionStatistical modeling

Indirect ApproachPerform density estimate of Pr(x , y)Check whether the estimate approximately factorizes

Direct ApproachCheck properties of factorizing distributionsE.g. kurtosis, covariance operators, etc.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

Page 6: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Measuring Independence

ProblemGiven {(x1, y1), . . . , (xm, ym)} ∼ Pr(x , y) determinewhether Pr(x , y) = Pr(x) Pr(y).Measure degree of dependence.

ApplicationsIndependent component analysisDimensionality reduction and feature extractionStatistical modeling

Indirect ApproachPerform density estimate of Pr(x , y)Check whether the estimate approximately factorizes

Direct ApproachCheck properties of factorizing distributionsE.g. kurtosis, covariance operators, etc.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 3 / 30

Page 7: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance Operator

Linear CaseFor linear functions f (x) = w>x and g(y) = v>y thecovariance is given by

Cov{f (x), g(y)} = w>Cv

This is a bilinear operator on the space of linear functions.Nonlinear Case

Define C to be the operator with (f , g) → Cov {f , g}.

TheoremC is a bilinear operator in f and g.

Proof.We only show linearity in f : Cov {αf , g} = αCov {f , g}.Moreover, for f + f ′ the covariance is additive.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

Page 8: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance Operator

Linear CaseFor linear functions f (x) = w>x and g(y) = v>y thecovariance is given by

Cov{f (x), g(y)} = w>Cv

This is a bilinear operator on the space of linear functions.Nonlinear Case

Define C to be the operator with (f , g) → Cov {f , g}.

TheoremC is a bilinear operator in f and g.

Proof.We only show linearity in f : Cov {αf , g} = αCov {f , g}.Moreover, for f + f ′ the covariance is additive.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

Page 9: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance Operator

Linear CaseFor linear functions f (x) = w>x and g(y) = v>y thecovariance is given by

Cov{f (x), g(y)} = w>Cv

This is a bilinear operator on the space of linear functions.Nonlinear Case

Define C to be the operator with (f , g) → Cov {f , g}.

TheoremC is a bilinear operator in f and g.

Proof.We only show linearity in f : Cov {αf , g} = αCov {f , g}.Moreover, for f + f ′ the covariance is additive.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 4 / 30

Page 10: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Independent random variables

Alexander J. Smola: Hilbert Schmidt Independence Criterion 5 / 30

Page 11: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Dependent random variables

Alexander J. Smola: Hilbert Schmidt Independence Criterion 6 / 30

Page 12: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Or are we just unlucky?

Alexander J. Smola: Hilbert Schmidt Independence Criterion 7 / 30

Page 13: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance operators

Criterion (Renyi, 1957)Test for independence by checking whether C = 0.

Reproducing Kernel Hilbert SpaceKernels k , l on X, Y with associated RKHSs F, G.Assume bounded k , l on domain.

Mean operator

〈µx , f 〉 = Ex [f (x)] and 〈µy , g〉 = Ey [g(y)]

Covariance operatorDefine covariance operator C via bilinear form

f>Cxyg = Cov{f , g} = Ex ,y [f (x)g(y)]− Ex [f (x)] Ey [g(y)]

Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

Page 14: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance operators

Criterion (Renyi, 1957)Test for independence by checking whether C = 0.

Reproducing Kernel Hilbert SpaceKernels k , l on X, Y with associated RKHSs F, G.Assume bounded k , l on domain.

Mean operator

〈µx , f 〉 = Ex [f (x)] and 〈µy , g〉 = Ey [g(y)]

Covariance operatorDefine covariance operator C via bilinear form

f>Cxyg = Cov{f , g} = Ex ,y [f (x)g(y)]− Ex [f (x)] Ey [g(y)]

Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

Page 15: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance operators

Criterion (Renyi, 1957)Test for independence by checking whether C = 0.

Reproducing Kernel Hilbert SpaceKernels k , l on X, Y with associated RKHSs F, G.Assume bounded k , l on domain.

Mean operator

〈µx , f 〉 = Ex [f (x)] and 〈µy , g〉 = Ey [g(y)]

Covariance operatorDefine covariance operator C via bilinear form

f>Cxyg = Cov{f , g} = Ex ,y [f (x)g(y)]− Ex [f (x)] Ey [g(y)]

Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

Page 16: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Covariance operators

Criterion (Renyi, 1957)Test for independence by checking whether C = 0.

Reproducing Kernel Hilbert SpaceKernels k , l on X, Y with associated RKHSs F, G.Assume bounded k , l on domain.

Mean operator

〈µx , f 〉 = Ex [f (x)] and 〈µy , g〉 = Ey [g(y)]

Covariance operatorDefine covariance operator C via bilinear form

f>Cxyg = Cov{f , g} = Ex ,y [f (x)g(y)]− Ex [f (x)] Ey [g(y)]

Alexander J. Smola: Hilbert Schmidt Independence Criterion 8 / 30

Page 17: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Hilbert Space Representation

TheoremProvided that k , l are universal kernels ‖Cxy‖ = 0 if and only ifx , y are independent.

Proof.Step 1: If x , y are dependent then there exist some

[0, 1]-bounded range f ∗, g∗ with Cov {f ∗, g∗} = ε > 0.Step 2: Since k , l are universal there exist ε′ approximation of

f ∗, g∗ in F, G such that covariance of approximation does notvanish.

Step 3: Hence the covariance operator Cxy is nonzero.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 9 / 30

Page 18: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

A Test Statistic

Covariance operator

Cxy = Ex ,y [k(x , ·)l(y , ·)]− Ex [k(x , ·)] Ey [l(y , ·)]

Operator NormUse the norm of Cxy to test whether x and y areindependent. It also gives us a measure of dependence.

HSIC(Prxy

, F, G) := ‖Cxy‖2

where ‖·‖ denotes the Hilbert-Schmidt norm.Frobenius Norm

For matrices we can define

‖M‖2 =∑

ij

M2ij = tr M>M.

Hilbert-Schmidt norm is generalization of Frobenius norm.Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

Page 19: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

A Test Statistic

Covariance operator

Cxy = Ex ,y [k(x , ·)l(y , ·)]− Ex [k(x , ·)] Ey [l(y , ·)]

Operator NormUse the norm of Cxy to test whether x and y areindependent. It also gives us a measure of dependence.

HSIC(Prxy

, F, G) := ‖Cxy‖2

where ‖·‖ denotes the Hilbert-Schmidt norm.Frobenius Norm

For matrices we can define

‖M‖2 =∑

ij

M2ij = tr M>M.

Hilbert-Schmidt norm is generalization of Frobenius norm.Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

Page 20: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

A Test Statistic

Covariance operator

Cxy = Ex ,y [k(x , ·)l(y , ·)]− Ex [k(x , ·)] Ey [l(y , ·)]

Operator NormUse the norm of Cxy to test whether x and y areindependent. It also gives us a measure of dependence.

HSIC(Prxy

, F, G) := ‖Cxy‖2

where ‖·‖ denotes the Hilbert-Schmidt norm.Frobenius Norm

For matrices we can define

‖M‖2 =∑

ij

M2ij = tr M>M.

Hilbert-Schmidt norm is generalization of Frobenius norm.Alexander J. Smola: Hilbert Schmidt Independence Criterion 10 / 30

Page 21: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Computing ‖Cxy‖2

Rank-one operatorsFor rank-one terms we have‖f ⊗ g‖2 = 〈f ⊗ g, f ⊗ g〉HS = ‖f‖2 ‖g‖2.

Joint expectationBy construction of Cxy we exploit linearity and obtain

‖Cxy‖2 =〈Cxy , Cxy〉HS

= {Ex ,yEx ′,y ′ − 2Ex ,yEx ′Ey ′ + ExEyEx ′Ey ′}[〈k(x , ·)l(y , ·), k(x ′, ·)l(y ′, ·)〉HS

]= {Ex ,yEx ′,y ′ − 2Ex ,yEx ′Ey ′ + ExEyEx ′Ey ′}

[k(x , x ′)l(y , y ′)]

This is well-defined if k , l are bounded.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 11 / 30

Page 22: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Estimating∥∥C2

xy

∥∥Empirical criterion

HSIC(Z , F, G) :=1

(m − 1)2 trKHLH

where Kij = k(xi , xj), Lij = l(yi , yj) and Hij = δij − m−2.

Theorem

EZ [HSIC(Z , F, G)] = HSIC(Prxy

, F, G) + O(1/m)

Proof: Sketch only.Expand tr KHLH into terms of pairs, triples and quadruples ofindices of non-repeated terms, which lead to the properexpectations and bound the rest by O(m−1).

Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

Page 23: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Estimating∥∥C2

xy

∥∥Empirical criterion

HSIC(Z , F, G) :=1

(m − 1)2 trKHLH

where Kij = k(xi , xj), Lij = l(yi , yj) and Hij = δij − m−2.

Theorem

EZ [HSIC(Z , F, G)] = HSIC(Prxy

, F, G) + O(1/m)

Proof: Sketch only.Expand tr KHLH into terms of pairs, triples and quadruples ofindices of non-repeated terms, which lead to the properexpectations and bound the rest by O(m−1).

Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

Page 24: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Estimating∥∥C2

xy

∥∥Empirical criterion

HSIC(Z , F, G) :=1

(m − 1)2 trKHLH

where Kij = k(xi , xj), Lij = l(yi , yj) and Hij = δij − m−2.

Theorem

EZ [HSIC(Z , F, G)] = HSIC(Prxy

, F, G) + O(1/m)

Proof: Sketch only.Expand tr KHLH into terms of pairs, triples and quadruples ofindices of non-repeated terms, which lead to the properexpectations and bound the rest by O(m−1).

Alexander J. Smola: Hilbert Schmidt Independence Criterion 12 / 30

Page 25: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Uniform convergence bounds for∥∥C2

xy

∥∥Theorem (Recall Hoeffding’s theorem for U Statistics)For averages over functions on r variables

u :=1

(m)r

∑imr

g(xi1 , . . . , xir )

which are bounded by a ≤ u ≤ b we have

Pru{u − Eu[u] ≥ t} ≤ exp

(−2t2dm/re

(b − a)2

)In our statistic we have terms of 2, 3, and 4 random variables.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 13 / 30

Page 26: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Uniform convergence bounds for∥∥C2

xy

∥∥CorollaryAssume that k , l ≤. Then at least with probability 1 − δ∣∣∣∣HSIC(Z , F, G)− HSIC(Pr

xy, F, G)

∣∣∣∣ ≤√

log 6/δ

0.24m+

Cm

Proof.Bound each of the three terms separatly via Hoeffding’stheorem.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 14 / 30

Page 27: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Outline

1 Measuring IndependenceCovariance OperatorHilbert Space MethodsA Test Statistic and its Analysis

2 Independent Component AnalysisICA PrimerExamples

3 Feature SelectionProblem SettingAlgorithmResults

Alexander J. Smola: Hilbert Schmidt Independence Criterion 15 / 30

Page 28: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Blind Source Separation

Dataw = Ms, where all si are mutually independent.

The Cocktail Party Problem

TaskRecover the sources S and mixing matrix M given W .

Alexander J. Smola: Hilbert Schmidt Independence Criterion 16 / 30

Page 29: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Independent Component Analysis

WhiteningRotate, center, and whiten data before separation. This isalways possible.

OptimizationWe cannot recover scale of data anyway.Need to find orthogonal matrix U such that Uw = sleads to independent random variables.Optimization on the Stiefel manifold.Could do this by a Newton method.

Important TrickKernel matrix could be huge.Use reduced-rank representation. We get

tr H(AA>)H(BB>) =∥∥A>HB

∥∥2 instead of tr HKHL.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

Page 30: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Independent Component Analysis

WhiteningRotate, center, and whiten data before separation. This isalways possible.

OptimizationWe cannot recover scale of data anyway.Need to find orthogonal matrix U such that Uw = sleads to independent random variables.Optimization on the Stiefel manifold.Could do this by a Newton method.

Important TrickKernel matrix could be huge.Use reduced-rank representation. We get

tr H(AA>)H(BB>) =∥∥A>HB

∥∥2 instead of tr HKHL.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

Page 31: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Independent Component Analysis

WhiteningRotate, center, and whiten data before separation. This isalways possible.

OptimizationWe cannot recover scale of data anyway.Need to find orthogonal matrix U such that Uw = sleads to independent random variables.Optimization on the Stiefel manifold.Could do this by a Newton method.

Important TrickKernel matrix could be huge.Use reduced-rank representation. We get

tr H(AA>)H(BB>) =∥∥A>HB

∥∥2 instead of tr HKHL.

Alexander J. Smola: Hilbert Schmidt Independence Criterion 17 / 30

Page 32: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

ICA Experiments

Alexander J. Smola: Hilbert Schmidt Independence Criterion 18 / 30

Page 33: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Outlier Robustness

Alexander J. Smola: Hilbert Schmidt Independence Criterion 19 / 30

Page 34: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Automatic Regularization

Alexander J. Smola: Hilbert Schmidt Independence Criterion 20 / 30

Page 35: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Mini Summary

Linear mixture of independent sourcesRemove mean and whiten for preprocessingUse HSIC as measure of dependenceFind best rotation to demix the data

PerformanceHSIC is very robust to outliersGeneral purpose criterionBest performing algorithm (Radical) is designed forlinear ICA, HSIC is a general purpose criterionLow rank decomposition makes optimization feasible

Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30

Page 36: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Mini Summary

Linear mixture of independent sourcesRemove mean and whiten for preprocessingUse HSIC as measure of dependenceFind best rotation to demix the data

PerformanceHSIC is very robust to outliersGeneral purpose criterionBest performing algorithm (Radical) is designed forlinear ICA, HSIC is a general purpose criterionLow rank decomposition makes optimization feasible

Alexander J. Smola: Hilbert Schmidt Independence Criterion 21 / 30

Page 37: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Outline

1 Measuring IndependenceCovariance OperatorHilbert Space MethodsA Test Statistic and its Analysis

2 Independent Component AnalysisICA PrimerExamples

3 Feature SelectionProblem SettingAlgorithmResults

Alexander J. Smola: Hilbert Schmidt Independence Criterion 22 / 30

Page 38: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Feature Selection

The ProblemLarge number of featuresSelect a small subset of them

Basic IdeaFind features such that the distributions p(x |y = 1) andp(x |y = −1) are as different as possible.Use a two-sample test for that.

Important TweakWe can find a similar criterion to measure dependencebetween data and labels (by computing the Hilbert-Schmidtnorm of covariance operator).

Alexander J. Smola: Hilbert Schmidt Independence Criterion 23 / 30

Page 39: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Recursive Feature Elimination

AlgorithmStart with full set of featuresAdjust kernel width to pick up maximum discrepancyFind feature which decreases dissimilarity the leastRemove this featureRepeat

ApplicationsBinary classification (standard MMD criterion)MulticlassRegression

Alexander J. Smola: Hilbert Schmidt Independence Criterion 24 / 30

Page 40: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Alexander J. Smola: Hilbert Schmidt Independence Criterion 25 / 30

Page 41: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Comparison to other feature selectors

Synthetic Data

Brain Computer Interface Data

Alexander J. Smola: Hilbert Schmidt Independence Criterion 26 / 30

Page 42: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Frequency Band Selection

Alexander J. Smola: Hilbert Schmidt Independence Criterion 27 / 30

Page 43: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Microarray Feature Selection

GoalObtain small subset of features for estimationReproducible feature selection

Results

Alexander J. Smola: Hilbert Schmidt Independence Criterion 28 / 30

Page 44: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Summary

1 Measuring IndependenceCovariance OperatorHilbert Space MethodsA Test Statistic and its Analysis

2 Independent Component AnalysisICA PrimerExamples

3 Feature SelectionProblem SettingAlgorithmResults

Alexander J. Smola: Hilbert Schmidt Independence Criterion 29 / 30

Page 45: Hilbert Schmidt Independence Criterion - Alex Smolaalex.smola.org/teaching/iconip2006/iconip_4.pdf · Hilbert Space Representation Theorem Provided that k,l are universal kernels

Shameless Plugs

Looking for a job . . . talk to [email protected] (http://www.nicta.com.au)

PositionsPhD scholarshipsPostdoctoral positions, Senior researchersLong-term visitors (sabbaticals etc.)

More details on kernelshttp://sml.nicta.com.auhttp://www.kernel-machines.orghttp://www.learning-with-kernels.orgSchölkopf and Smola: Learning with Kernels

Alexander J. Smola: Hilbert Schmidt Independence Criterion 30 / 30