Kernel Methods for Topological Data Analysis - UCL · Kernel Methods for Topological Data Analysis...

23
Kernel Methods for Topological Data Analysis Toward Topological Statistics Kenji Fukumizu The Institute of Statistical Mathematics (Tokyo, Japan) Joint work with Genki Kusano and Yasuaki Hiraoka (Tohoku Univ.) supported by JST CREST 2nd UCL Workshop on the Theory of Big Data. 6 - 8 January 2016 1

Transcript of Kernel Methods for Topological Data Analysis - UCL · Kernel Methods for Topological Data Analysis...

Kernel Methods for Topological Data Analysis Toward Topological Statistics

Kenji Fukumizu

The Institute of Statistical Mathematics (Tokyo, Japan)

Joint work with Genki Kusano and Yasuaki Hiraoka (Tohoku Univ.) supported by JST CREST

2nd UCL Workshop on the Theory of Big Data. 6 - 8 January 2016

1

Topological Data Analysis• TDA: a new method for extracting topological or geometrical

information of data.

• Persistence homology: a recent mathematical tool (Edelsbrunner et al 2002; Carlsson 2005)

• Progress of computational topology: computing topological invariants in feasible computational cost.

• Big data? Not only the size, buy complex structure matters TDA.

2

TDA: Various applications

33

Brain artery treese.g. age effect(Bendich et al 2014)

Brain Science

Structure change of proteinseg. open / closed(Kovacev-Nikolic et al 2015)

Material Science

Computer Vision

Shape signature, natural image statistics(Freedman & Chen 2009)

Data of highly complex geometric structure

Often difficult to define good feature vectors / descriptors

etc…Non-crystal materials(Nakamura, Hiraoka, Hirata, Escolar, Nishiura. Nanotechnology 26 (2015))

Liquid Glass

TDA provides a compact representation for geometry of such data

Biology / protein

Algebraic Topology• Homology group

4

Algebraic operation

Homology group: topological invariance

0 dim(connected comp.) : �� ≅ �1 dim(ring) : �� ≅ � ⊕ �

・・・

Various topological invariances

e.g. Euler number = ∑ −1 �dim ���

The generators of 1 dim. homology group

≅ Independent ℓ-dim. holes

Topology of statistical data?

5

Noisy sample

True structure

� neighborhood(e.g. manifold learning)

Small � disconnected object

Large � Small ring is not visible

Stable extraction of topology is NOT easy!

Persistence Homology• All � considered

� = �� ���� ⊂ �� , �� ≔∪���

� ��(��)

6

� small � large

Two rings (generators of 1 dim homology) persist in a long interval.

• Persistence homology (formal definition)Filtration of topological spaces X ∶ �� ⊂ �� ⊂ ⋯ ⊂ ��

���(X): �� �� → �� �� → ⋯ → ��(��) ≅ ⊕���

���[��, ��]

� �, � ≅ 0 → ⋯ → 0 → � → ⋯ → � → 0 → ⋯ → 0

• Persistence diagram (PD)

7

at �� at ��

�: field

Irreducible decomposition

Birth and death of a generator �

PD plots the birth and death of each generator of PH in a 2D graph (� ≥ �).

Birth Death

The lifetime of each generator is rigorously defined,and can be computed numerically.

Handy descriptor of complex geometry

Topological statistics: systematic data analysis in TDA• Conventional analysis in TDA: analysis WITH PH/PD

Data Computation of PH Display(PD) Analysis by human

• Our statistical approach to TDA: analysis OF random PH/PD

8

Many data setsComputation 

of PD’s

PD1

PD2

PD3

PDn

Many persistent diagrams

Analysis of PD’s

Software available

CGAL / PHAT

Features / Descriptors

How?

Topological Statistics

“I predict a new subject of statistical topology. Rather than count the number of holes, Betti numbers, etc., one will be more interested in the distribution of such objects on noncompact manifolds as one goes out to infinity.”

— Isadore Singer, 2004.

9

Kernel representation of PD• Vectorization of PD by positive definite kernel

• PD = Discrete measure �� ≔ ∑ ���∈��

• Embedding of PD’s into RKHSℇ�: �� ↦ ∫ � ⋅, � ��� � = ∑ �(⋅, ��)� ∈ �� , Vectorization

• For some kernels (e.g., Gaussian, Laplace), ℇ� is injective.

• By vectorization,• a number of methods for data analysis can be applied,

SVM, regression, PCA, CCA, etc. • tractable computation is possible with kernel methods.

10

�: positive definite kernel��: corresponding RKHS

Persistence Weighted Gaussian (PWG) Kernel

Generators close to the diagonal may be noise, and should be discounted.

���� �, � = � � � � exp −��� �

���

� � = ��,� � ≔ arctan �Pers � � (�, � > 0)

Pers � ≔ � − � for � ∈ { �, � ∈ ��|� ≥ �}

11

Pers(x1)

• Stability with PWG kernel embedding• PWGK defines a distance measure on the persistence diagrams,

�� ��, �� ≔ ℇ� �� − ℇ� �� ��

Statiblity Theorem (Kusano, Hiraoka, Fukumizu 2015+) �: compact subset in �� . � ⊂ �, � ⊂ �� : finite sets.If � > � + 1, then with PWG kernel (�, �, �),

�� ��(�), ��(�) ≤ � �� �, � .

�: constant depending only on �, �, �, �, �

��(�): � dimensional persistence diagram of �

�� : Haussdorff distance

This stability is not known for Gaussian kernel.12

A small change of a set causes only a small change in PD

2nd‑level kernel

2nd-level kernel (SVM for measures, Muandet, Fukumizu, Dinuzzo, Schölkopf 2012)

• RKHS-Gaussian kernel � ��, �� = exp −����� ��

���

derives

� ��, �� = exp −ℇ�(��)�ℇ�(��)

��

���

13

PD1

PD2

PD3

PDm

ℇ� ���

ℇ� ���

…ℇ� ���

Vectors in RKHSPDData sets

Application of pos. def. Kernel on RKHS

��, ��: Persistence diagrams

Data analysis method

Embedding

Computational issueThe number of generators in a PD may be large (≥ 103, 104 )

For ��� = ∑ ���

(�)����� ∪ Δ, � ���, ��� = exp −

ℇ�(���)�ℇ�(���)��

��� requires computation

ℇ�(���) − ℇ�(���)��

= ∑ ∑ � ���

, �����

� ����� �� + ∑ ∑ � ��

�, ��

���

� ��

��

� �� − 2 ∑ ∑ � ���

, �����

� ����� �� .

The number of exp −�����

��� = �(����) computationally expensive for � ≈ 10�

14

� = max{��|� = 1, … , �}

• Approximation by random features (Rahimi & Recht 2008)

By Bochner’s theorem

exp −�����

��� = � ∫ � ���� �������

����

�� � �

� ��

Approximation by sampling: ��, … , ��: �. �. �. ~ ��

exp −�����

��� ≈ ��

�∑ � ���ℓ

��� � ���ℓ����

ℓ��

∑ ∑ � ���

, �����

� �� ≈��� ��

�∑ ∑ ∑ � ��

�� ��

�� ���ℓ

���(�)

� ���ℓ���

(�)�ℓ��

��

� ����� ��

=�

�∑ ∑ � ��

�� ���ℓ

���(�)

∑ � ���

� ���ℓ���

(�)��

������ ��

�ℓ ��

Computational cost �(��) 2nd level Gram matrix �(��� + ���). c.f. �(����)

Big gain if �, � ≪ �

15

Gaussian distribution =: ��

� dim.

(Fourier transform)

Comparison: Persistence Scale Space Kernel(Reininghaus et al 2015)

• PSS Kernel

�� �, � =1

8��exp

� − � �

8�− exp

� − �� �

8�

�� = (�, �) for � = (�, �).

ℇ�(�) is considered.

• Comparison between PWGK and PSSK • PWGK can control the discount around the diagonal independently of the

bandwidth parameter.• PSSK is not shift-invariant Random feature approximation is not applicable.• In Reininghaus et al 2015, 2nd level kernel is not considered.

16

Pos. def. on �, � � ≥ �0 on Δ.

Synthetic example: SVM classification• Classification of PD’s by SVM

• Synthetic data• One big circle (random location and sample size) �1

with or without small circle �0

• � = XOR((birth(�1)<1 && death(�1)>4)), �0 exists)• In fact, data include noise.• 100 for training

and 100 for testing

• Result (correct classification)• PWGK (proposed): 83.8%• PSSK (comparison): 46.5%

17

S0

S1

S1S1

S0

noise

PD1

Data points 1

Data points 2

No S0

� = 1

Application: Transition of Silica (SiO2 )If cooled down rapidly from the liquid state, SiO2 changes into the glass state (not to crystal).Goal: identify the temperature of phase transition.Data: Molecular Dynamics simulation for SiO2. 3D arrangements of the atoms are used for computing PD at 80 temperatures. (Nakamura et al 2015; Hiraoka et al 2015)

18

Examples of PD’s

Liquid Glass (Amorphous)

Amorphous: “soft” structure

Change point detection• Data along a parameter �

�� , � = 1, … , �.

Kernel Change Point Analysis with Fisher Discriminant score (Harchoui et al 2009): For each �, two classes are defined by the data before and after �. Fisher score on RKHS is used.

• For each �, compute ���:� =�

�∑ Φ(��)�

��� and �����:� =�

���∑ Φ(��)�

����� .

• Compute Δ� ≔ ��:� + ����:� + �� ��

�(���:� − �����:�)��

.

• Find max�

Δ�.

• For the packing problem, �� = ℇ� ��� (� = 1, … , 80).

19

• Detection of liquid-glass state transition• Approach in physics:Estimation using derivatives of enthalpy, but not so

accurate. Estimate = [2000K, 3500K]• Our approach: Change point detection by Kernel FDR.• Number of generators in a PD is 30000 at most difficult to use PSSK directly• Only PWGK (proposed) is applied with random features.

20

KFDR Gram matrix

Detected change point = 3100K

• Kernel PCA with PWGK

21

Sharp change between the two phases.

(Colored by the result of change point detection. Colors are not used for KPCA).

PD + PWGK successfully capture “soft” and complex structure of the glass state.

Conclusion• Goal: establishing topological statistics

• Persistence homology can be used for defining useful descriptors to complex objects of geometrical structures.

• Systematic statistical methods for TDA: Kernel methods introduce systematic data analysis to TDA.• Vectorization of persistence diagrams by kernel embedding.• Persistence weighted Gaussian kernel flexible kernel for noise.• A toolbox of kernel methods become available:

SVM, kernel ridge regression, kernel PCA, kernel CCA, etc ….

22

Thank you

23