Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data...

37
1/16 Topology Proteins Software Summary Topological Data Analysis Peter Bubenik University of Florida Department of Mathematics, [email protected] http://people.clas.ufl.edu/peterbubenik/ British Applied Mathematics Colloquium University of Oxford April 6, 2016 Peter Bubenik Topological Data Analysis

Transcript of Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data...

Page 1: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

1/16

Topology Proteins Software Summary

Topological Data Analysis

Peter Bubenik

University of FloridaDepartment of Mathematics,[email protected]

http://people.clas.ufl.edu/peterbubenik/

British Applied Mathematics ColloquiumUniversity of Oxford

April 6, 2016

Peter Bubenik Topological Data Analysis

Page 2: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

2/16

Topology Proteins Software Summary Basics Persistent homology

Topological Data Analysis

Idea

Use topology to summarize and learn from the “shape” of data.

Peter Bubenik Topological Data Analysis

Page 3: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

3/16

Topology Proteins Software Summary Basics Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Topological Data Analysis

Page 4: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

3/16

Topology Proteins Software Summary Basics Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Topological Data Analysis

Page 5: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

3/16

Topology Proteins Software Summary Basics Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Topological Data Analysis

Page 6: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

3/16

Topology Proteins Software Summary Basics Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Topological Data Analysis

Page 7: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

3/16

Topology Proteins Software Summary Basics Persistent homology

Simplicial complexes from point data

The Cech construction

Peter Bubenik Topological Data Analysis

Page 8: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

4/16

Topology Proteins Software Summary Basics Persistent homology

Homology of simplicial complexes

Definition

Homology in degree k is given by k-cycles modulo thek-boundaries.

Peter Bubenik Topological Data Analysis

Page 9: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

4/16

Topology Proteins Software Summary Basics Persistent homology

Homology of simplicial complexes

Definition

Homology in degree k is given by k-cycles modulo thek-boundaries.

Peter Bubenik Topological Data Analysis

Page 10: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 0Peter Bubenik Topological Data Analysis

Page 11: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 1Peter Bubenik Topological Data Analysis

Page 12: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 2Peter Bubenik Topological Data Analysis

Page 13: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 3Peter Bubenik Topological Data Analysis

Page 14: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 4Peter Bubenik Topological Data Analysis

Page 15: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 5Peter Bubenik Topological Data Analysis

Page 16: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 6Peter Bubenik Topological Data Analysis

Page 17: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 7Peter Bubenik Topological Data Analysis

Page 18: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 8Peter Bubenik Topological Data Analysis

Page 19: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 9Peter Bubenik Topological Data Analysis

Page 20: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 10Peter Bubenik Topological Data Analysis

Page 21: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

5/16

Topology Proteins Software Summary Basics Persistent homology

Persistence

Main idea

Vary a parameter and keep track of when features appear anddisappear.

radius = 11Peter Bubenik Topological Data Analysis

Page 22: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

6/16

Topology Proteins Software Summary Basics Persistent homology

Mathematical encoding

We have an increasing sequence of simplicial complexes

X0 ⊆ X1 ⊆ X2 ⊆ · · · ⊆ Xm

called a filtered simplicial complex.

Apply homology.

We get a sequence of vector spaces and linear maps

V0 → V1 → V2 → · · · → Vm

called a persistence module.

Peter Bubenik Topological Data Analysis

Page 23: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

7/16

Topology Proteins Software Summary Basics Persistent homology

Persistence module to Barcode

V0 → V1 → V2 → V3 → V4 → V5 → V6 → V7 → · · · → Vm

Fundamental Theorem of Persistent Homology

There exists a choice of bases for the vector spaces Vi such thateach map is determined by a bipartite matching of basis vectors.

Get a barcode:

2 3 4 5 6 7 8 9 10 11 12

Peter Bubenik Topological Data Analysis

Page 24: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

8/16

Topology Proteins Software Summary Basics Persistent homology

Barcode to Persistence Landscape

Barcode:

0 2 4 6 8 10 12 14

Persistence Landscape:

2 4 6 8 10 12 14

2

4

6

0

λ1

λ2

λ3

λk = 0,

for k ≥ 4

Peter Bubenik Topological Data Analysis

Page 25: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

9/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Maltose Binding Protein, two conformations

V. Kovacev-Nikolic, P. Bubenik, D. Nikolic, and G. Heo. Using persistenthomology and dynamical distances to analyze protein binding. StatisticalApplications in Genetics and Molecular Biology, 15 (2016) no. 1, 19–38.

Peter Bubenik Topological Data Analysis

Page 26: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

9/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Maltose Binding Protein, two conformations

V. Kovacev-Nikolic, P. Bubenik, D. Nikolic, and G. Heo. Using persistenthomology and dynamical distances to analyze protein binding. StatisticalApplications in Genetics and Molecular Biology, 15 (2016) no. 1, 19–38.

Peter Bubenik Topological Data Analysis

Page 27: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

10/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Maltose Binding Protein Data

The Data

Fourteen MBP structures from the Protein Data Bank.

7 closed conformations

7 open conformations

X-ray crystallography: coordinates of atoms.

Represent each amino acid residue by its Cα atom.

Have 14 sets of 370 points in R3.

The Goal

Can we use topological data analysis to distinguish the open andclosed conformations?

Peter Bubenik Topological Data Analysis

Page 28: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

11/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Filtered simplicial complex

Peter Bubenik Topological Data Analysis

Page 29: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

11/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Filtered simplicial complex

Peter Bubenik Topological Data Analysis

Page 30: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

11/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Filtered simplicial complex

Peter Bubenik Topological Data Analysis

Page 31: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

11/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Filtered simplicial complex

Peter Bubenik Topological Data Analysis

Page 32: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

11/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Filtered simplicial complex

Peter Bubenik Topological Data Analysis

Page 33: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

12/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Average persistence landscapes

H1 closed

0.70.60.50.40.30.20.180

60

40

0.08

0.06

0.1

0.04

0.02

0

0.12

0.14

20

H1 open

0.550.50.450.40.350.30.250.20.150.10.05

60

40

20

0.06

0.04

0.02

0

0.12

0.14

0.16

0.1

0.08

Peter Bubenik Topological Data Analysis

Page 34: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

13/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Clustering of protein conformations

H1

0

8

14

11 10 13912

1

6354

2 7

1

12 1113

10 148

9

5

6 3472

12 14

765

23

14 811

13910

11

3

13 1214

89

10

45

6 71

2

L1-no

Figure 1: Distance analysis based on the 2-Landscape distance shows a separation between theclosed (blue) and the open (red) MBP conformation for degree 0 (left) and degree 1 (right) persistenthomology. Similar results hold for degree 2. The projection of the 14 × 14 distance matrix ontothe 3D space is attained via Isomap.

4.3 Statistical Inference

To measure the statistical significance of visually observed differences between the closed

and the open conformation we use a permutation test. For each degree, we calculate

fourteen sample values of the random variable X from Equation (3). The permutation

test carried out at the significance level of 0.05 yields a p-value of 5.83 × 10−4 for both

homology in degree 0 and in degree 1. We obtain the same p-value since in both cases the

observed statistic was the most extreme among all 1716 possible permutations. Hence, at

14

Projection of the L2 distance matrix to R3 using Isomap.

Peter Bubenik Topological Data Analysis

Page 35: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

14/16

Topology Proteins Software Summary Topological Data Analysis Machine Learning

Classification of protein conformations

H1

x

y

z

x

y

z

H0 H2

z

xy

- closed - open - support vector

SVM on Isomap embedded 3D coordinates in the metric space induced by the 2-Landscape distance

H1

x

y

z

x

y

z

H0 H2

z

xy

- closed - open - support vector

SVM on Isomap embedded 3D coordinates in the metric space induced by the 2-Landscape distance

Peter Bubenik Topological Data Analysis

Page 36: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

15/16

Topology Proteins Software Summary

Software

Persistent Homology software:

JavaPlex

PHAT, DIPHA

Perseus

Dionysus

CHOMP

GUDHI

Persistence Landscape software:

The Persistence Landscape Toolbox

the R package TDA

Peter Bubenik Topological Data Analysis

Page 37: Topological Data Analysis - People...10/16 TopologyProteinsSoftwareSummary Topological Data AnalysisMachine Learning Maltose Binding Protein Data The Data Fourteen MBP structures from

16/16

Topology Proteins Software Summary

Topological Data Analysis Summary

Raw Data Clean data

Filteredsimplicialcomplex

Persistencemodule

Topologicalsummary

Statisticsand

MachineLearning

Preprocess

Transform

Homology

Peter Bubenik Topological Data Analysis