Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...

Post on 27-Mar-2020

8 views 0 download

Transcript of Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 9: Graph Mining in Chemoinformatics

Chloé-Agathe Azencott & Karsten Borgwardt

February 10 to February 21, 2014

Machine Learning & Computational Biology Research GroupMax Planck Institutes Tübingen andEberhard Karls Universität Tübingen

Drug discovery

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Modern therapeutic researchFrom serendipity to rationalized drug design

Ancient Greeks treatinfections with mould

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Biapenem in PBP-1A

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

Protein that we want to inhibit so as to interfer with a biological process

Compounds likely to bind to the target

Can they be drugs? (ADME-Tox)

- in vitro- in vivo- clinical

- bioactivity- pharmacokinetics- synthetic pathway

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

52 months 90 months

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

$500,000,000to

$2,000,000,000

52 months 90 months

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

Chemoinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

How can computer science help?→ Chemoinformatics!

“...the mixing of information resources to transform data into informa-tion, and information into knowledge, for the intended purpose of mak-ing better decisions faster in the arena of drug lead identification andoptimisation.” – F. K. Brown

“... the application of informatics methods to solve chemical problems.”– J. Gasteiger and T. Engel

Chemoinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Chemoinformatics

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

Chemoinformatics

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

The chemical space

1060 possible small or-ganic molecules

1022 stars in the observ-able universe

(Slide courtesy of Matthew A. Kayala)

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

QSARQSPR

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

QSAR: Qualitative Structure-Activity Relationshipi.e. classification

QSPR: Quantititive Structure-Property Relationshipi.e. regression

Representing chemicals in silico

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Expert knowledge molecular descriptors→ hard, potentially incomplete

Molecules are...

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Representing chemicals in silico

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Similar Property PrincipleMolecules having similar structures should exhibit similaractivities.

→ Structure-based representationsCompare molecules by comparing substructures

Molecular graph

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

C

O

N C

C

C

N

O

S

C

C

O O

C

C

d

d

d

C

C

NC

C

C

C

C

CO

Undirected labeled graph

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

Define feature vectors that record the presence/absence(or number of occurrences) of particular patterns in a givenmolecular graph

φ(A) = (φs(A))s substructure

whereφs(A) =

{1 if s occurs in A0 otherwise

Extension of traditional chemical fingerprints

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Learning from fingerprintsClassical machine learning and data mining techniquescan be applied to these vectorial feature representations.

Any distance / kernel can be usedClassificationFeature selectionClustering

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Fingerprints compressionSystematic enumeration→ long, sparse vectorse.g. 50, 000 random compounds from ChemDB→ 300, 000 paths of length up to 8→ 300 non-zeros on average“Naive” Compression

List the positions of the 1s219 = 524, 288average encoding: 300× 19 = 5, 700 bits

Fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

Fingerprints compressionModulo Compression (lossy)

Frequent patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

MOLFEA [Helma et al., 2004]

P = positive (mutagenic) compoundsN = negative compounds

features: fragments (= patterns) f such thatboth freq(f,P) ≥ t and freq(f,N) ≥ t

Limited to frequent linear patterns

ML algorithm: SVM with linear or quadratic kernel

Frequent patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

MOLFEA [Helma et al., 2004]

CPDB – Carcinogenic Potency DataBase684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella

1% 3% 5% 10%Frequency threshold

50

60

70

80

90

100

Cross-validated sensitivity

Mutagenicity prediction [Hema04]

Linear kernelQuadratic kernel

Spectrum kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

φ(A) = (φs(A))s∈S

Kspectrum(A,A′) = k(φ(A), φ(A′))

k ∈ RR|(S)|×R|(S)| can beDot product (linear kernel)

RBF kernel

Tanimoto kernel: k(A,B) = A∩BA∪B

MinMax kernel:∑N

i=1min(Ai,Bi)∑Ni=1max(Ai,Bi)

Spectrum kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

Tanimoto and MinMaxBoth Tanimoto and Minmax are kernels.

Proof for Tanimoto: J.C. Gower A general coefficientof similarity and some of its properties. Biometrics1971.Proof for MinMax:

MinMax(x, y) =〈φ(x), φ(y)〉

〈φ(x), φ(x)〉 + 〈φ(y), φ(y)〉 − 〈φ(x), φ(y)〉with φ(x) of length: # patterns × max countφ(x)i = 1 iff. the pattern indexed by bi/qc appears morethan i mod q times in x

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

Paths fingerprintsLabeled sub-paths (walks)

O

N C C

N

O

S

C

C

O O

C

C

d

d

d

C

C

NsCsCsS

CsCsCdO

C

C

NC

C

C

C

C

CO

Some sub-paths of length 3

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

Circular fingerprintsLabeled sub-trees - Extended-Connectivity (or Circular)features

O

N C C

N

O

S

C

C

O O

C

Cd

d

d

C

C

C{sC{sN|sC}|sN{sC}|sS{sC}}

C

C

NC

C

C

C

C

CO

Example of a circular substructure of depth 2

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

2D spectrum kernels [Azencott et al., 2007]

Systematically extract paths / circular fingerprints,for various maximal depthsSVM with Tanimoto / Minmax

All patterns fingerprints

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

2D spectrum kernels [Azencott et al., 2007]

Mutagenicity (Mutag): 188 compounds

Benzodiazepine receptor affinity (BZR): 181+125 compounds

Cyclooxygenase-2 ihibitors (COX2): 178 + 125 compounds

Estrogen receptor affinity (ER): 166 + 180 compounds

Data SVM Previous bestMutag 90.4% 85.2% (gBoost)BZR 79.8% 76.4%

COX2 70.1% 73.6%

ER 82.1% 79.8%

Weisfeiler-Lehman kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

[Shervashidze et al., 2011]

Goal: scalability

Compute a sequence that captures topological and labelinformation of graphs in a runtime linear in the number ofedges

→ sub-tree kernel

Weisfeiler-Lehman kernel

Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

[Shervashidze et al., 2011]

Convolution kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

a.k.a. decomposition kernels(x1, . . . , xD) is a tuple of parts of x, with xd ∈ X for eachpart d = 1, . . . , D

kd ∈ RXd×Xd: a Mercer kernel

Kdecomposition(x, x′) =

∑x1x2...xD=x

∑x′1x′2x′D=x

k1(x1, x′1)k2(x2, x

′2) . . . kD(xD, x

′D)

Spectrum kernels are a particular case of convolutionkernels

Convolution kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

Weighted Decomposition Kernel [Menchetti et al., 2005]

Match atoms and weigh them according to a kernel between sub-graphs that include these atoms

KWDK(x, x′) =

∑(a,σ∈Dr(x))

∑(a′,σ′∈Dr(x′)) δ(a, a

′)Kc(σ, σ′)

r > 0 ∈ N

Dr(x): decompositions of the molecular graph of x in an atom a

and a subpath σ of x including a and of depth at most r

Convolution kernels

Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

Weighted Decomposition Kernel [Menchetti et al., 2005]

Kc: contextual kernel, here: histogram intersection kernel

Kc(σ, σ′) =

∑l∈L min(fσ(l), fσ′(l))

L: possible labels for edges and vertices

fσ(l): frequency of label l subgraph σ.

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 30

3D Histograms [Azencott et al., 2007]

Groups of k atoms

Associated size:

Pairwise distances(k = 2)diameter of the smallestsphere that contains allk atoms

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 31

3D Histograms [Azencott et al., 2007]

One histogram per class of k-tuple (e.g. C-C-C, C-C-O)

C

O

N C

C

C

N

O

S

C

C

O O

C

C

C2.2

4.6

3.2

5.6

6.7

2.4

2.6

3.7

0 1 2 3 4 5 6 7

Frequency of N-O

N-O distance (A)

C

NC

C

C

C

C

CO 6.3

6.6

9.2 2.7

5.7

7.9

9.5

8 9 10

1

2

3

4

0

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 32

3D Histograms: performance [Azencott et al., 2007]

Data 2D kernel Hist3D kernelMutag 90.4% 88.8%

BZR (loo) 82.0% 79.4%ER (loo) 87.0% 86.1%COX2 76.9% 78.6%

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 33

3D Decomposition Kernels [Ceroni et al., 2007]

Remember: KWDK(x, x′) =

∑(a,σ∈Dr(x))

∑(a′,σ′∈Dr(x′))

δ(a, a′)Kc(σ, σ′)

K3DDK(x, x′) =

∑σ∈Sr(x)

∑σ′∈Sr(x′)Ks(σ, σ

′)

Sr(x): subgraphs of x composed of r distinct vertices

Ks(σ, σ′) =

∏r(r−1)/2i=1 δ(ei, e

′i)e−γ(li−l′i)

li = length of edge ei in x(e1, e2, . . . , er(r−1)/2 lexicographically ordered; γ ∈ R

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 34

3DDK: Performance [Ceroni et al., 2007]

Data 2D kernel Hist3D kernel 3DDK Circ3DDKMutag 90.4% 88.8% 86.7% 83.5%

BZR (loo) 82.0% 79.4% 78.4% 81.4%ER (loo) 87.0% 86.1% 82.3% 82.1%COX2 76.9% 78.6% 75.6% 75.2%

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 35

The pharmacophore kernel [Mahé et al., 2006]

pharmacophore p ∈ P(x): p = [(x1, l1), (x2, l2), (x3, l3)]

xi 3D coordinates of atom i of x; li = label of atom i

K(x, x′) =∑

p∈P(x)∑

p′∈P(x′)KP (p, p′)

KP (p, p′) = Kdist(d1, d

′1)Kdist(d2, d

′2)Kdist(d3, d

′3)Kfeat(l1, l

′1)Kfeat(l2, l

′2)Kfeat(l3, l

′3)

Kdist: RBF Gaussian Kdist(d, d′) = exp

(‖d−d′‖22σ2

)Kfeat: Dirac

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 36

3D LAP kernel [Hinselmann et al., 2010]

M : pairwise intramolecular matrix of inter-atomicgeometric distances

Introducing spatial information

Karsten Borgwardt: Data Mining in Bioinformatics, Page 37

ConclusionHow relevant is 3D information?How good is 3D information?

Drug discovery process

Karsten Borgwardt: Data Mining in Bioinformatics, Page 38

Docking

VirtualHigh-Throughput

Screening

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

hits

4. Lead optimization

and synthesis

5. Assay

High-throughput screening

Karsten Borgwardt: Data Mining in Bioinformatics, Page 39

Assay a large library of potentialdrugs against their target

Very costly

→ docking

→ virtual high-throughputscreening (vHTS)

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 40

Imbalanced data

Typically, most compounds are inactive ⇒ many more negativethan positive examples

E.g. DHFR data set:99, 995 chemicals screened for activity against dihydrofolatereductase; < 0.2% active compounds

Accuracy is not appropriate:predicting all compounds negative⇒ accuracy = 99.8%

sensitivity= # True Positives# Positives

specificity= # True Negatives# Negatives

For many methods, the output is continuous⇒ accuracy, sensitivity and specificity depend on a threshold θ

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 41

Receiver-Operator Characteristic Curves

For all possible values of θ, report sensitivity and 1− specificityAUROC (Area under the ROC Curve) is a numerical measure ofperformance

AUROC(random) = 0.5 and AUROC(optimal) = 1

0 1/6 1/3 1/2 2/3 5/6 1

01

/42

/43

/41

False Positive Rate

Tru

e P

ositiv

e R

ate

x

x x

x

x x x x

x x x

Inf

0.95 0.94

0.9

0.81 0.73 0.52 0.2

0.17 0.12 0.09

random

perfect

real

label prediction+ 0.95- 0.94+ 0.90+ 0.81- 0.73- 0.52- 0.20+ 0.17- 0.12- 0.09

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 42

Inhibition of DHFR: ROC Curves [Azencott et al., 2007]

method AUCIRV 0.71SVM 0.59kNN 0.59

MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

RANDOM

IRV

SVM

MAXSIM

Measuring performance

Karsten Borgwardt: Data Mining in Bioinformatics, Page 43

Precision-recall curves

Precision = # True Positives# Predicted Positives

Recall = sensitivity

0 1/4 2/4 3/4 1

01/5

2/5

3/5

4/5

1

Recall

Pre

cis

ion

x

x

x

x

x

x

x

xxx

0.95

0.94

0.9

0.81

0.73

0.52

0.2

0.170.120.09

perfect

real

Other applications

Karsten Borgwardt: Data Mining in Bioinformatics, Page 44

Other applications of graph mining in chemoinformatics

Database indexing and searchPrediction of 3D structures of small compoundsand proteinsReaction Prediction

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 45

[Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical

information and modeling 47, 965–974. 23, 24, 35, 36, 37, 47

[Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprintsusing integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109.

[Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P. (2007). Classification of small molecules by two-and three-dimensionaldecomposition kernels. Bioinformatics 23, 2038–2045. 38, 39

[Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques forthe identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal ofchemical information and computer sciences 44, 1402–1411. 17, 18

[Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compoundsusing topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 41

[Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening withsupport vector machines. Journal of chemical information and modeling 46, 2003–2014. 40

[Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd

International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34

[Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programmingapproach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29

[Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31