Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 9: Graph Mining in Chemoinformatics

Chloé-Agathe Azencott & Karsten Borgwardt

February 10 to February 21, 2014

Machine Learning & Computational Biology Research GroupMax Planck Institutes Tübingen andEberhard Karls Universität Tübingen

Drug discovery

Modern therapeutic researchFrom serendipity to rationalized drug design

Ancient Greeks treatinfections with mould

Biapenem in PBP-1A

Drug discovery process

1. Find a target

2. Identifyhits

3.Hit-to-lead: characterize

4. Lead optimization

and synthesis

5. Assay

Protein that we want to inhibit so as to interfer with a biological process

Compounds likely to bind to the target

Can they be drugs? (ADME-Tox)

- in vitro- in vivo- clinical

- bioactivity- pharmacokinetics- synthetic pathway

52 months 90 months

1. Find a target

2. Identifyhits

and synthesis

5. Assay

$500,000,000to

$2,000,000,000

52 months 90 months

1. Find a target

2. Identifyhits

and synthesis

5. Assay

Chemoinformatics

How can computer science help?→ Chemoinformatics!

“...the mixing of information resources to transform data into informa-tion, and information into knowledge, for the intended purpose of mak-ing better decisions faster in the arena of drug lead identification andoptimisation.” – F. K. Brown

“... the application of informatics methods to solve chemical problems.”– J. Gasteiger and T. Engel

Chemoinformatics

1. Find a target

2. Identifyhits

and synthesis

5. Assay

Chemoinformatics

The chemical space

1060 possible small or-ganic molecules

1022 stars in the observ-able universe

(Slide courtesy of Matthew A. Kayala)

QSARQSPR

1. Find a target

2. Identifyhits

and synthesis

5. Assay

QSAR: Qualitative Structure-Activity Relationshipi.e. classification

QSPR: Quantititive Structure-Property Relationshipi.e. regression

Representing chemicals in silico

Expert knowledge molecular descriptors→ hard, potentially incomplete

Molecules are...

Representing chemicals in silico

Similar Property PrincipleMolecules having similar structures should exhibit similaractivities.

→ Structure-based representationsCompare molecules by comparing substructures

Molecular graph

Undirected labeled graph

Fingerprints

Define feature vectors that record the presence/absence(or number of occurrences) of particular patterns in a givenmolecular graph

φ(A) = (φs(A))s substructure

whereφs(A) =

{1 if s occurs in A0 otherwise

Extension of traditional chemical fingerprints

Fingerprints

Learning from fingerprintsClassical machine learning and data mining techniquescan be applied to these vectorial feature representations.

Any distance / kernel can be usedClassificationFeature selectionClustering

Fingerprints

Fingerprints compressionSystematic enumeration→ long, sparse vectorse.g. 50, 000 random compounds from ChemDB→ 300, 000 paths of length up to 8→ 300 non-zeros on average“Naive” Compression

List the positions of the 1s219 = 524, 288average encoding: 300× 19 = 5, 700 bits

Fingerprints

Fingerprints compressionModulo Compression (lossy)

Frequent patterns fingerprints

MOLFEA [Helma et al., 2004]

P = positive (mutagenic) compoundsN = negative compounds

features: fragments (= patterns) f such thatboth freq(f,P) ≥ t and freq(f,N) ≥ t

Limited to frequent linear patterns

ML algorithm: SVM with linear or quadratic kernel

Frequent patterns fingerprints

MOLFEA [Helma et al., 2004]

CPDB – Carcinogenic Potency DataBase684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella

1% 3% 5% 10%Frequency threshold

Cross-validated sensitivity

Mutagenicity prediction [Hema04]

Linear kernelQuadratic kernel

Spectrum kernels

φ(A) = (φs(A))s∈S

Kspectrum(A,A′) = k(φ(A), φ(A′))

k ∈ RR|(S)|×R|(S)| can beDot product (linear kernel)

RBF kernel

Tanimoto kernel: k(A,B) = A∩BA∪B

MinMax kernel:∑N

i=1min(Ai,Bi)∑Ni=1max(Ai,Bi)

Spectrum kernels

Tanimoto and MinMaxBoth Tanimoto and Minmax are kernels.

Proof for Tanimoto: J.C. Gower A general coefficientof similarity and some of its properties. Biometrics1971.Proof for MinMax:

MinMax(x, y) =〈φ(x), φ(y)〉

〈φ(x), φ(x)〉 + 〈φ(y), φ(y)〉 − 〈φ(x), φ(y)〉with φ(x) of length: # patterns × max countφ(x)i = 1 iff. the pattern indexed by bi/qc appears morethan i mod q times in x

All patterns fingerprints

Paths fingerprintsLabeled sub-paths (walks)

NsCsCsS

CsCsCdO

Some sub-paths of length 3

Circular fingerprintsLabeled sub-trees - Extended-Connectivity (or Circular)features

C{sC{sN|sC}|sN{sC}|sS{sC}}

Example of a circular substructure of depth 2

2D spectrum kernels [Azencott et al., 2007]

Systematically extract paths / circular fingerprints,for various maximal depthsSVM with Tanimoto / Minmax

2D spectrum kernels [Azencott et al., 2007]

Mutagenicity (Mutag): 188 compounds

Benzodiazepine receptor affinity (BZR): 181+125 compounds

Cyclooxygenase-2 ihibitors (COX2): 178 + 125 compounds

Estrogen receptor affinity (ER): 166 + 180 compounds

Data SVM Previous bestMutag 90.4% 85.2% (gBoost)BZR 79.8% 76.4%

COX2 70.1% 73.6%

ER 82.1% 79.8%

Weisfeiler-Lehman kernel

[Shervashidze et al., 2011]

Goal: scalability

Compute a sequence that captures topological and labelinformation of graphs in a runtime linear in the number ofedges

→ sub-tree kernel

Weisfeiler-Lehman kernel

[Shervashidze et al., 2011]

Convolution kernels

a.k.a. decomposition kernels(x1, . . . , xD) is a tuple of parts of x, with xd ∈ X for eachpart d = 1, . . . , D

kd ∈ RXd×Xd: a Mercer kernel

Kdecomposition(x, x′) =

∑x1x2...xD=x

∑x′1x′2x′D=x

k1(x1, x′1)k2(x2, x

′2) . . . kD(xD, x

Spectrum kernels are a particular case of convolutionkernels

Convolution kernels

Weighted Decomposition Kernel [Menchetti et al., 2005]

Match atoms and weigh them according to a kernel between sub-graphs that include these atoms

KWDK(x, x′) =

∑(a,σ∈Dr(x))

∑(a′,σ′∈Dr(x′)) δ(a, a

′)Kc(σ, σ′)

r > 0 ∈ N

Dr(x): decompositions of the molecular graph of x in an atom a

and a subpath σ of x including a and of depth at most r

Convolution kernels

Weighted Decomposition Kernel [Menchetti et al., 2005]

Kc: contextual kernel, here: histogram intersection kernel

Kc(σ, σ′) =

∑l∈L min(fσ(l), fσ′(l))

L: possible labels for edges and vertices

fσ(l): frequency of label l subgraph σ.

Introducing spatial information

3D Histograms [Azencott et al., 2007]

Groups of k atoms

Associated size:

Pairwise distances(k = 2)diameter of the smallestsphere that contains allk atoms

3D Histograms [Azencott et al., 2007]

One histogram per class of k-tuple (e.g. C-C-C, C-C-O)

0 1 2 3 4 5 6 7

Frequency of N-O

N-O distance (A)

CO 6.3

9.2 2.7

8 9 10

3D Histograms: performance [Azencott et al., 2007]

Data 2D kernel Hist3D kernelMutag 90.4% 88.8%

BZR (loo) 82.0% 79.4%ER (loo) 87.0% 86.1%COX2 76.9% 78.6%

3D Decomposition Kernels [Ceroni et al., 2007]

Remember: KWDK(x, x′) =

∑(a,σ∈Dr(x))

∑(a′,σ′∈Dr(x′))

δ(a, a′)Kc(σ, σ′)

K3DDK(x, x′) =

∑σ∈Sr(x)

∑σ′∈Sr(x′)Ks(σ, σ

Sr(x): subgraphs of x composed of r distinct vertices

Ks(σ, σ′) =

∏r(r−1)/2i=1 δ(ei, e

′i)e−γ(li−l′i)

li = length of edge ei in x(e1, e2, . . . , er(r−1)/2 lexicographically ordered; γ ∈ R

3DDK: Performance [Ceroni et al., 2007]

Data 2D kernel Hist3D kernel 3DDK Circ3DDKMutag 90.4% 88.8% 86.7% 83.5%

BZR (loo) 82.0% 79.4% 78.4% 81.4%ER (loo) 87.0% 86.1% 82.3% 82.1%COX2 76.9% 78.6% 75.6% 75.2%

The pharmacophore kernel [Mahé et al., 2006]

pharmacophore p ∈ P(x): p = [(x1, l1), (x2, l2), (x3, l3)]

xi 3D coordinates of atom i of x; li = label of atom i

K(x, x′) =∑

p∈P(x)∑

p′∈P(x′)KP (p, p′)

KP (p, p′) = Kdist(d1, d

′1)Kdist(d2, d

′2)Kdist(d3, d

′3)Kfeat(l1, l

′1)Kfeat(l2, l

′2)Kfeat(l3, l

Kdist: RBF Gaussian Kdist(d, d′) = exp

(‖d−d′‖22σ2

)Kfeat: Dirac

3D LAP kernel [Hinselmann et al., 2010]

M : pairwise intramolecular matrix of inter-atomicgeometric distances

ConclusionHow relevant is 3D information?How good is 3D information?

Docking

VirtualHigh-Throughput

Screening

1. Find a target

2. Identifyhits

and synthesis

5. Assay

High-throughput screening

Assay a large library of potentialdrugs against their target

Very costly

→ docking

→ virtual high-throughputscreening (vHTS)

Measuring performance

Imbalanced data

Typically, most compounds are inactive ⇒ many more negativethan positive examples

E.g. DHFR data set:99, 995 chemicals screened for activity against dihydrofolatereductase; < 0.2% active compounds

Accuracy is not appropriate:predicting all compounds negative⇒ accuracy = 99.8%

sensitivity= # True Positives# Positives

specificity= # True Negatives# Negatives

For many methods, the output is continuous⇒ accuracy, sensitivity and specificity depend on a threshold θ

Receiver-Operator Characteristic Curves

For all possible values of θ, report sensitivity and 1− specificityAUROC (Area under the ROC Curve) is a numerical measure ofperformance

AUROC(random) = 0.5 and AUROC(optimal) = 1

0 1/6 1/3 1/2 2/3 5/6 1

False Positive Rate

ositiv

x x x x

0.95 0.94

0.81 0.73 0.52 0.2

0.17 0.12 0.09

random

perfect

label prediction+ 0.95- 0.94+ 0.90+ 0.81- 0.73- 0.52- 0.20+ 0.17- 0.12- 0.09

Inhibition of DHFR: ROC Curves [Azencott et al., 2007]

method AUCIRV 0.71SVM 0.59kNN 0.59

MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0

RANDOM

MAXSIM

Precision-recall curves

Precision = # True Positives# Predicted Positives

Recall = sensitivity

0 1/4 2/4 3/4 1

Recall

0.170.120.09

perfect

Other applications

Other applications of graph mining in chemoinformatics

Database indexing and searchPrediction of 3D structures of small compoundsand proteinsReaction Prediction

References and further reading

[Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical

information and modeling 47, 965–974. 23, 24, 35, 36, 37, 47

[Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprintsusing integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109.

[Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P. (2007). Classification of small molecules by two-and three-dimensionaldecomposition kernels. Bioinformatics 23, 2038–2045. 38, 39

[Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques forthe identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal ofchemical information and computer sciences 44, 1402–1411. 17, 18

[Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compoundsusing topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 41

[Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening withsupport vector machines. Journal of chemical information and modeling 46, 2003–2014. 40

[Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd

International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34

[Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programmingapproach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29

[Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31

Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...

Documents

Transcript of Data Mining in Bioinformatics Day 9: Graph Mining in ... · Data Mining in Bioinformatics Day 9:...

Data Mining for Bioinformatics

Text mining on the command line - Introduction to linux for bioinformatics

Bioinformatics Lab. Centrality and Graph Mining. Bioinformatics Lab. Introduction Many real world systems can be described as networks. Social relationships:

MSc in Bioinformatics: Statistical Data Mining

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Vector Space Information Retrieval Techniques for Bioinformatics Data Mining · 2018-09-25 · 0 Vector Space Information Retrieval Techniques for Bioinformatics Data Mining Eric

Dynamic Hybrid Clustering of Bioinformatics by Incorporating Text Mining and Citation Analysis

Bioinformatics: Practical Application of Simulation and Data Mining Markov Modeling III

LNBIP 107 - When Process Mining Meets Bioinformatics · When Process Mining Meets Bioinformatics R.P.JagadeeshChandraBose 1,2 andWilM.P.vanderAalst 1 DepartmentofMathematicsandComputerScience,UniversityofTechnology,

When Process Mining Meets Bioinformatics · 2018. 1. 16. · When Process Mining Meets Bioinformatics R.P. Jagadeesh Chandra Bose 1;2 and Wil M.P. van der Aalst 1 Department of Mathematics

APPLICATION OF DATA MINING TECHNIQUES IN BIOINFORMATICSethesis.nitrkl.ac.in/4154/1/Application_of_Data_Mining_Techniques... · i application of data mining techniques in bioinformatics

Detecting the Knowledge Structure of Bioinformatics with Text Mining and Citation Analysis

Data Mining in Bioinformatics - UQAM

Data Integration & Data Mining Tool Donald Dunbar BHF CoRE Bioinformatics Team Edinburgh Bioinformatics Meeting April 2013.

Data mining in genetics - USPvision.ime.usp.br/~jb/lectures/bioinformatics/datamining.pdf · Data mining. P1 P2 Pn Pi : analytical and mining procedures ( kernel parallel ) Objected

ANSC644 Bioinformatics-Database Mining 1 ANSC644 Bioinformatics §Carl J. Schmidt §051 Townsend Hall §schmidtc@udel.edu §.

Bioinformatics Mining and Modeling Methods for the ...

Data Mining in Bioinformatics Erwin M. Bakker LIACS Leiden University.

Bioinformatics and data mining: application in dairy ... · Bioinformatics and data mining: application in dairy cattle nutrition and physiology ... Enrichment tools systematically

Bioinformatics: Practical Application of Simulation and Data Mining Protein Folding I