The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics.

Post on 27-Mar-2015

215 views 0 download

Tags:

Transcript of The new JKlustor suite Miklós Vargyas Solutions for Cheminformatics.

The new JKlustor suite

Miklós Vargyas

Solutions for Cheminformatics

2

Why do we cluster?

• to reduce the number of objects to deal with– group subsets together– represent each group by one member of it

3

What’s the matter with that?

– tedious• parameter tuning in a trial-and-error fashion

– lack of interpretability• the algorithm does not provide explanation

– often do not meet chemists’ expectation

4

Why is clustering molecules hard?

• lack of innate spatial arrangement– artificial arrangement

• infinite types of chemical spaces• various ‘distance metrics’• usually high dimensionality (hard to visualize)

– various approaches, no superior one• “best method” depends on application area, and on

actual data

5

What do we need?

• no/few tuning

• easy to understand simple “explanation”

• novel approach– structure based clustering– Maximum Common Substructure– Molecular frameworks

6

Maximum Common Substructure

• largest substructure shared by two molecules

• Simple concept! More “human”, visual.

• Yet hard (= expensive (= slow)) to compute.

7

MCS complexity

• Sub-structure searching – query structure is known, it “only” have to be found

as part of the target structure (subgraph isomorphism)

– graph isomorphism is even “simpler” yet NP-hard• finding the answer can take long (scales exponentially

with respect to the number graph vertexes) in the worst case

• validating an answer is fast

• MCS – “query” structure is not known

– all possible substructures need to be checked• even the number of substructures is exponential!

8

MCS algorithms

• two camps

backtracking clique detection

ad hoc high mathematical elegance

average complexity is better than worst case

average complexity is same as worst case

dynamic heuristics static (initial) heuristics

coloring is easy coloring is hard

fuzzy matching fussy matching

9

MCS of a structure set

10

LibraryMCS: Hierarchical MCS

11

Intuitive visualization

12

SAR table view

13

R-group decomposition

14

LibraryMCS scales linearly

-500

0

500

1000

1500

2000

2500

3000

3500

4000

0 5000 10000 15000 20000 25000 30000 35000

Structure count

Ru

nn

ing

tim

e (s

ec)

2006

2007

Linear (2007)

15

Clustering performance comparison

0

10

20

30

40

50

60

70

80

90

0 20000 40000 60000 80000 100000 120000

Structure count

Run

ning

tim

e (m

in)

LibraryMCSJarvis-PatrickWard-Murtagh

16

Behind performance

• MCS search– exhaustive– heuristics

• exact• inexact

• Predictive MCS coupling in clustering– all pairs are not feasible– rich fingerprinting

17

Live demonstration

• Affect of use of heuristics– on average < 10% misclassifications– useful for obtaining birds-eye-view of a

larger/diverse sets

1M< compounds libraries

• Molecular scaffolds, – Rings, ring systems– Bemis-Murcko frameworks

• Sphere exclusion – Variants… linear scaling

Fast clustering methods

20

Jklustor roadmap

• In the dev pipeline– IJC integration– Spotfire integration– new dynamic viewer

• Planned– disconnected MCS– multiple class members

21

Acknowledgements

• Gábor Imre

• Judit Vaskó-Szedlár

• Péter Vadász