Post on 27-Mar-2015
The new JKlustor suite
Miklós Vargyas
Solutions for Cheminformatics
2
Why do we cluster?
• to reduce the number of objects to deal with– group subsets together– represent each group by one member of it
3
What’s the matter with that?
– tedious• parameter tuning in a trial-and-error fashion
– lack of interpretability• the algorithm does not provide explanation
– often do not meet chemists’ expectation
4
Why is clustering molecules hard?
• lack of innate spatial arrangement– artificial arrangement
• infinite types of chemical spaces• various ‘distance metrics’• usually high dimensionality (hard to visualize)
– various approaches, no superior one• “best method” depends on application area, and on
actual data
5
What do we need?
• no/few tuning
• easy to understand simple “explanation”
• novel approach– structure based clustering– Maximum Common Substructure– Molecular frameworks
6
Maximum Common Substructure
• largest substructure shared by two molecules
• Simple concept! More “human”, visual.
• Yet hard (= expensive (= slow)) to compute.
7
MCS complexity
• Sub-structure searching – query structure is known, it “only” have to be found
as part of the target structure (subgraph isomorphism)
– graph isomorphism is even “simpler” yet NP-hard• finding the answer can take long (scales exponentially
with respect to the number graph vertexes) in the worst case
• validating an answer is fast
• MCS – “query” structure is not known
– all possible substructures need to be checked• even the number of substructures is exponential!
8
MCS algorithms
• two camps
backtracking clique detection
ad hoc high mathematical elegance
average complexity is better than worst case
average complexity is same as worst case
dynamic heuristics static (initial) heuristics
coloring is easy coloring is hard
fuzzy matching fussy matching
9
MCS of a structure set
10
LibraryMCS: Hierarchical MCS
11
Intuitive visualization
12
SAR table view
13
R-group decomposition
14
LibraryMCS scales linearly
-500
0
500
1000
1500
2000
2500
3000
3500
4000
0 5000 10000 15000 20000 25000 30000 35000
Structure count
Ru
nn
ing
tim
e (s
ec)
2006
2007
Linear (2007)
15
Clustering performance comparison
0
10
20
30
40
50
60
70
80
90
0 20000 40000 60000 80000 100000 120000
Structure count
Run
ning
tim
e (m
in)
LibraryMCSJarvis-PatrickWard-Murtagh
16
Behind performance
• MCS search– exhaustive– heuristics
• exact• inexact
• Predictive MCS coupling in clustering– all pairs are not feasible– rich fingerprinting
17
Live demonstration
• Affect of use of heuristics– on average < 10% misclassifications– useful for obtaining birds-eye-view of a
larger/diverse sets
1M< compounds libraries
• Molecular scaffolds, – Rings, ring systems– Bemis-Murcko frameworks
• Sphere exclusion – Variants… linear scaling
Fast clustering methods
20
Jklustor roadmap
• In the dev pipeline– IJC integration– Spotfire integration– new dynamic viewer
• Planned– disconnected MCS– multiple class members
21
Acknowledgements
• Gábor Imre
• Judit Vaskó-Szedlár
• Péter Vadász