What Makes a Good Structure Activity Landscape?
Transcript of What Makes a Good Structure Activity Landscape?
What Makes a Good Structure Ac2vity Landscape?
Rajarshi Guha NIH Chemical Genomics Center
August 25, 2010 Na=onal ACS Mee=ng, Boston
Outline
• Selec=on with SALI • Predic=ng the landscape • SALI in bulk
Structure Ac2vity Landscapes
• Rugged gorges or rolling hills? – Small structural changes associated with large ac=vity changes represent steep slopes in the landscape
– But tradi=onally, QSAR assumes gentle slopes – Machine learning is not very good for special cases
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
Characterizing the Landscape
• A cliff can be numerically characterized • Structure Ac=vity Landscape Index (SALI)
• Cliffs are characterized by elements of the matrix with very large values
€
SALIi, j =Ai − A j
1− sim(i, j)
Guha, R.; Van Drie, J.H., J. Chem. Inf. Model., 2008, 48, 646–658
Visualizing SALI Values
• The SALI graph – Compounds are nodes
– Nodes i,j are connected if SALI(i,j) > X – Only display connected nodes
!1
!2!3
!4
!5
!6
!7
!8
!9
!10!12
!13
!15
!17
!18
!19
!20 !21!22
!23!24
!25
!27
!28
!29
!30
!31
!34!35
!37
!38
!39
!40
!41 !42
!43
!44
!45
!46
!49
!50
!51
!52
!53
!54
!55
!56!57
!58
!59
!60
!62
!63
!64
!65
!66
!68
!69
!71 !72 !73
!75
!76
What Can We Do With SALI’s?
• SALI characterizes cliffs & non‐cliffs • For a given molecular representa=on, SALI’s gives us an idea of the smoothness of the SAR landscape
• Models try and encode this landscape
• Use the landscape to guide descriptor or model selec=on
Descriptor Space Smoothness
• Edge count of the SALI graph for varying cutoffs • Measures smoothness of the descriptor space
• Can reduce this to a single number (AUC)
SALI Cutoff
Num
ber
of E
dges in S
ALI G
raph
0
100
200
300
400
0.0 0.2 0.4 0.6 0.8 1.0
astemizole
cisaprideE-4031 pimozide
dofetilidesertindole haloperidol droperidolthioridazine
terfenadineverapamildomperidoneloratadinehalofantrine mizolastine
bepridilazimilidechlorpromazinemibefradil
imipraminegranisetrondolasetron perhexiline
amitriptylinediltiazemsparfloxacin
grepafloxacin sildenafilmoxifloxacin
gatifloxacin
astemizole
cisapride E-4031 sertindole pimozide dofetilide haloperidoldroperidol thioridazine domperidone loratadine mizolastine bepridil azimilide mibefradil chlorpromazine imipramine
granisetron dolasetron perhexiline amitriptyline diltiazem sparfloxacin grepafloxacin sildenafil moxifloxacin gatifloxacin
astemizole
grepafloxacin sildenafil moxifloxacin gatifloxacin
Other Examples
• Instead of fingerprints, we use molecular descriptors
• SALI denominator now uses Euclidean distance
• 2D & 3D random descriptor sets – None are really good – Too rough, or – Too flat
SALI Cutoff
Nu
mb
er
of
Ed
ge
s in
SA
LI
Gra
ph
0
100
200
300
400
0.0 0.2 0.4 0.6 0.8 1.0
SALI Cutoff
Nu
mb
er
of
Ed
ge
s in
SA
LI
Gra
ph
0
100
200
300
400
0.0 0.2 0.4 0.6 0.8 1.0
2D
3D
Feature Selec2on Using SALI
• Surprisingly, exhaus=ve search of 66,000 4‐descriptor combina=ons did not yield semi‐smoothly decreasing curves
• Not en=rely clear what type of curve is desirable
Measuring Model Quality
• A QSAR model should easily encode the “rolling hills” • A good model captures the most significant cliffs • Can be formalized as
How many of the edge orderings of a SALI graph does the model predict correctly?
• Define S (X ), represen=ng the number of edges correctly predicted for a SALI network at a threshold X
• Repeat for varying X and obtain the SALI curve
SALI Curves
0.0 0.2 0.4 0.6 0.8 1.0
!1.0
!0.5
0.0
0.5
1.0
X
S(X
)
3!descriptor
5!descriptor
Scrambled 3!descriptor
0.0 0.2 0.4 0.6 0.8 1.0
!1
.0!
0.5
0.0
0.5
1.0
X
S(X
)
SCI = 0.12
Model Search Using the SCI
• We’ve used the SALI to retrospec=vely analyze models
• Can we use SALI to develop models? – Iden=fy a model that captures the cliffs
• Tricky – Cliffs are fundamentally outliers
– Op=mizing for good SALI values implies overfikng – Need to trade‐off between SALI & generalizability
The Objec2ve Func2on
• S0 is a measure of the models ability to summarize the dataset (analogous to RMSE)
• S100 measures the models ability to capture cliffs
• Ideally, the curve starts high and stays high
SALI Cutoff
S(X)
0.6
0.7
0.8
0.9
1.0
0.0 0.2 0.4 0.6 0.8 1.0
S100 S0
€
F =1S100
€
F =1S0
+(S100 − S0)
2
€
F =1SCI
SALI Based Model Selec2on
• Considered the BZR dataset from Sutherland et al
• Iden=fied “best” models using a GA to select from a pool of 2D descriptors
• While SALI based op=miza=on can lead to a “bemer” curve, it doesn’t give the best model
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.0 0.2 0.4 0.6 0.8 1.0
RMSE SCI S(100)
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08
RMSE SCI S(100)
Sutherland, J et al, J. Chem. Inf. Comput. Sci., 2003, 43, 1906‐1915
SALI Based Model Selec2on
• 107 aryl azoles as ER‐β agonists • Used a GA and 2D descriptors to iden=fy models
• In this case, a SALI based objec=ve func=on was able to iden=fy the best model
• Interes=ngly, SCI does not seem to perform very well
Malamas, M.S. et al, J Med Chem, 2004, 47, 5021‐5040
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.0 0.2 0.4 0.6 0.8 1.0
RMSE SCI S(0) + D/2
SALI Cutoff
S(X)
-0.5
0.0
0.5
0.00 0.02 0.04 0.06 0.08
RMSE SCI S(0) + D/2
SALI Based Model Selec2on
• The size of the solu=on space explored depends on the SALI objec=ve func=on
RMSE S(100) SCI
0.90
0.95
1.00
1.05
1.10
1.15
Objective Function
RMSE
BZR
1/S(0) + D/2 RMSE SCI
0.55
0.60
0.65
Objective Function
RMSE
ER‐β
Predic2ng the Landscape
• Rather than predic=ng ac=vity directly, we can try to predict the SAR landscape
• Implies that we amempt to directly predict cliffs – Observa=ons are now pairs of molecules
• A more complex problem – Choice of features is trickier – S=ll face the problem of cliffs as outliers – Somewhat similar to predic=ng ac=vity differences
Scheiber et al, StaDsDcal Analysis and Data Mining, 2009, 2, 115‐122
Predic2ng Cliffs
• Dependent variable are pairwise SALI values, calculated using fingerprints
• Independent variables are molecular descriptors – but considered pairwise – Absolute difference of descriptor pairs, or – Geometric mean of descriptor pairs – …
• Develop a model to correlate pairwise descriptors to pairwise SALI values
A Test Case
• We first consider the Cavalli CoMFA dataset of 30 molecules with pIC50’s
• Evaluate topological and physicochemical descriptors
• Developed random forest models – On the original observed values (30 obs)
– On the SALI values (435 observa=ons)
Cavalli, A. et al, J Med Chem, 2002, 45, 3844‐3853
Double Coun2ng Structures?
• The dependent and independent variables both encode structure.
• But premy low correla=ons between individual pairwise descriptors and the SALI values
R2
Perc
ent of Tota
l0
10
20
30
40
50
60
0.00 0.05 0.10 0.15
AbsDiff
0
10
20
30
40
50
60
GeoMean
Model Summaries
• All models explain similar % of variance of their respec=ve datasets
• Using geometric mean as the descriptor aggrega=on func=on seems to perform best
• SALI models are more robust due to larger size of the dataset
Observed pIC50
Pre
dic
ted
pIC
50
4
5
6
7
8
9
4 5 6 7 8 9
!
!
!!!
!
!!
!
!
!!
!
!!
!!
!
!!
!!
!!
!
!
!
!
!
!
Original pIC50 RMSE = 0.97
Observed SALIP
red
icte
d S
AL
I
0
2
4
6
0 2 4 6
!!
!
!
!
!
!! !
!!
!!
!
!!
!!
!!!!
!
!
!
!
!
!
!
!!!!
!!! !!
!
!!
!
!
!!
!
!
! !!
!
!
!!
!
!!
!
!
!!
!!
!!
!
!!
!
!
!
!!
!
!
!
!!
!
!!
!
!
!
!
!!
!! !
!
!
!
!
!
!
!
!!
!
!
!
! !
!
! !
!
!
!
!!!
!!
!
!
!!
!!!!
!
!
!!!
!
!
!
!
!
!!
! !
!
!! !
!
! !
!!
!
!!!!
!!
!
!
!
!
!!
!! !
!!
!
!! !! !!
!
!!
!!!
!
!
!!
!
! !!
!
!
! !!!
!!
!
!!
!
!
!
!
!
!
!
!
!
!!!
!
!!!
!!
!
!!
!!!
!
!
!
!
!
!
!
!!
!!!
!! !
!! !! !
!!!!!
!
!!!
!!
!!!!!
!
!
!
!!
!
!
!
!
!
!!! !!
!!!
!!
!
!
!
!
!!
! !!!
!!
!
!!!
!!
!
!
!
!!
!!
!
!!!!!
!
!
!
!!
!
!!
!! !
! !
!!!
!
!
!!
!
!!
!!!
!!!
!
!
!
!
!
!
!
!
!
!!
!!!
!
!! !
!
!!
!!
!!!
!!
!
!
!
!
!
!!!
!
!
!
!
!
!!!
!!
!
!
!!
!!!!
!
!
!
!!
! !
!!!
!
!!
!
!
!
!
!
!
!!
!
!!!
!
!!!!
!!!
!
!
!!
!
!!!
!!!
!
SALI, AbsDiff RMSE = 1.10
Observed SALI
Pre
dic
ted
SA
LI
0
2
4
6
0 2 4 6
!
!
!
!
!
! !!
!!
!
!!
!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!!!
!!
! !
!
!
!
!
!
!
! !
!
! ! !!
!!
!
!
!
!
!
!!!!!
!
!
!!!! !
!
!
!!
!!
!!!
!
!
!
!!!
!!!!
! !
!
!!!
!
!
!
!!
!! !! !
!
!
!
!
!!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!!!
!!
!!
!
!!
!!
!
!
!
!
!!
!
!
!
!
!
!!!
!!
!
! !!
!
!
!!
!!!!
!!
!
!!
!
!!!
!
!
!
!!!
!!
!
!
!
!
!!
!!!
!
!!!!!!
!
!
!
!
!
!
!
!
!!!!
!
!
! !
!!
!
!
!!
!
!
!
!
!
!!!
!!!!
!
!
!
!!!!!
! !!!!
!!!
!
!!
!!!
!!!
!
!! !
!!!
!!
!
!
!
!
! !
!!!
!
!
! !! !
!
! !!!!!
!! !!! !
! !!!!
!!
!!
!
!!!!
!!
! !!!
!
!
!
!
!!
!!
!
!
!
!!
!
!
!
!
!
!!!
!
!!!
!!!!!
!
! !
!!!
!
!
!
!
!
!!
! !
!!!
!
!! !!!
!
!!!
!!
!
!
!
!
!!
!!!
!!! !
!!!!
!!
!! !
!
!!
!
!
!! !
!
!!
!
!
!!
!
!!!! !
!!
!
! !
!
!!!
!! !!
!
!
!
!
!
SALI, GeoMean RMSE = 1.04
Test Case 2
• Considered the Holloway docking dataset, 32 molecules with pIC50’s and Einter
• Similar strategy as before
• Need to transform SALI values
• Descriptors show minimal correla=on
SALI
Perc
ent of T
ota
l
0
10
20
30
40
50
0 20 40 60 80 100 120
log10 (SALI)
Perc
ent of T
ota
l
0
10
20
30
-1 0 1 2
Holloway, M.K. et al, J Med Chem, 1995, 38, 305‐317
Model Summaries
• The SALI models perform much poorer in terms of % of variance explained
• Descriptor aggrega=on method does not seem to have much effect
• The SALI models appear to perform decently on the cliffs – but misses the most significant
Observed pIC50
Pre
dic
ted
pIC
50
5
6
7
8
9
10
5 6 7 8 9 10
!!!
!
!!
!!!
!
!
!
!!
!!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
Original pIC50 RMSE = 1.05
Observed log10(SALI)P
red
icte
d lo
g1
0(S
AL
I)
!1
0
1
2
!1 0 1 2
!
!!
!
!
!!
!
!
!
!
!
!
!
!
! ! !!
!!!!
!
!
!
!!
!!!
!
!
!
!
!!
!
!
!!
!
!!
!
!!
!
!
!! !
!!
!
! !!!
!!
!
!
!!!
!
!
!
!
!!
!!
!!!
!!!!!
!
!
!
!!!
!!
! !!
!
!
!
!
!
!
!
!!
!
!!
!
!! !
!
!
!
!
!!!
!
!
!!!
!
!
!
!
!
!
!!
!!! !!!
!!
!
!
!
!!
!
!
!
!!!
!
!
!
!
!!
!
!
!! !
!!!
!
!
!
!
!!!
!
!!
!
!
!
!
!
!
!
!
!!!
!
!!!!
!
!
!
!!
!
!!
!
!
!
!
!!
!
!
!
!!
!
!!!!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!! !!! !
!
!
!
!
!!!
!
!
!
!
!!
!
! !
!!
!!!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
! !!
!!!
!!
!
!
!
!!
!
!
!
!!!!
!
!!!
!!
!
!
!
!
!
!
!!
!
!
!!
!!
!!
!
!! !!
! !
!
!!
!
!
!!
!
!!
!!
!! !!!
!! !
!! !!
!
!!!
!!!!!
!
!
!
!!
!!!
!
!!!
!!
!!
!
!
!
!!!
!
! !
! !
!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!!!
!
!
!!
!
!
!
! !
!!
!
!
!
!
!
!!
!!
!
!
!!
!
!
!!!
!
!
!!
!
!
!
!
!!
!!
!
!!!
!
!
!
!!!
!
!!
!
!
!
!
!!!
!
!
!
!
!! !
!
!
!
!
!!!
!!
!
!
!!
!
!
!!!
!
! !!
!
!
!
!
!
!
!
!
SALI, AbsDiff RMSE = 0.48
Observed log10(SALI)
Pre
dic
ted
lo
g1
0(S
AL
I)
!1
0
1
2
!1 0 1 2
!
!!
!
!
!
!!
!
!
!
!
!!
!
!
! !
!
!!!!
!
!
!
!
!
!
!! !!!
!
!
!
!
!
!
!
!!!
!
!
!
!!
!!
!! !
!
! !!
!
!
!
!!
!!!!
!
!
!
!
!
!
!
!
!!
!
!! !
!
!
!
!
!!
!
!
!
! !!!!
!
!
!
!!
!!
!
!!
!
!! !
!
!
!
!
!!
!
!
!
!!!
!
!
!
!
!
!!
!!
!!
!
!!
!
!
!
!
!
!!
!
!
!
!!!
!
!
!
!!!
!
!
!! !
!
!
!
!
!
!
!
!!!
!!!
!
!
!
!
!!
!
!
!!!
!
!!!!
!
!
!
!!
!
!!
!
!
!
!
! !
!
!
!
!!
!
!!
!!
!
!
!
!!
!
!!
!
!
!
! !!
!
!
!!
!!!!
!
!
!
!
!!!
!
!
!
!
!!!! !
!
!!
! !
!
!
!
!
!
!
!
!
!
!
!
!!
!
!
!!!
!
!!
!!
!
!
!
!!
!
! !
!!
!
! !
!
!
!
!!
!
!
!
!
!
!!!
!
!
!
!
!
!
!!
!
!! !!
! !
!
!!
!!!!
!
!!
!!!
!!!!!
!!!
! !!
!
!!!
!
!! !
!
!!
!
!
!
!
!!
!
!!!
!!
!
!
!
!
!
!!
!
!
!!!
!
!!!
!
!!
!
!!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!
!! !
!! !
!
!
!!
!!
!
! !
!
!
!
!
!
!
!
!
!
!
! !!
! !
!
!
!
!!
!
!!!
!
!
!
!!!
!
!
! !
!
!
!!!
!
!
!
!
!
!
!!
!
!
!
!!
!
!!
!
!
!
!!
!
!
!!!
!
!
!
!
!
!
!
!
!
!
!
!
SALI, GeoMean RMSE = 0.48
Model Summaries
• With untransformed SALI values, models perform similarly in terms of % of variance explained
• The most significant cliffs correspond to stereoisomers
Observed pIC50
Pre
dic
ted
pIC
50
5
6
7
8
9
10
5 6 7 8 9 10
!!!
!
!!
!!!
!
!
!
!!
!!
!
!
!
!! !
!
!
!
!
!
!
!
!
!
!
Original pIC50 RMSE = 1.05
Observed SALIP
red
icte
d S
AL
I
20
40
60
80
100
20 40 60 80 100
!
!!
!
!
!
!!
!
!
!
!!
!
!
!
!!
!!!
!!!
!
!
!!
!!!
!!
!
!
!
!
!
!
!
!
!!
!
!
!!
!!!!!
! !
!!
!!! !
!!!!!!!
!
!
!
! !!
!! !!
!
!!
!!
!!
!
!!!
!
!!!!!
!
!
!
!
!
!
!!
!
!!
!
!!!!!!
!
!!
!!!!
!
!!
!
!
!
!
!
!!
!
!!
!!!
!!
!
!
!
!!!
!!!!!
!
!
!
!!!
!!!
!!
!
!!!
!!
!
!!!!!
!!
!
!
!
!!
!!
!
!!
!
!!!!
!!
!
!!
!!!
!
!
!
!
!!
!
!
!!!
!
!!
!!
!
!
!
!!
!!!
!
!
!
! !!!! !
!
!
!!!!
!!
!
!!!
!!
!
!
!!
!
!
!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!
!!
!!
!!!
!
!!!!
!
!
!
!!
!!
!
!
!!!
!
!!!!!
!!
!
!
!
!!!
!
!
!
!!
!!
! !
!!!!
!!
!
!!
!!
!!! !
!!!!!!!!!
!!
!
!!
!!!!!
!
!!!!
!!
!
!!
!!!
! !!
!
!!!!
!
!
!
!!
!!!!
!
!
!
!!
!
!!
!
!!
!
!!
!!
!
!
!
! !
!!
!
!
!
!
!
!
!
!!
!!
!! !
!! !
!
!!
!!
!
!
!!
!
!
!
!!!
! !!
!!
!!
!
!
!! !!
!
!!! !!!
!
!
!!!!
!
!
!
!
!
!!!
!
!
!
!
!!!!!
!
!!!
!
!
!!!
!!
!!
!
!!
!
!
!!!
SALI, AbsDiff RMSE = 9.76
Observed SALI
Pre
dic
ted
SA
LI
20
40
60
80
100
20 40 60 80 100
!
!!!
!
!!!
!
!
!
!
!!
!!!!
!
!!!!
!
!
!
!
!
!!! !
!!!
!
!
!
!
!
!
!
!!!
!
!!
!
!!
!! !
!! !!
!
!
!!!!!!!
!
!
!
!!
!!!!!
!!!
!!
!
!
!
!!
!!
!!!!!!
!
!
!
!!
!!
!!!
!
!!!!
!
!
!
!!
!!!
!!!!
!
!
!
!
!
!!! !!
!
!!!!
!!
!
!!
!!!
!!!
!
!
!
! !!!
!!!!!!!!
!
!
!
!!!!!!
!
!
!
!
!!
!!
!!!
!
!!
!!
!!
!
!!
!!!
!
!
!
!
!!
!!
!!!
!
!!
!!
!
!
!
!!
!!!
!
!
!
!!
!!!
!!
!
!! !!
!
!
!
!!
!!!
!
!
!
!
!
! !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!!
!!
! !!
!
!!
!!
!
!
!
!
!
!! !
!!
!
! !
!
!
!
!!!! !
!
!
!
!
!
!
!
!
!!!
!!
!
!!
!!!!
!
!!
!!!
!!
!!
!!!!!!!!
!!!
!!
!!!!!
!
!!!!
!!
!
!!
!!!! !!
!
!!!!
!!
!
!!
!!!!
!
!
!!!!!
!
!
!!
!!
!!
!
!!
!! !!
!
!
!
!
!!
!!!! !!!
! !
!
! !!!!
!!
!
!
!!
!
!
!
!
!!! !
!!
!
!! ! !!! !
!!
!!
!!!
! !!
! !!!
!
!
!
!
!
!!!
!
!
!
!!!
!!!!!
!!
!
!
!!!
!!
!!
!
!!
!
!!
!!
SALI, GeoMean RMSE = 10.01
Model Caveats
• Models based on SALI values are dependent on their being an SAR in the original ac=vity data
• Scrambling results for these models are poorer than the original models but aren’t as random as expected
Observed SALI
Pre
dic
ted S
ALI
0
2
4
6
0 2 4 6
SALI in Bulk
• Much of this material is exploratory • So we’re interested in trends across many assays • ChEMBL is an excellent source for ac=vity cliffs • Assay selec=on
– Human target, binding assay – High confidence (score = 9) – Number of compounds between 75 & 300 – Only consider non‐NA ac=vity values – Censored data is considered the same as exact data – 31 assays
• We iden=fy datasets with ac=vity cliffs by the skewness of the dependent variable
SALI in Bulk
• Used pIC50’s and CDK hashed fingerprints • Lots of material to explore here
SALI in Bulk
• But fingerprint quality is important • Assay 379744 has 83 cliffs of infinite height since 83 pairs of molecules have Tc = 1.0
• Probably should simply ignore such “iden=cal” molecules
Conclusions
• SALI is the first step in characterizing the SAR landscape
• Allows us to directly analyze the landscape, as opposed to individual molecules
• Being able to predict the landscape could serve as a useful way to extend an SAR landscape
Acknowledgements
• John Van Drie • Gerry Maggiora
• Mic Lajiness
• Jurgen Bajorath
Job Openings at NCGC/NCTT
• Sovware development (focusing on Tripod) – Java, Swing UI, algorithms
• Research Informa=cs Scien=st – Generalist, cheminforma=cs, comp chem, med chem
• Collaborate with chemists, biologists • Cukng edge problems • Lots of fresh data • Fun!
ER‐β Dataset
• 107 molecules, censored data taken as exact
• A few big cliffs • The best linear model performs decently
pIC50
Frequency
-4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5
05
10
15
20
25
Observed pIC50
Pre
dic
ted
pIC
50
-4
-3
-2
-1
-4 -3 -2 -1
Different Ac2vity Representa2ons
• Using the Hill parameters from a dose‐response curve represents richer data than a single IC50
Concentration
Activity
S0
50%
SInf
AC50
€
S0SinfAC50
H
€
SALIi, j =d(Pi,Pj )1− sim(i, j)
SALI Curves from DRCs
• No difference in major cliffs • Some of the minor cliffs are highlighted using the DRC instead of IC50
Clustering in the Holloway Dataset 17
14
23
25
26
18
9
27
16
19
1
10
32
6
29
8
33
12
30
11
4
22
5
28
2
7
13
3
31
24
20
15
21
0.5
0.6
0.7
0.8
0.9
1.0
hclust (*, "complete")
Height