Prediction Models Based on Classifying Compounds by...
Transcript of Prediction Models Based on Classifying Compounds by...
![Page 1: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/1.jpg)
Prediction Models Based on Classifying Compounds by
Structural Features
Chihae Yang, Kevin Cross, Paul Blower, Glenn MyattLeadscope, Inc.
3rd Joint Conference Sheffield Conferenceon Chemoinformatics
![Page 2: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/2.jpg)
Objectives
1. Illustrate strategy for building transparent models2. Build models predicting pIC50 for a published test set
and compare results with a published CoMFA study3. Confirm the importance of macrostructures in a
molecular descriptor set for predicting activity
![Page 3: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/3.jpg)
Issues to Consider When Building Predictive Models
• Data– Size and distribution– Quality– Availability
• Structure– Diversity– Mechanistic complexity
• Interpretation– Chemical transparency
![Page 4: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/4.jpg)
Structure-Data Surface
+--
Data distribution (active/inactive)-Data size+Mechanistic complexity+Structural diversity
- Structure diversity +
-M
echa
nism
com
plex
ity +
-Data
size
+
![Page 5: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/5.jpg)
Assessment of 4 Factors in the Structure-Data Surface
• Local models– Structurally similar– Mechanistically homogenous– Reasonable size– Data distribution
• Global models– Structurally dissimilar– Mechanistically complex– Large data points– Data distribution - balance in actives and inactives
![Page 6: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/6.jpg)
Example of Local Dataset:PTP-1B Inhibitors
• Protein Tyrosine Phosphatase (PTP-1B) is therapeutic target for treatment of diabetes, obesity and cancer
• Dataset of 118 compounds from literature study1
• SAR analysis identified two active classes
• Comparison with literature CoMFA study2
1 Malamas, M.S., et. al.; J. Med. Chem. 2000, 43, 1293-1310 (Wyeth-Ayerst)
2Murthy, V.S., et. al.; Bioorganic & Medicinal Chemistry, 2002, 10, 2267-2282..
![Page 7: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/7.jpg)
Modeling Strategy
1. Diagnose the data set2. Assemble discriminating macrostructures 3. Select descriptors – features and properties4. Build predictive models5. Evaluate the model with chemical inference6. Rebuild the model with a refined feature set
![Page 8: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/8.jpg)
1. Diagnoses of the PTP1B Dataset
• Dataset was the published 118 structures by Murthy• Training set was partitioned as published
92 compound training set26 compound test set
• 26-test set contains a higher set of actives
![Page 9: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/9.jpg)
1. Diagnoses of the PTP1B Dataset
• Assessing similarity of training and test sets– chemical space of the test set must lie within the
training set1. Grouping by chemical class2. Diversity analysis3. Similarity by Sammon map (feature based)4. Feature similarity within the test set and between
the test and training set
![Page 10: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/10.jpg)
7 19 6 17 18 1 2 3 4 9 5 20 11 12 13 8 14 10 15 160.5
0.55
0.6
0.65
0.7
0.75
0.8
dist
ance
Structural Diversity – 92 chemicals
S
O
O
O
O
O
O
O
F
FF
O
N
O
N NN
N
O
O
O
![Page 11: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/11.jpg)
O S
O
O
O
OO
O
OH377.715
476263
10.57.710.9
019.210.9
03.83.3tetrazole
5.302.2pyridine
213.87.6oxazole
5.33532benzothiophene
636153benzofuran
19-unknown set26-test set92-training set
% frequencyMajor Classes
Chemical class groupings in the data
![Page 12: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/12.jpg)
Diversity Analysis of PTP1B through Multiple Subset Extraction
Structural Diversity of PTP1B
0.00
0.20
0.40
0.60
0.80
1.00
0.00 0.05 0.10 0.15 0.20
Subset size (percent)
Cov
erag
e (p
erce
nt)
![Page 13: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/13.jpg)
118 benzothiophenes and furans - 92Training set- 26 Test set- 19 Test set (benzimidazoles, oxazoles, etc.)
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8
-1-0.5
00.5
1-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.626C92C19C
Similarity of Training and Test Sets26-test set correlation: 97%19-test set correlation: 68%
![Page 14: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/14.jpg)
Characterization of PTP-1B Inhibitors
• Activity distribution is not Gaussian
• Good balance between actives and inactives
• Data size is small • Mechanistically homogenous• Structurally similar
- Structure + diversity
-Data
size
+
-M
echa
nism
co
mpl
exity
+
pIC50
0.0 0.5 1.0 1.5
prob
abili
ty
0.00
0.05
0.10
0.15
0.20
0.2592-train 26-test
activeinactive
![Page 15: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/15.jpg)
2. Assembly of Macrostructures
• 509 structural features describe 92 training compounds were automatically extracted from over 27,000 features
• 71 macrostructures with discriminating activity were assembled
• 580 total features plus 8 properties were available for modeling
![Page 16: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/16.jpg)
Advantages of Macro-Structure Assembly
• MSAs are actual substructures that are easily interpreted• Connectivity of individual features is explicitly
represented• The assembly process is supervised by selected
biological response• MSAs are chemically relevant• Macrostructure assembly is computationally feasible for
“larger” structure sets
![Page 17: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/17.jpg)
Macrostructure Assemblies
O OO
O
OO
O
S
O SO O
O O O
O
SO
O
O O
t= -6.4 PLS wt =0.096Mean (92) = 0.16Mean (26) = 0.13
t= -5.9 PLS wt = 0.048Mean (92) = 0.0023Mean (26) = 0.45
MSA 28MSA 26
t= 3.7 PLS wt = 0.056Mean (92) = 0.82Mean (26) = 0.86
t= 4.9 PLS wt = 0.090Mean (92) = 0.70Mean (26) = 0.97
t= 5.4 PLS wt = 0.10Mean (92) = 0.75Mean (26) = 0.97
t= 7.9 PLS wt = 0.052Mean (92) = 1.32Mean (26) = 1.28
MSA 19MSA 13MSA 10MSA 6
t= 9.3 PLS wt = 0.050Mean (92) = 1.34Mean (26) = 1.29
t= 9.7 PLS wt = 0.063Mean (92) = 1.30Mean (26) = 1.34
t= 13.9 PLS wt = 0.12Mean (92) = 1.52Mean (26) = 1.57
t= 14.7 PLS wt = 0.076Mean (92) = 1.57Mean (26) = 1.32
MSA 4MSA 3MSA 2MSA 1
O
O
![Page 18: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/18.jpg)
3. Pre-selection of Structural Features
• Test significance - T2 test
• Of 150 influential features, 41 were MSAs
![Page 19: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/19.jpg)
Property Descriptors Available
• aLogP• Polar surface area• Hydrogen bond donors and acceptors• Molecular weight• Rotatable bonds• Lipinski scores
![Page 20: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/20.jpg)
4. Modeling Building
• Multivariate Least Squares• Principal Component Regression• k-nearest neighbors• Partial Least Squares• Neural Networks
![Page 21: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/21.jpg)
Cross-validation of the training set
number of PLS factors
num
ber o
f pre
-sel
ecte
d fe
atur
es
Q2 mean R2Scv
number of PLS factors number of PLS factors
92 1 # PLS factorscvPRESSS =
− −
![Page 22: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/22.jpg)
Parameter Optimization for Cross Validation
0.400.470.400.48188 Properties alone
0.320.680.220.84571 (all MSA)MSA + 8 properties
0.350.610.250.80571(all MSA)MSA only
0.340.620.270.764150Base Features + 8 properties
0.370.560.300.714150Base Features only
0.310.680.280.76550
0.300.710.230.831250
0.310.710.170.9119100
0.320.660.260.786100
0.390.570.150.9320150
0.340.620.180.8912150
0.320.700.240.804150
0.310.680.230.835200
0.370.570.270.773580 (all used)All: Base features + MSA + 8 Properties
RMSEQ2RMSER2FG
Leave-one-out CVTraining SetPLS factors# pre-selected structural features
Predictor Types
![Page 23: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/23.jpg)
Predictive Power of Molecular Descriptors
0.750.84CoMFA model B (w/ aLogP)*
0.510.72CoMFA model A*
0.370.660.470.400.48188 Properties only
0.370.650.680.220.83571MSAs + 8 properties
0.360.640.610.250.79571MSAs only
0.330.690.620.270.764150Basic features + 8 properties
0.330.680.670.260.784150basic features + MSAs
0.380.590.560.300.714150basic features only
0.320.720.700.240.804150All descriptors:Basic features + MSAs + 8 properties
RMSEQ2Q2RMSER2FG
26 Test SetTraining setParametersdescriptors
G = number of pre-selected features; F= number of PLS factors used
![Page 24: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/24.jpg)
Comparison of actual and predicted pIC50 values for 92-training and 26-test sets.
26-Test Set
pIC50 (actual)
-0.5 0.0 0.5 1.0 1.5 2.0
pIC
50 (p
redi
cted
)
-0.5
0.0
0.5
1.0
1.5
2.0
103 106
113
119 125
130
136
141
145
148
159
160 171
177
179
180 182
183
61 62
63
66 68 74
84
85
92-Training Set
pIC50 (actual)
-0.5 0.0 0.5 1.0 1.5 2.0
pIC
50 (p
redi
cted
)
-0.5
0.0
0.5
1.0
1.5
2.0
![Page 25: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/25.jpg)
5. Chemical Evaluation of Model
pIC50 = 1.41pIC50 = 1.12pIC50 = 0.98pIC50 = 1.56
176174175Test-177
Similar compounds in training setTest compounds
O
O SO O
OH
O
OH O
O SO O
OH
O
O
O SO O
O
OH O
O SO O
OH
O
OH
![Page 26: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/26.jpg)
6. Refining the Model by Chemical Inference
O SO O
OO
O SO O
OH
OO
O SO O
O
O
Mean (92) = 1.28Mean (26) = 1.59
Mean (92) = 1.22Mean (26) = 1.53
Mean (92) = 1.07Mean (26) = 1.53
New Feature 3New Feature 2New Feature 1
- increased the pCI50 of 117 from 0.886 to 0.947 (1.54 exp) without reducing goodness of fit.
(additional features for test-177)
![Page 27: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/27.jpg)
Building a Classification Model
• Actives and inactives defined as above and below the mean pIC50
• 50 new MSAs from binary response data• 8 QSAR properties still used• MSAs + properties provided the best balance• Partial Logistic Regression model - PLS of binary
response data followed by logistic regression• More accurate results
– two false negatives (with probabilities near 0.5)
![Page 28: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/28.jpg)
Classification model results
85.781.483.795.995.395.79
91.881.487.08691.889.12MSA + props
83.774.479.393.988.491.98
85.779.182.685.781.483.72All 50 MSAs
83.765.175.093.095.994.68
83.781.482.687.881.484.82All 150 feats
% specificity% sensitivity% concordance% specificity% sensitivity% concordance
Cross validationTrainingFactorsDescriptors
![Page 29: Prediction Models Based on Classifying Compounds by ...cisrg.shef.ac.uk/shef2004/talks/KCross.pdf · Prediction Models Based on Classifying Compounds by Structural Features Chihae](https://reader035.fdocuments.in/reader035/viewer/2022063012/5fc8f59e38dba02c924a67f9/html5/thumbnails/29.jpg)
Conclusion
• 2D PLS models performed as well as 3-D QSAR models• The 2D models are intuitive due to chemical structure
descriptors and explain model strengths and weaknesses • These transparent models provide insight for refinement in
using additional features• Macrostructure assemblies provide an intuitive means to
reduce high dimensionality and improve the ability to perform chemical inference
• Chemical inference enables efficient evaluation of hypotheses in the design of structures.