Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1...
-
Upload
reginald-riley -
Category
Documents
-
view
216 -
download
0
Transcript of Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1...
Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
CHEMICALCOMPUTINGGROUP INC.
1
A Probabilistic Approach toHigh Throughput Drug Discovery
Introduction and Motivation
Probability Modeling in Drug Discovery
Representation of Chemical Structures (Descriptors)
Focused Combinatorial Library Design
Summary and Outlook
2Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
High Throughput Screening
Large-scale automation of biological assays (HTS) Use robotics to perform 10,000 to 100,000 screens per day Brute-force approach to drug discovery: “rapidly screen all compounds”
Noteworthy drawbacks to HTS: Economics: $1-$5 per assay (provided large collections are assayed) Logistics: compound formatting, inventory systems and other overhead Precision Loss: effective “binary” measurement: active/inactive (pass/fail) High Error Rate: assay, synthesis failure, sample degradation, registration
Resulting effects: Quality for quantity tradeoff - lots of low quality data High level of noise (error) in data makes interpretation very difficult
HTS has gained acceptance and is routinely used to generate lead compounds for drug discovery projects
3Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Sources of Compounds for HTS
Initial screening libraries (first libraries used in project) Historical “in-house” collection of compounds augmented with compounds
purchased from external suppliers 1 million+ compounds available means initial screening library must be
designed (diversity retained using fewer numbers of compounds) Receptor biased initial screening libraries are a possibility
Follow-up libraries Parallel synthesis / combinatorial chemistry is an excellent source of large
numbers of (new) compounds Synthesis of “all” analogs around a lead structure exhibits poor diversity
but very good for “local” exploration and lead follow-up
External screening compound purchasing and in-house combinatorial chemistry efforts have gained acceptance and are routinely used in lead generation and follow-up
4Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
High Throughput Discovery Cycle
Brute-force HTS not practical At least 10 trillion stable drug candidates At 1 billion screens per day >27 years are needed to screen all 10 trillion
A discovery cycle can be used to reduce total screens Use HTS data to affect the selection of compounds to screen next Scale-up of the traditional experimental discovery cycle
HTSBioassay
FocusedLibrary Design
HTS DataAnalysis
ParallelSynthesis
5Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Required Technology for HTD Cycle
High Throughput Screening facility
Parallel synthesis and combinatorial chemistry capabilities
Methodology for automatically analyzing HTS data Humans find it difficult to interpret large amounts of noisy data Automatic HTS QSAR technology necessary for HTD cycle
Methodology for designing focused combinatorial libraries HTS QSAR results are used to bias a combinatorial library towards activity ADME properties and other design criteria should be taken into account
Meaningful representation of compounds Collection of molecular descriptors meaningful across projects (avoid time
consuming variable selection procedures) Definition of a “chemistry space” for diversity studies (design of initial
screening libraries)
Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
CHEMICALCOMPUTINGGROUP INC.
6
Probability Modeling in Drug Discovery
7Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Probabilistic Formalism (Bayesian Inference)
Step 1: Write all observables as a joint probabilitydensity; e.g., Pr (A,B,C)
Step 2: Decompose density using probability theoryand Bayes’ theorem until components aremeasurable; e.g.,
Pr (A,B,C) = Pr (B | A,C) Pr (C | A) Pr (A)
Step 3: Model each component in product from adatabase or experimental data set
Step 4: Make predictions or estimates usingcomputed model of Pr(A,B,C)
8Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Probabilities in Speech Recognition
Successful speech recognizers select (predict) an output word sequence from an input waveform by maximizing the joint likelihood Pr (WAVE, WORDS) This is used (in part) to solve the isophonetic word sequence problem;
e.g., “imadam” can be “I’m Adam” or “I’m a Dam” or “eye mad am”
Pr (WAVE, WORDS) = Pr (WAVE | WORDS) Pr (WORDS) Pr(WORDS) is the prior probability of a word sequence (utterance) Pr(WAVE | WORDS) is used to score the waveform under the assumption
or hypothesis that the word sequence is WORDS Build model of Pr(WORDS) by training on, say, 500,000,000 words of
newspaper text (the prior knowledge)
Pr(WORDS) effectively depresses importance of unlikely utterances in favor of more plausible statements (real phrases)
9Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Probabilities in Drug Discovery
Notation: Y = active(0/1) D = drugable(0/1) S = structure
Decompose:
Product of probabilities balances competing goals Classification alone (e.g., RP) is not enough: weighted outcomes needed Methodology similar to “soft” classification problems or fuzzy logic Any method of probability modeling is valid (e.g., histogram, analytic)
Approximations introduced can be clearly identified e.g., Pr (D | Y, S) Pr (D | S) : drugability is independent of activity (!?)
Drugable given active structure(approximated by “is drug-like” efforts)
Activity assuming structure(probabilistic QSAR efforts)
) Pr( ) | Pr( ) , | Pr( ) , , Pr(S S Y S Y D S D Y
10Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Pr(Y|X) via Binary QSAR
If Y is “binary activity” and X is a descriptor vector then
Pathology of Binary QSAR is reasonable If new structure is outside the training set then Pr(Y=1), the hit rate, is used to make predictions (no other information available)
Active Inactive
X1 Xk Xk+1 Xn
Pr(Y)
Pr(X|Y)X1 Xn
Active Inactive Active Inactive
Pr(X)
Pr(Y|X)
Bayes Theorem
1
)1Pr()1|Pr(
)0Pr()0|Pr(1)|1Pr(
YYxX
YYxXxXY
11Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Distribution Estimates
Four distributions in formula are of two types Pr(Y=0), Pr(Y=1) Prior probability of inactive/active Pr(X=x|Y=0), Pr(X=x|Y=1) Probability of ligand assuming inactive/active
Modeling assumption: independent uncorrelated! Decompose multi-dimensional distribution into a product
Estimate 2n+2 distributions instead of original four
Binary QSAR Algorithm Compute descriptor vectors di
De-correlate descriptors xi = Q(di - u)
Estimate distributions from {xi ,yi} Pr (X = x | Y = y) Assemble p (x) Pr (Y = 1 | X = x) Predict for new descriptors d p (Q (d - u))
i
ii yYxXyYxX )|Pr()|Pr(
12Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Experience with Binary QSAR
Fundamental methodology publication (robustness study) Biocomputing Proceedings of the 1999 Pacific Symposium World
Scientific Publishing, Singapore, 1999
Example literature data sets (non-HTS data) Estrogen receptor (Gao et al.; J. Chem. Info. Comput. Sci., 1999, 36) O-acyltransferase (ACAT) (Labute et. al.; in press)
Example industrial data sets (HTS assays) ArQule: 24,000 cpds. ~200 active, 93% on inactives, 60% on actives Pharmacopeia: 24,000 cpds. >90% on inactives, >90% on actives SmithKline Beecham: 80,000 cpds. ~100 active, 90% on actives
Best success story: Pharmacia & Upjohn Binary QSAR model used to select building blocks in combi-chem library Improved activity from M to nM (factor of 1000)
13Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Combined Design Model for HTD Cycle
Use Binary QSAR method twice, once for activity model and once for drugability model Train drugability model Pr (D | X) on WDI/ACD for drug-like/non-drug-like
or on specific data sets (e.g., blood-brain barrier permeability)
Complete model of activity and drugability is the product Pr(D | X) Pr(Y | X) which approximates Pr(D, Y | S)
ADME Model
Activity ModelLibrary Design
Binary QSAR BioAssay
DesignModel
CombinatorialLibrary
HTS Data
Drugability Data(e.g., BBB or drug-like)
Binary QSAR
Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
CHEMICALCOMPUTINGGROUP INC.
14
Representation of Chemical Structures (Descriptors)
15Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
A Brief History of QSAR
Original philosophy (Hansch & Leo): Use a fixed set of meaningful molecular properties to describe a wide variety of biological phenomena
Linear regression used to determine SAR The determination of linear relationships is basic science Statistical regression framework used to assess significance of SAR
Proliferation of descriptors Early successes lead to introduction of a vast array of descriptors In principle, any number calculable from a chemical structure can be used
as a molecular descriptor for SAR determination
Over-determination of SAR Multitude of descriptors lead to need for schemes for variable elimination 3D methods treat each grid-point in field representation as a descriptor
16Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Fundamental Notions
Use a fixed set of descriptors for diversity and QSAR/QSPR A meaningful chemistry space should not require customization In QSAR/QSPR automatic variable selection can be dangerous Make direct use of Hansch & Leo thinking (build on their experience)
Model 3D properties from 2D (connectivity) information 3D information from 2D connectivity = 2½ D descriptors HTS QSAR and large-scale diversity require fast calculation times 2D topological descriptors too weak, 3D descriptors too expensive Use approximate atomic surface areas as fundamental representation Complement substructure keys (stay property-based for class-hopping)
Intended applications QSAR/QSPR models - linear and nonlinear - early and late in project Chemistry space for library design
17Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Exposed Van der Waals Surface Area (VSA)
Calculate exposed Van der Waals surface area for each atom by subtracting off surface area inside neighbors
Correction factors to sphere formula depend on atomic radii and inter-atomic distances
4r2 4r2-CA 4r2 -CA -CB
A
B
A
r
18Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Connection Table VSA Calculation
Neglect Non-bonded neighbors (small molecules have little NB contact) Interaction between angles (1-3 interactions) Stretch of bond lengths (use ideal bond length)
Parameters Radii: Van der Waals (or solvation) Inter-atomic distances: Ideal bond lengths
Define Vi to be the exposed VSA of atom i.
)2/()(
)(22222
iiii
iA
ddsrx
xrrrV
r sd
A
19Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Quality of Approximate VSA Calculation
Data set of 1,947 conformations MOE 2D 3D converter, MMFF94 force field, 0.01 RMS gradient Molecular weights in [300,1600] range
VDW Surface Area 3D dot calculation
Accuracy r = 0.9856 r2 = 0.9666 <10% error Largest errors
on steroids an otherfused ring systems
0
200
400
600
800
1000
1200
1400
1600
0 500 1000 1500
Van der Waals Surface Area
Ap
pro
xim
ate
VS
A
20Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Subdivision of VSA by Properties
Given an atomic property valuePi for each atom i O2 1.2 C3 4.5 C4 5.9 N7 0.2
Bin Pi by ranges and sum Vi
Vi values:
Pi range: [0,1) [1,2) [2,3) [3,4) [4,5) [5,6)Descriptors: D1 D2 D3 D4 D5 D6
V1V2 V3V7 V4
+ V5
V6
+ V8
C8
C3
C4
C5
C6
N7
O2
C1
21Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
8 Molar Refractivity Descriptors
Wildman & Crippen SMR model ofMolar Refractivity Specific attention paid to calculation of
atomic contributions Protonation state taken as-is from structure
(specific species)
Property bins trained derived from~50,000 structures 8 descriptors result: SMR_VSAk Each bin is approximately equally populated
over training set
Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).
22Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
10 LogP (octanol/water) Descriptors
Wildman & Crippen SlogPmodel of LogP Specific attention paid to
calculation of atomic contributions Protonation state taken as-is
from structure (specific species)
Property bins trained derivedfrom ~50,000 structures 10 descriptors: SlogP_VSAk Each bin is approximately equally
populated over training set
Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).
23Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
SMR_VSA and SlogP_VSA Inter-correlation
Correlation Analysis SMR SlogP descriptors
weakly correlated Test made on ~2000
small molecules not used in definition of descriptors
Displayed values are r values (not r2)
Descriptors encode “orthogonal” molecular properties
24Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
14 Partial Charge Descriptors
Gasteiger (PEOE) partial charge model Approximation to local pKa Electrostatic interactions Similar to Jurs descriptors
14 descriptors result from uniform interval boundaries Weak correlation
Stanton D., Jurs, P. Anal. Chem. 62, 2323 (1990)
Gasteiger,J., Marsali. Iterative Partial Equalization of Orbital Electronegativity - A Rapid Access to Atomic Charges. Tetrahedron. Vol. 36, p3219 (1980)
25Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Encoding of Traditional Descriptors
Traditional descriptors modeled with VSA descriptors 1,932 small organic molecules with weights in (28,800) SlogP_VSA, SMR_VSA and PEOE_VSA descriptors calculated Principal components regression models for 64 traditional descriptors
chi0 0.99 chi0v_C 0.97 b_ar 0.89 b_1rotN 0.78Kier1 0.99 KierA1 0.97 Kier2 0.89 b_double 0.77vdw_area 0.99 a_hyd 0.96 vsa_pol 0.89 b_rotN 0.77vdw_vol 0.99 a_nC 0.96 vsa_acc 0.88 a_ICM 0.73vsa_hyd 0.99 a_nH 0.96 diameter 0.87 vsa_don 0.73a_count 0.98 a_nO 0.95 VadjEq 0.87 KierFlex 0.69a_heavy 0.98 b_heavy 0.95 a_nN 0.86 balabanJ 0.61a_IC 0.98 chi1_C 0.95 KierA2 0.86 a_nP 0.60apol 0.98 chi1v_C 0.95 radius 0.86 Kier3 0.57b_count 0.98 SlogP 0.95 VdistMa 0.86 a_nCl 0.56chi0v 0.98 a_acc 0.94 wienPath 0.85 KierA3 0.55chi1 0.98 chi1v 0.94 wienPol 0.84 a_nS 0.53SMR 0.98 Weight 0.93 VadjMa 0.82 b_1rotR 0.50b_single 0.97 a_aro 0.91 VdistEq 0.82 density 0.49bpol 0.97 a_don 0.91 vsa_oth 0.82 b_rotR 0.48chi0_C 0.97 zagreb 0.91 a_nF 0.80 b_triple 0.46
26Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Boiling Point
Data set Exp. boiling point (K) 298 small molecules 18 descriptors:
SlogP_VSA(10), SMR_VSA(8)
PCA regression r2 = 0.96, RMSE = 15.53 Leave-one-out:
r2 = 0.94, RMSE = 21.37 Random leave-100-out:
r2 = 0.94200
300
400
500
600
700
200 300 400 500 600 700
27Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Free Energy of Solvation in Water
Data set Exp. Gs (kcal/mol) 291 small molecules 12 descriptors:
PEOE_VSA(3), SlogP_VSA(7), SMR_VSA(2)
PCA regression r2 = 0.90, RMSE = 0.78 Leave-one-out:
r2 = 0.89, RMSE = 0.82 Random leave-100-out:
r2 = 0.88
Viswanadhan, V.N., Ghose, A.K., Singh, U.C., Wendoloski, J.J.; Prediction of Solvation Free Energies of Small Organic Moleucles: Additive-Constitutive Models Based on Molecular Fingerprints and Atomic Constants; J. Chem. Inf. Comput. Sci., 39, 405-412 (1999)
-10
-8
-6
-4
-2
0
2
4
-10 -8 -6 -4 -2 0 2 4
28Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Thermodynamic Solubility in Water
Data set Exp. logW at 25ºC 1,438 small molecules 32 Descriptors:
SlogP_VSA (10),SMR_VSA (8),PEOE_VSA (14)
PCA regression r2 = 0.75, RMSE = 2.4 Leave-one-out:
r2 = 0.74, RMSE = 2.5
-15
-10
-5
0
5
10
15
20
-15 -10 -5 0 5 10 15 20
Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL: http://www.syyres.com.
29Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Vapor Pressure
Data set Exp. vapor pressure at
25ºC 1,771 small molecules 32 Descriptors:
SlogP_VSA (10),SMR_VSA (8),PEOE_VSA (14)
PCA regression r2 = 0.88, RMSE = 2.1 Leave-one-out:
r2 = 0.87, RMSE = 2.2
-30
-20
-10
0
10
20
-30 -20 -10 0 10 20
Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL: http://www.syyres.com.
30Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Compound Classification with Binary QSAR
Can Binary QSAR separate inhibitor classes using SLogP_VSAk and SMR_VSAk descriptors?
Data: 455 compounds active against one of 7 targets
Results (classification accuracy) Class 1: 98.7% p=0.003 Serotonin receptor ligands Class 2: 96.7% p=0.043 Benzodiazepine receptor ligands Class 3: 96.5% p=0.290 Carbonic anhydrase II inhibitors Class 4: 98.7% p=0.001 Cyclooxygenase-2 (Cox-2) inhibitors Class 5: 98.7% p=0.014 H3 antagonsists Class 6: 98.7% p=0.012 HIV protease inhibitors Class 7: 99.1% p=0.002 Tyrosine Kinase inhibitors
Labute,P. Binary QSAR: A New Method for Quantitative Structure Activity Relationships. Proceedings of the 1999 Pacific Symposium World Scientific Publishing, Singapore (1999)
31Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Compound Classification with CART
Learning set for CART (recursive partitioning) 455 compounds active against one of 7 targets 1,942 “random” organic compounds SlogP_VSA, SMR_VSA descriptors
Classification accuracy (32 node tree, depth 5) Class 1: 84.5% p=0.07 Serotonin receptor ligands Class 2: 49.1% p=0.30 Benzodiazepine receptor ligands Class 3: 92.5% p=0.27 Carbonic anhydrase II inhibitors Class 4: 96.8% p=0.01 Cyclooxygenase-2 (Cox-2) inhibitors Class 5: 82.7% p=0.03 H3 antagonsists Class 6: 85.4% p=0.02 HIV protease inhibitors Class 7: 91.4% p=0.01 Tyrosine Kinase inhibitors
Xue,L., Godden,J., Gao,H., Bajorath,J. Identification of a Preferred Set of Molecular Descriptors for Compound Classification Based on Principal Component Analysis. J. Chem. Inf. Comput. Sci, 39, p699 (1999)
32Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Thrombin, Trypsin and Factor Xa Activity
Data and descriptors Exp. pKi data for 72 analogs against Thrombin,
Trypsin and Factor Xa Descriptors: subsets of SlogP_VSAk,
SMR_VSAk , PEOE_VSAk Principal components regresssion
Thrombin (10 descr.)r2 = 0.65 RMSE = 0.61
Trypsin (9 descr.)r2 = 0.72 RMSE = 0.47
Factor Xa (15 descr.)r2 = 0.69 RMSE = 0.35
Bohm,M., Sturzebecher,J., Klebe,G. J. Med. Chem., 42, p458-477 (1999).
N
S
O
OHN
NH2+
NH2
O
N N S
O
O
2
3
4
5
6
7
8
9
2 3 4 5 6 7 8 9
2
3
4
5
6
7
8
9
2 3 4 5 6 7 8 9
2
3
4
5
6
7
8
9
2 3 4 5 6 7 8 9
33Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Blood-Brain Barrier Permeability
Data set Exp. logBBB partition 75 molecules (charged) 14 descriptors:
PEOE_VSA(3),SlogP_VSA(6),SMR_VSA(5)
PCA regression r2 = 0.83, RMSE = 0.32 Leave-one-out:
r2 = 0.73, RMSE = 0.43
Luco, J.M. Prediction of the Brain-Blood Distribution of a Large Set of Drugs from Structurally Derived Descriptors Using Partial Least Squares (PLS) Modeling. J. Chem. Inf. Comput. Sci., 39, 396-404 (1999)
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5
34Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Advantages
Orthogonality SlogP_VSA, SMR_VSA and PEOE_VSA exhibit weak correlation Binary QSAR and Recursive Partitioning methodologies benefit Less reliance on Principal Components Analysis
Relevance SlogP_VSA, SMR_VSA and PEOE_VSA useful for QSAR/QSPR SlogP_VSA, SMR_VSA and PEOE_VSA used successfully in HTS QSAR Pharmacokinetic, “drug-like” and ADME properties modeled reasonably
Additivity VSA conversion of logP and MR to non-whole molecule properties VSA descriptors are “group” additive (useful for combinatorial designs) Fundamental units are surface areas for all descriptors (Euclidean space) More continuous than simple atom/fragment counts
Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
CHEMICALCOMPUTINGGROUP INC.
35
Focused CombinatorialLibrary Design
36Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Combinatorial Library Design
Objective: Select reagents for combinatorial synthesis to minimize iterations in High Throughput Discovery Cycle
Combinatorial Library: all combinationsof R1, R2, R3 and R4 groups.
Select building blocks that biaslibrary towards drug-like actives
Use non-enumerative techniques toscore building blocks in large virtual libraries 4 connection points, 1000 R-groups = 1012 compounds Enumeration Impractical. Can’t even store compounds! Use statistical sampling to score building blocks
Use Binary QSAR model as focusing agent
R1
R2
R3
N
N
N
R2
R3
R1R4
37Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Building Block Scoring Methodology
Estimate probability that building block is in active compound:
Use random sampling to estimate terms in formula to avoid enumeration of entire virtual library Randomly choose a central group and R-groups Construct virtual product and calculate product descriptors
Focused library design using Binary QSAR model “Count” the number times reagents appear in active compounds Binary QSAR model used to estimate activity of virtual products Select top scoring reagents for pure combinatorial design
kiki
ijiiji rX
rXrX
)|activePr(
)|activePr()active|Pr(
Binary QSAR Model
38Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Building Block Scoring Algorithm
Select RandomBuilding Blocks
Construct VirtualProduct
Calculate ProductDescriptors
Estimate Probabilityof Product Activity
Add Probabilityto Building Blocks
Output BuildingBlock Scores
R1
R3 R2R
Me
R
N
OO
R
Me
Me
N
O
O
Me
(2.1,3.2,4.3,0,5.3,2.8, ...)
+0.63 +0.63 +0.63 +0.63
R1
R3 R2R
Me
R
N
OO
R
Me
p = 0.63
k
ikijij aaa /
Binary QSAR Model
39Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
cGMP PDE V Data Set
263 compounds from literature
1,534 random “inactives” added
IC50: 0.5 nm through 100+ m
N
N
R2
N
R4 R1
R3
N
N
HN
R2
R1
263 imidazopyrines, quinazolines, 1,3-bis(cyclopropylmethyl)xanthines
N
N
N
R
N
NN
RN
N
N
O
O
R1N
R2
R3
Cyclic GMP phosphodiesterase (human type 5) inhibitors
40Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Binary QSAR Model for cGMP PDE V
Binary QSAR setup Descriptors: SlogP_VSA (10), SMR_VSA (8), PEOE_VSA (14) Threshold to simulate HTS data: active = - log IC50 > 0 (1 m)
Random 10% of data separated for validation set (9 active, 171 inactive) Remaining 90% of data used for training (52 active, 1568 inactive)
Results Training set accuracy: 69.2% on active, 98.4% on inactive (p=0.000227) Leave-one-out accuracy: 55.8% on active, 98.4% on inactive 10-fold block cross-validation gave similar results Validation set accuracy: 55.6% on active, 98.2% on inactive
Resulting statistically significant model is assumed to “understand” the original data set Individual predictions of activity are suspect but a collection of predictions is
meaningful: e.g., estimate the number of actives in a library
41Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Quinazoline Virtual Library Definition
Quinazoline scaffold
27 x 12 x 10 x 2 = 6,480 products in virtual library
N
N
N
R2
R3
R1R4
H,Me
H, CH3, F, Cl, Br,
CH2OH, CH2SH,
CH3SO2, NO2, HCC
CH2-(2-thienyl), benzyl, CH2-benzyl, phenyl, CH2-(3-pyridyl),CH2-(2-furanyl), 2-pyridyl,3-(5-Me-isoxazoyl), 2-ClPh, 3-ClPh, 4-ClPh, 3-pyridyl, 1-pyrrolyl, propyl, 3-CH3OPh,CH2CH2-2(3-Me-pyrrolyl),3-NO2Ph, H, CH2(CH2)4OH, CH2(cPr), 4-(CO2Me)Ph, c-pentyl, c-hexyl,CH2-(2-THF), CH2(CH2)2OCOCH2CH3
1-imidazolyl, 2-pyridyl, 3-pyridyl, H, 4-pyridyl, 2-thienyl, 2-furyl, Cl, 4-morpholine,
c-hexyl, 4-Me-1-piperazinyl, styrenyl
42Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
R1 R2 R3 R40.121 CH2-(2-thienyl) 0.191 1-imidazolyl 0.197 Me 0.960 H0.120 benzyl 0.160 2-pyridyl 0.192 H 0.040 Me0.109 CH2-benzyl 0.145 3-pyridyl 0.142 Cl0.083 phenyl 0.127 H 0.120 F0.081 CH2-(3-pyridyl) 0.111 4-pyridyl 0.112 CH3O0.062 CH2-(2-furanyl) 0.099 2-thienyl 0.093 Br0.061 2-pyridyl 0.076 2-furyl 0.065 CH3S0.041 3-(5-Me-isoxazolyl) 0.072 Cl 0.042 CH3SO20.038 2-ClPh 0.016 4-morpholine 0.024 NO20.037 3-ClPh 0.003 c-hexyl 0.013 NCC0.036 4-ClPh 0.001 4-Me-1-piperazinyl0.034 3-pyridyl0.032 1-pyrrolyl0.023 3-CH3OPh0.015 CH2CH2-2(3-Me-pyrrolyl)0.015 3-NO2Ph0.011 H0.008 CH2(CH2)4OH0.007 CH2(cPr)0.007 4-(CO2Me)Ph0.006 CH2-(c-hexyl)0.005 3-(CO2Me)Ph, c-pentyl, c-hexyl0.004 CH2-(2-THF)0.003 CH2(CH2)2OCOCH2CH3
Quinazoline Library R-Group Scores
Use median cutoff at each position (selected R-groups shown in blue)
Retain only high-scoringR-groups that account for 50% of the probability
43Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
3 x 3 x 5 = 45 compounds(0.7% of original library)
SAR for R3 Preference for
small hydrophobicgroups agrees withexperiment
SAR for R1 -CH2- spacer in top scoring groups agrees with experiment Benzyl group agrees with experiment
SAR for R2 1-imidazoyl as top scoring group agrees with experiment 2-, 3-pyridyl groups agree with experiment
Resulting Focused Quinazoline Library
N
N
HN
R2
H,Me,Cl
R1
CH2-(2-thienyl)benzyl CH2-benzylphenyl CH2-(3-pyridyl)
1-imidazolyl2-pyridyl3-pyridyl
Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
CHEMICALCOMPUTINGGROUP INC.
44
Summary and Outlook
45Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Complete Methodology for HTDD
Binary QSAR: probabilistic QSAR QSAR models from HTS data (directly) and ADME data
VSA Descriptors: information rich low dimensional space Applicable for activity models and ADME models
Reagent Scoring: probabilistic scoring Non-enumerative technique that uses Binary QSAR models to focus
ADME Model
Activity ModelLibrary Design
Binary QSAR BioAssay
DesignModel
CombinatorialLibrary
HTS Data
Drugability Data(e.g., BBB or drug-like)
Binary QSAR
46Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Important Ideas
Binary QSAR Shift away from regression-based techniques: more robust to errors Predictions are “soft”: best suited to collection-based predictions Probability models can be combined (e.g., ADME & Potency)
VSA Descriptors Orthogonal, group-additive descriptors reminiscent of Hansch & Leo Wide applicability gives rise to meaningful chemistry space Less reliance on variable selection procedures (fewer false correlations)
Reagent Scoring Non-enumerative techniques can handle huge virtual libraries Resulting score complements other criteria (cost, availability, etc.) Sample + Reject + Estimate procedure can incorporate arbitrary filters
Statistically significant model = “understanding” of data
47Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.
Future Work and Availability
Generation II VSA descriptors Improve fundamental VSA approximation: can do better than 10% error Handle “protonics”: average over tautomeric and protonation states Direct polarizability model might replace logP + MR
Extend probability decomposition to include receptor (T)
All methodology available in MOE version 2000.02
Drugable for T or T’s type
Probabilistic QSAR
Bioinformatics?
),Pr(),|Pr(),,|Pr(),,,Pr( STTSYTSYDTSDY