Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1...

47
Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery Introduction and Motivation Probability Modeling in Drug Discovery Representation of Chemical Structures (Descriptors) Focused Combinatorial Library Design Summary and Outlook

Transcript of Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1...

Page 1: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

CHEMICALCOMPUTINGGROUP INC.

1

A Probabilistic Approach toHigh Throughput Drug Discovery

Introduction and Motivation

Probability Modeling in Drug Discovery

Representation of Chemical Structures (Descriptors)

Focused Combinatorial Library Design

Summary and Outlook

Page 2: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

2Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

High Throughput Screening

Large-scale automation of biological assays (HTS) Use robotics to perform 10,000 to 100,000 screens per day Brute-force approach to drug discovery: “rapidly screen all compounds”

Noteworthy drawbacks to HTS: Economics: $1-$5 per assay (provided large collections are assayed) Logistics: compound formatting, inventory systems and other overhead Precision Loss: effective “binary” measurement: active/inactive (pass/fail) High Error Rate: assay, synthesis failure, sample degradation, registration

Resulting effects: Quality for quantity tradeoff - lots of low quality data High level of noise (error) in data makes interpretation very difficult

HTS has gained acceptance and is routinely used to generate lead compounds for drug discovery projects

Page 3: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

3Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Sources of Compounds for HTS

Initial screening libraries (first libraries used in project) Historical “in-house” collection of compounds augmented with compounds

purchased from external suppliers 1 million+ compounds available means initial screening library must be

designed (diversity retained using fewer numbers of compounds) Receptor biased initial screening libraries are a possibility

Follow-up libraries Parallel synthesis / combinatorial chemistry is an excellent source of large

numbers of (new) compounds Synthesis of “all” analogs around a lead structure exhibits poor diversity

but very good for “local” exploration and lead follow-up

External screening compound purchasing and in-house combinatorial chemistry efforts have gained acceptance and are routinely used in lead generation and follow-up

Page 4: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

4Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

High Throughput Discovery Cycle

Brute-force HTS not practical At least 10 trillion stable drug candidates At 1 billion screens per day >27 years are needed to screen all 10 trillion

A discovery cycle can be used to reduce total screens Use HTS data to affect the selection of compounds to screen next Scale-up of the traditional experimental discovery cycle

HTSBioassay

FocusedLibrary Design

HTS DataAnalysis

ParallelSynthesis

Page 5: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

5Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Required Technology for HTD Cycle

High Throughput Screening facility

Parallel synthesis and combinatorial chemistry capabilities

Methodology for automatically analyzing HTS data Humans find it difficult to interpret large amounts of noisy data Automatic HTS QSAR technology necessary for HTD cycle

Methodology for designing focused combinatorial libraries HTS QSAR results are used to bias a combinatorial library towards activity ADME properties and other design criteria should be taken into account

Meaningful representation of compounds Collection of molecular descriptors meaningful across projects (avoid time

consuming variable selection procedures) Definition of a “chemistry space” for diversity studies (design of initial

screening libraries)

Page 6: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

CHEMICALCOMPUTINGGROUP INC.

6

Probability Modeling in Drug Discovery

Page 7: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

7Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Probabilistic Formalism (Bayesian Inference)

Step 1: Write all observables as a joint probabilitydensity; e.g., Pr (A,B,C)

Step 2: Decompose density using probability theoryand Bayes’ theorem until components aremeasurable; e.g.,

Pr (A,B,C) = Pr (B | A,C) Pr (C | A) Pr (A)

Step 3: Model each component in product from adatabase or experimental data set

Step 4: Make predictions or estimates usingcomputed model of Pr(A,B,C)

Page 8: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

8Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Probabilities in Speech Recognition

Successful speech recognizers select (predict) an output word sequence from an input waveform by maximizing the joint likelihood Pr (WAVE, WORDS) This is used (in part) to solve the isophonetic word sequence problem;

e.g., “imadam” can be “I’m Adam” or “I’m a Dam” or “eye mad am”

Pr (WAVE, WORDS) = Pr (WAVE | WORDS) Pr (WORDS) Pr(WORDS) is the prior probability of a word sequence (utterance) Pr(WAVE | WORDS) is used to score the waveform under the assumption

or hypothesis that the word sequence is WORDS Build model of Pr(WORDS) by training on, say, 500,000,000 words of

newspaper text (the prior knowledge)

Pr(WORDS) effectively depresses importance of unlikely utterances in favor of more plausible statements (real phrases)

Page 9: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

9Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Probabilities in Drug Discovery

Notation: Y = active(0/1) D = drugable(0/1) S = structure

Decompose:

Product of probabilities balances competing goals Classification alone (e.g., RP) is not enough: weighted outcomes needed Methodology similar to “soft” classification problems or fuzzy logic Any method of probability modeling is valid (e.g., histogram, analytic)

Approximations introduced can be clearly identified e.g., Pr (D | Y, S) Pr (D | S) : drugability is independent of activity (!?)

Drugable given active structure(approximated by “is drug-like” efforts)

Activity assuming structure(probabilistic QSAR efforts)

) Pr( ) | Pr( ) , | Pr( ) , , Pr(S S Y S Y D S D Y

Page 10: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

10Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Pr(Y|X) via Binary QSAR

If Y is “binary activity” and X is a descriptor vector then

Pathology of Binary QSAR is reasonable If new structure is outside the training set then Pr(Y=1), the hit rate, is used to make predictions (no other information available)

Active Inactive

X1 Xk Xk+1 Xn

Pr(Y)

Pr(X|Y)X1 Xn

Active Inactive Active Inactive

Pr(X)

Pr(Y|X)

Bayes Theorem

1

)1Pr()1|Pr(

)0Pr()0|Pr(1)|1Pr(

YYxX

YYxXxXY

Page 11: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

11Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Distribution Estimates

Four distributions in formula are of two types Pr(Y=0), Pr(Y=1) Prior probability of inactive/active Pr(X=x|Y=0), Pr(X=x|Y=1) Probability of ligand assuming inactive/active

Modeling assumption: independent uncorrelated! Decompose multi-dimensional distribution into a product

Estimate 2n+2 distributions instead of original four

Binary QSAR Algorithm Compute descriptor vectors di

De-correlate descriptors xi = Q(di - u)

Estimate distributions from {xi ,yi} Pr (X = x | Y = y) Assemble p (x) Pr (Y = 1 | X = x) Predict for new descriptors d p (Q (d - u))

i

ii yYxXyYxX )|Pr()|Pr(

Page 12: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

12Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Experience with Binary QSAR

Fundamental methodology publication (robustness study) Biocomputing Proceedings of the 1999 Pacific Symposium World

Scientific Publishing, Singapore, 1999

Example literature data sets (non-HTS data) Estrogen receptor (Gao et al.; J. Chem. Info. Comput. Sci., 1999, 36) O-acyltransferase (ACAT) (Labute et. al.; in press)

Example industrial data sets (HTS assays) ArQule: 24,000 cpds. ~200 active, 93% on inactives, 60% on actives Pharmacopeia: 24,000 cpds. >90% on inactives, >90% on actives SmithKline Beecham: 80,000 cpds. ~100 active, 90% on actives

Best success story: Pharmacia & Upjohn Binary QSAR model used to select building blocks in combi-chem library Improved activity from M to nM (factor of 1000)

Page 13: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

13Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Combined Design Model for HTD Cycle

Use Binary QSAR method twice, once for activity model and once for drugability model Train drugability model Pr (D | X) on WDI/ACD for drug-like/non-drug-like

or on specific data sets (e.g., blood-brain barrier permeability)

Complete model of activity and drugability is the product Pr(D | X) Pr(Y | X) which approximates Pr(D, Y | S)

ADME Model

Activity ModelLibrary Design

Binary QSAR BioAssay

DesignModel

CombinatorialLibrary

HTS Data

Drugability Data(e.g., BBB or drug-like)

Binary QSAR

Page 14: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

CHEMICALCOMPUTINGGROUP INC.

14

Representation of Chemical Structures (Descriptors)

Page 15: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

15Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

A Brief History of QSAR

Original philosophy (Hansch & Leo): Use a fixed set of meaningful molecular properties to describe a wide variety of biological phenomena

Linear regression used to determine SAR The determination of linear relationships is basic science Statistical regression framework used to assess significance of SAR

Proliferation of descriptors Early successes lead to introduction of a vast array of descriptors In principle, any number calculable from a chemical structure can be used

as a molecular descriptor for SAR determination

Over-determination of SAR Multitude of descriptors lead to need for schemes for variable elimination 3D methods treat each grid-point in field representation as a descriptor

Page 16: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

16Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Fundamental Notions

Use a fixed set of descriptors for diversity and QSAR/QSPR A meaningful chemistry space should not require customization In QSAR/QSPR automatic variable selection can be dangerous Make direct use of Hansch & Leo thinking (build on their experience)

Model 3D properties from 2D (connectivity) information 3D information from 2D connectivity = 2½ D descriptors HTS QSAR and large-scale diversity require fast calculation times 2D topological descriptors too weak, 3D descriptors too expensive Use approximate atomic surface areas as fundamental representation Complement substructure keys (stay property-based for class-hopping)

Intended applications QSAR/QSPR models - linear and nonlinear - early and late in project Chemistry space for library design

Page 17: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

17Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Exposed Van der Waals Surface Area (VSA)

Calculate exposed Van der Waals surface area for each atom by subtracting off surface area inside neighbors

Correction factors to sphere formula depend on atomic radii and inter-atomic distances

4r2 4r2-CA 4r2 -CA -CB

A

B

A

r

Page 18: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

18Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Connection Table VSA Calculation

Neglect Non-bonded neighbors (small molecules have little NB contact) Interaction between angles (1-3 interactions) Stretch of bond lengths (use ideal bond length)

Parameters Radii: Van der Waals (or solvation) Inter-atomic distances: Ideal bond lengths

Define Vi to be the exposed VSA of atom i.

)2/()(

)(22222

iiii

iA

ddsrx

xrrrV

r sd

A

Page 19: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

19Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Quality of Approximate VSA Calculation

Data set of 1,947 conformations MOE 2D 3D converter, MMFF94 force field, 0.01 RMS gradient Molecular weights in [300,1600] range

VDW Surface Area 3D dot calculation

Accuracy r = 0.9856 r2 = 0.9666 <10% error Largest errors

on steroids an otherfused ring systems

0

200

400

600

800

1000

1200

1400

1600

0 500 1000 1500

Van der Waals Surface Area

Ap

pro

xim

ate

VS

A

Page 20: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

20Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Subdivision of VSA by Properties

Given an atomic property valuePi for each atom i O2 1.2 C3 4.5 C4 5.9 N7 0.2

Bin Pi by ranges and sum Vi

Vi values:

Pi range: [0,1) [1,2) [2,3) [3,4) [4,5) [5,6)Descriptors: D1 D2 D3 D4 D5 D6

V1V2 V3V7 V4

+ V5

V6

+ V8

C8

C3

C4

C5

C6

N7

O2

C1

Page 21: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

21Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

8 Molar Refractivity Descriptors

Wildman & Crippen SMR model ofMolar Refractivity Specific attention paid to calculation of

atomic contributions Protonation state taken as-is from structure

(specific species)

Property bins trained derived from~50,000 structures 8 descriptors result: SMR_VSAk Each bin is approximately equally populated

over training set

Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).

Page 22: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

22Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

10 LogP (octanol/water) Descriptors

Wildman & Crippen SlogPmodel of LogP Specific attention paid to

calculation of atomic contributions Protonation state taken as-is

from structure (specific species)

Property bins trained derivedfrom ~50,000 structures 10 descriptors: SlogP_VSAk Each bin is approximately equally

populated over training set

Wildman,S.A., Crippen,G.M. Prediction of Physiochemical Parameters by Atomic Contributions. J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999).

Page 23: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

23Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

SMR_VSA and SlogP_VSA Inter-correlation

Correlation Analysis SMR SlogP descriptors

weakly correlated Test made on ~2000

small molecules not used in definition of descriptors

Displayed values are r values (not r2)

Descriptors encode “orthogonal” molecular properties

Page 24: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

24Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

14 Partial Charge Descriptors

Gasteiger (PEOE) partial charge model Approximation to local pKa Electrostatic interactions Similar to Jurs descriptors

14 descriptors result from uniform interval boundaries Weak correlation

Stanton D., Jurs, P. Anal. Chem. 62, 2323 (1990)

Gasteiger,J., Marsali. Iterative Partial Equalization of Orbital Electronegativity - A Rapid Access to Atomic Charges. Tetrahedron. Vol. 36, p3219 (1980)

Page 25: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

25Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Encoding of Traditional Descriptors

Traditional descriptors modeled with VSA descriptors 1,932 small organic molecules with weights in (28,800) SlogP_VSA, SMR_VSA and PEOE_VSA descriptors calculated Principal components regression models for 64 traditional descriptors

chi0 0.99 chi0v_C 0.97 b_ar 0.89 b_1rotN 0.78Kier1 0.99 KierA1 0.97 Kier2 0.89 b_double 0.77vdw_area 0.99 a_hyd 0.96 vsa_pol 0.89 b_rotN 0.77vdw_vol 0.99 a_nC 0.96 vsa_acc 0.88 a_ICM 0.73vsa_hyd 0.99 a_nH 0.96 diameter 0.87 vsa_don 0.73a_count 0.98 a_nO 0.95 VadjEq 0.87 KierFlex 0.69a_heavy 0.98 b_heavy 0.95 a_nN 0.86 balabanJ 0.61a_IC 0.98 chi1_C 0.95 KierA2 0.86 a_nP 0.60apol 0.98 chi1v_C 0.95 radius 0.86 Kier3 0.57b_count 0.98 SlogP 0.95 VdistMa 0.86 a_nCl 0.56chi0v 0.98 a_acc 0.94 wienPath 0.85 KierA3 0.55chi1 0.98 chi1v 0.94 wienPol 0.84 a_nS 0.53SMR 0.98 Weight 0.93 VadjMa 0.82 b_1rotR 0.50b_single 0.97 a_aro 0.91 VdistEq 0.82 density 0.49bpol 0.97 a_don 0.91 vsa_oth 0.82 b_rotR 0.48chi0_C 0.97 zagreb 0.91 a_nF 0.80 b_triple 0.46

Page 26: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

26Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Boiling Point

Data set Exp. boiling point (K) 298 small molecules 18 descriptors:

SlogP_VSA(10), SMR_VSA(8)

PCA regression r2 = 0.96, RMSE = 15.53 Leave-one-out:

r2 = 0.94, RMSE = 21.37 Random leave-100-out:

r2 = 0.94200

300

400

500

600

700

200 300 400 500 600 700

Page 27: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

27Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Free Energy of Solvation in Water

Data set Exp. Gs (kcal/mol) 291 small molecules 12 descriptors:

PEOE_VSA(3), SlogP_VSA(7), SMR_VSA(2)

PCA regression r2 = 0.90, RMSE = 0.78 Leave-one-out:

r2 = 0.89, RMSE = 0.82 Random leave-100-out:

r2 = 0.88

Viswanadhan, V.N., Ghose, A.K., Singh, U.C., Wendoloski, J.J.; Prediction of Solvation Free Energies of Small Organic Moleucles: Additive-Constitutive Models Based on Molecular Fingerprints and Atomic Constants; J. Chem. Inf. Comput. Sci., 39, 405-412 (1999)

-10

-8

-6

-4

-2

0

2

4

-10 -8 -6 -4 -2 0 2 4

Page 28: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

28Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Thermodynamic Solubility in Water

Data set Exp. logW at 25ºC 1,438 small molecules 32 Descriptors:

SlogP_VSA (10),SMR_VSA (8),PEOE_VSA (14)

PCA regression r2 = 0.75, RMSE = 2.4 Leave-one-out:

r2 = 0.74, RMSE = 2.5

-15

-10

-5

0

5

10

15

20

-15 -10 -5 0 5 10 15 20

Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL: http://www.syyres.com.

Page 29: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

29Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Vapor Pressure

Data set Exp. vapor pressure at

25ºC 1,771 small molecules 32 Descriptors:

SlogP_VSA (10),SMR_VSA (8),PEOE_VSA (14)

PCA regression r2 = 0.88, RMSE = 2.1 Leave-one-out:

r2 = 0.87, RMSE = 2.2

-30

-20

-10

0

10

20

-30 -20 -10 0 10 20

Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL: http://www.syyres.com.

Page 30: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

30Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Compound Classification with Binary QSAR

Can Binary QSAR separate inhibitor classes using SLogP_VSAk and SMR_VSAk descriptors?

Data: 455 compounds active against one of 7 targets

Results (classification accuracy) Class 1: 98.7% p=0.003 Serotonin receptor ligands Class 2: 96.7% p=0.043 Benzodiazepine receptor ligands Class 3: 96.5% p=0.290 Carbonic anhydrase II inhibitors Class 4: 98.7% p=0.001 Cyclooxygenase-2 (Cox-2) inhibitors Class 5: 98.7% p=0.014 H3 antagonsists Class 6: 98.7% p=0.012 HIV protease inhibitors Class 7: 99.1% p=0.002 Tyrosine Kinase inhibitors

Labute,P. Binary QSAR: A New Method for Quantitative Structure Activity Relationships. Proceedings of the 1999 Pacific Symposium World Scientific Publishing, Singapore (1999)

Page 31: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

31Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Compound Classification with CART

Learning set for CART (recursive partitioning) 455 compounds active against one of 7 targets 1,942 “random” organic compounds SlogP_VSA, SMR_VSA descriptors

Classification accuracy (32 node tree, depth 5) Class 1: 84.5% p=0.07 Serotonin receptor ligands Class 2: 49.1% p=0.30 Benzodiazepine receptor ligands Class 3: 92.5% p=0.27 Carbonic anhydrase II inhibitors Class 4: 96.8% p=0.01 Cyclooxygenase-2 (Cox-2) inhibitors Class 5: 82.7% p=0.03 H3 antagonsists Class 6: 85.4% p=0.02 HIV protease inhibitors Class 7: 91.4% p=0.01 Tyrosine Kinase inhibitors

Xue,L., Godden,J., Gao,H., Bajorath,J. Identification of a Preferred Set of Molecular Descriptors for Compound Classification Based on Principal Component Analysis. J. Chem. Inf. Comput. Sci, 39, p699 (1999)

Page 32: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

32Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Thrombin, Trypsin and Factor Xa Activity

Data and descriptors Exp. pKi data for 72 analogs against Thrombin,

Trypsin and Factor Xa Descriptors: subsets of SlogP_VSAk,

SMR_VSAk , PEOE_VSAk Principal components regresssion

Thrombin (10 descr.)r2 = 0.65 RMSE = 0.61

Trypsin (9 descr.)r2 = 0.72 RMSE = 0.47

Factor Xa (15 descr.)r2 = 0.69 RMSE = 0.35

Bohm,M., Sturzebecher,J., Klebe,G. J. Med. Chem., 42, p458-477 (1999).

N

S

O

OHN

NH2+

NH2

O

N N S

O

O

2

3

4

5

6

7

8

9

2 3 4 5 6 7 8 9

2

3

4

5

6

7

8

9

2 3 4 5 6 7 8 9

2

3

4

5

6

7

8

9

2 3 4 5 6 7 8 9

Page 33: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

33Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Blood-Brain Barrier Permeability

Data set Exp. logBBB partition 75 molecules (charged) 14 descriptors:

PEOE_VSA(3),SlogP_VSA(6),SMR_VSA(5)

PCA regression r2 = 0.83, RMSE = 0.32 Leave-one-out:

r2 = 0.73, RMSE = 0.43

Luco, J.M. Prediction of the Brain-Blood Distribution of a Large Set of Drugs from Structurally Derived Descriptors Using Partial Least Squares (PLS) Modeling. J. Chem. Inf. Comput. Sci., 39, 396-404 (1999)

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5

Page 34: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

34Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Advantages

Orthogonality SlogP_VSA, SMR_VSA and PEOE_VSA exhibit weak correlation Binary QSAR and Recursive Partitioning methodologies benefit Less reliance on Principal Components Analysis

Relevance SlogP_VSA, SMR_VSA and PEOE_VSA useful for QSAR/QSPR SlogP_VSA, SMR_VSA and PEOE_VSA used successfully in HTS QSAR Pharmacokinetic, “drug-like” and ADME properties modeled reasonably

Additivity VSA conversion of logP and MR to non-whole molecule properties VSA descriptors are “group” additive (useful for combinatorial designs) Fundamental units are surface areas for all descriptors (Euclidean space) More continuous than simple atom/fragment counts

Page 35: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

CHEMICALCOMPUTINGGROUP INC.

35

Focused CombinatorialLibrary Design

Page 36: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

36Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Combinatorial Library Design

Objective: Select reagents for combinatorial synthesis to minimize iterations in High Throughput Discovery Cycle

Combinatorial Library: all combinationsof R1, R2, R3 and R4 groups.

Select building blocks that biaslibrary towards drug-like actives

Use non-enumerative techniques toscore building blocks in large virtual libraries 4 connection points, 1000 R-groups = 1012 compounds Enumeration Impractical. Can’t even store compounds! Use statistical sampling to score building blocks

Use Binary QSAR model as focusing agent

R1

R2

R3

N

N

N

R2

R3

R1R4

Page 37: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

37Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Building Block Scoring Methodology

Estimate probability that building block is in active compound:

Use random sampling to estimate terms in formula to avoid enumeration of entire virtual library Randomly choose a central group and R-groups Construct virtual product and calculate product descriptors

Focused library design using Binary QSAR model “Count” the number times reagents appear in active compounds Binary QSAR model used to estimate activity of virtual products Select top scoring reagents for pure combinatorial design

kiki

ijiiji rX

rXrX

)|activePr(

)|activePr()active|Pr(

Binary QSAR Model

Page 38: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

38Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Building Block Scoring Algorithm

Select RandomBuilding Blocks

Construct VirtualProduct

Calculate ProductDescriptors

Estimate Probabilityof Product Activity

Add Probabilityto Building Blocks

Output BuildingBlock Scores

R1

R3 R2R

Me

R

N

OO

R

Me

Me

N

O

O

Me

(2.1,3.2,4.3,0,5.3,2.8, ...)

+0.63 +0.63 +0.63 +0.63

R1

R3 R2R

Me

R

N

OO

R

Me

p = 0.63

k

ikijij aaa /

Binary QSAR Model

Page 39: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

39Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

cGMP PDE V Data Set

263 compounds from literature

1,534 random “inactives” added

IC50: 0.5 nm through 100+ m

N

N

R2

N

R4 R1

R3

N

N

HN

R2

R1

263 imidazopyrines, quinazolines, 1,3-bis(cyclopropylmethyl)xanthines

N

N

N

R

N

NN

RN

N

N

O

O

R1N

R2

R3

Cyclic GMP phosphodiesterase (human type 5) inhibitors

Page 40: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

40Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Binary QSAR Model for cGMP PDE V

Binary QSAR setup Descriptors: SlogP_VSA (10), SMR_VSA (8), PEOE_VSA (14) Threshold to simulate HTS data: active = - log IC50 > 0 (1 m)

Random 10% of data separated for validation set (9 active, 171 inactive) Remaining 90% of data used for training (52 active, 1568 inactive)

Results Training set accuracy: 69.2% on active, 98.4% on inactive (p=0.000227) Leave-one-out accuracy: 55.8% on active, 98.4% on inactive 10-fold block cross-validation gave similar results Validation set accuracy: 55.6% on active, 98.2% on inactive

Resulting statistically significant model is assumed to “understand” the original data set Individual predictions of activity are suspect but a collection of predictions is

meaningful: e.g., estimate the number of actives in a library

Page 41: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

41Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Quinazoline Virtual Library Definition

Quinazoline scaffold

27 x 12 x 10 x 2 = 6,480 products in virtual library

N

N

N

R2

R3

R1R4

H,Me

H, CH3, F, Cl, Br,

CH2OH, CH2SH,

CH3SO2, NO2, HCC

CH2-(2-thienyl), benzyl, CH2-benzyl, phenyl, CH2-(3-pyridyl),CH2-(2-furanyl), 2-pyridyl,3-(5-Me-isoxazoyl), 2-ClPh, 3-ClPh, 4-ClPh, 3-pyridyl, 1-pyrrolyl, propyl, 3-CH3OPh,CH2CH2-2(3-Me-pyrrolyl),3-NO2Ph, H, CH2(CH2)4OH, CH2(cPr), 4-(CO2Me)Ph, c-pentyl, c-hexyl,CH2-(2-THF), CH2(CH2)2OCOCH2CH3

1-imidazolyl, 2-pyridyl, 3-pyridyl, H, 4-pyridyl, 2-thienyl, 2-furyl, Cl, 4-morpholine,

c-hexyl, 4-Me-1-piperazinyl, styrenyl

Page 42: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

42Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

R1 R2 R3 R40.121 CH2-(2-thienyl) 0.191 1-imidazolyl 0.197 Me 0.960 H0.120 benzyl 0.160 2-pyridyl 0.192 H 0.040 Me0.109 CH2-benzyl 0.145 3-pyridyl 0.142 Cl0.083 phenyl 0.127 H 0.120 F0.081 CH2-(3-pyridyl) 0.111 4-pyridyl 0.112 CH3O0.062 CH2-(2-furanyl) 0.099 2-thienyl 0.093 Br0.061 2-pyridyl 0.076 2-furyl 0.065 CH3S0.041 3-(5-Me-isoxazolyl) 0.072 Cl 0.042 CH3SO20.038 2-ClPh 0.016 4-morpholine 0.024 NO20.037 3-ClPh 0.003 c-hexyl 0.013 NCC0.036 4-ClPh 0.001 4-Me-1-piperazinyl0.034 3-pyridyl0.032 1-pyrrolyl0.023 3-CH3OPh0.015 CH2CH2-2(3-Me-pyrrolyl)0.015 3-NO2Ph0.011 H0.008 CH2(CH2)4OH0.007 CH2(cPr)0.007 4-(CO2Me)Ph0.006 CH2-(c-hexyl)0.005 3-(CO2Me)Ph, c-pentyl, c-hexyl0.004 CH2-(2-THF)0.003 CH2(CH2)2OCOCH2CH3

Quinazoline Library R-Group Scores

Use median cutoff at each position (selected R-groups shown in blue)

Retain only high-scoringR-groups that account for 50% of the probability

Page 43: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

43Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

3 x 3 x 5 = 45 compounds(0.7% of original library)

SAR for R3 Preference for

small hydrophobicgroups agrees withexperiment

SAR for R1 -CH2- spacer in top scoring groups agrees with experiment Benzyl group agrees with experiment

SAR for R2 1-imidazoyl as top scoring group agrees with experiment 2-, 3-pyridyl groups agree with experiment

Resulting Focused Quinazoline Library

N

N

HN

R2

H,Me,Cl

R1

CH2-(2-thienyl)benzyl CH2-benzylphenyl CH2-(3-pyridyl)

1-imidazolyl2-pyridyl3-pyridyl

Page 44: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

CHEMICALCOMPUTINGGROUP INC.

44

Summary and Outlook

Page 45: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

45Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Complete Methodology for HTDD

Binary QSAR: probabilistic QSAR QSAR models from HTS data (directly) and ADME data

VSA Descriptors: information rich low dimensional space Applicable for activity models and ADME models

Reagent Scoring: probabilistic scoring Non-enumerative technique that uses Binary QSAR models to focus

ADME Model

Activity ModelLibrary Design

Binary QSAR BioAssay

DesignModel

CombinatorialLibrary

HTS Data

Drugability Data(e.g., BBB or drug-like)

Binary QSAR

Page 46: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

46Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Important Ideas

Binary QSAR Shift away from regression-based techniques: more robust to errors Predictions are “soft”: best suited to collection-based predictions Probability models can be combined (e.g., ADME & Potency)

VSA Descriptors Orthogonal, group-additive descriptors reminiscent of Hansch & Leo Wide applicability gives rise to meaningful chemistry space Less reliance on variable selection procedures (fewer false correlations)

Reagent Scoring Non-enumerative techniques can handle huge virtual libraries Resulting score complements other criteria (cost, availability, etc.) Sample + Reject + Estimate procedure can incorporate arbitrary filters

Statistically significant model = “understanding” of data

Page 47: Copyright © 2000 Chemical Computing Group Inc. All Rights Reserved. CHEMICAL COMPUTING GROUP INC. 1 A Probabilistic Approach to High Throughput Drug Discovery.

47Copyright © 2000 Chemical Computing Group Inc.All Rights Reserved.

Future Work and Availability

Generation II VSA descriptors Improve fundamental VSA approximation: can do better than 10% error Handle “protonics”: average over tautomeric and protonation states Direct polarizability model might replace logP + MR

Extend probability decomposition to include receptor (T)

All methodology available in MOE version 2000.02

Drugable for T or T’s type

Probabilistic QSAR

Bioinformatics?

),Pr(),|Pr(),,|Pr(),,,Pr( STTSYTSYDTSDY