SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of...

69
SAR/QSAR Modeling: State of the Art Alexander Tropsha Laboratory for Molecular Modeling UNC Eshelman School of Pharmacy UNC-Chapel Hill

Transcript of SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of...

Page 1: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

SAR/QSAR Modeling: State of the Art

Alexander Tropsha

Laboratory for Molecular Modeling

UNC Eshelman School of Pharmacy

UNC-Chapel Hill

Page 2: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

The expectations of our field are

high

The newly-appointed President-Elect of the Royal Society of

Chemistry today forecast the impact of advances in modelling and

computational informatics on chemistry

Page 3: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

The growth of QSAR modeling is

coincidental with the growth of chemical

data

“Chemical databases” vs “QSAR Modeling

Page 4: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

OUTLINE

Conclusions: models are tools for testable hypothesis generation

focus on accurate, experimentally confirmed predictions

Predictive QSAR Modeling Workflow

-Chemical and biological data curation

-Applicability domains

-I’ve got the data: should I model it?

(modelability index)

M

E

T

H

O

D

S

A

P

P

L

I

C

A

T

I

O

N

S

-Cheminformatics approaches to structure-

based virtual screening

-Chemocentric Informatics: integration of

chemical and biological data streams and

diverse data sources

-Novel approaches to model building and

interpretation: Chemical-Biological Read Across

1

2

Page 5: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

~106 – 109

molecules

VIRTUAL

SCREENING

CHEMICAL

STRUCTURES

CHEMICAL DESCRIPTORS

PROPERTY/

ACTIVITY

PREDICTIVE

QSAR MODELS

INACTIVES

(confirmed inactives)

QSAR

MAGIC

HITS

(confirmed

actives)

CHEMICAL

DATABASE

The chief utility of QSAR models: identification

of novel hits in external libraries

Page 6: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Quantitative

Structure

Activity

Relationships

D

E

S

C

R

I

P

T

O

R

S

N

O

N

O

N

O

N

O

N

O

N

O

N

O

N

O

N

O

N

O

0.613

0.380

-0.222

0.708

1.146

0.491

0.301

0.141

0.956

0.256

0.799

1.195

1.005

C

O

M

P

O

U

N

D

S

A

C

T

I

V

I

T

Y

Thousands of molecular descriptors are available for organic compounds constitutional, topological, structural,

quantum mechanics based, fragmental,

steric, pharmacophoric, geometrical,

thermodynamical conformational, etc.

- Building of models using

machine learning methods

(NN, SVM etc.);

- Validation of models according to numerous

statistical procedures, and

their applicability domains.

Credit: Denis Fourches 6

Quantitative Structure Activity Relationships

N

O

N

O

N

O

N

O

N

O

N

O

N

O

N

O

p

n

n molecules x p descriptors

ACTIVITY e.g. toxicity

binding affinity

Pattern matrix

Page 7: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Data dependency and data quality

are critical issues in QSAR modeling

Florian Prinz, Thomas Schlange and Khusru Asadullah. Nature Rev. Drug

Disc. Sep 2011

Page 8: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

D

E

S

C

R

I

P

T

O

R

S

N

O

N

O

N

O

N

O

N

O

N

O

N

O

N

O

N

O

0.613

0.380

-0.222

7.08

1.146

0.491

0.301

0.141

0.956

0.256

0.799

1.195

1.005

QSAR modeling with non-curated datasets

C

CH3

CH3

CH3

N

O

H3C CH

2

CH3

O

O–

Na+

Presence of SALTS

Presence of MIXTURES OH

Presence of ERRONEOUS AND/OR

WRONG STRUCTURES

Presence of DUPLICATES

Presence of MISPRINTS

AND WRONG NAMES

Etc.

ERRORS in the calculation

of DESCRIPTORS

Page 9: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

INITIAL LIST OF SMILES/STRUCTURES

(2D representation)

difficult cases

Fourches,

Muratov,

Tropsha. Trust

but verify.

JCIM, 2010,

50:1189-204.

Page 10: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

QSAR modeling of nitro-aromatic

toxicants

-Case Study 1: 28 compounds tested in rats,

log(LD50), mmol/kg.

-Case Study 2: 95 compounds tested against

Tetrahymena pyriformis, log(IGC50), mmol/ml.

-Case Study 2: after the normalization of nitro groups R2ext~0 increased to R2

ext~0.5

Artemenko, Muratov et al. J. SAR QSAR 2011, 22 (5-6), 1-27.

- Five different representations of nitro groups. -Case Study 1: after the normalization of nitro groups

R2ext~0.45 increased to R2

ext~0.9.

Even small differences in structure representation can

lead to significant errors in prediction accuracy of

models

Data curation affects the accuracy

(up or down!) of QSAR models

Page 11: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Experimental Data quality: Comparison of the ToxCAST (Phase I) in vitro Assay Results for

Duplicates Compounds Total ACEA ATG BSK Cellumen NVS CellzDirect

500 7 81 87 33 239 48

3-Iodo-2-propynylbutylcarbamate

0.71 0.73 0.18 0.53 0.49 0.89 0.15

Bensulide 0.64 0.09 0.71 0.4 0.69 0.95 0.04

Chlorsulfuron 0.24 N/A N/A 0.4 N/A N/A -0.1

Dibutyl phthalate 0.55 N/A 0.62 0.51 0.7 0.81 -0.1

Diclofop-methyl 0.36 1 0.89 0.15 N/A -0 -0.1

EPTC 0.13 N/A N/A -0.1 N/A N/A 0.33

Fenoxaprop-ethyl 0.47 N/A 0.56 0.59 0.31 0.35 0.01

Prosulfuron 0.55 N/A 0.68 0.08 N/A 1 0.4

*

Page 12: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

ChEMBL Statistics

• Used ChEMBL 14 – released 18 July 2012 – 1,384,479 compound records – 1,213,242 distinct compounds – 644,734 assays – 10,129,256 bioactivities – 9,003 targets – 46,133 documents

• Primarily covers MedChem Literature • Adds annotations for target data • Successor to SARLite commercial database

Page 13: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Manual Curation of ChEMBL (following several automated steps)

• Input: 190,068 compound-target measures in pairs of papers – Used values as published in ChEMBL – Converted to standardized pKi values – Semi-automated (based on units and type of value

reported)

• 23,956 failed to be automatically converted – Mostly Log Ki or –Log Ki values but others – Manually examined papers representing ~70% and

hand converted affinity value, except when data was being recycled/recited

• Final: 178,317 total replicate pairs of values

Page 14: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Frequency distribution plot for differences in pKi values (>1%) for duplicates

Page 15: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Note the peaks at integral pKi differences

A Recurrent Pattern

Page 16: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Non-standard Units Used

Page 17: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Non-Ki measures given as Ki

These numbers made it into ChEMBL, too.

Page 18: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Ignorance of Biological Complexicty

α2a? α2b? α2c?

Target Doc_ID Src_Key Assay_ID Activity_ID Std_Type Std_Value

α2a 10218 8b 32635 359172 pKi 9.45

α2b 10218 8b 32635 359172 pKi 9.45

α2c 10218 8b 32635 359172 pKi 9.45

Page 19: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

No Units at All

Page 20: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

No Citation For Data Sources

Page 21: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Summary of published data quality analysis

• A lot of the replicates in the literature aren’t actually independent determinations

• Many errors come from careless specification or interconversion of units

• 91% of the data are single reported measurements • Modeling studies often are not explicitly identified as

such • ChEMBL 15 and going forward have started to address

these issues • This observations suggest new challenges to employ

cheminformatics approaches for biological data curation

Page 22: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Cheminformatics Analysis of qHTS data over 17,000 compounds screened against five major CYP isozymes using

In Vitro bioluminescent qHTS assay

J. Chem. Inf. Model., 2011,

Nature Biotechnology, 2009,

Page 23: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Duplicate analysis

• Carried out by ISIDA/Duplicates program

• 1,280 duplicate couples were found

– 406 had a complete matching profile

– 874 had profile differences

– A total of 1,535 discrepancies were found in the 874

duplicates couples CYP annotation:

CYP2C19 CYP2D6 CYP3A4 CYP1A2 CYP2C9

170 422 426 363 154 # of

discrepancies

PROBLEM: CYP bioprofiles for some duplicates are dramatically different Need biological curation!

Page 24: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

2C19 2D6 3A4 1A2 2C9 Supplier SID Tocris-0740

-4.5 -6.2 -4.6 -4.4 -4.6 Tocris 11113673 CID_6603937

-5 -5.6 -8 INA -4.4 Sigma Aldrich 11111504 CID_6603937

2C19 2D6 3A4 1A2 2C9 Supplier SID Tanimoto

Similarity

5 Nearest

neighbors

5.5- INA 4.5- INA INA Tocris 11114071 0.98 6604862

INA INA 5.1- INA INA Sigma Aldrich 11112029 0.98 6604106

INA INA INA INA INA Tocris 11114012 0.98 6604846

INA 5.9- 4.8- INA INA Sigma Aldrich 11112054 0.95 6604136

INA 4.5- 4.7- 4.4- INA Tocris 11113764 0.95 6604137

6604862

6604106

6604846

6604136

6604137

FALSE-POSITIVE

Cytochrome P450

Cytochrome P450

Neighborhood analysis helps to choose correct value Case Study: structural duplicates found in NCGC CYP450 qHTS data

Page 25: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Published guidance on model development : and validation: The OECD Principles

To facilitate the consideration of a QSAR model for regulatory purposes, it should be associated with the

following information:

a defined endpoint

an unambiguous algorithm;

a defined domain of applicability

appropriate measures of goodness- of-fit, robustness and predictivity a mechanistic interpretation, if possible; Should be added: data used for modeling should be carefully curated

-

Page 26: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Dearden JC et al., 2009, SAR and QSAR in Environmental Research, Vol. 20, Nos. 3–4, April–June 2009, 241–266.

Published guidance on model development and validation: J. Dearden’s 21 “how not to do QSAR” principles

Page 27: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

DATA AND METHODS: • Genetic Algorithm driven descriptor selection + MLR for the entire dataset • Two methods of dataset splitting:

– SOM to split a dataset into DIVERSE Training (64) and Test (21) sets – Activity-based to split into Training (64) and Test (21) sets

• Modeling for the SOM-training set using MLR with Genetic Algorithm

STATISTICAL CLAIMS: For SOM-generated split: • R2 (train)=0.836, Q2

LOO=0.793 • R2 (predict)= 0.730 For activity-based split. • R2 (train)=0.813, Q2

LOO=0.766 • R2

(predict) = 0.808 Compliance with OECD principles: “In addition, the proposed models took into full account fundamental points required by the OECD principles for QSAR models…”

Beware of poor published models: Bioorganic & Medicinal Chemistry (2007)

Page 28: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

True external prediction: Bioorganic & Medicinal Chemistry (2007)

Our attempt to predict activity for new compounds reported in ChEMBL after 2007 using the published model (MLR: all selected descriptors and weights were reported)

Page 29: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Experimental pIC50

Pre

dic

ted

pIC

50

Problems with the model: Bioorganic & Medicinal Chemistry (2007)

Training and test set are selected by SOM: both are diverse and highly similar to one another The same descriptors are used to develop model for the training set using activity-based split

Page 30: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

• The model: 1. built only for training set SOM-based split 2. validated using the corresponding prediction set

• Model re-development using activity-based split for training and test sets:

1. the same model and compounds were used 2. the compounds were just rearranged between training

and test sets The external model validation: missing • Conclusion: despite claims in the paper external predictive

power of the model was not established; model is clearly overfitted.

Bioorganic & Medicinal Chemistry (2007)

Why Model Redevelopment and Application Failed?

Page 31: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

In Vitro Assays

Chemical Structure

Dataset Modelability: does it make sense to model any SAR data?

Example: Poor structure – in vivo or in vitro-in vivo correlations for Toxcast data*

Toxicol Sci. 2012 Aug;128(2):398-417.

Page 32: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

The Concept of Modelability • We often fail to build a predictive QSAR model.

However, it may be possible to evaluate

modelability of the dataset prior to QSAR

study.

• MODI-index: Balanced accuracy (BA) of a kNN

model with K=1 (the activity class of each

compound is predicted to be the same as that of

its nearest chemical neighbor)

CONFUSION MATRIX

PREDICTED OBSERVED

CLASS 0

OBSERVED

CLASS 1

TOTAL

CLASS 0 N00 N10 N.0

CLASS 1 N01 N11 N.1

TOTAL N0.=N0 N1. =N1 N..=N

SE = N00/N0

SP = N11/N1

BA = ½ (SE + SP)

Page 33: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Prediction of Dataset

Modelability

44 diverse datasets

60 ToxCast datasets

Page 34: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

M o d e l i n g m e t h o d s

5-fold External Validation

1

4

3

2

5

QSAR Modeling Workflow: the importance of rigorous validation

1 2 3 5 4

courtesy of L. Zhang

Combi-QSAR modeling

Datasets

K-Nearest Neighbors (kNN)

Random Forest (RF)

Support Vector Machines (SVM)

Dragon MOE

Internal validation Model selection

An ensemble of QSAR Models

Modeling set

External set

D e s c r i p t o r s

Evaluation of external performance

34

Tropsha, A. Best Practices for QSAR Model Development, Validation,

and Exploitation Mol. Inf., 2010, 29, 476 – 488

Fully implemented on CHEMBENCH.MML.UNC.EDU

Virtual screening (with AD threshold)

Experimental confirmation

Data curation

Page 35: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Recent examples of theoretically and

experimentally validated QSAR-based

predictions: Experiment-Assisted Computational

Drug Discovery!!

• Anticonvulsants: Shen, M. et al, J. Med. Chem. 2004, 47, 2356-2364.

• HIV-1 reverse transcriptase inhibitors: Medina-Franco, J., et al, J. Comput. Aided. Mol. Des., 2005, 19, 229–242

• D1 receptor antagonists: Oloff et al, J. Med. Chem., 2005, 48, 7322-32

• AmpC inhibitors: Hsieh, J.-H.. et al, J. Comp. Aid. Molec. Des., 2008, 22(9):593-609

• HDAC inhibitors: Wang, S. et al, (JCIM, 2009, 49, 461-76)

• GGT-I inhibitors: Wang, Peterson, et al (JMC, 2009, 52(14):4210-20; provisional patent)

• 5HT6 binders: Hajjo et al, JMC, 2012, 55(12):5704-19

• Antimalarial: Zhang et al. J Chem Inf Model. 2013, 53(2):475-92

• Polypharmacological compounds: Besnard et al, Nature. 2012;492(7428):215-20

Page 36: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

OUTLINE

Conclusions: models are tools for testable hypothesis generation

focus on accurate, experimentally confirmed predictions

Predictive QSAR Modeling Workflow

-Chemical and biological data curation

-Applicability domains

-I’ve got the data: should I model it?

(modelability index)

M

E

T

H

O

D

S

A

P

P

L

I

C

A

T

I

O

N

S

-Cheminformatics approaches to structure-

based virtual screening

-Chemocentric Informatics: integration of

chemical and biological data streams and

diverse data sources

-Novel approaches to model building and

interpretation: Chemical-Biological Read Across

1

2

Page 37: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Bridging ligand- and structure-based drug

Design: CSAR 2012 (Carlson et al, JCIM, ASAP)

Any public data and any

computational method can be used.

1.Pose Prediction

2.Ligand Ranking

Urokinase (UK)

Checkpoint kinase

(CHK1)

16 ligands

47 ligands

Pseudomonas aeruginosa

deacetylase (LpxC) 16 ligands

39 ligands

Extracellular signal-regulated

kinase (ERK2)

Reference Structure Target Name Number of ligands

3i5z Ki = 2nM

1owe Ki = 631nM

2e9n Ki = 6.2 nM

3p3e

Page 38: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

CONSENSUS

PREDICTIONS

Study Design

STRUCTURE-BASED LIGAND-BASED

Medusa docking

(“traditional” MM/MD

based)

QSAR models built with

available data outside

of CSAR

LIGAND RANKING

CSAR Ligands

TARGET

Page 39: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Ranking Active Ligands

D-1 QSAR no_AD

D-2 QSAR AD

D-3 QSAR AD+SIM

D-4 QSAR no_AD + MEDUSA

D-5 QSAR AD + MEDUSA

D-6 QSAR AD+SIM + MEDUSA

Summary statistics for all groups (Spearman (ρ))

Min

25%tile

Median

75%tile

Max

Our Group UK CHK1

-0.282

0.396

0.522

0.576

0.759

-0.334

-0.141

0.008

0.116

0.356

0.558 15

0.576 15

0.529 15

0.673 15

0.615 15

0.579 15

Ligands

0.319 29

0.320 29

0.330 29

0.182 29

0.171 29

0.194 29

Ligands ERK2 Ligands

0.575 38

0.584 38

0.589 38

0.492 38

0.490 38

0.477 38

-0.077

0.183

0.397

0.491

0.678

QSAR model accuracy in comparison with

docking/scoring based models (all other groups)

Page 40: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

QSAR model accuracy in comparison with

docking/scoring based models (all other groups)

Identifying Inactive Ligands*

D-1 QSAR no_AD 0.933 15 4 0.756 30 9

D-2 QSAR AD 0.950 15 4 0.756 30 9

D-3 QSAR AD+SIM 0.950 15 4 0.733 30 9

D-4 QSAR no_AD + MEDUSA 0.967 15 4 0.696 30 9

D-5 QSAR AD + MEDUSA 0.967 15 4 0.696 30 9

D-6 QSAR AD+SIM + MEDUSA 0.967 15 4 0.685 30 9

Summary statistics for all groups (AUC)

Min 0.250 0.378

25%tile 0.714 0.496

Median 0.828 0.563

75%tile 0.950 0.689

Max 1.000 0.756

Our Group UK CHK1

*No inactive ligands were available for ERK2

Actives Inactives Actives Inactives

Page 41: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Integrated workflow incorporating cheminformatics and molecular simulation approaches for pose classification into native- like and non-native

Knowledge based pose scoring filter

Multiple poses created by docking and ranked by a MM scoring function (Medusa)

Native-like poses

Non-native poses

Page 42: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Cp

Op

Ol

Nl

MCT MCT

MCT

MCT

d1

d2

d3

Each tetrahedron is described by a) receptor/ligand atom b) chemical atom type In total, there are 554 theoretical descriptor types (m).

D1: CpOpOlNl

D2: CpOlOlNl

Structure Based QSAR: Chemical descriptors of protein-ligand interface (PL/MCT-tess)

Cp

Op

Ol

Nl

Each descriptor’s value is the SUM of protein (p)-ligand (l)pairwise potential for the same tetrahedral type at the interface (n)

kpll

n

1k

3~1

p

3~1

l

pm )MCT(MCTPL/MCT d

*Parr RG, Liu S J. Am. Chem. Soc. 1999, 121, 1922

D1 D2 D3 … D554

2 0 5 0 Count descriptors →

Incorporate the maximal charge transfer (MCT)* value derived from conceptual DFT (Hsieh et al, 2012)

D1 D2 D3 … D554

2.8 0 5.3 0

PL-MCT-tess descriptors

Tessellated protein-ligand

interface

EN

EN

EN

EN

Incorporate Pauling electronegativity (EN) ENTess descriptors (Zhang, 2006)

Page 43: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

X-ray

complex

RMSD (Å)

Nu

mb

er o

f p

ose

s D1 D2 D3 … D554 NL/NN

2.8 0 5.3 0 1

3.5 2.1 13.4 0 1

5.2 0 3.7 0 0

Pose 1

Pose 2

Pose n

Native pose: RMSD = 0

Pose1: RMSD = 0.5 Å

Pose2: RMSD = 2 Å

Pose3: RMSD = 10 Å

NL NN

RMSD classification

threshold = 2.5Å

Building a pose filter using binary QSAR and PL/MCT-tess chemical descriptors

CCR = 0.96 Predicted 0 (non native)

Predicted 1 (native like)

Observed 0 (non native)

2894 127

Observed 1 (native like)

126 2801

Page 44: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Use of knowledge based filter to filter out pose decoys improves the accuracy of virtual screening*

Build filter

Apply to DUD dataset of

true binders and decoys

Native-like?

*Jui-Hua Hsieh; Shuangye Yin; Xiang S. Wang; Shubin Liu; Nikolay V. Dokholyan; Alexander Tropsha; J. Chem. Inf. Model. 2012,

52, 16-28.

Use of target-specific pose filter combined with a physical force-field based

scoring (MedusaScore) leads to significant improvement of hit rates in VS.

N

Page 45: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Knowledge based target specific approach succeeded to filter out “bad” poses generated by MedusaDock

Page 46: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Chemocentric Informatics: Integration of

diverse data streams and data sources

46

Page 47: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Ge

ne

s

Metabolites

Dis

ea

se

s

PubChem

DBpedia

ChEMBL

PDB

DrugBank

PubMed

STITCH

PDSP

STITCH

PDD

CTD

Diseasome

CMAP

KEGG

STITCH

CTD

GeneID

OMIM

Entrez

Gene

Pfam

UniProt

CAS

CTD

KEGG

ERGO

BioCyc

MetaCyc

CMAP

miR2Dis

-ease

miRBase

TarBase

PuTmiR

HMDB

METLIN

METLIN

HMDB

UCbase

miRfunc

PubMed

CTD

HMPDb

Page 48: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Disease gene

signatures

Disease related

genes or proteins

Text/database mining

Network mining

PubMed/ Chemotext

CTD

HMDB

Disease related

proteins

cmap ChemoText

New hypothesis about connectivity between chemicals and diseases

Binding data

Target related ligands

Functional data

QSAR

Predictive models

Database mining

Structural hypothesis “putative drug candidates”

Accept common

hits only

New testable hypothesis with higher confidence

Disease-Target Association

48

Hajjo et al, Chemocentric Informatics Approach

to Drug Discovery:…J Med Chem.

2012, 55(12):5704-19

Page 49: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

5-HT6

predictor

300 VS Hits “Actives”

59 K cps.

5-HT6 receptor QSAR models & QSAR-based VS

Model statistics

94 Inactives Ki ≥ 10 µM

196 cps.

102 Actives Ki < 10 µM

Dataset Virtual screening

Source: PDSP Ki-DB

0,0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1,0

Model

CC

Re

vs kNN-Dragon Model

kNN-Dragon Random

CBA-SG Model

CBA-SG Random

49

Page 50: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Step3 : list of correlated compounds

Step2: query the cmap

Database

The connectivity map

Step1: upload signature

Output

High correlation

Low correlation

Null

Biological state 1

Control

Signature

Input

50 Lamb, J. et al. Science, 313, 1929-1935 (2006)

Lamb, J. Nature 7, 54-60 (2007)

Page 51: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Querying the cmap

cmap

1.00

0.00

0.00

-1.00

cmap SCORE

Upload signature Query the cmap List of compounds

(S1) (S2)

S1: Hata, R. et al., Biochem. Biophys. Res. Commun 284, 310 (2001). S2: Ricciarelli, R. et al., IUBMB Life 56, 349 (2004).

Alzheimer’s disease gene signatures

51

Page 52: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

97 COMMON HITS with S1

106 COMMON HITS with S2

Chemocentric

Informatics

QSAR

FILTER

Further

selection

34 Higher

Confidence Hits

CONSENSUS

HYPOTHESES 300 5-HT6

Active HITS

WDI

DATABASE

73 COMMON HITS with S1 & S2

cmap

FILTER

cmap

DATABASE

881 instances with S1

861 instances with S2

59 K

compounds

6.1 K

Individual

instances

52

Antipsychotics Antidepressants Calcium Channel Blockers Selective Estrogen Receptor

Modulators (SERMs)

Page 53: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

SERMs predicted as 5-HT6 receptor ligands

53

Page 54: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Raloxifene identified as a 5-HT6 receptor ligand and potential preventative for Alzheimer’s disease

Raloxifene binds to 5-HT6

receptor with a Ki= 750 nM.

Raloxifene given at a dose of

120 mg/day led to reduced risk

of cognitive impairment in post-

menopausal women.

Yaffe, K. et al., Am J Psychiatry, 162,

683–690 (2005).

A study recently funded by NIH

is ongoing to evaluate

raloxifen’s effects in AD

patients.

Competition binding at 5-HT6 receptors for

raloxifene (yellow triangle) and chlorpromazine

(square) versus [3H] LSD. Tested by our

collaborators at PDSP.

54

http://www.nia.nih.gov/alzheimers/publications/adprevented/

Raloxifene

Chlorpromazine

Page 55: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Human Effects

Cheminformatics

Exploration and exploitation of diverse

data streams

Bioinformatics Multiple

biological assays

Inherent chemical properties

Integrate cheminformatics and short term assay

data to improve predictive power and interpretability

Page 56: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

QSAR Table – qHTS descriptors

ID Name Structure 3T3

9.2mkM

3T3

21mkM …

SHSY

92mkM

1 Acrolein 0 0 … -92

2 2-Amino-4-nitrophenol

0 -22 … 0

... ... … … … … …

369 Tebuco-nazole

-21 -24 … -18

Descriptor #: 1 2 … 182

Page 57: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Modeling Workflow

Page 58: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Hybrid QSAR models have higher

predictive power than those built

with either chemical or biological

descriptors

% TOPKAT

Chemical

descriptors only

Hybrid

descriptors

(Original)

Hybrid descriptors

(THR=15%)

kNN RF kNN RF kNN RF

Sensitivity 0.45 0.73 0.73 0.55 0.82 0.91 0.91

Specificity 0.93 0.78 0.80 0.85 0.78 0.85 0.83

BA 0.69 * 0.75 0.77 0.70 0.80 0.88 0.87

Results are shown for 52 compounds in our external validation sets, which were also absent in the TOPKAT training set. *TOPKAT model was significantly different (p < 0.05) from all other models by the permutation test (10,000 times).

Page 59: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

68- 75% BAcc

Hybrid models

Chemical descriptors

QSAR models

1. Integrative chem-bio modeling (existing method)

2. Integrative chem-bio modeling (new method)

3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation 59

Data source:

Hepatotoxicity (28 day)

127 drugs

55-61%

BAcc

Toxicogenomics models

Toxicogenomics expression (24h)

Top 400 genes

Top 100 genes

Top 30 genes

Top 4 genes

2,923 genes

Rank by differential expression

69-78%

BAcc

QSAR < Hybrid < Toxicogenomics models models models

4 classification methods (RF, SVM, kNN, DWD)

Page 60: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

30

00

Nu

mb

er

of

gen

es

Number of chemical descriptors

300

Good Toxicogenomics models

Balanced Accuracy

QSAR models

Hybrid models

Bad

60 1. Integrative chem-bio modeling

(existing method) 2. Integrative chem-bio modeling

(new method) 3. Integrate cheminformatics prediction and

pharmacoepidemiomiology validation

Optimal toxicogenomics model

Low et al. (2011) Chem. Res. Toxicol. 24,1251-1262

How predictivity varied with number of genes and number of chemical descriptors

Page 61: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Problem: Conflicting predictions by QSAR and toxicogenomics models

61

-1 0 1 2 3 4

-1.0

-0.5

0.0

0.5

1.0

PC1

PC

2

Biological space

caffeine

carbamazepine

●Toxic drug

●Non-toxic drug

Carbamazepine Distant biological neighbors Close chemical neighbors => QSAR works better

Caffeine Close biological neighbors Distant chemical neighbors => TGx works better

Solution: Learn from both sets of neighbors

-4 -2 0 2 4

-2-1

01

23

PC1

PC

2

carbamazepine

Chemical space ●Toxic drug

●Non-toxic drug

caffeine

1. Integrative chem-bio modeling (existing method)

2. Integrative chem-bio modeling (new method)

3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation

(20%) (70%)

(4%

)

(10

%)

Page 62: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Chemical Read-Across: Learning from similar compounds

62

Traditional read-across predicts toxicity from chemically similar neighbors

ToxMatch, EU

QSAR Toolbox, OECD

Page 63: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Biological neighbors Chemical neighbors

Chemical-biological read-across (CBRA) learns from both sets of neighbors

63

Bendazac

Toxic

0.790

Phenytoin

Non-toxic

0.813

Flutamide

Toxic

0.783

Pemoline

Non-toxic

0.766

Chloramphenicol

Toxic

0.776

Phenylbutazone

Non-toxic

0.737

Disulfiram

Toxic

0.770

Phenobarbital

Non-toxic

0.721

Phenylanthranilic acid

Non-toxic

0.767

CARBAMAZEPINE

0.9

0.8

0.7

similarity = 0.6

Low et al. (2013) Chem. Res. Toxicol. In press. 1. Integrative chem-bio modeling (existing method)

2. Integrative chem-bio modeling (new method)

3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation

CARBAMAZEPINE Non-toxic

Predicted as Non-toxic

Predicted toxicity=-0.099

correctly predicted as nontoxic

incorrectly predicted as toxic

Predicted toxicity =similarity-weighted average of toxicity values = 𝑆𝑖 ⋅ 𝐴𝑖𝑘𝑏𝑖𝑜𝑖=1 + 𝑆𝑗 ⋅ 𝐴𝑗

𝑘𝑐ℎ𝑒𝑚𝑗=1

𝑆𝑖𝑘𝑏𝑖𝑜𝑖=1 + 𝑆𝑗

𝑘𝑐ℎ𝑒𝑚𝑗=1

conflicting predictions

Page 64: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

CBRA allows visual comparison of multiple compounds

64

Low et al. (2013) Chem. Res. Toxicol. In press.

1. Integrative chem-bio modeling (existing method)

2. Integrative chem-bio modeling (new method)

3. Integrate cheminformatics prediction and pharmacoepidemiomiology validation

Page 65: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Types of chemical-biological models

65

Hybrid model (from pooling data)

Ensemble model (from pooling models)

Chemical-biological read-across (CBRA)

Pooled data

matrix

Biological model

QSAR model

Chemical structures

Biological assays

Chemical structures

Biological assays

Chemical structures

Biological assays

Hybrid model

Ensemble model

CBRA

Page 66: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Conclusions and Outlook

• Methodology

– data curation is critical (NB: QSAR models could be used to spot and correct erroneous data!)

– Rigorous external model validation is mandatory and should precede any mechanistic interpretation

– Consensus (collaborative!) prediction using all acceptable models affords the highest accuracy and chemical space coverage

– Novel chemical descriptors are needed for (so far) uncommon substances (mixtures, materials, nanomaterials)

– outcome: decision support tools for prioritizing compounds for experimental screening and/or regulatory decision making

Page 67: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Conclusions and Outlook

• Emerging trends in applications of QSAR modeling

– Rapid accumulation of large biomolecular datasets (especially, in public domain)

– Non-traditional sources of datasets (text mining of biomedical literature, patents, EMRs, …)

– Extension of QSAR modeling beyond organic molecules (mixtures, materials, nanomaterials, …)

– Integration of inherent chemical properties with short term biological profiles (biodescriptors ) in the context of structure – in vitro – in vivo extrapolation

– Interpretation of significant chemical and biological descriptors emerging from externally validated models to inform the selection or design of effective and safe chemicals

Page 68: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

http://chembench.mml.unc.edu

Page 69: SAR/QSAR Modeling: State of the Art · 2013-08-31 · QSAR Modeling Workflow: the importance of rigorous validation 12354 courtesy of L. Zhang Combi-QSAR modeling Datasets K-Nearest

Principal Investigator Alexander Tropsha

Research Professors Alexander Golbraikh, Denis

Fourches, Eugene Muratov

Graduate Research

Assistants Andrew Fant,

Stephen Bush,

Yen Low

Mary La (Pharmacy

student)

Postdoctoral Fellows Aleck Sedykh,

Regina Politi

Adjunct Members

Weifan Zheng, Shubin Liu

The Laboratory for Molecular Modeling

MAJOR FUNDING

NIH - R01-GM66940

- R01-GM068665

NSF

- ABI 9179-1165

EPA (STAR awards) - RD832720

- RD833825

- RD834999

ONR

Former members:

Jui-Hua Hsieh

Rima Hajjo

Collaborators Bryan Roth,

Ivan Rusyn

Nikolay Dokholyan