Predictive Qspr Epa Star

download Predictive Qspr Epa Star

of 44

Transcript of Predictive Qspr Epa Star

  • 8/3/2019 Predictive Qspr Epa Star

    1/44

    The Challenges in Predictive QSPR

    Modeling

    Alexander Tropsha, Hao Zhu, Kun Wang

    Laboratory for Molecular Modeling

    School of Pharmacy

    and

    Carolina Center for Exploratory

    Cheminformatics Research

    UNC-Chapel Hill

  • 8/3/2019 Predictive Qspr Epa Star

    2/44

    OUTLINE

    The need for developing validated models OECD programme

    NIHs Molecular Library Initiative and PubChem

    Possible insufficiency of HTS and -omics data for predictivetoxicology

    Predictive QSAR Modeling Workflow

    Examples of the Workflow applications Ames genotoxicity

    HTS/NTP dataset

    Carcinogenicity modeling (HTS/NTP/CPDB and CPDB)

    Discussion

  • 8/3/2019 Predictive Qspr Epa Star

    3/44

    Art. 3.2Art. 3.2 ..to keep animal testing to a minimumto keep animal testing to a minimumin the interest of timein the interest of time-- and costand cost--effectivenesseffectiveness

    particular research efforts are needed for development andparticular research efforts are needed for development and

    validation of modelling (e.g. QSAR) and screening methodsvalidation of modelling (e.g. QSAR) and screening methods

    for assessing the potential adverse effects of chemicalsfor assessing the potential adverse effects of chemicals

    Regulatory acceptance of QSAR modelsRegulatory acceptance of QSAR models::

    Setubal PrinciplesSetubal PrinciplesSetubal Principles

    Workshop ICCA/CEFIC (2002):Workshop ICCA/CEFIC (2002):

    *Slide is a courtesy of Prof. Paola Gramatica - QSAR Research Unit - DBSF - University of Insubria - Varese (Italy)

    EU- WHITE PAPER on the Strategy for aFuture Chemicals Policy (2001)*

    EUEU-- WHITE PAPER on theWHITE PAPER on the StrategyStrategy forfor aaFutureFuture ChemicalsChemicals PolicyPolicy ((2001)*2001)*

  • 8/3/2019 Predictive Qspr Epa Star

    4/44

    From Setubal to OECD PrinciplesFrom Setubal to OECD PrinciplesFrom Setubal to OECD PrinciplesTo facilitate the consideration of a QSAR modelTo facilitate the consideration of a QSAR model

    for regulatory purposes, it should be associated with thefor regulatory purposes, it should be associated with the

    following information:following information:

    --

    be associated with abe associated with a defined endpointdefined endpoint

    take the form of an unambiguous and easily applicabletake the form of an unambiguous and easily applicable

    algorithmalgorithm;;

    ideally, have aideally, have a mechanistic basis;mechanistic basis;

    be accompanied by a definition ofbe accompanied by a definition of domain of applicabilitydomain of applicability

    be associated with a measure of goodnessbe associated with a measure of goodness--of fit (of fit (internalinternal

    validationvalidation););

    be assessed in terms of its predictive power by using databe assessed in terms of its predictive power by using datanot used in the development of the model (not used in the development of the model (externalexternal

    validationvalidation).).*Slide is a courtesy of Prof. Paola Gramatica - QSAR Research Unit - DBSF - University of Insubria - Varese (Italy)

  • 8/3/2019 Predictive Qspr Epa Star

    5/44

  • 8/3/2019 Predictive Qspr Epa Star

    6/44

    4 Chemical SynthesisCenters

    4 Chemical SynthesisCenters

    Molecular Libraries Initiative

    Molecular Libraries Initiative

    NIH Roadmap Initiative

    NIH Roadmap Initiative

    MLSCN (9+1)9 centers

    1 NIH intramural20 x 10 = 200 assays

    MLSCN (9+1)9 centers

    1 NIH intramural20 x 10 = 200 assays

    PubChem(NLM)

    PubChem(NLM)

    ECCR (6)Exploratory

    Centers

    ECCR (6)ExploratoryCenters

    CombiChem

    Parallel synthesis

    DOS

    4 centers + DPI

    100K 1M compounds

    CombiChem

    Parallel synthesis

    DOS

    4 centers + DPI

    100K 1M compounds

    1M compounds

    200 assays

    SAR matrix

    NIHs Molecular Libraries Initiative in numbers

    PredictiveADMET

    (10)

    PredictiveADMET(10)

  • 8/3/2019 Predictive Qspr Epa Star

    7/44

    Recent MLSCN Screening

    Results in PubChem

    0.000536982637NCGCIkB Signalling

    0.036932140852NCGCCell Viability

    (BJ)

    0.100852

    1408142NCGCCell Viability

    (Jurkat)

    0.056818

    140880NCGCCell Viability

    (Hek293

    )

    0.037642

    140853NCGCCell Viability

    (HepG2)

    0.036222

    140851NCGCCell Viability

    (MRC5)

    0.065341

    140892NCGCCell Viability

    (SK-N-SH)

    0701580NCGCsOGT0655350NCGCPrx2

    % Active# Screened# ofActives

    CenterScreen

    0.0058211236972VUMLSCM40.00574

    853649VUMLSCThallium

    flux through

    GIRK

    0.019431

    9984194SDCCGCell

    Viability

    (HPDE-C7)

    0.021534

    9984215SDCCGCell

    Viability

    (HPDE-

    C7K)

    0.083811

    3317278SRMLSCA549 Cell

    Growth

    0.0002

    100112SRMLSCPantothenate

    Synthetase

    0.005104999351NMMLSCFPR0.002302

    999323NMMLSCFPRL1

    0.00153365239100PMLSCMKP-1

  • 8/3/2019 Predictive Qspr Epa Star

    8/44

    Subset of PubChem relevant to this presentation:

    NTP-HTS Content Summary of 1408 Compounds

    Chemical Structure Types:- Organic: 1,348

    - Inorganic: 27

    - Organometallic: 19- No structure: 14

    1348 Organic compounds contain:- Unique: 1,279

    - Complex: 51

    - Salt: 20

    - Duplicates: 53

    Curated subset: 1,289 unique organic compounds

  • 8/3/2019 Predictive Qspr Epa Star

    9/44

    HTS Screening Data (NCGC) for

    1,289 NTP Compounds

    1,208

    44

    37

    MRC5

    74416312142Actives

    1,1611,2011,1471,0791,203Inactives

    5447798944Inconclusives

    SK-N-

    SH

    HepG2Hek293JurkatBJ

  • 8/3/2019 Predictive Qspr Epa Star

    10/44

    Data interpretation: How was the Activity

    Classified?

    -0.5

    0

    0.5

    1

    1.5

    2

    2.5

    -9 -8 -7 -6 -5 -4 -3 -2 -1

    Log(EC50)

    Classifications

    The relationship between IC50 and classification

    of compounds in Jurkat cell line test.

    active

    inactive

    inconclusive

  • 8/3/2019 Predictive Qspr Epa Star

    11/44

    Summary of the experimental data

    for 1,289 compounds 141 compounds are active in at least one test.

    230 compounds are at least active orinconclusive in at least one test.

    1,059 compounds are inactive in all 6 tests

  • 8/3/2019 Predictive Qspr Epa Star

    12/44

    Additional biological data on

    1,289 NTP/HTS compounds*

    181383*4231,0531,1531,289

    IRISSICPDBHPVCSINTPGTZNTPBSINTP-

    HTS

    NTPBSI: National Toxicology Program Chemical Structure Index file

    NTPGTZ: National Toxicology Program genotoxicity

    HPVCSI: High Production Volume Chemicals

    CPDB: Carcinogenic Potency Data Base All SpeciesIRISSI: EPA Integrated Risk Information System

    *15 of 383 compounds in CPDB database are technique class.

    *Based on the DSSTox project of Dr. Ann Richard at EPA.

  • 8/3/2019 Predictive Qspr Epa Star

    13/44

    Overview of carcinogenic responses of

    the 383 compounds in rats and mice 229 compounds show positive response in at least

    one organ of one or more species.

    92 compounds show negative results in all tests.

    62 compounds show negative response in all testsbut the tests are not complete.

  • 8/3/2019 Predictive Qspr Epa Star

    14/44

    Are HTS results indicative of

    carcinogenicity?

    Results based on the IRIS database (EPA 1986, 1996, 1999 carcinogen risk assessment)

    1

    5

    HTS-Actives

    Non HumanCarcinogens

    Human

    Carcinogens

    332

    475

    HTS-InactivesHTS-

    Inconclusives

    93 compounds were tested in HTS, 57 of them are or likely to be human carcinogens, 36

    of them are not human carcinogens.

  • 8/3/2019 Predictive Qspr Epa Star

    15/44

    Can the explicit use of chemical structure help

    with the end point prediction: QSPR Modeling

    Chemistry Biology

    (IC50, Kd...)

    Cheminformatics(Molecular Descriptors)

    Comp.1 Value1 D1 D2 D3 D4

    Comp.2 Value2 " " " "

    Comp.3 Value3 " " " "

    Comp.N ValueN " " " "

    - - - - - - - - - - - - - -

    Goal: Establish correlations between descriptors and the target

    property capable of predicting activities of novel compounds

    BA (e.g., IC50) = F(D)

    =

    2

    2

    2

    )(

    )(1

    i

    ii

    yy

    yyq

  • 8/3/2019 Predictive Qspr Epa Star

    16/44

    Typical QSPR modeling result: Comparison

    between Actual and Predicted Activity

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 1 2 3 4Actual LogED50 (ED50 = mM/kg)

    PredictedLogED50

    Training

    Linear (Training)

    q2 = 0.79

    makes everyone happy

  • 8/3/2019 Predictive Qspr Epa Star

    17/44

    An example of mechanistic model

    of mutagenicity

    Possible remedies (per authors)

    retesting some of the compounds;

    testing further new compounds; checking (if necessary) the use of additional

    chemical descriptors.

    Beningni et al.Env. Mol. Mutagenesis 46:268-280 (2005)

  • 8/3/2019 Predictive Qspr Epa Star

    18/44

    y = 0.5958x + 2.3074

    R2 = 0.2135

    3

    5

    7

    9

    3 5 7 9

    Predicted

    Observed

    EXTERNAL TEST SET PREDICTIONS

    y = 0.4694x + 2.9313

    R2 = 0.1181

    3

    5

    7

    9

    3 5 7 9

    Predicted

    Observed

    The unbearable lightness of modeling

    (in this case, CoMFA)

    leads to unacceptable prediction accuracy.

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 1 2 3 4Actual LogED50 (ED50 = mM/kg)

    PredictedLogED50

    Training

    Linear (Training)

    2!!!

  • 8/3/2019 Predictive Qspr Epa Star

    19/44

    BEWARE OF q2!!!(Golbraikh & Tropsha,J. Mol. Graphics Mod. 2002, 20, 269-276. )

    -0.8

    -0.6

    -0.4

    -0.2

    0

    0.2

    0.4

    0.6

    0.8

    1

    0.5 0.6 0.7 0.8 0.9 1

    q2

    R2

    Poor Models

    Good Models

    0

    20

    40

    60

    80

    100

    120

    -0.7 -0.6 -0.5 -0.4 - 0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    R2

    F

    re

    q

    ue

    n

    c

  • 8/3/2019 Predictive Qspr Epa Star

    20/44

    COMPONENTS OF THE PREDICTIVE

    QSPR MODELING WORKFLOW*

    Model Building: Combination of various descriptorsets and variable selection data modeling methods

    (Combi-QSAR) Model Validation

    Y-randomization

    Training and test set selection

    Applicability domain Evaluation of external predictive power

    Virtual screening

    *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.

  • 8/3/2019 Predictive Qspr Epa Star

    21/44

    COMBINATORIAL QSAR

    C-Qics

    KNNKNN

    BINARY QSAR,BINARY QSAR,

    COMFA descriptorsCOMFA descriptors

    MolconnMolconn ZZ

    descriptorsdescriptors

    Chirality descriptorsChirality descriptors

    VolsurfVolsurf descriptorsdescriptors

    Comma descriptorsComma descriptors

    MOE descriptorsMOE descriptors

    Dragon descriptorsDragon descriptors

    SAR DatasetSAR Dataset

    Compound representationCompound representation

    Selection of best modelsSelection of best models

    Model validationModel validation

    usingusing external testexternal test setset

    andand YY--RandomizationRandomization

    QSAR modelQSAR modeliningg SVMSVM

    DECISION TREEDECISION TREE

    Lima, P., Golbraikh, A., Oloff, S., Xiao, Y., Tropsha, A. Combinatorial QSAR Modeling of P-Glycoprotein

    Substrates. J. Chem. Info. Model., 2006, 46, 1245-1254

    Kovatcheva, A., Golbraikh, A., Oloff, S., Xiao, Y., Zheng, W., Wolschann, P., Buchbauer, G., Tropsha, A.

    Combinatorial QSAR of Ambergris Fragrance Compounds. J Chem. Inf. Comput. Sci. 2004, 44, 582-95

  • 8/3/2019 Predictive Qspr Epa Star

    22/44

    Example of application in a Combi-QSAR Study

    808088887878868676768383VOLSURFVOLSURF

    626284846969888853538989MOEMO E

    808094947676838380808787ATOM PAIRATOM PAIR

    676790906767888878789292MOLCONNZMOLCONNZ

    TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

    SVMSVMDECISION TREEDECISION TREEkN NkN NMethodMethod

    DescriptorsDescriptors

    808088887878868676768383VOLSURFVOLSURF

    626284846969888853538989MOEMO E

    808094947676838380808787ATOM PAIRATOM PAIR

    676790906767888878789292MOLCONNZMOLCONNZ

    TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

    SVMSVMDECISION TREEDECISION TREEkN NkN NMethodMethod

    DescriptorsDescriptors

    Percent Classification Accuracy for the PGP Dataset*

    75758383626275756565717189897676CoMFACoMFA

    69697373727274747070737375757777COMMA/MOECOMMA/MOE

    69697373727274747070737375757777COMMA/MOECOMMA/MOE

    65657777717174748686747465657777MOEMOE

    53539494606077777070747485857878VOLSURFVOLSURF

    53538787535362624747858560606767CMTD/MOLCONNZCMTD/MOLCONNZ

    58588181747467675050767665657272CMTDCMTD

    68688383787870707676727286867070DRAGONDRAGON

    TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

    SVMSVMDECISION TREEDECISION TREEBINARYBINARY

    QSARQSAR

    kNNkNNMethodMethod

    DescriptorsDescriptors

    75758383626275756565717189897676CoMFACoMFA

    69697373727274747070737375757777COMMA/MOECOMMA/MOE

    69697373727274747070737375757777COMMA/MOECOMMA/MOE

    65657777717174748686747465657777MOEMOE

    53539494606077777070747485857878VOLSURFVOLSURF

    53538787535362624747858560606767CMTD/MOLCONNZCMTD/MOLCONNZ

    58588181747467675050767665657272CMTDCMTD

    68688383787870707676727286867070DRAGONDRAGON

    TestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTrainingTestTestTrainingTraining

    SVMSVMDECISION TREEDECISION TREEBINARYBINARY

    QSARQSAR

    kNNkNNMethodMethod

    DescriptorsDescriptors

    Percent Classification Accuracy for the Fragrance Dataset**

    *Lima, et al. JCIM, 2006, in press.**Kovatcheva, Golbraikh, Oloff, et al. JCICS, 44: 582-595, 2004.

  • 8/3/2019 Predictive Qspr Epa Star

    23/44

    Activity randomization: model robustnessStruc.1

    Struc.2

    Struc.n

    ..

    ..

    Pro.1

    Struc.3

    ..

    ..

    Pro.2Pro.3

    Pro.n

    Struc.1

    Struc.2

    Struc.n

    ..

    ..

    Pro.1

    Struc.3

    ..

    ..

    Pro.2

    Pro.3

    Pro.n

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0 10 20 30 40 50 60 70

    Number of Variables

    q2

    The lowest q2 = 0.51 in the top 10 models

    The highest q2 =0.14 for randomized datasets

    -0.1

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0 10 20 30 40 50 60 70

    Number of Variables

    q2

    The lowest q2 = 0.51 in the top 10 models

    The highest q2 =0.14 for randomized datasets

  • 8/3/2019 Predictive Qspr Epa Star

    24/44

    RATIONAL SELECTION OF TRAINING AND TEST SETS

    BASED ON DIVERSITY SAMPLING

    0

    0.2

    0.4

    0.6

    0.8

    1

    0 0.2 0.4 0.6 0.8 1

    R

    TRAINING SET TEST SET

    NVVp /=

    K

    p

    cVR /1=

    V - the occupied volume in the descriptor space

    Vp- volume corresponding to one point

    c - dissimilarity level

    K - dimensionality of the descriptor space

    ALGORITHMS 1 to 3

    1. Volume corresponding to one point is 1/N.

    2. Select a compound with the highest activity.3. Include this compound into the training set.

    4. Construct a sphere with the center in the

    representative point of this compound with

    radius R = c(V/N)1/K.

    5. Include compounds within this sphere except

    for the center in the test set.6. Exclude all points within this sphere. For

    algorithm 1, select randomly a compound

    and go to 3. If no compounds left, go to 10.

    7. n - the number of remaining compounds. m -

    the number of spheres already constructed.

    dij, i=1,,n, j=1,,m distances of

    compounds left to the sphere surfaces.8. Select a compound with the smallest

    (algorithm 2) or largest (algorithm 3) dij.

    9. Go to step 3.

    10. Stop.

    N number of points

    Golbraikh et al.,J. Comp. Aid. Mol. Design 2003, 17, 241253

  • 8/3/2019 Predictive Qspr Epa Star

    25/44

    DEFINING THE APPLICABILITY DOMAIN

    Distribution of distances between points and

    their nearest neighbors in the training set

    0

    0.1

    0.2

    0.3

    0.4

    0-0.1

    0.1-0.2

    0.2-0.3

    0.3-0.4

    0.4-0.5

    0.5-0.6

    0.6-0.7

    0.7-0.8

    0.8-0.9

    0.9-1.0

    Distances

    Ni/N

    Training set

    Test set

    Training set: 60 compoundsTest set: 35 compounds

    MODEL:Two nearest neighbors

    The number of descriptors: 8Q2(CV)=0.57 R2 =0.67

    DISTANCES:train=0.287

    StDev(D)train= =0.149

    Closest nearest neighbors oftest set compounds:

    Dtest = train+ ZCutOff(ZCutOff=0.5)

    N is the total number of distances

    (Ntrain=60 2=120; Ntest=70 )

    Ni is the number of distances in each

    category (bin)

    *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.

  • 8/3/2019 Predictive Qspr Epa Star

    26/44

    y = 0.9383x

    R02= -3.3825

    y = 0.3154x + 3.4908

    R2

    = 0.9778

    0

    2

    4

    6

    8

    10

    0 2 4 6 8 10

    Predicted

    Observed

    y = 1.0023x

    R02 = 0.5238

    y = 3.1007x - 10.715

    R2 = 0.9778

    0

    2

    4

    6

    8

    10

    0 2 4 6 8 10

    Observed

    Predic

    ted

    =

    22 )~~()(

    )~~)((

    yyyy

    yyyyR

    ii

    ii

    2

    2

    2

    0)~~(

    )~(1

    0

    yy

    yyR

    i

    r

    ii

    =

    2

    2

    2

    0 )(

    )~(

    1'

    0

    yy

    yy

    R i

    r

    ii

    =

    ''~ byay r += ykyr '~ 0 =

    Criteria for Predictive QSAR Model.

    Correlation coefficient

    y = 1.2458x - 1.8812

    R2

    = 0.8604

    y = 0.9796x

    R02

    = 0.8209

    3

    5

    7

    9

    3 5 7 9

    Predicted

    Observed

    2

    ~

    'i

    ii

    y

    yyk

    =

    Coefficients of determinationRegression Regression throughthe origin

    CRITERIA

    22'

    0

    2

    0

    '

    22

    or;0.1or

    ;6.0;5.0

    RRRkk

    Rq

    >>

    Golbraikh et al.,J. Comp. Aid. Mol. Design 2003, 17, 241253

  • 8/3/2019 Predictive Qspr Epa Star

    27/44

    Why cant we get it right? Have not

    we tried enough?

    Descriptors? No, we have plenty (e.g., Dragon)

    Methods? No, we also have plenty, and still

    searching (e.g., adapting datamining techniques). Training set statistics? NO, it does not work

    Test set statistics? Maybe, but it is still insufficient

    Sowhat else can we do?????

    Change the success criteria!!!

    QSAR is an empirical data modeling exercise: justdo it any way you like but VALIDATE onindependent datasets!

  • 8/3/2019 Predictive Qspr Epa Star

    28/44

    Only accept models

    that have a

    q2 > 0.6

    R2 > 0.6, etc.

    MultipleTraining Sets

    Validated Predictive

    Models with High Internal

    & External Accuracy

    Predictive QSPR Workflow*

    Original

    Dataset

    Multiple

    Test Sets

    Combi-QSAR

    ModelingSplit into

    Training, Test,and External

    validation Sets

    Activity

    Prediction

    Y-Randomization

    External validation

    Using Applicability

    Domain

    Database

    Screening

    *Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.

  • 8/3/2019 Predictive Qspr Epa Star

    29/44

    3,363 diverse compounds (including >300drugs) tested for their Ames genotoxicity

    60% mutagens, 40% non mutagens 148 initial topological descriptors

    ANN, kNN, Decision Forest (DF) methods

    2963 compounds in the training set, 400compounds (39 drugs) in randomly selectedvalidation set

    Example. Consensus QSPR models for

    the prediction of Ames genotoxicity*

    *Votano JR, Parham M, Hall LH, Kier LB, Oloff S, Tropsha A, Xie Q, Tong W.Mutagenesis, 2004, 19, 365-77.

    Comparison of GenTox prediction

  • 8/3/2019 Predictive Qspr Epa Star

    30/44

    Comparison of GenTox prediction

    for 30 drugs in the external test set

  • 8/3/2019 Predictive Qspr Epa Star

    31/44

    Frequent MI

    descriptors map

    onto (someknown) structural

    alerts

    Bold, wide bonds show

    positions within structures

    where descriptors indicate

    a structural alert for Amesmutagenicity as found

    among most important E-

    State indices

  • 8/3/2019 Predictive Qspr Epa Star

    32/44

    Applicability domain vs. prediction

    accuracy (Ames Genotoxicity dataset)

    80

    82

    84

    86

    88

    90

    92

    0.5 1 2 3 5 10

    kNN Z-Score Used for Prediction Cutoff

    TestSet%Accurac

    y

  • 8/3/2019 Predictive Qspr Epa Star

    33/44

    QSAR modeling of the NTP/NCGC/HTS data only

    37103Actives

    2367Inconclusives

    157400Total97*230*Inactives

    Validation setModeling set

    The best k-NN models based on the modeling set:

    74.1%

    79.4%

    72.8%

    Pred. Test

    278.1%3

    278.8%2

    278.8%1

    NNNPred. Train.Nm

    *Inactives most similar to actives are selected

  • 8/3/2019 Predictive Qspr Epa Star

    34/44

    Prediction of the External Set

    88.7%63.9%Pred.

    Accuracy

    8613Pred.

    Inactives

    1123Pred.

    Actives

    InactivesActives

    92.1%76.2%Pred.

    Accuracy

    825Pred.

    Inactives

    716Pred.

    Actives

    InactivesActives

    No applicability domain.

    Accuracy 75.8

    Applicability domain filter applied.

    Accuracy 83.6%, Coverage 82.8%

  • 8/3/2019 Predictive Qspr Epa Star

    35/44

    Carcinogenicity Model Based on the 187

    Compounds Modeling set: 167 compounds

    External validation set: 20 compounds

    The number of kNN QSAR models based on modeling set fordifferent cutoff values:

    Training/test set predictivity

    cutoff

    Chemical descriptors only Combinational descriptors

    0.7/0.7 315 919

    0.75/0.75 29 86

    0.8/0.8 1 4

    Combined HTS+chemical descriptors

  • 8/3/2019 Predictive Qspr Epa Star

    36/44

    Prediction of the 20 External Compounds

    Chemial descriptors

    only

    Combined HTS+chemical

    descriptors

    Exp.

    Actives

    Exp.

    Inactives

    Exp.

    Actives

    Exp. Inactives

    Pred. actives 5 1 8 0

    Pred. inactives 5 4 3 5

    Accuracy 50.0% 80.0% 72.7% 100%

    Overall

    Accuracy

    65.0% 86.4%

    Modeling of the completeModeling of the complete

  • 8/3/2019 Predictive Qspr Epa Star

    37/44

    Modeling of the complete

    carcinogenicity dataset: The

    Carcinogenic Potency Database (CPDB)

    Modeling of the completecarcinogenicity dataset: The

    Carcinogenic Potency Database (CPDB) Lois Swirsky Gold, Ph.D., Director

    Unique and widely used international repositoryhttp://potency.berkeley.edu/

    1485 chemicals Species, strain, and sex of test animals

    Target organ, tumor types, and tumor incidence

    Carcinogenic potency (TD50)

    Shape of the dose-response

    Expertss conclusion on carcinogenicity

    Literature citation through1997

    Incorporated in The Distributed Structure-Searchable Toxicity(DSSTox) Database Network.http://www.epa.gov/nheerl/dsstox/sdf_cpdbas.html

  • 8/3/2019 Predictive Qspr Epa Star

    38/44

    Database CurationDatabase Curation

    Total entries: (1481)

    Delete entries with no structure (1444 left)

    Delete entries containing inorganicelements (1244 left)

    Clean duplicates / triplicates and keep one

    copy (1216 left) Delete chiral compounds (1214 left)

    Delete all entries missing mutagenicity data(693 left)

  • 8/3/2019 Predictive Qspr Epa Star

    39/44

    Statistics of a Working Subset for the

    Animal Carcinogenicity Modeling

    Statistics of a Working Subset for the

    Animal Carcinogenicity Modeling

    693140553Total

    42481343Active

    26959210Inactive

    TotalVal. SetTrain/Test SetT

  • 8/3/2019 Predictive Qspr Epa Star

    40/44

    Accuracy of kNN QSAR Models of

    Animal Carcinogenicity

    Accuracy of kNN QSAR Models of

    Animal Carcinogenicity

    Training SetTest Set

    Val. Set

    Min Pred. Acc.

    Max Pred. Acc.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.60.7

    0.8

    0.9

    Training SetTest Set

    Val. Set

    Min Pred. Acc.

    Max Pred. Acc.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.60.7

    0.8

    0.9

    QSPR Workflow: Emphasis on Successful

  • 8/3/2019 Predictive Qspr Epa Star

    41/44

    Large fraction isconfirmed active

    Small number ofcomputational hits

    SAR datasetExternal

    database/library

    QSPR Workflow: Emphasis on Successful

    Predictions, not statistics or interpretations

    QSPR

    Magic

    Input

    Output

    Real Test

  • 8/3/2019 Predictive Qspr Epa Star

    42/44

    C-

  • 8/3/2019 Predictive Qspr Epa Star

    43/44

    Summary and thoughts

    HTS and omics data may be insufficient toachieve the desired accuracy of the end point

    property prediction. Should be explored asbiodescriptors in conjunction with chemicaldescriptors

    Predictive QSPR workflow with extensivevalidation affords statistically significant modelsthat can serve as reliable property predictors

    Mechanistic model interpretation should only beattempted IFF models have been externallyvalidated

    The public has an insatiable curiosity to know everything,

    except what is worth knowing.Oscar Wilde

  • 8/3/2019 Predictive Qspr Epa Star

    44/44

    ACKNOWLEDGMENTSUNC ASSOCIATES

    Former:

    -Stephen CAMMER-Sung Jin CHO

    -Weifan ZHENG

    - Min SHEN

    -Bala KRISHNAMOORTHY

    -Shuxing ZHANG

    Peter ITSKOWITZ

    Scott OLOFF- Shuquan ZONG

    Protein structure group:

    Yetian CHEN

    Tanarat KIETSAKORN

    Berk ZAFER

    Funding NIH RoadMap

    EPA (STAR award)

    Collaborators

    Ann Richard (EPA)

    Ivan Rusyn (UNC)

    Tudor Oprea (UNM)

    Cheminformatics group:- Kun WANG - Rima HAJJO

    Alex GOLBRAIKH - Mei WANG Raed KHASHAN - Julia GRACE

    Chris GRULKE - Hao HU

    - Hao TANG - Mihir SHAH

    - Simon WANG - Jui-Hua Hsieh

    Jun FENG

    Yun-De XIAO

    Yuanyuan QIAO

    Patricia LIMA

    Assia KOVACHEVA

    M. KARTHIKEYAN

    Current