QSAR/QSPR Model development and Validation for successful prediction and interpretation

49
1 QSAR/QSPR Model development and Validation for successful prediction and interpretation 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009 Mohsen Kompany-Zareh In the name of GOD

description

In the name of GOD. 8 th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009. QSAR/QSPR Model development and Validation for successful prediction and interpretation. Mohsen Kompany-Zareh. Contents:. Introduction Selwood data set (all descriptors Model development - PowerPoint PPT Presentation

Transcript of QSAR/QSPR Model development and Validation for successful prediction and interpretation

Page 1: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

1

QSAR/QSPR Model development and Validation

for successful prediction and interpretation

8th Iranian Workshop on Chemometrics, IASBS, 7-9 Feb 2009

Mohsen Kompany-Zareh

In the name of GOD

Page 2: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

Contents:

2

Introduction Selwood data set (all descriptors Model development Model validation Statistical diagnostics (R2, q2, RMSEC, RMSEP, RMSECV Internal validation QUIK Selwood data (a # descriptors Descriptor selection LMO and Jackknife Cross model validation Bootstrapping Training and test set selection Leverage

Page 3: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

3

QSPR/QSAR (Quantitative structure activity relationship)

Mathematical relation between structural attribute(s) and a property(an activity) of a set of chemicals.Application: Prediction of property for a variety of chemicals,prior to expensive synthesis and experimental measurement.To determine environmental risk of thousands of untested industrial chemicals.Description of a mechanism of action for a variety of

chemicals,

Introduction

Page 4: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

molec. 6

molec. 5

Descriptors

1.885120.93476.92122.04

2.913108.77508.56150.17

3.312122.85554.01164.08

3.711123.92571.26178.10

2.696120.49505.61156.01

3.106119.98518099247.93

2.924

1.992

1.987

1.544

2.079

1.530

X yLipoph. LUMO MW

Surf. Area

Activities

??

QSARmodel

molec. 1

molec. 2

molec. 3

molec. 4

Introduction

Page 5: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

5

Data preparation:

1. Collection and cleaning of target property data; selection of accurate, precise and consistent experimental data.

2. Calculation of molecular descriptors for chemicals with acceptable target properties;(After optimiz. of conform.)

more than 3000 descr.s

Introduction

DRAGON (Todeschini et al, 2001ADAPT (Jurs 2002; Stuper and Hurs 1976OASIS (Mekenyan and Bonchev 1986CODESSA (Katritzky et al, 1994Gaussian …

Page 6: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

6

Unique numerical representation of molecular structure in term of few molecular descriptors that capture salient compositional, electronic and steric attributes;

From a very large number of descriptors from different softwares

As few explanatory descriptors as possible for simple interpretation of model (sometimes by variable select

Structure ActivityModelDescriptors:Topologic (edges and verticesGeometric (surface, volume, …Electronic (e dencity, local chargesConstitutional (#C, #OH, …….

Introduction

Page 7: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

7

Selwood data: D (31x53) , Y(31x1)

>> load selwood.txt;>> D=selwood(:,1:end-1);>> y=selwood(:,end);

31 molecules53 descriptors

31 antifilarial antimycin analogous cantifilarial antimycin analogous characterized by 53 physicochemical descriptors

Selwood, et alJ Med Chem (1990) 33, 136.

Data set

Page 8: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

8

Model generation:Indep variables: descriptorsDepend variables: properties (activities)

Model developm methods:Multiple linear regression MLR,Partial least squares PLS,Artificial neural netorks (ANNs),k-nearest neighbor

Model development

#samples<#descr.s !!

Page 9: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

9

D b = yb = D+ y

Multiple Linear Regression Simplest model:

>> b= D\y;>> yEST= D*b;

0 20 40

-5

0

5

22 of 53 coeff.s are zero!!

b0-1 0 1 2

-1

0

1

2

y

yES

T

Model is developed

Application of model ?

Validation?

D yb

Model development

R2=1

Page 10: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

10

Other statistical diagnostics:Coefficient of determination, R2

Fraction of dependent variable variance explained by a model (e.g. MLR model).

Closer to unity is better.

It is a measure of the quality of fit between model-predicted and experimental values, and does not reflect the predictive power, at all.

train

itraini

train

iii

yy

yyR

1

2

1

2

2

)(

)ˆ(1

averageyerimentalactualy

estimatedy

i

i

:)(exp:

Model development

Page 11: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

11

Many QSPR/QSAR practitioners find data preparation and model generation steps sufficient to arrive at acceptable model !!

They do not include model validation in model development.

n/#descr=11/2>5 r2

cv < r2 fit : unstable model

log(1/IGC50)=0.54 logKw – 8.90 LUMO – 0.99 n=11, r2=0.82, s=0.28, r2

cv =0.64

Schultz, et alToxicity of Tetrahymena PyriformisQSAR 2002 meeting, May 25-29, Ottawa, Canada.Ex

Model development

Page 12: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

12

Model development

Ex Akers et alStruc.-tox. Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39.

Claim: The goodness of fit is satisfactory for predictive purposes.

Ex Benigni et alQSAR of mutagenic and carcinogenic aromatic amines, Chem. Rev. (2000) 100, 3697-3714.

“..use of a limited set of individual parameters with clear mechanistic significance is still the best approach that ensure the optimal comprehension of the results and gives the possibility of performing non-formal validations much superior to those provided by statistics” !!

x

Page 13: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

13

Problem:

Sometimes a highly fitted and accurate model for training set is not proper for validation sets !!

..so, the model is not reliable !!

Page 14: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

14

Model validation

Real utility of a QSAR/QSPR model is its ability to accurately predict the modeled property/activity for new chemicals.

Model validation:

Quantitative assessment of model robustness and its predictive power.

Definition of the application domainof the model in the space of applied chemical descriptors

Page 15: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

15

DivisionDivision to calibration and test sets

calD = [D(1:3:end,:);[D(2:3:end,:)]]; valD = D(3:3:end,:); caly = [y(1:3:end,:);[y(2:3:end,:)]]; valy = y(3:3:end,:);

b=calD\caly; %model development

valD valyvalidation

calD

Model

calyDevelopm.

There are many different methods for selection of members in training and test set.

External validation Model validation

1 4 7 10 13 … 2 5 8 11 14 … 3 6 9 12 15…

Page 16: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

16

>> calyEST=calD*b;

>> valyEST=valD*b; % model validation

-5 0 5-5

0

5

testy

test

yES

T

-1 0 1 2-1

0

1

2

caly

caly

ES

T

Not good prediction

5 10 15 20

-505

x 10-14

calDr

resi

dual

2 4 6 8 10-4-2024

testDr

resi

dual

Model validation

R2=1

Page 17: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

17

>> calyEST=calD*b; %root mean square error of calibr>> rmsec=sqrt(((caly-calyEST)'*(caly-calyEST))/calDr)

>> valyEST=valD*b; % root mean square error validation>> rmsep=sqrt(((valy-valyEST)'*(valy-valyEST))/valDr)

RMSEC=2.9396e-014

RMSEP=2.2940

Not good prediction

5 10 15 20

-505

x 10-14

calDr

resi

dual

2 4 6 8 10-4-2024

testDr

resi

dual

c

r

iii

r

yyRMSEC

c

1

2)ˆ(

t

r

jjj

r

yyRMSEP

t

1

2)ˆ(

Model validation

Page 18: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

18

A model with high R2 could be a poor predictor:

Variable muticollinearity, Statistically insignificant model descriptors, High leverage points in the training set.

Model validation

A regression model with k descriptors and n training set compounds may be acceptable for validation only if :

n > 4 k

For any of k descriptors Pair-wise correlation coefficient <0.9, Tolerance >0.1.

Page 19: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

19

Validation strategies:

Randimization of model property

(Y-scrambling).

Internal validation.

Only training

External validation.

Division to training and test sets.

Model validation

Page 20: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

20

Predictive power of QSAR models:

From sufficiently large external test set of compounds that were not used in the model development.

Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.

Zefirov, et alQSAR for boiling points of “small” sulfides. Are the “high-quality structure-property-activity regressions” the real high quality QSAR models? , J Chem Inf Comput Sci (2001) 41, 1022-1027.

test

itraini

test

iii

ext

yy

yyq

1

2

1

2

2

)(

)ˆ(1

Model validation

Page 21: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

21

training

ii

training

iii

yy

yyR

1

2

1

2

2

)(

)ˆ(1

0 5 10 15 20

-1

-0.5

0

0.5

1

1.5

2

calibr sample number

y

2 4 6 8 10-5

0

5

test sample number

y

test

jj

test

jjj

yy

yyq

1

2

1

2

2

)(

)ˆ(1

Train

Test

residual SS

Model validation

Page 22: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

22

0 5 10 15 20

-1

-0.5

0

0.5

1

1.5

2

calibr sample number

y

2 4 6 8 10-5

0

5

test sample number

y

training

ii

training

iii

yy

yyR

1

2

1

2

2

)(

)ˆ(1

test

jj

test

jjj

yy

yyq

1

2

1

2

2

)(

)ˆ(1

Train

Test

Tot variance SS

Model validation

Page 23: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

23

0 5 10 15 20

-1

-0.5

0

0.5

1

1.5

2

calibr sample number

y

2 4 6 8 10-5

0

5

test sample number

y

Train

Test

R2 = 1.0000

q2 = -8.5220

5.56.5212 q

14.9108.11

262

R

Model validation

Page 24: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

24

Internal validation:

Internal validation

Cross validation (CV) (applied to training set ) Leave-one-out (LOO) (common Leave-many-out (LMO) (sometimes

Similar to R2 !

train

ii

train

iii

yy

yyLOOq

1

2

1

2

2

)(

)ˆ(1

CV corr coeff

Page 25: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

25

Training set, only

Internal validation

Cross validationLeave-one-out

Internal validation

Useful when small number of molecules are available.

Page 26: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

26

Subsamples(copies from Training set

# subsamples = # molec.s

Internal validation

Page 27: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

27

SubTrain1 SubValid1 211 )ˆ( yy

222 )ˆ( yy

233 )ˆ( yy

244 )ˆ( yy

255 )ˆ( yy

cumPRESS# subsamples = # molec.s in training set

SubTrain3

SubTrain2 SubValid2

SubValid3

SubValid5

SubTrain5

Internal validation

Page 28: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

28

for i = 1:Dr calX = [X(1:i-1,:);[X(i+1:Dr,:)]]; valX = X(i,:); caly = [y(1:i-1,:);[y(i+1:Dr,:)]]; valy = y(i,:); b = (calX\caly)'; valyEST(i) = valX*b‘; press(i) = ((valyEST(i)-valy).^2)'; endcumpress= sum(press); rmsecv = sqrt(cumpress/Dr); q2LOO=1-((y-valyEST')'*(y-valyEST'))/… ((y-mean(y))'*(y-mean(y)))

LOO CV Internal validation

Page 29: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

29

5 10 15 20

-2

0

2

4

6

training sample number

yq2LOO = -4.8574

RMSECV = 2.0397

>> q2ASYMPTOT=1-(1-R2)*(calDr/(calDr-calDc))^2

>> if q2LOO-q2ASYMPTOT<0.005,disp('reject'),end

q2ASYMPTOT = 1.0000

REJECT

Internal validation

q2LOO and R2 should not be considerably different .

Page 30: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

30

Many authors consider qq22LOO>0.5 LOO>0.5 as an indicator of the high predictive power of model and do not evaluate the model on an external test set or use only one- or two-compounds test set.

Ex Cronin, et alThe importace of hydrophobicty and … in mechanistically based QSARs for toxicological endpoints, SAR QSAR Environ. Res. (2002) 13, 167-176.

Ex Moss, et alQ. S. Permeability Relationships for percutaneous absorption, Toxicol. In Vitro (2002) 16, 299-317.

Ex Suzuki, et al Classification of environ. estrogens by physicochem. properties using PCA and hierachical cluster analysis, J Chem Inf Comput Sci (2001) 41, 718-726.

Internal validation

Page 31: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

31

Small value of q2LOO or q2LMO test indicates low prediction ability,

But opposite is not necessarily true. (high q2LOO is necess and not enough)

It indicates robustness, but not the prediction ability of model.

Internal validation

Page 32: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

32

It has been shown that there exist no correlation between LOO cross-validation q2LOO and the correlation coefficient R2 between the predicted and observed activities for an external test set.

Kubinyi, et alThree dimensional quant. similarity-activ. relationships (QSiAR) from SEAL similarity matrices, J Med Chem (1998) 41, 2553-2564.

Golbraikh, et alBeware of q2 !, J Mol Graph Model (2002) 20, 269-276.

High q2LOO is the necessary condition for a model to have a high predictive power, but not a sufficient condition.

Internal validation

Page 33: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

33

QUIK

R. Todeschini, et alDetecting bad Regression models: Multicriteria fitness functions in regression analysisAnal. Chim Acta (2004) 515, 199-208.For illustration of correlation (collinearity) among independent variables.

Based on Multivariate correlation index K

QUIK

Page 34: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

34

111222243336444855510

>> corr(M)

4 correlated descriptorsM=

1111111111111111

1 2 3 40

1

2

3

4

Factor No

Eig

en v

alue

>> p=size(M,2);>> CorrEV=svds(corr(M),p);

1020304050

y=

It seems possible to use svd(M)

QUIK

Page 35: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

35

>> K=sum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2*(p-1)/p);

KM = 1.0000 Maximum correlation between descriptors>> [KM]=QUIK(M)function

>> [KMY]=QUIK([M Y]) %in the pres of depend var

if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end

KMY = 1.0000

REJECT

QUIK

Page 36: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

36

.79.17.87.89.28

.96.98.74.20.47

.52.27.14.30.06

.88.25.01.66.99

>> corr(M)

>> M=rand(4,5)M=

1.5468.3863.1101.6879.54681.3623-.7227.0419.3863.36231.1784-.3545.1101-.7227.17841.2450.6879.0419-.3545.24501

1234

y=

QUIK

Page 37: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

37

KM = 0.5000>> [KM]=QUIK(M)

>> [KMY]=QUIK([M Y])

if KMY-KM<0.05,disp('reject'),else,disp('NOT reject'), end

KMY = 0.6000

NOT REJECTED

1 2 3 4 50

1

2

3

Factor No

Eig

en v

alue

1 2 3 4 50

0.5

1

1.5

2

2.5

Factor No

Eig

en v

alue

QUIK

Page 38: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

38

KM = 0.7919>> [KM]=QUIK(calD) % Selwood data, all descriptors

>> [KMY]=QUIK([calD Y])

>>if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), end

KMY = 0.7923

REJECTED

0 10 20 30 40 500

10

20

Factor No

Eig

en v

alue

0 10 20 30 40 500

10

20

Factor No

Eig

en v

alue

QUIK

Page 39: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

39

Development of MLR model using all descriptors is not acceptable.

Model can be improved, using a factor based method,

…and by descriptor selection.

Page 40: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

40

>> D=Dini(:,[51 37 35 38 39 36 15]);

Development of MLR model using a number of descriptors.

RMSEC= 0.4989

RMSEP= 0.4993Comparable

Improved

-2 0 2-2

0

2

caly

caly

ES

T

0 10 20

-1

0

1

calDr

resi

dual

-2 0 2-2

0

2

testy

test

yES

T

0 5 10-1

-0.5

0

0.5

1

testDr

resi

dual

A number of descriptors

Page 41: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

41

0 5 10 15 20-1

0

1

2

calibr sample number

y

2 4 6 8 10-5

0

5

test sample number

y

R2 = 0.6495

q2 = 0.5490Comparable

Improved

q2LOO = 0.2816

5 10 15 20-2

-1

0

1

2LOO CV

training sample number

y

NOT REJECTED

A number of descriptors

Page 42: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

D=Dini(:,[51 37 35 38 39 36 15]);

42

1 2 3 4 5 6 70

1

2

3

4

5

Factor No

Eig

en v

alue

1 2 3 4 5 6 7 80

1

2

3

4

5

Factor No

Eig

en v

alue

KX = 0.6384

QUIK

KXY = 0.5996

if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endREJECTED

A number of descriptors

Page 43: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

D=Dini(:,[51 1 38]);

43

KX = 0.3159

QUIK

KXY = 0.3953

if KMY-KM<0.03,disp('reject'),else,disp('NOT reject'), endNOT REJECTED

A number of descriptors

1 2 30

0.5

1

1.5

2

Factor No

Eig

en v

alue

1 2 3 40

0.5

1

1.5

2

Factor NoE

igen

val

ue

Page 44: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

44

Using proper set of descriptors, improved results from MLR can be obtained.

But how the proper set of descriptors can be selected.

Page 45: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

45

Descriptor selection:

-Forward selection,-Backward elimination,-Genetic algorithm-Kohonen map-SPA-CWSPA

Descriptor Selection

Page 46: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

Descriptor Selection

Kohonen Map53 × 31

Rows (descriptors) as input for Kohonen map:

1 .Sampling from all regions in descriptors space

2 .Sampling from regions which descriptors have high correlation with Y (activity)

selwood data matrix

By: Mehdi Vasighi

Page 47: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

47

Descriptor Selection

Y. Akhlaghi and M. Kompany-Zareh Application of RBFNN and successive projections algorithm in a QSAR study of anti-HIV activity of HEPT derivatives, Journal of Chemometrics, (2006) 20, 1-12

Successive projections algorithm (SPA)

SPA is a forward selection method that starts with one variable, and incorporates a new one at each iteration, until a specified number N of variables is reached. In SPA, to minimize the the collinearity between the selected descriptors, the criterion for the stepwise selection of variables is the orthogonality of them to the previously selected variable.

Page 48: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

Araujo, et al The successive projections algorithm for variable selection in Spectroscopic Multicomponent Analysis. Chemom. Intell. Lab. Syst. (2001) 57, 65–73.

Important parameters:

1- Starting vector

2- N, maximum number of descriptors

Descriptor Selection

Page 49: QSAR/QSPR  Model  development  and  Validation for successful  prediction and interpretation

Correlation weighted SPA A limitation of SPA is that the only criterion for the stepwise

selection of variables is the orthogonality of them to the

previously selected variable, relation of entered vector as an

independent variable to the response is not considered.

Incorporation of a form of correlation ranking procedure

by which the variables are weighted by their correlation

coefficient with dependent variable, within SPA

procedure will overcome this limitation of SPA.

Descriptor Selection

M. Kompany-Zareh and Y. AkhlaghiCorrelation weighted successive projections algorithm: A QSAR study of anti-HIV activity of HEPT derivatives,J of Chemom, (2007) 21, 239-250.