Quantitative Structure-Activity Relationship (QSAR) · Ligand-based approach • Structure-Activity...

Post on 06-Aug-2019

229 views 1 download

Transcript of Quantitative Structure-Activity Relationship (QSAR) · Ligand-based approach • Structure-Activity...

Course Outline

1.Ligand-­‐based  approaches  1.(Quantitative)  structure-­‐activity  relationship  (SAR  &  QSAR)2.Pharmacophore  modeling

2.Bioinformatics  approaches  (target  recognition  and  structural  modeling)  1.Sequence  alignments  and  searches2.Gene  identiBication  and  prediction3.Homology  modeling

3.Structure-­‐based  approaches  1.Molecular  docking

1.Ligand  docking:  theory  and  scoring  functions2.Virtual  screening3.Protein-­‐protein  docking  and  interaction

2.Molecular  dynamics  simulation1.Introduction  into  molecular  dynamics  

3.Estimation  of  ligand  binding  afBinity1.Free  energy  perturbation2.Enhance  sampling  methods

1.

Ligand-based approach

• Structure-Activity Relationships (SAR)

• Quantitative Structure-Activity Relationships (QSAR)

Molecular descriptors

( )= fBiological activity

QSAR: Historical perspective

1900. Meyer-Overton

Public Domain, https://commons.wikimedia.org/w/index.php?curid=6597630

QSAR: Historical perspective

1964. Hansch analysis

Hansch & Fujita, JACS 1964

log 1! = −!!! + !!!!! − !!!!!! + log !! + !!!!!

Quantitative Structure-Activity Relationships (QSAR)

Definition

QSAR is building a mathematical model correlating a set

of structural descriptors of a set of chemical compounds

to their biological activity.

QYXR is building a mathematical model correlating a set of

independent variables of a set of samples to a set of dependent

variables.

Quantitative Structure-Activity Relationships (QSAR)

1. Set of compounds

4. Biological activities

Considerations

All compounds should belong to congeneric series

Same mechanism of action

A similar binding mechanism

Biological activity should be exactly the same

Biological activity is correlated to binding affinity

Quantitative Structure-Activity Relationships (QSAR)

1. Set of compounds

2. Molecular descriptors

4. Biological activities

Quantitative Structure-Activity Relationships (QSAR)

1. Set of compounds

2. Molecular descriptors

3. Mathematical models

4. Biological activities

! = !! + !!!! + !!!! +⋯+ !!!! !Mul$ple  Linear  Regression  (MLR)

Par$al  Least  Square  (PLS)

Ar$ficial  Neural  Network  (ANN)

Gene$c  Algorithm  (GA)

Molecular descriptors

Molecular descriptors

Molecular descriptors

1D descriptors

2D descriptors

3D descriptors

Molecular weight, LogP, No. of functional groups

Topological indices

Geometrical parameters, Molecular surfaces, Quantum

chemistry descriptors

2D descriptors

Topological indices based on adjacency matrix

1

3 4

6

5

21 3 4 652

1

3

4

6

5

20 22 01 12 23 33 3

!!!!!

1 21 20 11 02 12 1

!!!!!

3 33 32 21 10 22 0

!! !

!! = 12 !!"

!

!!!

!

!!!!TI = 29

3D descriptors

Quantum chemical descriptors

Descriptors calculated by Quantum Mechanic methods

(semi empirical, Ab initio or DFT )

Partial atomic charges

Lowest occupied molecular orbital energy (LUMO)

Highest occupied molecular orbital energy (HOMO)

Electrostatic potential

Molecular polarizability

Molecular descriptors Softwares

Dragon

GAUSSIAN

HyperChem

CODESSA

MOE

Quantitative Structure-Activity Relationships (QSAR)

1. Set of compounds

2. Molecular descriptors

3. Mathematical models

4. Biological activities

! = !! + !!!! + !!!! +⋯+ !!!! !Mul$ple  Linear  Regression  (MLR)

Par$al  Least  Square  (PLS)

Ar$ficial  Neural  Network  (ANN)

Gene$c  Algorithm  (GA)

Multiple Linear Regression (MLR)

InterceptCoefficients

! = (!!!)!!!′!!

! = !! = !(!!!)!!!′!!

! = !! + !!!! + !!!! +⋯+ !!!! !

!! − ! − !!!!,! −⋯ !!!!,! !

!!Objective Function

Multiple Linear Regression (MLR)

! = !!!!

/(! − ! − 1)!

! = !! + !!!! + !!!! +⋯+ !!!! !

! =!! − ! !!

!!! !!! − !! !!

!!! ! − ! − 1!!! = 1− !! − !! !!

!!!!! − ! !!

!!!!

ȓ = -

!!!!…!!

!!!!!!!!!!!1!2…!"

!

Expr Estimated

!"# = ! log !!!!! !

! + 2(! + 1)!Akaike Information Criterion

Multiple Linear Regression (MLR)

X1 X2 X3 X4 Yexp Ycalc Residual

1 3.42 38.51 6.62 6.63 3 2.9 0.12 3.05 38.91 6.61 6.04 3.15 3.37 -­‐0.223 2.52 54.28 6.58 6.23 3.28 3.07 0.214 3.29 54.27 6.63 6.09 4.24 3.91 0.335 2.25 54.62 6.61 6.03 3.28 3.14 0.146 2.42 55.37 6.59 5.67 4.35 3.75 0.67 3.15 70.6 6.67 6.51 3.88 3.69 0.198 1.67 69.77 6.49 5.79 3.64 3.3 0.349 2.91 70.03 6.64 6.11 4.35 3.99 0.3610 1.73 70.57 6.61 6.04 3.4 3.11 0.2911 1.36 86.18 6.64 6.12 3.3 3.12 0.1812 2.81 85.83 6.62 6.05 4.7 4.38 0.3213 2.96 102.96 6.66 6.52 4.67 4.35 0.3214 0.65 102.7 6.61 6.04 3.34 3.06 0.2815 2.22 117.89 6.62 6.04 4.11 4.74 -­‐0.6316 0.19 118.98 6.61 6.18 3.37 2.92 0.4517 2.85 135.34 6.67 6.52 5.93 5.1 0.8318 0.39 134.08 6.65 6.32 3.65 3.31 0.3419 3.58 22.34 6.7 6.6 2.7 2.69 0.0120 3.41 54.34 6.62 6.64 3.49 3.29 0.221 0.43 77.39 1.87 4.37 1.99 1.87 0.1222 0.35 93.05 1.88 4.34 2.38 2.25 0.1323 0.09 109.53 1.87 4.34 2.76 2.46 0.324 -­‐0.2 125.8 1.88 4.34 3.29 2.65 0.6425 1.41 87.61 0.35 -­‐14.65 0.87 0.85 0.02

∂2=0.170 R2=0.899 F=42.4

! = 4.224− 1.305!! + 0.535!! + 0.026!! + 0.817!!!

∂2Y=0.712

Variable selection

1. Systematic approaches

1. Forward selection

2. Backward elimination

2. Heuristic approaches

1. Genetic algorithm

2. Simulated annealing

Forward selection

Y X1 X2 X3 X4 X5X1 X2 X3 X4 X5

AIC 57.7 60.70 54.7 56.1 56.5Y=a+Xn

X1 X2 X3 X4 X5

AIC 56.3 47.55 56.7 56.5Y=a+X3+Xn

X1 X2 X3 X4 X5

AIC 29.4 49.5 48.3Y=a+X3+X2+Xn

X1 X2 X3 X4 X5

AIC 13.8 25.1Y=a+X3+X2+X1+Xn

X1 X2 X3 X4 X5

AIC 15.7Y=a+X3+X2+X1+X4+Xn

!"# = ! log !!!!! !

! + 2(! + 1)!

Backward elimination

Y X1 X2 X3 X4 X5 !"# = ! log !!!!! !

! + 2(! + 1)!

X1 X2 X3 X4 X5

AIC 15.7Y=a+X1+X2+X3+X4+X5

X1 X2 X3 X4 X5

AIC 21.8 50.6 59.9 25.1 13.8Y=a+X1+X2+X3+X4

X1 X2 X3 X4

AIC 31.9 49.5 58.0 29.4Y=a+X1+X2+X3+X4

Genetic algorithm

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 GENOME

0 1 0 0 1 0 0 1 0 0

0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0

1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1

AIC

! = !! + !!!! + !!!! + !!!!!

0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0

Mutation Mutation

Partial least square

The X-variables are correlated

The number of X-variables is relatively high compared with the number of samples

X = TPT Y =UQT

Y =ß X + ℇ

U =ß T + ℇ

Other modeling methods

Non-linear regression

Artificial neural network

Classification methods

Multiple logistic regression

Support vector machine

! = !! + !!!!! + !!!!! +⋯+ !!!!! ! Y

X1

X2

X3

! !!

!

!!!!! !

Validation

Valida&on  is  required  to  ensure  model  quality  

Over-­‐fi6ng  

Chance  correla&on

1. Cross-validation

1. Leave-one-out

2. Leave-N-out

2. Bootstrapping

3. External validation (prediction set)

4. Y randomization

Cross-validation

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19Y20

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19

Y20

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y20

Y19

P Tim

es

Leave-one-out

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19Y20

Leave-N-out

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16

Y20

Y19

Y18

Y17

Y1

Y2

Y7

Y8

Y9

Y10

Y11

Y12

Y13

Y14

Y15

Y16

Y17

Y18

Y20

Y3

Y4

Y5

Y6P/

N T

imes

Rcv2LOO Rcv2LNO

Bootstrapping

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19Y20

Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y15Y17Y19

Y20

Y16

Y18

Y14

Y2

Y3

Y4

Y5

Y8

Y9

Y11

Y12

Y13

Y14

Y15

Y16

Y17

Y18

Y20

Y7

Y10

Y1

Y6

N T

imes

RBS2

External validation

Y1

Y2

Y3

Y4

Y5

Y6

Y7

Y8

Y9

Y10

Y11

Y12

Y13

Y14

Y15

Y16

Y17

Y18

Y19

Y20

Y1

Y3

Y4

Y5

Y6

Y7

Y8

Y10

Y11

Y12

Y13

Y15

Y16

Y17

Y19

Y2

Y9

Y14

Y18

Y20

Variable selection

Cross-validation

Final model

Predic

t

R2EV

Y-randomization

Y1

Y2

Y3

Y4

Y5

Y6

Y7

Y8

Y9

Y10

Y11

Y12

Y13

Y14

Y15

Y16

Y17

Y18

Y19

Y20

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

X19

X20

Y =ß X + ℇ

Y20

Y19

Y18

Y17

Y16

Y15

Y14

Y13

Y12

Y11

Y10

Y9

Y8

Y7

Y6

Y5

Y4

Y3

Y2

Y1

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

X16

X17

X18

X19

X20

Ynew =ß X + ℇ RYrand2

N T

imes

Good model?

! = !! + !!!! + !!!! +⋯+ !!!! !∂2 R2 F (R)MSEModel Robustness

!"#$ = !"#$ !! − ! !

! − 1

!

!!!!

Model Quality Rcv2LOO Rcv2LNO RBS2 RMSEcv

Model Reliability RYrand2 RMSEYrand

Model Predictability REV2 RMSEEv

Good model?

! = !! + !!!! + !!!! +⋯+ !!!! !∂2 R2 >0.8 F (R)MSEModel Robustness

Model Quality Rcv2LOO >0.6 Rcv2LNO >0.6 RBS2 >0.6 RMSEcv

R2 - Rcv2 < 0.3

Model Reliability RYrand2 <0.3 RMSEYrand

R2 - RYrand2 > 0.4

Model Predictability REV2 >0.6 RMSEEV

R2 - REV2 < 0.3

Applicability domain

! = !! + !!!! + !!!! +⋯+ !!!! !

X1

X2

Principal component analysis

Prediction Vs Description

VE_b(e): coefficient sum of the last eigenvector from Burden matrix weighted by Sanderson electronegativityATS1v: Broto-Moreau autocorrelation of lag 1 (log function) weighted by van der Waals volumeSM02_AEA: spectral moment of order 2 from augmented edge adjacency mat. weighted by resonance integral

! = 2.34+ 3.5!!! ! − 0.87!"!1! + 3.76!!02_!"!!

! = 8.34+ 2.5!"#$ + 0.93!"#!

∂2=0.003 R2=0.951 F=260.2 REV2=0.891

∂2=0.113 R2=0.811 F=43.2 REV2=0.761

LogP: water-oil partition coefficientNAR: Number of aromatic rings