Prediction Of Bioactivity From Chemical Structure

48
Prediction of bioactivity from chemical structure Small Molecule Bioactivity Resources At The EBI Jérémy Besnard [email protected] .uk

description

Presentation for the Small Molecule Bioactivity Resources At The EBI training course 2010

Transcript of Prediction Of Bioactivity From Chemical Structure

Page 1: Prediction Of Bioactivity From Chemical Structure

Prediction of bioactivity from chemical structure

Small Molecule Bioactivity Resources At The EBI

Jérémy Besnard

[email protected]

Page 2: Prediction Of Bioactivity From Chemical Structure

2

Myself

• PhD student at the university of Dundee– Supervisor: Pr. Andrew Hopkins– Lab: medicinal informatics

• Background– Chemistry degree with some biology– One industrial year at Pfizer on computational

chemistry

Page 3: Prediction Of Bioactivity From Chemical Structure

3

Prediction of bioactivity

• Type of predictions– How active is a compound?

• Continuous model

– Is the compound active, or not?• Categorical model

QSAR – Quantitative Structure-Activity Relationship

Some slides are adapted from Richard Lewis (Novartis) presentation at the University of Sheffield Practical introduction to Chemoinformatics course (next in 2011)

Page 4: Prediction Of Bioactivity From Chemical Structure

4

Example

3

4

5

6

7

8

150 250 350 450 550

Molecular Weight

Act

ivity

Molecular Weight 180 220 250 290 340 380 450 500

Activity (pIC50) 4 4.3 4.8 5.4 4.8 5.8 7.5 7.7

Molecular Weight = 360

Activity?

Linear regression:

Activity = 0.01 Molecular weight + 1.7 (R2 = 0.900)

Activity = 5.3

Active?

Category:

Molecular weight > 260 = active

Active : Yes

Page 5: Prediction Of Bioactivity From Chemical Structure

5

QSAR

Activity = IC50, Ki, Ratios…

Molecular Descriptors

Topological (shape, size)

Physical & Thermodynamics

Chemical feature (substructure)

Activity = f(Molecular Descriptors)

Statistics O

O

> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...

Page 6: Prediction Of Bioactivity From Chemical Structure

6

The absolute basics

• Activity + Representation + Method = QSAR

• Activity = experimental data

• Representation = description of the molecule

• Method = Statistical tool to use– Underlying principle: similar molecules should

have similar activities

Page 7: Prediction Of Bioactivity From Chemical Structure

7

Advantages of Models

• Fast and cheap method– Virtual screening: the computer does the

manipulation• Human: day – week• Computer : seconds - hours

• Help understand the science behind the observation– Tool to design compounds with higher chance

of being active

Page 8: Prediction Of Bioactivity From Chemical Structure

8

Activity

• It can be anything– Continuous: IC50, %Inhibition, EC50, ratios,…– Categorical: Yes/No, Low/Medium/High

• Better if– Data come from the same assay/condition– Good quality (you trust the experimental data)

• For ADME endpoints– Lots of software solutions: not easy to predict!

• Few experimental data points (and not very reliable)• In vivo phenomena

Page 9: Prediction Of Bioactivity From Chemical Structure

9

Molecular descriptors

www.moleculardescriptors.eu

• Many Many Many• Simple counts

– Number of atoms, rings, hydrogen bond donors, acceptors, molecular weight…

• Physicochemical– Hydrophobicity, polarity: cLogP, Polar Surface Area (PSA)

• Shape – Topological indices– Big, small, long, round

• 2D fingerprints– Presence or absence of certain substructures

• From a dictionary (MACCS eg count of acids)• On the fly: look at the substructures present in the data

• 3D: fingerprints, electrostatics, shape

Page 10: Prediction Of Bioactivity From Chemical Structure

10

Fingerprint

• Binary vector: list of 0 and 1• Dictionary: fixed size with

each bit = one group (defined in advance)

• Hashed: fragment the molecules and insert the fragment in a bit position of the vector

Acid Cl Amide6

aromatic ring

O

O

O

O

Page 11: Prediction Of Bioactivity From Chemical Structure

13

Extending the Initial Atom Codes

• Fingerprint bits indicate presence and absence of certain structural features

• Fingerprints do not depend on a predefined set of substructural features

O

N

A

A

A

A

O

N

AA

A

A A

Each iteration adds bitsthat represent larger and larger structures

Iteration 0

Iteration 1

Iteration 2

Page 12: Prediction Of Bioactivity From Chemical Structure

14

Generating the Fingerprint

• Iteration is repeated desired number of times– Each iteration extends the diameter by two

bonds• Codes from all iterations are collected• Duplicate bits may be removed

> <FCFP_2#S>160131618154665203677720-154910344918721545241070061035...

> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...

> <FCFP_0#S>16013

...

Page 13: Prediction Of Bioactivity From Chemical Structure

Data Sets

Page 14: Prediction Of Bioactivity From Chemical Structure

16

Validity of a model• It is easy to introduce artefacts and “false

correlation”

The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy), Johnson, J. Chem. Inf. Model., 2008, 48 (1), pp 25–26

Page 15: Prediction Of Bioactivity From Chemical Structure

17

Training and Test Sets• Build the model from training set

• Predict the test set

• Also called Leave-N-Out validation where N=1 compound to 50% of the dataset.

• Cross validation: repeat the steps using complementary training and test set N times.

http://www.cs.cmu.edu/~awm/tutorials

http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf

Page 16: Prediction Of Bioactivity From Chemical Structure

18

Space of the sets

• The training set should cover the representation space evenly

Page 17: Prediction Of Bioactivity From Chemical Structure

19

Training vs Test Sets

• The test set should be not too dissimilar to the training set– Too similar = over estimated the good quality– Too dissimilar = difficult prediction

Test Set

Test Set

Test Set

Page 18: Prediction Of Bioactivity From Chemical Structure

Questions?

Page 19: Prediction Of Bioactivity From Chemical Structure

21

Statistical Methods

Activity Molecular Descriptors

Training and test sets

Activity = f(Molecular Descriptors)

Statistics O

O

> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...

Page 20: Prediction Of Bioactivity From Chemical Structure

22

Categorical

• The focus is on a specific criterion:– Is the activity < 10uM? (like in HTS assay)

• The data is not continuous– Soluble/Insoluble

• Try to find a rule (or set of rules) to split the data in classes with the lowest rate of misclassification– Different coefficients to measure the quality(ref: Assessing the accuracy of prediction algorithms for classification: an overview. Baldi et al. Bioinformatics

2000, 16:412-424)

Page 21: Prediction Of Bioactivity From Chemical Structure

23

Recursive Partitioning

• Using decision trees

• Rules are organized like a tree, each node = one rule – Cut-off : Molecular weight <450– Absence/presence of a group: Acid group

• Usually easy to interpret

• Drawback: overfitting and model to specific to the training data

Page 22: Prediction Of Bioactivity From Chemical Structure

24

N

O

O

O

Molecular Weight

>450≤ 450

Polar surface area

>100

0,10

≤100

2,0

cLogP

Acid Group 0,7

>4.2≤ 4.2

18,2 1,5

YesNo

21 Actives, 24 Inactives

2,1019,14

19,7

MW: 178PSA: 37LogP: 3

MW: 205PSA: 20LogP: 3

Page 23: Prediction Of Bioactivity From Chemical Structure

25

Substructural Analysis

• Idea: each fragment of the molecule makes a contribution to the activity , independent of the other fragments in the molecule.

• Fragments get a score for their activity and a molecule has the score of the sum of the fragments.

• A simple fragment scoring function:

ii

ii inactact

actw

Acti = Nb of active compounds containing fragment i

Inacti = Nb of inactive compounds containing fragment i

Page 24: Prediction Of Bioactivity From Chemical Structure

26

Naïve Bayesian Classifiers

• Related to the substructural analysis (slight differences in the weight sum calculationref)

• Use with fingerprints– Each substructure (bit in the fingerprint) gets a weight– Fingerprint can be mixed with other properties

• Properties are binned and each bin obtains a weight

• Molecules are scored, the higher the score the higher the chance to be in a specific category

• Native implementation in Pipeline Pilot (practical)

Ref: New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching, Hert et al., J. Chem. Inf. Model., 2006, 46 (2), pp 462–470

Page 25: Prediction Of Bioactivity From Chemical Structure

27

Validation

• Collection of coefficients

• Most common ones– Specificity and sensitivity – ROC curve– Enrichment plot

Page 26: Prediction Of Bioactivity From Chemical Structure

Specificity & Sensitivity

• Specificity

Example:If all compounds are predicted inactives:

Specificity = 1 (very good)Sensitivity = 0 (very bad)

If all compounds are predicted actives:Specificity = 0 (very bad)Sensitivity =1 (very good)

28

TP=True PositiveTN=True NegativeFP=False PositiveFN=False Negative

• Sensitivity

FPTN

TN

FNTP

TP

http://en.wikipedia.org/wiki/Sensitivity_and_specificity

Page 27: Prediction Of Bioactivity From Chemical Structure

ROC curve

• Plot sensitivity versus 1-specificity

29

Coefficient = Area Under Curve 1 is ideal, 0.5 is random

http://www.medcalc.be/manual/roc.php

Page 28: Prediction Of Bioactivity From Chemical Structure

Enrichment curve• On some study the rank of compounds is not that

important: idea is to select X percent of the data• Use the model to select the Top X compounds: try

to have most of the active molecules inside

30

There 40% of the active in the top 10%.This plot doesn’t tell how many compounds this represents (could be 40 actives and 10,000 inactive in the top 10%)

Page 29: Prediction Of Bioactivity From Chemical Structure

31

Other methods

• There are other statistical methods.

• There is no perfect method and it is project dependent (also “personal” choice)

• Most common:– Forest of trees– Support Vector Machine– Neural Networks

Page 30: Prediction Of Bioactivity From Chemical Structure

Questions?

Page 31: Prediction Of Bioactivity From Chemical Structure

33

Regression

• Provide a value with more information than yes or no

• Usually smaller set than classification

• Link activity to the structure by an equation (simple to complicated)

Page 32: Prediction Of Bioactivity From Chemical Structure

34

Historical•First equation: Hansch in 1964•Link activity to molecule’s electronic characteristics and to its hydrophobicity

C is the concentration required to produce a response

LogP the octanol/water partition coefficient (possibility to cross membrane)

σ the Hammett substitution parameter (strength of the electron-withdrawing or -donating properties of the aromatic substituent)

•It is a linear equation•Then improved with a parabolic function

321 log)/1log( kkPkC

4322

1 log)(log)/1log( kkPkPkC p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure, Hansch et al., J. Am. Chem. Soc., 1964, 86 (8), pp 1616–1626

Parabolic dependence of drug action upon lipophilic character as revealed by a study of hypnotics, Hansch et al., J. Med. Chem., 1968, 11 (1), pp 1–11

Page 33: Prediction Of Bioactivity From Chemical Structure

35

Deriving a QSAR equation

• Most common method is the linear regression

• In QSAR x is usually a descriptor (eg logP)

• Aim: reduce the sum of the differences between the predicted and the real values

• With more than one descriptors:

cmxy

n

iii cxmy

1

Page 34: Prediction Of Bioactivity From Chemical Structure

36

Quality

• Most common way is to use the square of the correlation coefficient, R2

• Need to review the data:

Almost the same R2

Page 35: Prediction Of Bioactivity From Chemical Structure

37

Cross validation

• Involves the removal of some of the values from the dataset, build a QSAR from the remaining data, and apply this model on the previous removed data.

• The R2 of cross validation is written Q2, it represents the goodness of the prediction (R2 goodness of fit).

• Q2 should be lower than R2 but not too much (otherwise the model was over-fit).

Page 36: Prediction Of Bioactivity From Chemical Structure

38

Designing QSAR experiment

• Find the smallest number of variables to explain as much data as possible– It is easy to calculate thousands of parameters with a computer

in seconds.

• Rule of thumb– >5 compounds for each descriptor– Check the descriptor: remove the invariant ones– Remove correlated factors (by deleting a descriptor, or using

data reduction technique – PCA)

• Selection– Algorithms to select most significant descriptors

• Forward stepping regression: start from 1 and add • Backward-stepping regression: start with all and remove

Page 37: Prediction Of Bioactivity From Chemical Structure

39

Regression algorithms

• Multiple linear regression (see practical)– Easy to interpret– Problem of correlations between factors

• Partial Least Squares (PLS)– Similar to PCA by reducing the number of factors (xi)

in new orthogonal “latent variables” (ti)

– Compare to PCA, add a correlation between observed data and the latent variables (y~a1t1)

pipiii

nn

xbxbxbt

tatatay

...

...

2211

2211

Page 38: Prediction Of Bioactivity From Chemical Structure

40

Not limited

• Regressions algorithms are multiple– Implementation– Selection of factors– Best way to consider a good model

• Other methods– Gaussian Processes (

http://dx.doi.org/10.1021/ci7000633 )– Molecular Field Analysis and Partial Least Square:

CoMFA and derivative, using 3D steric and electrostatic information (http://www.wiley.com/legacy/wileychi/ecc/samples/sample05.pdf and http://www.netsci.org/Science/Compchem/feature11.html )

Page 39: Prediction Of Bioactivity From Chemical Structure

41

Regression + Category

• Poor regression but good classificationO

bser

ved

Predicted

False Positives

GoodBad

Good False Negative

Page 40: Prediction Of Bioactivity From Chemical Structure

42

After

• Once models are built and have ideas of the mathematical quality:– Look at the observed vs predicted plot– Try to understand the model

• Do the descriptors make sense?– LogP important when modelling solubility– Why is a certain substructure so important?

Page 41: Prediction Of Bioactivity From Chemical Structure

43

Outliers

• What to do with outliers?

• Prediction far from observed:– Are the compounds similar to the training set?– Outside your space of confidence

• Chemical similarity = Activity similarity not always true– There are activity cliffsref

– Interesting for SAR study

On Outliers and Activity Cliffs−Why QSAR Often Disappoints, Maggiora, J. Chem. Inf. Model.2006, 46, 1535−1535

Structure−Activity Relationship Anatomy by Network-like Similarity Graphs and Local Structure−Activity Relationship Indices, Wawer et al., J. Med. Chem., 2008, 51 (19), pp 6075–6084

Page 42: Prediction Of Bioactivity From Chemical Structure

44

A model is a model

• It is not the reality

• Provides help for experimentations– Understand what happens– Reduce the number of experiments– Do not replace lab work

• There is no one perfect model– Depending on the method, data sets,

descriptors, tuning parameters…

Page 43: Prediction Of Bioactivity From Chemical Structure

45

Real correlation?• The decrease of marriage decreases the risk of death?

Should we ban Church of England Weddings?

Why do we Sometimes get Nonsense-Correlations between Time-Series?--A Study in Sampling and the Nature of Time-Series, Yule, Journal of the Royal Statistical Society, Vol. 89, No. 1. (Jan., 1926), pp. 1-63

Page 44: Prediction Of Bioactivity From Chemical Structure

46

Further – Multiple Targets

• Large scale model:– Prediction of multiple interactions at once– Need large database

• Wombat (literature), MDDR (patent)• ChemBl

• Identify side effect, or unknown beneficial effect

Page 45: Prediction Of Bioactivity From Chemical Structure

47

Principle

• SEA approach:– Similarity of a compound to active ligands

(similar to Blast) website: http://sea.bkslab.org/

• Multiple category Bayesian model:– Each fingerprint gets a different weight for

each target: the sum is different by target

• Output:– List of protein ranked by probability of binding

Page 46: Prediction Of Bioactivity From Chemical Structure

48

References• An introduction to Chemoinformatics, A. Leach and V.Gillet• Sheffield course: next one in 2011:

http://www.shef.ac.uk/is/research/groups/chem/courses.html , Conference: http://cisrg.shef.ac.uk/shef2010/

• Pipeline Pilot documentation and Cheminformatics analysis and learning in a data pipelining environment, Hassan et al., Molecular Diversity (2006) 10: 283–299,

• Multiple targets:• Predicting new molecular targets for known drugs, Keiser et al., Nature

462, 175-181 (12 November 2009) and Relating protein pharmacology by ligand chemistry, Keiser et al., Nat Biotech 25 (2), 197-206 (2007)

• Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases, Nidhi et al., J. Chem. Inf. Model., 2006, 46 (3), pp 1124–1133

• Global mapping of pharmacological space, Paolini et al., Nat Biotech 25 (7), 805-815 (2006)

Page 47: Prediction Of Bioactivity From Chemical Structure

Questions

Page 48: Prediction Of Bioactivity From Chemical Structure

Practicals

Using Pipeline Pilot

Regression and Classification