1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for...

98
1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.

Transcript of 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for...

Page 1: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

1

Modelling in Chemistry: High and Low-Throughput Regimes

Dr John MitchellUnilever Centre for Molecular Science InformaticsDepartment of ChemistryUniversity of Cambridge, U.K.

Page 2: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

2

Page 3: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

3

Page 4: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

4

Page 5: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

5

Page 6: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

6

Page 7: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

7

We look at data, analyse data, use data to find correlations ...

... to develop models ...

... and to make (hopefully) useful predictions.

Let’s look at some data ...

Page 8: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

8

New York Times,4th October 2005.

Page 9: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

9

Happiness ≈ (GNP/$5000) -1 Poor fit to linear model

Page 10: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

10

(GNP/$5000) -2

Outliers?

Happiness

Page 11: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

11

Fitting with a curve: reduce RMSE

Page 12: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

12

Outliers?

Different linear models for different regimes

Page 13: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

13

Only one obvious (to me) conclusion

This area is empty: no country isboth rich and unhappy. All other

combinations are observed.

Happiness (GNP/$5000) -2

Page 14: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

14

... but what is the connection with chemistry?

Page 15: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

15

Modelling in Chemistry

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

PHYSICS-BASED

EMPIRICALATOMISTIC

Car-Parrinello

NON-ATOMISTIC

DPD

CoMFA

2-D QSAR/QSPR

Machine Learning

AM1, PM3 etc.Fluid Dynamics

Page 16: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

16

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.

HIGH THROUGHPUT

LOW THROUGHPUT

Fluid Dynamics

Page 17: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

17

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.

INFORMATICS

THEORETICAL CHEMISTRY

NO FIRM BOUNDARIES!

Fluid Dynamics

Page 18: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

18

Density Functional Theoryab initio

Molecular Dynamics

Monte Carlo

Docking

Car-Parrinello

DPD

CoMFA

2-D QSAR/QSPRMachine Learning

AM1, PM3 etc.Fluid Dynamics

Page 19: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

19

Theoretical Chemistry

• Calculations and simulations based on real physics.

• Calculations are either quantum mechanical or use parameters derived from quantum mechanics.

• Attempt to model or simulate reality.

• Usually Low Throughput.

Page 20: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

20

Informatics and Empirical Models• In general, Informatics methods

represent phenomena mathematically, but not in a physics-based way.

• Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model.

• Do not attempt to simulate reality. • Usually High Throughput.

Page 21: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

21

QSPR

• Quantitative Structure Property Relationship

• Physical property related to more than one other variable

• Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s).

• Property-property relationships from 1860’s

• General form (for non-linear relationships):y = f (descriptors)

Page 22: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

22

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

Y = f (X1, X2, ... , XN )

• Optimisation of Y = f(X1, X2, ... , XN) is called regression.

• Model is optimised upon N “training molecules” and then tested upon M “test” molecules.

Page 23: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

23

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Quality of the model is judged by three parameters:

n

i

predi

obsi yy

nBias

1

)(1

n

i

predi

obsi yy

nRMSE

1

2)(1

2

1

2

1

2 )(/)(1 averagen

i

obsi

predi

n

i

obsi yyyyr

Page 24: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

24

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Different methods for carrying out regression:

• LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc.

• NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.

Page 25: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

25

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• However, this does not guarantee a good predictive model….

Page 26: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

26

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Problems with experimental error.• QSPR only as accurate as data it is trained upon.• Therefore, we are need accurate experimental data.

Page 27: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

27

QSPRY X1 X2 X3 X4 X5 X6

Molecule 1 Property 1 – – – – – –Molecule 2 Property 2 – – – – – –Molecule 3 Property 3 – – – – – –Molecule 4 Property 4 – – – – – –Molecule 5 Property 5 – – – – – –Molecule 6 Property 6 – – – – – –Molecule 7 Property 7 – – – – – –

• Problems with “chemical space”.• “Sample” molecules must be representative of “Population”.• Prediction results will be most accurate for molecules similar

to training set.• Global or Local models?

Page 28: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

28

Relationship of Chemical Structure

With Lattice Energy

Can we predict lattice energy from 2D molecular

structure?

Dr Carole Ouvrard & Dr John MitchellUnilever Centre for Molecular InformaticsUniversity of Cambridge

C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)

Page 29: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

29

Why Do We Need a Predictive Model?

Existing techniques from Theoretical Chemistry can give us accurate sublimation and lattice energies ...

... but only in very low throughput.

Page 30: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

30

Why Do We Need a Predictive Model?

A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials

From 2-D molecular structure only

Without knowing the crystal packing

Without expensive theoretical calculations

Should help predict solubility.

Page 31: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

31

Why Do We Think it Will Work?

Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule.

Many molecules have a plurality of different experimentally observable polymorphs.

We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.

Page 32: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

32

x x

x

x

O

x

x

x

x

Density (g/cc)

Lattice Energy (kJ/mol)

xx

1.40 1.601.50

-92.0

-94.0

-96.0

-98.0

OOO

O�

�O

+

+

+

+ x

x P1-+ P21/c

O P212121 � P21

Calculated Lowest Energy Structure

Experimental Crystal Structure

Page 33: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

33

Expression for the Lattice Energy

U crystal = U molecule + U lattice

Theoretical lattice energy

– Crystal binding = Cohesive energy

Experimental lattice energy is related to -H sublimation

H sublimation = -Ulattice – 2RT(Gavezzotti & Filippini)

Page 34: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

34

Partitioning of the Lattice Energy

U crystal = U molecule + U lattice

H sublimation = -U lattice – 2RT

Partitioning the lattice energy in terms of structural contributions

Choice of the significant parameters

– number of atoms of each type?

– Number of rings, aromatics?

– Number of bonds of each type?

– Symmetry?

– Hydrogen bond donors and acceptors? Intramolecular?

We choose counts of atom type occurrences.

Page 35: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

35

Analysis of the Sublimation Energy Data

Experimental data: Hsublimation Atom Types

– SATIS codes : 10-digit

connectivity code + bond types

– Each 2 digit code = atomic

number

HN 01 07 99 99 99

HO 01 08 99 99 99

O=C 08 06 99 99 99

-O- 08 06 06 99 99

Statistical analysis

Multi-Linear Regression Analysis

Hsub # atoms of each type

Typically, several similar SATIS codes are grouped to define an atom type.

NIST (National Institute of Standards and Technology, USA) Scientific literature

Page 36: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

36

Training Dataset of Model Molecules 226 organic compounds

19 linear alkanes (19)

14 branched alkanes (33)

17 aromatics (50)

106 other non-H-bonders (156)

70 H-bond formers (226)

Non-specific interacting

– Hydrocarbons

– Nitrogen compounds

– Nitro-, CN, halogens,

– S, Se substituents

– Pyridine

Potential hydrogen

bonding interactions

– Amides

– Carboxylic acids

– Amino acids…

0

50

100

150

200

0 5 10 15 20 25

no. C, N, O

Hsu

blim

atio

n(e

xper

imen

tal)

/ kJ

mol

-1

amides

diamides

acids

diacid

aminoacids

alkanesvalineH O

O C H 3

C H 3

N H 2

Page 37: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

37

Study of Non-specific Interactions: Linear

Alkanes

19 compounds : CH4 C20H24 Limit for van der

Waals interactions

Hsub 7.955C-

2.714

r2= 0.977

s = 7.096 kJ/mol0

150

300

450

600

750

0 5 10 15 20

No. of carbon atoms

Bo

ilin

g p

oin

t / °

C

0

30

60

90

120

150

180

Hsu

b / kJ mo

l -1

BPt

Hsub

Note odd-even variation in Hsub for this series.

Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.

Page 38: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

38

Include Branched Alkanes

Add 14 branched alkanes to dataset. The graph below highlights the

reduction of sublimation enthalpy due to bulky substituents.

0

50

100

150

200

0 5 10 15 20 25

No. carbon atoms

H

sub

/ kJ

mo

l-1

C(CH3)4

(C(CH)3)3CH

33 compounds : CH4 C20H24

Hsub = 7.724Cnonbranched + 3.703

r2= 0.959 s = 8.117 kJ/mol

If we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

Page 39: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

39

All Hydrocarbons: Include Aromatics

Add 17 aromatics to the dataset (note: we have no alkenes or alkynes).

50 compounds

Hsub = 7.680Cnonbranched + 6.185Caromatic + 4.162

r2= 0.958 s = 7.478 kJ/mol

As before, if we also include the parameters for branched carbons, C3 & C4, the model doesn’t improve.

aliphatic

0

50

100

150

200

0 50 100 150 200

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/kJ

mo

l-1

Page 40: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

40

All Non-Hydrogen-Bonded Molecules:

Add 106 non-hydrocarbons to the dataset.

Include elements H, C, N, O, F, S, Cl, Br & I.

156 compounds

Hsub predicted by 16 parameter model

r2= 0.896 s = 9.976 kJ/mol

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value / kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

Parameters in model are counts of atom type occurrences.

Page 41: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

41

General Predictive Model

Add 70 hydrogen bond forming molecules to the dataset.

226 compounds

Hsub predicted by 19 parameter model

r2= 0.925 s = 9.579 kJ/mol

Parameters in model are counts of atom type occurrences.

0

50

100

150

200

250

0 50 100 150 200 250

Experimental value /kJ mol-1

Pre

dic

ted

val

ue

/ kJ

mo

l-1

Page 42: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

42

Hsublimation (kJ mol-1) = 6.942 + 20.141 HN + 30.172

HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I +

3.297 C3 – 3.305 C4 + 5.970 Caromatic + 7.631

Cnonbranched + 7.341 CO + 19.676 CS + 11.415 Nnitrile +

8.953 Nnonnitrile + 8.466 NO + 18.249 Oether + 20.585

SO + 12.840 Sthioether

Predictive Model Determined by

MLRA

aliphatic

All these parameters are significantly larger than their standard errors

Page 43: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

43

Distribution of Residuals

The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.

0

20

40

60

-30 -20 -10 0 10 20 30Residuals

No

. of

ob

se

rva

tio

ns

Page 44: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

44

35 diverse compounds

r2 = 0.928

s = 7.420 kJ/mol

Validation on an Independent Test Set

0

50

100

150

200

0 50 100 150 200H sub (experimental) / kJ mol-1

Hsu

b (

pre

dic

ted

) / k

J m

ol-1

NO2

CH3

NO2O2NNitro-compoundsare often outliers

Very encouraging result: accurate prediction possible.

Page 45: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

45

Major Conclusion

Lattice energy can be predicted from 2D

structure, without knowing the details of the

crystal packing!

Page 46: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

46

Conclusions

We have determined a general equation allowing us to estimate

the sublimation enthalpy for a large range of organic compounds

with an estimated error of 9 kJ/mol.

A very simple model (counts of atom types) gives a good

prediction of lattice & sublimation energies.

Lattice energy can be predicted from 2D structure, without

knowing the details of the crystal packing.

Avoids need for expensive calculations.

May help predict solubility.

Model gives good chemical insight.

Page 47: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

47

Solubility is an important issue in drug discovery and a major source of attrition

This is expensive for the industry

A good model for predicting the solubility of druglike molecules would be very valuable.

Page 48: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

48

Drug Disc.Today, 10 (4), 289 (2005)

Cohesive interactions in the lattice reduce solubility

Predicting lattice (or almost equivalently sublimation) energy should help predict solubility

Page 49: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

49

Classifying the WADA 2005 Prohibited List Using CDK & Unity Fingerprints

www-mitchell.ch.cam.ac.uk/[email protected]

Ed Cannon, Andreas Bender, David Palmer & John Mitchell,

J. Chem. Inf. and Model., 46, 2369-2380 (2006)

Page 50: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

50

Classifying the WADA Prohibited List

• Aims & Background.• Methods.• Data.• Results.• Conclusions.

Page 51: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

51

Aims & Background

Page 52: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

52

Aims & Background

• Much drug abuse in sport involves novel compounds such as the “designer steroid” THG.

tetrahydrogestrinone (THG)

Page 53: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

53

Aims & Background

• Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules.

• Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.

Page 54: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

54

WADA Prohibited Classes

• Anabolic Agents (S1)

• Hormones and Related Substances (S2)

• Beta-2-agonists (S3)• Anti-estrogenic

Agents (S4)• Diuretics and

Masking Agents (S5)

• Stimulants (S6)• Narcotics (S7)• Cannabinoids (S8)• Glucocorticoids

(S9)• Alcohol (P1)• Beta Blockers (P2)

Page 55: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

55

Predicting Bioactivities

• We seek to predict whether a molecule exhibits one of these bioactivities.

• Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.

Page 56: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

56

Methods

Page 57: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

57

Chemical Space

• Use descriptor-based fingerprints to locate molecules in chemical space.

• Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.

Page 58: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

58

Machine Learning

• Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space.

• Random Forest.

• k-Nearest Neighbours.

Page 59: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

59

Fingerprints

• CDK (Chemistry Development Kit) fingerprint.

• Unity 2D.• MACCS key.• MOE 2D (2004).• Typed Atom Distance.• Typed Graph Distance.

Page 60: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

60

CDK Fingerprint

• CDK fingerprint resembles Daylight.

• All bond paths up to a length of 6 are generated.

• A hashing function is used to map these paths onto a fingerprint of 1024 bits.

Page 61: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

61

Unity 2D Fingerprint

• Unity is similar to CDK, but based on sub-structures rather than just paths.

• Substructures present in the molecule are enumerated.

• A hashing function is used to map these paths onto a fingerprint of 992 bits.

Page 62: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

62

Classification Algorithms

• Random Forest (RF).

• k-Nearest Neighbours (k-NN).

Page 63: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

63

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

Page 64: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

64

Random Forest

Node

A > x1 A < x1

B > x2 B < x2 C > x3 C < x3

Decision: Yes No No Yes

A Random Forest contains many such trees.

Page 65: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

65

Random Forest

• Decision based learner.• Based on bootstrap sample of data.• Number of trees in forest (ntree).• Number of descriptors tried at each

node (mtry).• Each tree predicts label of molecule.• Majority vote = class label of

molecule.

Page 66: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

66

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

Page 67: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

67

k-Nearest Neighbours

ActiveInactive?

Page 68: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

68

k-Nearest Neighbours

• Instance based learner.

• Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space.

• k is a variable describing the number of neighbours to be considered.

• Class of x determined by majority vote of class labels of k neighbours.

• Ties broken randomly (only occurs for even k).

Page 69: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

69

k-Nearest Neighbours

• Local method.

• Uses only a very small number of near neighbours to make its prediction.

• Suitable for predicting activity classes with multiple clusters in chemical space.

• Therefore good for WADA classes with multiple receptors.

Page 70: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

70

Performance Measure

• Matthews Correlation Coefficient:

• Range: -1 < MCC < 1;• Balance between predicting

positives & negatives.

]))()()(( nnpnnppp

npnp

ftftftft

ffttMCC

Page 71: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

71

Data

Page 72: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

72

The Dataset

• 5245 molecules (5235 for CDK).

• Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.

Page 73: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

73

Data by Class

WADA Class Number of Molecules

S1 47

S2 272

S3 367

S4 928

S5 1000

S6 804

S7 195

S8 1000

S9 26

P2 239

Allowed 367

Page 74: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

74

Fivefold Cross-validation

• We test for membership of each prohibited class separately.

• All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.

Page 75: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

75

False Positives

• False Positives arise in two ways:

• (1) A molecule predicted positive on an incorrect activity class;

• (2) An explicitly allowed molecule predicted positive.

Page 76: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

76

Results

Page 77: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

77

Results: Random Forest

Aggregated over 10 classes

Page 78: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

78Unity CDK > MACCS > others.

MCC for RF for Six Fingerprints

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8214 CDK 0.8136

MACCS 0.7823

TGD 0.7283 MOE 0.7172

TAD 0.5902

Page 79: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

79

100 trees sufficient; little improvement with more.

MCC as a Function of ntree in RF models for Unity

0.6000

0.7000

0.8000

0.9000

0 100 200 300 400 500 600 700 800 900 1000

ntree

MC

C

Unity

Page 80: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

80

Results: k-Nearest Neighbours

Aggregated over 10 classes

Page 81: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

81

MCC as a Function of k in k -NN Models for Six Fingerprints

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Page 82: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

82Unity CDK > MACCS > others.

MCC for k = 1 for Six Fingerprints

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6

Rank out of 6 Fingerprints

MC

C

Unity

MACCS

MOE

TAD

TGD

CDK

Unity 0.8363CDK 0.8297

MACCS 0.8045

TGD 0.7404

MOE 0.6814

TAD 0.6152

Page 83: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

83

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k.

MCC as a Function of k in k -NN Models for Unity

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C Unity

Page 84: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

84

k = 1 best; poor performance at k = 2 due to ties.MCC falls off with increasing k. Unity ≈ CDK.

MCC as a Function of k in k- NN Models for CDK

0.6000

0.7000

0.8000

0.9000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

MC

C CDK

Page 85: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

85

Results: Comparison

Recall v PrecisionAggregated over 10 classes

Recall Precision

Page 86: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

86

Recall v Precision for Positives

60.00

70.00

80.00

90.00

100.00

30.00 40.00 50.00 60.00 70.00 80.00 90.00

Recall

Pre

cisi

on

Unity

MACCS

CDK

RF

k -NN

RF gives higher precision, k-NN higher recall.

Page 87: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

87

Results: Comparison

Analysed by class

Page 88: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

88

Classes vary in difficulty of prediction; independent of classification algorithm.

MCC by Class for Random Forest Default and k- NN (k = 1) Models

0.0000

0.1000

0.2000

0.3000

0.4000

0.5000

0.6000

0.7000

0.8000

0.9000

1.0000

1 2 3 4 5 6 7 8 9 10

Class

MC

C RF

k-NN

S1 S2 S3 S4 S5 S6 S7 S8 S9 P2

Page 89: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

89

Conclusions

Page 90: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

90

Major Conclusion

• Can use Informatics to predict whether or not a molecule exhibits a prohibited bioactivity.

Page 91: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

91

Conclusions

• Can successfully predict active molecules (MCC ≈ 0.83).

• Unity ≈ CDK > MACCS > others.

• RF & k-NN give similar MCC.

• k-NN higher recall.

• RF higher precision; RF less likely to find false positives.

Page 92: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

92

Conclusions

• RF results vary little with ntree.

• k-NN results best for k = 1.

• Performance decreases at higher k.

• Odd k avoids problems with ties (k = 2 is worse than k = 3).

• Activity classes show consistent prediction difficulty pattern.

Page 93: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

93

www-mitchell.ch.cam.ac.uk/

[email protected]

Page 94: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

94

Acknowledgements: People

Carole Ouvrard, Ed Cannon, David Palmer,

Florian Nigsch, Chrysi Kirtay, Laura Hughes,

Jo Bailey, Noel O’Boyle, Daniel Almonacid,

Gemma Holliday, Jen Ryder,

Dushy Puvanendrampillai, Andreas Bender.

Page 95: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

95

A¢know£€dg€m€nt$: Funding

Unilever

Page 96: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

96

MCC as a Function of Class Size

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 200 400 600 800 1000 1200

Class Size

MC

C

UnityMACCSMOE TADTGD

No significant correlation overall; though smallest class S9 is hardest to predict.

Page 97: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

97

MCC as a Function of Intra Class Mean Tanimoto Score

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.10 0.20 0.30 0.40 0.50 0.60

Intra Class Mean Tanimoto Score

MC

C

UnityMACCSMOE TADTGD

Page 98: 1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.

98

tetrahydrogestrinone (THG)

gestrinone

trenbolone