Training pK a and logP prediction Jozsef Szegezdi Solutions for Cheminformatics.

Training pKa and logP prediction

Jozsef Szegezdi

Solutions for Cheminformatics

logP calculation models in Marvin

Models Training set size

Number of parameters

VG 1000 120

KLOP 1700 100

PHYS 10000 110

Weighted >10000 120

User defined Variable <=100

Unfortunately we can not tell in advance which model will be better for a molecule if it is not included in the training set.

Three models are provided in Marvin. They share the same atom type definitions taken from

Viswanadhan, V. N., et al. J.Chem.Inf.Comput.Sci., 1989, 29, 3, 163-172;

Problem with logP models

Frequently occuring problems of constructing logP models

- logP training set size is too small- logP training set is unrepresentative- Specification of atom types and interactions is subjective- The number of logP parameters is restricted in order to ensure

the ‘predictive power’

As a result, there will be missing interactions and atom types for the models.

OH

HO

H3COH

H3C OHOH

H3C

OHH3C OHH3C

OHH3C

OHH3C

OHH3C OHH

3C

OHH

3C

CH3H3C

HO

H3C

CH3

CH3

HO

H3C OH

CH3

CH3H3C

OH CH3

CH3

HO

CH3

OH

HO

O

OH

OH

HO

HO

H3C

OH

CH3

HO

OH

HO

OHOH

OHHO

OH

HO OH

OH OH

HO

-0.77 -0.31 0.25 0.88 1.51 2.03

2.62 3.00 3.77 4.57 1.29 1.28

1.48

1.231.19

1.79 -3.24-0.92

0.15 1.460.16 2.85 -1.76 -1.040.88

Example for creating a local logP model

Example for creating a local logP model

The logP of the molecules calculated with the standard weighted method which

is shown on the figure below. Calculated vs. experimetal logP by weigthed method

n=25, R2=0.96, s=0.35

-4

-3

-2

-1

0

1

2

3

4

5

-4 -2 0 2 4 6

logP exp.

log

P c

alc

.

The ‘principal of uniformity of nature’ would say that other ‘OH’ containig molecules could be predicted reasonably by the standard ‘weighted’ method. Is it true?We test this with the ‘hydroquinone’molecule.

The logP value of hydroquinone is 0.59. The next table summarizes the ‘logP’ errors of the standard models.

Models logP calc. –logP exp.

VG 0.88

KLOP 0.75

PHYS 0.68

Weighted 0.77

User defined ?

Test of standard models

How can one improve the accuracy of the predicition?

Prediction error can be reduced by creating a local model using linear regression for the 25 molecules mentioned above. Command line call for creating the local model:

cxcalc -T logP -t LOGP –o logPparameters.txt training25.sdf

Error of the standard models is relatively large.

The logP value of 25 molecules containing ‘OH’ groups calculated with the ‘user

defined’ method after logP training on the figure below. Calculated vs. experimental logP by user method

n=25, R2=0.99, s=0.10

-4

-3

-2

-1

0

1

2

3

4

5

-4 -2 0 2 4 6

logP exp.

log

P c

alc.

Model n R2 s Test molecule:

logP error of hydroquinone

Weighted 25 0.96 0.36 0.77

User defined

25 0.99 0.10 0.24

Comparision of the standard and the user model

The user-trained local model based on 25 molecules outperforms all of the standard models.

User’s model

Conclusions

The local model based on 25 molecules is more accurate than any of the standard global models.

Depending on the training set different parameter values will be assigned to the same atom type. This is one of the main characteristics of the user

model. A ‘carefully’ created set of local models must be superior to any ‘large’ model. We plan to develop a model that combines many local models.

The ionization % -pH curvedenoted with blue color for basic centers and with red color for acidic centers.

Calculated ionization % vs. pH

0

50

100

0 2 4 6 8 10 12

pH

deg

ree

of

ion

izat

ion

%

10.28

4.30

5.102.49

Apparent pKa and ionization%-pH curve

Method for predicting pKa and training

Marvin’s prediction model considers:

• partial charges• polarizability• effect of ionizable centers on each others

Training refines the existing parameters for ionizable centers and at the same time creates new modifier parameters based on structures and experimental values specified by the user.

Example for training pKa prediction

N

CH3

N

CH3

CH3

CH3

N+

N

CH3 N

N N

N

NH2N

N

NH2N

H2N NH

2N

N

N

N

N

OH

N

N

N

NNO

O

N

NN

OH

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

pKa1 6.0 6.70 0.50 1.20 1.50 3.76 5.60 5.0 4.50 6.71 0.63 4.05 4.10 2.84 4.91

pKa2 -3.0 1.0 -7.5 8.35 0.71 9.01

Calculated vs. experimental pKa before

'training' n=20, R2=0.94, s=0.68

-4

-2

0

2

4

6

8

10

-5 0 5 10

exp. pKa

ca

lc.

pK

a

Calculated vs. experimental pKa after

'training' n=21, R2=0.99, s=0.26

-10

-8

-6-4

-2

0

2

46

8

10

-10 -5 0 5 10

exp. pKa

Ca

lc.

pK

a

Experimental vs. calculated pKa values

The input ‘sdf’ file may be created in IJC

The teaching can be run using this command line :cxcalc –T pka –o c:/output InputpKadata.sdf

Curating experimental pKa data

Conclusions

•User defined pKa model is more accurate then the built-in default model. •IJC can be used for curating input data for the training.

•The new model is only a refinement of the default model, so the training assumes a robust base model that is provided in Marvin.

Training pK a and logP prediction Jozsef Szegezdi Solutions for Cheminformatics.

Documents

Transcript of Training pK a and logP prediction Jozsef Szegezdi Solutions for Cheminformatics.