Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and...

Post on 20-Jan-2016

247 views 0 download

Tags:

Transcript of Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and...

Selection of Molecular Descriptor Subsets for Property Prediction

Inga Pastera, Neima Braunerb and Mordechai Shachama,

aDepartment of Chemical Engineering, Ben-Gurion UniversityBeer-Sheva, Israel

bSchool of Engineering, Tel-Aviv UniversityTel-Aviv, Israel

The Needs

Physicochemical and biological properties are needed for risk assessment, environmental impact assessment and process design, analysis and optimization

The number of the compounds used at present by the industry or those of its immediate interest ~100,000. Those theoretically possible and may be of future interest several tens of millions. The Toxic Substances Control Act (TSCA) inventory has 80,000 chemicals. Only 50% have some physicochemical property data, only 15% have data from genotoxicity bioassays

DIPPR 801 database contains 2101 compounds (33 constant properties, 15 temperature dependent properties)

Property Prediction Methods

“Group contribution” methods Methods based on the "corresponding-states principle“

“Asymptotic behavior" correlations (ABC’s)

“Quantitative Structure Property Relationships” (QSPR’s), based on the use of molecular descriptors

The existing methods cannot provide satisfactory predictions for certain properties (such as normal melting temperature) and for certain groups of compounds. Thus, research and development of new prediction techniques are essential.

Collinearity Between Vectors of Descriptors of Similar Compounds

-0.2

0

0.2

0.4

0.6

0.8

-0.2 0 0.2 0.4 0.6n -hexane

n-h

epta

ne

y = 1.08809 x

R2 = 0.99649

99 normalized molecular descriptors of n-heptane versus those of n-hexane.

Linear relationship between the descriptors

Collinearity Between Vectors of Properties of Similar Compounds

Selected properties of n-heptane versus those of n-hexane.

0

100

200

300

400

500

600

0 100 200 300 400 500 600

n -hexane

n-

hep

tan

e

y = 1.069266 x

R 2 = 0.999353

Linear relationship between the vectors of properties

Basis of the QS2PR method (Shacham et al, AIChE J. 50(10), 2481-2492, 2004)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds

VRD2- Average Randic-type eigenvector-based index from distance matrix (eigenvalue-based indices)

y = 106.6x + 262.65

R2 = 0.9546

530

540

550

560

570

580

590

600

2.5 3 3.5

Descriptor VRD2

Cri

tical

Tem

pera

ture

(K)

Measured value for 3,3-dimethylhexane Prediction error 0.68 %

Similarity Group (Training Set) of 3,3-dimethylhexane

CompoundNo. Corr. Coeff Tc (K)

Reliability %

1 2,3-dimethylhexane 0.99221 563.5 <0.2 2 2,4-dimethylhexane 0.98968 553.5 <0.2 3 2,2,3-trimethylpentane 0.98552 563.5 <0.2 4 3-ethylhexane 0.98281 565.5 <0.2 5 2,3,3-trimethylpentane 0.98114 573.5 <0.2 6 2,3,4-trimethylpentane 0.98111 566.4 <0.2 7 3-methylhexane 0.97958 535.2 <0.2 8 2-methyl-3-ethylpentane 0.97829 567.1 <0.2 9 2.3-dimethylpentane 0.97781 537.3 <0.2

10 2,2,3,4-tetramethylpentane 0.97722 592.6 <0.2

Target 3,3-dimethylhexane 562 <0.2

Similarity group of 10 predictive compounds has found to be sufficient in most cases.

985.010,1 tt rrGACC A measure of the level of group similarity

Basis of the Targeted QSPR method (Brauner et al, I&EC Research 45, 8430-8437, 2006)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds

Collinearity between the descriptor VEv1 and normal boiling temperature for the n-alkanoic acid homologous series

y = 104.56x + 187.2

R2 = 0.9989

350

400

450

500

550

600

650

700

1.5 2 2.5 3 3.5 4 4.5 5

Descriptor VEv1

No

rmal

Bo

ilin

g T

emp

. (K

)

DIPPR-Measured DIPPR-Predicted HS-QSPR Pred.

Sources of Molecular Descriptors and Thermo-Physical Properties

The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package

The Dragon program (http://www.talete.mi.it ) was used to calculate 1664 descriptors for the 340 compounds in the database from minimized energy molecular models

Property data (measured and predicted) were taken from DIPPR (http://dippr.byu.edu ) and NIST (National Institute of Standards, http://webbook.nist.gov/chemistry) databases.

Descriptor Types Generated by the Dragon Program

3-D descriptors, very sensitive to molecular structure minimization

Identifying Inaccuracy and Inconsistency Among 1600 Molecular Descriptors

Sources of inaccuracy and inconsistency: The descriptor cannot be calculated by DRAGON (-999); The descriptor value is set at zero for certain compounds; and Sensitivity of 3-D descriptors to the structure minimization method

y = 0.9815xR2 = 0.917

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.8 -0.3 0.2 0.7 1.2

n -hexane

n-h

epta

ne

Presentation Outline

Categorizing the Molecular Descriptors According to the Trend of Their Change with nC for Homologous Series Identifying Training Sets from Compounds Belonging to the Target Compounds Homologous Series

Predicting Critical Properties, Normal Boiling and Melting Temperatures, Liquid Molar Volume and Refractive Index for Five Homologous Series with and without the Use of 3-D descriptors.

Comparison of the Results and Conclusions

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series

ADDD = 3.364 nC - 3.388R 2 = 0.9995

0

20

40

60

80

100

2 6 10 14 18 22 26 30

No. of carbon atoms

Des

crip

tor

AD

DD

The descriptor ADDD changes with nC for the 1-alkene series in a trend similar to the change of liquid molar volume

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series

Normalized values of the descriptors AGDD, ASP and H4m versus nC for the 1-alkene homologous series

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35

No. of carbon atoms

No

rmal

ized

des

crip

tor

valu

e

AGDD

ASP

H4m

Similar to the trend of TC

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series

The descriptor ICR changes with nC for the 1-alkene series in a trend similar to the change of normal melting temperature

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

0 5 10 15 20 25 30

No. of carbon atoms

Desc

ripto

r ICR

Checking Consistency of Molecular Descriptors – Inconsistent Change with nC for Homologous Series

The descriptor Gm changes with nC for the 1-alkene series in an apparently random manner

0.10

0.15

0.20

0.25

0.30

0.35

0.40

2 6 10 14 18 22 26 30

No. of carbon atoms.

Des

crip

tor

Gm

Trend of change of descriptors with nC for homologous series

Constant descriptors identify compounds of the HS of the target compound and linearly increasing descriptors used to rank the compounds according to the distance from the target

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Prediction of TC, Tb and RI (Refractive Index) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids

In ~ 93 % of the cases descriptors of category IIIA used as dominant (1st to enter, out of one or two) descriptor. Exception 3-D descriptors for 1-alcohols (category IV)

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Prediction of VC and Vm (Liquid molar vol.) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids

In 90 % of the cases descriptors of category II used

Exception: 3-D descriptors for 1-alkenes, 1-alcohols (category IV)

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Prediction of PC and Tm (Melting Point.) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids

In 40 % of the cases descriptors of category IIIA used, descriptors IV 35%, descriptors V 20% , descriptor II 5 %.

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Uncertainty (%) in Predicting Various Properties Without 3-D Descriptors

Large prediction errors in Vc (and Pc) because of the uncertainty of the DIPPR data. The irregular shape of the melting point curve causes the errors in this property (3-D descriptors needed).

Group Statistics TC PC VC Tm Tb Vm RI

mean 0.04 0.48 0.35 1.80 0.22 0.29 0.01

median 0.03 0.36 0.22 0.11 0.03 0.22 0.00 n-alkanes

STDEV 0.05 0.38 0.39 6.86 0.81 0.24 0.01

mean 0.14 0.97 1.78 0.89 0.19 0.31 0.06

median 0.07 0.56 0.19 0.16 0.03 0.09 0.04 1-alkenes

STDEV 0.24 1.48 4.83 1.21 0.40 0.47 0.06

mean 0.13 0.29 0.51 0.17 0.03 0.08 0.00

median 0.08 0.27 0.29 0.09 0.02 0.08 0.00 n-alkylbenzenes

STDEV 0.15 0.23 0.76 0.20 0.04 0.05 0.00

mean 0.15 1.74 3.53 2.52 0.17 0.50 0.01

median 0.07 0.91 1.37 0.17 0.17 0.55 0.01 1-alcohols

STDEV 0.23 2.60 6.17 5.38 0.15 0.31 0.01

mean 0.30 0.97 2.83 0.17 0.18 0.75 0.02

median 0.22 0.70 0.40 0.11 0.10 0.23 0.01 aliphatic acids

STDEV 0.36 0.87 5.05 0.13 0.19 1.89 0.02

Conclusions

1. The Dragon descriptors were divided into seven categories according to the trend of their change as function of nc in homologous series.

2. It was observed that 3-D descriptors may exhibit very irregular (or even random) behavior.

3. The exclusive use of descriptors of two categories: “Constant” and “Linear Increase”, enabled selection of training sets belonging to the target compound’s homologous series.

4. The use of the proposed method for predicting 7 properties for 5 homologous series has shown that most properties can be predicted on experimental uncertainty level, without using 3-D descriptors. This extends the method’s applicability, increases its reliability and reduces the probability of “Chance Correlations”.