Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and...

22
Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a , Neima Brauner b and Mordechai Shacham a , a Department of Chemical Engineering, Ben- Gurion University Beer-Sheva, Israel b School of Engineering, Tel-Aviv University Tel-Aviv, Israel

Transcript of Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and...

Page 1: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Selection of Molecular Descriptor Subsets for Property Prediction

Inga Pastera, Neima Braunerb and Mordechai Shachama,

aDepartment of Chemical Engineering, Ben-Gurion UniversityBeer-Sheva, Israel

bSchool of Engineering, Tel-Aviv UniversityTel-Aviv, Israel

Page 2: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

The Needs

Physicochemical and biological properties are needed for risk assessment, environmental impact assessment and process design, analysis and optimization

The number of the compounds used at present by the industry or those of its immediate interest ~100,000. Those theoretically possible and may be of future interest several tens of millions. The Toxic Substances Control Act (TSCA) inventory has 80,000 chemicals. Only 50% have some physicochemical property data, only 15% have data from genotoxicity bioassays

DIPPR 801 database contains 2101 compounds (33 constant properties, 15 temperature dependent properties)

Page 3: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Property Prediction Methods

“Group contribution” methods Methods based on the "corresponding-states principle“

“Asymptotic behavior" correlations (ABC’s)

“Quantitative Structure Property Relationships” (QSPR’s), based on the use of molecular descriptors

The existing methods cannot provide satisfactory predictions for certain properties (such as normal melting temperature) and for certain groups of compounds. Thus, research and development of new prediction techniques are essential.

Page 4: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Collinearity Between Vectors of Descriptors of Similar Compounds

-0.2

0

0.2

0.4

0.6

0.8

-0.2 0 0.2 0.4 0.6n -hexane

n-h

epta

ne

y = 1.08809 x

R2 = 0.99649

99 normalized molecular descriptors of n-heptane versus those of n-hexane.

Linear relationship between the descriptors

Page 5: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Collinearity Between Vectors of Properties of Similar Compounds

Selected properties of n-heptane versus those of n-hexane.

0

100

200

300

400

500

600

0 100 200 300 400 500 600

n -hexane

n-

hep

tan

e

y = 1.069266 x

R 2 = 0.999353

Linear relationship between the vectors of properties

Basis of the QS2PR method (Shacham et al, AIChE J. 50(10), 2481-2492, 2004)

Page 6: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds

VRD2- Average Randic-type eigenvector-based index from distance matrix (eigenvalue-based indices)

y = 106.6x + 262.65

R2 = 0.9546

530

540

550

560

570

580

590

600

2.5 3 3.5

Descriptor VRD2

Cri

tical

Tem

pera

ture

(K)

Measured value for 3,3-dimethylhexane Prediction error 0.68 %

Page 7: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Similarity Group (Training Set) of 3,3-dimethylhexane

CompoundNo. Corr. Coeff Tc (K)

Reliability %

1 2,3-dimethylhexane 0.99221 563.5 <0.2 2 2,4-dimethylhexane 0.98968 553.5 <0.2 3 2,2,3-trimethylpentane 0.98552 563.5 <0.2 4 3-ethylhexane 0.98281 565.5 <0.2 5 2,3,3-trimethylpentane 0.98114 573.5 <0.2 6 2,3,4-trimethylpentane 0.98111 566.4 <0.2 7 3-methylhexane 0.97958 535.2 <0.2 8 2-methyl-3-ethylpentane 0.97829 567.1 <0.2 9 2.3-dimethylpentane 0.97781 537.3 <0.2

10 2,2,3,4-tetramethylpentane 0.97722 592.6 <0.2

Target 3,3-dimethylhexane 562 <0.2

Similarity group of 10 predictive compounds has found to be sufficient in most cases.

985.010,1 tt rrGACC A measure of the level of group similarity

Basis of the Targeted QSPR method (Brauner et al, I&EC Research 45, 8430-8437, 2006)

Page 8: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds

Collinearity between the descriptor VEv1 and normal boiling temperature for the n-alkanoic acid homologous series

y = 104.56x + 187.2

R2 = 0.9989

350

400

450

500

550

600

650

700

1.5 2 2.5 3 3.5 4 4.5 5

Descriptor VEv1

No

rmal

Bo

ilin

g T

emp

. (K

)

DIPPR-Measured DIPPR-Predicted HS-QSPR Pred.

Page 9: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Sources of Molecular Descriptors and Thermo-Physical Properties

The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package

The Dragon program (http://www.talete.mi.it ) was used to calculate 1664 descriptors for the 340 compounds in the database from minimized energy molecular models

Property data (measured and predicted) were taken from DIPPR (http://dippr.byu.edu ) and NIST (National Institute of Standards, http://webbook.nist.gov/chemistry) databases.

Page 10: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Descriptor Types Generated by the Dragon Program

3-D descriptors, very sensitive to molecular structure minimization

Page 11: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Identifying Inaccuracy and Inconsistency Among 1600 Molecular Descriptors

Sources of inaccuracy and inconsistency: The descriptor cannot be calculated by DRAGON (-999); The descriptor value is set at zero for certain compounds; and Sensitivity of 3-D descriptors to the structure minimization method

y = 0.9815xR2 = 0.917

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

-0.8 -0.3 0.2 0.7 1.2

n -hexane

n-h

epta

ne

Page 12: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Presentation Outline

Categorizing the Molecular Descriptors According to the Trend of Their Change with nC for Homologous Series Identifying Training Sets from Compounds Belonging to the Target Compounds Homologous Series

Predicting Critical Properties, Normal Boiling and Melting Temperatures, Liquid Molar Volume and Refractive Index for Five Homologous Series with and without the Use of 3-D descriptors.

Comparison of the Results and Conclusions

Page 13: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series

ADDD = 3.364 nC - 3.388R 2 = 0.9995

0

20

40

60

80

100

2 6 10 14 18 22 26 30

No. of carbon atoms

Des

crip

tor

AD

DD

The descriptor ADDD changes with nC for the 1-alkene series in a trend similar to the change of liquid molar volume

Page 14: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series

Normalized values of the descriptors AGDD, ASP and H4m versus nC for the 1-alkene homologous series

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20 25 30 35

No. of carbon atoms

No

rmal

ized

des

crip

tor

valu

e

AGDD

ASP

H4m

Similar to the trend of TC

Page 15: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series

The descriptor ICR changes with nC for the 1-alkene series in a trend similar to the change of normal melting temperature

0.500

1.000

1.500

2.000

2.500

3.000

3.500

4.000

0 5 10 15 20 25 30

No. of carbon atoms

Desc

ripto

r ICR

Page 16: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Checking Consistency of Molecular Descriptors – Inconsistent Change with nC for Homologous Series

The descriptor Gm changes with nC for the 1-alkene series in an apparently random manner

0.10

0.15

0.20

0.25

0.30

0.35

0.40

2 6 10 14 18 22 26 30

No. of carbon atoms.

Des

crip

tor

Gm

Page 17: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Trend of change of descriptors with nC for homologous series

Constant descriptors identify compounds of the HS of the target compound and linearly increasing descriptors used to rank the compounds according to the distance from the target

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Page 18: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Prediction of TC, Tb and RI (Refractive Index) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids

In ~ 93 % of the cases descriptors of category IIIA used as dominant (1st to enter, out of one or two) descriptor. Exception 3-D descriptors for 1-alcohols (category IV)

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Page 19: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Prediction of VC and Vm (Liquid molar vol.) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids

In 90 % of the cases descriptors of category II used

Exception: 3-D descriptors for 1-alkenes, 1-alcohols (category IV)

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Page 20: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Prediction of PC and Tm (Melting Point.) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids

In 40 % of the cases descriptors of category IIIA used, descriptors IV 35%, descriptors V 20% , descriptor II 5 %.

Category Trend of change of the descriptor with n C % of descriptors in the database

% of 3D descriptors

I Constant 8.5 7.3II Linear or nearly linear increase 10.9 41.7

IIIANonlinear monotonic increase or decrease with decreasing slope

25.2 32.1

IIIBNonlinear monotonic increase or decrease with increasing slope

10.1 66.7

IV Inconsistent, no particular trend or different trends for different homologous series

22.9 83.6

VZero value for some n C, nonlinear monotonic increase for others

21.9 62.9

VI Separate curves for odd and even n C 0.4 100VII Periodic 0.2 100

Page 21: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Uncertainty (%) in Predicting Various Properties Without 3-D Descriptors

Large prediction errors in Vc (and Pc) because of the uncertainty of the DIPPR data. The irregular shape of the melting point curve causes the errors in this property (3-D descriptors needed).

Group Statistics TC PC VC Tm Tb Vm RI

mean 0.04 0.48 0.35 1.80 0.22 0.29 0.01

median 0.03 0.36 0.22 0.11 0.03 0.22 0.00 n-alkanes

STDEV 0.05 0.38 0.39 6.86 0.81 0.24 0.01

mean 0.14 0.97 1.78 0.89 0.19 0.31 0.06

median 0.07 0.56 0.19 0.16 0.03 0.09 0.04 1-alkenes

STDEV 0.24 1.48 4.83 1.21 0.40 0.47 0.06

mean 0.13 0.29 0.51 0.17 0.03 0.08 0.00

median 0.08 0.27 0.29 0.09 0.02 0.08 0.00 n-alkylbenzenes

STDEV 0.15 0.23 0.76 0.20 0.04 0.05 0.00

mean 0.15 1.74 3.53 2.52 0.17 0.50 0.01

median 0.07 0.91 1.37 0.17 0.17 0.55 0.01 1-alcohols

STDEV 0.23 2.60 6.17 5.38 0.15 0.31 0.01

mean 0.30 0.97 2.83 0.17 0.18 0.75 0.02

median 0.22 0.70 0.40 0.11 0.10 0.23 0.01 aliphatic acids

STDEV 0.36 0.87 5.05 0.13 0.19 1.89 0.02

Page 22: Selection of Molecular Descriptor Subsets for Property Prediction Inga Paster a, Neima Brauner b and Mordechai Shacham a, a Department of Chemical Engineering,

Conclusions

1. The Dragon descriptors were divided into seven categories according to the trend of their change as function of nc in homologous series.

2. It was observed that 3-D descriptors may exhibit very irregular (or even random) behavior.

3. The exclusive use of descriptors of two categories: “Constant” and “Linear Increase”, enabled selection of training sets belonging to the target compound’s homologous series.

4. The use of the proposed method for predicting 7 properties for 5 homologous series has shown that most properties can be predicted on experimental uncertainty level, without using 3-D descriptors. This extends the method’s applicability, increases its reliability and reduces the probability of “Chance Correlations”.