Estimation of ionization constants for different classes of organic compounds with the use of the...

5
ISSN 0012-5008, Doklady Chemistry, 2007, Vol. 413, Part 2, pp. 90–94. © Pleiades Publishing, Ltd., 2007. Original Russian Text © A.A. Ivanova, I.I. Baskin, V.A. Palyulin, N.S. Zefirov, 2007, published in Doklady Akademii Nauk, 2007, Vol. 413, No. 64, pp. 766–770. 90 As is known, the ionization constant K a (or –logK a = pK a ) is one of the most important characteristics of organic compounds, which reflects their acid–base properties. In addition, knowledge of pK a values is essential for understanding biological activity and transport of substances at the molecular level [1]. In this context, reliable estimation of the pK a values for both the known and hypothetical molecules is an important problem. A number of experimental methods have been developed for determining the ionization constants of organic compounds. However, computational methods of determing pK a are also available. They have definite advantages over the experimental methods. The basic advantage of such predictive methods is the possibility of estimating pK a (or another property) even for an unavailable compound. In addition, computational methods make it possible to quickly and accurately determine the values of a property for a large set of compounds (including hypothetical). Thus, computa- tional methods are an efficient tool for solving many problems of modern chemistry. Up to now, a number of papers have been published dealing with the prediction of ionization constants for different classes of organic compounds with the use of different approaches. For example, to estimate the pK a values of phenols, carboxylic acids, and nitrogen-con- taining compounds, linear regression models were con- structed with the use of quantum-chemical descriptors [2–4]. In [5, 6], the pK a values for phenols and carbox- ylic acids were estimated by means of thermodynamic calculations with a preliminary calculation of atomic charges and geometry optimization. Comparative molecular field analysis (CoMFA) models were also constructed to predict the pK a values of 18 nitrogen- containing compounds [7]. The findings in [2–7] show that the standard error of prediction (s) of the pK a val- ues obtained using regression models or thermody- namic calculations is 0.3–0.6 kcal/mol, whereas the s value obtained with the use of the CoMFA model is 0.193 kcal/mol. However, it is worth noting that, despite the fact that CoMFA ensures the smallest error of prediction, the use of this method is efficient only for estimating the pK a values of a cogeneric set of com- pounds. This is, first of all, associated with the neces- sity of the spatial alignment of compounds with respect to the related fragments. In particular, such an align- ment is hindered when a data set contains both aliphatic and aromatic compounds. This imposes severe restric- tions on the applicability of the CoMFA method for predicting the properties of wide heterogeneous sets. Methods based on using quantum-chemical descriptors in regression models do not have this disadvantage. However, the use of these methods (as well as the CoMFA method) requires the calculation of atomic charges and geometry optimization for each of the compounds under consideration. The results turn out to be extremely dependent on the parameters used, such as force field parameters, atomic charge calculation scheme, and threshold energy values for geometry opti- mization. In addition, these methods require consider- able CPU time not only for constructing the predictive model itself but also for using this model to estimate the properties of hypothetical molecules, for each of which the atomic charge should be calculated and its geome- try should be optimized. Currently, empirical approaches using molecular structure descriptors, in particular, the fragmental approach [8–10], are widely used for predicting differ- ent physicochemical properties of organic compounds [8–10]. This approach is based on using substructural descriptors, which are the numbers of occurrences of atom chains of different length, branched fragments, rings (as a rule, three- to 15-membered), and bicyclic and tricyclic fragments. A studied property is often a nonlinear function of the descriptors used, the general form of such a func- Estimation of Ionization Constants for Different Classes of Organic Compounds with the Use of the Fragmental Approach to the Search of Structure–Property Relationships A. A. Ivanova, I. I. Baskin, V. A. Palyulin, and Academician N. S. Zefirov Received November 30, 2006 DOI: 10.1134/S0012500807040040 Moscow State University, Vorob’evy gory, Moscow, 119992 Russia CHEMISTRY

Transcript of Estimation of ionization constants for different classes of organic compounds with the use of the...

ISSN 0012-5008, Doklady Chemistry, 2007, Vol. 413, Part 2, pp. 90–94. © Pleiades Publishing, Ltd., 2007.Original Russian Text © A.A. Ivanova, I.I. Baskin, V.A. Palyulin, N.S. Zefirov, 2007, published in Doklady Akademii Nauk, 2007, Vol. 413, No. 64, pp. 766–770.

90

As is known, the ionization constant

K

a

(or –log

K

a

=p

K

a

) is one of the most important characteristics oforganic compounds, which reflects their acid–baseproperties. In addition, knowledge of p

K

a

values isessential for understanding biological activity andtransport of substances at the molecular level [1]. In thiscontext, reliable estimation of the p

K

a

values for boththe known and hypothetical molecules is an importantproblem.

A number of experimental methods have beendeveloped for determining the ionization constants oforganic compounds. However, computational methodsof determing p

K

a

are also available. They have definiteadvantages over the experimental methods. The basicadvantage of such predictive methods is the possibilityof estimating p

K

a

(or another property) even for anunavailable compound. In addition, computationalmethods make it possible to quickly and accuratelydetermine the values of a property for a large set ofcompounds (including hypothetical). Thus, computa-tional methods are an efficient tool for solving manyproblems of modern chemistry.

Up to now, a number of papers have been publisheddealing with the prediction of ionization constants fordifferent classes of organic compounds with the use ofdifferent approaches. For example, to estimate the p

K

a

values of phenols, carboxylic acids, and nitrogen-con-taining compounds, linear regression models were con-structed with the use of quantum-chemical descriptors[2–4]. In [5, 6], the p

K

a

values for phenols and carbox-ylic acids were estimated by means of thermodynamiccalculations with a preliminary calculation of atomiccharges and geometry optimization. Comparativemolecular field analysis (CoMFA) models were alsoconstructed to predict the p

K

a

values of 18 nitrogen-containing compounds [7]. The findings in [2–7] show

that the standard error of prediction (

s

) of the p

K

a

val-ues obtained using regression models or thermody-namic calculations is 0.3–0.6 kcal/mol, whereas the

s

value obtained with the use of the CoMFA model is0.193 kcal/mol. However, it is worth noting that,despite the fact that CoMFA ensures the smallest errorof prediction, the use of this method is efficient only forestimating the p

K

a

values of a cogeneric set of com-pounds. This is, first of all, associated with the neces-sity of the spatial alignment of compounds with respectto the related fragments. In particular, such an align-ment is hindered when a data set contains both aliphaticand aromatic compounds. This imposes severe restric-tions on the applicability of the CoMFA method forpredicting the properties of wide heterogeneous sets.Methods based on using quantum-chemical descriptorsin regression models do not have this disadvantage.However, the use of these methods (as well as theCoMFA method) requires the calculation of atomiccharges and geometry optimization for each of thecompounds under consideration. The results turn out tobe extremely dependent on the parameters used, such asforce field parameters, atomic charge calculationscheme, and threshold energy values for geometry opti-mization. In addition, these methods require consider-able CPU time not only for constructing the predictivemodel itself but also for using this model to estimate theproperties of hypothetical molecules, for each of whichthe atomic charge should be calculated and its geome-try should be optimized.

Currently, empirical approaches using molecularstructure descriptors, in particular, the fragmentalapproach [8–10], are widely used for predicting differ-ent physicochemical properties of organic compounds[8–10]. This approach is based on using substructuraldescriptors, which are the numbers of occurrences ofatom chains of different length, branched fragments,rings (as a rule, three- to 15-membered), and bicyclicand tricyclic fragments.

A studied property is often a nonlinear function ofthe descriptors used, the general form of such a func-

Estimation of Ionization Constants for Different Classesof Organic Compounds with the Use of the Fragmental Approach

to the Search of Structure–Property Relationships

A. A. Ivanova, I. I. Baskin, V. A. Palyulin, and

Academician

N. S. Zefirov

Received November 30, 2006

DOI:

10.1134/S0012500807040040

Moscow State University, Vorob’evy gory, Moscow, 119992 Russia

CHEMISTRY

DOKLADY CHEMISTRY

Vol. 413

Part 2

2007

ESTIMATION OF IONIZATION CONSTANTS FOR DIFFERENT CLASSES 91

tion being unknown a priori. In such cases, one of themost efficient methods of predicting properties is theuse of artificial neural networks (ANNs) [11]. ANNs incombination with the fragmental approach have beenused for predicting important properties, such as lipo-philicity, heat capacity, viscosity, density, etc. [12, 13].

In the present work, we attempted to use the frag-mental and quantum-chemical approaches for model-ing the p

K

a

values of different classes of organic com-pounds and for developing a general QSPR model forall classes of compounds under consideration. To dothis, we created four databases with the MOLED pro-gram [14]: (1) phenols (170 compounds), (2) carboxy-lic acids (238 compounds), (3) nitrogen-containingcompounds (268 compounds), and (4) the general data-base (676 compounds).

Each of the databases was randomly divided intotraining and test data sets (90 and 10% of compounds,respectively).

The ionization constants for different compoundswere taken from [2, 3]. In the databases created, theBASTET program revealed 11 structures (2-aminophe-nol, 3-aminophenol, 4-aminophenol, 2-amino-4-nitro-phenol, 2-aminobenzoic acid, 3-aminobenzoic acid, 4-aminobenzoic acid, 4-aminosalicylic acid, anabasine,nicotine,

N

,

N

-dimethylethylamine) that had two ioniza-

tion constants and were encountered twice, as well ashydroxyacetophenone for which the position of thehydroxy group was not specified. Thus, 23 com-pounds were removed from the databases reported in[2, 3].

Structure–property models were constructedusing the NASAWIN neural network program pack-age [15]. To model local properties, we developed aspecial modification of fragmental descriptors thatallows for labeling definite atoms in a molecule. Themethod implies that (1) the atoms for which localproperties are modeled are labeled and each localproperty, e.g., p

K

a

, has a unique label, e.g., a;(2) when fragmental descriptors are generated, eachlabel is considered as a separate pseudoatom with thename corresponding to the label symbol; and(3) only the descriptors containing the pseudoatomcorresponding to this label are used for constructingstructure–property equations. Although the thirdcondition is not absolutely necessary, its fulfillmentallows one to obtain more adequate and readily inter-pretable QSPR models.

As is known, the ionization constant correlates wellwith quantum-chemical descriptors [2–4]. In thepresent work, we calculated 12 descriptors describingintramolecular electronic properties of molecules, such

Table 1.

Individual QSPR models for phenols, carboxylic acids, and nitrogen-containing compounds

Class of compounds

n

training

/

n

test

Statistical parameters of QSPR models

models constructed usingfragmental descriptors

models constructed using fragmentaland quantum-chemical descriptors

Phenols157/17

MLR:

D

= 20,

R

2

= 0.9746,

s

= 0.40,

F

= 252,rms

training

= 0.38, rms

test

= 0.57MLR:

D

= 27,

R

2

= 0.9794,

s

= 0.36,

F

= 220,rms

training

= 0.33, rms

test

= 0.41

ANN:

D

= 20,

R

2

= 0.9815, rms

training

= 0.32,rms

test

= 0.53ANN:

D

= 27,

R

2

= 0.9831, rms

training

= 0.30,rms

test

= 0.42

Carboxylic acids215/23

MLR:

D

= 25,

R

2

= 0.8966,

s

= 0.33,

F

= 66,rms

training

= 0.31, rms

test

= 0.51MLR:

D

= 32,

R

2

= 0.9122,

s

= 0.31,

F

= 59,rms

training

= 0.28, rms

test

= 0.34

ANN:

D

= 25,

R

2

= 0.9115, rms

training

= 0.28,rms

test

= 0.48ANN:

D

= 32,

R

2

= 0.9534, rms

training

= 0.21,rms

test

= 0.27

Nitrogen-containingcompounds242/26

MLR:

D

= 25,

R

2

= 0.9302,

s

= 0.99,

F

= 115,rms

training

= 0.93, rms

test

= 1.14MLR:

D

= 32,

R

2

= 0.9611,

s

= 0.75,

F

= 161,rms

training

= 0.69, rms

test

= 0.94

ANN:

D

= 25,

R

2

= 0.9306, rms

training

= 0.93,rms

test

= 1.13ANN:

D

= 32,

R

2

= 0.9692, rms

training

= 0.62,rms

test

= 0.60

Note:

D

is the number of descriptors used for constructing a model;

R

2

is the squared correlation coefficient; rms

training

and rms

test

are themean-square error for the training and test samples, respectively;

n

training

/

n

test

is the number of compounds in the training/test set;

s

is the standard deviaion; and

F

is the Fisher criterion.

92

DOKLADY CHEMISTRY

Vol. 413

Part 2

2007

IVANOVA et al.

5

50

10

15

–5

–10

10 15p

K

a

(lit)–5–10

5

10

15p

K

a

(calc)

–5

–10

5

50

10

15

–5

–10

10 15p

Ka

(lit)–5–10

p

Ka (calc)

Fig. 1. QSPR models constructed by the ANN method for nitrogen-containing compounds with the use of (a) fragmental descriptorsand (b) fragmental and quantum-chemical descriptors. Here and in Fig. 2, (1) is the training set and (2) is the test set.

Fig. 2. QSPR model constructed by the ANN method for the complete database with the use of 100 descriptors.

12

12

(a)

(b)

DOKLADY CHEMISTRY Vol. 413 Part 2 2007

ESTIMATION OF IONIZATION CONSTANTS FOR DIFFERENT CLASSES 93

as the energies of the highest occupied and lowestunoccupied molecular orbitals; the charge on thelabeled atom; the maximal negative charge on an atom;the maximal charge on a hydrogen atom; the dipolemoment of a molecule; the electron density of frontierorbitals; the electrophilic, nucleophilic, and radicalsuperdelocalization; and the self-polarizability of theatom. The most significant descriptors for each QSPRmodel were selected using a stepwise linear regression(SLR) procedure. The stability of all models for eachdatabase was checked by predicting the pKa values forthe compounds of the test sample.

At the first stage, individual QSPR models for phe-nols, carboxylic acids, and nitrogen-containing com-pounds were constructed by the multiple linear regres-sion (MLR) and ANN methods using both fragmentaldescriptors calculated for the compounds with labelsand fragmental and quantum-chemical descriptors. Sta-tistical parameters for these models are listed in Table 1,and Fig. 1 shows the dependences constructed with theuse of fragmental and quantum-chemical descriptorsfor nitrogen-containing compounds.

Analysis of the resulting QSPR models shows that,in all cases, the structure–property models constructedby the ANN method are characterized by somewhathigher correlation coefficients and give a smaller errorfor both the training and test sets. It should be noted thatthe use of quantum-chemical descriptors improves thestatistical parameters of the models as compared withthe results obtained with the use of only fragmentaldescriptors. As follows from Table 1, the best statisticalparameters were obtained by the ANN method for phe-nols. However, in this case, statistical parameters of themodels constructed with the use of only 20 fragmentaldescriptors (R2 = 0.9815, rmstraining = 0.32, rmstest =0.53) and with the use of seven additional quantum-chemical descriptors (R2 = 0.9831, rmstraining = 0.30,rmstest = 0.41) are close to each other. As distinct fromphenols, the use of quantum-chemical descriptors formodeling the pKa values of carboxylic acids and nitro-gen-containing compounds clearly improves statisticalparameters.

Inasmuch as the proposed approach turned out to beefficient for modeling three separate databases, weattempted to construct a structure–property model forthe complete database. To check the predictability ofthe QSPR models, the complete database was alsodivided into the training (609 compounds) and test(67 compounds) data sets. Different numbers ofdescriptors were used for modeling the database.First, 226 fragmental descriptors and seven quantum-chemical descriptors were selected by the SLRmethod, and, then, their number was graduallydecreased (Table 2). Such an approach seems to beefficient for revealing the optimal number of descrip-tors, which makes it possible to obtain a more stablepredictive QSPR model.

Analysis of our findings showed that the structure–property model constructed using 96 fragmental andfour quantum-chemical descriptors (Fig. 2) is the mostoptimal. As follows from Table 2, going from the modelbased on 200 descriptors to the model based on100 descriptors leads to an insignificant decrease in thecorrelation coefficient, while the rms value for thetraining set remains unaltered and the rms value for thetest set is improved. Upon further decrease in the num-ber of descriptors to 50, a close correlation coefficientis retained, but the rms values for both the test and train-ing sets sharply increase.

Thus, the quantitative structure–property relation-ship model was constructed for ionization constants oforganic compounds from different families. Our find-ings demonstrate the applicability of the fragmentalapproach and artificial neural networks to modeling ofthis property and the possibility of using the structure–property models obtained for predicting the ionizationconstants of phenols, carboxylic acids, and nitrogen-containing compounds.

REFERENCES

1. Matoga, M., Laborde-Kummer, E., Langlois, M.H., Dal-let, P., Bose, J.J., and Jarry, C., J. Chromatogr., 2003,vol. 984, pp. 253–260.

2. Tehan, B.G., Lloyd, E.J., Wong, M.G., Pitt, W.R., Mon-tana, J.R., Manallack, D.T., and Gancia, E., QSAR, 2002,vol. 21, pp. 457–472.

3. Tehan, B.G., Lloyd, E.J., Wong, M.G., Pitt, W.R., Gan-cia, E., and Manallack, D.T., QSAR, 2002, vol. 21,pp. 473–485.

4. Gross, K.C. and Seybold, P.G., J. Org. Chem., 2001,vol. 66, pp. 6919–6925.

5. Liptak, M.D., Gross, K.C., Seybold, P.G., Feldgus, S.,and Scields, G.C., J. Am. Chem. Soc., 2002, vol. 124,pp. 6421–6427.

6. Liptak, M.D. and Shields, G.C., J. Am. Chem. Soc.,2001, vol. 123, pp. 7314–9719.

7. Gargallo, R., Sotriffer, C.A., Liedl, K.R., and Rode, B.M.,J. Comput. Aided Mol. Des., 1999, vol. 13, pp. 611–623.

Table 2. Statistical parameters of QSPR models constructedusing fragmental and quantum-chemical descriptors

Dfr /Dqc R2 rmstrainin rmstest

226/7 0.9938 0.34 0.40

194/6 0.9931 0.36 0.55

96/4 0.9862 0.36 0.46

46/4 0.9832 0.56 0.64

13/2 0.9658 0.78 0.92

Note: Dfr is the number of fragmental descriptors, and Dqc is thenumber of quantum-chemical descriptors.

94

DOKLADY CHEMISTRY Vol. 413 Part 2 2007

IVANOVA et al.

8. Zefirov, N.S., Palyulin, V.A., Oliferenko, A.A.,Ivanova, A.A., and Ivanov, A.A., Dokl. Chem., 2001,vol. 381, nos. 4–6, pp. 356–358 [Dokl. Akad. Nauk,2001, vol. 381, pp. 637–639].

9. Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput.Sci., 2002, vol. 42, pp. 1112–1122.

10. Ivanova, A.A., Ivanov, A.A., Oliferenko, A.A., Paly-ulin, V.A., and Zefirov, N.S., SAR QSAR Environ. Res.,2005, vol. 16, pp. 231–246.

11. Baskin, I.I., Palyulin, V.A., and Zefirov, N.S., Vestn.Mosk. Univ., Ser. 2: Khim., 1999, vol. 40, pp. 323–326.

12. Ivanova, A.A., Palyulin, V.A., Zefirov, A.N., andZefirov, N.S., Zh. Org. Khim., 2004, vol. 40, pp. 675–680.

13. Halberstam, N.M., Baskin, I.I., Palyulin, V.A., andZefirov, N.S., Usp. Khim., 2003, vol. 72, pp. 706–727.

14. Baskin, I.I., Palyulin, V.A., and Zefirov, N.S., Abstractsof Papers, Mezhvuz. konf. “Molekulyarnye grafy vkhimicheskikh issledovaniyakh” (Intercollegiate Confer-ence “Molecular Graphs in Chemical Research”), Kali-nin, 1990, p. 5.

15. Baskin, I.I., Halberstam, N.M., Artemenko, N.V., Pa-lyulin, V.A., and Zefirov, N.S., in EuroQSAR 2002Designing Drugs and Crop Protectants: Processes,Problems, and Solutions, Ford, M. et al., Eds., Mel-bourne: Blackwell, 2003, pp. 260–263.