An alternative approach for the use of water solubility of nonionic pesticides in the modeling of...

9
An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients Ralpho Rinaldo dos Reis a, *, Silvio Ce ´sar Sampaio a , Eduardo Borges de Melo b a Graduate Program in Agricultural Engineering (PGEAGRI), Western Parana ´ State University (Universidade Estadual do Oeste do Parana ´, UNIOESTE), Cascavel, Parana ´, Brazil b Theoretical Medicinal and Environmental Chemistry Laboratory (LQMAT) e UNIOESTE, Cascavel, Parana ´, Brazil article info Article history: Received 31 August 2013 Received in revised form 6 January 2014 Accepted 11 January 2014 Available online 23 January 2014 Keywords: Environmental risk Hydrophobic parameter Partition coefficient Soil contamination QSPR abstract The collection of data to study the damage caused by pesticides to the environment and its ecosystems is slowly acquired and costly. Large incentives have been established to encourage research projects aimed at building mathematical models for predicting physical, chemical or biological properties of environmental interest. The organic carbon normalized soil sorption coefficient (K oc ) is an important physicochemical property used in environmental risk assessments for compounds released into the environment. Many models for predicting logK oc that have used the parameters logP or logS as descriptors have been published in recent decades. The strong correlation between these properties (logP and logS) prevents them from being used together in multiple linear regressions. Because the sorption of a chemical compound in soil depends on both its water solubility and its water/organic matter partitioning, we assume that models capable of combining these two properties can generate more realistic results. Therefore, the objective of this study was to propose an alternative approach for modeling logK oc , using a simple descriptor of solubility, here designated as the logarithm of solubility corrected by octanol/water partitioning (logS P ). Thus, different models were built with this descriptor and with the conventional descriptors logP and logS, alone or associated with other explanatory variables representing easy-to-interpret physicochemical properties. The obtained models were validated according to current recommendations in the literature, and they were compared with other previously published models. The results showed that the use of logS p instead of conventional descriptors led to simple models with greater statistical quality and predictive power than other more complex models found in the literature. Therefore, logS P can be a good alternative to consider for the modeling of logK oc and other properties that relate to both solubility and water/organic matter partitioning. ª 2014 Elsevier Ltd. All rights reserved. * Corresponding author. Dept of Pharmacy, UNIOESTE, 2069 Universita ´ ria st, 85819-110 Cascavel, PR, Brazil. Tel.: þ55 45 32203256. E-mail addresses: [email protected], [email protected] (R.R. dos Reis). Available online at www.sciencedirect.com ScienceDirect journal homepage: www.elsevier.com/locate/watres water research 53 (2014) 191 e199 0043-1354/$ e see front matter ª 2014 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.watres.2014.01.023

Transcript of An alternative approach for the use of water solubility of nonionic pesticides in the modeling of...

Page 1: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

ww.sciencedirect.com

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9

Available online at w

ScienceDirect

journal homepage: www.elsevier .com/locate/watres

An alternative approach for the use of watersolubility of nonionic pesticides in the modeling ofthe soil sorption coefficients

Ralpho Rinaldo dos Reis a,*, Silvio Cesar Sampaio a,Eduardo Borges de Melo b

aGraduate Program in Agricultural Engineering (PGEAGRI), Western Parana State University (Universidade

Estadual do Oeste do Parana, UNIOESTE), Cascavel, Parana, BrazilbTheoretical Medicinal and Environmental Chemistry Laboratory (LQMAT) e UNIOESTE, Cascavel, Parana, Brazil

a r t i c l e i n f o

Article history:

Received 31 August 2013

Received in revised form

6 January 2014

Accepted 11 January 2014

Available online 23 January 2014

Keywords:

Environmental risk

Hydrophobic parameter

Partition coefficient

Soil contamination

QSPR

* Corresponding author. Dept of Pharmacy, UE-mail addresses: [email protected]

0043-1354/$ e see front matter ª 2014 Elsevhttp://dx.doi.org/10.1016/j.watres.2014.01.023

a b s t r a c t

The collection of data to study the damage caused by pesticides to the environment and

its ecosystems is slowly acquired and costly. Large incentives have been established to

encourage research projects aimed at building mathematical models for predicting

physical, chemical or biological properties of environmental interest. The organic carbon

normalized soil sorption coefficient (Koc) is an important physicochemical property used

in environmental risk assessments for compounds released into the environment. Many

models for predicting logKoc that have used the parameters logP or logS as descriptors

have been published in recent decades. The strong correlation between these properties

(logP and logS) prevents them from being used together in multiple linear regressions.

Because the sorption of a chemical compound in soil depends on both its water solubility

and its water/organic matter partitioning, we assume that models capable of combining

these two properties can generate more realistic results. Therefore, the objective of this

study was to propose an alternative approach for modeling logKoc, using a simple

descriptor of solubility, here designated as the logarithm of solubility corrected by

octanol/water partitioning (logSP). Thus, different models were built with this descriptor

and with the conventional descriptors logP and logS, alone or associated with other

explanatory variables representing easy-to-interpret physicochemical properties. The

obtained models were validated according to current recommendations in the literature,

and they were compared with other previously published models. The results showed

that the use of logSp instead of conventional descriptors led to simple models with

greater statistical quality and predictive power than other more complex models found

in the literature. Therefore, logSP can be a good alternative to consider for the modeling

of logKoc and other properties that relate to both solubility and water/organic matter

partitioning.

ª 2014 Elsevier Ltd. All rights reserved.

NIOESTE, 2069 Universitaria st, 85819-110 Cascavel, PR, Brazil. Tel.: þ55 45 32203256.r, [email protected] (R.R. dos Reis).

ier Ltd. All rights reserved.

Page 2: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

Nomenclature

ARE Average relative error

CCC Concordance correlation coefficient

F F statistic

k Slope of the regression line without intercept

Koc Organic carbon normalized soil sorption

coefficient

LNO Leave-N-out validation

LOO Leave-one-out validation

MAXDP Maximum positive difference between the

intrinsic states of the atoms of a molecule

MLR Multiple linear regression

Mv Mean van der Waals volume of the atoms of a

molecule

P Octanol/water partitioning coefficient

PCR Principal component regression

PLS Partial least squares regression

PRESScv Residual error sum of squares of the cross-

validation

PRESSev Residual error sum of squares of the external

validation

Q Total solute quantity (mol)

Q2LNO Coefficient of determination of the leave-N-out

validation

Q2LOO Coefficient of determination of the leave-one-out

validation

QO Solute quantity in the organic phase (mol)

QSPR Quantitative StructureeProperty Relationships

QW Solute quantity in the aqueous phase (mol)

R2 Coefficient of determination

R20 Coefficient of determination of the regression line

without intercept

r2m Modified coefficient of determination of external

validation

R2PRED Coefficient of determination of the external

validation

RSS Residual sum of squares

S Solubility in water (mol.L�1)

SEC Standard error of calibration

SECV Standard error of the cross-validation

SEP Standard error of the prediction

SO Solubility in the organic phase (mol.L�1)

SP Solubility in water corrected by the octanol/water

partition (mol.L�1)

SW Solubility in the aqueous phase (mol.L�1)

V Volume (L)

VAR Variation (topological descriptor)

VO Volume of organic phase (L)

VW Volume of aqueous phase (L)

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9192

1. Introduction

Pesticides or agrochemicals are substances that are widely

used in agriculture to increase productivity, improve crop

quality and reduce labor and energy costs. However, only a

small portion of the pesticides applied to crops reaches its

intended target; the remainder tends to transfer into different

environmental compartments and can thus contaminate

surface water and groundwater (Arias-Estevez et al., 2008).

In recent years, some segments of society have expressed

concern over the final destination of these substances and

their potential environmental and health risks (Mackay and

Webster, 2003). Therefore, studying the physical, chemical

and biological properties of these chemicals is essential for

planning recovery efforts in contaminated areas and for pre-

serving sites that are not yet contaminated.

The organic carbon normalized soil sorption coefficient

(Koc) is an important physicochemical property that can be

used to determine the final destination of chemicals released

to the environment (Doucette, 2003; Huuskonen, 2003). This

coefficient is expressed as the ratio between the concentration

of the chemical component that is sorbed by the soil (mg.kg�1)

and the concentration of the surrounding aqueous phase

(mg.L�1), divided by the content of organic carbon (g organic

carbon/g dry soil). The smaller is the value of Koc for a given

compound, the greater is its mobility in the soil and the

greater is its potential for groundwater contamination

(Dearden, 2002; Arias-Estevez et al., 2008).

The Koc coefficient of a chemical compound can be deter-

mined experimentally by the batch equilibrium method

(ASTM, 2001) or by the soil column method (Lee et al., 1991).

The extent of the sorption of a chemical compound is influ-

enced by itsmolecular structure, but it also depends on factors

related to the intrinsic characteristics of the soil, such as

particle size, porosity, chemical composition, pH and organic

matter content (Dearden, 2002). The organic carbon content is

a key factor that affects the sorption of hydrophobic com-

pounds. Therefore, the hydrophobicity of these compounds is

the driving force during soil sorption (Wen et al., 2012). Hy-

drophobic interactions are the primary mechanisms under-

lying the sorption of nonionic compounds, but for compounds

with higher polarities, hydrophilic interactions are more

relevant. Therefore, the interactions of clay and silica with

polar functional groups become important for polar com-

pounds. A more itemized description of the interactions

involved in the sorption of organic compounds in soil can be

found in the works of von Oepen et al. (1991) and Allen-King

et al. (2002).

The need for an updated and reliable database of Koc values

has motivated the proposals and publications of a consider-

able number of models for the prediction of this parameter

(Gramatica et al., 2000). Several studies on quantitative

structureeproperty relationships (QSPR) based on the molec-

ular structures of chemical compounds have become avail-

able during the last few decades. Therefore, various

descriptors have been used for the modeling of Koc, such as

physicochemical properties, chromatographic parameters,

steric factors, linear solvation energy relationships (LSER),

topological and quantum parameters (Gawlik et al., 1997;

Doucette, 2003; Huuskonen, 2003; Nguyen et al., 2005;

Gramatica et al., 2007; Razzaque and Grathwohl, 2008;

Goudarzi et al., 2009; Wen et al., 2012). However, the vast

Page 3: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 193

majority of QSPR models that have been proposed were not

properly validated. The non-validation of a predictive model

precludes its application because its predictive power is un-

known (Kiralj and Ferreira, 2009; Chirico and Gramatica, 2011,

2012; Roy et al., 2012).

Numerous models have been proposed based on the

relationship between logKoc and the octanol/water parti-

tioning coefficient (logP) or water solubility (logS) (Razzaque

and Grathwohl, 2008; Wen et al., 2012; dos Reis et al., 2013).

Several studies in the literature discuss the positive and

negative aspects of these models. These studies reveal that

the majority of these models were constructed with small

datasets or were designed for specific classes of compounds

(Gawlik et al., 1997; Dearden, 2002; Doucette, 2003). The pre-

dictive power of the models is affected by the quantities and

structural diversity of the compounds in their constructions.

A model that has been calibrated for a particular class of

compounds will not be suitable for the estimation of struc-

turally different compounds (dos Reis et al., 2013). As the

employed calibration datasets improve, the chemical spaces

serviced by the model increase. This finding explains the

potential limitations in the predictive powers of these

models.

A strong correlation exists between logP and logS. The

literature presents numerous relationships between these

two variables (Hansch et al., 1968; Wang and Hou, 2011). This

correlation can be investigated at the molecular level because

the electronic characteristics of substances may control the

affinity of molecules with aqueous or organic phases. Thus,

the partition process for the aqueous phase may be charge-

controlled, whereas the partition process for the organic

phase may be polarizability-controlled (Rogers and

Cammarata, 1969). The hydrophobicity and hydrophilicity

may be determined by quantum descriptors that can provide

an electronic representation of the molecules (Cammarata

and Rogers, 1971). The molecular descriptors that favor logP

hinder logS. This trend can be confirmed by the study of Gao

and Cao (2008), in which logS and logP prediction models

were obtained from the same molecular descriptors.

The strong correlation between logP and logS indicates that

models for logKoc based on these parameters may be consid-

ered equivalent (Wen et al., 2012). This correlation also means

that these two descriptors cannot be used together in an

equation obtained by multiple linear regression (MLR).

Instead, multivariate projection methods, such as principal

component regression (PCR) or partial least squares regression

(PLS), can be used (Livingstone, 2003).

Models using only logP as a descriptor are incapable of

differentiating compounds that possess equal logP values but

different logS values. Similarly, models that use only logS do

not distinguish between compounds that have identical logS

values but different logP values. The sorption of a chemical

compound in soil is affected by the water/organic matter

partitioning process (i.e., the higher the affinity of the sub-

stance for organic matter, the greater its sorption in the soil)

and by the water solubility of the compound (i.e., the greater

the solubility, the greater the leaching of the compound).

Therefore, it is expected that models capable of collating in-

formation from partitioning and solubility would generate

more realistic results.

In addition to the use of these two classical descriptors,

this study proposes an alternative approach: the use of a

simple solubility parameter that is generated by the mathe-

matical combination of logP and logS. This parameter is

designated as the logarithm of the solubility corrected by

octanol/water partitioning (logSP). Therefore, models for log-

Koc based on all three descriptors (i.e., logP, logS and logSP)

were constructed. The obtained models were then tested and

validated according to the recommendations of experts in the

field (Kiralj and Ferreira, 2009; Chirico and Gramatica, 2011,

2012; Roy et al., 2012) to ensure that they are reliable and

can be used for accurate predictions.

2. Material and methods

2.1. Experimental data for Koc

Experimental data on logKoc for 163 nonionic organic pesti-

cides were obtained from the literature (Sabljic et al., 1995;

Gramatica et al., 2000). These data were divided into two

sets of data: a training set (143 compounds) and a test set (20

compounds). The training set was used to construct the

models, and the test set was employed in the external vali-

dation of the models. The experimental logKoc values are

included in the Supplementary Material (Table S1). These

datasets were used in the studies of Gramatica et al. (2000),

Huuskonen (2003) and Duchowicz et al. (2007). The com-

pounds in the training set belong to different pesticide classes

(six acetonitriles, 29 carbamates, eight dinitroanilines, eight

organochlorines, 28 organophosphates, 44 phenylureas, 13

triazines, seven di- and triazoles); their logKoc values ranged

from 0.42 to 5.31. The logKoc values for the test set ranged from

0.56 to 4.50. Because these datasets have been used in other

studies, our results can be compared with previously reported

findings (Gramatica et al., 2000; Huuskonen, 2003; Duchowicz

et al., 2007).

2.2. Descriptors of the QSPR models

This study proposes the use of an alternative and simple

solubility parameter as a descriptor for modeling logKoc. This

parameter, here referred to as the logarithm of solubility

corrected by octanol/water partitioning, corresponds to the

logarithm of solubility (mol.L�1) for a given chemical com-

pound in the aqueous phase of a biphasic octanol/water sys-

tem obtained from the saturated aqueous solution of the

substance.

This property mathematically combines the logS and logP

descriptors. Thus, this descriptor may bemore suitable for the

modeling of properties that simultaneously involve solutions

and water/organic matter partitioning processes. To calculate

the value of this alternative descriptor, we begin with an

initial volume V (L) of a saturated aqueous solution of con-

centration S (mol.L�1) for a given chemical compound. If an

equal volume of n-octanol (V ¼ VO, where VO is the octanol

volume) is added and the biphasic system is stirred until

equilibrium is established between the amounts of solute

present in the aqueous phase and in the organic phase, then

the total solute quantityQ (mol) is equivalent to the sumof the

Page 4: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9194

quantities of solute present in the two phases (Q ¼ QW þ QO).

Because the system is biphasic and the volumes of water and

n-octanol are equivalent (V ¼ VO ¼ VW, where VW is the

aqueous volume), we obtain the following equations:

Q=V ¼ QW=V þ QO=V ¼ QW=VW þ QO=VO (1)

S ¼ Q=V (2a)

SW ¼ QW=VW (2b)

SO ¼ QO=VO (2c)

Thus:

S ¼ SW þ SO (3)

In this equation, SW and SO correspond to the solubilities

(mol.L�1) in the aqueous and organic phases, respectively. The

octanol/water partition coefficient P is defined as follows:

P ¼ SO=SW (4)

Combining the equations (3) and (4) yields

S ¼ SW þ SO ¼ SW þ P$SW ¼ SW$ð1þ PÞ (5)

Therefore, the solubility (mol.L�1) of the compound in the

aqueous phase is defined as

SW ¼ S= Pþ 1ð Þ (6)

To avoid ambiguities with the other previously reported

solubility descriptors, SW was renamed SP and expressed as

follows:

log SP ¼ log S� logðPþ 1Þ (7)

thus, logSP values are calculated from the values of logP and

logS using Equation (7). LogP and logS values were obtained

from the introduction of the Simplified Molecular Input Line

Entry Specification (SMILES) of each compound in the program

ALOGPS 2.1, which is available at http://www.vcclab.org/lab/

alogps (Tetko et al., 2001). The decision to use algorithms to

calculate the values of logP (AlogPs) and logS (AlogpS) was

based on the lack of reliable experimental data for all com-

pounds used in this study (dos Reis et al., 2013).

Other useful descriptors for this work (constitutional de-

scriptors, topological descriptors, molecular descriptors and

indices of rings and connectivity) were calculated using the

program Dragon 6.0. All values of the descriptors used in the

models are presented in Table S1 (Supplementary Material).

2.3. Construction and interpretation of the QSPR models

The models were obtained by linear regression using the

programs Minitab 15 and QSAR Modeling. Three models were

obtained with a single explanatory variable (logP, logS or

logSP), whereas three models with four variables were con-

structed by the addition of simple descriptors to the previous

models. Because the variables in QSPR typically exhibit

different numerical ranges for the MLR models prior to the

regression analysis, data auto-scaling (i.e., the variables were

mean-centered and scaled by variance) can be required. (Kiralj

and Ferreira, 2009; Martins and Ferreira, 2013).

To construct the models that contain four descriptors, the

Stepwise variables selection method (forward and backward)

was employed. Only classes of descriptors with simple phys-

icochemical interpretations were considered. Descriptors that

possessed Pearson’s linear correlation coefficient with

another explanatory variable higher than 0.6 were excluded.

The Pearson’s linear correlation coefficients for the

selected descriptors with each other and with the variable

logKoc are shown in Supplementary Material (Table S2). These

coefficients demonstrated that the collinearity between the

explanatory variables was controlled. The requirement of

noncollinearity between the descriptors is essential for

obtaining statistically reliable models. Conversely, as the

correlation of a descriptor with the response variable in-

creases, the contribution of this parameter to modeling the

investigated property also increases. Although the interpre-

tation of a QSPR model is an important and necessary step,

interpretation is not always feasible. Understanding how each

descriptor influences the response variable enables the

development and interpretation of a mechanism that may

motivate the process under study (de Melo, 2012).

2.4. Statistical quality of the QSPR models

The fit degree of amodelwas evaluated using the coefficient of

determination (R2), the residual sum of squares (RSS) and the

standard error of calibration (SEC). For adjusted QSPR models,

recent studies have recommended that R2 values should be

greater than 0.7 and that values of RSS and SEC should be close

to 0 (Chirico and Gramatica, 2011, 2012). The significance of

the models was evaluated using the F test at the 5% level

(Kiralj and Ferreira, 2009).

Leave-one-out (LOO) cross-validation is performed by

calculating the sum of squares for the differences between the

observed and the predicted values (PRESScv), the coefficient of

determination of the leave-one-out cross-validation (Q2LOO)

and the standard error of cross-validation (SECV). For approval

of a model, the SECV and PRESScv should be close to 0,

Q2LOO > 0.6 and jR2 e Q2

LOOj< 0.1 (Chirico and Gramatica, 2011,

2012).

The leave-N-out (LNO) cross-validation, which is a

robustness test, is achieved by the systematic removal of a

maximum of N elements of the training set. In this study, N

ranged from one to 40with 12 replicates for eachN value. For a

model to be considered robust, the average value of Q2LNO

should be close to the value of Q2LOO (Kiralj and Ferreira, 2009).

The random correlation can be verified by the y-randomi-

zation test. The models are recalculated after randomization

of the vector Y; a significant deterioration in these new re-

gressions is expected. This procedure was repeated 50 times;

two graphs, in which the parameters R2 and Q2LOO (y axis)

were plotted as a function of the correlation coefficient

R(Yo,Yr) (x axis) that was obtained from the original and ran-

domized Y values, was constructed. To eliminate the possi-

bility of a chance correlation, the intercepts of these graphs

should be less than 0.3 for R2 and less than 0.05 for Q2LOO

(Eriksson et al., 2003).

The external validation of the obtained models was per-

formed by predicting logKoc values for the compounds of the

test set. Thus, the predictive power was assessed using the

Page 5: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

Table 1 e Regression models for the prediction of logKoc

for non-ionic pesticides (n [ 143).

Modela Equation

A1 logKoc ¼ 0.945 þ 0.560 LogP

A2 logKoc ¼ 0.670e0.568 LogS

A3 logKoc ¼ 0.702e0.299 LogSP

B1 logKoc ¼ �1.46 þ 0.386 LogP þ 4.45 Mv þ 0.0152

VAR e 0.124 MAXDP

B2 logKoc ¼ �0.834e0.435 LogS þ 3.05 Mv þ 0.00909

VAR e 0.0850 MAXDP

B3 logKoc ¼ �1.09e0.227 LogSP þ 3.48 Mv þ 0.00987

VAR e 0.0884 MAXDP

a “A” designates models with one variable; “B” designates models

with 4 explanatory variables.

Table 3 e LNO cross-validation and random correlationtest.

Models Robustness Random correlation

Average Q2LNO Intercept of

R2 vs R(Yo,Yr)Intercept of

Q2LOO vs R(Yo,Yr)

A1 0.734 �0.040 �0.070

A2 0.781 �0.041 �0.070

A3 0.798 �0.046 �0.077

B1 0.834 �0.023 �0.096

B2 0.803 �0.028 �0.105

B3 0.839 �0.024 �0.100

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 195

coefficient of determination of external validation (R2PRED), the

analysis of modified coefficient of determination of external

validation (r2m) and the GolbraikheTropsha method. The

R2PRED values should be higher than 0.6 (Chirico and

Gramatica, 2011, 2012). The average r2m value should be

higher than 0.5, and jDr2mj should be less than 0.2 (Ojha et al.,

2011). According to the GolbraikheTropsha method, the in-

clinations (k and k’) of the regression lines that pass through

the origin (observed values versus predicted values and pre-

dicted values versus observed values) should range from 0.85

to 1.15, and the jR20�R’20j should be less than 0.3 (Golbraikh

and Tropsha, 2002; Golbraikh et al., 2003; Tropsha et al.,

2003). The values of the standard error of prediction (SEP),

the prediction residual error sum of squares for the external

validation (PRESSev) and the average relative error (ARE) were

also determined; these values should be as small as possible.

In addition to these classical statistical parameters, we also

employed the concordance correlation coefficient (CCC) pro-

posed by Lin (1989). This parameter provides a measure of

agreement between the calculated and experimental values.

Recent studies recommend that the CCC should be greater

than 0.85 for a model to be adopted (Chirico and Gramatica,

2011, 2012).

Residual plots were analyzed to verify that the residuals

exhibited normal and zero-mean distributions. The formulas

used to calculate the previously mentioned statistical pa-

rameters are shown in the Supplementary Material.

Thus, recent validation criteria, which are more stringent

than the validation criteria commonly used in QSPR studies

(Kiralj and Ferreira, 2009; Chirico and Gramatica, 2011, 2012;

Roy et al., 2012), were employed in this study. This option

was selected to ensure the statistical quality of the models

obtained using the logSP descriptor and to demonstrate the

viability of the approach. After the validation procedures were

Table 2 e Statistical parameters of the logKoc prediction mode

Model R2 SEC RSS

A1 0.744 0.447 28.14

A2 0.788 0.407 23.35

A3 0.804 0.391 21.50

B1 0.848 0.348 16.71

B2 0.821 0.378 19.72

B3 0.852 0.343 16.24

completed, the best models were compared to the models

reported in the literature (Gramatica et al., 2000; Huuskonen,

2003; Duchowicz et al., 2007).

3. Results and discussion

3.1. Models for predicting logKoc and interpretation ofthe descriptors

The models for predicting logKoc are displayed in Table 1. The

sorption of the nonionic organic compounds in soil involves a

mechanism in which hydrophobicity is the driving force (Wen

et al., 2012; dos Reis et al., 2013). Thus, molecular descriptors

that estimate or affect the hydrophobicity of a substance are

relevant to the sorption process. This finding explains the high

correlation of logP, logS and logSP with logKoc (Table S2 in the

Supplementary Material). In the models with 4 variables, the

significance of other descriptors was also noted, including a

constitutional descriptor (Mv) and 2 topological descriptors

(VAR and MAXDP).

The constitutional descriptor Mv, which is the mean van

der Waals volume of the atoms that comprise the molecule

(scaled according to the volume of the carbon atom), is in

agreement with the expected mechanism of hydrophobic

interaction because this measure of volume is associated with

the capacity of each atom to interact with the organic matter

of the soil. An atom with a greater volume has a greater con-

tact surface.

The topological descriptors are variables that describe

different structural properties of a molecule, such as the de-

gree of branching, size, flexibility or centricity (de Melo, 2012).

The VAR descriptor (variation) is a topological distance

parameter that emphasizes the strong dependence of the size

and symmetry of the molecules on the sorption process.

The descriptor MAXDP (maximum positive difference be-

tween the intrinsic states of the atoms of a molecule) is a

ls.

F Q2LOO SECV PRESScv

410.04 0.735 0.451 29.15

523.02 0.782 0.410 23.99

580.04 0.798 0.394 22.20

192.62 0.835 0.356 18.11

157.93 0.805 0.387 21.44

199.14 0.840 0.351 17.62

Page 6: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

Fig. 1 e LNO cross-validation plots of the (a) A3 and (b) B3

models.

Fig. 2 e Plots of the observed values as a function of the

logKoc predicted values (95% confidence interval) for the

external validation set of the (a) A3 and (b) B3 models.

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9196

topological index proposed by Gramatica et al. (2000) that was

deduced from the Kier-Hall atomic electronegativities. Thus,

MAXDP is related tomolecule electrophilicity; it highlights the

importance of electrophilic properties in the prevention of the

sorption process in soil and favors pesticide leaching.

3.2. Statistical analysis

The statistical parameters of all models are shown in Table 2.

All models with a single independent variable (A1, A2 and A3)

were approved with regard to curve fitting because their R2

values were higher than 0.7. However, it is important to note

that the A3 model, which used the new descriptor logSP,

showed a better fit and greater significance than the A1 and A2

models, which used the classical descriptors logP and logS. A

similar trend was also observed for the models with four

variables. These findings suggested that the models based on

logSP exhibit a better fit with the experimental data compared

with the other models.

The SEC values should be as small as possible within the

limits imposed by the experimental error of the measured

response variables. Lohninger (1994) demonstrated the limits

Table 4 e Statistical data for the external validation (dataset, n

Model R2PRED SEP PRESSev ARE (%) CCC

A1 0.668 0.536 5.75 16.1 0.85

A2 0.708 0.503 5.06 24.2 0.84

A3 0.723 0.490 4.80 19.9 0.86

B1 0.710 0.501 5.03 19.0 0.87

B2 0.733 0.481 4.62 23.4 0.86

B3 0.743 0.472 4.45 20.5 0.87

of the accuracy of the experimental measurements of logKoc.

In a study performed with a set of pesticides that contained a

minimum of four different logKoc values for each compound,

the average standard deviation of these measurements was

0.44 log units. Themodeling of a dependent variable is limited

to the quality of the available experimental data. Thus,

obtaining predictionmodels with errors significantly less than

the experimental errors is not an appropriate solution (dos

Reis et al., 2013). Note that all of the models we constructed

exhibited SEC values that were compatible with the experi-

mental error. Therefore, they were considered to be reliable

for estimating logKoc values.

All models were validated by LOO cross-validation because

they possessed Q2Loo values higher than 0.6 and jR2 e Q2

LOOj <

[ 20).

Average r2m jDr2mj k k’ jR20 e R’20j

0.622 0.127 0.97 0.99 0.050

0.614 0.157 0.96 1.00 0.064

0.654 0.004 0.96 1.00 0.001

0.678 0.076 0.94 1.03 0.023

0.658 0.181 0.95 1.02 0.063

0.688 0.028 0.95 1.02 �0.008

Page 7: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

Fig. 3 e Histogram of residuals distribution for the (a) A3

and (b) B3 models.

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 197

0.1. Thus, based on the data shown in Table 2 and above dis-

cussion, regarding the data adjustment and LOO cross-

validation, the models were arranged in descending order of

statistical quality, considering the number of descriptors in

the equations:

A3 > A2 > A1 (models with 1 descriptor)

B3 > B1 > B2 (models with 4 descriptors)

The LNO cross-validation and random correlation tests are

also important tools for verifying the internal quality of the

models. Data from the LNO cross-validation showed that all

models had average Q2LNO values similar to their respective

Q2LOO values (Table 3). Plots of the LNO validation for the A3

and B3 models (both based on the descriptor logSP), shown in

Fig. 1, indicated that the models were stable and resistant to

small variations. The plots of the other models are available

in the Supplementary Material and led to the same conclu-

sion. However, it is important to note that the A3 model

Table 5 e Comparison of logKoc models.

Model No. of variables R2 SEC Q2

A3 Model 1 0.804 0.391 0.

B3 Model 4 0.852 0.343 0.

Gramatica et al. (2000) 6 0.843 0.350 0.

Huuskonen (2003) 12 0.82 0.37 0.

Duchowicz et al. (2007) 6 0.90 0.29 0.

a calculated by the authors of the present study, based on the experimen

showed the best results in the robustness test among the

modelswith a single descriptor, and the B3modelwas the best

model among the models with 4 variables. Therefore, the

models based on logSP were also better than the other models

in the LNO validation test.

Random correlation was evaluated by the y-randomization

test, and the results shown in Table 3 indicate the absence of

random correlation in all of the models obtained, as values of

the intercepts for the R2 and Q2LOO plots were within the limits

(less than 0.3 and 0.05, respectively) proposed by Eriksson

et al. (2003).

Results of the external validation (Table 4) show that all

models were considered approved because the R2PRED values

were all greater than 0.6. The SEP values were consistent

with the experimental errors associated with the logKoc

measurements (0.44). Thus, from the data in Table 4, the

models were arranged in decreasing order according to

predictive power, considering the number of descriptors in

the equations:

A3 > A2 > A1 (models with 1 descriptor)

B3 > B2 > B1 (models with 4 descriptors)

It is important to note that, regarding predictive power, the

A3 model was again the best model among the single-

descriptor models. Similarly, the B3 model was better than

the other 4-variable models.

The CCC data (Table 4) suggested that the A2model should

be rejected because it had a value less than 0.85, whereas the

other models were validated.

An analysis of the k, k’ and jR20 e R’20j parameters (Gol-

braikheTropsha method) showed that all models were within

the proposed limits (Tropsha et al., 2003). However, it is

important to note that the A3 model had the best results

among the models with a single descriptor and that the B3

model was again the best model among the models with 4

descriptors. This same trend was also observed with respect

to the parameters r2m and jDr2mj. Fig. 2 shows the graphs of the

observed values as a function of the logKoc predicted values for

the A3 and B3 models.

When the results were analyzed according to the more

stringent r2m criterion (average r2m > 0.65), the A1 and A2

models were rejected (Chirico and Gramatica, 2012). Thus, it

became clear that the solubility descriptor logSP was more

appropriate than the conventional descriptors logP and logS

for modeling logKoc for nonionic pesticides with 1 or 4

explanatory variables.

Histograms of the A3 and B3 models are shown in Fig. 3.

Residuals analysis indicated that they had a normal distribu-

tion and a mean of 0. Histograms for the other models are

available in the Supplementary Material.

LOO R2PRED ARE (%) CCC Average r’2m jDr2mj

798 0.723 19.9 0.86 0.654 0.004

840 0.743 20.5 0.87 0.688 0.028

824 0.670 27.4a 0.80a 0.539a 0.245a

79 0.79 20.0a 0.87a 0.654a 0.190a

89 0.71 21.7a 0.83a 0.594a 0.134a

tal and predicted values available in the literature.

Page 8: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9198

The proposed descriptor logSP proved to be useful for

modeling logKoc becausemodelswith better fits and predictive

powers were generated when the information regarding logP

and logS were combined than when only the conventional

descriptors were used. Therefore, logSP can be a good alter-

native to consider for the modeling of properties that relate to

both solubility and water/organic matter partitioning.

3.3. Comparison with QSPR models in the literature

The A3 and B3 models were compared with other previously

reported models that utilized the same dataset. The major

statistical parameters of these models and their respective

prediction powers are summarized in Table 5. Models A3

and B3 fit the experimental data well and had good predic-

tive ability, similar to the models found in the literature. All

models presented in Table 5 possessed a sufficient degree of

fit and had SEC values compatible with the experimental

error of logKoc. However, while the A3 and B3 models were

validated according to all the external validation criteria

recommended for QSPR studies, the models of Gramatica

et al. (2000) and Duchowicz et al. (2007) were rejected due

to their CCC values (less than 0.85). These models were also

rejected when we applied a more rigorous criterion for r2m(values greater than 0.65) that was proposed by Chirico and

Gramatica (2012). Thus, among the models from the litera-

ture, only the model of Huuskonen (2003) satisfied the

stricter criteria of external validation that is currently

recommended.

It should be noted that the B3 model (our best model)

outperformed the model of Huuskonen (the best model from

the literature) in terms of fit. The external validation pa-

rameters of the two models had similar values. However,

while the Huuskonen’s model was built with 12 explanatory

variables, the B3 model was much simpler and was con-

structed with only 4 descriptors. We note also that these

variables are in agreement with the mechanisms of soil

sorption of nonionic compounds. Thus, the B3 model pro-

posed in this paper is not only reliable in terms of its sta-

tistical qualities and predictive power but also satisfies the

requirement of greater simplicity.

4. Conclusions

We have demonstrated that the solubility descriptor logSP is a

suitable alternative for logKoc modeling because all models

that we obtained were validated according to statistical

criteria that were stricter than criteria typically used in QSPR

studies. In addition, all selected descriptors are consistent

with the mechanisms of the sorption process in soil for

nonionic organic compounds.

The primary objective of a QSPR study is to obtain models

that combine, when possible, mathematical simplicity, ease

of physicochemical interpretation, high statistical quality and

effective predictive power. Therefore, themodels that we built

based on the solubility descriptor logSP clearly possess these

qualities.

However, some caution is necessary because this

approach cannot necessarily be generalized to all QSPR

studies. It should be tested case-by-case, and the property to

be modeled must be dependent on both water solubility and

water/organic matter partitioning of the compound under

study.

Acknowledgments

The authors thank the MCT/CNPq/Fundacao Araucaria (www.

fundacaoaraucaria.org.br) and the National Counsel of Tech-

nological and Scientific Development (CNPq, www.cnpq.br)

for providing the financial support for this study.

In addition, the authors thank the Laboratory for Theo-

retical and Applied Chemometrics at State University of

Campinas (LQTA/UNICAMP) for providing the QSAR Modeling

software.

Appendix A. Supplementary data

Supplementary data related to this article can be found at

http://dx.doi.org/10.1016/j.watres.2014.01.023.

r e f e r e n c e s

Allen-King, R.M., Grathwohl, P., Ball, W.P., 2002. New modelingparadigms for the sorption of hydrophobic organicchemicals to heterogeneous carbonaceous matter in soils,sediments, and rocks. Adv. Water Resour. 25 (8e12),985e1016.

Arias-Estevez, M., Lopez-Periago, E., Martinez-Carballo, E., Simal-Gandara, J., Mejuto, J.C., Garcia-Rio, L., 2008. Themobility anddegradation of pesticides in soils and the pollution ofgroundwater resources.Agric. Ecosyst. Environ. 123 (4), 247e260.

ASTM, 2001. Standard Test Method 24-Hour Batch-typeMeasurement Containment Sorption by Soils and Sediments.American Society for Testing and Materials, Philadelphia,Pennsylvania.

Cammarata, A., Rogers, K.S., 1971. Electronic representation ofthe lipophilic parameter p. J. Med. Chem. 14 (4), 269e274.

Chirico, N., Gramatica, P., 2011. Real external predictivity of QSARmodels: how to evaluate it? Comparison of different validationcriteria and proposal of using the concordance correlationcoefficient. J. Chem. Inf. Model. 51 (9), 2320e2335.

Chirico, N., Gramatica, P., 2012. Real external predictivity of QSARmodels. Part 2. New intercomparable thresholds for differentvalidation criteria and the need for need scatter plotinspection. J. Chem. Inf. Model. 52 (8), 2044e2058.

Dearden, J.C., 2002. Prediction of environmental toxicity and fateusing quantitative structureeactivity relationships (QSARs). J.Braz. Chem. Soc. 13 (6), 754e762.

de Melo, E.B., 2012. A new quantitative structureepropertyrelationship model to predict bioconcentration factors ofpolychlorinated biphenyls (PCBs) in fishes using E-state indexand topological descriptors. Ecotoxicol. Environ. Saf. 75 (1),213e222.

dos Reis, R.R., Sampaio, S.C., de Melo, E.B., 2013. The effect ofdifferent logP algorithms on the modeling of the soil sorptioncoefficient of nonionic pesticides. Water Res. 47 (15),5751e5759.

Doucette, W.J., 2003. Quantitative structureeactivityrelationships for predicting soilesediment sorption

Page 9: An alternative approach for the use of water solubility of nonionic pesticides in the modeling of the soil sorption coefficients

wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 199

coefficients for organic chemicals. Environ. Toxicol. Chem. 22(8), 1771e1788.

Dragon 6.0, 2011. Talete srl. Milano, Italy.Duchowicz, P.R.,Gonzalez,M.P., Helguera,A.M.,Cordeiro,M.N.D.S.,

Castro, E.A., 2007. Application of the replacement method asnovel variable selection in QSPR. 2. Soil sorption coefficients.Chemom. Intell. Lab. Syst. 88 (2), 197e203.

Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D.,McDowell, R.M., Gramatica, P., 2003. Methods for reliabilityand uncertainty assessment and for applicability evaluationsof classification and regression-based QSARs. Environ. HealthPerspect. 111 (10), 1361e1375.

Gao, S., Cao, C., 2008. A new approach on estimation of solubilityand n-octanol/water partition coefficient for organohalogencompounds. Int. J. Mol. Sci. 9 (6), 962e977.

Gawlik, B.M., Sotiriou, N., Feicht, E.A., Schulte-Hostede, S.,Kettrup, A., 1997. Alternatives for the determination of the soiladsorption coefficient, KOC, of non-ionic-organic compoundse a review. Chemosphere 34 (12), 2525e2551.

Golbraikh, A., Shen, M., Xiao, Z., Xiao, Y., Lee, K., Tropsha, A.,2003. Rational selection of training and test set for thedevelopment of validated QSAR models. J. Comput. Aided Mol.Des. 17 (2e4), 241e253.

Golbraikh, A., Tropsha, A., 2002. Beware of q2! J. Mol. Graph.Model. 20 (4), 269e276.

Goudarzi, N., Goodarzi, M., Araujo, M.C., Galvao, R.K., 2009. QSPRmodeling of soil sorption coefficients (KOC) of pesticides usingSPA-ANN and SPA-MLR. J. Agric. Food Chem. 57 (15),7153e7158.

Gramatica, P., Corradi, M., Consonni, V., 2000. Modelling andprediction of soil sorption coefficients of non-ionic organicpesticides by molecular descriptors. Chemosphere 41 (5),763e777.

Gramatica, P., Giani, E., Papa, E., 2007. Statistical externalvalidation and consensus modeling: a QSPR case study for Kocprediction. J. Mol. Graph. Model. 25 (6), 755e766.

Hansch, C., Quinlan, J.E., Lawrence, G.L., 1968. Linear free-energyrelationship between partition coefficients and the aqueoussolubility of organic liquids. J. Org. Chem. 33 (1), 347e350.

Huuskonen, J., 2003. Prediction of soil sorption coefficient oforganic pesticides from the atom-type electrotopological stateindices. Environ. Toxicol. Chem. 22 (4), 816e820.

Kiralj, R., Ferreira, M.M.C., 2009. Basic validation procedures forregression models in QSAR and QSPR studies: theory andapplication. J. Braz. Chem. Soc. 20 (4), 770e787.

Lee, L.S., Rao, P.S.C., Brusseau, M.L., 1991. Nonequilibriumsorption and transport of neutral and ionized chlorophenols.Environ. Sci. Technol. 25 (4), 722e729.

Lin, L., 1989. A concordance correlation coefficient to evaluatereproducibility. Biometrics 45 (1), 255e268.

Livingstone, D.J., 2003. Quantitative structureeactivityrelationships. In: King, F.D. (Ed.), Medicinal Chemistry:Principles and Practice, second ed. Royal Society of Chemistry,Cambridge.

Lohninger, H., 1994. Estimation of soil partition coefficients ofpesticides from their chemical structure. Chemosphere 29 (8),1611e1626.

Mackay, D., Webster, E., 2003. A perspective on environmentalmodels and QSARs. SAR QSAR Environ. Res. 14 (1), 7e16.

Martins, J.P.A., Ferreira, M.M.C., 2013. QSARmodeling: a new opensource computational package to generate and validate QSARmodels. Quım. Nova 36 (4), 554e560.

Minitab Statistical Software 15, 2007. Minitab Inc.. State College,Pennsylvania, USA.

Nguyen, T.H., Goss, K.U., Ball, P.W., 2005. Polyparameter linearfree energy relationships for estimating the equilibriumpartition of organic compounds between water and thenatural organic matter in soils and sediments. Environ. Sci.Technol. 39 (4), 913e924.

Ojha, P.K., Mitra, I., Das, R.N., Roy, K., 2011. Further exploring rm2metrics for validation of QSPR models. Chemom. Intell. Lab.Syst. 107 (1), 194e205.

QSAR Modeling, 2013. Theoretical and Applied ChemometricsLaboratory. State University of Campinas, Campinas, Brazil.

Razzaque, M.M., Grathwohl, P., 2008. Predicting organic carbon-water partitioning of hydrophobic organic chemicals in soilsand sediments based on water solubility. Water Res. 42 (14),3775e3780.

Rogers, K.S., Cammarata, A., 1969. A molecular orbital descriptionof the partitioning of aromatic compounds between polar andnonpolar phases. Biochim. Biophys. Acta e Biomemb. 193 (1),22e29.

Roy, K., Mitra, I., Kar, S., Ojha, P.K., Das, R.N., Kabir, H., 2012.Comparative studies on some metrics for external validationof QSPR models. J. Chem. Inf. Model. 52 (2), 396e408.

Sabljic, A., G}usten, H., Verhaar, H., Hermens, J., 1995. QSARmodeling of soil sorption. Improvements and systematics oflog KOC vs log KOW correlations. Chemosphere 31 (11e12),4489e4514.

Tetko, I.V., Tanchuk, V.Y., Villa, A.E., 2001. Prediction of n-octanol/water partition coefficients from PHYSPROP databaseusing artificial neural networks and E-state indices. J. Chem.Inf. Comput. Sci. 41 (5), 1407e1421.

Tropsha, A., Gramatica, P., Gombar, V.K., 2003. The importance ofbeing earnest: validation is the absolute essential forsuccessful application and interpretation of QSPR models.QSAR Comb. Sci. 22 (1), 69e77.

von Oepen, B., Kordel, W., Klein, W., 1991. Sorption of nonpolarand polar compounds to soil processes measurements andexperience with the applicability of the modified OECD-guideline 106. Chemosphere 22 (3e4), 285e304.

Wang, J., Hou, T., 2011. Recent advances on aqueous solubilityprediction. Comb. Chem. High. Throughput Screen. 14 (5),328e338.

Wen, Y., Su, L.M., Qin, W.C., Fu, L., He, J., Zhao, Y.H., 2012. Linearand non-linear relationships between soil sorption andhydrophobicity: model, validation and influencing factors.Chemosphere 86 (6), 634e640.