An alternative approach for the use of water solubility of nonionic pesticides in the modeling of...
-
Upload
eduardo-borges -
Category
Documents
-
view
212 -
download
0
Transcript of An alternative approach for the use of water solubility of nonionic pesticides in the modeling of...
ww.sciencedirect.com
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9
Available online at w
ScienceDirect
journal homepage: www.elsevier .com/locate/watres
An alternative approach for the use of watersolubility of nonionic pesticides in the modeling ofthe soil sorption coefficients
Ralpho Rinaldo dos Reis a,*, Silvio Cesar Sampaio a,Eduardo Borges de Melo b
aGraduate Program in Agricultural Engineering (PGEAGRI), Western Parana State University (Universidade
Estadual do Oeste do Parana, UNIOESTE), Cascavel, Parana, BrazilbTheoretical Medicinal and Environmental Chemistry Laboratory (LQMAT) e UNIOESTE, Cascavel, Parana, Brazil
a r t i c l e i n f o
Article history:
Received 31 August 2013
Received in revised form
6 January 2014
Accepted 11 January 2014
Available online 23 January 2014
Keywords:
Environmental risk
Hydrophobic parameter
Partition coefficient
Soil contamination
QSPR
* Corresponding author. Dept of Pharmacy, UE-mail addresses: [email protected]
0043-1354/$ e see front matter ª 2014 Elsevhttp://dx.doi.org/10.1016/j.watres.2014.01.023
a b s t r a c t
The collection of data to study the damage caused by pesticides to the environment and
its ecosystems is slowly acquired and costly. Large incentives have been established to
encourage research projects aimed at building mathematical models for predicting
physical, chemical or biological properties of environmental interest. The organic carbon
normalized soil sorption coefficient (Koc) is an important physicochemical property used
in environmental risk assessments for compounds released into the environment. Many
models for predicting logKoc that have used the parameters logP or logS as descriptors
have been published in recent decades. The strong correlation between these properties
(logP and logS) prevents them from being used together in multiple linear regressions.
Because the sorption of a chemical compound in soil depends on both its water solubility
and its water/organic matter partitioning, we assume that models capable of combining
these two properties can generate more realistic results. Therefore, the objective of this
study was to propose an alternative approach for modeling logKoc, using a simple
descriptor of solubility, here designated as the logarithm of solubility corrected by
octanol/water partitioning (logSP). Thus, different models were built with this descriptor
and with the conventional descriptors logP and logS, alone or associated with other
explanatory variables representing easy-to-interpret physicochemical properties. The
obtained models were validated according to current recommendations in the literature,
and they were compared with other previously published models. The results showed
that the use of logSp instead of conventional descriptors led to simple models with
greater statistical quality and predictive power than other more complex models found
in the literature. Therefore, logSP can be a good alternative to consider for the modeling
of logKoc and other properties that relate to both solubility and water/organic matter
partitioning.
ª 2014 Elsevier Ltd. All rights reserved.
NIOESTE, 2069 Universitaria st, 85819-110 Cascavel, PR, Brazil. Tel.: þ55 45 32203256.r, [email protected] (R.R. dos Reis).
ier Ltd. All rights reserved.
Nomenclature
ARE Average relative error
CCC Concordance correlation coefficient
F F statistic
k Slope of the regression line without intercept
Koc Organic carbon normalized soil sorption
coefficient
LNO Leave-N-out validation
LOO Leave-one-out validation
MAXDP Maximum positive difference between the
intrinsic states of the atoms of a molecule
MLR Multiple linear regression
Mv Mean van der Waals volume of the atoms of a
molecule
P Octanol/water partitioning coefficient
PCR Principal component regression
PLS Partial least squares regression
PRESScv Residual error sum of squares of the cross-
validation
PRESSev Residual error sum of squares of the external
validation
Q Total solute quantity (mol)
Q2LNO Coefficient of determination of the leave-N-out
validation
Q2LOO Coefficient of determination of the leave-one-out
validation
QO Solute quantity in the organic phase (mol)
QSPR Quantitative StructureeProperty Relationships
QW Solute quantity in the aqueous phase (mol)
R2 Coefficient of determination
R20 Coefficient of determination of the regression line
without intercept
r2m Modified coefficient of determination of external
validation
R2PRED Coefficient of determination of the external
validation
RSS Residual sum of squares
S Solubility in water (mol.L�1)
SEC Standard error of calibration
SECV Standard error of the cross-validation
SEP Standard error of the prediction
SO Solubility in the organic phase (mol.L�1)
SP Solubility in water corrected by the octanol/water
partition (mol.L�1)
SW Solubility in the aqueous phase (mol.L�1)
V Volume (L)
VAR Variation (topological descriptor)
VO Volume of organic phase (L)
VW Volume of aqueous phase (L)
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9192
1. Introduction
Pesticides or agrochemicals are substances that are widely
used in agriculture to increase productivity, improve crop
quality and reduce labor and energy costs. However, only a
small portion of the pesticides applied to crops reaches its
intended target; the remainder tends to transfer into different
environmental compartments and can thus contaminate
surface water and groundwater (Arias-Estevez et al., 2008).
In recent years, some segments of society have expressed
concern over the final destination of these substances and
their potential environmental and health risks (Mackay and
Webster, 2003). Therefore, studying the physical, chemical
and biological properties of these chemicals is essential for
planning recovery efforts in contaminated areas and for pre-
serving sites that are not yet contaminated.
The organic carbon normalized soil sorption coefficient
(Koc) is an important physicochemical property that can be
used to determine the final destination of chemicals released
to the environment (Doucette, 2003; Huuskonen, 2003). This
coefficient is expressed as the ratio between the concentration
of the chemical component that is sorbed by the soil (mg.kg�1)
and the concentration of the surrounding aqueous phase
(mg.L�1), divided by the content of organic carbon (g organic
carbon/g dry soil). The smaller is the value of Koc for a given
compound, the greater is its mobility in the soil and the
greater is its potential for groundwater contamination
(Dearden, 2002; Arias-Estevez et al., 2008).
The Koc coefficient of a chemical compound can be deter-
mined experimentally by the batch equilibrium method
(ASTM, 2001) or by the soil column method (Lee et al., 1991).
The extent of the sorption of a chemical compound is influ-
enced by itsmolecular structure, but it also depends on factors
related to the intrinsic characteristics of the soil, such as
particle size, porosity, chemical composition, pH and organic
matter content (Dearden, 2002). The organic carbon content is
a key factor that affects the sorption of hydrophobic com-
pounds. Therefore, the hydrophobicity of these compounds is
the driving force during soil sorption (Wen et al., 2012). Hy-
drophobic interactions are the primary mechanisms under-
lying the sorption of nonionic compounds, but for compounds
with higher polarities, hydrophilic interactions are more
relevant. Therefore, the interactions of clay and silica with
polar functional groups become important for polar com-
pounds. A more itemized description of the interactions
involved in the sorption of organic compounds in soil can be
found in the works of von Oepen et al. (1991) and Allen-King
et al. (2002).
The need for an updated and reliable database of Koc values
has motivated the proposals and publications of a consider-
able number of models for the prediction of this parameter
(Gramatica et al., 2000). Several studies on quantitative
structureeproperty relationships (QSPR) based on the molec-
ular structures of chemical compounds have become avail-
able during the last few decades. Therefore, various
descriptors have been used for the modeling of Koc, such as
physicochemical properties, chromatographic parameters,
steric factors, linear solvation energy relationships (LSER),
topological and quantum parameters (Gawlik et al., 1997;
Doucette, 2003; Huuskonen, 2003; Nguyen et al., 2005;
Gramatica et al., 2007; Razzaque and Grathwohl, 2008;
Goudarzi et al., 2009; Wen et al., 2012). However, the vast
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 193
majority of QSPR models that have been proposed were not
properly validated. The non-validation of a predictive model
precludes its application because its predictive power is un-
known (Kiralj and Ferreira, 2009; Chirico and Gramatica, 2011,
2012; Roy et al., 2012).
Numerous models have been proposed based on the
relationship between logKoc and the octanol/water parti-
tioning coefficient (logP) or water solubility (logS) (Razzaque
and Grathwohl, 2008; Wen et al., 2012; dos Reis et al., 2013).
Several studies in the literature discuss the positive and
negative aspects of these models. These studies reveal that
the majority of these models were constructed with small
datasets or were designed for specific classes of compounds
(Gawlik et al., 1997; Dearden, 2002; Doucette, 2003). The pre-
dictive power of the models is affected by the quantities and
structural diversity of the compounds in their constructions.
A model that has been calibrated for a particular class of
compounds will not be suitable for the estimation of struc-
turally different compounds (dos Reis et al., 2013). As the
employed calibration datasets improve, the chemical spaces
serviced by the model increase. This finding explains the
potential limitations in the predictive powers of these
models.
A strong correlation exists between logP and logS. The
literature presents numerous relationships between these
two variables (Hansch et al., 1968; Wang and Hou, 2011). This
correlation can be investigated at the molecular level because
the electronic characteristics of substances may control the
affinity of molecules with aqueous or organic phases. Thus,
the partition process for the aqueous phase may be charge-
controlled, whereas the partition process for the organic
phase may be polarizability-controlled (Rogers and
Cammarata, 1969). The hydrophobicity and hydrophilicity
may be determined by quantum descriptors that can provide
an electronic representation of the molecules (Cammarata
and Rogers, 1971). The molecular descriptors that favor logP
hinder logS. This trend can be confirmed by the study of Gao
and Cao (2008), in which logS and logP prediction models
were obtained from the same molecular descriptors.
The strong correlation between logP and logS indicates that
models for logKoc based on these parameters may be consid-
ered equivalent (Wen et al., 2012). This correlation also means
that these two descriptors cannot be used together in an
equation obtained by multiple linear regression (MLR).
Instead, multivariate projection methods, such as principal
component regression (PCR) or partial least squares regression
(PLS), can be used (Livingstone, 2003).
Models using only logP as a descriptor are incapable of
differentiating compounds that possess equal logP values but
different logS values. Similarly, models that use only logS do
not distinguish between compounds that have identical logS
values but different logP values. The sorption of a chemical
compound in soil is affected by the water/organic matter
partitioning process (i.e., the higher the affinity of the sub-
stance for organic matter, the greater its sorption in the soil)
and by the water solubility of the compound (i.e., the greater
the solubility, the greater the leaching of the compound).
Therefore, it is expected that models capable of collating in-
formation from partitioning and solubility would generate
more realistic results.
In addition to the use of these two classical descriptors,
this study proposes an alternative approach: the use of a
simple solubility parameter that is generated by the mathe-
matical combination of logP and logS. This parameter is
designated as the logarithm of the solubility corrected by
octanol/water partitioning (logSP). Therefore, models for log-
Koc based on all three descriptors (i.e., logP, logS and logSP)
were constructed. The obtained models were then tested and
validated according to the recommendations of experts in the
field (Kiralj and Ferreira, 2009; Chirico and Gramatica, 2011,
2012; Roy et al., 2012) to ensure that they are reliable and
can be used for accurate predictions.
2. Material and methods
2.1. Experimental data for Koc
Experimental data on logKoc for 163 nonionic organic pesti-
cides were obtained from the literature (Sabljic et al., 1995;
Gramatica et al., 2000). These data were divided into two
sets of data: a training set (143 compounds) and a test set (20
compounds). The training set was used to construct the
models, and the test set was employed in the external vali-
dation of the models. The experimental logKoc values are
included in the Supplementary Material (Table S1). These
datasets were used in the studies of Gramatica et al. (2000),
Huuskonen (2003) and Duchowicz et al. (2007). The com-
pounds in the training set belong to different pesticide classes
(six acetonitriles, 29 carbamates, eight dinitroanilines, eight
organochlorines, 28 organophosphates, 44 phenylureas, 13
triazines, seven di- and triazoles); their logKoc values ranged
from 0.42 to 5.31. The logKoc values for the test set ranged from
0.56 to 4.50. Because these datasets have been used in other
studies, our results can be compared with previously reported
findings (Gramatica et al., 2000; Huuskonen, 2003; Duchowicz
et al., 2007).
2.2. Descriptors of the QSPR models
This study proposes the use of an alternative and simple
solubility parameter as a descriptor for modeling logKoc. This
parameter, here referred to as the logarithm of solubility
corrected by octanol/water partitioning, corresponds to the
logarithm of solubility (mol.L�1) for a given chemical com-
pound in the aqueous phase of a biphasic octanol/water sys-
tem obtained from the saturated aqueous solution of the
substance.
This property mathematically combines the logS and logP
descriptors. Thus, this descriptor may bemore suitable for the
modeling of properties that simultaneously involve solutions
and water/organic matter partitioning processes. To calculate
the value of this alternative descriptor, we begin with an
initial volume V (L) of a saturated aqueous solution of con-
centration S (mol.L�1) for a given chemical compound. If an
equal volume of n-octanol (V ¼ VO, where VO is the octanol
volume) is added and the biphasic system is stirred until
equilibrium is established between the amounts of solute
present in the aqueous phase and in the organic phase, then
the total solute quantityQ (mol) is equivalent to the sumof the
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9194
quantities of solute present in the two phases (Q ¼ QW þ QO).
Because the system is biphasic and the volumes of water and
n-octanol are equivalent (V ¼ VO ¼ VW, where VW is the
aqueous volume), we obtain the following equations:
Q=V ¼ QW=V þ QO=V ¼ QW=VW þ QO=VO (1)
S ¼ Q=V (2a)
SW ¼ QW=VW (2b)
SO ¼ QO=VO (2c)
Thus:
S ¼ SW þ SO (3)
In this equation, SW and SO correspond to the solubilities
(mol.L�1) in the aqueous and organic phases, respectively. The
octanol/water partition coefficient P is defined as follows:
P ¼ SO=SW (4)
Combining the equations (3) and (4) yields
S ¼ SW þ SO ¼ SW þ P$SW ¼ SW$ð1þ PÞ (5)
Therefore, the solubility (mol.L�1) of the compound in the
aqueous phase is defined as
SW ¼ S= Pþ 1ð Þ (6)
To avoid ambiguities with the other previously reported
solubility descriptors, SW was renamed SP and expressed as
follows:
log SP ¼ log S� logðPþ 1Þ (7)
thus, logSP values are calculated from the values of logP and
logS using Equation (7). LogP and logS values were obtained
from the introduction of the Simplified Molecular Input Line
Entry Specification (SMILES) of each compound in the program
ALOGPS 2.1, which is available at http://www.vcclab.org/lab/
alogps (Tetko et al., 2001). The decision to use algorithms to
calculate the values of logP (AlogPs) and logS (AlogpS) was
based on the lack of reliable experimental data for all com-
pounds used in this study (dos Reis et al., 2013).
Other useful descriptors for this work (constitutional de-
scriptors, topological descriptors, molecular descriptors and
indices of rings and connectivity) were calculated using the
program Dragon 6.0. All values of the descriptors used in the
models are presented in Table S1 (Supplementary Material).
2.3. Construction and interpretation of the QSPR models
The models were obtained by linear regression using the
programs Minitab 15 and QSAR Modeling. Three models were
obtained with a single explanatory variable (logP, logS or
logSP), whereas three models with four variables were con-
structed by the addition of simple descriptors to the previous
models. Because the variables in QSPR typically exhibit
different numerical ranges for the MLR models prior to the
regression analysis, data auto-scaling (i.e., the variables were
mean-centered and scaled by variance) can be required. (Kiralj
and Ferreira, 2009; Martins and Ferreira, 2013).
To construct the models that contain four descriptors, the
Stepwise variables selection method (forward and backward)
was employed. Only classes of descriptors with simple phys-
icochemical interpretations were considered. Descriptors that
possessed Pearson’s linear correlation coefficient with
another explanatory variable higher than 0.6 were excluded.
The Pearson’s linear correlation coefficients for the
selected descriptors with each other and with the variable
logKoc are shown in Supplementary Material (Table S2). These
coefficients demonstrated that the collinearity between the
explanatory variables was controlled. The requirement of
noncollinearity between the descriptors is essential for
obtaining statistically reliable models. Conversely, as the
correlation of a descriptor with the response variable in-
creases, the contribution of this parameter to modeling the
investigated property also increases. Although the interpre-
tation of a QSPR model is an important and necessary step,
interpretation is not always feasible. Understanding how each
descriptor influences the response variable enables the
development and interpretation of a mechanism that may
motivate the process under study (de Melo, 2012).
2.4. Statistical quality of the QSPR models
The fit degree of amodelwas evaluated using the coefficient of
determination (R2), the residual sum of squares (RSS) and the
standard error of calibration (SEC). For adjusted QSPR models,
recent studies have recommended that R2 values should be
greater than 0.7 and that values of RSS and SEC should be close
to 0 (Chirico and Gramatica, 2011, 2012). The significance of
the models was evaluated using the F test at the 5% level
(Kiralj and Ferreira, 2009).
Leave-one-out (LOO) cross-validation is performed by
calculating the sum of squares for the differences between the
observed and the predicted values (PRESScv), the coefficient of
determination of the leave-one-out cross-validation (Q2LOO)
and the standard error of cross-validation (SECV). For approval
of a model, the SECV and PRESScv should be close to 0,
Q2LOO > 0.6 and jR2 e Q2
LOOj< 0.1 (Chirico and Gramatica, 2011,
2012).
The leave-N-out (LNO) cross-validation, which is a
robustness test, is achieved by the systematic removal of a
maximum of N elements of the training set. In this study, N
ranged from one to 40with 12 replicates for eachN value. For a
model to be considered robust, the average value of Q2LNO
should be close to the value of Q2LOO (Kiralj and Ferreira, 2009).
The random correlation can be verified by the y-randomi-
zation test. The models are recalculated after randomization
of the vector Y; a significant deterioration in these new re-
gressions is expected. This procedure was repeated 50 times;
two graphs, in which the parameters R2 and Q2LOO (y axis)
were plotted as a function of the correlation coefficient
R(Yo,Yr) (x axis) that was obtained from the original and ran-
domized Y values, was constructed. To eliminate the possi-
bility of a chance correlation, the intercepts of these graphs
should be less than 0.3 for R2 and less than 0.05 for Q2LOO
(Eriksson et al., 2003).
The external validation of the obtained models was per-
formed by predicting logKoc values for the compounds of the
test set. Thus, the predictive power was assessed using the
Table 1 e Regression models for the prediction of logKoc
for non-ionic pesticides (n [ 143).
Modela Equation
A1 logKoc ¼ 0.945 þ 0.560 LogP
A2 logKoc ¼ 0.670e0.568 LogS
A3 logKoc ¼ 0.702e0.299 LogSP
B1 logKoc ¼ �1.46 þ 0.386 LogP þ 4.45 Mv þ 0.0152
VAR e 0.124 MAXDP
B2 logKoc ¼ �0.834e0.435 LogS þ 3.05 Mv þ 0.00909
VAR e 0.0850 MAXDP
B3 logKoc ¼ �1.09e0.227 LogSP þ 3.48 Mv þ 0.00987
VAR e 0.0884 MAXDP
a “A” designates models with one variable; “B” designates models
with 4 explanatory variables.
Table 3 e LNO cross-validation and random correlationtest.
Models Robustness Random correlation
Average Q2LNO Intercept of
R2 vs R(Yo,Yr)Intercept of
Q2LOO vs R(Yo,Yr)
A1 0.734 �0.040 �0.070
A2 0.781 �0.041 �0.070
A3 0.798 �0.046 �0.077
B1 0.834 �0.023 �0.096
B2 0.803 �0.028 �0.105
B3 0.839 �0.024 �0.100
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 195
coefficient of determination of external validation (R2PRED), the
analysis of modified coefficient of determination of external
validation (r2m) and the GolbraikheTropsha method. The
R2PRED values should be higher than 0.6 (Chirico and
Gramatica, 2011, 2012). The average r2m value should be
higher than 0.5, and jDr2mj should be less than 0.2 (Ojha et al.,
2011). According to the GolbraikheTropsha method, the in-
clinations (k and k’) of the regression lines that pass through
the origin (observed values versus predicted values and pre-
dicted values versus observed values) should range from 0.85
to 1.15, and the jR20�R’20j should be less than 0.3 (Golbraikh
and Tropsha, 2002; Golbraikh et al., 2003; Tropsha et al.,
2003). The values of the standard error of prediction (SEP),
the prediction residual error sum of squares for the external
validation (PRESSev) and the average relative error (ARE) were
also determined; these values should be as small as possible.
In addition to these classical statistical parameters, we also
employed the concordance correlation coefficient (CCC) pro-
posed by Lin (1989). This parameter provides a measure of
agreement between the calculated and experimental values.
Recent studies recommend that the CCC should be greater
than 0.85 for a model to be adopted (Chirico and Gramatica,
2011, 2012).
Residual plots were analyzed to verify that the residuals
exhibited normal and zero-mean distributions. The formulas
used to calculate the previously mentioned statistical pa-
rameters are shown in the Supplementary Material.
Thus, recent validation criteria, which are more stringent
than the validation criteria commonly used in QSPR studies
(Kiralj and Ferreira, 2009; Chirico and Gramatica, 2011, 2012;
Roy et al., 2012), were employed in this study. This option
was selected to ensure the statistical quality of the models
obtained using the logSP descriptor and to demonstrate the
viability of the approach. After the validation procedures were
Table 2 e Statistical parameters of the logKoc prediction mode
Model R2 SEC RSS
A1 0.744 0.447 28.14
A2 0.788 0.407 23.35
A3 0.804 0.391 21.50
B1 0.848 0.348 16.71
B2 0.821 0.378 19.72
B3 0.852 0.343 16.24
completed, the best models were compared to the models
reported in the literature (Gramatica et al., 2000; Huuskonen,
2003; Duchowicz et al., 2007).
3. Results and discussion
3.1. Models for predicting logKoc and interpretation ofthe descriptors
The models for predicting logKoc are displayed in Table 1. The
sorption of the nonionic organic compounds in soil involves a
mechanism in which hydrophobicity is the driving force (Wen
et al., 2012; dos Reis et al., 2013). Thus, molecular descriptors
that estimate or affect the hydrophobicity of a substance are
relevant to the sorption process. This finding explains the high
correlation of logP, logS and logSP with logKoc (Table S2 in the
Supplementary Material). In the models with 4 variables, the
significance of other descriptors was also noted, including a
constitutional descriptor (Mv) and 2 topological descriptors
(VAR and MAXDP).
The constitutional descriptor Mv, which is the mean van
der Waals volume of the atoms that comprise the molecule
(scaled according to the volume of the carbon atom), is in
agreement with the expected mechanism of hydrophobic
interaction because this measure of volume is associated with
the capacity of each atom to interact with the organic matter
of the soil. An atom with a greater volume has a greater con-
tact surface.
The topological descriptors are variables that describe
different structural properties of a molecule, such as the de-
gree of branching, size, flexibility or centricity (de Melo, 2012).
The VAR descriptor (variation) is a topological distance
parameter that emphasizes the strong dependence of the size
and symmetry of the molecules on the sorption process.
The descriptor MAXDP (maximum positive difference be-
tween the intrinsic states of the atoms of a molecule) is a
ls.
F Q2LOO SECV PRESScv
410.04 0.735 0.451 29.15
523.02 0.782 0.410 23.99
580.04 0.798 0.394 22.20
192.62 0.835 0.356 18.11
157.93 0.805 0.387 21.44
199.14 0.840 0.351 17.62
Fig. 1 e LNO cross-validation plots of the (a) A3 and (b) B3
models.
Fig. 2 e Plots of the observed values as a function of the
logKoc predicted values (95% confidence interval) for the
external validation set of the (a) A3 and (b) B3 models.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9196
topological index proposed by Gramatica et al. (2000) that was
deduced from the Kier-Hall atomic electronegativities. Thus,
MAXDP is related tomolecule electrophilicity; it highlights the
importance of electrophilic properties in the prevention of the
sorption process in soil and favors pesticide leaching.
3.2. Statistical analysis
The statistical parameters of all models are shown in Table 2.
All models with a single independent variable (A1, A2 and A3)
were approved with regard to curve fitting because their R2
values were higher than 0.7. However, it is important to note
that the A3 model, which used the new descriptor logSP,
showed a better fit and greater significance than the A1 and A2
models, which used the classical descriptors logP and logS. A
similar trend was also observed for the models with four
variables. These findings suggested that the models based on
logSP exhibit a better fit with the experimental data compared
with the other models.
The SEC values should be as small as possible within the
limits imposed by the experimental error of the measured
response variables. Lohninger (1994) demonstrated the limits
Table 4 e Statistical data for the external validation (dataset, n
Model R2PRED SEP PRESSev ARE (%) CCC
A1 0.668 0.536 5.75 16.1 0.85
A2 0.708 0.503 5.06 24.2 0.84
A3 0.723 0.490 4.80 19.9 0.86
B1 0.710 0.501 5.03 19.0 0.87
B2 0.733 0.481 4.62 23.4 0.86
B3 0.743 0.472 4.45 20.5 0.87
of the accuracy of the experimental measurements of logKoc.
In a study performed with a set of pesticides that contained a
minimum of four different logKoc values for each compound,
the average standard deviation of these measurements was
0.44 log units. Themodeling of a dependent variable is limited
to the quality of the available experimental data. Thus,
obtaining predictionmodels with errors significantly less than
the experimental errors is not an appropriate solution (dos
Reis et al., 2013). Note that all of the models we constructed
exhibited SEC values that were compatible with the experi-
mental error. Therefore, they were considered to be reliable
for estimating logKoc values.
All models were validated by LOO cross-validation because
they possessed Q2Loo values higher than 0.6 and jR2 e Q2
LOOj <
[ 20).
Average r2m jDr2mj k k’ jR20 e R’20j
0.622 0.127 0.97 0.99 0.050
0.614 0.157 0.96 1.00 0.064
0.654 0.004 0.96 1.00 0.001
0.678 0.076 0.94 1.03 0.023
0.658 0.181 0.95 1.02 0.063
0.688 0.028 0.95 1.02 �0.008
Fig. 3 e Histogram of residuals distribution for the (a) A3
and (b) B3 models.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 197
0.1. Thus, based on the data shown in Table 2 and above dis-
cussion, regarding the data adjustment and LOO cross-
validation, the models were arranged in descending order of
statistical quality, considering the number of descriptors in
the equations:
A3 > A2 > A1 (models with 1 descriptor)
B3 > B1 > B2 (models with 4 descriptors)
The LNO cross-validation and random correlation tests are
also important tools for verifying the internal quality of the
models. Data from the LNO cross-validation showed that all
models had average Q2LNO values similar to their respective
Q2LOO values (Table 3). Plots of the LNO validation for the A3
and B3 models (both based on the descriptor logSP), shown in
Fig. 1, indicated that the models were stable and resistant to
small variations. The plots of the other models are available
in the Supplementary Material and led to the same conclu-
sion. However, it is important to note that the A3 model
Table 5 e Comparison of logKoc models.
Model No. of variables R2 SEC Q2
A3 Model 1 0.804 0.391 0.
B3 Model 4 0.852 0.343 0.
Gramatica et al. (2000) 6 0.843 0.350 0.
Huuskonen (2003) 12 0.82 0.37 0.
Duchowicz et al. (2007) 6 0.90 0.29 0.
a calculated by the authors of the present study, based on the experimen
showed the best results in the robustness test among the
modelswith a single descriptor, and the B3modelwas the best
model among the models with 4 variables. Therefore, the
models based on logSP were also better than the other models
in the LNO validation test.
Random correlation was evaluated by the y-randomization
test, and the results shown in Table 3 indicate the absence of
random correlation in all of the models obtained, as values of
the intercepts for the R2 and Q2LOO plots were within the limits
(less than 0.3 and 0.05, respectively) proposed by Eriksson
et al. (2003).
Results of the external validation (Table 4) show that all
models were considered approved because the R2PRED values
were all greater than 0.6. The SEP values were consistent
with the experimental errors associated with the logKoc
measurements (0.44). Thus, from the data in Table 4, the
models were arranged in decreasing order according to
predictive power, considering the number of descriptors in
the equations:
A3 > A2 > A1 (models with 1 descriptor)
B3 > B2 > B1 (models with 4 descriptors)
It is important to note that, regarding predictive power, the
A3 model was again the best model among the single-
descriptor models. Similarly, the B3 model was better than
the other 4-variable models.
The CCC data (Table 4) suggested that the A2model should
be rejected because it had a value less than 0.85, whereas the
other models were validated.
An analysis of the k, k’ and jR20 e R’20j parameters (Gol-
braikheTropsha method) showed that all models were within
the proposed limits (Tropsha et al., 2003). However, it is
important to note that the A3 model had the best results
among the models with a single descriptor and that the B3
model was again the best model among the models with 4
descriptors. This same trend was also observed with respect
to the parameters r2m and jDr2mj. Fig. 2 shows the graphs of the
observed values as a function of the logKoc predicted values for
the A3 and B3 models.
When the results were analyzed according to the more
stringent r2m criterion (average r2m > 0.65), the A1 and A2
models were rejected (Chirico and Gramatica, 2012). Thus, it
became clear that the solubility descriptor logSP was more
appropriate than the conventional descriptors logP and logS
for modeling logKoc for nonionic pesticides with 1 or 4
explanatory variables.
Histograms of the A3 and B3 models are shown in Fig. 3.
Residuals analysis indicated that they had a normal distribu-
tion and a mean of 0. Histograms for the other models are
available in the Supplementary Material.
LOO R2PRED ARE (%) CCC Average r’2m jDr2mj
798 0.723 19.9 0.86 0.654 0.004
840 0.743 20.5 0.87 0.688 0.028
824 0.670 27.4a 0.80a 0.539a 0.245a
79 0.79 20.0a 0.87a 0.654a 0.190a
89 0.71 21.7a 0.83a 0.594a 0.134a
tal and predicted values available in the literature.
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9198
The proposed descriptor logSP proved to be useful for
modeling logKoc becausemodelswith better fits and predictive
powers were generated when the information regarding logP
and logS were combined than when only the conventional
descriptors were used. Therefore, logSP can be a good alter-
native to consider for the modeling of properties that relate to
both solubility and water/organic matter partitioning.
3.3. Comparison with QSPR models in the literature
The A3 and B3 models were compared with other previously
reported models that utilized the same dataset. The major
statistical parameters of these models and their respective
prediction powers are summarized in Table 5. Models A3
and B3 fit the experimental data well and had good predic-
tive ability, similar to the models found in the literature. All
models presented in Table 5 possessed a sufficient degree of
fit and had SEC values compatible with the experimental
error of logKoc. However, while the A3 and B3 models were
validated according to all the external validation criteria
recommended for QSPR studies, the models of Gramatica
et al. (2000) and Duchowicz et al. (2007) were rejected due
to their CCC values (less than 0.85). These models were also
rejected when we applied a more rigorous criterion for r2m(values greater than 0.65) that was proposed by Chirico and
Gramatica (2012). Thus, among the models from the litera-
ture, only the model of Huuskonen (2003) satisfied the
stricter criteria of external validation that is currently
recommended.
It should be noted that the B3 model (our best model)
outperformed the model of Huuskonen (the best model from
the literature) in terms of fit. The external validation pa-
rameters of the two models had similar values. However,
while the Huuskonen’s model was built with 12 explanatory
variables, the B3 model was much simpler and was con-
structed with only 4 descriptors. We note also that these
variables are in agreement with the mechanisms of soil
sorption of nonionic compounds. Thus, the B3 model pro-
posed in this paper is not only reliable in terms of its sta-
tistical qualities and predictive power but also satisfies the
requirement of greater simplicity.
4. Conclusions
We have demonstrated that the solubility descriptor logSP is a
suitable alternative for logKoc modeling because all models
that we obtained were validated according to statistical
criteria that were stricter than criteria typically used in QSPR
studies. In addition, all selected descriptors are consistent
with the mechanisms of the sorption process in soil for
nonionic organic compounds.
The primary objective of a QSPR study is to obtain models
that combine, when possible, mathematical simplicity, ease
of physicochemical interpretation, high statistical quality and
effective predictive power. Therefore, themodels that we built
based on the solubility descriptor logSP clearly possess these
qualities.
However, some caution is necessary because this
approach cannot necessarily be generalized to all QSPR
studies. It should be tested case-by-case, and the property to
be modeled must be dependent on both water solubility and
water/organic matter partitioning of the compound under
study.
Acknowledgments
The authors thank the MCT/CNPq/Fundacao Araucaria (www.
fundacaoaraucaria.org.br) and the National Counsel of Tech-
nological and Scientific Development (CNPq, www.cnpq.br)
for providing the financial support for this study.
In addition, the authors thank the Laboratory for Theo-
retical and Applied Chemometrics at State University of
Campinas (LQTA/UNICAMP) for providing the QSAR Modeling
software.
Appendix A. Supplementary data
Supplementary data related to this article can be found at
http://dx.doi.org/10.1016/j.watres.2014.01.023.
r e f e r e n c e s
Allen-King, R.M., Grathwohl, P., Ball, W.P., 2002. New modelingparadigms for the sorption of hydrophobic organicchemicals to heterogeneous carbonaceous matter in soils,sediments, and rocks. Adv. Water Resour. 25 (8e12),985e1016.
Arias-Estevez, M., Lopez-Periago, E., Martinez-Carballo, E., Simal-Gandara, J., Mejuto, J.C., Garcia-Rio, L., 2008. Themobility anddegradation of pesticides in soils and the pollution ofgroundwater resources.Agric. Ecosyst. Environ. 123 (4), 247e260.
ASTM, 2001. Standard Test Method 24-Hour Batch-typeMeasurement Containment Sorption by Soils and Sediments.American Society for Testing and Materials, Philadelphia,Pennsylvania.
Cammarata, A., Rogers, K.S., 1971. Electronic representation ofthe lipophilic parameter p. J. Med. Chem. 14 (4), 269e274.
Chirico, N., Gramatica, P., 2011. Real external predictivity of QSARmodels: how to evaluate it? Comparison of different validationcriteria and proposal of using the concordance correlationcoefficient. J. Chem. Inf. Model. 51 (9), 2320e2335.
Chirico, N., Gramatica, P., 2012. Real external predictivity of QSARmodels. Part 2. New intercomparable thresholds for differentvalidation criteria and the need for need scatter plotinspection. J. Chem. Inf. Model. 52 (8), 2044e2058.
Dearden, J.C., 2002. Prediction of environmental toxicity and fateusing quantitative structureeactivity relationships (QSARs). J.Braz. Chem. Soc. 13 (6), 754e762.
de Melo, E.B., 2012. A new quantitative structureepropertyrelationship model to predict bioconcentration factors ofpolychlorinated biphenyls (PCBs) in fishes using E-state indexand topological descriptors. Ecotoxicol. Environ. Saf. 75 (1),213e222.
dos Reis, R.R., Sampaio, S.C., de Melo, E.B., 2013. The effect ofdifferent logP algorithms on the modeling of the soil sorptioncoefficient of nonionic pesticides. Water Res. 47 (15),5751e5759.
Doucette, W.J., 2003. Quantitative structureeactivityrelationships for predicting soilesediment sorption
wat e r r e s e a r c h 5 3 ( 2 0 1 4 ) 1 9 1e1 9 9 199
coefficients for organic chemicals. Environ. Toxicol. Chem. 22(8), 1771e1788.
Dragon 6.0, 2011. Talete srl. Milano, Italy.Duchowicz, P.R.,Gonzalez,M.P., Helguera,A.M.,Cordeiro,M.N.D.S.,
Castro, E.A., 2007. Application of the replacement method asnovel variable selection in QSPR. 2. Soil sorption coefficients.Chemom. Intell. Lab. Syst. 88 (2), 197e203.
Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D.,McDowell, R.M., Gramatica, P., 2003. Methods for reliabilityand uncertainty assessment and for applicability evaluationsof classification and regression-based QSARs. Environ. HealthPerspect. 111 (10), 1361e1375.
Gao, S., Cao, C., 2008. A new approach on estimation of solubilityand n-octanol/water partition coefficient for organohalogencompounds. Int. J. Mol. Sci. 9 (6), 962e977.
Gawlik, B.M., Sotiriou, N., Feicht, E.A., Schulte-Hostede, S.,Kettrup, A., 1997. Alternatives for the determination of the soiladsorption coefficient, KOC, of non-ionic-organic compoundse a review. Chemosphere 34 (12), 2525e2551.
Golbraikh, A., Shen, M., Xiao, Z., Xiao, Y., Lee, K., Tropsha, A.,2003. Rational selection of training and test set for thedevelopment of validated QSAR models. J. Comput. Aided Mol.Des. 17 (2e4), 241e253.
Golbraikh, A., Tropsha, A., 2002. Beware of q2! J. Mol. Graph.Model. 20 (4), 269e276.
Goudarzi, N., Goodarzi, M., Araujo, M.C., Galvao, R.K., 2009. QSPRmodeling of soil sorption coefficients (KOC) of pesticides usingSPA-ANN and SPA-MLR. J. Agric. Food Chem. 57 (15),7153e7158.
Gramatica, P., Corradi, M., Consonni, V., 2000. Modelling andprediction of soil sorption coefficients of non-ionic organicpesticides by molecular descriptors. Chemosphere 41 (5),763e777.
Gramatica, P., Giani, E., Papa, E., 2007. Statistical externalvalidation and consensus modeling: a QSPR case study for Kocprediction. J. Mol. Graph. Model. 25 (6), 755e766.
Hansch, C., Quinlan, J.E., Lawrence, G.L., 1968. Linear free-energyrelationship between partition coefficients and the aqueoussolubility of organic liquids. J. Org. Chem. 33 (1), 347e350.
Huuskonen, J., 2003. Prediction of soil sorption coefficient oforganic pesticides from the atom-type electrotopological stateindices. Environ. Toxicol. Chem. 22 (4), 816e820.
Kiralj, R., Ferreira, M.M.C., 2009. Basic validation procedures forregression models in QSAR and QSPR studies: theory andapplication. J. Braz. Chem. Soc. 20 (4), 770e787.
Lee, L.S., Rao, P.S.C., Brusseau, M.L., 1991. Nonequilibriumsorption and transport of neutral and ionized chlorophenols.Environ. Sci. Technol. 25 (4), 722e729.
Lin, L., 1989. A concordance correlation coefficient to evaluatereproducibility. Biometrics 45 (1), 255e268.
Livingstone, D.J., 2003. Quantitative structureeactivityrelationships. In: King, F.D. (Ed.), Medicinal Chemistry:Principles and Practice, second ed. Royal Society of Chemistry,Cambridge.
Lohninger, H., 1994. Estimation of soil partition coefficients ofpesticides from their chemical structure. Chemosphere 29 (8),1611e1626.
Mackay, D., Webster, E., 2003. A perspective on environmentalmodels and QSARs. SAR QSAR Environ. Res. 14 (1), 7e16.
Martins, J.P.A., Ferreira, M.M.C., 2013. QSARmodeling: a new opensource computational package to generate and validate QSARmodels. Quım. Nova 36 (4), 554e560.
Minitab Statistical Software 15, 2007. Minitab Inc.. State College,Pennsylvania, USA.
Nguyen, T.H., Goss, K.U., Ball, P.W., 2005. Polyparameter linearfree energy relationships for estimating the equilibriumpartition of organic compounds between water and thenatural organic matter in soils and sediments. Environ. Sci.Technol. 39 (4), 913e924.
Ojha, P.K., Mitra, I., Das, R.N., Roy, K., 2011. Further exploring rm2metrics for validation of QSPR models. Chemom. Intell. Lab.Syst. 107 (1), 194e205.
QSAR Modeling, 2013. Theoretical and Applied ChemometricsLaboratory. State University of Campinas, Campinas, Brazil.
Razzaque, M.M., Grathwohl, P., 2008. Predicting organic carbon-water partitioning of hydrophobic organic chemicals in soilsand sediments based on water solubility. Water Res. 42 (14),3775e3780.
Rogers, K.S., Cammarata, A., 1969. A molecular orbital descriptionof the partitioning of aromatic compounds between polar andnonpolar phases. Biochim. Biophys. Acta e Biomemb. 193 (1),22e29.
Roy, K., Mitra, I., Kar, S., Ojha, P.K., Das, R.N., Kabir, H., 2012.Comparative studies on some metrics for external validationof QSPR models. J. Chem. Inf. Model. 52 (2), 396e408.
Sabljic, A., G}usten, H., Verhaar, H., Hermens, J., 1995. QSARmodeling of soil sorption. Improvements and systematics oflog KOC vs log KOW correlations. Chemosphere 31 (11e12),4489e4514.
Tetko, I.V., Tanchuk, V.Y., Villa, A.E., 2001. Prediction of n-octanol/water partition coefficients from PHYSPROP databaseusing artificial neural networks and E-state indices. J. Chem.Inf. Comput. Sci. 41 (5), 1407e1421.
Tropsha, A., Gramatica, P., Gombar, V.K., 2003. The importance ofbeing earnest: validation is the absolute essential forsuccessful application and interpretation of QSPR models.QSAR Comb. Sci. 22 (1), 69e77.
von Oepen, B., Kordel, W., Klein, W., 1991. Sorption of nonpolarand polar compounds to soil processes measurements andexperience with the applicability of the modified OECD-guideline 106. Chemosphere 22 (3e4), 285e304.
Wang, J., Hou, T., 2011. Recent advances on aqueous solubilityprediction. Comb. Chem. High. Throughput Screen. 14 (5),328e338.
Wen, Y., Su, L.M., Qin, W.C., Fu, L., He, J., Zhao, Y.H., 2012. Linearand non-linear relationships between soil sorption andhydrophobicity: model, validation and influencing factors.Chemosphere 86 (6), 634e640.