Using ensemble models to identify and apportion heavy metal … · Using ensemble models to...

9
Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale Qi Wang a , Zhiyi Xie b , Fangbai Li a, * a Guangdong Key Laboratory of Agricultural Environment Pollution Integrated Control, Guangdong Institute of Eco-Environmental and Soil Sciences, Guangzhou, China b Guangdong Environmental Monitoring Center, Guangzhou, China article info Article history: Received 17 March 2015 Received in revised form 13 May 2015 Accepted 29 June 2015 Available online xxx Keywords: Heavy metals Pollution source Agricultural soil Local scale Ensemble model abstract This study aims to identify and apportion multi-source and multi-phase heavy metal pollution from natural and anthropogenic inputs using ensemble models that include stochastic gradient boosting (SGB) and random forest (RF) in agricultural soils on the local scale. The heavy metal pollution sources were quantitatively assessed, and the results illustrated the suitability of the ensemble models for the assessment of multi-source and multi-phase heavy metal pollution in agricultural soils on the local scale. The results of SGB and RF consistently demonstrated that anthropogenic sources contributed the most to the concentrations of Pb and Cd in agricultural soils in the study region and that SGB performed better than RF. © 2015 Elsevier Ltd. All rights reserved. 1. Introduction Soil is a large and long-term sink for ubiquitous heavy metals and related compounds. In agricultural soils, the accumulation of heavy metals is a growing public concern because it threatens environmental health; elevated heavy metal uptake by crops may also affect food quality and security (Harmanescu et al., 2011; Wu et al., 2015). An important prerequisite in the control and remedi- ation of heavy metal contaminated soils is determining the source of contamination (Lin et al., 2010; Zhang et al., 2009b). On a local scale, agricultural soils become contaminated by accumulated heavy metals released from multi-phase and diverse natural and anthropogenic sources (Gellrich and Zimmermann, 2007). Heavy metals in agricultural soils primarily originate from the weathering of parent materials but can also be accumulated from industrial emissions, such as mine tailings, disposal of high metal wastes and sewage sludge, and agricultural sources, such as livestock manure, inorganic fertilizers, lime, agrochemicals, irrigation water, atmo- spheric deposition and pesticides (Hu and Cheng, 2013; Khan et al., 2008; Mohammed et al., 2011). Every decision regarding the application of any measures in soil quality and management must be based on reliable information on the extent and sources of heavy metal pollution in the given area (Zovko and Romic, 2011). There- fore, the identication and apportionment of heavy metal pollution sources in agricultural soils on the local scale is crucial. The high spatial heterogeneity of heavy metals in soils, the complexity and diversity of pollution sources and the lack of long-term monitoring data have challenged researchers to assess multi-source and multi- phase heavy metal pollution in agricultural soils on a local scale; exploring suitable methods to address this challenge is imperative. To this end, models can serve as powerful tools for source identi- cation and apportionment. There are two competing modeling methods: the traditional approach (build one robust model) and the more recent ensemble learning approach (build many models and average the results). Numerous reports have shown that multivariate analysis and GIS are useful tools for the identication of probable pollution sources and the potential risks of heavy metals (Facchinelli et al., 2001). For example, multivariate analyses that have been applied to exclu- sively predict soil pollution sources include principle component analysis (Mic o et al., 2006; Yongming et al., 2006), clustering analysis (Bhuiyan et al., 2010; Soares et al.,1999) and discriminant analysis (Qishlaqi and Moore, 2007). GIS-based models together with multivariate analysis have also been developed for mapping and evaluating the sources and distributions of heavy metal con- taminants, such as those in Fragkos (1998), Zhou (2007a) and * Corresponding author. E-mail address: [email protected] (F. Li). Contents lists available at ScienceDirect Environmental Pollution journal homepage: www.elsevier.com/locate/envpol http://dx.doi.org/10.1016/j.envpol.2015.06.040 0269-7491/© 2015 Elsevier Ltd. All rights reserved. Environmental Pollution 206 (2015) 227e235

Transcript of Using ensemble models to identify and apportion heavy metal … · Using ensemble models to...

Page 1: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

lable at ScienceDirect

Environmental Pollution 206 (2015) 227e235

Contents lists avai

Environmental Pollution

journal homepage: www.elsevier .com/locate/envpol

Using ensemble models to identify and apportion heavy metalpollution sources in agricultural soils on a local scale

Qi Wang a, Zhiyi Xie b, Fangbai Li a, *

a Guangdong Key Laboratory of Agricultural Environment Pollution Integrated Control, Guangdong Institute of Eco-Environmental and Soil Sciences,Guangzhou, Chinab Guangdong Environmental Monitoring Center, Guangzhou, China

a r t i c l e i n f o

Article history:Received 17 March 2015Received in revised form13 May 2015Accepted 29 June 2015Available online xxx

Keywords:Heavy metalsPollution sourceAgricultural soilLocal scaleEnsemble model

* Corresponding author.E-mail address: [email protected] (F. Li).

http://dx.doi.org/10.1016/j.envpol.2015.06.0400269-7491/© 2015 Elsevier Ltd. All rights reserved.

a b s t r a c t

This study aims to identify and apportion multi-source and multi-phase heavy metal pollution fromnatural and anthropogenic inputs using ensemble models that include stochastic gradient boosting (SGB)and random forest (RF) in agricultural soils on the local scale. The heavy metal pollution sources werequantitatively assessed, and the results illustrated the suitability of the ensemble models for theassessment of multi-source and multi-phase heavy metal pollution in agricultural soils on the local scale.The results of SGB and RF consistently demonstrated that anthropogenic sources contributed the most tothe concentrations of Pb and Cd in agricultural soils in the study region and that SGB performed betterthan RF.

© 2015 Elsevier Ltd. All rights reserved.

1. Introduction

Soil is a large and long-term sink for ubiquitous heavy metalsand related compounds. In agricultural soils, the accumulation ofheavy metals is a growing public concern because it threatensenvironmental health; elevated heavy metal uptake by crops mayalso affect food quality and security (Harmanescu et al., 2011; Wuet al., 2015). An important prerequisite in the control and remedi-ation of heavy metal contaminated soils is determining the sourceof contamination (Lin et al., 2010; Zhang et al., 2009b). On a localscale, agricultural soils become contaminated by accumulatedheavy metals released from multi-phase and diverse natural andanthropogenic sources (Gellrich and Zimmermann, 2007). Heavymetals in agricultural soils primarily originate from the weatheringof parent materials but can also be accumulated from industrialemissions, such as mine tailings, disposal of high metal wastes andsewage sludge, and agricultural sources, such as livestock manure,inorganic fertilizers, lime, agrochemicals, irrigation water, atmo-spheric deposition and pesticides (Hu and Cheng, 2013; Khan et al.,2008; Mohammed et al., 2011). Every decision regarding theapplication of any measures in soil quality and management must

be based on reliable information on the extent and sources of heavymetal pollution in the given area (Zovko and Romic, 2011). There-fore, the identification and apportionment of heavymetal pollutionsources in agricultural soils on the local scale is crucial. The highspatial heterogeneity of heavy metals in soils, the complexity anddiversity of pollution sources and the lack of long-term monitoringdata have challenged researchers to assess multi-source and multi-phase heavy metal pollution in agricultural soils on a local scale;exploring suitable methods to address this challenge is imperative.To this end, models can serve as powerful tools for source identi-fication and apportionment.

There are two competing modeling methods: the traditionalapproach (build one robust model) and the more recent ensemblelearning approach (build many models and average the results).Numerous reports have shown that multivariate analysis and GISare useful tools for the identification of probable pollution sourcesand the potential risks of heavy metals (Facchinelli et al., 2001). Forexample, multivariate analyses that have been applied to exclu-sively predict soil pollution sources include principle componentanalysis (Mic�o et al., 2006; Yongming et al., 2006), clusteringanalysis (Bhuiyan et al., 2010; Soares et al., 1999) and discriminantanalysis (Qishlaqi and Moore, 2007). GIS-based models togetherwith multivariate analysis have also been developed for mappingand evaluating the sources and distributions of heavy metal con-taminants, such as those in Fragkos (1998), Zhou (2007a) and

Page 2: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235228

Facchinelli et al. (2001). Stochastic models, such as conditionalinference tree and finite mixture distribution model, have beenused to differentiate the effects and contributions of natural back-ground and human activities across large-scale regions (Hu andCheng, 2013; Lin et al., 2010). These modeling approaches arereferred to as “traditional approaches”. Conventional multivariateanalysis can help identify the pollution sources and distinguishnatural versus anthropogenic contributions based on associations.However, they are sensitive to outliers and the non-normal distri-butions of geochemical datasets; examining the probability distri-butions of all variables is essential, and transforming the dataconsequently changes the original data (Mic�o et al., 2006). GISmethodologies can help predict the point sources that areresponsible for particular areas of contamination. The accuracy ofsuch maps depends fundamentally on the accuracy of the disper-sion model. This model includes deductive components forassessing the sources of heavy metals that leads to low predictionaccuracy and large uncertainty (Fragkos et al., 1998). The commonmethods combining multivariate analysis, geo-statistics and GIScan qualitatively predict the potential pollution sources of heavymetals, but are unable to quantitatively apportion the contributionsfrom the different sources. Furthermore, models of the identifica-tion and apportionment of heavy metal pollution sources haveseldom been established at the local scale. The ensemble modelsprovided in this study are superior in their quantitative assessmentof the complex sources of multi-phase heavy metal pollution inagricultural soils on a local scale.

Stochastic gradient boosting (Friedman, 2006) (SGB) is a recentadvance in ensemblemethods. This technique has emerged as one ofthe most powerful methods for predictive data mining in recentyears (Hastie et al., 2009). SGB produces the greatest increase inmodel accuracy by the gradient descent of the loss function in iter-ative tree construction (Friedman, 2001). Even though SGB modelsare complex, their predictive performance is superior to mosttraditional models (Friedman, 2006). The application of SGB to theinterpretation of complex spatial patterns of ecological and remotesensing data has gained increasing attention in recent years (De'ath,2007; Lawrence et al., 2004). To date, there have been no publishedapplications of SGB in environmental soil science. SGB was used inthe present study for the first time to identify and apportion themulti-source andmulti-phase pollution fromcadmium (Cd) and lead(Pb) in agricultural soils at the local scale. The interaction effectsbetween predictors were also detected to render reliable variableselection. The ensemble-based random forest (RF) method wasadopted as a supplemental tool to assess the diverse sources andtheir importance. In a random forest, each node is split using the bestof a subset of predictors that are randomly chosen at that node. Thissomewhat counterintuitive strategy performs verywell compared tomany other datamining techniques, including discriminant analysis,support vector machines and neural networks, and is robust againstover-fitting (Breiman, 2001). In addition, it is very user-friendly in thesense that it has only two parameters (the number of variables in therandom subset at each node and the number of trees in the forest)(Hothorn et al., 2006). Thus, RF was employed as a robust tool forcomparative analysis in this study. Our case study was located inDongtang Township in the North of Guangdong Province, China,which contains the largest lead and zincmining and smelting base inAsia (Wang et al., 2012); children living there reportedly hadconsiderably high blood Pb levels (Van Kerckhove, 2012).

2. Materials and methods

2.1. Field sampling and chemical analyses

The study region (Fig. 1) is bound by the latitudes of 25�10700 N

and 25�90800 N and the longitudes of 113�3204600 E and 113�4304600 Ein the Northern Guangdong Province, covering more than1.92 � 102 km2 of land surface. A total of 250 samples of surfacesoils (0e20 cm deep) with agricultural use were collected alongwith corresponding samples of surface water (10e15 cm below thewater surface) and atmosphere. The heavy metal concentrations(Cd and Pb) in the soils were measured following the procedures ofHu et al. (2013). The concentrations of Cd and Pb in surface waterwere obtained using the procedures of Reza and Singh (2010). ThePb and Cd contents in the atmosphere were determined usingflame atomic adsorption spectrometry (Perkin Elmer 1100).

2.2. Data collection and preparation

Six type predictors were applied to assess the sources of heavymetals and their contributions: (1) background value, denoting thenatural source; (2) atmospheric sources, including the contents ofheavy metals in the atmosphere; (3) water sources, including thecontents of heavy metals in surface water; (4) urbanization sources,including population density and road density (which refers to thelengths of the roads surrounding the sampling sites); (5) agricul-tural sources, consisting of irrigation and the application of fertil-izers and pesticides and (6) industrial sources related to thequantity of heavy metal emissions, which is represented by thedistances from the each sampling site to Pb and Cd releases fromthe three main plants (Fankou plant, Huayue plant and Danxiaplant, Fig. 1) and the mining areas of those plants. The roads wereclassified as highways and railways.We created a buffer zonewith a500-m radius for each sampling site and identified the total roadlength within the zone based on the region's roadmap. We alsocalculated the total area of the ponds and ditches (which representirrigation) and the mining areas within the buffer zone based onthe region's land use map. Data processing was carried out inArcGIS 10.0.1. The population density and application of fertilizersand pesticides were obtained from the statistical yearbook andcensus data.

2.3. Modeling methodology

Two common ensemble methods for classification and regres-sion are Bagging (Soares et al., 1999) and Boosting (Bhuiyan et al.,2010). Boosting incorporates the important advantages of tree-based methods, such as handling different types of predictor vari-ables and accommodating missing data and outliers, withoutrequiring strong model assumptions (De'ath, 2007; Lawrence et al.,2004; Maloney et al., 2012). Fitting multiple boosted regressiontrees overcomes the biggest drawback of single tree models e theirrelatively poor predictive performance (Moisen et al., 2006). SGBbased on boosting uses only a fraction of the training data to in-crease both the computation speed and the prediction accuracy,while also helping to avoid over-fitting the data.

The relationship between explanatory variables and responsevariables (the concentrations of soil heavy metals) was establishedusing SGB. SGB (Friedman, 1999, 2001) is related to both boostingand bagging. Many small regression trees are built sequentiallyfrom the gradient of the loss function of the previous tree. At eachiteration, a tree is built from a random sub-sample of the dataset(selected without replacement), incrementally improving themodel. In the function estimation, the system consists of a random“response” variable y and a set of random “explanatory” variablesX ¼ fx1;/; xng. Given a “training” sample fyi; xigN1 of known ðy; xÞvalues, the goal is to find a function F*ðxÞ that maps x to y, such thatover the joint distribution of all ðy; xÞ values, the expected value ofsome specified loss function Jðy; FðxÞÞ is minimized.

Page 3: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Fig. 1. Map of the Dongtang Township in China; the locations of the soil sampling sites and main industries and land use classes in Dongtang Township.

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235 229

F*ðxÞ ¼ argminFðxÞ

Ey;XJðy; FðxÞÞ (1)

Boosting approximates F*ðxÞ using an “additive” expansion ofthe form

FðxÞ ¼XMm¼0

bmhðx; amÞ (2)

where the function hðx; aÞ (“base learner”) is typically chosen to bea simple function of x with parameters a ¼ fa1; a2;/g. Theexpansion coefficientsfbmgM0 and the parametersfamgM0 are jointlyfit to the training data in a forward “stage-wise”manner. One startswith an initial guess F0ðxÞ, and then, for m ¼ 1, 2, …, M (Friedman,2001):

ðbm; amÞ ¼ argminb;a

XNi¼1

Jðyi; Fm�1ðxiÞ þ bhðxi; aÞÞ (3)

and

FmðxÞ ¼ Fm�1ðxÞ þ bmhðx; amÞ (4)

Gradient boosting (Friedman,1999) approximately solves (3) forarbitrary loss functions Jðy; FðxÞÞ with a two-step procedure. First,the function hðx; aÞ is fit by least squares:

am ¼ argmina;r

XNi¼1

½~yim � rhðxi; aÞ�2 (5)

to the current “pseudo”-residuals:

~yim ¼ ��vJðyi; FðXiÞÞ

vFðXiÞ�FðxÞ¼Fm�1ðXÞ

(6)

Then, given hðx; amÞ, the optimal value of the coefficient bm isdetermined by

bm ¼ argminb

XNi¼1

Jðyi; Fm�1ðxiÞ þ bhðxi; amÞÞ (7)

This strategy replaces a potentially difficult function optimiza-tion problem (3) with one based on least squares (5), followed by asingle parameter optimization (7) based on the general loss crite-rion J (Friedman, 2006).

Gradient tree boosting specializes this approach to the case inwhich the base learner hðx; aÞ is an L-terminal node regression tree.At each iteration m, a regression tree partitions the X-space into L-disjoint regions fRlmgLl¼1 and predicts a separate constant value ineach one:

h�x; fRlmgL1

�¼

XLl¼1

ylm1ðx2RlmÞ (8)

Here ylm ¼ meanxi2Rlmð~yimÞ is the mean of (6) in each regionRlm. The parameters of this base learner are the splitting variablesand the corresponding split points defining the tree, which in turndefine the corresponding regions fRlmgL1 of the partition at themthiteration. With the regression tree, (7) can be solved separatelywithin each region Rlm defined by the corresponding terminal nodel of the mth tree. Because tree (8) predicts a constant value ylmwithin each region Rlm, the solution to (7) reduces to a simple“location” estimate based on the criterion J:

Page 4: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235230

Ylm ¼ argminY

Xxi2Rlm

Jðyi; Fm�1ðxiÞ þ YÞ (9)

The current approximation Fm�1ðxÞ is then separately updatedin each corresponding region:

FmðxÞ ¼ Fm�1ðxÞ þ y$Ylm1ðx2RlmÞ (10)

The “shrinkage” parameter 0 < y� 1 controls the learning rate ofthe procedure. Empirically, small values (y � 0.1) lead to muchlower generalization error (Friedman, 1999).

SGB suggested by Breiman (1999) incorporated randomnessinto the subsample of the training data as an integral part of thegradient boosting procedure. For more details on the algorithm ofSGB, see the Supplementary material.

2.4. Relative variable influence in SGB

The variable importance of each predictor in SGB was calculatedto obtain the contributions of the heavy metal sources. For moredetails on the calculation of the variable influence in SGB, see theSupplementary material. The influences were further standardizedso that they sum to 100%. The resulting influences can then be usedto select the variables. Model interpretation was visualized usingpartial dependence plots and two-dimensional interaction plots.The SGB was implemented in the R 3.1.2 software environment.Friedman (2001) provided guidelines on the appropriate settingsfor SGB model fitting options. The model fitting settings are asfollows: A Poisson distribution was applied to the model using theconcentrations of heavy metals as the response variables, and theinteraction depth, which controls the number of nodes in the treeand thus the maximum possible interactions, was set at five nodes.The bagging fraction controls the fraction of the training datarandomly selected for calculating each tree and was set to 0.5 forthese analyses. The shrinkage rate (set at 0.005) controls thelearning speed of the algorithm. The training fraction was left at itsdefault value of 1.0, and the out-of-bag (OOB) method was used todetermine the optimal number of boosting iterations.

2.5. Random forest

To assess the contributions of the heavy metal sources, variableimportance was calculated with random forests using the con-centrations of heavy metals as the response variables. A randomforest is an ensemble of trees in which each tree is growing whiletraining on a sample obtained from the training set via baggingwithout replacement and fitting to the generated samples usingrandom split selection at each node (Supplementary material). TheGini importance for regression forests is a well-known variableimportance metric in CART trees and random forests. However,because of the bias of impurities for selecting split variables, theresulting variable importance metrics are of course also biased(Shih and Tsai, 2004; Strobl et al., 2007). The permutation-basedMSE reduction suggested by Breiman (2002) has been employedas the state-of-the-art method of variable importance assessmentby many authors (Diaz-Uriarte and de Andres, 2006; Genuer et al.,2008; Ishwaran, 2007). Therefore, this permutation-based “MSEreduction” was also adopted as the random forest importance cri-terion in the present study (Supplementary material). All variableimportance metrics were standardized to sum to 100%. The RF wasalso implemented in the R 3.1.2 software environment.

2.6. Model validation and comparison

Model validation was based on ten runs of ten-fold cross-

validation fits. In each run, we tested the agreement betweenmeasurements and predictions by calculating the area under curve(AUC) of the receiver operating characteristic (ROC) approach.Then, the AUC values were averaged (Fielding and Bell, 1997). Thepercent deviance explained how well the models fit the data. Thepercent deviance (pseudo-R2) was calculated using the residualdeviance/total deviance (Mateo and Hanselman., 2014). The pre-dictive performances of SGB and RF were also compared usingpseudo-R2 and AUC values (Supplementary material, Table S2).

3. Results

3.1. Descriptive statistics and spatial patterns of heavy metalcontents

Descriptive statistics of the heavy metal concentrations in theagricultural soils of Dongtang are described in the Supplementarymaterial. The heavy metals in soils originate from several inputs,including the natural background, mining and smelting activities,atmospheric deposition, agrochemicals, water inflow and socio-economic activities. Because we cannot assess the source contri-butions through concentration measurements alone, spatial dis-tribution maps of Pb and Cd in air, soil and surface water(Supplementary material, Figures S1eS6) might reveal potentialsources based on the apparent overlap of hot spots of heavy metalpollution with industrial centers. Therefore, SGB and RF analyseswere applied to determine the sources of the heavy metals in soils.

3.2. Source identification of soil Pb pollution

Fig. 2 shows the variable importance of individual predictors forthe SGB and RF models of soil Pb concentrations (the red bar de-notes SGB and the blue bar RF). The partial dependence plot esti-mates the effect of a predictor on the modeled response afteraccounting for all other covariates (Fig. 3). To identify the pairwiseinteractions, the interpretation was visualized using 3D interactionplots (Fig. 4). Overall, the percentage of deviance explained(pseudo-R2) was 74.3% for the SGB model and 49.3% for the RFmodel of soil Pb. SGB performed better than RF (see Supplementarymaterial). These results show that the distances to the Huayue,Fankou and Danxia plants and the background Pb value wereranked as the most important predictor variables, and their con-tributions were 21.6%, 20.5%, 19.9% and 9.6%, respectively. Amongall of the predictors, distance to the Huayue plant contributed themost. In contrast, the Pb concentrations inwaterwere ranked as theleast important variable (1.7%), followed by the Pb concentrationsin air (3.1%) and fertilizers&pesticides (2.8%) (Fig. 2). Obviously, theanthropogenic inputs explain more of the soil Pb pollution than thenatural inputs. Fig. 3 shows that distance to the Huayue plant anddistance to the Danxia plant had similar functional relationshipswith the change in the soil Pb concentration. Within 9.5 km of thesampling sites to the Huayue plant, no significant relationship be-tween the distance to Huayue and the soil Pb concentration wasfound. Soil Pb concentrations reduced with increasing distance toHuayue when the distance ranged from 9.5 km to 10 km. Soil Pbconcentrations remained stable when the distance increased to14 km. A strong positive correlation between soil Pb concentrationand distance to Huayue was apparent at distances of14 kme14.5 km. The distance of the sampling sites from the Danxiaplant had an inverse relationship with the soil Pb concentration,with a critical threshold at a distance of 3 km. The soil Pb con-centration decreased as the distance between the sampling site andthe Fankou plant increased within 6 km. This relationship dis-appeared at distances greater than 6 km. The findings of RF wereapproximately consistent with those obtained from SGB. For the RF

Page 5: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Fig. 2. Variable importance of the individual predictors for the SGB and RF models of soil Pb and Cd concentrations (the red bar denotes SGB and the blue bar denotes RF). All elevenpredictors were used. The importance plots show their relative percent contributions to predicting the soil heavy metal concentrations. (For interpretation of the references tocolour in this figure legend, the reader is referred to the web version of this article.)

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235 231

model, the distances to Fankou, Huayue and Danxia and the back-ground value were ranked as the most important predictor vari-ables, with contributions of 18.1%, 17.4%, 16.0% and 11.5%,respectively. Moreover, Pb inwater was the least important variablefor explaining the soil Pb content.

The 3D interaction plots show a nonlinear relationship betweenthe soil Pb concentrations and the six pairs of the most correlatedpredictors (Fig. 4). The plots were arrayed according to the corre-lation coefficients of the pair variables. The effects of two pairvariables (the background value and the distance to Huayue, the Pbin air and the distance to Huayue) on the soil Pb concentrationweresimilar, and they affected the soil Pb concentration below 200 mg/kg in a similar pattern. A complicated interaction structure wasindicated in the effects of the pair variables, including distance toDanxia and distance to Fankou, road density and distance toHuayue and distance to Huayue and distance to Fankou. The pairvariable of distance to Danxia and distance to Fankou had thestrongest effect on soil Pb concentration within 600 mg/kg. Theseinteraction plots could lead to a better understanding of theresulting SGB model and its underlying effects.

3.3. Source identification of soil Cd pollution

Fig. 2 shows the variable importance for the SGB and RF modelson soil Cd. Partial dependence plots of individual variables areindicated in Fig. 3. Similar to Pb, we chose the six most correlated

pairs to estimate the effects of the two-variable interactions on thesoil Cd concentration (Fig. 5). The pseudo-R2 was 66.8% for the SGBmodel and 56.5% for the RF model for soil Cd. Compared to RF, SGBwas obviously superior in this application (see Supplementarymaterial). The population density, distance to Danxia, Cd in water,background Cd level, Cd in air and distance to Huayue were themost important predictor variables, explaining 17.3%, 14.3%, 13.0%,12.4%, 10.7% and 10.3% of the variance, respectively. The soil Cdconcentration rose with elevated population density, and thisrelationship disappeared when the population density was greaterthan 3 � 105. The soil Cd concentration decreased as the distancebetween the sampling sites and the Danxia plant increased within3 km. A positive correlation between the soil Cd concentrations andthe Cd in water and air was observed. Soil Cd concentrations below0.3 mg/kg might be ascribed to air, and those below 4.8 mg/kgcould be related to water (Fig. 3). For the RF model, the water Cdconcentration, population density, air Cd concentration, distance toDanxia and natural background contributed the most to the soil Cdconcentration, explaining 13.7%, 11.2%, 11.1%, 10.9% and 10.1% of thevariance, respectively.

Fig. 5 shows that the soil Cd concentrations had a nonlinear andcomplicated relationship with six pairs of the most correlatedpredictors. The plots were placed after the correlation coefficientsof the pair variables. The effect the distance to Danxia and distanceto Huayue pair on soil Cd concentrations was the strongest, andthey influenced the soil Cd concentrations ranging from 1.0 mg/kg

Page 6: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Fig. 3. Partial dependence plots for the SGB model of the soil Pb and Cd concentrations (the red line denotes Pb and the blue line denotes Cd). The partial plots show the de-pendencies of the soil heavy metals on each of the predictors. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of thisarticle.)

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235232

to 2.8 mg/kg. The distance to Danxia and Cd in water pair variableimpacted soil Cd concentrations between 1.7 mg/kg and 2.7 mg/kgat the smallest range. These types of interactions help elucidate theunderlying relationship between the sources and heavy metalpollution and are the likely reason for the clear superiority of theSGB method, especially in this application.

4. Discussion

4.1. Model validation and reliability

In this paper, an ensemble-based method framework estimatedthe pollution sources of soil heavy metals on the local scale. Thisframework is referred to as a stochastic gradient boosting method,intended to be a powerful alternative to conventional methods,

such as clustering analysis and artificial neural nets. By applying agradient-descent algorithm, SGB analysis allowed the parametersof the models to vary in function space and established consider-ably stronger relationships with soil heavy metals for each pre-dictor alone and in combination compared with conventionalmultivariate analysis. SGB was used to increase the predictiveability and solve two issues in assessing the sources of heavymetals: identify the sources (choosing the most informative subsetof covariates) and evaluate the importance of individual sources(calculating the number of times a variable is selected for splitting).This analysis does not aim to prove that the predictions of otherapproaches cannot be true; it seeks to frame the question to illus-trate that the predictions are weak from a theoretical point of viewand do not inevitably play out as expected in the real world. On theother hand, coupled with the growing consensus among

Page 7: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Fig. 4. 3D interaction plots for the SGB model of the soil Pb concentrations. Six pair variables with the largest correlation coefficients were used in the interaction plots.

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235 233

researchers (Kabata-Pendias, 2010; Mic�o et al., 2006; Wong et al.,2002; Zhang et al., 2009) that anthropogenic sources in Dongtangare more important in determining soil heavy metals than naturalsources, the SGB method is more robust in capturing reality thanthe alternative methods. Our results indicate that anthropogenicsources contributed the most to soil Pb and Cd pollution in Dong-tang (90.4% and 87.6%, respectively). Metallurgical industries(Supplementary material) can explain a total of 68.1% and 32.2% ofsoil Pb and Cd, respectively, and the contribution from naturalsources is 9.6% for Pb and 12.4% for Cd (Fig. 2). Wang et al. (2010),Guan et al. (2014a) and Kachenko and Singh (2006) reported thatanthropogenic inputs including industrial andmining activities andlocal economies were predominant sources of soil Pb and Cd, whichwas consistent with our findings. Within a relatively small area,variation in environmental conditions is sufficient to yield verysubstantial differences in soil heavy metals. These differences arecaptured by ensemble models along with the variation due to othersources, and themodels were verified using rigorous out-of-sampletesting.

4.2. Model suitability

Spatial heterogeneity and accumulative processes of heavymetals in soils affect geochemical characteristics and bioavailabilityand demonstrate the complex, telecoupled and nonlinear nature ofsoil source-sink system changes. SGB provides a powerful tool togain insight into the complex relationships between soil heavymetals and their sources with high predictive accuracy. Given thatthe resulting models of soil Cd involve dependent variables, thepractical consequences suggest that atmospheric deposition andwater inflow are more likely to impact soil heavy metal contents

over the long term. Furthermore, atmospheric and water influencestend to be significant at the local scale (Donisa et al., 2000; Zhouet al., 2007b) because few external sources determine the heavymetal concentrations and local sources play a leading role andgenerate elevated soil heavy metal concentrations. Overall, it ap-pears that the actual sources of soil Pb and Cd in Dongtang comefrom mixed inputs, which is consistent with the findings of Guanet al. (2014b) using an isotope method. Effective environmentalcontrol and the associated health risks due to the heavy metalsfrom these sources will require powerful methods to includeknowledge of pollution sources and basic soil characterization. Anappropriate modeling approach to explore the information onpollution sources and attempt remediation of heavy metalcontaminated soils is scale-dependent. SGB is an effective scientificmodel to particularly note the pollution sources at the local scaleand aid decision makers in contaminated site management in acost-effective manner, while preserving public and ecosystemhealth.

4.3. Complexity of heavy metal pollution sources

The multi-source and multi-phase nature of the heavy metalpollution in soil makes the task of identifying and apportioning thediverse sources difficult. Agricultural soils in Dongtang are heavilypolluted with heavy metals (13.2% of Pb and 34.8% of Cd samplesexceed the class III standards). The contributions of six source typesto soil Pb and Cd contents were background value, 9.6% and 12.4%;atmospheric sources, 3.1% and 10.7%; water sources,1.7% and 13.0%;urbanization sources, 10.1% and 20.0%; agricultural sources, 7.4%and 11.7% and industrial sources, 68.1% and 32.2%. The total con-tributions of the atmospheric and water sources to the soil Cd

Page 8: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Fig. 5. 3D interaction plots for the SGB model of the soil Cd concentrations. Six pair variables with the largest correlation coefficients were used in the interaction plots.

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235234

concentration were 23.7% (Fig. 2). When pollution sources arecomplex, ensemble models, especially the SGB model, can achievegood results when identifying multiple sources. Out of the sourcesidentified, water, soil and air are a system that can prevent andcontrol soil pollution. Pollution prevention and control of water andair are integral parts of systematic soil pollution remediation.Recent international documents that evaluated the role of agri-cultural soils in alleviating hunger and promoting sustainabilityalso validated the conclusion that pollution prevention and controlof water, soil and air is the best option for soil pollution remedia-tion, thus achieving both of these goals (J€arup, 2003; Kabata-Pendias, 2010; Lone et al., 2008). In a region where people gohungry amid an abundance of food and where the great majority ofthe poor live in rural areas or are forced by economic and industrialnecessity to live near environmental pollution, economic devel-opment at the cost of deteriorating environmental and humanhealth are bound to fail. A comprehensive assessment of thesources of heavy metal pollution in soil using the stochasticgradient boosting approach is more likely to lead to effective con-trol of agricultural soil pollution and targeting policies that protectsoils from long-term heavy metal accumulation, while alsoincreasing food safety and environmental health.

The authors acknowledge that several types of predictive vari-ables were overlooked by this analysis (i.e., the emissions of heavymetals from different metallurgical industries, vehicle emissions,types of parent materials and local socio-economic conditions). Thetendency to conclude that metallurgical activities are the mainsources of soil heavy metal pollution might be enhanced whenembracing the variables of heavy metal emissions. Questions to befurther explored involve the processes by which soils accumulateand absorb heavy metals from anthropogenic inputs. Agricultural

soils in Dongtang have been polluted by heavy metals for a longtime, which makes it challenging to quantitatively assess the con-tributions from human activities and natural inputs. We often lookfor more proximate industrial sources or underlying socio-economic sources because they can be correlated directly or indi-rectly to the observed soil pollution. However, it is difficult toevaluate how much atmospheric deposition and water inflowcontribute to soil concentrations when the respective heavy metalconcentrations in air and water are appropriate or do not exceedlocal or national environmental standards. The ensemble modelsused in this study are robust and effective at identifying multi-source and multi-phase heavy metal pollution in agricultural soilson the local scale. With appropriate selection of predictors relevantto the sources and transport of heavy metals, SGB is a powerful toolfor explaining the relationship between the concentrations of soilheavy metals and their sources while being resistant to outliers andover-fitting.

5. Conclusions

The sources of heavy metal pollution in soil were quantitativelyassessed on the local scale using SGB and RF ensemble models. Themodels were verified using rigorous cross-validation procedures.The ensemble models produced good results for the multi-sourceand multi-phase heavy metal pollution in agricultural soils at thelocal scale. The results of SGB and RF consistently showed thatanthropogenic sources contributed the most to the concentrationsof Pb and Cd in the agricultural soils of Dongtang, and SGB per-formed better than RF. The information provided by our study canhelp control agricultural soil pollution and develop targeting pol-icies to protect soils from long-term heavy metal accumulation.

Page 9: Using ensemble models to identify and apportion heavy metal … · Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Q. Wang et al. / Environmental Pollution 206 (2015) 227e235 235

Acknowledgments

The current work was financially supported by, the NationalNatural Science Foundation of China (41330857), the GuangdongProvince Foundation (CSJ143356) and the “863” Program(2013AA06A209).

Appendix A. Supplementary data

Supplementary data related to this article can be found at http://dx.doi.org/10.1016/j.envpol.2015.06.040.

References

Bhuiyan, M.A., Parvez, L., Islam, M., Dampare, S.B., Suzuki, S., 2010. Heavy metalpollution of coal mine-affected agricultural soils in the northern part ofBangladesh. J. Hazard. Mater. 173, 384e392.

Breiman, L., 1999. Using Adaptive Bagging to Debias Regressions. University ofCalifornia, Berkeley.

Breiman, L., 2001. Random forests. Mach. Learn. 45, 5e32.Breiman, L., 2002. Manual on Setting up, Using, and Understanding Random Forests

V3.1.De'ath, G., 2007. Boosted trees for ecological modeling and prediction. Ecology 88,

243e251.Diaz-Uriarte, R., de Andres, S.A., 2006. Gene selection and classification of micro-

array data using random forest. BMC Bioinforma. 7.Donisa, C., Mocanu, R., Steinnes, E., Vasu, A., 2000. Heavy metal pollution by at-

mospheric transport in natural soils from the northern part of eastern Carpa-thians. Water Air Soil Pollut. 120, 347e358.

Facchinelli, A., Sacchi, E., Mallen, L., 2001. Multivariate statistical and GIS-basedapproach to identify heavy metal sources in soils. Environ. Pollut. 114, 313e324.

Fielding, A.H., Bell, J.F., 1997. A review of methods for the assessment of predictionerrors in conservation presence/absence models. Environ. Conserv. 24, 38e49.

Fragkos, C., Rosenbaum, M.S., Ramsey, M.H., Goodyear, K.L., 1998. GIS Techniques forMapping and Evaluating Sources and Distribution of Heavy Metal Contami-nants. In: Geological Society, London, Engineering Geology Special Publications,vol. 15, pp. 365e372.

Friedman, J.H., 1999. Stochastic gradient boosting. Comput. Stat. Data Anal. 38,367e378.

Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine.Ann. Stat. 29, 1189e1232.

Friedman, J.H., 2006. Recent advances in predictive (machine) learning. J. Classif. 23,175e197.

Gellrich, M., Zimmermann, N.E., 2007. Investigating the regional-scale pattern ofagricultural land abandonment in the Swiss mountains: a spatial statisticalmodelling approach. Landsc. Urban Plan. 79, 65e76.

Genuer, R., Poggi, J.-M., Tuleau, C., 2008. Random Forests: Some MethodologicalInsights. Institut National de Recherche en Informatique et en Automatique.

Guan, Y., Shao, C., Ju, M., 2014a. Heavy metal contamination assessment andpartition for industrial and mining gathering areas. Int. J. Environ. Res. PublicHealth 11, 7286e7303.

Guan, Y., Shao, C.F., Ju, M.T., 2014b. Heavy metal contamination assessment andpartition for industrial and mining gathering areas. Int. J. Environ. Res. PublicHealth 11, 7286e7303.

Harmanescu, M., Alda, L.M., Bordean, D.M., Gogoasa, I., Gergen, I., 2011. Heavymetals health risk assessment for population via consumption of vegetablesgrown in old mining area; a case study: Banat County, Romania. Chem. CentralJ. 5, 64e64.

Hastie, T., Tibshirani, R., Friedman, J.H., 2009. The Elements of Statistical Learning:Data Mining, Inference, and Prediction, second ed. Springer, New York, NY.

Hothorn, T., Hornik, K., Zeileis, A., 2006. Unbiased recursive partitioning: a condi-tional inference framework. J. Comput. Graph. Stat. 15, 651e674.

Hu, Y.N., Cheng, H.F., 2013. Application of stochastic models in identification andapportionment of heavy metal pollution sources in the surface soils of a large-scale region. Environ. Sci. Technol. 47, 3752e3760.

Hu, Y.N., Liu, X.P., Bai, J.M., Shih, K.M., Zeng, E.Y., Cheng, H.F., 2013. Assessing heavymetal pollution in the surface soils of a region that had undergone three de-cades of intense industrialization and urbanization. Environ. Sci. Pollut. Res. 20,6150e6159.

Ishwaran, H., 2007. Variable importance in binary regression trees and forests.Electron. J. Stat. 1, 519e537.

J€arup, L., 2003. Hazards of heavy metal contamination. Br. Med. Bull. 68, 167e182.

Kabata-Pendias, A., 2010. Trace Elements in Soils and Plants. CRC Press.Kachenko, A., Singh, B., 2006. Heavy metals contamination in vegetables grown in

urban and metal smelter contaminated sites in Australia. Water Air Soil Pollut.169, 101e123.

Khan, S., Cao, Q., Zheng, Y.M., Huang, Y.Z., Zhu, Y.G., 2008. Health risks of heavymetals in contaminated soils and food crops irrigated with wastewater inBeijing, China. Environ. Pollut. 152, 686e692.

Lawrence, R., Bunn, A., Powell, S., Zambon, M., 2004. Classification of remotelysensed imagery using stochastic gradient boosting as a refinement of classifi-cation tree analysis. Remote Sens. Environ. 90, 331e336.

Lin, Y.P., Cheng, B.Y., Shyu, G.S., Chang, T.K., 2010. Combining a finite mixture dis-tribution model with indicator kriging to delineate and map the spatial patternsof soil heavy metal pollution in Chunghua County, central Taiwan. Environ.Pollut. 158, 235e244.

Lone, M.I., He, Z.-l., Stoffella, P.J., Yang, X.-e, 2008. Phytoremediation of heavy metalpolluted soils and water: progresses and perspectives. J. Zhejiang Univ. Sci. B 9,210e220.

Maloney, K.O., Schmid, M., Weller, D.E., 2012. Applying additive modelling andgradient boosting to assess the effects of watershed and reach characteristics onriverine assemblages. Methods Ecol. Evol. 3, 116e128.

Mateo, I., Hanselman, D.H., 2014. A Comparison of Statistical Methods to Stan-dardize Catch-per-unit-effort of the Alaska Longline Sablefish, NOAA TechnicalMemorandum ed. U.S. Department of Commerce.

Mic�o, C., Recatal�a, L., Peris, M., S�anchez, J., 2006. Assessing heavy metal sources inagricultural soils of an European Mediterranean area by multivariate analysis.Chemosphere 65, 863e872.

Mohammed, A., Kapri, A., Goel, R., 2011. Heavy metal pollution: source, impact, andremedies. In: Khan, M.S., Zaidi, A., Goel, R., Musarrat, J. (Eds.), Biomanagementof Metal-contaminated Soils. Springer, Netherlands, pp. 1e28.

Moisen, G.G., Freeman, E.A., Blackard, J.A., Frescino, T.S., Zimmermann, N.E.,Edwards, T.C., 2006. Predicting tree species presence and basal area in Utah: acomparison of stochastic gradient boosting, generalized additive models, andtree-based methods. Ecol. Model. 199, 176e187.

Qishlaqi, A., Moore, F., 2007. Statistical analysis of accumulation and sources ofheavy metals occurrence in agricultural soils of Khoshk River Banks, Shiraz,Iran. Am. Eurasian J. Agric. Environ. Sci. 2, 565e573.

Reza, R., Singh, G., 2010. Heavy metal contamination and its indexing approach forriver water. Int. J. Environ. Sci. Technol. 7, 785e792.

Shih, Y.S., Tsai, H.W., 2004. Variable selection bias in regression trees with constantfits. Comput. Stat. Data Anal. 45, 595e607.

Soares, H., Boaventura, R., Machado, A., Esteves da Silva, J., 1999. Sediments asmonitors of heavy metal contamination in the Ave river basin (Portugal):multivariate analysis of data. Environ. Pollut. 105, 311e323.

Strobl, C., Boulesteix, A.L., Zeileis, A., Hothorn, T., 2007. Bias in random forest var-iable importance measures: illustrations, sources and a solution. BMC Bio-informa. 8.

Van Kerckhove, G., 2012. Toxic capitalism: the orgy of consumerism and waste: arewe the last generation on earth? AuthorHouse, 58e87.

Wang, X., Wang, F.H., Chen, B., Sun, F.F., He, W., Wen, D., Liu, X.X., Wang, Q.F., 2012.Comparing the health risk of toxic metals through vegetable consumptionbetween industrial polluted and non-polluted fields in Shaoguan, south China.J. Food Agric. Environ. 10, 943e948.

Wang, Z., Chai, L., Yang, Z., Wang, Y., Wang, H., 2010. Identifying sources andassessing potential risk of heavy metals in soils from direct exposure to childrenin a mine-impacted city, Changsha, China. J. Environ. Qual. 39, 1616e1623.

Wong, S., Li, X., Zhang, G., Qi, S., Min, Y., 2002. Heavy metals in agricultural soils ofthe Pearl River Delta, South China. Environ. Pollut. 119, 33e44.

Wu, Q., Leung, J.Y.S., Geng, X., Chen, S., Huang, X., Li, H., Huang, Z., Zhu, L., Chen, J.,Lu, Y., 2015. Heavy metal contamination of soil and water in the vicinity of anabandoned e-waste recycling site: implications for dissemination of heavymetals. Sci. Total Environ. 506e507, 217e225.

Yongming, H., Peixuan, D., Junji, C., Posmentier, E.S., 2006. Multivariate analysis ofheavy metal contamination in urban dusts of Xi'an, Central China. Sci. TotalEnviron. 355, 176e186.

Zhang, X., Lin, F., Wong, M.T., Feng, X., Wang, K., 2009. Identification of soil heavymetal sources from anthropogenic activities and pollution assessment ofFuyang County, China. Environ. Monit. Assess. 154, 439e449.

Zhou, F., Guo, H., Hao, Z., 2007a. Spatial distribution of heavy metals in Hong Kong'smarine sediments and their human impacts: a GIS-based chemometricapproach. Mar. Pollut. Bull. 54, 1372e1384.

Zhou, J.-M., Dang, Z., Cai, M.-F., Liu, C.-Q., 2007b. Soil heavy metal pollution aroundthe dabaoshan mine, Guangdong Province, China. Pedosphere 17, 588e594.

Zovko, M., Romic, M., 2011. Soil contamination by trace metals: geochemicalbehaviour as an element of risk assessment. Earth and environmental sciences.InTech Rij. 437e456.