Evaluation of the Effectiveness of Multiple Machine ...

17
Research Article Evaluation of the Effectiveness of Multiple Machine Learning Methods in Remote Sensing Quantitative Retrieval of Suspended Matter Concentrations: A Case Study of Nansi Lake in North China Xiuyu Liu , 1 Zhen Zhang , 1 Tao Jiang , 1 Xuehua Li , 1 and Yanyi Li 2 1 College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266500, China 2 College of Surveying and Geo-Informatics, Tongji University, Shanghai 200092, China Correspondence should be addressed to Tao Jiang; [email protected] Received 20 April 2021; Revised 30 July 2021; Accepted 5 August 2021; Published 14 August 2021 Academic Editor: Arnaud Cuisset Copyright © 2021 Xiuyu Liu et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Total suspended matter (TSM) is a core parameter in the quantitative retrieval of ocean color remote sensing and an important indicator for evaluating the quality of the aquatic environment. is study selects part of Nansi Lake in North China as the study area. Researchers used Hyperion remote sensing data and field-measured TSM concentration as data sources. Firstly, the characteristic variables with high correlation were selected based on spectral analysis. en, seven methods such as linear re- gression, BP neural network (BP), KNN, random forest (RF), and random forest based on genetic algorithm optimization (GA_RF) are used to construct the inversion model of TSM concentration. e retrieval accuracy of each model shows that the machine learning models are much more accurate than the linear model. Among them, the GA_RF model retrieves the suspended solids concentration with the best performance and the highest prediction accuracy, with a determination coefficient R 2 of 0.98, a root mean square error (RMSE) of 1.715mg/L, and an average relative error (ARE) of 6.83%. Additionally, the spatial distribution of TSM concentration was inversed by Hyperion remote sensing image. e results showed that the concentration of TSM was lower in the northwest and higher in the southeast, and the concentration distribution was uneven, showing the characteristics of a typical shallow macrophytic lake. is study provides an effective method for monitoring TSM concentration and other water quality parameters in the shallow macrophytic lake and further proves the advantages of machine learning in ocean color inversion. All in all, this research provides some useful methods and suggestions for quantitative inversion of TSM concentration in shallow macrophytic lakes. 1. Introduction TSM is one of the important parameters of water quality, which directly affects the transparency and turbidity of the water body [1] and then affects the growth of aquatic or- ganisms and the primary productivity of the water body [2]. Traditional water quality monitoring mainly adopts the method of fixed section and fixed point for sampling analysis, which not only is costly and greatly affected by external conditions but also has defects in large-scale real- time detection. As one of the important methods of envi- ronmental investigation and detection, remote sensing technology can quickly obtain water quality information of large areas of water. With the advantages of rapid, real-time, large-scale, periodic dynamic detection, remote sensing has become an important method to detect the spatial and temporal distribution of TSM [3]. Ocean color remote sensing provides an effective way to monitor the three components of ocean color and their spatiotemporal changes. e sensor receives the spectral signal from the water body, analyzes the element infor- mation in the water, and then establishes a model to retrieve the component concentration in the water. e element information refers to the concentration of certain substances in the water, such as the concentration of suspended matter, the concentration of chlorophyll, and the concentration of CDOM (Colored Dissolved Organic Matter). e remote sensing retrieval methods of TSM concentration in water Hindawi Journal of Spectroscopy Volume 2021, Article ID 5957376, 17 pages https://doi.org/10.1155/2021/5957376

Transcript of Evaluation of the Effectiveness of Multiple Machine ...

Page 1: Evaluation of the Effectiveness of Multiple Machine ...

Research ArticleEvaluation of the Effectiveness of Multiple Machine LearningMethods in Remote Sensing Quantitative Retrieval of SuspendedMatterConcentrations ACase Study ofNansi Lake inNorthChina

Xiuyu Liu 1 Zhen Zhang 1 Tao Jiang 1 Xuehua Li 1 and Yanyi Li 2

1College of Geodesy and Geomatics Shandong University of Science and Technology Qingdao 266500 China2College of Surveying and Geo-Informatics Tongji University Shanghai 200092 China

Correspondence should be addressed to Tao Jiang tj801020921sdusteducn

Received 20 April 2021 Revised 30 July 2021 Accepted 5 August 2021 Published 14 August 2021

Academic Editor Arnaud Cuisset

Copyright copy 2021 Xiuyu Liu et al )is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Total suspended matter (TSM) is a core parameter in the quantitative retrieval of ocean color remote sensing and an importantindicator for evaluating the quality of the aquatic environment )is study selects part of Nansi Lake in North China as the studyarea Researchers used Hyperion remote sensing data and field-measured TSM concentration as data sources Firstly thecharacteristic variables with high correlation were selected based on spectral analysis )en seven methods such as linear re-gression BP neural network (BP) KNN random forest (RF) and random forest based on genetic algorithm optimization(GA_RF) are used to construct the inversion model of TSM concentration )e retrieval accuracy of each model shows that themachine learning models are muchmore accurate than the linear model Among them the GA_RFmodel retrieves the suspendedsolids concentration with the best performance and the highest prediction accuracy with a determination coefficient R2 of 098 aroot mean square error (RMSE) of 1715mgL and an average relative error (ARE) of 683 Additionally the spatial distributionof TSM concentration was inversed by Hyperion remote sensing image )e results showed that the concentration of TSM waslower in the northwest and higher in the southeast and the concentration distribution was uneven showing the characteristics of atypical shallow macrophytic lake )is study provides an effective method for monitoring TSM concentration and other waterquality parameters in the shallow macrophytic lake and further proves the advantages of machine learning in ocean colorinversion All in all this research provides some useful methods and suggestions for quantitative inversion of TSM concentrationin shallow macrophytic lakes

1 Introduction

TSM is one of the important parameters of water qualitywhich directly affects the transparency and turbidity of thewater body [1] and then affects the growth of aquatic or-ganisms and the primary productivity of the water body [2]Traditional water quality monitoring mainly adopts themethod of fixed section and fixed point for samplinganalysis which not only is costly and greatly affected byexternal conditions but also has defects in large-scale real-time detection As one of the important methods of envi-ronmental investigation and detection remote sensingtechnology can quickly obtain water quality information oflarge areas of water With the advantages of rapid real-time

large-scale periodic dynamic detection remote sensing hasbecome an important method to detect the spatial andtemporal distribution of TSM [3]

Ocean color remote sensing provides an effective way tomonitor the three components of ocean color and theirspatiotemporal changes )e sensor receives the spectralsignal from the water body analyzes the element infor-mation in the water and then establishes a model to retrievethe component concentration in the water )e elementinformation refers to the concentration of certain substancesin the water such as the concentration of suspended matterthe concentration of chlorophyll and the concentration ofCDOM (Colored Dissolved Organic Matter) )e remotesensing retrieval methods of TSM concentration in water

HindawiJournal of SpectroscopyVolume 2021 Article ID 5957376 17 pageshttpsdoiorg10115520215957376

bodies can be divided into analytical semianalytical andempirical methods )e core of the above analysis method isthe bio-optical model [4] which is based on the radiativetransfer theory )e theory mainly analyzes the absorptioncharacteristics of upward radiation and optically activesubstances in water and analyzes the relationship betweenthese characteristics and backscattering characteristicsFurthermore the remote sensing reflectance spectrumcharacteristics of the actual water body are combined withthe inherent optical quantity of water components and thenthe water quality parameters are retrieved [5] )is type ofmethod has a clear physical meaning and the retrieval resultsare more reliable but it requires measurements of the in-trinsic optical properties of water components and the al-gorithms are difficult to establish)e semianalytical methodis a combination of known spectral characteristics of waterquality parameters and statistical models to select the op-timal band or a combination of bands as variables for remotesensing retrieval which has strong reliability and applica-bility and is widely used in the estimation of parameters ofCase II water [6 7] )e empirical method is based on thestatistical relationship between measured TSM concentra-tion data and remotely sensed spectral information to re-trieve TSM concentrations It is the most commonly usedocean color remote sensing retrieval algorithm at presentdue to its relatively simple method and fewer parameters)e empirical methods search for the best band or bandcombination to build a mathematical model mainly in-cluding single-band model [8 9] band ratio model [10] andmultiple regression model [11] With the continuous de-velopment and combination of computer technology andremote sensing technology the application of machinelearning in the field of remote sensing image processing hasgradually become a research hotspot Case II waters such asrivers lakes and near-shore marine waters are opticallycomplex and poorly suited to the use of linear models forstudying Case I waters in isolation [12] Machine learninghas the advantage of solving complex nonlinear problemsFor example Louis et al combined a neural network modelwith TM images to simulate suspended sediment concen-trations and the results demonstrated that the networkmodel performs better on nonlinear problems compared totraditional regression analysis [13] Chen et al used amultilayer backpropagation neural network (MBPNN)model to estimate TSS concentrations in the coastal zone ofeastern China and obtained results were all more accuratethan those of empirical models [14] )e construction of aneural network model requires a large number of samplesfor training and although a large number of samples canhelp improve the accuracy of the model the actual numberof water quality sampling points is affected by weatherconditions cost and other factors and fewer sample pointscan be obtained However due to factors such as weatherand costs the number of sampling points obtained is rel-atively small For instance if the sample point is located inthe pond or lake area under bad weather conditions therepeated measurement of the sampling point data com-pletely depends on manual work multiple repeated mea-surements also mean that the collected samples need to be

processed many times at the same time so the number ofactual water quality sampling points is generally small Incontrast support vector machines and random forest al-gorithms are more suitable for analyzing the little number ofsamples For example Park et al used both artificial neuralnetworks (ANN) and support vector machines (SVM) topredict Chl-a concentrations in Juam and Yeongsan res-ervoirs and the result was that the support vector machinemodel had higher prediction accuracy compared to ANN[15] By constructing a random forest model Fang Xinruiet al estimated the suspended sediment concentration in theriver section fromYichang to Chenglingji downstream of theChinese )ree Gorges Dam for each month before and afterthe dam construction [16] Overall the empirical method isrelatively simple has greater potential for developmentwhen combined with artificial intelligence techniques and isone of the common methods currently used for remotesensing retrieval of TSM concentrations

)e study area used in this study is the Nansi Lake regionin North China which is located in the northern part of theHuaihe River Basin at the junction of the provinces ofJiangsu Shandong Henan and Anhui and is a shallowmacrophytic lake As an important water storage lake of theEast Route of South-to-North Water Transfer Project it hasthe functions of flood control drainage breeding tourismand so forth [17] )e quality of the water in the area hasbeen gradually declining and neglected by the developmentof local industry and urbanization for some time so it ismeaningful to detect the TSM concentration in the water inthe area Meanwhile this study uses Hyperion hyperspectralsatellite imagery along with field-measured data to conductremote sensing retrieval studies of TSM concentrations inNansi Lake Models such as linear regression BP neuralnetwork KNN AdaBoost RF GA_RF and decision tree areconstructed using empirical methods By comparing theretrieval accuracies of different models the optimal model isused to retrieve the spatial distribution of TSMconcentrations

2 Data and Methods

21 Study Area Nansi Lake is located in the southwest ofShandong Province and belongs to Weishan County JiningCity Shandong Province China with geographical coor-dinates of 116deg34primeEndash117deg21primeE 34deg27primeNndash35deg20primeN (Figure 1)and belongs to the temperate monsoon climate It includesWeishan Lake Zhaoyang Lake Dushan Lake and NanyangLake with a total area of 1266 km2 and an average depth of15m It is the largest and best-preserved inland largefreshwater shallow macrophytic lake in the area north of theYellow River It is an important water source and waterstorage lake of the East Route of South-to-North WaterTransfer Project )erefore it is of great significance to useremote sensing to monitor its water quality

22 Remote Sensing Data and Measured Sample Data)is research uses Hyperion hyperspectral remote sensingdata Hyperion is a hyperspectral sensor carried by an Earth

2 Journal of Spectroscopy

observation satellite named Earth Observing-1 (EO-1) Inthe range of 355ndash2577 nm the Hyperion data provide 242bands 35 visible 35 near-infrared and 172 short-waveinfrared bands It has a spectral resolution of 10 nm and aspatial resolution of 30m A total of 26 sample points weremeasured in this study and the data were collected on July31 2010 with a clear and cloudless sky no wind and a calmwater surface )e highest value of TSM concentration at thesample sites was 548mgL the lowest value was 4mgL andthe mean value was 3031mgL )e image of July 26 2010was selected to extract the spectral data of the sampling sites)e imaging date of this hyperspectral image is most suitablewith the measured data and the distribution of samplingpoints is shown in Figure 1

3 Methodology

31 Backpropagation Algorithm (BP) )e backpropagation(BP) neural network algorithm has the characteristics of self-adaptability strong learning ability and fault tolerance BPneural networks combined with remote sensing technologycan effectively make a nonlinear prediction of water qualityparameters [18]

)e BP neural network algorithm is a multilayer feed-forward network trained according to the backpropagationalgorithm which is divided into input layers hidden layersand output layers )e core idea is the gradient descentmethod which modifies the weight values and thresholdvalues along the negative gradient direction of the function

to minimize the difference between the actual output of thenetwork and desired output [19] )e topology of the net-work structure is shown in Figure 2 )e network has twoprocesses forward propagation of operating signal andbackpropagation of error signal If the difference betweenthe output result and the expected result exceeds the re-quirement the error signal will be propagated from theoutput layer to the input layer and the weight values andthreshold values of each layer of the network will be adjustedto meet the error limitation

32 K-Nearest Neighbors (KNN) Algorithm )e K-nearestneighbor (KNN) algorithm a theoretically mature methodis widely used in remote sensing quantitative inversion)esample to be tested can be estimated with its nearestK-neighboring samples and the attributes of the samplecan be obtained by assigning the average of the attributes ofthe K-neighboring samples to the sample to be tested [20])e process is described as follows first calculate thedistance between the sample to be tested and all othersamples and then find the K-nearest neighbors of thesample to be tested and finally average the attributes of theselected K-neighbors to obtain the attributes of the sampleto be tested [21] )e advantage of using K sample pointswhen estimating pixel-level TSM concentrations using thismethod is that it reduces random variation due to noiseinternal variation and misalignment of sample pointcoordinates

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE116deg40prime0PrimeE

(a) (b)(c)117deg0prime0PrimeE 117deg20prime0PrimeE

116deg40prime0PrimeE 117deg0prime0PrimeE 117deg20prime0PrimeE 116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

34deg3

5prime0Prime

N35

deg0prime0Prime

N35

deg25prime

0PrimeN

34deg3

5prime0Prime

N35

deg0prime0Prime

N35

deg25prime

0PrimeN

N

Sampling Point

Location of Nan-Si Lake Remote Sensing Image of Nan-Si Lake

0 15 30 60 Kilometers

Kilometers0 15 3 6

Figure 1 Location and sampling point distribution of study area (a) Satellite strip map showing the location of the satellite image withinthe study area (b) )e location of the study area as shown by the red mark which is a typical shallow-grass lake area (c) Sampling pointdistribution map which shows the spatial distribution of sampling points in the study area

Journal of Spectroscopy 3

33 Decision Tree Regression (Decision Tree) )e decisiontree regression uses classification and regression trees withthe Gini coefficient for constructing a tree Using thetraining dataset to construct a decision tree of a certain scalelevel the validation dataset is used to prune the constructedtree and select the best subtree that best meets the actualrequirements at the cost of losing some information in thepruning process In this study the measure of uncertainty ofthe data at the time of execution of the algorithm is the Giniindex and also the Gini index is used to determine theoptimal classification of the variables

34 AdaBoost Algorithm (AdaBoost) )e AdaBoost algo-rithm is a widely used iterative algorithm as well as a boostingalgorithm with adaptive capabilities )e basic idea of thismethod is as follows first strengthen the training of weakclassifiers and then integrate trained weak classifiers to obtainthe final classifier with enough strength level After each it-eration the weights of the samples are adjusted and thesamples with larger fitting errors will increase the corre-sponding weight values )e weak learner obtains a sequenceof functions on the predicted values by iterative operationsand each prediction function is assigned a weight )efunction with better prediction results has a larger corre-sponding weight and after several iterations the final stronglearner is obtained by weighting the weak learner function[22] )e main idea is to integrate multiple weak learners toget the output of strong learners tomake accurate predictions

35 Random Forest (RF) Algorithm Random forest wasproposed by Leo Breiman in 2001 as a machine learningalgorithm formed by combining the ideas of boost aggre-gation integration and feature random selection [23] )ebasic idea of the algorithm is described as follows the classifierconsists of some classification regression trees and the finalresult of the classification is decided by voting Assuming thatthere are T variables when each tree is split into nodes togenerate a decision tree t (tltT) variables are randomly se-lected as candidates for node splitting with the same numberof variables at each node and the optimal branches are se-lected according to the branching optimality criterion Eachtree is recursive from top to bottom until the splitting con-dition is met )e predicted value of the dependent variable is

obtained by averaging the predictions of all trees [24])e keyto the algorithm is to determine the number of variables andthe number of decision trees )e algorithm does not overfitas the number of trees increases has good generalizationperformance is more robust and is suitable for dealing withhigh-dimensional nonlinear complex problems [25 26]

36 Optimisation and Selection of Parameters for GeneticAlgorithms (GA_RF) )e core idea of genetic algorithmcomes from the evolution of biology It is a global proba-bilistic random search algorithm It has been widely used inthe field of artificial intelligence and has a lot of experiencefor reference in solving optimization problems )ereforethe genetic algorithm is adopted in this study to optimize theparameter selection process of the random forest algorithmand the optimal combination of parameters is found withless manual intervention

In this study the optimized algorithm is defined asGA_RF algorithm )e basic idea of the algorithm is tosimulate the process of biological genetic evolution First thepopulation is initialized and each chromosome represents asolution )e fitness function measures the quality of thesolution and determines the ldquoparentsrdquo of the next genera-tion )en the next-generation population is generatedthrough crossover and mutation [27 28] In the process ofgenetic algorithm optimization one chromosome representsa different combination of two parameters of random forestalgorithm After that the effect of different combinationswas evaluated by the fitness function and the combinationmethod with the best accuracy was finally obtained

Zhou et al [29] took the number of features and thenumber of decision trees as the variables to be optimized inthe genetic algorithm and finally evolved into a better pa-rameter combination )is method can ensure computa-tional efficiency and improve the effectiveness of randomforest classification )erefore this study combines geneticalgorithm with random forest algorithm and the two pa-rameters of optimal design are the number of features andthe number of decision trees In this study the specificalgorithm parameter settings are shown in Table 1

As shown in Table 1 a certain amount of the initialpopulation is generated within the value range of m and kAfter that the fitness of the initial population is calculated Ifthe termination condition is not met a new ldquopopulationrdquo isgenerated through selection crossover and variogram untilthe set final condition is reached In other words when theldquopopulationrdquo evolves to the ideal effect the optimal solutionof the combination of two parameters of the random forestcan be output )e advantage of using genetic algorithm tooptimize stochastic forest model is that it can reduce manualintervention avoid local optimal solutions and can find theglobal optimal solution in complex space

4 Experimental Process and Preprocessing

41 Experimental Process Firstly the Hyperion remotesensing image is preprocessed and then the spectrum dataare extracted from the corresponding sample points the

Input

Input Layer Hidden layer Output Layer

Output

hellip

Figure 2 BP neural network structure diagram

4 Journal of Spectroscopy

spectral characteristics and sensitive bands of the TSMconcentration are analyzed and the high correlation featurevariables are screened On this basis different methods areused to establish the retrieval model for accuracy evaluationand finally the spatial distribution of the retrieved TSMconcentration in the study area was analyzed and theprocess of the experiment is shown in Figure 3

42 Data Preprocessing

421 Removing the Uncalibrated and Vapor-Affected BandsAmong the 242 bands of Hyperion data 8sim57 and 77sim224bands are radiometric calibration bands the rest of thebands are not calibrated and are set to 0 value 77sim78 bandsoverlap with 56sim57 bands and the latter is retained due tothe large noise while 121sim126 167sim180 and 222sim224 bandsare influenced by water vapor which also needs to be re-moved [30] and the remaining bands are recombined

422 Radiation Calibration and Atmospheric Correction)e radiometric correction is to convert the pixel value of theimage into an absolute radiometric value (formula (1))Atmospheric correction is to get the true reflectance ofground objects by eliminating the influence of atmospherelight and other environmental factors )e above processesare done in the module corresponding to ENVI53

L(k) C0(k) + C1(k)lowastDN(k) (1)

where k represents the band number and C0 and C1 arecorrection coefficients (offset and gain coefficientsrespectively)

423 Water Information Extraction Before analyzing theconcentration of TSM in the selected waters the nonwaterbody information should be removed to focus on the waterbody information in the study area )e indices popularlyused to extract water body information are NDWI [31](equation (2)) and MNDWI [32] (equation (3)) proposed byHanqiu Xu using the short-wave infrared band instead of thenear-infrared band in NDWI

NDWI G minus NIRG + NIR

(2)

MNDWI G minus SWIRG + SWIR

(3)

In the above two equations G NIR and SWIR are greenband near-infrared band and the short-wave infrared bandcorresponding to bands 21 50 and 146 in Hyperion data)e watershed information was extracted and merged usingNDWI and MNDWI respectively Since the watershed ofthe study area is a shallow macrophytic lake the waters arenot flat and the purpose of smoothing its interior is toremove patches

43 TSMSpectral Characteristics and Sensitive BandAnalysis)e atmospherically corrected satellite data is correlatedwith the components and their contents in the water column[33] In this study the measured TSM values are correlatedwith the spectral reflectance of the corresponding points)e spectral curves of the sampling sites are shown inFigure 4(a) )e spectral profile in the visible band is similarto that of vegetation mainly because the study area is ashallow macrophytic lake with many algae )e reflectionpeak at 690ndash780 nm is a typical spectral feature of waterbodies containing algae and the reflection peak at810ndash820 nm is formed due to the scattering of light by TSMAs the concentration of TSM increases the position of thereflection peak shifts towards the long wave direction )ePearson correlation coefficients for the atmosphericallycorrected reflectance and TSM concentration at each bandare shown in Figure 4(b))ey are negatively correlated with508 nm with a correlation coefficient of minus041609 positivelycorrelated with 1124 nm with a correlation coefficient of042522 and positively correlated with 844 nm with a cor-relation coefficient of 04211

In addition single bands were combined to analyze theircorrelation with TSM concentrations

(1) )e band difference algorithm the reflectance data ofany two bands are subjected to a different operation(B1minusB2) Since the correlation coefficients of theband difference calculation and the differential cal-culation are consistent the results of the two algo-rithms are comparable [34]

(2) Band ratio algorithm the reflectance of any twobands is calculated by the ratio (B1B2)

(3) Band normalized difference index (NDI) algorithmthe ratio of the difference of any two bands to thesum (B1minusB2)(B1 +B2)

In this study combined with the spectral characteristicsand correlation analysis of suspended solids concentrationas shown in Table 2 32 single band and band combination

Table 1 GA_RF algorithm parameter configuration

Attribute set A detailed description)e encoding type Real number codingGenetic algorithm chromosome length value 2Characteristics of the number m 1ltmlt 32Number of decision trees k 1lt klt 1000Genetic algorithm population size 10Genetic algorithm mutation rate 005Number of iterations of genetic algorithm evolution 500

Journal of Spectroscopy 5

Hyperion image

Preprocessing

Spectral analysis

Input featurevariables

Mapping andanalysis

Retrieved results

Precision evaluation

Measured data

Experimental algorithms

LinearAlgorithms BPNN KNN Adaboost RF GA_RF

Figure 3 Flowchart for TSM concentration retrieved

500 1000 1500 2000 2500-500

0

500

1000

1500

2000

2500

3000

64268376548

40208280540

TSM concentration

Wavelength (nm)

Refle

ctan

ce

(a)

500 1000 1500 2000 2500

-04

-02

00

02

04

Wavelength (nm)

Pear

son

Cor

relat

ion

Coe

ffici

ent

(b)

Figure 4 Spectral curves and correlation coefficients of sampling points

Table 2 Feature vector

Feature vector Pearsonrsquos correlation coefficientBand difference algorithm(49804ndash50822) nm 05527(209284ndash213324) nm 05773(52857ndash56927) nm 06112(164890ndash221392) nm 06248

6 Journal of Spectroscopy

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 2: Evaluation of the Effectiveness of Multiple Machine ...

bodies can be divided into analytical semianalytical andempirical methods )e core of the above analysis method isthe bio-optical model [4] which is based on the radiativetransfer theory )e theory mainly analyzes the absorptioncharacteristics of upward radiation and optically activesubstances in water and analyzes the relationship betweenthese characteristics and backscattering characteristicsFurthermore the remote sensing reflectance spectrumcharacteristics of the actual water body are combined withthe inherent optical quantity of water components and thenthe water quality parameters are retrieved [5] )is type ofmethod has a clear physical meaning and the retrieval resultsare more reliable but it requires measurements of the in-trinsic optical properties of water components and the al-gorithms are difficult to establish)e semianalytical methodis a combination of known spectral characteristics of waterquality parameters and statistical models to select the op-timal band or a combination of bands as variables for remotesensing retrieval which has strong reliability and applica-bility and is widely used in the estimation of parameters ofCase II water [6 7] )e empirical method is based on thestatistical relationship between measured TSM concentra-tion data and remotely sensed spectral information to re-trieve TSM concentrations It is the most commonly usedocean color remote sensing retrieval algorithm at presentdue to its relatively simple method and fewer parameters)e empirical methods search for the best band or bandcombination to build a mathematical model mainly in-cluding single-band model [8 9] band ratio model [10] andmultiple regression model [11] With the continuous de-velopment and combination of computer technology andremote sensing technology the application of machinelearning in the field of remote sensing image processing hasgradually become a research hotspot Case II waters such asrivers lakes and near-shore marine waters are opticallycomplex and poorly suited to the use of linear models forstudying Case I waters in isolation [12] Machine learninghas the advantage of solving complex nonlinear problemsFor example Louis et al combined a neural network modelwith TM images to simulate suspended sediment concen-trations and the results demonstrated that the networkmodel performs better on nonlinear problems compared totraditional regression analysis [13] Chen et al used amultilayer backpropagation neural network (MBPNN)model to estimate TSS concentrations in the coastal zone ofeastern China and obtained results were all more accuratethan those of empirical models [14] )e construction of aneural network model requires a large number of samplesfor training and although a large number of samples canhelp improve the accuracy of the model the actual numberof water quality sampling points is affected by weatherconditions cost and other factors and fewer sample pointscan be obtained However due to factors such as weatherand costs the number of sampling points obtained is rel-atively small For instance if the sample point is located inthe pond or lake area under bad weather conditions therepeated measurement of the sampling point data com-pletely depends on manual work multiple repeated mea-surements also mean that the collected samples need to be

processed many times at the same time so the number ofactual water quality sampling points is generally small Incontrast support vector machines and random forest al-gorithms are more suitable for analyzing the little number ofsamples For example Park et al used both artificial neuralnetworks (ANN) and support vector machines (SVM) topredict Chl-a concentrations in Juam and Yeongsan res-ervoirs and the result was that the support vector machinemodel had higher prediction accuracy compared to ANN[15] By constructing a random forest model Fang Xinruiet al estimated the suspended sediment concentration in theriver section fromYichang to Chenglingji downstream of theChinese )ree Gorges Dam for each month before and afterthe dam construction [16] Overall the empirical method isrelatively simple has greater potential for developmentwhen combined with artificial intelligence techniques and isone of the common methods currently used for remotesensing retrieval of TSM concentrations

)e study area used in this study is the Nansi Lake regionin North China which is located in the northern part of theHuaihe River Basin at the junction of the provinces ofJiangsu Shandong Henan and Anhui and is a shallowmacrophytic lake As an important water storage lake of theEast Route of South-to-North Water Transfer Project it hasthe functions of flood control drainage breeding tourismand so forth [17] )e quality of the water in the area hasbeen gradually declining and neglected by the developmentof local industry and urbanization for some time so it ismeaningful to detect the TSM concentration in the water inthe area Meanwhile this study uses Hyperion hyperspectralsatellite imagery along with field-measured data to conductremote sensing retrieval studies of TSM concentrations inNansi Lake Models such as linear regression BP neuralnetwork KNN AdaBoost RF GA_RF and decision tree areconstructed using empirical methods By comparing theretrieval accuracies of different models the optimal model isused to retrieve the spatial distribution of TSMconcentrations

2 Data and Methods

21 Study Area Nansi Lake is located in the southwest ofShandong Province and belongs to Weishan County JiningCity Shandong Province China with geographical coor-dinates of 116deg34primeEndash117deg21primeE 34deg27primeNndash35deg20primeN (Figure 1)and belongs to the temperate monsoon climate It includesWeishan Lake Zhaoyang Lake Dushan Lake and NanyangLake with a total area of 1266 km2 and an average depth of15m It is the largest and best-preserved inland largefreshwater shallow macrophytic lake in the area north of theYellow River It is an important water source and waterstorage lake of the East Route of South-to-North WaterTransfer Project )erefore it is of great significance to useremote sensing to monitor its water quality

22 Remote Sensing Data and Measured Sample Data)is research uses Hyperion hyperspectral remote sensingdata Hyperion is a hyperspectral sensor carried by an Earth

2 Journal of Spectroscopy

observation satellite named Earth Observing-1 (EO-1) Inthe range of 355ndash2577 nm the Hyperion data provide 242bands 35 visible 35 near-infrared and 172 short-waveinfrared bands It has a spectral resolution of 10 nm and aspatial resolution of 30m A total of 26 sample points weremeasured in this study and the data were collected on July31 2010 with a clear and cloudless sky no wind and a calmwater surface )e highest value of TSM concentration at thesample sites was 548mgL the lowest value was 4mgL andthe mean value was 3031mgL )e image of July 26 2010was selected to extract the spectral data of the sampling sites)e imaging date of this hyperspectral image is most suitablewith the measured data and the distribution of samplingpoints is shown in Figure 1

3 Methodology

31 Backpropagation Algorithm (BP) )e backpropagation(BP) neural network algorithm has the characteristics of self-adaptability strong learning ability and fault tolerance BPneural networks combined with remote sensing technologycan effectively make a nonlinear prediction of water qualityparameters [18]

)e BP neural network algorithm is a multilayer feed-forward network trained according to the backpropagationalgorithm which is divided into input layers hidden layersand output layers )e core idea is the gradient descentmethod which modifies the weight values and thresholdvalues along the negative gradient direction of the function

to minimize the difference between the actual output of thenetwork and desired output [19] )e topology of the net-work structure is shown in Figure 2 )e network has twoprocesses forward propagation of operating signal andbackpropagation of error signal If the difference betweenthe output result and the expected result exceeds the re-quirement the error signal will be propagated from theoutput layer to the input layer and the weight values andthreshold values of each layer of the network will be adjustedto meet the error limitation

32 K-Nearest Neighbors (KNN) Algorithm )e K-nearestneighbor (KNN) algorithm a theoretically mature methodis widely used in remote sensing quantitative inversion)esample to be tested can be estimated with its nearestK-neighboring samples and the attributes of the samplecan be obtained by assigning the average of the attributes ofthe K-neighboring samples to the sample to be tested [20])e process is described as follows first calculate thedistance between the sample to be tested and all othersamples and then find the K-nearest neighbors of thesample to be tested and finally average the attributes of theselected K-neighbors to obtain the attributes of the sampleto be tested [21] )e advantage of using K sample pointswhen estimating pixel-level TSM concentrations using thismethod is that it reduces random variation due to noiseinternal variation and misalignment of sample pointcoordinates

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE116deg40prime0PrimeE

(a) (b)(c)117deg0prime0PrimeE 117deg20prime0PrimeE

116deg40prime0PrimeE 117deg0prime0PrimeE 117deg20prime0PrimeE 116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

34deg3

5prime0Prime

N35

deg0prime0Prime

N35

deg25prime

0PrimeN

34deg3

5prime0Prime

N35

deg0prime0Prime

N35

deg25prime

0PrimeN

N

Sampling Point

Location of Nan-Si Lake Remote Sensing Image of Nan-Si Lake

0 15 30 60 Kilometers

Kilometers0 15 3 6

Figure 1 Location and sampling point distribution of study area (a) Satellite strip map showing the location of the satellite image withinthe study area (b) )e location of the study area as shown by the red mark which is a typical shallow-grass lake area (c) Sampling pointdistribution map which shows the spatial distribution of sampling points in the study area

Journal of Spectroscopy 3

33 Decision Tree Regression (Decision Tree) )e decisiontree regression uses classification and regression trees withthe Gini coefficient for constructing a tree Using thetraining dataset to construct a decision tree of a certain scalelevel the validation dataset is used to prune the constructedtree and select the best subtree that best meets the actualrequirements at the cost of losing some information in thepruning process In this study the measure of uncertainty ofthe data at the time of execution of the algorithm is the Giniindex and also the Gini index is used to determine theoptimal classification of the variables

34 AdaBoost Algorithm (AdaBoost) )e AdaBoost algo-rithm is a widely used iterative algorithm as well as a boostingalgorithm with adaptive capabilities )e basic idea of thismethod is as follows first strengthen the training of weakclassifiers and then integrate trained weak classifiers to obtainthe final classifier with enough strength level After each it-eration the weights of the samples are adjusted and thesamples with larger fitting errors will increase the corre-sponding weight values )e weak learner obtains a sequenceof functions on the predicted values by iterative operationsand each prediction function is assigned a weight )efunction with better prediction results has a larger corre-sponding weight and after several iterations the final stronglearner is obtained by weighting the weak learner function[22] )e main idea is to integrate multiple weak learners toget the output of strong learners tomake accurate predictions

35 Random Forest (RF) Algorithm Random forest wasproposed by Leo Breiman in 2001 as a machine learningalgorithm formed by combining the ideas of boost aggre-gation integration and feature random selection [23] )ebasic idea of the algorithm is described as follows the classifierconsists of some classification regression trees and the finalresult of the classification is decided by voting Assuming thatthere are T variables when each tree is split into nodes togenerate a decision tree t (tltT) variables are randomly se-lected as candidates for node splitting with the same numberof variables at each node and the optimal branches are se-lected according to the branching optimality criterion Eachtree is recursive from top to bottom until the splitting con-dition is met )e predicted value of the dependent variable is

obtained by averaging the predictions of all trees [24])e keyto the algorithm is to determine the number of variables andthe number of decision trees )e algorithm does not overfitas the number of trees increases has good generalizationperformance is more robust and is suitable for dealing withhigh-dimensional nonlinear complex problems [25 26]

36 Optimisation and Selection of Parameters for GeneticAlgorithms (GA_RF) )e core idea of genetic algorithmcomes from the evolution of biology It is a global proba-bilistic random search algorithm It has been widely used inthe field of artificial intelligence and has a lot of experiencefor reference in solving optimization problems )ereforethe genetic algorithm is adopted in this study to optimize theparameter selection process of the random forest algorithmand the optimal combination of parameters is found withless manual intervention

In this study the optimized algorithm is defined asGA_RF algorithm )e basic idea of the algorithm is tosimulate the process of biological genetic evolution First thepopulation is initialized and each chromosome represents asolution )e fitness function measures the quality of thesolution and determines the ldquoparentsrdquo of the next genera-tion )en the next-generation population is generatedthrough crossover and mutation [27 28] In the process ofgenetic algorithm optimization one chromosome representsa different combination of two parameters of random forestalgorithm After that the effect of different combinationswas evaluated by the fitness function and the combinationmethod with the best accuracy was finally obtained

Zhou et al [29] took the number of features and thenumber of decision trees as the variables to be optimized inthe genetic algorithm and finally evolved into a better pa-rameter combination )is method can ensure computa-tional efficiency and improve the effectiveness of randomforest classification )erefore this study combines geneticalgorithm with random forest algorithm and the two pa-rameters of optimal design are the number of features andthe number of decision trees In this study the specificalgorithm parameter settings are shown in Table 1

As shown in Table 1 a certain amount of the initialpopulation is generated within the value range of m and kAfter that the fitness of the initial population is calculated Ifthe termination condition is not met a new ldquopopulationrdquo isgenerated through selection crossover and variogram untilthe set final condition is reached In other words when theldquopopulationrdquo evolves to the ideal effect the optimal solutionof the combination of two parameters of the random forestcan be output )e advantage of using genetic algorithm tooptimize stochastic forest model is that it can reduce manualintervention avoid local optimal solutions and can find theglobal optimal solution in complex space

4 Experimental Process and Preprocessing

41 Experimental Process Firstly the Hyperion remotesensing image is preprocessed and then the spectrum dataare extracted from the corresponding sample points the

Input

Input Layer Hidden layer Output Layer

Output

hellip

Figure 2 BP neural network structure diagram

4 Journal of Spectroscopy

spectral characteristics and sensitive bands of the TSMconcentration are analyzed and the high correlation featurevariables are screened On this basis different methods areused to establish the retrieval model for accuracy evaluationand finally the spatial distribution of the retrieved TSMconcentration in the study area was analyzed and theprocess of the experiment is shown in Figure 3

42 Data Preprocessing

421 Removing the Uncalibrated and Vapor-Affected BandsAmong the 242 bands of Hyperion data 8sim57 and 77sim224bands are radiometric calibration bands the rest of thebands are not calibrated and are set to 0 value 77sim78 bandsoverlap with 56sim57 bands and the latter is retained due tothe large noise while 121sim126 167sim180 and 222sim224 bandsare influenced by water vapor which also needs to be re-moved [30] and the remaining bands are recombined

422 Radiation Calibration and Atmospheric Correction)e radiometric correction is to convert the pixel value of theimage into an absolute radiometric value (formula (1))Atmospheric correction is to get the true reflectance ofground objects by eliminating the influence of atmospherelight and other environmental factors )e above processesare done in the module corresponding to ENVI53

L(k) C0(k) + C1(k)lowastDN(k) (1)

where k represents the band number and C0 and C1 arecorrection coefficients (offset and gain coefficientsrespectively)

423 Water Information Extraction Before analyzing theconcentration of TSM in the selected waters the nonwaterbody information should be removed to focus on the waterbody information in the study area )e indices popularlyused to extract water body information are NDWI [31](equation (2)) and MNDWI [32] (equation (3)) proposed byHanqiu Xu using the short-wave infrared band instead of thenear-infrared band in NDWI

NDWI G minus NIRG + NIR

(2)

MNDWI G minus SWIRG + SWIR

(3)

In the above two equations G NIR and SWIR are greenband near-infrared band and the short-wave infrared bandcorresponding to bands 21 50 and 146 in Hyperion data)e watershed information was extracted and merged usingNDWI and MNDWI respectively Since the watershed ofthe study area is a shallow macrophytic lake the waters arenot flat and the purpose of smoothing its interior is toremove patches

43 TSMSpectral Characteristics and Sensitive BandAnalysis)e atmospherically corrected satellite data is correlatedwith the components and their contents in the water column[33] In this study the measured TSM values are correlatedwith the spectral reflectance of the corresponding points)e spectral curves of the sampling sites are shown inFigure 4(a) )e spectral profile in the visible band is similarto that of vegetation mainly because the study area is ashallow macrophytic lake with many algae )e reflectionpeak at 690ndash780 nm is a typical spectral feature of waterbodies containing algae and the reflection peak at810ndash820 nm is formed due to the scattering of light by TSMAs the concentration of TSM increases the position of thereflection peak shifts towards the long wave direction )ePearson correlation coefficients for the atmosphericallycorrected reflectance and TSM concentration at each bandare shown in Figure 4(b))ey are negatively correlated with508 nm with a correlation coefficient of minus041609 positivelycorrelated with 1124 nm with a correlation coefficient of042522 and positively correlated with 844 nm with a cor-relation coefficient of 04211

In addition single bands were combined to analyze theircorrelation with TSM concentrations

(1) )e band difference algorithm the reflectance data ofany two bands are subjected to a different operation(B1minusB2) Since the correlation coefficients of theband difference calculation and the differential cal-culation are consistent the results of the two algo-rithms are comparable [34]

(2) Band ratio algorithm the reflectance of any twobands is calculated by the ratio (B1B2)

(3) Band normalized difference index (NDI) algorithmthe ratio of the difference of any two bands to thesum (B1minusB2)(B1 +B2)

In this study combined with the spectral characteristicsand correlation analysis of suspended solids concentrationas shown in Table 2 32 single band and band combination

Table 1 GA_RF algorithm parameter configuration

Attribute set A detailed description)e encoding type Real number codingGenetic algorithm chromosome length value 2Characteristics of the number m 1ltmlt 32Number of decision trees k 1lt klt 1000Genetic algorithm population size 10Genetic algorithm mutation rate 005Number of iterations of genetic algorithm evolution 500

Journal of Spectroscopy 5

Hyperion image

Preprocessing

Spectral analysis

Input featurevariables

Mapping andanalysis

Retrieved results

Precision evaluation

Measured data

Experimental algorithms

LinearAlgorithms BPNN KNN Adaboost RF GA_RF

Figure 3 Flowchart for TSM concentration retrieved

500 1000 1500 2000 2500-500

0

500

1000

1500

2000

2500

3000

64268376548

40208280540

TSM concentration

Wavelength (nm)

Refle

ctan

ce

(a)

500 1000 1500 2000 2500

-04

-02

00

02

04

Wavelength (nm)

Pear

son

Cor

relat

ion

Coe

ffici

ent

(b)

Figure 4 Spectral curves and correlation coefficients of sampling points

Table 2 Feature vector

Feature vector Pearsonrsquos correlation coefficientBand difference algorithm(49804ndash50822) nm 05527(209284ndash213324) nm 05773(52857ndash56927) nm 06112(164890ndash221392) nm 06248

6 Journal of Spectroscopy

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 3: Evaluation of the Effectiveness of Multiple Machine ...

observation satellite named Earth Observing-1 (EO-1) Inthe range of 355ndash2577 nm the Hyperion data provide 242bands 35 visible 35 near-infrared and 172 short-waveinfrared bands It has a spectral resolution of 10 nm and aspatial resolution of 30m A total of 26 sample points weremeasured in this study and the data were collected on July31 2010 with a clear and cloudless sky no wind and a calmwater surface )e highest value of TSM concentration at thesample sites was 548mgL the lowest value was 4mgL andthe mean value was 3031mgL )e image of July 26 2010was selected to extract the spectral data of the sampling sites)e imaging date of this hyperspectral image is most suitablewith the measured data and the distribution of samplingpoints is shown in Figure 1

3 Methodology

31 Backpropagation Algorithm (BP) )e backpropagation(BP) neural network algorithm has the characteristics of self-adaptability strong learning ability and fault tolerance BPneural networks combined with remote sensing technologycan effectively make a nonlinear prediction of water qualityparameters [18]

)e BP neural network algorithm is a multilayer feed-forward network trained according to the backpropagationalgorithm which is divided into input layers hidden layersand output layers )e core idea is the gradient descentmethod which modifies the weight values and thresholdvalues along the negative gradient direction of the function

to minimize the difference between the actual output of thenetwork and desired output [19] )e topology of the net-work structure is shown in Figure 2 )e network has twoprocesses forward propagation of operating signal andbackpropagation of error signal If the difference betweenthe output result and the expected result exceeds the re-quirement the error signal will be propagated from theoutput layer to the input layer and the weight values andthreshold values of each layer of the network will be adjustedto meet the error limitation

32 K-Nearest Neighbors (KNN) Algorithm )e K-nearestneighbor (KNN) algorithm a theoretically mature methodis widely used in remote sensing quantitative inversion)esample to be tested can be estimated with its nearestK-neighboring samples and the attributes of the samplecan be obtained by assigning the average of the attributes ofthe K-neighboring samples to the sample to be tested [20])e process is described as follows first calculate thedistance between the sample to be tested and all othersamples and then find the K-nearest neighbors of thesample to be tested and finally average the attributes of theselected K-neighbors to obtain the attributes of the sampleto be tested [21] )e advantage of using K sample pointswhen estimating pixel-level TSM concentrations using thismethod is that it reduces random variation due to noiseinternal variation and misalignment of sample pointcoordinates

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE116deg40prime0PrimeE

(a) (b)(c)117deg0prime0PrimeE 117deg20prime0PrimeE

116deg40prime0PrimeE 117deg0prime0PrimeE 117deg20prime0PrimeE 116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

34deg3

5prime0Prime

N35

deg0prime0Prime

N35

deg25prime

0PrimeN

34deg3

5prime0Prime

N35

deg0prime0Prime

N35

deg25prime

0PrimeN

N

Sampling Point

Location of Nan-Si Lake Remote Sensing Image of Nan-Si Lake

0 15 30 60 Kilometers

Kilometers0 15 3 6

Figure 1 Location and sampling point distribution of study area (a) Satellite strip map showing the location of the satellite image withinthe study area (b) )e location of the study area as shown by the red mark which is a typical shallow-grass lake area (c) Sampling pointdistribution map which shows the spatial distribution of sampling points in the study area

Journal of Spectroscopy 3

33 Decision Tree Regression (Decision Tree) )e decisiontree regression uses classification and regression trees withthe Gini coefficient for constructing a tree Using thetraining dataset to construct a decision tree of a certain scalelevel the validation dataset is used to prune the constructedtree and select the best subtree that best meets the actualrequirements at the cost of losing some information in thepruning process In this study the measure of uncertainty ofthe data at the time of execution of the algorithm is the Giniindex and also the Gini index is used to determine theoptimal classification of the variables

34 AdaBoost Algorithm (AdaBoost) )e AdaBoost algo-rithm is a widely used iterative algorithm as well as a boostingalgorithm with adaptive capabilities )e basic idea of thismethod is as follows first strengthen the training of weakclassifiers and then integrate trained weak classifiers to obtainthe final classifier with enough strength level After each it-eration the weights of the samples are adjusted and thesamples with larger fitting errors will increase the corre-sponding weight values )e weak learner obtains a sequenceof functions on the predicted values by iterative operationsand each prediction function is assigned a weight )efunction with better prediction results has a larger corre-sponding weight and after several iterations the final stronglearner is obtained by weighting the weak learner function[22] )e main idea is to integrate multiple weak learners toget the output of strong learners tomake accurate predictions

35 Random Forest (RF) Algorithm Random forest wasproposed by Leo Breiman in 2001 as a machine learningalgorithm formed by combining the ideas of boost aggre-gation integration and feature random selection [23] )ebasic idea of the algorithm is described as follows the classifierconsists of some classification regression trees and the finalresult of the classification is decided by voting Assuming thatthere are T variables when each tree is split into nodes togenerate a decision tree t (tltT) variables are randomly se-lected as candidates for node splitting with the same numberof variables at each node and the optimal branches are se-lected according to the branching optimality criterion Eachtree is recursive from top to bottom until the splitting con-dition is met )e predicted value of the dependent variable is

obtained by averaging the predictions of all trees [24])e keyto the algorithm is to determine the number of variables andthe number of decision trees )e algorithm does not overfitas the number of trees increases has good generalizationperformance is more robust and is suitable for dealing withhigh-dimensional nonlinear complex problems [25 26]

36 Optimisation and Selection of Parameters for GeneticAlgorithms (GA_RF) )e core idea of genetic algorithmcomes from the evolution of biology It is a global proba-bilistic random search algorithm It has been widely used inthe field of artificial intelligence and has a lot of experiencefor reference in solving optimization problems )ereforethe genetic algorithm is adopted in this study to optimize theparameter selection process of the random forest algorithmand the optimal combination of parameters is found withless manual intervention

In this study the optimized algorithm is defined asGA_RF algorithm )e basic idea of the algorithm is tosimulate the process of biological genetic evolution First thepopulation is initialized and each chromosome represents asolution )e fitness function measures the quality of thesolution and determines the ldquoparentsrdquo of the next genera-tion )en the next-generation population is generatedthrough crossover and mutation [27 28] In the process ofgenetic algorithm optimization one chromosome representsa different combination of two parameters of random forestalgorithm After that the effect of different combinationswas evaluated by the fitness function and the combinationmethod with the best accuracy was finally obtained

Zhou et al [29] took the number of features and thenumber of decision trees as the variables to be optimized inthe genetic algorithm and finally evolved into a better pa-rameter combination )is method can ensure computa-tional efficiency and improve the effectiveness of randomforest classification )erefore this study combines geneticalgorithm with random forest algorithm and the two pa-rameters of optimal design are the number of features andthe number of decision trees In this study the specificalgorithm parameter settings are shown in Table 1

As shown in Table 1 a certain amount of the initialpopulation is generated within the value range of m and kAfter that the fitness of the initial population is calculated Ifthe termination condition is not met a new ldquopopulationrdquo isgenerated through selection crossover and variogram untilthe set final condition is reached In other words when theldquopopulationrdquo evolves to the ideal effect the optimal solutionof the combination of two parameters of the random forestcan be output )e advantage of using genetic algorithm tooptimize stochastic forest model is that it can reduce manualintervention avoid local optimal solutions and can find theglobal optimal solution in complex space

4 Experimental Process and Preprocessing

41 Experimental Process Firstly the Hyperion remotesensing image is preprocessed and then the spectrum dataare extracted from the corresponding sample points the

Input

Input Layer Hidden layer Output Layer

Output

hellip

Figure 2 BP neural network structure diagram

4 Journal of Spectroscopy

spectral characteristics and sensitive bands of the TSMconcentration are analyzed and the high correlation featurevariables are screened On this basis different methods areused to establish the retrieval model for accuracy evaluationand finally the spatial distribution of the retrieved TSMconcentration in the study area was analyzed and theprocess of the experiment is shown in Figure 3

42 Data Preprocessing

421 Removing the Uncalibrated and Vapor-Affected BandsAmong the 242 bands of Hyperion data 8sim57 and 77sim224bands are radiometric calibration bands the rest of thebands are not calibrated and are set to 0 value 77sim78 bandsoverlap with 56sim57 bands and the latter is retained due tothe large noise while 121sim126 167sim180 and 222sim224 bandsare influenced by water vapor which also needs to be re-moved [30] and the remaining bands are recombined

422 Radiation Calibration and Atmospheric Correction)e radiometric correction is to convert the pixel value of theimage into an absolute radiometric value (formula (1))Atmospheric correction is to get the true reflectance ofground objects by eliminating the influence of atmospherelight and other environmental factors )e above processesare done in the module corresponding to ENVI53

L(k) C0(k) + C1(k)lowastDN(k) (1)

where k represents the band number and C0 and C1 arecorrection coefficients (offset and gain coefficientsrespectively)

423 Water Information Extraction Before analyzing theconcentration of TSM in the selected waters the nonwaterbody information should be removed to focus on the waterbody information in the study area )e indices popularlyused to extract water body information are NDWI [31](equation (2)) and MNDWI [32] (equation (3)) proposed byHanqiu Xu using the short-wave infrared band instead of thenear-infrared band in NDWI

NDWI G minus NIRG + NIR

(2)

MNDWI G minus SWIRG + SWIR

(3)

In the above two equations G NIR and SWIR are greenband near-infrared band and the short-wave infrared bandcorresponding to bands 21 50 and 146 in Hyperion data)e watershed information was extracted and merged usingNDWI and MNDWI respectively Since the watershed ofthe study area is a shallow macrophytic lake the waters arenot flat and the purpose of smoothing its interior is toremove patches

43 TSMSpectral Characteristics and Sensitive BandAnalysis)e atmospherically corrected satellite data is correlatedwith the components and their contents in the water column[33] In this study the measured TSM values are correlatedwith the spectral reflectance of the corresponding points)e spectral curves of the sampling sites are shown inFigure 4(a) )e spectral profile in the visible band is similarto that of vegetation mainly because the study area is ashallow macrophytic lake with many algae )e reflectionpeak at 690ndash780 nm is a typical spectral feature of waterbodies containing algae and the reflection peak at810ndash820 nm is formed due to the scattering of light by TSMAs the concentration of TSM increases the position of thereflection peak shifts towards the long wave direction )ePearson correlation coefficients for the atmosphericallycorrected reflectance and TSM concentration at each bandare shown in Figure 4(b))ey are negatively correlated with508 nm with a correlation coefficient of minus041609 positivelycorrelated with 1124 nm with a correlation coefficient of042522 and positively correlated with 844 nm with a cor-relation coefficient of 04211

In addition single bands were combined to analyze theircorrelation with TSM concentrations

(1) )e band difference algorithm the reflectance data ofany two bands are subjected to a different operation(B1minusB2) Since the correlation coefficients of theband difference calculation and the differential cal-culation are consistent the results of the two algo-rithms are comparable [34]

(2) Band ratio algorithm the reflectance of any twobands is calculated by the ratio (B1B2)

(3) Band normalized difference index (NDI) algorithmthe ratio of the difference of any two bands to thesum (B1minusB2)(B1 +B2)

In this study combined with the spectral characteristicsand correlation analysis of suspended solids concentrationas shown in Table 2 32 single band and band combination

Table 1 GA_RF algorithm parameter configuration

Attribute set A detailed description)e encoding type Real number codingGenetic algorithm chromosome length value 2Characteristics of the number m 1ltmlt 32Number of decision trees k 1lt klt 1000Genetic algorithm population size 10Genetic algorithm mutation rate 005Number of iterations of genetic algorithm evolution 500

Journal of Spectroscopy 5

Hyperion image

Preprocessing

Spectral analysis

Input featurevariables

Mapping andanalysis

Retrieved results

Precision evaluation

Measured data

Experimental algorithms

LinearAlgorithms BPNN KNN Adaboost RF GA_RF

Figure 3 Flowchart for TSM concentration retrieved

500 1000 1500 2000 2500-500

0

500

1000

1500

2000

2500

3000

64268376548

40208280540

TSM concentration

Wavelength (nm)

Refle

ctan

ce

(a)

500 1000 1500 2000 2500

-04

-02

00

02

04

Wavelength (nm)

Pear

son

Cor

relat

ion

Coe

ffici

ent

(b)

Figure 4 Spectral curves and correlation coefficients of sampling points

Table 2 Feature vector

Feature vector Pearsonrsquos correlation coefficientBand difference algorithm(49804ndash50822) nm 05527(209284ndash213324) nm 05773(52857ndash56927) nm 06112(164890ndash221392) nm 06248

6 Journal of Spectroscopy

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 4: Evaluation of the Effectiveness of Multiple Machine ...

33 Decision Tree Regression (Decision Tree) )e decisiontree regression uses classification and regression trees withthe Gini coefficient for constructing a tree Using thetraining dataset to construct a decision tree of a certain scalelevel the validation dataset is used to prune the constructedtree and select the best subtree that best meets the actualrequirements at the cost of losing some information in thepruning process In this study the measure of uncertainty ofthe data at the time of execution of the algorithm is the Giniindex and also the Gini index is used to determine theoptimal classification of the variables

34 AdaBoost Algorithm (AdaBoost) )e AdaBoost algo-rithm is a widely used iterative algorithm as well as a boostingalgorithm with adaptive capabilities )e basic idea of thismethod is as follows first strengthen the training of weakclassifiers and then integrate trained weak classifiers to obtainthe final classifier with enough strength level After each it-eration the weights of the samples are adjusted and thesamples with larger fitting errors will increase the corre-sponding weight values )e weak learner obtains a sequenceof functions on the predicted values by iterative operationsand each prediction function is assigned a weight )efunction with better prediction results has a larger corre-sponding weight and after several iterations the final stronglearner is obtained by weighting the weak learner function[22] )e main idea is to integrate multiple weak learners toget the output of strong learners tomake accurate predictions

35 Random Forest (RF) Algorithm Random forest wasproposed by Leo Breiman in 2001 as a machine learningalgorithm formed by combining the ideas of boost aggre-gation integration and feature random selection [23] )ebasic idea of the algorithm is described as follows the classifierconsists of some classification regression trees and the finalresult of the classification is decided by voting Assuming thatthere are T variables when each tree is split into nodes togenerate a decision tree t (tltT) variables are randomly se-lected as candidates for node splitting with the same numberof variables at each node and the optimal branches are se-lected according to the branching optimality criterion Eachtree is recursive from top to bottom until the splitting con-dition is met )e predicted value of the dependent variable is

obtained by averaging the predictions of all trees [24])e keyto the algorithm is to determine the number of variables andthe number of decision trees )e algorithm does not overfitas the number of trees increases has good generalizationperformance is more robust and is suitable for dealing withhigh-dimensional nonlinear complex problems [25 26]

36 Optimisation and Selection of Parameters for GeneticAlgorithms (GA_RF) )e core idea of genetic algorithmcomes from the evolution of biology It is a global proba-bilistic random search algorithm It has been widely used inthe field of artificial intelligence and has a lot of experiencefor reference in solving optimization problems )ereforethe genetic algorithm is adopted in this study to optimize theparameter selection process of the random forest algorithmand the optimal combination of parameters is found withless manual intervention

In this study the optimized algorithm is defined asGA_RF algorithm )e basic idea of the algorithm is tosimulate the process of biological genetic evolution First thepopulation is initialized and each chromosome represents asolution )e fitness function measures the quality of thesolution and determines the ldquoparentsrdquo of the next genera-tion )en the next-generation population is generatedthrough crossover and mutation [27 28] In the process ofgenetic algorithm optimization one chromosome representsa different combination of two parameters of random forestalgorithm After that the effect of different combinationswas evaluated by the fitness function and the combinationmethod with the best accuracy was finally obtained

Zhou et al [29] took the number of features and thenumber of decision trees as the variables to be optimized inthe genetic algorithm and finally evolved into a better pa-rameter combination )is method can ensure computa-tional efficiency and improve the effectiveness of randomforest classification )erefore this study combines geneticalgorithm with random forest algorithm and the two pa-rameters of optimal design are the number of features andthe number of decision trees In this study the specificalgorithm parameter settings are shown in Table 1

As shown in Table 1 a certain amount of the initialpopulation is generated within the value range of m and kAfter that the fitness of the initial population is calculated Ifthe termination condition is not met a new ldquopopulationrdquo isgenerated through selection crossover and variogram untilthe set final condition is reached In other words when theldquopopulationrdquo evolves to the ideal effect the optimal solutionof the combination of two parameters of the random forestcan be output )e advantage of using genetic algorithm tooptimize stochastic forest model is that it can reduce manualintervention avoid local optimal solutions and can find theglobal optimal solution in complex space

4 Experimental Process and Preprocessing

41 Experimental Process Firstly the Hyperion remotesensing image is preprocessed and then the spectrum dataare extracted from the corresponding sample points the

Input

Input Layer Hidden layer Output Layer

Output

hellip

Figure 2 BP neural network structure diagram

4 Journal of Spectroscopy

spectral characteristics and sensitive bands of the TSMconcentration are analyzed and the high correlation featurevariables are screened On this basis different methods areused to establish the retrieval model for accuracy evaluationand finally the spatial distribution of the retrieved TSMconcentration in the study area was analyzed and theprocess of the experiment is shown in Figure 3

42 Data Preprocessing

421 Removing the Uncalibrated and Vapor-Affected BandsAmong the 242 bands of Hyperion data 8sim57 and 77sim224bands are radiometric calibration bands the rest of thebands are not calibrated and are set to 0 value 77sim78 bandsoverlap with 56sim57 bands and the latter is retained due tothe large noise while 121sim126 167sim180 and 222sim224 bandsare influenced by water vapor which also needs to be re-moved [30] and the remaining bands are recombined

422 Radiation Calibration and Atmospheric Correction)e radiometric correction is to convert the pixel value of theimage into an absolute radiometric value (formula (1))Atmospheric correction is to get the true reflectance ofground objects by eliminating the influence of atmospherelight and other environmental factors )e above processesare done in the module corresponding to ENVI53

L(k) C0(k) + C1(k)lowastDN(k) (1)

where k represents the band number and C0 and C1 arecorrection coefficients (offset and gain coefficientsrespectively)

423 Water Information Extraction Before analyzing theconcentration of TSM in the selected waters the nonwaterbody information should be removed to focus on the waterbody information in the study area )e indices popularlyused to extract water body information are NDWI [31](equation (2)) and MNDWI [32] (equation (3)) proposed byHanqiu Xu using the short-wave infrared band instead of thenear-infrared band in NDWI

NDWI G minus NIRG + NIR

(2)

MNDWI G minus SWIRG + SWIR

(3)

In the above two equations G NIR and SWIR are greenband near-infrared band and the short-wave infrared bandcorresponding to bands 21 50 and 146 in Hyperion data)e watershed information was extracted and merged usingNDWI and MNDWI respectively Since the watershed ofthe study area is a shallow macrophytic lake the waters arenot flat and the purpose of smoothing its interior is toremove patches

43 TSMSpectral Characteristics and Sensitive BandAnalysis)e atmospherically corrected satellite data is correlatedwith the components and their contents in the water column[33] In this study the measured TSM values are correlatedwith the spectral reflectance of the corresponding points)e spectral curves of the sampling sites are shown inFigure 4(a) )e spectral profile in the visible band is similarto that of vegetation mainly because the study area is ashallow macrophytic lake with many algae )e reflectionpeak at 690ndash780 nm is a typical spectral feature of waterbodies containing algae and the reflection peak at810ndash820 nm is formed due to the scattering of light by TSMAs the concentration of TSM increases the position of thereflection peak shifts towards the long wave direction )ePearson correlation coefficients for the atmosphericallycorrected reflectance and TSM concentration at each bandare shown in Figure 4(b))ey are negatively correlated with508 nm with a correlation coefficient of minus041609 positivelycorrelated with 1124 nm with a correlation coefficient of042522 and positively correlated with 844 nm with a cor-relation coefficient of 04211

In addition single bands were combined to analyze theircorrelation with TSM concentrations

(1) )e band difference algorithm the reflectance data ofany two bands are subjected to a different operation(B1minusB2) Since the correlation coefficients of theband difference calculation and the differential cal-culation are consistent the results of the two algo-rithms are comparable [34]

(2) Band ratio algorithm the reflectance of any twobands is calculated by the ratio (B1B2)

(3) Band normalized difference index (NDI) algorithmthe ratio of the difference of any two bands to thesum (B1minusB2)(B1 +B2)

In this study combined with the spectral characteristicsand correlation analysis of suspended solids concentrationas shown in Table 2 32 single band and band combination

Table 1 GA_RF algorithm parameter configuration

Attribute set A detailed description)e encoding type Real number codingGenetic algorithm chromosome length value 2Characteristics of the number m 1ltmlt 32Number of decision trees k 1lt klt 1000Genetic algorithm population size 10Genetic algorithm mutation rate 005Number of iterations of genetic algorithm evolution 500

Journal of Spectroscopy 5

Hyperion image

Preprocessing

Spectral analysis

Input featurevariables

Mapping andanalysis

Retrieved results

Precision evaluation

Measured data

Experimental algorithms

LinearAlgorithms BPNN KNN Adaboost RF GA_RF

Figure 3 Flowchart for TSM concentration retrieved

500 1000 1500 2000 2500-500

0

500

1000

1500

2000

2500

3000

64268376548

40208280540

TSM concentration

Wavelength (nm)

Refle

ctan

ce

(a)

500 1000 1500 2000 2500

-04

-02

00

02

04

Wavelength (nm)

Pear

son

Cor

relat

ion

Coe

ffici

ent

(b)

Figure 4 Spectral curves and correlation coefficients of sampling points

Table 2 Feature vector

Feature vector Pearsonrsquos correlation coefficientBand difference algorithm(49804ndash50822) nm 05527(209284ndash213324) nm 05773(52857ndash56927) nm 06112(164890ndash221392) nm 06248

6 Journal of Spectroscopy

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 5: Evaluation of the Effectiveness of Multiple Machine ...

spectral characteristics and sensitive bands of the TSMconcentration are analyzed and the high correlation featurevariables are screened On this basis different methods areused to establish the retrieval model for accuracy evaluationand finally the spatial distribution of the retrieved TSMconcentration in the study area was analyzed and theprocess of the experiment is shown in Figure 3

42 Data Preprocessing

421 Removing the Uncalibrated and Vapor-Affected BandsAmong the 242 bands of Hyperion data 8sim57 and 77sim224bands are radiometric calibration bands the rest of thebands are not calibrated and are set to 0 value 77sim78 bandsoverlap with 56sim57 bands and the latter is retained due tothe large noise while 121sim126 167sim180 and 222sim224 bandsare influenced by water vapor which also needs to be re-moved [30] and the remaining bands are recombined

422 Radiation Calibration and Atmospheric Correction)e radiometric correction is to convert the pixel value of theimage into an absolute radiometric value (formula (1))Atmospheric correction is to get the true reflectance ofground objects by eliminating the influence of atmospherelight and other environmental factors )e above processesare done in the module corresponding to ENVI53

L(k) C0(k) + C1(k)lowastDN(k) (1)

where k represents the band number and C0 and C1 arecorrection coefficients (offset and gain coefficientsrespectively)

423 Water Information Extraction Before analyzing theconcentration of TSM in the selected waters the nonwaterbody information should be removed to focus on the waterbody information in the study area )e indices popularlyused to extract water body information are NDWI [31](equation (2)) and MNDWI [32] (equation (3)) proposed byHanqiu Xu using the short-wave infrared band instead of thenear-infrared band in NDWI

NDWI G minus NIRG + NIR

(2)

MNDWI G minus SWIRG + SWIR

(3)

In the above two equations G NIR and SWIR are greenband near-infrared band and the short-wave infrared bandcorresponding to bands 21 50 and 146 in Hyperion data)e watershed information was extracted and merged usingNDWI and MNDWI respectively Since the watershed ofthe study area is a shallow macrophytic lake the waters arenot flat and the purpose of smoothing its interior is toremove patches

43 TSMSpectral Characteristics and Sensitive BandAnalysis)e atmospherically corrected satellite data is correlatedwith the components and their contents in the water column[33] In this study the measured TSM values are correlatedwith the spectral reflectance of the corresponding points)e spectral curves of the sampling sites are shown inFigure 4(a) )e spectral profile in the visible band is similarto that of vegetation mainly because the study area is ashallow macrophytic lake with many algae )e reflectionpeak at 690ndash780 nm is a typical spectral feature of waterbodies containing algae and the reflection peak at810ndash820 nm is formed due to the scattering of light by TSMAs the concentration of TSM increases the position of thereflection peak shifts towards the long wave direction )ePearson correlation coefficients for the atmosphericallycorrected reflectance and TSM concentration at each bandare shown in Figure 4(b))ey are negatively correlated with508 nm with a correlation coefficient of minus041609 positivelycorrelated with 1124 nm with a correlation coefficient of042522 and positively correlated with 844 nm with a cor-relation coefficient of 04211

In addition single bands were combined to analyze theircorrelation with TSM concentrations

(1) )e band difference algorithm the reflectance data ofany two bands are subjected to a different operation(B1minusB2) Since the correlation coefficients of theband difference calculation and the differential cal-culation are consistent the results of the two algo-rithms are comparable [34]

(2) Band ratio algorithm the reflectance of any twobands is calculated by the ratio (B1B2)

(3) Band normalized difference index (NDI) algorithmthe ratio of the difference of any two bands to thesum (B1minusB2)(B1 +B2)

In this study combined with the spectral characteristicsand correlation analysis of suspended solids concentrationas shown in Table 2 32 single band and band combination

Table 1 GA_RF algorithm parameter configuration

Attribute set A detailed description)e encoding type Real number codingGenetic algorithm chromosome length value 2Characteristics of the number m 1ltmlt 32Number of decision trees k 1lt klt 1000Genetic algorithm population size 10Genetic algorithm mutation rate 005Number of iterations of genetic algorithm evolution 500

Journal of Spectroscopy 5

Hyperion image

Preprocessing

Spectral analysis

Input featurevariables

Mapping andanalysis

Retrieved results

Precision evaluation

Measured data

Experimental algorithms

LinearAlgorithms BPNN KNN Adaboost RF GA_RF

Figure 3 Flowchart for TSM concentration retrieved

500 1000 1500 2000 2500-500

0

500

1000

1500

2000

2500

3000

64268376548

40208280540

TSM concentration

Wavelength (nm)

Refle

ctan

ce

(a)

500 1000 1500 2000 2500

-04

-02

00

02

04

Wavelength (nm)

Pear

son

Cor

relat

ion

Coe

ffici

ent

(b)

Figure 4 Spectral curves and correlation coefficients of sampling points

Table 2 Feature vector

Feature vector Pearsonrsquos correlation coefficientBand difference algorithm(49804ndash50822) nm 05527(209284ndash213324) nm 05773(52857ndash56927) nm 06112(164890ndash221392) nm 06248

6 Journal of Spectroscopy

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 6: Evaluation of the Effectiveness of Multiple Machine ...

Hyperion image

Preprocessing

Spectral analysis

Input featurevariables

Mapping andanalysis

Retrieved results

Precision evaluation

Measured data

Experimental algorithms

LinearAlgorithms BPNN KNN Adaboost RF GA_RF

Figure 3 Flowchart for TSM concentration retrieved

500 1000 1500 2000 2500-500

0

500

1000

1500

2000

2500

3000

64268376548

40208280540

TSM concentration

Wavelength (nm)

Refle

ctan

ce

(a)

500 1000 1500 2000 2500

-04

-02

00

02

04

Wavelength (nm)

Pear

son

Cor

relat

ion

Coe

ffici

ent

(b)

Figure 4 Spectral curves and correlation coefficients of sampling points

Table 2 Feature vector

Feature vector Pearsonrsquos correlation coefficientBand difference algorithm(49804ndash50822) nm 05527(209284ndash213324) nm 05773(52857ndash56927) nm 06112(164890ndash221392) nm 06248

6 Journal of Spectroscopy

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 7: Evaluation of the Effectiveness of Multiple Machine ...

results are selected as the eigenvectors of the subsequentlysuspended solids concentration inversion algorithm

)e results show that the highest correlation bands of thedifference algorithm are located at 164890 nm and221392 nm while the highest correlation bands of the ratioandNDI algorithms are located at 52857 nm and 56927 nmand the distribution of the highest correlation bands alsovaries greatly Combining features of different wave com-binations for the model prediction can increase the depth ofdata extraction

44 Constructing Inverse Models Based on the latitude andlongitude locations of the 26 sampling points the corre-sponding values in the reflectance image were extracted andformed into data pairs for inverse modeling and accuracyevaluation of TSM concentration To reduce the matchingerror between the sampling point and the image pixel it isrequired that the change rate of all pixels in the 3times 3neighborhood centered on the image pixel corresponding tothe sampling point is less than 40 [35]

)is paper adopts the foldout method [36] for validation6 random seeds are set and 6 sets of random numbers aregenerated for randomly selecting 6 training-testing datasetswhere the training dataset contains 20 sample points (77 ofthe total sample) and the testing dataset contains 6 samplepoints (23 of the total sample) )ese six training-testingdatasets were used to train and test every one of the modelsdetailed in the following paragraphs )e averages of R2RMSE and ARE from the six training and testing sessionswere used as the final model accuracy evaluation metrics foreach model

441 Linear Models After the Pearson correlation analysisand significance test of single band divided by combinationband the results are shown in Table 1 Among them164890 nmndash221392 nm has the highest correlation Asingle-band inversion model and a binary linear model areestablished )e model and accuracy are shown in Table 3

)e results show that the one regression or multipleregression model performs the worst since the model onlyconsiders the linear relationship between the independentand dependent variables while the optical properties ofClass II waters are complex and the relationship betweenTSM concentration and the spectrum cannot be fullyexpressed in linear terms [37]

442 Machine Learning Models Machine learning inver-sion models are constructed with the filtered eigenvectorbands as input and the measured total suspended matterconcentration as output )e model structure of the neuralnetwork is determined by several calculations and a 3-layerneural network with a network structure of 32-6-1 is usedDue to the small number of samples the Brute algorithm ischosen for the search algorithm of KNN model )e Ginimethod is chosen as the feature selection algorithm fordecision tree regression models and the Best method is usedto select the feature division points when the sample pointsare small )e parameters of the AdaBoost algorithm modelinclude base_estimator n_estimators learning_rate andloss )is algorithm is relatively simple more efficient andless likely to overfit )e main parameters of the randomforest model are the number of decision trees and thenumber of feature vectors and the rest of the parameters areset to default values )e specific parameter settings of theabove machine learning models are shown in Table 4

)e experimental results are shown in Figure 5 )ecomparison reveals that the random forest algorithm has thehighest accuracy As a new type of integrated model it hasthe characteristics of small training sample requirementsand less manual settings of parameters However the set-tings of two parameters of random forest decision tree andfeature vector have a considerable influence on the results)erefore this paper optimizes the selection of its modelparameters by introducing genetic algorithm to furtherimprove the accuracy )e main parameters of the geneticalgorithm are population size variation rate and themaximum number of genetic generations and the tour-nament method is introduced to select the next generationTo slow down the convergence rate of the genetic algorithmthe parameters are set as shown in Table 3 taking intoaccount the computing efficiency and effect

)e random forest algorithmrsquos parameters provide arange within which different combinations are made )eexperiment of the GA_RF algorithm concludes that when k iscertain the larger m is the higher the accuracy is when m iscertain the larger k is the higher the accuracy is However itis not the case that the larger the two parameters are thehigher the accuracy is After several experiments the modelaccuracy regression is higher and stable when kisin [220 290]andmisin [18 26] according to the robustness of the results)efinal test found that the effect was optimal when k 260 andm 22 )e comparison of the effectiveness of the above sixmachine learning algorithm methods is shown in Figure 6

Table 2 Continued

Feature vector Pearsonrsquos correlation coefficientBand ratio algorithm(145723199196) nm 05485(208275210294) nm 05497(4980450822) nm 05945(5285756927) nm 06076Band NDI algorithm(208275ndash212313)(208275 + 212313) nm 05614(49804ndash50822)(49804 + 50822) nm 05926(52857ndash56927)(52857 + 56927) nm 06074

Journal of Spectroscopy 7

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 8: Evaluation of the Effectiveness of Multiple Machine ...

Table 3 Linear model and accuracy of suspended solids concentration inversion

Model type Formula Feature bands (nm) R2 RMSE (mgmiddotLminus1) ARE ()

Single-band model y 4lowast 107 lowastX4

minus 1lowast 108 lowastX3

+ 2lowast 108 lowastX2

minus 1lowast 108 lowastX + 2lowast 107X 164890ndash221392 0466 246613 668671

Binary linear model

y minus389lowastX12

minus 202lowast 10minus 5 lowastX22

+ 000019lowastX1+ 012X2 + 018X1 lowastX2 minus 00005

X1 16489ndash22038301 0614 8679 44233X2 20928401ndash213324y 743lowastX1

2+ 00002lowastX2

2+ 000012lowastX1

+ 002lowastX2 + 018lowastX1 lowastX2 minus 00004X1 151783ndash220383 0637 8414 45274X2 145723ndash199196

Table 4 Parameter settings of the machine learning models

Machine learning model Parameter settings

BPNN model

Hidden layer neuron function log S-type functionOutput function linear function

Training function momentum BP algorithm with variable learning rateNumber of iterations 1000

Learning rate 0003

KNN modelSearch algorithm brute-force search algorithm

K 4)e rest of the parameters default settings

Decision tree model

Criterion GiniSplitter best

Max_depth noneMin_samples_split 2

Min_samples_leaf none)e rest of the parameters default settings

AdaBoost model

Base_estimator noneN_estimators 60Learning_rate 085

Loss linear)e rest of the parameters default settings

RF modelNumber of decision trees (k) 100

Number of features (m) 6)e rest of the parameters default settings

GA_RF model

Population size 10Variation rates 005

Maximum number of genetic generations 500Range of the number of decision trees 5 to 1000 step size 10

Range of number of features 1 to 32 step size 1)e rest of the parameters default settings

092

0678

0897

09480969 098

000060

065

070

075

080

085

090

095

100

R2 4142

8277

33872786

21351715

0

1

2

3

4

5

6

7

8

9

RMSE

(mg

L)

869

1667

1005929

702 683

00

6

8

10

12

14

16

18

5

ARE

()

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

BPN

N

KNN

Dec

ision

Tre

e

Adab

oost RF

GA

_RF

Figure 5 A comparison chart of index data

8 Journal of Spectroscopy

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 9: Evaluation of the Effectiveness of Multiple Machine ...

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

BPNN

RMSE=4142mgLR2=0921ARE=869Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(a)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

KNN

RMSE=8277mgLR2=0678ARE=1667Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(b)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

Decision Tree

RMSE=3387mgLR2=0897ARE=1005Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(c)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

AdaBoost

RMSE=2786mgLR2=0948ARE=929Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(d)

0 10 20 30 40 50 60

0

10

20

30

40

50

6011

Measured TSM concentration (mgL)

RF

RMSE=2135mgLR2=0969ARE=702Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(e)

0 10 20 30 40 50 60

0

10

20

30

40

50

60 11

Measured TSM concentration (mgL)

GA_RF

RMSE=1715mgLR2=0980ARE=683Pr

edic

ted

TSM

conc

entr

atio

n (m

gL)

(f )

Figure 6 Scatter verification of TSM concentration values estimated by six models and measured values (a) )e BP neural networkparameter is a 3-layer neural network )e network structure is 32-6-1 the number of hidden neurons is 6 the number of iterations is 1000and the learning rate is 0003 (b) KNN parameters include the search algorithm as brute-force algorithm for violent search k 4 (c)Decision tree parameters Gini coefficient is used for feature selection and the postpruning method is used for the tree correction (d)AdaBoost parameters the error calculation function uses the exponential function and the feature division point method in the weakclassifier selects best (e) RF algorithm parameters the number of decision trees k is 100 and the number of features m is 6 (f ) GA_RFalgorithm parameters the number of decision trees k is 260 and the number of features m is 22

Journal of Spectroscopy 9

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 10: Evaluation of the Effectiveness of Multiple Machine ...

443 Spatial Distribution and Analysis of TSMConcentration According to the above six machine learningmodels the TSM concentration was retrieved )e resultsshowed that the concentration of TSM in the study area washigher in the southeast area and lower in the northwest areaHowever the spatial distribution of TSM concentration isnot uniform )rough analysis it can be found that thepredicted results of different models are not consistent andthe spatial pattern varied greatly between the different re-sults )e predicted values of TSM concentration in KNNand decision tree models are relatively high while thepredicted values of the BP neural network and AdaBoostmodel are relatively low )e results are shown in Figure 7

444 Model Comparison and Evaluation To verify theprediction accuracy of the models independent sampleswere applied as the validation set and several quantitativeevaluation indicators were selected to evaluate the accuracyand generalization of the model comprehensively )emodel is evaluated using the coefficient of determination(R2) the root mean square error (RMSE) and the averagerelative error (ARE) with the following equations

R2

1 minus1113936

ni1 Xi minus Yi( 1113857

2

1113936ni1 Xi minus Yi( 1113857

2

RMSE

1113936ni1 Xi minus Yi( 1113857

2

n

1113971

ARE 1n

1113944

n

i1

Xi minus Yi

Yi

11138681113868111386811138681113868111386811138681113868

11138681113868111386811138681113868111386811138681113868times 100

(4)

where Yi is the measured value Yi is the mean of themeasured value Xi is the predicted value and n is the samplesize)e larger R2 is the smaller the RMSE and ARE indicatehigher accuracy and better results After calculation thespecific operation results of the above model are shown inTable 5

)rough comprehensive comparison and analysis of theabove nine models we found that the accuracy of the linearmodel is poor R2 of the single-band model is less than 05and the accuracy of the binary model has been improved tosome extent but it cannot reach the normal accuracy re-quirement which is higher than 065 )e accuracy of themachine learning model has been greatly improved com-pared with that of the linear model Among them R2 of theBP neural network model is 0921 and the accuracies of theAdaBoost and RF models are increased by 027 and 048respectively compared with R2 of the BP neural network)e machine learning model has a better nonlinear fittingand is more suitable for the retrieval of water quality pa-rameters of Class II waters in a complex environmentAmong them the neural network model belongs to theblack-box model and has certain restrictions in its useAlthough the RF model also belongs to the black-box modelit provides other effective ways to assist the interpretationmaking it easier to be explained In addition the intro-duction of two random parameters makes the RF model

have better antinoise ability and is less likely to fall intooverfitting [16] )e GA_RF results show that R2 is 0980RMSE is 1715mgl and ARE is 683 and the accuracy ofthe GA_RF model is further improved based on the RFmodel

445 Further Comparison and Analysis )e random forestalgorithm is flexible robust simple and convenient and ithas obvious advantages in parameter optimization variableranking and subsequent variable analysis and interpreta-tion Using its randomness in selecting samples and inde-pendent variables it can focus on the differences betweendifferent samples and independent variables when lookingfor the relationship between independent variables anddependent variables so that the regression results can takeinto account the influence of each sample and independentvariables without overtrending to individual samples )ealgorithm has a better effect when performing the inversionof water quality parameters with complex change trendssuch as suspended solids concentration CDOM

As shown in Table 6 Silveira Kupssinsku [36] usedmultiple models to invert the suspended matter concen-tration using Sentinel-2 imagery and aerial image dataobtained by Unmanned Aerial Vehicles (UAVs) )e ran-dom forest algorithm (RF) eventually yielded the highestaccuracy with an R2 of 085603 and an RMSE of 000867mminus1It was more accurate than other machine learning methodsIn addition Wu et al [38] established an inversion model ofCDOM concentration in inland lake waters in China basedon Sentinel-3A OLCI data )e results show that the RMSEof the random forest (RF) model is 014mminus1 and the meanrelative error (MRE) is 21 Compared with the BP neuralnetwork model the RMSE is reduced by 50 and theMRE isreduced by 38 and the inversion accuracy is significantlyimproved In the meantime Fang et al [16] constructeddifferent models to inverse the suspended matter concen-tration based on sediment site monitoring data and MODISsatellite remote sensing reflectance data and the resultsshowed that the random forest algorithm had higher pre-diction accuracy and outperformed linear regression sup-port vector machine and artificial neural network models

)e above results show that the random forest algorithmhas advantages over other machine learning methods indealing with the inversion of suspended matter concen-tration in water which is consistent with the experimentalresults obtained in this study

5 Discussion

In this study the southern part of Nansi Lake in North Chinawas selected as the study area which is a shallow macro-phytic freshwater lake )ese lakes are widely distributedand there are many such lakes with an average depth of fewerthan 4m in the middle and lower reaches of the YangtzeRiver and Huang-Huai-hai Plain in China and they aremore widely distributed in the world )ese lakes are locatedin inland areas and are important for the local ecologicalbalance fishery development and tourism development

10 Journal of Spectroscopy

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 11: Evaluation of the Effectiveness of Multiple Machine ...

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(a)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

897622

TSMConcentration

N

0 175 35 7 Kilometers

(b)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

541218

TSMConcentration

N

0 175 35 7 Kilometers

(c)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

931363

475919

TSMConcentration

N

0 175 35 7 Kilometers

(d)

Figure 7 Continued

Journal of Spectroscopy 11

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 12: Evaluation of the Effectiveness of Multiple Machine ...

However the water quality of the shallow macrophytic lakeis sensitive which is easily affected by many factors such asthe surrounding environment aquatic organisms and

meteorological and geological disasters as well as aqua-culture activities reclamation and tourism development Asa result the inversion of water quality parameters is more

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

0

644282

TSMConcentration

N

0 175 35 7 Kilometers

(e)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

109303

508292

TSMConcentration

N

0 175 35 7 Kilometers

(f )

Figure 7 Spatial distribution map of TSM concentration (a) BPNN (b) KNN (c) Decision tree (d) AdaBoost (e) RF (f ) GA_RF

Table 5 Operation results of the models

Model R2 RMSE (mgL) ARE ()Single-band model 0466 246613 668671Binary linear model 1 0614 8679 44233Binary linear model 2 0637 8414 45274BPNN 0921 4142 869KNN 0678 8277 1667Decision tree 0897 3387 1005AdaBoost 0948 2786 929RF 0969 2135 702GA_RF 0980 1715 683

Table 6 Algorithm test results of other researchers

Algorithm name Evaluation indexR2 MSE

Silveira Kupssinsku et al [36] conductedexperimental comparison results

Linear regression 022380 004681SVR 056024 002650ANN 085371 000882RF 085603 000867

Algorithm name Evaluation indexRMSE (mminus1) MRE ()

Wu et al [38] conducted experimentalcomparison results

BPNN 024 40Single-band model 023 36

RF 014 21

12 Journal of Spectroscopy

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 13: Evaluation of the Effectiveness of Multiple Machine ...

complex and it is difficult to use the conventional inversionmodel for quantitative monitoring )e ecological problemsin the study area selected for this paper are relativelyprominent so this paper proposes using the machinelearning method which is more suitable for the shallowmacrophytic lake to quantitatively estimate the TSM con-centration in the study area and hopes that the conclusionsobtained from this study will be of some guidance for theinversion of water quality parameters in other shallowmacrophytic lakes in the world

)is study is based on Hyperion hyperspectral datawhich have the advantage of high spectral resolution and canbetter respond to the fine spectral information of the groundmaterial )e spectral analysis shows that the number ofbands with high correlation coefficients with the concen-tration of TSM is small and their correlation coefficients arearound 04 )erefore further analysis using band combi-nations band difference ratio and NDI algorithms yieldedhigher correlations than single bands which is consistentwith the results of Chinese scholars Xiao et al [39] and Houet al [40] Some scholars also used the red band to constructthe inversion model of TSM concentration [41] and theaccuracy was unsatisfactory )erefore the univariate factorobtained from spectral analysis and the correlation analysis

of the reflectance of each band and the TSM concentrationcannot meet the accuracy requirements of the quantitativeinversion of TSM concentration in shallow macrophyticlakes It requires the involvement of multiple types ofvariables that is multiple bands and band combinationsjointly for inversion and this conclusion is consistent withthe study by Mao et al [42] In this study 31 feature vectorswere selected to participate in the model construction Bycomparing and analyzing different algorithms it can beconcluded that the linear model has the worst inversionaccuracy where R2 of the univariate model is 0466 )ereason for the poor fitting ability of the linear model may bethat the linear model can only deal with the linear rela-tionship between the independent and dependent variablesbut the TSM concentration parameters are affected byvarious complex environmental conditions which oftenpresent a nonlinear relationship with high complexity anddimensionality and even collinearity can occur )is phe-nomenon is especially true for Class II waters with complexoptical properties where the relationship between waterquality parameters and spectra cannot be fully representedusing a linear model [34] thus leading to a limited appli-cation of linear models )e advantage of machine learningmodels over linear models is that they can handle the

(A)

(B)

(C)

(D)

(E)

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

116deg34prime0PrimeE 116deg37prime30PrimeE 116deg41prime0PrimeE

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

35deg9prime0Prime

N35

deg12prime

0PrimeN

35deg1

5prime0Prime

N35

deg18prime

0PrimeN

F(GA_RF)

109303

508292

TSMConcentration

Figure 8 Spatial distribution map of TSM concentration

Journal of Spectroscopy 13

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 14: Evaluation of the Effectiveness of Multiple Machine ...

problem of nonlinear fitting well)ese results study that theaccuracy of machine learning models such as BPNN de-cision tree AdaBoost and RF models has been greatlyimproved compared with that of linear models for the TSMconcentration estimation in this study area with the RFmodel having the best accuracy and fitting ability )erandom forest model is an ensemble model and it can gethigher prediction accuracy than individual learners Ma-chine learning models such as neural network models andsupport vector machines have certain shortcomings Be-cause they are black-box models it is difficult to understandtheir internal mechanisms which restrict their use to someextent Problems such as overfitting and large calculationswill occur at any time)e random forest model is explainedby the assistance of each variable on the predictive im-portance of the model which some scholars call the gray boxmodel [43] It has fewer parameters is simple and easy touse and has the advantage of being flexible and robust )erandom forest model has been widely used because of itshigh classification accuracy and good robustness but there isno uniform standard for its parameter setting in academiaand there is no uniform and better method In this paper thefit of the model is further improved by introducing a geneticalgorithm to further optimize its parameters in the researchprocess )e GA_RF model had the best accuracy with R2 of0980 RMSE of 1715mgL and ARE of 683 In this studythe spatial distribution of the TSM concentration in thestudy area predicted by the six machine learning models inFigure 6 is mapped )e linear models are not discussed dueto their poor accuracy )e analysis of the mapping resultsobtained from the six machine learning models shows thatthere are differences in the spatial distribution of the TSMconcentration obtained from different models )e spatialdistribution of TSM concentrations predicted by the KNNalgorithm in Figure 6(b) shows that the vast majority of thesouthern and southeastern parts of the study area are high-value areas )e predicted values are high and deviate fromthe actual situation failing to effectively predict the TSMconcentration in this part of the area)e reason for this maybe the small sample size and the failure of KNN to learn itsrules effectively Figure 6(c) shows the inversion resultsobtained by the decision tree algorithm which can predictthe spatial distribution of suspended matter but the pre-dicted value is high for the eastern part of the study area)eaccuracy of the results obtained from the AdaBoost algo-rithm shown in Figure 6(d) can meet the requirementsHowever there is the problem of poor generalization abilityand the overall prediction value for the southern region isrelatively low which is contrary to the real situation )espatial distribution of TSM concentrations retrieved by therandom forest algorithm shown in Figure 6(e) is in betteragreement with the actual but there are also local areaswhere the predicted values are too high Overall the GA_RFalgorithm has the highest accuracy and matches the actualsituation most closely and can accurately predict the smallpollution areas in the region For example the northernmostpart of the study area is an industrial pollution area but thelocal government has prevented the pollution area fromspreading through strong and effective environmental

protection measures which makes the pollution area rela-tively small and the inversion results can prove that theenvironmental protection measures implemented by thelocal government are effective )erefore this study con-cludes that the GA_RF algorithm works best among thesealgorithms and has an important reference value for waterquality parametersrsquo retrieval of shallow macrophytic lakes Itwas found by Figure 6 that the TSM concentration showed atrend of high southeast and low northwest but the spatialdistribution of TSM concentration within the region wasmore fragmented )e lower TSM concentration values inthe northern part of the study area may be because the area isthe eastern half of Nanyang Lake located in the scenic areaof Taibai Lake where the water level is deeper and ecologicalprotection is better )e main sources of TSM in the watercolumn include fine-grained sediment carried into the waterby surface runoff phytoplankton and plant decay residuesin the water and resuspension of bottom sediment by windand wave action [44] Environmental factors meteorologicalfactors and human activities jointly influence the TSMconcentration in the study area )e high TSM concentra-tion in the study area is mainly caused by several factors (1))ere are relatively high TSM concentration values in areassuch as raised islands in the lake land and surroundingshallow water )is situation is shown in Figure 8(a) (2))ere are many fish ponds established around the lake in thesouth and east of the study area which have shallow depthsand high fish densities and the water bodies suffer fromincreased disturbances resulting in increased TSM con-centrations which is consistent with the findings of Meijeret alrsquos study [45] )is situation is shown in Figure 8(d) (3))e time of the study was July 31 during which a largenumber of minnows died in the lake )e flocculent sus-pended matter formed by the dead residue combined withwind and wave disturbance resulted in an increase in TSMconcentration )is situation is shown in Figure 8(c) (4)Frequent human activities paddling of the lake to makefields (as shown in Figure 8(b)) and the hydrodynamicforces generated by the navigation of many pleasure boatsand fishing vessels have caused the resuspension of sedi-ment as shown in Figure 8(e) (5))e study area is low-lyingand many rivers feed into it When heavy precipitationweather occurs surrounding rivers carry large amounts ofsediment to the confluence resulting in increased TSMconcentrations along the coast Most of the above factorsbelong to the characteristics of shallow macrophytic lakesand some other inland lakes and their water quality pa-rameters are affected by a variety of factors together )edifferences in the inversion accuracy of different models arerelated to not only their algorithms but also the smallnumber of samples and the accuracy of the atmosphericcorrection algorithm will have some influence on the pre-diction accuracy of the models

)ere are some weak points in this study )e lake alsohas different characteristics in different seasons but only theTSM concentration in a single time phase was inverted inthis study In this study only spectral factors were analyzedand the adaptability of the model may be further improved ifother influencing factors such as meteorology are combined

14 Journal of Spectroscopy

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 15: Evaluation of the Effectiveness of Multiple Machine ...

for joint retrieval)is work only covers the study area of theNansi Lake in North China and further discussion on itsapplicability to other shallow macrophytic lakes and non-shallow macrophytic lakes is required Since there are manyshallow macrophytic lakes in the world this study is basedon various algorithms for quantitative inversion of TSMconcentrations in this type of lake Comparing the effec-tiveness of various machine learning methods in analyzingthe TSM concentrations in the water environment of theseshallow macrophytic lakes it will provide a more intuitivereference for selecting methods for subsequent research onthis issue

6 Conclusions

Based on the advantage of Hyperion data with the highspectral resolution the most suitable wavebands and theircombinations were selected by spectral analysis for the re-trieval of TSM concentration Using the local area of NansiLake as the study area we quantitatively estimated the TSMconcentration using linear models and various machinelearning models verified the accuracy of the inversion dataand mapped the spatial distribution of the TSM concen-tration )e following conclusions were obtained from thestudy

(1) )e correlation coefficients of R1124 and R844 ofHyperion hyperspectral remote sensing data with theconcentration of TSM are about 042 and the cor-relation coefficient of R16489ndashR22139 is 0624 )eTSM concentration is influenced by various factorsand there are more variables involved in modelestimation )erefore TSM concentration inversioncannot use only a certain band but requires the jointparticipation of multiple variables

(2) Among the seven types of TSM concentration in-version methods the linear performed poorly withR2 of 0634 RMSE of 8414mgL and ARE of45274 )e accuracy of the machine learningmodels was greatly improved compared with that ofthe linear model in which R2 of BP neural networkAdaBoost and RF model were all greater than 092and the RMSE was less than 412 )e RF modeloptimized based on genetic algorithm had thehighest accuracy with R2 of 098 RMSE of 1715mgL and ARE of 683 It is concluded that the ma-chine learning models are better than the linearmodel

(3) )e RF model is simple and robust and has obviousadvantages in terms of easy parameter optimizationand variable ranking and the prediction accuracy issignificantly better those that of other models whichis more suitable for quantitative remote sensingestimation of TSM in shallow macrophytic lakes

(4) In the process of actual data processing the retrievalof TSM concentration in the shallow macrophyticlake is limited by many factors For example thereare few data sources low resolution difficult

preprocessing and few sample points As a result theaccuracy of the linear model is poor so the machinelearning model is more suitable for remote sensingTSM concentration inversion For different influ-encing factors their characteristics and patterns canbe further investigated to quantify their relationshipwith TSM concentration )e water quality variesgreatly in different seasons and multitemporalsampling point data are required in subsequentstudies for better inversion of TSM concentrations)e combination of multiangle analysis of differentfactors and machine learning models helps toachieve an accurate estimation of water quality pa-rameters by remote sensing methods Comparingvarious machine learning methods it will provide amore intuitive reference for selecting methods forsubsequent research on such problems and provide avaluable reference for water quality monitoring andprotection of shallow macrophytic lakes

Data Availability

)edata involved in this study can be obtained by contactingthe corresponding author

Conflicts of Interest

)e authors declare no conflicts of interest

Authorsrsquo Contributions

Xiuyu Liu Xuehua Li and Yanyi Li contributed equally tothis work Xiuyu Liu Xuehua Li and Yanyi Li carried out thecalculation and result analysis and drafted the manuscriptwhich was revised by all authors All authors gave theirapproval of the version submitted for publication

Acknowledgments

)is work was supported by Natural Science Foundation ofShandong Province (no ZR2019QD010) and Open Fund ofState Key Laboratory of Information Engineering in Sur-veying Mapping and Remote Sensing Wuhan University(no 19R07)

References

[1] J Zhao W Cao Z Xu et al ldquoEstimation of suspendedparticulate matter in turbid coastal waters application tohyperspectral satellite imageryrdquo Optics Express vol 26 no 8pp 10476ndash10493 2018

[2] Y Zhang B Qin W Chen et al ldquoExperimental study onunderwater light intensity and primary productivity caused byvariation of total suspended matterrdquo Advances in WaterScience vol 5 pp 615ndash620 2004

[3] J Guang W Yuchun J Huang et al ldquoStudy on seasonalremote sensing estimation model of suspended solids in Taihulakerdquo Journal of Lake Sciences vol 3 pp 241ndash249 2007

[4] D Pan and R Ma ldquoSome key problems in remote sensing oflake water qualityrdquo Journal of Lake Sciences vol 2 pp 139ndash144 2008

Journal of Spectroscopy 15

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 16: Evaluation of the Effectiveness of Multiple Machine ...

[5] P Forget S Ouillon F Lahet and P Broche ldquoInversion ofreflectance spectra of nonchlorophyllous turbid coastal wa-tersrdquo Remote Sensing of Environment vol 68 no 3pp 264ndash272 1999

[6] J Chen T Cui Z Qiu and C Lin ldquoA three-band semi-analytical model for deriving total suspended sedimentconcentration from HJ-1ACCD data in turbid coastal wa-tersrdquo ISPRS Journal of Photogrammetry and Remote Sensingvol 93 pp 1ndash13 2014

[7] S Buri Q Song and H Yanling ldquoRemote sensing retrieval ofsuspended solids concentration in the Yellow river estuarybased on semi analytical methodrdquo Marine Sciences vol 43no 12 pp 17ndash27 2019

[8] V Klemas D Bartlett W Philpot R Rogers and L ReedldquoCoastal and estuarine studies with ERTS-1 and skylabrdquoRemote Sensing of Environment vol 3 no 3 pp 153ndash1741974

[9] F Gao Y Wang and X Hu ldquoEvaluation of the suitability oflandsat MERIS and MODIS for identifying spatial distri-bution patterns of total suspended matter from a self-orga-nizing map (SOM) perspectiverdquo Catena vol 172pp 699ndash710 2018

[10] X Hou L Feng H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[11] S Lei J Xu Y Li et al ldquoAn approach for retrieval of hor-izontal and vertical distribution of total suspended matterconcentration from GOCI data over Lake Hongzerdquo gteScience of the Total Environment vol 700 Article ID 1345242020

[12] Z Zhou W Tian and M Xin ldquoQuantitative retrieval ofchlorophyll-a concentration by remote sensing in Honghulake based on landsat8 datardquo Journal of Hubei University(Natural Science) vol 39 no 2 pp 212ndash216 2017

[13] L E Keiner and X-H Yan ldquoA neural network model forestimating sea surface chlorophyll and sediments from the-matic mapper imageryrdquo Remote Sensing of Environmentvol 66 no 2 pp 153ndash165 1998

[14] J Chen W Quan T Cui and Q Song ldquoEstimation of totalsuspended matter concentration from MODIS data using aneural network model in the China eastern coastal zonerdquoEstuarine Coastal and Shelf Science vol 155 pp 104ndash1132015

[15] Y Park K H Cho J Park S M Cha and J H KimldquoDevelopment of early-warning protocol for predictingchlorophyll-a concentration using machine learning modelsin freshwater and estuarine reservoirs Koreardquo gte Science ofthe Total Environment vol 502 pp 31ndash41 2015

[16] X Fang Z Wen J Chen et al ldquoRemote sensing estimation ofsuspended sediment concentration based on random forestregression modelrdquo National Remote Sensing Bulletin vol 23no 4 pp 756ndash772 2019

[17] C Yin Y Ye H Zhao et al ldquoAnalysis of relationship betweenfield hyperspectrum and suspended solid concentration andturbidity of water in Nansi lakerdquoWater Resources and Powervol 34 no 1 pp 40ndash44 2016

[18] H Lv X Li and K Cao ldquoQuantitative retrieval of suspendedsolid concentration in lake taihu based on BP neural netrdquoGeomatics and Iformation Science of Wuhan Universty vol 8pp 683ndash735 2006

[19] J Li J H Cheng J Y Shi et al Brief Introduction of BackPropagation (BP) Neural Network Algorithm and its Im-provement Springer Berlin Germany 2012

[20] Y Chen P Xu Y Chu et al ldquoShort-term electrical loadforecasting using the support vector regression (SVR) modelto calculate the demand response baseline for office build-ingsrdquo Applied Energy vol 195 pp 659ndash670 2017

[21] A Jog A Carass S Roy D L Pham and J L PrinceldquoRandom forest regression for magnetic resonance imagesynthesisrdquoMedical Image Analysis vol 35 pp 475ndash488 2017

[22] H Reulen and T Kneib ldquoBoosting multi-state modelsrdquoLifetime Data Analysis vol 22 no 2 pp 241ndash262 2016

[23] L Breiman ldquoUsing iterated bagging to debias regressionsrdquoMachine Learning vol 45 no 3 pp 261ndash277 2001

[24] Y Qiu W Shen H Xiao et al ldquoWater depth inversion basedon worldview-2 data and random forest algorithmrdquo RemoteSensing Ifomation vol 34 no 2 pp 75ndash79 2019

[25] Y Wang Optimization and Application of Random ForestAlgorithm in Recommender System Zhejiang UniversityHangzhou China 2016

[26] E Isaac K S Easwarakumar and J Isaac ldquoUrban landcoverclassification from multispectral image data using optimizedAdaBoosted random forestsrdquo Remote Sensing Letters vol 8no 4 pp 350ndash359 2017

[27] H Chen G Zhao X Zhang et al ldquoHyperspectral charac-teristics and content estimation of alkali hydrolyzable ni-trogen in fluvo aquic soil based on genetic algorithm andpartial least squaresrdquo Chinese Agronomy Bulletin vol 31no 2 pp 209ndash214 2015

[28] A Yang J Ding Y Li et al ldquoEstimation of total phosphoruscontent in desert soil based on variable selectionrdquoVisible NearInfrared Spectroscopy vol 36 no 3 pp 691ndash696 2016

[29] T Zhou D Ming and R Zhao ldquoParametric optimization ofrandom forest algorithm for land cover classificationrdquo ScienceSurveying and Mapping vol 42 no 2 pp 88ndash94 2017

[30] B Tan Z Li E Chen et al ldquoPreprocessing of EO-1 Hyperionhyperspectral datardquo Remote Sensing Information vol 6pp 36ndash41 2005

[31] S K McFeeters ldquo)e use of the normalized difference waterindex (NDWI) in the delineation of open water featuresrdquoInternational Journal of Remote Sensing vol 17 no 7pp 1425ndash1432 1996

[32] X U Hanqiu ldquoModification of normalised difference waterindex (NDWI) to enhance open water features in remotelysensed imageryrdquo International Journal of Remote Sensingvol 27 no 1214 pp 3025ndash3033 2006

[33] S Cheng K Chang J Wang et al ldquoDiscussion on the the-oretical model of water color remote sensingrdquo Journal ofTsinghua University vol 8 pp 1027ndash1031 2002

[34] F Yan S Wang Y Zhou et al ldquoStudy on monitoring waterquality of Taihu lake by hyperion spaceborne hyperspectralsensorrdquo Journal of Infrared and Millimeter Wave vol 6pp 460ndash464 2006

[35] E A Anas C Karem L Isabelle et al ldquoComparative analysisof four models to estimate chlorophyll-a concentration incase-2 waters using moderate resolution imaging spectro-radiometer (MODIS) imageryrdquo Remote Sensing vol 4 no 8pp 2373ndash2400 2012

[36] L Silveira Kupssinsku T )omassim Guimaratildees E Menezesde Souza et al ldquoA method for chlorophyll-a and suspendedsolids prediction through remote sensing and machinelearningrdquo Sensors vol 20 no 7 2020

[37] G Liu Study on the Remote Sensing Estimation Method ofChlorophyll a Concentration Suitable for Different Optical

16 Journal of Spectroscopy

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17

Page 17: Evaluation of the Effectiveness of Multiple Machine ...

Characteristics of Two Types of Water Bodies Nanjing NormalUniversity Nanjing China 2016

[38] Z Wu J Li R Wang et al ldquoRemote sensing estimation ofcolored soluble organic matter (CDOM) concentration ininland lakes based on random forestrdquo Lake Science vol 30no 4 pp 979ndash991 2018

[39] X Xiao J Xu D Zhao et al ldquoRemote sensing retrieval ofsuspended solids in typical sections of the middle and lowerreaches of Hanjiang river based on hyperspectral datardquoJournal of Yangtze River Scientific Research Institute vol 37no 11 pp 141ndash148 2020

[40] X Hou F Lian H Duan X Chen D Sun and K ShildquoFifteen-year monitoring of the turbidity dynamics in largelakes and reservoirs in the middle and lower basin of theYangtze river Chinardquo Remote Sensing of Environmentvol 190 pp 107ndash121 2017

[41] K Shi Y Zhang G Zhu et al ldquoLong-term remotemonitoringof total suspended matter concentration in lake Taihu using250m MODIS-aqua datardquo Remote Sensing of Environmentvol 164 pp 43ndash56 2015

[42] Z Mao J Chen D Pan B Tao and Q Zhu ldquoA regionalremote sensing algorithm for total suspended matter in theeast China seardquo Remote Sensing of Environment vol 124pp 819ndash831 2012

[43] A M Prasad L R Iverson and A Liaw ldquoNewer classificationand regression tree techniques bagging and random forestsfor ecological predictionrdquo Ecosystems vol 9 no 2 2006

[44] S Wang X Jiang W Wang et al ldquoSpatiotemporal variationof suspended solids in Lihu lake and its influencing factorsrdquoEnvironmental Science in China vol 34 no 6 pp 1548ndash15552014

[45] M L Meijer A J P Raat and R W Doef ldquoRestoration bybiomanipulation of lake bleiswijkse zoom ()e Netherlands)first resultsrdquoHydrobiological Bulletin vol 23 no 1 pp 49ndash571989

Journal of Spectroscopy 17