Optimizing biodiversity prediction from abiotic parameters

9
Optimizing biodiversity prediction from abiotic parameters Androniki Tamvakis a, * , Vasilis Trygonis a , John Miritzis a , George Tsirtsis a , Soe Spatharis a, b a University of the Aegean, Department of Marine Sciences, University Hill, 81100 Mytilene, Greece b University of Glasgow, Institute of Biodiversity, Animal Health and Comparative Medicine, Glasgow G12 8QQ, Scotland, UK article info Article history: Received 1 March 2013 Received in revised form 29 November 2013 Accepted 2 December 2013 Available online Keywords: Diversity indices Phytoplankton Simulated assemblages Machine learning Instance based learning abstract An integrated methodology is proposed for the effective prediction of biodiversity exclusively from abiotic parameters. Phytoplankton biodiversity was expressed as richness, evenness and dominance indices and abiotic parameters included temperature, salinity, dissolved inorganic nitrogen and phos- phates. Prediction was based on three machine learning techniques: model trees, multilayer perceptron and instance based learning. To optimize diversity prediction, indices were calculated on a large number of phytoplankton eld assemblages, but also on corresponding noise-free simulated assemblages. Biodiversity was most accurately predicted by the instance based learning algorithm and the efciency was doubled with simulated assemblages. Based on the optimal algorithm, indices, and dataset, a soft- ware package was developed for phytoplankton diversity prediction for Eastern Mediterranean waters. The proposed methodology can be adapted to any group of organisms in marine and terrestrial eco- systems whereas important applications are the integration of community structure in ecological models and in assessments of global change scenarios. Ó 2013 Elsevier Ltd. All rights reserved. Software availability Name of software: PREPHYB Developers: Androniki Tamvakis and Vasilis Trygonis Contact address: Department of Marine Sciences, University of the Aegean, University Hill, 81100 Mytilene, Greece Tel.: þ30 22510 36811 Fax.: þ30 22510 36809 E-mail: [email protected] First available: 2013 Software required: (a) for MATLAB users: MS Windows or Mac or Linux; (b) for non-MATLAB users: MS Windows Programming language: MATLAB R2010a Program size: (a) for MATLAB users: zipped le of 0.5 MB; (b) for non-MATLAB users: zipped le of 160 MB Availability and online documentation: http://www.mar.aegean.gr/ biodiv/Prephyb Cost: freely available 1. Introduction Diversity prediction through a number of biotic and abiotic parameters is currently a challenging issue in ecology (Ingram and Steel, 2010; Gontier et al., 2006). Obtaining a measure of diversity from eld data is not always feasible due to constraints related to the taxonomic analysis of samples (Maurer, 2000). However, esti- mates of diversity are essential when it comes to prioritizing sites for management purposes (Lockwood et al., 2012), for assessing the ecological status of ecosystems (WFD, 2000; Spatharis and Tsirtsis, 2010) or for predicting effects of global change on ecosystem di- versity and function (Dawson et al., 2011). In this context, it is essential to develop methodologies that provide a realistic pre- diction of diversity based on a small number of abiotic parameters that are more straightforward to measure. Recently, the emergence of powerful tools as the Machine Learning (ML) techniques and their application in ecology has signicantly advanced the predictive power of models (Fielding, 1999; Kuo et al., 2007; Li et al., 2011). These techniques are effec- tive for exploring complex ecological processes, and can handle non- linearity without relying on implicit assumptions on the relation- ships between parameters (Dzeroski and Drumm, 2003; Jeong et al., 2008; Junker et al., 2012; Kanevski et al., 2004). However, few at- tempts have been made so far to apply ML techniques for biodiver- sity prediction. Most studies are still based on classical statistical Abbreviations: CV, cross validation; IBk, instance based learning algorithm; LOOCV, leave one out cross validation; ML, machine learning; MAD, mean absolute deviation; MTs, model trees; MLP, multilayer perceptron; MLR, multiple linear regression; NN, neural network; PO 4 , phosphates; RMSE, root mean square error; S, salinity; T, temperature. * Corresponding author. E-mail address: [email protected] (A. Tamvakis). Contents lists available at ScienceDirect Environmental Modelling & Software journal homepage: www.elsevier.com/locate/envsoft 1364-8152/$ e see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.envsoft.2013.12.001 Environmental Modelling & Software 53 (2014) 112e120

Transcript of Optimizing biodiversity prediction from abiotic parameters

Page 1: Optimizing biodiversity prediction from abiotic parameters

lable at ScienceDirect

Environmental Modelling & Software 53 (2014) 112e120

Contents lists avai

Environmental Modelling & Software

journal homepage: www.elsevier .com/locate/envsoft

Optimizing biodiversity prediction from abiotic parameters

Androniki Tamvakis a,*, Vasilis Trygonis a, John Miritzis a, George Tsirtsis a,Sofie Spatharis a,b

aUniversity of the Aegean, Department of Marine Sciences, University Hill, 81100 Mytilene, GreecebUniversity of Glasgow, Institute of Biodiversity, Animal Health and Comparative Medicine, Glasgow G12 8QQ, Scotland, UK

a r t i c l e i n f o

Article history:Received 1 March 2013Received in revised form29 November 2013Accepted 2 December 2013Available online

Keywords:Diversity indicesPhytoplanktonSimulated assemblagesMachine learningInstance based learning

Abbreviations: CV, cross validation; IBk, instancLOOCV, leave one out cross validation; ML, machine ledeviation; MTs, model trees; MLP, multilayer perceregression; NN, neural network; PO4, phosphates; RMsalinity; T, temperature.* Corresponding author.

E-mail address: [email protected] (A. Tamv

1364-8152/$ e see front matter � 2013 Elsevier Ltd.http://dx.doi.org/10.1016/j.envsoft.2013.12.001

a b s t r a c t

An integrated methodology is proposed for the effective prediction of biodiversity exclusively fromabiotic parameters. Phytoplankton biodiversity was expressed as richness, evenness and dominanceindices and abiotic parameters included temperature, salinity, dissolved inorganic nitrogen and phos-phates. Prediction was based on three machine learning techniques: model trees, multilayer perceptronand instance based learning. To optimize diversity prediction, indices were calculated on a large numberof phytoplankton field assemblages, but also on corresponding noise-free simulated assemblages.Biodiversity was most accurately predicted by the instance based learning algorithm and the efficiencywas doubled with simulated assemblages. Based on the optimal algorithm, indices, and dataset, a soft-ware package was developed for phytoplankton diversity prediction for Eastern Mediterranean waters.The proposed methodology can be adapted to any group of organisms in marine and terrestrial eco-systems whereas important applications are the integration of community structure in ecological modelsand in assessments of global change scenarios.

� 2013 Elsevier Ltd. All rights reserved.

Software availability

Name of software: PREPHYBDevelopers: Androniki Tamvakis and Vasilis TrygonisContact address: Department of Marine Sciences, University of the

Aegean, University Hill, 81100 Mytilene, GreeceTel.: þ30 22510 36811Fax.: þ30 22510 36809E-mail: [email protected] available: 2013Software required: (a) for MATLAB users: MS Windows or Mac or

Linux; (b) for non-MATLAB users: MS WindowsProgramming language: MATLAB R2010aProgram size: (a) for MATLAB users: zipped file of 0.5 MB; (b) for

non-MATLAB users: zipped file of 160 MBAvailability and online documentation: http://www.mar.aegean.gr/

biodiv/PrephybCost: freely available

e based learning algorithm;arning; MAD, mean absoluteptron; MLR, multiple linearSE, root mean square error; S,

akis).

All rights reserved.

1. Introduction

Diversity prediction through a number of biotic and abioticparameters is currently a challenging issue in ecology (Ingram andSteel, 2010; Gontier et al., 2006). Obtaining a measure of diversityfrom field data is not always feasible due to constraints related tothe taxonomic analysis of samples (Maurer, 2000). However, esti-mates of diversity are essential when it comes to prioritizing sitesfor management purposes (Lockwood et al., 2012), for assessing theecological status of ecosystems (WFD, 2000; Spatharis and Tsirtsis,2010) or for predicting effects of global change on ecosystem di-versity and function (Dawson et al., 2011). In this context, it isessential to develop methodologies that provide a realistic pre-diction of diversity based on a small number of abiotic parametersthat are more straightforward to measure.

Recently, the emergence of powerful tools as the MachineLearning (ML) techniques and their application in ecology hassignificantly advanced the predictive power of models (Fielding,1999; Kuo et al., 2007; Li et al., 2011). These techniques are effec-tive for exploring complexecological processes, and canhandlenon-linearity without relying on implicit assumptions on the relation-ships betweenparameters (Dzeroski and Drumm, 2003; Jeong et al.,2008; Junker et al., 2012; Kanevski et al., 2004). However, few at-tempts have been made so far to apply ML techniques for biodiver-sity prediction. Most studies are still based on classical statistical

Page 2: Optimizing biodiversity prediction from abiotic parameters

Index

prediction

11 diversity

6 evenness

2 dominance

MTs

IBk

MLP

MLR

Algorithm

training

Abiotic variables &

19 indices calculated on field assemblages

Input

datasets

Abiotic variables &

19 indices calculated on simulated assemblages

Selection of

optimal dataset

Selection of

optimal algorithm

Selection of

optimal indices

Fig. 1. Conceptual diagram of the methodological procedure followed in order tooptimize diversity prediction from abiotic parameters.

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120 113

approaches such as regression analysis (Arias-Gonzalez et al., 2012;Brakstad et al.,1994;Denisenko, 2010; Thrush et al., 2001)which areconstrained by assumptions on data such as normality, homosce-dasticity or colinearity. In this context, application ofML approachesseems most prominent in marine ecosystems that are affected bymultidimensional, complex and stochastic phenomena often char-acterized by non-linearity (Olden et al., 2008).

Among the most frequently applied ML algorithms are ModelTrees (MTs), Neural Networks (NNs) and Instance Based learning(IBk). These algorithms represent the three main ML categories(trees, neural networks and lazy algorithms) that use completelydifferent predictive approaches (Solomatine et al., 2008). Thesespan many applications in ecology (Dzeroski, 2001; Lek andGuegan, 1999; Recknagel, 2001) whereas in the marine environ-ment they have been used in hydrodynamics, wave forecasting,habitat modelling, biomass prediction, and pollution assessment(e.g. Dakou et al., 2007; Etemad-Shahidi andMahjoobi, 2009; Millieet al., 2012; Solomatine et al., 2006; Tamvakis et al., 2012; Tian et al.,2011). Concerning biodiversity prediction in particular, applicationof ML techniques in both marine and terrestrial ecosystems hasbeen based on habitat features, biotic characteristics or a combi-nation of both with some abiotic parameters (Cheng et al., 2012;Debeljak et al., 2007; Demsar et al., 2006; Dominguez-Grandaet al., 2011; Dzeroski and Drumm, 2003; Jurc et al., 2006; Knudbyet al., 2010; Kocev et al., 2009; Pittman et al., 2007). These studieshave focused on one biodiversity component (e.g. species richnessor Shannon diversity) whereas so far there has been no attempt topredict different diversity components (richness, evenness, anddominance) exclusively from abiotic parameters related to thephysical and chemical environment.

Diversity can be expressed through a number of indices whichquantify community structure and the changes it undergoes due tonatural or anthropogenic stress (Magurran, 2004). However, fieldcommunities are also driven by multiple stochastic factors such asseasonality and spatial heterogeneity which impose a degree ofuncertainty and distortion on data (Van Straten, 1992). This ‘envi-ronmental noise’ inherent in field communities is also reflected onthe subsequent calculationof indices (VounatsouandKarydis,1991).This problem can be overcome with the use of simulated commu-nities via a species abundance distribution (e.g. the log-series,lognormal) however retaining the structure of field ones(Blackwood et al., 2007; Lyashevska and Farnsworth, 2012; Schlossand Handelsman, 2006; Spatharis and Tsirtsis, 2010). Calculationson noise-free simulated communities seem appropriate whentrying to establish cause-and-effect relationships, e.g. between di-versity and abiotic parameters, due to the removal of noise ordistortion thatmoreeasily supports the revealingof possible signals.

In this paper we propose an integrated methodology for theoptimization of diversity prediction exclusively from abiotic pa-rameters (Fig. 1). The diversity is expressed by diversity, evenness,and dominance indices calculated on both field and simulatedphytoplankton assemblages covering a wide productivity rangetypical of Eastern Mediterranean waters. Predictions were carriedout based on three ML algorithms. The objectives of the study werethus: (a) to distinguish the ML technique offering themost accurateprediction, (b) to select the indices representative of all three di-versity components (richness, evenness, and dominance) (c) tooptimize prediction by calibrating the methodology with indicescalculated on simulated assemblages, and (d) to develop a softwaretool for biodiversity prediction based on the proposedmethodology.

2. Methodology

2.1. Datasets

The first dataset employed in the study includes 658 field samples and wascompiled using existing data from coastal areas of the Aegean Sea, E.

Mediterranean representing a wide range of productivity. At each station of acoastal area, repetitive sampling was carried out covering at least a full annualcycle on a monthly basis. Detailed information about the sampling sites and datacollection are provided in Spatharis et al. (2008). Inner Saronikos Gulf, near Athens,and Kalloni Gulf in Lesvos Island are characteristic of eutrophic conditions(Simboura et al., 2005). Outer Saronikos Gulf and Gera Gulf in Lesvos Island aremore typical of mesotrophic conditions (Arhonditsis et al., 2000; Ignatiades et al.,1992), while offshore stations in Rhodes Island have been characterized as oligo-trophic (Kitsiou et al., 2002). A detailed account on the eutrophication level andecological status of these areas is provided in Spatharis and Tsirtsis (2010). Amongthe various abiotic parameters available in the dataset, a subset was selected forthe aims of the present study, including: (a) concentrations of limiting nutrients,Dissolved Inorganic Nitrogen (DIN) and Phosphates (PO4), that directly influencethe growth and composition of phytoplankton in the areas under consideration(Spatharis et al., 2008) and (b) Salinity (S) and Temperature (T), which may alsoindirectly affect phytoplankton synthesis through stratification in coastal waters(Spyropoulou et al., 2013). Nutrient concentrations were measured spectrophoto-metrically according to Parsons et al. (1984), whereas physical variables wererecorded in situ. Moreover, phytoplankton species-abundance data were used inthe current study analysed following the same protocol according to the invertedmicroscope method of Utermöhl (1958). Dataset information and summary sta-tistics of the above parameters in each of the four areas are provided in Table 1. Thedataset covers a wide range of phytoplankton abundance (103e9 � 106 cells/L) andspecies richness (4e39 species). There were no missing values in the dataset andno special treatment was performed for outlying values. It was considered that thelatter often correspond to extreme events such as algal blooms due to episodicterrestrial inputs (Spatharis et al., 2007) or to the photoperiod increase duringspring, that have to be included in the models to be developed. The variables’positive skewness (Table 1), that is almost always observed for environmental data,was taken into account in the application of the ML algorithms. According to therequirements of each algorithm standardization or normalization procedures wereapplied, described in detail below.

The second dataset includes 658 simulated phytoplankton assemblages withabundances corresponding exactly to the abundances of the 658 field samples.The simulation was based on the log-series statistical distribution which assumesthat most species in an assemblage are rare (Fisher et al., 1943). The log-seriesdistribution is shaped by parameters x and a, that can be calculated knowingthe ratio of species richness to total abundance (S/N) in an assemblage. The S/Nratio was estimated via a simple linear regression equation between S and Nusing the 658 field samples as described in Spatharis and Tsirtsis (2010).Regression analysis was also used to identify the relation of the abundance of themost dominant species N1 with the total phytoplankton abundance N in the 658field samples. When parameters x and a were estimated, the expected number ofspecies S was allocated for each abundance (total cells N). By feeding the pre-vious two relationships which characterize field phytoplankton assemblagesonto the log-series distribution, simulated assemblages are generated that retainthe structure of the initial field ones (Fig. 2). This approach has been described indetail in previous studies (Spatharis and Tsirtsis, 2010; Tsirtsis et al., 2008)resulting in a wide range of assemblage diversity closely matching reality(Spatharis et al., 2011).

2.2. Indices expressing diversity components

Indices can express different aspects of biological diversity such as richness,evenness, and dominance. Thus, diversity indices weigh more on the richness

Page 3: Optimizing biodiversity prediction from abiotic parameters

Table 1Dataset information (mean annual values, range in parenthesis and skewness) of abiotic (input) and phytoplankton parameters for the coastal areas in Aegean Sea.

Rhodes offshore Gera Gulf Kalloni Gulf Saronikos Gulfn ¼ 143 n ¼ 114 n ¼ 186 n ¼ 215

Abioticparameters

T (�C) 19.67 (15.86e26.39) 0.66 19.06 (9.90e26.70) 0.32 17.73 (9.43e28.20) 0.11 19.21 (13.10e27.60) 0.33S (pcu) 39.16 (38.92e39.39) 11.63 38.92 (36.39e40.28) 0.23 38.58 (34.02e41.06) 0.91 38.30 (37.20e39.70) 8.50DIN (mM) 0.91 (0.21e12.45) 9.08 1.48 (0.40e5.82) 2.36 3.94 (0.47e45.20) 4.66 2.70 (0.36e37.95) 5.18PO4 (mM) 0.0700 (0.010e4.090) 11.60 0.194 (0.050e0.850) 2.11 0.088 (0.00e1.577) 5.88 0.236 (0.010e6.00) 7.54

Bioticparameters

Cell No. 6291 (103e6 � 104) 4.37 47,237 (2 � 103e4 � 105) 3.04 592,441 (3 � 103e9 � 106) 5.03 283,201 (103e6 � 106) 6.30Species No. 12 (5e23) 0.18 16 (4e37) 0.95 23 (4e39) 0.40 19 (5e39) 0.35

Table 2Predictive performance in terms of R of MTs, IBk, MLP and MLR for all indices usingdata from field (F.A.) and simulated assemblages (S.A.) evaluated by 10-CV.

MTs IBk MLP MLR

F.A. S.A. F.A. S.A. F.A. S.A. F.A. S.A.

Abundance 0.76 0.77 0.73 0.73 0.39 0.39 0.26 0.26

Diversityindices

Sp. richnessa 0.33 0.73 0.69 0.81 0.47 0.54 0.28 0.31Margalefb 0.39 0.69 0.60 0.79 0.33 0.58 0.19 0.29Gleasona 0.37 0.69 0.59 0.79 0.29 0.58 0.18 0.28Menhinickc 0.62 0.71 0.77 0.80 0.55 0.59 0.28 0.29Odumd 0.35 0.65 0.75 0.74 0.58 0.59 0.26 0.23Simpsona 0.60 0.67 0.66 0.77 0.35 0.51 0.25 0.35H2-Shannone 0.58 0.57 0.67 0.71 0.40 0.44 0.21 0.32Hill N1a 0.49 0.53 0.63 0.70 0.32 0.43 0.10 0.30Hill N2a 0.34 0.68 0.60 0.79 0.30 0.53 0.12 0.32Hurlbertf 0.60 0.63 0.66 0.77 0.33 0.51 0.25 0.35McIntoshg 0.55 0.61 0.66 0.78 0.32 0.52 0.22 0.35

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120114

component of assemblages, evenness indices account more for the distribution ofindividuals to species, and dominance indices consider only the proportion of mostabundant species in an assemblage (Karydis and Tsirtsis, 1996). In the current study,the most commonly used diversity, evenness and dominance indices (Krebs, 1999;Magurran, 2004) were used in order to express all aspects of phytoplankton di-versity (Table 2). These indices were considered as output parameters for the MLalgorithms described below.

2.3. Description of the three ML algorithms

2.3.1. MTsMTs are hierarchical structures consisting of nodes, branches and leaves. Nodes

contain a test on an input variable, each branch corresponds to the outcome of thistest, while leaves (terminal nodes) provide a prediction of the output variable basedon piecewise linear regression function (Quinlan, 1992). As in simple linearregression, MTs predict the value of the output-dependent variable (e.g. differentindices) as a linear combination of the input-independent parameters (e.g. tem-perature, salinity, DIN, and PO4) within different leaves of the tree. TheM5 algorithmis one of the most well-known MT induction methods. The M5P algorithm in Javaimplementation which is part of WEKA machine learning package (Hall et al., 2009)was used for the MT induction. An optimization of the method was attempted basedon the minimum number of instances reaching a leaf that is crucial since it controlsthe tree pruning (Quinlan, 1999). To this aim, different values were used in order tooptimize results, that is 4 (default), 8, 16, 32 and 64 instances. Prior to analysis,abiotic parameters were standardized using the z-score procedure to ensure equalweights during tree induction.

2.3.2. IBkIBk algorithms are derived from the nearest neighbour pattern classifier (Cover

and Hart, 1967) and are based on the notion that similar instances have similarbehaviour thus the new input instances are predicted according to the stored mostsimilar neighbouring instances (Aha et al., 1991; Payne, 1995). In this study, the IBkmethod was applied with the use of k nearest training instances (k-NN) in order topredict the value of the output variable in new unseen instances. The Manhattan(city-block) distance was used as distance metric, as it was found more powerfulcompared to the classic Euclidean distance. The k-NN algorithm treats the inputparameters as dimensions of a Euclidean space and the instances as points in thisspace (Cover and Hart, 1967). Once a new unseen instance is given, prediction of theoutput variable is estimated using the inverse weighting function of k values, foundafter optimization (Wettschereck et al., 1997). In the software package WEKA (Hallet al., 2009) the initial setting of parameter k may significantly affect the predic-tion power. To optimise results, different values of this parameter were tested, i.e. 2,

Fig. 2. Schematic presentation of the procedure followed for the generation of 658simulated phytoplankton assemblages corresponding to the 658 field assemblages.

4, 8,12 and 20. Prior to data analysis, abiotic parameters were standardized using thez-scores, as performed in the application of MTs.

2.3.3. MLPMultilayer Perceptron (MLP) is a simple NN that consists of multiple layers of

neurons that interact using weighted connections. Weights measure the degree ofcorrelation between the neurons and are calculated during a learning procedurebased on knowledge of the input parameters and desired output (supplied data-base). More details about the NNmethodology can be found in the papers of Lek andPark (2008), Lek and Guegan (1999), Maier and Dandy (2000), and Pal and Mitra(1992). The MLPs used in this study belong to the classic group of feed-forwardNNs with one hidden layer in which sigmoid activation function is used to allnodes while it is being trained by the backpropagation algorithm (Rumelhart et al.,1986). To select the network’s topology that maximizes the algorithm effectiveness,five numbers of neurons were tested (4, 8, 10, 15, and 20).

The performance of the three algorithms was also compared to the classicmultiple linear regression (MLR) technique, described in detail in textbooks of basicstatistics (e.g. Zar, 1998). Prior to data analysis, abiotic parameters were log trans-formed to approach normality, as it is common for natural data to follow positively

Evennessindices

Evenness E1h 0.59 0.70 0.68 0.80 0.33 0.56 0.27 0.33Evenness E2i 0.53 0.72 0.67 0.81 0.35 0.58 0.24 0.30Evenness E3a 0.43 0.71 0.67 0.80 0.35 0.57 0.24 0.30Evenness E4a 0.23 0.71 0.45 0.78 0.20 0.54 0.03 0.27Evenness E5a 0.35 0.70 0.54 0.79 0.24 0.56 0.23 0.31Redundancyj 0.59 0.69 0.64 0.78 0.35 0.55 0.27 0.33

Dominanceindices

Berger-Parkerk 0.51 0.70 0.64 0.80 0.34 0.54 0.23 0.34McNaughtonl 0.51 0.64 0.64 0.74 0.32 0.47 0.18 0.32

a Ludwig and Reynolds (1988).b Margalef (1958).c Menhinick (1964).d Odum et al. (1960).e Shannon and Weaver (1949).f Hulbert (1971).g McIntosh (1967).h Pielou (1975).i Sheldon (1969).j Pattern (1962).k Berger and Parker (1970).l McNaughton (1967).

Page 4: Optimizing biodiversity prediction from abiotic parameters

Fig. 3. Box-plots of the percent errors for predicted values by the 4 best performingdiversity indices that were calculated on simulated assemblages. Prediction was basedon IBk algorithm.

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120 115

skewed distributions. Although NNs do not require any assumption regarding inputdata, it has been shown that their performance is often improved through datatransformation using mathematical functions (Shi, 2000).

2.4. Assessment of optimal diversity prediction

To estimate the prediction accuracy of different algorithms on unseen data, theK-fold cross validation (CV) approach was employed (Stone, 1974). This technique iscommonly used to estimate the error of algorithm predictions and is efficient fordatasets containing neither few (few tens) nor toomany (tens of thousands) records(Stone, 1978). In K-fold CV the dataset is randomly partitioned into K subsamples, Kminus 1 of which are used as training data while the remaining subsample isretained for testing the algorithm. This process is repeated K times (the folds) andresults are averaged to produce the performance estimation. To optimize algorithmresults, biodiversity prediction was based on three numbers of CV folds: 10, 20 and658 i.e. leave one out cross validation (LOOCV). The basic measure of performancefor assessing the predictive power of the three algorithms is the correlation coeffi-cient (R) between the calculated values of indices (based on field or simulated as-semblages) and those predicted by the algorithm while the root mean square error(RMSE) is also presented. The variability of R coefficient within the K-folds of eachalgorithm was estimated by the coefficient of variation which quantifies the vari-ability (or stability) of the results. A two-factor ANOVA was used to determine therelative effect of testing different CV folds and different values of algorithm pa-rameters (i.e. number of instances reaching an MT leaf, number of neighbours forIBk, or number of neurons for MLP). Percent errors between calculated and predictedvalues were used as an additional measure of performance. Instead of using solelyinstances (e.g. the 658 samples) to assess the performance of predicted indices, wealso assessed the behaviour of predictions using average monthly values of biodi-versity for each of the four areas.

3. Results

3.1. Selection of optimal parameters for algorithm training

The relative effect of three numbers of CV folds (10, 20, 658) onalgorithm performance based on R was tested using both field andsimulated data. This factor was not significant for MTs and IBk(ANOVA, p > 0.05), but was statistically significant for MLP(ANOVA, p < 0.001). For the latter, the best performance wasachieved with the 10-fold CV. The minimum number of instancesreaching an MT leaf (4, 8, 16, 32, 64) was not statistically differentfor field dataset (ANOVA, p > 0.05) but was strongly significantwhen using indices calculated on simulated data (ANOVA,p < 0.005). The optimal number of instances selected for MTparameterisation was 8. Significant differences in R were observedamong the different numbers of neighbours for IBk (2, 4, 8, 12, 20)and neurons for MLP (4, 8, 10, 15, 20) (ANOVA, p < 0.05). Theoptimal number of neighbours was 8 and the optimal number ofneurons was 10.

3.2. Selection of optimal dataset, algorithm and indices

The performance of the three ML algorithms (using the optimalparameters described above) with the respective performance ofMLR in terms of R for all indices is presented in Table 2. Addi-tionally, the performance of the algorithms in terms of RMSE ispresented in the Appendix. Overall, the use of indices calculatedon simulated instead of field assemblages resulted in significantlyimproved predictive power. This was observed for all tested al-gorithms and almost all indices as indicated by the higher and insome cases doubled correlation coefficients. The most efficientalgorithm for diversity prediction was IBk and the least efficientwas MLR. Based on IBk, the most effective indices calculated onsimulated data were Species Richness, Menhinick, Evenness E1,Evenness E2, Evenness E3 and Berger-Parker (R � 0.80). On theother hand, Shannon and Hill N1 indices had lower predictivepower (R < 0.72). According to the coefficient of variation, vari-ability of R among the 10 folds of CV was low (<25%) for themajority of indices tested, whereas it was minimised for IBk onsimulated data (e.g. 5.1% for Species Richness, 5.9% for Menhinick,

5.3% for Evenness E2, 5.1% for Berger-Parker). Therefore, SpeciesRichness, Menhinick, Evenness E2 and Berger-Parker wereselected as representative of the three components of assemblagediversity i.e. richness, evenness and dominance.

3.3. Prediction performance of optimal algorithm

The distribution of the percent error of prediction for the aboveindices is depicted in Fig. 3. Half of the produced errors fall within a�10% range for all four indices. Moreover, almost all errors do notexceed a �30% limit. Menhinick and Berger-Parker seem to beoverestimated by IBk giving positive error values. On the otherhand, the median is close to zero for Species Richness and EvennessE2, while the skewness is similar to a normal distribution indicatingthat IBk does not unilaterally overestimate or underestimate theseindices.

For each of the four sampling areas, monthly data of EvennessE2 index calculated on simulated and field assemblages werecompared with the corresponding predicted values by the IBk al-gorithm (Fig. 4). The deviation between predicted and field orsimulated values was expressed quantitatively by calculating theMean Absolute Deviation (MAD). Monthly predictions of EvennessE2 using the simulated assemblages shown in Fig. 4a were moreaccurate (MAD ¼ 0.048) than the values calculated on field data inFig. 4b (MAD¼ 0.053). However, the latter can be also considered assatisfactory indicating that IBk performs with high precision usingmean monthly values not only for noise-free simulated data butalso for field data. IBk predictions were least accurate for bothsimulated and field data in the case of Gera Gulf. This area ischaracterized by mesotrophic conditions, and for this reason theresponse of phytoplankton diversity to physico-chemical parame-ters is likely to be more unpredictable.

4. Discussion

Three novel ML techniques and 19 indices were tested in orderto achieve the best biodiversity prediction using exclusively abioticparameters. Algorithm training was based on an extensive datasetcontaining biotic (phytoplankton species abundances) andphysico-chemical information representative of a wide produc-tivity range of E. Mediterranean Sea. Biodiversity prediction,particularly in the marine environment, is a complex task as

Page 5: Optimizing biodiversity prediction from abiotic parameters

Fig. 4. Monthly IBk predictions of Evenness E2 (shown with rhombus) for each sampling area in comparison with the corresponding (a) simulated and (b) field data (shown withstars).

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120116

multiple factors and stochastic processes are acting upon com-munity structure (Adjou et al., 2012; Gontier et al., 2006). Thisproblem was overcome by using diversity indices calculated onsimulated assemblages, free of environmental noise. The use ofpowerful modelling tools such as MLs and the further optimiza-tion of the methodology with simulated assemblages provided anintegrated framework for biodiversity prediction with high pre-dictive power (R > 0.80 for all selected indices between predictedand simulated values).

The simulated phytoplankton assemblages used in this studymaintained the structural characteristics of the corresponding fieldassemblages across a wide productivity range (Tsirtsis et al., 2008),but were also free of noise related to stochastic extrinsic factorssuch as patchiness, grazing, and seasonality (Karydis, 1996). Thisproperty improved the relationship of diversity indices with abioticparameters, given that noise renders algorithms sensitive tomisleading (McCune, 1997; Van Straten, 1992). It also increased oreven doubled the predictive power of algorithms while maintain-ing the realism of the natural system. Simulated communitiesoriginating from field ones have been successfully used in the pastto investigate the behaviour of diversity indices in microbes(Blackwood et al., 2007; Schloss and Handelsman, 2006), benthos(Lyashevska and Farnsworth, 2012), and phytoplankton (Tsirtsisand Spatharis, 2011).

Our results indicate that ML techniques can greatly increasethe predictive power of models; however, the three algorithmspresented significant differences in their predictive performance.IBk was the most efficient and reliable in biodiversity predictionin agreement with other marine applications of this algorithm(Dzeroski and Drumm, 2003; Hatzikos et al., 2008) or otherscientific disciplines such as hydrology, weather forecasting,bioinformatics, banking and forensics (Bannayan andHoogenboom, 2008; Bhasin et al., 2005; Buchholz et al., 2009;Diplaris et al., 2005; Hinwood et al., 2006; Solomatine et al.,2006, 2008). The observed increased efficiency of IBk in ourstudy can be explained considering the heterogeneous structureof our dataset compiled from four different coastal areas, eachone showing variability on a monthly basis. In this algorithm,every single input instance can be dynamically used with equalweight during prediction (Aha et al., 1991). Therefore, whenindices are associated to the abiotic information, IBk maintains

the localized information of the data in the heterogeneousdataset (Solomatine et al., 2008). This also makes IBk sensitive toinstances that deviate from the main trends giving a more ac-curate prediction.

Instances that deviate from main trends (that characterize ourheterogeneous dataset) are missed by algorithms such as MTsand MLP, resulting in reduced sensitivity. Contrary to IBk, thesealgorithms attempt to derive general relationships between di-versity indices and abiotic parameters, described by linearmodels in MTs or weighted neurons in MLP. The latter has beenproposed as a reliable model of ecological processes (Basheer andHajmeer, 2000; Lek et al., 1996) however, its efficiency dependsupon choosing the correct topology (i.e. number of layers andneurons) and applying elaborate adjustments such as pruning,constructive algorithms or recurrence (Rocha et al., 2007; Wanget al., 1994). Although these adjustments may improve predictiveperformance, they dramatically increase algorithm complexityand thus application runtime. In the present study, MLP wasapplied in its simplest form and its efficiency was inferiorcompared to both MT and IBk algorithms. MLP has also shownweakness to give accurate predictions compared to other MLtechniques in several other studies (e.g. Etemad-Shahidi andMahjoobi, 2009; Nisanci et al., 2011; Solomatine and Siek,2006; Soysal and Schmidt, 2010).

Almost all indices calculated on simulated assemblages weresufficiently predicted by the IBk algorithm. However, Menhinick’s,Evenness E2 and Berger-Parker which scored higher based on theirR values, are proposed for predicting the three diversity compo-nents namely richness, evenness and dominance. Although calcu-lations on simulated data increase the predictive power ofalgorithms, satisfactory predictions can be also made with fielddata. We tested the predictive power using 658 discrete samplesbut also by pooling together data from different stations within asampling campaign at a given study site. The latter predictionswere much more accurate since the use of averaged data smoothedthe effect of time, space (local dimensionality), and outlying valuesin agreement with previous studies (Kumar, 2000; More and Deo,2003).

Presently we propose an optimization procedure for biodi-versity prediction based on few abiotic parameters. Althoughoptimization was based on phytoplankton data, this

Page 6: Optimizing biodiversity prediction from abiotic parameters

Fig. 5. Graphical user interface of the PREPHYB software developed in MATLAB. Prediction of four indices and abundance of phytoplankton assemblages is based on abiotic variablesthat are either manually entered by the user, or batch processed from a comma-separated ASCII file.

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120 117

methodology can be easily adapted for any group of organisms,provided that there are sufficient samples covering a wide rangeof environmental conditions so that biodiversity can be fullyrepresented. The proposed models are based on a black-boxapproach and do not offer mechanistic explanations for theobserved relations between abiotic variables and diversity;however the performance of a sensitivity analysis in a futurework could reveal the underlying processes and shed light ontheoretical aspects (Refsgaard et al., 2007). The high predictivepower (expressed with R correlation coefficient) in diversityprediction that the proposed methodology provides, enables itsintegration in various crucial ecological implementations. Untilnow, organisms (e.g. phytoplankton, zooplankton) are repre-sented in ecological models in terms of biomass of one or fewcomponents representing different size classes or the maingroups (Arhonditsis et al., 2006). Based on the proposed meth-odology, a link can be established between the most importantabiotic variables and diversity, therefore the whole diversityspectrum and its dynamics can be incorporated into anecological model (Laniak et al., 2013). This approach supportsboth the testing of ecological questions regarding diversity, aswell as environmental quality assessment and protection, sincechanges in diversity are a focal point in recent environmentalprotection measures, as in the European Water FrameworkDirective (WFD, 2000). In this context, diversity prediction canbe incorporated in models testing the effect of different sce-narios of climate change, habitat loss, or ecosystem manage-ment. For phytoplankton in particular, diversity across a wideproductivity range was predicted from temperature, salinity,DIN and PO4. Therefore important changes in phytoplanktonstructure can be foreseen based on temperature projectionsrelated to climate change scenarios, or in connection to nutrient

loading originating from potential changes in land use andmanagement practices.

5. Software features

PREdiction of PHYtoplankton Biodiversity (PREPHYB) is aMATLAB-based software with a user-friendly interface (Fig. 5)that is freely downloadable at http://www.mar.aegean.gr/biodiv/Prephyb. The software provides the optimal phytoplankton di-versity prediction with high predictive power, implementing theIBk algorithm and methodological scheme described in Fig. 1.PREPHYB incorporates an extensive dataset of 658 samples, andthe built-in IBk is trained through the relationship betweenphysico-chemical parameters and indices that are calculated onnoise-free simulated phytoplankton assemblages. User input islimited to four abiotic variables i.e. temperature, salinity, DIN andPO4; these can be either entered manually, or automaticallyprocessed in batch mode through a standard comma-separatedASCII file. The output consists of the predicted diversity, whichcorresponds to a wide productivity range typical of coastal andoffshore waters of the Eastern Mediterranean Sea (103e9 � 106 cells/L), expressed by indices representing all three di-versity components (richness, evenness, dominance), as well asadditional descriptors of phytoplankton assemblage structuresuch as species richness and cell number. It must be noted thatthe dataset used for model training (abiotic variables and cor-responding phytoplankton assemblages) is characteristic ofEastern Mediterranean waters as mentioned above. Therefore theuse of PREPHYB as it is for biodiversity prediction with thealready stated accuracy is limited for waters of similarcharacteristics.

Page 7: Optimizing biodiversity prediction from abiotic parameters

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120118

Appendix

Table A1Predictive performance in terms of RMSE of MTs, IBk, MLP and MLR for all indices using data from field (F.A.) and simulated assemblages (S.A.) evaluated by 10-CV.

MTs IBk MLP MLR

F.A. S.A. F.A. S.A. F.A. S.A. F.A. S.A.

Abundance 4.2Eþ04 4.2Eþ04 4.9Eþ04 4.9Eþ04 7.0Eþ04 7.0Eþ04 9.0Eþ04 9.0Eþ04

Diversity indices Sp. Richness 5.57 2.63 3.93 2.19 5.30 3.23 7.48 4.99Margalef 0.36 0.13 0.31 0.11 0.43 0.16 0.54 0.24Gleason 0.36 0.12 0.31 0.10 0.44 0.15 0.53 0.23Menhinick 0.06 0.02 0.03 0.02 0.04 0.03 0.06 0.05Odum 1.28 0.71 0.67 0.55 0.86 0.78 1.40 1.14Simpson 0.14 0.04 0.13 0.03 0.17 0.05 0.23 0.07H2-Shannon 0.60 0.10 0.54 0.09 0.78 0.12 1.00 0.17Hill N1 2.33 0.60 2.05 0.50 2.95 0.66 3.71 0.96Hill N2 1.90 0.61 1.60 0.49 2.20 0.69 2.82 1.07Hurlbert 0.14 0.04 0.13 0.03 0.17 0.05 0.23 0.07McIntosh 0.12 0.04 0.10 0.03 0.14 0.05 0.19 0.07

Evenness indices Evenness E1 0.17 0.07 0.12 0.06 0.17 0.09 0.23 0.14Evenness E2 0.15 0.11 0.12 0.09 0.16 0.13 0.23 0.21Evenness E3 0.19 0.12 0.13 0.09 0.17 0.14 0.24 0.22Evenness E4 0.08 0.05 0.07 0.04 0.08 0.06 0.11 0.09Evenness E5 0.11 0.06 0.10 0.05 0.12 0.07 0.16 0.11Redundancy 0.17 0.07 0.12 0.06 0.16 0.09 0.23 0.14

Dominance indices Berger-Parker 0.15 0.04 0.13 0.04 0.17 0.06 0.23 0.10McNaughton 0.13 0.05 0.11 0.04 0.15 0.05 0.20 0.08

References

Adjou, M., Bendtsen, J., Richardson, K., 2012. Modeling the influence from oceantransport, mixing and grazing on phytoplankton diversity. Ecol. Model. 225, 19e27.

Aha, D.W., Kibler, D., Albert, M.K., 1991. Instance-based learning algorithms. Mach.Learn. 6 (1), 37e66.

Arhonditsis, G., Tsirtsis, G., Angelidis, M.O., Karydis, M., 2000. Quantification of theeffects of nonpoint nutrient sources to coastal marine eutrophication: appli-cations to a semi-enclosed gulf in the Mediterranean Sea. Ecol. Model. 129 (2e3), 209e227.

Arhonditsis, G., Adams-VanHarn, B., Nielsen, L., Stow, C.A., Reckhow, K.H., 2006.Evaluation of the current state of mechanistic aquatic biogeochemicalmodeling: citation analysis and future perspectives. Environ. Sci. Technol. 40(21), 6547e6554.

Arias-Gonzalez, J.E., Acosta-Gonzalez, G., Membrillo, N., Garza-Perez, J.R., Castro-Perez, J.M., 2012. Predicting spatially explicit coral reef fish abundance, richnessand ShannoneWeaver index from habitat characteristics. Biodivers. Conserv. 21(1), 115e130.

Bannayan, M., Hoogenboom, G., 2008. Weather analogue: a tool for real-timeprediction of daily weather data realizations based on a modified k-nearestneighbour approach. Environ. Model. Softw. 23 (6), 703e713.

Basheer, I.A., Hajmeer, M., 2000. Artificial neural networks: fundamentals,computing, design, and application. J. Microbiol. Meth. 43 (1), 3e31.

Berger, W.H., Parker, F.L., 1970. Diversity of planktonic foraminifera in deep-seasediments. Science 168, 1345e1347.

Bhasin, M., Zhang, H., Reinherz, E.L., Reche, P.A., 2005. Prediction of methylatedCpGs in DNA sequences using a support vector machine. FEBS Lett. 579 (20),4302e4308.

Blackwood, C.B., Hudleston, D., Zak, D.R., Buyer, J.S., 2007. Interpreting ecologicaldiversity indices applied to terminal restriction fragment length polymorphismdata: insights from simulated microbial communities. Appl. Environ. Microbiol.73, 5276e5283.

Brakstad, F., Kvalheim, O.M., Ugland, K.I., Tjessem, K., Bryne, K., 1994. Prediction ofthe Shannon Wiener diversity index from trace element profiles in sedimentsaround the Statfjord platforms. Chemosphere 29 (7), 1441e1465.

Buchholz, R., Kraetzer, C., Dittmann, J., 2009. Microphone classification usingFourier coefficients information hiding. In: Katzenbeisser, S., Sadeghi, A.R.(Eds.), Information Hiding, Lecture Notes in Computer Science. Springer, Berlin,Heidelberg, pp. 235e246.

Cheng, L., Lek, S., Lek-Ang, S., Li, Z., 2012. Predicting fish assemblages and diversityin shallow lakes in the Yangtze River basin. Limnol. Ecol. Manage. Inland Waters42 (2), 127e136.

Cover, T.M., Hart, P.E., 1967. Nearest neighbor pattern classification. IEEE Trans.Inform. Theory IT 13 (1), 21e27.

Dakou, E., D’heygere, T., Dedecker, A.P., Goethals, P.L.M., Lazaridou-Dimitriadou, M.,De Pauw, N., 2007. Decision tree models for prediction of macroinvertebratetaxa in the river Axios (Northern Greece). Aquat. Ecol. 41 (3), 399e411.

Dawson, T.P., Jackson, S.T., House, J.I., Prentice, I.C., Mace, G.M., 2011. Beyond pre-dictions: biodiversity conservation in a changing climate. Science 332, 53e58.

Debeljak, M., Cortet, J., Demsar, D., Krogh, P.H., Dzeroski, S., 2007. Hierarchicalclassification of environmental factors and agricultural practices affecting soilfauna under cropping systems using Bt maize. Pedobiologia 51 (3), 229e238.

Demsar, D., Dzeroski, S., Larsen, T., Struyf, J., Axelsen, J., Pedersen, M.B., Krogh, P.H.,2006. Using multi-objective classification to model communities of soilmicroarthropods. Ecol. Model. 191 (1), 131e143.

Denisenko, N.V., 2010. The description and prediction of benthic biodiversity inhigh arctic and freshwater-dominated marine areas: the southern Onega Bay(the White Sea). Mar. Pollut. Bull. 61 (4e6), 224e233.

Diplaris, S., Tsoumakas, G., Mitkas, P., Vlahavas, I., 2005. Protein classification withmultiple algorithms. In: Bozanis, P., Houstis, E. (Eds.), Advances in Informa-ticsSpringer, Berlin, Heidelberg, pp. 448e456.

Dominguez-Granda, L., Lock, K., Goethals, P.L.M., 2011. Using multi-target clusteringtrees as a tool to predict biological water quality indices based on benthicmacroinvertebrates and environmental parameters in the Chaguana watershed(Ecuador). Ecol. Inform. 6 (5), 303e308.

Dzeroski, S., 2001. Applications of symbolic machine learning to ecologicalmodelling. Ecol. Model. 146 (1e3), 263e273.

Dzeroski, S., Drumm, D., 2003. Using regression trees to identify the habitat pref-erence of the sea cucumber (Holothuria leucospilota) on Rarotonga, CookIslands. Ecol. Model. 170 (2e3), 219e226.

Etemad-Shahidi, A., Mahjoobi, J., 2009. Comparison between M5’ model tree andneural networks for prediction of significant wave height in Lake Superior.Ocean Eng. 36 (15e16), 1175e1181.

Fielding, A.H. (Ed.), 1999. Machine Learning Methods for Ecological Applications.Kluwer Academic Publishers, Dordrecht.

Fisher, R.A., Corbet, A.S., Williams, C.B., 1943. The relation between the number ofspecies and the number of individuals in a random sample of an animal pop-ulation. J. Anim. Ecol. 12, 42e58.

Gontier, M., Balfors, B., Mortberg, U., 2006. Biodiversity in environmental asses-smentdcurrent practice and tools for prediction. Environ. Impact Assess. Rev.26 (3), 268e286.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. TheWEKA data mining software: an update. SIGKDD Explor. 11 (1), 10e18.

Hatzikos, E.V., Tsoumakas, G., Tzanis, G., Bassiliades, N., Vlahavas, I., 2008. Anempirical study on sea water quality prediction. Knowl.-Based Syst. 21 (6), 471e478.

Hinwood, A., Preston, P., Suaning, G.J., Lovell, N.H., 2006. Bank note recognition forthe vision impaired. Australas. Phys. Eng. Sci. Med. 29 (2), 229e233.

Hulbert, S.H., 1971. The nonconcept of species diversity: a critique and alternativeparameters. Ecology 52, 577e586.

Ignatiades, L., Karydis, M., Vounatsou, P., 1992. A possible method for evaluatingoligotrophy and eutrophication based on nutrient concentration scales. Mar.Pollut. Bull. 24 (5), 238e243.

Page 8: Optimizing biodiversity prediction from abiotic parameters

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120 119

Ingram, T., Steel, M., 2010. Modelling the unpredictability of future biodiversity inecological networks. J. Theor. Biol. 264 (3), 1047e1056.

Jeong, K.S., Kim, D.K., Jung, J.M., Kim, M.C., Joo, G.J., 2008. Non-linear autoregressivemodelling by temporal recurrent neural networks for the prediction of fresh-water phytoplankton dynamics. Ecol. Model. 211 (3e4), 292e300.

Junker, K., Sovilj, D., Kroncke, I., Dippner, J.W., 2012. Climate induced changes inbenthic macrofauna e a non-linear model approach. J. Mar. Syst. 96e97, 90e94.

Jurc, M., Perko, M., Dzeroski, S., Demsar, D., Hrasovec, B., 2006. Spruce bark beetles(Ips typographus, Pityogenes chalcographus, Col.: Scolytidae) in the Dinaricmountain forests of Slovenia: monitoring and modeling. Ecol. Model. 194 (1e3),219e226.

Kanevski, M., Parkin, R., Pozdnukhov, A., Timonin, V., Maignan, M., Demyanov, V.,Canu, S., 2004. Environmental data mining and modeling based on machinelearning algorithms and geostatistics. Environ. Model. Softw. 19 (9), 845e855.

Karydis, M., Tsirtsis, G., 1996. Ecological indices: a biometric approach for assessingeutrophication levels in the marine environment. Sci. Total Environ. 186 (3),209e219.

Karydis, M., 1996. Quantitative assessment of eutrophication: a scoring system forcharacterising water quality in coastal marine ecosystems. Environ. Monit.Assess. 41 (3), 233e246.

Kitsiou, D., Coccossis, H., Karydis, M., 2002. Multi-dimensional evaluation andranking of coastal areas using GIS and multiple criteria choice methods. Sci.Total Environ. 284 (1e3), 1e17.

Knudby, A., LeDrew, E., Brenning, A., 2010. Predictive mapping of reef fish speciesrichness, diversity and biomass in Zanzibar using IKONOS imagery andmachine-learning techniques. Rem. Sens. Environ. 114 (6), 1230e1241.

Kocev, D., Dzeroski, S., White, M.D., Newell, G.R., Griffioen, P., 2009. Using single-and multi-target regression trees and ensembles to model a compound index ofvegetation condition. Ecol. Model. 220 (8), 1159e1168.

Krebs, C.J., 1999. Ecological Methodology. Addison Wesley Longman, Menlo Park,California, p. 620.

Kumar, D.N., 2000. Multisite disaggregation of monthly to daily streamflow. WaterResour. Res. 36, 1823e1833.

Kuo, J.T., Hsieh, M.H., Lung, W.S., She, N., 2007. Using artificial neural network forreservoir eutrophication prediction. Ecol. Model. 200 (1e2), 171e177.

Laniak, G.F., Olchin, G., Goodall, J., Voinov, A., Hill, M., Glynn, P., Whelan, G.,Geller, G., Quinn, N., Blind, M., Peckham, S., Reaney, S., Gaber, N., Kennedy, R.,Hughes, A., 2013. Integrated environmental modeling: a vision and roadmap forthe future. Environ. Model. Softw. 39, 3e23.

Lek, S., Delacoste, M., Baran, P., Dimopoulos, I., Lauga, J., Aulagnier, S., 1996. Appli-cation of neural networks to modelling nonlinear relationships in ecology. Ecol.Model. 90 (1), 39e52.

Lek, S., Guegan, J.F., 1999. Artificial neural networks as a tool in ecological model-ling, an introduction. Ecol. Model. 120 (2e3), 65e73.

Lek, S., Park, Y.S., 2008. Multilayer perceptron. In: Jorgensen, S.E., Fath, B. (Eds.),Encyclopedia of Ecology. Academic Press, Oxford, pp. 2455e2462.

Li, J., Heap, A.D., Potter, A., Daniell, J.J., 2011. Application of machine learningmethods to spatial interpolation of environmental variables. Environ. Model.Softw. 26 (12), 1647e1659.

Lockwood, M., Davidson, J., Hockings, M., Haward, M., Kriwoken, L., 2012. Marinebiodiversity conservation governance and management: regime requirementsfor global environmental change. Ocean Coast. Manage. 69, 160e172.

Ludwig, A.J., Reynolds, J.F., 1988. Statistical Ecology: a Primer on Methods andComputing. John Wiley and Sons, New York.

Lyashevska, O., Farnsworth, K.D., 2012. Howmany dimensions of biodiversity do weneed? Ecol. Indic. 18, 485e492.

Magurran, A.E., 2004. Measuring Biological Diversity, second ed. Blackwell Science,Oxford, p. 256.

Maier, H.R., Dandy, G.C., 2000. Neural networks for the prediction and forecasting ofwater resources variables: a review of modelling issues and applications. En-viron. Model. Softw. 15 (1), 101e124.

Margalef, R., 1958. Information theory in ecology. Gen. Syst. 3, 36e71.Maurer, D., 2000. The dark side of taxonomic sufficiency (TS). Mar. Pollut. Bull. 40

(2), 98e101.McCune, B., 1997. Influence of noisy environmental data on canonical correspon-

dence analysis. Ecology 78, 2617e2623.McIntosh, R.P., 1967. An index of diversity and the relation of certain concepts to

diversity. Ecology 48, 392e404.McNaughton, J., 1967. Relationship among functional properties of California

grassland. Nature 216, 168e169.Menhinick, E.P., 1964. A comparison of some species-individuals diversity indices

applied to samples of field insects. Ecology 45, 859e861.Millie, D.F., Weckman, G.R., Young II, W.A., Ivey, J.E., Carrick, H.J., Fahnenstiel, G.L.,

2012. Modeling microalgal abundance with artificial neural networks:demonstration of a heuristic ‘Grey-Box’ to deconvolve and quantify environ-mental influences. Environ. Model. Softw. 38, 27e39.

More, A., Deo, M.C., 2003. Forecasting wind with neural networks. Mar. Struct. 16(1), 35e49.

Nisanci, M.H., Kucuksille, E.U., Cengiz, Y., Orlandi, A., Duffy, A., 2011. The predictionof the electric field level in the reverberation chamber depending on position ofstirrer. Expert Syst. Appl. 38 (3), 1689e1696.

Odum, H.T., Cantlon, J.E., Kornicker, L.S., 1960. An organizational hierarchy postulatefor the interpretation of species-individuals distribution, species entropy andecosystem evolution and the meaning of a species variety index. Ecology 41,395e399.

Olden, J.D., Lawler, J.J., Poff, LeRoy N., 2008. Machine learning methods withouttears: a primer for ecologists. Q. Rev. Biol. 83, 171e193.

Pal, S.K., Mitra, S., 1992. Multilayer perceptron, fuzzy sets, and classification. IEEETrans. Neural Netw. 3 (5), 683e697.

Parsons, T.R., Maita, Y., Lalli, C.M., 1984. A Manual of Chemical and BiologicalMethods for Seawater Analysis. Pergamon Press, Oxford, p. 175.

Pattern, B.C., 1962. Species diversity in net plankton of Raritan Bay. J. Mar. Res. 20,57e75.

Payne, T.R., 1995. Instance-based Prototypical Learning of Set Valued Attributes(PhD first year report). University of Aberdeen, Scotland, pp. 1e23.

Pielou, E.C., 1975. Ecological Diversity. Wiley InterScience, New York.Pittman, S.J., Christensen, J.D., Caldow, C., Menza, C., Monaco, M.E., 2007. Predictive

mapping of fish species richness across shallow-water seascapes in the Carib-bean. Ecol. Model. 204 (1e2), 9e21.

Quinlan, J.R., 1992. Learning with continuous classes. In: Proceedings AustralianJoint Conference on Artificial Intelligence. World Scientific, Hobart, Tasmania,Singapore, pp. 343e348.

Quinlan, J.R., 1999. Simplifying decision trees. Int. J. Hum.eComput. Stud. 51 (2),497e510.

Recknagel, F., 2001. Applications of machine learning to ecological modelling. Ecol.Model. 146 (1e3), 303e310.

Refsgaard, J.C., Van der Sluijs, J.P., Hojbergn, A.L., Vanrolleghem, P.A., 2007. Uncer-tainty in the environmental modelling process e a framework and guidance.Environ. Model. Softw. 22, 1543e1556.

Rocha, M., Cortez, P., Neves, J., 2007. Evolution of neural networks for classificationand regression. Neurocomputing 70 (16e18), 2809e2816.

Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. Nature 323, 533e536.

Schloss, P.D., Handelsman, J., 2006. Toward a census of bacteria in soil. PLoS Com-put. Biol. 2 (7), 786e793.

Shannon, C.E., Weaver, W., 1949. The Mathematical Theory of Communication.University of Illinois Press, Urbana.

Sheldon, A.L., 1969. Equitability indices: dependence on species count. Ecology 50,466e467.

Shi, J.J., 2000. Reducing prediction error by transforming input data for neuralnetworks. J. Comput. Civil Eng. 14 (2), 109e116.

Simboura, N., Panayotidis, P., Papathanassiou, E., 2005. A synthesis of the biologicalquality elements for the implementation of the European Water FrameworkDirective in the Mediterranean ecoregion: the case of Saronikos Gulf. Ecol.Indic. 5 (3), 253e266.

Solomatine, D.P., Maskey, M., Durga, L.S., 2006. Eager and lazy learning methods, inthe context of hydrologic forecasting. In: International Joint Conference onNeural Networks, Vancouver, Canada.

Solomatine, D.P., Siek, M.B., 2006. Modular learning models in forecasting naturalphenomena. Neural Netw. 19 (2), 215e224.

Solomatine, D.P., Maskey, M., Shrestha, D.L., 2008. Instance-based learningcompared to other data-driven methods in hydrological forecasting. Hydrol.Process. 22 (2), 275e287.

Soysal, M., Schmidt, E.G., 2010. Machine learning algorithms for accurate flow-based network traffic classification: evaluation and comparison. Perform.Eval. 67 (6), 451e467.

Spatharis, S., Danielidis, D., Tsirtsis, G., 2007. Recurrent Pseudo-nitzschia calliantha(Bacillariophyceae) and Alexandrium insuetum (Dinophyceae) winter bloomsinduced by agricultural runoff. Harmful Algae 6, 811e822.

Spatharis, S., Mouillot, D., Danielidis, D., Karydis, M., Do Chi, T., Tsirtsis, G., 2008.Influence of terrestrial runoff on phytoplankton species richnessebiomass re-lationships: a double stress hypothesis. J. Mar. Biol. Ecol. 362, 55e62.

Spatharis, S., Tsirtsis, G., 2010. Ecological quality scales based on phytoplankton forthe implementation of Water Framework Directive in the Eastern Mediterra-nean. Ecol. Indic. 10 (4), 840e847.

Spatharis, S., Roelke, D.L., Dimitrakopoulos, P.G., Kokkoris, G.D., 2011. Analyzing the(mis)behavior of Shannon index in eutrophication studies using field andsimulated phytoplankton assemblages. Ecol. Indic. 11 (2), 697e703.

Spyropoulou, A., Spatharis, S., Papantoniou, G., Tsirtsis, G., 2013. Potential responseof a semi-arid ecosystem to climate change. Hydrobiologia 705, 87e99.

Stone, M., 1978. Cross-validation: a review. Ser. Stat. 9 (1), 127e139.Stone, M., 1974. Cross-validation and multinomial prediction. Biometrika 61 (3),

509e515.Tamvakis, A., Miritzis, J., Tsirtsis, G., Spyropoulou, A., Spatharis, S., 2012. Effects of

meteorological forcing on coastal eutrophication: modeling with model trees.Estuar. Coast. Shelf Sci. 115, 210e217.

Thrush, S.F., Hewitt, J.E., Funnell, G.A., Cummings, V.J., Ellis, J., Schultz, D., Talley, D.,Norkko, A., 2001. Fishing disturbance and marine biodiversity: the role ofhabitat structure in simple soft-sediment systems. Mar. Ecol. Prog. Ser. 233,277e286.

Tian, X., Ju, M., Shao, C., Fang, Z., 2011. Developing a new grey dynamic modelingsystem for evaluation of biology and pollution indicators of the marine envi-ronment in coastal areas. Ocean Coast. Manage. 54 (10), 750e759.

Tsirtsis, G., Spatharis, S., Karydis, M., 2008. Application of the lognormal equation toassess phytoplankton community structural changes induced by marineeutrophication. Hydrobiologia 605 (1), 89e98.

Tsirtsis, G., Spatharis, S., 2011. Simulating the structure of natural phytoplankton as-semblages: descriptivevs.mechanisticmodels. Ecol.Model. 222 (12),1922e1928.

Utermöhl, H., 1958. Zur Vervollkommnung der quantitativen phytoplankton-methodik. Mitt. Int. Ver. Theor. Angew. Limnol. 9, 1e38.

Page 9: Optimizing biodiversity prediction from abiotic parameters

A. Tamvakis et al. / Environmental Modelling & Software 53 (2014) 112e120120

Van Straten, G., 1992. The predicting power of models for eutrophication. In:Sutcliffe, D.W., Jones, J.G. (Eds.), Eutrophication: Research and Application toWater Supply. Freshwater Biological Association, Ampleside, Cumbria, UK,pp. 44e58.

Vounatsou, P., Karydis, M., 1991. Environmental characteristics in oligotrophic wa-ters: data evaluation and statistical limitations in water quality studies. Environ.Monit. Assess. 18 (3), 211e220.

Wang, Z., Di Massimo, C., Tham, M.T., Julian Morris, A., 1994. A procedure fordetermining the topology of multilayer feedforward neural networks. NeuralNetw. 7 (2), 291e300.

Wettschereck, D., Aha, D.W., Mohri, T., 1997. A review and empirical evaluation offeature weighting methods for a class of lazy learning algorithms. Artif. Intell.Rev. 11 (1e5), 273e314.

WFD, 2000. Directive 2000/60/EC of the European Parliament and of the Council of23 October 2000 e establishing framework for community action in the field ofwater policy. Official J. Eur. Commun. L327, 1e71.

Zar, J.H., 1998. Biostatistical Analysis, fourth ed. Prentice-Hall, New Jersey.