Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility...

17
Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone 1,2 , Budiman Minansy 2 and Colby Brungard 3 1 CSIRO, Agriculture and Food, Canberra, ACT, Australia 2 The Sydney Institute of Agriculture, The University of Sydney, Sydney, NSW, Australia 3 Plant and Environmental Sciences, New Mexico State University, Las Cruces, NM, USA ABSTRACT The conditioned Latin hypercube sampling (cLHS) algorithm is popularly used for planning eld sampling surveys in order to understand the spatial behavior of natural phenomena such as soils. This technical note collates, summarizes, and extends existing solutions to problems that eld scientists face when using cLHS. These problems include optimizing the sample size, re-locating sites when an original site is deemed inaccessible, and how to account for existing sample data, so that under- sampled areas can be prioritized for sampling. These solutions, which we also share as individual R scripts, will facilitate much wider application of what has been a very useful sampling algorithm for scientic investigation of soil spatial variation. Subjects Agricultural Science, Soil Science, Statistics, Data Science, Spatial and Geographic Information Science Keywords Soil sampling, Conditioned Latin Hypercube, Digital soil mapping, Optimization, Sampling, Sample optimization, Legacy soil data, Pedometrics, Soil survey, Fieldwork INTRODUCTION The conditioned Latin hypercube sampling (cLHS) algorithm (Minasny & McBratney, 2006) was designed with digital soil mapping (DSM) in mind. cLHS is a random stratied procedure that choses sampling locations based on prior information pertaining to a suite of environmental variables in a given area. cLHS has been used extensively in DSM projects throughout the world with recent examples in the last 5 years including Sun et al. (2017) in China, Jeong et al. (2017) in Korea, Scarpone et al. (2016) in Canada, and Thomas et al. (2015) in Australia. cLHS has also been used for other purposes and contexts too. For example, in optimal soil spectral model calibration (Ramirez-Lopez et al., 2014; Kopačková et al., 2017), understanding the conditions which determine Phytophthora distribution in rainforests (Scarlett et al., 2015), and assessing the uncertainty of digital elevation models derived from light detection and ranging technology (Chu et al., 2014). For DSM, the algorithm exploits collections of environmental variables pertaining to soil forming factors and proxies thereof (McBratney, Mendonça Santos & Minasny, 2003; e.g., digital elevation model derivatives, remote sensing imagery of vegetation type and distribution, climatic data, and geological maps) to derive a sample conguration (of specied size), such that the empirical distribution function of each environmental variable is replicated (Clifford et al., 2014). Presuming that soil variation is a function of the chosen environmental variables, it is reasoned that models tted using data collected How to cite this article Malone BP, Minansy B, Brungard C. 2019. Some methods to improve the utility of conditioned Latin hypercube sampling. PeerJ 7:e6451 DOI 10.7717/peerj.6451 Submitted 2 October 2018 Accepted 15 January 2019 Published 25 February 2019 Corresponding author Brendan P. Malone, [email protected] Academic editor Paolo Giordani Additional Information and Declarations can be found on page 15 DOI 10.7717/peerj.6451 Copyright 2019 Malone et al. Distributed under Creative Commons CC-BY 4.0

Transcript of Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility...

Page 1: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

Some methods to improve the utilityof conditioned Latin hypercube samplingBrendan P. Malone1,2, Budiman Minansy2 and Colby Brungard3

1 CSIRO, Agriculture and Food, Canberra, ACT, Australia2 The Sydney Institute of Agriculture, The University of Sydney, Sydney, NSW, Australia3 Plant and Environmental Sciences, New Mexico State University, Las Cruces, NM, USA

ABSTRACTThe conditioned Latin hypercube sampling (cLHS) algorithm is popularly used forplanning field sampling surveys in order to understand the spatial behavior of naturalphenomena such as soils. This technical note collates, summarizes, and extendsexisting solutions to problems that field scientists face when using cLHS. Theseproblems include optimizing the sample size, re-locating sites when an original site isdeemed inaccessible, and how to account for existing sample data, so that under-sampled areas can be prioritized for sampling. These solutions, which we also shareas individual R scripts, will facilitate much wider application of what has been avery useful sampling algorithm for scientific investigation of soil spatial variation.

Subjects Agricultural Science, Soil Science, Statistics, Data Science, Spatial and GeographicInformation ScienceKeywords Soil sampling, Conditioned Latin Hypercube, Digital soil mapping, Optimization,Sampling, Sample optimization, Legacy soil data, Pedometrics, Soil survey, Fieldwork

INTRODUCTIONThe conditioned Latin hypercube sampling (cLHS) algorithm (Minasny &McBratney, 2006)was designed with digital soil mapping (DSM) in mind. cLHS is a randomstratified procedure that choses sampling locations based on prior information pertaining toa suite of environmental variables in a given area. cLHS has been used extensively inDSM projects throughout the world with recent examples in the last 5 years includingSun et al. (2017) in China, Jeong et al. (2017) in Korea, Scarpone et al. (2016) in Canada, andThomas et al. (2015) in Australia. cLHS has also been used for other purposes and contextstoo. For example, in optimal soil spectral model calibration (Ramirez-Lopez et al., 2014;Kopačková et al., 2017), understanding the conditions which determine Phytophthoradistribution in rainforests (Scarlett et al., 2015), and assessing the uncertainty of digitalelevation models derived from light detection and ranging technology (Chu et al., 2014).

For DSM, the algorithm exploits collections of environmental variables pertaining tosoil forming factors and proxies thereof (McBratney, Mendonça Santos & Minasny, 2003;e.g., digital elevation model derivatives, remote sensing imagery of vegetation type anddistribution, climatic data, and geological maps) to derive a sample configuration (ofspecified size), such that the empirical distribution function of each environmentalvariable is replicated (Clifford et al., 2014). Presuming that soil variation is a function ofthe chosen environmental variables, it is reasoned that models fitted using data collected

How to cite this article Malone BP, Minansy B, Brungard C. 2019. Some methods to improve the utility of conditioned Latin hypercubesampling. PeerJ 7:e6451 DOI 10.7717/peerj.6451

Submitted 2 October 2018Accepted 15 January 2019Published 25 February 2019

Corresponding authorBrendan P. Malone,[email protected]

Academic editorPaolo Giordani

Additional Information andDeclarations can be found onpage 15

DOI 10.7717/peerj.6451

Copyright2019 Malone et al.

Distributed underCreative Commons CC-BY 4.0

Page 2: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

via cLHS, capture all the soil spatial variability and will be applicable across thewhole spatial extent to be mapped.

However, our own experience, and from personal communication with otherresearchers and field technicians, a common set of methodological questions arise whenusing cLHS. These questions are:

1. How many samples should I collect?

2. Where else can I sample when a cLHS location cannot be visited because ofdifficult terrain, locked gate, safety reasons etc.?

3. How do I account for existing samples when designing a new survey?

The purpose of this technical note is to describe solutions to each of these questions.We first begin with a brief overview of the cLHS algorithm and then address eachquestion separately.

MATERIALS AND METHODSA short overview of cLHSConditioned Latin hypercube sampling is one of the many environmental surveying toolsavailable for understanding the spatial characteristics of environmental phenomena.Extended discussions about soil sampling, surveying, and monitoring of natural resourcesin a broad context can be found in seminal publications such as de Gruijter et al. (2006)and Webster & Lark (2013). cLHS has its origins in Latin hypercube sampling (LHS)first proposed byMcKay, Beckman & Conover (1979). LHS is an efficient way to reproducean empirical distribution function, where the idea is to divide the empirical distributionfunction of a variable, X, into n equi-probable, non-overlapping strata, and then drawone random value from each stratum. In a multi-dimensional setting, for k variables,X1,X2, : : : ,Xk, the n random values drawn for variable X1 are combined randomly (or insome order to maintain its correlation) with the n random values drawn for variable X2,and so on until n k-tuples are formed, that is, the Latin hypercube sample (Clifford et al.,2014). Its utility for soil sampling was noted by Minasny & McBratney (2006), but theyrecognized that some generalization of LHS sampling was required so that selectedsamples actually existed in the real world. Subsequently, they proposed a conditioning ofthe LHS, which is achieved by drawing an initial Latin hypercube sample from theancillary information, then using simulated annealing to permute the sample in such away that an objective function is minimized. The objective function of Minasny &McBratney (2006) comprised three criteria:

1. Matching the sample with the empirical distribution functions of the continuousancillary variables

2. Matching the sample with the empirical distribution functions of the categoricalancillary variables; and

3. Matching the sample with the correlation matrix of the continuous ancillary variables.

See Minasny & McBratney (2006) for full detailing of the cLHS algorithm.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 2/17

Page 3: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

How many samples should I collect?Ironically this question arose from a situation where we examined data from a ∼100 hafield. Within this field 238 soil samples were collected by regular grid sampling. Soilsamples were then analyzed for key soil properties. Also collected from the field wereancillary data including soil conductivity (from an EM38 sensor), elevation, and crop yielddata. Our interest was in determining whether the 238 soil samples adequately coveredthe ancillary data space and if an alternative sampling configuration using cLHS couldbe used to determine a specific sample number that would achieve the same datacoverage as the 238 sites.

We acknowledge that for most contexts, practitioners will not experience the samesituation we have just described, but will only have a suite of ancillary data and will need toestimate an optimal sample number.

The solution to choosing an optimal number of samples is to compare the empiricaldistribution functions of incrementally larger sample sizes with those of the population.The resulting output will display a typical exponential growth (or decay) curve, dependingon what comparative metric is used, that plateaus at some point with increasingsample size. This allows one to invoke a diminishing returns-like rule to derive an optimalsample number. This approach was exemplified in both Ramirez-Lopez et al. (2014)for selection of a number of observations to optimize the fitting of soil spectral models,and Stumpf et al. (2016) for evaluating an optimal sample size to calibrate a DSM model.

Comparing the empirical distribution functions of the samples and the population canbe done in a number of ways. Stumpf et al. (2016) compared the overall variance of theancillary data from an entire region, hereinafter referred to as the population, with that ofthe sample. The application of this approach for categorical data was not established,but for categorical data it is only necessary to match the relative proportions of thecategories. Other metrics for continuous variables could involve comparison of thequantiles of the empirical distributions. This would require comparing an absolutedeviation between the quantiles (which could be absolute difference or Euclidean distance)for each ancillary variable, then deriving a unified value by combining the distances(weighted averaging) from each variable. Alternatively, comparison could be made ofthe principal components of the ancillary variables (using the approach described byKrzanowski (1979)), which would alleviate the need to first compute distances for eachvariable and then combine into a single measure of distance.

Probably a more standard approach for comparing empirical distribution functionsis the Kullback & Leibler (1951) divergence which is specifically designed for comparingdistributions (Clifford et al., 2014). The Kullback–Leibler (KL) divergence (also calledrelative entropy) compares the relative proportions of samples within each histogrambin of the distributions. The KL divergence can be computed as:

KL ¼X

i

Oi logeOi � logeEi� �

(1)

where Ei is the distribution of an ancillary variable for the population (i is a histogrambin), and Oi is the distribution of the same ancillary variable from a sample.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 3/17

Page 4: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

The KL divergence decreases towards zero as the population and sample distributionsconverge (Clifford et al., 2014) and can be defined for both categorical and continuousancillary data. Signal processing is probably the most common use case for KL divergence,such as the characterization of relative entropy in information systems where therelated Shannon’s information criterion (Shannon & Weaver, 1949) is widely applied.Although useful, the KL divergence is not incorporated into the Minasny & McBratney(2006) cLHS algorithm, where instead comparison is made using the quantity:

PiOi � Eij j.

This metric could also be useful for optimization of a sample size too, but the differencebetween it and the KL divergence is that it penalizes all departures from the populationvalues equally. With the KL divergence, the penalty of missing a sample from the tailof a distribution is larger than the penalty of missing a sample close to the mode.

In the following, we demonstrate the usage of KL divergence to help identify an optimalsample size in the ∼100 ha field previously described. The cLHS algorithm was runsequentially with increasing sample sizes beginning at 10 and finishing at 500. A stepsize of 10 samples was used. Each sample size was repeated 10 times to assess thedispersion about the mean KL divergence estimate. KL divergence was estimated using abin number of 25 for each ancillary variable and then aggregated for all variables bycalculating the mean. Note that the justification for selecting a bin number of 25 was toreplicate the settings as implemented Clifford et al. (2014). Our own small investigationsindicated this bin number to be adequate for our data, but would recommend forother studies and contexts to evaluate the KL divergence response to various binnumbers via iteration. Too few bins may not adequately capture the main featuresof a distribution, giving misleading low divergence values. While too many bins maybeoverly sensitive to noise and provide high KL divergence values in otherwise nearmatching distributions.

In this work, The R package clhs (Roudier, 2011) was used for performing the cLHS. Theinputs into the clhs function included a table form (R data.frame) of the ancillary datathat was collected in the field of interest, that is, the proximal sensed data and yielddata. We used the default 10,000 iterations of the annealing process to run the function.A schematic of the sample size optimization algorithm is given in Fig. 1.

Where else can I sample when a cLHS location cannot be visitedbecause of difficult terrain, locked gate, safety reasons, etc.?One of the biggest criticisms levelled at cLHS is the rigidness around the samplingconfiguration. For example, both Thomas et al. (2012) and Kidd et al. (2015) remarked onthe lack of guidance on what to do when a site cannot be accessed. They arrived at thesame conclusion, that rather than continuing with cLHC sampling, the more flexiblesampling approach of fuzzy stratified sampling be used. Kidd et al. (2015) demonstratedthat this alternative sampling approach was comparable to cLHS in their work.Nevertheless, once a sampling location selected by the cLHC algorithm is unable to bevisited for any number of reasons, a replacement site needs to be selected. The introductionof replacement sites can potentially degrade the objective of cLHS if the replacementsites are arbitrarily selected in nearby locations that are accessible.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 4/17

Page 5: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

Modification of the cLHS objective function can be done so that accessibility isconsidered. This has been done by Roudier, Beaudette & Hewitt (2012) and Mulder,De Bruin & Schaepman (2013) where the cost of getting to sampling locations wasincorporated into the annealing schedule of the cLHS algorithm. Such modifications,however, do not guarantee accessibility but do increase the probability that a site will beaccessible. Adhering to an original cLHS site configuration, Stumpf et al. (2016)selected alternative sample sites by deriving numerous test sample sets based on theoriginal cLHS sites with all possible combinations of the covariate data from the cLHSstrata. They selected the test sample set that most closely represented the original cLHSsites via quantile matching. Implementing this method would require prior knowledgeof accessible and inaccessible sites, which for Stumpf et al. (2016) was based on terrainslope and selected land use classes.

Neither of the approaches described above accommodate situations where issuesof accessibility are found in the field, that is, site inaccessibility is unplanned.

Figure 1 Schematic of sample number size optimization algorithm.Full-size DOI: 10.7717/peerj.6451/fig-1

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 5/17

Page 6: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

Clifford et al. (2014) grappled with this issue and proposed a solution that involvessimulated annealing for optimally selecting accessible sites from a region. This methodinvolves multiple criteria such as the KL divergence, ease of access, and geographicalcoverage, which are all combined into a single criterion. While useful, this approach iscomputationally demanding to the extent that it would be difficult to apply directly in afield setting. To obviate the need for this, an ordered list of alternative sites close to eachof the primary target sites (should the primary target site prove inaccessible) isproduced prior to entering the field.

An easier, pragmatic option as demonstrated in the USA by Brungard & Johnanson(2015) would be to calculate a similarity measure to an inaccessible cLHS site within agiven buffer zone. In their example, a Gower’s dissimilarity index (Gower, 1971) wasused. With this approach, areas with high similarity to the inaccessible cLHS site couldthen be identified and the new sample location moved to these areas if needed. Suchan approach would allow alternative sites to be determined when in the field, given anappropriate computation device or if areas with high similarity to each location weregenerated before field sampling begins. This approach is also conveniently providedas a function in the clhs R package by Roudier (2011).

In the following, we demonstrate a similar approach to that proposed by Brungard &Johnanson (2015). It differs by generalization of the metric used to calculate distance betweenindividuals for each of the variables, and it uses a membership function to derive estimatesof similarity to a site that has been deemed inaccessible. Flexibility in the selection ofdistance metric allows alternatives other than the Gower distance to be considered.

We demonstrate this approach using a collection of 341 soil samples that cover theHunter Wine Country Private Irrigation District (HWCPID), a 220 km2 region in the LowerHunter Valley Region, of New South Wales, approximately 140 km north ofSydney, Australia. The description of the sampling and data collection are described inMalone et al. (2014). For this example, we used the following ancillary variables derived froma 25 m digital elevation model: elevation, slope, terrain wetness index, multi-resolutionvalley bottom flatness, and potential incoming solar radiation. We also used categoricalrasterized data from a 1:100,000 geological unit map (Hawley, Glen & Baker, 1995), anda 1:250,000 legacy soil map depicting the surveyed soil units (Kovac & Lawrie, 1990) thatcover the HWCPID. In this example, each of the 341 sampling locations were assumedto be inaccessible and alternative sites were selected for each.

We then compared the KL divergence between the original and alternative samplesites and assess whether the alternative sample sites capture the same environmentalinformation using the following algorithm (also illustrated in Fig. 2).

Given an inaccessible site location:

1. Create a sampling zone (i.e., buffer) from which an alternative site can be selectedand extract all the ancillary data from inside this zone. We used a circled samplingzone of radius 500 m.

2. Calculate the multivariate distance between the inaccessible site and all ancillary datain the sampling zone. If ancillary data is categorical, remove all areas of the sampling

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 6/17

Page 7: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

zone that do not match the category at the initial sampling location. If ancillary datais continuous calculate a Mahalanobis distance.

3. TransformMahalanobis distance to a similarity score using a negative sigmoid function:

Similarity ¼ 1� 1

1þ eð1�� disti�distmedÞð Þ (2)

Where dist is the Mahalanobis distance of the ancillary data i and the inaccessiblesite and distmed is the median Mahalanobis distance of the available data within thebuffer area. We used the median because it is less susceptible to outliers than themean. The similarity is expressed on a range between 1 and 0 with numbers approaching1 being highly similar, and is akin to a membership function.

4. Select an alternative site where the similarity is above a given threshold. In our examplethis threshold was set to 0.975. Alternatively, one could rank the sites (smallest tohighest distance) and select a given number from the top for possible consideration.Whichever the approach, an alternative site may be selected at random from thesites that pass the threshold similarity value, or alternatively the nearest one to theinaccessible site could be selected and so on until it is found that access to thesampling location is possible. In our example, we selected the former option.

5. Repeat steps 1 to 4 for every inaccessible site location.

How do I account for existing samples when designing a new survey?This is the situation where existing soil samples exist within a sampling domain, and theuser wants to derive a sampling configuration that captures environmental variationthat the existing samples fails to achieve appropriately. With its current functionality,the clhs R package by Roudier (2011) goes some way to addressing this problem throughforming n1+n2 marginal strata per covariate. Here n1 represents existing site data,and n2 represents marginal strata where new sampling locations can be derived. Essentiallyone selects a sample size of n2, adds in the existing sample data (n1) that have previouslybeen collected from the study area, and then runs the clhs algorithm. This appears auseful solution but probably needs to be tested or compared with other approaches suchas described below.

Possibly a more explicit approach to dealing with the aforementioned problem, is tofirst assess the environmental coverage of the existing samples, then look for gaps (wherethere is no coverage in the data space), and then prioritizing new samples to thoseareas as they occur in the field. This general approach was performed by Stumpf et al.(2016). A more thoroughly described approach is the hypercube evaluation of a legacysample (HELS) algorithm and its companion the HISQ algorithm described in Carré,McBratney & Minasny (2007). The HELS algorithm is used to check the occupancy ofthe existing samples in the hypercube of the quantiles of given ancillary data, and todetermine whether they occupy the hypercube uniformly or if there is over- or under-observation in partitions of the hypercube. The companion HISQ algorithm is used topreferentially select additional samples in areas where the degree of under-observation is

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 7/17

Page 8: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

Figure 2 Selection of alternate sites. Illustrated process of the algorithm for selecting an alternativesampling site when a cLHS site is inaccessible. (A) A sample site (as indicated by a marked point) withinthe HWCPID area that has been determined to be inaccessible. (B) A circular buffer area is createdaround the site. If categorical data are being used, those categories that do not match that at the sampling

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 8/17

Page 9: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

greatest. A limitation of the HELS algorithm is that the full hypercube of all ancillaryvariables needs to be computed (i.e., if there are k number of variables, k-cubes need tobe formed). We present an algorithm that is an adaptation and simplification of thecombined HELS and HISQ algorithms.

This example uses the same ancillary data and the 341 sample sites from theHWCPID study area described earlier. For this example, we want to add an additional100 samples while accounting for the prior 341 sample sites. We call this algorithm,adapted HELS (aHELS). A schematic of the aHELS algorithm is shown in Fig. 3 anddescribed in the following:

Figure 2 (continued)site are excluded. In the given example a circular buffer zone is used, but a portion was excluded due tonon-matching of the categorical variables. (C) The Mahalanobis distance is estimated between theinaccessible site and all cells in the buffer area. (D) Similarity is estimated using Eq. (2) and rangesbetween 0 and 1. An alternative sample site is selected if the similarity exceeds 0.975.

Full-size DOI: 10.7717/peerj.6451/fig-2

Figure 3 Adapted HELS algorithm. Schematic of the adapted HELS (aHELS) algorithm. To use thisalgorithm, one needs their collection of existing site data from the study area, together with a suite ofancillary data rasters. Finally, a specified sample size number (s) needs to be given.

Full-size DOI: 10.7717/peerj.6451/fig-3

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 9/17

Page 10: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

0a. Select a sample size of size s. In this example s = 100

0b. Extract the ancillary data values for existing observations (o). In the example o = 341.

1a. Construct a quantile matrix of the ancillary data. If there are k ancillary variables,the quantile matrix will be of (s + 1) � k dimensions. The rows of the matrix are thequantiles of each ancillary variable.

1b. Calculate the data density of the ancillary data. For each element of the quantile matrix,tally the number of pixels within the bounds of the specified quantile for that matrix element.This number is divided by r, where r is the number of pixels of an ancillary variable.

1c. Calculate the data density of ancillary information from the existing legacy samples.This is the same as 1b except the number of existing observations are tallied withineach quantile (quantile matrix from 1a) and the density is calculated by dividing the talliednumber by o instead of r.

2. Evaluate the ratio of densities. This is calculated as the point data density divided bythe grid density.

3. Rank the density ratios from smallest to largest. Across all elements of the (s+1) � kmatrix of the ratios from step 2, rank them from smallest to largest. The smallestratios indicate those quantiles that are under-sampled, and should be prioritized forselecting additional sample locations. In this ranking, it is important to save the elementrow and column indexes.

4a. Begin selection of additional sample locations. Start by initiating a sample of size s

4b. While s > 0 and working down the ranked list of ratios:

5. Estimate howmany samples (m) are required to make grid density = data density. This iscalculated as o � the grid data density in the same row and column position as the densityratio.

6. Using the same row and column position of the quantile matrix from 1a, select from thegrid data all possible locations that meet the criteria of the quantile at that position in thematrix. Select at random from this list of possible locations, m sampling locations.

7. Set s = s - m, then go back to 4b.

The algorithm will terminate once s sample locations have been selected. A caveat tousing the aHELS algorithm is that it is not immediately amenable to optimizing samplesize as described in the earlier section. Essentially, one needs to select a sample size, thenrun the algorithm.

Possibly a much more visual way—yet also complementary to the aHELS algorithm—toassess relatively adequate and under-sampled areas is to create a map that could show suchpatterns. We developed an algorithm called count of observations (COOBS) that canachieve this objective. The COOBS algorithm is implemented on a pixel basis and requiresa stack of ancillary data pertaining to the environmental variables for a given spatialdomain, and the associated legacy data points from the same spatial domain. Thelegacy points will have been intersected with ancillary data. We demonstrate thisCOOBS algorithm using the HWCPID example. A schematic of the COOBS algorithmis presented on Fig. 4.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 10/17

Page 11: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

1. For each pixel location i:

1a. Calculate the multivariate distance to every other location in the stacked ancillarydata. We use the Mahalanobis distance because it preserves the correlation betweenvariables, but other distance or similarity metrics can be considered.

1b. Record the maximum distance, which we define as magpd. Note that the minimumdistance will always be zero.

2. For each pixel location i:

2a. Calculate the multivariate distance of the ancillary data between the pixel i and each ofthe legacy data points. We call this the data distance (dd).

2b. Calculate a standardized distance (sdd) for each legacy data point which is simply:

sddi ¼ 1� ddmagpdi

As the sdd will scale from 0 to 1, values close to 1 means at least one observation pointis similar to the ancillary data for the selected pixel.

3. Set a similarity threshold close to 1, then count the number of legacy observationsthat are equal to or better than that criteria. Save this number and pixel locationtogether, then move to the next pixel, that is, go back to Step 2a.

From the above described COOBS algorithm, and depending on how it isimplemented, the first step will create a raster of the target area depicting the spatialpattern of magpd. This map is then interrogated pixel-by-pixel in step 2 to estimate theassociated COOBS number. Alternatively, the two steps of the COOBS algorithm couldbe implemented as a single workflow whereby the maps of magpd and COOBS arederived at near the same time as each other, pixel-by-pixel.

Figure 4 COOBS algorithm. Schematic of the COOBS algorithm. To use this algorithm, one needs theircollection of existing site data from the study area, together with a suite of ancillary data rasters.

Full-size DOI: 10.7717/peerj.6451/fig-4

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 11/17

Page 12: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

RESULTS AND DISCUSSIONHow many samples should I collect?The resulting output is shown in Fig. 5A which is a plot of the mean KL divergencewith increasing sample size and is a classical exponential decay with increasing sample size.We do not show the standard deviation bars on this plot as they are very small withincreasing sample size and are not visible on the plot. However, relatively larger standarddeviations were observed at smaller sample sizes which decreased with increasingsample sizes. We also calculated the KL divergence for the 238 samples already

Figure 5 KL divergence as function of sample size. (A) KL divergence between ancillary data of asample and ancillary data of the population as a function of sample size. The fitted line is a negativeexponential decay function fit to the KL divergence with the relationship y = b_1 ∙〖exp〗^((-kx))+b_0where x and y were sample size and KL divergence respectively. The fitted parameters of b_0,b_1,and kwere 0.275, 1.185, and 0.028, respectively. The “+” symbol is the KL divergence for a grid sample of 239.(B) Cumulative density function of the 1 - KL divergence (from A) as a function of sample size. The pointof intersection of the two straight lines is the optimal sample size as this is the point where the cdfbreaches 95%. Full-size DOI: 10.7717/peerj.6451/fig-5

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 12/17

Page 13: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

collected from the field of interest. This was found to be 0.039, and as can be seen onFig. 5A as the “+” symbol. This mark is above the fitted line for the same sample size usingcLHS. This indicates that (for this example), cLHS is superior to grid sampling forcapturing the variation in ancillary data for the same sample size.

To estimate an optimal sample size, we derived a cumulative density function for thefitted line on Fig. 5A, and identified the sample size required to capture 95% of thecumulative probability (Fig. 5B). The optimal sample size was 110. The KL divergencefor this sample size was 0.033, slightly lower (better) than that achieved with the actual238 grid samples.

In a practical setting, the 110 samples would be used as the minimum recommendedsample size. However, one caveat is that other metrics of comparison are likely to resultin differing optimal sample sizes because the underlying calculation is different. In situationswhere multiple comparison metrics are used, it might be pragmatic to assess what theminimum and maximum optimal sample sizes are, then select a recommended sample sizein the middle. Though likely not optimal, this approach would be an improvement on thearbitrary selection of sample size that currently occurs in most studies.

Where else can I sample when a cLHS location cannot be visitedbecause of difficult terrain, locked gate, safety reasons, etc.?Table 1 presents the KL divergences for each ancillary variable for both the original sampleconfiguration and the alternative sample configuration for each of the 341 sites. Overall,the mean KL divergence is equal for both sampling configurations which is a desiredoutcome of the algorithm. With this relatively simple procedure or adopting the similarapproach described in Brungard & Johnanson (2015) it becomes possible to derivealternative sample locations within the field at the time of sampling. This delivers to thefield technician an objective way to select an alternative sample site (or possibly even ageneral area) during the survey campaign.

How do I account for existing samples when designing a new survey?Figure 6A shows a map with the original 341 sites and the additional 100 sitesselected using the aHELS algorithm. The COOBS algorithm created the map on Fig. 6B.

Table 1 KL-divergence comparison between original sample and optimized sample.

Ancillary variable KL-divergence(original)

KL-divergence(relocated)

Terrain wetness index (unitless) 0.021 0.019

Slope (�) 0.038 0.041

Multi-resolution valley bottom flatness (unitless) 0.009 0.019

Potential incoming solar radiation (kWh/m2) 0.063 0.052

Elevation (m) 0.034 0.033

Mean 0.033 0.033

Note:KL-divergence of the original sample configuration and the relocated sample configuration (341 sites) for each of theancillary data and an overall mean.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 13/17

Page 14: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

A caveat to the COOBS algorithm though is that it is computationally heavy. Someof this burden can easily be diffused via compute parallelization, however. Alternatively,and in order to take advantage of spatial correlation, it could be possible to take arepresentative sample from the provided ancillary data of a specified size, calculatemagpd for each of these locations, then interpolate this target variable across the wholearea. Future investigations would need to assess the viability and efficiency of thisapproach.

Overall, both the aHELS and COOBS algorithms allow one to understand whichareas in a spatial domain are adequately and under-sampled. The aHELS algorithm isexplicit in selecting sites preferentially to locations in the environment that are notcaptured by the existing site data. Relatedly, the COOBS algorithm provides avisualization of general areas where under sampling is prevalent. As an aside, it may bepossible to use the COOBS derived map to design an additional survey by constrainingthe cLHS algorithm to areas where the COOBS value is below some specifiedthreshold. Nevertheless, both aHELS and COOBS are somewhat comparable in theirobjectives and ultimately generate similar outcomes as shown in Table 2 where theproportion of sample sites selected using the aHELS is much higher in the areas wherethe COOBS algorithm determined relatively low existing data coverage. For example,87% of additional sites appear to be allocated in areas where the existing data coverageis �10 sites.

Figure 6 Comparative outputs of aHELS and COOBS algorithms. (A) HWCPID existing 341 sites(crosses) and additional 100 sites (dots) selected using the adapted HELS algorithm. (B) Existing sampledata coverage using the 341 HWCPID sites based on using the COOBS algorithm. Values indicate thenumber of existing soil samples that are above a similarity threshold for each pixel. Lower values indicateareas with fewer similar observations and which likely require additional sampling.

Full-size DOI: 10.7717/peerj.6451/fig-6

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 14/17

Page 15: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

CONCLUSIONSThis technical note provides some solutions to common questions that arise when thecLHS algorithm is used for designing a soil or any other environmental survey. We havecollated solutions that others have come up with to deal with such questions, and alsopresented new and/or modified solutions too. We have presented a solution to optimize asample size number and to take into consideration existing soil samples. Importantly wehave introduced a relatively simple approach to relocate a site for soil sampling insituations of inaccessibility without deteriorating the original cLHS site configuration.R scripts with associated data examples for each of these solutions is shared in a datarepository at: https://bitbucket.org/brendo1001/clhc_sampling.

ADDITIONAL INFORMATION AND DECLARATIONS

FundingThe authors received no funding for this work.

Competing InterestsBudiman Minansy is an Academic Editor for PeerJ.

Author Contributions� Brendan P. Malone conceived and designed the experiments, performed theexperiments, analyzed the data, contributed reagents/materials/analysis tools, preparedfigures and/or tables, authored or reviewed drafts of the paper, approved the final draft.

� Budiman Minansy conceived and designed the experiments, analyzed the data,contributed reagents/materials/analysis tools, authored or reviewed drafts of the paper,approved the final draft.

� Colby Brungard authored or reviewed drafts of the paper, approved the final draft.

Data AvailabilityThe following information was supplied regarding data availability:

Bitbucket: https://bitbucket.org/brendo1001/clhc_sampling/src/master/.

Table 2 Similarity between adapted HELS and COOBS algorithm outputs.

Legacy sample data coverage(derived from COOBS algorithm)

Proportions of allocatedadapted HELS sites (100)

Proportions ofexisting sites (341)

0–5 0.66 0.26

5–10 0.21 0.17

10–20 0.12 0.30

20–40 0.01 0.23

>40 0.00 0.03

Note:Proportions of original and additional sampling sites in the HWCPID selected using the adapted HELS algorithm thatoccur within groupings of the sample data coverage, determined using the COOBS algorithm. We note that COOBSmeans at the pixel level, the count of observations estimated to be similar in terms of the given ancillary data.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 15/17

Page 16: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

REFERENCESBrungard C, Johnanson J. 2015. The gate’s locked! I can’t get to the exact sampling spot : : : can

I sample nearby? Pedometron: Newsletter of the Pedometrics Commission of the IUSS.Pedometron (37): Newsletter of the Pedometrics Commission of the IUSS. Sydney.Available at http://www.pedometrics.org/Pedometron/Pedometron37.pdf.

Carré F, McBratney AB, Minasny B. 2007. Estimation and potential improvement of thequality of legacy soil samples for digital soil mapping. Geoderma 141(1–2):1–14DOI 10.1016/j.geoderma.2007.01.018.

Chu H-J, Chen R-A, Tseng Y-H, Wang C-K. 2014. Identifying LiDAR sample uncertaintyon terrain features from DEM simulation. Geomorphology 204:325–333DOI 10.1016/j.geomorph.2013.08.016.

Clifford D, Payne JE, Pringle MJ, Searle R, Butler N. 2014. Pragmatic soil survey designusing flexible Latin hypercube sampling. Computers & Geosciences 67:62–68DOI 10.1016/j.cageo.2014.03.005.

de Gruijter J, Brus D, Bierkens M, Knotters M. 2006. Sampling for natural resource monitoring:statistics and methodology of sampling and data analysis. Berlin: Springer-Verlag, 343.

Gower JC. 1971. A general coefficient of similarity and some of its properties. Biometrics27(4):857–871 DOI 10.2307/2528823.

Hawley SP, Glen RA, Baker CJ. 1995. Newcastle coalfield regional geology 1:100 000. First Edition.Sydney: Geological Survey of New South Wales.

Jeong G, Choi K, Spohn M, Park SJ, Huwe B, Ließ M. 2017. Environmental drivers of spatialpatterns of topsoil nitrogen and phosphorus under monsoon conditions in a complex terrainof South Korea. PLOS ONE 12(8):e0183205 DOI 10.1371/journal.pone.0183205.

Kidd D, Malone B, McBratney A, Minasny B, Webb M. 2015. Operational samplingchallenges to digital soil mapping in Tasmania, Australia. Geoderma Regional 4:1–10DOI 10.1016/j.geodrs.2014.11.002.

Kopačková V, Ben-Dor E, Carmon N, Notesco G. 2017. Modelling diverse soil attributeswith visible to longwave infrared spectroscopy using PLSR employed by an automaticmodelling engine. Remote Sensing 9(2):134 DOI 10.3390/rs9020134.

Kovac M, Lawrie JM. 1990. Soil landscapes of the singleton 1:250 000 sheet. Sydney:Soil Conservation Service of NSW.

Krzanowski WJ. 1979. Between-groups comparison of principal components. Journal of theAmerican Statistical Association 74(367):703–707 DOI 10.2307/2286995.

Kullback S, Leibler RA. 1951. On information and sufficiency. Annals of Mathematical Statistics22(1):79–86.

Malone BP, Hughes P, McBratney AB, Minasny B. 2014. Amodel for the identification of terronsin the Lower Hunter Valley, Australia. Geoderma Regional 1:31–47DOI 10.1016/j.geodrs.2014.08.001.

McBratney AB, Mendonça Santos ML, Minasny B. 2003. On digital soil mapping. Geoderma117(1):3–52.

McKay MD, Beckman RJ, Conover WJ. 1979. A comparison of three methods for selecting valuesof input variables in the analysis of output from a computer code. Technometrics 21(2):239–245DOI 10.2307/1268522.

Minasny B, McBratney AB. 2006. A conditioned Latin hypercube method for sampling in thepresence of ancillary information. Computers & Geosciences 32(9):1378–1388DOI 10.1016/j.cageo.2005.12.009.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 16/17

Page 17: Some methods to improve the utility of conditioned Latin ... · Some methods to improve the utility of conditioned Latin hypercube sampling Brendan P. Malone1,2, Budiman Minansy2

Mulder VL, De Bruin S, Schaepman ME. 2013. Representing major soil variability at regionalscale by constrained Latin hypercube sampling of remote sensing data. International Journal ofApplied Earth Observation and Geoinformation 21:301–310 DOI 10.1016/j.jag.2012.07.004.

Ramirez-Lopez L, Schmidt K, Behrens T, Van Wesemael B, Demattê JAM, Scholten T. 2014.Sampling optimal calibration sets in soil infrared spectroscopy. Geoderma 226–227:140–150DOI 10.1016/j.geoderma.2014.02.002.

Roudier P. 2011. clhs: a R package for conditioned Latin hypercube sampling. Available athttps://cran.r-project.org/web/packages/clhs/index.html.

Roudier P, Beaudette D, Hewitt A. 2012. A conditioned Latin hypercube sampling algorithmincorporating operational constraints. In: Minasny B, Malone B, McBratney AB, eds. DigitalSoil Assessments and Beyond: Proceedings of the 5th Global Workshop on Digital Soil Mapping.Leiden: CRC Press, 227–231 DOI 10.1016/S0016-7061(03)00223-4.

Scarlett K, Daniel R, Shuttleworth LA, Roy B, Bishop TFA, Guest DI. 2015. Phytophthora inthe Gondwana rainforests of Australia world heritage area. Australasian Plant Pathology44(3):335–348 DOI 10.1007/s13313-015-0355-6.

Scarpone C, Schmidt MG, Bulmer CE, Knudby A. 2016. Modelling soil thickness in thecritical zone for Southern British Columbia. Geoderma 282:59–69DOI 10.1016/j.geoderma.2016.07.012.

Shannon CE, Weaver W. 1949. Mathematical theory of communication. Urbana: Universityof Illinois Press, 117.

Stumpf F, Schmidt K, Behrens T, Schönbrodt-Stitt S, Buzzo G, Dumperth C, Wadoux A,Xiang W, Scholten T. 2016. Incorporating limited field operability and legacy soil samplesin a hypercube sampling design for digital soil mapping. Journal of Plant Nutrition and SoilScience 179(4):499–509 DOI 10.1002/jpln.201500313.

Sun X-L, Wang H-L, Zhao Y-G, Zhang C, Zhang G-L. 2017. Digital soil mapping based onwavelet decomposed components of environmental covariates. Geoderma 303:118–132DOI 10.1016/j.geoderma.2017.05.017.

Thomas M, Clifford D, Bartley R, Philip S, Brough D, Gregory L, Willis R, Glover M. 2015.Putting regional digital soil mapping into practice in Tropical Northern Australia. Geoderma241–242:145–157 DOI 10.1016/j.geoderma.2014.11.016.

Thomas M, Odgers N, Ringrose-Voase A, Grealish G, Glover M, Dowling T. 2012.Soil survey design for management-scale digital soil mapping in a mountainous southernPhilippine catchment. In: Minasny B, Malone BP, McBratney AB, eds. Digital Soil Assessmentsand Beyond: Proceedings of the 5th Global Workshop on Digital Soil Mapping. Leiden: CRC Press,233–238.

Webster R, Lark M. 2013. Field sampling for environmental science and management. New York:Routledge, 192.

Malone et al. (2019), PeerJ, DOI 10.7717/peerj.6451 17/17