Optimization of sample configurations for spatial trend estimation

1
Support We are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil Technology of Brazil (Process 480515/2013-1). Pedometrics 2015 14 – 18 September 2015 Faculty of Labour Sciences, Avenida de Ollerías 2 Córdoba, Spain (37.891586, -4.777202) Optimization of Sample Configurations for Spatial Trend Estimation Alessandro Samuel-Rosa (1) , Dick J Brus (2) , Gustavo M Vasques (3) , Lúcia H C Anjos (1) (1) Universidade Federal Rural do Rio de Janeiro, Brazil ([email protected], [email protected]); (2) Alterra, Wageningen University and Research Centre, the Netherlands ([email protected]); (3) Embrapa Soils, Brazil ([email protected]). Introduction The spatial trend corresponds to the spatial variation of Z(s) that is explained linearly or non-linearly by the covariates. There are various methods to design samples for spatial trend estimation. One of the most used in soil science, the so-called conditioned Latin Hypercube Sampling (cLHS) (Minasny & McBratney, 2006), searches for a spatial sample optimal in terms of 1) coverage of the marginal distribution of numeric covariates, 2) linear correlation of numeric covariates, and 3) proportional sample sizes for the classes of factor covariates. The idea is that with such a sample we can identify the “true” spatial trend if we are ignorant about its form. We propose to improve on the existing cLHS and present our implementation in the R-package spsann. Measuring the Association Between Factor Covariates Like the cLHS, our implementation it is based on solving a multi-objective optimization problem (MOOP) using spatial simulated annealing. But instead of three, we define two objective functions. As such, we redefine the optimization criterion as the reproduction of an Association/Correlation measure and the marginal Distribution of the Covariates (ACDC). This is because the cLHS ignores the association among factor covariates and among factor and numeric covariates. We propose to use the Pearson's r (correlation) only when all covariates are numeric, and the Cramér's V (association) when some or all covariates are factors. In the latter case any numeric covariate is transformed to a factor covariate, with the factor levels defined by the marginal sampling strata. where r and c are the number of rows and columns of the contingency table, n is the number of observations, and χ 2 is the chi-squared statistic where O i and E i are the observed and expected frequency, respectively (Cramer, 1946). Defining the Marginal Sampling Strata The cLHS uses quantiles to create equal-area marginal sampling strata. Depending on the number of marginal strata, this may produce replicated breakpoints in regions with a relatively high frequency of covariate values. R> # Replicated breakpoints R> sample_size <- 5 R> covariate <- c(1, 5, 1, 3, 4, 1, 2, 3, 2, 1, 8, 9, 9, 9, 9) R> probs <- seq(0, 1, length.out = sample_size + 1) R> breaks <- quantile(covariate, probs, na.rm = TRUE) R> breaks 0% 20% 40% 60% 80% 100% 1.0 1.0 2.6 4.4 9.0 9.0 The presence of replicated breakpoints prevents the optimization algorithm from converging to the optimum. We propose defining marginal sampling strata using only the unique values of the sample quantiles estimated with a discontinuous function (Hyndman & Fan, 1996). This avoids creating empty marginal strata. R> # Unique breakpoints R> breaks <- quantile(covariate, probs, na.rm = TRUE, type = 3) R> breaks <- unique(breaks) R> breaks [1] 1 2 4 9 This approach results in each numeric covariate having a different number of quasi-equal-size sampling strata. The number of sample points that should fall in each marginal sampling stratum is proportional to the number of sampling units in that stratum. R> # Number of points per strata R> count <- hist(covariate, breaks, plot = FALSE)$counts R> count <- count / sum(count) * sample_size R> count [1] 2 1 2 Avoiding Numerical Dominance We also solve the MOOP aggregating the objective functions into a single utility function using a weighted sum, the weights defining the relative importance of each objective function: where w is a vector of positive weights that sum to unity, k being the number of objective functions (Marler & Arora, 2009). The improvement is that the objective functions are first scaled to the same approximate range of values using the upper-lower bound approach with the Pareto maximum (and minimum): where x j * is the point that minimizes the jth objective function, a vertex of the Pareto optimal set in the design space (Marler & Arora, 2005). Using the Pareto maximum (and minimum) avoids the numerical dominance (bias) of any objective function such as occurs with the first objective function ( O 1 ) of the cLHS that yields criterion values much larger than the second (O 2 ) and third (O 3 ). The numerical dominance occurs because O 1 uses the number of points per strata (0 to n), while O 2 uses the proportion of points per strata (0 to 1) and O 3 uses the linear correlation coefficient (-1 to 1). V = χ 2 / n min ( c 1, r 1 ) χ 2 = i =1 r j =1 c ( O i E i ) 2 E i U = i = 1 k w i f i ( x ) f i max = max 1 j k f i ( x j ) References Cramér, H. Mathematical methods of statistics. Princeton: Princeton University Press, p. 575, 1946. Hyndman, R. J. & Fan, Y. Sample quantiles in statistical packages. The American Statistician, v. 50, p. 361-365, 1996. Marler, R. T. & Arora, J. S. Function-transformation methods for multi-objective optimization. Engineering Optimization, v. 37, p. 551-570, 2005. Marler, R. T. & Arora, J. S. The weighted sum method for multi-objective optimization: new insights. Structural and Multidisciplinary Optimization, v. 41, p. 853-862, 2009. Minasny, B. & McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Computers & Geosciences, v. 32, p. 1378-1388, 2006. Preliminary Results Our preliminary results indicated that sampling distributions derived using our algorithm varied very little from the same set of covariates, indicating that the criterion approaches the global optimum. An in-depth study is being carried out to evaluate how our implementation performs compared to the original cLHS method (and other sample designs as well). Using simulated data, we will evaluate their ability to capture the true form of the spatial trend (linear and non-linear) and make accurate predictions. Acknowledgements We are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil (Process 140720/2012-0). The last author was supported by the CNPq Foundation, Ministry of Science and Technology of Brazil (Process 480515/2013-1). Student Presentation

Transcript of Optimization of sample configurations for spatial trend estimation

SupportWe are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the

development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of

Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil

Technology of Brazil (Process 480515/2013-1).

Pedometrics 201514 – 18 September 2015Faculty of Labour Sciences, Avenida de Ollerías 2Córdoba, Spain (37.891586, -4.777202)

Optimization of Sample Configurations for Spatial Trend Estimation

Alessandro Samuel-Rosa(1), Dick J Brus(2), Gustavo M Vasques(3), Lúcia H C Anjos(1)

(1) Universidade Federal Rural do Rio de Janeiro, Brazil ([email protected], [email protected]); (2) Alterra, Wageningen University and Research Centre, the Netherlands ([email protected]); (3) Embrapa Soils, Brazil ([email protected]).

IntroductionThe spatial trend corresponds to the spatial variation of Z(s) that is explained linearly or non-linearly by the

covariates. There are various methods to design samples for spatial trend estimation. One of the most used in soil

science, the so-called conditioned Latin Hypercube Sampling (cLHS) (Minasny & McBratney, 2006), searches for a

spatial sample optimal in terms of

1) coverage of the marginal distribution of numeric covariates,

2) linear correlation of numeric covariates, and

3) proportional sample sizes for the classes of factor covariates.

The idea is that with such a sample we can identify the “true” spatial trend if we are ignorant about its form. We

propose to improve on the existing cLHS and present our implementation in the R-package spsann.

Measuring the Association Between Factor CovariatesLike the cLHS, our implementation it is based on solving a multi-objective optimization problem (MOOP)

using spatial simulated annealing. But instead of three, we define two objective functions. As such, we redefine the

optimization criterion as the reproduction of an Association/Correlation measure and the marginal Distribution of

the Covariates (ACDC).

This is because the cLHS ignores the association among factor covariates and among factor and numeric

covariates. We propose to use the Pearson's r (correlation) only when all covariates are numeric, and the Cramér's V

(association) when some or all covariates are factors. In the latter case any numeric covariate is transformed to a

factor covariate, with the factor levels defined by the marginal sampling strata.

where r and c are the number of rows and columns of the contingency table, n is the number of observations, and χ2

is the chi-squared statistic

where Oi and E

i are the observed and expected frequency, respectively (Cramer, 1946).

Defining the Marginal Sampling StrataThe cLHS uses quantiles to create equal-area marginal sampling strata. Depending on the number of marginal

strata, this may produce replicated breakpoints in regions with a relatively high frequency of covariate values.

R> # Replicated breakpoints

R> sample_size <- 5

R> covariate <- c(1, 5, 1, 3, 4, 1, 2, 3, 2, 1, 8, 9, 9, 9, 9)

R> probs <- seq(0, 1, length.out = sample_size + 1)

R> breaks <- quantile(covariate, probs, na.rm = TRUE)

R> breaks

0% 20% 40% 60% 80% 100%

1.0 1.0 2.6 4.4 9.0 9.0

The presence of replicated breakpoints prevents the optimization algorithm from converging to the optimum. We

propose defining marginal sampling strata using only the unique values of the sample quantiles estimated with a

discontinuous function (Hyndman & Fan, 1996). This avoids creating empty marginal strata.

R> # Unique breakpoints

R> breaks <- quantile(covariate, probs, na.rm = TRUE, type = 3)

R> breaks <- unique(breaks)

R> breaks

[1] 1 2 4 9

This approach results in each numeric covariate having a different number of quasi-equal-size sampling strata.

The number of sample points that should fall in each marginal sampling stratum is proportional to the number of

sampling units in that stratum.

R> # Number of points per strata

R> count <- hist(covariate, breaks, plot = FALSE)$counts

R> count <- count / sum(count) * sample_size

R> count

[1] 2 1 2

Avoiding Numerical DominanceWe also solve the MOOP aggregating the objective functions into a single utility function using a weighted

sum, the weights defining the relative importance of each objective function:

where w is a vector of positive weights that sum to unity, k being the number of objective functions (Marler &

Arora, 2009). The improvement is that the objective functions are first scaled to the same approximate range of

values using the upper-lower bound approach with the Pareto maximum (and minimum):

where xj

* is the point that minimizes the jth objective function, a vertex of the Pareto optimal set in the design space

(Marler & Arora, 2005).

Using the Pareto maximum (and minimum) avoids the numerical dominance (bias) of any objective function

such as occurs with the first objective function (O1) of the cLHS that yields criterion values much larger than the

second (O2) and third (O

3). The numerical dominance occurs because O

1 uses the number of points per strata (0 to

n), while O2 uses the proportion of points per strata (0 to 1) and O

3 uses the linear correlation coefficient (-1 to 1).

V=√ χ2/nmin (c−1,r−1)

χ2=∑i=1

r

∑ j=1

c(Oi−Ei)

2

Ei

U =∑i=1

kw i f i(x )

f imax

=max 1≤ j≤ k f i(x j∗) References

Cramér, H. Mathematical methods of statistics. Princeton: Princeton University Press, p. 575, 1946.

Hyndman, R. J. & Fan, Y. Sample quantiles in statistical packages. The American Statistician, v. 50, p. 361-365,

1996.

Marler, R. T. & Arora, J. S. Function-transformation methods for multi-objective optimization. Engineering

Optimization, v. 37, p. 551-570, 2005.

Marler, R. T. & Arora, J. S. The weighted sum method for multi-objective optimization: new insights. Structural

and Multidisciplinary Optimization, v. 41, p. 853-862, 2009.

Minasny, B. & McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary

information. Computers & Geosciences, v. 32, p. 1378-1388, 2006.

Preliminary ResultsOur preliminary results indicated that sampling distributions derived using our algorithm varied very little from

the same set of covariates, indicating that the criterion approaches the global optimum.

An in-depth study is being carried out to evaluate how our implementation performs compared to the original

cLHS method (and other sample designs as well). Using simulated data, we will evaluate their ability to capture the

true form of the spatial trend (linear and non-linear) and make accurate predictions.

AcknowledgementsWe are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the

development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of

Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil

(Process 140720/2012-0). The last author was supported by the CNPq Foundation, Ministry of Science and

Technology of Brazil (Process 480515/2013-1).

Student Presentation