Optimization of sample configurations for spatial trend estimation
-
Upload
alessandro-samuel-rosa -
Category
Science
-
view
194 -
download
1
Transcript of Optimization of sample configurations for spatial trend estimation
SupportWe are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the
development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of
Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil
Technology of Brazil (Process 480515/2013-1).
Pedometrics 201514 – 18 September 2015Faculty of Labour Sciences, Avenida de Ollerías 2Córdoba, Spain (37.891586, -4.777202)
Optimization of Sample Configurations for Spatial Trend Estimation
Alessandro Samuel-Rosa(1), Dick J Brus(2), Gustavo M Vasques(3), Lúcia H C Anjos(1)
(1) Universidade Federal Rural do Rio de Janeiro, Brazil ([email protected], [email protected]); (2) Alterra, Wageningen University and Research Centre, the Netherlands ([email protected]); (3) Embrapa Soils, Brazil ([email protected]).
IntroductionThe spatial trend corresponds to the spatial variation of Z(s) that is explained linearly or non-linearly by the
covariates. There are various methods to design samples for spatial trend estimation. One of the most used in soil
science, the so-called conditioned Latin Hypercube Sampling (cLHS) (Minasny & McBratney, 2006), searches for a
spatial sample optimal in terms of
1) coverage of the marginal distribution of numeric covariates,
2) linear correlation of numeric covariates, and
3) proportional sample sizes for the classes of factor covariates.
The idea is that with such a sample we can identify the “true” spatial trend if we are ignorant about its form. We
propose to improve on the existing cLHS and present our implementation in the R-package spsann.
Measuring the Association Between Factor CovariatesLike the cLHS, our implementation it is based on solving a multi-objective optimization problem (MOOP)
using spatial simulated annealing. But instead of three, we define two objective functions. As such, we redefine the
optimization criterion as the reproduction of an Association/Correlation measure and the marginal Distribution of
the Covariates (ACDC).
This is because the cLHS ignores the association among factor covariates and among factor and numeric
covariates. We propose to use the Pearson's r (correlation) only when all covariates are numeric, and the Cramér's V
(association) when some or all covariates are factors. In the latter case any numeric covariate is transformed to a
factor covariate, with the factor levels defined by the marginal sampling strata.
where r and c are the number of rows and columns of the contingency table, n is the number of observations, and χ2
is the chi-squared statistic
where Oi and E
i are the observed and expected frequency, respectively (Cramer, 1946).
Defining the Marginal Sampling StrataThe cLHS uses quantiles to create equal-area marginal sampling strata. Depending on the number of marginal
strata, this may produce replicated breakpoints in regions with a relatively high frequency of covariate values.
R> # Replicated breakpoints
R> sample_size <- 5
R> covariate <- c(1, 5, 1, 3, 4, 1, 2, 3, 2, 1, 8, 9, 9, 9, 9)
R> probs <- seq(0, 1, length.out = sample_size + 1)
R> breaks <- quantile(covariate, probs, na.rm = TRUE)
R> breaks
0% 20% 40% 60% 80% 100%
1.0 1.0 2.6 4.4 9.0 9.0
The presence of replicated breakpoints prevents the optimization algorithm from converging to the optimum. We
propose defining marginal sampling strata using only the unique values of the sample quantiles estimated with a
discontinuous function (Hyndman & Fan, 1996). This avoids creating empty marginal strata.
R> # Unique breakpoints
R> breaks <- quantile(covariate, probs, na.rm = TRUE, type = 3)
R> breaks <- unique(breaks)
R> breaks
[1] 1 2 4 9
This approach results in each numeric covariate having a different number of quasi-equal-size sampling strata.
The number of sample points that should fall in each marginal sampling stratum is proportional to the number of
sampling units in that stratum.
R> # Number of points per strata
R> count <- hist(covariate, breaks, plot = FALSE)$counts
R> count <- count / sum(count) * sample_size
R> count
[1] 2 1 2
Avoiding Numerical DominanceWe also solve the MOOP aggregating the objective functions into a single utility function using a weighted
sum, the weights defining the relative importance of each objective function:
where w is a vector of positive weights that sum to unity, k being the number of objective functions (Marler &
Arora, 2009). The improvement is that the objective functions are first scaled to the same approximate range of
values using the upper-lower bound approach with the Pareto maximum (and minimum):
where xj
* is the point that minimizes the jth objective function, a vertex of the Pareto optimal set in the design space
(Marler & Arora, 2005).
Using the Pareto maximum (and minimum) avoids the numerical dominance (bias) of any objective function
such as occurs with the first objective function (O1) of the cLHS that yields criterion values much larger than the
second (O2) and third (O
3). The numerical dominance occurs because O
1 uses the number of points per strata (0 to
n), while O2 uses the proportion of points per strata (0 to 1) and O
3 uses the linear correlation coefficient (-1 to 1).
V=√ χ2/nmin (c−1,r−1)
χ2=∑i=1
r
∑ j=1
c(Oi−Ei)
2
Ei
U =∑i=1
kw i f i(x )
f imax
=max 1≤ j≤ k f i(x j∗) References
Cramér, H. Mathematical methods of statistics. Princeton: Princeton University Press, p. 575, 1946.
Hyndman, R. J. & Fan, Y. Sample quantiles in statistical packages. The American Statistician, v. 50, p. 361-365,
1996.
Marler, R. T. & Arora, J. S. Function-transformation methods for multi-objective optimization. Engineering
Optimization, v. 37, p. 551-570, 2005.
Marler, R. T. & Arora, J. S. The weighted sum method for multi-objective optimization: new insights. Structural
and Multidisciplinary Optimization, v. 41, p. 853-862, 2009.
Minasny, B. & McBratney, A. B. A conditioned Latin hypercube method for sampling in the presence of ancillary
information. Computers & Geosciences, v. 32, p. 1378-1388, 2006.
Preliminary ResultsOur preliminary results indicated that sampling distributions derived using our algorithm varied very little from
the same set of covariates, indicating that the criterion approaches the global optimum.
An in-depth study is being carried out to evaluate how our implementation performs compared to the original
cLHS method (and other sample designs as well). Using simulated data, we will evaluate their ability to capture the
true form of the spatial trend (linear and non-linear) and make accurate predictions.
AcknowledgementsWe are grateful to Dr. Gerard Heuvelink, from ISRIC – World Soil Information, for his comments during the
development of this work. The first author was supported by the CAPES Foundation, Ministry of Education of
Brazil (Process BEX 11677/13-9), and by the CNPq Foundation, Ministry of Science and Technology of Brazil
(Process 140720/2012-0). The last author was supported by the CNPq Foundation, Ministry of Science and
Technology of Brazil (Process 480515/2013-1).
Student Presentation