A Study on Prediction of Spatial Binomial Probabilities with...
Transcript of A Study on Prediction of Spatial Binomial Probabilities with...
-
1Correspondence to [email protected]
1
A Study on Prediction of Spatial Binomial Probabilities with an Application
to Spatial Design
Hao Zhang 1
Program in Statistics
Washin gton State U niversity
Pullman, WA 99164
H. Holly Wang
Department of Agricultural Economics
Washin gton State U niversity
Pullman, WA 99164
Abstract
This work studies some issues that are related to interpolation of binomial probabilities. In some situations, binomial
counts are observed at some spatial locations and binomial probabilities are interpolated at un-sampled locations
based on the sample data. An example in precision agriculture is considered in this work. A natural practical question
is how the number of sampling locations and the sampling sizes at the sampling locations affect the interpolation.
This que stion is studied in th is work throu gh simulations . The mo del-based geostatistics ap proach (D iggl et al.,
1998) is employed in which the binomial counts are modeled through a spatial generalized linear mixed model. The
minimum m ean-square d error pr ediction is car ried out for the binomial p robability.
1. Introduction
This work is motivated by an experiment on Cunningham farm that is located 8 miles north of Pullman, Washington.
The ov erall objec tive of the expe riment is to iden tify factors that affect bo th yield and q uality of wheat an d barley,
the two major crops in the dryland farming area in the Pacific Northwest, so that measures of precision farming may
be applied to improve yield and quality so as to maximize profitability of the farm. Some variables that likely affect
both yield and quality are incidence rates of plant root diseases, soil properties, elevation and aspect ratio, all of
which are location-specific and measured locally. Once the dominating variables are identified and their effects on
yield and quality quantified, optimization of both yield and quality can be realized through the practice of precision
agriculture. 100 locations were randomly selected by the research team by dividing the farm into equally spaced rows
and columns, approximately 20 meters between rows and columns, then random drawing from the lattice. At each
-
2
sampling lo cation, the incid ence rate o f some roo t disease and other variab les were mea sured, alon g with the yield
obtained by hand-harvesting a squared area of 2 meters long and 2 meters wide. Variables such as protein content
that reflect quality of wheat were also measured at the sites. These sample data are currently being studied to see how
they affect yield and quality. Once the domina ting variables a re identified an d their effects on yield and qu ality
quantified, o ptimum p rofits can be a chieved b y locally adjusting or contro lling the variable s to the appr opriate leve ls
through pr ecision farmin g that utilizes local info rmation. Ap parently, data of these variab les must be av ailable at all
locations wh ere quality and yield are to be optimized and ideally we would like to optimize the m across the whole
farm. This w ould requ ire interpolatio n or pred iction of the var iables to all um sampled locations.
In this work, we will study some issues related to interpolation of a particular variable, namely, the incidence
rate of Rhizo ctonia roo t rot caused b y Rhizoctonia solani and Rhizoctonia oryzae. These fungi attach to the root
system and re duce the ab ility of plants to take u p adequ ate water and nutrients, and c onseque ntly affect both yield
and quality of the crops. At each of the sampling sites on Cunningham farm, 15 plants of barley were
sampled and pulled out of groun d in the summ er of 200 0, and the nu mber of to tal crown ro ots, , and that of
infected cro wn roots, , were obtained for each plant. The incidence rate of root rot at the site was obtained as
the total numb er of infected c rown roo ts divided b y the total numb er of crown roots, i.e., . These incidence
rates at the sam pled sites are used to interp olate the incid ence rate to a s many sites acro ss the farm as ne cessary. A
natural practical problem is ho w the variables and affect the interpo lation. For ex ample, is it nece ssary to
sample more crops to increase , which currently ranges from 89 to 197? W hat is the gain by sampling more
locations? When the total sampling cost is controlled, do we prefer sampling more locations with each location
having a smaller sample size or otherwise? These will be the problems to be studied in the paper.
It is not possible to study these problems without first specifying how the prediction is to be carried out. Some
well-known geostatistical kriging (predicting) methods such as ordinary kriging, trans-Gaussian kriging and
disjunctive kr iging might now emerge as p ossible interp olation meth ods for ou r problem s. We refe r to Cressie
(1993) for introduction to each of the methods. However, if any of the methods was to be applied, it would be
applied to the ratio and not to , because itself without means nothing, and varies from
site to site in the experiment. Once the ratios are used for prediction, the sample sizes no longer affect
prediction. Conseq uently, we would not be able to study how the sampling sizes affect the prediction.
Therefore, we will adapt the model-based geostatistics (Diggle, et al.,1998), which incorporates the sample sizes
into the model and allows for the minimum mean-squared error (MMSE) prediction of the incidence rate.
This approac h assumes that the counts of infected crow n roots follow the following spatial generalized
linear mixed model (GLMM ):
(a). is a Gaussian stationary process with mean 0.
(b). Conditionally on , consists of independent variables. Moreover, the conditional
distribution o f an individua l is binomial with a binomial index and the bino mial prob ability
-
3
.
It is reasonable to assume that the counts of infected crown roots are binomial with varying binomial
probabilities . Spatial variation is represented by the random term which might be accounted for by
unknown or unobs ervable fac tors. In the abs ence of any p hysically based model for the spatial trend , the parame tric
form may be rea sonable. U nder the sp atial GLM M, the M MSE estimation or prediction of is the
conditiona l expectation of given the observed b inomial responses . Although it can not be eva luated in
closed form , the MM SE pred iction can be compute d through the Marko v chain M onte Carlo methods.
Two ap proache s to the MM SE pred iction of have been considere d under the spatial GLM M. Digg le et al.
(1998) con sidered Bayesian p rediction of a function of random effects, such as , implemented using MCMC
methods. Zhang (2002a, b) considered MM SE prediction under the assumption that parameters are known or have
been estimated, and also relied on MCMC m ethods for the calculation. Zhang (2002b) showed that some analytical
results can be used in either the Bayesian or non-Bayesian approach to make the prediction computationally more
efficient.
Obviously model parameters have to be known or estimated before prediction can be evaluated. For spatial
GLMM s, different inferential methods exist for model parameter estimation. Diggle et al. (1998) used a Bayesian
approach for parameter estimation. Zhang (2002a) considered the Monte Carlo EM algorithm for maximum
likelihood e stimation. Pe nalized qu asi-likelihood (Breslow and Clayto n, 1993 ) can also b e applied to this mode l.
Being aw are of that differe nt parame ter estimation m ethods exit fo r the mode l, and that these m ethods ma y lead to
different estimate s, we will not study e xplicitly how the d esigns, espec ially the choice o f and , affect
paramete r estimates. Ra ther, we will focus o n their effects on p rediction.
Although this w ork is related to spatial samp ling design and indeed d eals with some aspect of the d esign, it
differs from most works on spatial sampling design in several ways. Firstly, in most works on spatial designs, the
response variable can be decom posed into add itive parts:
where the error term has mean 0 and is either intrinsically stationary or second-order stationary, whose correlation
structure is independent of the mean surface (see, for example, Warrick and Myers, 1987, Pesti et al., 1994,
Bened etti and Palm a ,1995 , and Mu ller and Zimm erman,19 99, amo ng many oth ers). Such a d ecomp osition is
important in some of the works on spatial desgin. For example, when the means are a constant, the variogram of the
can be estimated without estimating the mean. Hence the design criterion can be expressed in the variogram
parameters alone (see, for example, Muller and Zimmerman, 1999). In our model, such a decomposition does not
exist because both the mea n and the variance of the respo nse variable depend on some com mon parame ters.
Consequently, approaches to and results of a classical spatial design need to be extended or modified to become
applicab le to our mo del.
Secondly, the objective of a spatial sampling design is often the optimum allocation of spatial sampling
-
4
locations, where the optimum may be done with regard to trend estimation, variogram estimation or prediction.
Although it is an interesting and important problem, deciding optimum sampling locations is very complicated in the
Cunningham farm experiment because not only the response variable is binomial but also other response variables
also measured at the sampling sites. Optimum sampling design for one variable alone has little practical value to the
experime nt. In addition , the respons e variable w e consider in this paper is b inomial. He nce the sam ple size at a
sampling lo cation is a new variable that c lassical spatial d esigns do no t entertain.
Lastly, unconditional variances of estimator or predictors are used in the criteria of spatial designs such as
minimizing the weighted average variance of prediction in Fed orov (1996 ) and Muller (19 98). Howev er, most
applications of spatial prediction use the conditional mean and variance of the predicted variable given observed
data. For e xample, the kriging (or pr edicting) surfac e of consists of the co nditional me an of given the
responses at sampling sites (Diggle et al., 1998). In our particular experiment, we may ask how this kriging
surface would change if less roots were sampled at each site (i.e., a smaller ). Hence, the conditiona l inference is
more dire ctly related to o ur proble ms.
Therefore, this work is not on spatial design in the classical sense. The primary focus of this paper is to study
how and affect the prediction. There are no theoretical results that provide immediate answers to the
questions. We hence resort to simulation studies to see these effects on prediction. The rest of the paper is organized
as follows. In S ection 2, we review som e results of Zha ng (2002 b) that will be use d in the subse quent sectio n to
calculate the M MSE prediction of . Sections 3 contains several simulation studies to see how the prediction of
is affected by changes in and . When we change sampling sizes, we keep the incidence rates unchanged
or appropriately the same ( has to be an integer) hence the conditional inferences, i.e., predicted values and
prediction variance give n the same inc idence rate s, can be co mparab le. Conclusio ns and discu ssion are pre sented in
Section 4.
2. MMSE Prediction in a Spatial GLMM
In this section, we review some results of Zhang (2002b) that can be used to efficiently calculate MMSE prediction
for . Let satisfy the spatial GLMM defined in the introduction and be the random effects. Let
be the sampling sites and write for and , respectively, , and
. Zhang (2002b) showed that for any function ,
(1)
Monte Carlo samples from the conditional distribution can be generated through a MCM C method and the
right-hand side of the equation can be approximated by the appropriate sample average. For example, the
Metropolis-Hastings algorithm is quite easy to be implemented for spatial GLMM, as seen in Diggle et al. (1998)
and Zhang (20 02a, b). Once M onte Carlo samp les are generated from , the following approximation
-
5
takes place
(2)
We now need to calculate for a given vector of rando m effects . Fortunately, this conditional
expectation is an integral of the form , which can be fairly easily approximated to any given
precision (Crouch and Spiegelman, 1990) if it can not be computed in closed form. Indeed, conditional on ,
has a normal distribution with mean and variance , and
(3)
It is well known that the conditional mean and conditional variance can b e calculated from the cova riances:
where is the covarian ce matrix of and is the (1,1)th elem ent of .
Equation s (1)-(3) can be applie d to calculate the MM SE pred iction of and the
corresponding prediction variance given . The first two c onditional m oments of given involve logistic-
normal integrals of the form
Indeed, for ,
from which we obtain the prediction variance
The logistic-normal integrals cannot be calculated in closed form but can be evaluated through numerical
methods. For exam ple, the method of Ga ussian quadrature app roximates the integral as follows:
where , , are available from standard tables for (Abramowitz and Stegun, 1967, p 924).
Although this m ethod is be lieved to ap proxima te well, an analytica l error bou nd is not kno wn. If it is desirable to
control the error bound, then the method of Crouch and Speigelman (1990) can be used. Given an error bound , this
method c alls for choo sing a prop er constant h>0 such that
(4)
where . The infinite sum is then truncated to satisfy any error-bound.
-
6
Figure 1. Sampling locations (circle) and predictedlocations (+); the black dot shows the samplinglocation where sample size is changed.
We have now outlined a method for approximate the MM SE prediction and the prediction variance
. This method uses some analytical results and calls for generation of Monte Carlo samples of random
effects at the sampled sites only, hence differs from the pure Monte Carlo method that generates samples of the
random effect at the interpolated site as in Diggle et al. (1998). The simulation results of Zhang (2002b) show that
using the analytical results provide faster convergence and is less computationally costing.
3. Simulation Studies
In this section, we carry out several simulation studies to deal with the practical problems arising in the
Cunningham farm experiment. Each of the following three subsections deal with some specific problems, in which
data from th e following m odel are ge nerated. Le t be a seco nd-order stationary Ga ussian spatial p rocess with
mean 0 an d a spheric al variogram : for and equals for .
Conditional on the pro cess , consists of independent binomial variables with binomial index and
probab ility . We use the method outlined in the previous section to calculate the MMSE
prediction and prediction variance for . The Metropolis-Hastings algorithm is employed to generate a Markov
chain of length 2000 in th e impleme ntation of the m ethod. Zha ng (2002 a, b) showe d that is sufficiently
large for predicting . Equation (4) is used in all simulations for predicting and the error bound is .
3.1. Effects of sample sizes on prediction
There is a general belief that a larger sample size at a location should affect prediction favorably. The simulation
study in subsec tion is intended to shed light on the exact effects a nd provid e guidance as to what sam ple size shou ld
be used in the experiment as far as prediction is concerned. The model parameters are ,
and . We choose these values because they are the estimates obtained in Zhang
-
7
(c)
(b)(a)
Figure 2. C ompariso ns of predic tion at the 40 site s using unmo dified data (c ircle) and m odified da ta
(black) with o ne sample size reduce d to 69 (a ), 14 (b), an d 0 (c).
(2002 a) using the rea l experimen tal data. Th e sample siz e is fixed at 138 for all sampling sites, the average of sample
sizes in the real ex periment.
We divide the stud y into two steps: (a) How do es a single sample size affect prediction of its neighbo ring sites?
(b) How do the sample sizes at all sampling sites affect prediction surface across the farm so that explicit guidance
may be given on cho osing the sample sizes?
-
8
3.1.1 The effect of the sample size at a single site
We choose a sampling site near the center of the farm and change and predict 40 sites located on a line crossing
the sampling site. This sampling site and the 40 predicted sites are shown on Figure 1.
We redu ce this sample size 0.5 and 0 .1 times to 69 and 24 , respectively. The binomial resp onse is also reduced so
that the ratio remains approximately the same. Using the modified data from the site and data from other sampling
sites, we calculate d the pred icted value a nd pred iction varianc e for each o f the 40 sites. W e also elimina ted this samp le
site and use the remaining 99 sites to make prediction for the 40 sites. Figure 2 shows the plots predicted values and
prediction variances using modified data in comparison with that obtained using original simulated data.
We can see from Figure 2 that reducing sample size at the site from 138 to 69 has very little efects on prediction. It
probably means that a sample size like 138 is large enough that it can be reduced. By reducing the sample size of the
single site to 14, prediction variances for other sites, especially for the sites near the sampling site, are increased.
However, these increases are not as great as those when the site is eliminated. This suggests that we prefer using more
sites with each site having a smaller sample size over another way around.
3.1.2 Effects of sample sizes on prediction over the farm
Having se en how a sing le sample size affects predic tion, we now carry out ano ther simulation study in which all
sampling size s are change d so that we m ight see how the se changes a ffect predictio n across the fa rm. The fa rm is
divided into 1904 g rid points, at wh ich the inciden ce prob ability will be interpolated. We first use the original
simulated data to make these predictions.
We now double the sampling size of each site to 276 and also double the binomial variable proportionally so that
the ratio remains the same. We then use the modified data to predict the same 1904 sites. Predicted values
and prediction variances are compared with those from the original data in Figure 3 (a). The horizontal axis is for the
original data and vertical one is for the modified data.
We then reduce the sample size of each site from 139 to 69, 28 and 14, and again change the binomial responses
to keep approximately the same. The 1904 sites are interpolated using the modified data. Prediction
results are shown in Figure 3 (b), (c) and (d).
We cle arly see that a sam ple size of 13 8 at every sam pling site is sufficient in the se nse that increas ing the samp le
size brings about little changes on prediction. Indeed, the sample sizes can be reduced without greatly affecting the
prediction. For example, a sample size of 69 provides very close results. We also see that when sample sizes become
smaller, predicted values are more smoothed out in the sense that the predicted value for a lower become larger
and that for a higher becomes smaller. Therefore, when sample sizes are smaller, the predicted values vary in a
smaller range . This can b e explained as follows. If a sam pling site has an in cidence ra te lower than m ost other sites, it
will affect the predicted incidence rates at nearby sites and those predicted incidence rates will be likely lower too at
nearby sites. If this p articular site has a large samp le size, its impact o n predictio n of nearby site s will be greater than it
-
9
(a) (b)
(c) (d)
would with a smaller size. In the latter case, othe r sampling size s would hav e more im pact on the prediction resulting in
elevated prediction for the b inomial probability at the nearby sites.
From the simulation results, we conclude that the sample size of 138 at each of 100 sites is large enough that
increasing the sample size would no t affect much the prediction surface. It may b e reduced to 50% without greatly
affecting pred iction.
Figure 3. Comparison of prediction using different sample sizes. The horizontal axises in (a)-(d) are
predicted value or prediction variance corresponding to ; the vertical axises in (a)-(d) are
those corresponding to =276, 6 9, 28 and 14, respe ctively.
3.2. Effects of the number of sampling locations
Due to the spatial correlation, the predicted value of a site is more influenced by observations near the site. Therefore,
more sampling sites are added near an interpolation site, the MMSE pred iction of the site will change. Since the
prediction variance is a conditional variance and depends on the observations at the sampling sites, it may become
-
10
Figure 4. Sampling sites randomly chosen(circle) and seven added sampling sites (black dot).
Figure 5. Comparisons of 40 predicted values andprediction standard deviations with 100 sites (circle)and with 10 7 sites (black d ot)
smaller or larg er. This is evid ent in the followin g simulation stud y.
We randomly draw 100 sites as the sampling sites from the farm, and then add 7 more sites are located on the line
crossing the center of the farm. These locations are shown in Figure 4. We now simulate data on the 107 sites from the
binomial mixed effects model with the linear parameter , and the spherical variogram with parameters
, and ; the binomial index or sample size is 138 for all sites. Prediction of 40 sites
are made separately using the 107 sites and using 100 sites with seven sites removed. The predicted sites and the 7 sites
are marked by “+” and black dots in Figure 4, respectively. Figure 5 compares the predicted values and predicted
variances, in w hich black d ots refer to pr ediction using 107 sites an d circles refer to prediction using the 100 sites. We
see as more sits are sampled, the predicted values vary more resulting in the prediction surfacing being less smooth.
The advantage of more sampling sites will be seen more clearly from the next simulation study in which we
investigate the combined effects of the number of sampling sites and sample sizes at the location. In this simulation, we
randomly draw 500 sites from the farm and simulate data from the same model with the same parameters we just gave
but with two sets o f sample sizes: and for all sites. We then randomly draw 200 sites from the 500
sites, and 100 sites from the 200 sites. The three sets of sampling sites are shown in Figure 6. Once the sites are chosen,
data on the sites are obtained accordingly from the simulated values. For each set of sampling sites (100, 200 and 500
sites), we make prediction for 1904 sites using =138 and 14 and plot the predicted values and prediction variance
together in Figures 7-9.
From Fig ures 7-9, we see that when th e number of sampling site s is large, reduc ing the samp ling size at each site
has less effect on prediction than it would when the number of sampling sites is small. Therefore, when more sites are
-
11
Figure 6. Change the number of sampling sites from 100 (blackdot) to 200 (+) to 500 (circle).
Figure 7. Comparison of prediction using =138 (horizontal) and 14
(vertical) at sites.
sampled, we can afford sampling less in each location without as greatly affecting the pred iction as we would when less
locations are sampled. S ince pred iction near a sa mpling site is mo re accurate , we prefer to sa mple mo re locations w ith
each loca tion having a sm all sample size , if the total sampling effort is controlle d.
-
12
Figure 8. Comparison of prediction of 1904 sites using =138 and 14at N=200 sampling sites.
Figure 9. Comparison of prediction of 1904 sites using =138 and 14at N=500 sampling sites.
4. Conclusion and Discussion
We studied the effects of the sample sizes at the sampling locations and the number of sampling locations on prediction
of binomial probability, employing model-based geostatistics and minimum mean-squared prediction. Results of the
-
13
simulation studies support the following conclusion:
(1). The sample size at each location needs not to be too large. For the Cunningham farm experiment, a sample size
like 69 seems sufficient for prediction. Too large a sample size hardly improves prediction.
(2). A smalle r sample size makes the p rediction sur face smoo ther, and the p redicted va lues vary in a sma ller range.
(3). When m ore locations are samp led, the effects of sample size on the pred iction surface become less.
(4). When the total sampling effort is controlled, we prefer sampling more locations with each location having a
moderate sample size over sampling fewer locations with each one having a larger sample size.
Theoretical justifications of the conclusions are not available. Nevertheless, the simulation results provide useful
guidance to the design of the experiment on Cunningham farm. In all the simulations, the parameters are fixed at some
values. We exp ect the conclusions to hold in gene ral cases regardless of the param eter values.
Acknowledgment
The auth ors ackno wledge the a ssistance of X iaoping Jin o n compu ting.
Reference:
Abramowitz, M. and Stegun, J. (1967) Handbook of Mathematical Functions. U.S. Government Printing Office,
Washin gton, D. C . (eds.)
Bened etti, R. and P alma, D. (1 995) O ptimal samp ling designs for depend ent spatial units. Environmetrics, 6, 101-114.
Breslow , N.E. and Clayton, D .G. (199 3) App roximate infe rence in gen eralized linea r mixed mo dels. Journal of the
American Statistical Association, 88, 9-25.
Crouch, E.A.C. and Spiegelman, D. (1990) The evaluation of integrals of the form : Applicatio n to
logistic-norma l models. Journal of the American Statistical Association, 85, 464-469.
Diggle, P., Tawn, J. A., and Moyeed, R.A. (1998) Mo del-based geostatistics (with discussion). Journal of Royal
Statistical So ciety, Ser. C, Applied Statistics, 47, 299-350.
Fedoro v, V.V. (1 996) D esign of spatial e xperimen ts: modeling fitting a nd pred iction. In Rao , C.R. and G osh, S.,
editors, Handbook of Statistics, Vol. 13, North-Holland.
Muller, W. G. (1998) Collecting Spatial D ata: Op timum D esign of E xperime nts for Ran dom F ields. Physica-V erlag, ,
Heidelberg.
Muller, W . G. and Zim merman, D . L. (1999 ) Optimum design for sem ivariogram estimation. E nvironme trics, 10, 23-
37.
Pesti, G. K elly, W. E. a nd Bog ardi, I. (199 4) Obse rvation netwo rk design for se lecting locatio ns for water sup ply
wells. Environmetrics, 5, 91-110.
Warric k, A.W. a nd Mye rs, D. E. (19 87) Op timization of sa mpling loca tions for vario gram calcu lations. Water
Resources Research, 23, 496-500.
-
14
Zhang, H . (2002a ) On estima tion and pr ediction for sp atial generalize d linear mixe d mode ls. Biome trics, Vol. 58,
No.1, 1 29-136 .
Zhang, H. (2002b) Optimal interpolation and the appropriateness of cross-validating variogram in spatial generalized
linear mixed models. Journal of Computational and Graphical Statistics (To appear)
.