A Study on Prediction of Spatial Binomial Probabilities with...

1Correspondence to [email protected]

1

A Study on Prediction of Spatial Binomial Probabilities with an Application

to Spatial Design

Hao Zhang 1

Program in Statistics

Washin gton State U niversity

Pullman, WA 99164

H. Holly Wang

Department of Agricultural Economics

Washin gton State U niversity

Pullman, WA 99164

Abstract

This work studies some issues that are related to interpolation of binomial probabilities. In some situations, binomial

counts are observed at some spatial locations and binomial probabilities are interpolated at un-sampled locations

based on the sample data. An example in precision agriculture is considered in this work. A natural practical question

is how the number of sampling locations and the sampling sizes at the sampling locations affect the interpolation.

This que stion is studied in th is work throu gh simulations . The mo del-based geostatistics ap proach (D iggl et al.,

1998) is employed in which the binomial counts are modeled through a spatial generalized linear mixed model. The

minimum m ean-square d error pr ediction is car ried out for the binomial p robability.

1. Introduction

This work is motivated by an experiment on Cunningham farm that is located 8 miles north of Pullman, Washington.

The ov erall objec tive of the expe riment is to iden tify factors that affect bo th yield and q uality of wheat an d barley,

the two major crops in the dryland farming area in the Pacific Northwest, so that measures of precision farming may

be applied to improve yield and quality so as to maximize profitability of the farm. Some variables that likely affect

both yield and quality are incidence rates of plant root diseases, soil properties, elevation and aspect ratio, all of

which are location-specific and measured locally. Once the dominating variables are identified and their effects on

yield and quality quantified, optimization of both yield and quality can be realized through the practice of precision

agriculture. 100 locations were randomly selected by the research team by dividing the farm into equally spaced rows

and columns, approximately 20 meters between rows and columns, then random drawing from the lattice. At each

2

sampling lo cation, the incid ence rate o f some roo t disease and other variab les were mea sured, alon g with the yield

obtained by hand-harvesting a squared area of 2 meters long and 2 meters wide. Variables such as protein content

that reflect quality of wheat were also measured at the sites. These sample data are currently being studied to see how

they affect yield and quality. Once the domina ting variables a re identified an d their effects on yield and qu ality

quantified, o ptimum p rofits can be a chieved b y locally adjusting or contro lling the variable s to the appr opriate leve ls

through pr ecision farmin g that utilizes local info rmation. Ap parently, data of these variab les must be av ailable at all

locations wh ere quality and yield are to be optimized and ideally we would like to optimize the m across the whole

farm. This w ould requ ire interpolatio n or pred iction of the var iables to all um sampled locations.

In this work, we will study some issues related to interpolation of a particular variable, namely, the incidence

rate of Rhizo ctonia roo t rot caused b y Rhizoctonia solani and Rhizoctonia oryzae. These fungi attach to the root

system and re duce the ab ility of plants to take u p adequ ate water and nutrients, and c onseque ntly affect both yield

and quality of the crops. At each of the sampling sites on Cunningham farm, 15 plants of barley were

sampled and pulled out of groun d in the summ er of 200 0, and the nu mber of to tal crown ro ots, , and that of

infected cro wn roots, , were obtained for each plant. The incidence rate of root rot at the site was obtained as

the total numb er of infected c rown roo ts divided b y the total numb er of crown roots, i.e., . These incidence

rates at the sam pled sites are used to interp olate the incid ence rate to a s many sites acro ss the farm as ne cessary. A

natural practical problem is ho w the variables and affect the interpo lation. For ex ample, is it nece ssary to

sample more crops to increase , which currently ranges from 89 to 197? W hat is the gain by sampling more

locations? When the total sampling cost is controlled, do we prefer sampling more locations with each location

having a smaller sample size or otherwise? These will be the problems to be studied in the paper.

It is not possible to study these problems without first specifying how the prediction is to be carried out. Some

well-known geostatistical kriging (predicting) methods such as ordinary kriging, trans-Gaussian kriging and

disjunctive kr iging might now emerge as p ossible interp olation meth ods for ou r problem s. We refe r to Cressie

(1993) for introduction to each of the methods. However, if any of the methods was to be applied, it would be

applied to the ratio and not to , because itself without means nothing, and varies from

site to site in the experiment. Once the ratios are used for prediction, the sample sizes no longer affect

prediction. Conseq uently, we would not be able to study how the sampling sizes affect the prediction.

Therefore, we will adapt the model-based geostatistics (Diggle, et al.,1998), which incorporates the sample sizes

into the model and allows for the minimum mean-squared error (MMSE) prediction of the incidence rate.

This approac h assumes that the counts of infected crow n roots follow the following spatial generalized

linear mixed model (GLMM ):

(a). is a Gaussian stationary process with mean 0.

(b). Conditionally on , consists of independent variables. Moreover, the conditional

distribution o f an individua l is binomial with a binomial index and the bino mial prob ability

3

.

It is reasonable to assume that the counts of infected crown roots are binomial with varying binomial

probabilities . Spatial variation is represented by the random term which might be accounted for by

unknown or unobs ervable fac tors. In the abs ence of any p hysically based model for the spatial trend , the parame tric

form may be rea sonable. U nder the sp atial GLM M, the M MSE estimation or prediction of is the

conditiona l expectation of given the observed b inomial responses . Although it can not be eva luated in

closed form , the MM SE pred iction can be compute d through the Marko v chain M onte Carlo methods.

Two ap proache s to the MM SE pred iction of have been considere d under the spatial GLM M. Digg le et al.

(1998) con sidered Bayesian p rediction of a function of random effects, such as , implemented using MCMC

methods. Zhang (2002a, b) considered MM SE prediction under the assumption that parameters are known or have

been estimated, and also relied on MCMC m ethods for the calculation. Zhang (2002b) showed that some analytical

results can be used in either the Bayesian or non-Bayesian approach to make the prediction computationally more

efficient.

Obviously model parameters have to be known or estimated before prediction can be evaluated. For spatial

GLMM s, different inferential methods exist for model parameter estimation. Diggle et al. (1998) used a Bayesian

approach for parameter estimation. Zhang (2002a) considered the Monte Carlo EM algorithm for maximum

likelihood e stimation. Pe nalized qu asi-likelihood (Breslow and Clayto n, 1993 ) can also b e applied to this mode l.

Being aw are of that differe nt parame ter estimation m ethods exit fo r the mode l, and that these m ethods ma y lead to

different estimate s, we will not study e xplicitly how the d esigns, espec ially the choice o f and , affect

paramete r estimates. Ra ther, we will focus o n their effects on p rediction.

Although this w ork is related to spatial samp ling design and indeed d eals with some aspect of the d esign, it

differs from most works on spatial sampling design in several ways. Firstly, in most works on spatial designs, the

response variable can be decom posed into add itive parts:

where the error term has mean 0 and is either intrinsically stationary or second-order stationary, whose correlation

structure is independent of the mean surface (see, for example, Warrick and Myers, 1987, Pesti et al., 1994,

Bened etti and Palm a ,1995 , and Mu ller and Zimm erman,19 99, amo ng many oth ers). Such a d ecomp osition is

important in some of the works on spatial desgin. For example, when the means are a constant, the variogram of the

can be estimated without estimating the mean. Hence the design criterion can be expressed in the variogram

parameters alone (see, for example, Muller and Zimmerman, 1999). In our model, such a decomposition does not

exist because both the mea n and the variance of the respo nse variable depend on some com mon parame ters.

Consequently, approaches to and results of a classical spatial design need to be extended or modified to become

applicab le to our mo del.

Secondly, the objective of a spatial sampling design is often the optimum allocation of spatial sampling

4

locations, where the optimum may be done with regard to trend estimation, variogram estimation or prediction.

Although it is an interesting and important problem, deciding optimum sampling locations is very complicated in the

Cunningham farm experiment because not only the response variable is binomial but also other response variables

also measured at the sampling sites. Optimum sampling design for one variable alone has little practical value to the

experime nt. In addition , the respons e variable w e consider in this paper is b inomial. He nce the sam ple size at a

sampling lo cation is a new variable that c lassical spatial d esigns do no t entertain.

Lastly, unconditional variances of estimator or predictors are used in the criteria of spatial designs such as

minimizing the weighted average variance of prediction in Fed orov (1996 ) and Muller (19 98). Howev er, most

applications of spatial prediction use the conditional mean and variance of the predicted variable given observed

data. For e xample, the kriging (or pr edicting) surfac e of consists of the co nditional me an of given the

responses at sampling sites (Diggle et al., 1998). In our particular experiment, we may ask how this kriging

surface would change if less roots were sampled at each site (i.e., a smaller ). Hence, the conditiona l inference is

more dire ctly related to o ur proble ms.

Therefore, this work is not on spatial design in the classical sense. The primary focus of this paper is to study

how and affect the prediction. There are no theoretical results that provide immediate answers to the

questions. We hence resort to simulation studies to see these effects on prediction. The rest of the paper is organized

as follows. In S ection 2, we review som e results of Zha ng (2002 b) that will be use d in the subse quent sectio n to

calculate the M MSE prediction of . Sections 3 contains several simulation studies to see how the prediction of

is affected by changes in and . When we change sampling sizes, we keep the incidence rates unchanged

or appropriately the same ( has to be an integer) hence the conditional inferences, i.e., predicted values and

prediction variance give n the same inc idence rate s, can be co mparab le. Conclusio ns and discu ssion are pre sented in

Section 4.

2. MMSE Prediction in a Spatial GLMM

In this section, we review some results of Zhang (2002b) that can be used to efficiently calculate MMSE prediction

for . Let satisfy the spatial GLMM defined in the introduction and be the random effects. Let

be the sampling sites and write for and , respectively, , and

. Zhang (2002b) showed that for any function ,

(1)

Monte Carlo samples from the conditional distribution can be generated through a MCM C method and the

right-hand side of the equation can be approximated by the appropriate sample average. For example, the

Metropolis-Hastings algorithm is quite easy to be implemented for spatial GLMM, as seen in Diggle et al. (1998)

and Zhang (20 02a, b). Once M onte Carlo samp les are generated from , the following approximation

5

takes place

(2)

We now need to calculate for a given vector of rando m effects . Fortunately, this conditional

expectation is an integral of the form , which can be fairly easily approximated to any given

precision (Crouch and Spiegelman, 1990) if it can not be computed in closed form. Indeed, conditional on ,

has a normal distribution with mean and variance , and

(3)

It is well known that the conditional mean and conditional variance can b e calculated from the cova riances:

where is the covarian ce matrix of and is the (1,1)th elem ent of .

Equation s (1)-(3) can be applie d to calculate the MM SE pred iction of and the

corresponding prediction variance given . The first two c onditional m oments of given involve logistic-

normal integrals of the form

Indeed, for ,

from which we obtain the prediction variance

The logistic-normal integrals cannot be calculated in closed form but can be evaluated through numerical

methods. For exam ple, the method of Ga ussian quadrature app roximates the integral as follows:

where , , are available from standard tables for (Abramowitz and Stegun, 1967, p 924).

Although this m ethod is be lieved to ap proxima te well, an analytica l error bou nd is not kno wn. If it is desirable to

control the error bound, then the method of Crouch and Speigelman (1990) can be used. Given an error bound , this

method c alls for choo sing a prop er constant h>0 such that

(4)

where . The infinite sum is then truncated to satisfy any error-bound.

6

Figure 1. Sampling locations (circle) and predictedlocations (+); the black dot shows the samplinglocation where sample size is changed.

We have now outlined a method for approximate the MM SE prediction and the prediction variance

. This method uses some analytical results and calls for generation of Monte Carlo samples of random

effects at the sampled sites only, hence differs from the pure Monte Carlo method that generates samples of the

random effect at the interpolated site as in Diggle et al. (1998). The simulation results of Zhang (2002b) show that

using the analytical results provide faster convergence and is less computationally costing.

3. Simulation Studies

In this section, we carry out several simulation studies to deal with the practical problems arising in the

Cunningham farm experiment. Each of the following three subsections deal with some specific problems, in which

data from th e following m odel are ge nerated. Le t be a seco nd-order stationary Ga ussian spatial p rocess with

mean 0 an d a spheric al variogram : for and equals for .

Conditional on the pro cess , consists of independent binomial variables with binomial index and

probab ility . We use the method outlined in the previous section to calculate the MMSE

prediction and prediction variance for . The Metropolis-Hastings algorithm is employed to generate a Markov

chain of length 2000 in th e impleme ntation of the m ethod. Zha ng (2002 a, b) showe d that is sufficiently

large for predicting . Equation (4) is used in all simulations for predicting and the error bound is .

3.1. Effects of sample sizes on prediction

There is a general belief that a larger sample size at a location should affect prediction favorably. The simulation

study in subsec tion is intended to shed light on the exact effects a nd provid e guidance as to what sam ple size shou ld

be used in the experiment as far as prediction is concerned. The model parameters are ,

and . We choose these values because they are the estimates obtained in Zhang

7

(c)

(b)(a)

Figure 2. C ompariso ns of predic tion at the 40 site s using unmo dified data (c ircle) and m odified da ta

(black) with o ne sample size reduce d to 69 (a ), 14 (b), an d 0 (c).

(2002 a) using the rea l experimen tal data. Th e sample siz e is fixed at 138 for all sampling sites, the average of sample

sizes in the real ex periment.

We divide the stud y into two steps: (a) How do es a single sample size affect prediction of its neighbo ring sites?

(b) How do the sample sizes at all sampling sites affect prediction surface across the farm so that explicit guidance

may be given on cho osing the sample sizes?

8

3.1.1 The effect of the sample size at a single site

We choose a sampling site near the center of the farm and change and predict 40 sites located on a line crossing

the sampling site. This sampling site and the 40 predicted sites are shown on Figure 1.

We redu ce this sample size 0.5 and 0 .1 times to 69 and 24 , respectively. The binomial resp onse is also reduced so

that the ratio remains approximately the same. Using the modified data from the site and data from other sampling

sites, we calculate d the pred icted value a nd pred iction varianc e for each o f the 40 sites. W e also elimina ted this samp le

site and use the remaining 99 sites to make prediction for the 40 sites. Figure 2 shows the plots predicted values and

prediction variances using modified data in comparison with that obtained using original simulated data.

We can see from Figure 2 that reducing sample size at the site from 138 to 69 has very little efects on prediction. It

probably means that a sample size like 138 is large enough that it can be reduced. By reducing the sample size of the

single site to 14, prediction variances for other sites, especially for the sites near the sampling site, are increased.

However, these increases are not as great as those when the site is eliminated. This suggests that we prefer using more

sites with each site having a smaller sample size over another way around.

3.1.2 Effects of sample sizes on prediction over the farm

Having se en how a sing le sample size affects predic tion, we now carry out ano ther simulation study in which all

sampling size s are change d so that we m ight see how the se changes a ffect predictio n across the fa rm. The fa rm is

divided into 1904 g rid points, at wh ich the inciden ce prob ability will be interpolated. We first use the original

simulated data to make these predictions.

We now double the sampling size of each site to 276 and also double the binomial variable proportionally so that

the ratio remains the same. We then use the modified data to predict the same 1904 sites. Predicted values

and prediction variances are compared with those from the original data in Figure 3 (a). The horizontal axis is for the

original data and vertical one is for the modified data.

We then reduce the sample size of each site from 139 to 69, 28 and 14, and again change the binomial responses

to keep approximately the same. The 1904 sites are interpolated using the modified data. Prediction

results are shown in Figure 3 (b), (c) and (d).

We cle arly see that a sam ple size of 13 8 at every sam pling site is sufficient in the se nse that increas ing the samp le

size brings about little changes on prediction. Indeed, the sample sizes can be reduced without greatly affecting the

prediction. For example, a sample size of 69 provides very close results. We also see that when sample sizes become

smaller, predicted values are more smoothed out in the sense that the predicted value for a lower become larger

and that for a higher becomes smaller. Therefore, when sample sizes are smaller, the predicted values vary in a

smaller range . This can b e explained as follows. If a sam pling site has an in cidence ra te lower than m ost other sites, it

will affect the predicted incidence rates at nearby sites and those predicted incidence rates will be likely lower too at

nearby sites. If this p articular site has a large samp le size, its impact o n predictio n of nearby site s will be greater than it

9

(a) (b)

(c) (d)

would with a smaller size. In the latter case, othe r sampling size s would hav e more im pact on the prediction resulting in

elevated prediction for the b inomial probability at the nearby sites.

From the simulation results, we conclude that the sample size of 138 at each of 100 sites is large enough that

increasing the sample size would no t affect much the prediction surface. It may b e reduced to 50% without greatly

affecting pred iction.

Figure 3. Comparison of prediction using different sample sizes. The horizontal axises in (a)-(d) are

predicted value or prediction variance corresponding to ; the vertical axises in (a)-(d) are

those corresponding to =276, 6 9, 28 and 14, respe ctively.

3.2. Effects of the number of sampling locations

Due to the spatial correlation, the predicted value of a site is more influenced by observations near the site. Therefore,

more sampling sites are added near an interpolation site, the MMSE pred iction of the site will change. Since the

prediction variance is a conditional variance and depends on the observations at the sampling sites, it may become

10

Figure 4. Sampling sites randomly chosen(circle) and seven added sampling sites (black dot).

Figure 5. Comparisons of 40 predicted values andprediction standard deviations with 100 sites (circle)and with 10 7 sites (black d ot)

smaller or larg er. This is evid ent in the followin g simulation stud y.

We randomly draw 100 sites as the sampling sites from the farm, and then add 7 more sites are located on the line

crossing the center of the farm. These locations are shown in Figure 4. We now simulate data on the 107 sites from the

binomial mixed effects model with the linear parameter , and the spherical variogram with parameters

, and ; the binomial index or sample size is 138 for all sites. Prediction of 40 sites

are made separately using the 107 sites and using 100 sites with seven sites removed. The predicted sites and the 7 sites

are marked by “+” and black dots in Figure 4, respectively. Figure 5 compares the predicted values and predicted

variances, in w hich black d ots refer to pr ediction using 107 sites an d circles refer to prediction using the 100 sites. We

see as more sits are sampled, the predicted values vary more resulting in the prediction surfacing being less smooth.

The advantage of more sampling sites will be seen more clearly from the next simulation study in which we

investigate the combined effects of the number of sampling sites and sample sizes at the location. In this simulation, we

randomly draw 500 sites from the farm and simulate data from the same model with the same parameters we just gave

but with two sets o f sample sizes: and for all sites. We then randomly draw 200 sites from the 500

sites, and 100 sites from the 200 sites. The three sets of sampling sites are shown in Figure 6. Once the sites are chosen,

data on the sites are obtained accordingly from the simulated values. For each set of sampling sites (100, 200 and 500

sites), we make prediction for 1904 sites using =138 and 14 and plot the predicted values and prediction variance

together in Figures 7-9.

From Fig ures 7-9, we see that when th e number of sampling site s is large, reduc ing the samp ling size at each site

has less effect on prediction than it would when the number of sampling sites is small. Therefore, when more sites are

11

Figure 6. Change the number of sampling sites from 100 (blackdot) to 200 (+) to 500 (circle).

Figure 7. Comparison of prediction using =138 (horizontal) and 14

(vertical) at sites.

sampled, we can afford sampling less in each location without as greatly affecting the pred iction as we would when less

locations are sampled. S ince pred iction near a sa mpling site is mo re accurate , we prefer to sa mple mo re locations w ith

each loca tion having a sm all sample size , if the total sampling effort is controlle d.

12

Figure 8. Comparison of prediction of 1904 sites using =138 and 14at N=200 sampling sites.

Figure 9. Comparison of prediction of 1904 sites using =138 and 14at N=500 sampling sites.

4. Conclusion and Discussion

We studied the effects of the sample sizes at the sampling locations and the number of sampling locations on prediction

of binomial probability, employing model-based geostatistics and minimum mean-squared prediction. Results of the

13

simulation studies support the following conclusion:

(1). The sample size at each location needs not to be too large. For the Cunningham farm experiment, a sample size

like 69 seems sufficient for prediction. Too large a sample size hardly improves prediction.

(2). A smalle r sample size makes the p rediction sur face smoo ther, and the p redicted va lues vary in a sma ller range.

(3). When m ore locations are samp led, the effects of sample size on the pred iction surface become less.

(4). When the total sampling effort is controlled, we prefer sampling more locations with each location having a

moderate sample size over sampling fewer locations with each one having a larger sample size.

Theoretical justifications of the conclusions are not available. Nevertheless, the simulation results provide useful

guidance to the design of the experiment on Cunningham farm. In all the simulations, the parameters are fixed at some

values. We exp ect the conclusions to hold in gene ral cases regardless of the param eter values.

Acknowledgment

The auth ors ackno wledge the a ssistance of X iaoping Jin o n compu ting.

Reference:

Abramowitz, M. and Stegun, J. (1967) Handbook of Mathematical Functions. U.S. Government Printing Office,

Washin gton, D. C . (eds.)

Bened etti, R. and P alma, D. (1 995) O ptimal samp ling designs for depend ent spatial units. Environmetrics, 6, 101-114.

Breslow , N.E. and Clayton, D .G. (199 3) App roximate infe rence in gen eralized linea r mixed mo dels. Journal of the

American Statistical Association, 88, 9-25.

Crouch, E.A.C. and Spiegelman, D. (1990) The evaluation of integrals of the form : Applicatio n to

logistic-norma l models. Journal of the American Statistical Association, 85, 464-469.

Diggle, P., Tawn, J. A., and Moyeed, R.A. (1998) Mo del-based geostatistics (with discussion). Journal of Royal

Statistical So ciety, Ser. C, Applied Statistics, 47, 299-350.

Fedoro v, V.V. (1 996) D esign of spatial e xperimen ts: modeling fitting a nd pred iction. In Rao , C.R. and G osh, S.,

editors, Handbook of Statistics, Vol. 13, North-Holland.

Muller, W. G. (1998) Collecting Spatial D ata: Op timum D esign of E xperime nts for Ran dom F ields. Physica-V erlag, ,

Heidelberg.

Muller, W . G. and Zim merman, D . L. (1999 ) Optimum design for sem ivariogram estimation. E nvironme trics, 10, 23-

37.

Pesti, G. K elly, W. E. a nd Bog ardi, I. (199 4) Obse rvation netwo rk design for se lecting locatio ns for water sup ply

wells. Environmetrics, 5, 91-110.

Warric k, A.W. a nd Mye rs, D. E. (19 87) Op timization of sa mpling loca tions for vario gram calcu lations. Water

Resources Research, 23, 496-500.

14

Zhang, H . (2002a ) On estima tion and pr ediction for sp atial generalize d linear mixe d mode ls. Biome trics, Vol. 58,

No.1, 1 29-136 .

Zhang, H. (2002b) Optimal interpolation and the appropriateness of cross-validating variogram in spatial generalized

linear mixed models. Journal of Computational and Graphical Statistics (To appear)

.

A Study on Prediction of Spatial Binomial Probabilities with...

Documents

Transcript of A Study on Prediction of Spatial Binomial Probabilities with...