Sampling, WLS, and Mixed Models

SPH&HS, UMASS Amherst*Sampling, WLS, and Mixed Models Ed Stanek and Julio SingerU of Mass, Amherst, and U of Sao Paulo, BrazilII ESAMP MeetingsNov 6, 2009Natal, Brazil

SPH&HS, UMASS Amherst

*Finite Population Mixed Models Research GroupLuz Mery Gonzalez, Columbia; Viviana Lencina, Argentina; Julio Singer, Brazil; Silvina San Martino, Argentina; Wenjun Li, US; and Ed Stanek US

BackgroundMotivation: 2-stage cluster sample of hospitalsn Hospitalsm Appendectomy operations per hospitalWhat is the average cost of an operation at a selected hospital (latent value)?Choices:Use average cost of m operations for selected hospitalUse shrunk cost- regressing to the mean for other sample hospitals. Which should we use?


SPH&HS, UMASS Amherst*Consider/Account for: Study DesignSamplingResponse ErrorModel AssumptionsHow do we make up models to get better insight from limited information?What is a subjects saturated fat intake?

An Example


SPH&HS, UMASS Amherst*Seasons Study UMASS Worc


SPH&HS, UMASS Amherst*Seasons Study UMASS Worc- focus on 3 subjects


SPH&HS, UMASS Amherst*The Problem-SimplifiedObserve:1 Measure of SFat on each SubjectAssume: Response Error (RE) Variance knownQuestion: How do we estimate Subjects True Sat Fat intake?

Begin with a Response Error Model which leads to. Mixed Model Finite Population Mixed Model DaisyLilyRose


SPH&HS, UMASS Amherst*Population


SPH&HS, UMASS Amherst*PopulationSet


SPH&HS, UMASS Amherst*11Response4


SPH&HS, UMASS Amherst*11Response..04


SPH&HS, UMASS Amherst*Response Error Model for Set

9110.50.5

040.50.5


SPH&HS, UMASS Amherst*Summary Response Error ModelLatent Value


SPH&HS, UMASS Amherst*Re-parameterized RE ModelMean Latent Value-of what?:or the Setthe Population


SPH&HS, UMASS Amherst*Generating Response in the RE Model


SPH&HS, UMASS Amherst*-11-22-1-2Generating an Observed Response in the RE Model


SPH&HS, UMASS Amherst*Sample SpaceResponse Error Model


SPH&HS, UMASS Amherst*Response Error Model


SPH&HS, UMASS Amherst*Mixed Model (MM)Random Effect


SPH&HS, UMASS Amherst*Mixed Model (MM)Latent Value


SPH&HS, UMASS Amherst*Mixed Model (MM) in action


SPH&HS, UMASS Amherst*Mixed Model (MM) .


SPH&HS, UMASS Amherst*Mixed Model (MM) .??Who Are They???


SPH&HS, UMASS Amherst*Mixed Model (MM)What Does it Mean???????


SPH&HS, UMASS Amherst*Sample Space (MM)RealArtificial


SPH&HS, UMASS Amherst*MM-Latent Values?Samples

Daisy (j=1)Rose (j=2)

3211210212-112102321810-212-1810-211101422910-14221110102-2910-102-2


SPH&HS, UMASS Amherst*What are they (for Daisy)? Samples


3211210212-112102321810-212-1810-211101422910-14221110102-2910-102-2


SPH&HS, UMASS Amherst*SamplesWhat are they (for Rose)?


3211210212-112102321810-212-1810-211101422910-14221110102-2910-102-2


SPH&HS, UMASS Amherst*BLUPs of the MM-Latent Value


SPH&HS, UMASS Amherst*SamplesMSE of BLUPs for MM-Latent ValuesAve=0.986Ave=3.768


3.120.8111.5102.191.221.1511.4101.863.120.717.7105.241.121.287.6105.7910.9101.284.425.798.9100.714.325.2410.8101.150.6421.868.9100.810.5222.19


SPH&HS, UMASS Amherst*MSE of BLUPs |P=ySamplesMSE=Ave=0.986MSE=Ave=3.768


10.90100.814.4125.798.93101.154.2925.2410.84100.710.6421.868.87101.280.5222.19


SPH&HS, UMASS Amherst*Population Finite Population Mixed Model (FPMM)


SPH&HS, UMASS Amherst*Response Error ModelLatent Value


Accounting for SamplingIndicator random variable, 1 if ith Selected sample subject is subject s


SPH&HS, UMASS Amherst*Finite Population Mixed Model (FPMM)


*FPMM- Sample Space

SPH&HS, UMASS Amherst*FPMM- Sample Space 411


SPH&HS, UMASS Amherst*FPMM- Sample Space -7413 4-7 013 0


SPH&HS, UMASS Amherst*FPMM- Sample Space 0134130-74-7


SPH&HS, UMASS Amherst*FPMM- Sample Space -71113 11-7 913 9


SPH&HS, UMASS Amherst*FPMM- Sample Space 91311139-711-7


SPH&HS, UMASS Amherst*FPMM- Sample Space All sample points are Potentially Observable


SPH&HS, UMASS Amherst*FPMM- BLUPs of Realized Latent Values


SPH&HS, UMASS Amherst*FPMM- BLUPs of Realized Latent ValuesSample Sequence


SPH&HS, UMASS Amherst*Comparison of MM-BLUP and FPMM-BLUPMM-BLUPFPMM-BLUPTarget Random VariableMM-Latent ValueLatent Value


SPH&HS, UMASS Amherst*Comparison of MM-BLUP and FPMM-BLUPMM-BLUPFPMM-BLUPPredictor


SPH&HS, UMASS Amherst*Comparison of FPMM-BLUP and MM-BLUP-Sample SpaceArtificial


SPH&HS, UMASS Amherst*To Compare, Focus onTHIS Sample Space


SPH&HS, UMASS Amherst*Bigger Sample (n=3) Population (N=4)


SPH&HS, UMASS Amherst*n=3, What is Lilys Latent value? Use n=3 subject effects for MM 1 possible sample set


SPH&HS, UMASS Amherst*n=3, What is Lilys Latent value? 8 sample points


SPH&HS, UMASS Amherst*n=3, What is Lilys Latent value? 8x(6 permutations)=48 sample points


SPH&HS, UMASS Amherst*n=3, What is Lilys Latent value?Combinations


SPH&HS, UMASS Amherst*n=3, What is Lilys Latent value?192 Sample Points


SPH&HS, UMASS Amherst*Select one sequence


SPH&HS, UMASS Amherst*Select one sequence, Observe Sample Point


SPH&HS, UMASS Amherst*FPMM-Average MSE of Predictor over Permutations


SPH&HS, UMASS Amherst*Ave MSE5.016.2 FPMM4.6MMX


SPH&HS, UMASS Amherst*34.3 FPMM17.729.4Ave MSEMMX


Summary MSE Results Sample Set MM FPMMSet j=1 j=2 j=3 Target MSE MSE 1 Daisy Lily Rose Mean 2.667 11.667 1 Daisy Lily Rose Daisy 0.9931 15.679 1 Daisy Lily Rose Lily 12.3195 34.165 1 Daisy Lily Rose Rose 3.7561 9.785 2 Daisy Lily Violet Mean 7.409 14.000 2 Daisy Lily Violet Daisy 0.993 18.362 2 Daisy Lily Violet Lily 17.765 34.311 2 Daisy Lily Violet Violet 18.929 18.487 3 Daisy Rose Violet Mean 2.464 3.333 3 Daisy Rose Violet Daisy 0.994 3.647 3 Daisy Rose Violet Rose 3.540 3.304 3 Daisy Rose Violet Violet 13.563 17.224 4 Lily Rose Violet Mean 3.066 14.333 4 Lily Rose Violet Lily 4.593 16.177 4 Lily Rose Violet Rose 3.345 13.751 4 Lily Rose Violet Violet 4.147 15.027


SPH&HS, UMASS Amherst*ConclusionsPopulationSample SpaceDesign Based


SPH&HS, UMASS Amherst*ConclusionsPopulationDesign BasedEvaluate Performance Conditional on the Sample


SPH&HS, UMASS Amherst*ConclusionsModel BasedConceptual Priors


SPH&HS, UMASS Amherst*ConclusionsModel BasedEvaluate Performance Conditional on the Sample


ConclusionsTo Evaluate Performance of BLUP Estimators:For Mixed Model: Condition on P=yi.e. MM Latent Values match subject Latent ValuesFor the FPMM: Condition on the sample setMSE for BLUPs not evaluated CorrectlyExtends to WLS estimate of meanMM-BLUP not always bestSPH&HS, UMASS Amherst*


SPH&HS, UMASS Amherst*Thanks


SPH&HS, UMASS Amherst* Any thoughts? Next steps?Questions?


*Talk given on Tuesday, October 13, 2009 at the Carolina Club, UNC Chapel Hill at the Festschrift in Honor of Professor Gary Koch by Ed Stanek*Finite Population Mixed Model Research Group: From left to right: Anne Stanek; Luzmery Gonzalas; Viviana Lencina; Julio Singer; Alice Singer; Ed Stanek; Silvina San Martino; Maria Lucia Singer; and Wenjun Li*In statistics, we look at information, and try to draw conclusions or make statements that go beyond the observations themselves. These conclusions, or insights, are the realm of inference in statistics. They are highly ritualized, but have as their origin the desire to simplify or summarize information in a meaningful way.

Inference may use the study design and sampling plan, additional assumptions, and model frameworks.

The example we consider is trying to determine a subjects true saturated fat intake from food/drink. True saturated fat intake is defined as the average saturated fat intake over a consecutive 21 day period prior to a blood draw. This time period is relevant for biological effects on serum cholesterol.*The Seasons Study was a longitudinal study of volunteers between the age of 18 and 75 from the Fallon HMO in Worcester, Massachusetts. The study aim was to identify factors associated with seasonal variation in cholesterol, accounting for changes in diet and physical activity over the seasons. A total of 641 volunteer subjects were enrolled. The protocol involved collection of baseline data including serum cholesterol measures, and subsequent quarterly follow-up measures for four subsequent quarters. Three weeks preceding the quarters measure, three un-announced 24-hour telephone dietary recalls were conducted by trained nutritionists on each subject, with two recalls during the week, and one on the weekend. More details and the results are given in the following references, with example data given at the study website at http://www.umass.edu/seasons/

Ockene, I.S., Chiriboga, D.E., Stanek, E.J.III, Harmatz, M.G., Nicolosi, R., Saperia, G., Well, A.D., Merriam, P.A., Reed, G., Ma, Y., Matthews, C.E. and Hebert, J.R. (2004). Seasonal variation in serum cholesterol: Treatment implications and possible mechanisms. Archives of Internal Medicine, 164:863-870. *This scatter plot is constructed using the mean and standard deviation for saturated fat intake for subjects with 10 or more 24-hour dietary recalls in the Seasons study.

We will discuss a simpler problem- one where the population consists of N=3 subjects.*We discuss how to estimate a subjects true saturated fat intake with minimal observed data (1 measure per subject), and fairly minimal assumptions (we assume response error variance is known).

First, we use simple observations to motivate a response error model, and define the latent value.

Next, we discuss two approaches to add to the response error model: A model-based approach where we replace a subject effect by a random variable. A design-based approach where we assume the subjects were a realization of a sample.*We start with a population (where N=3). Subjects are identifiable, and correspond to Daisy, Lily, and Rose.

It is not necessary to define the population to begin with. Instead, we could simply begin with a set of subjects, say Daisy and Rose, where a response is observed on each subject.*We will assume that we only observe response on two subjects. Thus, whether or not Lily is in the population is not important.*In a study, we assume that we observe response on each subject in a set. For example, suppose the response corresponds to the grams of saturated fat reported to have been eaten in the previous 24 hour period.

For ease of presentation, we assume that 10 (g/day) has been subtracted from each observed value to keep the numbers small. As a result, to know the actual observed response, 10 must be added to the value given.*The response that we observe is not always the same, and may vary over days for the same subjects.*Other possible responses are given.*Here are some more responses.*And some more responses.*After a while, we gain some experience as to what we may expect the response to be for a subject.*We would still be hard pressed to guess the saturated fat intake for subject on a new day.*A response error model is a natural way of representing what we see.

First notice that we refer Daisy with j=1, and to Rose with j=2. These subscripts are assigned as consecutive numbers to subjects after arranging subjects by name in ascending alphabetical order. This is the order commonly used in survey sampling when referring to elements in a set. Response is expressed as the sum of what we expect to see, plus a deviation. The response error model that we consider is additive. Multiplicative models could also be considered.

We consider a simple representation of response error with only two possible values, each equal in absolute value to the square root of sigma2. Note that simga2 is the usual population variance.*A subjects latent value is defined to be the expected response for the subject. Notice that we never observe a response equal to the latent value for a subject in this example. The latent value is an average that we define, and claim to be able to interpret.

For example, if a subjects latent value is high (say 30 mg/day saturated fat intake), a nutritionist may advise the subject to lower their saturated fat intake. However, on some days, the subject may already have low saturated fat intake. More clearly, the nutritionist may want to focus advice on days where saturated fat intake is likely to be high, and pinpoint advice to those days. This focus is on responses, not latent values.*We define an average latent value for Daisy and Rose. This is an average for the set. Deviations for subjects are defined relative to this average, so that the sum of beta1 and beta2 is zero. *This is an urn model where we sample with replacement from possible response errors.**Selections of response error give rise under the response error model to the observed response.*The pair of observed response, or response vector, is a point in the sample space. Other points correspond to other possible response vectors.

We assume each subject has 2 possible responses, leading to four points in the sample space. *This summarizes the setting.*The mixed model substitutes a random variable for the subject effect, betaj.

The subject is identified in a mixed model.*Adding the random effect to the mean, we have the mixed model latent value. This value, when realized, is interpreted as the latent value for the subject. We represent it by Pj.

We assume the expected value (i.e. mean) is zero, and the variance is gamma2 for the random effect. Expectation is with respect to the random effect.

One way to think of the random effect is as follows. Suppose there is a population of possible subject effects which we call a superpopulation. Let us select one of these effects at random. We may interpret the random effect as a selection from this superpopulation. The superpopulation may (or may not) be identifiable.

Notice that the subscript for expectation is over the assumed model. It is the assumptions on the random effect that define this model.

The use this idea to define the random effect. We take as the superpopulation the random effects for the two subjects, Daisy and Rose.*Notice that in the mixed model, the subjects are identifiable. We have Daisy, (j=1) with her possible response error values, and Rose, (j=2) with her possible response error values.

We can obtain response with a two step process. First, we pick a subject effect. Second, we pick response error.

However, response could be obtained in one step: simultaneously select a subject effect and response error. We do not need to know what subject effect is realized to pick from possible response errors for a subject.**We have illustrated the subject effects as if they are an attribute of a subject (keeping the same color for effect as we had when we formed the superpopulation of subject effects). This is consistent with subjects have different latent values, and the subject effects colored to keep the correspondence with the subject.

*The outcomes illustrated here for Daisy and Rose are never observed in practice. They have positive probability of occurring in the mixed model. *Some sample point may be observed. These are called Real.Other sample points have positive probability in the mixed model, but are never observed. These we call Artificial.*This table lists the MM response for the 8 possible sample points, with one row corresponding to a sample point. The MM-responses in the top 4 rows are not potentially observable, as indicated by the shaded region.

The four sample points illustrated in the bottom 4 rows of the on the bottom of the table are potentially observable.*Notice that the MM-latent value changes for different sample points for the same subject. This is inconsistent with the definition of the latent value from the response error model. This is a reason for concern over interpretation when predicting the MM-latent value, since such latent values are not the same as the latent values under the response error model.

The inconsistency in interpretation of the MM-latent values is a direct result of allowing the subject effects to be random.*The same interpretation conflict occurs for the latent value of Rose. We use the term MM-latent value to refer to the set of latent values that have positive probability under the MM. These are distinguished from the Latent Values that correspond to the latent value for the subject. *The solution to the MM equations results in the best linear unbiased predictor (BLUP). The mean, mu-hat, is a weighted least squares mean.

The shrinkage constant kj, depends on the response error variance for the subject. The shrinkage constant also depends on the variance of the random effect. We have defined this variance in terms of the two latent values in the set. It could be defined in a different way, as for example, the variance of latent values in a larger finite population, or as the variance of a distribution of latent values (assumed infinite).

Notice that the expression for the variance treats the latent value as a random variable, even for the same subject. We refer to these changing latent values as MM-latent values. Also notice that the expression for the variance depends on the subject, j.*Each row in this table gives the BLUP for a subject. Notice that each row corresponds to a point in the Mixed Model sample space. The MM-latent value changes for different rows for the same subject. Finally, the term in the column for MSE corresponds to the square of the difference between the BLUP and the MM-latent value.

For Daisy, the response error variance is equal to 1. Using the BLUP, the average MSE is 0.986. Since this is less 1, the MM BLUP is usually said to be a more efficient predictor of the realized latent value than the observed response (which is the subject mean response, since there is only one measure per subject).Although the MM BLUP is developed over a MM sample space that includes points that can not occur, the sample space also includes all points that are potentially observable for the set of subjects. We can evaluate the MSE of the MM BLUP over this sample space (i.e. conditional on the true latent values). In this example (with n=2), we obtain the same numeric values for the average MSE.*A finite population mixed model is developed directly from random variables underlying the two-stage sampling. The first stage is selection of subjects in the sample. The second stage is measuring each of the selected subjects.*Notice that the parameter corresponding to mu has a different definition here than it did in the mixed model. Here, mu is the average latent value of the subjects in the population.

Also notice that we specify the response error variance of Lily. As we proceed, the FPMM-BLUP will depend in part on this response error variance. It was not necessary to specify this variance for the MM development.

In the mixed model, there are no additional subjects beyond the subjects in the set. The MM mean was defined as the average latent value of the subjects in the set. *This illustrates the two-stage procedure that leads to a response on each sample subject.

Notice that the subscript, i=1, refers to an order of selection, not to a particular subject. As a result, this representation is for sample sequences. For different samples, different subjects may occupy the first position in the sample sequence.

In the two-stage sampling approach, there is never a mis-match between the subject and the subjects latent value.*Notice that there is no confusion over which subject is realized in the finite population mixed model.* * *Each block represents a sample point in the sample space.*Here are four more sample points. Notice that these are considered to be different points in the FPMM than the previous sample points, since this sample sequence is different from the previous sample sequence. Notice that sample sequences are used as opposed to sample sets. *These are additional sample points.*Here are some more sample points.*And some more sample points.*Here are the final set of sample points.*This is a summary illustration of all the sample points in a finite population mixed model. The bottom row in this diagram illustrates the possible sample sets, and the sample points for each set. Additional rows illustrate permutations of the subjects in the sets. Since a sample is a sample sequence, the sample space consists of all points for each possible sample sequence.*This illustrates the two-stage process that is assumed to occur when generating the response for a sample.

The idea of a latent value in a FPMM does not identify a subject. *In the FPMM, by a realized subject, we mean that we know who was selected. In this example, Daisy was selected first.*The variance of the random effects corresponds to the variance of the latent values in the population. This definition differs from that used for the example with the MM.

Since all subjects can be selected at any position in the sample, the response error is the average response error over all subjects in the population.

It is possible to make similar comparisons between the FPMM and MM BLUPs using the same definitions for the mean of the latent values, and the variance of the random effects. To do so, we begin with a set of subjects. The MM is developed in the same manner as the earlier development. The FPMM is developed assuming that the sample size is n=N (where all subjects are selected in the sample). In the FPMM, sample points correspond to sequences of subjects.

*The MSE for the FPMM- BLUP is typically defined over all possible sample points. Here, we limit the MSE to a sample sequence.

*This illustrates the different possible latent values that are conceptualized in the mixed model and the finite population mixed model.

Supposing response on two individuals (Daisy and Rose) is obtained. Then the latent value for these two individuals occurs in the set of MM-latent values and the set of FPMM latent values. *Notice the similarity of the predictors. The WLS mean is used in the MM as opposed to the simple sample mean in the FPMM.*From the standpoint of sample sequences in a FPMM, a single sample sequence is selected. However, the subjects in the sequence form a set, and the same set will occur for permutations of the sample sequence. For this reason, we can compare the sample set in the mixed model with sample sequences representing permutations of the sample set in the FPMM.Here we consider a slightly larger population where we have added another subject, Violet.

Sampling, WLS, and Mixed Models

Documents

Transcript of Sampling, WLS, and Mixed Models