Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in...

29
Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan & University of Maryland 26 Feb 2016 (UMich & UMD) Ross-Royall Symposium 1 / 24

Transcript of Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in...

Page 1: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Inferential Problems with Nonprobability Samples

Richard Valliant

University of Michigan & University of Maryland

26 Feb 2016

(UMich & UMD) Ross-Royall Symposium 1 / 24

Page 2: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Probability v. Nonprobability samples

Goal in surveys is to use sample to make estimates for entire finitepopulation—external validityProbability samples became touchstone in surveys after Neyman(JRSS 1934) article

design-based approach: model-free inferencestratified sampling with Neyman allocationcluster samplingconfidence intervals

Early failure of nonprobability sample

1936 Literary Digest; 2.3 million mail surveys to subscribers plusautomobile and telephone owners predicted landslide win by AlfLandon over FDR

(UMich & UMD) Ross-Royall Symposium 2 / 24

Page 3: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Nonprobability samples

Standard in experiments—no finite population

New sources of dataTwitter, Facebook, Web-scraping

Billion Prices Project @ MIT, http://bpp.mit.edu/

� Price indexes for 22 countries based on web-scraped data

Keiding & Louis (2016). Perils and potentials of self-selected entryto epidemiological studies and surveys. JRSS-A

Are these data good for anything?

(UMich & UMD) Ross-Royall Symposium 3 / 24

Page 4: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

(UMich & UMD) Ross-Royall Symposium 4 / 24

Page 5: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

(UMich & UMD) Ross-Royall Symposium 5 / 24

Page 6: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Not all nonprobability samples are created equal

College sophomores in Psych 100

Mall intercepts

Volunteer samples, river samples, snowball samples

Probability samples with low response rates

Coalitions of the willing

AAPOR task force report on non-probability samples (2013)

(UMich & UMD) Ross-Royall Symposium 6 / 24

Page 7: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Not all nonprobability samples are created equal

College sophomores in Psych 100

Mall intercepts

Volunteer samples, river samples, snowball samples

Probability samples with low response rates

Coalitions of the willing

AAPOR task force report on non-probability samples (2013)

(UMich & UMD) Ross-Royall Symposium 6 / 24

Page 8: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Not all nonprobability samples are created equal

College sophomores in Psych 100

Mall intercepts

Volunteer samples, river samples, snowball samples

Probability samples with low response rates

Coalitions of the willing

AAPOR task force report on non-probability samples (2013)

(UMich & UMD) Ross-Royall Symposium 6 / 24

Page 9: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Declining response rates

Pew Research response rates in typical telephone surveysdropped from 36% in 1997 to 9% in 2012 (Kohut et al. 2012)

With such low RRs, a sample initially selected randomly canhardly be called a probability sample

Low RRs raise the question of whether probability sampling isworthwhile, at least for some applications

I Non-probs are faster, cheaperI No worse?

(UMich & UMD) Ross-Royall Symposium 7 / 24

Page 10: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

Polls that failed

� British parliamentary election May 2015

Final Ipsos/MORI East Anglia/LSE/Durham UParty (online panel) (using poll aggregation)Conservative 51% 36% 43%Labour 36% 35% 41%

� Israeli March 2015 election (seats); online panels

Final Smith- TNS/ Maariv ChannelParty Reshet Bet Walla 1Likud 30 21 23 21 25Zionist Union 24 25 25 25 25

� Scottish independence referendum, Sep 2014

(UMich & UMD) Ross-Royall Symposium 8 / 24

Page 11: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Types of samples

One that worked

Xbox gamers: 345,000 people surveyed in opt-in poll for 45 dayscontinuously before 2012 US presidential electionXboxers much different from overall electorate18- to 29-year olds were 65% of dataset, compared to 19% innational exit poll93% male vs. 47% in electorateUnadjusted data suggested landslide for RomneyGelman, et al. used some sort of regression and poststratificationto get good estimatesCovariates: sex, race, age, education, state, party ID, politicalideology, and who voted for in the 2008 pres. election.

Wang, W., D. Rothschild, S. Goel, and A. Gelman. 2015. Forecasting Elections withNon-representative Polls. International Journal of Forecasting

(UMich & UMD) Ross-Royall Symposium 9 / 24

Page 12: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Inference problem

Universe & sample 

s

Potentially covered

Fc U-F

 Not  

covered 

U

Fpc

Covered 

For example ...

U = adult population

Fpc = adults with internet access

Fc = adults with internet access who visit some webpage(s)

s = adults who volunteer for a panel

(UMich & UMD) Ross-Royall Symposium 10 / 24

Page 13: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Inference problem

Ideas used in missing data literature

MCAR–Every unit has same probability of appearing in sample

MAR–Probability of appearing depends on covariates known forsample and nonsample cases

NINR–Probability of appearing depends on covariates and y ’s

(UMich & UMD) Ross-Royall Symposium 11 / 24

Page 14: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Inference problem

Table: Percentages of US households with Internet subscriptions; 2013American Community Survey

Percent of householdswith Internet subscription

Total households 74

Race and Hispanic origin of householderWhite alone, non-Hispanic 77Black alone, non-Hispanic 61Asian alone, non-Hispanic 87Hispanic (of any race) 67

Household incomeLess than $25,000 48$25,000-$49,999 69$50,000-$99,999 85$100,000-$149,999 93$150,000 and more 95

Educational attainment of householderLess than high school graduate 44High school graduate 63Some college or associate’s degree 79Bachelor’s degree or higher 90

(UMich & UMD) Ross-Royall Symposium 12 / 24

Page 15: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Inference problem

Estimating a total

Pop total t =P

s yi +P

Fc�syi +

PFpc�Fc

yi +P

U�F yi

To estimate t , predict 2nd, 3rd, and 4th sums

What if non-covered units are much different from covered?

I No 70+ year old Black women in a web panel

I No 18-21 year old Hispanic males in a phone surveyDifference from a bad probability sample with a good frame butlow RR:I No unit in U � F or Fpc � Fc had any chance of appearing inthe sample

(UMich & UMD) Ross-Royall Symposium 13 / 24

Page 16: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Inference problem

Full pop vs. Domains

If domain is completely or mostly in the uncovered part (U � F ,Fpc � Fc), then direct domain estimates not possible

I Small area approach where nD = 0 might be tried

Full pop estimates may be OK if uncovered are "like" covered

(UMich & UMD) Ross-Royall Symposium 14 / 24

Page 17: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Quasi-randomization

Quasi-randomization

Model probability of appearing in sample

Pr(i 2 s) = Pr(has Internet)�

Pr(visits webpage j Internet)�

Pr(volunteers for panel j Internet ; visits webpage)�

Pr(participates in survey j Internet ; visits webpage ; volunteers)

(UMich & UMD) Ross-Royall Symposium 15 / 24

Page 18: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Quasi-randomization

Reference sample

Select a probability sample from a frame with good coverageCombine probability and non-probability samples togetherEstimate probability of being in non-probability sample usinglogistic regression (or similar)Use inverse probability as a Horvitz-Thompson-like weight

What does this probability mean?

In a volunteer sample, there are people who would never visitrecruiting webpage or never volunteer if they did visit

The probability has no relative frequency interpretation

(UMich & UMD) Ross-Royall Symposium 16 / 24

Page 19: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Quasi-randomization

Reference sample

Select a probability sample from a frame with good coverageCombine probability and non-probability samples togetherEstimate probability of being in non-probability sample usinglogistic regression (or similar)Use inverse probability as a Horvitz-Thompson-like weight

What does this probability mean?

In a volunteer sample, there are people who would never visitrecruiting webpage or never volunteer if they did visit

The probability has no relative frequency interpretation

(UMich & UMD) Ross-Royall Symposium 16 / 24

Page 20: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Quasi-randomization

Reference sample

Select a probability sample from a frame with good coverageCombine probability and non-probability samples togetherEstimate probability of being in non-probability sample usinglogistic regression (or similar)Use inverse probability as a Horvitz-Thompson-like weight

What does this probability mean?

In a volunteer sample, there are people who would never visitrecruiting webpage or never volunteer if they did visit

The probability has no relative frequency interpretation

(UMich & UMD) Ross-Royall Symposium 16 / 24

Page 21: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

Superpopulation model

Use a model to predict the value for each nonsample unitLinear model: yi = x

Ti � + �i

If this model holds, then

t =Xs

yi +XFc�s

yi +X

Fpc�Fc

yi +XU�F

yi

=Xs

yi + tT(U�s);x �

:= t

TUx �; yi = x

Ti �

(UMich & UMD) Ross-Royall Symposium 17 / 24

Page 22: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

Prediction Theory literature

Royall (BMKA 1976)

Royall (AJE 1976)

Royall (JASA 1976)

Royall & Pfeffermann (BMKA 1982)

Royall (Handbook of Stat 1988)

Valliant, Dorfman, Royall (2000). Finite Population Sampling andInference: A Prediction Approach

(UMich & UMD) Ross-Royall Symposium 18 / 24

Page 23: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

Unit-level weights

Prediction estimator does lead to weights

wi = 1+ tT(U�s);x

�XTs Xs

��1

xi

:= t

TUx

�XTs Xs

��1

xi

(UMich & UMD) Ross-Royall Symposium 19 / 24

Page 24: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

y ’s & Covariates

If y is binary, a linear model is being used to predict a 0-1 variable

I Done routinely in surveys without thinking explicitly about amodelEvery y may have a different model ) pick a set of x ’s good formany y ’s

I Same thinking as done for GREG and other calibrationestimatorsUndercoverage: use x ’s associated with coverage

I Also done routinely in surveys

(UMich & UMD) Ross-Royall Symposium 20 / 24

Page 25: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

Modeling considerations

Good modeling should consider how to predict y ’s and how tocorrect for coverage errors

Covariate selection: LASSO, CART

Covariates: an extensive set of covariates neededDever, Rafferty, & Valliant (2008). Svy. Rsch. Meth.Valliant, Dever (2011). Soc. Meth. Res.Gelman, et al. (2015). Intl. Jnl. Forecasting

Model fit for sample needs to hold for nonsample

Proving that model estimated from sample holds for nonsampleseems impossible

(UMich & UMD) Ross-Royall Symposium 21 / 24

Page 26: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

SE estimation

Variance estimator must be model-basedReplication is an option

Jackknife or bootstrap

Approximate jackknifeLarge sample approximation to jackknife

vJ =Xs

w2

i r2

i

(1� hii )2� n�1

�Ps wiri

(1� hii )

�2

ri = yi � yi ; hii is a leveragevJ is consistent if the linear model holds with uncorrelated errors

(UMich & UMD) Ross-Royall Symposium 22 / 24

Page 27: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Model for y

Variance estimation articles

Royall & Eberhardt (Sankhy�a 1975)

Royall & Herson (JASA 1973a, b)

Royall & Cumberland (JASA 1978, 1981a, 1981b, 1985)

Royall (JASA 1986)

Royall (ISR 1986)

Royall & Cumberland (JRSS-B 1988)

(UMich & UMD) Ross-Royall Symposium 23 / 24

Page 28: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Future

Trouble Spots

How to use social media data: Twitter, Facebook, Snapchat, datascraped from web ...Volunteer panelsCombining nonprobability samples with (better controlled)probability samplesOfficial statistics: are there faster, cheaper ways of producingunemployment rate, inflation rate, new job creation, prevalence ofhealth conditions?

????

(UMich & UMD) Ross-Royall Symposium 24 / 24

Page 29: Inferential Problems with Nonprobability Samples...Probability v. Nonprobability samples Goal in surveys is to use sample to make estimates for entire finite population—external

Methods of Inference Future

Trouble Spots

How to use social media data: Twitter, Facebook, Snapchat, datascraped from web ...Volunteer panelsCombining nonprobability samples with (better controlled)probability samplesOfficial statistics: are there faster, cheaper ways of producingunemployment rate, inflation rate, new job creation, prevalence ofhealth conditions?

????

(UMich & UMD) Ross-Royall Symposium 24 / 24