Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf ·...

Statistical Paradises and Paradoxes

in Big Data

Xiao-Li Meng

Department of Statistics,

Harvard University

Thanks to many students and colleagues

1

Paradises

• Much larger general pipeline:

• Much better airplane conversations

• Golden era for methodological research

• Emerging theoretical foundations 2

Statistics Concentration Statistics Concentration Statistics Concentration Statistics Concentration (Major) (Major) (Major) (Major)

Size at Harvard CollegeSize at Harvard CollegeSize at Harvard CollegeSize at Harvard College

Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math

• Rigorous theory of the trade-off between

statistical and computational efficiency,

under confidentiality, etc., based on

classical statistical decision theory.

• Wide-ranging statistical machine learning

theory, methodology, algorithms, using

empirical process, signal processing &

information theory (e.g., MDL principle).

• Automated Targeted Learning and Super

Learning built upon well-established semi-

parametric and nonparametric theory.

• Algebraic statistics, e.g., studying

statistical hypothesis testing via algebraic

geometry and computational and

combinatorial techniques.

• ……3

BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives

• Fusion learning via confidence distributions (CD)

• Combining results from multiple analyses under

possibly different perspectives

4

Jianqing Fan’s Group (Princeton):

Bringing statistical theory and methods to the forefront of Big Data

Fan et al. (2014) Challenges of Big Data Analysis

National Science Review (China) 1: 293-314

Salient features of Big Data

• Heterogeneity (Individuality)

• Noise accumulation

• Spurious correlation

• Incidental endogeneity

• FanBigDataReview.pdf

5

Great Promises and Grand Challenges �Multi-Resolution Inference

�Multi-Phase Inference

�Multi-Source Inference

o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (if you help fund it). COPSS 50th Anniversary Volume.

o Blocker and Meng (2013) The Potential and Perils of Preprocessing: Building New Foundations. Bernoulli, 19, 1176-1211.

o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models are Uncongenial? (With discussion). Statistica Sinica, to appear.

6

OnTheMap Project of US Census Bureau

7

• Developed by LED (Local

Employment Dynamic).

• Users zoom into any region of

the US for paired employee-

employer information.

• Used diverse data sources:

surveys and administrative

datasets with confidential

information.

Thanks to Jeremy Wu of C. B.

Multi-Resolution

8

Multi-Phase

9

• To protect confidentiality, the displayed data are synthetic:

draws from a posterior.

• Each data source itself has gone through multiple

“clean up” processes, most of which are gray boxes

or even

Multi-Source

• Built from more than 20 data sources in the LEHD

(Longitudinal Employer-Household Dynamics) system.

• Survey Samples: Monthly survey of 60,000 households

covering only 0.05% of households.

• Administrative Records: Unemployment insurance wage

records covering more than 90% of the US workforce;

Never intended for inference purposes.

• Census Data: Quarterly census of earnings and wages

covering 98% of US jobs.

10

11

A Trio of NP-Hard Inference Problems

• Multi-Resolution: How do we infer estimands with resolution far

exceeding any possible estimators? Is it possible for such inference to

be qualitatively robust even if it cannot be quantitatively robust?

• Multi-Phase: (Big) Data are almost never collected, preprocessed,

and analyzed in a single phase. What theory and methods

accommodate this multi-phase setup?

• Multi-Source: Which one is better: a survey sample covering 1% or

an administrative record covering 95% of the population? How

should we combine information from these sources? Is it worth

combining?

So which one is better for estimating the population mean:a 1% simple random sample (SRS) or a 95% administrative (observational) dataset (AD) ?

12

1%

SRS

95%

AD

It d

epends!

Is th

is a

tric

...

0% 0%0%0%

1. 1% SRS

2. 95% AD

3. It depends!

4. Is this a trick question?

A fundamental principle of statistics: Variance-Bias Tradeoff

Total Error = Variance + Bias2

• probabilistic SRS [(1-fs)/n]S2 + 0

• Large non-prob data ≈ 0 + r2[(1-fa)/fa)] S2

• f is the fraction in the population: f=n/N

• r is the correlation between the (honest) responded/recorded value X and the probability of response/recording, P(X)

• “Big Data Paradox” – the larger the data, the more pronounced the bias

13

For estimating a population mean, if r=0.1, how large does an AD, as a percentage of US population, need to be in order to produce a more accurate sample average than a SRS with n=100 does?

14

<0.5

% (

1.6M

) 5

%

(16M

) 1

0% (

32M)

20%

(64M

) 5

0% (

160M)

75%

(240M

) 9

0%

(288M

) >

95%

(303M

)

0% 0% 0% 0%0%0%0%0%

1. <0.5% (1.6M)

2. 5% (16M)

3. 10% (32M)

4. 20% (64M)

5. 50% (160M)

6. 75% (240M)

7. 90% (288M)

8. >95% (303M)

Big Data: Big Size or Big Fraction?• Size matters, but only after having quality

• Importance of combining non-probabilistic samples

with probabilistic ones, however small the latter are.

• More does NOT guarantee better:

• I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?(Meng and Xie, 2014, Economics Review, 218-250)

15

So when/why do we need Big Data?

• Individualized treatments (e.g., medical;

educational; marketing; news)

• Inference/prediction with very weak signal to

noise ratio (e.g., climate change)

• Understand deeply connected (spatial)

networks and (temporal) dynamics

16

What does Big Data mean for you?We see you and others more clearly

2015/11/1 17

Gift: Treatment for you based only on data from people like you.

Curse: No one is perfectly like you.

2015/11/1 18

2015/11/1 19

Personalized Treatment: Sounds heavenly, but where on Earth did they find the right

guinea pig for me?

Liu and Meng (2014) A Fruitful Resolution to Simpson’s Paradox via Multi-Resolution Inference, The American Statistician, 17-29

A Painful Problem

2015/11/1 20

2015/11/1 21

Kidney Stone TreatmentC. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986)

Br Med J (Clin Res Ed) 292 (6524): 879–882.

Treatment A Treatment B

78%

(273/350)

83%(289/350)

Treatment A Treatment B

Small

Stone

93%

(81/87)

87%

(234/270)

Large Stone

73%(192/263)

69%

(55/80)

A: Open Surgery; B: Percutaneous Nephrolithotomy

22

Treatment A

Large Stones

Small Stones

Large Stones

Small Stones

Treatment B

SuccessfulUnsuccessful

69% successful

73% successful

93%

87%

78%

83%

Overall

Overall

Uneven distribution of stone sizes across treatments makes overall success rate misleading.

Simpson’s Paradox

• Dealing with the disparities between

aggregated analysis and disaggregated

analyses

• Determining the right level (primary

resolution) for analysis

• Understanding the bias-variance (relevance-

robustness) trade-off

23

So what would be the right resolution?

Let’s take a CarTalk challenge (7/111/2015)

24

25

From Cartalk: “You are tested positive for D by a test with

95% accuracy. What’s the chance you actually have D, given

the prevalence of D is 0.1%?”

1-5

%

5-1

0%

10-

25%

2

5-50

%

50-

75%

75-

95%

Cou

ld b

e an

yth

...

I hav

e no

idea

...

0% 0% 0% 0%0%0%0%0%

1. 1-5%

2. 5-10%

3. 10-25%

4. 25-50%

5. 50-75%

6. 75-95%

7. Could be anything

8. I have no idea.

C

o

u

n

t

d

o

w

n

10

It could be anything … depending on the meaning of “accuracy” and …

• Need to know how accurate the test is among

those with no disease (specificity) AND among

those with the disease (sensitivity)

• The probability could be 1 if sensitivity = 100%

• For rare disease, overall accuracy ~ specificity

• Then the answer is less than 2%, if this was a

random screening test

26

27

1,000 with Symptoms

100 D 900 no D

45

pos

855

neg

95

pos

5

neg

100,000 People for Screening

100 D 99,900 no D

4,995

pos

94,005

neg

95

pos

5

neg

5%95% 5%95%

5% 5%95% 95%

0.1% 99.9% 10% 90%

95/(95+4,995) = 1.87% 95/(95+45) = 67.9%

Conditioning is the Soul of Statistics

--- Joe Blitzstein

Bayes Theorem

28

When the facts change, I change my opinion. What

do you do, sir?

~ John Maynard Keynes

Useful Statistical Principles/Concepts for Data Science

Data Selection and Replication Mechanisms:

Randomization, sampling, experiments, observational studies, missing

data mechanisms; latent variable/constructs; potential outcome;

confidentiality protections

Conditioning vs. Marginalizing:

Disaggregation vs. aggregation, sub-population analysis,

individualized inference, Simpson’s paradox, ecological fallacy

Bias-Variance Trade-off:

Efficiency vs. Robustness, Relevance vs. Robustness; model

predictability vs. fitness

Inferences principles/perspectives:

Likelihood principle; Bayesian thinking; fiducial argument for

objectivity; uncertainty quantifications

…….

•2015/11/1

29

A Traditional Statistical Theme/Aim:

Seeking representative samples to infer about populations

A Big-Data Statistical Theme/Aim:

Constructing approximating populations to infer about individuals

Targeted Individual Approx. Population

2015/11/1 30

One more V for Big Data:

31

Veracity

I find your presentation …

32

Insp

iring a

nd ...

info

rmativ

e an...

confu

sing a

nd ...

what a

wast

e o...

0% 0%0%0%

1. Inspiring and thought

provoking

2. informative and I

learned a few things

3. confusing and not

very helpful

4. what a waste of my

time!

Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf ·...

Documents

Transcript of Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf ·...