Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf ·...
Transcript of Statistical Paradises and Paradoxes in Big Dataww2.amstat.org/misc/XiaoLiMengBDSSG.pdf ·...
Statistical Paradises and Paradoxes
in Big Data
Xiao-Li Meng
Department of Statistics,
Harvard University
Thanks to many students and colleagues
1
Paradises
• Much larger general pipeline:
• Much better airplane conversations
• Golden era for methodological research
• Emerging theoretical foundations 2
Statistics Concentration Statistics Concentration Statistics Concentration Statistics Concentration (Major) (Major) (Major) (Major)
Size at Harvard CollegeSize at Harvard CollegeSize at Harvard CollegeSize at Harvard College
Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math
• Rigorous theory of the trade-off between
statistical and computational efficiency,
under confidentiality, etc., based on
classical statistical decision theory.
• Wide-ranging statistical machine learning
theory, methodology, algorithms, using
empirical process, signal processing &
information theory (e.g., MDL principle).
• Automated Targeted Learning and Super
Learning built upon well-established semi-
parametric and nonparametric theory.
• Algebraic statistics, e.g., studying
statistical hypothesis testing via algebraic
geometry and computational and
combinatorial techniques.
• ……3
BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives
• Fusion learning via confidence distributions (CD)
• Combining results from multiple analyses under
possibly different perspectives
4
Jianqing Fan’s Group (Princeton):
Bringing statistical theory and methods to the forefront of Big Data
Fan et al. (2014) Challenges of Big Data Analysis
National Science Review (China) 1: 293-314
Salient features of Big Data
• Heterogeneity (Individuality)
• Noise accumulation
• Spurious correlation
• Incidental endogeneity
• FanBigDataReview.pdf
5
Great Promises and Grand Challenges �Multi-Resolution Inference
�Multi-Phase Inference
�Multi-Source Inference
o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (if you help fund it). COPSS 50th Anniversary Volume.
o Blocker and Meng (2013) The Potential and Perils of Preprocessing: Building New Foundations. Bernoulli, 19, 1176-1211.
o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase Inference Perspective: What Happens When God’s, Imputer’s and Analyst’s Models are Uncongenial? (With discussion). Statistica Sinica, to appear.
6
OnTheMap Project of US Census Bureau
7
• Developed by LED (Local
Employment Dynamic).
• Users zoom into any region of
the US for paired employee-
employer information.
• Used diverse data sources:
surveys and administrative
datasets with confidential
information.
Thanks to Jeremy Wu of C. B.
Multi-Resolution
8
Multi-Phase
9
• To protect confidentiality, the displayed data are synthetic:
draws from a posterior.
• Each data source itself has gone through multiple
“clean up” processes, most of which are gray boxes
or even
Multi-Source
• Built from more than 20 data sources in the LEHD
(Longitudinal Employer-Household Dynamics) system.
• Survey Samples: Monthly survey of 60,000 households
covering only 0.05% of households.
• Administrative Records: Unemployment insurance wage
records covering more than 90% of the US workforce;
Never intended for inference purposes.
• Census Data: Quarterly census of earnings and wages
covering 98% of US jobs.
10
11
A Trio of NP-Hard Inference Problems
• Multi-Resolution: How do we infer estimands with resolution far
exceeding any possible estimators? Is it possible for such inference to
be qualitatively robust even if it cannot be quantitatively robust?
• Multi-Phase: (Big) Data are almost never collected, preprocessed,
and analyzed in a single phase. What theory and methods
accommodate this multi-phase setup?
• Multi-Source: Which one is better: a survey sample covering 1% or
an administrative record covering 95% of the population? How
should we combine information from these sources? Is it worth
combining?
So which one is better for estimating the population mean:a 1% simple random sample (SRS) or a 95% administrative (observational) dataset (AD) ?
12
1%
SRS
95%
AD
It d
epends!
Is th
is a
tric
...
0% 0%0%0%
1. 1% SRS
2. 95% AD
3. It depends!
4. Is this a trick question?
A fundamental principle of statistics: Variance-Bias Tradeoff
Total Error = Variance + Bias2
• probabilistic SRS [(1-fs)/n]S2 + 0
• Large non-prob data ≈ 0 + r2[(1-fa)/fa)] S2
• f is the fraction in the population: f=n/N
• r is the correlation between the (honest) responded/recorded value X and the probability of response/recording, P(X)
• “Big Data Paradox” – the larger the data, the more pronounced the bias
13
For estimating a population mean, if r=0.1, how large does an AD, as a percentage of US population, need to be in order to produce a more accurate sample average than a SRS with n=100 does?
14
<0.5
% (
1.6M
) 5
%
(16M
) 1
0% (
32M)
20%
(64M
) 5
0% (
160M)
75%
(240M
) 9
0%
(288M
) >
95%
(303M
)
0% 0% 0% 0%0%0%0%0%
1. <0.5% (1.6M)
2. 5% (16M)
3. 10% (32M)
4. 20% (64M)
5. 50% (160M)
6. 75% (240M)
7. 90% (288M)
8. >95% (303M)
Big Data: Big Size or Big Fraction?• Size matters, but only after having quality
• Importance of combining non-probabilistic samples
with probabilistic ones, however small the latter are.
• More does NOT guarantee better:
• I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb?(Meng and Xie, 2014, Economics Review, 218-250)
15
So when/why do we need Big Data?
• Individualized treatments (e.g., medical;
educational; marketing; news)
• Inference/prediction with very weak signal to
noise ratio (e.g., climate change)
• Understand deeply connected (spatial)
networks and (temporal) dynamics
16
What does Big Data mean for you?We see you and others more clearly
2015/11/1 17
Gift: Treatment for you based only on data from people like you.
Curse: No one is perfectly like you.
2015/11/1 18
2015/11/1 19
Personalized Treatment: Sounds heavenly, but where on Earth did they find the right
guinea pig for me?
Liu and Meng (2014) A Fruitful Resolution to Simpson’s Paradox via Multi-Resolution Inference, The American Statistician, 17-29
A Painful Problem
2015/11/1 20
2015/11/1 21
Kidney Stone TreatmentC. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986)
Br Med J (Clin Res Ed) 292 (6524): 879–882.
Treatment A Treatment B
78%
(273/350)
83%(289/350)
Treatment A Treatment B
Small
Stone
93%
(81/87)
87%
(234/270)
Large Stone
73%(192/263)
69%
(55/80)
A: Open Surgery; B: Percutaneous Nephrolithotomy
22
Treatment A
Large Stones
Small Stones
Large Stones
Small Stones
Treatment B
SuccessfulUnsuccessful
69% successful
73% successful
93%
87%
78%
83%
Overall
Overall
Uneven distribution of stone sizes across treatments makes overall success rate misleading.
Simpson’s Paradox
• Dealing with the disparities between
aggregated analysis and disaggregated
analyses
• Determining the right level (primary
resolution) for analysis
• Understanding the bias-variance (relevance-
robustness) trade-off
23
So what would be the right resolution?
Let’s take a CarTalk challenge (7/111/2015)
24
25
From Cartalk: “You are tested positive for D by a test with
95% accuracy. What’s the chance you actually have D, given
the prevalence of D is 0.1%?”
1-5
%
5-1
0%
10-
25%
2
5-50
%
50-
75%
75-
95%
Cou
ld b
e an
yth
...
I hav
e no
idea
...
0% 0% 0% 0%0%0%0%0%
1. 1-5%
2. 5-10%
3. 10-25%
4. 25-50%
5. 50-75%
6. 75-95%
7. Could be anything
8. I have no idea.
C
o
u
n
t
d
o
w
n
10
It could be anything … depending on the meaning of “accuracy” and …
• Need to know how accurate the test is among
those with no disease (specificity) AND among
those with the disease (sensitivity)
• The probability could be 1 if sensitivity = 100%
• For rare disease, overall accuracy ~ specificity
• Then the answer is less than 2%, if this was a
random screening test
26
27
1,000 with Symptoms
100 D 900 no D
45
pos
855
neg
95
pos
5
neg
100,000 People for Screening
100 D 99,900 no D
4,995
pos
94,005
neg
95
pos
5
neg
5%95% 5%95%
5% 5%95% 95%
0.1% 99.9% 10% 90%
95/(95+4,995) = 1.87% 95/(95+45) = 67.9%
Conditioning is the Soul of Statistics
--- Joe Blitzstein
Bayes Theorem
28
When the facts change, I change my opinion. What
do you do, sir?
~ John Maynard Keynes
Useful Statistical Principles/Concepts for Data Science
Data Selection and Replication Mechanisms:
Randomization, sampling, experiments, observational studies, missing
data mechanisms; latent variable/constructs; potential outcome;
confidentiality protections
Conditioning vs. Marginalizing:
Disaggregation vs. aggregation, sub-population analysis,
individualized inference, Simpson’s paradox, ecological fallacy
Bias-Variance Trade-off:
Efficiency vs. Robustness, Relevance vs. Robustness; model
predictability vs. fitness
Inferences principles/perspectives:
Likelihood principle; Bayesian thinking; fiducial argument for
objectivity; uncertainty quantifications
…….
•2015/11/1
29
A Traditional Statistical Theme/Aim:
Seeking representative samples to infer about populations
A Big-Data Statistical Theme/Aim:
Constructing approximating populations to infer about individuals
Targeted Individual Approx. Population
2015/11/1 30
One more V for Big Data:
31
Veracity
I find your presentation …
32
Insp
iring a
nd ...
info
rmativ
e an...
confu
sing a
nd ...
what a
wast
e o...
0% 0%0%0%
1. Inspiring and thought
provoking
2. informative and I
learned a few things
3. confusing and not
very helpful
4. what a waste of my
time!