Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas...
Transcript of Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas...
![Page 1: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/1.jpg)
Arun Kumar KuchibhotlaUniversity of Pennsylvania
https://arun-kuchibhotla.github.io/
![Page 2: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/2.jpg)
Robust Statistics
Single Index Models
Post-selectionInference
Nonparametric Regression
MisspecificationAnalysis
2
Concentration Inequalities and
High-dimensional CLT
Overview of my research interests
![Page 3: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/3.jpg)
Valid Post-selection InferenceWhy and How
Arun Kumar KuchibhotlaUniversity of Pennsylvania
Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George
![Page 4: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/4.jpg)
Roadmap
Data Snooping: Effects and Examples
Formulation of the Problem
Solution for Covariate Selection
Example and Conclusions
4
![Page 5: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/5.jpg)
RoadmapData Snooping: Effects and Examples
● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.
Formulation of the Problem● The Problem & literature review for covariate selection.
Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.
Example and Conclusions
● Real data example, Extensions & Summary
5
![Page 6: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/6.jpg)
RoadmapData Snooping: Effects and Examples
● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.
Formulation of the Problem
Solution for Covariate Selection
Example and Conclusions
6
![Page 7: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/7.jpg)
500 observations
most correlated with 95% CI for slope
Effects of Data Snooping: Stepwise Selection
![Page 8: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/8.jpg)
Effects of Data Snooping: Stepwise Selection
500 observations
most correlated with 95% CI for slope
![Page 9: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/9.jpg)
Effects of Data Snooping: Stepwise Selection
500 observations
most correlated with 95% CI for slope
![Page 10: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/10.jpg)
Some Notes
❖ Unadjusted inference after data snooping can be (very) misleading.
❖ Data snooping contributes to the
➢ Inability to replicate conclusions in future studies.➢ 95% CIs should imply correct conclusions in 95% of
studies.
❖ More concerningly, common practice of data snooping is more informal and imprecise than the example shown.
10
Replicability Crisis
![Page 11: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/11.jpg)
Case Study 1: Covariate Selection
11
BritishMedicalJournal
2005
![Page 12: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/12.jpg)
12
Case Study 1: Covariate SelectionBritishMedicalJournal
2005
![Page 13: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/13.jpg)
13
Case Study 2: Covariate Selection
![Page 14: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/14.jpg)
Case Study 2: Covariate SelectionScientific Reports, 2019
“The MLR models were tested by sequential introduction of predictors and interaction terms.
ultimately, from the three best models with similar adjusted R squared values the simplest one was chosen.”
Scientific Reports, 2019
![Page 15: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/15.jpg)
★H & R (1978) Hedonic housing prices and the demand for clean air.
Case Study 3: Transformations
Harrison and Rubinfeld (1978)★ write
![Page 16: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/16.jpg)
RoadmapData Snooping: Effects and Examples
● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.
Formulation of the Problem● The Problem & literature review for covariate selection.
Solution for Covariate Selection
Example and Conclusions
16
![Page 17: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/17.jpg)
500 observations
most correlated with 95% CI for slope
Effects of Data Snooping: Stepwise Selection
![Page 18: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/18.jpg)
There are covariates and for each
Want a valid CI for the parameter/target :
irrespective of how is chosen based on data.
18
Post-selection Inference: Problem 1
![Page 19: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/19.jpg)
For each
Want a valid CI for a coordinate of :
irrespective of how with size and are chosen based on data.
19
Post-selection Inference: Problem 2
![Page 20: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/20.jpg)
Allows arbitrary exploration in
❌ Cannot revise the model after using
❌ Invalidity when using multiple splits.
❌ Applicable only for independent data.
Variable selection,Transformations etc.
Finally, use once for inference.
20
Solution 0: Sample Splitting
✔
Rinaldo et al. (2019) Bootstrapping and sample splitting for high-dimensional, assumption-lean inference, Annals of Statistics.
![Page 21: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/21.jpg)
For each
Want a valid CI for a coordinate of :
irrespective of how with size and are chosen based on data.
21
Recap: Post-selection Inference
![Page 22: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/22.jpg)
Buehler and Feddersen (1963); Olshen (1973); Sen (1979);
Rencher and Pun (1980); Freedman (1983); Sen and Saleh
(1987); Dijkstra and Veldkamp (1988); Hurvich and Tsai (1990);
Potscher (1991); Pfeiffer, Redd and Carroll (2017).
Cox (1965); Kabaila (1998); Hjort and Claeskens (2003);
Claeskens and Carroll (2007); Berk et al. (2013); Lee, Sun, Sun
and Taylor (2016); Tibshirani, Taylor, Lockhart and Tibshirani
(2016); Bachoc, Preinerstorfer and Steinberger (2019); Rinaldo,
Wasserman, G’Sell and Lei (2019).
Literature Review
22
![Page 23: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/23.jpg)
RoadmapData Snooping: Effects and Examples
● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.
Formulation of the Problem● The Problem & literature review for covariate selection.
Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.
Example and Conclusions
23
![Page 24: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/24.jpg)
Simultaneous Inference FWER Control
Post-selectionInference
➢ Simultaneity implies valid CIs for arbitrary selection
➢ Simultaneity implies infinite revisions of a selection.
➢ Simultaneity also guarantees validity if multiple
models are reported. 24
A Guiding Principle
![Page 25: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/25.jpg)
➢ Choose any model and get a confidence region.➢ The choice can be revised as many times as desired.➢ Simultaneity also guarantees validity if multiple
models are reported.
Theorem: Simultaneous inference is necessary for valid Post-selection inference.
K et al. (2019) Valid Post-selection Inference in Model-free Linear Regression. Annals of Statistics (Forthcoming).
25
Post-selectionInference
A Key Result
Simultaneous Inference FWER Control
![Page 26: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/26.jpg)
The classical interval for ist-score
Solution 1: Uniform Adjustment
For simultaneity, inflate the confidence regions:
quantile of
![Page 27: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/27.jpg)
27
➢ in an assumption-lean setting,➢ for independent and weakly dependent obs.,➢ for and maximal model size ,➢ for fixed or random covariates,
can be estimated using bootstrap.
In the worst case,
We show
Our Contributions
Berk et al. (2013): Homoscedastic Gaussian response, fixed X.Bachoc et al (2019): assumption-lean but fixed X and fixed p.
![Page 28: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/28.jpg)
Simulation Examples: Revisiting Stepwise Selection
![Page 29: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/29.jpg)
500 observations
most correlated with 95% CI for slope
Effects of Data Snooping: Stepwise Selection
![Page 30: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/30.jpg)
Effects of Data Snooping: Stepwise Selection
500 observations
most correlated with 95% CI for slope
![Page 31: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/31.jpg)
Effects of Data Snooping: Stepwise Selection
500 observations
most correlated with 95% CI for slope
![Page 32: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/32.jpg)
Effects of Data Snooping: Stepwise Selection
500 observations
most correlated with 95% CI for slope
![Page 33: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/33.jpg)
Key Steps in the Proof
![Page 34: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/34.jpg)
For independent, sub-Gaussian data
K et al. (2018) A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference. arXiv:1802.05801
➢ Doesn’t require any parametric model assumptions.
➢ A finite sample result. Allows for diverging
➢ Extends beyond independent & sub-Gaussian data.
34
Uniform Linear Representation
![Page 35: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/35.jpg)
implies
Close in Probability by triangle ineq.
35
Three Line Proof
![Page 36: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/36.jpg)
Close inDistribution by HDCLT
K et al. (2018) High-dimensional CLT: Improvements, Non-uniform Extensions and Large Deviations. arXiv:1806.06153
Close in Probability by triangle ineq.
36
Three Line Proof
implies
![Page 37: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/37.jpg)
Close in Probability by triangle ineq.
Close inDistribution by HDCLT
Close inDistributionby triangle ineq.
37
Three Line Proof
Justifies bootstrap!
implies
![Page 38: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/38.jpg)
RoadmapData Snooping: Effects and Examples
● Effects illustrated with stepwise selection.● Data snooping in textbooks and practice.
Formulation of the Problem● The Problem & literature review for covariate selection.
Solution for Covariate Selection● Key contributions● Simulations & main components of the theory.
Example and Conclusions● Real Data Example, Extensions & Summary.
38
![Page 39: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/39.jpg)
Telomere Length Analysis
39
![Page 40: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/40.jpg)
Case Study 2: Covariate SelectionScientific Reports, 2019
“The MLR models were tested by sequential introduction of predictors and interaction terms.
ultimately, from the three best models with similar adjusted R squared values the simplest one was chosen.”
40
Scientific Reports, 2019
![Page 41: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/41.jpg)
TL inheritance patterns based on 246 families.
41
Telomere Length Analysis
❖ Dependent Variable: MTL (Mean telomere length)
❖ Child Variables:
➢ Sex ➢ Age
❖ Parental Variables:
➢ mMTL (mother MTL) ➢ fMTL (father MTL)➢ MAC (mother’s age at conception)➢ PAC (father’s age at conception)
(Additionally, 15 interaction variables were considered.)
![Page 42: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/42.jpg)
42
Adjusted Inference: Telomere Length Analysis
Covariate Unadjusted Adjusted
AGE ✔ ❌
mMTL ✔ ✔
fMTL ✔ ✔
MAC ✔ ❌
PAC ❌ ❌
✔ : Significant at 5% level❌ : Insignificant at 5% level
![Page 43: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/43.jpg)
Summary and Conclusions❖ Data snooping contributes to replicability crisis.❖ Inference is possible after data snooping.
❖ The framework allows for➢ Misspecified models; Random covariates;➢ Dependent data; High-dimensional features;➢ Variable Transformations;➢ M-estimators: logistic/Poisson/Quantile/Cox.
43
Classical Framework New FrameworkFix the test & model Fix a universe
Collect the data Collect the data
![Page 44: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/44.jpg)
Thank you for your attention44
Kuchibhotla A., Brown L., Buja A., Cai J., George E., Zhao L. (2019) Valid Post-selection Inference in Model-free Linear Regression. Annals of Statistics (Forthcoming).
Kuchibhotla A., Brown L., Buja A., George E., Zhao L. (2018) A Model Free Perspective for Linear Regression: Uniform-in-model Bounds for Post Selection Inference. arXiv:1802.05801
Kuchibhotla A. (2018) Deterministic Inequalities for Smooth M-estimators. arXiv:1809.05172
Kuchibhotla A., Mukherjee S., and Banerjee D. (2018) High-dimensional CLT: Improvements, Non-uniform Extensions and Large Deviations. arXiv:1806.06153
Kuchibhotla A., Brown L., Buja A., Cai J. (2019) All of Linear Regression. arXiv:1910.06386
References
![Page 45: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/45.jpg)
For each
Want a valid CI for a coordinate of
irrespective of how is chosen based on data.
45
Post-selection for Transformations
![Page 46: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/46.jpg)
Inflate the classical intervals,
with being the quantile of
Bootstrap applies with validity for “nice” function classes , for example, Box-Cox family.
46
Solution for Transformations
![Page 47: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/47.jpg)
Transformations: Boston Housing DataThis dataset has 506 census tracts with 13 features.
❖ Dependent Variable: MV, Median value of house.
❖ Covariate of Interest: NOX, Nitrogen Oxide Conc.
❖ Structural Variables: RM (No. of rms), AGE (% of homes bfr 1940).
❖ Neighborhood Variables: CRIM (Crime rate), ZN (% of res. land
zoned for lots > 25K ft2), INDUS (% non-retail business acres per twn),
RIVER (Charles river dummy), TAX (Property tax rate),
PTRATIO (Pupil-teacher ratio), B (Racial diversity),
LSTAT (% of lwr socio-econ. status of population).
❖ Accessibility Variables: DIS (Distance to Employment Ctr.),
RAD (Distance to Radial Highway). 47
![Page 48: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/48.jpg)
Harrison and Rubinfeld (1978)★ write
★H & R (1978) Hedonic housing prices and the demand for clean air. 48
Transformations: Boston Housing Data
![Page 49: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/49.jpg)
49
Covariate Unadjusted Adjusted Covariate Unadjusted Adjusted
NOX2 ✔ ✔ TAX ✔ ✔
RM ✔ ✔ PTRATIO ✔ ✔
AGE ❌ ❌ B ✔ ❌
CRIM ✔ ✔ LSTAT ✔ ✔
ZN ❌ ❌ DIS ✔ ✔
INDUS ❌ ❌ RAD ✔ ✔
RIVER ✔ ❌
Adjusted Inference: Boston Housing
✔ : Significant at 5% level❌ : Insignificant at 5% level
![Page 50: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/50.jpg)
50
Implications of PoSI for Applications
Similar to Boston housing data and TL data,
How do conclusions in applied data analysis change when exploration is
accounted for?
Joint work with Junhui Cai and Linda Zhao.
![Page 51: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/51.jpg)
51
High-dimensional CLT
Anderson, Hall and Titterington (1998, JSPI):
Joint work with Somabha Mukherjee.
Let be the class of all rectangles in
![Page 52: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/52.jpg)
52
Maximal Inequalities
➢ High-dimensional rates are sensitive to tails.
➢ What is the order of
subject to
Joint work with Somabha Mukherjee and Sagnik Nandy.
![Page 53: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/53.jpg)
53
Multiple Testing under Dependence
➢ Multiple testing often requires independence.
➢ FWER is possible under arbitrary dependence.
➢ What about FDR control?
➢ How does BH procedure behave?
![Page 54: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/54.jpg)
Arun KumarKuchibhotla
University of Pennsylvania
Webpage: https://arun-kuchibhotla.github.io/Email: [email protected]
![Page 55: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/55.jpg)
55
Case Study 3: Model Building
2008
➢ 76 Oregon homes and 12 features.
➢ Try a linear model with 12 predictors “as is”.
➢ Residuals imply non-linearity.
➢ Bath and Bed also have high p-values, so add an
interaction to the model.
➢ Price is skewed suggesting a log-transformation.
Journal of Statistics Education
![Page 56: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/56.jpg)
Case Study 3: Transformations
★Stine and Foster, Statistics for Business: Decision Making and Analysis.
In the context of curve fitting to bivariate data, Stine and Foster (2014)★ on page 515, write
56
“Picking a transformation requires practice, and you may need to try several to find one that is interpretable and captures the pattern in the data.”
![Page 57: Arun Kumar Kuchibhotla · Arun Kumar Kuchibhotla University of Pennsylvania Larry Brown Andreas Buja Junhui Cai Linda Zhao Ed George. Roadmap Data Snooping: Effects and Examples Formulation](https://reader030.fdocuments.in/reader030/viewer/2022040205/5f203633185557093920bb11/html5/thumbnails/57.jpg)
➢ The solution applies to most other estimation problems.
➢ Examples include logistic/Poisson regression, quantile regression and Cox regression.
➢ Solution also applies to transformation of variables.
K (2018) Deterministic Inequalities for Smooth M-estimators, arXiv:1809.0517257
Extensions