Data science as a science
Transcript of Data science as a science
![Page 1: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/1.jpg)
![Page 2: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/2.jpg)
Evidence based data analysis @jtleek
![Page 3: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/3.jpg)
Data scienceas a Science (DSaaS) @jtleek
![Page 4: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/4.jpg)
![Page 5: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/5.jpg)
![Page 6: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/6.jpg)
![Page 7: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/7.jpg)
![Page 8: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/8.jpg)
![Page 9: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/9.jpg)
“Data science is as much art as it is science.”
![Page 10: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/10.jpg)
![Page 11: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/11.jpg)
Wouldn’t it be amazing if we got 2,000 people to learn
statistics!
“”-Jeff Leek
7/17/12
![Page 12: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/12.jpg)
date: 7/19/12from: [email protected]
Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”
![Page 13: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/13.jpg)
date: 7/19/12from: [email protected]
Hi Jeff,
7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)
![Page 15: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/15.jpg)
9 classes1 month longAlways open
![Page 16: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/16.jpg)
Data Science SpecializationTotal Enrollments: 3,815,890Total Completions: 409,712
Genomic Data Science SpecializationTotal Enrollments: 173,495Total Completions: 10,826
Executive Data Science SpecializationTotal Enrollments: 62,076Total Completions: 10,957
![Page 17: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/17.jpg)
![Page 18: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/18.jpg)
2A theoretical model
Data
![Page 19: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/19.jpg)
1A theoretical model
Data
![Page 20: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/20.jpg)
Y = some outcomeX = some covariateD = (X,Y)
lm(Y ~ X)
![Page 21: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/21.jpg)
Y = some outcomeX = some covariateD = (X,Y)
lm(Y ~ X)Leek and Peng, Nature 2015
![Page 22: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/22.jpg)
F0
Ul F0(S) Ul F0(Y)
Fithian, Sun and Taylor arXiv 2015
![Page 23: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/23.jpg)
σ-algebra“what we know”
F0
Ul F0(S) Ul F0(Y)
![Page 24: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/24.jpg)
“we’ve done nothing”
F0
Ul F0(S) Ul F0(Y)
![Page 25: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/25.jpg)
“we did model selection”
F0
Ul F0(S) Ul F0(Y)
![Page 26: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/26.jpg)
“we looked at all the data”
F0
Ul F0(S) Ul F0(Y)
![Page 27: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/27.jpg)
E[β |F0]≠
E[β |F0(S)]
![Page 28: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/28.jpg)
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim Patil, Peng and Leek biorXiv 2016
![Page 29: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/29.jpg)
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim Patil, Peng and Leek biorXiv 2016
F0
Ul F0(1P,Q(H))F0(1ED(E))F0(1ED;E(D))F0(1AP;A(C))F0(1C(A*))
UlUlUl Ul
![Page 30: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/30.jpg)
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim Patil, Peng and Leek biorXiv 2016
F0
Ul F0(1P,Q(H))F0(1ED(E))F0(1ED;E(D))F0(1AP;A(C))F0(1C(A*))
UlUlUl Ul
![Page 31: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/31.jpg)
2A theoretical model
Data
![Page 32: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/32.jpg)
Slide courtesy Hadley Wickham
![Page 33: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/33.jpg)
Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham
![Page 34: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/34.jpg)
Who?What?When?Why?Where?How? Where Ingo is working
![Page 35: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/35.jpg)
Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham
Base R
Lassodplyr
googlesheets
ppt
![Page 36: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/36.jpg)
Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham
Bad life choices?
Sparsity!David Robinsontold me
Spreadsheets
Hedgemony
![Page 37: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/37.jpg)
Cleveland and McGill JASA 1984
![Page 38: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/38.jpg)
![Page 39: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/39.jpg)
Leek & Peng 2015 PNAS
![Page 40: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/40.jpg)
Experiment1
![Page 41: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/41.jpg)
Leek and Peng, Science 2015
![Page 42: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/42.jpg)
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim
E[S| F0(1c(W))
![Page 43: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/43.jpg)
We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population.
![Page 44: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/44.jpg)
79% 17%
Inferential
vs
Causal
n=47,141
![Page 45: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/45.jpg)
We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as benzene, which make cells in lungs become cancerous.
![Page 46: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/46.jpg)
65% 32 %Inferential
vs
Causal
n=47,141
![Page 47: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/47.jpg)
Experiment2
![Page 48: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/48.jpg)
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim
E[Est| F0(1c(A))
![Page 49: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/49.jpg)
![Page 50: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/50.jpg)
![Page 51: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/51.jpg)
69% vs 40%n=1,985
![Page 52: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/52.jpg)
Experiment3
![Page 53: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/53.jpg)
![Page 54: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/54.jpg)
E[Claim | F0(1set(base)(A))] - E[Claim | F0(1set(ggplot2)(A))]
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim
![Page 55: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/55.jpg)
1.Make a plot that answers the question: what is the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?
2.Make a plot (possibly multi-panel) that answers the question: how does the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?
Use only the [ggplot2/base R] graphics system (not base R or lattice) to make your figure.
![Page 56: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/56.jpg)
“Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?”
G: 5/22 (23%) vs. B: 5/12 (42%)
![Page 57: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/57.jpg)
“Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?”
G: 7/22 (32%) vs. B: 5/12 (42%)
![Page 58: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/58.jpg)
“Is the plot visually pleasing?”
G: 21/22 (95%) vs. B: 10/12 (83%)
G: 20/22 (91%) vs. B: 8/12 (67%)
![Page 59: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/59.jpg)
“Do the plot text and labels use full words instead of abbreviations?”
G: 21/22 (95%) vs. B: 12/12 (100%)
G: 11/22 (50%) vs. B: 5/12 (42%)
![Page 60: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/60.jpg)
2A theoretical model
Data
![Page 61: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/61.jpg)
Data scienceas a Science (DSaaS) @jtleek
![Page 62: Data science as a science](https://reader035.fdocuments.in/reader035/viewer/2022062316/5871a1e81a28ab044e8b7035/html5/thumbnails/62.jpg)