Statistics for Programmers
Transcript of Statistics for Programmers
-
8/19/2019 Statistics for Programmers
1/129
Jake VanderPlas
Jake VanderPlas @jakevdp
Sept 17, 2015
-
8/19/2019 Statistics for Programmers
2/129
Jake VanderPlas
About Me Jake VanderPlas (Twitter/Github: jakevdp)
- Astronomer by training
- Data Scientist at UW eScience Institute
- Active in Python science & open source- Blog at Pythonic Perambulations
- Author of two books:
-
8/19/2019 Statistics for Programmers
3/129
-
8/19/2019 Statistics for Programmers
4/129
Jake VanderPlas
Statistics is Hard.
Using programming skills,it can be easy.
-
8/19/2019 Statistics for Programmers
5/129
-
8/19/2019 Statistics for Programmers
6/129
Jake VanderPlas
My thesis today:
If you can write a for-loop, you can do statistics
-
8/19/2019 Statistics for Programmers
7/129Jake VanderPlas
You toss a coin 30
times and see 22 heads. Is it a fair coin?
Warm-up:
Coin Toss
-
8/19/2019 Statistics for Programmers
8/129Jake VanderPlas
A fair coin shouldshow 15 heads in 30
tosses. This coin isbiased.
Even a fair coincould show 22 heads
in 30 tosses. It mightbe just chance.
-
8/19/2019 Statistics for Programmers
9/129Jake VanderPlas
Classic Method:
Assume the Skeptic is correct:test the Null Hypothesis.
Assuming a fair coin, compute
probability of seeing 22 headssimply by chance.
-
8/19/2019 Statistics for Programmers
10/129Jake VanderPlas
Classic Method:
Start computing probabilities . . .
-
8/19/2019 Statistics for Programmers
11/129
-
8/19/2019 Statistics for Programmers
12/129Jake VanderPlas
Classic Method:
Number of
arrangements(binomialcoefficient) Probability of
N H heads
Probability of N
T tails
-
8/19/2019 Statistics for Programmers
13/129Jake VanderPlas
Classic Method:
-
8/19/2019 Statistics for Programmers
14/129Jake VanderPlas
Classic Method:
-
8/19/2019 Statistics for Programmers
15/129Jake VanderPlas
Classic Method:
0.8 %
-
8/19/2019 Statistics for Programmers
16/129Jake VanderPlas
Classic Method:
0.8 %
Probability of 0.8% (i.e. p = 0.008) of
observations given a fair coin. → reject fair coin hypothesis at p < 0.05
-
8/19/2019 Statistics for Programmers
17/129Jake VanderPlas
Could there be
an easier way?
-
8/19/2019 Statistics for Programmers
18/129Jake VanderPlas
Easier Method:
Just simulate it!
M = 0
for i in range(10000):
trials = randint(2, size=30)
if (trials.sum() >= 22):
M += 1
p = M / 10000 # 0.008149
→ reject fair coin at p = 0.008
-
8/19/2019 Statistics for Programmers
19/129
Jake VanderPlas
In general . . .
Computing the SamplingDistribution is Hard.
-
8/19/2019 Statistics for Programmers
20/129
Jake VanderPlas
In general . . .
Computing the SamplingDistribution is Hard.
Simulating the Sampling
Distribution is Easy.
-
8/19/2019 Statistics for Programmers
21/129
Jake VanderPlas
Four Recipes for
Hacking Statistics:1. Direct Simulation
2. Shuffling3. Bootstrapping
4. Cross Validation
-
8/19/2019 Statistics for Programmers
22/129
Jake VanderPlas
Now, the Star-Belly Sneetches
had bellies with stars.The Plain-Belly Sneetches had none upon thars . . .
Sneeches:Stars and
Intelligence
*adapted from John Rauser’sStatistics Without All The Agonizing Pain
-
8/19/2019 Statistics for Programmers
23/129
Jake VanderPlas
★ ❌
84 72 81 69
57 46 74 61
63 76 56 87
99 91 69 65
66 44
62 69
★ mean: 73.5❌ mean: 66.9
difference: 6.6
Sneeches:Stars and
Intelligence
Test Scores
-
8/19/2019 Statistics for Programmers
24/129
Jake VanderPlas
★ mean: 73.5❌ mean: 66.9 difference: 6.6
Is this difference of 6.6statistically significant?
-
8/19/2019 Statistics for Programmers
25/129
-
8/19/2019 Statistics for Programmers
26/129
-
8/19/2019 Statistics for Programmers
27/129
Jake VanderPlas
Classic
Method
(Student’s t distribution)
-
8/19/2019 Statistics for Programmers
28/129
Jake VanderPlas
Classic
Method
(Student’s t distribution)
Degree of Freedom: “The number of independentways by which a dynamic system can move,without violating any constraint imposed on it.”
-Wikipedia
-
8/19/2019 Statistics for Programmers
29/129
Jake VanderPlas
Degree of Freedom: “The number of independentways by which a dynamic system can move,without violating any constraint imposed on it.”
-Wikipedia
Classic
Method
(Student’s t distribution)
https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation
-
8/19/2019 Statistics for Programmers
30/129
Jake VanderPlas
Classic
Method
( Welch–Satterthwaiteequation)
https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equationhttps://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equationhttps://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation
-
8/19/2019 Statistics for Programmers
31/129
Jake VanderPlas
Classic
Method
( Welch–Satterthwaiteequation)
https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equationhttps://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation
-
8/19/2019 Statistics for Programmers
32/129
-
8/19/2019 Statistics for Programmers
33/129
Jake VanderPlas
Classic
Method
-
8/19/2019 Statistics for Programmers
34/129
Jake VanderPlas
Classic
Method
1.7959
-
8/19/2019 Statistics for Programmers
35/129
-
8/19/2019 Statistics for Programmers
36/129
-
8/19/2019 Statistics for Programmers
37/129
Jake VanderPlas
Classic
Method
-
8/19/2019 Statistics for Programmers
38/129
Jake VanderPlas
“The difference of 6.6 is not
significant at the p=0.05 level”
-
8/19/2019 Statistics for Programmers
39/129
Jake VanderPlas
-
8/19/2019 Statistics for Programmers
40/129
-
8/19/2019 Statistics for Programmers
41/129
Jake VanderPlas
Let’s use a sampling
method instead
-
8/19/2019 Statistics for Programmers
42/129
Jake VanderPlas
Problem:
Unlike coin flipping, we don’t have a probabilistic model . . .
-
8/19/2019 Statistics for Programmers
43/129
Jake VanderPlas
Problem:
Unlike coin flipping, we don’t have a probabilistic model . . .
Solution:
Shuffling
★ d
-
8/19/2019 Statistics for Programmers
44/129
Jake VanderPlas
★ ❌
84 72 81 69
57 46 74 61
63 76 56 87
99 91 69 6566 44
62 69
Idea:Simulate the distributionby shuffling the labelsrepeatedly and computingthe desired statistic.
Motivation:if the labels really don’tmatter, then switchingthem shouldn’t changethe result!
★ Sh ffl L b l
-
8/19/2019 Statistics for Programmers
45/129
Jake VanderPlas
★ ❌
84 72 81 69
57 46 74 61
63 76 56 87
99 91 69 6566 44
62 69
1. Shuffle Labels2. Rearrange3. Compute means
★ Sh ffl L b l
-
8/19/2019 Statistics for Programmers
46/129
Jake VanderPlas
★ ❌
84 72 81 69
57 46 74 61
63 76 56 87
99 91 69 6566 44
62 69
1. Shuffle Labels2. Rearrange3. Compute means
★ Sh ffl L b l
-
8/19/2019 Statistics for Programmers
47/129
Jake VanderPlas
★ ❌
84 81 72 69
61 69 74 57
65 76 56 87
99 44 46 6366 91
62 69
1. Shuffle Labels2. Rearrange3. Compute means
★ Sh ffl L b l
-
8/19/2019 Statistics for Programmers
48/129
Jake VanderPlas
★ ❌
84 81 72 69
61 69 74 57
65 76 56 87
99 44 46 6366 91
62 69
★ mean: 72.4❌ mean: 67.6difference: 4.8
1. Shuffle Labels2. Rearrange3. Compute means
★ Sh ffl L b l
-
8/19/2019 Statistics for Programmers
49/129
Jake VanderPlas
★ ❌
84 81 72 69
61 69 74 57
65 76 56 87
99 44 46 6366 91
62 69
★ mean: 72.4❌ mean: 67.6difference: 4.8
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌ Sh ffl L b l
-
8/19/2019 Statistics for Programmers
50/129
Jake VanderPlas
★ ❌
84 81 72 69
61 69 74 57
65 76 56 87
99 44 46 6366 91
62 69
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌ 1 Sh ffl L b l
-
8/19/2019 Statistics for Programmers
51/129
Jake VanderPlas
★ ❌
84 56 72 69
61 63 74 57
65 66 81 87
62 44 46 6976 91
99 69
★ mean: 62.6❌ mean: 74.1difference: -11.6
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌ 1 Sh ffl L b l
-
8/19/2019 Statistics for Programmers
52/129
Jake VanderPlas
★ ❌
84 56 72 69
61 63 74 57
65 66 81 87
62 44 46 6976 91
99 69
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌ 1 Shuffle Labels
-
8/19/2019 Statistics for Programmers
53/129
Jake VanderPlas
★ ❌
74 56 72 69
61 63 84 57
87 76 81 65
91 99 46 6966 62
44 69
★ mean: 75.9❌ mean: 65.3difference: 10.6
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌ 1 Shuffle Labels
-
8/19/2019 Statistics for Programmers
54/129
Jake VanderPlas
★ ❌
84 56 72 69
61 63 74 57
65 66 81 87
62 44 46 6976 91
99 69
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌ 1 Shuffle Labels
-
8/19/2019 Statistics for Programmers
55/129
Jake VanderPlas
★ ❌
84 81 69 69
61 69 87 74
65 76 56 57
99 44 46 6366 91
62 72
1. Shuffle Labels2. Rearrange3. Compute means
-
8/19/2019 Statistics for Programmers
56/129
1 Shuffle Labels★ ❌
-
8/19/2019 Statistics for Programmers
57/129
Jake VanderPlas
1. Shuffle Labels2. Rearrange3. Compute means
★ ❌
84 81 72 69
61 69 74 57
65 76 56 87
99 44 46 6366 91
62 69
-
8/19/2019 Statistics for Programmers
58/129
Jake VanderPlas
score difference
n u m b
e r
-
8/19/2019 Statistics for Programmers
59/129
Jake VanderPlas
score difference
n u m b
e r
-
8/19/2019 Statistics for Programmers
60/129
Jake VanderPlas
16 %
score difference
n u m b
e r
-
8/19/2019 Statistics for Programmers
61/129
Jake VanderPlas
“A difference of 6.6 is notsignificant at p = 0.05.”
That day, all the Sneetches
forgot about stars
And whether they had one,
or not, upon thars.
N t Sh ffli
-
8/19/2019 Statistics for Programmers
62/129
Jake VanderPlas
Notes on Shuffling:
- Works when the Null Hypothesis assumes two groups are equivalent
- Like all methods, it will only work if your
samples are representative – always becareful about selection biases!
- For more discussion & references, seeStatistics is Easy by Shasha & Wilson
-
8/19/2019 Statistics for Programmers
63/129
Jake VanderPlas
Four Recipes for
Hacking Statistics:1. Direct Simulation
2. Shuffling3. Bootstrapping
4. Cross Validation
-
8/19/2019 Statistics for Programmers
64/129
How High can Yertle
-
8/19/2019 Statistics for Programmers
65/129
Jake VanderPlas
How High can Yertlestack his turtles?
- What is the mean of the number of
turtles in Yertle’s stack?- What is the uncertainty on this
estimate?
48 24 32 61 51 12 32 18 19 24
21 41 29 21 25 23 42 18 23 13
Observe 20 of Yertle’s turtle towers . . .
# o f t u r t l e s
Classic Method:
-
8/19/2019 Statistics for Programmers
66/129
Jake VanderPlas
Classic Method:
Sample Mean:
Standard Error of the Mean:
-
8/19/2019 Statistics for Programmers
67/129
Jake VanderPlas
What assumptions go intothese formulae?
Can we usesampling instead?
-
8/19/2019 Statistics for Programmers
68/129
Jake VanderPlas
Problem:
We need a way to simulatesamples, but we don’t have a
generating model . . .
-
8/19/2019 Statistics for Programmers
69/129
Jake VanderPlas
Problem:
We need a way to simulatesamples, but we don’t have a
generating model . . .
Solution:Bootstrap Resampling
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
70/129
Jake VanderPlas
p p g
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
71/129
Jake VanderPlas
p p g
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
72/129
Jake VanderPlas
p p g
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
73/129
Jake VanderPlas
p p g
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
74/129
Jake VanderPlas
p p g
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
75/129
Jake VanderPlas
g
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
76/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
77/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
78/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
79/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23
-
8/19/2019 Statistics for Programmers
80/129
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
81/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
-
8/19/2019 Statistics for Programmers
82/129
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
83/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
84/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
85/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
86/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 13
32 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42 42
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
87/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 1332 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42 42 19
-
8/19/2019 Statistics for Programmers
88/129
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
89/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 1332 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42 42 19 18 61
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
90/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 1332 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42 42 19 18 61 29
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
91/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 1332 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42 42 19 18 61 29 41
Bootstrap Resampling:
-
8/19/2019 Statistics for Programmers
92/129
Jake VanderPlas
48 24 51 12
21 41 25 23
32 61 19 24
29 21 23 1332 18 42 18
Idea:Simulate the distributionby drawing samples withreplacment.
Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.
21 19 25 24 23 19 41 23 41 18
61 12 42 42 42 19 18 61 29 41→ 31.05
-
8/19/2019 Statistics for Programmers
93/129
Jake VanderPlas
Repeat this
several thousand times . . .
Recovers The Analytic Estimate!
-
8/19/2019 Statistics for Programmers
94/129
Jake VanderPlas
for i in range(10000):
sample = N[randint(20, size=20)]
xbar[i] = mean(sample)mean(xbar), std(xbar)
# (28.9, 2.9)
Height = 29 ± 3 turtles
-
8/19/2019 Statistics for Programmers
95/129
Jake VanderPlas
Bootstrap samplingcan be applied even to
more involved statistics
-
8/19/2019 Statistics for Programmers
96/129
Bootstrap on LinearR i
-
8/19/2019 Statistics for Programmers
97/129
Jake VanderPlas
Regression:
for i in range(10000):
i = randint(20, size=20)
slope, intercept = fit(x[i], y[i])
results[i] = (slope, intercept)
Notes on Bootstrapping:
-
8/19/2019 Statistics for Programmers
98/129
Jake VanderPlas
pp g
- Bootstrap resampling rests on sound
theoretical grounds.
- Bootstrapping doesn’t work well for rank-based statistics (e.g. maximum value)
- Works poorly with very few samples(N > 20 is a good rule of thumb)
- Always be careful about selection biases!
Four Recipes for
-
8/19/2019 Statistics for Programmers
99/129
Jake VanderPlas
Four Recipes for
Hacking Statistics:1. Direct Simulation
2. Shuffling3. Bootstrapping
4. Cross Validation
Onceler Industries:
-
8/19/2019 Statistics for Programmers
100/129
Jake VanderPlas
Sales of Thneeds
I'm being quite useful! This thing is a Thneed.
A Thneed's a Fine-Something-
That-All-People-Need!
-
8/19/2019 Statistics for Programmers
101/129
Jake VanderPlas
Thneed sales seem to show atrend with temperature . . .
-
8/19/2019 Statistics for Programmers
102/129
Jake VanderPlas
y = a + b x
y = a + b x + c x2
But which model is a better fit?
-
8/19/2019 Statistics for Programmers
103/129
Jake VanderPlas
y = a + b x
y = a + b x + c x2
Can we judge by root-mean-square error?
RMS error = 63.0
RMS error = 51.5
In general, more flexible models will
-
8/19/2019 Statistics for Programmers
104/129
Jake VanderPlas
always have a lower RMS error.
y = a + b x
y = a + b x + c x2
y = a + b x + c x2 + d x3
y = a + b x + c x2 + d x3 + e x4
y = a + ⋯
RMS error does not
-
8/19/2019 Statistics for Programmers
105/129
Jake VanderPlas
y = a + b x + c x2 + d x3 + e x4 + f x5 + ⋯ + n x14
tell the whole story.
-
8/19/2019 Statistics for Programmers
106/129
Jake VanderPlas
Not to worry:
Statistics has figured this out.
Classic Method
-
8/19/2019 Statistics for Programmers
107/129
Jake VanderPlas
Difference in Mean
Squared Error follows
chi-square distribution:
Classic Method
-
8/19/2019 Statistics for Programmers
108/129
Jake VanderPlas
Difference in Mean
Squared Error follows
chi-square distribution:
Can estimate degrees of
freedom easily because the
models are nested . . .
Classic Method
-
8/19/2019 Statistics for Programmers
109/129
Jake VanderPlas
Difference in Mean-
Squared-Error follows
chi-square distribution:
Can estimate degrees of
freedom easily because the
models are nested . . .
Now plug in all our numbers and . . .
Classic Method
-
8/19/2019 Statistics for Programmers
110/129
Jake VanderPlas
Difference in Mean
Squared Error follows
chi-square distribution:
Can estimate degrees of
freedom easily because the
models are nested . . .
Now plug in all our numbers and . . .
-
8/19/2019 Statistics for Programmers
111/129
Jake VanderPlas
Easier Way:
Cross Validation
Cross-Validation
-
8/19/2019 Statistics for Programmers
112/129
Jake VanderPlas
Cross-Validation
-
8/19/2019 Statistics for Programmers
113/129
Jake VanderPlas
1. Randomly Split data
Cross-Validation
-
8/19/2019 Statistics for Programmers
114/129
Jake VanderPlas
1. Randomly Split data
Cross-Validation
-
8/19/2019 Statistics for Programmers
115/129
Jake VanderPlas
2. Find the best model for each subset
Cross-Validation
-
8/19/2019 Statistics for Programmers
116/129
Jake VanderPlas
3. Compare models across subsets
-
8/19/2019 Statistics for Programmers
117/129
Cross-Validation
-
8/19/2019 Statistics for Programmers
118/129
Jake VanderPlas
3. Compare models across subsets
Cross-Validation
-
8/19/2019 Statistics for Programmers
119/129
Jake VanderPlas
3. Compare models across subsets
Cross-Validation
-
8/19/2019 Statistics for Programmers
120/129
Jake VanderPlas
4. Compute RMS error for each
RMS = 48.9RMS = 55.1
RMS estimate = 52.1
Cross-Validation
-
8/19/2019 Statistics for Programmers
121/129
Jake VanderPlas
5. Compare cross-validated RMS for models:
Cross-Validation
-
8/19/2019 Statistics for Programmers
122/129
Jake VanderPlas
Best model minimizes thecross-validated error.
5. Compare cross-validated RMS for models:
-
8/19/2019 Statistics for Programmers
123/129
Jake VanderPlas
. . . I biggered the loads of the thneeds I shipped out! I was shipping them forth,to the South, to the East to the West, to the North!
Notes on Cross-Validation:
-
8/19/2019 Statistics for Programmers
124/129
Jake VanderPlas
- This was “2-fold” cross-validation; other
CV schemes exist & may perform betterfor your data (see e.g. scikit-learn docs)
- Cross-validation is the go-to method formodel evaluation in machine learning,as statistics of the models are often notknown in the classical sense.
Four Recipes for
-
8/19/2019 Statistics for Programmers
125/129
Jake VanderPlas
p
Hacking Statistics:1. Direct Simulation
2. Shuffling3. Bootstrapping
4. Cross Validation
Sampling Methods
-
8/19/2019 Statistics for Programmers
126/129
Jake VanderPlas
Sampling Methods
allow you to replacenon-intuitive statistical rules withintuitive computational approaches!
If you can write a for-loop
you can do statistical analysis.
Things I didn’t have time for:
-
8/19/2019 Statistics for Programmers
127/129
Jake VanderPlas
- Bayesian Methods: very intuitive & powerful
approaches to more sophisticated modeling.(see Bayesian Methods for Hackers by Cam Davidson-Pillon)
- Selection Bias: if you get data selection
wrong, you’ll have a bad time.(See Chris Fonnesbeck’s Scipy 2015 talk, Statistical Thinking for Data Science )
- Detailed considerations on use of sampling,shuffling, and bootstrapping.(I recommend Statistics Is Easy , a book by Shasha & Wilson)
-
8/19/2019 Statistics for Programmers
128/129
~ Thank You! ~
-
8/19/2019 Statistics for Programmers
129/129
Email: [email protected]
Twitter: @jakevdp
Github: jakevdp
Web: http://vanderplas.com/
Blog: http://jakevdp.github.io/
Slides available ath k d k j k d i i f h k
https://speakerdeck.com/jakevdp/statistics-for-hackers/https://speakerdeck.com/jakevdp/statistics-for-hackers/