09 bootstrapping

Upload
hadleywickham 
Category
Documents

view
618 
download
0
Embed Size (px)
Transcript of 09 bootstrapping
Hadley Wickham
Stat405 Bootstrapping
Tuesday, 21 September 2010
1. The project
2. Hypothesis testing revision
3. Bootstrapping
Tuesday, 21 September 2010
The project
Tuesday, 21 September 2010
Project
About the movies data (sorry for confusion)
Like a homework, but bigger, and you will work on it as a group. 45 main questions.
There will be a single grade for the project, but individual grades will be adjusted based on effort (peer rating form).
Tuesday, 21 September 2010
Team workWorking in teams is tough. But it is a vital skill to gain.
Take 10 minutes to discuss expectations with your team. Make sure to sign the sheet, and one team member should take responsibility for copying and distributing it.
Make sure you talk about how to get in touch.
Hand outs: team policies, team expectations.
Tuesday, 21 September 2010
Firing & Quitting
You may fire a nonparticipating team member, but you need to meet with me and issue a written warning.
If you feel that you are doing all the work in your team, you may quit. You’ll also need to meet with me and give a written warning to the rest of your team.
Tuesday, 21 September 2010
I think I have scheduled all teams. If your team doesn’t have a time yet, please get in touch ASAP.
Bring your main questions, as well as any initial plots you have made to answer them.
The more of a start you have made the more I’ll be able to help you.
Project meetings
Tuesday, 21 September 2010
Hypothesis testing
Tuesday, 21 September 2010
Goal
Casino claims that slot machines have prize payout of 92%, but payoff for 345 we observed is 67%. Is the casino lying?
(House advantage of 8% vs. 33%)
(Big caveat: we’re using a prize calculation function we know to be incorrect)
Tuesday, 21 September 2010
http://www.flickr.com/photos/joegratz/117048243
Hypothesis testingThe statistical justice system
Tuesday, 21 September 2010
A suspect is accused of a crime. The suspect is declared guilty or innocent based on a trial. Each trial has a defence and a prosecution. On the basis of how evidence compares to a standard, the judge declares them guilty or not guilty.
Tuesday, 21 September 2010
A dataset is accused of having a particular parameter value. The data is declared guilty or innocent based on the results of a statistical test. Each test has a null hypothesis and an alternative hypothesis. On the basis of how a test statistic compares to a standard distribution, we make the decision to reject the null or fail to reject the null hypothesis.
Tuesday, 21 September 2010
Your turn
Write down the null and alternative hypotheses for this case. How could we generate the distribution of payoffs under the null hypothesis?
Tuesday, 21 September 2010
Hypothesis testing
Null hypothesis: Mean payout is 92%Alternative hypothesis: Mean payout is less than 92%
How could we generate samples from the null distribution?
Tuesday, 21 September 2010
# Could assume that the payoffs come from a normal # distribution with mean 0.67 and standard deviation 2.55).# This gives a t.test:
> t.test(slots$prize, mu = 0.92, alternative = "less")
One Sample ttest
data: slots$prize t = 1.8026, df = 344, pvalue = 0.03616alternative hypothesis: true mean is less than 0.92 95 percent confidence interval: Inf 0.8989483 sample estimates:mean of x 0.6724638
Tuesday, 21 September 2010
slots$prize
count
0
50
100
150
200
250
300
0 5 10 15 20
Tuesday, 21 September 2010
Alternative approach
Assume that the distribution under the null hypothesis is similar to the empirical (data) distribution, but with a different mean.
This is called bootstrapping, and is a very widely used statistical technique.
Tuesday, 21 September 2010
Bootstrapping
Tuesday, 21 September 2010
Bootstrapping
So to answer the question we’re interested in, we need to get a better grasp on the distribution of prizes.
1 set of 345 pulls is too few, we want to simulate lots. We’ll start by simulating a single pull and then work our way up.
Tuesday, 21 September 2010
Your turnNow want to generate a new draw from the same distribution (a bootstrap sample).
Write a function that returns the prize from a randomly generated new draw. Call the function random_prize
Hint: sample(slot$w1, 1) will draw a single sample randomly from the original data. Next slide has code you need to get started.
Tuesday, 21 September 2010
slots < read.csv("slots.csv", stringsAsFactors = FALSE)
calculate_prize < function(windows) { payoffs < c("DD" = 800, "7" = 80, "BBB" = 40, "BB" = 25, "B" = 10, "C" = 10, "0" = 0)
same < length(unique(windows)) == 1 allbars < all(windows %in% c("B", "BB", "BBB"))
if (same) { prize < payoffs[windows[1]] } else if (allbars) { prize < 5 } else { cherries < sum(windows == "C") diamonds < sum(windows == "DD")
prize < c(0, 2, 5)[cherries + 1] * c(1, 2, 4)[diamonds + 1] } unname(prize)}
Tuesday, 21 September 2010
w1 < sample(slots$w1, 1)w2 < sample(slots$w2, 1)w3 < sample(slots$w3, 1)
calculate_prize(c(w1, w2, w3))
Tuesday, 21 September 2010
random_prize < function() { w1 < sample(slots$w1, 1) w2 < sample(slots$w2, 1) w3 < sample(slots$w3, 1)
calculate_prize(c(w1, w2, w3))}
# What is the implicit assumption here?# How could we test that assumption?
Tuesday, 21 September 2010
Your turn
Write a function to do this n times
Use a for loop: create an empty vector, and then fill with values
Draw a histogram of the results
Tuesday, 21 September 2010
n < 100prizes < rep(NA, n)
for(i in seq_along(prizes)) { prizes[i] < random_prize()}
Tuesday, 21 September 2010
random_prizes < function(n = 345) { prizes < rep(NA, n)
for(i in seq_along(prizes)) { prizes[i] < random_prize() } prizes}library(ggplot2)qplot(random_prizes())
payout < function() mean(random_prizes(345))payout()
Tuesday, 21 September 2010
# Windows should be a matrix with a column for each windowcalculate_prize < function(windows) { payoffs < c("DD" = 800, "7" = 80, "BBB" = 40, "BB" = 25, "B" = 10, "C" = 10, "0" = 0)
prize < rep(NA, nrow(windows))
same < windows[, 1] == windows[, 2] & windows[, 2] == windows[, 3] prize[same] < payoffs[windows[same, 1]] bars < windows == "B"  windows == "BB"  windows == "BBB" all_bars < bars[, 1] & bars[, 2] & bars[, 3] prize[all_bars] < 5
other < !same & !all_bars other_windows < windows[other, ] cherries < rowSums(other_windows == "C") diamonds < rowSums(other_windows == "DD") prize[other] < c(0, 2, 5)[cherries + 1] * c(1, 2, 4)[diamonds + 1]
unname(prize)}
random_prizes < function(n) { w1 < sample(slots$w1, n, rep = T) w2 < sample(slots$w2, n, rep = T) w3 < sample(slots$w2, n, rep = T) calculate_prize(cbind(w1, w2, w3))}
This is a much faster version. You should be able to figure out how it works.
Tuesday, 21 September 2010
Your turn
Now we want to do this repeatedly to learn about the distribution of the payout. Write a function that generates the payout from m sets of 345 pulls.
Then think about how this helps answer our question.
Tuesday, 21 September 2010
payouts < function(m) { payouts < rep(NA, m)
for(i in seq_along(payouts)) { payouts[i] < payout() } payouts}
payouts(10)
Tuesday, 21 September 2010
quantile(payouts(10000), c(.05, 0.5, .95))
# However, this is too low because our prize# function is incorrect  once you've done the # homework try it out with your function.
Tuesday, 21 September 2010