STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference...

39
STAT 101: Day 2 Data Collection: Sampling 1/18/12 • Sample versus Population • Statistical Inference • Sampling Bias • Simple Random Sample • Other Sources of Bias Section 1.2

Transcript of STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference...

Page 1: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

STAT 101: Day 2Data Collection: Sampling

1/18/12

• Sample versus Population• Statistical Inference• Sampling Bias• Simple Random Sample• Other Sources of Bias

Section 1.2

Page 2: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Course Website

http://stat.duke.edu/courses/Spring12/sta101.2/

Page 3: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Sample vs Population

• A population includes all individuals or objects of interest

• A sample is all the cases that we have collected data on, usually a subset of the population

• Statistical inference is the process of using data from a sample to gain information about the population

Page 4: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

The Big Picture

Population

Sample

Sampling

Statistical Inference

Page 5: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Most Important to You

Which of the following is most important to you?

(a) Athletics(b) Academics(c) Social Life(d) Community Service(e) Other

Page 6: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Most Important to You

• Suppose researchers studying student life at Duke use the results of our clicker question to investigate what Duke students find important

• What is the sample?• What is the population?

• Can the sample data be generalized to make inferences about the population? Why or why not?

Page 7: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Sampling

Population

Sample

Sampling

GOAL: Select a sample that is similar to the population, only smaller

Page 8: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Dewey Defeats Truman?

Page 9: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Dewey Defeats Truman?

• The paper was published before the conclusion of the 1948 presidential election, and was based on the results of a large telephone poll which showed Dewey sweeping Truman

• However, Harry S. Truman won the election

• What went wrong?

Page 10: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Sampling Bias

• Sampling bias occurs when the method of selecting a sample causes the sample to differ from the population in some relevant way

• If sampling bias exists, we cannot trust generalizations from the sample to the population

Page 11: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Sampling

Population Sample

Sample

Page 12: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Can you avoid sampling bias?

• The next slide shows Lincoln’s Gettysburg Address. The entire population, all words in his address, will be shown to you.

• Your task: Select a sample of 10 words that resemble the overall address. Write them down.

• Calculate the average number of letters for the words in your sample

• Place a dot above your sample average on the board

Page 13: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Lincoln’s Gettysburg Address“Four score and seven years ago our fathers brought forth, on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they here gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.”

Page 14: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Can you avoid sampling bias?

• Actual average: 4.29 letters

• People are TERRIBLE at selecting a good sample, even when explicitly trying to avoid sampling bias!

• We need a better way…

Page 15: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Random Sampling

• How can we make sure to avoid sampling bias?

• Imagine putting the names of all the units of the population into a hat, and drawing out names at random to be in the sample

Take a RANDOM sample!

Page 16: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Random Sampling

• Before the 2008 election, the Gallup Poll took a random sample of 2,847 Americans. 52% of those sampled supported Obama

• In the actual election, 53% voted for Obama

• Random sampling is a very powerful tool!!!

Page 17: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Selecting a Random Sample• Option 1: Actually draw names out of a hat

• Option 2: Number all units in the population, and generate random numbers

Online: http://www.random.org/integers/

RStudio: To generate n random numbers between 1 and max, use

sample(1:max, n)

> sample(1:100,5) [1] 66 4 51 18 70

Page 18: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Selecting a Random Sample• Option 3: Use RStudio to randomly sample

directly from a vector of population units

population = vector of population unitsn = sample size

sample(population, n)

Page 19: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

“Random” Numbers

1. Pick 10 “random” numbers between 1 and 268. Write these numbers down.

(Note: When choosing a real sample, you should use technology to generate random numbers. This is simply for illustrative purposes in class.)

2. Using the next slide, calculate the average number of letters in the words corresponding to your random numbers

3. Place a dot above this average on the board

Page 20: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

1 Four 35in 69dedicate 103But, 137add 171here 205these 239that2 score 36a 70a 104in 138or 172to 206honored 240this3 and 37great 71portion 105a 139detract. 173the 207dead 241nation,4 seven 38civil 72of 106larger 140The 174unfinished 208we 242under5 years 39war, 73that 107sense, 141world 175work 209take 243God,6 ago, 40testing 74field 108we 142will 176which 210increased 244shall7 our 41whether 75as 109cannot 143little 177they 211devotion 245have8 fathers 42that 76a 110dedicate, 144note, 178who 212to 246a9 brought 43nation, 77final 111we 145nor 179fought 213that 247new10 forth 44or 78resting 112cannot 146long 180here 214cause 248birth11 upon 45any 79place 113consecrate, 147remember, 181have 215for 249of12 this 46nation 80for 114we 148what 182thus 216which 250freedom,13 continent 47so 81those 115cannot 149we 183far 217they 251and14 a 48conceived 82who 116hallow 150say 184so 218gave 252that15 new 49and 83here 117this 151here, 185nobly 219the 253government16 nation: 50so 84gave 118ground. 152but 186advanced. 220last 254of17 conceived 51dedicated, 85their 119The 153it 187It 221full 255the18 in 52can 86lives 120brave 154can 188is 222measure 256people,19 liberty, 53long 87that 121men, 155never 189rather 223of 257by20 and 54endure. 88that 122living 156forget 190for 224devotion, 258the21 dedicated 55We 89nation 123and 157what 191us 225that 259people,22 to 56are 90might 124dead, 158they 192to 226we 260for23 the 57met 91live. 125who 159did 193be 227here 261the24 proposition 58on 92It 126struggled 160here. 194here 228highly 262people,25 that 59a 93is 127here 161It 195dedicated 229resolve 263shall26 all 60great 94altogether 128have 162is 196to 230that 264not27 men 61battlefield 95fitting 129consecrated 163for 197the 231these 265perish28 are 62of 96and 130it, 164us 198great 232dead 266from29 created 63that 97proper 131far 165the 199task 233shall 267the30 equal. 64war. 98that 132above 166living, 200remaining 234not 268earth.31 Now 65We 99we 133our 167rather, 201before 235have32 we 66have 100should 134poor 168to 202us, 236died33 are 67come 101do 135power 169be 203that 237in34 engaged 68to 102this. 136to 170dedicated 204from 238vain,

Page 21: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Random vs Non-Random Sampling

• Random samples have averages that are centered around the correct number

• Non-random samples may suffer from sampling bias, and averages may not be centered around the correct number

• Only random samples can truly be trusted when making generalizations to the population!

Page 22: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Bowl of Soup Analogy

Think of tasting a bowl of soup…

• Population = entire bowl of soup• Sample = whatever is in your tasting bites

• If you take bites non-randomly from the soup (if you stab with a fork, or prefer noodles to vegetables), you may not get a very accurate representation of the soup

• If you take bites at random, only a few bites can give you a very good idea for the overall taste of the soup

Page 23: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Simple Random Sample

• These methods generate a simple random sample

• In a simple random sample, each unit of the population has the same chance of being selected, regardless of the other units chosen for the sample

• More complicated random sampling schemes exist, but will not be covered in this course

Page 24: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Realities of Sampling

• While a random sample is ideal, often it isn’t feasible. A list of the entire population may not be available, or it may be impossible or too difficult to contact all members of the population.

• Sometimes, your population of interest has to be altered to something more feasible to sample from. Generalization of results are limited to the population that was actually sampled from.

• In practice, think hard about potential sources of sampling bias, and try your best to avoid them

Page 25: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Non-Random Samples

Suppose you want to estimate the average number of hours that Duke students spend studying each week. Which of the following is the best method of sampling?

(a) Go to the library and ask all the students there how much they study

(b) Email all Duke students asking how much they study, and use all the data you get

(c) Give a clicker question in STAT 101 and force every student to respond

(d) Stand outside the Bryan Center and ask everyone going in how much they study

Page 26: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Bad Methods of Sampling

• Sampling units based on something obviously related to the variable(s) you are studying

– Sampling only students in the library when asking how much they study, or sampling only students taking a statistics class

– “Today’s Poll” on fitnessmagazine.com asked “Have you ever hired a personal trainer?”. 27% of respondents said “yes” – can we infer that 27% of all humans have hired a personal trainer?

Page 27: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Bad Methods of Sampling

• Letting your sample be comprised of whoever chooses to participate (volunteer bias)

– Emailing or mailing the entire population, and then making conclusions about the population based on whoever chooses to respond

–Example: An airline emails all of it’s customers asking them to rate their satisfaction with their recent travel

Page 28: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Road Safety• The Federal Office of Road Safety in Australia conducted a study on the effects of alcohol and marijuana on performance

• Participants were volunteers who responded to advertisements for the study on rock radio stations

• Volunteers were given a random combination of the two drugs, then their performance was observed

• What is the sample? What is the population?• Is there sampling bias?• Will the results be informative and/or do you think the study is worth conducting?

Page 29: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

DATA

Data Collection and Bias

PopulationSample

Sampling Bias?

Other forms of bias?

Page 30: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Other Forms of Bias

• Even with a random sample, data can still be biased, especially when collected on humans

• Other forms of bias to watch out for in data collection:

– Question wording– Context– Inaccurate responses– Many other possibilities – examine the specifics of each study!

Page 31: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Question Wording

• “Do you think the US should allow public speeches against democracy?”

• “Do you think the US should not forbid public speeches against democracy?”

Source: Rugg, D. (1941). “Experiments in wording questions,” Public Opinion Quarterly, 5, 91-92.

21% said speeches should be allowed

39% said speeches should be not be forbidden

Page 32: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Question Wording• A random sample was asked: “Should there be a tax cut, or should money be used to fund new government programs?”

• A different random sample was asked: “Should there be a tax cut, or should money be spent on programs for education, the environment, health care, crime-fighting, and military defense?”

Tax Cut: 60% Programs: 40%

Tax Cut: 22% Programs: 78%

Page 33: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Context• Ann Landers column asked readers

“If you had it to do over again, would you have children?

• The first request for data contained a letter from a young couple which listed worries about parenting and various reasons not to have kids=> 30% said “yes”

• The second request for data was in response to this number, in which Ann wrote how she was “stunned, disturbed, and just plain flummoxed”Þ95% said “yes”

Page 34: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Having Children• If we were to run the question all by itself in the newspaper with a request for responses, could we trust the results?

(a) Yes(b) No

Page 35: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Having Children

Newsday conducted a random sample of all US adults, and asked them the same question, without any additional leading materialÞ91% said “yes”

Do you think the true proportion of parents who are happy they had children is close to 91%?

(a) Yes(b) No

Page 36: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Inaccurate Responses

• In a study on US students, 93% of the sample said they were in the top half of the sample regarding driving skill

Svenson, O. (February 1981). "Are we all less risky and more skillful than our fellow drivers?".  Acta Psychologica 47 (2): 143–148.

• From random sample of all US college students, 22.7% reported using illicit drugs. Do you think this number is accurate?Substance Abuse and Mental Health Services Administration (2010). “Results from the 2009 National Survey on Drug Use and Health: Volume 1.” Summary of National Findings (Office of Applied Studies, NSDUH Series H-38A, HHS Publication No. SMA 10-4856Findings). Rockville, MD, heeps://nsduhweb.rti.org/

Page 37: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Summary• Data is collected on a sample, and we would like to use

the data to make inferences to the larger population

• Sampling bias can occur when the sample does not resemble the population

• Sampling bias can be avoided by random sampling

• Bias exists when the sample data do not accurately reflect the true population data, and bias can occur in many ways

• When making conclusions based on data, STOP AND THINK ABOUT HOW THE DATA WERE COLLECTED!

Page 38: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

Summary

Always think critically about how the data were collected, and recognize that not all forms of data

collection lead to valid inferences

Page 39: STAT 101: Day 2 Data Collection: Sampling 1/18/12 Sample versus Population Statistical Inference Sampling Bias Simple Random Sample Other Sources of Bias.

To Do• Complete the class survey on Sakai (due Monday,

1/23)

• Email me if you still need a textbook

• Email me with your gmail adress if you still need an RStudio account

• Buy a clicker (grading starts 1/30)(go to this google doc if you want to buy one used

from a previous student)