Unit1: Introductiontodata 1.Data Collection ... · 1.Data Collection + Observational studies &...

39
Unit 1: Introduction to data 1. Data Collection + Observational studies & experiments Sta 101 - Spring 2016 Duke University, Department of Statistical Science Dr. Çetinkaya-Rundel Slides posted at http://bit.ly/sta101_s16

Transcript of Unit1: Introductiontodata 1.Data Collection ... · 1.Data Collection + Observational studies &...

Unit1: Introductiontodata1. Data Collection +

Observational studies & experiments

Sta 101 - Spring 2016

Duke University, Department of Statistical Science

Dr. Çetinkaya-Rundel Slides posted at http://bit.ly/sta101_s16

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

1. Use a sample to make inferences about the population

▶ Ultimate goal: make inferences about populations

▶ Caveat: populations are difficult or impossible to access▶ Solution: use a sample from that population, and use

statistics from that sample to make inferences about theunknown population parameters

▶ The better (more representative) sample we have, themore reliable our estimates and more accurate ourinferences will be

Suppose we want to know how many offspring female lemurshave, on average. It’s not feasible to obtain offspring data fromon all female lemurs, so we use data from the Duke LemurCenter. We use the sample mean from these data as anestimate for the unknown population mean. Can you see anylimitations to using data from the Duke Lemur Center to makeinferences about all lemurs?

1

1. Use a sample to make inferences about the population

▶ Ultimate goal: make inferences about populations▶ Caveat: populations are difficult or impossible to access

▶ Solution: use a sample from that population, and usestatistics from that sample to make inferences about theunknown population parameters

▶ The better (more representative) sample we have, themore reliable our estimates and more accurate ourinferences will be

Suppose we want to know how many offspring female lemurshave, on average. It’s not feasible to obtain offspring data fromon all female lemurs, so we use data from the Duke LemurCenter. We use the sample mean from these data as anestimate for the unknown population mean. Can you see anylimitations to using data from the Duke Lemur Center to makeinferences about all lemurs?

1

1. Use a sample to make inferences about the population

▶ Ultimate goal: make inferences about populations▶ Caveat: populations are difficult or impossible to access▶ Solution: use a sample from that population, and use

statistics from that sample to make inferences about theunknown population parameters

▶ The better (more representative) sample we have, themore reliable our estimates and more accurate ourinferences will be

Suppose we want to know how many offspring female lemurshave, on average. It’s not feasible to obtain offspring data fromon all female lemurs, so we use data from the Duke LemurCenter. We use the sample mean from these data as anestimate for the unknown population mean. Can you see anylimitations to using data from the Duke Lemur Center to makeinferences about all lemurs?

1

1. Use a sample to make inferences about the population

▶ Ultimate goal: make inferences about populations▶ Caveat: populations are difficult or impossible to access▶ Solution: use a sample from that population, and use

statistics from that sample to make inferences about theunknown population parameters

▶ The better (more representative) sample we have, themore reliable our estimates and more accurate ourinferences will be

Suppose we want to know how many offspring female lemurshave, on average. It’s not feasible to obtain offspring data fromon all female lemurs, so we use data from the Duke LemurCenter. We use the sample mean from these data as anestimate for the unknown population mean. Can you see anylimitations to using data from the Duke Lemur Center to makeinferences about all lemurs?

1

1. Use a sample to make inferences about the population

▶ Ultimate goal: make inferences about populations▶ Caveat: populations are difficult or impossible to access▶ Solution: use a sample from that population, and use

statistics from that sample to make inferences about theunknown population parameters

▶ The better (more representative) sample we have, themore reliable our estimates and more accurate ourinferences will be

Suppose we want to know how many offspring female lemurshave, on average. It’s not feasible to obtain offspring data fromon all female lemurs, so we use data from the Duke LemurCenter. We use the sample mean from these data as anestimate for the unknown population mean. Can you see anylimitations to using data from the Duke Lemur Center to makeinferences about all lemurs?

1

1. Use a sample to make inferences about the population

▶ Ultimate goal: make inferences about populations▶ Caveat: populations are difficult or impossible to access▶ Solution: use a sample from that population, and use

statistics from that sample to make inferences about theunknown population parameters

▶ The better (more representative) sample we have, themore reliable our estimates and more accurate ourinferences will be

Suppose we want to know how many offspring female lemurshave, on average. It’s not feasible to obtain offspring data fromon all female lemurs, so we use data from the Duke LemurCenter. We use the sample mean from these data as anestimate for the unknown population mean. Can you see anylimitations to using data from the Duke Lemur Center to makeinferences about all lemurs?

1

Sampling is natural

▶ When you taste a spoonful of soup and decide thespoonful you tasted isn’t salty enough, that’s exploratoryanalysis

▶ If you generalize and conclude that your entire soup needssalt, that’s an inference

▶ For your inference to be valid, the spoonful you tasted (thesample) needs to be representative of the entire pot (thepopulation)

2

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

2. Ideally use a simple random sample, stratify to control for a variable,and cluster to make sampling easier

Simple random:Drawing names from a hat

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6Stratified: homogenous strataStratify to control for SES

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

Cluster: heterogenous clustersSample all chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9Multistage:Random sample in chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

3

2. Ideally use a simple random sample, stratify to control for a variable,and cluster to make sampling easier

Simple random:Drawing names from a hat

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

Stratified: homogenous strataStratify to control for SES

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

Cluster: heterogenous clustersSample all chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9Multistage:Random sample in chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

3

2. Ideally use a simple random sample, stratify to control for a variable,and cluster to make sampling easier

Simple random:Drawing names from a hat

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6Stratified: homogenous strataStratify to control for SES

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

Cluster: heterogenous clustersSample all chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9Multistage:Random sample in chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

3

2. Ideally use a simple random sample, stratify to control for a variable,and cluster to make sampling easier

Simple random:Drawing names from a hat

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6Stratified: homogenous strataStratify to control for SES

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

Cluster: heterogenous clustersSample all chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Multistage:Random sample in chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

3

2. Ideally use a simple random sample, stratify to control for a variable,and cluster to make sampling easier

Simple random:Drawing names from a hat

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6Stratified: homogenous strataStratify to control for SES

Index

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

Stratum 1

Stratum 2

Stratum 3

Stratum 4

Stratum 5

Stratum 6

Cluster: heterogenous clustersSample all chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9Multistage:Random sample in chosen clusters

Index

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

● ●

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

3

Clicker question

A city council has requested a household survey be conductedin a suburban area of their city. The area is broken into manydistinct and unique neighborhoods, some including largehomes, some with only apartments, and others a diversemixture of housing structures. Which approach would likely bethe least effective?

(a) Simple random sampling(b) Stratified sampling, where each stratum is a neighborhood(c) Cluster sampling, where each cluster is a neighborhood

4

Clicker question

A city council has requested a household survey be conductedin a suburban area of their city. The area is broken into manydistinct and unique neighborhoods, some including largehomes, some with only apartments, and others a diversemixture of housing structures. Which approach would likely bethe least effective?

(a) Simple random sampling(b) Stratified sampling, where each stratum is a neighborhood(c) Clustersampling, whereeachclusterisaneighborhood

4

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

3. Sampling schemes can suffer from a variety of biases

▶ Non-response: If only a small fraction of the randomlysampled people choose to respond to a survey, thesample may no longer be representative of the population

▶ Voluntary response: Occurs when the sample consists ofpeople who volunteer to respond because they havestrong opinions on the issue since such a sample will alsonot be representative of the population

▶ Convenience sample: Individuals who are easilyaccessible are more likely to be included in the sample

5

3. Sampling schemes can suffer from a variety of biases

▶ Non-response: If only a small fraction of the randomlysampled people choose to respond to a survey, thesample may no longer be representative of the population

▶ Voluntary response: Occurs when the sample consists ofpeople who volunteer to respond because they havestrong opinions on the issue since such a sample will alsonot be representative of the population

▶ Convenience sample: Individuals who are easilyaccessible are more likely to be included in the sample

5

3. Sampling schemes can suffer from a variety of biases

▶ Non-response: If only a small fraction of the randomlysampled people choose to respond to a survey, thesample may no longer be representative of the population

▶ Voluntary response: Occurs when the sample consists ofpeople who volunteer to respond because they havestrong opinions on the issue since such a sample will alsonot be representative of the population

▶ Convenience sample: Individuals who are easilyaccessible are more likely to be included in the sample

5

Clicker questionA school district is considering whether it will no longer allow highschool students to park at school after two recent accidents wherestudents were severely injured. As a first step, they survey parents bymail, asking them whether or not the parents would object to thispolicy change. Of 6,000 surveys that go out, 1,200 are returned. Ofthese 1,200 surveys that were completed, 960 agreed with the policychange and 240 disagreed. Which of the following statements aretrue?

I. Some of the mailings may have never reached the parents.II. Overall, the school district has strong support from parents to

move forward with the policy approval.III. It is possible that majority of the parents of high school students

disagree with the policy change.IV. The survey results are unlikely to be biased because all parents

were mailed a survey.

(a) Only I (b) I and II (c) I and III (d) III and IV (e) Only IV

6

Clicker questionA school district is considering whether it will no longer allow highschool students to park at school after two recent accidents wherestudents were severely injured. As a first step, they survey parents bymail, asking them whether or not the parents would object to thispolicy change. Of 6,000 surveys that go out, 1,200 are returned. Ofthese 1,200 surveys that were completed, 960 agreed with the policychange and 240 disagreed. Which of the following statements aretrue?

I. Some of the mailings may have never reached the parents.II. Overall, the school district has strong support from parents to

move forward with the policy approval.III. It is possible that majority of the parents of high school students

disagree with the policy change.IV. The survey results are unlikely to be biased because all parents

were mailed a survey.

(a) Only I (b) I and II (c) I andIII (d) III and IV (e) Only IV

6

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

What type of study is this? What is the scope of inference(causality / generalizability)?

http://www.nytimes.com/2014/06/30/technology/

facebook-tinkers-with-users-emotions-in-news-feed-experiment-stirring-outcry.html

7

4. Experiments use random assignment to treatment groups,observational studies do not

A study that surveyed a random sample of otherwise healthy adultsfound that people are more likely to get muscle cramps when they’restressed. The study also noted that people drink more coffee andsleep less when they’re stressed. What type of study is this?

Observational

What is the conclusion of the study?

There is an association between increased stress & muscle cramps.

Can this study be used to conclude a causal relationship betweenincreased stress and muscle cramps?

Muscle cramps might also be due to increased caffeine consumptionor sleeping less – these are potential confounding variables.

8

4. Experiments use random assignment to treatment groups,observational studies do not

A study that surveyed a random sample of otherwise healthy adultsfound that people are more likely to get muscle cramps when they’restressed. The study also noted that people drink more coffee andsleep less when they’re stressed. What type of study is this?Observational

What is the conclusion of the study?

There is an association between increased stress & muscle cramps.

Can this study be used to conclude a causal relationship betweenincreased stress and muscle cramps?

Muscle cramps might also be due to increased caffeine consumptionor sleeping less – these are potential confounding variables.

8

4. Experiments use random assignment to treatment groups,observational studies do not

A study that surveyed a random sample of otherwise healthy adultsfound that people are more likely to get muscle cramps when they’restressed. The study also noted that people drink more coffee andsleep less when they’re stressed. What type of study is this?Observational

What is the conclusion of the study?There is an association between increased stress & muscle cramps.

Can this study be used to conclude a causal relationship betweenincreased stress and muscle cramps?

Muscle cramps might also be due to increased caffeine consumptionor sleeping less – these are potential confounding variables.

8

4. Experiments use random assignment to treatment groups,observational studies do not

A study that surveyed a random sample of otherwise healthy adultsfound that people are more likely to get muscle cramps when they’restressed. The study also noted that people drink more coffee andsleep less when they’re stressed. What type of study is this?Observational

What is the conclusion of the study?There is an association between increased stress & muscle cramps.

Can this study be used to conclude a causal relationship betweenincreased stress and muscle cramps?Muscle cramps might also be due to increased caffeine consumptionor sleeping less – these are potential confounding variables.

8

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

5. Four principles of experimental design:randomize, control, block, replicate

▶ We would like to design an experiment to investigate ifincreased stress causes muscle cramps:

– Treatment: increased stress– Control: no or baseline stress

▶ It is suspected that the effect of stress might be differenton younger and older people: block for age.

Why is this important? Can you think of other variables toblock for?

9

5. Four principles of experimental design:randomize, control, block, replicate

▶ We would like to design an experiment to investigate ifincreased stress causes muscle cramps:

– Treatment: increased stress– Control: no or baseline stress

▶ It is suspected that the effect of stress might be differenton younger and older people: block for age.

Why is this important? Can you think of other variables toblock for?

9

5. Four principles of experimental design:randomize, control, block, replicate

▶ We would like to design an experiment to investigate ifincreased stress causes muscle cramps:

– Treatment: increased stress– Control: no or baseline stress

▶ It is suspected that the effect of stress might be differenton younger and older people: block for age.

Why is this important? Can you think of other variables toblock for?

9

5. Four principles of experimental design:randomize, control, block, replicate

▶ We would like to design an experiment to investigate ifincreased stress causes muscle cramps:

– Treatment: increased stress– Control: no or baseline stress

▶ It is suspected that the effect of stress might be differenton younger and older people: block for age.

Why is this important? Can you think of other variables toblock for?

9

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

6. Random sampling helps generalizability,random assignment helps causality

Random assignment

No random assignment

Random sampling

Causal conclusion, generalized to the whole

population.

No causal conclusion, correlation statement

generalized to the whole population.

Generalizability

No random sampling

Causal conclusion, only for the sample.

No causal conclusion, correlation statement only

for the sample.No

generalizability

Causation Correlation

ideal experiment

most experiments

most observational

studies

bad observational

studies

10

Outline

1. Main ideas1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for a

variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize, control,

block, replicate6. Random sampling helps generalizability, random

assignment helps causality

2. Summary

Summary of main ideas

1. Use a sample to make inferences about the population2. Ideally use a simple random sample, stratify to control for

a variable, and cluster to make sampling easier3. Sampling schemes can suffer from a variety of biases4. Experiments use random assignment to treatment groups,

observational studies do not5. Four principles of experimental design: randomize,

control, block, replicate6. Random sampling helps generalizability, random

assignment helps causality

11