STAT 113 Inferential Statistics I

31
Inference Goals Parameter Estimation Sampling Distributions STAT 113 Inferential Statistics I Foundational Concepts Colin Reimer Dawson March 30, 2022 1 / 20

Transcript of STAT 113 Inferential Statistics I

Page 1: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

STAT 113Inferential Statistics I

Foundational Concepts

Colin Reimer Dawson

March 30, 2022

1 / 20

Page 2: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Sampling and Inference: The “Big Picture”

2 / 20

Page 3: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Population (AKA, the “Process”, or the“Phenomenon”): All potential cases that we are interested insaying something about, or, equivalently, the process thatgenerated the data (and could generate a different dataset ifthe study were repeated).

Sample (AKA The Data Set): The set of cases weactually have data for (a subset of the population, or a singlesnapshot produced by the data-generating process)

Statistical Inference: Using a representative snapshot(a data set) to say something about an underlyingpopulation/process/phenomenon

3 / 20

Page 4: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Outline

Inference Goals

Parameter Estimation

Sampling Distributions

4 / 20

Page 5: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Two Main Goals of Inference

1. Estimating unknown quantities in a population using a dataset (by reporting confidence intervals)

2. Assessing strength of evidence about “yes/no” questions(by carrying out hypothesis tests)

5 / 20

Page 6: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Statistics vs. Parameters• Summary values (like mean, median, standard deviation) exist

for both Populations/Processes/Phenomena and forSamples/dataSets/Snapshots.• In a Population/Process, such a summary value is called aParameter• In a Sample/dataSet/Snapshot, these values are calledStatistics, and are used to make inferences about thecorresponding parameter

Notation: Parameters and StatisticsSummary Value Parameter Statistic

Mean µ X̄Proportion p p̂Correlation ρ r

Slope of a Line β b̂Difference in Means µ1 − µ2 X̄1 − X̄2

. . . . . . . . . 6 / 20

Page 7: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Outline

Inference Goals

Parameter Estimation

Sampling Distributions

7 / 20

Page 8: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.

• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the

factory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a

random one.• The mean flavor-life of the gumballs tested is a statistic

(write x̄ for the sample mean).

• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20

Page 9: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)

• The mean flavor-life of all gumballs produced from thefactory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a

random one.• The mean flavor-life of the gumballs tested is a statistic

(write x̄ for the sample mean).

• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20

Page 10: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the

factory is a parameter (write µ for the pop. mean)

• I can only test a subset (a sample/snapshot) — ideally, arandom one.• The mean flavor-life of the gumballs tested is a statistic

(write x̄ for the sample mean).

• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20

Page 11: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the

factory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a

random one.

• The mean flavor-life of the gumballs tested is a statistic(write x̄ for the sample mean).

• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20

Page 12: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Using Data to Make Inferences About Phenomena• I want to know the mean flavor-life (in minutes) ofgumballs produced by my gumball factory.• Gumball production at my factory is theprocess/phenomenon of interest (the population being allpotential gumballs produced)• The mean flavor-life of all gumballs produced from the

factory is a parameter (write µ for the pop. mean)• I can only test a subset (a sample/snapshot) — ideally, a

random one.• The mean flavor-life of the gumballs tested is a statistic

(write x̄ for the sample mean).

• A Statistic is a summary of a Sample/DataSet/Snapshot• a Parameter is a summary of thePopulation/Process/Phenomenon itself. 8 / 20

Page 13: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to SamplingIf all potential gumballs have a flavor-life distribution representedby this histogram...

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

then a random sample of 10 gumballs might have flavor liveslike...

Sample Mean = 65.7

55 60 65 70 75 80Flavor Life (minutes)

9 / 20

Page 14: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

Another sample of 10 gumballs might have different flavor-lives:

Sample Mean = 66.5

55 60 65 70 75 80Flavor Life (minutes)

10 / 20

Page 15: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

Yet another sample of 10 gumballs might look like this:

Sample Mean = 66.5

55 60 65 70 75 80Flavor Life (minutes)

11 / 20

Page 16: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

Or this:

Sample Mean = 66.3

55 60 65 70 75 80Flavor Life (minutes)

12 / 20

Page 17: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

We could get this one, but it’s less likely:

Sample Mean = 70.7

55 60 65 70 75 80Flavor Life (minutes)

13 / 20

Page 18: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to SamplingOn the other hand, if the process produced gumballs with greaterlongevity in general...

Process Mean = 71.8

55 60 65 70 75 80Flavor Life (minutes)

then the first four sample means (65.7, 66.5, 66.5, 66.3) areunlikely, whereas the last one (70.7) is more likely.

Sample Mean = 70.7

55 60 65 70 75 80Flavor Life (minutes)

14 / 20

Page 19: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon

• Therefore, statistics are imperfect reflections of theunderlying parameters• However, if samples are representative, statistics areusually close to the corresponding parameter• So, we can estimate (with some, but not full certainty) that

the unknown underlying parameter is probably close to thecorresponding statistic

15 / 20

Page 20: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon• Therefore, statistics are imperfect reflections of the

underlying parameters

• However, if samples are representative, statistics areusually close to the corresponding parameter• So, we can estimate (with some, but not full certainty) that

the unknown underlying parameter is probably close to thecorresponding statistic

15 / 20

Page 21: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon• Therefore, statistics are imperfect reflections of the

underlying parameters• However, if samples are representative, statistics areusually close to the corresponding parameter

• So, we can estimate (with some, but not full certainty) thatthe unknown underlying parameter is probably close to thecorresponding statistic

15 / 20

Page 22: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Variability due to Sampling

• Each potential dataset (sample) is animperfect/incomplete snapshot of the underlyingpopulation/process/phenomenon• Therefore, statistics are imperfect reflections of the

underlying parameters• However, if samples are representative, statistics areusually close to the corresponding parameter• So, we can estimate (with some, but not full certainty) that

the unknown underlying parameter is probably close to thecorresponding statistic

15 / 20

Page 23: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Outline

Inference Goals

Parameter Estimation

Sampling Distributions

16 / 20

Page 24: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Definition: Sampling Distribution

• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.

• Each one has its own value for a particular statistic (like themean of a certain variable).• A sampling distribution is the collection of values of all of

these statistics (such as sample means)• Note that this is a hypothetical/theoretical construction; we

almost never actually have more than onedataset/sample/statistic

17 / 20

Page 25: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Definition: Sampling Distribution

• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.• Each one has its own value for a particular statistic (like the

mean of a certain variable).

• A sampling distribution is the collection of values of all ofthese statistics (such as sample means)• Note that this is a hypothetical/theoretical construction; we

almost never actually have more than onedataset/sample/statistic

17 / 20

Page 26: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Definition: Sampling Distribution

• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.• Each one has its own value for a particular statistic (like the

mean of a certain variable).• A sampling distribution is the collection of values of all of

these statistics (such as sample means)

• Note that this is a hypothetical/theoretical construction; wealmost never actually have more than onedataset/sample/statistic

17 / 20

Page 27: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Definition: Sampling Distribution

• Consider all possible datasets of a certain sample size, n,produced by taking a representative snapshot (sample) froma process/phenomenon/population.• Each one has its own value for a particular statistic (like the

mean of a certain variable).• A sampling distribution is the collection of values of all of

these statistics (such as sample means)• Note that this is a hypothetical/theoretical construction; we

almost never actually have more than onedataset/sample/statistic

17 / 20

Page 28: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Sample Distribution 6= Sampling Distribution

Sample Distribution 6= Sampling Distribution

Sample Distribution 6= Sampling Distribution

• The cases in a sample are individual observations• The cases in a sampling distribution are statistics (such as

means), each from a different potential dataset

18 / 20

Page 29: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

If the process produces a flavor-life distribution like this:

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

which could yield any of the following data setsSample Mean = 65.7

55 60 65 70 75 80Flavor Life (minutes)

Sample Mean = 66.5

55 60 65 70 75 80Flavor Life (minutes)

Sample Mean = 66.5

55 60 65 70 75 80Flavor Life (minutes)

then each potential set of 10 gumballs has a mean flavor life.The sampling distribution of all such potential means mightlook like this:

55 60 65 70 75 80Mean Flavor Life (minutes)

19 / 20

Page 30: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

If the process produces a flavor-life distribution like this:

Process Mean = 66.8

55 60 65 70 75 80Flavor Life (minutes)

which could yield any of the following data setsSample Mean = 65.7

55 60 65 70 75 80Flavor Life (minutes)

Sample Mean = 66.5

55 60 65 70 75 80Flavor Life (minutes)

Sample Mean = 66.5

55 60 65 70 75 80Flavor Life (minutes)

then each potential set of 10 gumballs has a mean flavor life.The sampling distribution of all such potential means mightlook like this:

55 60 65 70 75 80Mean Flavor Life (minutes)

19 / 20

Page 31: STAT 113 Inferential Statistics I

Inference Goals Parameter Estimation Sampling Distributions

Demo: StatKey

http://lock5stat.com/statkey

20 / 20