Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For...

36
Confidence Intervals Lecture 3

Transcript of Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For...

Page 1: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Confidence Intervals

Lecture 3

Page 2: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Confidence Intervals for the Population Mean (or percentage)For studies with large samples, “approximately 95% of the time, the population mean will be in the interval given by the sample mean plus or minus two standard errors.”

A confidence interval is a range of values

Page 3: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

WARNING!This DOES NOT imply that for a given experiment the population parameter has a 95% chance of being in the confidence interval.Pregnancy AnalogyCORRECT Interpretation: “I am 95% confident that

Page 4: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Learning to work with the Gaussian/Normal Distribution

What if I want a 90% confidence interval?

We need to learn how to calculate probabilities based on the Normal Distribution!

For confidence intervals, we are working with the sample mean, a random variable which approximately follows a normal distribution. We will talk now in general about a random variable, Y, which follows a normal distribution with mean μ and standard deviation σ.

Page 5: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Framework for the CalculationDefine z as the value such that (1-α)100% of the population would fall within z standard deviations of the population mean (as in the Empirical Rule).

This is equivalent to saying that a single random variable has a (1-α)100% chance of falling within z

standard deviations of the population mean.

In the context of confidence intervals (and hypothesis tests), z is called a critical value.

Then (1-α)100% of the time, the following statements are true:

Page 6: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

The Mathematics

FACT: follows a “standard normal” distribution.

Y

Page 7: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

The Z-scoreAll normal calculations are based on a z-score:

where Y is a normal random variable (which may be a single value, an average, etc.)The z-score measures how many standard deviations (or standard errors) an observed value of a normal random variable is away from its mean.Z-scores follow a standard normal distribution, a normal distribution with

Page 8: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Probabilities from the Standard Normal

Chapter 4 in textbook & Table A5.2 on p.366Areas under the Normal curve represent probabilities or proportions of the population.The total area under the curve equals one.Column #2 gives the probability of falling within z standard deviations of the mean, where z is given in column #1.May be used to refer to population distributions or sampling distributions.

Page 9: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Application of a Normal Distribution to a Population Distribution of IQ’s

Suppose that I am going to measure IQ scores of UMDNJ –School of Public Health students.

The national average of IQ scores 100 points and the standard deviation is 16 points. For now assume that the national average equals the average for public health students.

According to the Empirical Rule, approximately

68% of scores are between 84 and 116, and

95% of scores are between 68 and 132.

Page 10: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

IQ Example continuedLet’s calculate these percentages exactly.

Plus or minus 1 st. dev.: 68.27% = proportion of .6827

Plus or minus 2 st. dev.: 95.45% = proportion of .9545

The probability of a single observed IQ falling within one standard deviation of the mean is equivalent to the proportion of all IQ’s that fall within one standard deviation of the mean.

Page 11: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Probabilities for Single Observations

Calculate the probability that the IQ score of one public health student will be within 10 points of the national average.

Distance between mean and observation=10

Z-score = 10/16 = 0.625 Std. Dev.’s Probability = 0.4679

Page 12: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Probabilities for Sample Means (Sampling Distributions)

Suppose I do repeated experiments (studies) in which I draw 10 students for each study.

Standard error for a mean of 10 measurements SEM = 16/(square root of 10) = 5.06

Therefore, in approximately68% of these experiments, the average score will be between (100-5.06=) 94.94 and 105.06, and 95% of these experiments, the average score will be between (100-2*5.06=) 89.88 and 110.12.

Equivalently,For any one experiment, we have a 95% chance that the sample mean will fall between 89.88 and 110.12.

Page 13: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Sampling Average IQ’s, cont.Calculate the probability that the average IQ score of 10 public health students will be within 10 points of the national average given that the public health distribution of IQ’s is the same as the national distribution.Standard error for a mean of 10 measurements SEM = 16/(square root of 10) = 5.060• Z-score = 10/5.060 = 1.976• Probability within ± 1.976 st.dev.s

= about 95%

Much larger probability than when looking at at single measurement!

Page 14: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

95% Large Sample Confidence IntervalSuppose that I sample 50 public health students and find that Sample Mean = 110.7 pointsAnd the Standard Error Estimated from Sample = SEM = 2.5 points

What’s the value of z such that 95% of sample means would fall within z standard errors of the population mean?

z = 1.96Therefore a 95% confidence interval is

110.7 ± 1.96(2.5) = (105.800, 115.600)

Interpretation: We are 95% confident that the mean IQ for all public health students falls between 105.8 and 115.6 points.How likely is it that the mean for the public health students is the same as that the national mean (100)?

Page 15: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

90% Large Sample Confidence IntervalSample Mean = 110.7 points

Standard Error of the Mean = SEM = 2.5 points

z = 1.645

Therefore a 90% confidence interval is 110.7 ± 1.645(2.5)

= (106.588, 114.813)

Interpretation:

Page 16: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

What Confidence Level should I use?

The more confidence we would like to claim in our interval estimate

The standard confidence people use is 95%.

Smaller confidence levels (such as 90%) may be appropriate, especially for pilot studies or other studies with small sample sizes.

Larger confidence levels may also be appropriate when costly reforms rest on the conclusions from those confidence intervals.

Page 17: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Large Sample Confidence Interval for a Population MeanSample mean ± z SEM

Where SEM = square root of s2/n

Typically, 95% confidence intervals are used, with

z =1.96

Here, 1.96 is the 97.5th percentile of the standard normal distribution

To be modified for smaller sample sizes shortly

Page 18: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Overweight Status and Eating Patterns… (AJPH, 2002)

95% CI for mean BMI of girls is23.3 +/- 1.96*4.9/sqrt(2000) = (23.09, 23.51)

95% CI for mean BMI of boys is23.0 +/- 1.96*4.8/sqrt(2000) = (22.79, 23.21)

Unfortunately, we have target percentiles for BMI, not target means.

Page 19: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.
Page 20: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Large Sample Confidence Intervals for Population Proportions

p ± z (SE)

Where the estimated variance of the sample proportion is SE2 = p(1-p)/n

Typically, 95% confidence intervals are used, with

z =1.96

Page 21: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Actual Percent in the Top Targeted 5th Percentile – 95% Confidence IntervalFor girls,

p = .125, n=2099

SE = sqrt((.125*.875)/2099) = .007219

p ± z (SE) (0.111, .139) (11.1%, 13.9%)

For boys,

p = .166, n=2141

SE = sqrt((.166*.834)/2141) = .008041

p ± z (SE) (0.150, .182) (15.0%, 18.2%)

I am 95% confident that the percent of boys that is in the targeted upper 5th percentile is actually between 15.0% and 18.2% of boys.

Page 22: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.
Page 23: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

“Men Carrying Pollutant Have More Boys”

For description, see news article.

n =101

p = .57

SE = 101

)43(.57.)1(

n

pp 049.

Thus a 95% confidence interval for the proportion is given by .57 ± 1.96 (.049)

Or equivalently (0.473, 0.667)

Page 24: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Interpretation for Pollutant – Boys E.g.

I am 95 % confident that the true proportion of boy babies born from parents who both have detectable PCB levels in their blood is between 0.473 and 0.667.I.e., if this is one of the 95% of the times that the true parameter falls in the interval, then the mean is between 0.473 and 0.667.If the proportion of of boys in the general population is 0.51, is there a difference in the proportion of boys from parents with detectable PCB levels?

Page 25: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Assumptions for “Large Sample CI’s”Random sample

• Every individual in the population has an equal chance of being selected.

Independent observations• Selecting one subject does not alter the chances of selection

of another subjectLarge Sample Size – so that the Central Limit Theorem “kicks in” and the standard error is accurately estimated

• How large is “large”?• Depends!• If population distribution is approximately normal, then

“large” is approximately 40 subjects.• The farther the population distribution away from normal, the

larger the needed size of the sample. … E.g., binary variables- at least 30 subjects – 5 in each category.

Page 26: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Other AssumptionsOther Assumptions• Study Population = Target Population• Variable accurately measures characteristic of

interest• Values are correctly measured and recorded

Page 27: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Quick Re-evaluation of Large Sample CI’s for Population Means

Why do we use a z-critical value in the CI?

Due to the CLT, we can say that if the sample size is large, the sample mean will fall within the specified number (z) of standard errors of the mean.

More specifically, the 95% CI is derived from noting

n

Yz

/

• The standard deviation is assumed to be known.

falls between –1.96 and +1.96.

Page 28: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

CI’s for Population Means when the Sample Sizes are Smaller

ISSUE: The standard deviation is unknown and, hence, it is estimated.

This estimate is generally OK for large sized samples, but not for smaller sized samples.

PROBLEM: There is more variation in the z-score with an estimated standard deviation than is allowed for by the Normal distribution.

Page 29: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Thanks to a smart person!SOLUTION:• In 1908, William Sealy Gosset working at the

Guinness Brewery in Dublin, Ireland showed, mathematically, that the z-score for the sample average in which the population standard deviation has been replaced by the sample standard deviation follows a “t-distribution” with a specified degrees of freedom (d.f.).

• I.e.,

ns

yt

/

Page 30: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

T-DistributionWhat does it look like?• Like the Normal distribution• With larger variance, depending on the d.f.

What are “degrees of freedom.”• It is difficult to define• “Amount of information about the variance.”• Practically, for the sample mean,

d.f. = n-1

Page 31: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Using the T-distribution for CI’sAssumptions:• Approximately normal, • Random and independent sample

d.f. = n-1

CI:

See Table A5.3 on page 368 for a table of critical values for 90%, 95% and 99% CI’s.

n

stY *

Page 32: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Cadmium levels (ng/gram) in mothersThe sample sizes are 14 (for smoking) and 18 (for non-smoking mothers).Distributions are bimodal.CI using the t critical valueMeans and standard errors are:

• Smoking: 20.414 & 6.814/sqrt(14) ng/gram • Non-smoking: 14.722 & 6.199/sqrt(18) ng/gram

t-critical values for 95% confidence intervals are2.160 and 2.110 respectively for 13 and 17 degrees of freedom

Page 33: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Cadmium Confidence IntervalsThe confidence intervals are:

Smoking: 20.414 ± 2.160 (1.821) (16.481, 24.347)

• Non-smoking: 14.722 ± 2.110 (1.461) (11.639, 17.805)

• These intervals overlap, suggesting that the mean cadmium levels for the populations of smoking and non-smoking mothers are not different.

• After the midterm, we will talk about more specific ways to test this hypothesis. (Comparison of two population means)

Page 34: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

HELP: When do I use which CI?In general for means, use the t-interval.• For means (not proportions), the t-interval is

robust to violations of the Normality assumption.

• If the sample distribution is far from normal (as determined by, e.g., histograms), non-parametric or other more “exact” methods may be needed.

• These are easier to discuss in the context of hypothesis testing.

Page 35: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

When do I use which CI?, cont…For proportions with large sample sizes, use the z-interval.• “Large” = 30 with between 5 and 25 subjects

in the group of interest• Is the pollutant-boys sample size OK?

• 101 subjects with at least 43 of each gender.

For proportions with small sample sizes, there exist “exact” confidence intervals.

See Table A5.1 & pp.16-18

Page 36: Confidence Intervals Lecture 3. Confidence Intervals for the Population Mean (or percentage) For studies with large samples, “approximately 95% of the.

Assumptions for all CI’s presented so far

Observations within the sample are independent of one another.

Sample consists of randomly selected subjects that are representative of the target population. • In particular, each subject in the study

population has an equal chance of being selected.