Biostatistics Unit 6 Confidence Intervals 1. Statistical inference Statistical inference is the...

Biostatistics

Unit 6

Confidence Intervals

Statistical inference

• Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample drawn from that population.

• Estimation involves the use of the data in the sample to calculate the corresponding parameter in the population from which the sample was drawn.

Types of estimates

• A point estimate is a single numerical value used to estimate the corresponding population parameter.

• An interval estimate consists of two numerical values that, with a specified degree of confidence, we feel includes the parameter being estimated.

Estimator

• An estimator is a rule or formula that tells how to compute the estimate.

• Estimators are unbiased if they predict well the value in the population.

Table of unbiased estimators

Sampled and target populations

• The sampled population is the population from which we actually draw the sample.

• The target population is the population about which we wish to make an inference.

(continued)

Sampled and target populations

• These two populations may or may not be the same.

• When they are the same, it is possible to use statistical inference procedures to make conclusions about the target population.

• If the sample and target populations are different, conclusions can be made about the target population only on the basis of nonstatistical considerations.

Random and nonrandom samples

The strict validity of statistical procedures depends on the assumption of random samples.

Confidence intervals to be studied

A) Confidence Interval for a Population meanB) Confidence Interval for the Difference of Two Population Means

C) Confidence Interval for a Population ProportionD) Confidence Interval for the Difference of Two Population Proportions

E) Confidence Interval for the Variance of a Normally Distributed Population

F) Confidence Interval for the Ratio of Variances of Two Normally Distributed Populations

A) Confidence interval for a population meanEstimating the mean

• Estimating the mean of a normally distributed population entails drawing a sample of size n and computing which is used as a point estimate of .

• It is more meaningful to estimate by an interval that communicates information regarding the probable magnitude of .

Sampling distributions and estimation

Interval estimates are based on sampling distributions. When the sample mean is being used as an estimator of a population mean, and the population is normally distributed, the sample mean will be normally distributed with mean,

, equal to the population mean, , and variance of

The 95% confidence interval

• 95% of the values of making up the distribution will lie within two standard deviations of the mean.

• The actual value is 1.96

• The interval is noted by the two points, – 1.96 and + 1.96 , so that 95% of the values are in the interval, ± 1.96 .

The 95% confidence interval

• Since and are unknown, the location of the distribution is uncertain.

• We can use as a point estimate of .

• In constructing intervals of ± 1.96 , about 95% of these intervals would contain

Example

Suppose a researcher, interested in obtaining an estimate of the average level of some enzyme in a certain human population, takes a sample of 10 individuals, determines the level of the enzyme in each, and computes a sample mean of x = 22. Suppose further it is known that the variable of interest is approximately normally distributed with a variance of 45. We wish to estimate .

Solution

± 1.96

Components of an interval estimate

• The interval estimate of is centered on the point estimate of .

• 95% of the values of the standard normal curve lie within 1.96 standard deviations of the mean.

• The z score of 1.96 used in this case is called the reliability coefficient.

General expression for an interval estimate

Table of reliability coefficients for confidence intervals

Interpretation of confidence intervals

The interval estimate for is expressed as:

± z1-(/2)

If = .05, we can say that, in repeated sampling, 95% of the intervals constructed this way will include . This is based on the probability of occurrence of different values of .

(continued)19

Interpretation of confidence intervals

The area of the curve of that is outside the area of the interval is called .

The amount of area inside the interval is called 1-.

Probabilistic interpretation of the interval

In repeated sampling from a normally distributed population with a known standard deviation, 100(1- ) percent of all intervals in the form

will, in the long run, include the population mean, .

(continued)

Probabilistic interpretation of the interval

The quantity 1- is called the confidence coefficient or confidence level and the

interval, , is called the

confidence interval for .

Practical interpretation of the interval

When sampling is from a normally distributed population with known standard deviation, we are 100(1- ) percent confident that the single computed interval,

contains the population mean, .

Precision

• Precision indicates how much the values deviate from their mean.

• Precision is found by multiplying the reliability factor by the standard error of the mean.

• This is also called the margin of error.

Exercise 6.2.2

We wish to estimate the mean serum indirect bilirubin level of 4-day-old infants. The mean for a sample of 16 infants was found to be 5.98 mg/dl. Assuming bilirubin levels in 4-day-old infants are approximately normally distributed with a standard deviation of 3.5 mg/dl find: A) The 90% confidence interval for B) The 95% confidence interval for C) The 99% confidence interval for

Solution

(1) Given = 5.98 = 3.5 n = 16

(2) Sketch

Solution(3) Calculations A) 90% interval (z = 1.645) 5.98 ± 1.645 (.875)

5.98-1.439375, 5.98+1.439375

(4.5408, 7.4129)

Solution

B) 95% interval (z = 1.96)

5.98 ± 1.96 (.875)

(4.265, 7.695)

Solution

C) 99% interval (z = 2.575)

5.98 ± 2.575 (.875)

(3.7261, 8.2339)

Solution

(4) ResultsA higher percent confidence level gives a wider band. There is less chance of making an error but there is more uncertainty. Calculator answers are more accurate because the calculator uses exact values and derives its answers from calculus.

The t distributionIn most real life situations the variance of the population is unknown. We know that the z score,

is normally distributed if the population is normally distributed and is approximately normally distributed when the population is large. But, it cannot be used because is unknown.

Estimation of the standard deviation

The sample standard deviation,

can be used to replace . If n 30, then s is a good approximation of . An alternate procedure is used when the samples are small. It is known as Student's t distribution.

Student's t distribution

Student's t distribution is used as an alternative for z with small samples. It uses the following formula:

Student's t distributionStudent's t distribution was developed in 1908 by W. S. Gosset (1876-1937) who worked for the Guinness Brewery.

Properties of the t distribution

1. Mean = 02. It is symmetrical about the mean.3. Variance is greater than 1 but approaches 1 as the sample gets large. For df > 2, the variance = df/(df-2) or

(continued)

Properties of the t distribution

4. The range is to .

5. t is really a family of distributions because the divisors are different.6. Compared with the normal distribution, t is less peaked and has higher tails.7. t distribution approaches the normal distribution as n-1 approaches infinity.

Confidence interval for a mean using t

General relationship

The reliability coefficient is obtained from the t distribution.

Confidence interval

When sampling is from a normal distribution whose standard deviation, , is unknown, the 100(1- ) percent confidence interval for the population mean, , is given by:

Deciding between z and t

• When constructing a confidence interval for a population mean, we must decide whether to use z or t.

• Which one to use depends on the size of the sample, whether it is normally distributed or not, and whether or not the variance is known.

• There are various flowcharts and decision keys that can be used to help decide. Mine appears below.

Key for deciding between z and t in confidence interval construction

1. Population normally distributed................2 Not as above—normally distributed.........5 2. Sample size is large (30 or higher)............3 Sample size is small (less than 30)............4 3. Population variance is known.............use z Population variance not known.... use t (or z) 4. Population variance is known.............use z Population variance is not known.......use t 5. Sample size is large..................................6 Sample size is small..................................7 6. Population variance is known.............use z Population variance not known (central limit theorem applies)............use z 7. Must use a non-parametric method

Example

In a study of preeclampsia, Kaminski and Rechberger found the mean systolic blood pressure of 10 healthy, nonpregnant women to be 119 with a standard deviation of 2.1.

(continued)

Example

(Preeclampsia: Development of hypertension, albuminuria, or edema between the 20th week of pregnancy and the first week postpartum.

Eclampsia: Coma and/or convulsive seizures in the same time period, without other etiology.)

Example

a. What is the estimated standard error of the mean?b. Construct the 99% confidence interval for the mean of the population from which the 10 subjects may be presumed to be a random sample.c. What is the precision of the estimate?d. What assumptions are necessary for the validity of the confidence interval you constructed?

Solution

(1) Given n = 10 = 119 s = 2.1

(2) Sketch of t distribution

Reading the t table

(3) Calculations

= .6640783086

119 ± 3.2498 (.66407...)

116.84, 121.16

Solution

Precision = 3.2498 (.66407...) = 2.158121687

AssumptionsThe population is normally distributed The 10 subjects represent a random sample from this population.

B) Confidence interval for the difference of two population means

Introduction

From each of two populations an independent random sample is drawn. Sample means, and , are calculated.

(continued)

B) Confidence interval for the difference of two population means Introduction

The difference is which is an unbiased estimator of the difference between the two population means, . The variance of the

estimator is

Conditions for use

Assuming the populations are normally distributed, there are three situations where we would determine the 100(1- ) percent confidence interval for .

(continued)

Conditions for use

a) where the population variances are known (use z)

b) where the population variances are unknown but equal (use t)

c) where the population variances are unknown but unequal (use t').

Population variances are known

When the population variances are known, the 100(1- ) percent confidence interval for is given by

Example 6.4.1A research team is interested in the difference between serum uric acid levels in patients with and without Down's syndrome. In a large hospital for the treatment of the mentally retarded, a sample of 12 individuals with Down's syndrome yielded a mean of = 4.5 mg/100 ml. In a general hospital a sample of 15 normal individuals of the same age and sex were found to have a mean value of = 3.4

mg/100 ml. If it is reasonable to assume that the two populations of values are normally distributed with variances equal to 1 and 1.5, find the 95%

confidence interval for .

Solution

(1) Given

n1 = 12, = 4.5, = 1

n2 = 15, = 3.4, = 1.5

Solution

(2) Calculations

The point estimate for is

= 4.5 - 3.4 = 1.1

Solution

The standard error is

Solution

The 95% confidence interval is 1.1 ± 1.96 (.4282)

(.26, 1.94)

Population variances unknown but equal

If it can be assumed that the population variances are equal then each sample variance is actually a point estimate of the same quantity. Therefore, we can combine the sample variances to form a pooled estimate.

Weighted averages

The pooled estimate of the common variance is made using weighted averages. This means that each sample variance is weighted by its degrees of freedom.

Pooled estimate of the variance

The pooled estimate of the variance comes from the formula:

Standard error of the estimate

The standard error of the estimate is

Confidence interval

The 100(1-) confidence interval for

Example

(1) Given

n1 = 13, = 21.0, s1 = 4.9

n2 = 17, = 12.1, s2 = 5.6

Example

(2) CalculationsThe point estimate for - is

= 21.0 - 12.1 = 8.9

Example

The pooled estimate of the variance is

Example

The standard error is

Example

The 95% confidence interval is 8.9 ± 2.0484 (1.9569)

8.9 ± 4.0085

(4.9, 12.9)

Population variances unknown and not equal

With unequal variances, the quantity used to calculate the test statistic does not follow the t distribution. A substitute reliability factor called t' has been proposed.

C) Confidence interval for a population proportion

To begin, a sample is drawn from the population of interest and the sample proportion, , is calculated. This sample proportion is used as the point estimator of the population proportion, p. The confidence interval is defined by the general formula:

Distribution

When n is large, the reliability coefficient will be z from the standard normal distribution. Since p, the population proportion, is unknown, we use as an

estimate. The estimate of , the

standard error, is given by:

Confidence interval

The 100(1- ) confidence interval for p is given by:

Probabilistic interpretation.

We say that we are 95% confident that the population proportion, p, lies between the calculated limits since, in repeated sampling, about 95% of the intervals constructed this way would contain p.

Practical interpretation.

In a specific example, we would expect, with 95% confidence, to find the population proportion between the two boundaries.

Example 6.5.2

A research study obtained data regarding sexual behavior from a sample of unmarried men and women between the ages of 20 and 44 residing in geographic areas characterized by high rates of sexually transmitted diseases and admission to drug programs. Fifty percent of 1229 respondents reported that they never used a condom. Construct a 95 percent confidence interval for the population proportion never using a condom.

Solution

(1) Given

n = 1229 = .50

(for the TI-83, x = 615)

Solution

(2) Calculation

D) Confidence interval for the difference of two population proportions

When studying the difference between two population proportions, the difference between the two sample proportions, , can be used as an unbiased point estimator for the difference between the two population proportions, p1 – p2. This is used with the general formula:

Distribution

When the central limit theorem applies, the normal distribution is used to obtain confidence intervals. The standard error is estimated by the formula:

Confidence interval

The 100(1- ) percent confidence interval for p1 – p2 is given by:

Probabilistic interpretation.

We say that we are 95% confident that the difference between the two population proportions, p1 – p2, lies between the calculated limits since, in repeated sampling, about 95% of the intervals constructed this way would contain p1 – p2.

Practical interpretation.

In a specific example, we would expect, with 95% confidence, to find the difference between the two population proportions between the two limits.

Example 6.6.1

A study of teenage suicide included a sample of 96 boys and 123 girls between ages of 12 and 16 years selected scientifically from admissions records to a private psychiatric hospital. Suicide attempts were reported by 18 of the boys and 60 of the girls. We assume that the girls constitute a simple random sample from a population of similar girls and likewise for the boys. Construct a 99 percent confidence interval for the difference between the two proportions.

Solution

(1) Given

n1 = 123 n2 = 96

= .4878 = .1875

Solution

(2) Calculation

Determining the sample size for estimating means

It is important to have a sample that is the correct size. It is also important to have a method that will allow prediction of the correct sample size for estimating a population mean or a population proportion. This is important especially in business or commercial situations where money is involved. Selecting a sample size that is too big wastes money. One that is too small may give inaccurate results.

Objectives

The width of the confidence interval is determined by the magnitude of the margin of error which is given by:

d = (reliability coefficient) (standard error)

The total width of the interval is twice this amount.

Reducing the margin of error

In the standard error, , the value of is a constant. If the reliability coefficient is fixed, the only way to reduce the margin of error is to have a large sample. The size of the sample depends on the size of , the degree of reliability and the desired interval width.

Margin of error

Sample size for a large population

d = (reliability coefficient) X (standard error)

Solving for n gives

Estimating 2

Generally the variance of the population under study is unknown. As a result has to be estimated. The most common sources of estimates for are:1. A pilot sample which is drawn from the population and used as an estimate of .2. Estimates of from previous or similar studies.3. In a normally distributed population, the range is usually about 6 standard deviations so is estimated by R/6.

Determination of the sample size for estimating proportions

The manner of finding sample sizes for estimating a population proportion is basically the same as for estimating a mean.

The general formula is:

Sample size

Assuming proper random sampling and an approximately normal distribution, the sample size is

Estimating the population proportion

It is necessary to estimate the population proportion, p, to use in the determination of the sample size.1. If an upper limit is suspected or presumed, it could be used to represent p.2. A pilot sample could be drawn and used to obtain an estimate for p.3. With no better estimate, one may use p = .5 which gives the maximum value of n.

E) Confidence interval for the variance of a normally distributed population

Measures of dispersion

(continued)

E) Confidence interval for the variance of a normally distributed population

Measures of dispersion

S E( 2 ) = when E( s2 ) = when sampling is with sampling is without replacement replacement.

Large population size

When N is large, N and N-1 are approximately equal so 2 and s2 will be approximately equal. These results justify why s2 can be used to compute the population variance.

Interval estimate of a population variance

• The value of s2 is used as a point estimator of the population variance, 2.

• Confidence intervals of 2 are based on the sampling distribution of (n-1) s2/ 2.

• If samples of size n are drawn from a normally distributed population, this quantity has a distribution known as the chi-square distribution with n-1 degrees of freedom.

• The assumption that the sample is drawn from a normally distributed population is crucial.

The chi-square distribution

The chi-square distribution is not symmetrical. For low values of n, its shape is variable. The distribution does not have negative values.

Microsoft Excel Demonstration

Note how the shape of the curve changes depending on the degrees of freedom. With 1 degree of freedom, the curve is hyperbolic.

[Here follows the Excel Worksheet.]

Microsoft Excel Demonstration

Reading the 2 table

Finding 2 values

Confidence interval on the 2 distribution

The 100(1-) confidence interval for the distribution of (n-1) s2/2 is a two-tailed 2 distribution between

and . This interval is given by

Confidence interval for 2

From the sampling distribution of (n-1) s2/2 the sampling distribution of 2 is derived. The formula is:

Confidence interval for

To get the 100(1-) confidence interval for , the population standard deviation, the square root of each term is taken. The result is the formula below.

Example 6.9.1

In a study on cholesterol levels a sample of 12 men and women was chosen. The plasma cholesterol levels (mmol/L) of the subjects were as follows: 6.0, 6.4, 7.0, 5.8, 6.0, 5.8, 5.9, 6.7, 6.1, 6.5, 6.3, and 5.8. We assume that these 12 subjects constitute a simple random sample of a population of similar subjects. We wish to estimate the variance of the plasma cholesterol levels with a 95 percent confidence interval.

Solution

(1) Given

6.0 6.4 7.0 5.8 6.0 5.8 5.9 6.7 6.1 6.5 6.3 5.8

Estimate the variance with a 95% confidence interval.

Solution

(2) CalculationsValue of s = .3918680978Values of from table

= 21.920 = 3.816

Calculations

F) Confidence interval for the ratio of variances of two normally distributed populations

A way to compare the variances of two normally distributed populations is to use the variance ratio,

/ . The variance ratio is used, among other

things, as the test statistic for analysis of variance (ANOVA). If the two variances are equal, then

V. R. = 1.

Sampling distribution

The sampling distribution of ( / )/( / ) is

used. Since the population variances are usually not known, the sample variances are used. The

assumptions are that and are computed from

independent samples of size n1 and n2, respectively,

drawn from two normally distributed populations. (continued)

Sampling distribution

If the assumptions are met, ( / )/( / )

follows a distribution known as the F distribution with two values used for degrees of freedom.

Degrees of freedom

• The F distribution uses two values for degrees of

freedom.

• The numerator degrees of freedom is the

value of n1 -1 which is used in calculating .

• The denominator degrees of freedom is the value of n2 -

1which is used in calculating .

The F distribution

• The F distribution is not symmetrical.

• The distribution does not have negative values.

• Because it uses two values of degrees of freedom, there are separate charts for different confidence intervals.

F distribution tables

Reading F tables

F tables come in denominations based on

which are , , , and with one tail.

For two-tail intervals, the lower boundary, ,

must be calculated to give values of , and

Reading F tables

Two-tail F distribution boundaries

The F.95 table

The F.975 table

The F.995 table

Confidence interval for /

The distribution ( / )/( / ) is used to

establish the 100(1- ) percent confidence interval

for / . The starting point is

(continued)

Confidence interval for /

From this relation, it can be shown that the 100(1- )

percent confidence interval for / is

Example 6.10.1

Among 11 patients in a certain study, the standard deviation of the property of interest was 5.8. In another group of 4 patients, the standard deviation was 3.4. We wish to construct a 95 percent confidence interval for the ratio of the variances of these two populations.

Solution

(1) Given n1 = 11 = (5.8)2 = 33.64 = .05

n2 = 4 = (3.4)2 = 11.56

10, 3 = 14.42

= 1/ 3, 10 = 1/4.83 = .20704

Solution

(2) Calculations

Calculation of the 95% confidence interval for /

Biostatistics Unit 6 Confidence Intervals 1. Statistical inference Statistical inference is the...

Documents

Transcript of Biostatistics Unit 6 Confidence Intervals 1. Statistical inference Statistical inference is the...

Biostatistics 602 - Statistical Inference Lecture 04 ... · Minimal Suﬃcient Statistics. . . . . . . . . Ancillary Statistics. . . . . . . Location-scale Family. Summary.. Biostatistics

Statistical inference Statistical inference Its application for health science research Bandit Thinkhamrop, Ph.D.(Statistics) Department of Biostatistics.

PARAMETRIC STATISTICAL INFERENCE

Biostatistics€¦ · 1. Introduction to Biostatistics 1 2. Biostatistical Design of Medical Studies 10 3. Descriptive Statistics 25 4. Statistical Inference: Populations and Samples

3 . Statistical Inference

Statistical inference

Statistical inference

Statistical Inference Two Statistical Tasks 1. Description 2. Inference.

CONSULTING, STATISTICAL 141 MULTIVARIATE ANALYSIS STATISTICAL INFERENCE) · 2016-03-23 · MULTIVARIATE ANALYSIS STATISTICAL INFERENCE) CONSULTING, STATISTICAL DEFINITION Statistical

STATS 200: Introduction to Statistical Inference 200: Introduction to Statistical Inference ... Statistical inference Statistical inference = Probability 1 ... STATS 200: Introduction

Statistical Inference Web

Biostatistics 602 - Statistical Inference Lecture 01 ... · Syllabus. . . . . . BIOSTAT602. . . . . Data Reduction. . . . . . . . . . . . . . . Suﬃcient Statistics. Summary.. Biostatistics

Statistical Inference Statistical Inference is the process of making judgments about a population based on properties of the sample Statistical Inference.

Lecture 2: Association and Inference July 12, 2011 Karen Bandeen-Roche, PhD Department of Biostatistics Johns Hopkins University Introduction to Statistical.

BASIC STATISTICAL INFERENCE

Biostatistics 602 - Statistical Inference Lecture 24 E …...Recap. . . . . . . . . . . . . . . . E-M. . . P1. . . . P2. . . P3. Summary.. Biostatistics 602 - Statistical Inference

Statistical Inference: Introduction

통계적 추론 Statistical Inference 개념wolfpack.hnu.ac.kr/2015_Fall/D4BE/통계적추론.pdfMathematical Statistics Statistical Inference 통계적 추론 Statistical Inference

Introduction to Statistical Inferencefab2/inference_talk.pdfIntroduction to Statistical Inference. Statistical Inference. Statistical Inference. Inference data" if the ratio of its

Part Two Statistical Inference - Biostatistics - …tlouis/bio752-753/rohde_intermediate-stat...Statistical Inference: Major Approaches 6.1 Introduction The problem addressed by \statistical