BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval...

34
BIO5312 Biostatistics Lecture 5: Estimations Yujin Chung September 27th, 2016 Fall 2016 Yujin Chung Lec5: Estimations Fall 2016 1/34

Transcript of BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval...

Page 1: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

BIO5312 BiostatisticsLecture 5: Estimations

Yujin Chung

September 27th, 2016

Fall 2016

Yujin Chung Lec5: Estimations Fall 2016 1/34

Page 2: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Recap

Yujin Chung Lec5: Estimations Fall 2016 2/34

Page 3: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Today’s lecture and some following lectures

How to infer the properties of the underlying distribution in a dataset.

Two types of statistical inferences:

Estimation: concerned with estimating the values of specificpopulation parameters. These specific values are referred to aspoint estimates. Sometimes, interval estimation is carried outto specify an interval which likely includes the parameter values.

Hypothesis testing: concerned with testing whether the value ofa population parameter is equal to some specific value

Yujin Chung Lec5: Estimations Fall 2016 3/34

Page 4: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Point estimations

Let X1, . . . , Xn be a random sample from a probability distribution.That is, X1, . . . , Xn are independent and identically distributed (iid).

If X1, . . . , Xn ∼ N(µ, σ2), what are the point estimations of µ andσ2, respectively?

If X1, . . . , Xn ∼ Bernoulli(p), what is the point estimation of p?

If X1, . . . , Xn ∼ Poisson(λ), what is the point estimations of λ?

Yujin Chung Lec5: Estimations Fall 2016 4/34

Page 5: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

A point estimation of the population mean

Consider a random sample X1, . . . , Xn drawn from a distribution withmean µ = E(X) (unknown).

A natural estimator for the population mean µ is the sample mean:

µ = E(X) = X =1

n

n∑i=1

Xi

If X1, . . . , Xn ∼ N(µ, σ2), X is a point estimation of E(X) = µ.

If X1, . . . , Xn ∼ Bernoulli(p), X (the proportion of success) is apoint estimation of E(X) = p.

If Y ∼ B(n, p), Y = Y (the number of successes) is a pointestimations of E(Y ) = np. That is, p = Y/n = X, whereX1, . . . , Xn ∼ Bernoulli(p).If X1, . . . , Xn ∼ Poisson(λ), X is a point estimations ofE(X) = λ?

Yujin Chung Lec5: Estimations Fall 2016 5/34

Page 6: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

Suppose a random sample of 5000 women is selected from this agegroup, of whom 28 are found to have malignant melanoma. Whatis the probability of having the disease (prevalence)?

Let p be the probability of having the disease. Let the randomvariable Xi represent the disease status for the ith woman, whereXi = 1 if the ith woman has the disease and 0 if she does not fori = 1, . . . , 5000. The random variable Xi was also defined as aBernoulli trial. That is, X1, . . . , X5000 ∼ Bernoulli(p). Then apoint estimation of E(X) = p is p = x = 28

5000 = 0.0056.

Yujin Chung Lec5: Estimations Fall 2016 6/34

Page 7: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Properties of X (1)

An unbiased estimator of µ: E(X) = µ

proof) E(X) = E

(1

n

n∑i=1

Xi

)=

1

n

n∑i=1

E(Xi) =1

n

n∑i=1

µ = µ

We consider infinitely many sets of random sample of size n. Fromeach sample, the sample mean X is computed. Then the average valueof X over infinitely many sets is µ = E(X).

0 2000 4000 6000 8000 10000

9.96

9.98

10.0

0

the number of sets of random sample

Ave

rage

of s

ampl

e m

eans

Sample~N(10,1)Sample size = 100

Yujin Chung Lec5: Estimations Fall 2016 7/34

Page 8: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Properties of X (2)

The minimum variance unbiased estimator of µ:If the underlying distribution is normal, then it can be shown that theunbiased estimator with smallest variance is given by X.

For example, from a random sample X1, . . . , Xn ∼ N(µ, σ2), weconsider two estimators for µ: one is X and the other is the firstobservation X1. Both are unbiased estimators: E(X) = µ andE(X1) = µ. However, X has a smaller variance than X1:

V ar(X) = V ar

(1

n

n∑i=1

Xi

)=

1

n2

n∑i=1

V ar (Xi) =1

n2nσ2 =

σ2

n,

V ar(X1) = σ2.

Yujin Chung Lec5: Estimations Fall 2016 8/34

Page 9: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Properties of X (3)

A consistent estimator of µ:The estimator X converges to the population mean µ, as the samplesize n goes to infinity.

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

9.90

10.0

010

.10

Sample size (n)

Sam

ple

mea

n

Sample~N(10,1)

Yujin Chung Lec5: Estimations Fall 2016 9/34

Page 10: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

A point estimation for V ar(X)

Let X1, . . . , Xn ∼ N(µ, σ2).

A point estimation of V ar(X) = σ2 is1

n

n∑i=1

(Xi − µ)2. (unbiased)

pf) E

(1

n

n∑i=1

(Xi − µ)2

)=

1

n

n∑i=1

E[(Xi − µ)2] =1

n

n∑i=1

σ2 = σ2

Let X1, . . . , Xn ∼ N(µ, σ2) but µ is unknown.

σ2 = S2 =1

n− 1

n∑i=1

(Xi− X)2. (unbiased), σ =√σ2 = S. (biased)

Let X1, . . . , Xn ∼ Bernoulli(p).A point estimation of V ar(X) = p(1− p) isV ar(X) = p(1− p) = X(1− X) (biased)

Let X1, . . . , Xn ∼ Poisson(λ).A point estimation of V ar(X) = λ is λ = X as the estimation ofE(X) = λ. (unbiased)

Yujin Chung Lec5: Estimations Fall 2016 10/34

Page 11: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Consistency of a variance estimator

The estimators of variances on the previous slide are Consistent!Let X1, . . . , Xn ∼ Bernoulli(p). As the sample size goes to the infinity,V ar(X) = p(1− p) = X(1− X) converges to p(1− p).

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

0.00

0.10

0.20

Sample size

Var

ianc

e es

timat

ion

random sample~Bernoulli(0.7)

Yujin Chung Lec5: Estimations Fall 2016 11/34

Page 12: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

LEAD data exampleWhat are the point estimations of the mean and standard deviation ofthe full IQ of children in the exposed group?

The point estimation of the mean is x = 88.02 and the estimation ofthe standard deviation is s2 = 12.207.

Yujin Chung Lec5: Estimations Fall 2016 12/34

Page 13: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Interval estimation of µ = E(X): σ known

Interval Estimation: specify an interval which likely includes aparameter value of interest.

Point estimates do not reflect our uncertainty when estimating aparameter. We always remain uncertain regarding the true value of theparameter when we estimate it using a sample from the population. Toaddress this issue, we can present our estimates in terms of an intervalof possible values (as opposed to a single value).

Yujin Chung Lec5: Estimations Fall 2016 13/34

Page 14: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Interval estimation: the uncertainty of X

Let X = (X1, . . . , Xn), where Xi ∼ N(µ, σ2) for i = 1, . . . , n. Samplemean X is a point estimator of µ. How is X distributed?

X ∼ N(µ, σ2/n)

What is the interval (u(X), v(X)) such thatPr[(u(X), v(X)) 3 µ] = .95?

StandardizationX − µσ/√n∼ N(0, 1)

Consider a constanta = z0.925 = 1.96 such that

Pr

(−a < X − µ

σ/√n< a

)= 0.95

Yujin Chung Lec5: Estimations Fall 2016 14/34

Page 15: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Interval estimation for µ

Pr

(−a < X − µ

σ/√n< a

)= Pr

(−a σ√

n< X − µ < a

σ√n

)= Pr

(X + a

σ√n> µ & X − a σ√

n< µ

)= Pr

[(X − a σ√

n, X + a

σ√n

)3 µ]

= 0.95

Therefore, u(X) = X − 1.96σ√n

and v(X) = X + 1.96σ√n

.

Yujin Chung Lec5: Estimations Fall 2016 15/34

Page 16: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Confidence interval for µ

Let X = (X1, . . . , Xn), where Xi ∼ N(µ, σ2) for i = 1, . . . , n. Assume σis known.100(1− α)% confidence interval for µ:(

X − z1−α/2σ√n, X + z1−α/2

σ√n

)For short hand, X ± z1−α/2

σ√n

.

1− α: confidence level

z1−α/2: critical value for confidence level 1-α

For example, the 95% confidence interval for µ is(X − 1.96

σ√n, X + 1.96

σ√n

)

Yujin Chung Lec5: Estimations Fall 2016 16/34

Page 17: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

LEAD data exampleThe point estimation of the mean of the full IQ of children in theexposed group is x = 88.02.

Assume the full IQ follows a normal distribution with standarddeviation is σ = 12.207. Compute 95% confidence interval for themean of IQ.

There are 46 children in the exposed group. The standard error isσ/√n = 1.799. Therefore, the 95% CI is

(88.02− 1.96× 1.799, 88.02 + 1.96× 1.799) = (84.494, 91.549)

Yujin Chung Lec5: Estimations Fall 2016 17/34

Page 18: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Confidence interval (CI)

Interpretations of 95% CI:

The probability that the interval contains the true value(parameter) is 0.95

Consider infinitely many sets of random sample of size n andcompute the CIs. 95% of the infinitely many CIs will contain thetrue value.

Yujin Chung Lec5: Estimations Fall 2016 18/34

Page 19: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Factors Affecting the Length of a CI

the 95% confidence interval (CI) for µ is(X − 1.96

σ√n, X + 1.96

σ√n

)The length of the CI indicates the precision of the point estimate X.The length of a 100%(1− α) CI for equals 2z0.975σ/

√n and is

determined by α, the standard error σ/√n.

α: as the confidence desired increases (decreases), the length of theCI increases.

n: as the sample size (n) increases, the standard error decreasesand the length of the CI decreases

σ: As the variability of the distribution increases, the length of theCI increases

Yujin Chung Lec5: Estimations Fall 2016 19/34

Page 20: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Interval estimation of µ = E(X): σ unknown

If X1, . . . , Xn N(µ, σ2), the 95% confidence interval (CI) for µ is(X − 1.96

σ√n, X + 1.96

σ√n

)What if we don’t know σ?

X − µs/√n

is distributed as a t distribution with (n− 1)df.

A 100%× (1− α) CI is given by

(X − tn−1,1−α/2S/√n, X + tn−1,1−α/2S/

√n),

where tn−1,1−α/2 is the (1− α/2)th percentile of tn−1 distribution

If n > 200, use the standard normal distribution instead of tn−1:

(X − z1−α/2S/√n, X + z1−α/2S/

√n)

Yujin Chung Lec5: Estimations Fall 2016 20/34

Page 21: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

LEAD data exampleThe point estimation of the mean of the full IQ of children in theexposed group is x = 88.02. There are 46 children in the exposedgroup.

Assume the full IQ follows a normal distribution Compute 95%confidence interval for the mean of IQ.

The standard error is σ/√n = 1.799. The critical value is

t45,.975 = 2.014. Therefore, the 95% CI is

(88.02− 2.014× 1.799, 88.02 + 2.014× 1.799) = (84.396, 91.643)

Note: The CI with unknown σ is wider than the CI (84.494, 91.549)with known σ.

Yujin Chung Lec5: Estimations Fall 2016 21/34

Page 22: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Interval Estimation of the Variance of a Distribution

Let X1, . . . , Xn ∼ N(µ, σ2). A point estimation of σ2 is S2.

Using(n− 1)S2

σ2∼ χ2

n−1, a 95% CI is (u(X), v(X)) such that

Pr(u(X) < σ2 & v(X) > σ2

)= 0.95

Find (u(X), v(X)):

0.95 = Pr

(χ2n−1,0.025 <

(n− 1)S2

σ2< χ2

n−1,0.975

)= Pr

((n− 1)S2

χ2n−1,0.025

> σ2 &(n− 1)S2

χ2n−1,0.975

< σ2

)

A 100%× (1− α) CI for σ2 is given by((n− 1)S2

/χ2n−1,1−α/2, (n− 1)S2

/χ2n−1,α/2

)Yujin Chung Lec5: Estimations Fall 2016 22/34

Page 23: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

LEAD data exampleThe point estimation of the variance of the full IQ of children in theexposed group is s2 = 149.99. There are 46 children in the exposedgroup.

Assume the full IQ follows a normal distribution. Compute 95%confidence interval for the variance of IQ.

The critical values are χ245,.025 = 28.366 and χ2

45,.975 = 65.41.Therefore, the 95% CI is

(45× 149.99/65.41, 45× 149.99/28.366) = (103.188, 237.945).

Yujin Chung Lec5: Estimations Fall 2016 23/34

Page 24: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

CI for a binomial parameter p

Let X be a binomial random variable with parameters n and p. Anunbiased estimator of p is given by the sample proportion of eventsp = X/n. Its standard error is estimated by

√p(1− p)/n.

By the Central limit theorem,p− p√

p(1− p)/n→ Z, where Z ∼ N(0, 1),

as n→∞.

We replace p by p in the standard error:p− p√

p(1− p)/n∼N(0, 1).

When np(1− p) ≥ 5 (that is np(1− p)), an approximate100%× (1− α) CI for the binomial parameter p:

p± z1−α/2√p(1− p)/n

Yujin Chung Lec5: Estimations Fall 2016 24/34

Page 25: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

‘Exact’ CI for a binomial parameter p

When np(1− p) and X = x, an ‘exact’ binomial distribution to build aCI. As p increase, Pr(X ≥ x|p) increases, while Pr(X ≤ x|p) decreases.CI for a binomial parameter p is obtained by (p1, p2) such that

p1 = min{p|Pr(X ≥ x|p) > α/2} & p2 = max{p|Pr(X ≤ x|p) > α/2}

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

p=0.2, n=5, X=2 p=0.4, Exact CI (0.15, 0.85)

parameter p

Pro

babi

lity

Pr(X <= 2|p) Pr(X >= 2|p)

Yujin Chung Lec5: Estimations Fall 2016 25/34

Page 26: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

Suppose a random sample of 5000 women is selected from this agegroup, of whom 28 are found to have malignant melanoma. Whatis the probability of having the disease (prevalence) and the 95%CI?

Let p be the probability of having the disease. Let Xi = 1 if theith woman has the disease; 0 otherwise, for i = 1, . . . , 5000. Thena point estimation of E(X) = p is p = x = 28

5000 = 0.0056.

Since np(1− p) = 27.8432 ≥ 5, we use the normal approximationto compute a CI for p. The standard error is estimated by√p(1− p)/n = 0.00106 and hence the 95% CI for p is

(0.0056−1.96×0.00106, 0.0056+1.96×0.00106) = (0.0035, 0.0077).

Yujin Chung Lec5: Estimations Fall 2016 26/34

Page 27: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

CI for Poisson Distribution

Let X1, . . . , Xn ∼ Poi(λ). A point estimation of λ is λ = X and itsstandard error is

√λ/n.

By the central limit theorem, an approximate 100%(1− α) CI for λ is

λ± z1−α/2√λ/n

Let S =

n∑i=1

Xi. Then, S =

n∑i=1

Xi ∼ Poi(nλ). An “exact” CI for λ is

(λ1, λ2) such that

λ1 = min{λ|Pr(S ≥ s|λ) > α/2} & λ2 = max{λ|Pr(S ≤ s|λ) > α/2}

Yujin Chung Lec5: Estimations Fall 2016 27/34

Page 28: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Bootstrap confidence interval

Real data has a complex structure and we may be interested in morecomplex parameters or quantity.

We have a large sample(e.g., n = 1000), but its distribution is veryskewed. We’d like to compute a CI for the population mean, butthe Normal approximation may not be good enough.

Histogram of a data

Fre

quen

cy

0 100 200 300 400

020

060

010

00

n=1000

If we are interested in the median of a data, how to compute a CIfor the median?

Yujin Chung Lec5: Estimations Fall 2016 28/34

Page 29: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Bootstrap confidence interval

Let X1, . . . , Xn be randomly sampled from an unknowndistribution. We are interested in estimating a parameter θ andthe estimator of θ is θ = S(X). How can we build a CI for θ?

A confidence interval for θ is in the form of

(point estimation)± z1−α/2 × (standard error of the estimation).

That is, θ ± z1−α/2SE(θ)!

However, we do NOT know the standard error of θ.

We use the bootstrap method to estimate the standard error.

Yujin Chung Lec5: Estimations Fall 2016 29/34

Page 30: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Bootstrap method: resampling method

Goal: estimating the distribution of θ and hence the s.d. of θ

1 Estimating the distribution from which the data was sampled:the population is estimated by the sampled data x = (x1, . . . , xn)

2 Sample many data sets from the estimated distribution P :x∗1, . . . ,x

∗B

3 Compute the estimation of θ: θ(x∗1), . . . , θ(x

∗B) (This forms the

distribution of θ)

Yujin Chung Lec5: Estimations Fall 2016 30/34

Page 31: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Bootstrap confidence intervals

Now we estimated the distribution of θ: θ(x∗1), . . . , θ(x

∗B).

100%(1− α) CI for θ (normal approximation) is

(θ + (θ − θ∗))± z1−α/2se∗(θ),

where θ∗ =1

B

B∑i=1

θ(x∗i ) and se∗(θ) =

√√√√ 1

B − 1

B∑i=1

(θ(x∗i )− θ∗)2.

A percentile CI is using the 100%(α/2)th and 100%(1− α/2)thpercentiles of θ(x∗

1), . . . , θ(x∗B):

(q∗α/2, q∗1−α/2)

Yujin Chung Lec5: Estimations Fall 2016 31/34

Page 32: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Examples

LEAD data exampleThe point estimation of the mean of the full IQ of children in theexposed group is x = 88.02. There are 46 children in the exposedgroup.

Compute the 95% normal- and percentile CIs for the mean IQ.

Resample the data and generate 1,000 replicates.The normal-CI is (84.64, 91.35) and the percentile-CI is (84.78,91.54).

Previously with normal assumption, the CIs are (84.494, 91.549)with known σ = s and (84.396, 91.643) with unknown σ.

Yujin Chung Lec5: Estimations Fall 2016 32/34

Page 33: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Summary

A point or interval estimation of a parameter of interest from a randomsample.

1 Identify the data type: continuous? Normal or non-normal?Bernoulli? Poisson? unknown?

2 What parameter or quantity to estimate? Are there any otherunknown parameters?

3 What is a point estimation of the parameter of interest? Is theestimator good enough?

4 What is the standard error of the estimate?5 What is a CI for the parameter? What is the distribution of the

point estimation?I Normal, Chi-square, t-dist, Binomial dist, etcI Normal approximation?I Difficult to discover or unknown: Bootstrap

Yujin Chung Lec5: Estimations Fall 2016 33/34

Page 34: BIO5312 Biostatistics Lecture 5: Estimations...Interval estimation of = E(X): ˙known Interval Estimation: specify an interval which likely includes a parameter value of interest.

Next week

Statistical hypothesis testing: concerned with testing whether thevalue of a population parameter is equal to some specific value.

Yujin Chung Lec5: Estimations Fall 2016 34/34