Stat 101: Lecture 18

Post on 02-Dec-2021

2 views 0 download

Transcript of Stat 101: Lecture 18

Stat 101: Lecture 18

Summer 2006

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Designed Experiments

I Double-blinded, randomized, control study versusobservational study.

I Causation and association.I Confounding factors may exist.I Weighted average and the chi-square test.I Summary statistics: mean, median, sd, IQR.I Plots: histogram, boxplot, scatterplot.

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Mathematical model for regression

I Each point (Xi , Yi) in the scatterplot satisfies:

Yi = a + bXi + εi

I εi ∼ N(0, sd = σ). σ is usually unknown. The ε’s havenothing to do with one another (independent). e.g., big εidoes not imply big εj .

I We know Xi ’s exactly. This imply that all error occurs in thevertical direction.

Estimating the regression line

ei = Yi − (a + bXi) is called residuals. It measures the verticaldistance from a point to the regression line.One estimates a and b by minimizing,

f (a, b) =n∑

i=1

(Yi − (a + bXi))2

Take the derivative of f (a, b) w.r.t a and b, and set them to 0, weget,

a = Y − bX ; b =1n

∑n1 XiYi − X Y

1n

∑n1 X 2

i − X 2

f (a, b) is also referred as Sum of Squared Errors (SSE).

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Probability

I Definition – Frequentist view versus Bayesian view.I Kolmogorov’s Axioms.I Conditional probability.

P(A | B) =P(A and B)

P(B)

I Independence.

P(A | B) = P(A)

I The Addition Rule:

P(A or B) = P(A) + P(B)− P(A and B)

I The total probability rule:

P(A) = P(A | B)P(B) + P(A | not B)P(not B)

I The Bayes’ rule:Let A1, . . . , An be mutually exclusive and suppose thatP(A1 or A2 . . . or An) = 1. Then,

P(A1 | B) =P(B | A1)× P(A1)∑n

i=1 P(B | Ai)P(Ai)

I The binomial formula:

P(exactly r successes) =

(nr

)pr (1− p)n−r

I The Poisson formula:

P(exactly k events) =λk

k !exp(−λ)

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

The Normal distribution and the Central Limit TheoremI The normal distribution and use of the normal table,

f (x | µ, σ) =1√2πσ

exp(− 1

2σ2 (x − µ)2)

I Box model – EV, σ.I the Central Limit Theorem for averages:

X − EVσ/√

n∼ N(0, 1)

I the Central Limit Theorem for sums:

nX − nEV√nσ

∼ N(0, 1)

I the Central Limit Theorem for proportion:

p − p√p(1− p)/n

∼ N(0, 1)

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Confidence Intervals

I The formulas:I (L, U):

L = pe − se × cv(1−C)/2; U = pe + se × cv(1−C)/2

I (−∞, L) : L = pe + se × cvC .I (U,+∞) : U = pe + se × cv(1−C).

I Confidence Intervals for,I Average: pe = X , se = σ/

√n.

I Sum: pe = nX , se =√

nσ.I Proportion: pe = X , se =

√p(1− p)/n

I Interpretation – what is random, and what is constant.

Confidence Intervals

I The formulas:I (L, U):

L = pe − se × cv(1−C)/2; U = pe + se × cv(1−C)/2

I (−∞, L) : L = pe + se × cvC .I (U,+∞) : U = pe + se × cv(1−C).

I Confidence Intervals for,I Average: pe = X , se = σ/

√n.

I Sum: pe = nX , se =√

nσ.I Proportion: pe = X , se =

√p(1− p)/n

I Interpretation – what is random, and what is constant.

Confidence Intervals

I The formulas:I (L, U):

L = pe − se × cv(1−C)/2; U = pe + se × cv(1−C)/2

I (−∞, L) : L = pe + se × cvC .I (U,+∞) : U = pe + se × cv(1−C).

I Confidence Intervals for,I Average: pe = X , se = σ/

√n.

I Sum: pe = nX , se =√

nσ.I Proportion: pe = X , se =

√p(1− p)/n

I Interpretation – what is random, and what is constant.

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Significance Tests

A significance test requires:

I a null and alternative hypothesis.I a test statistic.I a significance probability (P-value).

I: Possible hypotheses

1. H0 : θ = θ0; H0 : θ 6= θ0.2. H0 : θ ≤ θ0; H0 : θ > θ0.3. H0 : θ ≥ θ0; H0 : θ < θ0.

Here θ represents a generic parameter. It could be a populationmean, a population proportion, the difference of two populationmeans, or many other things.

II: Possible test statistics

a. Population mean, we take θ to be the population mean µ. Ifyou know the population SD, or for n > 26 you use thesample SD as an estimate of the SD, then you get thesignificance probability from a z-table and the test statisticis:

ts =X − µ0

SD/√

n

b. For the previous case, if you have a sample of size n ≤ 26,and use the sample SD to estimate the population SD,then the significance probability comes from a tn−1 tableand the test statistic is:

ts =X − µ0

SD/√

n − 1

c. For a test about a proportion, θ = p. The significanceprobability comes from a z-table, and the test statistics is:

ts =p − p0√

p0(1− p0)/n

d. For a test of the difference of two means, θ = µ1 − µ2.Assuming that the sample sizes from each populationsatisfy n1 > 26 and n2 > 26, then the significanceprobability comes from a z-table and the test statistic is:

ts =X1 − X2 − θ0√

SD21/n1 + SD2

2/n2

e. For a test of the difference of two proportions, takeθ = p1 − p2. Use a z-table for the significance probabilityand the test statistic:

ts =p1 − p2 − θ0√

p1 (1− p1) /n1 + p2 (1− p2) /n2

f. For n ≤ 26, with θ = µ1 − µ2, and n paired differencesXi −Yi , use tn−1 for the significance probability and the teststatistic is:

ts =X − Y − θ0

SDd/√

n − 1

The SDd is the sample variance of the n differences.

III: The significance probability

I The significance probability of the test statistic depends onthe hypothesis chosen in Part I. For that choice, let W be arandom variable with z or tn−1 distribution, as indicated inPart II. Then,

1. The significance probability isP(W ≤ − | ts |) + P(W ≥ − | ts |).

2. The significance probability is P(W ≥ ts).3. The significance probability is P(W ≤ −ts).

I The significance probability is “the chance of observingdata that supports the alternative hypothesis as or morestrongly than the data you have seen, when the nullhypothesis is correct.”

I Goodness-of-Fit Tests:

H0 : The model holds; Ha : The model fails

ts =∑ (Oi − Ei)

2

Ei

k = #categories − 1

I Contigency table and tests of independence:

H0 : the two criteria are independent

Ha : some dependence exists

ts =∑

all cells

(Oij − Eij)2

Eij

Eij =ith row sum× jth column sum

totalk = (number of rows - 1)× (number of columns - 1)

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Multiple Regression

I In multiple regression, there is more than one explanatoryvariable. The model is,

Yi = a + b1X1i + b2X2i + . . . + bpXpi + εi

Again, the εi are independent normal r.v.s with mean 0.I The null and alternative hypotheses are:

H0 : b1 ≥ 0; Ha : b1 < 0

I The test statistic is,

ts =b1 − 0

seI This is compared to a t-distribution with n-p-1 degrees of

freedom where p is the number of explanatory variables inour regression model.

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

The BootstrapThe pivot confidence interval assumes that the behavior ofθ − θ is approximately the same as the behavior of θ − θ∗. And,

I Suppose we use a computer to draw 1000 bootstrapsamples of size n. For each such sample, it calculates anew estimate of the parameter of interest.

I Rank these estimate from least to largest. We denotethese ordered bootstrap estimates by

θ∗(1), . . . , θ∗(1000)

where the number in parentheses shows the order in termsof size. Thus θ∗(1) is the smallest estimate of the sd found in

one of the 1000 bootstrap samples, and θ∗(1000) is thelargest.

I The 95% confidence interval is given by,

L = 2θ − θ∗(0.975); U = 2θ − θ∗(0.025)

Outline

Designed Experiments and Descriptive Statistics

Simple Linear Regression

Probability

The normal distribution and the Central Limit Theorem

Confidence Intervals

Significance Tests

Multiple Linear Regression

Multiple Regression

The Bootstrap

Bayesian Statistics

Bayesian Statistics

I Recall the Bayes’ Theorem:

P(A1 | B) =P(B | A1)× P(A1)∑ki=1 P(B | Ai)× P(Ai)

where A1, . . . , Ak are mutually exclusive and

P(A1 or A2 or . . . or Ak ) = 1

I Specify prior distribution. Calculate Likelihood andposterior.

I Posterior predictive probability – use the posteriorprobability as weight.

The Prior, Likelihood, and Posterior

Model Prior P(data | model) Product Posteriorp P(model) P(k = 0 | p) P(model | data)

0.1 1/9 0.656 0.0729 0.4270.2 1/9 0.410 0.0455 0.2670.3 1/9 0.24 0.0266 0.1560.4 1/9 0.130 0.0144 0.0840.5 1/9 0.065 0.007 0.0410.6 1/9 0.026 0.0029 0.0170.7 1/9 0.008 0.0009 0.0050.8 1/9 0.002 0.0002 0.0010.9 1/9 0.000 0.0000 0.000

1 0.1704 1