Math 344: Regression Analysis - New Jersey Institute of ...wguo/Math344_2012/Math344_Chapter...

Post on 01-Mar-2021

5 views 0 download

Transcript of Math 344: Regression Analysis - New Jersey Institute of ...wguo/Math344_2012/Math344_Chapter...

Math 344: Regression AnalysisChapter 1 — Linear Regression with One Predictor Variable

Wenge Guo

January 19, 2012

Wenge Guo Math 344: Regression Analysis

Why regression?

I Want to model a functional relationship between an“predictor variable” (input, independent variable, etc.) and a“response variable” (output, dependent variable, etc.)

I Examples?

I But real world is noisy, no f = maI Observation noiseI Process noise

I Two distinct goalsI (Estimation) Understanding the relationship between predictor

variables and response variablesI (Prediction) Predicting the future response given the new

observed predictors.

History

I Sir Francis Galton, 19th centuryI Studied the relation between heights of parents and children

and noted that the children “regressed” to the populationmean

I “Regression” stuck as the term to describe statistical relationsbetween variables

Example Applications

Trend lines, eg. Google over 6 mo.

Others

I EpidemiologyI Relating lifespan to obesity or smoking habits etc.

I Science and engineeringI Relating physical inputs to physical outputs in complex systems

I Brain

Aims for the course

I Given something you would like to predict and some numberof covariates

I What kind of model should you use?I Which variables should you include?I Which transformations of variables and interaction terms

should you use?

I Given a model and some dataI How do you fit the model to the data?I How do you express confidence in the values of the model

parameters?I How do you regularize the model to avoid over-fitting and

other related issues?

Aims for the course

I Given something you would like to predict and some numberof covariates

I What kind of model should you use?I Which variables should you include?I Which transformations of variables and interaction terms

should you use?

I Given a model and some dataI How do you fit the model to the data?I How do you express confidence in the values of the model

parameters?I How do you regularize the model to avoid over-fitting and

other related issues?

Data for Regression Analysis

I Observational DataExample: relation between age of employee (X ) and numberof days of illness last year (Y )Cannot be controlled!

I Experimental DataExample: an insurance company wishes to study the relationbetween productivity of its analysts in processing claims (Y )and length of training X .

I Treatment: the length of trainingI Experimental Units: the analysts included in the study.

I Completely Randomized Design: Most basic type of statisticaldesignExample: same example, but every experimental unit has anequal chance to receive any one of the treatments.

Data for Regression Analysis

I Observational DataExample: relation between age of employee (X ) and numberof days of illness last year (Y )Cannot be controlled!

I Experimental DataExample: an insurance company wishes to study the relationbetween productivity of its analysts in processing claims (Y )and length of training X .

I Treatment: the length of trainingI Experimental Units: the analysts included in the study.

I Completely Randomized Design: Most basic type of statisticaldesignExample: same example, but every experimental unit has anequal chance to receive any one of the treatments.

Data for Regression Analysis

I Observational DataExample: relation between age of employee (X ) and numberof days of illness last year (Y )Cannot be controlled!

I Experimental DataExample: an insurance company wishes to study the relationbetween productivity of its analysts in processing claims (Y )and length of training X .

I Treatment: the length of trainingI Experimental Units: the analysts included in the study.

I Completely Randomized Design: Most basic type of statisticaldesignExample: same example, but every experimental unit has anequal chance to receive any one of the treatments.

Simple Linear Regression

I Want to find parameters for a function of the form

Yi = β0 + β1Xi + εi

I Distribution of error random variable not specified

Quick Example - Scatter Plot

●●

●●

●●

●●

●●

−2 −1 0 1 2

−2

02

46

8

a

b

Formal Statement of Model

Yi = β0 + β1Xi + εi

I Yi value of the response variable in the i th trial

I β0 and β1 are parameters

I Xi is a known constant, the value of the predictor variable inthe i th trial

I εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

I i = 1, . . . , n

Properties

I The response Yi is the sum of two componentsI Constant term β0 + β1Xi

I Random term εi

I The expected response is

E(Yi ) = E(β0 + β1Xi + εi )

= β0 + β1Xi + E(εi )

= β0 + β1Xi

Expectation Review

I Definition

E(X ) = E(X ) =

∫XP(X )dX , X ∈ R

I Linearity property

E(aX ) = aE(X )

E(aX + bY ) = aE(X ) + b E(Y )

I Obvious from definition

Example Expectation Derivation

P(X ) = 2X , 0 ≤ X ≤ 1

Expectation

E(X ) =

∫ 1

0XP(X )dX

=

∫ 1

02X 2dX

=2X 3

3|10

=2

3

Expectation of a Product of Random Variables

If X,Y are random variables with joint distribution P(X ,Y ) thenthe expectation of the product is given by

E(XY ) =

∫XY

XYP(X ,Y )dXdY .

Expectation of a product of random variables

What if X and Y are independent? If X and Y are independentwith density functions f and g respectively then

E(XY ) =

∫XY

XYf (X )g(Y )dXdY

=

∫X

∫Y

XYf (X )g(Y )dXdY

=

∫X

Xf (X )[

∫Y

Yg(Y )dY ]dX

=

∫X

Xf (X )E(Y )dX

= E(X )E(Y )

Regression Function

I The response Yi comes from a probability distribution withmean

E(Yi ) = β0 + β1Xi

I This means the regression function is

E(Y ) = β0 + β1X

Since the regression function relates the means of theprobability distributions of Y for a given X to the level of X

Error Terms

I The response Yi in the i th trial exceeds or falls short of thevalue of the regression function by the error term amount εi

I The error terms εi are assumed to have constant variance σ2

Response Variance

Responses Yi have the same constant variance

Var(Yi ) = Var(β0 + β1Xi + εi )

= Var(εi )

= σ2

Variance (2nd central moment) Review

I Continuous distribution

Var(X ) = E((X −E(X ))2) =

∫(X −E(X ))2P(X )dX , X ∈ R

I Discrete distribution

Var(X ) = E((X − E(X ))2) =∑i

(Xi − E(X ))2P(Xi ), X ∈ Z

Alternative Form for Variance

Var(X ) = E((X − E(X ))2)

= E((X 2 − 2X E(X ) + E(X )2))

= E(X 2)− 2E(X )E(X ) + E(X )2

= E(X 2)− 2E(X )2 + E(X )2

= E(X 2)− E(X )2.

Example Variance Derivation

P(X ) = 2X , 0 ≤ X ≤ 1

Var(X ) = E((X − E(X ))2) = E(X 2)− E(X )2

=

∫ 1

02XX 2dX − (

2

3)2

=2X 4

4|10 −

4

9

=1

2− 4

9=

1

18

Variance Properties

Var(aX ) = a2 Var(X )

Var(aX + bY ) = a2 Var(X ) + b2 Var(Y ) if X ⊥ Y

Var(a + cX ) = c2 Var(X ) if a and c are both constant

More generally

Var(∑

aiXi ) =∑i

∑j

aiajCov(Xi ,Xj)

Covariance

I The covariance between two real-valued random variables Xand Y, with expected values E(X ) = µ and E(Y ) = ν isdefined as

Cov(X ,Y ) = E((X − µ)(Y − ν))

I Which can be rewritten as

Cov(X ,Y ) = E(XY − νX − µY + µν),

Cov(X ,Y ) = E(XY )− ν E(X )− µE(Y ) + µν,

Cov(X ,Y ) = E(XY )− µν.

Covariance of Independent Variables

If X and Y are independent, then their covariance is zero. Thisfollows because under independence

E(XY ) = E(X )E(Y ) = µν.

and thenCov(X ,Y ) = µν − µν = 0.

Formal Statement of Model

Yi = β0 + β1Xi + εi

I Yi value of the response variable in the i th trial

I β0 and β1 are parameters

I Xi is a known constant, the value of the predictor variable inthe i th trial

I εi is a random error term with mean E(εi ) = 0 and varianceVar(εi ) = σ2

I i = 1, . . . , n

Example

An experimenter gave three subjects a very difficult task. Data onthe age of the subject (X ) and on the number of attempts toaccomplish the task before giving up (Y ) follow:

Table:

Subject i 1 2 3

Age Xi 20 55 30Number of Attempts Yi 5 12 10

Least Squares Linear Regression

I Seek to minimize

Q =n∑

i=1

(Yi − (β0 + β1Xi ))2

I Choose b0 and b1 as estimators for β0 and β1. b0 and b1 willminimize the criterion Q for the given sample observations(X1,Y1), (X2,Y2), · · · , (Xn,Yn).

Guess #1

Guess #2

Function maximization

I Important technique to remember!I Take derivativeI Set result equal to zero and solveI Test second derivative at that point

I Question: does this always give you the maximum?

I Going further: multiple variables, convex optimization

Function Maximization

Find the value of x that maximize the function

f (x) = −x2 + ln(x), x > 0

1. Take derivative

d

dx(−x2 + ln(x)) = 0 (1)

−2x +1

x= 0 (2)

2x2 = 1 (3)

x2 =1

2(4)

x =

√2

2(5)

Function Maximization

Find the value of x that maximize the function

f (x) = −x2 + ln(x), x > 0

1. Take derivative

d

dx(−x2 + ln(x)) = 0 (1)

−2x +1

x= 0 (2)

2x2 = 1 (3)

x2 =1

2(4)

x =

√2

2(5)

2. Check second order derivative

d2

dx2(−x2 + ln(x)) =

d

dx(−2x +

1

x) (6)

= −2− 1

x2(7)

< 0 (8)

Then we have found the maximum. x∗ =√22 ,

f (x∗) = −12 [1 + log(2)].

Least Squares Max(min)imization

I Function to minimize w.r.t. b0 and b1 – b0 and b1 are calledpoint estimators of β0 and β1 respectively

Q =n∑

i=1

(Yi − (b0 + b1Xi ))2

I Minimize this by maximizing -Q

I Either way, find partials and set both equal to zero

dQ

db0= 0

dQ

db1= 0

Normal Equations

I The result of this maximization step are called the normalequations. b0 and b1 are called point estimators of β0 and β1respectively. ∑

Yi = nb0 + b1

∑Xi∑

XiYi = b0

∑Xi + b1

∑X 2i

I This is a system of two equations and two unknowns. Thesolution is given by. . .

Solution to Normal Equations

After a lot of algebra one arrives at

b1 =

∑(Xi − X )(Yi − Y )∑

(Xi − X )2

b0 = Y − b1X

X =

∑Xi

n

Y =

∑Yi

n

Least Square Fit

Questions to Ask

I Is the relationship really linear?

I What is the distribution of the of “errors”?

I Is the fit good?

I How much of the variability of the response is accounted forby including the predictor variable?

I Is the chosen predictor variable the best one?

Goals for First Half of Course

I How to do linear regressionI Self familiarization with software tools

I How to interpret standard linear regression results

I How to derive tests

I How to assess and address deficiencies in regression models

Estimators for β0, β1, σ2

I We want to establish properties of estimators for β0, β1, andσ2 so that we can construct hypothesis tests and so forth

I We will start by establishing some properties of the regressionsolution.

Properties of Solution

I The i th residual is defined to be

ei = Yi − Yi

I The sum of the residuals is zero:∑i

ei =∑

(Yi − b0 − b1Xi )

=∑

Yi − nb0 − b1

∑Xi

=∑

Yi − n(Y − b1X )− b1

∑Xi

= 0

Properties of Solution

The sum of the observed values Yi equals the sum of the fittedvalues Yi ∑

i

Yi =∑i

(b1Xi + b0)

=∑i

(b1Xi + Y − b1X )

= b1

∑i

Xi + nY − b1nX

= b1nX +∑i

Yi − b1nX

=∑i

Yi

Properties of Solution

The sum of the weighted residuals is zero when the residual in thei th trial is weighted by the level of the predictor variable in the i th

trial ∑i

Xiei =∑

(Xi (Yi − b0 − b1Xi ))

=∑i

XiYi − b0

∑Xi − b1

∑X 2i

= 0

The last step is due to the second normal equation.

Properties of Solution

The sum of the weighted residuals is zero when the residual in thei th trial is weighted by the fitted value of the response variable inthe i th trial ∑

i

Yiei =∑i

(b0 + b1Xi )ei

= b0

∑i

ei + b1

∑i

Xiei

= 0

Properties of Solution

The regression line always goes through the point

X , Y

Using the alternative format of linear regression model:

Yi = β∗0 + β1(Xi − X ) + εi

The least squares estimator b1 for β1 remains the same as before.The least squares estimator for β∗0 = β0 + β1X becomes

b∗0 = b0 + b1X = (Y − b1X ) + b1X = Y

Hence the estimated regression function is

Y = Y + b1(X − X )

Apparently, the regression line always goes through the point(X , Y ).

Estimating Error Term Variance σ2

I Review estimation in non-regression setting.

I Show estimation results for regression setting.

Estimation Review

I An estimator is a rule that tells how to calculate the value ofan estimate based on the measurements contained in a sample

I i.e. the sample mean

Y =1

n

n∑i=1

Yi

Point Estimators and Bias

I Point estimatorθ = f ({Y1, . . . ,Yn})

I Unknown quantity / parameter

θ

I Definition: Bias of estimator

B(θ) = E(θ)− θ

Example

I Samples from a Normal(µ, σ2) distribution

Yi ∼ Normal(µ, σ2)

I Estimate the population mean

θ = µ, θ = Y =1

n

n∑i=1

Yi

Sampling Distribution of the Estimator

I First moment

E(θ) = E(1

n

n∑i=1

Yi )

=1

n

n∑i=1

E(Yi ) =nµ

n= θ

I This is an example of an unbiased estimator

B(θ) = E(θ)− θ = 0

Variance of Estimator

I Definition: Variance of estimator

Var(θ) = E([θ − E(θ)]2)

I Remember:

Var(cY ) = c2 Var(Y )

Var(n∑

i=1

Yi ) =n∑

i=1

Var(Yi )

Only if the Yi are independent with finite variance

Example Estimator Variance

I For N(0, 1) mean estimator

Var(θ) = Var(1

n

n∑i=1

Yi )

=1

n2

n∑i=1

Var(Yi ) =nσ2

n2=σ2

n

I Note assumptions

Bias Variance Trade-off

I The mean squared error of an estimator

MSE (θ) = E([θ − θ]2)

I Can be re-expressed

MSE (θ) = Var(θ) + (B(θ)2)

MSE = VAR + BIAS2

Proof

MSE (θ)

= E((θ − θ)2)

= E(([θ − E(θ)] + [E(θ)− θ])2)

= E([θ − E(θ)]2) + 2E([E(θ)− θ][θ − E(θ)]) + E([E(θ)− θ]2)

= Var(θ) + 2E([E(θ)[θ − E(θ)]− θ[θ − E(θ)])) + (B(θ))2

= Var(θ) + 2(0 + 0) + (B(θ))2

= Var(θ) + (B(θ))2

Trade-off

I Think of variance as confidence and bias as correctness.I Intuitions (largely) apply

I Sometimes choosing a biased estimator can result in anoverall lower MSE if it exhibits lower variance.

s2 estimator for σ2 for Single population

Sum of Squares:n∑

i=1

(Yi − Yi )2

Sample Variance Estimator:

s2 =

∑ni=1(Yi − Yi )

2

n − 1

I s2 is an unbiased estimator of σ2.

I The sum of squares SSE has n − 1 “degrees of freedom”associated with it, one degree of freedom is lost by using Y asan estimate of the unknown population mean µ.

Estimating Error Term Variance σ2 for Regression Model

I Regression model

I Variance of each observation Yi is σ2 (the same as for theerror term εi )

I Each Yi comes from a different probability distribution withdifferent means that depend on the level Xi

I The deviation of an observation Yi must be calculated aroundits own estimated mean.

s2 estimator for σ2

s2 = MSE =SSE

n − 2=

∑(Yi − Yi )

2

n − 2=

∑e2i

n − 2

I MSE is an unbiased estimator of σ2

E(MSE ) = σ2

I The sum of squares SSE has n − 2 “degrees of freedom”associated with it.

I Cochran’s theorem (later in the course) tells us where degree’sof freedom come from and how to calculate them.

Normal Error Regression Model

I No matter how the error terms εi are distributed, the leastsquares method provides unbiased point estimators of β0 andβ1

I that also have minimum variance among all unbiased linearestimators

I To set up interval estimates and make tests we need tospecify the distribution of the εi

I We will assume that the εi are normally distributed.

Normal Error Regression Model

Yi = β0 + β1Xi + εi

I Yi value of the response variable in the i th trial

I β0 and β1 are parameters

I Xi is a known constant, the value of the predictor variable inthe i th trial

I εi ∼iid N(0, σ2)note this is different, now we know the distribution

I i = 1, . . . , n

Notational Convention

I When you see εi ∼iid N(0, σ2)

I It is read as εi is distributed identically and independentlyaccording to a normal distribution with mean 0 and varianceσ2

I ExamplesI θ ∼ Poisson(λ)I z ∼ G (θ)

Maximum Likelihood Principle

The method of maximum likelihood chooses as estimates thosevalues of the parameters that are most consistent with the sampledata.

Likelihood Function

IfXi ∼ F (Θ), i = 1 . . . n

then the likelihood function is

L({Xi}ni=1,Θ) =n∏

i=1

F (Xi ; Θ)

Maximum Likelihood Estimation

I The likelihood function can be maximized w.r.t. theparameter(s) Θ, doing this one can arrive at estimators forparameters as well.

L({Xi}ni=1,Θ) =n∏

i=1

F (Xi ; Θ)

I To do this, find solutions to (analytically or by followinggradient)

dL({Xi}ni=1,Θ)

dΘ= 0

Important Trick

Never (almost) maximize the likelihood function, maximize the loglikelihood function instead.

log(L({Xi}ni=1,Θ)) = log(n∏

i=1

F (Xi ; Θ))

=n∑

i=1

log(F (Xi ; Θ))

Quite often the log of the density is easier to work withmathematically.

ML Normal Regression

Likelihood function

L(β0, β1, σ2) =

n∏i=1

1

(2πσ2)1/2e−

12σ2 (Yi−β0−β1Xi )

2

=1

(2πσ2)n/2e−

12σ2

∑ni=1(Yi−β0−β1Xi )

2

which if you maximize (how?) w.r.t. to the parameters you get. . .

Maximum Likelihood Estimator(s)

I β0b0 same as in least squares case

I β1b1 same as in least squares case

I σ2

σ2 =

∑i (Yi − Yi )

2

n

I Note that ML estimator is biased as s2 is unbiased and

s2 = MSE =n

n − 2σ2

Comments

I Least squares minimizes the squared error between theprediction and the true output

I The normal distribution is fully characterized by its first twocentral moments (mean and variance)