Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf ·...

51
Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Ofce: OW 133 Fall 2015, Binghamton University 1 / 51

Transcript of Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf ·...

Page 1: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Chapter 2: EstimationMATH 531: Regression - I

Ganggang XuOffice: OW 133

Fall 2015, Binghamton University

1 / 51

Page 2: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Outline

1 Introduction to R

2 Simple Linear Regression Model

3 Multiple Linear Regression

2 / 51

Page 3: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

The next section would be . . . . . .

1 Introduction to RData types in RData operations in RData visualization in RWrite your own functions

2 Simple Linear Regression Model

3 Multiple Linear Regression

3 / 51

Page 4: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Some basics about R software

Download: www.r-project.org;Install an R package: e.g. install.packages("MASS")

There are tons of packages available at https://cran.cnr.berkeley.edu/If you need to do something special, you can probably find an existingpackage out there;You can create your own package and share with others online;Everything is free;

Load an R package: e.g. library("MASS");If you want to use an R package, you need to load it first;

Set the working directory: e.g. setwd("C:/regression");After you have done this, your R software will look for everything underthe folder “C:/regression";This is not mandatory but will be very helpful;

Read a file/dataset into R:For .txt file: e.g. read.table(“C:/regression/awsome.txt",header=TRUE)For .csv file: e.g. read.csv(“C:/regression/awsome.csv",header=TRUE)Type in “?read.table" or “?read.csv" for more information;

4 / 51

Page 5: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Some helpful tips for learning R programming

You are expected to master necessary skills of R programming byyourself;

The more you learn, the more complicated things you can do with R;Google is your best friend!

For example, if you want to draw 3d picture with R;Google “draw 3d plot R";Research over all the options you find;

Study the manual of an R packageEvery R package comes with a mannualfor example, the package "MASS"Go to https://cran.r-project.org/web/packages/MASS/index.html

If you don’t know how to use a function in R, use “help()" or “?"For example, if you don’t know how to use function “outer()"Type in “help(outer)" or “?outer"

Best way to learn an R functionexperimenting! just play with it using simple examples.

Feel free to knock on my door for any of your R questions.

5 / 51

Page 6: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Data types in R

Basic typesNumerical: pi,3.14;Character: “hello", “world";Logical: FALSE, TRUE;Complex: 2+3i;Others: NA(missing values), NaN (not a number), Inf, -Inf, etc.

Checking basic types: return values of type logicalis.numeric(3)is.numeric("Someone’s name")is.character(pi)is.character("Someon’s name")is.logical(0)is.logical(F)is.logical(FALSE)is.complex(2+3i)is.numeric(2+3i)is.na(NaN)is.na(Inf)

6 / 51

Page 7: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Data types of R—-Composite types:

Uniform basic types: all elements must be of the same typeVectors:

valid vectors: 1:10, 1:(-10), seq(1,10,by=0.5), c(“hello",“world"),c(FALSE,FALSE,TRUE,FALSE);invalid vectors: c(0.5,“hello"), c(0.5,TRUE)Automatic coercion;

Matrices:mat1 <- matrix(NA, nrow=3, ncol=2); mat1mat2 <- matrix(1:6, nrow=3); mat2mat3 <- cbind(1:3,4:6); mat3mat4 <- rbind(c(1,4),c(2,5),c(3,6)); mat4

Arrays (3 dimensional):arr1 <- array(1:12, dim=c(2,3,4)); arr1arr2 <- array(rnorm(20), dim=c(2,5,2)); arr2

A matrix is a 2-dimensional array;

7 / 51

Page 8: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Data types of R—-Composite types:

Arbitrary types: elements can be of different typesLists:

lst <- list(1:5, matrix(runif(10),5,2), letters, list(NaN,1:3,"A"))lst[1]; lst[[1]]?Useful, e.g, for returning output from model fitting;

Data frames:df <- data.frame("Var1"=1:3, "Var2"=c("A","B","C"), "Var3"=c(T,TRUE,F))dfPurpose: statistical data tables with heterogeneous variable typesdata frame = a constrained list, such that all elements are vectors(statistical variables) of the same length but arbitrary type;coercion to matrix: as.matrix(df);function “data.frame()" create a data frame;the “$" sign: gives a variable in the data framee.g. df$Var1 gives a vector (1,2,3);

8 / 51

Page 9: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Get used to Vectorization!

Avoid loops!What is done in C?

x <- runif(1000); y <- 1:1000;

tmp <- 0;

for(i in 1:1000) tmp <- tmp + x[i]*y[i];

What should be done in R?sum(x*y)!

Get used to Vectorization:

sum(x) # Try this: x[1] <- NA; sum(x)

mean(x)

var(x)

sd(x)

min(x)

max(x)

range(x)

rowSums(mat)

colSums(mat)

9 / 51

Page 10: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Basic plots in R

Plots are most important tools in statistical analysis. Useful plots include

Histogram: hist()

Boxplots: boxplot()

Scatter plots: plot(x,y)

QQ plots: qqnorm()

You will need to learn how to use them in your team project;

10 / 51

Page 11: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Tuning plots in R

Do not submit a plot generated without fine-tuning! Plots require time andeffort and thought. For example, you need to learn the available glyphs inR; default circles are often suboptimal.

plot(runif(200)) # Bad: Plot content has no visual impact.

plot(runif(200), pch=16) # Better.

plot(runif(200), pch=16, cex=1.2) # Even better.

plot(1:25, pch=1:25, cex=2)

plot(1:25, pch=letters[1:25], cex=2)

plot(runif(100000)) # versus:

plot(runif(100000), pch=".")

11 / 51

Page 12: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Write your own functionsYou can always write your own functions to serve your specific purpose.For example

newDef <- function(a,b)

{

x = runif(10,a,b)

mean(x)

}

> newDef(-1,1)

[1] 0.06177728

> newDef

function(a,b)

{

x = runif(10,a,b)

mean(x)

}

Here is a detailed instruction on how to do so:https://cran.r-project.org/doc/manuals/R-intro.pdf

12 / 51

Page 13: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

The next section would be . . . . . .

1 Introduction to R

2 Simple Linear Regression ModelLeast square estimatorProperties of Least Square EstimatorsGoodness of fit

3 Multiple Linear Regression

13 / 51

Page 14: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Example 1: Car weight V.S. MPG

What is the relationship between the weight of a vehicle and thenumber of miles it travels per gallon of gasoline? To answer thisquestion, a random sample of 25 vehicles was taken. The weight (inpounds) and the miles per gallon (MPG) was recorded for each car.Which variable is the explanatory and which is the response?

Explanatory:Response:

14 / 51

Page 15: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Observations

There is a Negative, somewhat linear relationship between MPGand Weight.

How to determine if a linear association between two variables isstatistically significant?We begin by trying to find the straight line that best fits the data.

This technique is known as Simple Linear Regression.It is an attempt to describe the relationship between the variables.

15 / 51

Page 16: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Simple Linear Regression Model

The Simple Linear Regression Model is given as:

yi = β0 + β1xi + εi, i = 1, . . . , n

yi: dependent variable or response variable.

xi: independent variable or explanatory variable.εi: the error term or noise term in order to explain the discrepancy.

Something in the response variable yi that cannot be completelydetermined by the explanatory random variable xi;For example: for any car that has the same weight, the MPG cannot becompletely the same due to

road conditions, driver’s driving habits, gasoline qualities;all these factors contribute to MPG, whose impact are sort ofunpredictable, can be put into the error term εi;

16 / 51

Page 17: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Simple Linear Regression Model-continue

The Simple Linear Regression Model is given as:

yi = β0 + β1xi + εi, i = 1, . . . , n

β0 is the intercept term and is fixed but unknown. (needs to beestimated!)

β1 is the slope term and is fixed but unknown. (needs to beestimated!)

The explanatory random variable xi is usually observable and thusconsidered fixed!

εi’s: independent, mean Eεi = 0, constant variance var(εi) = σ2.

17 / 51

Page 18: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Something to keep in mind when using Simple Linearregression model

Is the association between two variables really normal?Not always true;Nonlinear regression models are also available to deal with that.

Does the association really imply causation?Not always true;Florida, when ice cream sales goes up, the number of shark attacksalso goes up!Definite association. But is there causation going on here?

18 / 51

Page 19: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Something to keep in mind when using Simple Linearregression model

Does the association really imply causation?Not always true;Florida, when ice cream sales goes up, the number of shark attacksalso goes up!Definite association. But is there causation going on here?

19 / 51

Page 20: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Least Square EstimatorMethod of Least Square

We attempt to find an estimator of β0 and β1, such that yi’s are overall“close" to the fitted line;Define the fitted line as yi ≡ β0 + β1xi and the residual, ei = yi − yi;

We define the sum of squared errors (or residual sum of squares) to be

SSE(RSS) ≡n∑i=1

(yi − yi)2 =

n∑i=1

(yi − (β0 + β1xi))2

We find a pair of β0 and β1 to minimize this sum of squares. Theresulting estimators are called least square estimators (LSE), orordinary least square (OLS) estimators.

20 / 51

Page 21: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

∂SSE

∂β0=∂{∑n

i=1[yi − (β0 + β1xi)]2}

∂β0

= 2

n∑i=1

[yi − (β0 + β1xi)] · (−1) = 0

∂SSE

∂β1=∂{∑n

i=1[yi − (β0 + β1xi)]2}

∂β1

= 2

n∑i=1

[yi − (β0 + β1xi)] · (−xi) = 0

21 / 51

Page 22: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

OLS solution

The OLS estimators for model

yi = β0 + β1xi + εi, i = 1, . . . , n

are {β0 = y − β1xβ1 =

∑ni=1(xi−x)yi∑ni=1(xi−x)2

Moreover since∑n

i=1 y(xi − x) = 0, the estimator for β1 can also bewritten as

β1 =

∑ni=1 yi(xi − x)∑ni=1(xi − x)2

=

∑ni=1(yi − y)(xi − x)∑n

i=1(xi − x)2

22 / 51

Page 23: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Notations

Define Sxx =∑n

i=1(xi − x)2;

Define Sxy =∑n

i=1(xi − x)(yi − y);We have the model set up (1):

yi = β0 + β1xi + εi, i = 1, . . . , n

Conditions: Eεi = 0 and Var ε2i = σ2;The Least square estimator are:

β0 = y − β1x; β1 =Sxy

Sxx;

23 / 51

Page 24: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Unbiasedness

E[β1] = E[∑n

i=1(xi − x)YiSxx

]=

∑ni=1(xi − x)EYi

Sxx=∑n

i=1(xi − x)(β0 + β1xi)

Sxx= 0 +

∑ni=1 β1(xi − x)2

Sxx= β1.

E(β0) = E(y − β1x) = (β0 + β1x− β1x) = β0.

24 / 51

Page 25: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Var(β0) and Var(β1)

Var(β1) = Var

[n∑

i=1

(xi − x)YiSxx

](independence) =

n∑i=1

[(xi − x)2 Var(Yi)

S2xx

]=

[(∑ni=1(xi − x)2

)σ2

S2xx

]=

σ2

Sxx

Var(β0) = σ2

(∑ni=1 x

2i

nSxx

)? (Homework.)

Cov(β0, β1) = −σ2(

xSxx

)? (Homework.)

Questions:

What does it mean Var(β1) 6= 0?

What is the source of randomness in β1?

How can you reduce Var(β1)?

25 / 51

Page 26: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Visualization of the randomness

You can copy the following code into your R program;

Change the values of N , sigma, see what happens.

N <- 30 # Picking a sample size

X <- cbind(x0=rep(1,N), x1=runif(N)) # Making up a predictor matrix

sigma <- 1 # Assuming a 'true' error variance

beta <-c(b0=1, b1=2) # Assuming a 'true' coeff vector

y <- X%*%beta + rnorm(N,0, s=sigma) # Simulating a response vector

b <- solve(t(X)%*%X)%*%t(X)%*%y; #ols estimator

##Simulate and animate parallel universes/possible worlds/...:

for(i in 1:50) {

y <- rnorm(N, m=X%*%beta, s=sigma) # Simulating a response vector

b <- solve(t(X)%*%X) %*% t(X) %*% y

plot(x=X[,2], y=y, pch=16, col="gray20", ylim=c(-2,6))

abline(b, lwd=3, col="red")

dev.flush()

Sys.sleep(0.5)

}

26 / 51

Page 27: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Covariance between β0 and β1

You can copy the following code into your R program;

Change the values of N , sigma, see what happens.

N <- 30 # Picking a sample size

X <- cbind(x0=rep(1,N), x1=runif(N)) # Making up a predictor matrix

sigma <- 1 # Assuming a 'true' error variance

beta <- c(b0=1, b1=2) # Assuming a 'true' coeff vector

## Next, simulate many response vectors (possible worlds, parallel universes,...)

## and store the results:

Nsim <- 10000

bs <- matrix(NA, nrow=Nsim, ncol=length(beta)) # Storage for coeff estimates

colnames(bs) <- paste("b",0:(length(beta)-1),sep="") # Cosmetics

for(isim in 1:Nsim) {

y <- X%*%beta + rnorm(N,0, s=sigma) # Simulating a response vector

bs[isim,] <- solve(t(X)%*%X)%*%t(X)%*%y

if(isim%%100==0) cat(isim,"...")

}

##Plot slopes versus intercepts to illustrate the 'sampling' or

## 'dataset-to-dataset' variability of b:

plot(bs, pch=16, cex=.5)

##==> Negative correlation between b0 and b1 !!!

27 / 51

Page 28: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Evaluating the fit of simple linear regression model

After we fit the simple regression model, how can we know we fit agood model?

1 Use scatter plot to make sure the relationship between X and Y islinear;

2 Can we have a numerical measure?

28 / 51

Page 29: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

ANOVA Decomposition

Define notations:

TSS =

n∑i=1

(yi − y)2, Total sum of squares

RSS =

n∑i=1

(yi − yi)2, Residual sum of squares

SSreg =

n∑i=1

(yi − y)2, Sum of squares in regression

where yi = β0 + β1xi;

ANOVA Decomposition:

TSS︸︷︷︸Total variations in Y

= SSreg︸ ︷︷ ︸Variations in Y explained by model

+ RSS︸ ︷︷ ︸Variations in Y unexplained by model

29 / 51

Page 30: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

ANOVA Decomposition

Proof:

30 / 51

Page 31: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Definition of R2

The R2 is defined as

R2 =SSreg

TSS= 1− RSS

TSS

R2 is between 0 and 1;

The bigger the R2, the better (loosely speaking...);

For simple linear regression, R2 equals to r2, the sample correlationbetween X and Y

R2 = r2 ={∑n

i=1(xi − x)(yi − y)}2∑ni=1(xi − x)2

∑ni=1(yi − y)2

.

Proof.Homework!

31 / 51

Page 32: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Some comments on R2

R2 can be used to measure the strength of association;R2 can only measure LINEAR relationship;If the relationship between x and y is not linear, R2 is useless!R2 and the scatter plot need to be used together!

32 / 51

Page 33: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

The next section would be . . . . . .

1 Introduction to R

2 Simple Linear Regression Model

3 Multiple Linear RegressionLeast square estimatorProperties of Least Square EstimatorsGoodness of fitEstimating σ2

33 / 51

Page 34: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Multiple Linear Regression

In general, more than 1 independent variable; called multiple linearregression.

Y = β0 + β1x(1) + β2x(2) + · · ·+ βpx(p) + ε

Notation: I deliberately use x(j) (subscript with parenthesis) todenote the jth variable, instead of the jth observation.The term “Linear" means:

Y is a linear function of the parameters β1, . . . , βp!As a example

Y = β0 + β1x2(1) + β2 log(x(2)) + · · ·+ βp exp(x(p)) + ε

is still a linear regression model!On the other hand

Y = β0 + xβ1(1) +x(2)β2 + 1

+ · · ·+ βpx(p) + ε

is NOT a linear regression model!

34 / 51

Page 35: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Matrix representation

Observations: (yi, xi1, xi2, . . . , xip), i = 1, . . . , n.

y1 ← Y1 = β0 + β1x11 + β2x12 + · · ·+ βpx1p + ε1

y2 ← Y2 = β0 + β1x21 + β2x22 + · · ·+ βpx2p + ε2

...

yi ← Yi = β0 + β1xi1 + β2xi2 + · · ·+ βpxip + εi

...

yn ← Yn = β0 + β1xn1 + β2xn2 + · · ·+ βpxnp + εn

We have (p+ 1) coefficients (1 for the intercept term and p for theindependent variables), and n observations. Each observation isrepresented by yi and xi := (1,xi1, xi2, . . . , xip)

T

35 / 51

Page 36: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Matrix representation

Use matrix notation to ease the derivation:

Y =

y1y2...yn

, X =

1 x11 x12 . . . x1p1 x21 x22 . . . x2p...

......

......

1 xn1 xn2 . . . xnp

︸ ︷︷ ︸

p+1 columns

=

xT1

xT2...xTn

β = (β0, β1, . . . , βp)T , ε = (ε1, ε2, . . . , εn)

T

Rewrite model asY = Xβ + ε

36 / 51

Page 37: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

The least square estimator is obtained by minimizing the ResidualSum of Squares (RSS)

RSS(β) =

n∑i=1

{yi − (β0 + β1xi1 + β2xi2 + · · ·+ βpxip)}2

=

n∑i=1

{yi − xTi β}2

=[Y −Xβ]T [Y −Xβ]

To minimize RSS(β), we have

0 =∂RSS(β)

∂β= −2XT [Y −Xβ] = −2XTY + 2XTXβ

⇒ XTXβ = XTY ⇒ βOLS = (XTX)−1XTY

XTXβ = XTY is called the estimating equation.

37 / 51

Page 38: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Matrix Calculus/GradientFor a scaler function f(x1, . . . , xk) = f(x) ∈ R and x ∈ Rk is a vector.The gradient vector is defined as the vector of the partial derivatives of fwith respect to xi’s, i.e.

∂f(x)

∂x=

∂f(x)∂x1

∂f(x)∂x2

...∂f(x)∂xk

Two examples:

∂xTx

∂x=∂[∑k

j=1 x2j ]

∂x=

∂∑k

j=1 x2j

∂x1

∂∑k

j=1 x2j

∂x2

...∂∑k

j=1 x2j

∂xk

=

2x12x2

...2xk

= 2x

∂Ax

∂x= A where A is a matrix that does not depend on x.

38 / 51

Page 39: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

The chain rule still applies. Hence our previous example can be viewedas such an example.

∂RSS(β)

∂β=∂[Y −Xβ]T [Y −Xβ]

∂β

=

(∂[Y −Xβ]

∂β

)T (2[Y −Xβ]

)= 2(−X)T [Y −Xβ] = 0

39 / 51

Page 40: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

More results from multivariate distribution

If X ∈ Rp is a p-dimensional random vector, then its expectation isdefined as the vector whose jth element is the expected value of thejth coordinate of X, i.e.

E(X) =

E(X1)E(X2)

...E(Xn)

If A is a m× p matrix, then AX is a m-dimensional random vector,whose expectation is

E(AX) = AE(X)

40 / 51

Page 41: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

More results from multivariate distribution

If X ∈ Rp is a p-dimensional random vector, then its(variance-)covariance matrix is defined as the p× p matrix whoseijth element is the covariance between the ith and jth coordinatesof X, i.e.

Cov(X) =

Cov(X1, X1) Cov(X1, X2) . . . Cov(X1, Xp)Cov(X2, X1) Cov(X2, X2) . . . Cov(X2, Xp)

......

. . ....

Cov(Xp, X1) Cov(Xp, X2) . . . Cov(Xp, Xp)

For example, Cov(ε) = σ2I where I is the identity matrix which is adiagonal matrix with 1’s on the diagonal.

If A is a m× p matrix, then the variance-covariance matrix of AX is

Cov(AX) = ACov(X)AT

41 / 51

Page 42: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Properties of βOLS

βOLS = (XTX)−1XTY is a (p+ 1)-vector, whose first element isthe estimator for β0 and whose (j + 1)th element is the estimator forβj , for j = 1, . . . , p.

E{βOLS} = E{(XTX)−1XTY } = (XTX)−1XTE{Y }. But recallthat EY = Xβ. Hence E{βOLS} = (XTX)−1XTXβ = β. I.e.unbiasedness.

And the covariance matrix is

Cov(βOLS) = σ2(XTX)−1.

42 / 51

Page 43: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Properties of βOLS

Cov(βOLS)

=Cov((XTX)−1XTY )

=(XTX)−1XT Cov(Y )[(XTX)−1XT ]T

=(XTX)−1XT Cov(Y )X(XTX)−1

But recall that Cov(Y ) = σ2I.So Cov(βOLS) = (XTX)−1XT Cov(Y )X(XTX)−1 =σ2(XTX)−1XTX(XTX)−1 = σ2(XTX)−1

43 / 51

Page 44: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Gauss-Markov Theorem

Theorem 1(Gauss-Markov Theorem) Under conditions (i) Eεi = 0 (mean 0), (ii)V ar(εi) = σ2 (constant variance), (iii) Cov(εi, εj) = 0 fori 6= j(uncorrelated), the least square estimator βOLS is the bestunbiased linear estimator of β.

Question: is βOLS always the best estimator for β? In what sense it isthe best?

44 / 51

Page 45: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Prediction, hat matrix and RSS

The prediction for the original data X is

Y = XβOLS = X(XTX)−1XTY = HY

where H := X(XTX)−1XT is called the hat matrix.

The residual vector is

ε = Y − Y = Y −HY = (I−H)Y

The residual sum of square (sum of squared error) is

RSS = [(I−H)Y ]T (I−H)Y = Y T (I−H)T (I−H)Y

Note that (I−H)T (I−H) = (I−H) (H is idempotent!).

So theRSS = Y T (I−H)Y = Y TY − Y THY

45 / 51

Page 46: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

ANOVA Decomposition still holds!Use the same notations:Total sum of squares:

TSS =

n∑i=1

(yi − y)2 =

(Y − 1

n11TY

)T (Y − 1

n11TY

),

where 1 = (1, 1, . . . , 1)Tn×1.Residual sum of squares

RSS =

n∑i=1

(yi − yi)2 = Y TY − Y THY ,

Sum of squares in regression

SSreg =

n∑i=1

(yi − y)2 =

(HY − 1

n11TY

)T (HY − 1

n11TY

),

ANOVA Decomposition still holds (prove it, homework)!

TSS︸︷︷︸Total variations in Y

= SSreg︸ ︷︷ ︸Variations in Y explained by model

+ RSS︸ ︷︷ ︸Variations in Y unexplained by model

46 / 51

Page 47: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

ANOVA Decomposition for multiple regressionProof.First of all, the hat matrix H is an idempotent matrix such that H2 = Hand therefore (I−H)2 = I−H. Then

TSS =

(Y − 1

n11TY

)T (Y − 1

n11TY

)= YTY − 1

n(1TY)2,

SSreg =

(HY − 1

n11TY

)T (HY − 1

n11TY

)= YTHY − 2

nYT11THY +

1

n(1TY)2.

It suffices to show that 1THY = 1TY? (This is ONLY true when the firstcolumn of the design matrix X are all 1’s. In other words, your linearregression model must have an “intercept" term. why?) To see this,

XTHY = XTX(XTX)−1XTY = XTY⇒XTHY = XTY

If there is an intercept term, the first column of X is 1 and therefore,1THY = 1TY.

47 / 51

Page 48: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

Definition of R2 for multiple regression

The R2 is also defined as

R2 =SSreg

TSS= 1− RSS

TSS

R2 is between 0 and 1 (if an intercept term is included!);

The bigger the R2, the better (loosely speaking...)

Lemma 2It is easy to show that

R2 = cor2(Y,Y),

the mean adjusted correlation between Y and Y (in Rn space, you don’tneed to know).

48 / 51

Page 49: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

More useful linear algebra results

For a square m×m matrix A, its trace, denoted as Trace(A) isdefined as the sum of its diagonal elements.

Trace(A) :=

m∑i=1

aii

Trace(AB) = Trace(BA)

when both AB and BA are square matrices.

If A is a random square matrix, then

Trace(E(A)) = E[Trace(A)]

If AA = A, then A is called an idempotent matrix.

For example, H := X(XTX)−1XT is idempotent. Consequentially,(I−H) is also idempotent.

49 / 51

Page 50: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

The E(RSS) =?

To see that, note that the residual can also be written asε = Y −Y = (I−H)Y = (I−H)(Xβ+ε) = (I−H)Xβ+(I−H)ε.

Show that (I−H)X = 0

Then the RSS is

RSS = εT ε = εT (I−H)T (I−H)ε = εT (I−H)ε = εTε− εTHε

Its expected value is E[εTε− εTHε] = E[εTε]− E[εTHε]1 E[εT ε] = nσ2

2 E[εTHε] = E[Trace(εTHε)] = E[Trace(HεεT )] =Trace(HE[εεT ]) = σ2Trace(HIn) = σ2Trace(H) =σ2Trace(X(XTX)−1XT ) = σ2Trace(XTX(XTX)−1) =σ2Trace(Ip+1) = (p+ 1)σ2

Overall, the expected value of RSS is E(RSS) = (n− p− 1)σ2.

50 / 51

Page 51: Chapter 2: Estimation - MATH 531: Regression - Ipeople.math.binghamton.edu/qyu/ftp/xu1.pdf · Chapter 2: Estimation MATH 531: Regression - I Ganggang Xu Office: OW 133 Fall 2015,

So σ2 = RSSn−p−1 is an unbiased estimator for σ2. Recall that,

in simple linear regression, p = 1 and σ2 = RSSn−2

was the unbiasedestimator;in iid case, p = 0 and σ2 = RSS

n−1was the unbiased estimator.

in this case RSS =∑ni=1(yi − y)2 and hence

σ2 = RSSn−1

= 1n−1

∑ni=1(yi − y)2 is the sample variance.

dfreg = p+ 1 is the degrees of freedom of the multiple regressionmodel.

51 / 51