Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear...

Regression and Model SelectionBook Chapters 3 and 6.

Carlos M. CarvalhoThe University of Texas McCombs School of Business

1

1. Simple Linear Regression2. Multiple Linear Regression3. Dummy Variables4. Residual Plots and Transformations5. Variable Selection and Regularization6. Dimension Reduction Methods

1. Regression: General Introduction

I Regression analysis is the most widely used statistical tool forunderstanding relationships among variables

I It provides a conceptually simple method for investigatingfunctional relationships between one or more factors and anoutcome of interest

I The relationship is expressed in the form of an equation or amodel connecting the response or dependent variable and oneor more explanatory or predictor variable

1

1st Example: Predicting House Prices

Problem:

I Predict market price based on observed characteristics

Solution:

I Look at property sales data where we know the price andsome observed characteristics.

I Build a decision rule that predicts price as a function of theobserved characteristics.

2

Predicting House Prices

It is much more useful to look at a scatterplot

1.0 1.5 2.0 2.5 3.0 3.5

6080

100120140160

size

price

In other words, view the data as points in the X × Y plane.

3

Regression Model

Y = response or outcome variableX 1,X 2,X 3, . . . ,Xp = explanatory or input variables

The general relationship approximated by:

Y = f (X1,X2, . . . ,Xp) + e

And a linear relationship is written

Y = b0 + b1X1 + b2X2 + . . .+ bpXp + e

4

Linear Prediction

Appears to be a linear relationship between price and size:As size goes up, price goes up.

The line shown was fit by the “eyeball” method.

5

Linear Prediction

Y

X

b0

2 1

b1

Y = b0 + b1X

Our “eyeball” line has b0 = 35, b1 = 40.

6

Linear Prediction

Can we do better than the eyeball method?

We desire a strategy for estimating the slope and interceptparameters in the model Y = b0 + b1X

A reasonable way to fit a line is to minimize the amount by whichthe fitted value differs from the actual value.

This amount is called the residual.

7

Linear Prediction

What is the “fitted value”?

Yi

Xi

Ŷi

The dots are the observed values and the line represents our fittedvalues given by Yi = b0 + b1X1 .

8

Linear Prediction

What is the “residual”’ for the ith observation’?

Yi

Xi

Ŷi ei = Yi – Ŷi = Residual i

We can write Yi = Yi + (Yi − Yi ) = Yi + ei .

9

Least Squares

Ideally we want to minimize the size of all residuals:

I If they were all zero we would have a perfect line.

I Trade-off between moving closer to some points and at thesame time moving away from other points.

I Minimize the “total” of residuals to get best fit.

Least Squares chooses b0 and b1 to minimize∑N

i=1 e2i

N∑i=1

e2i = e2

1 +e22 +· · ·+e2

N = (Y1−Y1)2+(Y2−Y2)2+· · ·+(YN−YN)2

=n∑

i=1

(Yi − [b0 + b1Xi ])2

10

Least Squares – Excel Output

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.909209967R Square 0.826662764Adjusted R Square 0.81332913Standard Error 14.13839732Observations 15

ANOVAdf SS MS F Significance F

Regression 1 12393.10771 12393.10771 61.99831126 2.65987E-06Residual 13 2598.625623 199.8942787Total 14 14991.73333

Coefficients Standard Error t Stat P-value Lower 95%Intercept 38.88468274 9.09390389 4.275906499 0.000902712 19.23849785Size 35.38596255 4.494082942 7.873900638 2.65987E-06 25.67708664

Upper 95%58.5308676345.09483846

11

2nd Example: Offensive Performance in Baseball

1. Problems:I Evaluate/compare traditional measures of offensive

performanceI Help evaluate the worth of a player

2. Solutions:I Compare prediction rules that forecast runs as a function of

either AVG (batting average), SLG (slugging percentage) orOBP (on base percentage)

12

2nd Example: Offensive Performance in Baseball

13

Baseball Data – Using AVGEach observation corresponds to a team in MLB. Each quantity isthe average over a season.

I Y = runs per game; X = AVG (average)

LS fit: Runs/Game = -3.93 + 33.57 AVG

14

Baseball Data – Using SLG

I Y = runs per game

I X = SLG (slugging percentage)

LS fit: Runs/Game = -2.52 + 17.54 SLG15

Baseball Data – Using OBP

I Y = runs per game

I X = OBP (on base percentage)

LS fit: Runs/Game = -7.78 + 37.46 OBP16

Place your Money on OBP!!!

1

N

N∑i=1

e2i =

∑Ni=1

(Runsi − Runsi

)2

N

Average Squared Error

AVG 0.083SLG 0.055OBP 0.026

17

The Least Squares Criterion

The formulas for b0 and b1 that minimize the least squares

criterion are:

b1 = rxy ×sy

sxb0 = Y − b1X

where,

I X and Y are the sample mean of X and Y

I corr(x , y) = rxy is the sample correlation

I sx and sy are the sample standard deviation of X and Y

18

Sample Mean and Sample Variance

I Sample Mean: measure of centrality

Y =1

n

n∑i=1

Yi

I Sample Variance: measure of spread

s2y =

1

n − 1

n∑i=1

(Yi − Y

)2

I Sample Standard Deviation:

sy =√

s2y

19

Example

s2y =

1

n − 1

n∑i=1

(Yi − Y

)2

!!

!!

!

!

!

!

!

!

!

!!

!

!

!!

!

!!

!!

!!

!!!

!

!

!!

!

!!

!

!

!!!

!!

!

!

!!!!!

!

!

!

!!

!

!

!!

!

!!

0 10 20 30 40 50 60

−40

−20

020

sample

X

!!

!

!

!!

!

!

!!

!

!

!

!

!

!!

!

!

!

!!

!

!

!!!!

!

!

!

!

!!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

0 10 20 30 40 50 60

−40

−20

020

sample

Y

X

Y

(Xi − X)

(Yi − Y )

sx = 9.7 sy = 16.020

CovarianceMeasure the direction and strength of the linear relationship between Y and X

Cov(Y ,X ) =

∑ni=1 (Yi − Y )(Xi − X )

n − 1

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

−20 −10 0 10 20

−40

−20

020

X

Y

(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0

X

Y I sy = 15.98, sx = 9.7

I Cov(X ,Y ) = 125.9

How do we interpret that?

21

Correlation

Correlation is the standardized covariance:

corr(X ,Y ) =cov(X ,Y )√

s2x s2

y

=cov(X ,Y )

sxsy

The correlation is scale invariant and the units of measurementdon’t matter: It is always true that −1 ≤ corr(X ,Y ) ≤ 1.

This gives the direction (- or +) and strength (0→ 1)of the linear relationship between X and Y .

22

Correlation

corr(Y ,X ) =cov(X ,Y )√

s2x s2

y

=cov(X ,Y )

sxsy=

125.9

15.98× 9.7= 0.812

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

!

−20 −10 0 10 20

−40

−20

020

X

Y(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0(Yi − Y )(Xi − X) > 0

(Yi − Y )(Xi − X) < 0

X

Y

23

Correlation

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = 1

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .5

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .8

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = -.8

24

Correlation

Only measures linear relationships:corr(X ,Y ) = 0 does not mean the variables are not related!

-3 -2 -1 0 1 2

-8-6

-4-2

0

corr = 0.01

0 5 10 15 20

05

1015

20

corr = 0.72

Also be careful with influential observations.

25

Back to Least Squares

1. Intercept:b0 = Y − b1X ⇒ Y = b0 + b1X

I The point (X , Y ) is on the regression line!I Least squares finds the point of means and rotate the line

through that point until getting the “right” slope

2. Slope:

b1 = corr(X ,Y )× sYsX

=

∑ni=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2

=Cov(X ,Y )

var(X )

I So, the right slope is the correlation coefficient times a scalingfactor that ensures the proper units for b1

26

Decomposing the Variance

How well does the least squares line explain variation in Y ?Remember that Y = Y + e

Since Y and e are uncorrelated, i.e. corr(Y , e) = 0,

var(Y ) = var(Y + e) = var(Y ) + var(e)∑ni=1(Yi − Y )2

n − 1=

∑ni=1(Yi − ¯Y )2

n − 1+

∑ni=1(ei − e)2

n − 1

Given that e = 0, and ¯Y = Y (why?) we get to:

n∑i=1

(Yi − Y )2 =n∑

i=1

(Yi − Y )2 +n∑

i=1

e2i

27

Decomposing the Variance

SSR: Variation in Y explained by the regression line.SSE: Variation in Y that is left unexplained.

SSR = SST ⇒ perfect fit.

Be careful of similar acronyms; e.g. SSR for “residual” SS.

28

A Goodness of Fit Measure: R2

The coefficient of determination, denoted by R2,

measures goodness of fit:

R2 =SSR

SST= 1− SSE

SST

I 0 < R2 < 1.

I The closer R2 is to 1, the better the fit.

29

Back to the House Data

!""#$%%&$'()*"$+,Applied Regression AnalysisCarlos M. Carvalho

Back to the House Data

''-''.

''/

R2 =SSR

SST= 0.82 =

12395

14991

30

Back to Baseball

Three very similar, related ways to look at a simple linearregression... with only one X variable, life is easy!

R2 corr SSE

OBP 0.88 0.94 0.79SLG 0.76 0.87 1.64AVG 0.63 0.79 2.49

31

Prediction and the Modeling Goal

There are two things that we want to know:

I What value of Y can we expect for a given X?

I How sure are we about this forecast? Or how different couldY be from what we expect?

Our goal is to measure the accuracy of our forecasts or how muchuncertainty there is in the forecast. One method is to specify arange of Y values that are likely, given an X value.

Prediction Interval: probable range for Y-values given X

32


Key Insight: To construct a prediction interval, we will have toassess the likely range of error values corresponding to a Y valuethat has not yet been observed!

We will build a probability model (e.g., normal distribution).

Then we can say something like “with 95% probability the errorwill be no less than -$28,000 or larger than $28,000”.

We must also acknowledge that the “fitted” line may be fooled byparticular realizations of the residuals.

33


I Suppose you only had the purple points in the graph. Thedashed line fits the purple points. The solid line fits all thepoints. Which line is better? Why?

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

0.25 0.26 0.27 0.28 0.29

4.5

5.0

5.5

6.0

AVG

RP

G

●

●

●

●

●

I In summary, we need to work with the notion of a “true line”and a probability distribution that describes deviation aroundthe line.

34

The Simple Linear Regression Model

The power of statistical inference comes from the ability to makeprecise statements about the accuracy of the forecasts.

In order to do this we must invest in a probability model.

Simple Linear Regression Model: Y = β0 + β1X + ε

ε ∼ N(0, σ2)

I β0 + β1X represents the “true line”; The part of Y thatdepends on X .

I The error term ε is independent “idosyncratic noise”; Thepart of Y not associated with X .

35

The Simple Linear Regression Model

Y = β0 + β1X + ε

1.6 1.8 2.0 2.2 2.4 2.6

160

180

200

220

240

260

x

y

36

The Simple Linear Regression Model – Example

You are told (without looking at the data) that

β0 = 40; β1 = 45; σ = 10

and you are asked to predict price of a 1500 square foot house.

What do you know about Y from the model?

Y = 40 + 45(1.5) + ε

= 107.5 + ε

Thus our prediction for price is Y |X = 1.5 ∼ N(107.5, 102)and a 95% Prediction Interval for Y is 87.5 < Y < 127.5

37

Conditional DistributionsY = β0 + β1X + ε

0.5 1.0 1.5 2.0 2.5

6080

100

120

140

x

y

The conditional distribution for Y given X is Normal:Y |X = x ∼ N(β0 + β1x , σ2).

38

Estimation of Error Variance

We estimate s2 with:

s2 =1

n − 2

n∑i=1

e2i =

SSE

n − 2

(2 is the number of regression coefficients; i.e. 2 for β0 and β1).

We have n − 2 degrees of freedom because 2 have been “used up”in the estimation of b0 and b1.

We usually use s =√

SSE/(n − 2), in the same units as Y . It’salso called the regression standard error.

39

Estimation of Error Variance

Where is s in the Excel output?

!""#$%%%&$'()*"$+Applied Regression Analysis Carlos M. Carvalho

Estimation of 2

!,"-"$).$s )/$0,"$123"($4506507

8"9"9:"-$;,"/"<"-$=45$.""$>.0?/*?-*$"--4-@$-"?*$)0$?.$estimated.0?/*?-*$*"<)?0)4/&$ ).$0,"$.0?/*?-*$*"<)?0)4/

.

Remember that whenever you see “standard error” read it asestimated standard deviation: σ is the standard deviation.

40

One Picture Summary of SLRI The plot below has the house data, the fitted regression line

(b0 + b1X ) and ±2 ∗ s...I From this picture, what can you tell me about β0, β1 and σ2?

How about b0, b1 and s2?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

41

Understanding Variation... Runs per Game and AVG

I blue line: all points

I red line: only purple points

I Which slope is closer to the true one? How much closer?

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

0.25 0.26 0.27 0.28 0.29

4.5

5.0

5.5

6.0

AVG

RP

G

●

●

●

●

●

42

The Importance of Understanding Variation

When estimating a quantity, it is vital to develop a notion of theprecision of the estimation; for example:

I estimate the slope of the regression line

I estimate the value of a house given its size

I estimate the expected return on a portfolio

I estimate the value of a brand name

I estimate the damages from patent infringement

Why is this important?We are making decisions based on estimates, and these may bevery sensitive to the accuracy of the estimates!

43

Sampling Distribution of b1

The sampling distribution of b1 describes how estimator b1 = β1

varies over different samples with the X values fixed.

It turns out that b1 is normally distributed (approximately):b1 ∼ N(β1, s

2b1

).

I b1 is unbiased: E [b1] = β1.

I sb1 is the standard error of b1. In general, the standard error isthe standard deviation of an estimate. It determines how closeb1 is to β1.

I This is a number directly available from the regression output.

44


Can we intuit what should be in the formula for sb1?

I How should s figure in the formula?

I What about n?

I Anything else?

s2b1

=s2∑

(Xi − X )2=

s2

(n − 1)s2x

Three Factors:sample size (n), error variance (s2), and X -spread (sx).

45


The intercept is also normal and unbiased: b0 ∼ N(β0, s2b0

).

s2b0

= var(b0) = s2

(1

n+

X 2

(n − 1)s2x

)What is the intuition here?

46


Regression with all points

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.798496529R Square 0.637596707Adjusted R Square 0.624653732Standard Error 0.298493066Observations 30


Regression 1 4.38915033 4.38915 49.26199 1.239E-‐07Residual 28 2.494747094 0.089098Total 29 6.883897424

Coefficients Standard Error t Stat P-‐value Lower 95% Upper 95%

Intercept -‐3.936410446 1.294049995 -‐3.04193 0.005063 -‐6.587152 -‐1.2856692AVG 33.57186945 4.783211061 7.018689 1.24E-‐07 23.773906 43.369833

sb1 = 4.78

47


Regression with subsample

SUMMARY OUTPUT




Regression 1 1.220667405 1.220667 20.36659 0.0203329Residual 3 0.17980439 0.059935Total 4 1.400471795


Intercept -‐7.956288201 2.874375987 -‐2.76801 0.069684 -‐17.10384 1.191259AVG 48.69444328 10.78997028 4.512936 0.020333 14.355942 83.03294

sb1 = 10.78

48

Confidence Intervals

I 68% Confidence Interval: b1 ± 1× sb1



Same thing for b0


The confidence interval provides you with a set of plausible valuesfor the parameters

49

Example: Runs per Game and AVG

Regression with all points

SUMMARY OUTPUT




Regression 1 4.38915033 4.38915 49.26199 1.239E-‐07Residual 28 2.494747094 0.089098Total 29 6.883897424


Intercept -‐3.936410446 1.294049995 -‐3.04193 0.005063 -‐6.587152 -‐1.2856692AVG 33.57186945 4.783211061 7.018689 1.24E-‐07 23.773906 43.369833

[b1 − 2× sb1 ; b1 + 2× sb1 ] ≈ [23.77; 43.36]

50

Testing

Suppose we want to assess whether or not β1 equals a proposedvalue β0

1 . This is called hypothesis testing.

Formally we test the null hypothesis:

H0 : β1 = β01

vs. the alternative

H1 : β1 6= β01

51

Testing

That are 2 ways we can think about testing:

1. Building a test statistic... the t-stat,

t =b1 − β0

1

sb1

This quantity measures how many standard deviations theestimate (b1) from the proposed value (β0

1).

If the absolute value of t is greater than 2, we need to worry(why?)... we reject the hypothesis.

52

Testing

2. Looking at the confidence interval. If the proposed value isoutside the confidence interval you reject the hypothesis.

Notice that this is equivalent to the t-stat. An absolute valuefor t greater than 2 implies that the proposed value is outsidethe confidence interval... therefore reject.

This is my preferred approach for the testing problem. Youcan’t go wrong by using the confidence interval!

53

Example: Mutual FundsLet’s investigate the performance of the Windsor Fund, anaggressive large cap fund by Vanguard...

!""#$%%&$'()*"$+,Applied Regression AnalysisCarlos M. Carvalho

Another Example of Conditional Distributions

-"./0$(11#$2.$2$032.."45426$17$.8"$*2.29$

The plot shows monthly returns for Windsor vs. the S&P500 54

Example: Mutual Funds

Consider a CAPM regression for the Windsor mutual fund.

rw = β0 + β1rsp500 + ε

Let’s first test β1 = 0

H0 : β1 = 0. Is the Windsor fund related to the market?

H1 : β1 6= 0

55


!""#$%%%&$'()*"$+,Applied Regression Analysis Carlos M. Carvalho

-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<

b! sb! bsb!

!

Hypothesis Testing – Windsor Fund Example

I t = 32.10... reject β1 = 0!!

I the 95% confidence interval is [0.87; 0.99]... again, reject!!

56


Now let’s test β1 = 1. What does that mean?

H0 : β1 = 1 Windsor is as risky as the market.

H1 : β1 6= 1 and Windsor softens or exaggerates market moves.

We are asking whether or not Windsor moves in a different waythan the market (e.g., is it more conservative?).

57


!""#$%%%&$'()*"$+,Applied Regression Analysis Carlos M. Carvalho

-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<

b! sb! bsb!

!

Hypothesis Testing – Windsor Fund Example

I t = b1−1sb1

= −0.06430.0291 = −2.205... reject.

I the 95% confidence interval is [0.87; 0.99]... again, reject,but...

58

Testing – Why I like Conf. Int.

I Suppose in testing H0 : β1 = 1 you got a t-stat of 6 and theconfidence interval was

[1.00001, 1.00002]

Do you reject H0 : β1 = 1? Could you justify that to youboss? Probably not! (why?)

59

Testing – Why I like Conf. Int.

I Now, suppose in testing H0 : β1 = 1 you got a t-stat of -0.02and the confidence interval was

[−100, 100]

Do you accept H0 : β1 = 1? Could you justify that to youboss? Probably not! (why?)

The Confidence Interval is your best friend when it comes totesting!!

60

Testing – Summary

I Large t or small p-value mean the same thing...

I p-value < 0.05 is equivalent to a t-stat > 2 in absolute value

I Small p-value means something weird happen if the nullhypothesis was true...

I Bottom line, small p-value → REJECT! Large t → REJECT!

I But remember, always look at the confidence interval!

61

Forecasting

0 2 4 6 8

0100

200

300

400

Size

Price

I Careful with extrapolation!

62

House Data – one more time!

I R2 = 82%

I Great R2, we are happy using this model to predict houseprices, right?

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

63

House Data – one more time!I But, s = 14 leading to a predictive interval width of about

US$60,000!! How do you feel about the model now?I As a practical matter, s is a much more relevant quantity than

R2. Once again, intervals are your friend!

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

size

pric

e

64

2. The Multiple Regression Model

Many problems involve more than one independent variable orfactor which affects the dependent or response variable.

I More than size to predict house price!

I Demand for a product given prices of competing brands,advertising,house hold attributes, etc.

In SLR, the conditional mean of Y depends on X. The MultipleLinear Regression (MLR) model extends this idea to include morethan one independent variable.

65

The MLR Model

Same as always, but with more covariates.

Y = β0 + β1X1 + β2X2 + · · ·+ βpXp + ε

Recall the key assumptions of our linear regression model:

(i) The conditional mean of Y is linear in the Xj variables.

(ii) The error term (deviations from line)I are normally distributedI independent from each otherI identically distributed (i.e., they have constant variance)

Y |X1 . . .Xp ∼ N(β0 + β1X1 . . . + βpXp, σ2)

66

The MLR ModelIf p = 2, we can plot the regression surface in 3D.Consider sales of a product as predicted by price of this product(P1) and the price of a competing product (P2).

Sales = β0 + β1P1 + β2P2 + ε

67

Least Squares

Y = β0 + β1X1 . . .+ βpXp + ε, ε ∼ N(0, σ2)

How do we estimate the MLR model parameters?

The principle of Least Squares is exactly the same as before:

I Define the fitted values

I Find the best fitting plane by minimizing the sum of squaredresiduals.

68

Least Squares

Model: Salesi = β0 + β1P1i + β2P2i + εi , ε ∼ N(0, σ2)SUMMARY OUTPUT

Regression StatisticsMultiple R 0.99R Square 0.99Adjusted R Square 0.99Standard Error 28.42Observations 100.00


Regression 2.00 6004047.24 3002023.62 3717.29 0.00Residual 97.00 78335.60 807.58Total 99.00 6082382.84

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 115.72 8.55 13.54 0.00 98.75 132.68p1 -97.66 2.67 -36.60 0.00 -102.95 -92.36p2 108.80 1.41 77.20 0.00 106.00 111.60

b0 = β0 = 115.72, b1 = β1 = −97.66, b2 = β2 = 108.80,s = σ = 28.42

69

Least Squares

Just as before, each bi is our estimate of βi

Fitted Values: Yi = b0 + b1X1i + b2X2i . . .+ bpXp.

Residuals: ei = Yi − Yi .

Least Squares: Find b0, b1, b2, . . . , bp to minimize∑n

i=1 e2i .

In MLR the formulas for the bi ’s are too complicated so we won’ttalk about them...

70

Fitted Values in MLRUseful way to plot the results for MLR problems is to look atY (true values) against Y (fitted values).

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat (MLR: p1 and p2)

y=Sales

If things are working, these values should form a nice straight line. Canyou guess the slope of the blue line?

71

Fitted Values in MLRWith just P1...

2 4 6 8

0200

400

600

800

1000

p1

y=Sales

300 400 500 600 700

0200

400

600

800

1000

y.hat (SLR: p1)

y=Sales

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat (MLR: p1 and p2)

y=Sales

I Left plot: Sales vs P1

I Right plot: Sales vs. y (only P1 as a regressor)72

Fitted Values in MLR

Now, with P1 and P2...

300 400 500 600 700

0200

400

600

800

1000

y.hat(SLR:p1)

y=Sales

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat(SLR:p2)

y=Sales

0 200 400 600 800 1000

0200

400

600

800

1000

y.hat(MLR:p1 and p2)

y=Sales

I First plot: Sales regressed on P1 alone...

I Second plot: Sales regressed on P2 alone...

I Third plot: Sales regressed on P1 and P2

73

R-squared

I We still have our old variance decomposition identity...

SST = SSR + SSE

I ... and R2 is once again defined as

R2 =SSR

SST= 1− SSE

SST

telling us the percentage of variation in Y explained by theX ’s.

I In Excel, R2 is in the same place and “Multiple R” refers tothe correlation between Y and Y .

74

Intervals for Individual Coefficients

As in SLR, the sampling distribution tells us how close we canexpect bj to be from βj

The LS estimators are unbiased: E [bj ] = βj for j = 0, . . . , d .

I We denote the sampling distribution of each estimator as

bj ∼ N(βj , s2bj

)

75


Intervals and t-statistics are exactly the same as in SLR.

I A 95% C.I. for βj is approximately bj ± 2sbj

I The t-stat: tj =(bj − β0

j )

sbjis the number of standard errors

between the LS estimate and the null value (β0j )

I As before, we reject the null when t-stat is greater than 2 inabsolute value

I Also as before, a small p-value leads to a rejection of the null

I Rejecting when the p-value is less than 0.05 is equivalent torejecting when the |tj | > 2

76


IMPORTANT: Intervals and testing via bj & sbj are one-at-a-timeprocedures:

I You are evaluating the j th coefficient conditional on the otherX ’s being in the model, but regardless of the values you’veestimated for the other b’s.

77

Understanding Multiple Regression

The Sales Data:

I Sales : units sold in excess of a baseline

I P1: our price in $ (in excess of a baseline price)

I P2: competitors price (again, over a baseline)

78


I If we regress Sales on our own price, we obtain a somewhatsurprising conclusion... the higher the price the more we sell!!

27

The Sales Data

In this data we have weekly observations onsales:# units (in excess of base level)p1=our price: $ (in excess of base)p2=competitors price: $ (in excess of base).

p1 p2 Sales5.13567 5.2042 144.493.49546 8.0597 637.257.27534 11.6760 620.794.66282 8.3644 549.01......

(each row correspondsto a week)

If we regressSales on own price,we obtain thesomewhatsurprisingconclusionthat a higherprice is associatedwith more sales!!

9876543210

1000

500

0

p1

Sal

es

S = 223.401 R-Sq = 19.6 % R-Sq(adj) = 18.8 %

Sales = 211.165 + 63.7130 p1

Regression Plot

The regression linehas a positive slope !!

I It looks like we should just raise our prices, right? NO, not ifyou have taken this statistics class!

79


I The regression equation for Sales on own price (P1) is:

Sales = 211 + 63.7P1

I If now we add the competitors price to the regression we get

Sales = 116− 97.7P1 + 109P2

I Does this look better? How did it happen?

I Remember: −97.7 is the affect on sales of a change in P1with P2 held fixed!!

80


I How can we see what is going on? Let’s compare Sales in twodifferent observations: weeks 82 and 99.

I We see that an increase in P1, holding P2 constant,corresponds to a drop in Sales!

28

Sales on own price:

The multiple regression of Sales on own price (p1) andcompetitor's price (p2) yield more intuitive signs:

How does this happen ?

The regression equation isSales = 211 + 63.7 p1

The regression equation isSales = 116 - 97.7 p1 + 109 p2

Remember: -97.7 is the affect on sales of a change inp1 with p2 held fixed !!

151050

9

8

7

6

5

4

3

2

1

0

p2

p1

9876543210

1000

500

0

p1

Sale

s82

82

99

99

If we compares sales in weeks 82 and 99, we see that an increase in p1, holding p2 constant(82 to 99) corresponds to a drop is sales.

How can we see what is going on ?

Note the strong relationship between p1 and p2 !!I Note the strong relationship (dependence) between P1 and

P2!!81


I Let’s look at a subset of points where P1 varies and P2 isheld approximately constant...

29

9876543210

1000

500

0

p1Sa

les

151050

9

8

7

6

5

4

3

2

1

0

p2

p1Here we select a subset of points where p variesand p2 does is help approximately constant.

For a fixed level of p2, variation in p1 is negativelycorrelated with sale!

0 5 10 15

24

68

sales$p2

sale

s$p1

2 4 6 8

020

040

060

080

010

00

sales$p1

sale

s$S

ales

Different colors indicate different ranges of p2.

p1

p2

Sales

p1

for each fixed level of p2there is a negative relationshipbetween sales and p1

larger p1 are associated withlarger p2

I For a fixed level of P2, variation in P1 is negatively correlatedwith Sales!!

82


I Below, different colors indicate different ranges for P2...

29

9876543210

1000

500

0

p1

Sale

s

151050

9

8

7

6

5

4

3

2

1

0

p2

p1

Here we select a subset of points where p variesand p2 does is help approximately constant.

For a fixed level of p2, variation in p1 is negativelycorrelated with sale!

0 5 10 15

24

68

sales$p2

sale

s$p1

2 4 6 8

020

040

060

080

010

00

sales$p1

sale

s$S

ales

Different colors indicate different ranges of p2.

p1

p2

Sales

p1

for each fixed level of p2there is a negative relationshipbetween sales and p1

larger p1 are associated withlarger p2

83


I Summary:

1. A larger P1 is associated with larger P2 and the overall effectleads to bigger sales

2. With P2 held fixed, a larger P1 leads to lower sales3. MLR does the trick and unveils the “correct” economic

relationship between Sales and prices!

84


Beer Data (from an MBA class)

I nbeer – number of beers before getting drunk

I height and weight

31

The regression equation isnbeer = - 36.9 + 0.643 height

Predictor Coef StDev T PConstant -36.920 8.956 -4.12 0.000height 0.6430 0.1296 4.96 0.000

75706560

20

10

0

height

nbee

rIs nbeer relatedto height ?

Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht

Is nbeer relatedto height ?

No, not all.

nbeer weightweight 0.692height 0.582 0.806

The correlations:

The regression equation isnbeer = - 11.2 + 0.078 height + 0.0853 weight

Predictor Coef StDev T PConstant -11.19 10.77 -1.04 0.304height 0.0775 0.1960 0.40 0.694weight 0.08530 0.02381 3.58 0.001

S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%

The two x’s arehighly correlated !!

Is number of beers related to height?85


nbeers = β0 + β1height + εSUMMARY OUTPUT




Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -36.92 8.96 -4.12 0.00 -54.93 -18.91height 0.64 0.13 4.96 0.00 0.38 0.90

Yes! Beers and height are related...

86


nbeers = β0 + β1weight + β2height + εSUMMARY OUTPUT




Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -11.19 10.77 -1.04 0.30 -32.85 10.48weight 0.09 0.02 3.58 0.00 0.04 0.13height 0.08 0.20 0.40 0.69 -0.32 0.47

What about now?? Height is not necessarily a factor...

87


31



75706560

20

10

0

height

nbee

r


Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht


No, not all.


The correlations:



S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%


I If we regress “beers” only on height we see an effect. Biggerheights go with more beers.

I However, when height goes up weight tends to go up as well...in the first regression, height was a proxy for the real cause ofdrinking ability. Bigger people can drink more and weight is amore accurate measure of “bigness”.

88


31



75706560

20

10

0

height

nbee

r


Yes,very clearly.

200150100

75

70

65

60

weight

heig

ht


No, not all.


The correlations:



S = 2.784 R-Sq = 48.1% R-Sq(adj) = 45.9%


I In the multiple regression, when we consider only the variationin height that is not associated with variation in weight, wesee no relationship between height and beers.

89


nbeers = β0 + β1weight + εSUMMARY OUTPUT

Regression StatisticsMultiple R 0.69R Square 0.48Adjusted R Square0.47Standard Error 2.76Observations 50



Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept -7.021 2.213 -3.172 0.003 -11.471 -2.571weight 0.093 0.014 6.642 0.000 0.065 0.121

Why is this a better model than the one with weight and height??

90


In general, when we see a relationship between y and x (or x ’s),that relationship may be driven by variables “lurking” in thebackground which are related to your current x ’s.

This makes it hard to reliably find “causal” relationships. Anycorrelation (association) you find could be caused by othervariables in the background... correlation is NOT causation

Any time a report says two variables are related and there’s asuggestion of a “causal” relationship, ask yourself whether or notother variables might be the real reason for the effect. Multipleregression allows us to control for all important variables byincluding them into the regression. “Once we control for weight,height and beers are NOT related”!!

91

Back to Baseball – Let’s try to add AVG on top of OBP

t o

SUMMARY OUTPUT



Regression 2 6.188355 3.094177 120.1119098 3.63577E‐14Residual 27 0.695541 0.025761Total 29 6.883896

Coefficients andard Err t Stat P‐value Lower 95% Upper 95%Intercept ‐7.933633 0.844353 ‐9.396107 5.30996E‐10 ‐9.666102081 ‐6.201163AVG 7.810397 4.014609 1.945494 0.062195793 ‐0.426899658 16.04769OBP 31.77892 3.802577 8.357205 5.74232E‐09 23.9766719 39.58116

R/G = β0 + β1AVG + β2OBP + ε

Is AVG any good?92

Back to Baseball - Now let’s add SLG

t o

SUMMARY OUTPUT



Regression 2 6.28747 3.143735 142.31576 4.56302E‐15Residual 27 0.596426 0.02209Total 29 6.883896

Coefficients andard Err t Stat P‐value Lower 95% Upper 95%Intercept ‐7.014316 0.81991 ‐8.554984 3.60968E‐09 ‐8.69663241 ‐5.332OBP 27.59287 4.003208 6.892689 2.09112E‐07 19.37896463 35.80677SLG 6.031124 2.021542 2.983428 0.005983713 1.883262806 10.17899

R/G = β0 + β1OBP + β2SLG + ε

What about now? Is SLG any good93

Back to Baseball

CorrelationsAVG 1OBP 0.77 1SLG 0.75 0.83 1

I When AVG is added to the model with OBP, no additionalinformation is conveyed. AVG does nothing “on its own” tohelp predict Runs per Game...

I SLG however, measures something that OBP doesn’t (power!)and by doing something “on its own” it is relevant to helppredict Runs per Game. (Okay, but not much...)

94

3. Dummy Variables... Example: House Prices

We want to evaluate the difference in house prices in a couple ofdifferent neighborhoods.

Nbhd SqFt Price

1 2 1.79 114.3

2 2 2.03 114.2

3 2 1.74 114.8

4 2 1.98 94.7

5 2 2.13 119.8

6 1 1.78 114.6

7 3 1.83 151.6

8 3 2.16 150.7

... ... ... ...

95

Dummy Variables... Example: House Prices

Let’s create the dummy variables dn1, dn2 and dn3...

Nbhd SqFt Price dn1 dn2 dn3

1 2 1.79 114.3 0 1 0

2 2 2.03 114.2 0 1 0

3 2 1.74 114.8 0 1 0

4 2 1.98 94.7 0 1 0

5 2 2.13 119.8 0 1 0

6 1 1.78 114.6 1 0 0

7 3 1.83 151.6 0 0 1

8 3 2.16 150.7 0 0 1

... ... ...

96


Pricei = β0 + β1dn1i + β2dn2i + β3Sizei + εi

E [Price|dn1 = 1, Size] = β0 + β1 + β3Size (Nbhd 1)

E [Price|dn2 = 1, Size] = β0 + β2 + β3Size (Nbhd 2)

E [Price|dn1 = 0, dn2 = 0, Size] = β0 + β3Size (Nbhd 3)

97


Pricei = β0 + β1dn1 + β2dn2 + β3Size + εiSUMMARY OUTPUT



Regression 3 62809.1504 20936 89.9053 5.8E-31Residual 124 28876.0639 232.87Total 127 91685.2143

Coefficients Standard Error t Stat P-value Lower 95%Upper 95%Intercept 62.78 14.25 4.41 0.00 34.58 90.98dn1 -41.54 3.53 -11.75 0.00 -48.53 -34.54dn2 -30.97 3.37 -9.19 0.00 -37.63 -24.30size 46.39 6.75 6.88 0.00 33.03 59.74

Pricei = 62.78− 41.54dn1− 30.97dn2 + 46.39Size + εi

98


1.6 1.8 2.0 2.2 2.4 2.6

80100

120

140

160

180

200

Size

Price

Nbhd = 1Nbhd = 2Nbhd = 3

99


Pricei = β0 + β1Size + εiSUMMARY OUTPUT



Regression 1 28036.4 28036.36 55.501 1E-11Residual 126 63648.9 505.1496Total 127 91685.2

CoefficientsStandard Error t Stat P-valueLower 95%Upper 95%Intercept -10.09 18.97 -0.53 0.60 -47.62 27.44size 70.23 9.43 7.45 0.00 51.57 88.88

Pricei = −10.09 + 70.23Size + εi

100


1.6 1.8 2.0 2.2 2.4 2.6

80100

120

140

160

180

200

Size

Price

Nbhd = 1Nbhd = 2Nbhd = 3Just Size

101

Sex Discrimination Case

60 70 80 90

3040

5060

7080

90

Year Hired

Salary

Does it look like the effect of experience on salary is the same formales and females? 102


Could we try to expand our analysis by allowing a different slopefor each group?

Yes... Consider the following model:

Salaryi = β0 + β1Expi + β2Sexi + β3Expi × Sexi + εi

For Females:Salaryi = β0 + β1Expi + εi

For Males:

Salaryi = (β0 + β2) + (β1 + β3)Expi + εi

103


How does the data look like?

YrHired Gender Salary Sex SexExp

1 92 Male 32.00 1 92

2 81 Female 39.10 0 0

3 83 Female 33.20 0 0

4 87 Female 30.60 0 0

5 92 Male 29.00 1 92

... ... ...

208 62 Female 30.00 0 62

104


Salaryi = β0 + β1Sexi + β2Exp + β3Exp ∗ Sex + εi

Srs

S

o

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.799130351R Square 0.638609318Adjusted R 0.63329475Standard Er 6.816298288Observation 208

ANOVAdf SS MS F ignificance F


Coefficients tandard Err t Stat P-value Lower 95% Upper 95%Intercept 61.12479795 8.770854 6.96908 4E-11 43.831649 78.41795Gender 114.4425931 11.7012 9.78041 9E-19 91.371794 137.5134YrHired -0.279963351 0.102456 -2.7325 0.0068 -0.4819713 -0.077955GenderExp -1.247798369 0.136676 -9.1296 7E-17 -1.5172765 -0.97832

Salaryi = 61 + 114Sexi +−0.27Exp +−1.24Exp ∗ Sex + εi105


60 70 80 90

3040

5060

7080

90

Year Hired

Salary

106

Variable Interaction

So, the effect of experience on salary is different for males andfemales... in general, when the effect of the variable X1 onto Ydepends on another variable X2 we say that X1 and X2 interactwith each other.

We can extend this notion by the inclusion of multiplicative effectsthrough interaction terms.

Yi = β0 + β1X1i + β2X2i + β3(X1iX2i) + ε

∂E[Y |X1,X2]

∂X1= β1 + β3X2

This is our first non-linear model!

107

4. Residual Plots and Transformations

What kind of properties should the residuals have??

ei ≈ N(0, σ2) iid and independent from the X’s

I We should see no pattern between e and each of the X ’s

I This can be summarized by looking at the plot betweenY and e

I Remember that Y is “pure X ”, i.e., a linear function of theX ’s.

If the model is good, the regression should have pulled out of Y allof its “x ness”... what is left over (the residuals) should havenothing to do with X .

108

Non Linearity

Example: Telemarketing

I How does length of employment affect productivity (numberof calls per day)?

109

Non Linearity

Example: Telemarketing

I Residual plot highlights the non-linearity!

110

Non Linearity

What can we do to fix this?? We can use multiple regression andtransform our X to create a no linear model...

Let’s tryY = β0 + β1X + β2X 2 + ε

The data...

months months2 calls

10 100 18

10 100 19

11 121 22

14 196 23

15 225 25

... ... ...

111

TelemarketingAdding Polynomials

Week VIII. Slide 5Applied Regression Analysis Carlos M. Carvalho

Linear Model

112



With X2

113



114

Telemarketing

What is the marginal effect of X on Y?

∂E[Y |X ]

∂X= β1 + 2β2X

I To better understand the impact of changes in X on Y youshould evaluate different scenarios.

I Moving from 10 to 11 months of employment raisesproductivity by 1.47 calls

I Going from 25 to 26 months only raises the number of callsby 0.27.

115

Polynomial Regression

Even though we are limited to a linear mean, it is possible to getnonlinear regression by transforming the X variable.

In general, we can add powers of X to get polynomial regression:Y = β0 + β1X + β2X 2 . . .+ βmXm

You can fit any mean function if m is big enough.Usually, m = 2 does the trick.

116

Closing Comments on Polynomials

We can always add higher powers (cubic, etc) if necessary.

Be very careful about predicting outside the data range. The curvemay do unintended things beyond the observed data.

Watch out for over-fitting... remember, simple models are“better”.

117

Be careful when extrapolating...

10 15 20 25 30 35 40

2025

30

months

calls

118

...and, be careful when adding more polynomial terms!

10 15 20 25 30 35 40

1520

2530

3540

months

calls

238

119

Non-constant Variance

Example...

This violates our assumption that all εi have the same σ2.

120

Non-constant Variance

Consider the following relationship between Y and X :

Y = γ0X β1(1 + R)

where we think about R as a random percentage error.

I On average we assume R is 0...

I but when it turns out to be 0.1, Y goes up by 10%!

I Often we see this, the errors are multiplicative and thevariation is something like ±10% and not ±10.

I This leads to non-constant variance (or heteroskedasticity)

121

The Log-Log Model

We have data on Y and X and we still want to use a linearregression model to understand their relationship... what if we takethe log (natural log) of Y ?

log(Y ) = log[γ0X β1(1 + R)

]log(Y ) = log(γ0) + β1 log(X ) + log(1 + R)

Now, if we call β0 = log(γ0) and ε = log(1 + R) the above leads to

log(Y ) = β0 + β1 log(X ) + ε

a linear regression of log(Y ) on log(X )!

122

Elasticity and the log-log Model

In a log-log model, the slope β1 is sometimes called elasticity.

In english, a 1% increase in X gives a beta % increase in Y.

β1 ≈d%Y

d%X(Why?)

123

Price Elasticity

In economics, the slope coefficient β1 in the regressionlog(sales) = β0 + β1 log(price) + ε is called price elasticity.

This is the % change in sales per 1% change in price.

The model implies that E [sales] = A ∗ priceβ1

where A = exp(β0)

124

Price Elasticity of OJ

A chain of gas station convenience stores was interested in thedependency between price of and Sales for orange juice...They decided to run an experiment and change prices randomly atdifferent locations. With the data in hands, let’s first run anregression of Sales on Price:

Sales = β0 + β1Price + εSUMMARY OUTPUT




Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 89.642 8.610 10.411 0.000 72.330 106.955Price -20.935 2.919 -7.171 0.000 -26.804 -15.065

125


1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100120140

Fitted Model

Price

Sales

1.5 2.0 2.5 3.0 3.5 4.0-20

020

4060

80

Residual Plot

Price

residuals

No good!!

126


But... would you really think this relationship would be linear?Moving a price from $1 to $2 is the same as changing it form $10to $11?? We should probably be thinking about the price elasticityof OJ... log(Sales) = γ0 + γ1 log(Price) + ε

SUMMARY OUTPUT




Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 4.812 0.148 32.504 0.000 4.514 5.109LogPrice -1.752 0.144 -12.173 0.000 -2.042 -1.463

How do we interpret γ1 = −1.75?(When prices go up 1%, sales go down by 1.75%)

127


1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100120140

Fitted Model

Price

Sales

1.5 2.0 2.5 3.0 3.5 4.0-0.5

0.0

0.5

Residual Plot

Price

residuals

Much better!!

128

Making Predictions

What if the gas station store wants to predict their sales of OJ ifthey decide to price it at $1.8?

The predicted log(Sales) = 4.812 + (−1.752)× log(1.8) = 3.78

So, the predicted Sales = exp(3.78) = 43.82.How about the plug-in prediction interval?

In the log scale, our predicted interval in

[ log(Sales)− 2s; log(Sales) + 2s] =[3.78− 2(0.38); 3.78 + 2(0.38)] = [3.02; 4.54].

In terms of actual Sales the interval is[exp(3.02), exp(4.54)] = [20.5; 93.7]

129

Making Predictions

1.5 2.0 2.5 3.0 3.5 4.0

2040

6080

100

120

140

Plug-in Prediction

Price

Sales

0.4 0.6 0.8 1.0 1.2 1.42.0

2.5

3.0

3.5

4.0

4.5

5.0

Plug-in Prediction

log(Price)

log(Sales)

I In the log scale (right) we have [Y − 2s; Y + 2s]

I In the original scale (left) we have[exp(Y ) ∗ exp(−2s); exp(Y ) exp(2s)]

130

Some additional comments...

I Another useful transformation to deal with non-constantvariance is to take only the log(Y ) and keep X the same.Clearly the “elasticity” interpretation no longer holds.

I Always be careful in interpreting the models after atransformation

I Also, be careful in using the transformed model to makepredictions

131

Summary of Transformations

Coming up with a good regression model is usually an iterativeprocedure. Use plots of residuals vs X or Y to determine the nextstep.

Log transform is your best friend when dealing with non-constantvariance (log(X ), log(Y ), or both).

Add polynomial terms (e.g. X 2) to get nonlinear regression.

The bottom line: you should combine what the plots and theregression output are telling you with your common sense andknowledge about the problem. Keep playing around with it untilyou get something that makes sense and has nothing obviouslywrong with it.

132

Airline Data

Monthly passengers in the U.S. airline industry (in 1,000 ofpassengers) from 1949 to 1960... we need to predict the number ofpassengers in the next couple of months.

0 20 40 60 80 100 120 140

100

200

300

400

500

600

Time

Passengers

Any ideas?

133

Airline Data

How about a “trend model”? Yt = β0 + β1t + εt

0 20 40 60 80 100 120 140

100

200

300

400

500

600

Time

Passengers

Fitted Values

What do you think?

134

Airline Data

Let’s look at the residuals...

0 20 40 60 80 100 120 140

-2-1

01

23

Time

std

resi

dual

s

Is there any obvious pattern here? YES!!

135

Airline Data

The variance of the residuals seems to be growing in time... Let’stry taking the log. log(Yt) = β0 + β1t + εt

0 20 40 60 80 100 120 140

5.0

5.5

6.0

6.5

Time

log(Passengers)

Fitted Values

Any better?

136

Airline Data

Residuals...

0 20 40 60 80 100 120 140

-2-1

01

2

Time

std

resi

dual

s

Still we can see some obvious temporal/seasonal pattern....

137

Airline Data

Okay, let’s add dummy variables for months (only 11 dummies)...log(Yt) = β0 + β1t + β2Jan + ...β12Dec + εt

0 20 40 60 80 100 120 140

5.0

5.5

6.0

6.5

Time

log(Passengers)

Fitted Values

Much better!!

138

Airline Data

Residuals...

0 20 40 60 80 100 120 140

-2-1

01

2

Time

std

resi

dual

s

I am still not happy... it doesn’t look normal iid to me...139

Airline Data

Residuals...

-0.15 -0.10 -0.05 0.00 0.05 0.10

-0.15

-0.10

-0.05

0.00

0.05

0.10

corr(e(t),e(t-1))= 0.786

e(t-1)

e(t)

I was right! The residuals are dependent on time...140

Airline Data

We have one more tool... let’s add one legged term.log(Yt) = β0 + β1t + β2Jan + ...β12Dec + β13 log(Yt−1) + εt

0 20 40 60 80 100 120 140

5.0

5.5

6.0

6.5

Time

log(Passengers)

Fitted Values

Okay, good...

141

Airline Data

Residuals...

0 20 40 60 80 100 120 140

-2-1

01

2

Time

std

resi

dual

s

Much better!!142

Airline Data

Residuals...

-0.10 -0.05 0.00 0.05

-0.10

-0.05

0.00

0.05

corr(e(t),e(t-1))= -0.11

e(t-1)

e(t)

Much better indeed!!143

Model Building Process

When building a regression model remember that simplicity is yourfriend... smaller models are easier to interpret and have fewerunknown parameters to be estimated.

Keep in mind that every additional parameter represents a cost!!

The first step of every model building exercise is the selection ofthe the universe of variables to be potentially used. This task isentirely solved through you experience and context specificknowledge...

I Think carefully about the problem

I Consult subject matter research and experts

I Avoid the mistake of selecting too many variables

144

Model Building Process

With a universe of variables in hand, the goal now is to select themodel. Why not include all the variables in?

Big models tend to over-fit and find features that are specific tothe data in hand... ie, not generalizable relationships.The results are bad predictions and bad science!

In addition, bigger models have more parameters and potentiallymore uncertainty about everything we are trying to learn...

We need a strategy to build a model in ways that accounts for thetrade-off between fitting the data and the uncertainty associatedwith the model

145

5. Variable Selection and Regularization

When working with linear regression models where the number ofX variables is large, we need to think about strategies to selectwhat variables to use...

We will focus on 3 ideas:

I Subset Selection

I Shrinkage

I Dimension Reduction

146

Subset Selection

The idea here is very simple: fit as many models as you can andcompare their performance based on some criteria!

Issues:

I How many possible models? Total number of models = 2p

Is this large?

I What criteria to use?Just as before, if prediction is what we have in mind,out-of-sample predictive ability should be the criteria

147

Information Criteria

Another way to evaluate a model is to use Information Criteriametrics which attempt to quantify how well our model would havepredicted the data (regardless of what you’ve estimated for theβj ’s).

A good alternative is the BIC: Bayes Information Criterion, whichis based on a “Bayesian” philosophy of statistics.

BIC = n log(s2) + p log(n)

You want to choose the model that leads to minimum BIC.

148

Information Criteria

One nice thing about the BIC is that you caninterpret it in terms of model probabilities. Given a list of possiblemodels {M1,M2, . . . ,MR}, the probability that model i is correct is

P(Mi) ≈e−

12BIC(Mi )∑R

r=1 e− 1

2BIC(Mr )

=e−

12

[BIC(Mi )−BICmin]∑Rr=1 e

− 12

[BIC(Mr )−BICmin]

(Subtract BICmin = min{BIC(M1) . . .BIC(MR)} for numerical stability.)

Similar, alternative criteria include AIC, Cp, adjusted R2...bottom line: these are only useful if we lack the ability to comparemodels based on their out-of-sample predictive ability!!!

149

Search Strategies: Stepwise Regression

One computational approach to build a regression modelstep-by-step is “stepwise regression” There are 3 options:

I Forward: adds one variable at the time until no remainingvariable makes a significant contribution (or meet a certaincriteria... could be out of sample prediction)

I Backwards: starts will all possible variables and removes oneat the time until further deletions would do more harm themgood

I Stepwise: just like the forward procedure but allows fordeletions at each step

150

Shrinkage Methods

An alternative way to deal with selection is to work with all ppredictors at once while placing a constraint on the size of theestimated coefficients

This idea is a regularization technique that reduces the variabilityof the estimates and tend to lead to better predictions.

The hope is that by having the constraint in place, the estimationprocedure will be able to focus on “the important β’s”

151

Ridge Regression

Ridge Regression is a modification of the least squares criteria thatminimizes (as a function of β’s)

n∑i=1

Yi − β0 −p∑

j=1

βjXij

2

+ λ

p∑j=1

β2j

for some value of λ > 0

I The “blue” part of the equation is the traditional objectivefunction of LS

I The “red” part is the shrinkage penalty, ie, something thatmakes costly to have big values for β

152

Ridge Regression

n∑i=1

Yi − β0 −p∑

j=1

βjXij

2

+ λ

p∑j=1

β2j

I if λ = 0 we are back to least squares

I when λ→∞, it is “too expensive” to allow for any β to bedifferent than 0...

I So, for different values of λ we get a different solution to theproblem

153

Ridge Regression

I What ridge regression is doing is exploring the bias-variancetrade-off! The larger the λ the more bias (towards zero) isbeing introduced in the solution, ie, the less flexible the modelbecomes... at the same time, the solution has less variance

I As always, the trick to find the “right” value of λ that makesthe model not too simple but not too complex!

I Whenever possible, we will choose λ by comparing theout-of-sample performance (usually via cross-validation)

154

Ridge Regression

1e−01 1e+01 1e+03

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

0.0 0.2 0.4 0.6 0.8 1.0

010

20

30

40

50

60

Me

an

Sq

ua

red

Err

or

λ ‖βRλ ‖2/‖β‖2

bias2 (black), var(green), test MSE (purple)Comments:

I Ridge is computationally very attractive as the “computingcost” is almost the same of least squares (contrast that withsubset selection!)

I It’s a good practice to always center and scale the X ’s beforerunning ridge

155

LASSO

The LASSO is a shrinkage method that performs automaticselection. It is similar to ridge but it will provide solutions that aresparse, ie, some β’s exactly equal to 0! This facilitatesinterpretation of the results...

Ridge LASSO

1e!02 1e+00 1e+02 1e+04

!300

!100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

0.0 0.2 0.4 0.6 0.8 1.0

!300

!100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

! !"R! !2/!"!2

20 50 100 200 500 2000 5000

!200

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

0.0 0.2 0.4 0.6 0.8 1.0

!300

!100

0100

200

300

400

Sta

nd

ard

ize

d C

oe

ffic

ien

ts

IncomeLimitRatingStudent

! !"L! !1/!"!1

156

LASSO

The LASSO solves the following problem:

arg minβ

n∑

i=1

Yi − β0 −p∑

j=1

βjXij

2

+ λ

p∑j=1

|βj |

I Once again, λ controls how flexible the model gets to be

I Still a very efficient computational strategy

I Whenever possible, we will choose λ by comparing theout-of-sample performance (usually via cross-validation)

157

Ridge vs. LASSO

Why does the LASSO outputs zeros?

158

Ridge vs. LASSO

Which one is better?

I It depends...

I In general LASSO will perform better than Ridge when arelative small number of predictors have a strong effect in Ywhile Ridge will do better when Y is a function of many of theX ’s and the coefficients are of moderate size

I LASSO can be easier to interpret (the zeros help!)

I But, if prediction is what we care about the only way todecide which method is better is comparing theirout-of-sample performance

159

Choosing λ

The idea is to solve the ridge or LASSO objective function over agrid of possible values for λ...

-2 0 2 4 6

0.35

0.40

0.45

0.50

0.55

Ridge CV (k=10)

log(lambda)

RMSE

-8 -6 -4 -2

0.35

0.40

0.45

0.50

0.55

LASSO CV (k=10)

log(lambda)

RMSE

160

6. Dimension Reduction Methods

Sometimes, the number (p) of X variables available is too large forus to work with the methods presented above.

Perhaps, we could first summarize the information in the predictorsinto a smaller set of variables (m << p) and then try to predict Y .

In general, these summaries are often linear combinations of theoriginal variables.

161

Principal Components Regression

A very popular way to summarize multivariate data is PrincipalComponents Analysis (PCA).

PCA is a dimensionality reduction technique that tries torepresent p variables with a k < p “new” variables.

These “new” variables are create by linear combinations of theoriginal variables and the hope is that a small number of them areable to effectively represent what is going on in the original data.

162

Principal Components Analysis

Assume we have a dataset where p variables are observed. Let Xi

be the i th observation of the p-dimensional vector X . PCA writes:

xij = b1jzi1 + b2jzi2 + · · ·+ bkjzik + eij

where zij is the i th observation of the j th principal component.

You can think about these z variables as the “essential variables”responsible for all the action in X .

163

Principal Components AnalysisHere’s a picture...

Fitting Principal Components via Least Squares

PCA looks for high-variance projections from multivariate x

(i.e., the long direction) and finds the least squares fit.

-2 -1 0 1 2

-2-1

01

2

x1

x2

PC1PC2

Components are ordered by variance of the fitted projection.

9

Thesetwo variables x1 and x2 are very correlated. PC 1 tells you almosteverything that is going on in this dataset!PCA will look for linear combinations of the original variables thataccount for most of their variability! 164


Let’s look at a simple example... Data: Protein consumption byperson by country for 7 variables: red meat, white meat, eggs,milk, fish, cereals, starch, nuts, vegetables.

-1 0 1 2

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

RedMeat

WhiteMeat

Albania

Austria

Belgium

Bulgaria

CzechoslovakiaDenmark

E Germany

Finland

France

Greece

Hungary

Ireland

Italy

Netherlands

Norway

Poland

Portugal

Romania

Spain

Sweden

Switzerland

UK

USSR

W Germany

Yugoslavia

-4 -2 0 2 4

-10

12

34

PC1

PC2

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

DenmarkE Germany

Finland

FranceGreece

HungaryIreland

Italy

Netherlands

NorwayPoland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSRW Germany

Yugoslavia

Looks to me that PC1 measures how rich you are and PC2something to do with the Mediterranean diet!

165


-4 -2 0 2 4

-10

12

34

PC1

PC2

Albania

Austria

Belgium

Bulgaria

Czechoslovakia

DenmarkE Germany

Finland

FranceGreece

HungaryIreland

Italy

Netherlands

NorwayPoland

Portugal

Romania

Spain

Sweden

Switzerland

UK USSRW Germany

Yugoslavia

These are the weights defining the principal components...

166

Principal Components Analysis3 variables might be enough to represent this data... Most of thevariability (75%) is explained with PC1, PC2 and PC3.

Food Principal Components Variance

PC

Variances

01

23

4

167

Principal Components Analysis: Comments

I PCA is a great way to summarize data

I It “clusters” both variables and observations simultaneously!

I The choice of k can be evaluated as a function of theinterpretation of the results or via the fit (% of the variationexplained)

I The units of each PC is not interpretable in an absolute sense.However, relative to each other it is... see example above.

I Always a good idea to center the data before running PCA.

168

Principal Components Regression (PCR)

Let’s go back to and think of predicting Y with a potentially largenumber of X variables...

PCA is sometimes used as a way to reduce the dimensionality ofX ... if only a small number of PC’s are enough to represent X , Idon’t need to use all the X ’s, right? Remember, smaller modelstend to do better in predictive terms!

This is called Principal Component Regression. First represent Xvia k principal components (Z ) and then run a regression of Yonto Z . PCR assumes that the directions in which shows the mostvariation (the PCs), are the directions associated with Y .

The choice of k can be done by comparing the out-of-samplepredictive performance.

169


Example: Roll Call Votes in Congress... all votes in the 111th

Congress (2009-2011); p = 1647, n = 445.Goal: Predict party how “liberal” a district is a function of thevotes by their representative

Let’s first take the principal component decomposition of thevotes...

Rollcall-Vote Principle Components

PC

Variances

0100

200

300

400

500

It looks like 2 PC capture muchof what is going on...

170

Principal Components Regression (PCR)Histogram of loadings on PC1... What bills are important indefining PC1?

1st Principle Component Vote-Loadings

Frequency

-0.04 -0.02 0.00 0.02 0.04

0100

200

300

400

500

Afford. Health (amdt.) TARP

171

Principal Components Regression (PCR)Histogram of PC1... “Ideology Score”

1st Principle Component Vote-Scores

Frequency

-40 -30 -20 -10 0 10 20 30

050

100

150

Royball (CA-34)Deutch (FL-19)Rosleht (FL-18)Conaway (TX-11)

172


TX-11 CA-34

The two extremes...

173


FL-18 FL-19

The swing state!

174


-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

-40 -30 -20 -10 0 10 20-80

-60

-40

-20

0

PC1

PC2

All we need is PC1 to predict party affiliation!

How can this picture help you understand what happened in the2010 election?

175


-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

losers in 2010

176

Partial Least Squares (PLS)

PLS works very similarly to PCR as it will create “new” variables(Z ) by taking linear combinations of the original variables (X ).

The different is that PLS attempts to find the directions ofvariation in X that help explain BOTH X and Y .

It is a supervised learning alternative to PCR...

177


PLS works as follows:

1. The weights of the first linear combination (Z1) is defined bythe regression of Y onto each of the X ’s... i.e., large weightsare going to placed on the X variables most related to Y in aunivariate sense

2. Regress each X variable onto Z1 and compute the residuals

3. Repeat step (1) using the residuals from (2) in place of X

4. iterate

As always, the choice of where to stop, i.e., how many Z variablesto use should be done by comparing the out-of-sample predictiveperformance.

178


Roll Call Data again... it looks like the first component from PLSis the same as the first principal component!

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

PC1

% O

bam

a

-40 -30 -20 -10 0 10 20

3040

5060

7080

90

Comp1 (PLS)

% O

bam

a

179


But, using two components PLS does better than PCR!!

40 45 50 55 60

3040

5060

7080

90

PCR ( R^2= 0.448 )

Fitted

% O

bam

a

40 50 60 70

3040

5060

7080

90

PLS ( R^2= 0.558 )

Fitted

% O

bam

a

180

Partial Least Squares (PLS)Not easy to understand the difference between the secondcomponent in each method (how is that for a homework!)... thebottom line is that by using the information from Y insummarizing the X variables, PLS find a second component thathas the ability to explain part of Y .

-40 -30 -20 -10 0 10 20

-80

-60

-40

-20

0

PCR

PC1

PC2

-40 -30 -20 -10 0 10 20

-10

010

2030

PLS

Comp 1

Com

p 2

181

Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear...

Documents

Transcript of Regression and Model Selection Book Chapters 3 …1. Simple Linear Regression 2. Multiple Linear...