Linear Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Post on 18-Jan-2018

227 views 0 download

description

3 What Is a (Good) Model? For correlated data, model predicts response given an input Model should be equation that fits data Standard definition of “fits” is least- squares –Minimize squared error –Keep mean error zero –Minimizes variance of errors

Transcript of Linear Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Linear Regression ModelsAndy WangCIS 5930-03

Computer SystemsPerformance Analysis

2

Linear Regression Models

• What is a (good) model?• Estimating model parameters• Allocating variation• Confidence intervals for regressions• Verifying assumptions visually

3

What Is a (Good) Model?

• For correlated data, model predicts response given an input

• Model should be equation that fits data• Standard definition of “fits” is least-

squares– Minimize squared error– Keep mean error zero– Minimizes variance of errors

4

Least-Squared Error• If then error in estimate for

xi is

• Minimize Sum of Squared Errors (SSE)

• Subject to the constraint

xbby 10ˆ iii yye ˆ

n

iii

n

ii xbbye

1

210

1

2

n

iii

n

ii xbbye

110

1

0

5

Estimating Model Parameters

• Best regression parameters are

where

• Note error in book!

221 xnxyxnxyb

xbyb 10

22

11

iii

ii

xxyxxy

yn

yxn

x

6

Parameter EstimationExample

• Execution time of a script for various loop counts:

• = 6.8, = 2.32, xy = 88.7, x2 = 264

• b0 = 2.32 (0.30)(6.8) = 0.28

Loops 3 5 7 9 10Time 1.2 1.7 2.5 2.9 3.3

x y

30.0

8.6526432.28.657.8821

b

7

Graph of Parameter Estimation Example

0

1

2

3

0 2 4 6 8 10 12

8

Allocating Variation• If no regression, best guess of y is• Observed values of y differ from ,

giving rise to errors (variance)• Regression gives better guess, but

there are still errors• We can evaluate quality of regression

by allocating sources of errors

yy

9

The Total Sum of Squares

• Without regression, squared error is

0SSSSY

2

2

2SST

2

1

2

2

1

2

2

11

2

1

22

1

2

yny

ynynyy

ynyyy

yyyyyy

n

ii

n

ii

n

ii

n

ii

n

iii

n

ii

10

The Sum of Squaresfrom Regression

• Recall that regression error is

• Error without regression is SST• So regression explains SSR = SST - SSE• Regression quality measured by

coefficient of determination

22 ˆSSE yye ii

SSTSSESST

SSTSSR2

R

11

Evaluating Coefficientof Determination

• Compute• Compute

• Compute

• where R = R(x,y) = correlation(x,y)

22 )(SST yny xybyby 10

2SSE

SSTSSESST2

R

12

Example of Coefficientof Determination

• For previous regression example

– y = 11.60, y2 = 29.88, xy = 88.7,

– SSE = 29.88-(0.28)(11.60)-(0.30)(88.7) = 0.028– SST = 29.88-26.9 = 2.97– SSR = 2.97-.03 = 2.94– R2 = (2.97-0.03)/2.97 = 0.99

3 5 7 9 101.2 1.7 2.5 2.9 3.3

9.2632.25 22 yn

13

Standard Deviationof Errors

• Variance of errors is SSE divided by degrees of freedom– DOF is n2 because we’ve calculated 2

regression parameters from the data– So variance (mean squared error, MSE) is

SSE/(n2)

• Standard dev. of errors is square root:(minor error in book)

2SSE

nse

14

Checking Degreesof Freedom

• Degrees of freedom always equate:– SS0 has 1 (computed from )– SST has n1 (computed from data and ,

which uses up 1)– SSE has n2 (needs 2 regression

parameters)– So

yy

)2(111SSESSR0SSSSYSST

nnn

15

Example ofStandard Deviation of

Errors• For regression example, SSE was 0.03,

so MSE is 0.03/3 = 0.01 and se = 0.10• Note high quality of our regression:

– R2 = 0.99– se = 0.10

16

Confidence Intervals

for Regressions• Regression is done from a single population sample (size n)– Different sample might give different

results– True model is y = 0 + 1x– Parameters b0 and b1 are really means

taken from a population sample

17

Calculating Intervalsfor Regression

Parameters• Standard deviations of parameters:

• Confidence intervals arewhere t has n - 2 degrees of freedom– Not divided by sqrt(n)

22

22

2

1

0

1

xnxss

xnxx

nss

eb

eb

ibi tsb

18

Example of Regression Confidence Intervals

• Recall se = 0.13, n = 5, x2 = 264, = 6.8• So

• Using 90% confidence level, t0.95;3 = 2.353

x

017.0)8.6(5264

10.0

12.0)8.6(5264

)8.6(5110.0

2

2

2

1

0

b

b

s

s

19

Regression Confidence Example, cont’d

• Thus, b0 interval is

– Not significant at 90%

• And b1 is

– Significant at 90% (and would survive even 99.9% test)

)57.0,004.0()12.0(353.238.0

)34.0,26.0()016.0(353.230.0

20

Confidence Intervalsfor Predictions

• Previous confidence intervals are for parameters– How certain can we be that the parameters

are correct?• Purpose of regression is prediction

– How accurate are the predictions?– Regression gives mean of predicted

response, based on sample we took

21

Predicting m Samples• Standard deviation for mean of future

sample of m observations at xp is

• Note deviation drops as m • Variance minimal at x = • Use t-quantiles with n–2 DOF for interval

22

2

ˆ11

xnxxx

nmss p

eymp

x

22

Example of Confidenceof Predictions

• Using previous equation, what is predicted time for a single run of 8 loops?

• Time = 0.28 + 0.30(8) = 2.68• Standard deviation of errors se = 0.10

• 90% interval is then

11.0)8.6(5264

8.6851110.0 2

2

ˆ1

py

s

)93.2,42.2()11.0(353.268.2

23

Prediction Confidence

x

y

24

Verifying Assumptions

Visually• Regressions are based on assumptions:– Linear relationship between response y and

predictor x• Or nonlinear relationship used in fitting

– Predictor x nonstochastic and error-free– Model errors statistically independent

• With distribution N(0,c) for constant c

• If assumptions violated, model misleading or invalid

25

Testing Linearity• Scatter plot x vs. y to see basic curve type

Linear Piecewise Linear

Outlier Nonlinear (Power)

26

TestingIndependence of Errors• Scatter-plot i versus• Should be no visible trend• Example from our curve fit:

iy

-0.1

0

0.1

0.2

0 1 2 3 4

27

More on Testing Independence

• May be useful to plot error residuals versus experiment number– In previous example, this gives same plot

except for x scaling• No foolproof tests

– “Independence” test really disproves particular dependence

– Maybe next test will show different dependence

28

Testing for Normal Errors

• Prepare quantile-quantile plot of errors• Example for our regression:

-0.2

-0.1

0

0.1

0.2

-1.5 -1 -0.5 0 0.5 1 1.5

29

Testing for Constant Standard Deviation

• Tongue-twister: homoscedasticity• Return to independence plot• Look for trend in spread• Example:

-0.1

0

0.1

0.2

0 1 2 3 4

30

Linear RegressionCan Be Misleading

• Regression throws away some information about the data– To allow more compact summarization

• Sometimes vital characteristics are thrown away– Often, looking at data plots can tell you

whether you will have a problem

31

Example ofMisleading RegressionI II III IV

xy xy xy xy108.0410 9.1410 7.46 86.58

86.958 8.148 6.77 85.76137.5813 8.741312.74 87.71

98.819 8.779 7.11 88.84118.3311 9.2611 7.81 88.47149.9614 8.1014 8.84 87.04

67.246 6.136 6.08 85.2544.264 3.104 5.39 1912.50

1210.84 129.1312 8.158 5.5674.827 7.267 6.42 87.9155.685 4.745 5.73 86.89

32

What Does Regression Tell Us About These

Data Sets?• Exactly the same thing for each!• N = 11• Mean of y = 7.5• Y = 3 + .5 X• Standard error of regression is 0.118• All the sums of squares are the same• Correlation coefficient = .82• R2 = .67

33

Now Look at the Data Plots

0

2

4

6

8

10

12

0 5 10 15 200

2

4

6

8

10

12

0 5 10 15 20

0

2

4

6

8

10

12

0 5 10 15 200

2

4

6

8

10

12

0 5 10 15 20

I II

III IV

White Slide