Linear Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Linear Regression ModelsAndy WangCIS 5930-03

Computer SystemsPerformance Analysis

Linear Regression Models

• What is a (good) model?• Estimating model parameters• Allocating variation• Confidence intervals for regressions• Verifying assumptions visually

What Is a (Good) Model?

• For correlated data, model predicts response given an input

• Model should be equation that fits data• Standard definition of “fits” is least-

squares– Minimize squared error– Keep mean error zero– Minimizes variance of errors

Least-Squared Error• If then error in estimate for

• Minimize Sum of Squared Errors (SSE)

• Subject to the constraint

xbby 10ˆ iii yye ˆ

ii xbbye

Estimating Model Parameters

• Best regression parameters are

• Note error in book!

221 xnxyxnxyb

xbyb 10

xxyxxy

Parameter EstimationExample

• Execution time of a script for various loop counts:

• = 6.8, = 2.32, xy = 88.7, x2 = 264

• b0 = 2.32 (0.30)(6.8) = 0.28

Loops 3 5 7 9 10Time 1.2 1.7 2.5 2.9 3.3

8.6526432.28.657.8821

Graph of Parameter Estimation Example

0 2 4 6 8 10 12

Allocating Variation• If no regression, best guess of y is• Observed values of y differ from ,

giving rise to errors (variance)• Regression gives better guess, but

there are still errors• We can evaluate quality of regression

by allocating sources of errors

The Total Sum of Squares

• Without regression, squared error is

0SSSSY

ynynyy

yyyyyy

The Sum of Squaresfrom Regression

• Recall that regression error is

• Error without regression is SST• So regression explains SSR = SST - SSE• Regression quality measured by

coefficient of determination

22 ˆSSE yye ii

SSTSSESST

SSTSSR2

Evaluating Coefficientof Determination

• Compute• Compute

• Compute

• where R = R(x,y) = correlation(x,y)

22 )(SST yny xybyby 10

SSTSSESST2

Example of Coefficientof Determination

• For previous regression example

– y = 11.60, y2 = 29.88, xy = 88.7,

– SSE = 29.88-(0.28)(11.60)-(0.30)(88.7) = 0.028– SST = 29.88-26.9 = 2.97– SSR = 2.97-.03 = 2.94– R2 = (2.97-0.03)/2.97 = 0.99

3 5 7 9 101.2 1.7 2.5 2.9 3.3

9.2632.25 22 yn

Standard Deviationof Errors

• Variance of errors is SSE divided by degrees of freedom– DOF is n2 because we’ve calculated 2

regression parameters from the data– So variance (mean squared error, MSE) is

SSE/(n2)

• Standard dev. of errors is square root:(minor error in book)

Checking Degreesof Freedom

• Degrees of freedom always equate:– SS0 has 1 (computed from )– SST has n1 (computed from data and ,

which uses up 1)– SSE has n2 (needs 2 regression

parameters)– So

)2(111SSESSR0SSSSYSST

Example ofStandard Deviation of

Errors• For regression example, SSE was 0.03,

so MSE is 0.03/3 = 0.01 and se = 0.10• Note high quality of our regression:

– R2 = 0.99– se = 0.10

Confidence Intervals

for Regressions• Regression is done from a single population sample (size n)– Different sample might give different

results– True model is y = 0 + 1x– Parameters b0 and b1 are really means

taken from a population sample

Calculating Intervalsfor Regression

Parameters• Standard deviations of parameters:

• Confidence intervals arewhere t has n - 2 degrees of freedom– Not divided by sqrt(n)

ibi tsb

Example of Regression Confidence Intervals

• Recall se = 0.13, n = 5, x2 = 264, = 6.8• So

• Using 90% confidence level, t0.95;3 = 2.353

017.0)8.6(5264

12.0)8.6(5264

)8.6(5110.0

Regression Confidence Example, cont’d

• Thus, b0 interval is

– Not significant at 90%

• And b1 is

– Significant at 90% (and would survive even 99.9% test)

)57.0,004.0()12.0(353.238.0

)34.0,26.0()016.0(353.230.0

Confidence Intervalsfor Predictions

• Previous confidence intervals are for parameters– How certain can we be that the parameters

are correct?• Purpose of regression is prediction

– How accurate are the predictions?– Regression gives mean of predicted

response, based on sample we took

Predicting m Samples• Standard deviation for mean of future

sample of m observations at xp is

• Note deviation drops as m • Variance minimal at x = • Use t-quantiles with n–2 DOF for interval

nmss p

Example of Confidenceof Predictions

• Using previous equation, what is predicted time for a single run of 8 loops?

• Time = 0.28 + 0.30(8) = 2.68• Standard deviation of errors se = 0.10

• 90% interval is then

11.0)8.6(5264

8.6851110.0 2

)93.2,42.2()11.0(353.268.2

Prediction Confidence

Verifying Assumptions

Visually• Regressions are based on assumptions:– Linear relationship between response y and

predictor x• Or nonlinear relationship used in fitting

– Predictor x nonstochastic and error-free– Model errors statistically independent

• With distribution N(0,c) for constant c

• If assumptions violated, model misleading or invalid

Testing Linearity• Scatter plot x vs. y to see basic curve type

Linear Piecewise Linear

Outlier Nonlinear (Power)

TestingIndependence of Errors• Scatter-plot i versus• Should be no visible trend• Example from our curve fit:

0 1 2 3 4

Linear Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Documents

Transcript of Linear Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Summarizing Measured Data Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Ratio Games and Designing Experiments Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Workload Selection and Characterization Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Other Regression Models Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Dbc@csit.fsu.edu1 CIS 5930-04 – Spring 2001 Part 4: GUIs and Events Instructors: Geoffrey Fox, Bryan Carpenter Computational.

PCI Drivers Ted Baker Andy Wang CIS 4930 / COP 5641.

Commonly Used Distributions Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Example: Obstacle Modeling for Wireless Transmissions Andy Wang CIS 5930 Computer Systems Performance Analysis.

Selecting Evaluation Techniques Andy Wang CIS 5930 Computer Systems Performance Analysis.

Random-Number Generation Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Fractional Factorial Designs Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

USB Drivers Ted Baker Andy Wang COP 5641 / CIS 4930.

Fall 2004FSU CIS 5930 Internet Protocols1 Architecture of Network Implementation Reading: Chapters 3 and 4.

Selecting Evaluation Techniques Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Comparing Systems Using Sample Data Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

Testing Random-Number Generators Andy Wang CIS 5930-03 Computer Systems Performance Analysis.

CIS 5930-04 – Spring 2001

Dbc@csit.fsu.edu1 CIS 5930-04 – Spring 2001 Instructors: Geoffrey Fox, Bryan Carpenter Computational Science and.

Example: Rumor Performance Evaluation Andy Wang CIS 5930 Computer Systems Performance Analysis.

Advanced Character Driver Operations Ted Baker Andy Wang COP 5641 / CIS 4930.