Advanced Char Driver Operations Ted Baker Andy Wang COP 5641 / CIS 4930.
Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.
-
Upload
justin-dalton -
Category
Documents
-
view
218 -
download
0
Transcript of Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.
![Page 1: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/1.jpg)
Multiple Linear Regression
Andy WangCIS 5930-03
Computer SystemsPerformance Analysis
![Page 2: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/2.jpg)
2
Multiple Linear Regression
• Models with more than one predictor variable
• But each predictor variable has a linear relationship to the response variable
• Conceptually, plotting a regression line in n-dimensional space, instead of 2-dimensional
![Page 3: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/3.jpg)
3
Basic Multiple Linear Regression Formula
• Response y is a function of k predictor variables x1,x2, . . . , xk
exbxbxbby kk 22110
![Page 4: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/4.jpg)
4
A Multiple Linear Regression Model
Given sample of n observations
model consists of n equations (note + vs. - typo in book):
nknnnk yxxxyxxx ,,,,,,,,,, 21112111
nknknnn
kk
kk
exbxbxbby
exbxbxbby
exbxbxbby
22110
2222212102
1121211101
![Page 5: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/5.jpg)
5
Looks Like It’s Matrix Arithmetic Time
y = Xb +e
nkknnn
k
k
n e
e
e
b
b
b
xxx
xxx
xxx
y
y
y
.
.
.
.
.
.
1
.....
.....
.....
1
1
.
.
.2
1
1
0
22
22212
12111
2
1
![Page 6: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/6.jpg)
6
Analysis ofMultiple Linear
Regression• Listed in box 15.1 of Jain• Not terribly important (for our purposes)
how they were derived– This isn’t a class on statistics
• But you need to know how to use them• Mostly matrix analogs to simple linear
regression results
![Page 7: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/7.jpg)
7
Example ofMultiple Linear
Regression• IMDB keeps numerical popularity
ratings of movies• Postulate popularity of Academy Award
winning films is based on two factors:– Year made– Running time
• Produce a regressionrating = b0 + b1(year) +b2(length)
![Page 8: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/8.jpg)
8
Some Sample Data
Title Year LengthRating
Silence of the Lambs 1991 118 8.1Terms of Endearment 1983 132 6.8Rocky 1976 119 7.0Oliver! 1968 153 7.4Marty 1955 91 7.7Gentleman’s Agreement 1947 118 7.5Mutiny on the Bounty 1935 132 7.6It Happened One Night 1934 105 8.0
![Page 9: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/9.jpg)
9
Now for Some Tedious Matrix Arithmetic
• We need to calculate X, XT, XTX, (XTX)-1, and XTy• Because• We will see that
b = (18.5430, -0.0051, -0.0086 )T
• Meaning the regression predicts:rating = 18.5430 – 0.0051*year
– 0.0086*length
yXXXb T1T
![Page 10: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/10.jpg)
10
X Matrix for Example
10519341
13219351
11819471
9119551
15319681
11919761
13219831
11819911
X
![Page 11: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/11.jpg)
11
Transpose to Get XT
10513211891153119132118
19341935194719551968197619831991
11111111TX
![Page 12: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/12.jpg)
12
Multiply To Get XTX
1195721899083968
18990833077138515689
968156898
XXT
![Page 13: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/13.jpg)
13
Invert to Get C=(XTX)-1
0004.00001.01328.0
0001.00003.062400
1328.06240.07585.1207
.1TXXC
![Page 14: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/14.jpg)
14
Multiply to Get XTy
57247
7.117840
160
.
.
yXT
![Page 15: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/15.jpg)
15
Multiply (XTX)-1(XTy)to Get b
00860
0051.0
5430.18
.
b
![Page 16: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/16.jpg)
16
How Good Is ThisRegression Model?
• How accurately does the model predict the rating of a film based on its age and running time?
• Best way to determine this analytically is to calculate the errors
or yXbyy TTTSSE
2ieSSE
![Page 17: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/17.jpg)
17
Calculating the ErrorsEstimated
Rating Year Length Rating ei ei^28.1 1991 118 7.4 -0.71 0.516.8 1983 132 7.3 0.51 0.267.0 1976 119 7.5 0.45 0.217.4 1968 153 7.2 -0.20 0.047.7 1955 91 7.8 0.10 0.017.5 1947 118 7.6 0.11 0.017.6 1935 132 7.6 -0.05 0.008.0 1934 105 7.8 -0.21 0.04
![Page 18: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/18.jpg)
18
Calculating the Errors, Continued
• So SSE = 1.08• SSY =• SS0 = • SST = SSY - SS0 = 452.9- 451.5 = 1.4• SSR = SST - SSE = .33
•
• In other words, this regression stinks
914522 .y i
54512 .yn
23.41.1
33.2 SST
SSRR
![Page 19: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/19.jpg)
19
Why Does It Stink?
• Let’s look at properties of the regression parameters
• Now calculate standard deviations of the regression parameters
46.5
08.1
3
n
SSEse
![Page 20: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/20.jpg)
20
Calculating STDEVof Regression Parameters
• Estimations only, since we’re working with a sample
• Estimated stdev of
16.1676.120746.000 csb e
0084.0003.46.111 csb e
0097.0004.46.222 csb e
![Page 21: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/21.jpg)
21
Calculating Confidence Intervals of STDEVs
• We will use 90% level• Confidence intervals for
• None is significant, at this level
10.51,02.1416.16015.254.180 b
012,.022.0084.015.2005.1 b
011,.028.0097.015.2009.2 b
![Page 22: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/22.jpg)
22
Analysis of Variance
• So, can we really say that none of the predictor variables are significant?– Not yet; predictors may be correlated
• F-tests can be used for this purpose– E.g., to determine if the SSR is significantly
higher than the SSE– Equivalent to testing that y does not
depend on any of the predictor variables• Alternatively, that no bi is significantly nonzero
![Page 23: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/23.jpg)
23
Running an F-Test
• Need to calculate SSR and SSE• From those, calculate mean squares of
regression (MSR) and errors (MSE)• MSR/MSE has an F distribution• If MSR/MSE > F-table, predictors explain
a significant fraction of response variation• Note typos in book’s table 15.3
– SSR has k degrees of freedom– SST matches not
yy yy ˆ
![Page 24: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/24.jpg)
24
F-Test for Our Example
• SSR = .33• SSE = 1.08• MSR = SSR/k = .33/2 = .16• MSE = SSE/(n-k-1) = 1.08/(8 - 2 - 1) = .22• F-computed = MSR/MSE = .76• F[90; 2,5] = 3.78 (at 90%)• So it fails the F-test at 90% (miserably)
![Page 25: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/25.jpg)
25
Multicollinearity
• If two predictor variables are linearly dependent, they are collinear– Meaning they are related– And thus second variable does not improve
the regression– In fact, it can make it worse
X1
1 0
X2 1 data No data
0 No data data
![Page 26: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/26.jpg)
26
Multicollinearity
• Typical symptom is inconsistent results from various significance tests
![Page 27: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/27.jpg)
27
Finding Multicollinearity
• Must test correlation between predictor variables
• If it’s high, eliminate one and repeat regression without it
• If significance of regression improves, it’s probably due to collinearity between the variables
![Page 28: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/28.jpg)
28
Is Multicollinearity a Problem in Our
Example?• Probably not, since significance tests are
consistent• But let’s check, anyway• Calculate correlation of age and length• After tedious calculation, 0.25
– Not especially correlated• Important point - adding a predictor
variable does not always improve a regression– See example on p. 253 of book
![Page 29: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/29.jpg)
29
Why Didn’t RegressionWork Well Here?
• Check scatter plots– Rating vs. age– Rating vs. length
• Regardless of how good or bad regressions look, always check the scatter plots
![Page 30: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/30.jpg)
30
Rating vs. Length
6.0
6.5
7.0
7.5
8.0
8.5
9.0
80 100 120 140 160Length
Rating
![Page 31: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/31.jpg)
31
Rating vs. Age
6.0
6.5
7.0
7.5
8.0
8.5
9.0
0 20 40 60 80Age
Rating
![Page 32: Multiple Linear Regression Andy Wang CIS 5930-03 Computer Systems Performance Analysis.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649f145503460f94c28a61/html5/thumbnails/32.jpg)
White Slide