SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)
-
Upload
shawn-cook -
Category
Documents
-
view
215 -
download
0
Transcript of SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)
SE-280Dr. Mark L. Hornick
Multiple Regression (Cycle 4)
SE-280Dr. Mark L. Hornick
2
0
0.5
1
1.5
2
2.5
3
3.5
0 50 100 150 200
Est Proxy Size (LOC)
Tim
e (h
rs)
Review: Linear Regression (from Cycle 3)
0 1k ky x
xk = estimated LOC (in this example)yk = estimated time (in this example)0 = offset1 = slope
xk
yk
1
122
1
0 1
n
i i avg avgin
i avgi
avg avg
x y nx y
x n x
y x
SE-280Dr. Mark L. Hornick
3
The regression algorithm assumed a single independent variable.
Estimated proxy size (E)
Added+modified size (A+M)
Development effort (time)
y size=β s0+
β s1x k
ytime=β
t0+βt1xk
SE-280Dr. Mark L. Hornick
4
Can we still apply regression if our estimates involve more than one independent variable?
Web pages (JSP)
Database tables
Java classes
If development of each component type is
completely independent, we can make separate
estimates and add them up.
But what if they are so interdependent that we
can't do that?
One possible solution is to use multiple regression
m = number of independent variables (j =1..m)0 = offsetj = "slope" relative to each independent variable xk,j = current independent value estimates (e.g. proxy size)yk = projected value (e.g., time)
mkmkkk xxxy ,2,21,10
Where do the values come from?
SE-280Dr. Mark L. Hornick
6
The values are calculated by solving a system of linear equations:
n
iimi
n
imim
n
iimi
n
iimi
n
imi
n
iii
n
imiim
n
ii
n
iii
n
ii
n
iii
n
imiim
n
iii
n
ii
n
ii
n
ii
n
imim
n
ii
n
ii
yxxxxxxx
yxxxxxxx
yxxxxxxx
yxxxn
1,
1
2,
12,,2
11,,1
1,0
12,
1,2,
1
22,2
11,2,1
12,0
11,
1,1,
12,1,2
1
21,1
11,0
11,
12,2
11,10
.
n = number of historical data points (i = 1..n)xi,j = historical independent variable valuesyi = historical dependent variable values
SE-280Dr. Mark L. Hornick
7
The same equation in matrix form
,1 ,2 ,1 1 1
2
,1 ,1 ,1 ,2 ,1 ,1 1 1 1
2
,2 ,2 ,1 ,2 ,2 ,1 1 1 1
2
, , ,1 , ,2 ,1 1 1 1
...
...
...
... ... ... ...
...
n n n
i i i mi i i
n n n n
i i i i i i mi i i i
n n n n
i i i i i i mi i i i
n n n n
i m i m i i m i i mi i i i
n x x x
x x x x x x
x x x x x x
x x x x x x
1
0,1
11
2,2
1
,1
...
...
n
ii
n
i ii
n
i ii
m
n
i m ii
y
x y
x y
x y
.
bAβ
Use these slides as reference when you implement Cycle 3
SE-280Dr. Mark L. Hornick
8
When rank(A)=2 (that is, 1 independent variable), the familiar regression equations result when the equations are refactored:
1
122
1
0 1
n
i i avg avgin
i avgi
avg avg
x y nx y
x n x
y x
,1 ,2 ,1 1 1
2
,1 ,1 ,1 ,2 ,1 ,1 1 1 1
2
,2 ,2 ,1 ,2 ,2 ,1 1 1 1
2
, , ,1 , ,2 ,1 1 1 1
...
...
...
... ... ... ...
...
n n n
i i i mi i i
n n n n
i i i i i i mi i i i
n n n n
i i i i i i mi i i i
n n n n
i m i m i i m i i mi i i i
n x x x
x x x x x x
x x x x x x
x x x x x x
1
0,1
11
2,2
1
,1
...
...
n
ii
n
i ii
n
i ii
m
n
i m ii
y
x y
x y
x y
To solve numerically for the values, we need to calculate values for the A and b matrices.
bAβ
mmmmmm
m
m
b
b
b
aaa
aaa
aaa
1
0
1
0
,1,0,
,11,10,1
,01,00,0
0,0 0,1 ,11
, , .n
ii
a n a x etc
where
If you look close, you will see a pattern to the equation coefficients.
n
iqipiqp xxa
1,,,
n
iipip yxb
1,
What about the x?,0 terms?They are “fictitious” values that are treated as being equal to one (1.0)!!!
SE-280Dr. Mark L. Hornick
11
In general, (especially for rank(A)>2), we have to solve the system of equations to get the i values.
bAβ bAβ 1Mathematically, we can do this by inverting the coefficient matrix A.
However, it's more common to solve the equations using a technique like Gauss-Jordan elimination with back
substitution (remember that?).
OK, let's try an example, using test case 4-3 from Cycle 3.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0
Row 1
Row 2
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,, What is the value of a0,0?
We pretend there is an x0 column, with all "1" values.
Since A0,0 is always "n", the number of points, let's try another matrix element.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7
Row 1
Row 2
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,, What is the value of a0,1?
Since a0,1 is the sum of the x1 values, a0,2 must be the sum of the x2 values.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1
Row 2
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,,
Now, what about the rest of the first column in the A matrix?
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1
Row 2
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,, What is the value of a1,0?
Yes, the values in column 0 are the same as those in row 0, making the matrix symmetric (at least so far).
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90
Row 2 1571
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,,
Next, let's try an element on the diagonal of matrix A.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90
Row 2 1571
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,,
Sum of squares = 1746
What is the value of a1,1?
Since the same independent variable appears twice in the diagonal terms, they are computing by summing the squares.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746
Row 2 1571 440239
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,,
Let's try one of the remaining terms.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746
Row 2 1571 440239
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,,
Sum of products = 26905
What is the value of a1,2?
Since sum of products for x1 and x2 is the same as for x2 and x1, the last remaining value is the same.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746 26905
Row 2 1571 26905 440239
b Col 0
Row 0
Row 1
Row 2
"x0"
1
1
1
1
1
1
1
n
iqipiqp xxa
1,,,
OK, now how about the b vector?
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746 26905
Row 2 1571 26905 440239
b Col 0
Row 0
Row 1
Row 2
n
iipip yxb
1,
"x0"
1
1
1
1
1
1
1
What is the value of b0 (b0,0)?
Sum of y values = 2438
b0 is always the sum of the y values; let's try the next one.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746 26905
Row 2 1571 26905 440239
b Col 0
Row 0 2438
Row 1
Row 2
n
iipip yxb
1,
"x0"
1
1
1
1
1
1
1
Sum of products = 36506
What is the value of b1 (b1,0)?
OK, we need just one more value.
x1 x2 y
23 279 367
28 421 584
11 256 387
2 42 131
6 164 351
16 265 218
4 144 400
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746 26905
Row 2 1571 26905 440239
b Col 0
Row 0 2438
Row 1 36506
Row 2
What is the value of b2 (b2,0)?
Sum of products = 625765
n
iipip yxb
1,
"x0"
1
1
1
1
1
1
1
Finally, we have our A and b matrices, and can solve for the values.
A Col 0 Col 1 Col 2
Row 0 7 90 1571
Row 1 90 1746 26905
Row 2 1571 26905 440239
b Col 0
Row 0 2438
Row 1 36506
Row 2 625765
Col 0
Row 0 98.472
Row 1 -11.261
Row 2 1.7582
Matrix betas = a.solve(b);
To extend the correlation calculation to handle multiple independent variables, the only change is in calculating the predicted y values.
ipred xyi 10
mimiipred xxxyi ,2,21,10
One independent variable("linear regression")
One or more independent variables("multiple regression")
Obviously, both are forms of "linear" regression, despite the names.
We use the usual approach to calculate the correlation coefficient.
n
iavgi
n
iavgpred
yy
yyr
i
1
2
1
2
2 2rr
n
ipredi
n
iavgpred
n
iavgi ii
yyyyyy1
2
1
2
1
2
Just in case you are curious, the statisticians label the sum-square values like this:
Total sum of squares(variability) Sum of squares – predicted
(explained)
Sum of squares – error(unexplained)
SE-280Dr. Mark L. Hornick
27
Another Numerical Example
y = {16, 30, 44, 58, 96} x1 = {0, 2, 4, 6, 8}
x2 = {1, 3, 5, 7, 11}
x3 = {1, 2, 3, 4, 10}
Determine values such that
0 1 ,1 2 ,2 3 ,3k k k ky x x x
SE-280Dr. Mark L. Hornick
28
The values in the data vectors are used in the summation terms of the matrices: y = {16, 30, 44, 58, 96} x1 = {0, 2, 4, 6, 8}
x2 = {1, 3, 5, 7, 11}
x3 = {1, 2, 3, 4, 10}
5 5 5
,1 ,2 ,31 1 1
5 5 5 52 0
,1 ,1 ,1 ,2 ,1 ,31 1 1 1 1
5 5 5 52 2
,2 ,2 ,1 ,2 ,2 ,31 1 1 1 3
5 5 5 52
,3 ,3 ,1 ,3 ,2 ,31 1 1 1
5 i i ii i i
i i i i i ii i i i
i i i i i ii i i i
i i i i i ii i i i
x x x
x x x x x x
x x x x x x
x x x x x x
5
1
5
,11
5
,21
5
,31
ii
i ii
i ii
i ii
y
x y
x y
x y
5
,11
0 2 4 6 8 20ii
x
SE-280Dr. Mark L. Hornick
29
Evaluating terms in the matrices:
0
1
2
3
5 20 27 20 244
20 120 156 120 1352
27 156 205 160 1788
20 120 160 130 1400
.