SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

29
SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Transcript of SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Page 1: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

Multiple Regression (Cycle 4)

Page 2: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

2

0

0.5

1

1.5

2

2.5

3

3.5

0 50 100 150 200

Est Proxy Size (LOC)

Tim

e (h

rs)

Review: Linear Regression (from Cycle 3)

0 1k ky x

xk = estimated LOC (in this example)yk = estimated time (in this example)0 = offset1 = slope

xk

yk

1

122

1

0 1

n

i i avg avgin

i avgi

avg avg

x y nx y

x n x

y x

Page 3: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

3

The regression algorithm assumed a single independent variable.

Estimated proxy size (E)

Added+modified size (A+M)

Development effort (time)

y size=β s0+

β s1x k

ytime=β

t0+βt1xk

Page 4: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

4

Can we still apply regression if our estimates involve more than one independent variable?

Web pages (JSP)

Database tables

Java classes

If development of each component type is

completely independent, we can make separate

estimates and add them up.

But what if they are so interdependent that we

can't do that?

Page 5: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

One possible solution is to use multiple regression

m = number of independent variables (j =1..m)0 = offsetj = "slope" relative to each independent variable xk,j = current independent value estimates (e.g. proxy size)yk = projected value (e.g., time)

mkmkkk xxxy ,2,21,10

Where do the values come from?

Page 6: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

6

The values are calculated by solving a system of linear equations:

n

iimi

n

imim

n

iimi

n

iimi

n

imi

n

iii

n

imiim

n

ii

n

iii

n

ii

n

iii

n

imiim

n

iii

n

ii

n

ii

n

ii

n

imim

n

ii

n

ii

yxxxxxxx

yxxxxxxx

yxxxxxxx

yxxxn

1,

1

2,

12,,2

11,,1

1,0

12,

1,2,

1

22,2

11,2,1

12,0

11,

1,1,

12,1,2

1

21,1

11,0

11,

12,2

11,10

.

n = number of historical data points (i = 1..n)xi,j = historical independent variable valuesyi = historical dependent variable values

Page 7: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

7

The same equation in matrix form

,1 ,2 ,1 1 1

2

,1 ,1 ,1 ,2 ,1 ,1 1 1 1

2

,2 ,2 ,1 ,2 ,2 ,1 1 1 1

2

, , ,1 , ,2 ,1 1 1 1

...

...

...

... ... ... ...

...

n n n

i i i mi i i

n n n n

i i i i i i mi i i i

n n n n

i i i i i i mi i i i

n n n n

i m i m i i m i i mi i i i

n x x x

x x x x x x

x x x x x x

x x x x x x

1

0,1

11

2,2

1

,1

...

...

n

ii

n

i ii

n

i ii

m

n

i m ii

y

x y

x y

x y

.

bAβ

Use these slides as reference when you implement Cycle 3

Page 8: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

8

When rank(A)=2 (that is, 1 independent variable), the familiar regression equations result when the equations are refactored:

1

122

1

0 1

n

i i avg avgin

i avgi

avg avg

x y nx y

x n x

y x

,1 ,2 ,1 1 1

2

,1 ,1 ,1 ,2 ,1 ,1 1 1 1

2

,2 ,2 ,1 ,2 ,2 ,1 1 1 1

2

, , ,1 , ,2 ,1 1 1 1

...

...

...

... ... ... ...

...

n n n

i i i mi i i

n n n n

i i i i i i mi i i i

n n n n

i i i i i i mi i i i

n n n n

i m i m i i m i i mi i i i

n x x x

x x x x x x

x x x x x x

x x x x x x

1

0,1

11

2,2

1

,1

...

...

n

ii

n

i ii

n

i ii

m

n

i m ii

y

x y

x y

x y

Page 9: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

To solve numerically for the values, we need to calculate values for the A and b matrices.

bAβ

mmmmmm

m

m

b

b

b

aaa

aaa

aaa

1

0

1

0

,1,0,

,11,10,1

,01,00,0

0,0 0,1 ,11

, , .n

ii

a n a x etc

where

Page 10: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

If you look close, you will see a pattern to the equation coefficients.

n

iqipiqp xxa

1,,,

n

iipip yxb

1,

What about the x?,0 terms?They are “fictitious” values that are treated as being equal to one (1.0)!!!

Page 11: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

11

In general, (especially for rank(A)>2), we have to solve the system of equations to get the i values.

bAβ bAβ 1Mathematically, we can do this by inverting the coefficient matrix A.

However, it's more common to solve the equations using a technique like Gauss-Jordan elimination with back

substitution (remember that?).

Page 12: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

OK, let's try an example, using test case 4-3 from Cycle 3.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,, What is the value of a0,0?

We pretend there is an x0 column, with all "1" values.

Page 13: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Since A0,0 is always "n", the number of points, let's try another matrix element.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,, What is the value of a0,1?

Page 14: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Since a0,1 is the sum of the x1 values, a0,2 must be the sum of the x2 values.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Page 15: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Now, what about the rest of the first column in the A matrix?

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,, What is the value of a1,0?

Page 16: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Yes, the values in column 0 are the same as those in row 0, making the matrix symmetric (at least so far).

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90

Row 2 1571

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Page 17: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Next, let's try an element on the diagonal of matrix A.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90

Row 2 1571

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Sum of squares = 1746

What is the value of a1,1?

Page 18: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Since the same independent variable appears twice in the diagonal terms, they are computing by summing the squares.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746

Row 2 1571 440239

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Page 19: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Let's try one of the remaining terms.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746

Row 2 1571 440239

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Sum of products = 26905

What is the value of a1,2?

Page 20: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Since sum of products for x1 and x2 is the same as for x2 and x1, the last remaining value is the same.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Page 21: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

OK, now how about the b vector?

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0

Row 1

Row 2

n

iipip yxb

1,

"x0"

1

1

1

1

1

1

1

What is the value of b0 (b0,0)?

Sum of y values = 2438

Page 22: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

b0 is always the sum of the y values; let's try the next one.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0 2438

Row 1

Row 2

n

iipip yxb

1,

"x0"

1

1

1

1

1

1

1

Sum of products = 36506

What is the value of b1 (b1,0)?

Page 23: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

OK, we need just one more value.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0 2438

Row 1 36506

Row 2

What is the value of b2 (b2,0)?

Sum of products = 625765

n

iipip yxb

1,

"x0"

1

1

1

1

1

1

1

Page 24: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Finally, we have our A and b matrices, and can solve for the values.

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0 2438

Row 1 36506

Row 2 625765

Col 0

Row 0 98.472

Row 1 -11.261

Row 2 1.7582

Matrix betas = a.solve(b);

Page 25: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

To extend the correlation calculation to handle multiple independent variables, the only change is in calculating the predicted y values.

ipred xyi 10

mimiipred xxxyi ,2,21,10

One independent variable("linear regression")

One or more independent variables("multiple regression")

Obviously, both are forms of "linear" regression, despite the names.

Page 26: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

We use the usual approach to calculate the correlation coefficient.

n

iavgi

n

iavgpred

yy

yyr

i

1

2

1

2

2 2rr

n

ipredi

n

iavgpred

n

iavgi ii

yyyyyy1

2

1

2

1

2

Just in case you are curious, the statisticians label the sum-square values like this:

Total sum of squares(variability) Sum of squares – predicted

(explained)

Sum of squares – error(unexplained)

Page 27: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

27

Another Numerical Example

y = {16, 30, 44, 58, 96} x1 = {0, 2, 4, 6, 8}

x2 = {1, 3, 5, 7, 11}

x3 = {1, 2, 3, 4, 10}

Determine values such that

0 1 ,1 2 ,2 3 ,3k k k ky x x x

Page 28: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

28

The values in the data vectors are used in the summation terms of the matrices: y = {16, 30, 44, 58, 96} x1 = {0, 2, 4, 6, 8}

x2 = {1, 3, 5, 7, 11}

x3 = {1, 2, 3, 4, 10}

5 5 5

,1 ,2 ,31 1 1

5 5 5 52 0

,1 ,1 ,1 ,2 ,1 ,31 1 1 1 1

5 5 5 52 2

,2 ,2 ,1 ,2 ,2 ,31 1 1 1 3

5 5 5 52

,3 ,3 ,1 ,3 ,2 ,31 1 1 1

5 i i ii i i

i i i i i ii i i i

i i i i i ii i i i

i i i i i ii i i i

x x x

x x x x x x

x x x x x x

x x x x x x

5

1

5

,11

5

,21

5

,31

ii

i ii

i ii

i ii

y

x y

x y

x y

5

,11

0 2 4 6 8 20ii

x

Page 29: SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

29

Evaluating terms in the matrices:

0

1

2

3

5 20 27 20 244

20 120 156 120 1352

27 156 205 160 1788

20 120 160 130 1400

.