SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

SE-280Dr. Mark L. Hornick

Multiple Regression (Cycle 4)


2

0

0.5

1

1.5

2

2.5

3

3.5

0 50 100 150 200

Est Proxy Size (LOC)

Tim

e (h

rs)

Review: Linear Regression (from Cycle 3)

0 1k ky x

xk = estimated LOC (in this example)yk = estimated time (in this example)0 = offset1 = slope

xk

yk

1

122

1

0 1

n

i i avg avgin

i avgi

avg avg

x y nx y

x n x

y x


3

The regression algorithm assumed a single independent variable.

Estimated proxy size (E)

Added+modified size (A+M)

Development effort (time)

y size=β s0+

β s1x k

ytime=β

t0+βt1xk


4

Can we still apply regression if our estimates involve more than one independent variable?

Web pages (JSP)

Database tables

Java classes

If development of each component type is

completely independent, we can make separate

estimates and add them up.

But what if they are so interdependent that we

can't do that?

One possible solution is to use multiple regression

m = number of independent variables (j =1..m)0 = offsetj = "slope" relative to each independent variable xk,j = current independent value estimates (e.g. proxy size)yk = projected value (e.g., time)

mkmkkk xxxy ,2,21,10

Where do the values come from?


6

The values are calculated by solving a system of linear equations:

n

iimi

n

imim

n

iimi

n

iimi

n

imi

n

iii

n

imiim

n

ii

n

iii

n

ii

n

iii

n

imiim

n

iii

n

ii

n

ii

n

ii

n

imim

n

ii

n

ii

yxxxxxxx

yxxxxxxx

yxxxxxxx

yxxxn

1,

1

2,

12,,2

11,,1

1,0

12,

1,2,

1

22,2

11,2,1

12,0

11,

1,1,

12,1,2

1

21,1

11,0

11,

12,2

11,10

.

n = number of historical data points (i = 1..n)xi,j = historical independent variable valuesyi = historical dependent variable values


7

The same equation in matrix form

,1 ,2 ,1 1 1

2

,1 ,1 ,1 ,2 ,1 ,1 1 1 1

2

,2 ,2 ,1 ,2 ,2 ,1 1 1 1

2

, , ,1 , ,2 ,1 1 1 1

...

...

...

... ... ... ...

...

n n n

i i i mi i i

n n n n

i i i i i i mi i i i

n n n n


n n n n

i m i m i i m i i mi i i i

n x x x

x x x x x x

x x x x x x

x x x x x x

1

0,1

11

2,2

1

,1

...

...

n

ii

n

i ii

n

i ii

m

n

i m ii

y

x y

x y

x y

.

bAβ

Use these slides as reference when you implement Cycle 3


8

When rank(A)=2 (that is, 1 independent variable), the familiar regression equations result when the equations are refactored:

1

122

1

0 1

n

i i avg avgin

i avgi

avg avg

x y nx y

x n x

y x

,1 ,2 ,1 1 1

2

,1 ,1 ,1 ,2 ,1 ,1 1 1 1

2

,2 ,2 ,1 ,2 ,2 ,1 1 1 1

2

, , ,1 , ,2 ,1 1 1 1

...

...

...

... ... ... ...

...

n n n

i i i mi i i

n n n n


n n n n


n n n n

i m i m i i m i i mi i i i

n x x x

x x x x x x

x x x x x x

x x x x x x

1

0,1

11

2,2

1

,1

...

...

n

ii

n

i ii

n

i ii

m

n

i m ii

y

x y

x y

x y

To solve numerically for the values, we need to calculate values for the A and b matrices.

bAβ

mmmmmm

m

m

b

b

b

aaa

aaa

aaa

1

0

1

0

,1,0,

,11,10,1

,01,00,0

0,0 0,1 ,11

, , .n

ii

a n a x etc

where

If you look close, you will see a pattern to the equation coefficients.

n

iqipiqp xxa

1,,,

n

iipip yxb

1,

What about the x?,0 terms?They are “fictitious” values that are treated as being equal to one (1.0)!!!


11

In general, (especially for rank(A)>2), we have to solve the system of equations to get the i values.

bAβ bAβ 1Mathematically, we can do this by inverting the coefficient matrix A.

However, it's more common to solve the equations using a technique like Gauss-Jordan elimination with back

substitution (remember that?).

OK, let's try an example, using test case 4-3 from Cycle 3.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,, What is the value of a0,0?

We pretend there is an x0 column, with all "1" values.

Since A0,0 is always "n", the number of points, let's try another matrix element.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa


Since a0,1 is the sum of the x1 values, a0,2 must be the sum of the x2 values.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Now, what about the rest of the first column in the A matrix?

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1

Row 2

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa


Yes, the values in column 0 are the same as those in row 0, making the matrix symmetric (at least so far).

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90

Row 2 1571

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Next, let's try an element on the diagonal of matrix A.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90

Row 2 1571

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Sum of squares = 1746

What is the value of a1,1?

Since the same independent variable appears twice in the diagonal terms, they are computing by summing the squares.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746

Row 2 1571 440239

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Let's try one of the remaining terms.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746

Row 2 1571 440239

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

Sum of products = 26905

What is the value of a1,2?

Since sum of products for x1 and x2 is the same as for x2 and x1, the last remaining value is the same.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0

Row 1

Row 2

"x0"

1

1

1

1

1

1

1

n

iqipiqp xxa

1,,,

OK, now how about the b vector?

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0

Row 1

Row 2

n

iipip yxb

1,

"x0"

1

1

1

1

1

1

1

What is the value of b0 (b0,0)?

Sum of y values = 2438

b0 is always the sum of the y values; let's try the next one.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0 2438

Row 1

Row 2

n

iipip yxb

1,

"x0"

1

1

1

1

1

1

1



OK, we need just one more value.

x1 x2 y

23 279 367

28 421 584

11 256 387

2 42 131

6 164 351

16 265 218

4 144 400

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0 2438

Row 1 36506

Row 2



n

iipip yxb

1,

"x0"

1

1

1

1

1

1

1

Finally, we have our A and b matrices, and can solve for the values.

A Col 0 Col 1 Col 2

Row 0 7 90 1571

Row 1 90 1746 26905

Row 2 1571 26905 440239

b Col 0

Row 0 2438

Row 1 36506

Row 2 625765

Col 0

Row 0 98.472

Row 1 -11.261

Row 2 1.7582

Matrix betas = a.solve(b);

To extend the correlation calculation to handle multiple independent variables, the only change is in calculating the predicted y values.

ipred xyi 10

mimiipred xxxyi ,2,21,10

One independent variable("linear regression")

One or more independent variables("multiple regression")

Obviously, both are forms of "linear" regression, despite the names.

We use the usual approach to calculate the correlation coefficient.

n

iavgi

n

iavgpred

yy

yyr

i

1

2

1

2

2 2rr

n

ipredi

n

iavgpred

n

iavgi ii

yyyyyy1

2

1

2

1

2

Just in case you are curious, the statisticians label the sum-square values like this:

Total sum of squares(variability) Sum of squares – predicted

(explained)

Sum of squares – error(unexplained)


27

Another Numerical Example

y = {16, 30, 44, 58, 96} x1 = {0, 2, 4, 6, 8}

x2 = {1, 3, 5, 7, 11}

x3 = {1, 2, 3, 4, 10}

Determine values such that

0 1 ,1 2 ,2 3 ,3k k k ky x x x


28

The values in the data vectors are used in the summation terms of the matrices: y = {16, 30, 44, 58, 96} x1 = {0, 2, 4, 6, 8}

x2 = {1, 3, 5, 7, 11}

x3 = {1, 2, 3, 4, 10}

5 5 5

,1 ,2 ,31 1 1

5 5 5 52 0

,1 ,1 ,1 ,2 ,1 ,31 1 1 1 1

5 5 5 52 2

,2 ,2 ,1 ,2 ,2 ,31 1 1 1 3

5 5 5 52

,3 ,3 ,1 ,3 ,2 ,31 1 1 1

5 i i ii i i

i i i i i ii i i i

i i i i i ii i i i

i i i i i ii i i i

x x x

x x x x x x

x x x x x x

x x x x x x

5

1

5

,11

5

,21

5

,31

ii

i ii

i ii

i ii

y

x y

x y

x y

5

,11

0 2 4 6 8 20ii

x


29

Evaluating terms in the matrices:

0

1

2

3

5 20 27 20 244

20 120 156 120 1352

27 156 205 160 1788

20 120 160 130 1400

.

SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)

Documents

Transcript of SE-280 Dr. Mark L. Hornick Multiple Regression (Cycle 4)