LOGO Regression Analysis Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Post on 29-Mar-2015

214 views 1 download

Tags:

Transcript of LOGO Regression Analysis Lecturer: Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

LOGO

Regression Analysis

Lecturer: Dr. Bo Yuan

E-mail: yuanb@sz.tsinghua.edu.cn

Regression

To express the relationship between two or more variables by a mathematical formula.

x : predictor (independent) variable

y : response (dependent) variable

Identify how y varies as a function of x.

y is also considered as a random variable.

Real-Word Example:

Footwear impressions are commonly observed at crime scenes.

While there are numerous forensic properties that can be obtained

from these impressions, one in particular is the shoe size. The

detectives would like to be able to estimate the height of the

impression maker from the shoe size.

The relationship between shoe sizes and heights2

Shoe Size vs. Height

3

Shoe Size vs. Height

What is the predictor?

What is the response?

Can the height by accurately estimated from the shoe size?

If a shoe size is 11, what would you advise the police?

What if the size is 7 or 12.5?

4

General Regression Model

The systematic part m(x) is deterministic.

The error ε(x) is a random variable.

Measurement Error

Natural Variations

Additive

5

)()()( xxmxy

Example: Sin Function

6

)()sin()( xxAxy

Standard Assumptions

7

A1

8

A2

9

A3

10

Back to Shoes

11

Simple Linear Regression

12

xxm 10)(

Model Parameters

13

Derivation

14

n

iii xyR

1

21010 ),(

xy

xyn

iii

R

10

1100

020

2

1

2

11

111

1100

0

021

xnx

yxnyx

xxyxyx

xyx

n

ii

n

iii

n

iiiii

n

iiii

R

Standard Deviations

15

n

iin 1

22

2

1

2/1

2

1

2

21

0

xnx

x

n n

i

2/1

2

1

2

11

xnxn

i

Polynomial Terms

Modeling the data as a line is not always adequate.

Polynomial Regression

This is still a linear model!

m(x) is a linear combination of β.

Danger of Overfitting

16

p

k

kk

pp xxxxm

010 ...)(

Matrix Representation

17

i

p

k

kiki xy

0

XY

Matrix Representation

18

XYXYR T )(

YXXX

XXYXXYYYTT

TTTTTTR

00

YXXX TT 1

Model Comparison

19

n

ii yySST

1

2 :Total Squares of Sum

n

iii yySSE

1

2^

:Error Squares of Sum

R2

20

SST

SSE

SST

SSESSTR

12

2 / ( ( 1))1

/ ( 1)adj

SSE n pR

SST n

Example

21

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-5

0

5

10

15

20

25

30

X

Y

Y= -3.6029+4.8802X

R2=0.9131

Y= 0.7341-0.4303X+1.0621X2

R2=0.9880

Y=X2+N(0,1)

Summary

Regression is the oldest data mining technique.

Probably the first thing that you want to try on a new data set.

No need to do programming! Matlab, Excel …

Quality of Regression

R2

Residual Plot

Cross Validation

What you should learn after class:

Confidence Interval

Multiple Regression

Nonlinear Regression

22