Simple linear regression and correlation analysis 1. Regression 2. Correlation 3. Significance...

Simple linear regression and

correlation analysis

1. Regression2. Correlation3. Significance testing

1. Simple linear regression analysis

Simple regression describes relationship between two variables

Two variables, generally Y = f(X) Y = dependent variable (regressand) X = independent variable (regressor)

Simple linear regression

f (x) – regression equation ei – random error, residual deviation

independent random quantity N (0, σ2)

iii exfy )(

Simple linear regression – straight line

b0 = constant

b1 = coefficient of regression

ii xbbesty 10

Parameter estimates → least squares condition

difference of the actual Y from the estimated Y est. is minimal

hence

n is number of observations (yi,xi)

adjustment under partial derivation of

function according to parameters b0, b1, derivation of the S sum of squared deviationas are equated to zero:

iiiii xbbyestyyd 10

min,1

20010

n

iii xbbybbf

02 100

iii xxbbyb

f

012 101

ii xbbyb

f

min2

1

n

iii estyy

min2

1

n

iii estyy

ii xbbesty 10

Two approches to parameter estimates with using of least squares condition (made for straight line equation)

1. Normal equation system for straight line

2. Matrix computation approach

y = dependent variable vector X = independent variable matrix b = vector of regression coefficient (straight line → b0 and b1) ε = vector of random values

by

iiii yxxbxb 210

ii yxbnb 10

yb 1

Simple linear regression

observation yi

smoothed values yi est; yi´

residual deviation

residual sum of squares

residual variance

estyyd iii

21

210

2

n

xbby

kn

Ss

n

iii

rr

n

ii

n

iiir destyyS

1

22

1

Simple lin. reg. → dependence Y on X

Straight line equation

Normal equation system

Parameter estimates – computational formula

n

ii

n

iiyxyx yxbbn

1110

n

iii

n

iiyx

n

iiyx yxxbxb

11

21

10

221

ii

iiiiyx

xxn

yxyxnb

xbyb yxyx 10

iyxyxi xbbesty 10

Simple lin. reg. → dependence X on Y

Associated straight line equation

Parameters estimates – computational formula

ixyxyi ybbestx 10

221

ii

iiiixy

yyn

yxyxnb

ybxb xyxy 10

2. Correlation analysis corr. analysis measures strength of dependence – coeff. of correlation „r“ │r│is in <0; +1>

│r│is in <0; 0,33> weak dependence │r│is in <0,34; 0,66> medium strong dependence │r│is in <0,67; 1> strong to very strong dependence

r2 = coeff. of determination, proportion (%) of variance Y, that is caused by the effect of X

2222 . iiii

iiiiyx

yynxxn

yxyxnr

xyyxyx bbr 11 .x

yyxyx s

srb .1

y

xxyxy s

srb .1

xyyx rr

3. Significance testing in simple regression

Significance test of parameters b1 (straight line)

(two-sided)

test criterion

estimate sb for par. b1

table value (two-sided)

if test criterion>table value→H0 is rejected and H1 is valid;if test alfa>p-value→H0 is rejected

bs

bt 1

0:0 H 0:1 H

)( knt

2

1 2

n

r

s

ss

x

yb

Coefficient of regression estimation

interval estimate for the unknown βi

1iii bbP

bkn st )(

Significance test of coeff. corr. r (straight line)

(two-sided)

test criterion

table value (two-sided)

if test criterion>table value→H0 is rejected and H1 is valid;if test alfa>p-value→H0 is rejected

21 2

nr

rt

0:0 H 0:1 H

)( knt

Coefficient of correlation estimation

small samples and not normal distribution Fischer Z – transformation first r is assigned to Z (by tables)

interval estimate for the unknown σ

last step Z1 a Z2 is assigned to r1 a r2

13

1;

3

121

nuZZ

nuZZ

The summary ANOVA

VariationSum of deviaton

squaresdf Variance

Test criterion

along the regression

functionk - 1

across the regression

functionn - k

21 yyS i

2 iir yyS

112

1 k

Ss

kn

Ss rr

22

21

rs

sF

The summary ANOVA (alternatively)

test criterion

table value

11 2

2

k

kn

R

RF

)1(),1( nmF

Multicollinearity

relationship between (among) independent variables

among independent variables (X1; X2….XN) is almost perfect linear relationship, high multicollinearity

before model formation is needed to analyze of relationship

linear independent of culumns (variables) is disturbed

Causes of multicollinearity

tendencies of time series, similar tendencies among variables (regression)

including of exogenous variables, delay

using 0;1 coding in our sample

Consequences of multicollinearity wrong sampling null hypothesis about zero regression

coefficient is not rejected, really is rejected confidence intervals are wide regression coeff estimation is very

influented by data changing regression coeff can have wrong sign regression equation is not suitable for

prediction

Testing of multicollinearity

Paired coefficient of correlation t - test

Farrar-Glauber test

test criterion

table value

RpnB ln526

11

2

2/)1(1 pp

if test criterion>table value→H0 is rejected

Elimination of multicollinearity

variables excluding

get new sample

once again re-formulate and think out the model (chosen variables)

variables transformation – chosen variables recounting (not total consumption, but consumption per capita… etc.)

Regression diagnostics

Data quality for the chosen model

Suitable model for the chosen dataset

Method conditions

Data quality evaluation

A) outlying observation in „y“ setStudentized residuals

|SR| > 2 → outlying observation

→ outlying need not to be influential (influential has cardinal influence on regression)


B) outlying observation in „x“ setHat Diag leverage

hii – diagonal values of hat matrix H

H = X . (XT . X)-1 . XT

hii > → outlying observationn

p2


C) influential observation Cook D (influential obs. influence the whole equation)

Di > 4 → influential obs.

Welsch – Kuh DFFITS distance (influential obs. influence smoothed observation)

|DFFITS| > → influential obs.n

p2

Method condition

regression parameters <-∞; +∞>

regression model is linear in parameters(not linear – data transformation)

independent of residues

normal distribution of residues N(0;σ2)

Simple linear regression and correlation analysis 1. Regression 2. Correlation 3. Significance...

Documents

Transcript of Simple linear regression and correlation analysis 1. Regression 2. Correlation 3. Significance...