Data mining, prediction, correlation, regression, correlation analysis, regression analysis.
Simple linear regression and correlation analysis 1. Regression 2. Correlation 3. Significance...
-
Upload
darrell-byrd -
Category
Documents
-
view
244 -
download
4
Transcript of Simple linear regression and correlation analysis 1. Regression 2. Correlation 3. Significance...
Simple linear regression and
correlation analysis
1. Regression2. Correlation3. Significance testing
1. Simple linear regression analysis
Simple regression describes relationship between two variables
Two variables, generally Y = f(X) Y = dependent variable (regressand) X = independent variable (regressor)
Simple linear regression
f (x) – regression equation ei – random error, residual deviation
independent random quantity N (0, σ2)
iii exfy )(
Simple linear regression – straight line
b0 = constant
b1 = coefficient of regression
ii xbbesty 10
Parameter estimates → least squares condition
difference of the actual Y from the estimated Y est. is minimal
hence
n is number of observations (yi,xi)
adjustment under partial derivation of
function according to parameters b0, b1, derivation of the S sum of squared deviationas are equated to zero:
iiiii xbbyestyyd 10
min,1
20010
n
iii xbbybbf
02 100
iii xxbbyb
f
012 101
ii xbbyb
f
min2
1
n
iii estyy
min2
1
n
iii estyy
ii xbbesty 10
Two approches to parameter estimates with using of least squares condition (made for straight line equation)
1. Normal equation system for straight line
2. Matrix computation approach
y = dependent variable vector X = independent variable matrix b = vector of regression coefficient (straight line → b0 and b1) ε = vector of random values
by
iiii yxxbxb 210
ii yxbnb 10
yb 1
Simple linear regression
observation yi
smoothed values yi est; yi´
residual deviation
residual sum of squares
residual variance
estyyd iii
21
210
2
n
xbby
kn
Ss
n
iii
rr
n
ii
n
iiir destyyS
1
22
1
Simple lin. reg. → dependence Y on X
Straight line equation
Normal equation system
Parameter estimates – computational formula
n
ii
n
iiyxyx yxbbn
1110
n
iii
n
iiyx
n
iiyx yxxbxb
11
21
10
221
ii
iiiiyx
xxn
yxyxnb
xbyb yxyx 10
iyxyxi xbbesty 10
Simple lin. reg. → dependence X on Y
Associated straight line equation
Parameters estimates – computational formula
ixyxyi ybbestx 10
221
ii
iiiixy
yyn
yxyxnb
ybxb xyxy 10
2. Correlation analysis corr. analysis measures strength of dependence – coeff. of correlation „r“ │r│is in <0; +1>
│r│is in <0; 0,33> weak dependence │r│is in <0,34; 0,66> medium strong dependence │r│is in <0,67; 1> strong to very strong dependence
r2 = coeff. of determination, proportion (%) of variance Y, that is caused by the effect of X
2222 . iiii
iiiiyx
yynxxn
yxyxnr
xyyxyx bbr 11 .x
yyxyx s
srb .1
y
xxyxy s
srb .1
xyyx rr
3. Significance testing in simple regression
Significance test of parameters b1 (straight line)
(two-sided)
test criterion
estimate sb for par. b1
table value (two-sided)
if test criterion>table value→H0 is rejected and H1 is valid;if test alfa>p-value→H0 is rejected
bs
bt 1
0:0 H 0:1 H
)( knt
2
1 2
n
r
s
ss
x
yb
Coefficient of regression estimation
interval estimate for the unknown βi
1iii bbP
bkn st )(
Significance test of coeff. corr. r (straight line)
(two-sided)
test criterion
table value (two-sided)
if test criterion>table value→H0 is rejected and H1 is valid;if test alfa>p-value→H0 is rejected
21 2
nr
rt
0:0 H 0:1 H
)( knt
Coefficient of correlation estimation
small samples and not normal distribution Fischer Z – transformation first r is assigned to Z (by tables)
interval estimate for the unknown σ
last step Z1 a Z2 is assigned to r1 a r2
13
1;
3
121
nuZZ
nuZZ
The summary ANOVA
VariationSum of deviaton
squaresdf Variance
Test criterion
along the regression
functionk - 1
across the regression
functionn - k
21 yyS i
2 iir yyS
112
1 k
Ss
kn
Ss rr
22
21
rs
sF
The summary ANOVA (alternatively)
test criterion
table value
11 2
2
k
kn
R
RF
)1(),1( nmF
Multicollinearity
relationship between (among) independent variables
among independent variables (X1; X2….XN) is almost perfect linear relationship, high multicollinearity
before model formation is needed to analyze of relationship
linear independent of culumns (variables) is disturbed
Causes of multicollinearity
tendencies of time series, similar tendencies among variables (regression)
including of exogenous variables, delay
using 0;1 coding in our sample
Consequences of multicollinearity wrong sampling null hypothesis about zero regression
coefficient is not rejected, really is rejected confidence intervals are wide regression coeff estimation is very
influented by data changing regression coeff can have wrong sign regression equation is not suitable for
prediction
Testing of multicollinearity
Paired coefficient of correlation t - test
Farrar-Glauber test
test criterion
table value
RpnB ln526
11
2
2/)1(1 pp
if test criterion>table value→H0 is rejected
Elimination of multicollinearity
variables excluding
get new sample
once again re-formulate and think out the model (chosen variables)
variables transformation – chosen variables recounting (not total consumption, but consumption per capita… etc.)
Regression diagnostics
Data quality for the chosen model
Suitable model for the chosen dataset
Method conditions
Data quality evaluation
A) outlying observation in „y“ setStudentized residuals
|SR| > 2 → outlying observation
→ outlying need not to be influential (influential has cardinal influence on regression)
Data quality evaluation
B) outlying observation in „x“ setHat Diag leverage
hii – diagonal values of hat matrix H
H = X . (XT . X)-1 . XT
hii > → outlying observationn
p2
Data quality evaluation
C) influential observation Cook D (influential obs. influence the whole equation)
Di > 4 → influential obs.
Welsch – Kuh DFFITS distance (influential obs. influence smoothed observation)
|DFFITS| > → influential obs.n
p2
Method condition
regression parameters <-∞; +∞>
regression model is linear in parameters(not linear – data transformation)
independent of residues
normal distribution of residues N(0;σ2)