Data mining and statistical learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity...

Data mining and statistical learning - lab2-4

Lab 2, assignment 1: OLS regression of electricity

consumption on temperature at 53 sites

-10000

-5000

0

5000

10000

ARJEPLOG

BR_M_N

FLODA

GUSTAVSFORS

HELSINGBORG

KOLM

_RDEN_STR_MS...

LULE__KALLAX

MORA

R_NGEDALA

SKILLIN

GE

SVANBERGA

VILHELM

INA

_LVSBYN

_VERKALIX

_SVARTBYN

Predictor

Par

amet

er


SAS code for ridge regression

proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1;model daily_consumption = stockholm g_teborg malm_ /p;output out=olsoutput pred=olspred;proc print data=dtempbeta;run;

_TYPE_ _DEPVAR_ _RIDGE_ _RMSE_ Intercept STOCKHOLM G_TEBORG MALM_PARMS Daily_Consumption 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 0 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 1 36314.6 462824.0 -2327.8 -2357.6 -2512.6RIDGE Daily_Consumption 2 43008.7 450349.7 -1830.1 -1899.4 -2011.6RIDGE Daily_Consumption 3 48325.9 442054.5 -1514.3 -1584.8 -1674.9RIDGE Daily_Consumption 4 52401.2 436146.6 -1292.7 -1358.6 -1434.4RIDGE Daily_Consumption 5 55571.5 431726.2 -1128.0 -1188.6 -1254.1RIDGE Daily_Consumption 6 58092.1 428294.6 -1000.8 -1056.3 -1114.1RIDGE Daily_Consumption 7 60138.0 425553.4 -899.4 -950.4 -1002.1RIDGE Daily_Consumption 8 61829.0 423313.5 -816.7 -863.8 -910.6RIDGE Daily_Consumption 9 63248.9 421448.8 -747.9 -791.7 -834.4RIDGE Daily_Consumption 10 64457.3 419872.4 -689.8 -730.6 -770.0


Estimated regression parameters in ridge regression

-6000

-5000

-4000

-3000

-2000

-1000

0

0 1 2 3 4 5 6 7 8 9 10

Shrinkage

Par

amet

er STOCKHOLM

G_TEBORG

MALM_


Predicted vs observed values

in OLS regression and ridge regression

- trade-off between variance and bias

200000

300000

400000

500000

600000

700000

200000 300000 400000 500000 600000 700000

Observed

Pre

dic

ted

OLS regression Ridge regression


Fat content vs absorbance in different channels (wavelengths)

0

10

20

30

40

50

60

2 2.5 3 3.5 4 4.5 5 5.5 6

Absorbance

Fat

co

nte

nt

(%)

Channel 1 Channel 40 Channel 60 Channel 100


OLS regression fat vs channel10, channel30, channel50,

channel70, channel90

Model: MODEL1 Dependent Variable: Fat Fat Number of Observations Used 215

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 29469 5893.82795 233.90 <.0001 Error 209 5266.30507 25.19763 Corrected Total 214 34735 Root MSE 5.01972 R-Square 0.8484 Dependent Mean 18.14233 Adj R-Sq 0.8448 Coeff Var 27.66858

Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 42.20859 3.82059 11.05 <.0001 Channel10 Channel10 1 -245.26494 10.05660 -24.39 <.0001 Channel30 Channel30 1 361.41787 23.53244 15.36 <.0001 Channel50 Channel50 1 -203.28522 33.52937 -6.06 <.0001 Channel70 Channel70 1 104.37041 19.91571 5.24 <.0001 Channel90 Channel90 1 -34.48938 9.15823 -3.77 0.0002


OLS regression fat vs channel1 – channel 100Model: MODEL1 Dependent Variable: Fat Number of Observations Used: 215

Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 53 34326 647.66185 254.72 <.0001 Error 161 409.36692 2.54265 Corrected Total 214 34735 Root MSE 1.59457 R-Square 0.9882 Dependent Mean 18.14233 Adj R-Sq 0.9843 Coeff Var 8.78922 NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. Channel3 = -9.37E-6 * Intercept - 0.03975 * Channel1 + 0.47341 * Channel2 - 0.66366 * Channel4 + 0.1448 * Channel6 - 0.04202 * Channel8 - 0.0296 * Channel10 + 0.04022 * Channel12 - 0.1013 * Channel14 + 0.08297 * Channel16 + 0.09432 * Channel18 - 0.1725 * Channel20 + 0.07997 * Channel21 - 0.00495 * Channel23 + 0.02818 * Channel25 + 0.00606 * Channel27 - 0.08143 * Channel28 + 0.08083 * Channel30 - 0.05219 * Channel32 + 0.01912 * Channel33 + 0.01284 * Channel35 - 0.01179 * Channel36 + 0.03298 * Channel37 - 0.02684 * Channel38 + 0.00346 * Channel39 - 0.04165 * Channel41 + 0.04493 * Channel42 - 0.01572 * Channel44 + 0.01452 * Channel46 + 0.00074 * Channel48 - 0.0342 * Channel49 + 0.08672 * Channel51 - 0.0911 * Channel52 + 0.03303 * Channel53 - 0.00125 * Channel55 - 0.00744 * Channel56 + 0.01541 * Channel58 - 0.00663 * Channel59 - 0.02578 * Channel61 + 0.02883 * Channel63 - 0.01135 * Channel65 + 0.04673 * Channel67 - 0.04764 * Channel69 - 0.00365 * Channel71 + 0.01601 * Channel73 - 0.01333 * Channel75 - 0.00651 * Channel77 - 0.00392 * Channel80 + 0.03827 * Channel83 - 0.02069 * Channel86 + 0.01285 * Channel89 - 0.01378 * Channel92 - 0.00849 * Channel95 + 0.0093 * Channel98


OLS regression fat vs channel1 – channel 100

Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept B 7.67989 2.01644 3.81 0.0002 Channel1 Channel1 B 7550.89847 3181.94418 2.37 0.0188 Channel2 Channel2 B -6236.59799 4650.43463 -1.34 0.1818 Channel3 Channel3 0 0 . . . Channel4 Channel4 B -2576.07036 3776.80152 -0.68 0.4962 Channel5 Channel5 0 0 . . . Channel6 Channel6 B -7766.73338 4103.41990 -1.89 0.0602 Channel7 Channel7 0 0 . . . Channel8 Channel8 B 5660.86411 4248.60674 1.33 0.1846 Channel9 Channel9 0 0 . . . Channel10 Channel10 B 4509.28620 4503.11172 1.00 0.3182 Channel11 Channel11 0 0 . . . Channel12 Channel12 B 8050.98503 4080.26245 1.97 0.0502 Channel13 Channel13 0 0 . . . Channel14 Channel14 B -7368.85561 4319.59587 -1.71 0.0900 Channel15 Channel15 0 0 . . . Channel16 Channel16 B -5251.52459 3382.29352 -1.55 0.1225 Channel17 Channel17 0 0 . . . . . .


OLS regression with strongly correlated predictors

If the XTX matrix has not full rank (some X-variables are linearly dependent) the mean square solution is not unique

If the X-variables are strongly correlated, then:

(i) the regression coefficients will be uncertain;

(ii) the predictions may be OK


Principal Component Analysis of lake survey data

Some variables vary much more than others

How does this influence principal components derived from the covariance and correlation matrices, respectively?

0

1000

2000

3000

4000

5000

6000

7000

0 1000 2000 3000 4000 5000 6000 7000

Cl (meq/l)

To

t-N

(m g

/l)



- score plot derived from the correlation matrix



- eigenvectors derived from the correlation matrix

-0.2-0.1

00.10.20.30.40.5

pH_

Con

d__m

S_m

25_C

Ca_

meq

_l

Mg_

meq

_l

Na_

meq

_l

K_m

eq_l

Alk

__A

cid_

meq

_l

SO

4_IC

_meq

_l

Cl_

meq

_l

NO

2_N

O3_

N_u

g_l

Tot

_N_p

s_ug

_l

Tot

_P_u

g_l

Abs

__F

_420

nm_5

c

TO

C_m

g_l

Si_

mg_

l

PRIN1

PRIN2


Principal Component Analysis of lake survey data with

outliers removed

- score plot derived from the correlation matrix



outliers removed

- eigenvectors derived from the correlation matrix

-0.6-0.4-0.2

00.20.40.60.8

pH_

Cond_

_mS_m

25_C

Ca_m

eq_l

Mg_

meq

_l

Na_m

eq_l

K_meq

_l

Alk__Acid

_meq

_l

SO4_

IC_m

eq_l

Cl_m

eq_l

NO2_

NO3_N_ug

_l

Tot_N

_ps_

ug_l

Tot_P

_ug_l

Abs__

F_420

nm_5

cm

TOC_mg_

l

Si_m

g_l

Lo

ad

ing

PRIN 1

PRIN 2



outliers removed

- MINITAB score plot derived from the correlation matrix

403020100

12.5

10.0

7.5

5.0

2.5

0.0

-2.5

-5.0

First Component

Seco

nd C

om

ponent

Score Plot of pH, ..., Si mg/ l



outliers removed

- MINITAB loading plot derived from the correlation matrix

0.40.30.20.10.0

0.50

0.25

0.00

-0.25

-0.50

First Component

Seco

nd C

om

ponent Si mg/l

TOC mg/lAbs._F 420nm/5cm

Tot-P ug/l

Tot-N_ps ug/l

NO2+NO3-N ug/l

Cl meq/l

SO4_IC meq/l

Alk./Acid meq/l

K meq/lNa meq/l

Mg meq/l

Ca meq/lCond. mS/m25øC

pH

Loading Plot of pH, ..., Si mg/ l


Regression of an indicator matrix

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2 Class 1

Class 2

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Find a linear function

which is (on average) one for objects in class 1 and otherwise (on average) zero

Assign a new object to class 1 if

22212120212ˆˆˆ),(ˆ xxxxf

21211110211ˆˆˆ),(ˆ xxxxf

),(ˆ),(ˆ212211 xxfxxf


Discriminant analysis

- decision border

0

2

4

6

8

10

12

14

16

2 4 6 8 10

x1

x2

Class 1

Class 2

Discr.


3D-plot of an indicator matrix for class 1

15

0.0 10

0.5

1.0

4 6 58 10

Class_1

x2

x1

3D Scatterplot of Class_1 vs x2 vs x1


3D-plot of an indicator matrix for class 2

15

0.0 10

0.5

1.0

4 6 58 10

Class_2

x2

x1

3D Scatterplot of Class_2 vs x2 vs x1


Regression of an indicator matrix

- discriminating function

0

5

10

15

20

25

2 4 6 8 10

x1

x2

Class 1

Class 2

Class 3

Estimate discriminant functions

for each class, and then classify a new object to the class with the largest value for its discriminant function

)(xk


Linear discriminant analysis (LDA)

LDA is an optimal classification method when the data arise from Gaussian distributions with different means and a common covariance matrix

4

6

8

10

12

14

16

18

2 4 6 8 10 12

Class1

Class 2

Class3

Data mining and statistical learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity...

Documents

Transcript of Data mining and statistical learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity...