Interactive graphics Understanding OLS regression Normal approximation to the Binomial distribution.
Data mining and statistical learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Data mining and statistical learning - lab2-4 Lab 2, assignment 1: OLS regression of electricity...
Data mining and statistical learning - lab2-4
Lab 2, assignment 1: OLS regression of electricity
consumption on temperature at 53 sites
-10000
-5000
0
5000
10000
ARJEPLOG
BR_M_N
FLODA
GUSTAVSFORS
HELSINGBORG
KOLM
_RDEN_STR_MS...
LULE__KALLAX
MORA
R_NGEDALA
SKILLIN
GE
SVANBERGA
VILHELM
INA
_LVSBYN
_VERKALIX
_SVARTBYN
Predictor
Par
amet
er
Data mining and statistical learning - lab2-4
SAS code for ridge regression
proc reg data=mining.dailytemperature outest = dtempbeta ridge=0 to 10 by 1;model daily_consumption = stockholm g_teborg malm_ /p;output out=olsoutput pred=olspred;proc print data=dtempbeta;run;
_TYPE_ _DEPVAR_ _RIDGE_ _RMSE_ Intercept STOCKHOLM G_TEBORG MALM_PARMS Daily_Consumption 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 0 30845.8 480268.9 -5364.6 -548.3 -3598.2RIDGE Daily_Consumption 1 36314.6 462824.0 -2327.8 -2357.6 -2512.6RIDGE Daily_Consumption 2 43008.7 450349.7 -1830.1 -1899.4 -2011.6RIDGE Daily_Consumption 3 48325.9 442054.5 -1514.3 -1584.8 -1674.9RIDGE Daily_Consumption 4 52401.2 436146.6 -1292.7 -1358.6 -1434.4RIDGE Daily_Consumption 5 55571.5 431726.2 -1128.0 -1188.6 -1254.1RIDGE Daily_Consumption 6 58092.1 428294.6 -1000.8 -1056.3 -1114.1RIDGE Daily_Consumption 7 60138.0 425553.4 -899.4 -950.4 -1002.1RIDGE Daily_Consumption 8 61829.0 423313.5 -816.7 -863.8 -910.6RIDGE Daily_Consumption 9 63248.9 421448.8 -747.9 -791.7 -834.4RIDGE Daily_Consumption 10 64457.3 419872.4 -689.8 -730.6 -770.0
Data mining and statistical learning - lab2-4
Estimated regression parameters in ridge regression
-6000
-5000
-4000
-3000
-2000
-1000
0
0 1 2 3 4 5 6 7 8 9 10
Shrinkage
Par
amet
er STOCKHOLM
G_TEBORG
MALM_
Data mining and statistical learning - lab2-4
Predicted vs observed values
in OLS regression and ridge regression
- trade-off between variance and bias
200000
300000
400000
500000
600000
700000
200000 300000 400000 500000 600000 700000
Observed
Pre
dic
ted
OLS regression Ridge regression
Data mining and statistical learning - lab2-4
Fat content vs absorbance in different channels (wavelengths)
0
10
20
30
40
50
60
2 2.5 3 3.5 4 4.5 5 5.5 6
Absorbance
Fat
co
nte
nt
(%)
Channel 1 Channel 40 Channel 60 Channel 100
Data mining and statistical learning - lab2-4
OLS regression fat vs channel10, channel30, channel50,
channel70, channel90
Model: MODEL1 Dependent Variable: Fat Fat Number of Observations Used 215
Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 5 29469 5893.82795 233.90 <.0001 Error 209 5266.30507 25.19763 Corrected Total 214 34735 Root MSE 5.01972 R-Square 0.8484 Dependent Mean 18.14233 Adj R-Sq 0.8448 Coeff Var 27.66858
Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept 1 42.20859 3.82059 11.05 <.0001 Channel10 Channel10 1 -245.26494 10.05660 -24.39 <.0001 Channel30 Channel30 1 361.41787 23.53244 15.36 <.0001 Channel50 Channel50 1 -203.28522 33.52937 -6.06 <.0001 Channel70 Channel70 1 104.37041 19.91571 5.24 <.0001 Channel90 Channel90 1 -34.48938 9.15823 -3.77 0.0002
Data mining and statistical learning - lab2-4
OLS regression fat vs channel1 – channel 100Model: MODEL1 Dependent Variable: Fat Number of Observations Used: 215
Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 53 34326 647.66185 254.72 <.0001 Error 161 409.36692 2.54265 Corrected Total 214 34735 Root MSE 1.59457 R-Square 0.9882 Dependent Mean 18.14233 Adj R-Sq 0.9843 Coeff Var 8.78922 NOTE: Model is not full rank. Least-squares solutions for the parameters are not unique. Some statistics will be misleading. A reported DF of 0 or B means that the estimate is biased. NOTE: The following parameters have been set to 0, since the variables are a linear combination of other variables as shown. Channel3 = -9.37E-6 * Intercept - 0.03975 * Channel1 + 0.47341 * Channel2 - 0.66366 * Channel4 + 0.1448 * Channel6 - 0.04202 * Channel8 - 0.0296 * Channel10 + 0.04022 * Channel12 - 0.1013 * Channel14 + 0.08297 * Channel16 + 0.09432 * Channel18 - 0.1725 * Channel20 + 0.07997 * Channel21 - 0.00495 * Channel23 + 0.02818 * Channel25 + 0.00606 * Channel27 - 0.08143 * Channel28 + 0.08083 * Channel30 - 0.05219 * Channel32 + 0.01912 * Channel33 + 0.01284 * Channel35 - 0.01179 * Channel36 + 0.03298 * Channel37 - 0.02684 * Channel38 + 0.00346 * Channel39 - 0.04165 * Channel41 + 0.04493 * Channel42 - 0.01572 * Channel44 + 0.01452 * Channel46 + 0.00074 * Channel48 - 0.0342 * Channel49 + 0.08672 * Channel51 - 0.0911 * Channel52 + 0.03303 * Channel53 - 0.00125 * Channel55 - 0.00744 * Channel56 + 0.01541 * Channel58 - 0.00663 * Channel59 - 0.02578 * Channel61 + 0.02883 * Channel63 - 0.01135 * Channel65 + 0.04673 * Channel67 - 0.04764 * Channel69 - 0.00365 * Channel71 + 0.01601 * Channel73 - 0.01333 * Channel75 - 0.00651 * Channel77 - 0.00392 * Channel80 + 0.03827 * Channel83 - 0.02069 * Channel86 + 0.01285 * Channel89 - 0.01378 * Channel92 - 0.00849 * Channel95 + 0.0093 * Channel98
Data mining and statistical learning - lab2-4
OLS regression fat vs channel1 – channel 100
Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Pr > |t| Intercept Intercept B 7.67989 2.01644 3.81 0.0002 Channel1 Channel1 B 7550.89847 3181.94418 2.37 0.0188 Channel2 Channel2 B -6236.59799 4650.43463 -1.34 0.1818 Channel3 Channel3 0 0 . . . Channel4 Channel4 B -2576.07036 3776.80152 -0.68 0.4962 Channel5 Channel5 0 0 . . . Channel6 Channel6 B -7766.73338 4103.41990 -1.89 0.0602 Channel7 Channel7 0 0 . . . Channel8 Channel8 B 5660.86411 4248.60674 1.33 0.1846 Channel9 Channel9 0 0 . . . Channel10 Channel10 B 4509.28620 4503.11172 1.00 0.3182 Channel11 Channel11 0 0 . . . Channel12 Channel12 B 8050.98503 4080.26245 1.97 0.0502 Channel13 Channel13 0 0 . . . Channel14 Channel14 B -7368.85561 4319.59587 -1.71 0.0900 Channel15 Channel15 0 0 . . . Channel16 Channel16 B -5251.52459 3382.29352 -1.55 0.1225 Channel17 Channel17 0 0 . . . . . .
Data mining and statistical learning - lab2-4
OLS regression with strongly correlated predictors
If the XTX matrix has not full rank (some X-variables are linearly dependent) the mean square solution is not unique
If the X-variables are strongly correlated, then:
(i) the regression coefficients will be uncertain;
(ii) the predictions may be OK
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data
Some variables vary much more than others
How does this influence principal components derived from the covariance and correlation matrices, respectively?
0
1000
2000
3000
4000
5000
6000
7000
0 1000 2000 3000 4000 5000 6000 7000
Cl (meq/l)
To
t-N
(m g
/l)
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data
- score plot derived from the correlation matrix
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data
- eigenvectors derived from the correlation matrix
-0.2-0.1
00.10.20.30.40.5
pH_
Con
d__m
S_m
25_C
Ca_
meq
_l
Mg_
meq
_l
Na_
meq
_l
K_m
eq_l
Alk
__A
cid_
meq
_l
SO
4_IC
_meq
_l
Cl_
meq
_l
NO
2_N
O3_
N_u
g_l
Tot
_N_p
s_ug
_l
Tot
_P_u
g_l
Abs
__F
_420
nm_5
c
TO
C_m
g_l
Si_
mg_
l
PRIN1
PRIN2
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data with
outliers removed
- score plot derived from the correlation matrix
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data with
outliers removed
- eigenvectors derived from the correlation matrix
-0.6-0.4-0.2
00.20.40.60.8
pH_
Cond_
_mS_m
25_C
Ca_m
eq_l
Mg_
meq
_l
Na_m
eq_l
K_meq
_l
Alk__Acid
_meq
_l
SO4_
IC_m
eq_l
Cl_m
eq_l
NO2_
NO3_N_ug
_l
Tot_N
_ps_
ug_l
Tot_P
_ug_l
Abs__
F_420
nm_5
cm
TOC_mg_
l
Si_m
g_l
Lo
ad
ing
PRIN 1
PRIN 2
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data with
outliers removed
- MINITAB score plot derived from the correlation matrix
403020100
12.5
10.0
7.5
5.0
2.5
0.0
-2.5
-5.0
First Component
Seco
nd C
om
ponent
Score Plot of pH, ..., Si mg/ l
Data mining and statistical learning - lab2-4
Principal Component Analysis of lake survey data with
outliers removed
- MINITAB loading plot derived from the correlation matrix
0.40.30.20.10.0
0.50
0.25
0.00
-0.25
-0.50
First Component
Seco
nd C
om
ponent Si mg/l
TOC mg/lAbs._F 420nm/5cm
Tot-P ug/l
Tot-N_ps ug/l
NO2+NO3-N ug/l
Cl meq/l
SO4_IC meq/l
Alk./Acid meq/l
K meq/lNa meq/l
Mg meq/l
Ca meq/lCond. mS/m25øC
pH
Loading Plot of pH, ..., Si mg/ l
Data mining and statistical learning - lab2-4
Regression of an indicator matrix
0
2
4
6
8
10
12
14
16
2 4 6 8 10
x1
x2 Class 1
Class 2
Find a linear function
which is (on average) one for objects in class 1 and otherwise (on average) zero
Find a linear function
which is (on average) one for objects in class 1 and otherwise (on average) zero
Assign a new object to class 1 if
22212120212ˆˆˆ),(ˆ xxxxf
21211110211ˆˆˆ),(ˆ xxxxf
),(ˆ),(ˆ212211 xxfxxf
Data mining and statistical learning - lab2-4
Discriminant analysis
- decision border
0
2
4
6
8
10
12
14
16
2 4 6 8 10
x1
x2
Class 1
Class 2
Discr.
Data mining and statistical learning - lab2-4
3D-plot of an indicator matrix for class 1
15
0.0 10
0.5
1.0
4 6 58 10
Class_1
x2
x1
3D Scatterplot of Class_1 vs x2 vs x1
Data mining and statistical learning - lab2-4
3D-plot of an indicator matrix for class 2
15
0.0 10
0.5
1.0
4 6 58 10
Class_2
x2
x1
3D Scatterplot of Class_2 vs x2 vs x1
Data mining and statistical learning - lab2-4
Regression of an indicator matrix
- discriminating function
0
5
10
15
20
25
2 4 6 8 10
x1
x2
Class 1
Class 2
Class 3
Estimate discriminant functions
for each class, and then classify a new object to the class with the largest value for its discriminant function
)(xk