Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...
-
Upload
matteo-hakey -
Category
Documents
-
view
214 -
download
0
Transcript of Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player...
![Page 1: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/1.jpg)
Automatic Feature Selection
Feb 2015
![Page 2: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/2.jpg)
Update on Hadoop / R
Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from
HortonWorks) http://hortonworks.com/products/hortonworks-sand
box/#install
Do tutorials – here http://hortonworks.com/tutorials/
Add R / Rstudio Server to your VM
Use Rhadoop to inteface Hadoop and R
![Page 3: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/3.jpg)
Issue
There are many predictive analytical
models that will work –Which among many
is best?
![Page 4: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/4.jpg)
Example Data – HVAC building log data
date 6/1/13 6/1/13 6/1/13 6/25/13time 0:00:01 0:00:01 0:00:01 0:13:19target.temp 69 66 69 70actual.temp 55 58 60 71system 14 13 5 19system.age 6 20 8 14building.id 17 4 7 18temp.diff 14 8 9 -1temp.range COLD COLD COLD NORMALextreme.temp 1 1 1 0country Egypt Finland South Africa Indonesiahvac.product FN39TG GG1919 FN39TG JDNS77building.age 11 17 13 25building.manager M17 M4 M7 M18service.center.distance 150 115 100 68days.since.service 142 109 164 86he.efficiency 12 22 2 36fan.hours 17 16 15 8coolant.type B12 B12 B12 B12software.release P10 P10 P10 P10ave.outside.temp 91 46 77 80software.P12 0 0 0 0coolant.B12 1 1 1 1neg.diff 1 1 1 -1abs.diff 14 8 9 1diff.size 3 2 2 1cut.off 1 1 1 0
![Page 5: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/5.jpg)
What to look for in among models
R-squared (linear models)
Variable Significance
# of Variables that are significant
Sign of Variables
Confusion Matrix “Score” (non-linear models)
AIC number (non-linear models)
![Page 6: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/6.jpg)
What to look for in among models
Variables and Significance
AIC Score
Confusion Matrix
Confusion Matrix Score
![Page 7: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/7.jpg)
Hand Done Model Outcome
![Page 8: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/8.jpg)
Approach
Calculate the combinations of all independent variables
Write function to; Run each model possibility For a sample of X (~10) samples of training / test data
sets Collect;
# of variables that have significance < .1 “score” the confusion matrix
Multiple # of significant of variables by confusion matrix score, average over sampling range, sort results data frame
![Page 9: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/9.jpg)
Step 1 – set up empty data frame to hold results
![Page 10: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/10.jpg)
Step 2 – calculate all combinations of variables
![Page 11: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/11.jpg)
Step 3 – run function to estimate all models and save parameters
![Page 12: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/12.jpg)
Step 4 – average all models and sort
![Page 13: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/13.jpg)
Average of Top Models Are …
Model MatrixMean SigMean Weigthed
cut.off ~ + system + building.id + hvac.product + building.age + building.manager + coolant.type + software.P12 0.79 5.60 4.45
cut.off ~ + system.age + building.id + hvac.product + building.age + building.manager + he.efficiency + coolant.type 0.88 5.00 4.39
cut.off ~ + building.id + hvac.product + building.age + building.manager + coolant.type + software.release + ave.outside.temp 0.85 4.90 4.17
cut.off ~ + system + building.id + hvac.product + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.77 4.30 3.30
cut.off ~ + building.id + service.center.distance + days.since.service + fan.hours + coolant.type + software.release + software.P12 0.91 3.60 3.28
cut.off ~ + system + system.age + building.id + days.since.service + fan.hours + ave.outside.temp + software.P12 0.86 3.80 3.25
cut.off ~ + system + system.age + building.id + building.age + days.since.service + fan.hours + software.P12 0.84 3.80 3.18
cut.off ~ + building.id + country + building.manager + service.center.distance + days.since.service + fan.hours + coolant.type 0.88 3.60 3.17
cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.P12 0.87 3.60 3.14
cut.off ~ + system.age + building.id + country + building.manager + service.center.distance + coolant.type + software.release 0.85 3.70 3.14
cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + software.P12 0.89 3.50 3.11
cut.off ~ + building.id + hvac.product + building.age + building.manager + service.center.distance + coolant.type + ave.outside.temp 0.89 3.50 3.10
cut.off ~ + building.id + building.age + building.manager + service.center.distance + days.since.service + he.efficiency + coolant.type 0.88 3.50 3.09
cut.off ~ + building.id + country + building.manager + days.since.service + coolant.type + ave.outside.temp + software.P12 0.85 3.60 3.06
cut.off ~ + building.id + hvac.product + building.age + fan.hours + software.release + ave.outside.temp + software.P12 0.81 3.70 3.00
cut.off ~ + hvac.product + building.age + days.since.service + he.efficiency + coolant.type + ave.outside.temp + software.P12 0.91 3.30 3.00
![Page 14: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/14.jpg)
Each of these should be tested again
More extensive use of varied train / test data sample sets
Stability of each model beyond the scoring
Chosen model “makes sense”
![Page 15: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/15.jpg)
Alternative ways to do this …
Caret Package function “rfe” (recursive feature elimination) Try all variables first Train and Test the model with cross-validation Calculate the most important variables Eliminate the least important variables Train and Test the model again Calculate the most important variables Eliminate the least important variables Repeat …..
![Page 16: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/16.jpg)
Setting it up & running RFE
data frame of predictor variables
vector of outcome variable
max number of variables to keep
control functions
run recursive elimination model
![Page 17: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/17.jpg)
Outcome of the RFE
![Page 18: Automatic Feature Selection Feb 2015. Update on Hadoop / R Try HortonWorks Sandbox Get a VM player Download and install OVA (VM file from HortonWorks)](https://reader030.fdocuments.in/reader030/viewer/2022032517/56649ca25503460f94961a4e/html5/thumbnails/18.jpg)
Problems
Number of variables combinations can get HUGE
Might need multicore or parallel to get through it