India software developers conference 2013 Bangalore
-
Upload
satnam-singh -
Category
Technology
-
view
114 -
download
0
description
Transcript of India software developers conference 2013 Bangalore
![Page 1: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/1.jpg)
Data Science 101: Using R Language to get Big Insights
Satnam Singh,
Senior Chief Engineer,
Samsung Research India – Bangalore
[ Twitter - @satnam74s]
India Software Developers Conference, Bangalore
March 16, 2013
![Page 2: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/2.jpg)
2
Motivation: Using Data to get Business Insights
Data Bases& Clusters
Data Bases& Clusters
Data Bases& Clusters
Insights? Insights?
Insights?
![Page 3: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/3.jpg)
Ref. [kaggle.com]
Data Science Programming Languages
Why R?• Popular, Free• Open source• Multi-platform• Vectorization• Many statistical packages• Large support base• Obj. oriented prog. lang.
Ref [http://www.r-project.org]
![Page 4: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/4.jpg)
R Language Basics
> y <- 21> y[1] 21 > z = 233> z[1] 233
> y <- c(1,2,3,4)> y[1] 1 2 3 4
Simple Operations
VectorOperations
FunctionCalls
![Page 5: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/5.jpg)
5
R Language: Data Structures Examples
• Data frame
• Matrix
• List
> MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<-data.frame(MyFamilyName,MyFamilyage,MyFamilyweight)
> MyMatrix<-as.matrix(MyFamilyage)> Mydataframe <-as.data.frame(MyMatrix)
> MyList <-a.list(Mydataframe)
![Page 6: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/6.jpg)
6
Case Study: Activity Recognition
• Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc.
Example of Accelerometer data
Smartphone’s Accelerometer
Sensor
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University[Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar
![Page 7: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/7.jpg)
7
Data Analysis - Steps
Feature Extraction
Time Series Data 43 FeaturesMean for eachacc. Axis (3)
Std. dev. for eachacc. Axis (3)
200 samples (10 sec)
Avg. Abs. diff. fromMean for eachacc. Axis (3)
Avg. Resultant Acc. (1)
Histogram (30)
ClassifiersCART: Decision TreeRF: Random Forest
Classify the Activity
[Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY[Ref] Jordan Frank, McGill University
![Page 8: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/8.jpg)
Data Visualization – Activity (Class Variable)
[Ref] Rattle R Data Mining Tool
ds <- rbind(summary(na.omit(crs$dataset[,]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Jogging",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Sitting",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Standing",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Upstairs",]$class)), summary(na.omit(crs$dataset[,][crs$dataset$class=="Walking",]$class)))
ord <- order(ds[1,], decreasing=TRUE)
bp <- barplot2(ds[,ord], beside=TRUE, ylab="Frequency", xlab="class", ylim=c(0, 2497), col=rainbow_hcl(7))
dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19))
Bar Plot
Dot Plot
![Page 9: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/9.jpg)
Data Visualization Example – Variable Yavg.ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG"], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$class=="Walking","YAVG"], grp="Walking"))
bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE)
require(doBy, quietly=TRUE)
points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8)
hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE)
[Ref] Rattle R Data Mining Tool
![Page 10: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/10.jpg)
• Easy to interpret
Blue : Positive correlation
Red: Negative correlation
Correlation Plot
[Ref] Rattle R Data Mining Tool
require(ellipse, quietly=TRUE)
crs$cor <- cor(crs$dataset[, crs$numeric], use="pairwise", method="pearson")
crs$ord <- order(crs$cor[1,])crs$cor <- crs$cor[crs$ord, crs$ord]
print(crs$cor)
plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]
![Page 11: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/11.jpg)
Functions Library DiscriptionCluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifiers glm stats Logistic regression
rpart rpart Recursive partitioning and regression trees
ksvm kernlab Support Vector Machine
apriori arules Rule based classification
Ensemble ada ada Stochastic boosting
randomForest randomForest Random Forests classification and regression
Data Science R Packages
![Page 12: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/12.jpg)
Decision Tree - Visualization
[Ref] Rattle R Data Mining Tool
![Page 13: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/13.jpg)
• Decision Tree Model Results:
n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342
Decision Tree
rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
![Page 14: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/14.jpg)
Random Forest: Ensemble of Trees
[Ref] Rattle R Data Mining Tool
…
ΣRandom Forest
Tree1 Tree2
Treen
![Page 15: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/15.jpg)
• Random Forest Model Results:
Number of observations used to build the model: 3792Type of random forest: classificationOOB estimate of error rate: 11.05%Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.errorDownstairs 204 7 0 1 64 97 0.45308311Jogging 6 1117 0 0 8 7 0.01845343Sitting 0 0 209 5 1 0 0.02790698Standing 4 0 0 177 4 0 0.04324324Upstairs 48 31 1 0 276 97 0.39072848Walking 20 1 1 1 15 1390 0.02661064
Random Forest Package in R
randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)
![Page 16: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/16.jpg)
• Fusion of data science and domain knowledge enables the big insights from the data
• R language provides a platform to rapidly build prototypes and test the ideas
• Getting data insights is an outcome of intense team effort between various stakeholders
16
Summary
![Page 17: India software developers conference 2013 Bangalore](https://reader031.fdocuments.in/reader031/viewer/2022012916/54c64e8c4a7959ea028b4572/html5/thumbnails/17.jpg)
• R Project: http://www.r-project.org• Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05
• “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010
• R wiki:http://rwiki.sciviews.org/doku.php
• R graph gallery:http://addictedtor.free.fr/graphiques/thumbs.php
• Kickstarting R:http://cran.r-project.org/doc/contrib/Lemon-kickstart/
• Rattle – R Data Mining Tool [http://rattle.togaware.com/]• Sensor Platforms, http://www.sensorplatforms.com/context-aware/• Movea, http://www.movea.com/• Alohar, https://www.alohar.com
17
References