IBM PureData for Analytics Clustering three ways with Open Source R
-
Upload
pandora-case -
Category
Documents
-
view
42 -
download
0
description
Transcript of IBM PureData for Analytics Clustering three ways with Open Source R
© 2012 IBM Corporation1
IBM PureData for AnalyticsClustering three ways with Open Source R
© 2012 IBM Corporation2
Using R with Puredata for Analytics
Small data outside databaseSingle Model, Serial Model Processing
Large data inside databaseSingle Model, Serial Model Processing
Many small data inside databaseMany Model, Parallel Model Processinge.g. Bulk Parallel Execution
Pull data down from databaseRun R on desktop or dedicated server
Call INZA functions from RProcess data directly against DB tables
Push R into databaseProcess data directly against DB tables
Small data inside databaseSingle Model, Serial Model Processing
Push R into databaseProcess data directly against DB tables
© 2012 IBM Corporation3
Using R with Puredata for Analytics
Small data outside databaseSingle Model, Serial Model Processing
Large data inside databaseSingle Model, Serial Model Processing
Many small data inside databaseMany Model, Parallel Model Processinge.g. Bulk Parallel Execution
Pull data down from databaseRun R on desktop or dedicated server
Call INZA functions from RProcess data directly against DB tables
Push R into databaseProcess data directly against DB tables
Small data inside databaseSingle Model, Serial Model Processing
Push R into databaseProcess data directly against DB tables
Analysis only looks at the last three scenarios
© 2012 IBM Corporation4
Comparing performance for single model in-database
Number of Observations
INZA wrapper from R: nzKMeans
cclust run IDB with nzSingleModel
500,000 user system elapsed 0.01 0.01 327.06
user system elapsed0.44 0.02 30.51
1,000,000 user system elapsed 0.00 0.00 215.64
user system elapsed1.09 0.01 42.04
2,000,000 user system elapsed 0.05 0.01 212.24
user system elapsed1.88 0.05 59.89
4,000,000 user system elapsed 0.03 0.00 250.05
user system elapsed4.07 0.03 141.13
5,000,000 user system elapsed0.03 0.00 217.14
user system elapsed4.78 0.03 203.63
Would expect nzKMeans to outperform cclust in-database between 5M and 6M observations
Note: Tests run on a first-gen twin-finNote: performance numbers variations are relative due to
system being used during the testing
© 2012 IBM Corporation5
Bulk-parallel execution of cclust (10K observations for each)
Number of Models
cclust run IDB with nzBulkModel
Average time per model
50 user system elapsed0.02 0.00 6.18
0.1236
100 user system elapsed0.03 0.00 7.23
0.0723
500 user system elapsed0.00 0.02 14.25
0.0285
In general, these results would be significantly superior to running cclust serially in a dedicated environment simply due
to R execution overhead and accounting for additional time required for data movement and/or partitioning
© 2012 IBM Corporation6
Clustering three ways with Open R and IBM Puredata for Analytics
Using wrapper for INZA KMEANS (Stores resulting model in-database), single model
data.nz <- nz.data.frame("BENCHMARK_DATA")system.time(nz.clust5 <- nzKMeans(data.nz, k=5,maxiter=1000,distance="euclidean",id="ID", getLabels=F,randseed=1234,
outtable="admin.DATA_2_clust5d", format="kmeans",dropAfter=T))
Running R in-database, single model (Returns resulting model to client.)
system.time( data.cclust <- nzSingleModel(data.nz[,2:16], function(df){ require(cclust); cclust(as.matrix(df),5,iter.max=1000,
verbose=FALSE,dist="euclidean",method="kmeans") } , force=TRUE ))
Running R in-database, bulk parallel model (Stores resulting models in-database, returns list of models by INDEX)
# ua_ct is col 6, the “index” or grouping columnsystem.time(data.cclust <- nzBulkModel(data.nz[data.nz$ID<1000001,2:16], 6, function(df){ require(cclust);
cclust(as.matrix(df),5,iter.max=1000,verbose=FALSE,dist="euclidean",method="kmeans") }, output.name="CCLUSTBULKMODEL", clear.existing=TRUE ) )
© 2012 IBM Corporation7
Bulk-parallel execution of cclust: Result DetailsNumber of Rows
Number of Models
Timings Overall Average Elapsed per Model
Rows per Model
0.5 M 50 user system elapsed0.02 0.00 6.18
0.1236 10K
1 M 100 user system elapsed0.03 0.00 7.23
0.0723 10K
2 M 100 user system elapsed 0.01 0.00 6.85
0.0685 20K
4 M 500 user system elapsed 0.01 0.19 12.95
0.0259 8K
5 M 500 user system elapsed0.00 0.02 14.25
0.0285 10K