IBM PureData for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation1

IBM PureData for AnalyticsClustering three ways with Open Source R


Using R with Puredata for Analytics

Small data outside databaseSingle Model, Serial Model Processing

Large data inside databaseSingle Model, Serial Model Processing

Many small data inside databaseMany Model, Parallel Model Processinge.g. Bulk Parallel Execution

Pull data down from databaseRun R on desktop or dedicated server

Call INZA functions from RProcess data directly against DB tables

Push R into databaseProcess data directly against DB tables

Small data inside databaseSingle Model, Serial Model Processing



Using R with Puredata for Analytics

Small data outside databaseSingle Model, Serial Model Processing

Large data inside databaseSingle Model, Serial Model Processing

Many small data inside databaseMany Model, Parallel Model Processinge.g. Bulk Parallel Execution

Pull data down from databaseRun R on desktop or dedicated server

Call INZA functions from RProcess data directly against DB tables


Small data inside databaseSingle Model, Serial Model Processing


Analysis only looks at the last three scenarios


Comparing performance for single model in-database

Number of Observations

INZA wrapper from R: nzKMeans

cclust run IDB with nzSingleModel

500,000 user system elapsed 0.01 0.01 327.06

user system elapsed0.44 0.02 30.51

1,000,000 user system elapsed 0.00 0.00 215.64






5,000,000 user system elapsed0.03 0.00 217.14


Would expect nzKMeans to outperform cclust in-database between 5M and 6M observations

Note: Tests run on a first-gen twin-finNote: performance numbers variations are relative due to

system being used during the testing


Bulk-parallel execution of cclust (10K observations for each)

Number of Models

cclust run IDB with nzBulkModel

Average time per model

50 user system elapsed0.02 0.00 6.18

0.1236


0.0723


0.0285

In general, these results would be significantly superior to running cclust serially in a dedicated environment simply due

to R execution overhead and accounting for additional time required for data movement and/or partitioning


Clustering three ways with Open R and IBM Puredata for Analytics

Using wrapper for INZA KMEANS (Stores resulting model in-database), single model

data.nz <- nz.data.frame("BENCHMARK_DATA")system.time(nz.clust5 <- nzKMeans(data.nz, k=5,maxiter=1000,distance="euclidean",id="ID", getLabels=F,randseed=1234,

outtable="admin.DATA_2_clust5d", format="kmeans",dropAfter=T))

Running R in-database, single model (Returns resulting model to client.)

system.time( data.cclust <- nzSingleModel(data.nz[,2:16], function(df){ require(cclust); cclust(as.matrix(df),5,iter.max=1000,

verbose=FALSE,dist="euclidean",method="kmeans") } , force=TRUE ))

Running R in-database, bulk parallel model (Stores resulting models in-database, returns list of models by INDEX)

# ua_ct is col 6, the “index” or grouping columnsystem.time(data.cclust <- nzBulkModel(data.nz[data.nz$ID<1000001,2:16], 6, function(df){ require(cclust);

cclust(as.matrix(df),5,iter.max=1000,verbose=FALSE,dist="euclidean",method="kmeans") }, output.name="CCLUSTBULKMODEL", clear.existing=TRUE ) )


Bulk-parallel execution of cclust: Result DetailsNumber of Rows

Number of Models

Timings Overall Average Elapsed per Model

Rows per Model

0.5 M 50 user system elapsed0.02 0.00 6.18

0.1236 10K

1 M 100 user system elapsed0.03 0.00 7.23

0.0723 10K

2 M 100 user system elapsed 0.01 0.00 6.85

0.0685 20K

4 M 500 user system elapsed 0.01 0.19 12.95

0.0259 8K

5 M 500 user system elapsed0.00 0.02 14.25

0.0285 10K

IBM PureData for Analytics Clustering three ways with Open Source R

Documents

Transcript of IBM PureData for Analytics Clustering three ways with Open Source R