IBM PureData for Analytics Clustering three ways with Open Source R

7
© 2012 IBM Corporation 1 IBM PureData for Analytics Clustering three ways with Open Source R

description

IBM PureData for Analytics Clustering three ways with Open Source R. Using R with Puredata for Analytics. Small data outside database Single Model, Serial Model Processing. Pull data down from database Run R on desktop or dedicated server. Small data inside database - PowerPoint PPT Presentation

Transcript of IBM PureData for Analytics Clustering three ways with Open Source R

Page 1: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation1

IBM PureData for AnalyticsClustering three ways with Open Source R

Page 2: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation2

Using R with Puredata for Analytics

Small data outside databaseSingle Model, Serial Model Processing

Large data inside databaseSingle Model, Serial Model Processing

Many small data inside databaseMany Model, Parallel Model Processinge.g. Bulk Parallel Execution

Pull data down from databaseRun R on desktop or dedicated server

Call INZA functions from RProcess data directly against DB tables

Push R into databaseProcess data directly against DB tables

Small data inside databaseSingle Model, Serial Model Processing

Push R into databaseProcess data directly against DB tables

Page 3: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation3

Using R with Puredata for Analytics

Small data outside databaseSingle Model, Serial Model Processing

Large data inside databaseSingle Model, Serial Model Processing

Many small data inside databaseMany Model, Parallel Model Processinge.g. Bulk Parallel Execution

Pull data down from databaseRun R on desktop or dedicated server

Call INZA functions from RProcess data directly against DB tables

Push R into databaseProcess data directly against DB tables

Small data inside databaseSingle Model, Serial Model Processing

Push R into databaseProcess data directly against DB tables

Analysis only looks at the last three scenarios

Page 4: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation4

Comparing performance for single model in-database

Number of Observations

INZA wrapper from R: nzKMeans

cclust run IDB with nzSingleModel

500,000 user system elapsed 0.01 0.01 327.06

user system elapsed0.44 0.02 30.51

1,000,000 user system elapsed 0.00 0.00 215.64

user system elapsed1.09 0.01 42.04

2,000,000 user system elapsed 0.05 0.01 212.24

user system elapsed1.88 0.05 59.89

4,000,000 user system elapsed 0.03 0.00 250.05

user system elapsed4.07 0.03 141.13

5,000,000 user system elapsed0.03 0.00 217.14

user system elapsed4.78 0.03 203.63

Would expect nzKMeans to outperform cclust in-database between 5M and 6M observations

Note: Tests run on a first-gen twin-finNote: performance numbers variations are relative due to

system being used during the testing

Page 5: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation5

Bulk-parallel execution of cclust (10K observations for each)

Number of Models

cclust run IDB with nzBulkModel

Average time per model

50 user system elapsed0.02 0.00 6.18

0.1236

100 user system elapsed0.03 0.00 7.23

0.0723

500 user system elapsed0.00 0.02 14.25

0.0285

In general, these results would be significantly superior to running cclust serially in a dedicated environment simply due

to R execution overhead and accounting for additional time required for data movement and/or partitioning

Page 6: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation6

Clustering three ways with Open R and IBM Puredata for Analytics

Using wrapper for INZA KMEANS (Stores resulting model in-database), single model

data.nz <- nz.data.frame("BENCHMARK_DATA")system.time(nz.clust5 <- nzKMeans(data.nz, k=5,maxiter=1000,distance="euclidean",id="ID", getLabels=F,randseed=1234,

outtable="admin.DATA_2_clust5d", format="kmeans",dropAfter=T))

Running R in-database, single model (Returns resulting model to client.)

system.time( data.cclust <- nzSingleModel(data.nz[,2:16], function(df){ require(cclust); cclust(as.matrix(df),5,iter.max=1000,

verbose=FALSE,dist="euclidean",method="kmeans") } , force=TRUE ))

Running R in-database, bulk parallel model (Stores resulting models in-database, returns list of models by INDEX)

# ua_ct is col 6, the “index” or grouping columnsystem.time(data.cclust <- nzBulkModel(data.nz[data.nz$ID<1000001,2:16], 6, function(df){ require(cclust);

cclust(as.matrix(df),5,iter.max=1000,verbose=FALSE,dist="euclidean",method="kmeans") }, output.name="CCLUSTBULKMODEL", clear.existing=TRUE ) )

Page 7: IBM  PureData  for Analytics Clustering three ways with Open Source R

© 2012 IBM Corporation7

Bulk-parallel execution of cclust: Result DetailsNumber of Rows

Number of Models

Timings Overall Average Elapsed per Model

Rows per Model

0.5 M 50 user system elapsed0.02 0.00 6.18

0.1236 10K

1 M 100 user system elapsed0.03 0.00 7.23

0.0723 10K

2 M 100 user system elapsed 0.01 0.00 6.85

0.0685 20K

4 M 500 user system elapsed 0.01 0.19 12.95

0.0259 8K

5 M 500 user system elapsed0.00 0.02 14.25

0.0285 10K