Download - Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

Page 1: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

© 2015 IBM Corporation

Declarative Machine Learning: Bring your Own Algorithm, Data, Syntax and Infrastructure

Shivakumar VaithyanathanIBM Fellow

Watson & IBM Research

Page 2: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Credit Risk Scoring Application at a Large Financial Institution

To execute on one machine (with a hypothetical statistical package/engine) 3.6 TB of RAM required (underestimate). Reduced Set: 1.2 TB of RAM (underestimate) In practice more RAM is required

– Outputs and intermediates also need to be stored along with the input


Prototypical of problems in other industries ranging from automotive to insurance to transportation

Credit Risk ScoringPayment History

Amount Owed

Length of Credit History

New Credit

Types of Credit Used

Problem size 300 million rows, 1500 features Reduced set: 500 features

Data size on disk 3.6 TB (uncompressed) Even for reduced set: 1.2 TB

Algorithm of interest Regression

Page 3: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation


Big Data Analytics Usecases

Problem Description– Consumer risk modeling – Consumer data with

~300 M rows and ~500 attributes

Large Number of ModelsLarge Number of Features


Large Number of Data Points,

Attributes and Dense

Problem Description– Predict customer monetary

loss – Multi-million observations, 95

features, evaluate several hundred models for optimal subset of features

Problem Description– Customer Satisfaction – Multi-million cars with few

reacquired cars– Feature expansion from ~250

to ~21,800

AutomotiveDaaS (Retail Finance)


Page 4: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

A Day in the life of a Data Scientist ….


data sample

data characteristics

Develop new algorithm or modify existing algorithm

original data

Data scientist

Bayesian networks Neural networks Random forests Support vector machines …




Page 5: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Bottleneck: Moving the algorithm onto Big Data Infrastructure


Data scientist




Page 6: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

What If .….


Data scientist




compiler optimizer

Page 7: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Simplified view of what we want to build …


The What The How

language tooling compiler optimizer

High-level language

Write any algorithm

Adapt to different data and program characteristics

Support different backend architectures and configurations

Page 8: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

SystemML: IBM Research Project will soon be in Open Source


• IBM Research Project started 6 years ago

• More than 10 papers in major conferences

• In Beta for more than a year and used in multiple applications

What•R- like, Python-like syntax, ….. •Rich set of statistical functions•User-defined & external function

How•Single-node, embeddable and Hadoop & Spark•Dense / sparse matrix representation•Library of more than 15 algorithms

In-Memory Single Node

Hadoop / Spark

Lower Ops (LOP)

Higher Ops (HOP)


Writing a Python-syntax parser took less than 2


Page 9: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

How should the “What” work ?


package gnmf;


import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.JobConf;

public class MatrixGNMF{ public static void main(String[] args) throws IOException, URISyntaxException { if(args.length < 10) { System.out.println("missing parameters"); System.out.println("expected parameters: [directory of v] [directory of w] [directory of h] " + "[k] [num mappers] [num reducers] [replication] [working directory] " + "[final directory of w] [final directory of h]"); System.exit(1); } String vDir = args[0]; String wDir = args[1]; String hDir = args[2]; int k = Integer.parseInt(args[3]); int numMappers = Integer.parseInt(args[4]); int numReducers = Integer.parseInt(args[5]); int replication = Integer.parseInt(args[6]); String outputDir = args[7]; String wFinalDir = args[8]; String hFinalDir = args[9]; JobConf mainJob = new JobConf(MatrixGNMF.class); String vDirectory; String wDirectory; String hDirectory; FileSystem.get(mainJob).delete(new Path(outputDir)); vDirectory = vDir; hDirectory = hDir; wDirectory = wDir; String workingDirectory; String resultDirectoryX; String resultDirectoryY; long start = System.currentTimeMillis(); System.gc(); System.out.println("starting calculation"); System.out.print("calculating X = WT * V... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_H, vDirectory, wDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = WT * W * H... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, wDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_H, hDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating H = H .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, hDirectory, resultDirectoryX, resultDirectoryY, hFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back H... "); FileSystem.get(mainJob).delete(new Path(hDirectory)); hDirectory = workingDirectory; System.out.println("done"); System.out.print("calculating X = V * HT... "); workingDirectory = UpdateWHStep1.runJob(numMappers, numReducers, replication, UpdateWHStep1.UPDATE_TYPE_W, vDirectory, hDirectory, outputDir, k); resultDirectoryX = UpdateWHStep2.runJob(numMappers, numReducers, replication, workingDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating Y = W * H * HT... "); workingDirectory = UpdateWHStep3.runJob(numMappers, numReducers, replication, hDirectory, outputDir); resultDirectoryY = UpdateWHStep4.runJob(numMappers, replication, workingDirectory, UpdateWHStep4.UPDATE_TYPE_W, wDirectory, outputDir); FileSystem.get(mainJob).delete(new Path(workingDirectory)); System.out.println("done"); System.out.print("calculating W = W .* X ./ Y... "); workingDirectory = UpdateWHStep5.runJob(numMappers, numReducers, replication, wDirectory, resultDirectoryX, resultDirectoryY, wFinalDir, k); System.out.println("done"); FileSystem.get(mainJob).delete(new Path(resultDirectoryX)); FileSystem.get(mainJob).delete(new Path(resultDirectoryY)); System.out.print("storing back W... "); FileSystem.get(mainJob).delete(new Path(wDirectory)); wDirectory = workingDirectory; System.out.println("done"); long requiredTime = System.currentTimeMillis() - start; long requiredTimeMilliseconds = requiredTime % 1000; requiredTime -= requiredTimeMilliseconds; requiredTime /= 1000; long requiredTimeSeconds = requiredTime % 60; requiredTime -= requiredTimeSeconds; requiredTime /= 60; long requiredTimeMinutes = requiredTime % 60; requiredTime -= requiredTimeMinutes; requiredTime /= 60; long requiredTimeHours = requiredTime;}}

package gnmf;


import;import java.util.Iterator;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep2{ static class UpdateWHStep2Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixVector, TaggedIndex, MatrixVector> { @Override public void map(TaggedIndex key, MatrixVector value, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { out.collect(key, value); } } static class UpdateWHStep2Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixVector, TaggedIndex, MatrixObject> { @Override public void reduce(TaggedIndex key, Iterator<MatrixVector> values, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { MatrixVector result = null; while(values.hasNext()) { MatrixVector current =; if(result == null) { result = current.getCopy(); } else { result.addVector(current); } } if(result != null) { out.collect(new TaggedIndex(key.getIndex(), TaggedIndex.TYPE_VECTOR_X), new MatrixObject(result)); } } } public static String runJob(int numMappers, int numReducers, int replication, String inputDir, String outputDir) throws IOException { String workingDirectory = outputDir + System.currentTimeMillis() + "-UpdateWHStep2/";

JobConf job = new JobConf(UpdateWHStep2.class); job.setJobName("MatrixGNMFUpdateWHStep2"); job.setInputFormat(SequenceFileInputFormat.class); FileInputFormat.setInputPaths(job, new Path(inputDir)); job.setOutputFormat(SequenceFileOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(workingDirectory)); job.setNumMapTasks(numMappers); job.setMapperClass(UpdateWHStep2Mapper.class); job.setMapOutputKeyClass(TaggedIndex.class); job.setMapOutputValueClass(MatrixVector.class); job.setNumReduceTasks(numReducers); job.setReducerClass(UpdateWHStep2Reducer.class); job.setOutputKeyClass(TaggedIndex.class); job.setOutputValueClass(MatrixObject.class); JobClient.runJob(job); return workingDirectory;


package gnmf;


import;import java.util.Iterator;

import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.SequenceFileInputFormat;import org.apache.hadoop.mapred.SequenceFileOutputFormat;

public class UpdateWHStep1{ public static final int UPDATE_TYPE_H = 0; public static final int UPDATE_TYPE_W = 1; static class UpdateWHStep1Mapper extends MapReduceBase implements Mapper<TaggedIndex, MatrixObject, TaggedIndex, MatrixObject> { private int updateType; @Override public void map(TaggedIndex key, MatrixObject value, OutputCollector<TaggedIndex, MatrixObject> out, Reporter reporter) throws IOException { if(updateType == UPDATE_TYPE_W && key.getType() == TaggedIndex.TYPE_CELL) { MatrixCell current = (MatrixCell) value.getObject(); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_CELL), new MatrixObject(new MatrixCell(key.getIndex(), current.getValue()))); } else { out.collect(key, value); } } @Override public void configure(JobConf job) { updateType = job.getInt("gnmf.updateType", 0); } } static class UpdateWHStep1Reducer extends MapReduceBase implements Reducer<TaggedIndex, MatrixObject, TaggedIndex, MatrixVector> { private double[] baseVector = null; private int vectorSizeK; @Override public void reduce(TaggedIndex key, Iterator<MatrixObject> values, OutputCollector<TaggedIndex, MatrixVector> out, Reporter reporter) throws IOException { if(key.getType() == TaggedIndex.TYPE_VECTOR) { if(!values.hasNext()) throw new RuntimeException("expected vector"); MatrixFormats current =; if(!(current instanceof MatrixVector)) throw new RuntimeException("expected vector"); baseVector = ((MatrixVector) current).getValues(); } else { while(values.hasNext()) { MatrixCell current = (MatrixCell); if(baseVector == null) { out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), new MatrixVector(vectorSizeK)); } else { if(baseVector.length == 0) throw new RuntimeException("base vector is corrupted"); MatrixVector resultingVector = new MatrixVector(baseVector); resultingVector.multiplyWithScalar(current.getValue()); if(resultingVector.getValues().length == 0) throw new RuntimeException("multiplying with scalar failed"); out.collect(new TaggedIndex(current.getColumn(), TaggedIndex.TYPE_VECTOR), resultingVector); } } baseVector = null; } } @Override public void configure(JobConf job) { vectorSizeK = job.getInt("dml.matrix.gnmf.k", 0); if(vectorSizeK == 0) throw new RuntimeException("invalid k specified"); } } public static String runJob(int numMappers, int numReducers, int replication, int updateType, String matrixInputDir, String whInputDir, String outputDir, int k) throws IOException {

Java Implementation of Non-

negative Matrix Factorization

for Hadoop

(>1500 lines of code)

R syntax(10 lines of code)

Python syntax(10 lines of code)

A factor of 7 – 10 advantage in man-

months over multiple algorithms

Page 10: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Scalability and Performance – GNMF Example


All operations execute on

Single machine

0 MR Jobs

Hybrid Execution(majority of operations

execute on single machine)4 MR Jobs

Hybrid Execution(majority of operations

execute in map-reduce)6 MR Jobs

Page 11: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

What does the “How” do ?


Page 12: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

What does the “How” do ?


X has 3 times more columns




1y From 2.5 to GB Map Task JVM

7 GB In-Mem Master JVM

Change in Cluster configuration





X has 2 times more rows





X’y job1

X’y job2

X’X job


X’y job1

X’y job2

X’X job






Original dataX’X andX’y job

solveExecution plan

Change in data characteristics

X’X andX’y job


X’X job1

X’X job2

X’y job

solve3X faster!

Page 13: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Compilation Chain Overview with Example





b sb







Parse TreeIf dimensions are unknown at

compile time, validate will pass through and additional checks will be

made at run time

Runtime Instructions:CP: b+sb _mvar1MR-Job: [map=X%*%_mvar1 _mvar2]CP: y*_mvar2 _mvar3



Page 14: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Data fits in aggregated memory: SystemML optimizations give ~10X over HadoopIn-Memory Data Set (160GB)

Some Performance Numbers for Spark / Hadoop

Data larger than aggregated memory: SystemML optimizations give ~ 2X

ML Program MR Backend(All ML optims)

Spark Backend(All ML optims)

Spark Backend(Limited ML optims)

LinregDS 479s 342s 456sLinregCG 954s 188s 243s

L2SVM 1,517s 237s 531sGLM 1,989s 205s 318s

ML Program MR Backend (All ML optims)

Spark Backend(All ML optims)

LinregDS 5,429s 6,779sLinregCG 12,469s 10,014s

L2SVM 24,360s 12,795sGLM 32,521s 17,301s

Large-Scale Data Set (1.6TB)

Page 15: Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infrastructure

IBM Research

© 2012 IBM Corporation

Thank You