Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

33
Apache Mahout Distributed Matrix Math for Machine Learning

Transcript of Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Page 1: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Mahout Distributed Matrix Math for Machine Learning

Page 2: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

About Me

• Senior Director of Data Science at Lucidworks (Apache Solr/Lucene, Fusion search tools)

• Formerly Chief Data Scientist, Technical Lead of Data Science Practice at Accenture

• Committer and PMC Member, Apache Mahout

• On Twitter @akm

• Email at [email protected], [email protected]

• Adversarial Learning podcast with @joelgrus at http://adversariallearning.com

Page 3: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Mahout Recent Trends in 0.12/0.13

• Simplify and improve performance of distributed matrix-math programming

• Provide flexible computation options for software and hardware

• Enable easier and quicker new algorithm development

• Allow polyglot programming and plotting in notebooks via Apache Zeppelin

Page 4: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Introduction to Apache Mahout

Apache Mahout is an environment for creating scalable, performant, machine-learning applications

Apache Mahout provides:

• Mathematically expressive Scala DSL

• A collection of pre-canned math and statistics algorithms

• Interchangeable distributed engines

• Interchangeable native solvers (JVM, CPU, GPU, CUDA, or custom)

Page 5: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Feature Highlights in Recent Releases

• v 0.13.1, Soon — CUDA Solvers, Apache Spark 2.1/Scala 2.11 support

• New web site platform, May 2017 — Moved from ASF CMS system to Markdown and Jekyll; allows documentation pull requests to be merged in and published automatically

• v 0.13.0, Apr 2017 — GPU/CPU Solvers, algorithm framework

• v 0.12.2, Nov 2016 — Apache Zeppelin integration for notebooks and visualization

• v 0.12.0, Apr 2016 — Apache Flink backend support

• New Mahout book, Feb 2016 — ‘Apache Mahout: Beyond MapReduce’ by Dmitriy Lyubimov and Andrew Palumbo

• v 0.10.0 - Apr 2015 - Mahout-Samsara vector-math DSL, MapReduce jobs soft-deprecated, Spark backend support

Page 6: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Topic Overview

• Mahout-Samsara: Declarative, R-like, domain-specific language (DSL) for matrix math

• Backend-agnostic programming

• Apache Zeppelin notebooks

• Algorithm development framework (modeled after sk-learn)

• Solve on available CPU cores, single or multiple GPUs, or in the JVM

• Next steps, and how to get involved

Page 7: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

Page 8: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

MapReduce is dead; long live the little clip-art blue man!

Page 9: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

• Mahout-Samsara is an easy-to-use domain-specific language (DSL) for large-scale machine learning on distributed systems like Apache Spark and Flink

• Uses Scala as programming/scripting environment

• Algebraic expression optimizer for distributed linear algebra

• Provides a translation layer to distributed engines

• Support for Spark and Flink DataSets, RDDs

• System-agnostic, R-like DSL; actual formula from (d)spca:

val G = B %*% B.t - C - C.t + (ksi dot ksi) * (s_q cross s_q)

Page 10: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

• Mahout-Samsara computes C = A’A via row-outer-product formulation:

• Executes in a single pass over row-partitioned A

Example of an algebraic optimization

• Logical optimization

• Optimizer rewrites plan to use logical operator for Transpose-Times-Self matrix multiplication

• Single pass: multiply partitioned rows by themselves as transposed columns

• Computation of A’A:

val C = A.t %*% A

• Naïve execution

• 1st pass: transpose A (requires repartitioning of A)

• 2nd pass: multiply result with A (expensive, potentially requires repartitioning again)

Page 11: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

• Mahout-Samsara computes C = A’A via row-outer-product formulation:

• Executes in a single pass over row-partitioned A

Example of an algebraic optimization

Page 12: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

• Mahout-Samsara computes C = A’A via row-outer-product formulation:

• Executes in a single pass over row-partitioned A

Example of an algebraic optimization

Page 13: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

• Mahout-Samsara computes C = A’A via row-outer-product formulation:

• Executes in a single pass over row-partitioned A

Example of an algebraic optimization

Page 14: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Mahout-Samsara

• Mahout-Samsara computes C = A’A via row-outer-product formulation:

• Executes in a single pass over row-partitioned A

Example of an algebraic optimization

Page 15: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Backend-Agnostic Programming

Page 16: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Backend-Agnostic Programming

Page 17: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

Page 18: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

• Notebooks for polyglot programming with all types of data

• Plotting with R and Python off of computed data from other tools in the same notebook

• Share variables between interpreters

• For more: https://zeppelin.apache.org

• Mahout interpreter for Zeppelin released June 2016

• Post by Trevor Grant on how to use it at https://rawkintrevo.org/2016/05/19/visualizing-apache-mahout-in-r-via-apache-zeppelin-incubating

• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/mahout-in-zeppelin/

Page 19: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

Add the Mahout Interpreter

Page 20: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

Add the Mahout Interpreter, click “Create”

Page 21: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

Example usage

Page 22: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

Example usage

Page 23: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Apache Zeppelin Notebooks

Hand results to R for plotting

Page 24: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Algorithm Development Framework

Page 25: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Algorithm Development Framework

• Patterned after R and Python (sk-learn) APIs

• Fitter populates a Model, which contains the parameter estimates, fit statistics, a summary, and has a predict() method

• https://rawkintrevo.org/2017/05/02/introducing-pre-canned-algorithms-apache-mahout

• https://mahout.apache.org/docs/0.13.1-SNAPSHOT/tutorials/misc/contributing-algos

Page 26: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Solve on CPU, GPU, or JVM

Page 27: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Solve on CPU, GPU, or JVM

Current architecture with native CPU and GPU support and unreleased jCUDA bindings

Page 28: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Solve on CPU, GPU, or JVM

Initial benchmarking on latest release

Page 29: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Solve on CPU, GPU, or JVM

Initial benchmarking on latest release

• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = 0.2, with 5 runs Mahout JVM Sparse multiplication time: 1501 msMahout jCUDA Sparse multiplication time: 49 ms

30x speedup

• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .02, with 5 runs Mahout JVM Sparse multiplication time: 34 ms Mahout jCUDA Sparse multiplication time: 4 ms

8.5x speedup

• Sparse MMul at geometry of 1000 x 1000 %*% 1000 x 1000 density = .002, with 5 runs Mahout JVM Sparse multiplication time: 1 ms Mahout jCUDA Sparse multiplication time: 1 ms

0x speedup

Page 30: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Solve on CPU, GPU, or JVM

• jCUDA work is still in a branch, will be in master in the next couple months

• Currently the modes of compute are JVM, CPU (using all available cores), and single GPU

• Multi-GPU is next priority

• Currently multiplication takes place in different solvers based on matrix shape (banding, triangularity, etc.)

• Directing location for data and compute based on shape and density is another priority

• Watch this space for other speedups

Next steps

Page 31: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

How to Use Mahout and Get Involved

Page 32: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

How to Use Mahout and Get Involved

Web: https://mahout.apache.org

Source code, PRs welcome: https://github.com/apache/mahout

Mailing lists: https://mahout.apache.org/community/mailing-lists.html

Download, install, embed: https://mahout.apache.org/downloads.html

Page 33: Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle 2017

Thank YouQ&A

h2ps://mahout.apache.org h2ps://github.com/apache/mahout