What's Right and Wrong with Apache Mahout

Post on 10-May-2015

408 views 3 download

Tags:

description

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Transcript of What's Right and Wrong with Apache Mahout

1©MapR Technologies 2013- Confidential

Apache Mahout

How it's good, how it's awesome, and where it falls short

2©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

3©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

4©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

5©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

All the stuff that isn’t there

6©MapR Technologies 2013- Confidential

Mahout Math

7©MapR Technologies 2013- Confidential

Mahout Math

Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data

But not – totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature

8©MapR Technologies 2013- Confidential

Matrices and Vectors

At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix

Highly composable API

Important ideas: – view*, assign and aggregate– iteration

m.viewDiagonal().assign(v)

9©MapR Technologies 2013- Confidential

Assign

Matrices

Vectors

Matrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);

Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);

10©MapR Technologies 2013- Confidential

Views

Matrices

Vectors

Matrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();

Vector viewPart(int offset, int length);

11©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

12©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

m.viewDiagonal().zSum()

13©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

m.viewDiagonal().zSum()

m.times(new DenseMatrix(1000, 3).assign(new Normal()))

14©MapR Technologies 2013- Confidential

Recommenders

15©MapR Technologies 2013- Confidential

Examples of Recommendations

Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,

et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)

16©MapR Technologies 2013- Confidential

Recommendation Basics

History:

User Thing1 3

2 4

3 4

2 3

3 2

1 1

2 1

17©MapR Technologies 2013- Confidential

Recommendation Basics

History as matrix:

(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

18©MapR Technologies 2013- Confidential

A Quick Simplification

Users who do h

Also do r

User-centric recommendations

Item-centric recommendations

19©MapR Technologies 2013- Confidential

Clustering

20©MapR Technologies 2013- Confidential

An Example

21©MapR Technologies 2013- Confidential

An Example

22©MapR Technologies 2013- Confidential

Diagonalized Cluster Proximity

23©MapR Technologies 2013- Confidential

Parallel Speedup?

24©MapR Technologies 2013- Confidential

Lots of Clusters Are Fine

25©MapR Technologies 2013- Confidential

Decompositions

26©MapR Technologies 2013- Confidential

Low Rank Matrix

Or should we see it differently?

Are these scaled up versions of all the same column?

1 2 5

2 4 10

10 20 50

20 40 100

27©MapR Technologies 2013- Confidential

Low Rank Matrix

Matrix multiplication is designed to make this easy

We can see weighted column patterns, or weighted row patterns All the same mathematically

1

2

10

20

1 2 5x

Column pattern(or weights)

Weights (or row pattern)

28©MapR Technologies 2013- Confidential

Low Rank Matrix

What about here?

This is like before, but there is one exceptional value

1 2 5

2 4 10

10 100 50

20 40 100

29©MapR Technologies 2013- Confidential

Low Rank Matrix

OK … add in a simple fixer upper

1

2

10

20

1 2 5x

0

0

10

0

0 8 0x

Which rowException

pattern

+[

[

]]

30©MapR Technologies 2013- Confidential

Random Projection

31©MapR Technologies 2013- Confidential

SVD Projection

32©MapR Technologies 2013- Confidential

Classifiers

33©MapR Technologies 2013- Confidential

Mahout Classifiers

Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve

SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit

Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy

34©MapR Technologies 2013- Confidential

The stuff that isn’t there

35©MapR Technologies 2013- Confidential

What Mahout Isn’t

Mahout isn’t R, isn’t SAS

It doesn’t aim to do everything

It aims to scale some few problems of practical interest

The stuff that isn’t there is a feature, not a defect

36©MapR Technologies 2013- Confidential

Contact:– tdunning@maprtech.com– @ted_dunning– @apachemahout– @user-subscribe@mahout.apache.org

Slides and suchhttp://www.slideshare.net/tdunning

Hash tags: #mapr #apachemahout