What's Right and Wrong with Apache Mahout

36
1 ©MapR Technologies 2013- Confidential Apache Mahout How it's good, how it's awesome, and where it falls short

description

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

Transcript of What's Right and Wrong with Apache Mahout

Page 1: What's Right and Wrong with Apache Mahout

1©MapR Technologies 2013- Confidential

Apache Mahout

How it's good, how it's awesome, and where it falls short

Page 2: What's Right and Wrong with Apache Mahout

2©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

Page 3: What's Right and Wrong with Apache Mahout

3©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

Page 4: What's Right and Wrong with Apache Mahout

4©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

Page 5: What's Right and Wrong with Apache Mahout

5©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

All the stuff that isn’t there

Page 6: What's Right and Wrong with Apache Mahout

6©MapR Technologies 2013- Confidential

Mahout Math

Page 7: What's Right and Wrong with Apache Mahout

7©MapR Technologies 2013- Confidential

Mahout Math

Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data

But not – totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature

Page 8: What's Right and Wrong with Apache Mahout

8©MapR Technologies 2013- Confidential

Matrices and Vectors

At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix

Highly composable API

Important ideas: – view*, assign and aggregate– iteration

m.viewDiagonal().assign(v)

Page 9: What's Right and Wrong with Apache Mahout

9©MapR Technologies 2013- Confidential

Assign

Matrices

Vectors

Matrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);

Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);

Page 10: What's Right and Wrong with Apache Mahout

10©MapR Technologies 2013- Confidential

Views

Matrices

Vectors

Matrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();

Vector viewPart(int offset, int length);

Page 11: What's Right and Wrong with Apache Mahout

11©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

Page 12: What's Right and Wrong with Apache Mahout

12©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

m.viewDiagonal().zSum()

Page 13: What's Right and Wrong with Apache Mahout

13©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

m.viewDiagonal().zSum()

m.times(new DenseMatrix(1000, 3).assign(new Normal()))

Page 14: What's Right and Wrong with Apache Mahout

14©MapR Technologies 2013- Confidential

Recommenders

Page 15: What's Right and Wrong with Apache Mahout

15©MapR Technologies 2013- Confidential

Examples of Recommendations

Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,

et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)

Page 16: What's Right and Wrong with Apache Mahout

16©MapR Technologies 2013- Confidential

Recommendation Basics

History:

User Thing1 3

2 4

3 4

2 3

3 2

1 1

2 1

Page 17: What's Right and Wrong with Apache Mahout

17©MapR Technologies 2013- Confidential

Recommendation Basics

History as matrix:

(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

Page 18: What's Right and Wrong with Apache Mahout

18©MapR Technologies 2013- Confidential

A Quick Simplification

Users who do h

Also do r

User-centric recommendations

Item-centric recommendations

Page 19: What's Right and Wrong with Apache Mahout

19©MapR Technologies 2013- Confidential

Clustering

Page 20: What's Right and Wrong with Apache Mahout

20©MapR Technologies 2013- Confidential

An Example

Page 21: What's Right and Wrong with Apache Mahout

21©MapR Technologies 2013- Confidential

An Example

Page 22: What's Right and Wrong with Apache Mahout

22©MapR Technologies 2013- Confidential

Diagonalized Cluster Proximity

Page 23: What's Right and Wrong with Apache Mahout

23©MapR Technologies 2013- Confidential

Parallel Speedup?

Page 24: What's Right and Wrong with Apache Mahout

24©MapR Technologies 2013- Confidential

Lots of Clusters Are Fine

Page 25: What's Right and Wrong with Apache Mahout

25©MapR Technologies 2013- Confidential

Decompositions

Page 26: What's Right and Wrong with Apache Mahout

26©MapR Technologies 2013- Confidential

Low Rank Matrix

Or should we see it differently?

Are these scaled up versions of all the same column?

1 2 5

2 4 10

10 20 50

20 40 100

Page 27: What's Right and Wrong with Apache Mahout

27©MapR Technologies 2013- Confidential

Low Rank Matrix

Matrix multiplication is designed to make this easy

We can see weighted column patterns, or weighted row patterns All the same mathematically

1

2

10

20

1 2 5x

Column pattern(or weights)

Weights (or row pattern)

Page 28: What's Right and Wrong with Apache Mahout

28©MapR Technologies 2013- Confidential

Low Rank Matrix

What about here?

This is like before, but there is one exceptional value

1 2 5

2 4 10

10 100 50

20 40 100

Page 29: What's Right and Wrong with Apache Mahout

29©MapR Technologies 2013- Confidential

Low Rank Matrix

OK … add in a simple fixer upper

1

2

10

20

1 2 5x

0

0

10

0

0 8 0x

Which rowException

pattern

+[

[

]]

Page 30: What's Right and Wrong with Apache Mahout

30©MapR Technologies 2013- Confidential

Random Projection

Page 31: What's Right and Wrong with Apache Mahout

31©MapR Technologies 2013- Confidential

SVD Projection

Page 32: What's Right and Wrong with Apache Mahout

32©MapR Technologies 2013- Confidential

Classifiers

Page 33: What's Right and Wrong with Apache Mahout

33©MapR Technologies 2013- Confidential

Mahout Classifiers

Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve

SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit

Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy

Page 34: What's Right and Wrong with Apache Mahout

34©MapR Technologies 2013- Confidential

The stuff that isn’t there

Page 35: What's Right and Wrong with Apache Mahout

35©MapR Technologies 2013- Confidential

What Mahout Isn’t

Mahout isn’t R, isn’t SAS

It doesn’t aim to do everything

It aims to scale some few problems of practical interest

The stuff that isn’t there is a feature, not a defect

Page 36: What's Right and Wrong with Apache Mahout

36©MapR Technologies 2013- Confidential

Contact:– [email protected]– @ted_dunning– @apachemahout– @[email protected]

Slides and suchhttp://www.slideshare.net/tdunning

Hash tags: #mapr #apachemahout