Whats Right and Wrong with Apache Mahout
-
Upload
ted-dunning -
Category
Technology
-
view
3.675 -
download
2
description
Transcript of Whats Right and Wrong with Apache Mahout
![Page 1: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/1.jpg)
1©MapR Technologies 2013- Confidential
Apache Mahout
How it's good, how it's awesome, and where it falls short
![Page 2: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/2.jpg)
2©MapR Technologies 2013- Confidential
What is Mahout?
“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.
Components– math library– clustering– classification– decompositions– recommendations
![Page 3: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/3.jpg)
3©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components– recommendations– math library– clustering– classification– decompositions– other stuff
![Page 4: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/4.jpg)
4©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components– recommendations– math library– clustering– classification– decompositions– other stuff
![Page 5: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/5.jpg)
5©MapR Technologies 2013- Confidential
What is Right and Wrong with Mahout?
Components– recommendations– math library– clustering– classification– decompositions– other stuff
All the stuff that isn’t there
![Page 6: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/6.jpg)
6©MapR Technologies 2013- Confidential
Mahout Math
![Page 7: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/7.jpg)
7©MapR Technologies 2013- Confidential
Mahout Math
Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data
But not – totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature
![Page 8: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/8.jpg)
8©MapR Technologies 2013- Confidential
Matrices and Vectors
At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix
Highly composable API
Important ideas: – view*, assign and aggregate– iteration
m.viewDiagonal().assign(v)
![Page 9: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/9.jpg)
9©MapR Technologies 2013- Confidential
Assign
Matrices
Vectors
Matrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);
Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);
![Page 10: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/10.jpg)
10©MapR Technologies 2013- Confidential
Views
Matrices
Vectors
Matrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();
Vector viewPart(int offset, int length);
![Page 11: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/11.jpg)
11©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
![Page 12: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/12.jpg)
12©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
m.viewDiagonal().zSum()
![Page 13: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/13.jpg)
13©MapR Technologies 2013- Confidential
Examples
The trace of a matrix
Random projection
Low rank random matrix
m.viewDiagonal().zSum()
m.times(new DenseMatrix(1000, 3).assign(new Normal()))
![Page 14: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/14.jpg)
14©MapR Technologies 2013- Confidential
Recommenders
![Page 15: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/15.jpg)
15©MapR Technologies 2013- Confidential
Examples of Recommendations
Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,
et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)
![Page 16: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/16.jpg)
16©MapR Technologies 2013- Confidential
Recommendation Basics
History:
User Thing1 3
2 4
3 4
2 3
3 2
1 1
2 1
![Page 17: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/17.jpg)
17©MapR Technologies 2013- Confidential
Recommendation Basics
History as matrix:
(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
![Page 18: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/18.jpg)
18©MapR Technologies 2013- Confidential
A Quick Simplification
Users who do h
Also do r
User-centric recommendations
Item-centric recommendations
![Page 19: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/19.jpg)
19©MapR Technologies 2013- Confidential
Clustering
![Page 20: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/20.jpg)
20©MapR Technologies 2013- Confidential
An Example
![Page 21: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/21.jpg)
21©MapR Technologies 2013- Confidential
An Example
![Page 22: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/22.jpg)
22©MapR Technologies 2013- Confidential
Diagonalized Cluster Proximity
![Page 23: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/23.jpg)
23©MapR Technologies 2013- Confidential
Parallel Speedup?
✓
![Page 24: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/24.jpg)
24©MapR Technologies 2013- Confidential
Lots of Clusters Are Fine
![Page 25: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/25.jpg)
25©MapR Technologies 2013- Confidential
Decompositions
![Page 26: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/26.jpg)
26©MapR Technologies 2013- Confidential
Low Rank Matrix
Or should we see it differently?
Are these scaled up versions of all the same column?
1 2 5
2 4 10
10 20 50
20 40 100
![Page 27: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/27.jpg)
27©MapR Technologies 2013- Confidential
Low Rank Matrix
Matrix multiplication is designed to make this easy
We can see weighted column patterns, or weighted row patterns All the same mathematically
1
2
10
20
1 2 5x
Column pattern(or weights)
Weights (or row pattern)
![Page 28: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/28.jpg)
28©MapR Technologies 2013- Confidential
Low Rank Matrix
What about here?
This is like before, but there is one exceptional value
1 2 5
2 4 10
10 100 50
20 40 100
![Page 29: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/29.jpg)
29©MapR Technologies 2013- Confidential
Low Rank Matrix
OK … add in a simple fixer upper
1
2
10
20
1 2 5x
0
0
10
0
0 8 0x
Which rowException
pattern
+[
[
]]
![Page 30: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/30.jpg)
30©MapR Technologies 2013- Confidential
Random Projection
![Page 31: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/31.jpg)
31©MapR Technologies 2013- Confidential
SVD Projection
![Page 32: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/32.jpg)
32©MapR Technologies 2013- Confidential
Classifiers
![Page 33: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/33.jpg)
33©MapR Technologies 2013- Confidential
Mahout Classifiers
Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve
SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit
Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy
![Page 34: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/34.jpg)
34©MapR Technologies 2013- Confidential
The stuff that isn’t there
![Page 35: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/35.jpg)
35©MapR Technologies 2013- Confidential
What Mahout Isn’t
Mahout isn’t R, isn’t SAS
It doesn’t aim to do everything
It aims to scale some few problems of practical interest
The stuff that isn’t there is a feature, not a defect
![Page 36: Whats Right and Wrong with Apache Mahout](https://reader036.fdocuments.in/reader036/viewer/2022062513/554f5afbb4c905524c8b54c5/html5/thumbnails/36.jpg)
36©MapR Technologies 2013- Confidential
Contact:– [email protected]– @ted_dunning– @apachemahout– @[email protected]
Slides and suchhttp://www.slideshare.net/tdunning
Hash tags: #mapr #apachemahout