Intro to Mahout

29
Ofer Vugman May 2012

description

A short introduction to Mahout during SCISR meetup http://bit.ly/scisr

Transcript of Intro to Mahout

Page 1: Intro to Mahout

Ofer VugmanMay 2012

Page 2: Intro to Mahout

Agenda and such…

What is ML (Machine Learning) ML Common Use Cases Mahout Overview Algorithms in Mahout Mahout Commercial Use Mahout Summary

Page 3: Intro to Mahout

What is ML

“Machine Learning is programming computers to optimize a

performance criterion using example data or past experience”

Intro. To Machine Learning by E. Alpaydin

Page 4: Intro to Mahout

ML Common Use Cases

Recommendation

Page 5: Intro to Mahout

ML Common Use Cases

Classification

Page 6: Intro to Mahout

ML Common Use Cases

Clustering

Page 7: Intro to Mahout

ML Common Libraries

Page 8: Intro to Mahout

Mahout Overview – What ?

A mahout is a person who keeps and drives an elephant

Page 9: Intro to Mahout

Mahout Overview – What ?

A scalable machine learning library

Page 10: Intro to Mahout

Mahout Overview – What ?

Began life at 2008 as a subproject of Apache’s Lucene project

On 2010 Mahout became a top-level Apache project in its own right

Implemented in Java Built upon Apache’s Hadoop (Look !

An Elephant !)

Page 11: Intro to Mahout

Mahout Overview – Why ?

Many open source ML libraries either: Lack community Lack documentation and examples Lack scalability Lack the Apache license Are research oriented Not well tested Not built over existing production

quality libraries

Page 12: Intro to Mahout

Mahout Overview – Why ?

Scalability Scalable to reasonably large datasets

(core algorithms implemented in Map/Reduce, runnable on Hadoop)

Scalable to support your business case (Apache License)

Scalable community

Page 13: Intro to Mahout

Mahout Overview – Why ?

Built over existing production quality libraries

Page 14: Intro to Mahout

Mahout Overview – Use Cases

Mahout currently supports mainly four use cases:1. Recommendation2. Clustering3. Classification4. Frequent Itemset Mining

Page 15: Intro to Mahout

Mahout Overview - Technical

System Requirements Linux (or Cygwin on Windows) Java 1.6.x or greater Maven 2.0.11 or greater to build the

source code Hadoop 0.2 or greater*

* Not all algorithms are implemented to work on Hadoop clusters

Page 16: Intro to Mahout

Algorithms in Mahout

We’ll focus on one example: Collaborative Filtering (Recommenders)

Yet there are many (many !!) more, you can find them all on https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms

Page 17: Intro to Mahout

Algorithms Examples – Recommendation

Help users find items they might like based on historical preferences

Based on example by Sebastian Schelter in “Distributed Itembased Collaborative Filtering with Apache Mahout”

Page 18: Intro to Mahout

Algorithms Examples – Recommendation

Alice

Bob

Peter

5 1 4

? 2 5

4 3 2

Page 19: Intro to Mahout

Algorithms Examples – Recommendation

Algorithm Neighborhood-based approach Works by finding similarly rated items in

the user-item-matrix (e.g. cosine, Pearson-Correlation, Tanimoto Coefficient)

Estimates a user's preference towards an item by looking at his/her preferences towards similar items

Page 20: Intro to Mahout

Algorithms Examples – Recommendation

Prediction: Estimate Bob's preference towards “The Matrix”1. Look at all items that

a) are similar to “The Matrix“ b) have been rated by Bob

=> “Alien“, “Inception“

2. Estimate the unknown preference with a weighted sum

Page 21: Intro to Mahout

Algorithms Examples – Recommendation

MapReduce phase 1 Map – Make user the key

(Alice, Matrix, 5)

(Alice, Alien, 1)

(Alice, Inception, 4)

(Bob, Alien, 2)

(Bob, Inception, 5)

(Peter, Matrix, 4)

(Peter, Alien, 3)

(Peter, Inception, 2)

Alice (Matrix, 5)

Alice (Alien, 1)

Alice (Inception, 4)

Bob (Alien, 2)

Bob (Inception, 5)

Peter (Matrix, 4)

Peter (Alien, 3)

Peter (Inception, 2)

Page 22: Intro to Mahout

Algorithms Examples – Recommendation

MapReduce phase 1 Reduce – Create inverted index

Alice (Matrix, 5)

Alice (Alien, 1)

Alice (Inception, 4)

Bob (Alien, 2)

Bob (Inception, 5)

Peter (Matrix, 4)

Peter (Alien, 3)

Peter (Inception, 2)

Alice (Matrix, 5) (Alien, 1) (Inception, 4)

Bob (Alien, 2) (Inception, 5)

Peter(Matrix, 4) (Alien, 3) (Inception, 2)

Page 23: Intro to Mahout

Algorithms Examples – Recommendation

MapReduce phase 2 Map – Isolate all co-occurred ratings (all

cases where a user rated both items)Matrix, Alien (5,1)

Matrix, Alien (4,3)

Alien, Inception (1,4)

Alien, Inception (2,5)

Alien, Inception (3,2)

Matrix, Inception (4,2)

Matrix, Inception (5,4)

Alice (Matrix, 5) (Alien, 1) (Inception, 4)

Bob (Alien, 2) (Inception, 5)

Peter(Matrix, 4) (Alien, 3) (Inception, 2)

Page 24: Intro to Mahout

Algorithms Examples – Recommendation

MapReduce phase 2 Reduce – Compute similarities

Matrix, Alien (5,1)

Matrix, Alien (4,3)

Alien, Inception (1,4)

Alien, Inception (2,5)

Alien, Inception (3,2)

Matrix, Inception (4,2)

Matrix, Inception (5,4)

Matrix, Alien (-0.47)

Matrix, Inception (0.47)

Alien, Inception(-0.63)

Page 25: Intro to Mahout

Algorithms Examples – Recommendation

Alice

Bob

Peter

5 1 4

2 5

4 3 2

1.5

Page 26: Intro to Mahout

Mahout Commercial Use

Commercial use

Page 27: Intro to Mahout

Mahout Resources

Mahout website - http://mahout.apache.org/

Introducing Apache Mahout – http://www.ibm.com/developerworks/java/library/j-mahout/

“Mahout In Action” by Sean Owen and Robin Anil

Page 28: Intro to Mahout

Mahout Summary

ML is all over the web today Mahout is about scalable machine

learning Mahout has functionality for many of

today’s common machine learning tasks

MapReduce magic in action

Page 29: Intro to Mahout

Mahout Summary

Thank you and good night