Introducing Apache Mahout

29
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination

description

Introducing Apache Mahout. Scalable Machine Learning for All! Grant Ingersoll Lucid Imagination. Overview. What is Machine Learning? Mahout. Definition. “Machine Learning is programming computers to optimize a performance criterion using example data or past experience” - PowerPoint PPT Presentation

Transcript of Introducing Apache Mahout

Page 1: Introducing Apache Mahout

Introducing Apache Mahout

Scalable Machine Learning for All!

Grant Ingersoll

Lucid Imagination

Page 2: Introducing Apache Mahout

Overview

• What is Machine Learning?

• Mahout

Page 3: Introducing Apache Mahout

Definition• “Machine Learning is programming

computers to optimize a performance criterion using example data or past experience”– Intro. To Machine Learning by E.

Alpaydin

• Subset of Artificial Intelligence– Many other fields: comp sci., biology,

math, psychology, etc.

Page 4: Introducing Apache Mahout

Types• Supervised

– Using labeled training data, create function that predicts output of unseen inputs

• Unsupervised– Using unlabeled data, create function

that predicts output

• Semi-Supervised– Uses labeled and unlabeled data

Page 5: Introducing Apache Mahout

Characterizations

• Lots of Data

• Identifiable Features in that Data

• Too big/costly for people to handle– People still can help

Page 6: Introducing Apache Mahout

Clustering

• Unsupervised

• Find Natural Groupings– Documents– Search Results– People– Genetic traits in groups– Many, many more uses

Page 7: Introducing Apache Mahout

Example: Clustering

Google News

Page 8: Introducing Apache Mahout

Collaborative Filtering

• Unsupervised

• Recommend people and products– User-User

• User likes X, you might too

– Item-Item• People who bought X also bought Y

Page 9: Introducing Apache Mahout

Example: Collab Filtering

Amazon.com

Page 10: Introducing Apache Mahout

Classification/Categorization

• Many, many types

• Spam Filtering

• Named Entity Recognition

• Phrase Identification

• Sentiment Analysis

• Classification into a Taxonomy

Page 11: Introducing Apache Mahout

Example: NER

NER?

Excerpt from Yahoo News

Page 12: Introducing Apache Mahout

Example: Categorization

Page 13: Introducing Apache Mahout

Info. Retrieval

• Learning Ranking Functions

• Learning Spelling Corrections

• User Click Analysis and Tracking

Page 14: Introducing Apache Mahout

Other

• Image Analysis

• Robotics

• Games

• Higher level natural language processing

• Many, many others

Page 15: Introducing Apache Mahout

What is Apache Mahout?

• A Mahout is an elephant trainer/driver/keeper, hence…

+Machine Learning

=

(and other distributed techniques)

Page 16: Introducing Apache Mahout

What?

• Hadoop brings:– Map/Reduce API– HDFS– In other words, scalability and fault-

tolerance

• Mahout brings:– Library of machine learning algorithms– Examples

Page 17: Introducing Apache Mahout

Why Mahout?• Many Open Source ML libraries either:

– Lack Community

– Lack Documentation and Examples

– Lack Scalability

– Lack the Apache License ;-)

– Or are research-oriented

Page 18: Introducing Apache Mahout

Why Mahout?• Intelligent Apps are the Present and

Future

• Thus, Mahout’s Goal is:– Scalable Machine Learning with Apache

License

Page 19: Introducing Apache Mahout

Current Status• What’s in it:

– Simple Matrix/Vector library– Taste Collaborative Filtering– Clustering

• Canopy/K-Means/Fuzzy K-Means/Mean-shift/Dirichlet

– Classifiers• Naïve Bayes• Complementary NB

– Evolutionary• Integration with Watchmaker for fitness function

Page 20: Introducing Apache Mahout

How?

• Examples– Taste– Clustering– Classification– Evolutionary

Page 21: Introducing Apache Mahout

Taste: Movie Recommendations

• Given ratings by users of movies, recommend other movies

• http://lucene.apache.org/mahout/taste.html#demo

Page 22: Introducing Apache Mahout

Taste Demo

• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=12&debug=true

• http://localhost:8080/mahout-taste-webapp/RecommenderServlet?userID=43&debug=true

Page 23: Introducing Apache Mahout

Clustering: Synthetic Control Data

• http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series

• Each clustering impl. has an example Job for running in <MAHOUT_HOME>/examples– o.a.mahout.clustering.syntheticcontrol.*

• Outputs clusters…

Page 24: Introducing Apache Mahout

Classification: NB and CNB Examples

• 20 Newsgroups– http://cwiki.apache.org/confluence/

display/MAHOUT/TwentyNewsgroups

• Wikipedia– http://cwiki.apache.org/confluence/

display/MAHOUT/WikipediaBayesExample

Page 25: Introducing Apache Mahout

Evolutionary

• Traveling Salesman– http://cwiki.apache.org/confluence/

display/MAHOUT/Traveling+Salesman

• Class Discovery– http://cwiki.apache.org/confluence/

display/MAHOUT/Class+Discovery

Page 26: Introducing Apache Mahout

What’s Next?• More Examples• Winnow/Perceptron (MAHOUT-85)• Text Clustering• Association Rules (MAHOUT-108)• Logistic Regression• Solr Integration (SOLR-769)• GSOC

Page 27: Introducing Apache Mahout

When, Who• When? Now!

– Mahout is growing

• Who? You!– We want programmers who:

• Are comfortable with math• Like to work on hard problems

– We want others to:• Kick the tires

Page 28: Introducing Apache Mahout

Where?

• http://lucene.apache.org/mahout– Hadoop - http://hadoop.apache.org

• http://cwiki.apache.org/MAHOUT

• mahout-{user|dev}@lucene.apache.org– http://www.lucidimagination.com/search/p:mahout

Page 29: Introducing Apache Mahout

Resources

• “Programming Collective Intelligence” by Segaran

• “Data Mining - Practical Machine Learning Tools and Techniques” by Witten and Frank

• “Taming Text” by Ingersoll and Morton