Mahout Introduction BarCampDC
-
Upload
drew-farris -
Category
Technology
-
view
1.559 -
download
0
description
Transcript of Mahout Introduction BarCampDC
![Page 1: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/1.jpg)
MahoutLearning with
![Page 2: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/2.jpg)
About me
Drew Farris Committer to Apache Mahout since
2/2010 ..not as active in the past year
Author: Taming Text My Company: (and BarCamp DC Sponsor)
![Page 3: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/3.jpg)
What is Mahout?
Mahout (as in hoot) or Mahout (as in trout)?
A scalable machine learning library
![Page 4: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/4.jpg)
What is Mahout?
A scalable machine learning library ‘large’ data sets Often Hadoop ..but sometimes not
![Page 5: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/5.jpg)
What is Mahout?
A scalable machine learning library Recommendation Mining
![Page 6: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/6.jpg)
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering
![Page 7: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/7.jpg)
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification
![Page 8: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/8.jpg)
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification Association Mining
![Page 9: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/9.jpg)
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification Association Mining A reasonable linear algebra library A reasonable library of collections
![Page 10: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/10.jpg)
What is Mahout?
A scalable machine learning library Recommendation Mining Clustering Classification Association Mining A reasonable linear algebra library A reasonable library of collections Other Stuff
![Page 11: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/11.jpg)
Mahout
Getting Started Check out & build the code ▪ git clone git://git.apache.org/mahout.git▪ mvn install –DskipTests=true▪ The tests take a looong time to run, not needed for
intial build Or use the Cloudera Virtual Machine (http://bit.ly/
MyBnFi)
![Page 12: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/12.jpg)
Mahout
Getting Started Check out & build the code Examples in examples/bin
![Page 13: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/13.jpg)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/)
![Page 14: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/14.jpg)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Presentations▪ Grant’s IBM Developerworks Article▪ http://ibm.co/LUbptg (Nov 2011)
▪ Others @ http://bit.ly/IZ6PqE (wiki)
![Page 15: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/15.jpg)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Publications (http://bit.ly/IZ6PqE) Mailing Lists ▪ [email protected] ▪ (http://bit.ly/L1GSHB)▪ [email protected]▪ (http://bit.ly/JPeNoE)
![Page 16: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/16.jpg)
Mahout
Getting Started Check out & build the code Examples in examples/bin Wiki (http://mahout.apache.org/) Articles & Presentations Mailing Lists Books! ▪ Mahout in Action: http://bit.ly/IWMvaz▪ Taming Text: http://bit.ly/KkODZV
![Page 17: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/17.jpg)
Mahout Examples
Kicking the Tires in examples/bin classify-20newsgroups.sh cluster-reuters.sh cluster-syntheticcontrol.sh asf-email-examples.sh
![Page 18: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/18.jpg)
Mahout Examples
Kicking the Tires in examples/bin classify-20newsgroups.sh Premise: Classify News Stories Algorithm: sgd Data: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-
bydate.tar.gz
![Page 19: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/19.jpg)
Mahout Examples
Kicking the Tires in examples/bin cluster-reuters.sh Premise: Group Related News Stories Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
![Page 20: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/20.jpg)
Mahout Examples
Kicking the Tires in examples/bin cluster-syntheticcontrol.sh▪ Premise: Cluster time series data▪ normal, cyclic, increasing, decreasing, upward,
downward shift
▪ Algorithms: ▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift
See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html
![Page 21: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/21.jpg)
Mahout Examples
Kicking the Tires in examples/bin asf-email-examples.sh▪ Recommendation (user based)▪ Clustering (kmeans, dirichlet, minhash)▪ Classification (naïve bayes, sgd)
![Page 22: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/22.jpg)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
![Page 23: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/23.jpg)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training
![Page 24: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/24.jpg)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation
![Page 25: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/25.jpg)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation Lather, Rinse, Repeat
![Page 26: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/26.jpg)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation Lather, Rinse, Repeat Production
![Page 27: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/27.jpg)
Learning Outline
General Outline: Data Transformation▪ From Native format to…▪ ..Sequence Files; Typed Key, Value pairs▪ ..Labeled Vectors
Model Training Model Evaluation Lather, Rinse, Repeat Production Lather, Rinse, Repeat
![Page 28: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/28.jpg)
Text to Sparse Vectors
mahout seq2sparse Tokenize Documents Count Words Make Partial/Merge Vectors TFIDF Make Partial/Merge TFIDF Vectors
![Page 29: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/29.jpg)
Tips
View Sequence Files with: mahout seqdumper –i /path/to/sequence/file
Check out shortcuts in: src/conf/driver.classes.props
Run classes with: mahout org.apache.mahout.SomeCoolNewFeature …
Standalone vs. Distributed Standalone mode is default Set HADOOP_CONF_DIR to use Hadoop MAHOUT_LOCAL will force standalone
![Page 30: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/30.jpg)
Example: Recommendation asf-email-examples.sh (recommendation)
Premise: Recommend Interesting Threads User based recommendation Boolean preferences based on thread
contribution Implies boolean similarity measure – tanimoto, log-
likelihood
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
![Page 31: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/31.jpg)
Recommendation Process Recommendation Steps
Convert Mail to Sequence Files Convert Sequence Files to Preferences Prepare Preference Matrix Row Similarity Job Recommender Job
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
![Page 32: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/32.jpg)
Example: Classification
asf-email-examples.sh (classification)
Premise: Predict project mailing lists for incoming messages
Data labeled based on the mailing list it arrived on Hold back a random 20% of data for testing, the
rest for training. Algorithms: Naïve Bayes (Standard, Complimentary),
SGD
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
![Page 33: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/33.jpg)
Classification Process
Classification Steps Convert Mail to Sequence Files Sequence Files to Sparse Vectors Modify Sequence File Labels Split into Training and Test Sets Train the Model Test the Model
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
![Page 34: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/34.jpg)
Example: Clustering
asf-email-examples.sh (clustering)
Premise: Grouping Messages by Subject Same Prep as Classification Different Algorithms: (kmeans, dirichlet,
minhash)
12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398 ms (Minutes: 342.95663333333334
See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
![Page 35: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/35.jpg)
Clustering Process
Clustering Steps Convert Mail to Sequence Files Sequence Files to Sparse Vectors Run Clustering (iterate) Dump Results
![Page 36: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/36.jpg)
Where to now?
Insert Bar Camp Style Discussion Here
![Page 37: Mahout Introduction BarCampDC](https://reader033.fdocuments.in/reader033/viewer/2022061218/54b777d64a7959e6038b45cb/html5/thumbnails/37.jpg)
Resources
Mahout in Action Owen, Anil, Dunning and Friedman http://bit.ly/IWMvaz
Taming Text Ingersoll, Morton and Farris http://bit.ly/KkODZV