Download Materials

25
Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll

Transcript of Download Materials

  • 1.Introducing Apache Mahout Scalable Machine Learning for All! Grant Ingersoll

2. Agenda What is Machine Learning? Definitions Types Applications Mahout What? Why? How? Who? 3. What is Machine Learning? NOT!QuickTime and a decompressor QuickTime and a are needed to see this picture. Or?decompressor are needed to see this picture. http://en.wikipedia.org/wiki/Image:Hal-9000.jpg http://upload.wikimedia.org/wikipedia/en/4/49/Terminator.jpg 4. How about? Google News 5. Or? Amazon.com 6. Definition Machine Learning is programming computers to optimize a performance criterion using example data or past experience Intro. To Machine Learning by E. Alpaydin Subset of Artificial Intelligence Many other fields: comp sci., biology, math, psychology, etc. 7. Characterizations Lots of Data Identifiable Features in that Data Too big/costly for people to handle People still can help 8. Types Supervised Using labeled training data, create function that predicts output of unseen inputs Unsupervised Using unlabeled data, create function that predicts output Semi-Supervised Uses labeled and unlabeled data 9. Classification/Categorization Spam Filtering Named Entity Recognition Phrase Identification Sentiment Analysis Classification into a Taxonomy 10. Clustering Find Natural Groupings Documents Search Results People Genetic traits in groups Many, many more uses 11. Collaborative Filtering Recommend people and products User-User User likes X, you might too Item-Item People who bought X also bought Y 12. Info. Retrieval Learning Ranking Functions Learning Spelling Corrections User Click Analysis and Tracking 13. Other Image Analysis Robotics Games Higher level natural language processing Many, many others 14. What is Apache Mahout? A Mahout is an elephant trainer/driver/keeper, henceQuickTime and adecompressor are needed to see this picture.+ (and other distributed techniques)Machine Learning = 15. What? Hadoop brings: Map/Reduce API HDFS In other words, scalability and fault- tolerance Thus, Mahouts Goal is: Scalable Machine Learning with Apache License 16. Why Mahout? Many Open Source ML libraries either: Lack Community Lack Documentation and Examples Lack Scalability Lack the Apache License ;-) Or are research-oriented Personal: Learn more ML Intelligent Apps are the Present and Future See the Hadoop talks tomorrow and Friday! Goal: Overcome gaps the Apache Way! 17. Current Status Close to Initial release Focused on examples, docs, bug fixes Whats in it: Simple Matrix/Vector library Taste Collaborative Filtering Clustering Canopy/K-Means/Fuzzy K-Means/Mean-shift Classifiers Nave Bayes Complementary NB Evolutionary Integration with Watchmaker for fitness function 18. How? Examples Taste Clustering Classification Evolutionary 19. Taste: MovieRecommendations Given ratings by users of movies, recommend other movies http://lucene.apache.org/mahout/taste .html#demo 20. Clustering: Synthetic Control Data http://archive.ics.uci.edu/ml/datasets/Synthetic+ Each clustering impl. has an example Job for running in /examples o.a.mahout.clustering.syntheticcontrol.* Outputs clusters 21. Classification: NB and CNB Examples 20 Newsgroups http://cwiki.apache.org/confluence/display/MA Wikipedia http://cwiki.apache.org/confluence/display/MA 22. Evolutionary Traveling Salesman http://cwiki.apache.org/confluence/displa y/MAHOUT/Traveling+Salesman Class Discovery http://cwiki.apache.org/confluence/displa y/MAHOUT/Class+Discovery 23. Whats Next? Release 0.1! Shared Amazon Images (others?) More Examples Winnow/Perceptron (MAHOUT-85) Hbase and HAMA support Normalize I/O format for data Solr Integration (SOLR-769) Other Algorithms: SVM, Linear Regression, etc. 24. When, Where, Who When? Now! Mahout is growing Who? You! We want Java programmers who: Are comfortable with math Like to work on large, hard problems Where? http://lucene.apache.org/mahout http://cwiki.apache.org/MAHOUT mahout-{user|dev}@lucene.apache.org 25. Resources Programming Collective Intelligence by Toby Segaran Data Mining - Practical Machine Learning Tools and Techniques by Ian H. Witten and Eibe Frank Hadoop - http://hadoop.apache.org http://mloss.org/software/