Introduction to Mahout and Machine Learning

48
{ “Mahout” : “Scalable Machine Learning Library” } { “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”, “Twitter” : “@vrdmr” } 1

description

This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.

Transcript of Introduction to Mahout and Machine Learning

Page 1: Introduction to Mahout and Machine Learning

{ “Mahout” : “Scalable Machine Learning Library” }

{ “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”,

“Twitter” : “@vrdmr” }

1

Page 2: Introduction to Mahout and Machine Learning

{ “Mahout” : “Introduction” }

2

Page 3: Introduction to Mahout and Machine Learning

{ “Introduction” : “History and Etymology” }

• A Scalable Machine Learning Library built on Hadoop, written in Java.

• Driven by Ng et al.’s paper “MapReduce for Machine Learningon Multicore”

• Started as a Lucene sub-project. Became Apache TLP in April 2010.

• Latest version out – 0.6 (released on 6th Feb 2012).

• Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop.

• Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten.

• Taste Recommendation Framework was added later by Sean Owen.

3

©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 1.1 Apache Mahout and its related projects within the Apache Foundation.

Much of Mahout’s work has been to not only implement these algorithms conventionally, in an efficient and scalable way, but also to convert some of these algorithms to work at scale on top of Hadoop. Hadoop’s mascot is an elephant, which at last explains the project name!

Mahout incubates a number of techniques and algorithms, many still in development or in an experimental phase. At this early stage in the project's life, three core themes are evident: collaborative filtering / recommender engines, clustering, and classification. This is by no means all that exists within Mahout, but are the most prominent and mature themes at the time of writing. These therefore are the scope of this book.

Chances are that if you are reading this, you are already aware of the interesting potential of these three families of techniques. But just in case, read on.

1.2 Mahout’s Machine Learning Themes While Mahout is, in theory, a project open to implementations of all kinds of machine learning techniques, it is in practice a project that focuses on three key areas of machine learning at the moment. These are recommender engines (collaborative filtering), clustering, and classification.

1.2.1 Recommender Engines Recommender engines are the most immediately recognizable machine learning technique in use today. You will have seen services or sites that attempt to recommend books or movies or articles based on our past actions. They try to infer tastes and preferences and identify unknown items that are of interest:

� Amazon.com is perhaps the most famous commerce site to deploy recommendations. Based on purchases and site activity, Amazon recommends books and other items likely to be of interest. See Figure 1.2.

� Netflix similarly recommends DVDs that may be of interest, and famously offered a $1,000,000 prize to researchers that could improve the quality of their recommendations.

� Dating sites like Líbímseti (discussed later) can even recommend people to people. � Social networking sites like Facebook use variants on recommender techniques to identify people

most likely to be an as-yet-unconnected friend.

2

Licensed to Duan Jienan <[email protected]>

Page 4: Introduction to Mahout and Machine Learning

{ “Mahout” : “Machine Learning” }

4

Page 5: Introduction to Mahout and Machine Learning

{ “Machine Learning” : “Introduction” }

“Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience”

• Branch of Artificial Intelligence

• Design and Development of Algorithms

• Computers Evolve Behavior based on Empirical Data .

• Supervised Learning

• Using Labeled training data, to create a Classifier that can predict output for unseen inputs.

• Unsupervised Learning

• Using Unlabeled training data to create a function that can predict output.

• Semi-Supervised Learning

5

Page 6: Introduction to Mahout and Machine Learning

{ “Machine Learning” : “Applications” }

• Recommend Friends, Dates, Products to end-user.

• Classify content into pre-defined groups.

• Find Similar content based on Object Properties.

• Identify key topics in large Collections of Text.

• Detect Anomalies within given data.

• Ranking Search Results with User Feedback Learning.

• Classifying DNA sequences.

• Sentiment Analysis/ Opinion Mining

• Computer Vision.

• Natural Language Processing,

• BioInformatics.

• Speech and HandWriting Recognition.

• Others ...6

Page 7: Introduction to Mahout and Machine Learning

{“Machine Learning”: “Challenges”}

• BigData

• Yesterdays Processing on next generation Data.

• Time for Processing

• Large and Cheap Storage

7

Size Classification Tools

LinesSample Data

Analysis and Visualization

Whiteboard, bash,...

KBs - low MBsPrototype Data

Analysis and Visualization

Matlab, Octave, R, Processing, bash,...

MBs - low GBsOnline Data

Storage MySQL (DBs),...

MBs - low GBsOnline Data

AnalysisNumPy, SciPy, Weka, BLAS/LAPACK,...

MBs - low GBsOnline Data

Visualization Flare, AmCharts, Raphael, Protovis,...

GBs - TBs - PBsBig Data

Storage HDFS, HBase, Cassandra,...

GBs - TBs - PBsBig Data

Analysis Hive, Mahout, Hama, Giraph,...

Page 8: Introduction to Mahout and Machine Learning

{ “Machine Learning” : “Mahout for Big Data”}

• Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”.

• Some Algorithms won’t scale to massive machine clusters

• Others fit logically on MapReduce framework like Apache Hadoop

• Most Mahout implementations are MapReduce enabled

• Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”.

• The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library.

• The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine Learning Open-Source Softwares)

8

Page 9: Introduction to Mahout and Machine Learning

{ “Mahout” : “Internals” }

9

Page 10: Introduction to Mahout and Machine Learning

10

{ “Internals” : “Architecture” }

Math%Vectors/Matrices/SVD%

Recommenders%Clustering%Classifica9on%Freq.%Pa>ern%Mining%

Evolu9onary%Algorithms%

U9li9es%Lucene/Vectorizer%

Collec9ons%(primi9ves)%

Apache%Hadoop%

Applica9ons%

Examples%

Regression% Dimension%Reduc9on%

Page 11: Introduction to Mahout and Machine Learning

• Scalable

• Dual-Mode (Sequential and MapReduce Enabled)

• Support for easy Extension.

• Large Number of Data Source Enabled including the newer NoSQL variants.

• It is a Java library. It is a framework of tools intended to be used and adapted by developers.

• Advanced Implementations of Java’s Collections Framework for better Performance.

11

{ “Internals” : “Features” }

Page 12: Introduction to Mahout and Machine Learning

{ “Mahout” : “Algorithms” }

12

Page 13: Introduction to Mahout and Machine Learning

• Help Users find items they might like based on historical behavior and preferences

• Top-level packages define the Mahout interfaces to these key abstractions:

• DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel

• UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity

• ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity

• UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood.

• Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering Recommender.

13

{ “Algorithms” : “Recommender Systems”, “id” : “Introduction”}

Page 14: Introduction to Mahout and Machine Learning

14

{ “Algorithms” : “Recommender Systems”, “id” : “Example”}

0 1 1 1

1 0 1 1

0 1 0 0

1 0 1 1

1 1 1 1

1 0 1 1

1 0 0 0

1 1 1 0

1 1 0 1

Binary Values Recommendation

Alice

Bob

John

Jane

Bill

Steve

Larry

Don

Jack

Page 15: Introduction to Mahout and Machine Learning

15

{ “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}

1 1/3 – 0.33

5/8 – 0.625

5/8 – 0.625

1/3 – 0.33

1 3/8 – 0.375

3/8 – 0.375

5/8 – 0.625

3/8 – 0.375 1 5/7 –

0.714

5/8 – 0.625

3/8 – 0.375

5/7 – 0.714 1

Tanimoto Coefficient

NA – Number of Customers who bought Product A

NB – Number of Customer who bought Product B

Nc – Number of Customer who bought both Product A and

Product B

Page 16: Introduction to Mahout and Machine Learning

16

{ “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}

1 0.507 0.772 0.772

0.507 1 0.707 0.707

0.772 0.707 1 0.833

0.772 0.707 0.833 1

Cosine Coefficient

NA – Number of Customers who bought Product A

NB – Number of Customer who bought Product B

Nc – Number of Customer who bought both Product A and

Product B

Page 17: Introduction to Mahout and Machine Learning

• Assigning Data to discreet Categories.

• Train a model on Labeled Data

• Run the Model on new, Unlabeled Data

• Classifier: An algorithm that implements classification, especially in a concrete implementation.

• Classification Algorithms

• Maximum entropy classifier

• Naïve Bayes classifier

• Decision trees, decision lists

• Support vector machines

• Kernel estimation and K-nearest-neighbor algorithms

• Perceptrons

• Neural networks (multi-level perceptrons)

17

{ “Algorithms” : “Classification” , “id” : “Introduction”}

Spam Not spam

?

Page 18: Introduction to Mahout and Machine Learning

18

{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}

Train: Not Spam

President Obama’s Nobel Prize Speech

Page 19: Introduction to Mahout and Machine Learning

19

{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}

Train: Spam

Spam Email Content

Page 20: Introduction to Mahout and Machine Learning

20

{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}

Run

“Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”

Page 21: Introduction to Mahout and Machine Learning

21

{ “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”}

• Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.

• Training:

• Read the Features

• Calculate per-Document Statistics

• Normalize across Categories

• Calculate normalizing factor of each label

• Testing

• Classification (fifth job, explicitly invoked)

©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=623

algorithm through which the system will learn, and the variables used as input are key steps in the first phase of building the classification system.

The basic steps in building a classification system are illustrated in figure 13.2.

Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a training algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in production with new input examples to estimate the target variable.

The figure shows two phases of the classification process, with the upper path representing training the classification model and the lower path providing new examples for which the model will assign categories (the target variables) as a way to emulate decisions. For the training phase, input for the training algorithm consists of example data labeled with known target variables. The target variables are, of course, unknown to the model when using new examples during production. In evaluation, the value of the target variables are known, but will not be given to the model. In production, the target variable values are not known, which is why the model is built in the first place.

For testing, which is not shown explicitly in figure 13.2, new examples from held-out training data are presented to the model. The results chosen by the model are compared to known answers for the target variables in order to evaluate performance, a process described in depth throughout chapter 15.

The terminology used by different people to describe classification is highly variable. For consistency, we have limited the terms used for key ideas in the book. Many of these terms are provided in table 13.2. Note the relationship between a record and a field: the record is the repository for values related to a training example or production example, and a field is where the value of a feature is stored for each example.

Table 13.2 Terminology for the key ideas in classification

Key idea Description

Model A computer program that makes decisions; in classification, the output of the training algorithm is a model

Training Data Subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model

Test Data Withheld portion of training examples given to the model without the value for the target variable (although the value is known) and used to evaluate the model

Training Learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.

177

Licensed to Duan Jienan <[email protected]>

Page 22: Introduction to Mahout and Machine Learning

• Grouping unstructured data without any training data.

• Self learning from experience.

• Small intra-cluster distance - Trying for local and global Minima

• Large inter-cluster distance

• Mahout’s Canopy Clustering map reduce algorithm is oftenused to compute initial cluster centroids.

22

{ “Algorithms” : “Clustering” , “id” : “Introduction”}

Page 23: Introduction to Mahout and Machine Learning

23

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 24: Introduction to Mahout and Machine Learning

24

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 25: Introduction to Mahout and Machine Learning

25

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 26: Introduction to Mahout and Machine Learning

26

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 27: Introduction to Mahout and Machine Learning

27

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 28: Introduction to Mahout and Machine Learning

28

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 29: Introduction to Mahout and Machine Learning

29

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 30: Introduction to Mahout and Machine Learning

30

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 31: Introduction to Mahout and Machine Learning

31

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Page 32: Introduction to Mahout and Machine Learning

32

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

Cats

Dogs

Page 33: Introduction to Mahout and Machine Learning

33

{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}

+

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

se

Red

uce

Phas

e Shuffling Data

Page 34: Introduction to Mahout and Machine Learning

• Assume: Number of Cluster is far lesser than Number of Points.

• Therefore, |Clusters| << |Points|

• Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids.

34

{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}

M0 M1 M2 M3 <clusterID, observation>

R0 R1

Important arguments--maxIter--convergenceDelta--method

Page 35: Introduction to Mahout and Machine Learning

35

{ “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”}

Map phase: assign cluster IDs

Reduce phase: reset centroids

Page 36: Introduction to Mahout and Machine Learning

36

{ “Algorithms” : “Other Algorithms” }• Classification

‣ Stochastic Gradient Descent‣ Support Vector Machines‣ Random Forests

• Clustering‣ Latent Dirichlet Allocation

- Topic models‣ Fuzzy K-Means

- Points are assigned multiple clusters‣ Canopy clustering

- Fast approximations of clusters‣ Spectral clustering

- Treat points as a graph• Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions• Dimensionality Reduction• Regression

Page 37: Introduction to Mahout and Machine Learning

37

{ “Algorithms” : “Future” }• Classification‣ Decision Trees such as J48 and ID3

• Clustering‣ DBScan and CoWeb Clustering techniques

• Evolutionary Algorithms‣ Classical Genetic Algorithms

• Association Rules‣ Apriori. (It has an alternative frequent itemset algorithm implementation).

Page 38: Introduction to Mahout and Machine Learning

{ “Mahout” : “Summary” }

38

Page 39: Introduction to Mahout and Machine Learning

{ “Summary”: “Apache Mahout” }

39

• Scalable Library

Page 40: Introduction to Mahout and Machine Learning

40

• Scalable Library• Three Primary Areas of

Focus

{ “Summary”: “Apache Mahout” }

Page 41: Introduction to Mahout and Machine Learning

41

• Scalable Library• Three Primary Areas of

Focus• Other Algorithms

{ “Summary”: “Apache Mahout” }

Page 42: Introduction to Mahout and Machine Learning

42

• Scalable Library• Three Primary Areas of

Focus• Other Algorithms• All in your friendly

neighborhood MapReduce

{ “Summary”: “Apache Mahout” }

Page 43: Introduction to Mahout and Machine Learning

{ “Mahout” : “Demo” }

43

Page 44: Introduction to Mahout and Machine Learning

{ “Mahout” : “Questions” }

44

Page 45: Introduction to Mahout and Machine Learning

{ “Mahout” : “References” }

45

Page 46: Introduction to Mahout and Machine Learning

• Books

• “Mahout in Action”, Owen et. al., Manning Pub.

• “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub.

• “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub.

• Videos

• CS-229, Machine Learning at Stanford University - Prof. Andrew Ng.

• Collaborative filtering at scale - Sean Owen

• Distributed Item-based Collaborative Filtering - Sebastian Schelter

• EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination

46

{ “References” : “Mahout Books, Tutorials, Links”, “id” : 1}

Page 47: Introduction to Mahout and Machine Learning

• WWW

• http://mahout.apache.org - Mahout@Apache

• http://hadoop.apache.org - Hadoop@Apache

[email protected] - Developer mailing list

[email protected] - User mailing list

• http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout

47

{ “References” : “Mahout Books, Tutorials, Links”, “id” : 2}

Page 48: Introduction to Mahout and Machine Learning

{ “Mahout” : “The End” }

48

{“Thank You” : “Have a Nice and Green Day” }