Introduction to Mahout and Machine Learning

{ “Mahout” : “Scalable Machine Learning Library” }

{ “Presented By” : “Varad Meru”, “Company” : “Orzota, Inc”,

“Twitter” : “@vrdmr” }

1

http://www.linkedin.com/in/vmeru

http://www.linkedin.com/in/vmeru

http://www.orzota.com

http://www.orzota.com

{ “Mahout” : “Introduction” }

2

{ “Introduction” : “History and Etymology” }

• A Scalable Machine Learning Library built on Hadoop, written in Java.

• Driven by Ng et al.’s paper “MapReduce for Machine Learningon Multicore”

• Started as a Lucene sub-project. Became Apache TLP in April 2010.

• Latest version out – 0.6 (released on 6th Feb 2012).

• Mahout – Keeper/Driver of Elephants. Since many of the algorithms are implemented in MapReduce on Hadoop.

• Mahout was started by Isabel Drost, Grant Ingersoll, Karl Witten.

• Taste Recommendation Framework was added later by Sean Owen.

3

©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=623

Figure 1.1 Apache Mahout and its related projects within the Apache Foundation.

Much of Mahout’s work has been to not only implement these algorithms conventionally, in an efficient and scalable way, but also to convert some of these algorithms to work at scale on top of Hadoop. Hadoop’s mascot is an elephant, which at last explains the project name!

Mahout incubates a number of techniques and algorithms, many still in development or in an experimental phase. At this early stage in the project's life, three core themes are evident: collaborative filtering / recommender engines, clustering, and classification. This is by no means all that exists within Mahout, but are the most prominent and mature themes at the time of writing. These therefore are the scope of this book.

Chances are that if you are reading this, you are already aware of the interesting potential of these three families of techniques. But just in case, read on.

1.2 Mahout’s Machine Learning Themes While Mahout is, in theory, a project open to implementations of all kinds of machine learning techniques, it is in practice a project that focuses on three key areas of machine learning at the moment. These are recommender engines (collaborative filtering), clustering, and classification.

1.2.1 Recommender Engines Recommender engines are the most immediately recognizable machine learning technique in use today. You will have seen services or sites that attempt to recommend books or movies or articles based on our past actions. They try to infer tastes and preferences and identify unknown items that are of interest:

� Amazon.com is perhaps the most famous commerce site to deploy recommendations. Based on purchases and site activity, Amazon recommends books and other items likely to be of interest. See Figure 1.2.

� Netflix similarly recommends DVDs that may be of interest, and famously offered a $1,000,000 prize to researchers that could improve the quality of their recommendations.

� Dating sites like Líbímseti (discussed later) can even recommend people to people. � Social networking sites like Facebook use variants on recommender techniques to identify people

most likely to be an as-yet-unconnected friend.

2

Licensed to Duan Jienan <[email protected]>

{ “Mahout” : “Machine Learning” }

4

{ “Machine Learning” : “Introduction” }

“Machine Learning is Programming Computers to optimize a Performance Criterion using Example Data or Past Experience”

• Branch of Artificial Intelligence

• Design and Development of Algorithms

• Computers Evolve Behavior based on Empirical Data .

• Supervised Learning

• Using Labeled training data, to create a Classifier that can predict output for unseen inputs.

• Unsupervised Learning

• Using Unlabeled training data to create a function that can predict output.

• Semi-Supervised Learning

5

{ “Machine Learning” : “Applications” }

• Recommend Friends, Dates, Products to end-user.

• Classify content into pre-defined groups.

• Find Similar content based on Object Properties.

• Identify key topics in large Collections of Text.

• Detect Anomalies within given data.

• Ranking Search Results with User Feedback Learning.

• Classifying DNA sequences.

• Sentiment Analysis/ Opinion Mining

• Computer Vision.

• Natural Language Processing,

• BioInformatics.

• Speech and HandWriting Recognition.

• Others ...6

{“Machine Learning”: “Challenges”}

• BigData

• Yesterdays Processing on next generation Data.

• Time for Processing

• Large and Cheap Storage

7

Size Classification Tools

LinesSample Data

Analysis and Visualization

Whiteboard, bash,...

KBs - low MBsPrototype Data

Analysis and Visualization

Matlab, Octave, R, Processing, bash,...

MBs - low GBsOnline Data

Storage MySQL (DBs),...


AnalysisNumPy, SciPy, Weka, BLAS/LAPACK,...


Visualization Flare, AmCharts, Raphael, Protovis,...

GBs - TBs - PBsBig Data

Storage HDFS, HBase, Cassandra,...

GBs - TBs - PBsBig Data

Analysis Hive, Mahout, Hama, Giraph,...

{ “Machine Learning” : “Mahout for Big Data”}

• Goal: “Be as Fast and Efficient as possible given the intrinsic design of the Algorithm”.

• Some Algorithms won’t scale to massive machine clusters

• Others fit logically on MapReduce framework like Apache Hadoop

• Most Mahout implementations are MapReduce enabled

• Focus: “Scalability with Hadoop’s MapReduce Processing Framework on BigData on Hadoop’s HDFS Storage”.

• The only Machine Learning Library build on a MapReduce framework. Other MapReduce framework such as Disco, Skynet, FileMap, Phoenix, AEMR either don’t scale or don’t have any ML library.

• The only Scalable Machine Learning Framework with MapReduce and Hadoop Support. (www.mloss.org: Machine Learning Open-Source Softwares)

8

http://www.mloss.com

http://www.mloss.com

{ “Mahout” : “Internals” }

9

10

{ “Internals” : “Architecture” }

Math%Vectors/Matrices/SVD%

Recommenders%Clustering%Classifica9on%Freq.%Pa>ern%Mining%

Evolu9onary%Algorithms%

U9li9es%Lucene/Vectorizer%

Collec9ons%(primi9ves)%

Apache%Hadoop%

Applica9ons%

Examples%

Regression% Dimension%Reduc9on%

• Scalable

• Dual-Mode (Sequential and MapReduce Enabled)

• Support for easy Extension.

• Large Number of Data Source Enabled including the newer NoSQL variants.

• It is a Java library. It is a framework of tools intended to be used and adapted by developers.

• Advanced Implementations of Java’s Collections Framework for better Performance.

11

{ “Internals” : “Features” }

{ “Mahout” : “Algorithms” }

12

• Help Users find items they might like based on historical behavior and preferences

• Top-level packages define the Mahout interfaces to these key abstractions:

• DataModel – FileDataModel, MySQLJDBCDataModel, PostgreSQLJDBCDataModel, MongoDBDataModel, CassandraDataModel

• UserSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity

• ItemSimilarity – Pearson-Correlation, Tanimoto, Log-Likelihood, Uncentered Cosine Similarity, Euclidean Distance Similarity

• UserNeighborhood – Nearest N-User Neighborhood, Threshold User Neighborhood.

• Recommender – KNN Item-Based Recommender, Slope One Recommender, Tree Clustering Recommender.

13

{ “Algorithms” : “Recommender Systems”, “id” : “Introduction”}

14

{ “Algorithms” : “Recommender Systems”, “id” : “Example”}

0 1 1 1

1 0 1 1

0 1 0 0

1 0 1 1

1 1 1 1

1 0 1 1

1 0 0 0

1 1 1 0

1 1 0 1

Binary Values Recommendation

Alice

Bob

John

Jane

Bill

Steve

Larry

Don

Jack

15

{ “Algorithms” : “Recommender Systems” , “Similarity” : “Tanimoto”}

1 1/3 – 0.33

5/8 – 0.625

5/8 – 0.625

1/3 – 0.33

1 3/8 – 0.375

3/8 – 0.375

5/8 – 0.625

3/8 – 0.375 1 5/7 –

0.714

5/8 – 0.625

3/8 – 0.375

5/7 – 0.714 1

Tanimoto Coefficient

NA – Number of Customers who bought Product A

NB – Number of Customer who bought Product B

Nc – Number of Customer who bought both Product A and

Product B

16

{ “Algorithms” : “Recommender Systems” , “Similarity” : “Cosine”}

1 0.507 0.772 0.772

0.507 1 0.707 0.707

0.772 0.707 1 0.833

0.772 0.707 0.833 1

Cosine Coefficient

NA – Number of Customers who bought Product A

NB – Number of Customer who bought Product B

Nc – Number of Customer who bought both Product A and

Product B

• Assigning Data to discreet Categories.

• Train a model on Labeled Data

• Run the Model on new, Unlabeled Data

• Classifier: An algorithm that implements classification, especially in a concrete implementation.

• Classification Algorithms

• Maximum entropy classifier

• Naïve Bayes classifier

• Decision trees, decision lists

• Support vector machines

• Kernel estimation and K-nearest-neighbor algorithms

• Perceptrons

• Neural networks (multi-level perceptrons)

17

{ “Algorithms” : “Classification” , “id” : “Introduction”}

Spam Not spam

?

18

{ “Algorithms” : “Classification” , “id” : “Naïve Bayes Example”}

Train: Not Spam

President Obama’s Nobel Prize Speech

19


Train: Spam

Spam Email Content

20


Run

“Order a trial Adobe chicken daily EAB-List new summer savings, welcome!”

21

{ “Algorithms” : “Classification” , “id” : “Naïve Bayes in Mahout”}

• Naïve Bayes is a pretty complex process in Mahout: training the classifier requires four separate Hadoop jobs.

• Training:

• Read the Features

• Calculate per-Document Statistics

• Normalize across Categories

• Calculate normalizing factor of each label

• Testing

• Classification (fifth job, explicitly invoked)

©Manning Publications Co. Please post comments or corrections to the Author Online forum: http://www.manning-sandbox.com/forum.jspa?forumID=623

algorithm through which the system will learn, and the variables used as input are key steps in the first phase of building the classification system.

The basic steps in building a classification system are illustrated in figure 13.2.

Figure 13.2. How a classification system works. Inside the dotted lasso is the heart of the classification system, a training algorithm that learns a model to emulate human decisions. A copy of the model is then used in evaluation or in production with new input examples to estimate the target variable.

The figure shows two phases of the classification process, with the upper path representing training the classification model and the lower path providing new examples for which the model will assign categories (the target variables) as a way to emulate decisions. For the training phase, input for the training algorithm consists of example data labeled with known target variables. The target variables are, of course, unknown to the model when using new examples during production. In evaluation, the value of the target variables are known, but will not be given to the model. In production, the target variable values are not known, which is why the model is built in the first place.

For testing, which is not shown explicitly in figure 13.2, new examples from held-out training data are presented to the model. The results chosen by the model are compared to known answers for the target variables in order to evaluate performance, a process described in depth throughout chapter 15.

The terminology used by different people to describe classification is highly variable. For consistency, we have limited the terms used for key ideas in the book. Many of these terms are provided in table 13.2. Note the relationship between a record and a field: the record is the repository for values related to a training example or production example, and a field is where the value of a feature is stored for each example.

Table 13.2 Terminology for the key ideas in classification

Key idea Description

Model A computer program that makes decisions; in classification, the output of the training algorithm is a model

Training Data Subset of training examples labeled with the value of the target variable and used as input to the learning algorithm to produce the model

Test Data Withheld portion of training examples given to the model without the value for the target variable (although the value is known) and used to evaluate the model

Training Learning process that uses training data to produce a model. That model can then compute estimates of the target variable given the predictor variables as inputs.

177

Licensed to Duan Jienan <[email protected]>

• Grouping unstructured data without any training data.

• Self learning from experience.

• Small intra-cluster distance - Trying for local and global Minima

• Large inter-cluster distance

• Mahout’s Canopy Clustering map reduce algorithm is oftenused to compute initial cluster centroids.

22

{ “Algorithms” : “Clustering” , “id” : “Introduction”}

23

{ “Algorithms” : “Clustering” , “id” : “K-Means Clustering Example”}

24


25


26


27


28


29


30


31


32


Cats

Dogs

33

{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}

+

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

se

Red

uce

Phas

e Shuffling Data

• Assume: Number of Cluster is far lesser than Number of Points.

• Therefore, |Clusters| << |Points|

• Hadoop’s DistributedCache is used in order to give each Mapper access to all the current cluster centroids.

34

{ “Algorithms” : “Clustering” , “id” : “K-Means in Mahout”}

M0 M1 M2 M3 <clusterID, observation>

R0 R1

Important arguments--maxIter--convergenceDelta--method

35

{ “Algorithms” : “Clustering” , “id” : “MapReduce KMeans Clustering”}

Map phase: assign cluster IDs

Reduce phase: reset centroids

36

{ “Algorithms” : “Other Algorithms” }• Classification

‣ Stochastic Gradient Descent‣ Support Vector Machines‣ Random Forests

• Clustering‣ Latent Dirichlet Allocation

- Topic models‣ Fuzzy K-Means

- Points are assigned multiple clusters‣ Canopy clustering

- Fast approximations of clusters‣ Spectral clustering

- Treat points as a graph• Evolutionary Algorithms - Integration with Watchmaker for Genetic Programming Fitness Functions• Dimensionality Reduction• Regression

37

{ “Algorithms” : “Future” }• Classification‣ Decision Trees such as J48 and ID3

• Clustering‣ DBScan and CoWeb Clustering techniques

• Evolutionary Algorithms‣ Classical Genetic Algorithms

• Association Rules‣ Apriori. (It has an alternative frequent itemset algorithm implementation).

{ “Mahout” : “Summary” }

38

{ “Summary”: “Apache Mahout” }

39

• Scalable Library

40

• Scalable Library• Three Primary Areas of

Focus


41


Focus• Other Algorithms


42


Focus• Other Algorithms• All in your friendly

neighborhood MapReduce


{ “Mahout” : “Demo” }

43

{ “Mahout” : “Questions” }

44

{ “Mahout” : “References” }

45

• Books

• “Mahout in Action”, Owen et. al., Manning Pub.

• “Pattern Recognition and Machine Learning”, Christopher Bishop, Springer Pub.

• “Elements of Statistical Learning: Data Mining, Inference, and Prediction”, Hastie et. al., Springer Pub.

• Videos

• CS-229, Machine Learning at Stanford University - Prof. Andrew Ng.

• Collaborative filtering at scale - Sean Owen

• Distributed Item-based Collaborative Filtering - Sebastian Schelter

• EMail Classification with Mahout - Grant Ingersoll @ Lucid Imagination

46

{ “References” : “Mahout Books, Tutorials, Links”, “id” : 1}

• WWW

• http://mahout.apache.org - Mahout@Apache

• http://hadoop.apache.org - Hadoop@Apache

• [email protected] - Developer mailing list

• [email protected] - User mailing list

• http://www.ibm.com/developerworks/java/library/j-mahout/ - Introducing Apache Mahout

47

{ “References” : “Mahout Books, Tutorials, Links”, “id” : 2}

http://mahout.apache.org

http://mahout.apache.org

http://hadoop.apache.org

http://hadoop.apache.org

mailto:[email protected]




http://www.ibm.com/developerworks/java/library/j-mahout/

http://www.ibm.com/developerworks/java/library/j-mahout/

{ “Mahout” : “The End” }

48

{“Thank You” : “Have a Nice and Green Day” }

Introduction to Mahout and Machine Learning

Technology

Transcript of Introduction to Mahout and Machine Learning