Machine Learning with Hadoop

22
October 4, 2011 Presented to Hadoop-DC Training on a pluggable machine learning platform Machine Learning on Hadoop at Huffington Post | AOL

description

Sangchul Song and Thu Kyaw discuss machine learning at AOL, and the challenges and solutions they encountered when trying to train a large number of machine learning models using Hadoop. Algorithms including SVM and packages like Mahout are discussed. Finally, they discuss their analytics pipeline, which includes some custom components used to interoperate with a range of machine learning libraries, as well as integration with the query language Pig.

Transcript of Machine Learning with Hadoop

Page 1: Machine Learning with Hadoop

October 4, 2011Presented to Hadoop-DC

Training on a pluggable machine learning platform

Machine Learning on Hadoop at Huffington Post | AOL

Page 2: Machine Learning with Hadoop

A Little Bit about Us

Core Services Team at HPMG | AOL

Thu Kyaw ([email protected])• Principal Software Engineer• Worked on machine learning, data mining, and natural language processing

Sang Chul Song, Ph.D. ([email protected])• Senior Software Engineer• Worked on data intensive computing – data archiving / information retrieval

Page 3: Machine Learning with Hadoop

Machine Learning:Supervised Classification

Business Non-business

Investments are taxed as … Are you dense or just clueless?

the top tax bracket for … numbas is numbas …

Well, Mr. Geithner, … This is a joke, right?

the financial crises are unfair … My nephew is a hedge fund manager …

Train Model

Model

1. Learning Phase

2. Classifying Phase

capital gains to be taxed … Classify Result

“Business”

“Entertainment”

“Politics”

Page 4: Machine Learning with Hadoop

Two Machine Learning Use Cases at HuffPost | AOL

Comment Moderation• Evaluate All New HuffPost User Comments Every Day• Identify Abusive / Aggressive Comments• Auto Delete / Publish ~25% Comments Every Day

Article Classification• Tag Articles for Advertising

• E.g.: scary, salacious, …

Page 5: Machine Learning with Hadoop

Our Classification Tasks

abusive

non-abusive

non-abusive

non-abusive

non-abusive

abusive

scary

sexy

Comment Moderation Article Classification

Page 6: Machine Learning with Hadoop

In Order to Meet Our Needs,We Require…

Support for important algorithms, including• SVM• Perceptron / Winnow• Bayesian• Decision Tree• AdaBoost …

Ability to build tons of models on regular basis, and pick the best• Because, in general, it’s difficult to know in advance what algorithm / parameter set will

work best

Page 7: Machine Learning with Hadoop

However,

N algorithms, K parameters each, L values in each parameter There are N x LK combinations!, which is often too many to deal with sequentially.

For example, N=5, K=5, L=10 500K

Page 8: Machine Learning with Hadoop

So, we parallelize on Hadoop

Good news: • Mahout, a parallel machine learning tool, is already available.• There are Mallet, libsvm, Weka, … that support necessary algorithms.

Bad news: • Mahout doesn’t support necessary algorithms yet. • Other algorithms do not run natively on Hadoop.

Page 9: Machine Learning with Hadoop

Therefore, we do…

We build a flexible ML platform running on Hadoop that supports a wide range of algorithms, leveraging publicly available implementations.

On top of our platform, we generate / test hundred thousands models, and choose the best.

We use Pig for Hadoop implementation.

Page 10: Machine Learning with Hadoop

CONVENTIONAL

Training Data

Our Approach

1000s Models(one for each

param set)

Train (sequential) SelectBest

Model

OUR APPROACH More algorithms (thus better model), and faster parallel processing

Train

Requ

est Return

AdaBoost, SVM, Decision Tree,Bayesian and a Lot Others

Page 11: Machine Learning with Hadoop

What Parallelization?

Training Task

Training Task

Training Task

Training Task

Training Task

Page 12: Machine Learning with Hadoop

General Processing Flow

Preprocess Parameters• Stopword use, n-gram size, stemming, etc.

Train Parameters• Algorithm and algorithm specific parameters

• (e.g. SVM, C, Ɛ, and other kernel parameters)

PreprocessTrainingDocs

VectorizedDocs

Model

Train

Page 13: Machine Learning with Hadoop

Our Parallel Processing Flow

TrainingDocs

VectorizedDocs

Vectorized Docs

Vectorized Docs

Model

Model

Model

Model

Model

ModelModel

Model

Model

Page 14: Machine Learning with Hadoop

Preprocessing on Hadoop(see next

slide)

Preprocessing on Hadoop

279 68 ngram_stem_stopword 1 snowball true279 68 ngram_stem_stopword 2 snowball true279 68 ngram_stem_stopword 3 snowball true279 68 ngram_stem_stopword 1 porter true279 68 ngram_stem_stopword 2 porter true279 68 ngram_stem_stopword 3 none false…

Preprocessing Request (a parameter set per line)

business Investments are taxed as capital gains.....business It was the overleveraged and underregulated banks …none I am afraid we may be headed for …none In the famous words of Homer Simpson, “it takes 2 to lie …”…

Training Data

Vector 1

Vector 2

Vector 3

Vector 4

Vector 5

Vector k

Page 15: Machine Learning with Hadoop

Preprocessing on HadoopBig Picture

par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE

RunPreprocess(par1, par2, …);STORE run ..;

Through UDF Call

RunPreprocess()

Vector 1

Vector 2

Vector k

……

..UDF

Preprocessors (Pluggable Pipes)

Stemmer

VectorizerFeatureSelector

Tokenizer StopwordFilter

Page 16: Machine Learning with Hadoop

Training on Hadoop

73 923 balanced_winnow 5 1 10 …73 923 balanced_winnow 5 2 10 …73 923 balanced_winnow 5 3 10 …73 923 balanced_winnow 5 1 20 …73 923 balanced_winnow 5 2 20 …73 923 balanced_winnow 5 3 20 ……

Train Request (a parameter set per line)

010101101020101100010101110100010101011100…010111010100010100100010101011100110110101…011101011010101011101011011010001010010101…010010111010100010101010001010111010101010…111010110001110101011010100101011010001011…

Vectors

Model 1

Model 2

Model 3

Model 4

Model 5

Model kMahout, Weka, Malletor libsvm

Training on Hadoop

(see next slide)

Page 17: Machine Learning with Hadoop

Training on HadoopBig Picture

par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE

RunTrainer(par1, par2, …);STORE run ..;

Through UDF Call

RunTrainer()

Model 1

Model 2

Model k

……

.UDF

Mahout• Bayesian• Logistic Regression• …

Mallet• AdaBoost (M2)• Bagging• Balanced Winnow• C45• Decision Tree• …

Weka• AdaBoostM1• Bagging• Addictive Regression• …

libsvm• SVM

Page 18: Machine Learning with Hadoop

Training on Hadoop : Trick #1

Each model can be generated independently an easy parallelization problem (aka ‘embarrassingly parallel’)But, how do we achieve parallelism with Pig?

par = LOAD param_file AS par1, par2, …;run = FOREACH par GENERATE RunTrainer(par1, par2, …);STORE run ...;

par = LOAD param_file AS par1, par2, …;grp = GROUP par BY (par1, par2, …) PARALLEL 50fltn = FOREACH grp GENERATE group.par1 AS par1, …;run = FOREACH fltn GENERATE RunTrainer(par1, …);STORE run …;

Page 19: Machine Learning with Hadoop

Training on Hadoop: Trick #2

We call ML functions from UDF.Some functions can take too long to return, and Hadoop will kill the job if they do.

RunTrainer()

“Pig Heartbeat” ThreadMain Thread

Page 20: Machine Learning with Hadoop

As a result, we now see…

We are now able to build tens of thousands of models within an hour and choose the best.

• Previously, the same task took us days.

As we can generate more models more frequently, we become more adaptive to the fast-changing Internet community, catching up with newly-coined terms, etc.

Page 21: Machine Learning with Hadoop

Useful Resources

Mahout: http://mahout.apache.org/Mallet: http://mallet.cs.umass.edu/Weka: http://www.cs.waikato.ac.nz/ml/weka/libsvm: http://www.csie.ntu.edu.tw/~cjlin/libsvm/OpenNLP: http://incubator.apache.org/opennlp/Pig: http://pig.apache.org/

Page 22: Machine Learning with Hadoop

THANK YOU!