Download - Hadoop Summit 2010 Machine Learning Using Hadoop

Krishna Prasad Chitrapura

Sr. Scientist, Yahoo! Labs

[email protected]

Machine Learning on Hadoop

Outline

• ML 101 –  Basic formulation – ML is not Data mining

 Generalization and Optimality

•  Issues using Hadoop for ML –  Iterations –  Sparseness

• Case Study: –  Learning URL Patterns for Webpage De-duplication, published in

WSDM 2010. –  PLANET: Massively Parallel Learning of Tree Ensembles with

MapReduce, VLDB 2009.

ML 101

• Basic problem: – Matrix of data points and features. –  Each data point is labeled. –  Learn the labeling function and predict the labels of unseen data

points.  Numeric Label is regression else classification.

NXM Table La

bels

M features/Attributes

N D

ata

poin

ts

Data Mining vs Machine Learning

• Machine learning is about finding a guaranteed generalized approximation to the boundary separating the classes.

• Data-Mining is about describing the data in using simple algebra. – Hadoop is perfect for data processing and Mining.

• An Example (Student: Marks Class (Pass/Fail) )

• A Hard problem –  All students who fail may not fail due to same course –  Finding the boundary per course is not easy (Lenient Courses/

evaluation)

Student Course1 Course2 Course3 Course4 Course5 Course6 Course7 Class

R1 88 76 43 54 90 55 49 Pass R2 60 45 32 51 80 53 60 Fail … … .. .. ..

How does a typical learning algorithm solve this?

•  Intuition1: Courses in which every one fails or every one passes are not of much use here (Comments ? Lets assume unknown range).

•  Intuition 2: Courses in which 50% pass and fail? (Good. but can over-fit if there is a big spread in marks).

• Overall Intuition: Courses which have high density of labels and good separation are best.

• Optimality: – Criteria:

  Separability assumption – Convex guarantee (We don’t pass some one who got low marks in a course based on performance in other courses).

 Metrics space of features ( Triangular in-equality) –  Approximation to optimality can be obtained by greedy iterations

or hill climbing.

A Typical Tree:

B >= 45)

D >= 35

How does ML work – continued?

• An Old class of learners – Tree induction. –  [Split] Choose attribute (subject) which can best describe the final

class with least encoding.   If the {attribute {=,≤,≥} value} can homogeneously describe the

outcome you are done.   Else for each {attribute {=,≤,≥} value} group choose another

attribute and iterate from above. –  Intuition: Look at the toughest course– who got low marks here

also fails the exam. Amongst the one who passed this course look at which course they have failed and split on that (so on..).

– When do we stop? What do we mean by homogeneous? – What is over-fit? How do we prune?

How would I implement this in Map-Reduce

• Series of Map-Reduces

• Each Stage: – Map:

 Collect stats –  {Attribute {=,≤,≥} value}, {#Class1,#Class2,….}

– Reducer:  Choose the best split (E.g.: Gain Ratio)

• How good is this? –  Pretty bad (3B data took well over 100 hours on 100Nodes).

Why?  Map Blows up the space (NXM) X number of maps.

– One quick solution : Combiners.

€

∀k ∈ K,IG(k) = Entropy(C) − #c(k) = v#c(k)v∈{c(k )}∑ Entropy(C | c(k) = v)

What else is bad?

• Data sparsity in the Internet: –  Any attribute we choose on the

internet follows power-law:   (80:20 rule of layman).   Lots of attribute values occurs

only once.

• Why is this bad? (Not a Blame Game).  Hadoop’s problem

–  Too many files –  Each file is a map. –  Empty Reducers.

 Our problem – Majority of the of the splits are useless.

What tricks did we use?

• Observations: –  The first split is the hardest (You have to look at all the data).

  In fact, difficult to beat the performance of a single box with sampling.

– Most of the long tail can be grouped together.

•  Tricks: –  Speculation helps

 Not only Hadoop speculative execution  When doing the first split – you can choose the candidates for

the next few levels. –  At each split group all attribute values that are meaninglessly

small. (Also use Gnu Natural Hash).

Performance

• Our observations • Panda et al

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10

Single Node (Sampling)

100 Node (No grouping)

100 Node (Grouping)

100 Node(speculation)

Depth of the Tree

Tim

e Ta

ken

(S)

To Conclude

• Hadoop is a great tool for data aggregations.

• With careful handling can obtain perfect scale-ups.

•  Lots of research still needs to go on to build ML tools on Hadoop –  http://lucene.apache.org/mahout/ – Main Pieces to Build

  Smart way to carry information across iterations.   Smart ways to avoid data sparsity.

–  Small things Hadoop can help with   Avoid unnecessary small files (Maps across single file).   Automatic balanced distribution of keys into reducer.