of 12 /12

Embed Size (px)

description

Sr. Scientist, Yahoo! Labs

[email protected]

• Outline

ML 101 Basic formulation ML is not Data mining Generalization and Optimality

Issues using Hadoop for ML Iterations Sparseness

Case Study: Learning URL Patterns for Webpage De-duplication, published in

WSDM 2010. PLANET: Massively Parallel Learning of Tree Ensembles with

MapReduce, VLDB 2009.

• ML 101

Basic problem: Matrix of data points and features. Each data point is labeled. Learn the labeling function and predict the labels of unseen data

points. Numeric Label is regression else classification.

NXM Table La

bels

M features/Attributes

N D

ata

poin

ts

• Data Mining vs Machine Learning

Machine learning is about finding a guaranteed generalized approximation to the boundary separating the classes.

Data-Mining is about describing the data in using simple algebra. Hadoop is perfect for data processing and Mining.

An Example (Student: Marks Class (Pass/Fail) )

A Hard problem All students who fail may not fail due to same course Finding the boundary per course is not easy (Lenient Courses/

evaluation)

Student Course1 Course2 Course3 Course4 Course5 Course6 Course7 Class

R1 88 76 43 54 90 55 49 Pass R2 60 45 32 51 80 53 60 Fail .. .. ..

• How does a typical learning algorithm solve this?

Intuition1: Courses in which every one fails or every one passes are not of much use here (Comments ? Lets assume unknown range).

Intuition 2: Courses in which 50% pass and fail? (Good. but can over-fit if there is a big spread in marks).

Overall Intuition: Courses which have high density of labels and good separation are best.

Optimality: Criteria: Separability assumption Convex guarantee (We dont pass

some one who got low marks in a course based on performance in other courses).

Metrics space of features ( Triangular in-equality) Approximation to optimality can be obtained by greedy iterations

or hill climbing.

• A Typical Tree:

B >= 45)

D >= 35

• How does ML work continued?

An Old class of learners Tree induction. [Split] Choose attribute (subject) which can best describe the final

class with least encoding. If the {attribute {=,,} value} can homogeneously describe the

outcome you are done. Else for each {attribute {=,,} value} group choose another

attribute and iterate from above. Intuition: Look at the toughest course who got low marks here

also fails the exam. Amongst the one who passed this course look at which course they have failed and split on that (so on..).

When do we stop? What do we mean by homogeneous? What is over-fit? How do we prune?

• How would I implement this in Map-Reduce

Series of Map-Reduces Each Stage:

Map: Collect stats

{Attribute {=,,} value}, {#Class1,#Class2,.} Reducer: Choose the best split (E.g.: Gain Ratio)

How good is this? Pretty bad (3B data took well over 100 hours on 100Nodes).

Why? Map Blows up the space (NXM) X number of maps.

One quick solution : Combiners.

k K,IG(k) = Entropy(C) #c(k) = v#c(k)v{c(k )} Entropy(C | c(k) = v)

Data sparsity in the Internet: Any attribute we choose on the

internet follows power-law: (80:20 rule of layman). Lots of attribute values occurs

only once.

Too many files Each file is a map. Empty Reducers.

Our problem Majority of the of the splits are useless.

• What tricks did we use?

Observations: The first split is the hardest (You have to look at all the data). In fact, difficult to beat the performance of a single box with

sampling. Most of the long tail can be grouped together.

Tricks: Speculation helps Not only Hadoop speculative execution When doing the first split you can choose the candidates for

the next few levels. At each split group all attribute values that are meaninglessly

small. (Also use Gnu Natural Hash).

• Performance

Our observations Panda et al

0

5000

10000

15000

20000

25000

1 2 3 4 5 6 7 8 9 10

Single Node (Sampling)

100 Node (No grouping)

100 Node (Grouping)

100 Node(speculation)

Depth of the Tree

Tim

e Ta

ken

(S)

• To Conclude

Hadoop is a great tool for data aggregations. With careful handling can obtain perfect scale-ups. Lots of research still needs to go on to build ML tools on Hadoop

http://lucene.apache.org/mahout/ Main Pieces to Build Smart way to carry information across iterations. Smart ways to avoid data sparsity.

Small things Hadoop can help with Avoid unnecessary small files (Maps across single file). Automatic balanced distribution of keys into reducer.