Hadoop Summit 2010 Machine Learning Using Hadoop

of 12 /12
Krishna Prasad Chitrapura Sr. Scientist, Yahoo! Labs [email protected] Machine Learning on Hadoop

Embed Size (px)



Transcript of Hadoop Summit 2010 Machine Learning Using Hadoop

  • Krishna Prasad Chitrapura

    Sr. Scientist, Yahoo! Labs

    [email protected]

    Machine Learning on Hadoop

  • Outline

    ML 101 Basic formulation ML is not Data mining Generalization and Optimality

    Issues using Hadoop for ML Iterations Sparseness

    Case Study: Learning URL Patterns for Webpage De-duplication, published in

    WSDM 2010. PLANET: Massively Parallel Learning of Tree Ensembles with

    MapReduce, VLDB 2009.

  • ML 101

    Basic problem: Matrix of data points and features. Each data point is labeled. Learn the labeling function and predict the labels of unseen data

    points. Numeric Label is regression else classification.

    NXM Table La


    M features/Attributes

    N D




  • Data Mining vs Machine Learning

    Machine learning is about finding a guaranteed generalized approximation to the boundary separating the classes.

    Data-Mining is about describing the data in using simple algebra. Hadoop is perfect for data processing and Mining.

    An Example (Student: Marks Class (Pass/Fail) )

    A Hard problem All students who fail may not fail due to same course Finding the boundary per course is not easy (Lenient Courses/


    Student Course1 Course2 Course3 Course4 Course5 Course6 Course7 Class

    R1 88 76 43 54 90 55 49 Pass R2 60 45 32 51 80 53 60 Fail .. .. ..

  • How does a typical learning algorithm solve this?

    Intuition1: Courses in which every one fails or every one passes are not of much use here (Comments ? Lets assume unknown range).

    Intuition 2: Courses in which 50% pass and fail? (Good. but can over-fit if there is a big spread in marks).

    Overall Intuition: Courses which have high density of labels and good separation are best.

    Optimality: Criteria: Separability assumption Convex guarantee (We dont pass

    some one who got low marks in a course based on performance in other courses).

    Metrics space of features ( Triangular in-equality) Approximation to optimality can be obtained by greedy iterations

    or hill climbing.

  • A Typical Tree:

    B >= 45)

    D >= 35

  • How does ML work continued?

    An Old class of learners Tree induction. [Split] Choose attribute (subject) which can best describe the final

    class with least encoding. If the {attribute {=,,} value} can homogeneously describe the

    outcome you are done. Else for each {attribute {=,,} value} group choose another

    attribute and iterate from above. Intuition: Look at the toughest course who got low marks here

    also fails the exam. Amongst the one who passed this course look at which course they have failed and split on that (so on..).

    When do we stop? What do we mean by homogeneous? What is over-fit? How do we prune?

  • How would I implement this in Map-Reduce

    Series of Map-Reduces Each Stage:

    Map: Collect stats

    {Attribute {=,,} value}, {#Class1,#Class2,.} Reducer: Choose the best split (E.g.: Gain Ratio)

    How good is this? Pretty bad (3B data took well over 100 hours on 100Nodes).

    Why? Map Blows up the space (NXM) X number of maps.

    One quick solution : Combiners.

    k K,IG(k) = Entropy(C) #c(k) = v#c(k)v{c(k )} Entropy(C | c(k) = v)

  • What else is bad?

    Data sparsity in the Internet: Any attribute we choose on the

    internet follows power-law: (80:20 rule of layman). Lots of attribute values occurs

    only once.

    Why is this bad? (Not a Blame Game). Hadoops problem

    Too many files Each file is a map. Empty Reducers.

    Our problem Majority of the of the splits are useless.

  • What tricks did we use?

    Observations: The first split is the hardest (You have to look at all the data). In fact, difficult to beat the performance of a single box with

    sampling. Most of the long tail can be grouped together.

    Tricks: Speculation helps Not only Hadoop speculative execution When doing the first split you can choose the candidates for

    the next few levels. At each split group all attribute values that are meaninglessly

    small. (Also use Gnu Natural Hash).

  • Performance

    Our observations Panda et al







    1 2 3 4 5 6 7 8 9 10

    Single Node (Sampling)

    100 Node (No grouping)

    100 Node (Grouping)

    100 Node(speculation)

    Depth of the Tree


    e Ta



  • To Conclude

    Hadoop is a great tool for data aggregations. With careful handling can obtain perfect scale-ups. Lots of research still needs to go on to build ML tools on Hadoop

    http://lucene.apache.org/mahout/ Main Pieces to Build Smart way to carry information across iterations. Smart ways to avoid data sparsity.

    Small things Hadoop can help with Avoid unnecessary small files (Maps across single file). Automatic balanced distribution of keys into reducer.