Hadoop and Machine Learning

26
Machine Learning and Hadoop Present and Future Josh Wills, Tom Pierce, and Jeff Hammerbacher Cloudera Data Science Team December 17th, 2011

description

Slides for the talk by the Cloudera Data Science team on the state of machine learning and Hadoop at NIPS 2011.

Transcript of Hadoop and Machine Learning

Page 1: Hadoop and Machine Learning

Machine Learning and HadoopPresent and FutureJosh Wills, Tom Pierce, and Jeff HammerbacherCloudera Data Science TeamDecember 17th, 2011

Page 2: Hadoop and Machine Learning

High Availability for Data Scientists

Copyright 2011 Cloudera Inc. All rights reserved

NIPS

Page 3: Hadoop and Machine Learning

Agenda

• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop

• State of the World• Where Things Are Headed

• Part 3: Things Industry Needs From Academia

Copyright 2011 Cloudera Inc. All rights reserved

Page 4: Hadoop and Machine Learning

Industrial Machine Learning

Copyright 2011 Cloudera Inc. All rights reserved

Page 5: Hadoop and Machine Learning

Delta One: Model Evaluation

• ML Systems Are One Piece of a Complex System• Well-defined objective functions are the exception

• Multiple, often conflicting goals• Weights are fuzzy and shift with business priorities• Pareto optimization is the safest play

• Predictive Accuracy Is Only Useful Up to a Point• Examples

• Computational advertising• Friend recommendations on social networks

Copyright 2011 Cloudera Inc. All rights reserved

Page 6: Hadoop and Machine Learning

Delta Two: Systems Precede Algorithms

• Greenfield Projects Hardly Ever Happen• (and don’t usually launch)

• Industrial Computational Infrastructure• General-purpose• Cheap• Shared

• Constraints Drive Innovation• Vowpal Wabbit Hashing Trick• SETI @ Google

Copyright 2011 Cloudera Inc. All rights reserved

Page 7: Hadoop and Machine Learning

Delta Three: Workflow

Copyright 2011 Cloudera Inc. All rights reserved

Practice Over Theory Blog

Page 8: Hadoop and Machine Learning

Delta Three: Workflow

• Optimize the Overall Process• Model fitting is a small piece of the overall flow time• Parallelize everything

• Better Features > Better Models• Fast Model Deployment

• Common Feature Extraction Logic• Servable Models

• Validation as Sanity Checking• Deploy to a small subset of real data and evaluate

Copyright 2011 Cloudera Inc. All rights reserved

Page 9: Hadoop and Machine Learning

Agenda

• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop

• State of the World• Where Things Are Headed

• Part 3: Things Industry Needs From Academia

Copyright 2011 Cloudera Inc. All rights reserved

Page 10: Hadoop and Machine Learning

Hadoop: It’s Where The Data Is

Copyright 2011 Cloudera Inc. All rights reserved

Page 11: Hadoop and Machine Learning

Hadoop Platform: Substrate

• Commodity servers• Open Compute

• Open source operating system• Linux

• Open source configuration management• Puppet• Chef

• Coordination service• ZooKeeper

Copyright 2011 Cloudera Inc. All rights reserved

Page 12: Hadoop and Machine Learning

Hadoop Platform: Storage

• Distributed schema-less storage• HDFS• Ceph

• Append-only storage formats and metadata• Avro• RCFile• HCatalog

• Mutable key-value storage and metadata• HBase

Copyright 2011 Cloudera Inc. All rights reserved

Page 13: Hadoop and Machine Learning

Hadoop Platform: Integration

• Tool Access• FUSE• JDBC• ODBC

• Data Ingestion• Flume• Sqoop

Copyright 2011 Cloudera Inc. All rights reserved

Page 14: Hadoop and Machine Learning

ML and Hadoop: The State of the World

Copyright 2011 Cloudera Inc. All rights reserved

Page 15: Hadoop and Machine Learning

Computation: Plain Old MapReduce

• Great for:• Data Preparation• Feature Engineering• Model Validation/Evaluation

• Works For Certain Model Fitting Problems• Recommendation Systems• Decision Trees (PLANET; Gradient Boosted Decision Trees)

• Not A Practical Option for Online Learning• Way More Detail from the KDD 2011 Talk

Copyright 2011 Cloudera Inc. All rights reserved

Page 16: Hadoop and Machine Learning

Tools for Data Preparation/Feature Engineering

• Languages/Environments• PigLatin• HiveQL• Need to deal with mismatch between offline/online feature

generation

• Java/Scala APIs• Crunch (Cloudera)• Scoobi (NICTA)• Cascading (Concurrent)• Jaql (IBM)

Copyright 2011 Cloudera Inc. All rights reserved

Page 17: Hadoop and Machine Learning

Apache Mahout

• The starting place for MapReduce-based machine learning algorithms• Not machine-learning-in-a-box• Custom tweaks/modifications are the rule

• A disparate collection of algorithms for:• Recommendations• Clustering• Classification• Frequent Itemset Mining

Copyright 2011 Cloudera Inc. All rights reserved

Page 18: Hadoop and Machine Learning

Apache Mahout (cont.)

• Best Library: Taste Recommender• Oldest project, most widely-deployed in production• SVD implementation is particularly active

• Good Libraries: Online SGD• Does not use MapReduce• Vowpal Rabbit + AllReduce is faster, has L-BFGS option

• Roll Your Own Instead: Naïve Bayes• Challenges

• “Secret sauce” effect• Delta between Mahout + the cutting edge in ML

Copyright 2011 Cloudera Inc. All rights reserved

Page 19: Hadoop and Machine Learning

More Machine Learning Interfaces for Hadoop

• Based on MapReduce• SystemML (IBM)• AllReduce (Vowpal Wabbit)

• No MapReduce• Spark

• R-Based Systems (Augment MapReduce with R)• Segue• RHIPE• RHadoop• Ricardo (IBM)

Copyright 2011 Cloudera Inc. All rights reserved

Page 20: Hadoop and Machine Learning

ML and Hadoop: Where Things are Headed

Copyright 2011 Cloudera Inc. All rights reserved

Page 21: Hadoop and Machine Learning

MRv2 and YARN

• Eliminates JobTracker bottleneck• Separate Resource Manager/Scheduler• Individual jobs have their own task masters

• Moves MapReduce into user-land• Enables Hadoop clusters to run all sorts of jobs

• MPI (Hamster; MAPREDUCE-2911)• Native BSP (Giraph)• Spark• AllReduce, GraphLab

Copyright 2011 Cloudera Inc. All rights reserved

Page 22: Hadoop and Machine Learning

Agenda

• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop

• State of the World• Where Things Are Headed

• Part 3: Things Industry Needs From Academia

Copyright 2011 Cloudera Inc. All rights reserved

Page 23: Hadoop and Machine Learning

Machine Learning on Multivariate Time Series

• 1e5 writes/sec• Positive events are

relatively rare• Feature extraction

challenge• May not be clear what

the right time horizon is• Tight SLAs• Very high stakes

Copyright 2011 Cloudera Inc. All rights reserved

Page 24: Hadoop and Machine Learning

An Academic Language For Feature Engineering

• Feature extraction/selection is as important as model fitting• e.g., hierarchical feature representation, impact on training

time and experiment design, feature cost modeling, etc.

• Academic literature on this problem is sparse and dispersed across multiple fields• NIPS 2003• HCI, NLP, Information Retrieval, etc.

• We need a common language for talking about these problems across disciplines

Copyright 2011 Cloudera Inc. All rights reserved

Page 25: Hadoop and Machine Learning

A Broader Ontology For Model Selection

• Practical factors that enter into the “best” choice of model…• Data arrival rate• Data volume• Scoring latency• Model refresh time• Robustness/reliability

• …in addition to the standard predictive power/simplicity tradeoffs

Copyright 2011 Cloudera Inc. All rights reserved

Page 26: Hadoop and Machine Learning

Questions?Want A Job?

@josh_wills