Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th,...

35
Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012

Transcript of Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th,...

Page 1: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Machine Learning and HadoopPresent and FutureJosh WillsCloudera Data Science TeamFebruary 7th, 2012

Page 2: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Today’s Speaker – Josh Wills

[email protected]• Formerly of Google (2008 – 2011)

• Worked on the ad auction• Led the team that build the data infrastructure for Google+

• Before that: a bunch of startups• Sometimes as a software engineer, sometimes as a statistician

• Math degree from Duke and a half-finished PhD from The University of Texas at Austin

• Now: Director of Data Science at Cloudera

Copyright 2012 Cloudera Inc. All rights reserved

Page 3: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

High Availability for Data Scientists

Copyright 2012 Cloudera Inc. All rights reserved

NIPS

Page 4: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Outline

• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop

• State of the World• Where Things Are Headed

• Part 3: Offline/Online Batch/Real-Time

Copyright 2012 Cloudera Inc. All rights reserved

Page 5: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Industrial Machine Learning

Copyright 2012 Cloudera Inc. All rights reserved

Page 6: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Delta One: Model Evaluation

• Machine Learning is One Piece of a Complex System• Well-defined objective functions are the exception

• Multiple, often conflicting goals• Weights are fuzzy and shift with business priorities• Pareto optimization is the safest play

• Predictive Accuracy Is Only Useful Up to a Point• Examples

• Computational advertising• Friend recommendations on social networks

Copyright 2012 Cloudera Inc. All rights reserved

Page 7: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Delta Two: Systems Precede Algorithms

• Greenfield Projects Hardly Ever Happen• (and don’t usually launch)

• Industrial Computational Infrastructure• General-purpose• Cheap• Shared

• Constraints Drive Innovation• Vowpal Wabbit Hashing Trick• SETI @ Google

Copyright 2012 Cloudera Inc. All rights reserved

Page 8: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Delta Three: Workflow

Copyright 2012 Cloudera Inc. All rights reserved

Practice Over Theory Blog

Page 9: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Delta Three: Workflow

• Optimize the Overall Process• Model fitting is a small piece of the overall flow time• Parallelize everything

• Better Features > Better Models• Fast Model Deployment

• Common Feature Extraction Logic• Servable Models

• Validation as Sanity Checking• Deploy to a small subset of real data and evaluate

Copyright 2012 Cloudera Inc. All rights reserved

Page 10: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Outline

• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop

• State of the World• Where Things Are Headed

• Part 3: Offline/Online Batch/Real-Time

Copyright 2012 Cloudera Inc. All rights reserved

Page 11: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

“Hadoop. It’s Where The Data Is.”

Copyright 2012 Cloudera Inc. All rights reserved

Page 12: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Hadoop Platform: Substrate

• Commodity servers• Open Compute

• Open source operating system• Linux

• Open source configuration management• Puppet• Chef

• Coordination service• ZooKeeper

Copyright 2012 Cloudera Inc. All rights reserved

Page 13: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Hadoop Platform: Storage

• Distributed schema-less storage• HDFS• Ceph

• Append-only storage formats and metadata• Avro• RCFile• HCatalog

• Mutable key-value storage and metadata• HBase

Copyright 2012 Cloudera Inc. All rights reserved

Page 14: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Hadoop Platform: Integration

• Tool Access• FUSE• JDBC• ODBC

• Data Ingestion• Flume• Sqoop

Copyright 2012 Cloudera Inc. All rights reserved

Page 15: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

ML and Hadoop: The State of the World

Copyright 2012 Cloudera Inc. All rights reserved

Page 16: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Computation: Plain Old MapReduce

• Great for:• Data Preparation• Feature Engineering• Model Validation/Evaluation

• Works For Certain Model Fitting Problems• Recommendation Systems• Expectation Maximization• Decision Trees (PLANET; Gradient Boosted Decision Trees)

• Not A Practical Option for Online Learning• Way More Detail from the KDD 2011 Talk

Copyright 2012 Cloudera Inc. All rights reserved

Page 17: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Tools for Data Preparation/Feature Engineering

• Languages/Environments• PigLatin• HiveQL• Need to deal with mismatch between offline/online feature

generation

• Java/Scala APIs• Crunch (Cloudera)• Scoobi (NICTA)• Cascading (Concurrent)• Jaql (IBM)

Copyright 2012 Cloudera Inc. All rights reserved

Page 18: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Apache Mahout

• The starting place for MapReduce-based machine learning algorithms• Not machine-learning-in-a-box• Custom tweaks/modifications are the rule

• A disparate collection of algorithms for:• Recommendations• Clustering• Classification• Frequent Itemset Mining

Copyright 2012 Cloudera Inc. All rights reserved

Page 19: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Apache Mahout (cont.)

• Best Library: Taste Recommender• Oldest project, most widely-deployed in production• SVD implementation is particularly active

• Good Libraries: Online SGD• Does not use MapReduce• Vowpal Rabbit is faster, has L-BFGS option

• Roll Your Own Instead: Naïve Bayes• Challenges

• “Secret sauce” effect• Delta between Mahout + the cutting edge in ML

Copyright 2012 Cloudera Inc. All rights reserved

Page 20: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

More Machine Learning Interfaces for Hadoop

• Based on MapReduce• SystemML (IBM)

• R-Based Systems (Augment MapReduce with R)• Segue• RHIPE• RHadoop• Ricardo (IBM)

Copyright 2012 Cloudera Inc. All rights reserved

Page 21: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

ML and Hadoop: Where Things are Headed

Copyright 2012 Cloudera Inc. All rights reserved

Page 22: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

MRv2 and YARN

• Eliminates JobTracker bottleneck• Separate Resource Manager/Scheduler• Individual jobs have their own task masters• No more map slots and reduce slots

• Moves MapReduce into user-land• Hadoop clusters can run all sorts of jobs

• Will also allow fine-grained resource allocation• CPU• Memory• Disk

Copyright 2012 Cloudera Inc. All rights reserved

Page 23: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

YARN Job Flows

Copyright 2012 Cloudera Inc. All rights reserved

Page 24: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

The Contenders

Copyright 2012 Cloudera Inc. All rights reserved

Page 25: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

AllReduce

• Developed at Yahoo! Research• Defines the allreduce operation

• N machines each have a number => each machine has the sum of the numbers

• At the heart of Vowpal Wabbit’s performance• Implemented in C++• Can be patched into Apache Hadoop and used today

Copyright 2012 Cloudera Inc. All rights reserved

Page 26: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Spark

• Developed at Berkeley’s AMP Lab

• Defines operations on distributed in-memory collections

• Written in Scala• Supports reading to and

writing from HDFS

Copyright 2012 Cloudera Inc. All rights reserved

Page 27: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

GraphLab

• Developed at CMU• Lower-level primitives

• (but higher than MPI)

• Map/Reduce => Update/Sort

• Flexible, allows for asynchronous computations*

• C++/Java/Python/Matlab

Copyright 2012 Cloudera Inc. All rights reserved

Page 28: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Outline

• Part 1: Industrial Machine Learning• Part 2: Machine Learning and Hadoop

• State of the World• Where Things Are Headed

• Part 3: Offline/Online Batch/Real-Time

Copyright 2012 Cloudera Inc. All rights reserved

Page 29: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Offline vs. Online Learning

Copyright 2012 Cloudera Inc. All rights reserved

Page 30: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Batch vs. Real-Time: The CAP Theorem

• Impossible for a distributed computer system to simultaneously provide:• Consistency• Availability• Partition Tolerance

• Instead, we end up with BASE• Basically Available Soft State Eventual consistency• High availability• Cleanup mechanism for providing consistency (eventually)

Copyright 2012 Cloudera Inc. All rights reserved

Page 31: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Nathan Marz: Beating the CAP Theorem

Copyright 2012 Cloudera Inc. All rights reserved

Page 32: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Models as Queries

Copyright 2012 Cloudera Inc. All rights reserved

Page 33: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Collapsing Distinctions

Copyright 2012 Cloudera Inc. All rights reserved

Page 34: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Systems Drive Algorithms, Redux

Copyright 2012 Cloudera Inc. All rights reserved

Page 35: Machine Learning and Hadoop Present and Future Josh Wills Cloudera Data Science Team February 7th, 2012.

Questions?Want A [email protected]