Machine Learning and Hadoop

35
September 2011 – HUG – Atlanta, GA Machine Learning With Hadoop Josh Patterson | Sr Solution Architect

Transcript of Machine Learning and Hadoop

Page 1: Machine Learning and Hadoop

September 2011 – HUG – Atlanta, GA

Machine Learning With HadoopJosh Patterson | Sr Solution Architect

Page 2: Machine Learning and Hadoop

Who is Josh Patterson?

[email protected]• Master’s Thesis: self-organizing mesh networks

– Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

• Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)– Led team which designed classification techniques for time series and

Map Reduce

• Open source work at – http://openpdc.codeplex.com– https://github.com/jpatanooga

• Today– Sr. Solutions Architect at Cloudera

fworley
Recommend for backup
Page 3: Machine Learning and Hadoop

Outline

• Hadoop Today• Data Mining• Mahout and Friends• A Peek at the Road Ahead

3

Page 4: Machine Learning and Hadoop

Hadoop Today: The Oil Industry Circa 1900

“After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”

--- Excerpt from the book “The American Gas Station”

4

Page 5: Machine Learning and Hadoop

DNA Sequencing Trends

• Cost of DNA Sequencing Falling Very Fast

5

Page 6: Machine Learning and Hadoop

Unstructured Data Explosion

6

• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Relational

Complex, Unstructured

Page 7: Machine Learning and Hadoop

Obstacles to Leveraging Data

Copyright 2010 Cloudera Inc. All rights reserved7

• Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)• Sometimes makes the data unwieldy

• Customers are not creating schemas for all of their data• Yet still may want to join data sets

• Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”• They are throwing data away because its too expensive to

hold• Similar to the oil industry in 1900

Page 8: Machine Learning and Hadoop

A New Platform for an Evolving Landscape

• Ability to look at true distribution of data– Previously impossible due to scale

• Lower cost of analysis– Ad Hoc analysis now more open and flexible

• Speed @ Scale is the new Killer App– Results in that previously took 1 day to process

can gain new value when created in 10 minutes.• Greater Flexibility

– Less restrictive than SQL-only systems

Copyright 2010 Cloudera Inc. All rights reserved8

Page 9: Machine Learning and Hadoop

Data Mining

9

“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more

complicated than itself?”

--- Peter Norvig, “Artificial Intelligence: A Modern Approach”

Page 10: Machine Learning and Hadoop

Basic Concepts

• What is Data Mining?– “the process of extracting patterns from data”

• Why are we interested in Data Mining?– Raw data essentially useless

• Data is simply recorded facts• Information is the patterns underlying the data

• We want to learn these patterns– Information is key

Page 11: Machine Learning and Hadoop

How does Machine Learning differ from Data Mining?

• Data Mining– Extracting information from data– Finds patterns in data

• Machine Learning– Algorithms for acquiring structural

descriptions from data “examples”• Process of learning “concepts”

– “structural descriptions” represent patterns explicitly

Page 12: Machine Learning and Hadoop

Shades of Gray

• Information Retrieval– information science, information architecture,

cognitive psychology, linguistics, and statistics.• Natural Language Processing

– grounded in machine learning, especially statistical machine learning

• Statistics– Math and stuff

• Machine Learning– Considered a branch of artificial intelligence

Page 13: Machine Learning and Hadoop

Types of Machine Learning

• Classification• Association• Clustering• Numeric Prediction

– AKA: “Regression”

Page 14: Machine Learning and Hadoop

Tools, Applications, and Mahout

Copyright 2010 Cloudera Inc. All rights reserved14

Page 15: Machine Learning and Hadoop

ML Focused on in Mahout

• Classification– Naïve Bayes in Text Classification– Stochastic Gradient Descent (Logistic Regression)– Random Forests

• Recommendation– Collaborative Filtering, Taste Engine

• Item to item

• Clustering– K-means, Fuzzy K-means– (Latent) Dirichlet Process

Page 16: Machine Learning and Hadoop

Naïve Bayes and Text

• Doc classification is an important domain in Machine Learning

• Docs are characterized by the words that appear in them– One approach is to treat presence / absence

of each word as a boolean attribute– Naïve Bayes is popular here, fast, accurate

Page 17: Machine Learning and Hadoop

What Are Recommenders?

• An algorithm that looks at a user’s past actions and suggests– Products– Services– People

Page 18: Machine Learning and Hadoop

Collaborative Filtering

• Collaborative filtering produces recommendations based on – user preferences for items, – “User Based”– does not require knowledge of the specific properties of

the items. • In contrast,

– content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.

– “Item based”

Page 19: Machine Learning and Hadoop

Clustering: Topic Modeling

• Cluster words across docs to identify topics

• Latent Dirichlet Allocation

Page 20: Machine Learning and Hadoop

What is time series data?

• Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals

• Examples in finance– daily adjusted close price of a stock at the NYSE

• Example in Sensors / Signal Processing / Smart Grid– sensor readings on a power grid occurring 30 times a

second.• For more reference on time series data

– http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

Page 21: Machine Learning and Hadoop

NERC Sensor Data CollectionopenPDC PMU Data Collection circa 2009

• 120 Sensors• 30

samples/second• 4.3B Samples/day• Housed in Hadoop

Page 22: Machine Learning and Hadoop

Story Time: Keogh, SAX, and the openPDC

• NERC wanted high res smart grid data tracked– Started openPDC project @ TVA

• http://openpdc.codeplex.com/– We used Hadoop to store and process time series

data• https://openpdc.svn.codeplex.com/svn/Hadoop/Current

%20Version/

• Needed to find “unbounded oscillations”– Time series unwieldy to work with at scale

• We found “SAX” by Keogh and his folks for dealing with time series

Copyright 2011 Cloudera Inc. All rights reserved

Page 23: Machine Learning and Hadoop

What is Lumberyard?

• Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage

• It’s interesting for– Indexing large amounts of time series data– Low latency fuzzy pattern matching queries on time series

data• Lumberyard is open source and ASF 2.0 Licensed at

Github:– https://github.com/jpatanooga/Lumberyard/

Copyright 2011 Cloudera Inc. All rights reserved

Page 24: Machine Learning and Hadoop

Genome Data as Time Series

• A, C, G, and T– Could be thought of as “1, 2, 3, and 4”!

• If we have sequence X, what is the “closest” subsequence in a genome that is most like it?– Doesn’t have to be an exact match!– Example:

• ATATAT• TATATA

• Useful in proteomics as well• iSAX Indexing

– Lumberyard use case

Copyright 2011 Cloudera Inc. All rights reserved

Page 25: Machine Learning and Hadoop

Bioinformatics

• Applications in DNA Sequencing• Shortest Superstring Problem (SSP)

– Take lots of reads from sequencing– We want the “superstring” of all the reads

• We want a long string that “explains” all the reads we generated

• We want the shortest string possible– NP-complete

• We can reduce SSP to the Traveling Salesman Problem– Graph processing / algorithms now applicable

25

Page 26: Machine Learning and Hadoop

Packages For Hadoop

• DataFu– http://sna-projects.com/datafu/– UDFs in Pig– used at LinkedIn in many of off-line workflows for data derived

products• "People You May Know”• "Skills”

– Techniques• PageRank• Quantiles (median), variance, etc.• Sessionization• Convenience bag functions• Convenience utility functions

26

Page 27: Machine Learning and Hadoop

Integration with Libs

• Mix MapReduce with Machine Learning Libs– WEKA– KXEN– CPLEX

• Map side “groups data”• Reduce side processes groups of data with

Lib in parallel– Involves tricks in getting K/V pairs into lib– Pipes, tmp files, task cache dir, etc

27

Page 28: Machine Learning and Hadoop

What Hadoop Not Good At in Data Mining

• Anything highly iterative• Anything that is extemely CPU bound and

not disk bound• Algorithms that can’t be inherently

parallelized– Examples

• Stochastic Gradient Descent (SGD)• Support Vector Machines (SVM)

– Doesn’t mean they arent great to use

Page 29: Machine Learning and Hadoop

©2011 Cloudera, Inc. All Rights Reserved.29

MRv2: A Peek at the Road Ahead

Page 30: Machine Learning and Hadoop

MRv2

• Not everything fits great in MapReduce– Mahout as evidence of this– Examples

• Stochastic Gradient Descent (SGD)• Support Vector Machines (SVM)

• As we build further into verticals our analysis needs will become more complicated– MRv2 gives us new options

• CDH4 will be based on 0.23.x (or later)– 0.23.0 doesn't include MRv1– (via Tom White) CDH4 will *only* include MRv2

30

Page 31: Machine Learning and Hadoop

©2011 Cloudera, Inc. All Rights Reserved.31

Existing Parallel Frameworks

• MapReduce– Java, Pig, Hive

• Spark– Scala, hides complexity like hive/pig– Runs on hadoop, MRv2 already

• Giraph– Bulk-synchronous parallel model – relative to graphs where vertices can send messages to other vertices during a given

superstep• MPI

– Older parallel lib– Includes primitives for data exchange, synchronization– Standardized and portable

• GraphLab– “graph parallel” vs MR’s “data parallel”– Better at iterative style

Page 32: Machine Learning and Hadoop

Frameworks Currently in Dev – MRv2

• Giraph– https://issues.apache.org/jira/browse/GIRAPH-13

• Hama BSP plans to integrate with MRv2– https://issues.apache.org/jira/browse/HAMA-431

• MPI– https://issues.apache.org/jira/browse/MAPREDUCE-2911

• Spark– https://github.com/mesos/spark-yarn

• GraphLab– Discussion in user-mahout

32

Page 33: Machine Learning and Hadoop

©2011 Cloudera, Inc. All Rights Reserved.33

The Rise of the Meta Heuristic?

• We’re seeing a data deluge drive demand for new data products– MapReduce applications are still relatively new

• Customers have gotten a taste of data products with Hadoop– They like it– They want more

• MRv2 has the potential to open up a range of meta heuristics to the hadoop sector– Techniques like genetic algorithms that were previously

considered “boutique”

Page 34: Machine Learning and Hadoop

©2011 Cloudera, Inc. All Rights Reserved.34

The Shape of Things to Come

Pig, Hive, Scala, Java

HDFS For Large Streaming FilesHbase for small low latency transactions

MRv2

Compiler to build workflows of { Data, Algorithm, Framework }

Algorithm Library: Mahout, SGD, SVM, NeuralNetworks

Framework Library, MPI, Spark, GraphLab, MapReduce

Page 35: Machine Learning and Hadoop

Questions? (Thanks!)

• Hadoop World 2011– You should go– Talks are high quality– Lots more Machine Learning talks

• Developer class 10/10/2011– http://www.eventbrite.com/event/1951335497– 10% discount with code atlhug

35