Machine Learning and Hadoop

September 2011 – HUG – Atlanta, GA

Machine Learning With HadoopJosh Patterson | Sr Solution Architect

Who is Josh Patterson?

• [email protected]• Master’s Thesis: self-organizing mesh networks

– Published in IAAI-09: TinyTermite: A Secure Routing Algorithm

• Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA)– Led team which designed classification techniques for time series and

Map Reduce

• Open source work at – http://openpdc.codeplex.com– https://github.com/jpatanooga

• Today– Sr. Solutions Architect at Cloudera

mailto:[email protected]

http://openpdc.codeplex.com/


https://github.com/jpatanooga

https://github.com/jpatanooga

fworley

Recommend for backup

Outline

• Hadoop Today• Data Mining• Mahout and Friends• A Peek at the Road Ahead

3

Hadoop Today: The Oil Industry Circa 1900

“After the refining process, one barrel of crude oil yielded more than 40% gasoline and only 3% kerosene, creating large quantities of waste gasoline for disposal.”

--- Excerpt from the book “The American Gas Station”

4

DNA Sequencing Trends

• Cost of DNA Sequencing Falling Very Fast

5

Unstructured Data Explosion

6

• 2,500 exabytes of new information in 2012 with Internet as primary driver• Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year

Relational

Complex, Unstructured

Obstacles to Leveraging Data

Copyright 2010 Cloudera Inc. All rights reserved7

• Data comes in many shapes and sizes: relational tuples, log files, semistructured textual data (e.g., e-mail)• Sometimes makes the data unwieldy

• Customers are not creating schemas for all of their data• Yet still may want to join data sets

• Customers are moving some of it to tape or cold storage, throwing it away because “it doesn’t fit”• They are throwing data away because its too expensive to

hold• Similar to the oil industry in 1900

A New Platform for an Evolving Landscape

• Ability to look at true distribution of data– Previously impossible due to scale

• Lower cost of analysis– Ad Hoc analysis now more open and flexible

• Speed @ Scale is the new Killer App– Results in that previously took 1 day to process

can gain new value when created in 10 minutes.• Greater Flexibility

– Less restrictive than SQL-only systems


Data Mining

9

“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more

complicated than itself?”

--- Peter Norvig, “Artificial Intelligence: A Modern Approach”

Basic Concepts

• What is Data Mining?– “the process of extracting patterns from data”

• Why are we interested in Data Mining?– Raw data essentially useless

• Data is simply recorded facts• Information is the patterns underlying the data

• We want to learn these patterns– Information is key

How does Machine Learning differ from Data Mining?

• Data Mining– Extracting information from data– Finds patterns in data

• Machine Learning– Algorithms for acquiring structural

descriptions from data “examples”• Process of learning “concepts”

– “structural descriptions” represent patterns explicitly

Shades of Gray

• Information Retrieval– information science, information architecture,

cognitive psychology, linguistics, and statistics.• Natural Language Processing

– grounded in machine learning, especially statistical machine learning

• Statistics– Math and stuff

• Machine Learning– Considered a branch of artificial intelligence

Types of Machine Learning

• Classification• Association• Clustering• Numeric Prediction

– AKA: “Regression”

Tools, Applications, and Mahout


ML Focused on in Mahout

• Classification– Naïve Bayes in Text Classification– Stochastic Gradient Descent (Logistic Regression)– Random Forests

• Recommendation– Collaborative Filtering, Taste Engine

• Item to item

• Clustering– K-means, Fuzzy K-means– (Latent) Dirichlet Process

Naïve Bayes and Text

• Doc classification is an important domain in Machine Learning

• Docs are characterized by the words that appear in them– One approach is to treat presence / absence

of each word as a boolean attribute– Naïve Bayes is popular here, fast, accurate

What Are Recommenders?

• An algorithm that looks at a user’s past actions and suggests– Products– Services– People

Collaborative Filtering

• Collaborative filtering produces recommendations based on – user preferences for items, – “User Based”– does not require knowledge of the specific properties of

the items. • In contrast,

– content-based recommendation produces recommendations based off of intimate knowledge of the properties of items.

– “Item based”

Clustering: Topic Modeling

• Cluster words across docs to identify topics

• Latent Dirichlet Allocation

What is time series data?

• Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals

• Examples in finance– daily adjusted close price of a stock at the NYSE

• Example in Sensors / Signal Processing / Smart Grid– sensor readings on a power grid occurring 30 times a

second.• For more reference on time series data

– http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

http://en.wikipedia.org/wiki/Time_series

http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

http://www.cloudera.com/blog/2011/03/simple-moving-average-secondary-sort-and-mapreduce-part-1/

NERC Sensor Data CollectionopenPDC PMU Data Collection circa 2009

• 120 Sensors• 30

samples/second• 4.3B Samples/day• Housed in Hadoop

Story Time: Keogh, SAX, and the openPDC

• NERC wanted high res smart grid data tracked– Started openPDC project @ TVA

• http://openpdc.codeplex.com/– We used Hadoop to store and process time series

data• https://openpdc.svn.codeplex.com/svn/Hadoop/Current

%20Version/

• Needed to find “unbounded oscillations”– Time series unwieldy to work with at scale

• We found “SAX” by Keogh and his folks for dealing with time series

Copyright 2011 Cloudera Inc. All rights reserved


https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/

https://openpdc.svn.codeplex.com/svn/Hadoop/Current%20Version/

What is Lumberyard?

• Lumberyard is time series iSAX indexing stored in HBase for persistent and scalable index storage

• It’s interesting for– Indexing large amounts of time series data– Low latency fuzzy pattern matching queries on time series

data• Lumberyard is open source and ASF 2.0 Licensed at

Github:– https://github.com/jpatanooga/Lumberyard/


Genome Data as Time Series

• A, C, G, and T– Could be thought of as “1, 2, 3, and 4”!

• If we have sequence X, what is the “closest” subsequence in a genome that is most like it?– Doesn’t have to be an exact match!– Example:

• ATATAT• TATATA

• Useful in proteomics as well• iSAX Indexing

– Lumberyard use case


Bioinformatics

• Applications in DNA Sequencing• Shortest Superstring Problem (SSP)

– Take lots of reads from sequencing– We want the “superstring” of all the reads

• We want a long string that “explains” all the reads we generated

• We want the shortest string possible– NP-complete

• We can reduce SSP to the Traveling Salesman Problem– Graph processing / algorithms now applicable

25

Packages For Hadoop

• DataFu– http://sna-projects.com/datafu/– UDFs in Pig– used at LinkedIn in many of off-line workflows for data derived

products• "People You May Know”• "Skills”

– Techniques• PageRank• Quantiles (median), variance, etc.• Sessionization• Convenience bag functions• Convenience utility functions

26

http://sna-projects.com/datafu/

http://sna-projects.com/datafu/

Integration with Libs

• Mix MapReduce with Machine Learning Libs– WEKA– KXEN– CPLEX

• Map side “groups data”• Reduce side processes groups of data with

Lib in parallel– Involves tricks in getting K/V pairs into lib– Pipes, tmp files, task cache dir, etc

27

What Hadoop Not Good At in Data Mining

• Anything highly iterative• Anything that is extemely CPU bound and

not disk bound• Algorithms that can’t be inherently

parallelized– Examples

• Stochastic Gradient Descent (SGD)• Support Vector Machines (SVM)

– Doesn’t mean they arent great to use

MRv2

• Not everything fits great in MapReduce– Mahout as evidence of this– Examples

• Stochastic Gradient Descent (SGD)• Support Vector Machines (SVM)

• As we build further into verticals our analysis needs will become more complicated– MRv2 gives us new options

• CDH4 will be based on 0.23.x (or later)– 0.23.0 doesn't include MRv1– (via Tom White) CDH4 will *only* include MRv2

30


Existing Parallel Frameworks

• MapReduce– Java, Pig, Hive

• Spark– Scala, hides complexity like hive/pig– Runs on hadoop, MRv2 already

• Giraph– Bulk-synchronous parallel model – relative to graphs where vertices can send messages to other vertices during a given

superstep• MPI

– Older parallel lib– Includes primitives for data exchange, synchronization– Standardized and portable

• GraphLab– “graph parallel” vs MR’s “data parallel”– Better at iterative style

Frameworks Currently in Dev – MRv2

• Giraph– https://issues.apache.org/jira/browse/GIRAPH-13

• Hama BSP plans to integrate with MRv2– https://issues.apache.org/jira/browse/HAMA-431

• MPI– https://issues.apache.org/jira/browse/MAPREDUCE-2911

• Spark– https://github.com/mesos/spark-yarn

• GraphLab– Discussion in user-mahout

32

https://issues.apache.org/jira/browse/GIRAPH-13

https://issues.apache.org/jira/browse/HAMA-431

https://issues.apache.org/jira/browse/HAMA-431

https://issues.apache.org/jira/browse/MAPREDUCE-2911




https://github.com/mesos/spark-yarn

https://github.com/mesos/spark-yarn


The Rise of the Meta Heuristic?

• We’re seeing a data deluge drive demand for new data products– MapReduce applications are still relatively new

• Customers have gotten a taste of data products with Hadoop– They like it– They want more

• MRv2 has the potential to open up a range of meta heuristics to the hadoop sector– Techniques like genetic algorithms that were previously

considered “boutique”


The Shape of Things to Come

Pig, Hive, Scala, Java

HDFS For Large Streaming FilesHbase for small low latency transactions

MRv2

Compiler to build workflows of { Data, Algorithm, Framework }

Algorithm Library: Mahout, SGD, SVM, NeuralNetworks

Framework Library, MPI, Spark, GraphLab, MapReduce

Questions? (Thanks!)

• Hadoop World 2011– You should go– Talks are high quality– Lots more Machine Learning talks

• Developer class 10/10/2011– http://www.eventbrite.com/event/1951335497– 10% discount with code atlhug

35

http://www.eventbrite.com/event/1951335497

http://www.eventbrite.com/event/1951335497

Machine Learning and Hadoop

Technology

Transcript of Machine Learning and Hadoop