Hadoop for Data Science

38
Hadoop for Data Science Donald Miner Data Science MD October 9, 2013

description

This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.

Transcript of Hadoop for Data Science

Page 1: Hadoop for Data Science

Hadoop for Data Science

Donald Miner

Data Science MDOctober 9, 2013

Page 2: Hadoop for Data Science

About Don

@donaldpminer

[email protected]

Page 3: Hadoop for Data Science

I’ll talk about…

Intro to Hadoop (HDFS and MapReduce)

Some reasons why I think Hadoop is cool

(is this cliché yet?)

Step 1: HadoopStep 2: ????Step 3: Data Science!

Some examples of data science work on Hadoop

What can Hadoop do to enable data science work?

Page 4: Hadoop for Data Science

Hadoop Distributed File SystemHDFS

• Stores files in folders (that’s it)– Nobody cares what’s in your files

• Chunks large files into blocks (~64MB-2GB)• 3 replicates of each block (better safe than sorry)• Blocks are scattered all over the place

FILE BLOCKS

Page 5: Hadoop for Data Science

MapReduce• Analyzes raw data in HDFS where the data is• Jobs are split into Mappers and Reducers

Reducers (you code this, too)Automatically Groups by the mapper’s output keyAggregate, count, statisticsOutputs to HDFS

Mappers (you code this)Loads data from HDFSFilter, transform, parseOutputs (key, value) pairs

Page 6: Hadoop for Data Science

Hadoop Ecosystem

• Higher-level languages like Pig and Hive

• HDFS Data systems like HBase and Accumulo

• Close friends like ZooKeeper, Flume, Storm, Cassandra, Avro

Page 7: Hadoop for Data Science

Mahout

• Mahout is a Machine Library

• Has both MapReduce and non-parallel implementations of a number of algorithms:– Recommenders– Clustering– Classification

Page 8: Hadoop for Data Science

Cool Thing #1: Linear Scalability

• HDFS and MapReduce scale linearly

• If you have twice as many computers, jobs run twice as fast

• If you have twice as much data, jobs run twice as slow

• If you have twice as many computers, you can store twice as much data

DATA LOCALITY!!

Page 9: Hadoop for Data Science

Cool Thing #2: Schema on Read

LOAD DATA FIRST, ASK QUESTIONS LATERData is parsed/interpreted as it is loaded out of HDFS

What implications does this have?

BEFORE:ETL, schema design upfront,tossing out original data,comprehensive data study

Keep original data around!Have multiple views of the same data!Work with unstructured data sooner!Store first, figure out what to do with it later!

WITH HADOOP:

Page 10: Hadoop for Data Science

Cool Thing #3: Transparent Parallelism

Network programming?

Inter-process communication?

Threading?

Distributed stuff?

With MapReduce, I DON’T CARE

Your solution

… I just have to be sure my solution fits into this tiny box

Fault tolerance?

Code deployment?RPC?

Message passing?

Locking?

MapReduceFramework

Data storage?

Scalability?

Data center fires?

Page 11: Hadoop for Data Science

Cool Thing #4: Unstructured Data

• Unstructured data:media, text,

forms, log data lumped structured data

• Query languages like SQL and Pig assume some sort of “structure”

• MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer

Page 12: Hadoop for Data Science

The rest of the talk

• Four threads:– Data exploration– Classification– NLP– Recommender systems

I’m using these to illustrate some points

Page 13: Hadoop for Data Science

Exploration

• Hadoop is great at exploring data! • I like to explore data in a couple ways:– Filtering– Sampling– Summarization– Evaluate cleanliness

• I like to spend 50% of my timedoing exploration (but unfortunately it’s the first thing to get cut)

Page 14: Hadoop for Data Science

Filtering

• Filtering is like a microscope: I want to take a closer look at a subset

• In MapReduce, you do this in the mapper• Identify nasty records you want to get rid of• Examples:– Only Baltimore data– Remove gibberish– Only 5 minutes

Page 15: Hadoop for Data Science

Sampling

• Hadoop isn’t the king of interactive analysis• Sampling is a good way to grab a set of data

then work with it locally (Excel?)• Pig has a handy SAMPLE keyword• Types of sampling:– Sample randomly across the entire data set– Sub-graph extraction– Filters (from the last slide)

Page 16: Hadoop for Data Science

Summarization

• Summarization is a bird’s-eye view• MapReduce is good at summarization:– Mappers extract the group-by keys– Reducers do the aggregation

• I like to:– Count number, get stdev, get average, get min/max of

records in several groups– Count nulls in columns

(if applicable)– Grab top-10 lists

Page 17: Hadoop for Data Science

Evaluating Cleanliness

• I’ve never been burned twice• Things to check for:– Fields that shouldn’t be null that are– Duplicates (does unique records=records?)– Dates (look for 1970; look at formats; time zones)– Things that should be normalized– Keys that are different because of trash

e.g. “ abc “ != “abc”

SQL/RDBMS =

Page 18: Hadoop for Data Science

What’s the point?

• Hadoop is really good at this stuff!• You probably have a lot of data and a lot of it

is garbage!• Take the time to do this and your further work

will be much easier• It’s hard to tell what methods

you should use until youexplore your data

Page 19: Hadoop for Data Science

Classification

• Classification is taking feature vectors (derived from your data), and then guessing some sort of label– E.g.,

sunny, Saturday, summer -> play tennis rainy, Wednesday, winter -> don’t play tennis

• Most classification algorithms aren’t easily parallelizable or have good implementations

• You need a training set of true feature vectors and labels… how often is your data labeled?

• I’ve found classification rather hard, except for when…

Page 20: Hadoop for Data Science

Overall Classification Workflow

EXPLORATION EXPERIMENTATIONWITH DIFFERENT METHODS

REFINING PROMISINGMETHODS

The Model Training Workflow

FEATUREEXTRACTION

MODELTRAINING USE MODEL

DATA FEATUREVECTORS MODEL OUTPUT

Page 21: Hadoop for Data Science

Data volumes in trainingDA

TA V

OLU

ME

DATA

I have a lot of data

Page 22: Hadoop for Data Science

Data volumes in trainingDA

TA V

OLU

ME

DATA

FEATUREVECTORS

feature extraction

Is this result “big data”?

Examples: - 10TB of network traffic distilled into 9K IP address FVs - 10TB of medical records distilled into 50M patient FVs - 10TB of documents distilled into 5TB of document FVs

Page 23: Hadoop for Data Science

Data volumes in trainingDA

TA V

OLU

ME

DATA

FEATUREVECTORS

feature extraction Model Training

MODEL

The model itself is usually pretty tiny

Page 24: Hadoop for Data Science

Data volumes in trainingDA

TA V

OLU

ME

DATA

FEATUREVECTORS

feature extraction Model Training

MODEL

Applying that model to all thedata is a big data problem!

Page 25: Hadoop for Data Science

Some hurdles

• Where do I run non-hadoop code?• How do I host out results to the application?• How do I use my model on streaming data?• Automate performance measurement?

Page 26: Hadoop for Data Science

So what’s the point?

• Not all stages of the model training workflow are Hadoop problems

• Use the right tool for the job in each phase e.g., non-parallel model training in some cases

FEATUREEXTRACTION

MODELTRAINING USE MODEL

DATA FEATUREVECTORS MODEL OUTPUT

Page 27: Hadoop for Data Science

Natural Language Processing

• A lot of classic tools in NLP are “embarrassingly parallel” over an entire corpus since words split nicely.– Stemming– Lexical analysis– Parsing– Tokenization– Normalization– Removing stop words– Spell check

Each of these apply to segments of text anddon’t have much to do with any other piece ofText in the corpus.

Page 28: Hadoop for Data Science

Python, NLTK, and Pig

• Pig is a higher-level abstract over MapReduce• NLTK is a popular natural language toolkit for Python• Pig allows you to stream data through arbitrary

processes (including python scripts)• You can use UDFs to wrap NLTK methods, but the

need to use Jython sucks• Use Pig to move your data around, use a real

package to do the work on the records

postdata = STREAM data THROUGH `my_nltk_script.py`;(I do the same thing with Scipy and Numpy)

Page 29: Hadoop for Data Science

OpenNLP and MapReduce

• OpenNLP is an Apache project is an NLP library • “It supports the most common NLP tasks, such as

tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.”

• Written in Java with reasonable APIs• MapReduce is just Java, so you can link into just about

anything you want• Use OpenNLP in the Mapper to enrich, normalize,

cleanse your data

Page 30: Hadoop for Data Science

So what’s the point?

• Hadoop can be used to glue together already existing libraries– You just have to figure out how to split the

problem up yourself

Page 31: Hadoop for Data Science

Recommender Systems

• Hadoop is good at recommender systems– Recommender systems like a lot of data– Systems want to make a lot of recommendations

• A number of methods available in Mahout

Page 32: Hadoop for Data Science

Collaborative Filtering:Base recommendations on others

• Collaborative Filtering is cool because it doesn’t have to understand the user or the item… just the relationships

• Relationships are easy to extract, features and labels not so much

Page 33: Hadoop for Data Science

What’s the point?

• Recommender systems parallelize and there is a Hadoop library for it

• They use relationships, not features, so the input data is easier to extract

• If you can fit your problem into the recommendation framework, you can do something interesting

Page 34: Hadoop for Data Science

Other stuff: Graphs

• Graphs are useful and a lot can be done with Hadoop

• Check out Giraph• Check out how Accumulo has been used to

store graphs (google: “Graph 500 Accumulo”)• Stuff to do:– Subgraph extraction– Missing edge recommendation– Cool visualizations– Summarizing relationships

Page 35: Hadoop for Data Science

Other stuff: Clustering

• Provides interesting insight into group• Some methods parallelize well• Mahout has:– Dirichlet process– K-means– Fuzzy K-means

Page 36: Hadoop for Data Science

Other stuff: R and Hadoop

• RHIPE and Rhadoop allow you to write MapReduce jobs in R, instead of Java

• Can also use Hadoop streaming to use R• This doesn’t magically parallelize all your R

code• Useful to integrate into R more seamlessly

Page 37: Hadoop for Data Science

Wrap up

Hadoop can’t do everything andyou have to do the rest

Page 38: Hadoop for Data Science

THANKS!

[email protected]

@donaldpminer