Data science and Hadoop
-
Upload
donald-miner -
Category
Technology
-
view
106 -
download
0
description
Transcript of Data science and Hadoop
Hadoop for Data Science
Donald Miner
NYC Pig User GroupAugust 22, 2013
I’ll talk about…
Intro to Hadoop
Some reasons why I think Hadoop is cool
(is this cliché yet?)
Step 1: HadoopStep 2: ????Step 3: Data Science!
Some examples of data science work on hadoop
What can Hadoop do to enable data science work?
Hadoop
• Distributed platform for thousands of nodes• Data storage and computation framework• Open source• Runs on commodity hardware
Hadoop Distributed File SystemHDFS
• Stores files in folders (that’s it)– Nobody cares what’s in your files
• Chunks large files into blocks (~64MB-2GB)• 3 replicates of each block (better safe than sorry)• Blocks are scattered all over the place
FILE BLOCKS
MapReduce• Analyzes raw data in HDFS where the data is• Jobs are split into Mappers and Reducers
Reducers (you code this, too)Automatically Groups by the mapper’s output keyAggregate, count, statisticsOutputs to HDFS
Mappers (you code this)Loads data from HDFSFilter, transform, parseOutputs (key, value) pairs
Hadoop Ecosystem
• Higher-level languages like Pig and Hive
• HDFS Data systems like HBase and Accumulo
• Close friends like ZooKeeper, Flume, Storm, Cassandra, Avro
Pig
• Pig is a fantastic query language that runs MapReduce jobs• Higher-level than MapReduce: write code in terms of
GROUP BY, DISTINCT, FOREACH, FILTER, etc.• Custom loaders and storage functions make this good glue• I use this a lot
A = LOAD ‘data.txt’ AS (name:chararray, age:int, state:chararray);B = GROUP A BY state;C = FOREACH B GENERATE group, COUNT(*), AVG(age);dump c;
Mahout
• Mahout is a Machine Library
• Has both parallel and non-parallel implementations of a number of algorithms:– Recommenders– Clustering– Classification
Cool Thing #1: Linear Scalability
• HDFS and MapReduce scale linearly
• If you have twice as many computers, jobs run twice as fast
• If you have twice as much data, jobs run twice as slow
• If you have twice as many computers, you can store twice as much data
DATA LOCALITY!!
Cool Thing #2: Schema on Read
LOAD DATA FIRST, ASK QUESTIONS LATERData is parsed/interpreted as it is loaded out of HDFS
What implications does this have?
BEFORE:ETL, schema design upfront,tossing out original data,comprehensive data study
Keep original data around!Have multiple views of the same data!Work with unstructured data sooner!Store first, figure out what to do with it later!
WITH HADOOP:
Cool Thing #3: Transparent Parallelism
Network programming?
Inter-process communication?
Threading?
Distributed stuff?
With MapReduce, I DON’T CARE
Your solution
… I just have to fit my solution into this tiny box
Fault tolerance?
Code deployment?RPC?
Message passing?
Locking?
MapReduceFramework
Data storage?
Scalability?
Data center fires?
Cool Thing #4: Unstructured Data
• Unstructured data:media, text,
forms, log data lumped structured data
• Query languages like SQL and Pig assume some sort of “structure”
• MapReduce is just Java: You can do anything Java can do in a Mapper or Reducer
One of the things Hadoop can do for you is turn your unstructured data into structured
The rest of the talk
• Four threads:– Data exploration– Classification– NLP– Recommender systems
I’m using these to illustrate some points
Exploration
• Hadoop is great at exploring data! • I like to explore data in a couple ways:– Filtering– Sampling– Summarization– Evaluate cleanliness
• I like to spend 50% of my timedoing exploration (but unfortunately it’s the first thing to get cut)
Filtering
• Filtering is like a microscope: I want to take a closer look at a subset
• In MapReduce, you do this in the mapper• Identify nasty records you want to get rid of• Examples:– Only new york data– Only millennials– Remove gibberish– Only 5 minutes
Sampling
• Hadoop isn’t the king of interactive analysis• Sampling is a good way to grab a set of data
then work with it locally (Excel?)• Pig has a handy SAMPLE keyword• Types of sampling:– Sample randomly across the entire data set– Sub-graph extraction– Filters (from the last slide)
Summarization
• Summarization is a bird’s-eye view• MapReduce is good at summarization:– Mappers extract the group-by keys– Reducers do the aggregation
• I like to:– Count number, get stdev, get average, get min/max of
records in several groups– Count nulls in columns
(if applicable)– Grab top-10 lists
Evaluating Cleanliness
• I’ve never been burned twice:– There are a list of things that I like to check
• Things to check for:– Fields that shouldn’t be null that are– Duplicates (does unique records=records?)– Dates (look for 1970; look at formats; time zones)– Things that should be normalized– Keys that are different because of trash
e.g. “ abc “ != “abc”
What’s the point?
• Hadoop is really good at this stuff!• You probably have a lot of data and a lot of it
is garbage!• Take the time to do this and your further work
will be much easier• It’s hard to tell what methods
you should use until youexplore your data
Classification
• Classification is taking feature vectors (derived from your data), and then guessing some sort of label– E.g.,
sunny, Saturday, summer -> play tennis rainy, Wednesday, winter -> don’t play tennis
• Most classification algorithms aren’t easily parallelizable or have good implementations
• You need a training set of true feature vectors and labels… how often is your data labeled?
• I’ve found classification rather hard, except for when…
Overall Classification Workflow
EXPLORATION EXPERIMENTATIONOF DIFFERENT METHODS
REFINING PROMISINGMETHODS
The Model Training Workflow
FEATUREEXTRACTION
MODELTRAINING USE MODEL
DATA FEATUREVECTORS MODEL OUTPUT
Data volumes in trainingDA
TA V
OLU
ME
DATA
I have a lot of data
Data volumes in trainingDA
TA V
OLU
ME
DATA
FEATUREVECTORS
feature extraction
Is this result “big data”?
Examples: - 10TB of network traffic distilled into 9K IP address FVs - 10TB of medical records distilled into 50M patient FVs - 10TB of documents distilled into 5TB of document FVs
Data volumes in trainingDA
TA V
OLU
ME
DATA
FEATUREVECTORS
feature extraction Model Training
MODEL
The model itself is usually pretty tiny
Data volumes in trainingDA
TA V
OLU
ME
DATA
FEATUREVECTORS
feature extraction Model Training
MODEL
Applying that model to all thedata is a big data problem!
Some hurdles
• Where do I run non-hadoop code?• How do I host out results to the application?• How do I use my model on streaming data?• Automate performance measurement
Miscellaneous:Train all the classifiers!
Training a classifier might not be a big data problem… … but training lots of them is!
Examples: Train a model per user to detect anomalous events Train a Boolean model per label possibility Ensemble methods
So what’s the point?
• Not all stages of the model training workflow are Hadoop problems
• Use the right tool for the job in each phase e.g., non-parallel model training in some cases
FEATUREEXTRACTION
MODELTRAINING USE MODEL
DATA FEATUREVECTORS MODEL OUTPUT
Natural Language Pre-Processing
• A lot of classic tools in NLP are “embarrassingly parallel”– Stemming– Lexical analysis– Parsing– Tokenization– Normalization– Removing stop words– Spell check
Each of these apply to segments of text anddon’t have much to do with any other piece ofText in the corpus.
Python, NLTK, and Pig
• Pig is a higher-level abstract over MapReduce• NLTK is a popular natural language toolkit for Python• Pig allows you to stream data through arbitrary
processes (including python scripts)• You can use UDFs to wrap NLTK methods, but the
need to use Jython sucks• Use Pig to move your data around, use a real
package to do the work on the records
postdata = STREAM data THROUGH `my_nltk_script.py`;(I do the same thing with Scipy and Numpy)
OpenNLP and MapReduce
• OpenNLP is an Apache project is an NLP library • “It supports the most common NLP tasks, such as
tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.”
• Written in Java with reasonable APIs• MapReduce is just Java, so you can link into just about
anything you want• Use OpenNLP in the Mapper to enrich, normalize,
cleanse your data
One of my favorites: TF-IDF
• TF-IDF (Term Frequency, Inverse Document Frequency)– TF: how common is the word in the document– IDF: how common is this word everywhere
(inverse)– Multiply both and get a score for each term
• Easily pulls out topics in documents (or lack of topics)
• Parallelizable (examples online)Example: The quick brown fox jumps over the lazy dog
Somewhat related: Text extraction
• Extracting text with OCR or Speech-to-text (for example) can be an expensive operation
• Use Hadoop’s parallelism to apply your method against a large corpus of data
• You can’t really make individual extraction faster, but you can make the overall process faster
So what’s the point?
• Hadoop can be used to glue together already existing libraries– You just have to figure out how to split the
problem up yourself• Utilize a lot of the NLP toolkits to process text
Recommender Systems
• Hadoop is good at recommender systems– Recommender systems like a lot of data– Systems want to make a lot of recommendations
• A number of methods available in Mahout• I’ll be talking about Collaborative Filtering
1. Find similar users2. Make recommendations based on those
I have no idea what I’m doing
• Collaborative Filtering is cool because it doesn’t have to understand the user or the item… just the relationships
• Relationships are easy to extract, features and labels not so much
• Features can be folded into the similarity metrics
What’s the point?
• Recommender systems parallelize and there is a Hadoop library for it
• They use relationships, not features, so the data is easier to extract
• If you can fit your problem into the recommendation framework, you can do something interesting
Other stuff: Graphs
• Graphs are useful and a lot can be done with Hadoop
• Check out Giraph• Check out how Accumulo has been used to
store graphs (google: “Graph 500 Accumulo”)• Stuff to do:– Subgraph extraction– Missing edge recommendation– Cool visualizations– Summarizing relationships
Other stuff: Clustering
• Provides interesting insight into group• Some methods parallelize well• Mahout has:– Dirichlet process clustering– K-means– Fuzzy K-means
Other stuff: R and Hadoop
• RHIPE and Rhadoop allow you to write MapReduce jobs in R, instead of Java
• Can also use Hadoop streaming to use R• This doesn’t magically parallelize all your R
code• Useful to integrate into R more seamlessly
Wrap up
• Hadoop is good at certain things
• Hadoop can’t do everything and you have to do the rest