IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Using Hadoop: Best Practices
Casey Stella
March 14, 2012
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Table of Contents
Introduction
Background
Using Hadoop ProfessionallyIndexingPerformance
Staying SaneTestingDebugging
State of Big Data and Hadoop
Conclusion
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Introduction
I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily
I I’m going to talk about some of the best practices that I’veseen
I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM
debugging a problem.
I These are my opinions and not necessarily the opinions of myemployer.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Introduction
I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily
I I’m going to talk about some of the best practices that I’veseen
I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM
debugging a problem.
I These are my opinions and not necessarily the opinions of myemployer.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Introduction
I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily
I I’m going to talk about some of the best practices that I’veseen
I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM
debugging a problem.
I These are my opinions and not necessarily the opinions of myemployer.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
The Lay of the Land – The Bad
I There are two APIs, prefer the mapred packageI The mapreduce and the mapred packagesI mapred is deprecated, but still preferredI Hortonworks just kind of screwed up
I The Pipes interface is really poorly implemented and very slow
I HDFS currently has a single point of failure
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
The Lay of the Land – The Bad
I There are two APIs, prefer the mapred packageI The mapreduce and the mapred packagesI mapred is deprecated, but still preferredI Hortonworks just kind of screwed up
I The Pipes interface is really poorly implemented and very slow
I HDFS currently has a single point of failure
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
The Lay of the Land – The Good
I Hortonworks is actively working on Map-Reduce v2I This means other distributed computing modelsI Included in 0.23
I HDFS is dramatically faster in 0.23I Socket communication is made more efficientI Smarter checksumming
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
The Lay of the Land – The Good
I Hortonworks is actively working on Map-Reduce v2I This means other distributed computing modelsI Included in 0.23
I HDFS is dramatically faster in 0.23I Socket communication is made more efficientI Smarter checksumming
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)
I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like Katta
I Serve them directly from HDFS with an on-disk index akaHadoop MapFiles
I Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFiles
I Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFilesI Serve them through HBase or Cassandra
I If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Indexing
I Hadoop is a batch processing system, but you need realtimeaccess
I Options areI Roll your own (Jimmy Lin talks about how one might serve up
inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka
Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database
I Katta can serve up both Lucene indices and Mapfiles
I Indexing is hard, be careful.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Performance Considerations
I Setup and teardown costs, so keep the HDFS block size large
I Mappers, Reducers and Combiners have memory constraintsI Transmission costs dearly
I Use Snappy, LZO, or (soon) LZ4 compression at every phaseI Serialize your objects tightly (e.g. not using Java Serialization)I Key/values emitted from the map phase had better be linear
with a small constant..preferably below 1
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Performance Considerations
I StrategiesI Intelligent use of the combinersI Use Local Aggregation in the mapper to emit a more complex
value. (you already know this)I Ensure that all components of your keys are necessary in the
sorting logic. If any are not, push them into the value.
I Profile JobConf.setProfileEnabled(boolean) 1
I Use Hadoop Vaidya2
1http://hadoop.apache.org/common/docs/current/mapred_tutorial.
html#Profiling2http://hadoop.apache.org/common/docs/current/vaidya.html
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
IndexingPerformance
Performance Considerations
I StrategiesI Intelligent use of the combinersI Use Local Aggregation in the mapper to emit a more complex
value. (you already know this)I Ensure that all components of your keys are necessary in the
sorting logic. If any are not, push them into the value.
I Profile JobConf.setProfileEnabled(boolean) 1
I Use Hadoop Vaidya2
1http://hadoop.apache.org/common/docs/current/mapred_tutorial.
html#Profiling2http://hadoop.apache.org/common/docs/current/vaidya.html
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Unit/Integration Testing Methodologies
I First off, do it.I Unit test individual mappers, reducers, combiners and
partitionersI Actual unit tests. This will help debugging, I promise.I Design components so that dependencies can be injected via
polymorphism when testing
I Minimally verify that keysI Can be serialized and deserializedI hashcode() is sensible (Remember: the hashcode() for enum is
not stable across different JVMs instances)I compareTo() is reflexive, symmetric and jives with equals()
I Integration test via single user mode hadoop
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Unit/Integration Testing Methodologies
I First off, do it.I Unit test individual mappers, reducers, combiners and
partitionersI Actual unit tests. This will help debugging, I promise.I Design components so that dependencies can be injected via
polymorphism when testing
I Minimally verify that keysI Can be serialized and deserializedI hashcode() is sensible (Remember: the hashcode() for enum is
not stable across different JVMs instances)I compareTo() is reflexive, symmetric and jives with equals()
I Integration test via single user mode hadoop
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Quality Assurance Testing
I The output of processing large amounts of data is often largeI Verify statistical properties
I If statistical tests fit within Map Reduce, then use MRI If not, then sample the dataset with MR and verify with R,
Python or whatever.
I Do outlier analysis and thresholding based QA
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Quality Assurance Testing
I The output of processing large amounts of data is often largeI Verify statistical properties
I If statistical tests fit within Map Reduce, then use MRI If not, then sample the dataset with MR and verify with R,
Python or whatever.
I Do outlier analysis and thresholding based QA
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Debugging Methodologies
I Better to catch it at the unit test levelI If you can’t, I suggest the following technique
I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using
reservoir samplingI Take the data and integrate it into a unit test.
I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than
a fixed amount.I DO
I Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into
HDFS
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Debugging Methodologies
I Better to catch it at the unit test levelI If you can’t, I suggest the following technique
I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using
reservoir samplingI Take the data and integrate it into a unit test.
I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than
a fixed amount.
I DOI Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into
HDFS
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
TestingDebugging
Debugging Methodologies
I Better to catch it at the unit test levelI If you can’t, I suggest the following technique
I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using
reservoir samplingI Take the data and integrate it into a unit test.
I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than
a fixed amount.I DO
I Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into
HDFS
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Hadoop Opinions
I “We’re about a year behind Google” – Doug Cutting, HadoopWorld
I Giraph and Mahout are just not there yet
I HBase is getting there (Facebook is dragging HBase intobeing serious)
I Zookeeper is the real deal
I Cassandra is cool, but eventual consistency is too hard toseriously consider.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Hadoop Opinions
I “We’re about a year behind Google” – Doug Cutting, HadoopWorld
I Giraph and Mahout are just not there yet
I HBase is getting there (Facebook is dragging HBase intobeing serious)
I Zookeeper is the real deal
I Cassandra is cool, but eventual consistency is too hard toseriously consider.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Hadoop Opinions
I “We’re about a year behind Google” – Doug Cutting, HadoopWorld
I Giraph and Mahout are just not there yet
I HBase is getting there (Facebook is dragging HBase intobeing serious)
I Zookeeper is the real deal
I Cassandra is cool, but eventual consistency is too hard toseriously consider.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Hadoop Opinions
I “We’re about a year behind Google” – Doug Cutting, HadoopWorld
I Giraph and Mahout are just not there yet
I HBase is getting there (Facebook is dragging HBase intobeing serious)
I Zookeeper is the real deal
I Cassandra is cool, but eventual consistency is too hard toseriously consider.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Hadoop Opinions
I “We’re about a year behind Google” – Doug Cutting, HadoopWorld
I Giraph and Mahout are just not there yet
I HBase is getting there (Facebook is dragging HBase intobeing serious)
I Zookeeper is the real deal
I Cassandra is cool, but eventual consistency is too hard toseriously consider.
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Big Data
I We kind of went overboard w.r.t. Map ReduceI Easier than MPI, but really not as flexible.I Bringing distributed computing to the masses...meh, maybe
the masses don’t need it.I M.R. v2 opens up a broader horizon
I Data analysis is hard and often requires specialized skillsI Enter a new breed: the data scientistI Stats + Computer Science + Domain knowledgeI Often not a software engineer
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Big Data
I We kind of went overboard w.r.t. Map ReduceI Easier than MPI, but really not as flexible.I Bringing distributed computing to the masses...meh, maybe
the masses don’t need it.I M.R. v2 opens up a broader horizon
I Data analysis is hard and often requires specialized skillsI Enter a new breed: the data scientistI Stats + Computer Science + Domain knowledgeI Often not a software engineer
Casey Stella Using Hadoop: Best Practices
IntroductionBackground
Using Hadoop ProfessionallyStaying Sane
State of Big Data and HadoopConclusion
Conclusion
I Thanks for your attention
I Follow me on twitter @casey stellaI Find me at
I http://caseystella.comI https://github.com/cestella
I P.S. If you dig this stuff, come work with me.
Casey Stella Using Hadoop: Best Practices