Download - Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Using Hadoop: Best Practices

Casey Stella

March 14, 2012

Casey Stella Using Hadoop: Best Practices

Table of Contents

Introduction

Background

Using Hadoop ProfessionallyIndexingPerformance

Staying SaneTestingDebugging

State of Big Data and Hadoop

Conclusion

Introduction

I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily

I I’m going to talk about some of the best practices that I’veseen

I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM

debugging a problem.

I These are my opinions and not necessarily the opinions of myemployer.

Introduction

The Lay of the Land – The Bad

I There are two APIs, prefer the mapred packageI The mapreduce and the mapred packagesI mapred is deprecated, but still preferredI Hortonworks just kind of screwed up

I The Pipes interface is really poorly implemented and very slow

I HDFS currently has a single point of failure

Page 7: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

The Lay of the Land – The Bad

I There are two APIs, prefer the mapred packageI The mapreduce and the mapred packagesI mapred is deprecated, but still preferredI Hortonworks just kind of screwed up

I The Pipes interface is really poorly implemented and very slow

I HDFS currently has a single point of failure

Page 8: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

The Lay of the Land – The Good

I Hortonworks is actively working on Map-Reduce v2I This means other distributed computing modelsI Included in 0.23

I HDFS is dramatically faster in 0.23I Socket communication is made more efficientI Smarter checksumming

Page 9: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

The Lay of the Land – The Good

I Hortonworks is actively working on Map-Reduce v2I This means other distributed computing modelsI Included in 0.23

I HDFS is dramatically faster in 0.23I Socket communication is made more efficientI Smarter checksumming

Page 10: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Page 11: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

inverted indices in Chapter 3)

I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Page 12: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like Katta

I Serve them directly from HDFS with an on-disk index akaHadoop MapFiles

I Serve them through HBase or CassandraI If data permits, push them to a database

Page 13: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

Hadoop MapFiles

I Serve them through HBase or CassandraI If data permits, push them to a database

Page 14: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

Hadoop MapFilesI Serve them through HBase or Cassandra

I If data permits, push them to a database

Page 15: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

Page 16: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

Page 17: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Indexing

Page 18: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

Performance Considerations

I Setup and teardown costs, so keep the HDFS block size large

I Mappers, Reducers and Combiners have memory constraintsI Transmission costs dearly

I Use Snappy, LZO, or (soon) LZ4 compression at every phaseI Serialize your objects tightly (e.g. not using Java Serialization)I Key/values emitted from the map phase had better be linear

with a small constant..preferably below 1

Page 19: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

I StrategiesI Intelligent use of the combinersI Use Local Aggregation in the mapper to emit a more complex

value. (you already know this)I Ensure that all components of your keys are necessary in the

sorting logic. If any are not, push them into the value.

I Profile JobConf.setProfileEnabled(boolean) 1

I Use Hadoop Vaidya2

1http://hadoop.apache.org/common/docs/current/mapred_tutorial.

html#Profiling2http://hadoop.apache.org/common/docs/current/vaidya.html

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Profiling

http://hadoop.apache.org/common/docs/current/vaidya.html

Page 20: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

IndexingPerformance

I StrategiesI Intelligent use of the combinersI Use Local Aggregation in the mapper to emit a more complex

value. (you already know this)I Ensure that all components of your keys are necessary in the

sorting logic. If any are not, push them into the value.

I Profile JobConf.setProfileEnabled(boolean) 1

I Use Hadoop Vaidya2

1http://hadoop.apache.org/common/docs/current/mapred_tutorial.

html#Profiling2http://hadoop.apache.org/common/docs/current/vaidya.html

http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Profiling

http://hadoop.apache.org/common/docs/current/vaidya.html

Page 21: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

Unit/Integration Testing Methodologies

I First off, do it.I Unit test individual mappers, reducers, combiners and

partitionersI Actual unit tests. This will help debugging, I promise.I Design components so that dependencies can be injected via

polymorphism when testing

I Minimally verify that keysI Can be serialized and deserializedI hashcode() is sensible (Remember: the hashcode() for enum is

not stable across different JVMs instances)I compareTo() is reflexive, symmetric and jives with equals()

I Integration test via single user mode hadoop

Page 22: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

Unit/Integration Testing Methodologies

I First off, do it.I Unit test individual mappers, reducers, combiners and

partitionersI Actual unit tests. This will help debugging, I promise.I Design components so that dependencies can be injected via

polymorphism when testing

I Minimally verify that keysI Can be serialized and deserializedI hashcode() is sensible (Remember: the hashcode() for enum is

not stable across different JVMs instances)I compareTo() is reflexive, symmetric and jives with equals()

I Integration test via single user mode hadoop

Page 23: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

Quality Assurance Testing

I The output of processing large amounts of data is often largeI Verify statistical properties

I If statistical tests fit within Map Reduce, then use MRI If not, then sample the dataset with MR and verify with R,

Python or whatever.

I Do outlier analysis and thresholding based QA

Page 24: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

Quality Assurance Testing

I The output of processing large amounts of data is often largeI Verify statistical properties

I If statistical tests fit within Map Reduce, then use MRI If not, then sample the dataset with MR and verify with R,

Python or whatever.

I Do outlier analysis and thresholding based QA

Page 25: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

Debugging Methodologies

I Better to catch it at the unit test levelI If you can’t, I suggest the following technique

I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using

reservoir samplingI Take the data and integrate it into a unit test.

I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than

a fixed amount.I DO

I Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into

HDFS

Page 26: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

a fixed amount.

I DOI Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into

HDFS

Page 27: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

TestingDebugging

a fixed amount.I DO

I Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into

HDFS

Page 28: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Hadoop Opinions

I “We’re about a year behind Google” – Doug Cutting, HadoopWorld

I Giraph and Mahout are just not there yet

I HBase is getting there (Facebook is dragging HBase intobeing serious)

I Zookeeper is the real deal

I Cassandra is cool, but eventual consistency is too hard toseriously consider.

Page 29: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Hadoop Opinions

Page 30: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Hadoop Opinions

Page 31: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Hadoop Opinions

Page 32: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Hadoop Opinions

Page 33: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Big Data

I We kind of went overboard w.r.t. Map ReduceI Easier than MPI, but really not as flexible.I Bringing distributed computing to the masses...meh, maybe

the masses don’t need it.I M.R. v2 opens up a broader horizon

I Data analysis is hard and often requires specialized skillsI Enter a new breed: the data scientistI Stats + Computer Science + Domain knowledgeI Often not a software engineer

Page 34: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Big Data

I We kind of went overboard w.r.t. Map ReduceI Easier than MPI, but really not as flexible.I Bringing distributed computing to the masses...meh, maybe

the masses don’t need it.I M.R. v2 opens up a broader horizon

I Data analysis is hard and often requires specialized skillsI Enter a new breed: the data scientistI Stats + Computer Science + Domain knowledgeI Often not a software engineer

Page 35: Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf · Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and

Conclusion

I Thanks for your attention

I Follow me on twitter @casey stellaI Find me at

I http://caseystella.comI https://github.com/cestella

I P.S. If you dig this stuff, come work with me.

http://caseystella.com

https://github.com/cestella