Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf ·...

35
Introduction Background Using Hadoop Professionally Staying Sane State of Big Data and Hadoop Conclusion Using Hadoop: Best Practices Casey Stella March 14, 2012 Casey Stella Using Hadoop: Best Practices

Transcript of Using Hadoop: Best Practices - Kent State Universityjin/Cloud12Spring/HadoopPractice.pdf ·...

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Using Hadoop: Best Practices

Casey Stella

March 14, 2012

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Table of Contents

Introduction

Background

Using Hadoop ProfessionallyIndexingPerformance

Staying SaneTestingDebugging

State of Big Data and Hadoop

Conclusion

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Introduction

I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily

I I’m going to talk about some of the best practices that I’veseen

I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM

debugging a problem.

I These are my opinions and not necessarily the opinions of myemployer.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Introduction

I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily

I I’m going to talk about some of the best practices that I’veseen

I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM

debugging a problem.

I These are my opinions and not necessarily the opinions of myemployer.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Introduction

I Hi, I’m CaseyI I work at ExplorysI I work with Hadoop and the Hadoop ecosystem daily

I I’m going to talk about some of the best practices that I’veseen

I Some of these are common knowledgeI Some of these don’t show up until you’ve been up ’til 3AM

debugging a problem.

I These are my opinions and not necessarily the opinions of myemployer.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

The Lay of the Land – The Bad

I There are two APIs, prefer the mapred packageI The mapreduce and the mapred packagesI mapred is deprecated, but still preferredI Hortonworks just kind of screwed up

I The Pipes interface is really poorly implemented and very slow

I HDFS currently has a single point of failure

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

The Lay of the Land – The Bad

I There are two APIs, prefer the mapred packageI The mapreduce and the mapred packagesI mapred is deprecated, but still preferredI Hortonworks just kind of screwed up

I The Pipes interface is really poorly implemented and very slow

I HDFS currently has a single point of failure

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

The Lay of the Land – The Good

I Hortonworks is actively working on Map-Reduce v2I This means other distributed computing modelsI Included in 0.23

I HDFS is dramatically faster in 0.23I Socket communication is made more efficientI Smarter checksumming

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

The Lay of the Land – The Good

I Hortonworks is actively working on Map-Reduce v2I This means other distributed computing modelsI Included in 0.23

I HDFS is dramatically faster in 0.23I Socket communication is made more efficientI Smarter checksumming

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)

I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like Katta

I Serve them directly from HDFS with an on-disk index akaHadoop MapFiles

I Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFiles

I Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or Cassandra

I If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Indexing

I Hadoop is a batch processing system, but you need realtimeaccess

I Options areI Roll your own (Jimmy Lin talks about how one might serve up

inverted indices in Chapter 3)I Use an open source indexing infrastructure, like KattaI Serve them directly from HDFS with an on-disk index aka

Hadoop MapFilesI Serve them through HBase or CassandraI If data permits, push them to a database

I Katta can serve up both Lucene indices and Mapfiles

I Indexing is hard, be careful.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Performance Considerations

I Setup and teardown costs, so keep the HDFS block size large

I Mappers, Reducers and Combiners have memory constraintsI Transmission costs dearly

I Use Snappy, LZO, or (soon) LZ4 compression at every phaseI Serialize your objects tightly (e.g. not using Java Serialization)I Key/values emitted from the map phase had better be linear

with a small constant..preferably below 1

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Performance Considerations

I StrategiesI Intelligent use of the combinersI Use Local Aggregation in the mapper to emit a more complex

value. (you already know this)I Ensure that all components of your keys are necessary in the

sorting logic. If any are not, push them into the value.

I Profile JobConf.setProfileEnabled(boolean) 1

I Use Hadoop Vaidya2

1http://hadoop.apache.org/common/docs/current/mapred_tutorial.

html#Profiling2http://hadoop.apache.org/common/docs/current/vaidya.html

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

IndexingPerformance

Performance Considerations

I StrategiesI Intelligent use of the combinersI Use Local Aggregation in the mapper to emit a more complex

value. (you already know this)I Ensure that all components of your keys are necessary in the

sorting logic. If any are not, push them into the value.

I Profile JobConf.setProfileEnabled(boolean) 1

I Use Hadoop Vaidya2

1http://hadoop.apache.org/common/docs/current/mapred_tutorial.

html#Profiling2http://hadoop.apache.org/common/docs/current/vaidya.html

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Unit/Integration Testing Methodologies

I First off, do it.I Unit test individual mappers, reducers, combiners and

partitionersI Actual unit tests. This will help debugging, I promise.I Design components so that dependencies can be injected via

polymorphism when testing

I Minimally verify that keysI Can be serialized and deserializedI hashcode() is sensible (Remember: the hashcode() for enum is

not stable across different JVMs instances)I compareTo() is reflexive, symmetric and jives with equals()

I Integration test via single user mode hadoop

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Unit/Integration Testing Methodologies

I First off, do it.I Unit test individual mappers, reducers, combiners and

partitionersI Actual unit tests. This will help debugging, I promise.I Design components so that dependencies can be injected via

polymorphism when testing

I Minimally verify that keysI Can be serialized and deserializedI hashcode() is sensible (Remember: the hashcode() for enum is

not stable across different JVMs instances)I compareTo() is reflexive, symmetric and jives with equals()

I Integration test via single user mode hadoop

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Quality Assurance Testing

I The output of processing large amounts of data is often largeI Verify statistical properties

I If statistical tests fit within Map Reduce, then use MRI If not, then sample the dataset with MR and verify with R,

Python or whatever.

I Do outlier analysis and thresholding based QA

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Quality Assurance Testing

I The output of processing large amounts of data is often largeI Verify statistical properties

I If statistical tests fit within Map Reduce, then use MRI If not, then sample the dataset with MR and verify with R,

Python or whatever.

I Do outlier analysis and thresholding based QA

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Debugging Methodologies

I Better to catch it at the unit test levelI If you can’t, I suggest the following technique

I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using

reservoir samplingI Take the data and integrate it into a unit test.

I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than

a fixed amount.I DO

I Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into

HDFS

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Debugging Methodologies

I Better to catch it at the unit test levelI If you can’t, I suggest the following technique

I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using

reservoir samplingI Take the data and integrate it into a unit test.

I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than

a fixed amount.

I DOI Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into

HDFS

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

TestingDebugging

Debugging Methodologies

I Better to catch it at the unit test levelI If you can’t, I suggest the following technique

I Investigatory map reduce job to find the data causing the issue.I Single point if you’re lucky, if not then a random sample using

reservoir samplingI Take the data and integrate it into a unit test.

I DO NOTI Use print statements to debug unless you’re sure of the scope.I Use counters where the group or name count grows more than

a fixed amount.I DO

I Use a single counter in the actual job if the job doesn’t finishI Use a map reduce job that outputs suspect input data into

HDFS

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Hadoop Opinions

I “We’re about a year behind Google” – Doug Cutting, HadoopWorld

I Giraph and Mahout are just not there yet

I HBase is getting there (Facebook is dragging HBase intobeing serious)

I Zookeeper is the real deal

I Cassandra is cool, but eventual consistency is too hard toseriously consider.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Hadoop Opinions

I “We’re about a year behind Google” – Doug Cutting, HadoopWorld

I Giraph and Mahout are just not there yet

I HBase is getting there (Facebook is dragging HBase intobeing serious)

I Zookeeper is the real deal

I Cassandra is cool, but eventual consistency is too hard toseriously consider.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Hadoop Opinions

I “We’re about a year behind Google” – Doug Cutting, HadoopWorld

I Giraph and Mahout are just not there yet

I HBase is getting there (Facebook is dragging HBase intobeing serious)

I Zookeeper is the real deal

I Cassandra is cool, but eventual consistency is too hard toseriously consider.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Hadoop Opinions

I “We’re about a year behind Google” – Doug Cutting, HadoopWorld

I Giraph and Mahout are just not there yet

I HBase is getting there (Facebook is dragging HBase intobeing serious)

I Zookeeper is the real deal

I Cassandra is cool, but eventual consistency is too hard toseriously consider.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Hadoop Opinions

I “We’re about a year behind Google” – Doug Cutting, HadoopWorld

I Giraph and Mahout are just not there yet

I HBase is getting there (Facebook is dragging HBase intobeing serious)

I Zookeeper is the real deal

I Cassandra is cool, but eventual consistency is too hard toseriously consider.

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Big Data

I We kind of went overboard w.r.t. Map ReduceI Easier than MPI, but really not as flexible.I Bringing distributed computing to the masses...meh, maybe

the masses don’t need it.I M.R. v2 opens up a broader horizon

I Data analysis is hard and often requires specialized skillsI Enter a new breed: the data scientistI Stats + Computer Science + Domain knowledgeI Often not a software engineer

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Big Data

I We kind of went overboard w.r.t. Map ReduceI Easier than MPI, but really not as flexible.I Bringing distributed computing to the masses...meh, maybe

the masses don’t need it.I M.R. v2 opens up a broader horizon

I Data analysis is hard and often requires specialized skillsI Enter a new breed: the data scientistI Stats + Computer Science + Domain knowledgeI Often not a software engineer

Casey Stella Using Hadoop: Best Practices

IntroductionBackground

Using Hadoop ProfessionallyStaying Sane

State of Big Data and HadoopConclusion

Conclusion

I Thanks for your attention

I Follow me on twitter @casey stellaI Find me at

I http://caseystella.comI https://github.com/cestella

I P.S. If you dig this stuff, come work with me.

Casey Stella Using Hadoop: Best Practices