Hadoop - Lessons Learned

of 50/50
Hadoop lessons learned
  • date post

  • Category


  • view

  • download


Embed Size (px)


Hadoop has proven to be an invaluable tool for many companies over the past few years. Yet it has it's ways and knowing them up front can safe valuable time. This session is a run down of the ever recurring lessons learned from running various Hadoop clusters in production since version 0.15. What to expect from Hadoop - and what not? How to integrate Hadoop into existing infrastructure? Which data formats to use? What compression? Small files vs big files? Append or not? Essential configuration and operations tips. What about querying all the data? The project, the community and pointers to interesting projects that complement the Hadoop experience.

Transcript of Hadoop - Lessons Learned

  • 1. Hadoop lessons learned

2. @tcurdt github.com/tcurdtyourdailygeekery.com 3. Data 4. hiring 5. Agenda hadoop? really? cloud? integration mapreduce operations community and outlook 6. Why Hadoop? 7. It is a new and improvedversion of enterprise tape drive 8. 20 machines Map Reduce20 files, 1.5 GB eachhadoop job grep.jargrep needle file ir0 17.5 35.0 52.5 70.0 f aun 9. Run your own?http://bit.ly/elastic-mr-pig 10. Integration 11. black box 12. Engineers hadoop-cat hadoop-grep hadoop-range --prefix /logs --from 2012-05-15 --until 2012-05-22 --postfix /*play*.seq | xargs hadoop jar streaming jobs 13. Non-Engineering Folks mount hdfs pig / hive data dumps 14. Map ReduceHDFS les InputFormatSplit Split Split SplitMap Map Map MapCombinerCombinerCombinerCombinerSortSortSortSortPartitionerCopy and Merge CombinerCombiner Reducer Reducer OutputFormat 15. Job Counters12/05/25 01:27:38 INFO mapred.JobClient:Reduce input records=106..12/05/25 01:27:38 INFO mapred.JobClient:Combine output records=40912/05/25 01:27:38 INFO mapred.JobClient:Map input records=11270584412/05/25 01:27:38 INFO mapred.JobClient:Reduce output records=412/05/25 01:27:38 INFO mapred.JobClient:Combine input records=64842079..12/05/25 01:27:38 INFO mapred.JobClient:Map output records=64841776map in: 112705844 *********************************map out :64841776 *****************combine in:64842079 *****************combine out : 409 |reduce in : 106 |reduce out: 4 | MAPREDUCE-346 (since 2009) 16. Job Countersmap in: 20000 **************map out : 40000 ******************************combine in: 40000 ******************************combine out : 10001 ********reduce in : 10001 ********reduce out: 10001 ******** 17. Map-onlymapred.reduce.tasks = 0 18. EOF on appendpublic class EofSafeSequenceFileInputFormatextends SequenceFileInputFormat {...}public class EofSafeRecordReaderextends RecordReader {...public boolean nextKeyValue()throws IOException, InterruptedException {try {return this.delegate.nextKeyValue();} catch(EOFException e) {return false;}}...} 19. Serializationbefore ASN1, custom java serialization, Thriftnow protobuf 20. Custom Writablespublic static class Play extends CustomWritable {public final LongWritable time= new LongWritable();public final LongWritable owner_id= new LongWritable();public final LongWritable track_id= new LongWritable();public Play() {fields = new WritableComparable[] {owner_id, track_id, time };}} 21. Fear the StateBytesWritable bytes = new BytesWritable();...byte[] buffer = bytes.getBytes(); 22. Re-Iteratepublic void reduce(LongTriple key,Iterable values,Context ctx) {for(LongWritable v : values) { }for(LongWritable v : values) { }}public void reduce(LongTriple key,Iterable values,Context ctx) {buffer.clear();for(LongWritable v : values) { buffer.add(v); }for(LongWritable v : buffer.values()) { }} HADOOP-5266 (applied to 0.21.0) 23. BitSetslong min = 1;long max = 10000000;FastBitSet set = new FastBitSet(min, max);for(long i = min; itopology.script.file.name/path/to/script/location-from-iptruetopology script#!/usr/bin/rubylocation = {hdd001.some.net => /ams/1, => /ams/1,hdd002.some.net => /ams/2, => /ams/2,}puts ARGV.map { |ip| location[ARGV.first] || /default-rack }.join( ) 37. Fix the Policyfor f in `hdfs hadoop fsck / | grep "Replicaplacement policy is violated" | awk -F: {print $1}| sort | uniq | head -n1000`; dohadoop fs -setrep -w 4 $fhadoop fs -setrep 3 $fdone 38. Fsckhadoop fsck / -openforwrite -files | grep -i"OPENFORWRITE: MISSING 1 blocks of total size" | awk{print $1} | xargs -L 1 -i hadoop dfs -mv {} /lost+notfound 39. Communityhadoop* from markmail.org 40. Community The Enterprise EffectThe Community Effect (in 2011) 41. Communitycoremapreduce * from markmail.org 42. The Futureincrementalreal time refined API flexible pipelines refined implementation 43. Real Time Datamining and Aggregation at Scale (Ted Dunning)Eventually Consistent Data Structures (Sean Cribbs)Real-time Analytics with HBase (Alex Baranau)Profiling and performance-tuning your Hadoop pipelines (Aaron Beppu) From Batch to Realtime with Hadoop (Lars George) Event-Stream Processing with Kafka (Tim Lossen)Real-/Neartime analysis with Hadoop & VoltDB (Ralf Neeb) 44. Take Aways use hadoop only if you must really understand the pipeline unbox the black box 45. Thats it [email protected] github.com/tcurdtyourdailygeekery.com