Hadoop - Lessons Learned

Hadooplessons learned

@tcurdtgithub.com/tcurdt

yourdailygeekery.com

hiring

Agenda

· hadoop? really? cloud?· integration· mapreduce· operations· community and outlook

Why Hadoop?

“It is a new and improved version of enterprise tape

drive”

20 machines20 files, 1.5 GB each

grep “needle” file

hadoop job grep.jar

0 17.5 35.0 52.5 70.0

unfair

Map Reduce

Run your own?

http://bit.ly/elastic-mr-pig

Integration

black box

· hadoop-cat

· hadoop-grep

· hadoop-range --prefix /logs --from 2012-05-15 --until 2012-05-22 --postfix /*play*.seq | xargs hadoop jar

· streaming jobs

Engineers

· mount hdfs

· pig / hive

· data dumps

Non-Engineering Folks

Map Reduce

InputFormat

HDFS files

Combiner

Partitioner

Copy and Merge

Reducer

OutputFormat

Reducer

Combiner

Combiner Combiner

MAPREDUCE-346 (since 2009)

12/05/25 01:27:38 INFO mapred.JobClient: Reduce input records=106..12/05/25 01:27:38 INFO mapred.JobClient: Combine output records=40912/05/25 01:27:38 INFO mapred.JobClient: Map input records=11270584412/05/25 01:27:38 INFO mapred.JobClient: Reduce output records=412/05/25 01:27:38 INFO mapred.JobClient: Combine input records=64842079..12/05/25 01:27:38 INFO mapred.JobClient: Map output records=64841776

map in : 112705844 *********************************map out : 64841776 *****************combine in : 64842079 *****************combine out : 409 |reduce in : 106 |reduce out : 4 |

Job Counters

map in : 20000 **************map out : 40000 ******************************combine in : 40000 ******************************combine out : 10001 ********reduce in : 10001 ********reduce out : 10001 ********

Job Counters

mapred.reduce.tasks = 0

Map-only

public class EofSafeSequenceFileInputFormat<K,V> extends SequenceFileInputFormat<K,V> { ...}

public class EofSafeRecordReader<K,V> extends RecordReader<K,V> { ... public boolean nextKeyValue() throws IOException, InterruptedException { try { return this.delegate.nextKeyValue(); } catch(EOFException e) { return false; } } ...}

EOF on append

ASN1, custom java serialization, Thrift

Serialization

before

protobuf

public static class Play extends CustomWritable {

public final LongWritable time = new LongWritable();

public final LongWritable owner_id = new LongWritable();

public final LongWritable track_id = new LongWritable();

public Play() { fields = new WritableComparable[] { owner_id, track_id, time }; }}

Custom Writables

BytesWritable bytes = new BytesWritable();...byte[] buffer = bytes.getBytes();

Fear the State

public void reduce( LongTriple key, Iterable<LongWritable> values, Context ctx) {

for(LongWritable v : values) { } for(LongWritable v : values) { }}

public void reduce( LongTriple key, Iterable<LongWritable> values, Context ctx) { buffer.clear(); for(LongWritable v : values) { buffer.add(v); } for(LongWritable v : buffer.values()) { }}

Re-Iterate

HADOOP-5266 (applied to 0.21.0)

long min = 1;long max = 10000000;

FastBitSet set = new FastBitSet(min, max);

for(long i = min; i<max; i++) { set.set(i);}

BitSets

org.apache.lucene.util.*BitSet

Data Structures

http://bit.ly/data-structureshttp://bit.ly/bloom-filtershttp://bit.ly/stream-lib

General Tips

· test on small datasets, test on your machine

· many reducers

· always consider a combiner and partitioner

· pig / streaming for one-time jobs,java/scala for recurring

http://bit.ly/map-reduce-book

Operations

pdsh -w "hdd[001-019]" \"sudo sv restart /etc/sv/hadoop-tasktracker"

runit / init.d

pdsh / dsh

use chef / puppet

Hardware

· 2x name nodes raid 1

· 12 cores, 48GB RAM, xfs, 2x1TB

· n x data nodes no raid

· 12 cores, 16GB RAM, xfs, 4x2TB

Monitoringdfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext31dfs.period=10dfs.servers=...

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31mapred.period=10mapred.servers=...

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31jvm.period=10jvm.servers=...

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31rpc.period=10rpc.servers=...

# ignoreugi.class=org.apache.hadoop.metrics.spi.NullContext

Monitoring

total capacity capacity used

Compression

# of 64MB blocks# of bytes needed# of bytes used# bytes reclaimed

bzip2 / gzip / lzo / snappyio.seqfile.compression.type = BLOCKio.seqfile.compression.blocksize = 512000

Janitor

hadoop-expire -url namenode.here -path /tmp -mtime 7d -delete

The last block of an HDFS block only occupies the required space. So a 4k file only consumes 4k on disk.-- Owen

BUSTED

find \ -wholename "/var/log/hadoop/hadoop-*" \ -wholename "/var/log/hadoop/job_*.xml" \ -wholename "/var/log/hadoop/history/*" \ -wholename "/var/log/hadoop/history/\\.*.crc" \ -wholename "/var/log/hadoop/history/done/*" \ -wholename "/var/log/hadoop/history/done/\\.*.crc" \ -wholename "/var/log/hadoop/userlogs/attempt_*" \ -mtime +7 \ -daystart \ -delete

Logfiles

Limits

hdfs hard nofile 128000hdfs soft nofile 64000mapred hard nofile 128000mapred soft nofile 64000

fs.file-max = 128000

sysctl.conf

limits.conf

Localhost

127.0.0.1 localhost localhost.localdomain127.0.1.1 hdd01

127.0.0.1 localhost localhost.localdomain127.0.1.1 hdd01.some.net hdd01

before

hadoop

Rackaware

<property> <name>topology.script.file.name</name> <value>/path/to/script/location-from-ip</value> <final>true</final></property>

#!/usr/bin/rubylocation = { 'hdd001.some.net' => '/ams/1', '10.20.2.1' => '/ams/1', 'hdd002.some.net' => '/ams/2', '10.20.2.2' => '/ams/2',}

puts ARGV.map { |ip| location[ARGV.first] || '/default-rack' }.join(' ')

site config

topology script

for f in `hdfs hadoop fsck / | grep "Replica placement policy is violated" | awk -F: '{print $1}' | sort | uniq | head -n1000`; do hadoop fs -setrep -w 4 $f hadoop fs -setrep 3 $fdone

Fix the Policy

hadoop fsck / -openforwrite -files | grep -i "OPENFORWRITE: MISSING 1 blocks of total size" | awk '{print $1}' | xargs -L 1 -i hadoop dfs -mv {} /lost+notfound

Community

hadoop

* from markmail.org

Community

The Enterprise Effect

“The Community Effect” (in 2011)

Community

mapreduce

* from markmail.org

The Future

real timeincremental

flexible pipelinesrefined API

refined implementation

Real Time Datamining and Aggregation at Scale (Ted Dunning)

Eventually Consistent Data Structures (Sean Cribbs)

Real-time Analytics with HBase (Alex Baranau)

Profiling and performance-tuning your Hadoop pipelines (Aaron Beppu)

From Batch to Realtime with Hadoop (Lars George)

Event-Stream Processing with Kafka (Tim Lossen)

Real-/Neartime analysis with Hadoop & VoltDB (Ralf Neeb)

Take Aways

·use hadoop only if you must·really understand the pipeline·unbox the black box

@tcurdtgithub.com/tcurdt

yourdailygeekery.com

That’s it folks!

Hadoop - Lessons Learned

Technology

Transcript of Hadoop - Lessons Learned

Hadoop + JavaScript: what we learned

Hadoop for High-Performance Climate Analytics Aeronautics and Space Administration Hadoop for High-Performance Climate Analytics Use Cases and Lessons Learned Glenn Tamkin (NASA/CSC)

Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twitter (Alex Levenson)

LESSONS LEARNED AND GOOD PRACTICES...Lessons Learned Lessons are learned for sharing and applying to similar projects. Project lessons are learned from experiences, both inspiring

Lessons Learned Handbook · 2011-10-12 · • Lessons Learned, an adjective, describes anything related to a Lessons Learned procedure. E.g. Lessons Learned process, Lessons Learned

Webinar: Productionizing Hadoop: Lessons Learned - 20101208

Lessons Learned from “Lessons Learned” · 2 LESSONS LEARNED FROM “LESSONS LEARNED” 2. The NRC set safety goals in its Safety Goal Policy Statement, initiated not long after

Hadoop for High-Performance Climate Analytics - Use Cases and Lessons Learned

HADOOP Monitoring and Diagnostics: Challenges and Lessons Learned

Lessons Learned Events at SLAC · 2012-09-14 · Lessons Learned Events: Summary Comments . Important to identify and communicate lessons learned! Lessons learned and near misses

LEARNED LESSONS:

Lessons Learned Globally - Campaign for Tobacco-Free Kids...Process lessons learned 2. Content lessons learned See next page for lists of these lessons learned. 7 | Lessons Learned

Lessons learneD

Deploying Hadoop on Lustre Storage - OpenSFScdn.opensfs.org/.../Deploying-Hadoop-on-Lustre-Storage_Gallegos_Tao...Deploying Hadoop on Lustre Storage: Lessons learned and best practices.

Lessons Learned from Migration of a Large-analytics Platform from MPP Databases to Hadoop YARN

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn

(BDT305) Lessons Learned and Best Practices for Running Hadoop on AWS | AWS re:Invent 2014

Self-Service Analytics on Hadoop: Lessons Learned

Lessons Learned:

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at LinkedIn