Hadoop I/O Analysis

23
Architects view of Hadoop I/O I/O analysis using vProbes Richard McDougall V1.0 April 2012

description

Some initial analysis of the Hadoop Stack using vProbes

Transcript of Hadoop I/O Analysis

Page 1: Hadoop I/O Analysis

Architects)view)of)Hadoop)I/O)

I/O)analysis)using)vProbes))

Richard)McDougall)V1.0))

April)2012)

Page 2: Hadoop I/O Analysis

Architect’s)QuesFons)

•  Does)Hadoop)really)need)compute)+)data)local)

•  How)much)and)what)I/O)rates)of)ephemeral)data)do)we)need)to)design)for?)

•  What)I/O)paKerns)do)we)need)to)support)HDFS?)

•  What)is)the)I/O)paKern)of)MNR)tasks)•  Are)there)opportuniFes)for)caching)–)map)input,)output)or)ephemeral?)

Page 3: Hadoop I/O Analysis

Controlled)Small)Study)

•  Focus)on)developing)tooling)•  Using)vProbes)+)Perl)+)R)•  Hadoop)0.20.204)•  Terasort)@)1GB)

•  One)Namenode,)Tasktracker,)Datanode)

Page 4: Hadoop I/O Analysis

Terasort)

Map)Task)

Map)Task)

Map)Task)

Map)Task)

Input)File)

Input)Splits)(x16))

Sort)Chunk)of)Of)KeyNValues)

Shuffle) Reduce)(Sort))

Output)File)

Shuffle)output)To)Reducers) Combine)and)Sort)

Page 5: Hadoop I/O Analysis

Log)of)the)sort)‘Job’)$ log.pl job_201201261301_0005_1327649126255_rmc_TeraSort ! Item Time Jobname Taskname Phase Start-Time End-Time Elapsed ! Job 0.000 201201261301_0005 ! Job 201201261301_0005 ! Job 0.475 201201261301_0005 PREP ! Task 1.932 201201261301_0005 m_000017 SETUP ! MapAttempt 3.066 201201261301_0005 m_000017 SETUP ! MapAttempt 10.409 201201261301_0005 m_000017 SETUP SUCCESS 1.932 10.409 8.477 "setup"! Task 10.966 201201261301_0005 m_000017 SETUP SUCCESS 1.932 10.966 9.034 ! Job 201201261301_0005 RUNNING ! Task 10.970 201201261301_0005 m_000000 MAP ! Task 10.972 201201261301_0005 m_000001 MAP ! MapAttempt 10.981 201201261301_0005 m_000000 MAP ! MapAttempt 65.819 201201261301_0005 m_000000 MAP SUCCESS 10.970 65.819 54.849 ""! Task 68.063 201201261301_0005 m_000000 MAP SUCCESS 10.970 68.063 57.093 ! MapAttempt 10.998 201201261301_0005 m_000001 MAP ! MapAttempt 65.363 201201261301_0005 m_000001 MAP SUCCESS 10.972 65.363 54.391 ""! Task 68.065 201201261301_0005 m_000001 MAP SUCCESS 10.972 68.065 57.093 ! Task 68.066 201201261301_0005 m_000002 MAP ! Task 68.067 201201261301_0005 m_000003 MAP ! Task 68.068 201201261301_0005 r_000000 REDUCE ! MapAttempt 68.075 201201261301_0005 m_000002 MAP ! MapAttempt 139.789 201201261301_0005 m_000002 MAP SUCCESS 68.066 139.789 71.723 ""! Task 140.193 201201261301_0005 m_000002 MAP SUCCESS 68.066 140.193 72.127 ! MapAttempt 68.076 201201261301_0005 m_000003 MAP ! MapAttempt 139.927 201201261301_0005 m_000003 MAP SUCCESS 68.067 139.927 71.860 ""! Task 140.198 201201261301_0005 m_000003 MAP SUCCESS 68.067 140.198 72.131 !…! ReduceAttempt 68.112 201201261301_0005 r_000000 REDUCE ! ReduceAttempt 795.299 201201261301_0005 r_000000 REDUCE SUCCESS 68.068 795.299 727.231 "reduce > reduce"! Task 798.223 201201261301_0005 r_000000 REDUCE SUCCESS 68.068 798.223 730.155 ! Task 798.226 201201261301_0005 m_000016 CLEANUP ! MapAttempt 798.241 201201261301_0005 m_000016 CLEANUP ! MapAttempt 806.113 201201261301_0005 m_000016 CLEANUP SUCCESS 798.226 806.113 7.887 "cleanup"! Task 807.252 201201261301_0005 m_000016 CLEANUP SUCCESS 798.226 807.252 9.026 ! Job 807.253 201201261301_0005 SUCCESS 0.000 807.253 807.253 !

Page 6: Hadoop I/O Analysis

Terasort:)Map)and)Reduce)Phases)

Elapsed)Time)N)Seconds)

Reducer)

Mappers)

Setup)Map)

Cleanup)Map)

Page 7: Hadoop I/O Analysis

Terasort:)Map)and)Reduce)Phases)

Elapsed)Time)N)Seconds)

Reducer)

Mappers)

Setup)Map)

Cleanup)Map)

Zoom)in)on)

Map)Task)I/O)

Zoom)in)on)

Reduce)Task)I/O)

Page 8: Hadoop I/O Analysis

VMware)vProbes)

•  Dynamic)InstrumentaFon)

•  Probe)mulFple)VMs)

•  Probe)VirtualizaFon)Layer)

•  VMware)Fusion)and)WorkstaFon)

Page 9: Hadoop I/O Analysis

vProbes)

GUEST:ENTER:system_call {! string path;! comm = curprocname();! tid = curtid();! pid = curpid();! ppid = curppid();! syscall_num = sysnum;!! if(syscall_num == NR_open) {!

!path = guestloadstr(sys_arg0);! syscall_name = "open";! sprintf(syscall_args, "\"%s\", %x, %x", path, sys_arg1, sys_arg2); ! …!}!!GUEST:OFFSET:ret_from_sys_call:0 {!

!printf("%s/%d/%d/%d %s(%s) = %d <0>\n", comm, pid, rtid, ppid, syscall_name,! syscall_args, getgpr(REG_RAX)); !}!!!java/14774/15467/1 open("/host/hadoop/hdfs/data/current/subdir0/blk_1719908349220085071_1649.meta", 0, 1b6) = 144 <0>!java/14774/15467/1 stat("/host/hadoop/hdfs/data/current/subdir0/blk_1719908349220085071_1649.meta", 7f0b80a4e590) = 0 <0>!java/14774/15467/1 read(144, 7f0b80a4c470, 4096) = 167 <0>!!

Page 10: Hadoop I/O Analysis

Pathname)ResoluFon)filetracevp.pl: !!if ($syscall =~ m/open/) {! $path1 = $line;! $path1 =~ s/[A-z\/0-9]+[ ]+[a-z]+\("([^"]+)".*\n/\1/;! $fd1 = $line;! if ($fd1 =~ s/.* ([0-9]+) <.*>\n/\1/) {! $fds{$pid,$fd1} = $path1;!!if ($syscall =~ m/write/) {! $params = $line;! if ($params =~ s/^[A-z\/0-9]+[ ]+[a-z]+\(([0-9]+),.* ([0-9]+)\) = ([0-9]+) <(.*)>\n/\1,\2,\3,\4/) {! ($fd1, $size, $bytes, $lat) = split(',', $params);! $path1 = $fds{$pid, $fd1};!…!!!java,14774,15467,,open,0,0,0,0,144,/host/hadoop/hdfs/data/current/subdir0/blk_1719908349220085071_1649.meta,0,!java,14774,15467,,stat,0,0,0,0,0,/host/hadoop/hdfs/data/current/subdir0/blk_1719908349220085071_1649.meta,0,!java,14774,15467,,read,4096,167,0,0,144,/host/hadoop/hdfs/data/current/subdir0/blk_1719908349220085071_1649.meta,0,!!!!!

Page 11: Hadoop I/O Analysis

Controlled)SmallNScale)Study)

Job Counters ! Launched reduce tasks=1! SLOTS_MILLIS_MAPS=1146887! Launched map tasks=16! Data-local map tasks=16! SLOTS_MILLIS_REDUCES=766823! File Input Format Counters ! Bytes Read=1000057358! File Output Format Counters ! Bytes Written=1000000000! FileSystemCounters! FILE_BYTES_READ=2382257412! HDFS_BYTES_READ=1000059070! FILE_BYTES_WRITTEN=3402627838! HDFS_BYTES_WRITTEN=1000000000! Map-Reduce Framework! Map output materialized bytes=1020000096! Map input records=10000000! Reduce shuffle bytes=1020000096! Spilled Records=33355441! Map output bytes=1000000000! Map input bytes=1000000000! Combine input records=0! SPLIT_RAW_BYTES=1712! Reduce input records=10000000! Reduce input groups=10000000! Combine output records=0! Reduce output records=10000000! Map output records=10000000!

Hadoop)Distro) 236)Hadoop)Logs) 132)Hadoop)clienKmp)unjar) 1)Mappers)files)jobcache)N)spills) 1753)Mappers)files)jobcache)N)output) 1777)Reducer)Intermediate) 764)Reducers)Shuffle)and)Intermediate) 1744)Jobcache)class)files)and)shell)scripts) 1)Hadoop)Datanode) 1690)JVM)N)/usr/lib/jvm…) 98)

Total&MB& 7987&

$ hadoop jar hadoop-examples-0.20.204.0.jar teragen 10000000 teradata!<begin trace>!$ hadoop jar hadoop-examples-0.20.204.0.jar terasort teradata teraout!!

0) 200) 400) 600) 800)1000)1200)1400)1600)1800)2000)

Hadoop)Distro)

Hadoop)Logs)

Hadoop)clienKmp)unjar)

Mappers)files)jobcache)N)spills)

Mappers)files)jobcache)N)map)output)

Reducer)intermediate)file)

Reducers)files)jobcache)N)output)

Jobcache)class)files)and)shell)scripts)

Hadoop)Datanode)

JVM)N)/usr/lib/jvm…)

Page 12: Hadoop I/O Analysis

12)

)75%)of)Disk)Bandwidth)

Hadoop)I/O)Model)(With)some)data)from)early)observaFons))

Job)

Map)Task)

Map)Task)

Map)Task)

Map)Task)

Reduce)

Reduce)

HDFS)

DFS)Input)Data)

DFS)Output)Data))

12%)of)Bandwidth)

)12%)of)Bandwidth)

Spills)&)Logs)spill*.out*

Spills)

Map)Output)file.out*

Shuffle)Map_*.out*

Sort)

Combine)Intermediate.out*

Page 13: Hadoop I/O Analysis

One)Mapper)Task:)Temp)Data)path bytes/host/hadoop/clienttmp/mapred/local/taskTracker/rmc/jobcache/job_201201251035_0001/attempt_201201251035_0001_m_000000_0/output/file.out 67586124/host/hadoop/clienttmp/mapred/local/taskTracker/rmc/jobcache/job_201201251035_0001/attempt_201201251035_0001_m_000000_0/output/spill1.out 52762519/host/hadoop/clienttmp/mapred/local/taskTracker/rmc/jobcache/job_201201251035_0001/attempt_201201251035_0001_m_000000_0/output/spill0.out 52508540/host/hadoop/clienttmp/mapred/local/taskTracker/rmc/jobcache/job_201201251035_0001/attempt_201201251035_0001_m_000000_0/output/spill2.out 29698564/usr/lib/jvm/javaD6Dopenjdk/jre/lib/rt.jar 5057763/home/rmc/untars/hadoopD0.20.204.0/hadoopDcoreD0.20.204.0.jar 895582/home/rmc/untars/hadoopD0.20.204.0/lib/log4jD1.2.15.jar 82522/home/rmc/untars/hadoopD0.20.204.0/lib/commonsDlangD2.4.jar 70477/home/rmc/untars/hadoopD0.20.204.0/lib/commonsDconfigurationD1.6.jar 61007/usr/lib/x86_64DlinuxDgnu/gconv/gconvDmodules 51772/host/hadoop/clienttmp/mapred/local/taskTracker/rmc/jobcache/job_201201251035_0001/job.xml 44420/home/rmc/untars/hadoopD0.20.204.0/lib/commonsDcollectionsD3.2.1.jar 29974/host/hadoop/clienttmp/mapred/local/taskTracker/rmc/jobcache/job_201201251035_0001/attempt_201201251035_0001_m_000000_0/job.xml 21695/usr/lib/jvm/javaD6Dopenjdk/jre/lib/amd64/libnio.so 15946/home/rmc/untars/hadoopD0.20.204.0/conf/coreDsite.xml 11024/usr/lib/jvm/javaD6Dopenjdk/jre/lib/security/java.security 10081/proc/self/maps 7523

Page 14: Hadoop I/O Analysis

One)Mapper)Task:)Temp)I/O)Counts)I/O)measured)at)syscall)

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Num

ber o

f I/O

s

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Read I/O Size Bucket

Num

ber o

f I/O

s

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Write I/O Size Bucket

Num

ber o

f I/O

s

0

5000

10000

15000

20000

25000

30000

Page 15: Hadoop I/O Analysis

One)Mapper)Task:)Tmp)Bytes)Transferred)1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Byte

s

0.0e+00

5.0e+07

1.0e+08

1.5e+08

2.0e+08

2.5e+08

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

2097

152

4194

304

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

I/O Size Bucket

Byte

s

0e+00

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

Logical)I/O)(sequenFal)grouping)of)syscalls))I/O)measured)at)syscall)

Page 16: Hadoop I/O Analysis

Reducer)Task:)Temp)Data)

Page 17: Hadoop I/O Analysis

Reducer)Task:)Temp)I/O)Counts)I/O)measured)at)syscall)

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Num

ber o

f I/O

s

0e+00

1e+05

2e+05

3e+05

4e+05

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Read I/O Size Bucket

Num

ber o

f I/O

s

0

50000

100000

150000

200000

250000

300000

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Write I/O Size Bucket

Num

ber o

f I/O

s

0

20000

40000

60000

80000

Page 18: Hadoop I/O Analysis

Reducer)Task:)Tmp)Bytes)Transferred)

Logical)I/O)(sequenFal)grouping)of)syscalls))I/O)measured)at)syscall)

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Byte

s

0.0e+00

5.0e+08

1.0e+09

1.5e+09

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

2097

152

4194

304

8388

608

I/O Size Bucket

Byte

s

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

Page 19: Hadoop I/O Analysis

Datanode)–)Bytes)Transferred)1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

2097

152

4194

304

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8I/O Size Bucket

Byte

s

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Read I/O Size Bucket

Byte

s

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

6e+08

7e+08

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Write I/O Size Bucket

Byte

s

0e+00

2e+08

4e+08

6e+08

8e+08

1e+09

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Byte

s

0.0e+00

5.0e+08

1.0e+09

1.5e+09

Page 20: Hadoop I/O Analysis

Datanode)–)Actual)vs.)Logical)I/O)Size)

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

2097

152

4194

304

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8I/O Size Bucket

Byte

s

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Byte

s

0.0e+00

5.0e+08

1.0e+09

1.5e+09

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

2097

152

4194

304

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8

I/O Size Bucket

Byte

s

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

Logical)I/O)(sequenFal)grouping)of)syscalls))I/O)measured)at)syscall)

Page 21: Hadoop I/O Analysis

Datanode)–)IOPS)1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

2621

4452

4288

1048

576

2097

152

4194

304

8388

608

1677

7216

3355

4432

6710

8864

1342

1772

8I/O Size Bucket

Byte

s

0e+00

1e+08

2e+08

3e+08

4e+08

5e+08

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

I/O Size Bucket

Num

ber o

f I/O

s

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Read I/O Size Bucket

Num

ber o

f I/O

s

0

2000

4000

6000

8000

10000

1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

432

768

6553

613

1072

Write I/O Size Bucket

Num

ber o

f I/O

s

0

5000

10000

15000

Page 22: Hadoop I/O Analysis

Back)of)the)Envelope)Modeling)))

•  How)much)bandwidth)does)terasort)need?)– 10)seconds)of)CPU/core)Fme)per)task)– 128MB)of)HDFS)per)task)

– ~3x,)384MB)of)temporary)data)per)task)

I/O&Component& Per7task& Per7task&Bandwidth& Per7host&(24&cores)&

HDFS)I/O) 128MB) ~13MBytes/s) 312Mbytes/sec)

Temp) 384MB) ~38Mbytes/sec) 912Mbytes/sec)

Page 23: Hadoop I/O Analysis

Do)we)need)locality?)•  Main)issue)is)crossNsecFonal)bandwidth)– Secondary)issue)is)perNhost)link)speed)–  Just)look)at)storage)I/O)now,)consider)shuffle)next)

I/O&Component&

Per7host&(24&cores)&

Network&Bandwidth&&w/&0%&locality&

Rack&Bandwidth&w/40&hosts&

HDFS)I/O) 312Mbytes/sec) 2.5Gbits) 100gbits)

Temp) 912Mbytes/sec) 7.3Gbits) 300gbits)

•  Possible)Conclusion)– Must)have)locality)w/1Gbit)host)link)

– Feasible)to)have)remote)data)w/10Gbit)and)keeping)temp)local)only)