Performance of Hadoop on OpenStack

Performance of Hadoop on OpenStack

Andrew LazarevMirantis, 2014

Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

Agenda

What Is Hadoop?Am

bari

(Man

agem

ent)

ZooK

eepe

r(C

oord

inat

ion)

Ooz

ie(S

ched

ulin

g)

HDFS(File System)

HBas

e(N

oSql

Sto

re)

MapReduce(Programming Framework)

Pig

(Dat

a Fl

ow)

Hive

(SQ

L)

Stor

m(R

eal-t

ime

com

puta

tion)

- Core Apache Hadoop

Easy to operate cluster One-click self-service provisioning Sharing hardware between several Hadoop

clusters Tenants isolation on hypervisor and network

layers Comparable performance with much more

flexibility

Why Virtualize Hadoop?

Sahara - OpenStack Data Processing project OpenStack Integrated Supports Hadoop 1 and 2 Different vendors (Apache, Hortonworks, Intel*) Cluster provisioning and on-demand jobs

execution

How To Virtualize?

Direct impact Disk write Disk read Network CPU

Virtualization Impact

Indirect impact Lack of low level system control Resources for hypervisor operation

Virtualization Impact


Agenda

Mirantis OpenStack Express cluster 20 nodes CPU: 24 x 2.10 GHz (2 x Intel Xeon CPU E5-2620) Memory: 8 x 4.0 GB, 32.0 GB total Disk: 1 drive, 0.9 TB (WDC WD1003FBYX-0) Network: 2 x 1 GbE

Environment

Host OS: CentOS 6.5 VM OS: CentOS 6.5 Mirantis OpenStack QEMU-KVM 1.2.0 Network: Neutron + GRE Open vSwitch 1.10.2

Environment (continuation)

Hadoop: Vanilla Apache 1.2.1 Bare metal setup: 19 Hadoop Nodes

OpenStack setup: 1 Controller + 19 Computes 19 (or 57) VMs with Hadoop

Environment (continuation)


Agenda

Disk Write (using dd)

*greater is better

TestDFSIO - built-in hadoop IO test write test read test 1000 files of 1GB (1 TB total)

Disk Write (hadoop test)

Disk Write (hadoop test)

*less is better

disk_cachemodes param in nova.conf writethrough (default) - guest disk write cache

is disabled writeback - guest disk write cache is enabled

Disk Cache Mode

Writeback cache enabled One large VM with all memory per Host

Disk Write (dd, writeback cache)

Disk Write (dd, writeback cache)

*greater is better

Disk Write (hadoop test, writeback cache)

*less is better

QEMU 1.4: high performance virtio-blk data plane

implementation +108.0% on rnd-write (based on RedHat

presentation on KVM Forum):

Disk Write - Way To Improve

Disk Read (using hdparm)

*greater is better

Disk Read (hadoop test)

*less is better

Network (OVS+GRE)

*greater is better

PI - built-in hadoop test Depends mostly on CPU 50 series of 10,000,000,000 probes

CPU (hadoop test)

CPU (hadoop test)

*less is better


Agenda

Built-in hadoop test Represents real Hadoop workload Involves

IO Networking Computation

Sorting 200,000,000 of 100-byte entries (20 GB) Writeback cache enabled

Terasort

Terasort

*less is better


Agenda

Hadoop can consider distance between nodes Intelligent task scheduling Reading data from close data nodes

Data Locality

NODENODE

NODE

NODE

NODE

NODE

Data Locality

*greater is better

Network within host comparable to disk speed Allows hadoop process isolation (VM per process) Test:

1 Master Node (JobTracker + NameNode) 18 DataNodes 18 TaskTrackers TeraSort of 20 Gb data

Data Locality

Terasort (data locality)

*less is better


Agenda

Only 6% performance impact for composite test Performance continuously improving with

external libs upgrade (QEMU, Open vSwitch) Much more topology flexibility Isolation at low cost

between clusters between nodes within cluster

Conclusion

Thank you!Andrew Lazarev

Launchpad/GitHub/IRC: alazarevE-Mail: [email protected]

Performance of Hadoop on OpenStack

Documents

Transcript of Performance of Hadoop on OpenStack