Performance of Hadoop on OpenStack

download Performance of Hadoop on OpenStack

of 39

description

sdfdsf

Transcript of Performance of Hadoop on OpenStack

  • Performance of Hadoop on OpenStack

    Andrew LazarevMirantis, 2014

  • Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

    Agenda

  • What Is Hadoop?Am

    bari

    (Man

    agem

    ent)

    ZooK

    eepe

    r(C

    oord

    inat

    ion)

    Ooz

    ie(S

    ched

    ulin

    g)

    HDFS(File System)

    HBas

    e(N

    oSql

    Sto

    re)

    MapReduce(Programming Framework)

    Pig

    (Dat

    a Fl

    ow)

    Hive

    (SQ

    L)

    Stor

    m(R

    eal-t

    ime

    com

    puta

    tion)

    - Core Apache Hadoop

  • Easy to operate cluster One-click self-service provisioning Sharing hardware between several Hadoop

    clusters Tenants isolation on hypervisor and network

    layers Comparable performance with much more

    flexibility

    Why Virtualize Hadoop?

  • Sahara - OpenStack Data Processing project OpenStack Integrated Supports Hadoop 1 and 2 Different vendors (Apache, Hortonworks, Intel*) Cluster provisioning and on-demand jobs

    execution

    How To Virtualize?

  • Direct impact Disk write Disk read Network CPU

    Virtualization Impact

  • Indirect impact Lack of low level system control Resources for hypervisor operation

    Virtualization Impact

  • Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

    Agenda

  • Mirantis OpenStack Express cluster 20 nodes CPU: 24 x 2.10 GHz (2 x Intel Xeon CPU E5-2620) Memory: 8 x 4.0 GB, 32.0 GB total Disk: 1 drive, 0.9 TB (WDC WD1003FBYX-0) Network: 2 x 1 GbE

    Environment

  • Host OS: CentOS 6.5 VM OS: CentOS 6.5 Mirantis OpenStack QEMU-KVM 1.2.0 Network: Neutron + GRE Open vSwitch 1.10.2

    Environment (continuation)

  • Hadoop: Vanilla Apache 1.2.1 Bare metal setup: 19 Hadoop Nodes

    OpenStack setup: 1 Controller + 19 Computes 19 (or 57) VMs with Hadoop

    Environment (continuation)

  • Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

    Agenda

  • Disk Write (using dd)

    *greater is better

  • TestDFSIO - built-in hadoop IO test write test read test 1000 files of 1GB (1 TB total)

    Disk Write (hadoop test)

  • Disk Write (hadoop test)

    *less is better

  • Disk Write (hadoop test)

    *less is better

  • disk_cachemodes param in nova.conf writethrough (default) - guest disk write cache

    is disabled writeback - guest disk write cache is enabled

    Disk Cache Mode

  • Writeback cache enabled One large VM with all memory per Host

    Disk Write (dd, writeback cache)

  • Disk Write (dd, writeback cache)

    *greater is better

  • Disk Write (hadoop test, writeback cache)

    *less is better

  • QEMU 1.4: high performance virtio-blk data plane

    implementation +108.0% on rnd-write (based on RedHat

    presentation on KVM Forum):

    Disk Write - Way To Improve

  • Disk Read (using hdparm)

    *greater is better

  • Disk Read (using hdparm)

    *greater is better

  • Disk Read (hadoop test)

    *less is better

  • Network (OVS+GRE)

    *greater is better

  • PI - built-in hadoop test Depends mostly on CPU 50 series of 10,000,000,000 probes

    CPU (hadoop test)

  • CPU (hadoop test)

    *less is better

  • Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

    Agenda

  • Built-in hadoop test Represents real Hadoop workload Involves

    IO Networking Computation

    Sorting 200,000,000 of 100-byte entries (20 GB) Writeback cache enabled

    Terasort

  • Terasort

    *less is better

  • Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

    Agenda

  • Hadoop can consider distance between nodes Intelligent task scheduling Reading data from close data nodes

    Data Locality

    NODENODE

    NODE

    NODE

    NODE

    NODE

  • Data Locality

    *greater is better

  • Network within host comparable to disk speed Allows hadoop process isolation (VM per process) Test:

    1 Master Node (JobTracker + NameNode) 18 DataNodes 18 TaskTrackers TeraSort of 20 Gb data

    Data Locality

  • Terasort (data locality)

    *less is better

  • Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion

    Agenda

  • Only 6% performance impact for composite test Performance continuously improving with

    external libs upgrade (QEMU, Open vSwitch) Much more topology flexibility Isolation at low cost

    between clusters between nodes within cluster

    Conclusion

  • Q&A

  • Thank you!Andrew Lazarev

    Launchpad/GitHub/IRC: alazarevE-Mail: [email protected]