Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS)...

72
Big Data: Hadoop and Memcached Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] http://www.cse.ohio-state.edu/~panda Talk at HPC Advisory Council Stanford Conference and Exascale Workshop (Feb 2014) by

Transcript of Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS)...

Page 1: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Big Data: Hadoop and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Stanford Conference and Exascale

Workshop (Feb 2014)

by

Page 2: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Introduction to Big Data Applications and Analytics

• Big Data has become the one of the most

important elements of business analytics

• Provides groundbreaking opportunities

for enterprise information management

and decision making

• The amount of data is exploding;

companies are capturing and digitizing

more information than ever

• The rate of information growth appears

to be exceeding Moore’s Law

2 HPC Advisory Council Stanford Conference, Feb '14

• Commonly accepted 3V’s of Big Data • Volume, Velocity, Variety Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

• 5V’s of Big Data – 3V + Value, Veracity

Page 3: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Scientific Data Management, Analysis, and Visualization

• Applications examples

– Climate modeling

– Combustion

– Fusion

– Astrophysics

– Bioinformatics

• Data Intensive Tasks

– Runs large-scale simulations on supercomputers

– Dump data on parallel storage systems

– Collect experimental / observational data

– Move experimental / observational data to analysis sites

– Visual analytics – help understand data visually

HPC Advisory Council Stanford Conference, Feb '14 3

Not Only in Internet Services - Big Data in Scientific Domains

Page 4: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Hadoop: http://hadoop.apache.org

– The most popular framework for Big Data Analytics

– HDFS, MapReduce, HBase, RPC, Hive, Pig, ZooKeeper, Mahout, etc.

• Spark: http://spark-project.org

– Provides primitives for in-memory cluster computing; Jobs can load data into memory and

query it repeatedly

• Shark: http://shark.cs.berkeley.edu

– A fully Apache Hive-compatible data warehousing system

• Storm: http://storm-project.net

– A distributed real-time computation system for real-time analytics, online machine learning,

continuous computation, etc.

• S4: http://incubator.apache.org/s4

– A distributed system for processing continuous unbounded streams of data

• Web 2.0: RDBMS + Memcached (http://memcached.org)

– Memcached: A high-performance, distributed memory object caching systems

HPC Advisory Council Stanford Conference, Feb '14 4

Typical Solutions or Architectures for Big Data Applications and Analytics

Page 5: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Focuses on large data and data analysis

• Hadoop (e.g. HDFS, MapReduce, HBase, RPC) environment

is gaining a lot of momentum

• http://wiki.apache.org/hadoop/PoweredBy

HPC Advisory Council Stanford Conference, Feb '14 5

Who Are Using Hadoop?

Page 6: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Scale: 42,000 nodes, 365PB HDFS, 10M daily slot hours • http://developer.yahoo.com/blogs/hadoop/apache-hbase-yahoo-multi-tenancy-helm-again-171710422.html

• A new Jim Gray’ Sort Record: 1.42 TB/min over 2,100 nodes • Hadoop 0.23.7, an early branch of the Hadoop 2.X

• http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-sets-gray-sort-record-yellow-elephant-092704088.html

HPC Advisory Council Stanford Conference, Feb '14 6

Hadoop at Yahoo!

http://developer.yahoo.com/blogs/ydn/hadoop-yahoo-more-ever-54421.html

Page 7: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

7 HPC Advisory Council Stanford Conference, Feb '14

Page 8: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• Major components

– HDFS

– MapReduce

• Underlying Hadoop Distributed File System (HDFS) used by both MapReduce and HBase

• Model scales but high amount of communication during intermediate phases can be further optimized

8

HDFS

MapReduce

Hadoop Framework

User Applications

HBase

Hadoop Common (RPC)

HPC Advisory Council Stanford Conference, Feb '14

Page 9: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Hadoop MapReduce Architecture & Operations

Disk Operations

• Map and Reduce Tasks carry out the total job execution

• Communication in shuffle phase uses HTTP over Java Sockets

9 HPC Advisory Council Stanford Conference, Feb '14

Page 10: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop;

highly reliable and fault-tolerant

• Adopted by many reputed

organizations

– eg: Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-

independence and portability

• Uses sockets for communication!

(HDFS Architecture)

10

RPC

RPC

RPC

RPC

HPC Advisory Council Stanford Conference, Feb '14

Client

Page 11: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 11

System-Level Interaction Between Clients and Data Nodes in HDFS

High

Performance

Networks

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

...

(HDFS Data Nodes)(HDFS Clients)

...

...

Page 12: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 12

HBase

• Apache Hadoop Database

(http://hbase.apache.org/)

• Semi-structured database,

which is highly scalable

• Integral part of many

datacenter applications

– eg:- Facebook Social Inbox

• Developed in Java for platform-

independence and portability

• Uses sockets for

communication!

ZooKeeper

HBase

Client

HMaster

HRegion

Servers

HDFS

(HBase Architecture)

Page 13: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 13

System-Level Interaction Among HBase Clients, Region Servers, and Data Nodes

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

......

... ...

...

High

Performance

Networks

High

Performance

Networks

Page 14: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 14

Memcached Architecture

• Integral part of Web 2.0 architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Page 15: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

15 HPC Advisory Council Stanford Conference, Feb '14

Page 16: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 16

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

0

10

20

30

40

50

60

70

80

90

100

0

50

100

150

200

250

300

350

400

450

500

Pe

rce

nta

ge o

f C

lust

ers

Nu

mb

er

of

Clu

ste

rs

Timeline

Percentage of Clusters

Number of Clusters

Page 17: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols

– InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance

– Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand)

– Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems

• Many such systems in Top500 list

17 HPC Advisory Council Stanford Conference, Feb '14

Page 18: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

18

Trends of Networking Technologies in TOP500 Systems

Percentage share of InfiniBand is steadily increasing

Interconnect Family – Systems Share

HPC Advisory Council Stanford Conference, Feb '14

Page 19: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Message Passing Interface (MPI)

– Most commonly used programming model for HPC applications

• Parallel File Systems

• Storage Systems

• Almost 13 years of Research and Development since

InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

19

Use of High-Performance Networks for Scientific Computing

HPC Advisory Council Stanford Conference, Feb '14

Page 20: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 20

One-way Latency: MPI over IB with MVAPICH2

0.00

1.00

2.00

3.00

4.00

5.00

6.00Small Message Latency

Message Size (bytes)

Late

ncy

(u

s)

1.66

1.56

1.64

1.82

0.99 1.09

1.12

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

0.00

50.00

100.00

150.00

200.00

250.00Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Large Message Latency

Message Size (bytes)

Late

ncy

(u

s)

Page 21: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 21

Bandwidth: MPI over IB with MVAPICH2

0

2000

4000

6000

8000

10000

12000

14000Unidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3280

3385

1917 1706

6343

12485

12810

0

5000

10000

15000

20000

25000

30000Qlogic-DDRQlogic-QDRConnectX-DDRConnectX2-PCIe2-QDRConnectX3-PCIe3-FDRSandy-ConnectIB-DualFDRIvy-ConnectIB-DualFDR

Bidirectional Bandwidth

Ban

dw

idth

(M

Byt

es/

sec)

Message Size (bytes)

3341 3704

4407

11643

6521

21025

24727

DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch

ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

Page 22: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

22 HPC Advisory Council Stanford Conference, Feb '14

Page 23: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Can High-Performance Interconnects Benefit Hadoop?

• Most of the current Hadoop environments use Ethernet Infrastructure with Sockets

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest from many companies

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?

• Can HPC Clusters with High-Performance networks be used for Big Data applications using Hadoop?

23 HPC Advisory Council Stanford Conference, Feb '14

Page 24: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Designing Communication and I/O Libraries for Big Data Systems: Challenges

HPC Advisory Council Stanford Conference, Feb '14 24

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

Page 25: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 25

Common Protocols using Open Fabrics

Application/ Middleware

Verbs Sockets

Application /Middleware Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User

space

RDMA

RoCE

RoCE Adapter

User

space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Adapter

Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User

space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload

RSockets

InfiniBand Adapter

InfiniBand Switch

User

space

RSockets

Page 26: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Can Hadoop be Accelerated with High-Performance Networks and Protocols?

26

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Hadoop and HBase)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface

HPC Advisory Council Stanford Conference, Feb '14

Page 27: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in Networking Technologies and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

27 HPC Advisory Council Stanford Conference, Feb '14

Page 28: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs-

level for HDFS, MapReduce, and RPC components

– Easily configurable for native InfiniBand, RoCE and the traditional sockets-

based support (Ethernet and InfiniBand with IPoIB)

• Current release: 0.9.8

– Based on Apache Hadoop 1.2.1

– Compliant with Apache Hadoop 1.2.1 APIs and applications

– Tested with

• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE

• Various multi-core platforms

• Different file systems with disks and SSDs

– http://hadoop-rdma.cse.ohio-state.edu

RDMA for Apache Hadoop Project

28 HPC Advisory Council Stanford Conference, Feb '14

Page 29: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Design Overview of HDFS with RDMA

• JNI Layer bridges Java based HDFS with communication library written in native code

• Only the communication part of HDFS Write is modified; No change in HDFS architecture

29

HDFS

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

• Design Features

– RDMA-based HDFS write

– RDMA-based HDFS replication

– InfiniBand/RoCE support

HPC Advisory Council Stanford Conference, Feb '14

Page 30: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Communication Time in HDFS

• Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR)

– 56% improvement in communication time over 10GigE

• Similar improvements are obtained for SSD DataNodes

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

30

0

5

10

15

20

25

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Stanford Conference, Feb '14

Page 31: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Evaluations using TestDFSIO

• Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24%

Increased by 61%

31

0

200

400

600

800

1000

1200

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Stanford Conference, Feb '14

Page 32: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node

– 24% improvement over IPoIB (QDR) for 20GB file size

• Cluster with 4 DataNodes, 2 HDD per node

– 31% improvement over IPoIB (QDR) for 20GB file size

• 2 HDD vs 1 HDD

– 76% improvement for OSU-IB (QDR)

– 66% improvement for IPoIB (QDR)

Increased by 31%

32

0

500

1000

1500

2000

2500

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

File Size (GB)

10GigE-1HDD 10GigE-2HDD

IPoIB -1HDD (QDR) IPoIB -2HDD (QDR)

OSU-IB -1HDD (QDR) OSU-IB -2HDD (QDR)

HPC Advisory Council Stanford Conference, Feb '14

Page 33: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

33

Evaluations using Enhanced DFSIO of Intel HiBench on SDSC-Gordon

• Cluster with 32 DataNodes, single SSD per node

– 47% improvement in throughput over IPoIB (QDR) for 128GB file size

– 16% improvement in latency over IPoIB (QDR) for 128GB file size

HPC Advisory Council Stanford Conference, Feb '14

Increased by 47% Reduced by 16%

0

500

1000

1500

2000

2500

32 64 128

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

32 64 128

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

Page 34: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 34

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

• Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size

– 37% improvement in latency over IPoIB (FDR) for 256GB file size

0

200

400

600

800

1000

1200

1400

64 128 256

Agg

rega

ted

Th

rou

ghp

ut

(MB

ps)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR) Increased by 64% Reduced by 37%

0

100

200

300

400

500

600

64 128 256

Exec

uti

on

Tim

e (s

)

File Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

Page 35: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

35 HPC Advisory Council Stanford Conference, Feb '14

Page 36: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Design Overview of MapReduce with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface

• JNI Layer bridges Java based MapReduce with communication library written in native code

• InfiniBand/RoCE support

MapReduce

Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU Design

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Job

Tracker

Task

Tracker

Map

Reduce

Design features for RDMA

Prefetch/Caching of MOF

In-Memory Merge

Overlap of Merge & Reduce

36 HPC Advisory Council Stanford Conference, Feb '14

Page 37: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Evaluations using Sort

8 HDD DataNodes

• With 8 HDD DataNodes for 40GB sort

– 41% improvement over IPoIB (QDR)

– 39% improvement over Hadoop-A-IB

(QDR)

37

M. W. Rahman, N. S. Islam, X. Lu, J. Jose, H. Subramon, H. Wang, and D. K. Panda, High-Performance RDMA-

based Design of Hadoop MapReduce over InfiniBand, HPDIC Workshop, held in conjunction with IPDPS, May

2013.

HPC Advisory Council Stanford Conference, Feb '14

8 SSD DataNodes

• With 8 SSD DataNodes for 40GB sort

– 52% improvement over IPoIB (QDR)

– 44% improvement over Hadoop-A-IB

(QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

10GigE IPoIB (QDR)

Hadoop-A-IB (QDR) OSU-IB (QDR)

0

50

100

150

200

250

300

350

400

450

500

25 30 35 40

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

10GigE IPoIB (QDR)

Hadoop-A-IB (QDR) OSU-IB (QDR)

Page 38: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Evaluations using TeraSort

• 100 GB TeraSort with 8 DataNodes with 2 HDD per node

– 51% improvement over IPoIB (QDR)

– 41% improvement over Hadoop-A-IB (QDR)

38 HPC Advisory Council Stanford Conference, Feb '14

0

200

400

600

800

1000

1200

1400

1600

1800

60 80 100

Job

Exe

cuti

on

Tim

e (

sec)

Data Size (GB)

10GigE IPoIB (QDR)Hadoop-A-IB (QDR) OSU-IB (QDR)

Page 39: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• For 240GB Sort in 64 nodes

– 40% improvement over IPoIB (QDR)

with HDD used for HDFS

39

Performance Evaluation on Larger Clusters

HPC Advisory Council Stanford Conference, Feb '14

0

200

400

600

800

1000

1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (QDR) Hadoop-A-IB (QDR)OSU-IB (QDR)

Sort in OSU Cluster

0

100

200

300

400

500

600

700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB

Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job

Exe

cuti

on

Tim

e (s

ec)

IPoIB (FDR) Hadoop-A-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

• For 320GB TeraSort in 64 nodes

– 38% improvement over IPoIB

(FDR) with HDD used for HDFS

Page 40: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size

• 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

40 HPC Advisory Council Stanford Conference, Feb '14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB)

No

rmal

ized

Exe

cuti

on

Tim

e

Benchmarks

10GigE IPoIB (QDR) OSU-IB (QDR)

Page 41: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

41 HPC Advisory Council Stanford Conference, Feb '14

Page 42: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 42

Motivation – Detailed Analysis on HBase Put/Get

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,

Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12

0

20

40

60

80

100

120

140

160

180

200

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (

us)

HBase Put 1KB

0

20

40

60

80

100

120

140

160

IPoIB (DDR) 10GigE OSU-IB (DDR)

Tim

e (

us)

Hbase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

Page 43: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 43

HBase Single-Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

50

100

150

200

250

300

350

400

450

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

IPoIB

OSU-IB

10GigE

Latency Throughput

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR)

IPoIB (DDR)

Page 44: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 44

HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

IPoIB OSU-IB

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR)

Page 45: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

45 HPC Advisory Council Stanford Conference, Feb '14

Page 46: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..)

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Our Design

Default

OSU Design

Enables high performance RDMA communication, while supporting traditional socket interface

46

• Design Features

– JVM-bypassed buffer management

– On-demand connection setup

– RDMA or send/recv based adaptive communication

– InfiniBand/RoCE support

HPC Advisory Council Stanford Conference, Feb '14

Page 47: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Hadoop RPC over IB - Gain in Latency and Throughput

• Hadoop RPC over IB PingPong Latency

– 1 byte: 39 us; 4 KB: 52 us

– 42%-49% and 46%-50% improvements compared with the performance on 10 GigE

and IPoIB (32Gbps), respectively

• Hadoop RPC over IB Throughput

– 512 bytes & 48 clients: 135.22 Kops/sec

– 82% and 64% improvements compared with the peak performance on 10 GigE and

IPoIB (32Gbps), respectively

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512 102420484096

Late

ncy

(u

s)

Payload Size (Byte)

10GigE

IPoIB(QDR)

OSU-IB(QDR)

0

20

40

60

80

100

120

140

160

8 16 24 32 40 48 56 64

Thro

ugh

pu

t (K

op

s/Se

c)

Number of Clients

47 HPC Advisory Council Stanford Conference, Feb '14

Page 48: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

48 HPC Advisory Council Stanford Conference, Feb '14

Page 49: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• For 5GB,

– 53.4% improvement over IPoIB, 126.9% improvement over 10GigE with HDD used for HDFS

– 57.8% improvement over IPoIB, 138.1% improvement over 10GigE with SSD used for HDFS

• For 20GB,

– 10.7% improvement over IPoIB, 46.8% improvement over 10GigE with HDD used for HDFS

– 73.3% improvement over IPoIB, 141.3% improvement over 10GigE with SSD used for HDFS

49

Performance Benefits - TestDFSIO

Cluster with 4 HDD Nodes, TestDFSIO with total 4 maps Cluster with 4 SSD Nodes, TestDFSIO with total 4 maps

0

100

200

300

400

500

600

700

800

900

1000

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

0

100

200

300

400

500

600

700

800

900

5 10 15 20

Tota

l Th

rou

ghp

ut

(MB

ps)

Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Stanford Conference, Feb '14

Page 50: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• For 80GB Sort in a cluster size of 8,

– 40.7% improvement over IPoIB, 32.2% improvement over 10GigE with HDD

used for HDFS

– 43.6% improvement over IPoIB, 42.9% improvement over 10GigE with SSD

used for HDFS

50

Performance Benefits - Sort

0

200

400

600

800

1000

1200

1400

1600

1800

48 64 80

Job

Exe

cuti

on

Tim

e (s

ec)

Sort Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

Cluster with 8 HDD Nodes, Sort with <8 maps, 8 reduces>/host Cluster with 8 SSD Nodes, Sort with <8 maps, 8 reduces>/host

0

200

400

600

800

1000

1200

1400

1600

48 64 80

Job

Exe

cuti

on

Tim

e (s

ec)

Sort Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Stanford Conference, Feb '14

Page 51: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• For 100GB TeraSort in a cluster size of 8,

– 45% improvement over IPoIB, 39% improvement over 10GigE with HDD used

for HDFS

– 46% improvement over IPoIB, 40% improvement over 10GigE with SSD used for

HDFS

51

Performance Benefits - TeraSort

Cluster with 8 HDD Nodes, TeraSort with <8 maps, 8reduces>/host Cluster with 8 SSD Nodes, TeraSort with <8 maps, 8reduces>/host

0

500

1000

1500

2000

2500

80 90 100

Job

Exe

cuti

on

Tim

e (

sec)

Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

0

500

1000

1500

2000

2500

80 90 100

Job

Exe

cuti

on

Tim

e (

sec)

Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

HPC Advisory Council Stanford Conference, Feb '14

Page 52: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

0

50

100

150

200

250

300

350

400

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exe

cuti

on

Tim

e (

sec)

IPoIB (QDR) OSU-IB (QDR)

0

20

40

60

80

100

120

140

160

50 GB 100 GB 200 GB 300 GB

Cluster: 16 Cluster: 32 Cluster: 48 Cluster: 64

Job

Exe

cuti

on

Tim

e (

sec)

IPoIB (QDR) OSU-IB (QDR)

Performance Benefits – RandomWriter & Sort in SDSC-Gordon

• RandomWriter in a cluster of single SSD per

node

– 16% improvement over IPoIB (QDR) for

50GB in a cluster of 16

– 20% improvement over IPoIB (QDR) for

300GB in a cluster of 64

Reduced by 20% Reduced by 36%

52 HPC Advisory Council Stanford Conference, Feb '14

• Sort in a cluster of single SSD per node

– 20% improvement over IPoIB (QDR)

for 50GB in a cluster of 16

– 36% improvement over IPoIB (QDR)

for 300GB in a cluster of 64

Page 53: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

53 HPC Advisory Council Stanford Conference, Feb '14

Page 54: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Memcached-RDMA Design

• Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

• Once a client is assigned a verbs worker thread, it can communicate directly and is “bound” to that thread

• All other Memcached data structures are shared among RDMA and Sockets worker threads

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

1

1

2

2

54 HPC Advisory Council Stanford Conference, Feb '14

Page 55: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

55

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

10

20

30

40

50

60

70

80

90

100

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

IPoIB (QDR)

10GigE (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)

HPC Advisory Council Stanford Conference, Feb '14

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

IPoIB (DDR)

10GigE (DDR)

OSU-RC-IB (DDR)

OSU-UD-IB (DDR)

Page 56: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 56

Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

Tim

e%(us)%

Message%Size%

Tim

e%(us)%

Message%Size%

Time%(us)%

Message%Size%

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

Page 57: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

1

10

100

10001 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

Tim

e (

us)

Message Size

OSU-IB (FDR)

IPoIB (FDR)

0

100

200

300

400

500

600

700

16 32 64 128 256 512 1024 2048 4080

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r Se

con

d (

TPS)

No. of Clients

57

• Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us

– 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

• Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s

– Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X

2X

HPC Advisory Council Stanford Conference, Feb '14

Page 58: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 58

Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

0

10

20

30

40

50

60

1 4 8

Tim

e (

ms)

No. of Clients

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

Time%(ms)%

No.%of%Clients%

Page 59: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14 59

Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

10

20

30

40

50

60

70

80

90

1 4 8

Tim

e (

ms)

No. of Clients

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached

design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR)

OSU-RC-IB (QDR)

OSU-UD-IB (QDR)

OSU-Hybrid-IB (QDR)

Time%(ms)%

No.%of%Clients%

Page 60: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Overview of Hadoop and Memcached

• Trends in High Performance Networking and Protocols

• Challenges in Accelerating Hadoop and Memcached

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• RPC

• Combination of components

– Memcached

• Open Challenges

• Conclusion and Q&A

Presentation Outline

60 HPC Advisory Council Stanford Conference, Feb '14

Page 61: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

HPC Advisory Council Stanford Conference, Feb '14

Upper level

Changes?

61

Page 62: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Multi-threading and Synchronization

– Multi-threaded model exploration

– Fine-grained synchronization and lock-free design

– Unified helper threads for different components

– Multi-endpoint design to support multi-threading communications

• QoS and Virtualization

– Network virtualization and locality-aware communication for Big

Data middleware

– Hardware-level virtualization support for End-to-End QoS

– I/O scheduling and storage virtualization

– Live migration

HPC Advisory Council Stanford Conference, Feb '14 62

More Challenges

Page 63: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Support of Accelerators

– Efficient designs for Big Data middleware to take advantage of

NVIDA GPGPUs and Intel MICs

– Offload computation-intensive workload to accelerators

– Explore maximum overlapping between communication and

offloaded computation

• Fault Tolerance Enhancements

– Novel data replication schemes for high-efficiency fault tolerance

– Exploration of light-weight fault tolerance mechanisms for Big Data

processing

• Support of Parallel File Systems

– Optimize Big Data middleware over parallel file systems (e.g.

Lustre) on modern HPC clusters

HPC Advisory Council Stanford Conference, Feb '14 63

More Challenges (Cont’d)

Page 64: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

64

Performance Improvement Potential of MapReduce over Lustre on HPC Clusters – An Initial Study

HPC Advisory Council Stanford Conference, Feb '14

Sort in TACC Stampede Sort in SDSC Gordon

• For 100GB Sort in 16 nodes

– 33% improvement over IPoIB (QDR)

0

200

400

600

800

1000

1200

300 400 500

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (FDR)

OSU-IB (FDR)

0

50

100

150

200

250

60 80 100

Job

Exe

cuti

on

Tim

e (s

ec)

Data Size (GB)

IPoIB (QDR)

OSU-IB (QDR)

• Can more optimizations be achieved by leveraging more features of Lustre?

Page 65: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• The current benchmarks provide some performance

behavior

• However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer?

– Where the benefits are coming from?

– Which design is leading to benefits or bottlenecks?

– Which component in the design needs to be changed and what will

be its impact?

– Can performance gain/loss at the lower-layer be correlated to the

performance gain/loss observed at the upper layer?

HPC Advisory Council Stanford Conference, Feb '14 65

Are the Current Benchmarks Sufficient for Big Data?

Page 66: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

66

Current

Benchmarks

No Benchmarks

Correlation?

HPC Advisory Council Stanford Conference, Feb '14

Page 67: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

OSU MPI Micro-Benchmarks (OMB) Suite

• A comprehensive suite of benchmarks to – Compare performance of different MPI libraries on various networks and systems

– Validate low-level functionalities

– Provide insights to the underlying MPI-level designs

• Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and bi-directional bandwidth

• Extended later to – MPI-2 one-sided

– Collectives

– GPU-aware data movement

– OpenSHMEM (point-to-point and collectives)

– UPC

• Has become an industry standard

• Extensively used for design/development of MPI libraries, performance comparison of MPI libraries and even in procurement of large-scale systems

• Available from http://mvapich.cse.ohio-state.edu/benchmarks

• Available in an integrated manner with MVAPICH2 stack

67 HPC Advisory Council Stanford Conference, Feb '14

Page 68: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

Big Data Middleware (HDFS, MapReduce, HBase and Memcached)

Networking Technologies

(InfiniBand, 1/10/40GigE and Intelligent NICs)

Storage Technologies

(HDD and SSD)

Programming Models (Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

HPC Advisory Council Stanford Conference, Feb '14 68

Applications-

Level

Benchmarks

Micro-

Benchmarks

Page 69: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Upcoming RDMA for Apache Hadoop Releases will support

– HDFS Parallel Replication

– Advanced MapReduce

– HBase

– Hadoop 2.0.x

• Memcached-RDMA

• Exploration of other components (Threading models, QoS,

Virtualization, Accelerators, etc.)

• Advanced designs with upper-level changes and optimizations

• Comprehensive set of Benchmarks

• More details at http://hadoop-rdma.cse.ohio-state.edu

HPC Advisory Council Stanford Conference, Feb '14

Future Plans of OSU Big Data Project

69

Page 70: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

• Presented an overview of Big Data, Hadoop and Memcached

• Provided an overview of Networking Technologies

• Discussed challenges in accelerating Hadoop

• Presented initial designs to take advantage of InfiniBand/RDMA

for HDFS, MapReduce, RPC, HBase and Memcached

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern

HPC technologies to carry out their analytics in a fast and scalable

manner

HPC Advisory Council Stanford Conference, Feb '14

Concluding Remarks

70

Page 71: Big Data: Hadoop and Memcached - HPC Advisory Council...Hadoop Distributed File System (HDFS) •Primary storage of Hadoop; highly reliable and fault-tolerant •Adopted by many reputed

HPC Advisory Council Stanford Conference, Feb '14

Personnel Acknowledgments

Current Students

– S. Chakraborthy (Ph.D.)

– N. Islam (Ph.D.)

– J. Jose (Ph.D.)

– M. Li (Ph.D.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekhar (Ph.D.)

Past Students

– P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

71

Past Research Scientist – S. Sur

Current Post-Docs

– X. Lu

– K. Hamidouche

Current Programmers

– M. Arnold

– J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– R. Shir (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Senior Research Associate

– H. Subramoni

Past Programmers

– D. Bureddy