Harp: Collective Communication on Hadoop
Bingjing Zhang, Yang Ruan, Judy Qiu
Outline• Motivations
– Why do we bring collective communications to big data processing?• Collective Communication Abstractions
– Our approach to optimize data movement– Hierarchical data abstractions and operations defined on top of them
• MapCollective Programming Model– Extended from MapReduce model to support collective communications– Two Level BSP parallelism
• Harp Implementation– A plugin on Hadoop– Component layers and the job flow
• Experiments• Conclusion
Motivation
K-means Clustering in (Iterative) MapReduce
K-means Clustering in Collective Communication
gatherM: Compute local points sumR: Compute global centroids
broadcast
shuffle
M M M M
RR
M M M M
allreduce
M: Control iterations and compute local points sum
More efficient and much simpler!
Large Scale Data Analysis Applications• Iterative Applications
– Cached and reused local data between iterations– Complicated computation steps– Large intermediate data in communications– Various communication patterns
Computer Vision Complex NetworksBioinformatics Deep Learning
The Models of Contemporary Big Data Tools
MapReduce ModelDAG Model Graph Model BSP/Collective Model
Storm
TwisterFor Iterations
/Learning
For Streaming
For Query
S4
Hadoop
DryadLINQ Pig
Spark
Spark SQL
Spark Streaming
MRQL
HiveTez
GiraphHama
GraphLab
HarpGraphX
HaLoop
Samza
DryadStratosphere / Flink
Many of them have fixed communication patterns!
Contributions
Parallelism Model Architecture
ShuffleM M M M
Collective Communication
M M M M
R R
MapCollective ModelMapReduce Model
YARN
MapReduce V2
Harp
MapReduce Applications
MapCollective ApplicationsApplication
Framework
Resource Manager
Collective Communication Abstractions• Hierarchical Data Abstractions
– Basic Types• Arrays, key-values, vertices, edges and messages
– Partitions• Array partitions, key-value partitions, vertex partitions, edge partitions and
message partitions
– Tables• Array tables, key-value tables, vertex tables, edge tables and message tables
• Collective Communication Operations– Broadcast, allgather, allreduce– Regroup– Send messages to vertices, send edges to vertices
Hierarchical Data Abstractions
Vertex Table
Key-Value Partition
Array
Transferable
Key-ValuesVertices, Edges,
MessagesDouble Array
Int Array
Long Array
Array Partition <Array Type>
Object
Vertex Partition
Edge Partition
Array Table <Array Type>
Message Partition
Key-Value Table
Byte Array
Message Table
EdgeTable
broadcast, send
broadcast, allgather, allreduce, regroup, message-to-vertex…
broadcast, send
Table
Partition
Basic Types
Example: regroup
Table
Partition 0
Table
Process 0 Process 1 Process 2
Partition 1
Table
Partition 0
Partition 1
Regroup
Partition 2
Partition 3
Partition 4
Partition 0
Partition 2
Partition 3 Partition 4
Partition 1 Partition 2Partition 2Partition 0
OperationsOperation Name Data Abstraction Algorithm Time Complexity
broadcastarrays, key-valuepairs & vertices
chain
allgatherarrays, key-valuepairs & vertices
bucket
allreducearrays, key-valuepairs
bi-directionalexchange
regroup-allgather 2
regroup arrays, key-valuepairs & vertices
point-to-pointdirect sending
send messagesto vertices
messages,vertices
point-to-pointdirect sending
send edges tovertices
edges, verticespoint-to-pointdirect sending
MapCollective Programming Model• BSP parallelism
– Inter node parallelism and inner node parallelism
Process Level
Thread Level
Process Level
The Harp Library• Hadoop Plugin which targets on Hadoop 2.2.0• Provides implementation of the collective communication abstractions
and MapCollective programming model
• Project Link– http://salsaproj.indiana.edu/harp/index.html
• Source Code Link– https://github.com/jessezbj/harp-project
YARN
MapReduce V2
Harp
MapReduce Applications MapCollective Applications
Component Layers
MapReduce
Collective Communication Abstractions
MapCollective Programming Model
Applications: K-Means, WDA-SMACOF, Graph-Drawing…
Collective Communication Operators
Hierarchical Data Types (Tables & Partitions)
Memory Resource Pool
Collective Communication APIs
Array, Key-Value, Graph Data Abstraction
MapCollective Interface
Task Management
A MapCollective Job
YARN Resource Manager
Client
MapCollective Runner
1. Record Map task locations from original MapReduce AppMaster
MapCollectiveAppMaster
MapCollectiveContainerLauncher
MapCollectiveContainerAllocator
I. Launch AppMaster
II. Launch Tasks
CollectiveMapper
setup
mapCollective
cleanup
3. Invoke collective communication APIs
4. Write output to HDFS
2. Read key-value pairs
Experiments• Applications
– K-means Clustering– Force-directed Graph Drawing Algorithm– WDA-SMACOF
• Test Environment– Big Red II
• http://kb.iu.edu/data/bcqt.html
K-means Clustering
M M M M
allreduce centroids
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
0
20
40
60
80
100
120
140
500M points 10K centroids Execution Time5M points 1M centroids Execution Time500M points 10K centroids Speedup5M points 1M centroids Speedup
Number of Nodes
Exec
ution
Tim
e (S
econ
ds)
Speedup
Force-directed Graph Drawing Algorithm
T. Fruchterman, M. Reingold. “Graph Drawing by Force-Directed Placement”, Software Practice & Experience 21 (11), 1991.
M M M M
allgather positions of vertices
0 20 40 60 80 100 120 1400
1000
2000
3000
4000
5000
6000
7000
8000
0
10
20
30
40
50
60
70
80
90
Execution Time Speedup
Number of Nodes
Exec
ution
Tim
e (S
econ
ds)
Speedup
WDA-SMACOF
Y. Ruan et al. “A Robust and Scalable Solution for Interpolative Multidimensional Scaling With Weighting”. E-Science, 2013.
M M M M
allreduce the stress value
allgather and allreduce results in the conjugate gradient process
0 20 40 60 80 100 120 1400
5001000150020002500300035004000
100K points 200K points 300K points400K points
Number of Nodes
Exec
ution
Tim
e (s
econ
ds)
0 20 40 60 80 100 120 1400
20
40
60
80
100
120
100K points 200K points 300K pointsNumber of Nodes
Spee
dup
Conclusions• Harp is an implementation designed in a pluggable way to bring high
performance to the Apache Big Data Stack and bridge the differences between Hadoop ecosystem and HPC system through a clear communication abstraction, which did not exist before in the Hadoop ecosystem.
• The experiments show that with Harp we can scale three applications to 128 nodes with 4096 CPUs on the Big Red II supercomputer, where the speedup in most tests is close to linear.
Top Related