PREGEL A System for LargeScale Graph Processing
The Problem
• Large Graphs are often part of computations required in modern systems (Social networks and Web graphs etc.)
• There are many graph computing problems like shortest path, clustering, page rank, minimum cut, connected components etc. but there exists no scalable general purpose system for implementing them.
2 Pregel
Characteristics of the algorithms
• They often exhibit poor locality of memory access.
• Very little computation work required per vertex.
• Changing degree of parallelism over the course of execution.
Refer [1, 2] 3 Pregel
Possible solutions • Crafting a custom distributed framework for every new algorithm.
• Existing distributed computing platforms like MapReduce. – These are sometimes used to mine large graphs[3, 4], but often give suboptimal performance and have usability issues.
• Singlecomputer graph algorithm libraries – Limiting the scale of the graph is necessary – BGL, LEDA, NetworkX, JDSL, Standford GraphBase or FGL
• Existing parallel graph systems which do not handle fault tolerance and other issues – The Parallel BGL[5] and CGMgraph[6]
Pregel 4
Pregel
Google, to overcome, these challenges came up with Pregel.
• Provides scalability
• Faulttolerance
• Flexibility to express arbitrary algorithms
The high level organization of Pregel programs is inspired by Valiant’s Bulk Synchronous Parallel model[7].
Pregel 5
Message passing model
A pure message passing model has been used, omitting remote reads and ways to emulate shared memory because:
1. Message passing model was found sufficient for all graph algorithms
2. Message passing model performs better than reading remote values because latency can be amortized by delivering larges batches of messages asynchronously.
Pregel 6
Message passing model
Pregel 7
Example
Find the largest value of a vertex in a strongly connected graph
8 Pregel
3 6 2 1
3 6 2 1 6 2 6 6
6 6 2 6 6 6
6 6 6 6 6
Blue Arrows are messages
Blue vertices have voted to halt
9 Pregel
6
Finding the largest value in a graph
Basic Organization
• Computations consist of a sequence of iterations called supersteps.
• During a superstep, the framework invokes a user defined function for each vertex which specifies the behavior at a single vertex V and a single Superstep S. The function can: – Read messages sent to V in superstep S1 – Send messages to other vertices that will be received
in superstep S+1 – Modify the state of V and of the outgoing edges – Make topology changes (Introduce/Delete/Modify
edges/vertices)
10 Pregel
Basic Organization - Superstep
11 Pregel
Model Of Computation: Entities
VERTEX
• Identified by a unique identifier.
• Has a modifiable, user defined value.
EDGE
• Source vertex and Target vertex identifiers.
• Has a modifiable, user defined value.
Pregel 12
Model Of Computation: Progress
• In superstep 0, all vertices are active.
• Only active vertices participate in a superstep. – They can go inactive by voting for halt.
– They can be reactivated by an external message from another vertex.
• The algorithm terminates when all vertices have voted for halt and there are no messages in transit.
13 Pregel
Model Of Computation: Vertex
State machine for a vertex
14 Pregel
Comparison with MapReduce
Graph algorithms can be implemented as a series of MapReduce invocations but it requires passing of entire state of graph from one stage to the next, which is not the case with Pregel.
Also Pregel framework simplifies the programming complexity by using supersteps.
15 Pregel
The C++ API
Creating a Pregel program typically involves subclassing the predefined Vertex class.
• The user overrides the virtual Compute() method. This method is the function that is computed for every active vertex in supersteps.
• Compute() can get the vertex’s associated value by GetValue() or modify it using MutableValue()
• Values of edges can be inspected and modified using the outedge iterator.
16 Pregel
The C++ API – Message Passing
Each message consists of a value and the name of the destination vertex. –The type of value is specified in the template parameter of the Vertex class.
Any number of messages can be sent in a superstep. –The framework guarantees delivery and nonduplication but not inorder delivery.
A message can be sent to any vertex if it’s identifier is known.
17 Pregel
The C++ API – Pregel Code Pregel Code for finding the max value
Class MaxFindVertex
: public Vertex<double, void, double> {
public:
virtual void Compute(MessageIterator* msgs) {
int currMax = GetValue();
SendMessageToAllNeighbors(currMax);
for ( ; !msgs>Done(); msgs>Next()) {
if (msgs>Value() > currMax)
currMax = msgs>Value();
}
if (currMax > GetValue())
*MutableValue() = currMax;
else VoteToHalt();
}
}; 18 Pregel
The C++ API – Combiners Sending a message to another vertex that exists on a different machine has some overhead. However if the algorithm doesn’t require each message explicitly but a function of it (example sum) then combiners can be used.
This can be done by overriding the Combine() method.
It can be used only for associative and commutative operations.
19 Pregel
The C++ API – Combiners
Example: Say we want to count the number of incoming links to all the pages in a set of interconnected pages.
In the first iteration, for each link from a vertex(page)we will send a message to the destination page.
Here, count function over the incoming messages can be used a combiner to optimize performance.
In the MaxValue Example, a Max combiner would reduce the communication load.
20 Pregel
The C++ API – Combiners
21 Pregel
The C++ API – Aggregators
They are used for Global communication, monitoring and data.
Each vertex can produce a value in a superstep S for the Aggregator to use. The Aggregated value is available to all the vertices in superstep S+1.
Aggregators can be used for statistics and for global communication.
Can be implemented by subclassing the Aggregator Class
Commutativity and Assosiativity required
22 Pregel
The C++ API – Aggregators
Example: Sum operator applied to outedge count of each vertex can be used to generate the total number of edges in the graph and communicate it to all the vertices. More complex reduction operators can even generate histograms. In the MaxValue example, we can finish the entire program in a single superstep by using a Max aggregator. 23 Pregel
The C++ API – Topology Mutations
The Compute() function can also be used to modify the structure of the graph.
Example: Hierarchical Clustering
Mutations take effect in the superstep after the requests were issued. Ordering of mutations, with – deletions taking place before additions, – deletion of edges before vertices and – addition of vertices before edges
resolves most of the conflicts. Rest are handled by userdefined handlers.
24 Pregel
Implementation
Pregel is designed for the Google cluster architecture.
The architecture schedules jobs to optimize resource allocation, involving killing instances or moving them to different locations.
Persistent data is stored as files on a distributed storage system like GFS[8] or BigTable.
25 Pregel
Basic Architecture
The Pregel library divides a graph into partitions, based on the vertex ID, each consisting of a set of vertices and all of those vertices’ outgoing edges.
The default function is hash(ID) mod N, where N is the number of partitions.
The next few slides describe the several stages of the execution of a Pregel program.
26 Pregel
Pregel Execution
1. Many copies of the user program begin executing on a cluster of machines. One of these copies acts as the master.
The master is not assigned any portion of the graph, but is responsible for coordinating worker activity.
27 Pregel
Pregel Execution
2. The master determines how many partitions the graph will have and assigns one or more partitions to each worker machine.
Each worker is responsible for maintaining the state of its section of the graph, executing the user’s Compute() method on its vertices, and managing messages to and from other workers.
28 Pregel
Pregel Execution
29 Pregel
1 4 2
6
8
9
10
3
5
7
11 12
Pregel Execution
3. The master assigns a portion of the user’s input to each worker.
The input is treated as a set of records, each of which contains an arbitrary number of vertices and edges.
After the input has finished loading, all vertices are marked are active.
30 Pregel
Pregel Execution
4. The master instructs each worker to perform a superstep. The worker loops through its active vertices, and call Compute() for each active vertex. It also delivers messages that were sent in the previous superstep.
When the worker finishes it responds to the master with the number of vertices that will be active in the next superstep.
31 Pregel
Pregel Execution
32 Pregel
Pregel Execution
33 Pregel
Fault Tolerance
• Checkpointing is used to implement fault tolerance.
– At the start of every superstep the master may instruct the workers to save the state of their partitions in stable storage.
– This includes vertex values, edge values and incoming messages.
• Master uses “ping“ messages to detect worker failures.
34 Pregel
Fault Tolerance
• When one or more workers fail, their associated partitions’ current state is lost.
• Master reassigns these partitions to available set of workers. – They reload their partition state from the most recent available checkpoint. This can be many steps old.
– The entire system is restarted from this superstep.
• Confined recovery can be used to reduce this load
35 Pregel
Applications
PageRank
36 Pregel
PageRank
PageRank is a link analysis algorithm that is used to determine the importance of a document based on the number of references to it and the importance of the source documents themselves.
[This was named after Larry Page (and not after rank of a webpage)]
37 Pregel
PageRank
A = A given page
T1 …. Tn = Pages that point to page A (citations)
d = Damping factor between 0 and 1 (usually kept as 0.85)
C(T) = number of links going out of T
PR(A) = the PageRank of page A
))()(........
)()(
)()(()1()(
2
2
1
1
n
n
TCTPR
TCTPR
TCTPRddAPR
38 Pregel
PageRank
Courtesy: Wikipedia
39 Pregel
PageRank
40 Pregel
PageRank can be solved in 2 ways: • A system of linear equations • An iterative loop till convergence
We look at the pseudo code of iterative version Initial value of PageRank of all pages = 1.0; While ( sum of PageRank of all pages – numPages > epsilon) { for each Page Pi in list { PageRank(Pi) = (1d); for each page Pj linking to page Pi { PageRank(Pi) += d × (PageRank(Pj)/numOutLinks(Pj));
} } }
PageRank in MapReduce – Phase I
Parsing HTML • Map task takes (URL, page content) pairs and maps them to (URL, (PRinit, listofurls)) – PRinit is the “seed” PageRank for URL – listofurls contains all pages pointed to by URL
• Reduce task is just the identity function
41 Pregel
PageRank in MapReduce – Phase 2
PageRank Distribution • Map task takes (URL, (cur_rank, url_list)) – For each u in url_list, emit (u, cur_rank/|url_list|) – Emit (URL, url_list) to carry the pointsto list along through iterations
• Reduce task gets (URL, url_list) and many (URL, val) values – Sum vals and fix up with d – Emit (URL, (new_rank, url_list))
42 Pregel
PageRank in MapReduce - Finalize
• A nonparallelizable component determines whether convergence has been achieved • If so, write out the PageRank lists done • Otherwise, feed output of Phase 2 into another Phase 2 iteration
43 Pregel
PageRank in Pregel Class PageRankVertex
: public Vertex<double, void, double> {
public:
virtual void Compute(MessageIterator* msgs) {
if (superstep() >= 1) {
double sum = 0;
for (; !msgs>done(); msgs>Next())
sum += msgs>Value();
*MutableValue() = 0.15 + 0.85 * sum;
}
if (supersteps() < 30) {
const int64 n = GetOutEdgeIterator().size();
SendMessageToAllNeighbors(GetValue() / n);
} else {
VoteToHalt();
}}};
44 Pregel
PageRank in Pregel
The pregel implementation contains the PageRankVertex, which inherits from the Vertex class.
The class has the vertex value type double to store tentative PageRank and message type double to carry PageRank fractions.
The graph is initialized so that in superstep 0, value of each vertex is 1.0 .
45 Pregel
PageRank in Pregel
In each superstep, each vertex sends out along each outgoing edge its tentative PageRank divided by the number of outgoing edges.
Also, each vertex sums up the values arriving on messages into sum and sets its own tentative PageRank to
For convergence, either there is a limit on the number of supersteps or aggregators are used to detect convergence.
46 Pregel
sum 85.015.0
Apache GiraphLarge-scale Graph Processing on Hadoop
Claudio Martella <[email protected]> @claudiomartella
Hadoop Summit @ Amsterdam - 3 April 2014
2
Graphs are simple
3
A computer network
4
A social network
5
A semantic network
6
A map
7
Graphs are huge
•Google’s index contains 50B pages
•Facebook has around1.1B users
•Google+ has around 570M users
•Twitter has around 530M users
VERY rough estimates!
8
9
Graphs aren’t easy
10
Graphs are nasty.
11
Each vertex depends on its
neighbours, recursively.
12
Recursive problems are nicely solved iteratively.
13
PageRank in MapReduce
•Record: < v_i, pr, [ v_j, ..., v_k ] >
•Mapper: emits < v_j, pr / #neighbours >
•Reducer: sums the partial values
14
MapReduce dataflow
15
Drawbacks
•Each job is executed N times
•Job bootstrap
•Mappers send PR values and structure
•Extensive IO at input, shuffle & sort, output
16
17
Timeline
•Inspired by Google Pregel (2010)
•Donated to ASF by Yahoo! in 2011
•Top-level project in 2012
•1.0 release in January 2013
•1.1 release in days 2014
18
Plays well with Hadoop
19
Vertex-centric API
20
BSP machine
21
BSP & Giraph
22
Advantages
•No locks: message-based communication
•No semaphores: global synchronization
•Iteration isolation: massively parallelizable
23
Architecture
24
Giraph job lifetime
25
Designed for iterations
•Stateful (in-memory)
•Only intermediate values (messages) sent
•Hits the disk at input, output, checkpoint
•Can go out-of-core
26
A bunch of other things
•Combiners (minimises messages)
•Aggregators (global aggregations)
•MasterCompute (executed on master)
•WorkerContext (executed per worker)
•PartitionContext (executed per partition)
27
Shortest Paths
28
Shortest Paths
29
Shortest Paths
30
Shortest Paths
31
Shortest Paths
32
Composable API
33
Checkpointing
34
No SPoFs
35
Giraph scales
36
ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-edges/10151617006153920
Giraph is fast
•100x over MR (Pr)
• jobs run within minutes
•given you have resources ;-)
37
Serialised objects
38
Primitive types
•Autoboxing is expensive
•Objects overhead (JVM)
•Use primitive types on your own
•Use primitive types-based libs (e.g. fastutils)
39
Sharded aggregators
40
Many stores with Gora
41
And graph databases
42
Current and next steps
•Out-of-core graph and messages
•Jython interface
•Remove Writable from < I V E M >
•Partitioned supernodes
•More documentation
43
GraphLab: A New Framework for Parallel Machine Learning
Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein
Presented by Guozhang Wang
DB Lunch, Nov.8, 2010
Overview
Programming ML Algorithms in Parallel◦ Common Parallelism and MapReduce◦ Global Synchronization Barriers
GraphLab◦ Data Dependency as a Graph◦ Synchronization as Fold/Reduce
Implementation and Experiments From Multicore to Distributed
Environment
Parallel Processing for ML
Parallel ML is a Necessity◦ 13 Million Wikipedia Pages◦ 3.6 Billion photos on Flickr◦ etc
Parallel ML is Hard to Program◦ Concurrency v.s. Deadlock◦ Load Balancing◦ Debug◦ etc
MapReduce is the Solution?
High-level abstraction: Statistical Query Model [Chu et al, 2006]
Weighted Linear Regression: only sufficient statistics
𝚹 = A-1b, A = 𝚺wi(xixiT), b = 𝚺wi(xiyi)
MapReduce is the Solution?
High-level abstraction: Statistical Query Model [Chu et al, 2006]
K-Means: only data assignments
class mean = avg(xi), xi in class
Embarrassingly Parallel independent computation
No Communication needed
ML in MapReduce
Multiple Mapper
Single Reducer
Iterative MapReduce needs global synchronization at the single reducer◦ K-means◦ EM for graphical models◦ gradient descent algorithms, etc
Not always Embarrassingly Parallel Data Dependency: not MapReducable◦ Gibbs Sampling◦ Belief Propagation◦ SVM◦ etc
Capture Dependency as a Graph!
Overview
Programming ML Algorithms in Parallel◦ Common Parallelism and MapReduce◦ Global Synchronization Barriers
GraphLab◦ Data Dependency as a Graph◦ Synchronization as Fold/Reduce
Implementation and Experiments From Multicore to Distributed
Environment
Key Idea of GraphLab
Sparse Data Dependencies Local Computations
X4 X5 X6
X9X8
X3X2X1
X7
GraphLab for ML
High-level Abstract◦ Express data dependencies◦ Iterative
Automatic Multicore Parallelism◦ Data Synchronization◦ Consistency◦ Scheduling
Main Components of GraphLabData Graph
Shared Data Table
Scheduling
Update Functions and Scopes
GraphLabModel
Data Graph
A Graph with data associated with every vertex and edge.
x3: Sample valueC(X3): sample counts
Φ(X6,X9): Binary potential
X1
X2
X3
X5
X6
X7
X8
X9
X10
X4
X11
Update Functions
Operations applied on a vertex that transform data in the scope of the vertex
Gibbs Update:- Read samples on adjacent vertices
- Read edge potentials- Compute a new sample for the current vertex
Scope Rules
Consistency v.s. Parallelism◦ Belief Propagation: Only uses edge data◦ Gibbs Sampling: Needs to read adjacent
vertices
Scheduling
Scheduler determines the order of Update Function evaluations
Static Scheduling◦ Round Robin, etc
Dynamic Scheduling◦ FIFO, Priority Queue, etc
Dynamic Scheduling
e f g
kjih
dcbaCPU 1
CPU 2
a
h
a
b
b
i
Global Information
Shared Data Table in Shared Memory◦ Model parameters (updatable)◦ Sufficient statistics (updatable)◦ Constants, etc (fixed)
Sync Functions for Updatable Shared Data◦ Accumulate performs an aggregation over
vertices◦ Apply makes a final modification to the
accumulated data
Sync Functions
Much like Fold/Reduce◦ Execute Aggregate over every vertices in turn◦ Execute Apply once at the end
Can be called ◦ Periodically when update functions are active
(asynchronous) or ◦ By the update function or user code
(synchronous)
GraphLab
GraphLabModel
Data GraphShared Data Table
SchedulingUpdate Functions and
Scopes
Overview
Programming ML Algorithms in Parallel◦ Common Parallelism and MapReduce◦ Global Synchronization Barriers
GraphLab◦ Data Dependency as a Graph◦ Synchronization as Fold/Reduce
Implementation and Experiments From Multicore to Distributed
Environment
Implementation and Experiments
Shared Memory Implemention in C++ using Pthreads
Applications: ◦ Belief Propagation◦ Gibbs Sampling◦ CoEM◦ Lasso◦ etc (more on the project page)
Parallel Performance
0
2
4
6
8
10
12
14
16
0 2 4 6 8 10 12 14 16
Sp
eed
up
Number of CPUs
Optimal
Bett
er
Round robin schedule
Colored Schedule
From Multicore to Distributed Enviroment MapReduce and GraphLab work well for
Multicores◦ Simple High-level Abstract◦ Local computation + global synchronization
When Migrate to Clusters◦ Rethink Scope synchronization◦ Rethink Shared Data single “reducer”◦ Think Load Balancing◦ Maybe think abstract model?
22.06.2015 DIMA – TU Berlin 1
Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin
http://www.dima.tu-berlin.de/
Hot Topics in Information Management PowerGraph: Distributed Graph-Parallel
Computation on Natural Graphs
Igor Shevchenko
Mentor: Sebastian Schelter
22.06.2015 DIMA – TU Berlin 2
Agenda
1. Natural Graphs: Properties and Problems;
2. PowerGraph: Vertex Cut and Vertex Programs;
3. GAS Decomposition;
4. Vertex Cut Partitioning;
5. Delta Caching;
6. Applications and Evaluation;
Paper: Gonzalez at al. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.
22.06.2015 DIMA – TU Berlin 3
■ Natural graphs are graphs derived from real-world or natural phenomena;
■ Graphs are big: billions of vertices and edges and rich metadata;
Natural graphs have
Power-Law Degree Distribution
Natural Graphs
22.06.2015 DIMA – TU Berlin 4
Power-Law Degree Distribution
(Andrei Broder et al. Graph structure in the web)
22.06.2015 DIMA – TU Berlin 5
■ We want to analyze natural graphs;
■ Essential for Data Mining and Machine Learning;
Goal
Identify influential people and information; Identify special nodes and communities; Model complex data dependencies;
Target ads and products; Find communities; Flow scheduling;
22.06.2015 DIMA – TU Berlin 6
■ Existing distributed graph computation systems perform poorly on natural graphs (Gonzalez et al. OSDI ’12);
■ The reason is presence of high degree vertices;
Problem
High Degree Vertices: Starlike motif
22.06.2015 DIMA – TU Berlin 7
Possible problems with high degree vertices:
■ Limited single-machine resources; ■ Work imbalance; ■ Sequential computation; ■ Communication costs; ■ Graph partitioning;
Applicable to: ■ Hadoop; GraphLab; Pregel (Piccolo);
Problem Continued
22.06.2015 DIMA – TU Berlin 8
■ High degree vertices can exceed the memory capacity of a single machine;
■ Store edge meta-data and adjacency information;
Problem: Limited Single-Machine Resources
22.06.2015 DIMA – TU Berlin 9
■ The power-law degree distribution can lead to significant work imbalance and frequency barriers;
■ For ex. with synchronous execution (Pregel):
Problem: Work Imbalance
22.06.2015 DIMA – TU Berlin 10
■ No parallelization of individual vertex-programs; ■ Edges are processed sequentially; ■ Locking does not scale well to high degree
vertices (for ex. in GraphLab);
Problem: Sequential Computation
Sequentially process edges Asynchronous execution requires heavy locking
22.06.2015 DIMA – TU Berlin 11
■ Generate and send large amount of identical messages (for ex. in Pregel);
■ This results in communication asymmetry;
Problem: Communication Costs
22.06.2015 DIMA – TU Berlin 12
■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed)
partitioning on natural graphs thus maximizing the network communication;
Problem: Graph Partitioning
22.06.2015 DIMA – TU Berlin 13
■ Natural graphs are difficult to partition; ■ Pregel and GraphLab use random (hashed)
partitioning on natural graphs thus maximizing the network communication;
Expected edges that are cut Examples: ■ 10 machines: ■ 100 machines:
Problem: Graph Partitioning Continued
= number of machines
90% of edges cut; 99% of edges cut;
22.06.2015 DIMA – TU Berlin 14
■ GraphLab and Pregel are not well suited for computations on natural graphs;
Reasons: ■ Challenges of high-degree vertices; ■ Low quality partitioning;
Solution: ■ PowerGraph new abstraction;
In Summary
22.06.2015 DIMA – TU Berlin 15
PowerGraph
22.06.2015 DIMA – TU Berlin 16
Two approaches for partitioning the graph in a distributed environment:
■ Edge Cut;
■ Vertex Cut;
Partition Techniques
22.06.2015 DIMA – TU Berlin 17
■ Used by Pregel and GraphLab abstractions; ■ Evenly assign vertices to machines;
Edge Cut
22.06.2015 DIMA – TU Berlin 18
■ Used by PowerGraph abstraction; ■ Evenly assign edged to machines;
Vertex Cut The strong point of the paper
4 edges 4 edges
22.06.2015 DIMA – TU Berlin 19
Think like a Vertex [Malewicz et al. SIGMOD’10]
User-defined Vertex-Program: 1. Runs on each vertex; 2. Interactions are constrained by graph structure; Pregel and GraphLab also use this concept, where parallelism is achieved by running multiple vertex programs simultaneously;
Vertex Programs
22.06.2015 DIMA – TU Berlin 20
■ Vertex cut distributes a single vertex-program across several machines;
■ Allows to parallelize high-degree vertices;
GAS Decomposition The strong point of the paper
22.06.2015 DIMA – TU Berlin 21
Generalize the vertex-program into three phases: 1. Gather
Accumulate information about neighborhood;
2. Apply Apply accumulated value to center vertex;
3. Scatter Update adjacent edges and vertices;
GAS Decomposition
Gather, Apply and Scatter are userdefined functions;
The strong point of the paper
22.06.2015 DIMA – TU Berlin 22
■ Executed on the edges in parallel; ■ Accumulate information about neighborhood;
Gather Phase
22.06.2015 DIMA – TU Berlin 23
■ Executed on the central vertex; ■ Apply accumulated value to center vertex;
Apply Phase
2D PartitioningAydin Buluc and Kamesh Madduri
8 / 13
Graph Partitioning for Scalable Distributed Graph Computations
Aydın Buluç Kamesh Madduri [email protected] [email protected]
10th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering February 1314, 2012
Atlanta, GA
Overview of our study
• We assess the impact of graph partitioning for computations on ‘low diameter’ graphs
• Does minimizing edge cut lead to lower execution time?
• We choose parallel BreadthFirst Search as a representative distributed graph computation
• Performance analysis on DIMACS Challenge instances
2
Key Observations for Parallel BFS
• Wellbalanced vertex and edge partitions do not guarantee loadbalanced execution, particularly for realworld graphs
– Range of relative speedups (8.850X, 256way parallel concurrency) for lowdiameter DIMACS graph instances.
• Graph partitioning methods reduce overall edge cut and communication volume, but lead to increased computational load imbalance
• Internode communication time is not the dominant cost in our tuned bulksynchronous parallel BFS implementation
3
Talk Outline
• Levelsynchronous parallel BFS on distributedmemory systems
– Analysis of communication costs
• Machineindependent counts for internode communication cost
• Parallel BFS performance results for several largescale DIMACS graph instances
4
Parallel BFS strategies
5
1. Expand current frontier (levelsynchronous approach, suited for low diameter graphs)
0 7
5
3
8
2
4 6
1
9
source vertex
2. Stitch multiple concurrent traversals (UllmanYannakakis, for highdiameter graphs)
• O(D) parallel steps • Adjacencies of all vertices in current frontier are visited in parallel
0 7
5
3
8
2
4 6
1
9 source vertex
• pathlimited searches from “super vertices” • APSP between “super vertices”
• Consider a logical 2D processor grid (pr * pc = p) and the dense matrix representation of the graph
• Assign each processor a submatrix (i.e, the edges within the submatrix)
“2D” graph distribution
0 7
5
3
8
2
4 6
1 x x x
x
x x
x x x
x x x
x x
x x x
x x x
x x x x
9 vertices, 9 processors, 3x3 processor grid
Flatten Sparse matrices
Perprocessor local graph representation
BFS with a 1Dpartitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current
frontier. 2. Fold: Alltoall exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
0 1
2
3 6
5
4
[0,1] [0,3] [0,3] [1,0] [1,4] [1,6]
[2,3] [2,5] [2,5] [2,6] [3,0] [3,0] [3,2] [3,6]
[4,1] [4,5] [4,6] [5,2] [5,2] [5,4]
[6,1] [6,2] [6,3] [6,4]
Consider an undirected graph with n vertices and m edges
Each processor ‘owns’ n/p vertices andstores their adjacencies (~ 2m/p per processor, assuming balanced partitions).
BFS with a 1Dpartitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current
frontier. 2. Fold: Alltoall exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
0 1
2
3 6
5
4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
1. Local discovery:
[1,0]
[6,2] [6, 3]
[1,4] [1,6]
[6,1] [6,4]
P0
P3
P1
P2
No work
No work
BFS with a 1Dpartitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current
frontier. 2. Fold: Alltoall exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
0 1
2
3 6
5
4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
2. Alltoall exchange:
[1,0]
[6,2] [6, 3]
[1,4] [1,6]
[6,1] [6,4]
P0
P3
P1
P2
No work
No work
BFS with a 1Dpartitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current
frontier. 2. Fold: Alltoall exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
0 1
2
3 6
5
4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
2. Alltoall exchange:
[1,0]
[6,2] [6, 3]
[1,4]
[1,6]
[6,1]
[6,4]
P0
P3
P1
P2
BFS with a 1Dpartitioned graph
Steps: 1. Local discovery: Explore adjacencies of vertices in current
frontier. 2. Fold: Alltoall exchange of adjacencies. 3. Local update: Update distances/parents for unvisited vertices.
0 1
2
3 6
5
4
Current frontier: vertices 1 (partition Blue) and 6 (partition Green)
3. Local update:
[1,0]
[6,2] [6, 3]
[1,4]
[1,6]
[6,1]
[6,4]
P0
P3
P1
P2
0
2, 3
4
Frontier for next iteration
Modeling parallel execution time
• Time dominated by local memory references and internode communication
• Assuming perfectly balanced computation and communication, we have
12
pmn
pm
pnLL
/,
Local latency on working set |n/p|
Inverse local RAM bandwidth
Local memory references:
pp
edgecutp NaaN )(2,
Internode communication:
Alltoall remote bandwidth with p participating processors
BFS with a 2Dpartitioned graph
• Avoid expensive pway Alltoall communication step
• Each process collectively ‘owns’ n/pr vertices
• Additional ‘Allgather’ communication step for processesin a row
13
Local memory references:
pm
pn
pm
rc pnLpnLL ,,
Internode communication:
cNcr
cgatherN
rNraaN
ppn
pp
pp
edgecutp
11)(
)(
,
2,
Temporal effects, communicationminimizing tuning prevent us from obtaining tighter bounds
• The volume of communication can be further reduced by maintaining state of nonlocal visited vertices
14
0 1
2
3 6
5
4
[0,3] [0,3] [1,3] [0,4] [1,4]
P0
Local pruning prior to Alltoall step
[0,6] [1,6] [1,6]
[0,3] [0,4] [1,6]
Predictable BFS execution time for synthetic smallworld graphs
• Randomly permuting vertex IDs ensures load balance on RMAT graphs (used in the Graph 500 benchmark).
• Our tuned parallel implementation for the NERSC Hopper system (Cray XE6) is ranked #2 on the current Graph 500 list.
15 Buluc & Madduri, Parallel BFS on distributed memory systems, Proc. SC’11, 2011.
Execution time is dominated by work performed in a few parallel phases
Modeling BFS execution time for realworld graphs
• Can we further reduce communication time utilizing existing partitioning methods?
• Does the model predict execution time for arbitrary lowdiameter graphs?
• We try out various partitioning and graph distribution schemes on the DIMACS Challenge graph instances
– Natural ordering, Random, Metis, PaToH
16
Experimental Study
• The (weak) upper bound on aggregate data volume communication can be statically computed (based on partitioning of the graph)
• We determine runtime estimates of – Total aggregate communication volume
– Sum of max. communication volume during each BFS iteration
– Intranode computational work balance
– Communication volume reduction with 2D partitioning
• We obtain and analyze execution times (at several different parallel concurrencies) on a Cray XE6 system (Hopper, NERSC)
17
Orderings for the CoPapersCiteseer graph
18
Natural Random
PaToH checkerboard PaToH Metis
BFS Alltoall phase total communication volume normalized to # of edges (m)
# of partitions
Graph name
% compared to m
Natural Random PaToH
19
Ratio of max. communication volume across iterations to total communication volume
# of partitions
Graph name
Ratio over total volume
Natural Random PaToH
20
Reduction in total Alltoall communication volume with 2D partitioning
21
Graph name
Ratio compared to 1D
Natural Random PaToH
# of partitions
Edge count balance with 2D partitioning
Graph name
Max/Avg. ratio
Natural Random PaToH
# of partitions
Parallel speedup on Hopper with 16way partitioning
23
Execution time breakdown
24
0
50
100
150
200
Random1D Random2D Metis1D PaToH1D
BFS tim
e (ms)
Partitioning Strategy
Computation Fold Expand
0
2
4
6
8
10
Random1D Random2D Metis1D PaToH1D
Comm. tim
e (ms)
Partitioning Strategy
Fold Expand
0
50
100
150
200
250
300
Random1D Random2D Metis1D PaToH1D
BFS tim
e (ms)
Partitioning Strategy
Computation Fold Expand
0
0.5
1
1.5
2
2.5
3
Random1D Random2D Metis1D PaToH1D
Comm. tim
e (ms)
Partitioning Strategy
Fold Expand
eu2005 kronsimplelogn18
Imbalance in parallel execution
25
eu2005, 16 processes*
PaToH Random
* Timeline of 4 processes shown in figures. PaToHpartitioned graph suffers from severe load imbalance in computational phases.
Conclusions
• Randomly permuting vertex identifiers improves computational and communication load balance, particularly at higher process concurrencies
• Partitioning methods reduce overall communication volume, but introduce significant load imbalance
• Substantially lower parallel speedup with realworld graphs compared to synthetic graphs (8.8X vs 50X at 256way parallel concurrency) – Points to the need for dynamic load balancing
26
Top Related