Paragon: Parallel Architecture-Aware Graph Partition...

52
Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm Angen Zheng, Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi University of Pittsburgh 1

Transcript of Paragon: Parallel Architecture-Aware Graph Partition...

Paragon: Parallel Architecture-Aware Graph Partition Refinement Algorithm

Angen Zheng, Alexandros Labrinidis, Patrick Pisciuneri, Panos K. Chrysanthis, and Peyman Givi

University of Pittsburgh

1

Importance of Graph Partitioning

Applications of Graph Partitioning

o Scientific Simulations

o Distributed Graph Computation

o Pregel, Hama, Giraph

o VLSI Design

o Task Scheduling

o Linear Programming

2

Target Workloads

3

★ Vertex○ a unique identifier

○ a modifiable, user-defined value

★ Edge○ a modifiable, user-defined value

○ a target vertex identifier

UD

F

UD

F

UD

F

UD

F

Balanced load distribution!!!

Minimizing the comm cost!!!

★ Vertex-Centric UDF

○ Change vertex/edge state

○ Send msg to neighbours

○ Receive msg from neighbors

○ Mutate the graph topology

○ Deactivate at end of the superstep

○ Reactivate by external msgs

A Balanced Partitioning = Even Load Distribution

N3

N1

N2

4

Minimal Edge-Cut = Minimal Data Comm

5

N3

N1

N2

Minimal Data Comm ≠ Minimal Comm Cost

Roadmap

6

# of slides

Introduction ✔

Heterogeneity

State of the Art

PARAGON

Contention

Experiments

% of

audience

asleep

Nonuniform Inter-Node Network Comm Cost

Comm costs vary a lot as their locations change!7

Nonuniform Intra-Node Network Comm Cost

Cores sharing more cache levels communicate faster!

8

Inter-Node Comm Cost > Intra-Node Comm Cost

Network(Ethernet, IPoIB)

Node#1 Node#2

9

Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost

N1 N2 N3

N1 1 6

N2 1 1

N3 6 1

N3N1

N2

• 3 edge-cut• 3 unit data comm• 8 unit comm cost (8 = 1 * 6 + 2 * 1)

10

Minimal Edge-Cut = Minimal Data Comm ≠ Minimal Comm Cost

N1 N2 N3

N1 1 6

N2 1 1

N3 6 1

11

• 4 edge-cut• 4 unit data comm• 4 unit comm cost (4 = 1 * 1+ 3 * 1)

N3N1

N2

Group neighbouring vertices as close as possible!

Roadmap

12

# of slides

Introduction ✔

Heterogeneity✔

State of the Art

PARAGON

Contention

Experiments

% of

audience

asleep

Overview of the State-of-the-Art

13

Balanced Graph (Re)Partitioning

Partitioners

(static graphs)Repartitioners

(dynamic graphs)

Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG

Fennel

Offline Methods

(High Quality)

(Poor Scalability)

Online Methods(Moderate Quality)

(High Scalability)

Parmetis Aragon

Offline Methods(High Quality)

(Poor Scalability)

Online Methods(Moderate~High Quality)

(High Scalability)

Heterogeneity-Aware

CatchWParagon xdgp Mizan

Heterogeneity-Aware

LogGPHermes

Our Prior Work: Aragon

A sequential architecture-aware graph partition refinement algorithm.o Input:

• A partitioned graph• The relative network comm cost matrix

o Output:• A partitioning with improved mapping of the comm

pattern to the underlying hardware topology.

14

[1]. Angen Zheng, Alexandors Labrinidis, and Panos K. Chrysanthis. Archiitecture-Aware Graph Repartitioning for Data-Intensive Scientific

Computing. BigGraphs, 2014

Our Prior Work: AragonN1

N2

N3

N4

N5

N6

N7

N8

N9

G

P1

P2

P3

P4

P5

P6

P7

P8

P9

P1

P2

P3

P4

P9

P8

P7

P6

Heterogeneity-Aware Refinement(More details in the paper)

Aragon N5 can hold entire graph

in memory

Prefer to work in offline

mode

15

Roadmap

16

# of slides

Introduction ✔

Heterogeneity✔

State of the Art ✔

PARAGON

Contention

Experiments

% of

audience

asleep

Overview:

o Parallel Architecture-Aware Graph Partition Refinement Algorithm

Goal:

o Group neighbouring vertices as close as possible

Paragon

17

Paragon vs Aragon○ lower overhead

○ scale to much larger graphs

Paragon: Partition Grouping

P1P2P3

P4P6P9

P5P7P8

18

P1

P2

P3

P4

P5

P6

P7

P8

P9

N1

N2

N3

N4

N5

N6

N7

N8

N9

Paragon: Group Server Selection

19

P1

P2

P3

P4

P5

P6

P7

P8

P9

N1

N2

N3

N4

N5

N6

N7

N8

N9

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

Paragon: Sending “Partition” to Group Servers

N1

N2

N3

N4

N5

N6

N7

N8

N9

P1

P2

P3

P4

P5

P6

P7

P8

P9

P1

P3

P4

P6

P5

P7

20

Only send boundary vertices

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

Paragon: Parallel Refinement

Aragon

P1

P2

P3

P4

P5

P6

P7

P8

P9

P1

P3

P4

P6

P5

P7

Aragon

Aragon

N2

N3

N4

N5

N6

N7

N8

N9

N1

21

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

# of Groups○ Degree of Parallelism

Paragon: Parallel Refinement

Aragon

P1

P2

P3

P4

P5

P6

P7

P8

P9

P1

P3

P4

P6

P5

P7

Aragon

Aragon

N2

N3

N4

N5

N6

N7

N8

N9

N1

22

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

# of Groups○ Degree of Parallelism○ Parallelism vs Quality

36

16

96

Paragon: Shuffle Refinement

N2:

N9:

N8:

P1P2P4

P5P7P9

P3P6P8

Swap

Aragon

Aragon

Aragon

Parallel

P1P2P3

P4P6P9

P5P7P8

23

Repeat k times

To increase the # of partition pairs being refined!

Roadmap

24

# of slides

Introduction ✔

Heterogeneity✔

State of the Art ✔

PARAGON ✔

Contention

Experiments

% of

audience

asleep

Inter-Node Comm Cost ? Intra-Node Comm Cost

Network

Node#1 Node#2

RDMA-enabled

25

Inter-Node Comm Cost≅ Intra-Node Comm Cost

[2]. C. Binnig, U. Çetintemel, A. Crotty, A. Galakatos, T. Kraska, E. Zamanian, and S. B. Zdonik. The End of Slow Networks: It’sTime for a

Redesign. CoRR, 2015

★ Dual-socket Xeon E5v2 server with

○ DDR3-1600

○ 2 FDR 4x NICs per socket

Revisit the Impact of Memory Subsystem Carefully!

★ Infiniband: 1.7GB/s~37.5GB/s

★ DDR3:

6.25GB/s~16.6GB/s

26

Intra-Node Shared Resource Contention

Send Buffer

Sending Core Receiving Core

Receive BufferShared Buffer

1. Load3. Load2b. Write

2a. Load 4a. Load

4b. Write

27

Cached Send/Shared/Receive Buffer

Intra-Node Shared Resource Contention

Multiple copies of the same data in LLC,

contending for LLC and MC

28

Intra-Node Shared Resource Contention

Cached Send/Shared Buffer Cached Receive/Shared Buffer

29

Multiple copies of the same data in LLC,

contending for LLC, MC, and QPI.

Paragon: Avoiding Contention

Intra-Node Network Comm Cost

Maximal Inter-Node Network Comm Cost

Degree of Contention

30

(Small HPC Clusters)

(Cloud/Large Clusters)

Paragon: Avoiding Contention

Send

Buffer

Sending Core

Node#1

IB

HCA

Receive

Buffer

Receiving Core

Node#2

IB

HCA

31

Roadmap

32

# of slides

Introduction ✔

Heterogeneity✔

State of the Art ✔

PARAGON ✔

Contention ✔

Experiments

% of

audience

asleep

Evaluation

MicroBenchmarks

o Degree of Refinement Parallelism

o Varying Shuffle Refinement Times

o Varying Initial Partitioners

Real-World Workloads

o Breadth First Search (BFS)

o Single Source Shortest Path (SSSP)

Billion-Edge Graph Scaling

33

Evaluation

MicroBenchmarks

o Degree of Refinement Parallelism

o Varying Shuffle Refinement Times

o Varying Initial Partitioners

Real-World Workloads

o Breadth First Search (BFS)

o Single Source Shortest Path (SSSP)

Billion-Edge Graph Scaling

34

Degree of Refinement Parallelism: Refinement Time

35

Aragon

★ com-lj: |V|=4M, |E| = 69M

★ 40 partitions: two 20-core machines

★ Initial Partitioner: DG (deterministic greedy)

★ # of Shuffle Times: 0

Degree of Refinement Parallelism: Partitioning Quality

36

★ com-lj: |V|=4M, |E| = 69M

★ 40 partitions: two 20-core machines

★ Initial Partitioner: DG (deterministic greedy)

★ # of Shuffle Times: 0

Evaluation

MicroBenchmarks

o Degree of Refinement Parallelism

o Varying Shuffle Refinement Times

o Varying Initial Partitioners

Real-World Workloads

o Breadth First Search (BFS)

o Single Source Shortest Path (SSSP)

Billion-Edge Graph Scaling

37

Varying Shuffle Refinement Times

38

★ com-lj: |V|=4M, |E| = 69M

★ 40 partitions: two 20-core machines

★ Initial Partitioner: DG (deterministic greedy)

★ Deg. of Parallelism: 8

# of shuffle refinement times > 10○ Paragon had lower refinement overhead

■ 8~10s vs 33s (Paragon vs Aragon)

○ Paragon produce better decompositions

■ 0~2.6% (Paragon vs Aragon)

Evaluation

MicroBenchmarks

o Degree of Refinement Parallelism

o Varying Shuffle Refinement Times

o Varying Initial Partitioners

Real-World Workloads

o Breadth First Search (BFS)

o Single Source Shortest Path (SSSP)

Billion-Edge Graph Scaling

39

Varying Initial Partitioners

40

Dataset 12 datasets from various areas

# of Parts 40 (two 20-core machines)

Initial Partitioner HP/DG/LDG

Deg. of Parallelism 8

# of Refinement Times 8

HP: Hashing Partitioning

DG: Deterministic Greedy Partitioning

LDG: Linear Deterministic Greedy Partitioning

Impact of Varying Initial Partitioners: Partitioning Quality

Improv. Max Avg.

HP 58% 43%

DG 29% 17%

LDG 53% 36%

41

Evaluation

MicroBenchmarks

o Degree of Refinement Parallelism

o Varying Shuffle Refinement Times

o Varying Initial Partitioners

Real-World Workloads

o Breadth First Search (BFS)

o Single Source Shortest Path (SSSP)

Billion-Edge Graph Scaling

42

BFS: Platform

• PittMPICluster: 32 nodes connected via a single switch with 56Gbps FDR Infiniband

• Gordon Supercomputer: 4x4x4 3D torus of switches connected via QDR Infiniband with 16 compute nodes attached to each switch (with 8Gbps link bandwidth)

43Bottleneck Memory (𝛌=1) Network (𝛌=0)

Paragon xdgp Mizan

BFS: Competitors

44

uniParagon

Initial Partitioner: DG

Balanced Graph (Re)Partitioning

Partitioners

(static graphs)Repartitioners

(dynamic graphs)

Metis ICA3PP’08 SoCC’12 TKDE’15 BigData’15 DG/LDG

Fennel

Offline Methods

(High Quality)

(Poor Scalability)

Online Methods(Moderate Quality)

(High Scalability)

Parmetis Aragon

Offline Methods(High Quality)

(Poor Scalability)

Online Methods(Moderate~High Quality)

(High Scalability)

CatchW LogGPHermes

BFS on PittMPICluser (𝛌=1): Exec. Time

★ as-skitter: |V|=1.6M, |E| = 22M

★ 60 partitions: three 20-core machines

★ deg. of parallelism: 8

★ # of shuffle refinement times: 8

5.9x

6.7x

5.9x

2.7x

45

BFS on PittMPICluser (𝛌=1): Comm Vol

46

★ as-skitter: |V|=1.6M, |E| = 22M

★ 60 partitions: three 20-core machines

★ deg. of parallelism: 8

★ # of shuffle refinement times: 8

Reduction Intra-

Socket

Inter-

Socket

DG 62% 55%

METIS 53% 55%

PARMETIS 15% 17%

uniPARAGON 62% 39%

2.5x

1.5x50%

38%

BFS on Gordon (𝛌=0)

47

★ as-skitter: |V|=1.6M, |E| = 22M

★ 48 partitions: three 16-core machines

★ deg. of parallelism: 8

★ # of shuffle refinement times: 8

Evaluation

MicroBenchmarks

o Degree of Refinement Parallelism

o Varying Shuffle Refinement Times

o Varying Initial Partitioners

Real-World Workloads

o Breadth First Search (BFS)

o Single Source Shortest Path (BFS)

Billion-Edge Graph Scaling

48

Billion-Edge Graph Scaling: BFS Exec. Time

49

★ friendster: |V|=124M, |E| = 3.6B

★ 60 partitions: three 20-core machines

★ deg. of parallelism: 10

★ # of shuffle refinement times: 10

★ 1.65x

★ 60 cores

Billion-Edge Graph Scaling: Refine Time

★ friendster: |V|=124M, |E| = 3.6B

★ 60 partitions: three 20-core machines

★ deg. of parallelism: 10

★ # of shuffle refinement times: 10

50

1.36x

Conclusions

Introduced PARAGONo Parallel Architecture-Aware

Graph Partition Refinement Algorithm

PARAGON addresses:

o Scalability

o Communication Heterogeneity

o Shared Resource Contention

Extensive experimental study:

o Achieved up to 6.7x speedups on real-world workloads

o Scaled up to 3.6 billion edges

51

Acknowledgments:

• Jack Lange

• Albert DeFusco

• Kim Wong

• Mark Silvis

Funding:

• NSF OIA-1028162

• NSF CBET-1250171

Thank You!

Email: [email protected] or [email protected]: http://people.cs.pitt.edu/~anz28/ADMT: http://db.cs.pitt.edu/group/

52