1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed...

30
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Transcript of 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed...

Page 1: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

1

If you were plowing a field, which would you rather

use?Two oxen, or 1024 chickens?(Attributed to S. Cray)

Page 2: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

2

Our ‘field’ to plow : Graph processing

|V| = 1.4B, |E| = 6.6B

Page 3: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto

Matei Ripeanu

NetSysLabThe University of British Columbia

http://netsyslab.ece.ubc.ca

Page 4: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Graph Processing: The Challenges

Data-dependent memory access

patterns

Caches + summary data structures

Large memory footprint

>128GB

CPUs

Poor locality

Varying degrees of parallelism(both intra- and inter- stage)

Low compute-to-memory access

ratio

Page 5: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Graph Processing: The GPU Opportunity

Data-dependent memory access

patterns

Caches + summary data structures

Large memory footprint

>128GB

CPUs

Poor locality

Assemble a heterogeneous platform

Massive hardware multithreading

6GB!

GPUs

Low compute-to-memory access

ratio

Caches + summary data structures

Varying degrees of parallelism(both intra- and inter- stage)

Page 6: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

6

YES WE CAN!2x speedup (8 billion edges)

Motivating Question

Can we efficiently use hybrid systems for

large-scale graph processing?

Page 7: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

7

Performance Model Predicts speedup Intuitive

Totem A graph processing engine for hybrid

systems Applies algorithm-agnostic optimizations

Evaluation Predicated vs. achieved Hybrid vs. Symmetric

Methodology

Page 8: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

8

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPUb

The Performance Model (I)

GgpuGcpu

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

c

rSpeedup

cpu

1

Goal: Predict the speedup obtained from offloading part of the graph to the GPU (when compared to processing only on the host)

)(/ msizeofbc =

Page 9: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

9

GgpuGcpu

The Performance Model (II)

β = 20% rcpu = 0.5 BEPS

It is beneficial to process the graph on a hybrid system

if communication overhead is low

Core0

Core4

Core1

Core0

.

.

Core1

Core2

Corem

Core3

System MemoryDevice Memory

Host GPU

||/|| EEcpuα = ||/|| EEboundaryβ =

)(/|| GTErcpu =

)(/ msizeofbc =

c

rSpeedup

cpu

1

Assume PCI-E bus, b ≈ 4 GB/secand per edge state m = 4 bytes

=> c = 1 billion EPS

x

Best reported single-node BFS performance

[Agarwal, V. 2010]

|V| = 32M, |E| = 1B

Worst case (e.g., bipartite graph)

Page 10: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

10

Totem: Programming ModelBulk Synchronous Parallel

Rounds of computation and communication phases

Updates to remote vertices are delivered in the next round

Partitions vote to terminate execution

.

.

.

Page 11: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

11

Totem: A BSP-based Engine

1

0

3

2

4

5

CPU GPU0

3

2

1S S

outbox

0 20 1

outbox

40

0 1 2 3 4 5

0 1 3 54 6

10 20 30 40 40E

V

01 21 21

0 1 2 3

0 1 3 3

31 40 E

V

2121

Compressed sparse row representation

Computation: kernel manipulates local state

Updates to remote vertices aggregated locally

Comm2: merge with local state

Comm1: transfer outbox buffer to remote input buffer

inbox4

inbox

0 2

Page 12: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

12

|E| = 512 Million

Random

The Aggregation Opportunity

real-world graphs are mostly scale-free: skewed

degree distribution

sparse graph: ~5x reduction

Denser graph has better opportunity for aggregation:

~50x reduction

Page 13: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

13

Evaluation Setup

Workload R-MAT graphs |V|=32M, |E|=1B, unless otherwise noted

Algorithms Breadth-first Search PageRank

Metrics Speedup compared to processing on the host only

Testbed Host: dual-socket Intel Xeon with 16GB GPU: Nvidia Tesla C2050 with 3GB

Page 14: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

14

Predicted vs. Achieved Speedup

Linear speedup with respect to offloaded part

GPU partition fills GPU memory

After aggregation, β = 2%. A low value

is critical for BFS

Page 15: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

15

Breakdown of Execution Time

PageRank is dominated by the compute phase

Aggregation significantly reduced communication

overhead

GPU is > 5x faster than the host

Page 16: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

16

So far …

Performance modeling Simple Useful for initial system provisioning

Totem Generic graph processing framework Algorithm-agnostic optimizations

Evaluation (Graph500 scale-28) 2x speedup over a symmetric system 1.13 Billion TEPS edges on a dual-socket, dual-GPU

system

But, random partitioning! Can we do better?

Page 17: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Better partitioning strategies.

The search space. Handles large (billion-edge scale) graphs.

o Low space and time complexity.o Ideally, quasi-linear!

Handles well scale-free graphs. Minimizes algorithm’s execution time by

reducing computation time o (rather than communication)

17

Page 18: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

The strategies we explore HIGH: vertices with high degree left on the

hostLOW: vertices with low degree left on the

host RAND: random

18The percentage of vertices placed on the CPU for a scale-28 RMAT graph (|V|=256m, |E|=4B)

Page 19: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Evaluation platform

19

Intel Nehalem Fermi GPU Xeon X5650 Tesla C2075

(2x sockets) (2x GPUs)

Core Frequency 2.67GHz 1.15GHz

Num Cores (SMs) 6 14

HW-thread/Core 2 x 48warps (x32/warp)

Last Level Cache 12MB 2MB

Main Memory 144GB 6GB

Memory Bandwidth 32GB/sec 144GB/sec

Total Power (TDP) 95W 225W

Page 20: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

BSF performance

20BFS traversal rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

2x performance gain!

LOW: No gain over random!

Exploring the 75% data point

Page 21: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Host is the bottleneck in all cases !

BSF performance – more details

Page 22: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

PageRank performance

22PageRank processing rate for a scale-28 RMAT graph (|V|=256m, |E|=4B)

LOW: Minimal gain over random!

Better packing

25% performance gain!

Page 23: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Small graphs(scale-25 RMAT graphs: |V|=32m, |E|=512m)

23

BFS PageRank

Intelligent partitioning provides benefits Key for performance: load balancing

Page 24: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Uniform graphs (not scale free)

24

BFS on scale-25 uniform graph |V|=32m, |E|=512m) BFS on scale-28

Hybrid techniques not useful for uniform graphs

Page 25: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Scalability Graph size: RMAT graphs: scale 25 to 29 (|V|=512m, |E|=8B)Platform size: 1,2, 4 sockets 2xsockets + 2 x GPU

25

BFS PageRank

Page 26: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Power Normalizing by power (TDP – thermal design power)Metric: million TEPS / watt

26

BFS PageRank

2.0 1.3 1.1 1.3 1.3

1.9

2.3

2.4 1.8 1.01.4

1.8

Page 27: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Conclusions

Q: Does it make sense to use a hybrid system? A: Yes! (for large scale-free graphs)

Q: Can one design a processing engine for hybrid platforms that both generic and efficient?

A: Yes.

Q: Are there near-linear complexity partitioning strategies that enable higher performance?

A: Yes, partitioning strategies based on vertex connectivity provide in all cases better performance than random.

Q: Should one search for partitioning strategies that reduce the communication overheads (and hope for higher performance)?

A: No. (for scale free graphs)

Q: Which strategies work best?A: It depends!

Large graphs: shape the load. Small graphs: load balancing.

Page 28: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

28

If you were plowing a field, which would you rather

use?- Two oxen, or 1024 chickens?- Both!

Page 29: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

29

code available at: netsyslab.ece.ubc.ca

Papers: • A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph

Processing, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, PACT 2012

• On Graphs, GPUs, and Blind Dating: A Workload to Processor Matchmaking Quest, A. Gharaibeh, L. Costa, E. Santos-Neto, M. Ripeanu, IPDPS 2013

29

Page 30: 1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

30

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia