Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing...

34
Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini , Francis O’Brien and Tarek S. Abdelrahman The Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto matt.agostini@mail. utoronto.ca [email protected] [email protected] ICPP - August 20, 2020

Transcript of Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing...

Page 1: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Balancing Graph Processing Workloads Using WorkStealing on Heterogeneous CPU-FPGA Systems

Matthew Agostini, Francis O’Brien and Tarek S. Abdelrahman

The Edward S. Rogers Sr. Department of Electrical and Computer Engineering

University of Toronto

[email protected] [email protected]

[email protected]

ICPP - August 20, 2020

Page 2: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Outline

• Research context and motivation

• Work stealing

• Heterogeneous Work Stealing (HWS)

• Evaluation

• Related work

• Conclusions and future work

2ICPP - August 20, 2020

Page 3: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Accelerator-based Computing

ICPP - August 20 2020 3

• Field Programmable Gate Arrays (FPGAs)

– Enable user-defined application-specific circuits

– Potential for faster more power-efficient computing

GPUs GPUsFPGAs FPGAs

• Accelerators are prevalent in computing, from personal to cloud platforms

Page 4: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Emerging FPGA-Accelerated Servers

• A new generation of high-performance systems tightly integrate FPGAs with multicores, targeting data centers– Exemplified by the Intel HARP, IBM CAPI and Xilinx Zynq systems

• An FPGA circuit can directly access system memory in a manner that is coherent with the processor caches

– Enables CPU threads and FPGA hardware to cooperatively accelerate an application, sharing data in system memory

– In contrast to typical offload FPGA acceleration that leaves the CPU idle during FPGA processing

4ICPP - August 20 2020

Page 5: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Research Question

• The concurrent use of (multiple) CPU threads and FPGA hardware requires balancing of workloads

• Research question: how to balance workloads between software (CPU threads) and hardware (FPGA accelerator) such that:– The accelerator is fully utilized

– Load imbalance is minimized

– Scheduling overheads are reduced

5ICPP - August 20, 2020

Page 6: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Graph Analytics

• We answer our research question in the context of Graph Analytics: applications that process large graphs to deduce some properties– Prevalent in social networks, targeted advertising and web searches

• Processing graphs is notoriously load imbalanced– Graph structure (varying outgoing edge degrees)

– Distribution of active/inactive vertices (computations vary across processing iterations)

2020-08-10 6

Page 7: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

This Work

• We develop Heterogeneous Work Stealing (HWS): a strategy for balancing graph processing workloads on tightly-coupled CPU+FPGA systems– Identify and address some unique challenges that arise in this context

• We implement and evaluate HWS on the Intel HARP platform– Use it for 3 kernels processing large real-world graphs

– Effectively balances workloads

– Outperforms state-of-the-art strategies

• Supported by Intel Strategic Research Alliance (ISRA) grant

7ICPP - August 20 2020

Page 8: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Work Stealing

8

Thread T[i](Workload:workItems)while true do

if has workItems then // Normal ExecutionProcess(workItem)

else // StealAcquireWork(k) // k = id of victim thread

• Allows fine-grained workload partitioning with low overhead

• Previously considered unsuitable for heterogeneous systems due to explicit copying of data to accelerators

ICPP - August 20, 2020

Page 9: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Work Stealing for Graph Processing

9

Thread T[i](Start, End, sync)while true do

if Start < End then // Normal ExecutionProcess(vtx[Start])Start = Start + 1

else // Start == End; // Stealif CAS(T[k].sync) then // k = id of randomly chosen thread

T[i].Start = (T[k].Start+T[k].End)/2 // Steal halfT[i].End = T[k].EndT[k].End = T[i].Start

ICPP - August 20, 2020

Page 10: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Challenges with Heterogeneity

• Non-Linear FPGA performance with workload size

• How to steal from hardware?

• Duplicate work caused by FPGA read latency

• Hardware Limitations

10ICPP - August 20, 2020

Page 11: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Non-Linear FPGA Performance

• The FPGA accelerator performance depends on the size of the workload assigned to it– Larger workloads better amortize accelerator startup and initial

latency

– HWS assigns large enoughworkloads, stealing onlywhen CPU threads idle

ICPP - August 20, 2020 11

0.00

0.10

0.20

0.30

0.40

Exec

uti

on

Tim

e (s

)

Workload Partition Size (vertices)

FPGA Only Single CPU Thread

Page 12: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Steal Mechanism

• How does software steal from hardware?– Internal accelerator state is often not accessible by software

• Accelerator exposes two CSRs: start and end– Thief thread reads start to determine how much to steal

– Thief thread calculates new FPGA end bound and then writes to end

ICPP - August 20, 2020 12

start

end(CSRs)

HWS

read

write

Thief Thread

Acc.FPGA processing isun-interrupted during the steal

FPGACPU

Page 13: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Duplicate Work

• The delay in reading CSR registers leads to potential work duplication– The value of the start register read by thief is stale

ICPP - August 20, 2020 13

FPGA Workload

start end

To CPU

new end

From CPU

Page 14: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

CPU Workload

Duplicate Work

• The delay in reading CSR registers leads to potential work duplication– The value of the start register read by thief is stale

ICPP - August 20, 2020 14

FPGA Workload

start end

Page 15: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

CPU

Duplicate Work• The delay in reading CSR registers leads to potential work

duplication– The value of the start register read by thief is stale

– When a small amount of work remains, the thief may steal work already performed by the FPGA

ICPP - August 20, 2020 15

FPGA Workload

start

new end

FPGA

end

startend

duplication

Page 16: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Duplicate Work

• The delay in reading CSR registers leads to potential work duplication– The value of the start register read by thief is stale

– When a small amount of work remains, the thief may steal work already performed by the FPGA

• We estimate FPGA progress P, ensuring that a steal from the FPGA fails if too small a workload remains– Enabled by the relatively deterministic nature of the accelerator

ICPP - August 20, 2020 16

T[i].Start = ((T[k].Start + P) + T[k].End)/2

Page 17: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Hardware Limitations

• FPGA memory requests are aligned to cache lines– Misaligned requests can negatively affect performance

• HWS aligns FPGA workloads with cache lines and imposes a lower bound on stealing granularity– Only 8 vertices per cache line

ICPP - August 20, 2020 17

Page 18: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Evaluation

• Graph Benchmarks

• Platform

• Metrics of performance

• Results– Load balancing effectiveness

– Comparison to state-of-the-art

– Steal characteristics

– Graph processing throughput

ICPP - August 20, 2020 18

Page 19: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Graph Benchmarks

• We use three common graph processing benchmarks:– Breadth-First Search (BFS)

– Single Source Shortest Path (SSSP)

– PageRank (PR)

• Implemented in the Scatter-Gather paradigm– A common paradigm for graph processing

– Scatter: sweep over vertices, producing updates to neighboring vertices

– Gather: sweep over updates, applying them to destination vertices

19ICPP - August 20, 2020

Common benchmarks

BFS and SSSP used by Graph500

Page 20: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Evaluation Graphs

20

Graph Vertices Edges Description

Twitter 62M 1,468M Follower data

LiveJournal 4.8M 69M Friendship relations data

Orkut 3M 234M Social connections

StackOverflow 2.6M 36M Questions and answers

Skitter 1.7M 22M 2005 Internet topology graph

Pokec 1.6M 31M Social connections

Higgs 460K 15M Twitter subset

ICPP - August 20, 2020

• Process 7 large graphs, mostly drawn from SNAP

Page 21: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Platform

• Intel’s Heterogeneous Architecture Research Platform (HARP)– Xeon E5-2680 v4 CPU + Arria 10 GX1150 FPGA

– AFU issues cache coherent reads/writes to system memory

• AFUs for the scatter phase of graph processing [O’Brien 2020]– The gather phase is done by CPU threads

ICPP - August 20, 2020 21

CPU CPU

System Memory

QPI/PCIe Interconnect

Arria 10 FPGA

FIU

AFUXeon Multicore AFU: User’s Accelerator Function Unit

FIU: QPI/PCIe links protocolsData cacheAddress translation

Page 22: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Performance Metrics

• Execution time: time for processing, excluding loading graph into memory

• Load imbalance: the maximum useful work time of a thread relative to the average useful work time

• Throughput: the number of traversed edges per second (MTEPS)

ICPP - August 20, 2020 22

λ = ideally, λ is1

Page 23: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Comparisons

• We compare HWS to different load balancing strategies

– Static: equal sized partitions to all threads, giving FPGA 2.5X more

– Best-Dynamic: a chunk self-scheduling load balancer with a priori knowledge of the optimal chunk size

– HAP: Heterogeneous Adaptive Partitioning scheduler [Rodriguez 2019]

• We define speedup as the ratio of the execution time of static to that of a load balancing strategy

23ICPP - August 20, 2020

Page 24: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Disclaimer

• The results in this paper were generated using pre-production hardware and software, and may not reflect the performance of production or future systems.

ICPP - August 20, 2020 24

Page 25: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

BFS Scatter λ

25

HARPv2 15 Threads + AFU

0.0

0.5

1.0

1.5

2.0

2.5

3.0

λ

Graph

Static Best-Dynamic HAP HWS

ICPP - August 20, 2020

Page 26: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

BFS Scatter λ

26

HARPv2 15 Threads + AFU

0.0

0.5

1.0

1.5

2.0

2.5

3.0

λ

Graph

Static Best-Dynamic HAP HWS

ICPP - August 20, 2020

Page 27: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

BFS Scatter Performance

27

HARPv2 15 Threads + AFU

0.0

0.5

1.0

1.5

2.0

Spe

ed

up

Graph

Static Best-Dynamic HAP HWS

ICPP - August 20, 2020

Page 28: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

HWS Steal Characteristics

28

HARPv2 SSSP 7 Threads + AFU

0

200

400

600

800

1000

Graphs

Ste

al C

ou

nt

Steals By FPGA Steals From FPGA Aborted

ICPP - August 20, 2020

Page 29: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Average FPGA Chunk Size

29

HARPv2 SSSP 7 Threads + AFU

1

10

100

1000

10000

100000

1000000

FPG

A S

tre

am S

ize

Graphs

Best-Dynamic HAP HWS

ICPP - August 20, 2020

Page 30: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Overall Throughput

ICPP - August 20, 2020 30

0

200

400

600

800

1000

1200

1400

1600

MTE

PS

Graph

HWS 4 HWS 8 HWS 16HARPv2 BFS

Page 31: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Related Work

• Heterogeneous scheduling: Tripp (2005), Belviranli (2013), Vilches(2015), Song (2016), Navarro (2019), Rodriguez (2019), Wang (2019)– Focus is on adaptive chunk size selection

– We introduce work stealing and demonstrate it effectiveness

• Work stealing: Acar (2013), Cong (2008), Dinan (2009), Hendler(2002), Khayyat (2013), Nakashima(2019)– We extend work stealing to heterogeneous systems

• FPGA accelerators for graph processing: Dai (2016), Engelhardt(2016), Zhou (2017), Zhou (2019)– We extend to concurrent CPU-FPGA usage

ICPP - August 20, 2020 31

Page 32: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Concluding Remarks

• HWS addresses some unique challenges when processing graphs on CPU-FPGA systems– HWS maximizes FPGA throughput ensuring large workloads

– HWS achieves perfect load balance (λ = 1)

– HWS outperforms competitively against other schedulers

• Our results collectively demonstrate that work stealing is an effective solution for balancing graph processing workloads on tightly-coupled heterogeneous CPU-FPGA systems

32ICPP - August 20, 2020

Page 33: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Future Work

• Heterogeneous acceleration of the gather phase of graph processing

• Integration of HWS with multiple FPGAs

• Work stealing optimizations: priority-based victim selection

• Processing of dynamically changing graphs

ICPP - August 20, 2020 33

Page 34: Balancing Graph Processing Workloads Using Work Stealing ... · Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Matthew Agostini, Francis

Thank You