Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all...

36
Fast Triangle Counting on the GPU Oded Green

Transcript of Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all...

Page 1: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Fast Triangle Counting

on the GPU

Oded Green

Page 2: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Triangle Counting

Building block for clustering coefficients

Defined by [Watts & Strogatz; 1998]

Used to state how tightly bound vertices are in a network

Used for the analysis many types of networks:

Communication

Social

Biological

[email protected], GTC, 2015 Twitter social network using Large

Graph Layout

Page 3: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Relevant network properties

Sparse-network

System utilization

Power-law distribution

[Faloutsos & Faloutsos & Faloutsos; 1999]

[Broder et al.; 2000]

Load balancing issues

[email protected], GTC, 2015

Page 4: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Computational Approaches

Enumerating over all node-triples - ๐‘‚(๐‘‰3).

Using matrix multiplication - ๐‘‚(๐‘‰๐‘ค), ๐‘ค โ‰ค 2.376.

Adjacency list intersection - ๐‘‚ ๐‘‰ โ‹… ๐‘‘๐‘š๐‘Ž๐‘ฅ2 ,

where ๐‘‘ ๐‘š๐‘Ž๐‘ฅ is the vertex with largest

adjacency.

[email protected], GTC, 2015

Page 5: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

For given e = ๐‘ข, ๐‘ฃ โˆˆ ๐ธ:

Take:

uโ€™s list

vโ€™s list

Intersection and find common neighbors

Then intersect (๐‘ข, ๐‘ฅ) and (๐‘ข, ๐‘ฆ)

Adjacency List Intersection

x

v u

y

[email protected], GTC, 2015

Page 6: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU Challenges: List Intersection

Partitioning / Load balancing

Must be computationally cheap

Must be parallel (high utilization)

Minimal synchronization/communication

Scalable

Simple

[email protected], GTC, 2015

Page 7: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

The only other GPU triangle counting algorithm

Uses the GPU like a CPU

One CUDA thread per

intersection

Load balancing issues

Low utilization

Limited scalability

[Heist et al. ;2012]

[email protected], GTC, 2015

Page 8: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Merge-Path and GPU Triangle Counting

[email protected], GTC, 2015

Page 9: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Merge-Path

Visual approach for merging

Highly scalable

Load-balanced

Two legal moves

Right

Down

Reminder: Merging and intersecting are similar

[email protected], GTC, 2015 Merge Path : [Odeh et al. ; 2012]

GPU Merge Path : [Green et al. ;2012]

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Page 10: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Partitioning

Find the intersection of the

cross-diagonal and the

path

Uses binary search

[email protected], GTC, 2015

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Page 11: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Merge-Path Phases

Partitioning

Log(n)

Synchronization

Merging

Synchronization

[email protected], GTC, 2015

Page 12: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Intersect-Path

Building block for triangle counting and clustering coefficients

Modified Merge-Path

Right

Down

Right-down โ€“ when values are equal

[email protected], GTC, 2015

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Page 13: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

B[1] B[2] B[3] B[4] B[5] B[6] B[7] B[8]

1 2 4 5 8 9 12 13

A[1] 4

A[2] 6

A[3] 7

A[4] 11

A[5] 12

A[6] 14

A[7] 15

A[8] 16

Differences

Typically unique values in each set

Two possible start locations when cross-diagonal goes through โ€œequal-pathโ€.

Requires:

Checking the initial condition

Avoiding double counting

Shared-memory and synchronization

[email protected], GTC, 2015

Page 14: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Triangle-Counting

Split vertices to thread blocks

|V|/Thread_Blocks

Iteratively (over the vertices)

Partition

Log(n)

Warp Synchronization

Multiple Intersection

Thread-block Synchronization

Vertices

Thread-block

Warp 0 Warp 1

[email protected], GTC, 2015

Page 15: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Algorithm Control Parameters

๐ผ: # intersection in a thread block

Small number : under utilization

Large number : not useful for sparse networks

๐‘‡โ„Ž: # threads per intersection

Small number : load-balancing, under utilization

Large number : adds overhead compared to the intersection

๐ผ โ‹… ๐‘‡โ„Ž โ‰ฅ ๐‘ค๐‘Ž๐‘Ÿ๐‘ ๐‘ ๐‘–๐‘ง๐‘’

[email protected], GTC, 2015

Page 16: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Performance Analysis

[email protected], GTC, 2015

Page 17: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Experiments

Graphs taken from 10th DIMACS Challenge

GPU

NVIDIA K40

12 GB

14 MPs x 192 SPs (per MP) = 2880 CUDA threads

[email protected], GTC, 2015

Page 18: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Experiments

CPU - OpenMP based [Green et al. ;2014]

4x10 Intel Xeon E7-8870 processers

256 GB

30 MB L3 cache per processor

[email protected], GTC, 2015

Page 19: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Implementation comparison

GPU algorithm as discussed here

CPU algorithms

Threads are independent

Partitioning stage not needed

[email protected], GTC, 2015

Page 20: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU Vs. Sequential CPU

[email protected], GTC, 2015

0

5

10

15

20

25

30

35

40

Spe

edu

p

GPU IntersectPath Sequential (Baseline)

Page 21: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU Vs. CPU โ€“ OpenMP Straightforward

OpenMP results from[Green et al. ;2014] [email protected], GTC, 2015

0

5

10

15

20

25

30

35

40

Spe

edu

p

Naรฏve OMP GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 22: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU Vs. CPU OpenMP Optimized I

[email protected], GTC, 2015 [Green et al. ;2014]

0

5

10

15

20

25

30

35

40

Spe

edu

p

Optimized OMP I GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 23: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

[email protected], GTC, 2015

Page 24: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU Vs. CPU OpenMP Optimized II

[email protected], GTC, 2015 [Green et al. ;2014]

0

5

10

15

20

25

30

35

40

Spe

edu

p

Optimized OMP II GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 25: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU โ€“ Parameter control -prefAttachment

[email protected], GTC, 2015

Single thread per intersection

X-Y-Z configuration:

โ€ข X: threads per

block

โ€ข Y: thread per

intersection

โ€ข Z: number of

intersection per

thread block

Page 26: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

[email protected], GTC, 2015

Single Intersection per warp

GPU โ€“ Parameter control -prefAttachment

X-Y-Z configuration:

โ€ข X: threads per

block

โ€ข Y: thread per

intersection

โ€ข Z: number of

intersection per

thread block

Page 27: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

These parameters count

3๐‘‹ โˆ’ 7๐‘‹ difference between slowest and

fastest

Offer better system utilization

Offer better load-balancing

[email protected], GTC, 2015

Page 28: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Final thoughts

Intersect-Path

Modified Merge-Path

Performance

9X-32X faster than a sequential

[email protected], GTC, 2015

Page 29: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Additional Challenges (Future work)

Fine-tuning the GPU parameters based on

graph properties.

More load-balancing across the MPs.

Create the capability to analyze large

graphs on the GPU

Limited memory size

[email protected], GTC, 2015

Page 30: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Thank you!

Join our Open-Source community

https://github.com/arrayfire/arrayfire

[email protected], GTC, 2015

Page 31: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Backup slides

[email protected], GTC, 2015

Page 32: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Examples โ€“ Clustering Coefficient

๐ถ๐ถ = ๐ถ๐ถ๐‘ฃ =

๐‘ฃ

๐‘ก๐‘Ÿ๐‘–(๐‘ฃ)

deg ๐‘ฃ โ‹… deg ๐‘ฃ โˆ’ 1๐‘ฃ

[email protected], GTC, 2015

Page 33: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Immense volume of data

Facebook: >1B users

average 130 friends

30B pieces of content shared / month

Twitter: 200M active users

400M tweets / day

Goal: Use the interaction

network to understand

and characterize

information flow

Motivation: Social Media

Sources: Facebook, Twitter

[email protected], GTC, 2015

Page 34: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU Vs. CPU

[email protected], GTC, 2015

0

5

10

15

20

25

30

35

40

Spee

du

p

Naรฏve OMP Optimized OMP I Optimized OMP II

GPU IntersectPath Sequential (Baseline) Maximal for 40 cores

Page 35: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

Speedup/Dollar

[email protected], GTC, 2015

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

Spee

du

p/d

olla

r

Naรฏve OMP Optimized OMP I Optimized OMP II GPU IntersectPath

Assuming

CPU : $20K

GPU : $5K

Page 36: Fast Triangle Counting on the GPUโ‚ฌยฆย ยท Computational Approaches Enumerating over all node-triples - ๐‘‚(๐‘‰3). Using matrix multiplication - ๐‘‚(๐‘‰ ), โ‰ค t. u y x. Adjacency

GPU โ€“ Parameter control

[email protected], GTC, 2015

Growing thread block size