Parallel Clustering Coefficient Computation using GPUs v d...

1
Parallel Clustering Coefficient Computation using GPUs Tahsin Reza, Matei Ripeanu Electrical and Computer Engineering University of British Columbia, Vancouver Clustering coefficient is the measure of how tightly vertices are bounded in a network [1]. The Triangle Counting problem is at the core of clustering coefficient computation and performance depends on how efficiently triangles are counted. If two vertices {u, v}V are neighbours, and there exists a third vertex wV where wF(u) and wF(v), then {u, v, w} forms a triangle. Clustering coefficient of v, is computed using the following equation: 0 1 3 5 CPU implementation performs better than GPU in the case of RMAT graphs but not for real-world graphs –Why? Power-law graphs are known to be highly irregular, i.e., vertex degree heavily varies across vertices. While the real- world graphs we experimented with have power-law degree distribution, they are less skewed than the RMAT graphs. Irregularity leads to work imbalance among GPU threads and causes scattered memory access pattern, which in turn results in uncoalesced global memory access on the GPUs and hurts performance [8]. Graph Vertices Edges LiveJournal * 4.8M 69M web-BerkStan* 0.69M 13M RMAT20 1M 16M RMAT21 2M 32M RMAT22 4M 64M RMAT23 8M 128M RMAT24 16M 256M RMAT25 32M 512M RMAT26 64M 1B RMAT27 128M 2B RMAT28 256M 4B GPU implementation achieves 7x speedup over the best known work [5] for the same graph (LiveJournal) while CPU implementation offers 3.3x speedup. Edge-centric implementation, up to graphs with 512M edges (RMAT25) on a Kepler GPU with 12 GB device memory. Edge-centric is constantly better than vertex-centric as it leverages an order of magnitude more parallelism. Kepler GPUs offer speedup over Fermi GPUs but not as much as compared to moving from vertex-centric to edge-centric. CPU implementation, up to graphs with 4B edges (RMAT28). 0 1 2 1 2 4 7 0 2 4 5 1 0 2 Graph CPU GPU (Kepler) [2] [4] [5] [9] [10] LiveJournal 9.94s 4.0s 32.98s 1.7min 29.43s 3.63min 4.70min web-BerkStan 28.70s 17.0s - 5.33min - 1.31min 2.03min Graph Type d max Standard Deviation of vertex degree distribution CPU GPU (Kepler) LiveJournal Real-world 20293 65.08 9.94s 4.0s RMAT20 RMAT 146092 355.75 35.17s 88.5s Warp Execution Efficiency Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor. (Higher is better) Achieved Occupancy (0.0 – 1.0) Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor. (Higher is better) Stall Execution Dependency Percentage of stalls occurring because an input required by the instruction is not yet available. (Lower is better) Global Memory Load/Store Efficiency Ratio of requested global memory load/store transactions to actual global memory load/store transactions. (Higher is better) ) 1 ( 2 v v v v d d T cc Table (above): Dataset used. *Real-world graphs were obtained from [7]. RMAT (Recursive MATrix) [6] graphs were generated using parameters (A,B,C) = (0.57,0.19,0.19) and an avg. vertex degree of 16. All the graphs we used for evaluation are undirected. Figure (left): NodeIterator would count the triangle incident on 0 twice, once for 1 and once for 2, while scanning through their respective neighbour list. We can avoid this by not counting the triangle shown in red dotted lines. Table (right): Comparison of running time with [2], [4], [5], [9] and [10]. Column headings CPU and GPU indicate our work. The RMAT graph suffers from low occupancy as a result of high stall execution dependency. Low number of active warps per processor cycle means more cycles is required to complete a task. The RMAT graph incurs low memory efficiency as a result of highly scattered memory access pattern which increases number of memory transactions and wastes memory bandwidth. Future work Load-balancing mechanism on the GPU to improve performance of highly irregular graphs, e.g., virtual-warp technique for handling work imbalance in irregular graphs [8]. CPU-GPU or multi-GPU implementation to process graphs larger than what a single GPU can accommodated. 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 log - Time(s) Execution time on CPU Pseudocode (right): Array U has the same length as E.(U[x], E[x]) represents an edge, where x is the array index. For storage efficiency, we recycle U to hold intermediate triangle counts. Vertex-centric algorithm C[|V|] = 0 // clustering coefficient for each v V in parallel for each e F(v) count_trainagles(v, e, T) C[v] = 2 * T / ( d v * (d v - 1)) procedure count_triangles(v, e, T) if (d v > d e ) L = v, S = e else L = e, S = v, i = 0, j = 0 while (j < F(S).length) a = F(S)[j], b = F(L)[i] if (S == e) if (a < e) j++ if (a < b) i++ continue if (L == e) if (b < e) i++ if (a > b) j++ continue if (a == b) i++, j++, T++ else if (a > b) j++ else if (a < b) i++ if (i == (F(L).length – 1)) break Edge-centric algorithm C[|V|] = 0 // clustering coefficient U[|E|] = 0 prpcedure initialize for each v V in parallel for each e F(v) U[e] = v procedure triangle_counting for each u U and e E in parallel count_trainagles(u, e, t) U[u] = t procedure cc_computation for each v V in parallel for each e F(v) T = T + U[e] C[v] = 2 * T / ( d v * (d v - 1)) Our technique targets CSR graphs [3] and requires the neighbour list of each vertex to be sorted in ascending order with respect to vertex ID. Sorted lists allows us to skip many expensive memory accesses while computing the intersection of two neighbour lists, the key building block of triangle counting. The neighbour with the smaller vertex ID is responsible for counting a triangle. We count each triangle incident on a vertex only one time. NodeIterator [4] is the baseline algorithm. It counts the same triangle six times. Complexity - O(d max 2 .IV|) NodeIterator++ [4] counts each triangle only once. Neighbour list is sorted with respect to vertex degree. Only the lowest degree vertex in each triangle is responsible for counting the triangle. Using NodeIterator++ it is not possible to explicitly count the number of triangles incident on each vertex, which is essential for counting clustering coefficient of each vertex. Graph Achieved Occupancy Warp Execution Efficiency Stall Execution Dependency Global Memory Load Efficiency Global Memory Store Efficiency LiveJournal 0.5 34.80% 32.77% 14.28% 71.28% RMAT20 0.1 7.03% 88.19% 13.30% 35.51% 1 1.5 2 2.5 3 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 Speedup log - Time(s) Execution Time (Edge-centric): Fermi vs Kepler Fermi Kepler Speedup 1 5 9 13 17 21 25 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 Speedup log - Time(s) Execution time on GPU (Fermi) Vertex-centric Edge-centric Speedup Complexity Vertex-centric O(d max 2 .|V|) Edge-centric initialize O(|V|.d max ) triangle_counting O(|E|.d max ) cc_computation O(|V|.d max ) 2*O(|V|.d max ) + O(|E|.d max ) < O(d max 2 . |V|) Tanuj Kr Aasawat Department of Information Technology Jadavpur University, Kolkata, India Contribution highlights Scalable implementation of parallel clustering coefficient algorithm on GPUs. Report performance numbers for large graphs not seen in the literature for single-node in-memory systems before. GPU implementation offers 7x speedup over the best reported running time for the same graph. We can compute clustering coefficient of each vertex in power-law [2] graphs with up to 32M vertices and 512M edges using a single GPU. For the parallel CPU implementation, we present results for graphs with up to 4B edges. v u w Problem and Contributions Parallel Clustering Coefficient Algorithm Performance G = (V, E) – an undirected graph V – set of vertices, E – set of edges F(v) – set of neighbours of any vertex vV d v degree of v d max – maximum vertex degree in G T v number of triangles incident on v cc v clustering coefficient of v We present two implementations Vertex-centric – a single thread counts triangles incident on a vertex. Edge-centric – each edge is processed by a single thread. The number of threads that are used to count triangles incident on a vertex is equal to its vertex degree. [1] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of “small- world” networks. Nature, 393:440–442, 1998. [2] Latapy, M. Main-Memory triangle computations for very large (sparse (power-law)) graphs. Theor. Comp. Sci., 407, 1–3, 458–473, 2008. [3] Bell, N. and Garland, M. Efficient sparse matrix-vector multiplication on CUDA. In Proc. Conf. Supercomputing, 2009. [4] Suri, S. and Vassilvitskii, S. Counting triangles and the curse of the last reducer. In WWW. ACM, 2011. [5] Chu, S. and Cheng, J. Triangle listing in massive networks. ACM Trans. Knowl. Discov. Data 6, 4, Article 17, 32 pages, 2012. [6] Chakrabarti, D., Zhan, Y. and Faloutsos, C. R-MAT: A Recursive Model for Graph Mining. In SDM, 2004. [7] SNAP: http://snap.stanford.edu/ [8] Hong, S., Kim, S. K., Oguntebi , T. and Olukotun, T. Accelerating CUDA graph algorithms at maximum warp. In Proc. of ACM symposium on Principles and practice of parallel programming, 2011. [9] Park, H-M. and Chung, C-W. An efficient MapReduce algorithm for counting triangles in a very large graph. In Proc. of Conference on information & knowledge management. ACM, New York, NY, USA, 539-548, 2013. [10] Cohen, J. Graph twiddling in a mapreduce world. Computing in Science & Engineering, 11(4):29-41,2009. References Discussions and Future Work

Transcript of Parallel Clustering Coefficient Computation using GPUs v d...

Page 1: Parallel Clustering Coefficient Computation using GPUs v d ...sc14.supercomputing.org/sites/all/...poster/poster...This poster is 48” wide by 36” high. It’s designed to be printed

Printing:This poster is 48” wide by 36” high. It’s designed to be printed on a largeprinter.

Customizing the Content:The placeholders in this for you. click an icon to add a table, chart, SmartArt graphic, picture or multimedia file.

Tclick the Bullets button on the Home tab.

If you need more placeholders for titles, contentyou need and drag it into place. PowerPoint’s Smart Guides will help you align it with everything else.

Want to use your own pictures instead of ours? No problem! Just choose Change Picture. Maintain the proportion of pictures as you resize by dragging a corner.

Parallel Clustering Coefficient Computation using GPUsTahsin Reza, Matei Ripeanu

Electrical and Computer Engineering

University of British Columbia, Vancouver

Clustering coefficient is the measure of how tightly vertices are bounded in a network [1]. The TriangleCounting problem is at the core of clustering coefficient computation and performance depends onhow efficiently triangles are counted. If two vertices {u, v}∈V are neighbours, and there exists a thirdvertex w∈V where w∈F(u) and w∈F(v), then {u, v, w} forms a triangle. Clustering coefficient of v, iscomputed using the following equation:

0 1 3 5

CPU implementation performs better than GPU in the case of RMAT graphs but not for real-world graphs –Why?

Power-law graphs are known to be highly irregular, i.e., vertex degree heavily varies across vertices. While the real-world graphs we experimented with have power-law degree distribution, they are less skewed than the RMAT graphs.

Irregularity leads to work imbalance among GPU threads and causes scattered memory access pattern, which in turnresults in uncoalesced global memory access on the GPUs and hurts performance [8].

Graph Vertices Edges

LiveJournal * 4.8M 69M

web-BerkStan* 0.69M 13M

RMAT20 1M 16M

RMAT21 2M 32M

RMAT22 4M 64M

RMAT23 8M 128M

RMAT24 16M 256M

RMAT25 32M 512M

RMAT26 64M 1B

RMAT27 128M 2B

RMAT28 256M 4B

GPU implementation achieves 7x speedup over the bestknown work [5] for the same graph (LiveJournal) while CPUimplementation offers 3.3x speedup.

Edge-centric implementation, up to graphs with 512M edges(RMAT25) on a Kepler GPU with 12 GB device memory.

Edge-centric is constantly better than vertex-centric as itleverages an order of magnitude more parallelism.

Kepler GPUs offer speedup over Fermi GPUs but not as muchas compared to moving from vertex-centric to edge-centric.

CPU implementation, up to graphs with 4B edges (RMAT28).

0

1

2

1 2 4 7

0 2 4 51

0

2

Graph CPUGPU

(Kepler)[2] [4] [5] [9] [10]

LiveJournal 9.94s 4.0s 32.98s 1.7min 29.43s 3.63min 4.70min

web-BerkStan 28.70s 17.0s - 5.33min - 1.31min 2.03min

Graph Type dmax

Standard Deviation of vertex degree distribution

CPU GPU (Kepler)

LiveJournal Real-world 20293 65.08 9.94s 4.0s

RMAT20 RMAT 146092 355.75 35.17s 88.5s

Warp Execution Efficiency

Ratio of the average activethreads per warp to themaximum number of threadsper warp supported on amultiprocessor. (Higher isbetter)

Achieved Occupancy (0.0 – 1.0)

Ratio of the average activewarps per active cycle to themaximum number of warpssupported on a multiprocessor.(Higher is better)

Stall Execution Dependency

Percentage of stallsoccurring because an inputrequired by the instructionis not yet available. (Loweris better)

Global Memory Load/Store Efficiency

Ratio of requested global memoryload/store transactions to actualglobal memory load/storetransactions. (Higher is better)

)1(

2

vv

vv

dd

Tcc

Table (above): Dataset used. *Real-world graphs wereobtained from [7]. RMAT (Recursive MATrix) [6] graphswere generated using parameters (A,B,C) =(0.57,0.19,0.19) and an avg. vertex degree of 16. All thegraphs we used for evaluation are undirected.

Figure (left): NodeIterator wouldcount the triangle incident on 0twice, once for 1 and once for 2,while scanning through theirrespective neighbour list. We canavoid this by not counting thetriangle shown in red dotted lines.

Table (right): Comparison of running timewith [2], [4], [5], [9] and [10]. Columnheadings CPU and GPU indicate our work.

The RMAT graph suffers from low occupancyas a result of high stall executiondependency. Low number of active warpsper processor cycle means more cycles isrequired to complete a task.

The RMAT graph incurs low memoryefficiency as a result of highly scatteredmemory access pattern which increasesnumber of memory transactions and wastesmemory bandwidth.

Future work

Load-balancing mechanism on the GPU to improve performance of highly irregulargraphs, e.g., virtual-warp technique for handling work imbalance in irregulargraphs [8].

CPU-GPU or multi-GPU implementation to process graphs larger than what a singleGPU can accommodated.

1.00E+00

1.00E+01

1.00E+02

1.00E+03

1.00E+04

1.00E+05

log

-Ti

me(

s)

Execution time on CPU

Pseudocode (right): Array U hasthe same length as E. (U[x], E[x])represents an edge, where x isthe array index. For storageefficiency, we recycle U to holdintermediate triangle counts.

Vertex-centric algorithm

C[|V|] = 0 // clustering coefficient

for each v ∈ V in parallel

for each e ∈ F(v)

count_trainagles(v, e, T)

C[v] = 2 * T / ( dv * (dv - 1))

procedure count_triangles(v, e, T)

if (dv > de) L = v, S = e

else L = e, S = v,

i = 0, j = 0

while (j < F(S).length)

a = F(S)[j], b = F(L)[i]

if (S == e)

if (a < e) j++

if (a < b) i++

continue

if (L == e)

if (b < e) i++

if (a > b) j++

continue

if (a == b) i++, j++, T++

else if (a > b) j++

else if (a < b) i++

if (i == (F(L).length – 1)) break

Edge-centric algorithm

C[|V|] = 0 // clustering coefficient

U[|E|] = 0

prpcedure initialize

for each v ∈ V in parallel

for each e ∈ F(v) U[e] = v

procedure triangle_counting

for each u ∈ U and e ∈ E in parallel

count_trainagles(u, e, t)

U[u] = t

procedure cc_computation

for each v ∈ V in parallel

for each e ∈ F(v)

T = T + U[e]

C[v] = 2 * T / ( dv * (dv - 1))

Our technique targets CSR graphs [3] and requires the neighbour list of each vertex to besorted in ascending order with respect to vertex ID.

Sorted lists allows us to skip many expensive memory accesses while computing theintersection of two neighbour lists, the key building block of triangle counting.

The neighbour with the smaller vertex ID is responsible for counting a triangle. We counteach triangle incident on a vertex only one time.

NodeIterator [4] is the baseline algorithm. It counts the same triangle six times. Complexity - O(dmax2.IV|)

NodeIterator++ [4] counts each triangle only once. Neighbour list is sorted with respect to vertex degree.Only the lowest degree vertex in each triangle is responsible for counting the triangle.

Using NodeIterator++ it is not possible to explicitly count the number of triangles incident on each vertex,which is essential for counting clustering coefficient of each vertex.

GraphAchieved

Occupancy Warp Execution

EfficiencyStall Execution Dependency

Global Memory Load Efficiency

Global Memory Store Efficiency

LiveJournal 0.5 34.80% 32.77% 14.28% 71.28%

RMAT20 0.1 7.03% 88.19% 13.30% 35.51%

1

1.5

2

2.5

3

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05

Spee

du

p

log

-Ti

me(

s)

Execution Time (Edge-centric): Fermi vs Kepler

Fermi Kepler Speedup

1

5

9

13

17

21

25

1.0E+00

1.0E+01

1.0E+02

1.0E+03

1.0E+04

1.0E+05Sp

eed

up

log

-Ti

me(

s)

Execution time on GPU (Fermi)

Vertex-centric Edge-centric Speedup

Complexity

Vertex-centric O(dmax2.|V|)

Edge-centric

initialize O(|V|.dmax)

triangle_counting O(|E|.dmax)

cc_computation O(|V|.dmax)

2*O(|V|.dmax) + O(|E|.dmax) < O(dmax2. |V|)

Tanuj Kr AasawatDepartment of Information Technology

Jadavpur University, Kolkata, India

Contribution highlights

Scalable implementation of parallel clustering coefficient algorithm on GPUs.

Report performance numbers for large graphs not seen in the literature for single-node in-memorysystems before.

GPU implementation offers 7x speedup over the best reported running time for the same graph.

We can compute clustering coefficient of each vertex in power-law [2] graphs with up to 32Mvertices and 512M edges using a single GPU.

For the parallel CPU implementation, we present results for graphs with up to 4B edges.

v

u

w

Problem and Contributions Parallel Clustering Coefficient Algorithm

Performance

G = (V, E) – an undirected graph

V – set of vertices, E – set of edges

F(v) – set of neighbours of any vertex v∈V

dv – degree of v

dmax– maximum vertex degree in G

Tv – number of triangles incident on v

ccv – clustering coefficient of v

We present two implementations

Vertex-centric – a single thread counts triangles incident on a vertex.

Edge-centric – each edge is processed by a single thread. The number of threadsthat are used to count triangles incident on a vertex is equal to its vertex degree.

[1] Duncan J. Watts and Steven H. Strogatz. Collective dynamics of “small-world” networks. Nature, 393:440–442, 1998.

[2] Latapy, M. Main-Memory triangle computations for very large (sparse(power-law)) graphs. Theor. Comp. Sci., 407, 1–3, 458–473, 2008.

[3] Bell, N. and Garland, M. Efficient sparse matrix-vector multiplication onCUDA. In Proc. Conf. Supercomputing, 2009.

[4] Suri, S. and Vassilvitskii, S. Counting triangles and the curse of the lastreducer. In WWW. ACM, 2011.

[5] Chu, S. and Cheng, J. Triangle listing in massive networks. ACM Trans.Knowl. Discov. Data 6, 4, Article 17, 32 pages, 2012.

[6] Chakrabarti, D., Zhan, Y. and Faloutsos, C. R-MAT : A Recursive Model forGraph Mining. In SDM, 2004.

[7] SNAP: http://snap.stanford.edu/

[8] Hong, S., Kim, S. K., Oguntebi , T. and Olukotun, T. Accelerating CUDAgraph algorithms at maximum warp. In Proc. of ACM symposium on Principlesand practice of parallel programming, 2011.

[9] Park, H-M. and Chung, C-W. An efficient MapReduce algorithm forcounting triangles in a very large graph. In Proc. of Conference on information& knowledge management. ACM, New York, NY, USA, 539-548, 2013.

[10] Cohen, J. Graph twiddling in a mapreduce world. Computing in Science &Engineering, 11(4):29-41,2009.

References

Discussions and Future Work