Performance Optimization in X10 Graph Processing...

Post on 27-Jul-2020

2 views 0 download

Transcript of Performance Optimization in X10 Graph Processing...

Performance Optimization in X10 Graph Processing LibraryKoji Ueno (Tokyo Institute of Technology),

Toyotaro Suzumura (IBM Research)

Large-Scale Graph Mining is Everywhere

2

Internet Map

Symbolic Networks:

Protein Interactions Social Networks

Cyber Security (15 billion log entries / day for large enterprise)

CybersecurityMedical InformaticsData EnrichmentSocial NetworksSymbolic Networks

Problem on Existing Graph Analytics Libraries Many existing graph analytics libraries Single Node

igraph (R package) GraphLab/GraphChi (Carnegie Mellon University and Start-up, C++)

Distributed Systems PBGL2 (Parallel Boost Graph Library, C++) [Gregor, 2005] Apache Giraph (Pregel Model, Java) PEGASUS (Hadoop based) GPS (Graph Processing System - Pregel Model, Stanford, Java +

NIO) Distributed Graphlab (CMU)

However, they are not optimized for the state of the art hardware. High-speed network, Multi-core CPUs, NVRAM etc.

3

ScaleGraph: X10 based Graph Processing Library Based on our optimized X10 (C++ Backend) ScaleGraph utilize high speed network and multi core CPUs.

NVRAM is a work in progress. XPregel: main graph processing framework in ScaleGraph

Software stackSoftware stack

4

http://scalegraph.org

XPregel Overview Pregel: Distributed graph processing framework

and programming model proposed by Google[Malewiczʼ10]

XPregel is a pregel like distributed graph processing framework optimized for supercomputers. Utilizes MPI collective communication. Native support for hybrid (MPI and multi-threading)

parallelism.

5

Pregel Programming Model Each vertex initialize its state.

6

Pregel Programming Model Each vertex send messages to other vertices.

7

Pregel Programming Model Each vertex process received messages and

update its state.

8

Pregel Programming Model Each vertex send messages to other vertices.

9

Parallel Serialization for Collective Communication

10

Message communication in XPregel is implemented with Team collective communication. mainly uses Alltoallv.

Sometimes we need to serialize messages sent with XPregel. When a message is an array whose length is

unknown at the compile time. Hyper ANF

We developed parallel serialization to improve performance of collective communication with serialization.

Parallel Serialization for Collective Communication

11

Collective Communication without SerializationCollective Communication without Serialization

We do not need serialization when the data has no

pointer.

We do not need serialization when the data has no

pointer.

No pointer

Communication

Parallel Serialization for Collective Communication

12

We need to serialize data when they have pointers.

We need to serialize data when they have pointers.

Serialization

No pointer

Has pointers

Collective Communication with SerializationCollective Communication with Serialization

Communication

Parallel Serialization for Collective Communication

13

Each region can be serialized

independently. We parallelized serializing

them.

Each region can be serialized

independently. We parallelized serializing

them.

No pointer

Has pointersParallel Serialization

Collective Communication with SerializationCollective Communication with Serialization

Communication

Avoid Serialization (C++ Backend)

14

Sometimes we need to send multiple fields as a message. E.g. edge direction and weight

When we use “class” to define a message type, we need to serialize it since it is referenced by a pointer. But in many case, it has no pointer.

Potentially, we can avoid serialization when a message has no pointer.

class ValueMessage {val excess:Double;val height:Long;val id:Long;

}

One of the message types used by

ScaleGraphʼs MaxFlow implementation.

One of the message types used by

ScaleGraphʼs MaxFlow implementation.

Avoid Serialization (C++ Backend)

Each instance is embedded.

Rail for class typesRail for class types MemoryChunk for class typesMemoryChunk for class types

To avoid serialization, we Create our own array type MemoryChunk to flatten

memory. Modify X10 compiler to generate a flag that means the

class type has pointers or not. With this optimization, we realized 6 times speed

up than before in some application (MaxFlow).

15

Optimize Memory Allocation (C++ Backend)Hybrid Memory X10 C++ backend uses BDW GC, which is a

conservative GC. False pointer is a big problem for the program

that deals with large data.

False pointer problemFalse pointer problemThis is not a pointer

but a conservative GC assumes all the data

is a pointer.Accidently this integer

value is the same as the address within the other

array.

A conservative GC cannot release the

memory.16

Integer array

Integer array

GC HeapGC Heap

Optimize Memory Allocation (C++ Backend)Hybrid Memory

17

We introduce hybrid memory to solve the false pointer problem.

The array memory is allocated from out side of GCheap. Using malloc/free

The array reference has two pointers to ensure the array is collected by GC. One points to the corresponding object allocated with GC. The other points to the array memory.

arrayarray

Rail

GC HeapGC Heap

MemoryChunk

arrayarray

Object to release the array by GC.

Object to release the array by GC.

Hybrid MemoryHybrid MemoryX10ʼs OriginalX10ʼs Original

Optimize Memory Allocation (C++ Backend)Hybrid Memory

18

Hybrid Memory reduces both memory consumption and execution time.

But there is a segmentation fault problem now.

Improve Activity Scheduler

19

There is a performance bug in the X10ʼs activity scheduler.

We fixed it.

1 2 3 1

2

We expectedWe expected

Thread

Running activity

But sometimes it isBut sometimes it is

time timeThe case where X10_NTHREADS=3, X10 2.3.1

3

Improve Activity Scheduler

20

Effect of fixing activity schedulerEffect of fixing activity scheduler

PageRank 30 iterations, Weak scaling, RMAT Scale 22 per node3.4x Speedup3.4x Speedup

Performance evaluationStrong Scaling and Weak Scaling

Degree of SeparationDegree of Separation Degree of SeparationDegree of Separation

HyperANF in Weak Scaling (B=5, Scale 22, 1 itHyperANF in Strong Scaling (B=5, Scale 28, 1 iterations)

21

HyperANF is an algorithm that computes approximated average distance of all vertex pairs in the graph.

Performance evaluationComparing with Giraph and PBGL

ScaleGraph vs. Giraph, PBGLScaleGraph vs. Giraph, PBGLScaleGraph vs. Giraph, PBGLScaleGraph vs. Giraph, PBGL

9.4x Speedup9.4x Speedup

PageRank in Weak Scaling (Scale 22, 30 iterationPageRank in Strong Scaling (Scale 25, 30 iterations)

38.4x Speedu38.4x Speedu22

Giraph is an open source pregel graph processing framework written in Java.

PBGL is the Parallel Boost Graph Library written in C++.

Summary

23

We improved the performance of XPregel and ScaleGraph library with the following optimizations. Parallel serialization for collective communication Avoid serialization of “class” types Hybrid memory Improve activity scheduler

ScaleGraph outperforms Giraph and PBGL by an order of magnitude.

Facing Problem

24

1. When we use hybrid memory, sometimes we encounter segmentation fault.

We have been investigating for a long time but the cause is still unknown.

2. When we stress network by using “at” or remote array copy, we encounter segmentation fault.

Our environment X10 2.3.1, C++ backend, MPI transport

Improve Activity Scheduler

25

There is a performance bug in the X10ʼs activity scheduler.

We fixed it.

1 2 3 1

2

We expectedWe expected

Thread

Running activity

But sometimes it isBut sometimes it is

time timeThe case where X10_NTHREADS=3, X10 2.5

3