Galois Performance Mario Mendez-Lojo Donald Nguyen.

Post on 19-Dec-2015

219 views 3 download

Transcript of Galois Performance Mario Mendez-Lojo Donald Nguyen.

Galois Performance

Mario Mendez-LojoDonald Nguyen

2

Overview

• Galois system is a test bed to explore opts– Safe but not fast out of the box

• Important optimizations– Select least transactional overhead– Select right scheduling– Select appropriate data structure

• Quantify optimizations on applications

3

Algorithms

irregularalgorithms

topology

operator

ordering

morph

local computation

reader

general graph

grid

tree

unordered

ordered

1. Barnes-Hut

2. Delaunay Mesh Refinement

3. Preflow-push

4

MethodologyTh

read

s

IdleSerial GC

Time

Compute

• Abort Ratio: Aborted It/Total it

• GC options• UseParallelGC• UseParallelOldGC• NewRatio=1

5

Terms

• Base– Default scheduling, Default graph

• Serial– Galois classes => No concurrency control classes

• Speedup– Best mean performance of a serial variant

• Throughput– # Serial Iterations / time

6

Numbers

• Runtime– Last of 5 runs in same VM– Ignore time to read and construct initial graph

• Other statistics– Last of 5 runs

7

Test Environment

• 2 x Xeon X5570 (4 core, 2.93 GHz)• Java 1.6.0_0-b11• Linux 2.6.24-27 x86_64• 20GB heap size

8

BARNES-HUT

Most Distant Galaxy Candidates in the Hubble Ultra Deep Field

9

Barnes-Hut• N-body algorithm

– Oct-tree acceleration structure– Serial

• Tree build, center of mass, particle update

– Parallel• Force computation

• Structure– Reader on tree

• Variants– Splash2, Reader Galois

10

Reader Optimization

child = octree.getNeighbor(nn, 1);

child = octree.getNeighbor(nn, 1, MethodFlag.NONE);

11

ParaMeter Profile

12

Barnes-Hut Results

100,000 points, 1 time step

Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X

13

Barnes-Hut Results

100,000 points, 1 time step

Best serial: baseSerial time: 10271 msBest // time: 1553 msBest speedup: 6.6X

14

Barnes-Hut Scalability

15

16

DELAUNAY MESH REFINEMENT

17

Delaunay Mesh Refinement

• Refine “bad” triangles– Maintained in worklist

• Structure– Cautious operator on graph

• Variants– Flag optimized, locallifo

base: Priority.defaultOrder()

local lifo: Priority.first(ChunkedFIFO.class). thenLocally(LIFO.class)

Cautious Optimization

mesh.contains(item);...

mesh.remove(preNodes.get(i));...

mesh.add(node);

mesh.contains(item, MethodFlag.CHECK_CONFLICT);...

mesh.remove(preNodes.get(i), MethodFlag.NONE);...

mesh.add(node, MethodFlag.NONE);

• No need to save undo info• Only check conflicts up to first write

19

LIFO Optimization

GaloisRuntime.foreach(...,

Priority.defaultOrder());

GaloisRuntime.foreach(...,

Priority.first(ChunkedFIFO.class).thenLocally(LIFO.class));

20

ParaMeter Profile

21

DMR Results

0.5M triangles, 0.25M bad triangles

Best serial: locallifo.flagoptSerial time: 17002 msBest // time: 3745 msBest speedup: 4.5X

22

23

PREFLOW-PUSH

Preflow-push

• Max-flow algorithm– Nodes push flow downhill

• Structure– Cautious, local computation

• Variants– Flag optimized, local computation graph

base (discharge): Priority.first(Bucketed.class, numHeight+1, false, indexer). then(FIFO.class)

base (relabel): Priority.first(ChunkedFIFO.class, 8)

25

Local Computation Optimization

graph = ...

graph = ...b = new LocalComputationGraph.ObjectGraphBuilder();

graph = b.from(graph).create()

26

ParaMeter Profile

27

Preflow-push Results

From challenge problem (genmf-wide)14 linearly connected grids(194x194), 526,904 nodes, 2,586,020 edgeshttp://avglab.com/andrew/CATS/maxflow_synthetic.htm

C: 11450 msJava: 30234 ms

Best serial: lc.flagoptSerial time: 57121 msBest // time: 18242 msBest speedup: 3.1X

28

Preflow-push Scalability

29

30

What performance did we expect?Th

read

s

Time

IdleSerial GC//Compute Miss-Speculation

Measured Indirectly

Synchronization, …

Error

31

What performance did we expect?

• Naïve: r(x) = t1 / x

• Amdahl: r(x) = tp / x + ts

t1 = tp + ts

ts = tidle + tgc+ tserial

• Simple: r(x) = (tp (ix / i1)) / x + ts

32

Barnes-Hut

33

Delaunay Mesh Refinement

34

Preflow-push

35

Summary

• Many profitable optimizations– Selecting among method flags, worklists, graph

variants

• Open topics– Automation– Static, dynamic and performance analysis– Efficient ordered algorithms

36