STREAM and RA HPC Challenge Benchmarks - Chapel

Post on 03-Feb-2022

2 views 0 download

Transcript of STREAM and RA HPC Challenge Benchmarks - Chapel

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 2

Two classes of competition: Class 1: “best performance”

Class 2: “most productive” Judged on: 50% performance 50% elegance

Four recommended benchmarks: STREAM, RA, FFT, HPL

Use of library routines: discouraged

Why you may care: provides an alternative to the top-500’s focus on peak performance

Recent Class 2 Winners:2008: performance: IBM (UPC/X10)

productive: Cray (Chapel), IBM (UPC/X10), Mathworks (Matlab)

2009: performance: IBM (UPC+X10)

elegance: Cray (Chapel)

Chapel: Sample Codes

Chapel: Sample Codes 4

64 77 107 165 176329787

1130

8800

Global STREAM Triad

EP STREAM Triad Random Access FFT HPL

Code Size Summary (Source Lines of Code)

Chapel

Reference

Benchmark 2008 2009 Improvement

Global STREAM1.73 TB/s(512 nodes)

10.8 TB/s(2048 nodes)

6.2x

EP STREAM1.59 TB/s(256 nodes)

12.2 TB/s(2048 nodes)

7.7x

Global RA0.00112 GUPs(64 nodes)

0.122 GUPs(2048 nodes)

109x

Global FFTsingle-threadedsingle-node

multi-threaded multi-node

multi-node parallel

Global HPLsingle-threaded single-node

multi-threaded single-node

single-node parallel

All timings on ORNL Cray XT4:

• 4 cores/node

• 8 GB/node

• no use of library routines

Chapel: Sample Codes

const ProblemSpace: domain(1, int(64))

dmapped Block([1..m])

= [1..m];

var A, B, C: [ProblemSpace] real;

forall (a,b,c) in (A,B,C) do

a = b + alpha * c;

Chapel: Sample Codes 6

This loop should eventually be written:A = B + alpha * C;

(and can be today, but performance is worse)

coforall loc in Locales do

on loc {

local {

var A, B, C: [1..m] real;

forall (a,b,c) in (A,B,C) do

a = b + alpha * c;

}

}

Chapel: Sample Codes 7

Chapel: Sample Codes 8

0

2000

4000

6000

8000

10000

12000

14000

1 2048

GB

/s

Number of Locales

Performance of HPCC STREAM Triad (Cray XT4)

2008 Chapel Global TPL=1

2008 Chapel Global TPL=2

2008 Chapel Global TPL=3

2008 Chapel Global TPL=4

MPI EP PPN=1

MPI EP PPN=2

MPI EP PPN=3

MPI EP PPN=4

Chapel Global TPL=1

Chapel Global TPL=2

Chapel Global TPL=3

Chapel Global TPL=4

Chapel EP TPL=4

const TableDist = new dmap(new Block([0..m-1])),

UpdateDist = new dmap(new Block([0..N_U-1]));

const TableSpace: domain … dmapped TableDist = …,

Updates: domain … dmapped UpdateDist = …;

var T: [TableSpace] uint(64);

forall ( ,r) in (Updates,RAStream()) do

on TableDist.idxToLocale(r & indexMask) {

const myR = r;

local T(myR & indexMask) ^= myR;

}

Chapel: Sample Codes 9

This body should eventually simply be written:on T(r&indexMask) do

T(r&indexMask) ^= r;

(and again, can be today, but performance is worse)

Chapel: Sample Codes 10

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2048

GU

P/s

Number of Locales

Performance of HPCC Random Access (Cray XT4)

Chapel TPL=1

Chapel TPL=2

Chapel TPL=4

Chapel TPL=8

Chapel: Sample Codes 11

0%

1%

2%

3%

4%

5%

6%

7%

32 64 128 256 512 1024 2048

% E

ffic

ien

cy(o

f sc

ale

d C

hap

el T

PL=

4 lo

cal G

UP/

s)

Number of Locales

Efficiency of HPCC Random Access on 32+ Locales (Cray XT4)

Chapel TPL=1

Chapel TPL=2

Chapel TPL=4

Chapel TPL=8

MPI PPN=4

MPI No Buckets PPN=4

MPI+OpenMP TPN=4

Chapel: Sample Codes 12

0

0.004

0.008

0.012

0.016

0.02

1 4

GU

P/s

Number of Locales

Performance of HPCC Random Access (Cray CX1)

0

5

10

15

20

25

30

35

1 4

GB

/s

Number of Locales

Performance of HPCC STREAM Triad (Cray CX1)

0

50

100

150

200

250

300

350

1 64

GB

/s

Number of Locales

Performance of HPCC STREAM Triad (IBM pSeries 575)

0

2

4

6

8

10

12

1 64

GB

/s

Tasks per Locale

Performance of HPCC STREAM Triad (SGI Altix)

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 13

Chapel (14)

Adaptive Mesh Refinement in Chapel

What’s so great about domains?

Ability to reason about unions of rectangular index spaces

(as unions of domains)

Trivial shared-memory

parallelism; easy access

to distributed parallelism with

distributions

Fewer nested loops, and no

bounds to mess up

Striding allows much better

description of grids (vertices,

edges, cell centers)

Dimension-independent code

coarse

gridfiner grids

finest grids

Chapel (15)

Strided grid description

Chapel (16)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

Chapel (17)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

var vertical_edges = [0..8 by 2, 1..5 by 2];

Chapel (18)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

var vertical_edges = [0..8 by 2, 1..5 by 2];

var horizontal_edges = [1..7 by 2, 0..6 by 2];

Chapel (19)

Strided grid description

var cell_centers = [1..7 by 2, 1..5 by 2];

var vertical_edges = [0..8 by 2, 1..5 by 2];

var horizontal_edges = [1..7 by 2, 0..6 by 2];

var vertices = [0..8 by 2, 0..6 by 2];

Chapel (20)

Dimension-free stencils

Tasks:

1. Create an N-dimensional grid.

2. Evaluate the function

on the grid.

3. Approximate the Laplacian,

Chapel (21)

1. Create an N-dimensional grid

config param N: int = 2;

const dimensions = [1..N];

Chapel (22)

1. Create an N-dimensional gridnum_points

dx

config const num_points = 20;

const dx = 1.0 / (num_points-1);

Chapel (23)

1. Create an N-dimensional grid

var grid_points: domain(N);

var ranges: N*range;

for d in dimensions do

ranges(d) = 1..num_points;

grid_points = ranges;

Chapel (24)

2. Evaluate the function

var f: [grid_points] real = 1.0;

forall point in points {

for d in dimensions do

f(point) *= sin( (point(d)-1)*dx );

}

Chapel (25)

2. Evaluate the function

var f: [grid_points] real = 1.0;

forall point in points {

for d in dimensions do

f(point) *= sin( (point(d)-1)*dx );

}

Calculates real

coordinate xd

Chapel (26)

3. Approximate the Laplacian,

var interior_points = grid_points.expand(-1);

var laplacian: [interior_points] real;

Chapel (27)

3. Approximate the Laplacian,

var interior_points = grid_points.expand(-1);

var laplacian: [interior_points] real;

Laplacian is only defined

on the interior of the grid

Chapel (28)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Chapel (29)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}Approximates

Chapel (30)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

Chapel (31)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

point

Chapel (32)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

point

+shift

Chapel (33)

3. Approximate the Laplacian,

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}

}

Translates in dimension d

point

+shift

-shift

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 34

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

Chapel: Sample Codes 35

97

97

23

6577

85

91

33

76

44

54

57

12

82

61

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0

Chapel: Sample Codes 36

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0 maxPathLength = 1

Chapel: Sample Codes 37

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0 maxPathLength = 1 maxPathLength = 2

Chapel: Sample Codes 38

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 39

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 40

Generic Implementation of Graph G

G.Vertices: A domain whose indices represent the vertices• For toroidal graphs, a domain(d), so vertices are d-tuples• For other graphs, a domain(1), so vertices are integers

G.Neighbors: An array over G.Vertices• For toroidal graphs, a fixed-size array over the domain [1..2*d]• For other graphs…

…an associative domain with indices of type index(G.vertices)…a sparse subdomain of G.Vertices

This kernel and the others are generic w.r.t. these decisions!

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 41

Generic with respect to vertex sets

vertexSet: A type argument specifying how to represent vertex subsets

Requirements:• Parallel iteration• Ability to add members, test for membership

Options:• An associative domain over verticesdomain(index(G.vertices))

• A sparse subdomain of the verticessparse subdomain(G.vertices)

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 42

The same genericity applies to subgraphs

def rootedHeavySubgraphs(

G,

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

}

}

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

}

}

}

Chapel: Sample Codes 43

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 44