STREAM and RA HPC Challenge Benchmarks - Chapel

STREAM and RA HPC Challenge Benchmarks simple, regular 1D computations

results from SC ’09 competition

AMR Computations hierarchical, regular computation

SSCA #2 unstructured graph computation

Chapel: Sample Codes 2

Two classes of competition: Class 1: “best performance”

Class 2: “most productive” Judged on: 50% performance 50% elegance

Four recommended benchmarks: STREAM, RA, FFT, HPL

Use of library routines: discouraged

Why you may care: provides an alternative to the top-500’s focus on peak performance

Recent Class 2 Winners:2008: performance: IBM (UPC/X10)

productive: Cray (Chapel), IBM (UPC/X10), Mathworks (Matlab)

2009: performance: IBM (UPC+X10)

elegance: Cray (Chapel)

Chapel: Sample Codes

64 77 107 165 176329787

Global STREAM Triad

EP STREAM Triad Random Access FFT HPL

Code Size Summary (Source Lines of Code)

Chapel

Reference

Benchmark 2008 2009 Improvement

Global STREAM1.73 TB/s(512 nodes)

10.8 TB/s(2048 nodes)

EP STREAM1.59 TB/s(256 nodes)

12.2 TB/s(2048 nodes)

Global RA0.00112 GUPs(64 nodes)

0.122 GUPs(2048 nodes)

Global FFTsingle-threadedsingle-node

multi-threaded multi-node

multi-node parallel

Global HPLsingle-threaded single-node

multi-threaded single-node

single-node parallel

All timings on ORNL Cray XT4:

• 4 cores/node

• 8 GB/node

• no use of library routines

Chapel: Sample Codes

const ProblemSpace: domain(1, int(64))

dmapped Block([1..m])

= [1..m];

var A, B, C: [ProblemSpace] real;

forall (a,b,c) in (A,B,C) do

a = b + alpha * c;

This loop should eventually be written:A = B + alpha * C;

(and can be today, but performance is worse)

coforall loc in Locales do

on loc {

local {

var A, B, C: [1..m] real;

forall (a,b,c) in (A,B,C) do

a = b + alpha * c;

1 2048

Number of Locales

Performance of HPCC STREAM Triad (Cray XT4)

2008 Chapel Global TPL=1

MPI EP PPN=1

MPI EP PPN=2

MPI EP PPN=3

MPI EP PPN=4

Chapel Global TPL=1

Chapel Global TPL=2

Chapel Global TPL=3

Chapel Global TPL=4

Chapel EP TPL=4

const TableDist = new dmap(new Block([0..m-1])),

UpdateDist = new dmap(new Block([0..N_U-1]));

const TableSpace: domain … dmapped TableDist = …,

Updates: domain … dmapped UpdateDist = …;

var T: [TableSpace] uint(64);

forall ( ,r) in (Updates,RAStream()) do

on TableDist.idxToLocale(r & indexMask) {

const myR = r;

local T(myR & indexMask) ^= myR;

This body should eventually simply be written:on T(r&indexMask) do

T(r&indexMask) ^= r;

(and again, can be today, but performance is worse)

1 2048

Number of Locales

Performance of HPCC Random Access (Cray XT4)

Chapel TPL=1

Chapel TPL=2

Chapel TPL=4

Chapel TPL=8

32 64 128 256 512 1024 2048

Number of Locales

Efficiency of HPCC Random Access on 32+ Locales (Cray XT4)

Chapel TPL=1

Chapel TPL=2

Chapel TPL=4

Chapel TPL=8

MPI PPN=4

MPI No Buckets PPN=4

MPI+OpenMP TPN=4

Number of Locales

Performance of HPCC Random Access (Cray CX1)

Number of Locales

Performance of HPCC STREAM Triad (Cray CX1)

Number of Locales

Performance of HPCC STREAM Triad (IBM pSeries 575)

Tasks per Locale

Performance of HPCC STREAM Triad (SGI Altix)

Chapel (14)

Adaptive Mesh Refinement in Chapel

What’s so great about domains?

Ability to reason about unions of rectangular index spaces

(as unions of domains)

Trivial shared-memory

parallelism; easy access

to distributed parallelism with

distributions

Fewer nested loops, and no

bounds to mess up

Striding allows much better

description of grids (vertices,

edges, cell centers)

Dimension-independent code

coarse

gridfiner grids

finest grids

Chapel (15)

Strided grid description

Chapel (16)

var cell_centers = [1..7 by 2, 1..5 by 2];

Chapel (17)

var vertical_edges = [0..8 by 2, 1..5 by 2];

Chapel (18)

var horizontal_edges = [1..7 by 2, 0..6 by 2];

Chapel (19)

var horizontal_edges = [1..7 by 2, 0..6 by 2];

var vertices = [0..8 by 2, 0..6 by 2];

Chapel (20)

Dimension-free stencils

Tasks:

1. Create an N-dimensional grid.

2. Evaluate the function

on the grid.

3. Approximate the Laplacian,

Chapel (21)

1. Create an N-dimensional grid

config param N: int = 2;

const dimensions = [1..N];

Chapel (22)

1. Create an N-dimensional gridnum_points

config const num_points = 20;

const dx = 1.0 / (num_points-1);

Chapel (23)

1. Create an N-dimensional grid

var grid_points: domain(N);

var ranges: N*range;

for d in dimensions do

ranges(d) = 1..num_points;

grid_points = ranges;

Chapel (24)

var f: [grid_points] real = 1.0;

forall point in points {

f(point) *= sin( (point(d)-1)*dx );

Chapel (25)

var f: [grid_points] real = 1.0;

forall point in points {

f(point) *= sin( (point(d)-1)*dx );

Calculates real

coordinate xd

Chapel (26)

var interior_points = grid_points.expand(-1);

var laplacian: [interior_points] real;

Chapel (27)

var interior_points = grid_points.expand(-1);

var laplacian: [interior_points] real;

Laplacian is only defined

on the interior of the grid

Chapel (28)

forall point in interior_points {

var shift: N*int;

for d in dimensions {

shift(d) = 1;

laplacian(point) += ( f(point+shift)

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

Chapel (29)

var shift: N*int;

shift(d) = 1;

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

}Approximates

Chapel (30)

var shift: N*int;

shift(d) = 1;

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

Translates in dimension d

Chapel (31)

var shift: N*int;

shift(d) = 1;

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

Chapel (32)

var shift: N*int;

shift(d) = 1;

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

+shift

Chapel (33)

var shift: N*int;

shift(d) = 1;

- 2*f(point)

+ f(point-shift)

)/ dx**2;

shift(d) = 0;

+shift

-shift

Given a set of heavy edges HeavyEdges in directed graph G, find

sub-graphs of outgoing paths with length ≤ maxPathLength

maxPathLength = 0

maxPathLength = 0 maxPathLength = 1

maxPathLength = 0 maxPathLength = 1 maxPathLength = 2

def rootedHeavySubgraphs(

type vertexSet;

HeavyEdges : domain,

HeavyEdgeSubG : [],

in maxPathLength: int ) {

forall (e, subgraph) in

(HeavyEdges, HeavyEdgeSubG) {

const (x,y) = e;

var ActiveLevel: vertexSet;

ActiveLevel += y;

subgraph.edges += e;

subgraph.nodes += x;

subgraph.nodes += y;

for pathLength in 1..maxPathLength {

var NextLevel: vertexSet;

forall v in ActiveLevel do

forall w in G.Neighbors(v) do

atomic {

if !subgraph.nodes.member(w) {

NextLevel += w;

subgraph.nodes += w;

subgraph.edges += (v, w);

if (pathLength < maxPathLength) then

ActiveLevel = NextLevel;

type vertexSet;

HeavyEdgeSubG : [],

const (x,y) = e;

ActiveLevel += y;

atomic {

NextLevel += w;

Generic Implementation of Graph G

G.Vertices: A domain whose indices represent the vertices• For toroidal graphs, a domain(d), so vertices are d-tuples• For other graphs, a domain(1), so vertices are integers

G.Neighbors: An array over G.Vertices• For toroidal graphs, a fixed-size array over the domain [1..2*d]• For other graphs…

…an associative domain with indices of type index(G.vertices)…a sparse subdomain of G.Vertices

This kernel and the others are generic w.r.t. these decisions!

type vertexSet;

HeavyEdgeSubG : [],

const (x,y) = e;

ActiveLevel += y;

atomic {

NextLevel += w;

Generic with respect to vertex sets

vertexSet: A type argument specifying how to represent vertex subsets

Requirements:• Parallel iteration• Ability to add members, test for membership

Options:• An associative domain over verticesdomain(index(G.vertices))

• A sparse subdomain of the verticessparse subdomain(G.vertices)

type vertexSet;

HeavyEdgeSubG : [],

const (x,y) = e;

ActiveLevel += y;

atomic {

NextLevel += w;

The same genericity applies to subgraphs

type vertexSet;

HeavyEdgeSubG : [],

const (x,y) = e;

ActiveLevel += y;

atomic {

NextLevel += w;

STREAM and RA HPC Challenge Benchmarks - Chapel

Documents

Transcript of STREAM and RA HPC Challenge Benchmarks - Chapel

Active Harmony and the Chapel HPC Language Ray Chen, UMD Jeff Hollingsworth, UMD Michael P. Ferguson, LTS.

HPC-Cloud / Cloud-HPC - Approaches to HPC with OpenStack

Software Libraries and Middleware for Exascale Systems · 2019. 12. 2. · Benchmarks and Middleware for Designing Convergent HPC, Big Data and Deep Learning Software Stacks for Exascale

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Active Harmony and the Chapel HPC Language

Tax Benchmarks and Variations Statement · Web viewTax Benchmarks and Variations Statement Tax Benchmarks and Variations Statement Tax Benchmarks and Variations Statement Tax Benchmarks

A Performance Comparison using HPC Benchmarks ......A PERFORMANCE COMPARISON USING HPC BENCHMARKS: WINDOWS HPC SERVER 2008 AND RED HAT ENTERPRISE LINUX 5 R. Henschel, S. Teige, H.

The HPC Challenge (HPCC) Benchmark Suite · 2007. 10. 2. · 3 Luszczek_HPCC_SC07 HPCC: Motivation and measurement HPC Challenge Benchmarks Select Applications 0.00 0.20 0.40 0.60

November 25, 20021WORKLOADS THAT SCALE IN MULTIPLE DIMENSIONS Workload Benchmarks that Scale in Multiple Dimensions John L. Gustafson Sun Labs HPC Workload.

Benchmarks - June, 2013 | Benchmarks Onlineit.unt.edu/sites/default/files/benchmarks-06-2013.pdf · Benchmarks - June, 2013 | Benchmarks Online 4/26/16, 8:52:25 AM] Skip to content

< QC | HPC >: Quantum for HPC

2006 Open Grid Forum HPC Job Delegation Best Practices Grid Scheduling Architecture Research Group (GSA-RG) May 26, 2009, Chapel Hill, NC, US.

The Swiss HPC Service Provider Community hpc-ch community building BoF HPC for HPC June 24th, 2014.

Chapel Tutorial Using Global HPCC Benchmarks: STREAM Triad, Random Access, and FFT

The UberCloud - From Project to Product - From HPC Experiment to HPC Marketplace - From HPC Shop to HPC Shopping Mall

Benchmarks - May, 2011 | Benchmarks Onlineit.unt.edu/sites/default/files/benchmarks-05-2011.pdf · Benchmarks - May, 2011 | Benchmarks Online 4/28/16, 9:13:42 AM] By Patrick McLoud,

Chapel: HPCC Benchmarks Brad Chamberlain Cray Inc. CSEP 524 May 20, 2010.

A Performance Comparison using HPC Benchmarks: Windows HPC

THE MLPERF BENCHMARKS: DEEP LEARNING AT ......THE MLPERF BENCHMARKS: DEEP LEARNING AT SCALE WITH NVIDIA GPUS Vishal Mehta, DevTech Compute HPC-AI Advisory Council, Swiss Conference,

HPC Cloud Security Stakeholders Platform (HCSSP) · 2019-09-03 · HPC yyy WG HPC xxx WG HPC Cloud Security WG CCM for AWS HPC Cloud CCM for Google HPC Cloud CCM for Microsoft HPC