An Overview of Aurora, Argonne’s Upcoming ExascaleSystem ......2020/02/04 · (Ponte Vecchio)...

www.anl.gov

Argonne Leadership Computing Facility and Computational Science Division

ECP Annual Meeting, Houston, Feb. 3-7

An Overview of Aurora, Argonne’s Upcoming Exascale System

Yasaman Ghadar, Tim Williams

2

Argonne staff that can help with your question Yasaman GhadarApplication PortingEXXALT

Colleen BertoniApplication PortingProgramming ModelsGAMESS

Thomas ApplencourtApplication PortingSYCL/DPC++QMCPACK

Victor AnisimovPrograming ModelsNWChemEx

Kevin HarmsI/O LibrariesDAOS

Sudheer ChunduriNetwork PerformanceSlingShot

Brian HomerdingProgramming ModelsRaja/KokkosEQSIM

Ray LoyToolsDebuggerE3SM-MMF

JaeHyuk KwackPerformance ToolsOpenMPKokkosExaWind

ServeshMuralidharanPerformance Opt.ExaFEL

Kris RoweI/O Libraries ExaSMR

Kevin BrownNetwork performanceHardware Integration

Tim WilliamsEarly Science ProjectWDMApp

James OsbornLatticeQCD

Saumil PatelExaSMR

Silvio RizziALPINE

3

Outlineq Hardware Architecture: (Yasaman Ghadar)

q Node-level Hardwareq Interconnectq I/Oq Aurora’s Testbed

q Software Stack: (Tim Williams)q Three pillarsq Programming modelsq Tools and librariesq Development platformsq Support for ECP teams

q Further details on Aurora can be discussed under NDA q If you are not covered by an Aurora NDA please talk to your ECP project PI

4

Aurora: A High-level Viewq Intel-Cray machine arriving at Argonne in 2021

q Sustained Performance > 1Exaflops

q Intel Xeon processors and Intel Xe GPUsq 2 Xeons (Sapphire Rapids)q 6 GPUs (Ponte Vecchio [PVC])

q Greater than 10 PB of total memory

q Cray Slingshot fabric and Shasta platform

q Filesystemq Distributed Asynchronous Object Store (DAOS)

q ≥ 230 PB of storage capacityq Bandwidth of > 25 TB/s

q Lustreq 150 PB of storage capacity q Bandwidth of ~1TB/s

5

The Evolution of Intel GPUs

Source: Intel

6

Intel GPUsq Recent integrated generations:

q Gen9 – used in Skylakeq Gen11 – used in Ice Lake

qGen9: Double precision peak performance: 100-300 GFq Low by design due to power and space

limits

qUpcoming Xe GPUq Discrete

Layout of Architecture components for an Intel Core i7 processor 6700K for desktop systems (91 W TDP, 122 mm)

7

Intel Discrete Graphics 1 (DG1): q Intel Xe DG1: Software Development Vehicle(SDV) Shown At

CES 2020

q Xe DG1 GPU is part of the low power (LP) series and is not a high-end product.

q Chip will be used in laptops with GPU integrated on to the motherboard

q PCIe form factor is only for the SDV and not a product

q The SDV is shipping worldwide to integrated software vendors (ISVs) to enable broad software optimization for the Xe architecture.

https://newsroom.intel.com/image-archive/images-intel-dgi-sdv/?wapkw=DG1#gs.u0a8t8

8

Aurora Compute Nodeq2 Intel Xeon (Sapphire Rapids)

processors

q6 Xe Architecture based GPUs (Ponte Vecchio)q All to all connectionq Low latency and high bandwidth

q8 Slingshot Fabric endpoints

qUnified Memory Architecture across CPUs and GPUs

Network Interconnect

10

Cray Shasta System

Shasta Hardware Workshop, May 6, 2019, Cray User Meeting

qSupport diversity of processors

qScale-optimized cabinets for density, cooling, and high network bandwidth

qStandard cabinets for flexibility

qCray SW stack with improved modularity

qUnified by a single, high-performance interconnect

Optimized Rack Standard Rack

11

Cray Slingshot NetworkqSlingshot is next generation scalable interconnect by

Crayq 8th major generation

qBuilds on Cray’s expertise in high performance network following q Gemini (Titan, Blue Waters) q Aries (Theta, Cori)

q 5 hop dragonfly topology

qSlingshot introduces:q Congestion managementq Traffic classesq 3 hop dragonfly

Cray Slingshot

https://www.cray.com/products/computing/slingshothttps://www.cray.com/resources/slingshot-interconnect-for-exascale-era

https://www.cray.com/products/computing/slingshot

https://www.cray.com/resources/slingshot-interconnect-for-exascale-era

12

Dragonfly TopologyqSeveral groups are connected together using all to all links.qOptical cables are used for the long links between groups.qLow-cost electrical links are used to connect the NICs in each node to their local router and

the routers in a group

https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf

Nodes

Switch Switch Switch

…

Nodes Nodes Nodes


…

Nodes Nodes Nodes


…

Nodes NodesNodes


…

Nodes Nodes

…

Group 0 Group 1 Group 2 Group Z

All-to-all

Group 0 Group 1 Group 2 Group n

All to All

13

High Bandwidth Switch: Rosettaq Multi-level congestion management

q To minimize the impact of congested applications on othersq Very low average and tail latency

qQuality of Service (QoS) – Traffic Classesq Class: Collection of buffers, queues and bandwidthq Intended to provide isolation between applications via traffic

shaping

q Aggressive adaptive routingq Expected to be more effective for 3-hop dragonfly due to closer

congestion information

q High performance multicast and reductions

Rosetta Switch25.6 Tb/s per switch, from 64 - 200 Gbs ports (25GB/s per direction)

14

Congestion Management with Slingshotq GPCNeT benchmark suite was

utilized

15

GPCNeT (Global Performance and Congestion Network Tests)

qA new benchmark developed by Cray in collaboration with Argonne and NERSC

qPublicly available at https://github.com/netbench/GPCNET

qGoals:q Proxy for real-world application communication

patternsq Measure network performance under loadq Look at both mean and tail latencyq Look at interference between workloadsq What is the Congestion management?

q How well the system performs under congestion

https://github.com/netbench/GPCNET

16

Congestion Management with Slingshotq GPCNeT benchmark suit was utilized q Tail latency was measured at:

q congestedq isolated

https://dl.acm.org/doi/pdf/10.1145/3295500.3356215


17

Congestion Management with Slingshotq GPCNeT benchmark suit was utilized qTail latency was measured at:


qBenchmark was run over the full machine for multiple runs

qCongestion impact:q Latency congested/Latency isolated



18

InfiniBandAries Network

Congestion Management with Slingshotq GPCNeT benchmark suit was utilized q Tail latency was measured at:


q Benchmark was run over the full machine for multiple runs

qCongestion impact:q Latency congested/Latency isolated

q Aries congestion impact is in order of 1000x

q InfiniBand has reduced congestion impact (20-1000x)

q Slingshot testbed (Malbec) has minimal congestion impact

Con

gest

ion

Impa

ct (l

og s

cale

)

1

10

100

1000

10000

Crystal Theta Edison Osprey Summit Sierra Malbec

CI(99% Tail Latency)

CI(99% All Reduce)



19

Slingshot Quality of Service Classes

§ Highly tunable QoS classes

§ Supports multiple, overlaid virtual networks …

§ Jobs can use multiple traffic classes

§ Provides performance isolation for different types of traffic Traffic class 1

Traffic class 2

Traffic class 3

Bandwidth sharing with QoS

Traffic type 1

Traffic type 2

Traffic type 3

Bandwidth sharing without QoS

21

Distributed Asynchronous Object Store (DAOS) qOpen source storage solution

qOffers high performance in bandwidth and IO operationsq ≥ 230 PB capacityq ≥ 25 TB/sq Using DAOS is critical to achieving good I/O performance

on Aurora

qProvides compatibility with existing I/O models such as POSIX, MPI-IO and HDF5

qProvides a flexible storage API that enables new I/O paradigms

DAOS (Distributed Asynchronous Object Storage) for Applications; Thursday 8:30-10:00AM

22

User Environment q User see single storage namespace which is in Lustre Compatibility with legacy systems

q Links point to DAOS containers within the /project directoryq DAOS aware software interpret these links and access the DAOS containers

qData resides in a single place (Lustre or DAOS)q Explicit data movement, no auto-migration

qSuggested storage locations q Source and binaries in Lustreq Bulk data in DAOS

DAOS (Distributed Asynchronous Object Storage) for Applications; Thursday 8:30-10:00AM

23

Summary of Aurora Hardware FeaturesqIntel/Cray machine arriving at Argonne in 2021 with sustained performance > 1 Exaflops

qAurora node architectureq2 Intel Xeon (Sapphire Rapids) processorsq6 Xe Architecture based GPUs (Ponte Vecchio)q8 Slingshot Fabric endpointsqUnified Memory Architecture across CPUs and GPUs

qCray Sling Shot Network and Shasta platformq Highly tunable congestion management, traffic classes

qI/O q DAOS (≥ 230 PB capacity and ≥ 25 TB/s)q Lustre (150 PB of storage and ~ 1TB/s)

Aurora Testbed

25

Aurora TestbedqJLSE provides testbeds for Aurora – available today

q 20 Nodes of Intel Xeons with Gen9 Iris Pro integrated GPU

q Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming models planned for Aurora. [NDA required]q OpenMPq DPC++/SYCLq OpenCLq Intel VTune and Advisor

q If you’d like to develop your application on the integrated Gen9 GPU, request access by sending email to [email protected]. q Note your home institution needs a CNDA agreement with

Intel.

Joint Laboratory for System Evaluation (JLSE) https://jlse.anl.gov

http://jlse.anl.gov

https://jlse.anl.gov/

26

Intel Gen9 Integrated Graphics

qCPUs and GPUs are integrated on same chip:q GPU and CPU same memoryq GPU, cores, and memory are connected by an internal ring interconnectq Shared Last Level Cache

IDF-2015 release of “The Compute Architecture of Intel Processor Graphics Gen9” – Stephen Junkins

27

Intel Gen9 Hierarchyq GPU architectures have hardware hierarchies

q Built out of smaller scalable units, each level shares a set of resources

Slice

Execution Unit (EU)Sub-Slice

28

Intel Gen9 Building Blocks: EU

EU: Execution Unit q These are compute processors that executes instructions

q 28 KB register fileq 7 hardware threadsq Can issue 4 instructions per cycle

(different threads)q SIMD width instructions

q SIMD hardware is 128 bits wide q 4 SP, 2 DP

q Instruction SIMD width can vary (1-32)


29

Intel Gen9 Building Blocks: EUEU: Execution Unit q Issues instructions to 4 functional units

q 2 SIMD “FPU”sq Floating point instructionsq Integer instructionsq One unit performs double precisionq Fully pipelinedq Supports FMA

q Branchq Send (memory load/store)


30

A sub-slice contains:q8 EUsqInstruction cacheqLocal thread dispatcherqL1 & L2 sampler cachesqData port (memory load/store)

q Efficient read/write operationsq Provides scatter/gatherq Shared local memory

Intel Gen9 Building Blocks: Subslice

31

A slice contains:q3 subslicesqL3 cacheqShared local memoryqFixed functional unitsqSlices are a scalable architectural

unitq Products available with 1-3 slices

qSlices are connected at the L3

Intel Gen9 Building Blocks: Slice

32

Gen9 (GT4) GPU CharacteristicsCharacteristics Value Notes

Clock Freq. 1.15 GHz

Slices 3

EUs 72 3 slice * 3 sub-slices * 8 EUs

Hardware Threads 504 72 EUs * 7 threads

Concurrent Kernel Instances 16,128 504 threads * SIMD-32

L3 Data Cache Size 1.5 MB 3 slices * 0.5 MB/slice

Max Shared Local Memory 576 KB 3 slice * 3 sub-slices * 64 KB/sub-slice

Last Level Cache Size 8 MB

eDRAM size 128 MB

32b float FLOPS 1152 FLOPS/cycle 72 EUs * 2 FPUs * SIMD-4 * (MUL + ADD)

64b float FLOPS 288 FLOPS/cycle 72 EUs * 1 FPU * SIMD-2 * (MUL + ADD)

32b integer IOPS 576 IOPS/cycle 72 EUs * 2 FPUs * SIMD-4

331.2 DP GFlops

SOFTWARE

34

Pre-exascale and Exascale US Landscape

System Delivery CPU + Accelerator Vendor

Summit 2018 IBM + NVIDIA

Sierra 2018 IBM + NVIDIA

Perlmutter 2020 AMD + NVIDIA

Aurora 2021 Intel + Intel Frontier 2021 AMD + AMD

El Capitan 2022 TBA

§ Heterogenous Computing (CPU + Accelerator)§ Varying vendors

35

Three PillarsSimulation Data Learning

Directives

Parallel Runtimes

Solver Libraries

HPC Languages

Big Data Stack

Statistical Libraries

Productivity Languages

Databases

DL Frameworks

Linear Algebra Libraries

Statistical Libraries

Productivity Languages

Math Libraries, C++ Standard Library, libc

I/O, Messaging

Scheduler

Linux Kernel, POSIX

Compilers, Performance Tools, Debuggers

Containers, Visualization

PROGRAMMING MODELS

37

Heterogenous System Programming ModelsqApplications will be using a variety of programming models for Exascale:

q CUDAq OpenCLq HIPq OpenACCq OpenMPq DPC++/SYCLq Kokkosq Raja

qNot all systems will support all modelsqLibraries may help you abstract some programming models.

38

Available Aurora Programming Models qAurora applications may use:

q CUDAq OpenCLq HIPq OpenACCq OpenMPq DPC++/SYCLq Kokkosq Raja

Preparing Applications for Aurora•Thursday - 1:30 PM - 3:00 PM•Champions III

39

Mapping of Existing Programming Models to Aurora

OpenMP w/o target

OpenMP with target

OpenACC

OpenCL

CUDAMPI +

OpenMP

DPC++/ SYCL

OpenCL

Kokkos Kokkos

Raja Raja

Aurora Models

Vendor Supported Programming Models

ECP ProvidedProgramming Models

40

OpenMP 5qOpenMP 5 constructs will provide directives based programming model for Intel GPUsqAvailable for C, C++, and FortranqA portable model expected to be supported on a variety of platforms (Aurora, Frontier, Perlmutter, …)qOptimized for AuroraqFor Aurora, OpenACC codes could be converted into OpenMP

q ALCF staff will assist with conversion, training, and best practicesq Automated translation possible through the clacc conversion tool (for C/C++)

https://www.openmp.org/

41

OpenMP 4.5/5: for AuroraqOpenMP 4.5/5 specification has significant updates to allow for improved support of accelerator devices

Distributing iterations of the loop to threads

Offloading code to run on accelerator

Controlling data transfer between devices

#pragma omp target [clause[[,] clause],…]

structured-block#pragma omp declare target

declarations-definition-seq#pragma omp declare variant*(variant-func-id) clause new-line

function definition or declaration

#pragma omp teams [clause[[,] clause],…]

structured-block#pragma omp distribute [clause[[,] clause],…]

for-loops#pragma omp loop* [clause[[,] clause],…]

for-loops

map ([map-type:] list )map-type:=alloc | tofrom | from | to | …

#pragma omp target data [clause[[,] clause],…]

structured-block#pragma omp target update [clause[[,] clause],…]

* denotes OMP 5Environment variables• Control default device through

OMP_DEFAULT_DEVICE• Control offload with

OMP_TARGET_OFFLOAD

Runtime support routines:• void omp_set_default_device(int dev_num)• int omp_get_default_device(void)• int omp_get_num_devices(void)• int omp_get_num_teams(void)

42

DPC++ (Data Parallel C++) and SYCLqSYCL

q Khronos standard specificationq SYCL is a C++ based abstraction layer (standard C++11)q Builds on OpenCL concepts (but single-source)q SYCL is designed to be as close to standard C++ as

possibleqCurrent Implementations of SYCL:

q ComputeCPP™ (www.codeplay.com)q Intel SYCL (github.com/intel/llvm)q triSYCL (github.com/triSYCL/triSYCL)q hipSYCL (github.com/illuhad/hipSYCL)

q Runs on today’s CPUs and nVidia, AMD, Intel GPUs

SYCL 1.2.1 or later

C++11 or later

43





q Runs on today’s CPUs and nVidia, AMD, Intel GPUsqDPC++

q Part of Intel oneAPI specificationq Intel extension of SYCL to support new innovative featuresq Incorporates SYCL 1.2.1 specification and Unified Shared

Memoryq Add language or runtime extensions as needed to meet user

needs

Intel DPC++

SYCL 1.2.1 or later

C++11 or later

Extensions DescriptionUnified Shared Memory (USM)

defines pointer-based memory accesses and management interfaces.

In-order queues

defines simple in-order semantics for queues, to simplify common coding patterns.

Reductionprovides reduction abstraction to the ND-range form of parallel_for.

Optional lambda name

removes requirement to manually name lambdas that define kernels.

Subgroupsdefines a grouping of work-items within a work-group.

Data flow pipes enables efficient First-In, First-Out (FIFO) communication (FPGA-only)

https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html#extensions-table

https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html

44





q Runs on today’s CPUs and nVidia, AMD, Intel GPUsqDPC++

q Part of Intel oneAPI specificationq Intel extension of SYCL to support new innovative featuresq Incorporates SYCL 1.2.1 specification and Unified Shared

Memoryq Add language or runtime extensions as needed to meet user

needs

Intel DPC++

SYCL 1.2.1 or later

C++11 or later

Extensions DescriptionUnified Shared Memory (USM)

defines pointer-based memory accesses and management interfaces.

In-order queues

defines simple in-order semantics for queues, to simplify common coding patterns.

Reductionprovides reduction abstraction to the ND-range form of parallel_for.

Optional lambda name

removes requirement to manually name lambdas that define kernels.

Subgroupsdefines a grouping of work-items within a work-group.

Data flow pipes enables efficient First-In, First-Out (FIFO) communication (FPGA-only)

https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html#extensions-table

New open-source DPC++ for NVIDIA from CodePlay:https://github.com/codeplaysoftware/sycl-for-cuda

https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html

https://github.com/codeplaysoftware/sycl-for-cuda

45

OpenCLqOpen standard for heterogeneous device programming (CPU, GPU, FPGA)

q Utilized for GPU programmingqStandardized by multi-vendor Khronos Group, V 1.0 released in 2009

q AMD, Intel, nVidia, …q Many implementations from different vendors

qIntel implementation for GPU is Open Source (https://github.com/intel/compute-runtime) qSIMT programming model

q Distributes work by abstracting loops and assigning work to threadsq Not using pragmas / directives for specifying parallelismq Similar model to CUDA

qConsists of a C compliant library with kernel language q Kernel language extends Cq Has extensions that may be vendor specific

qProgramming aspects:q Requires host and device code be in different files q Typically uses JIT compilation

qExample: https://github.com/alcf-perfengr/alcl

https://github.com/intel/compute-runtime

46

MPI on Aurora• Intel MPI & Cray MPI

• MPI 3.0 standard compliant

• The MPI library will be thread safe • Allow applications to use MPI from individual threads• Efficient MPI_THREAD_MUTIPLE (locking optimizations)

• Asynchronous progress in all types of nonblocking communication• Nonblocking send-receive and collectives• One-sided operations

• Hardware and topology optimized collective implementations• Supports MPI tools interface

• Control variables

47

oneAPIqIndustry specification from Intel

(https://www.oneapi.com/spec/)q Language and libraries to target programming across diverse

architectures (DPC++, APIs, low level interface)

qIntel oneAPI products and toolkits (https://software.intel.com/ONEAPI)q Implementations of the oneAPI specification and analysis and

debug tools to help programming

https://www.oneapi.com/spec/

https://software.intel.com/ONEAPI

48

Other Aurora Languages

qC++qCqDPC++qPython

Intel Fortran for AuroraqFortran 2008qOpenMP 5qNew compiler—LLVM backend

q Strong Intel history of optimizing Fortran compilersqPart of SDK (pre-beta)

q Beta expected early in 2nd half of 2020

} OpenMP 5

TOOLS AND LIBRARIES

50

Intel VTune and AdvisorqVtune Profiler

q Widely used performance analysis toolq Currently supports analysis on Intel integrated GPUsq Will support future Intel GPUs

qAdvisorq Provides roofline analysisq Offload analysis will identify components for

profitable offloadq Measure performance and behavior of original codeq Model specific accelerator performance to determine

offload opportunitiesq Considers overhead from data transfer and kernel launch

51

Intel MKL – Math Kernel LibraryqHighly tuned algorithms

q FFTq Linear algebra (BLAS, LAPACK)

q Sparse solversq Statistical functionsq Vector mathq Random number generators

qOptimized for every Intel platform

qoneAPI MKL (oneMKL)q https://software.intel.com/en-us/oneapi/mkl

oneAPI beta includes DPC++ support

https://software.intel.com/en-us/oneapi/mkl

52

AI and Analytics

qLibraries to support AI and Analytics

q OneAPI Deep Neural Network Library (oneDNN)q High Performance Primitives to accelerate deep learning frameworksq Powers Tensorflow, PyTorch, MXNet, Intel Caffe, and moreq Running on Gen9 today (via OpenCL)

q oneAPI Data Analytics Library (oneDAL)q Classical Machine Learning Algorithmsq Easy to use one line daal4py Python interfacesq Powers Scikit-Learn

q Apache Spark MLlib

53

DAOSqFast for broad spectrum of read/write patternsqWide concurrencyqRedundancyqCommon layer to build I/O middleware on

Application

Middleware

DAOSOpen-Source Apache 2.0 License

HDF5

MPI-IO

POSIX

custom

provided

Spark RDD

Apache Arrow

SQL

Simulation Data

DAOS (Distributed Asynchronous Object Storage) for Applications•Thursday - 8:30 AM - 10:00 AM•Discovery A

Try it now!

§ Try it now

55

Intel oneAPI Toolkits

qIntel’s DevCloudq oneAPI Toolkits: https://software.intel.com/ONEAPI

https://software.intel.com/ONEAPI

56

Argonne’s Testbed

qArgonne’s Joint Laboratory for System Evaluation (JLSE) provides testbeds for Aurora –available today q 20 Nodes of Intel Xeons with Gen 9 Iris Proq Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming

models planned for Aurora. [NDA required, ECP/ESP]

57

Argonne’s Testbed

qArgonne’s Joint Laboratory for System Evaluation (JLSE) provides testbeds for Aurora –available today q 20 Nodes of Intel Xeons with Gen 9 Iris Proq Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming

models planned for Aurora. [NDA required, ECP/ESP]

qOr, roll your ownq oneAPI public betaq Intel Gen 9 system such as NUC

Questions?

59

Aurora support for ECP teamsqECP is funding Argonne staff and Intel COE personnel to prepare applications for Aurora

q 17 Argonne staff (8 FTE)q 5 Argonne post-docs (2 hired)q 2 on-site Intel COE positions (hiring)

qSome overlap exists between ALCF Early Science Projects (ESP) and ECP AD teamsq Support activities through ESP program include:

q COE Workshopsq Hackathonsq Applications Working Group Calls (Argonne/Intel)q Webinars

qAdditional support activities targeting ECP teamsq Aurora Programming Workshopsq Aurora Early Adopters webinar series

60

Aurora Programming WorkshopqHeld at Argonne on September 17-19, 2019

q First Aurora event for ECP teams (some ESP and Industry teams by invitation)

qWorkshop was geared towards developers with emphasis on using the Intel Software Development Kit (SDK)q Hands-on sessions every dayq Opportunities to discuss feedback and project requirements

q163 total attendees: including faculty, staff, postdocs, and studentsq 13 Universitiesq 10 DOE/Other Laboratories (plus Argonne)q 2 Vendors (plus Intel)q 2 Industry (Boeing & GE)

qThis workshop was a giant step forward in our ability to share information with ECP teams.

61

Aurora Programming WorkshopqRepresentation by many AD & ST ECP teams

q 16 AD (9 KPP-1 projects), 21 ST, and 3 Co-design centersq Argonne continuing to engage many AD teams

• ALPINE (ST)• ARGO (ST)• CANDLE (AD)• CEED (CD)• CODAR (CD)• COPA (CD)• E3SM-MMF (AD)• EQSIM (AD)• EXAALT (AD)• ExaBiome (AD)• ExaFEL (AD)• ExaGraph (ST)• ExaLearn (ST)• ExaPAPI (ST)• ExaSGD (AD)

• ExaSky (AD)• ExaSMR (AD)• ExaStar (AD)• ExaWind (AD)• EZ (ST)• GAMESS (AD)• HPCToolkit (ST)• Kokkos/RAJA (ST)• LatticeQCD (AD)• Packaging (ST)• PETSc/TAO (ST)• Pre-exascale• PROTEAS-TUNE (ST)• Proxy Apps (ST)• QMCPACK (AD)

• SICM (AD)• SOLLVE (ST)• STRUMPACK/SUPERLU

(ST)• SUNDIALS (ST)• Training• UPC++/GASNet (ST)• VeloC (ST)• VTK-m (ST)• WDMApp (AD)• xGA (ST)• xSDK (ST)• Y-Tune (ST)

62

Upcoming EventsqFebruary 2020 COE Workshop

q Next in series of Aurora events for ESP teamsq Emphasis on hands-on work with latest SDK and application development plans

qFall 2020 Workshopq Next in series of Aurora events for ECP teamsq Targeting September time frameq Preparations for updated testbed hardwareq Keep an eye out for invite!

qAurora Early Adopters webinar series

qhttps://www.alcf.anl.gov/support-center/aurora

https://www.alcf.anl.gov/support-center/aurora

63

AcknowledgementsqArgonne Leadership Computing Facility and Computational Science Division Staff

qThis research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative.

qThis research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.

Thank You

65

ReferencesqIntel Gen9 GPU Programmer’s Reference Manual:

https://en.wikichip.org/w/images/3/31/intel-gfx-prm-osrc-skl-vol03-gpu_overview.pdfqRaja Koduri, Intel Chief Architect, Keynote at Intel HPC Developer Conference, Nov 17,

2019. https://s21.q4cdn.com/600692695/files/doc_presentations/2019/11/DEVCON-2019_16x9_v13_FINAL.pdf

qDAOS:q DAOS Community Home: https://wiki.hpdd.intel.com/display/DC/DAOS+Community+Homeq DAOS User Group meeting: https://wiki.hpdd.intel.com/display/DC/DUG19

qSlingshotq Hot Interconnects Keynote by Steve Scott, CTO, Cray: http://www.hoti.org/hoti26/slides/sscott.pdfq HPC User Forum talk by Steve Scott, Sep 2019 q https://www.nextplatform.com/2019/08/16/how-cray-makes-ethernet-suited-for-hpc-and-ai-with-

slingshot/

https://en.wikichip.org/w/images/3/31/intel-gfx-prm-osrc-skl-vol03-gpu_overview.pdf

https://s21.q4cdn.com/600692695/files/doc_presentations/2019/11/DEVCON-2019_16x9_v13_FINAL.pdf

https://wiki.hpdd.intel.com/display/DC/DAOS+Community+Home

https://wiki.hpdd.intel.com/display/DC/DUG19

http://www.hoti.org/hoti26/slides/sscott.pdf

https://www.nextplatform.com/2019/08/16/how-cray-makes-ethernet-suited-for-hpc-and-ai-with-slingshot/

66

OpenMP 5Host DeviceTransfer data and

execution control

extern void init(float*, float*, int); extern void output(float*, int);void vec_mult(float*p, float*v1, float*v2, int N){int i;init(v1, v2, N); #pragma omp target teams distribute parallel for simd \

map(to: v1[0:N], v2[0:N]) map(from: p[0:N])for (i=0; i<N; i++){

p[i] = v1[i]*v2[i];}output(p, N);

}

Creates teams of threads in the target device

Distributes iterations to the threads, where each thread uses SIMD parallelism

Controlling data transfer

67

void foo( int *A ) { default_selector selector; // Selectors determine which device to dispatch to.{

queue myQueue(selector); // Create queue to submit work to, based on selector

// Wrap data in a sycl::bufferbuffer<cl_int, 1> bufferA(A, 1024);

myQueue.submit([&](handler& cgh) {

//Create an accessor for the sycl buffer.auto writeResult = bufferA.get_access<access::mode::discard_write>(cgh);

// Kernel cgh.parallel_for<class hello_world>(range<1>{1024}, [=](id<1> idx) {

writeResult[idx] = idx[0];}); // End of the kernel function

}); // End of the queue commands} // End of scope, wait for the queued work to stop.

...}

Get a device

SYCL buffer using host pointer

Kernel

Queue out of scope

Data accessor

SYCL ExamplesHost DeviceTransfer data and

execution control

An Overview of Aurora, Argonne’s Upcoming ExascaleSystem ......2020/02/04 · (Ponte Vecchio)...

Documents

Transcript of An Overview of Aurora, Argonne’s Upcoming ExascaleSystem ......2020/02/04 · (Ponte Vecchio)...