An Overview of Aurora, Argonne’s Upcoming ExascaleSystem ......2020/02/04 · (Ponte Vecchio)...
Transcript of An Overview of Aurora, Argonne’s Upcoming ExascaleSystem ......2020/02/04 · (Ponte Vecchio)...
www.anl.gov
Argonne Leadership Computing Facility and Computational Science Division
ECP Annual Meeting, Houston, Feb. 3-7
An Overview of Aurora, Argonne’s Upcoming Exascale System
Yasaman Ghadar, Tim Williams
2
Argonne staff that can help with your question Yasaman GhadarApplication PortingEXXALT
Colleen BertoniApplication PortingProgramming ModelsGAMESS
Thomas ApplencourtApplication PortingSYCL/DPC++QMCPACK
Victor AnisimovPrograming ModelsNWChemEx
Kevin HarmsI/O LibrariesDAOS
Sudheer ChunduriNetwork PerformanceSlingShot
Brian HomerdingProgramming ModelsRaja/KokkosEQSIM
Ray LoyToolsDebuggerE3SM-MMF
JaeHyuk KwackPerformance ToolsOpenMPKokkosExaWind
ServeshMuralidharanPerformance Opt.ExaFEL
Kris RoweI/O Libraries ExaSMR
Kevin BrownNetwork performanceHardware Integration
Tim WilliamsEarly Science ProjectWDMApp
James OsbornLatticeQCD
Saumil PatelExaSMR
Silvio RizziALPINE
3
Outlineq Hardware Architecture: (Yasaman Ghadar)
q Node-level Hardwareq Interconnectq I/Oq Aurora’s Testbed
q Software Stack: (Tim Williams)q Three pillarsq Programming modelsq Tools and librariesq Development platformsq Support for ECP teams
q Further details on Aurora can be discussed under NDA q If you are not covered by an Aurora NDA please talk to your ECP project PI
4
Aurora: A High-level Viewq Intel-Cray machine arriving at Argonne in 2021
q Sustained Performance > 1Exaflops
q Intel Xeon processors and Intel Xe GPUsq 2 Xeons (Sapphire Rapids)q 6 GPUs (Ponte Vecchio [PVC])
q Greater than 10 PB of total memory
q Cray Slingshot fabric and Shasta platform
q Filesystemq Distributed Asynchronous Object Store (DAOS)
q ≥ 230 PB of storage capacityq Bandwidth of > 25 TB/s
q Lustreq 150 PB of storage capacity q Bandwidth of ~1TB/s
5
The Evolution of Intel GPUs
Source: Intel
6
Intel GPUsq Recent integrated generations:
q Gen9 – used in Skylakeq Gen11 – used in Ice Lake
qGen9: Double precision peak performance: 100-300 GFq Low by design due to power and space
limits
qUpcoming Xe GPUq Discrete
Layout of Architecture components for an Intel Core i7 processor 6700K for desktop systems (91 W TDP, 122 mm)
7
Intel Discrete Graphics 1 (DG1): q Intel Xe DG1: Software Development Vehicle(SDV) Shown At
CES 2020
q Xe DG1 GPU is part of the low power (LP) series and is not a high-end product.
q Chip will be used in laptops with GPU integrated on to the motherboard
q PCIe form factor is only for the SDV and not a product
q The SDV is shipping worldwide to integrated software vendors (ISVs) to enable broad software optimization for the Xe architecture.
https://newsroom.intel.com/image-archive/images-intel-dgi-sdv/?wapkw=DG1#gs.u0a8t8
8
Aurora Compute Nodeq2 Intel Xeon (Sapphire Rapids)
processors
q6 Xe Architecture based GPUs (Ponte Vecchio)q All to all connectionq Low latency and high bandwidth
q8 Slingshot Fabric endpoints
qUnified Memory Architecture across CPUs and GPUs
Network Interconnect
10
Cray Shasta System
Shasta Hardware Workshop, May 6, 2019, Cray User Meeting
qSupport diversity of processors
qScale-optimized cabinets for density, cooling, and high network bandwidth
qStandard cabinets for flexibility
qCray SW stack with improved modularity
qUnified by a single, high-performance interconnect
Optimized Rack Standard Rack
11
Cray Slingshot NetworkqSlingshot is next generation scalable interconnect by
Crayq 8th major generation
qBuilds on Cray’s expertise in high performance network following q Gemini (Titan, Blue Waters) q Aries (Theta, Cori)
q 5 hop dragonfly topology
qSlingshot introduces:q Congestion managementq Traffic classesq 3 hop dragonfly
Cray Slingshot
https://www.cray.com/products/computing/slingshothttps://www.cray.com/resources/slingshot-interconnect-for-exascale-era
12
Dragonfly TopologyqSeveral groups are connected together using all to all links.qOptical cables are used for the long links between groups.qLow-cost electrical links are used to connect the NICs in each node to their local router and
the routers in a group
https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf
Nodes
Switch Switch Switch
…
Nodes Nodes Nodes
Switch Switch Switch
…
Nodes Nodes Nodes
Switch Switch Switch
…
Nodes NodesNodes
Switch Switch Switch
…
Nodes Nodes
…
Group 0 Group 1 Group 2 Group Z
All-to-all
Group 0 Group 1 Group 2 Group n
All to All
13
High Bandwidth Switch: Rosettaq Multi-level congestion management
q To minimize the impact of congested applications on othersq Very low average and tail latency
qQuality of Service (QoS) – Traffic Classesq Class: Collection of buffers, queues and bandwidthq Intended to provide isolation between applications via traffic
shaping
q Aggressive adaptive routingq Expected to be more effective for 3-hop dragonfly due to closer
congestion information
q High performance multicast and reductions
Rosetta Switch25.6 Tb/s per switch, from 64 - 200 Gbs ports (25GB/s per direction)
14
Congestion Management with Slingshotq GPCNeT benchmark suite was
utilized
15
GPCNeT (Global Performance and Congestion Network Tests)
qA new benchmark developed by Cray in collaboration with Argonne and NERSC
qPublicly available at https://github.com/netbench/GPCNET
qGoals:q Proxy for real-world application communication
patternsq Measure network performance under loadq Look at both mean and tail latencyq Look at interference between workloadsq What is the Congestion management?
q How well the system performs under congestion
16
Congestion Management with Slingshotq GPCNeT benchmark suit was utilized q Tail latency was measured at:
q congestedq isolated
https://dl.acm.org/doi/pdf/10.1145/3295500.3356215
17
Congestion Management with Slingshotq GPCNeT benchmark suit was utilized qTail latency was measured at:
q congestedq isolated
qBenchmark was run over the full machine for multiple runs
qCongestion impact:q Latency congested/Latency isolated
https://dl.acm.org/doi/pdf/10.1145/3295500.3356215
18
InfiniBandAries Network
Congestion Management with Slingshotq GPCNeT benchmark suit was utilized q Tail latency was measured at:
q congestedq isolated
q Benchmark was run over the full machine for multiple runs
qCongestion impact:q Latency congested/Latency isolated
q Aries congestion impact is in order of 1000x
q InfiniBand has reduced congestion impact (20-1000x)
q Slingshot testbed (Malbec) has minimal congestion impact
Con
gest
ion
Impa
ct (l
og s
cale
)
1
10
100
1000
10000
Crystal Theta Edison Osprey Summit Sierra Malbec
CI(99% Tail Latency)
CI(99% All Reduce)
https://dl.acm.org/doi/pdf/10.1145/3295500.3356215
19
Slingshot Quality of Service Classes
§ Highly tunable QoS classes
§ Supports multiple, overlaid virtual networks …
§ Jobs can use multiple traffic classes
§ Provides performance isolation for different types of traffic Traffic class 1
Traffic class 2
Traffic class 3
Bandwidth sharing with QoS
Traffic type 1
Traffic type 2
Traffic type 3
Bandwidth sharing without QoS
I/O
21
Distributed Asynchronous Object Store (DAOS) qOpen source storage solution
qOffers high performance in bandwidth and IO operationsq ≥ 230 PB capacityq ≥ 25 TB/sq Using DAOS is critical to achieving good I/O performance
on Aurora
qProvides compatibility with existing I/O models such as POSIX, MPI-IO and HDF5
qProvides a flexible storage API that enables new I/O paradigms
DAOS (Distributed Asynchronous Object Storage) for Applications; Thursday 8:30-10:00AM
22
User Environment q User see single storage namespace which is in Lustre Compatibility with legacy systems
q Links point to DAOS containers within the /project directoryq DAOS aware software interpret these links and access the DAOS containers
qData resides in a single place (Lustre or DAOS)q Explicit data movement, no auto-migration
qSuggested storage locations q Source and binaries in Lustreq Bulk data in DAOS
DAOS (Distributed Asynchronous Object Storage) for Applications; Thursday 8:30-10:00AM
23
Summary of Aurora Hardware FeaturesqIntel/Cray machine arriving at Argonne in 2021 with sustained performance > 1 Exaflops
qAurora node architectureq2 Intel Xeon (Sapphire Rapids) processorsq6 Xe Architecture based GPUs (Ponte Vecchio)q8 Slingshot Fabric endpointsqUnified Memory Architecture across CPUs and GPUs
qCray Sling Shot Network and Shasta platformq Highly tunable congestion management, traffic classes
qI/O q DAOS (≥ 230 PB capacity and ≥ 25 TB/s)q Lustre (150 PB of storage and ~ 1TB/s)
Aurora Testbed
25
Aurora TestbedqJLSE provides testbeds for Aurora – available today
q 20 Nodes of Intel Xeons with Gen9 Iris Pro integrated GPU
q Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming models planned for Aurora. [NDA required]q OpenMPq DPC++/SYCLq OpenCLq Intel VTune and Advisor
q If you’d like to develop your application on the integrated Gen9 GPU, request access by sending email to [email protected]. q Note your home institution needs a CNDA agreement with
Intel.
Joint Laboratory for System Evaluation (JLSE) https://jlse.anl.gov
26
Intel Gen9 Integrated Graphics
qCPUs and GPUs are integrated on same chip:q GPU and CPU same memoryq GPU, cores, and memory are connected by an internal ring interconnectq Shared Last Level Cache
IDF-2015 release of “The Compute Architecture of Intel Processor Graphics Gen9” – Stephen Junkins
27
Intel Gen9 Hierarchyq GPU architectures have hardware hierarchies
q Built out of smaller scalable units, each level shares a set of resources
Slice
Execution Unit (EU)Sub-Slice
28
Intel Gen9 Building Blocks: EU
EU: Execution Unit q These are compute processors that executes instructions
q 28 KB register fileq 7 hardware threadsq Can issue 4 instructions per cycle
(different threads)q SIMD width instructions
q SIMD hardware is 128 bits wide q 4 SP, 2 DP
q Instruction SIMD width can vary (1-32)
IDF-2015 release of “The Compute Architecture of Intel Processor Graphics Gen9” – Stephen Junkins
29
Intel Gen9 Building Blocks: EUEU: Execution Unit q Issues instructions to 4 functional units
q 2 SIMD “FPU”sq Floating point instructionsq Integer instructionsq One unit performs double precisionq Fully pipelinedq Supports FMA
q Branchq Send (memory load/store)
IDF-2015 release of “The Compute Architecture of Intel Processor Graphics Gen9” – Stephen Junkins
30
A sub-slice contains:q8 EUsqInstruction cacheqLocal thread dispatcherqL1 & L2 sampler cachesqData port (memory load/store)
q Efficient read/write operationsq Provides scatter/gatherq Shared local memory
Intel Gen9 Building Blocks: Subslice
31
A slice contains:q3 subslicesqL3 cacheqShared local memoryqFixed functional unitsqSlices are a scalable architectural
unitq Products available with 1-3 slices
qSlices are connected at the L3
Intel Gen9 Building Blocks: Slice
32
Gen9 (GT4) GPU CharacteristicsCharacteristics Value Notes
Clock Freq. 1.15 GHz
Slices 3
EUs 72 3 slice * 3 sub-slices * 8 EUs
Hardware Threads 504 72 EUs * 7 threads
Concurrent Kernel Instances 16,128 504 threads * SIMD-32
L3 Data Cache Size 1.5 MB 3 slices * 0.5 MB/slice
Max Shared Local Memory 576 KB 3 slice * 3 sub-slices * 64 KB/sub-slice
Last Level Cache Size 8 MB
eDRAM size 128 MB
32b float FLOPS 1152 FLOPS/cycle 72 EUs * 2 FPUs * SIMD-4 * (MUL + ADD)
64b float FLOPS 288 FLOPS/cycle 72 EUs * 1 FPU * SIMD-2 * (MUL + ADD)
32b integer IOPS 576 IOPS/cycle 72 EUs * 2 FPUs * SIMD-4
331.2 DP GFlops
SOFTWARE
34
Pre-exascale and Exascale US Landscape
System Delivery CPU + Accelerator Vendor
Summit 2018 IBM + NVIDIA
Sierra 2018 IBM + NVIDIA
Perlmutter 2020 AMD + NVIDIA
Aurora 2021 Intel + Intel Frontier 2021 AMD + AMD
El Capitan 2022 TBA
§ Heterogenous Computing (CPU + Accelerator)§ Varying vendors
35
Three PillarsSimulation Data Learning
Directives
Parallel Runtimes
Solver Libraries
HPC Languages
Big Data Stack
Statistical Libraries
Productivity Languages
Databases
DL Frameworks
Linear Algebra Libraries
Statistical Libraries
Productivity Languages
Math Libraries, C++ Standard Library, libc
I/O, Messaging
Scheduler
Linux Kernel, POSIX
Compilers, Performance Tools, Debuggers
Containers, Visualization
PROGRAMMING MODELS
37
Heterogenous System Programming ModelsqApplications will be using a variety of programming models for Exascale:
q CUDAq OpenCLq HIPq OpenACCq OpenMPq DPC++/SYCLq Kokkosq Raja
qNot all systems will support all modelsqLibraries may help you abstract some programming models.
38
Available Aurora Programming Models qAurora applications may use:
q CUDAq OpenCLq HIPq OpenACCq OpenMPq DPC++/SYCLq Kokkosq Raja
Preparing Applications for Aurora•Thursday - 1:30 PM - 3:00 PM•Champions III
39
Mapping of Existing Programming Models to Aurora
OpenMP w/o target
OpenMP with target
OpenACC
OpenCL
CUDAMPI +
OpenMP
DPC++/ SYCL
OpenCL
Kokkos Kokkos
Raja Raja
Aurora Models
Vendor Supported Programming Models
ECP ProvidedProgramming Models
40
OpenMP 5qOpenMP 5 constructs will provide directives based programming model for Intel GPUsqAvailable for C, C++, and FortranqA portable model expected to be supported on a variety of platforms (Aurora, Frontier, Perlmutter, …)qOptimized for AuroraqFor Aurora, OpenACC codes could be converted into OpenMP
q ALCF staff will assist with conversion, training, and best practicesq Automated translation possible through the clacc conversion tool (for C/C++)
https://www.openmp.org/
41
OpenMP 4.5/5: for AuroraqOpenMP 4.5/5 specification has significant updates to allow for improved support of accelerator devices
Distributing iterations of the loop to threads
Offloading code to run on accelerator
Controlling data transfer between devices
#pragma omp target [clause[[,] clause],…]
structured-block#pragma omp declare target
declarations-definition-seq#pragma omp declare variant*(variant-func-id) clause new-line
function definition or declaration
#pragma omp teams [clause[[,] clause],…]
structured-block#pragma omp distribute [clause[[,] clause],…]
for-loops#pragma omp loop* [clause[[,] clause],…]
for-loops
map ([map-type:] list )map-type:=alloc | tofrom | from | to | …
#pragma omp target data [clause[[,] clause],…]
structured-block#pragma omp target update [clause[[,] clause],…]
* denotes OMP 5Environment variables• Control default device through
OMP_DEFAULT_DEVICE• Control offload with
OMP_TARGET_OFFLOAD
Runtime support routines:• void omp_set_default_device(int dev_num)• int omp_get_default_device(void)• int omp_get_num_devices(void)• int omp_get_num_teams(void)
42
DPC++ (Data Parallel C++) and SYCLqSYCL
q Khronos standard specificationq SYCL is a C++ based abstraction layer (standard C++11)q Builds on OpenCL concepts (but single-source)q SYCL is designed to be as close to standard C++ as
possibleqCurrent Implementations of SYCL:
q ComputeCPP™ (www.codeplay.com)q Intel SYCL (github.com/intel/llvm)q triSYCL (github.com/triSYCL/triSYCL)q hipSYCL (github.com/illuhad/hipSYCL)
q Runs on today’s CPUs and nVidia, AMD, Intel GPUs
SYCL 1.2.1 or later
C++11 or later
43
DPC++ (Data Parallel C++) and SYCLqSYCL
q Khronos standard specificationq SYCL is a C++ based abstraction layer (standard C++11)q Builds on OpenCL concepts (but single-source)q SYCL is designed to be as close to standard C++ as
possibleqCurrent Implementations of SYCL:
q ComputeCPP™ (www.codeplay.com)q Intel SYCL (github.com/intel/llvm)q triSYCL (github.com/triSYCL/triSYCL)q hipSYCL (github.com/illuhad/hipSYCL)
q Runs on today’s CPUs and nVidia, AMD, Intel GPUsqDPC++
q Part of Intel oneAPI specificationq Intel extension of SYCL to support new innovative featuresq Incorporates SYCL 1.2.1 specification and Unified Shared
Memoryq Add language or runtime extensions as needed to meet user
needs
Intel DPC++
SYCL 1.2.1 or later
C++11 or later
Extensions DescriptionUnified Shared Memory (USM)
defines pointer-based memory accesses and management interfaces.
In-order queues
defines simple in-order semantics for queues, to simplify common coding patterns.
Reductionprovides reduction abstraction to the ND-range form of parallel_for.
Optional lambda name
removes requirement to manually name lambdas that define kernels.
Subgroupsdefines a grouping of work-items within a work-group.
Data flow pipes enables efficient First-In, First-Out (FIFO) communication (FPGA-only)
https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html#extensions-table
44
DPC++ (Data Parallel C++) and SYCLqSYCL
q Khronos standard specificationq SYCL is a C++ based abstraction layer (standard C++11)q Builds on OpenCL concepts (but single-source)q SYCL is designed to be as close to standard C++ as
possibleqCurrent Implementations of SYCL:
q ComputeCPP™ (www.codeplay.com)q Intel SYCL (github.com/intel/llvm)q triSYCL (github.com/triSYCL/triSYCL)q hipSYCL (github.com/illuhad/hipSYCL)
q Runs on today’s CPUs and nVidia, AMD, Intel GPUsqDPC++
q Part of Intel oneAPI specificationq Intel extension of SYCL to support new innovative featuresq Incorporates SYCL 1.2.1 specification and Unified Shared
Memoryq Add language or runtime extensions as needed to meet user
needs
Intel DPC++
SYCL 1.2.1 or later
C++11 or later
Extensions DescriptionUnified Shared Memory (USM)
defines pointer-based memory accesses and management interfaces.
In-order queues
defines simple in-order semantics for queues, to simplify common coding patterns.
Reductionprovides reduction abstraction to the ND-range form of parallel_for.
Optional lambda name
removes requirement to manually name lambdas that define kernels.
Subgroupsdefines a grouping of work-items within a work-group.
Data flow pipes enables efficient First-In, First-Out (FIFO) communication (FPGA-only)
https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html#extensions-table
New open-source DPC++ for NVIDIA from CodePlay:https://github.com/codeplaysoftware/sycl-for-cuda
45
OpenCLqOpen standard for heterogeneous device programming (CPU, GPU, FPGA)
q Utilized for GPU programmingqStandardized by multi-vendor Khronos Group, V 1.0 released in 2009
q AMD, Intel, nVidia, …q Many implementations from different vendors
qIntel implementation for GPU is Open Source (https://github.com/intel/compute-runtime) qSIMT programming model
q Distributes work by abstracting loops and assigning work to threadsq Not using pragmas / directives for specifying parallelismq Similar model to CUDA
qConsists of a C compliant library with kernel language q Kernel language extends Cq Has extensions that may be vendor specific
qProgramming aspects:q Requires host and device code be in different files q Typically uses JIT compilation
qExample: https://github.com/alcf-perfengr/alcl
46
MPI on Aurora• Intel MPI & Cray MPI
• MPI 3.0 standard compliant
• The MPI library will be thread safe • Allow applications to use MPI from individual threads• Efficient MPI_THREAD_MUTIPLE (locking optimizations)
• Asynchronous progress in all types of nonblocking communication• Nonblocking send-receive and collectives• One-sided operations
• Hardware and topology optimized collective implementations• Supports MPI tools interface
• Control variables
47
oneAPIqIndustry specification from Intel
(https://www.oneapi.com/spec/)q Language and libraries to target programming across diverse
architectures (DPC++, APIs, low level interface)
qIntel oneAPI products and toolkits (https://software.intel.com/ONEAPI)q Implementations of the oneAPI specification and analysis and
debug tools to help programming
48
Other Aurora Languages
qC++qCqDPC++qPython
Intel Fortran for AuroraqFortran 2008qOpenMP 5qNew compiler—LLVM backend
q Strong Intel history of optimizing Fortran compilersqPart of SDK (pre-beta)
q Beta expected early in 2nd half of 2020
} OpenMP 5
TOOLS AND LIBRARIES
50
Intel VTune and AdvisorqVtune Profiler
q Widely used performance analysis toolq Currently supports analysis on Intel integrated GPUsq Will support future Intel GPUs
qAdvisorq Provides roofline analysisq Offload analysis will identify components for
profitable offloadq Measure performance and behavior of original codeq Model specific accelerator performance to determine
offload opportunitiesq Considers overhead from data transfer and kernel launch
51
Intel MKL – Math Kernel LibraryqHighly tuned algorithms
q FFTq Linear algebra (BLAS, LAPACK)
q Sparse solversq Statistical functionsq Vector mathq Random number generators
qOptimized for every Intel platform
qoneAPI MKL (oneMKL)q https://software.intel.com/en-us/oneapi/mkl
oneAPI beta includes DPC++ support
52
AI and Analytics
qLibraries to support AI and Analytics
q OneAPI Deep Neural Network Library (oneDNN)q High Performance Primitives to accelerate deep learning frameworksq Powers Tensorflow, PyTorch, MXNet, Intel Caffe, and moreq Running on Gen9 today (via OpenCL)
q oneAPI Data Analytics Library (oneDAL)q Classical Machine Learning Algorithmsq Easy to use one line daal4py Python interfacesq Powers Scikit-Learn
q Apache Spark MLlib
53
DAOSqFast for broad spectrum of read/write patternsqWide concurrencyqRedundancyqCommon layer to build I/O middleware on
Application
Middleware
DAOSOpen-Source Apache 2.0 License
HDF5
MPI-IO
POSIX
custom
provided
Spark RDD
Apache Arrow
SQL
Simulation Data
DAOS (Distributed Asynchronous Object Storage) for Applications•Thursday - 8:30 AM - 10:00 AM•Discovery A
Try it now!
§ Try it now
55
Intel oneAPI Toolkits
qIntel’s DevCloudq oneAPI Toolkits: https://software.intel.com/ONEAPI
56
Argonne’s Testbed
qArgonne’s Joint Laboratory for System Evaluation (JLSE) provides testbeds for Aurora –available today q 20 Nodes of Intel Xeons with Gen 9 Iris Proq Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming
models planned for Aurora. [NDA required, ECP/ESP]
57
Argonne’s Testbed
qArgonne’s Joint Laboratory for System Evaluation (JLSE) provides testbeds for Aurora –available today q 20 Nodes of Intel Xeons with Gen 9 Iris Proq Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming
models planned for Aurora. [NDA required, ECP/ESP]
qOr, roll your ownq oneAPI public betaq Intel Gen 9 system such as NUC
Questions?
59
Aurora support for ECP teamsqECP is funding Argonne staff and Intel COE personnel to prepare applications for Aurora
q 17 Argonne staff (8 FTE)q 5 Argonne post-docs (2 hired)q 2 on-site Intel COE positions (hiring)
qSome overlap exists between ALCF Early Science Projects (ESP) and ECP AD teamsq Support activities through ESP program include:
q COE Workshopsq Hackathonsq Applications Working Group Calls (Argonne/Intel)q Webinars
qAdditional support activities targeting ECP teamsq Aurora Programming Workshopsq Aurora Early Adopters webinar series
60
Aurora Programming WorkshopqHeld at Argonne on September 17-19, 2019
q First Aurora event for ECP teams (some ESP and Industry teams by invitation)
qWorkshop was geared towards developers with emphasis on using the Intel Software Development Kit (SDK)q Hands-on sessions every dayq Opportunities to discuss feedback and project requirements
q163 total attendees: including faculty, staff, postdocs, and studentsq 13 Universitiesq 10 DOE/Other Laboratories (plus Argonne)q 2 Vendors (plus Intel)q 2 Industry (Boeing & GE)
qThis workshop was a giant step forward in our ability to share information with ECP teams.
61
Aurora Programming WorkshopqRepresentation by many AD & ST ECP teams
q 16 AD (9 KPP-1 projects), 21 ST, and 3 Co-design centersq Argonne continuing to engage many AD teams
• ALPINE (ST)• ARGO (ST)• CANDLE (AD)• CEED (CD)• CODAR (CD)• COPA (CD)• E3SM-MMF (AD)• EQSIM (AD)• EXAALT (AD)• ExaBiome (AD)• ExaFEL (AD)• ExaGraph (ST)• ExaLearn (ST)• ExaPAPI (ST)• ExaSGD (AD)
• ExaSky (AD)• ExaSMR (AD)• ExaStar (AD)• ExaWind (AD)• EZ (ST)• GAMESS (AD)• HPCToolkit (ST)• Kokkos/RAJA (ST)• LatticeQCD (AD)• Packaging (ST)• PETSc/TAO (ST)• Pre-exascale• PROTEAS-TUNE (ST)• Proxy Apps (ST)• QMCPACK (AD)
• SICM (AD)• SOLLVE (ST)• STRUMPACK/SUPERLU
(ST)• SUNDIALS (ST)• Training• UPC++/GASNet (ST)• VeloC (ST)• VTK-m (ST)• WDMApp (AD)• xGA (ST)• xSDK (ST)• Y-Tune (ST)
62
Upcoming EventsqFebruary 2020 COE Workshop
q Next in series of Aurora events for ESP teamsq Emphasis on hands-on work with latest SDK and application development plans
qFall 2020 Workshopq Next in series of Aurora events for ECP teamsq Targeting September time frameq Preparations for updated testbed hardwareq Keep an eye out for invite!
qAurora Early Adopters webinar series
qhttps://www.alcf.anl.gov/support-center/aurora
63
AcknowledgementsqArgonne Leadership Computing Facility and Computational Science Division Staff
qThis research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, in support of the nation’s exascale computing imperative.
qThis research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.
Thank You
65
ReferencesqIntel Gen9 GPU Programmer’s Reference Manual:
https://en.wikichip.org/w/images/3/31/intel-gfx-prm-osrc-skl-vol03-gpu_overview.pdfqRaja Koduri, Intel Chief Architect, Keynote at Intel HPC Developer Conference, Nov 17,
2019. https://s21.q4cdn.com/600692695/files/doc_presentations/2019/11/DEVCON-2019_16x9_v13_FINAL.pdf
qDAOS:q DAOS Community Home: https://wiki.hpdd.intel.com/display/DC/DAOS+Community+Homeq DAOS User Group meeting: https://wiki.hpdd.intel.com/display/DC/DUG19
qSlingshotq Hot Interconnects Keynote by Steve Scott, CTO, Cray: http://www.hoti.org/hoti26/slides/sscott.pdfq HPC User Forum talk by Steve Scott, Sep 2019 q https://www.nextplatform.com/2019/08/16/how-cray-makes-ethernet-suited-for-hpc-and-ai-with-
slingshot/
66
OpenMP 5Host DeviceTransfer data and
execution control
extern void init(float*, float*, int); extern void output(float*, int);void vec_mult(float*p, float*v1, float*v2, int N){int i;init(v1, v2, N); #pragma omp target teams distribute parallel for simd \
map(to: v1[0:N], v2[0:N]) map(from: p[0:N])for (i=0; i<N; i++){
p[i] = v1[i]*v2[i];}output(p, N);
}
Creates teams of threads in the target device
Distributes iterations to the threads, where each thread uses SIMD parallelism
Controlling data transfer
67
void foo( int *A ) { default_selector selector; // Selectors determine which device to dispatch to.{
queue myQueue(selector); // Create queue to submit work to, based on selector
// Wrap data in a sycl::bufferbuffer<cl_int, 1> bufferA(A, 1024);
myQueue.submit([&](handler& cgh) {
//Create an accessor for the sycl buffer.auto writeResult = bufferA.get_access<access::mode::discard_write>(cgh);
// Kernel cgh.parallel_for<class hello_world>(range<1>{1024}, [=](id<1> idx) {
writeResult[idx] = idx[0];}); // End of the kernel function
}); // End of the queue commands} // End of scope, wait for the queued work to stop.
...}
Get a device
SYCL buffer using host pointer
Kernel
Queue out of scope
Data accessor
SYCL ExamplesHost DeviceTransfer data and
execution control