LCU13: HSA Architecture Presentation

42
HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW VINOD TIPPARAJU PRINCIPAL MEMBER OF TECHNICAL STAFF, AMD AUSTIN.

description

Resource: LCU13 Name: HSA Architecture Presentation Date: 31-10-2013 Speaker: Vinod Tipparaju Video: http://lcu-13.zerista.com/event/member/87862

Transcript of LCU13: HSA Architecture Presentation

Page 1: LCU13: HSA Architecture Presentation

HETEROGENEOUS SYSTEM ARCHITECTURE OVERVIEW VINOD TIPPARAJU

PRINCIPAL MEMBER OF TECHNICAL STAFF,

AMD AUSTIN.

Page 2: LCU13: HSA Architecture Presentation

AGENDA

u  Introduction (terminology)

u  Memory Model & Queues

u  HSAIL

u  Architecture Details

AMD Confidential - NDA Required

Page 3: LCU13: HSA Architecture Presentation

SOME TERMINOLOGY

u  HSA is heterogeneous systems architecture, not just GPUs

u  HSA Component – IP that satisfies architecture requirements and provides identified features

u  SoC – system on Chip, collection of various IPs u  E.g. AMD APU (Accelerated Processing Unit) integrates AMD/ARM CPU cores and

Graphics IP u  It is possible to conceive companies just building parts of the IP

u  HSAIL -- HSA intermediate language very low-level SIMT language

u  HSA Agent – something that can participate in the HSA memory subsystem (i.e. respect page sizes, memory properties, atomics, etc.)

AMD Confidential - NDA Required

Page 4: LCU13: HSA Architecture Presentation

WHAT IS HSA?

RUNTIME

u  API that wraps the features like user mode queues, clocks, signalling, etc

u  Provides execution control

u  Supports tools

TOOLS

u  Supporting profilers, debuggers and compilers

u  Unique debugging support that greatly simplifies implementing debuggers

u  Excellent profiling support with some user mode access

Systems Architecture

u  From a hardware point of view, system architecture requirements necessary

u  Specifies shared memory, cache coherence domains, concept of clocks, context switching, memory based signaling, topology,

u  Rules governing design and agent behavior

Programmers Reference (HSAIL)

u  An intermediate representation, very low level.

u  Vendor independence, device compiler optimizations

u  Abstracts HW, or can serve as the lowest level instruction set

Page 5: LCU13: HSA Architecture Presentation

TAKING THE HW INTEGRATION TO ITS NATURAL CONCLUSION

u  Architectural and System integration

u  Extend architecture to make the component a first class citizen on the SoC

u  Fully-evolved MMU

u  Provide same level of support for tools as CPU

u  Provide context switching, preemption, full-coherence u  Helps simulators, migrations, checkpoints, etc

u  Future, other HSA IP

AMD Confidential - NDA Required

Page 6: LCU13: HSA Architecture Presentation

HIGH LEVEL USAGE SCENARIOS

u  Bulk-Synchronous Parallelism -like concurrent computation u  Rather large parallel sections followed by synchronization

u  Outstanding support for task-based parallelism u  Wavefront is 64 threads u  256 threads sufficient to fully fill the pipeline u  Launch in under a microsecond (same with nested launches)

u  Support for execution schedules – excellent compiler target u  Architected Queueing Language (AQL), dependencies

u  Advanced language support u  Function calls u  Virtual functions u  Exception handling (throw-catch)

AMD Confidential - NDA Required

Page 7: LCU13: HSA Architecture Presentation

MEMORY AND QUEUING MODEL

Page 8: LCU13: HSA Architecture Presentation

HSA MEMORY MODEL

u  Defines visibility ordering between all threads in the HSA System

u  Designed to be compatible with C++11, Java, OpenCL and .NET Memory Models

u  Relaxed consistency memory model for parallel compute performance

u  Visibility controlled by: u  Load.Acquire u  Store.Release u  Barriers

AMD Confidential - NDA Required

Page 9: LCU13: HSA Architecture Presentation

HSA QUEUING MODEL

u  User mode queuing for low latency dispatch u  Application dispatches directly u  No OS or driver in the dispatch path

u  Architected Queuing Layer u  Single compute dispatch path for all hardware u  No driver translation, direct to hardware

u  Allows for dispatch to queue from any agent u  CPU or GPU

u  GPU self enqueue enables lots of solutions u  Recursion u  Tree traversal u  Wavefront reforming

AMD Confidential - NDA Required

Page 10: LCU13: HSA Architecture Presentation

HSAIL

AMD Confidential - NDA Required

Page 11: LCU13: HSA Architecture Presentation

HSA INTERMEDIATE LAYER — HSAIL

u  HSAIL is a virtual ISA for parallel programs u  Finalized to ISA by a JIT compiler or “Finalizer” u  ISA independent by design for CPU & GPU

u  Explicitly parallel u  Designed for data parallel programming

u  Support for exceptions, virtual functions, and other high level language features

u  Lower level than OpenCL SPIR u  Fits naturally in the OpenCL compilation stack

u  Suitable to support additional high level languages and programming models: u  Java, C++, OpenMP, Fortran etc

AMD Confidential - NDA Required

Page 12: LCU13: HSA Architecture Presentation

WHAT IS HSAIL?

u  HSAIL is the intermediate language for parallel compute in HSA u  Generated by a high level compiler (LLVM, gcc, Java VM, etc) u  Low-level IR, close to machine ISA level u  Compiled down to target ISA by an IHV “Finalizer” u  Finalizer may execute at run time, install time, or build time

u  Example: OpenCL™ Compilation Stack using HSAIL

AMD Confidential - NDA Required

OpenCL™ Kernel High-Level Compiler Flow (Developer)

Finalizer Flow (Runtime) EDG or CLANG

SPIR LLVM HSAIL HSAIL

Finalizer Hardware ISA

Page 13: LCU13: HSA Architecture Presentation

KEY HSAIL FEATURES u  Parallel

u  Shared virtual memory

u  Portable across vendors in HSA Foundation

u  Stable across multiple product generations

u  Consistent numerical results (IEEE-754 with defined min accuracy)

u  Fast, robust, simple finalization step (no monthly updates)

u  Good performance (little need to write in ISA)

u  Supports all of OpenCL™ and C++ AMP™

u  Support Java, C++, and other languages as well

AMD Confidential - NDA Required

Page 14: LCU13: HSA Architecture Presentation

SIMT EXECUTION MODEL u  HSAIL Presents a “SIMT” execution model to the programmer

u  “Single Instruction, Multiple Thread” u  Programmer writes program for a single thread of execution u  Each work-item appears to have its own program counter u  Branch instructions look natural

u  Hardware Implementation u  Most hardware uses SIMD (Single-Instruction Multiple Data) vectors for efficiency u  Actually one program counter for the entire SIMD instruction u  Branches implemented with predication

u  SIMT Advantages u  Easier to program (branch code in particular) u  Natural path for mainstream programming models u  Scales across a wide variety of hardware (programmer doesn’t see vector width) u  Cross-lane operations available for those who want peak performance

AMD Confidential - NDA Required

Page 15: LCU13: HSA Architecture Presentation

ARCHITECTURE DETAILS – WALK THROUGH OF FEATURES AND BENEFITS

Page 16: LCU13: HSA Architecture Presentation

HIGH LEVEL FEATURES OF HSA

u  Features currently being defined in the HSA Working Groups** u  Unified addressing across all processors u  Operation into pageable system memory u  Full memory coherency u  User mode dispatch u  Architected queuing language u  High level language support for GPU compute processors u  Preemption and context switching

AMD Confidential - NDA Required

** All features subject to change, pending completion and ratification of specifications in the HSA Working Groups

Page 17: LCU13: HSA Architecture Presentation

STATE OF GPU COMPUTING

Today’s Challenges u  Separate address spaces

u  Copies u  Can’t share pointers

u  New language required for compute kernel u  EX: OpenCL™ runtime API u  Compute kernel compiled separately

than host code

Emerging Solution u  HSA Hardware

u  Single address space u  Coherent u  Virtual address space u  Fast access from all components u  Can share pointers

u  Bring GPU computing to existing, popular, programming models

u  Single-source, fully supported by compiler

u  HSAIL compiler IR (Cross-platform!)

•  GPUs are fast and power efficient : high compute density per-mm and per-watt •  But: Can be hard to program

PCIe

Page 18: LCU13: HSA Architecture Presentation

MOTIVATION (TODAY’S PICTURE)

AMD Confidential - NDA Required

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

Page 19: LCU13: HSA Architecture Presentation

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (TODAY) u  Multiple virtual memory address spaces

AMD Confidential - NDA Required

CPU0 GPU

VIRTUAL MEMORY1

PHYSICAL MEMORY

VA1->PA1 VA2->PA1

VIRTUAL MEMORY2

Page 20: LCU13: HSA Architecture Presentation

PHYSICAL MEMORY

SHARED VIRTUAL MEMORY (HSA) u  Common Virtual Memory for all HSA agents

AMD Confidential - NDA Required

CPU0 GPU

VIRTUAL MEMORY

PHYSICAL MEMORY

VA->PA VA->PA

Page 21: LCU13: HSA Architecture Presentation

SHARED VIRTUAL MEMORY

u  Advantages u  No mapping tricks, no copying back-and-forth between different PA

addresses u  Send pointers (not data) back and forth between HSA agents.

u  Implications u  Common Page Tables (and common interpretation of architectural

semantics such as shareability, protection, etc). u  Common mechanisms for address translation (and servicing

address translation faults) u  Concept of a process address space ID (PASID) to allow multiple,

per process virtual address spaces within the system.

AMD Confidential - NDA Required

Page 22: LCU13: HSA Architecture Presentation

GETTING THERE …

AMD Confidential - NDA Required

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

Page 23: LCU13: HSA Architecture Presentation

SHARED VIRTUAL MEMORY

u  Specifics u  Minimum supported VA width is 48b for 64b systems, and 32b for

32b systems. u  HSA agents may reserve VA ranges for internal use via system

software. u  All HSA agents other than the host unit must use the lowest

privilege level u  If present, read/write access flags for page tables must be

maintained by all agents. u  Read/write permissions apply to all HSA agents, equally.

AMD Confidential - NDA Required

Page 24: LCU13: HSA Architecture Presentation

CACHE COHERENCY

Page 25: LCU13: HSA Architecture Presentation

CACHE COHERENCY DOMAINS (1/2)

u  Data accesses to global memory segment from all HSA Agents shall be coherent without the need for explicit cache maintenance.

AMD Confidential - NDA Required

Page 26: LCU13: HSA Architecture Presentation

CACHE COHERENCY DOMAINS (2/2)

u  Advantages u  Composability u  Reduced SW complexity when communicating between agents u  Lower barrier to entry when porting software

u  Implications u  Hardware coherency support between all HSA agents u  Can take many forms

u  Stand alone Snoop Filters / Directories u  Combined L3/Filters u  Snoop-based systems (no filter) u  Etc …

AMD Confidential - NDA Required

Page 27: LCU13: HSA Architecture Presentation

GETTING CLOSER …

AMD Confidential - NDA Required

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

Cache coherency

Page 28: LCU13: HSA Architecture Presentation

SIGNALING

Page 29: LCU13: HSA Architecture Presentation

SIGNALING (1/2)

u  HSA agents support the ability to use signaling objects u  All creation/destruction signaling objects occurs via HSA

runtime APIs u Object creation/destruction

u  From an HSA Agent you can directly accessing signaling objects. u Signaling a signal object (this will wake up HSA

agents waiting upon the object) u Query current object u Wait on the current object (various conditions

supported).

AMD Confidential - NDA Required

Page 30: LCU13: HSA Architecture Presentation

SIGNALING (2/2)

u  Advantages u  Enables asynchronous interrupts between HSA agents,

without involving the kernel u  Common idiom for work offload u  Low power waiting

u  Implications u  Runtime support required u  Commonly implemented on top of cache coherency

flows

AMD Confidential - NDA Required

Page 31: LCU13: HSA Architecture Presentation

ALMOST THERE…

AMD Confidential - NDA Required

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

signaling

Page 32: LCU13: HSA Architecture Presentation

USER MODE QUEUEING

Page 33: LCU13: HSA Architecture Presentation

USER MODE QUEUEING (1/3)

u  User mode Queueing u  Enables user space applications to directly, without OS intervention, enqueue jobs

(“Dispatch Packets”) for HSA agents. u  Dispatch packet is a job of work

u  Support for multiple queues per PASID u  Multiple threads/agents within a PASID may enqueue Packets in the same Queue. u  Dependency mechanisms created for ensuring ordering between packets.

AMD Confidential - NDA Required

Page 34: LCU13: HSA Architecture Presentation

USER MODE QUEUEING (2/3)

u  Advantages u  Avoid involving the kernel/driver when dispatching work for an Agent. u  Lower latency job dispatch enables finer granularity of offload u  Standard memory protection mechanisms may be used to protect communication

with the consuming agent.

u  Implications u  Packet formats/fields are Architected – standard across vendors!

u  Guaranteed backward compatibility u  Packets are enqueued/dequeued via an Architected protocol (all via memory

accesses and signalling) u  More on this later……

AMD Confidential - NDA Required

Page 35: LCU13: HSA Architecture Presentation

SUCCESS!

AMD Confidential - NDA Required

Application OS GPU

Transfer buffer to GPU

Copy/Map Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule Application

Get Buffer

Copy/Map Memory

Page 36: LCU13: HSA Architecture Presentation

SUCCESS!

AMD Confidential - NDA Required

Application OS GPU

Queue Job

Start Job

Finish Job

Page 37: LCU13: HSA Architecture Presentation

ACCELERATING SUFFIX ARRAY CONSTRUCTION CLOUD SERVER WORKLOAD

Page 38: LCU13: HSA Architecture Presentation

SUFFIX ARRAYS

u  Suffix Arrays are a fundamental data structure u  Designed for efficient searching of a large text

u  Quickly locate every occurrence of a substring S in a text T

u  Suffix Arrays are used to accelerate in-memory cloud workloads u  Full text index search u  Lossless data compression u  Bio-informatics

AMD Confidential - NDA Required

Page 39: LCU13: HSA Architecture Presentation

ACCELERATED SUFFIX ARRAY CONSTRUCTION ON HSA

AMD Confidential - NDA Required

M. Deo, “Parallel Suffix Array Construction and Least Common Prefix for the GPU”, Submitted to ”Principles and Practice of Parallel Programming, (PPoPP’13)” February 2013. AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G, 6 compute units, 685MHz; 4GB RAM

By offloading data parallel computations to GPU, HSA increases performance and reduces energy for Suffix Array Construction versus Single Threaded CPU.

By efficiently sharing data between CPU and GPU, HSA lets us move compute to data without penalty of intermediate copies.

+5.8x

-5x

INCREASED PERFORMANCE

DECREASED ENERGY Merge Sort::GPU

Radix Sort::GPU

Compute SA::CPU

Lexical Rank::CPU

Radix Sort::GPU

Skew Algorithm for Compute SA

Page 40: LCU13: HSA Architecture Presentation

THE HSA FUTURE

Architected heterogeneous processing on the SOC

Programming of accelerators becomes much easier

Accelerated software that runs across multiple hardware vendors

Scalability from smart phones to super computers on a common architecture

GPU acceleration of parallel processing is the initial target, with DSPs and other accelerators coming to the HSA system architecture model

Heterogeneous software ecosystem evolves at a much faster pace

Lower power, more capable devices in your hand, on the wall or in a super computer.

AMD Confidential - NDA Required

Page 41: LCU13: HSA Architecture Presentation

AMD Confidential - NDA Required

BACKUP

Page 42: LCU13: HSA Architecture Presentation

SEGMENTS

AMD Confidential - NDA Required