"Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA...

47
© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 1 ENABLING EFFICIENT HETEROGENEOUS PROCESSING THROUGH COHERENCY: AN HSA FOUNDATION UPDATE EMBEDDED VISION ALLIANCE MEMBER MEETING DR. JOHN GLOSSNER, PRESIDENT, HSA FOUNDATION / CEO GPT-US HARMONIZING THE INDUSTRY AROUND HETEROGENEOUS COMPUTING

Transcript of "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA...

Page 1: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 1

ENABLING EFFICIENT HETEROGENEOUS

PROCESSING THROUGH COHERENCY: AN

HSA FOUNDATION UPDATE EMBEDDED VISION ALLIANCE MEMBER MEETING

DR. JOHN GLOSSNER, PRESIDENT, HSA FOUNDATION / CEO GPT-US

HARMONIZING THE INDUSTRY AROUND

HETEROGENEOUS COMPUTING

Page 2: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 2

AGENDA

Heterogeneous Programming

Problem

About HSA Founding

Member Companies

Open / Royalty Free Solutions

HSA Solution Hardware

Software Infrastructure

HSAIL

Portable Applications

Programming C/C++, Python, OpenCL

Performance Results AMD

Products and Announcements AMD, GPT, Imagination, MediaTek

What’s Next

Page 3: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 3

THE PROBLEM HETEROGENEOUS APPLICATION DEVELOPMENT

Page 4: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 4

WHAT IS A HETEROGENEOUS SYSTEM?

A CPU+ System +GPU

+Vision Processors

+DSP

+FPGA

+Accelerators

Typically Different development tools

Different memory spaces

Communication via I/O only (data copies)

Unified Coherent Memory

CPU

1

CPU

N …

CPU

2

GPU

1

GPU

2

GPU

3

GPU

M DSP ACC …

Page 5: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 5

WHAT’S THE PROBLEM?

Heterogeneous processors are

widely available

Huge compute capability Acceleration Units (GPU, DSP, FPGA)

CPU Cluster-based computer

Coherency Established in high-end

Migrating to mainstream mobile and

consumer

BUT…

Heterogeneous programming

models not standardized

Multi-core/device applications

difficult to optimize or scale

Non-portable application

developer ecosystems

HSAF brings compute app abstraction to heterogeneous platforms

Page 6: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 6

HSA TECHNOLOGY

Developing a new platform for heterogeneous systems Reducing Heterogeneous System Complexity

Provides software ecosystem

Abstracts away complexities of heterogeneous systems Cache coherent shared virtual memory hardware

Removes time consuming operating system calls

Runs at user level

Exploiting Compute Capabilities Single source programming

Control and compute code reside in the same file or project

Page 7: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 7

ABOUT HSA HETEROGENEOUS SYSTEM ARCHITECTURE FOUNDATION

Page 8: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 8

HSA FOUNDATION

A Non-Profit Foundation Founded in June 2012 Programming heterogeneous systems (“CPU +” era)

Industry standards body V1.1 Specifications released May 2016

Backward compatibility with V1.0 hardware

First compatible hardware AMD

Measured Performance Improvements

Page 9: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 9

HSA – AN OPEN PLATFORM

Open Architecture, membership open to all HSA Programmers Reference Manual

HSA Platform System Architecture

HSA Runtime

HSA Multivendor Specification

Royalty Free IP, Specifications, and APIs

Open Source Tools, Compilers, etc.

Runtime implementations

Tests

Page 10: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 10

MEMBERS DRIVING HSA Founders

Promoters

Supporters

Contributors

Academic

Page 11: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 11

HSAF HARDWARE CONTRIBUTIONS

Page 12: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 12

JIM MCGREGOR, TIRIAS RESEARCH

…HSAF has had a profound impact on hardware

architectures

… even Intel‘s Cache

Coherency

Unified memory

(Shared Virtual Memory)

Page 13: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 13

THE PLATFORM PILLARS OF HSA

Unified memory

(SVM)

User mode dispatch

Platform atomics

Architected

Signals

Formal

Relaxed

Memory Model

Cache

Coherency

Quality

Of

Service Some non-HSA platforms support a few

of these platform features

In combination they form a well-rounded

base for application programmability

Page 14: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 14

HSAF SOFTWARE INFRASTRUCTURE

Page 15: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 15

THE VISION

Make Heterogeneous Programming Much Easier

Single source programming 1

Any programming language 2

Eliminate data copies 3

Common address space 4

Standardized command submission to Agents (GPU / DSP) 5

Eliminate software layers between application and hardware 6

ISA agnostic for CPU, GPU, DSP, and more 7

Open source software stack 8

Single tool chain

C++, Python, JavaScript, …

Performance!

A pointer is a pointer

A common dispatch language

Efficient

x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, …

Open Access!

High performance

Low power

Extensible to other accelerators on the SoC

Page 16: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 16

MOTIVATION (TODAY’S PICTURE)

Application OS

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule

Application Get Buffer

Copy/Map

Memory

Agent GPU/DSP

Page 17: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 17

WITH SHARED VIRTUAL MEMORY

Application OS

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule

Application Get Buffer

Copy/Map

Memory

Agent GPU/DSP

Page 18: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 18

WITH COHERENT CACHE MEMORY

Application OS

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule

Application Get Buffer

Copy/Map

Memory

Agent GPU/DSP

Page 19: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 19

SIGNALS

HSA agents support signaling creation/destruction using runtime APIs

Any Agent can access signals Wake up agents waiting upon the object

Query/Wait for current object

Allows conditions

Hardware-assisted signaling and

synchronization primitives Memory semantics synchronizes work

items processed by HSA agents

Synchronizes execution between threads

on HSA agents and host CPU

One-to-one and one-to-many

signaling System Software, runtime & application SW

use infrastructure to build higher-level

synchronization primitives like mutexes,

semaphores, …

Advantages Asynchronous events between agents

Doesn’t require CPU

Common idiom for work offload

Low power waiting

Page 20: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 20

WITH SIGNALING

Application OS

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule

Application Get Buffer

Copy/Map

Memory

Agent GPU/DSP

Page 21: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 21

HSA QUEUING MODEL

User mode queuing Low latency dispatch

Application dispatches directly

No OS or driver required

Architected Queuing Layer (AQL) Single compute dispatch path for all hardware

No driver translation, direct to hardware

Standard across vendors!

Guaranteed backward compatibility

Allows for dispatch to queue from any agent CPU or GPU or DSP or FPGA, etc.

Agent self enqueue enables Recursion, Tree traversal, Wavefront reforming

Requires coherency and

shared virtual memory

Page 22: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 22

WITH USER MODE QUEUING

Application CPU OS Agent GPU/DSP

Transfer

buffer to GPU Copy/Map

Memory

Queue Job

Schedule Job Start Job

Finish Job

Schedule

Application Get Buffer

Copy/Map

Memory

Page 23: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 23

FINAL PICTURE: SVM + CACHE COHERENCY +

SIGNALS + USER MODE QUEUES

Application OS Agent GPU/DSP

Queue Job

Start Job

Finish Job

Page 24: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 24

HSA COMMAND AND DISPATCH FLOW

Application

A

Application

B

Application

C

Optional

Dispatch Buffer

Agent

HARDWARE

Hardware Queue

A

A A

Hardware Queue

B

B B

Hardware Queue

C

C C

C

C

HW view:

HW / microcode controlled

HW scheduling

Architected Queuing Language (AQL)

HW-managed protection

SW view:

User-mode dispatches to HW

No OS Driver overhead

Low dispatch times

Host & Kernel Agent dispatch APIs

Page 25: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 25

HSA INTERMEDIATE LANGUAGE (HSAIL) BYTECODE FOR HETEROGENEOUS SYSTEMS

Page 26: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 26

THE PORTABILITY CHALLENGE

CPU ISAs – Backwards Compatible ISA innovations added incrementally (ie NEON, AVX, etc)

ISA retains backwards-compatibility with previous generation

HSA instruction-set architectures: ARM, GPT, MIPS, and x86

Kernel Agent ISAs – No Backwards Compatibility GPU, DSP, DNN, Image Signal Processor, Custom Accelerators, etc.

Massive diversity of architectures in the market Each vendor has own ISA - and often several in market at same time

Compatibility via APIs (OpenGL, DirectX, OpenCV)

Page 27: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 27

HSA INTERMEDIATE LAYER — HSAIL

Virtual ISA for parallel programs Finalized to native ISA by a compiler

Dynamic or Offline

ISA independent by design

Explicitly parallel Designed for data parallel programming

Multiple HLL Support Exceptions, virtual functions, etc.

Java, C++, OpenMP, C++, Python, etc

main() {

#pragma omp parallel for

for (int i=0;i<N; i++) {

}

}

High-Level

Compiler BRIG Finalizer Component

ISA

Host ISA

Page 28: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 28

HSAIL FEATURES

A Virtual Explicitly Parallel ISA ~135 Opcodes

RISC Register-based Load/Store

Arithmetic IEEE 754 Floating Point including 16-bit

Integer (32/64-bit)

DSP fixed point

Packed / SIMD f16x2, f16x4, f16x8, f32x2, f32x4, f64x2

signed/unsigned 8x4, 8x8, 8x16, 16x2, 16x4,

16x8, 32x2, 32x4, 64x2

Branches & Function Calls

Atomic Operations

Wavefronts 1, 2, 4, 8, 16, 32, or 64 SIMD lanes

Lanes can be active or inactive

Memory Shared Virtual Memory

Exceptions

ld_global_u64 $d0, [$d6 + 120] ; $d0= load($d6+120)

add_u64 $d1, $d0, 24 ; $d1= $d2+24

Page 29: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 29

PORTABLE APPLICATIONS PROGRAMMING FROM OPENCL TO C++17

Page 30: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 30

HSA OPEN SOURCE SOFTWARE

Full open source Linux stack: tools, compilers and OS support Allows a single shared implementation for many components

Enables university research and industry collaboration in all areas

Because it’s the right thing to do

Many open source applications & frameworks Native Languages: HCC (C++17), LLVM, GCC, CLOC/SNACK, Python, Java, …

Tools, API’s, Frameworks: CodeXL, POCL, Docker, OpenMP, OKRA, HIP, …

Research: Multi2sim, HSAEmu, gem5, ViennaCL, …

And many applications using OCL 2.x or HSA stack

Github & Bitbucket repositories have much, much more…

gccbrig Any processor with a gcc machine description can finalize HSAIL

Page 31: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 31

ARCHITECTED PROFILING AND DEBUGGING

Profiling

• Common timeline across HSA accelerators & system

• Common HSA hardware events (+ HW specific)

• Common HSA profiling counter definitions (+ HW specific counters)

• Consistent profiling methodology for all HSA accelerators

Debugging

• Breakpoints

• Exception handling

• Single-step

• Tracing

• HSAIL Disassembly

• Emulation support

• Libraries

• Plugins

Page 32: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

Python on GPU’s

Numba: NumPy aware python compiler Open source. Avail on Github

Sponsored by Continuum Analytics

Direct HSA Support

Automatic Parallelization 2x-200x speedup

Page 33: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 33

PERFORMANCE RESULTS

Page 34: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

Python Geographic Locality

What is the distance from a set of points to a target point How many points are within a specified

range

Numba can auto-parallelize user universal functions for HSA Ufunc’s broadcast operation over

elements of a NumPy array ZERO HSA developer knowledge

required

1M Points >8X speedup

https://github.com/ContinuumIO/Numba-HSA-Webinar

Page 35: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 35

GEN1: FIR & AES

FIR is a memory-intensive streaming workload

AES is a compute-intensive streaming workload

CL12 – cl_mem buffer Copy to/from the device

CL20 – SVM buffer – Coarse Grain Sync Copy to/from SVM Data copy cannot be avoided, since the space for

SVM is limited

HSA – Unified Memory Space – Fine Grained Sync Regular pointer No explicit copy

Results HSA compute abstraction NO performance penalty

Note: Not all algorithms run faster Benchmark: NUCAR HeteroMark

Page 36: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 36

HSA PRODUCT UPDATES - AMD FROM HSA FOUNDATION MEMBER COMPANIES

Page 37: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

37

Heterogeneous System Architecture Is At The Core of ROCm Rich Foundation for HPC and Ultrascale Computing support our APU’s and Discreet GPU’s

HSA Drives rich capabilities into the ROCm

Systems Architecture ‒ User Mode Queues

‒ Architected Queuing Language

‒ Flat memory Addressing

‒ Atomic Memory Transactions

‒ Process Concurrency & Preemption

HSA Runtime enables a programming language

neutral systems interface Supports standardized loader and linker interface

ROCm: Radeon

Open Compute

Platform

Page 38: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

38

ROCm Enabled Hardware 2016

S9150 W9100

RADEON R9 Nano S9300x2 RADEON RX480 ( Oct ROCm 1.3)

S9170

AMD Proprietary and Confidential August 2016

AMD Embedded R-Series SOC

AMD FX 98xx, A12-97xx

Page 39: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 39

HXGPT ANNOUNCEMENT UNITY处理器架构

Working silicon for Unity Architecture

Focus on out-of-order pipeline Superscalar

Control flow

See our demo Image Processing Filter

Before After

Page 40: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 40

HXGPT ANNOUNCEMENT

基于HSAIL的深度学习-神经网络开源计划

Open Source Machine Learning HSAIL library Deep Neural Network Library

Delivered in HSAIL

Any HSA-Compatible platform can execute

Optimized for hxGPT Using gccbrig

Development now underway

Page 41: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 41

IMAGINATION HSA COMPLIANT IP (COMING SOON)

We will be rolling out:

• HSA across all

MIPS I-class and

P-class CPUs

• HSA across all

PowerVR GPUs

• HSA compliant

fabric solutions

Coherent HSA-compliant SoC fabric

PowerVR Video Encode

PowerVR Camera ISP

PowerVR Video Decode

ROM

Peripheral Bus

DD

R3

/4

Bridge

RAM

PowerVR GX7200 Series6XT 2 cluster

PowerVR GPU

HSA-compliant

eF

use

DMAC

Clock &

Reset

Control

JTAG

& Test

PSU &

Power

Control

TE &

Crypto

L2 cache

PowerVR GX7200

Series6XT 2 cluster

MIPS CPU

HSA-compliant

Display Pipeline

PowerVR JPEG Encode

OTP

Ensigma RPU

AFE

Customer IP

HDMI

Tx & Rx USB3 MIPI NAND

Peripherals

GPIO; UART; I2C; I2S; SPI; SD

Customer IP

Customer IP

& interfaces

Imagination Smart Vision IP Platform

Page 42: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

Copyright © MediaTek Inc. All rights reserved.

HMP – 2013 Heterogeneous Multi-

Processing

HC – 2015 Heterogeneous

Computing

Tri-cluster 2016

Hybrid Tri-cluster Multi-Processing

HSA Features Heterogeneous

System Architecture

LITTLE CPUs

BIG CPUs

LITTLE CPUs

BIG CPUs

GPU GPU

Accelerators

Co

heren

t Mem

ory

MM

U

Evolution of Heterogeneity at MediaTek

Min CPUs

Max CPUs

GPU

Mid CPUs

Min CPUs

Max CPUs

Mid CPUs

42

Page 43: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 43

CONCLUSIONS

Page 44: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 44

THE RESULT

Make Heterogeneous Programming Much Easier

Single source programming 1

Any programming language 2

Eliminate data copies 3

Common address space 4

Standardized command submission to Agents (GPU / DSP) 5

Eliminate software layers between application and hardware 6

ISA agnostic for CPU, GPU, DSP, and more 7

Open source software stack 8

Single tool chain

C++, Python, JavaScript, …

Performance!

A pointer is a pointer

A common dispatch language

Efficient

x86, ARM, MIPS, PowerVR, Mali, Adreno, GPT, …

Open Access!

High performance

Low power

Extensible to other accelerators on the SoC

Page 45: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 45

SUMMARY

2012 goal of changing chip H/W architecture achieved Cache coherent shared virtual memory

2014-2015 S/W architecture to support H/W March 2015 V1.0 specs

Programmed in any language (C++, Python, OpenCL)

2015-2016 May 2016 v1.1 specs

Multivendor support

Wider range of processors

2016 H/W platforms arriving AMD’s Carrizo (Dell, Asus, Lenovo)

Licensable IP available

Page 46: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 46

V1.2 SPECIFICATIONS IN PROGRESS

Improved Data Interop

Fixed function accelerators

(e.g. FPGA)

Local device memory

Coarse grain

memory Architected Debug

BRIG, new linking formats

Architecture Fully formalized memory model

HSAIL Parallel loops

Flexible API and access semantics

Programming

Models

Page 47: "Enabling Efficient Heterogeneous Processing Through Coherency," a Presentation from the HSA Foundation

© Copyright 2012-2016 HSA Foundation. All Rights Reserved. 47

JOIN US! WWW.HSAFOUNDATION.COM

THANK YOU