HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · –...
Transcript of HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · –...
![Page 1: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/1.jpg)
AdvancedTopicsonHeterogeneousSystemArchitectures
Politecnico di Milano!Seminar Room (Bld 20)!
15 December, 2017!
Antonio R. Miele!Marco D. Santambrogio!
Politecnico di Milano!
HSA Foundation!
![Page 2: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/2.jpg)
References!• This presentation is based on the material and
slides published on the HSA foundation website:!– http://www.hsafoundation.com/ !
2
![Page 3: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/3.jpg)
Heterogeneous processors have proliferated – make them better!
• Heterogeneous SoCs have arrived and are!a tremendous advance over previous !platforms!
• SoCs combine CPU cores, GPU cores and!other accelerators, with high bandwidth !access to memory!
• How do we make them even better?!– Easier to program!– Easier to optimize!– Easier to load balance!– Higher performance!– Lower power!
• HSA unites accelerators architecturally!• Early focus on the GPU compute accelerator, but HSA will go well
beyond the GPU!
3
![Page 4: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/4.jpg)
HSA foundation!• Founded in June 2012!• Developing a new platform !
for heterogeneous systems!• www.hsafoundation.com !• Specifications under !
development in working !groups to define the platform!
• Membership consists of 43 companies and 16 universities!
• Adding 1-2 new members each month!
4
![Page 5: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/5.jpg)
HSA consortium!
5
![Page 6: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/6.jpg)
HSA goals!• To enable power-efficient performance !• To improve programmability of heterogeneous
processors !• To increase the portability of code across
processors and platforms !• To increase the pervasiveness of
heterogeneous solutions throughout the industry !
6
![Page 7: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/7.jpg)
Paradigm shift!
7
• Inflection in processor design and programming!
![Page 8: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/8.jpg)
Key features of HSA!• hUMA – Heterogeneous Unified Memory
Architecture !• hQ – Heterogeneous Queuing !• HSAIL – HSA Intermediate Language !
8
![Page 9: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/9.jpg)
Key features of HSA!• hUMA – Heterogeneous Unified Memory
Architecture !• hQ – Heterogeneous Queuing !• HSAIL – HSA Intermediate Language !
9
![Page 10: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/10.jpg)
Legacy GPU compute!
• Multiple memory pools!• Multiple address spaces!
– No pointer-based data structures!• Explicit data copying across PCIe !
– High latency !– Low bandwidth !
• High overhead dispatch!
• Need lots of compute on GPU to amortize copy overhead !
• Very limited GPU memory capacity !• Dual source development!• Proprietary environments!• Expert programmers only!
10
![Page 11: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/11.jpg)
Existing APUs and SoCs!
• Physical integration of GPUs and CPUs!• Data copies on an internal bus!• Two memory pools remain!• Still queue through the OS!• Still requires expert programmers !
11
APU = Accelerated Processing Unit (i.e. a SoC containing also a GPU)!
• FPGAs and DSPs have the same issues!
![Page 12: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/12.jpg)
Existing APUs and SoCs!• CPU and GPU still have separate memories for the
programmer (different virtual memory spaces)!1. CPU explicitly copies data to GPU memory!2. GPU executes computation!3. CPU explicitly copies results back to its own memory!
12
![Page 13: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/13.jpg)
An HSA enabled SoC!
• Unified Coherent Memory enables data sharing across all processors!– Enabling the usage of pointers!– Not explicit data transfer -> values move on demand!– Pageable virtual addresses for GPUs -> no GPU capacity constraints!
• Processors architected to operate cooperatively!• Designed to enable the application to run on different processors
at different times!
13
![Page 14: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/14.jpg)
Unified coherent memory!
14
![Page 15: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/15.jpg)
Unified coherent memory!
15
• CPU and GPU have a unified virtual memory spaces!1. CPU simply passes a pointer to GPU!2. GPU executes computation!3. CPU can read the results directly – no explicit copy need!!
![Page 16: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/16.jpg)
Unified coherent memory!
16
![Page 17: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/17.jpg)
Unified coherent memory!
17
Transmissionofinputdata
![Page 18: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/18.jpg)
Unified coherent memory!
18
![Page 19: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/19.jpg)
Unified coherent memory!
19
![Page 20: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/20.jpg)
Unified coherent memory!
20
![Page 21: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/21.jpg)
Unified coherent memory!
21
![Page 22: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/22.jpg)
Unified coherent memory!
22
Transmissionofresults
![Page 23: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/23.jpg)
Unified coherent memory!
23
![Page 24: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/24.jpg)
Unified coherent memory!
24
![Page 25: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/25.jpg)
Unified coherent memory!
25
![Page 26: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/26.jpg)
Unified coherent memory!
26
![Page 27: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/27.jpg)
Unified coherent memory!
27
![Page 28: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/28.jpg)
Unified coherent memory!
28
![Page 29: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/29.jpg)
Unified coherent memory!
29
• OpenCL2.0leveragesHSAmemoryorganizaGontoimplementavirtualsharedmemory(VSM)model
• VSMcanbeusedtosharepointersinthesamecontextamongdevicesandthehost
![Page 30: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/30.jpg)
Key features of HSA!• hUMA – Heterogeneous Unified Memory
Architecture !• hQ – Heterogeneous Queuing !• HSAIL – HSA Intermediate Language !
30
![Page 31: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/31.jpg)
hQ: heterogeneous queuing!• Task queuing runtimes !– Popular pattern for task and data parallel
programming on Symmetric Multiprocessor (SMP) systems !
– Characterized by: !• A work queue per core !• Runtime library that divides large loops into tasks and
distributes to queues !• A work stealing scheduler that keeps system balanced !
• HSA is designed to extend this pattern to run on heterogeneous systems !
31
![Page 32: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/32.jpg)
hQ: heterogeneous queuing!
32
• How compute dispatch operates today in the driver model!
![Page 33: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/33.jpg)
hQ: heterogeneous queuing!• How compute dispatch
improves under HSA!– Application codes to the
hardware!– User mode queuing!– Hardware scheduling!– Low dispatch times !!– No Soft Queues!– No User Mode Drivers!– No Kernel Mode
Transitions!– No Overhead!!
33
![Page 34: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/34.jpg)
hQ: heterogeneous queuing!
34
• AQL (Architected Queueing Layer) enables any agent to enqueue tasks!
![Page 35: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/35.jpg)
hQ: heterogeneous queuing!
35
• AQL (Architected Queueing Layer) enables any agent to enqueue tasks!– Single compute
dispatch path for all hardware!
– No driver translation, direct access to hardware!
– Standard across vendors!
• All agents can enqueue!– Allowed also self-enqueuing!
• Requires coherency and shared virtual memory !
![Page 36: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/36.jpg)
hQ: heterogeneous queuing!
36
• A work stealing scheduler that keeps system balanced !
![Page 37: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/37.jpg)
Advantages of the queuing model!
37
• Today’s picture:!
![Page 38: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/38.jpg)
Advantages of the queuing model!
38
• Theunifiedsharedmemoryallowstosharepointersamongdifferentprocessingelementsthusavoidingexplicitmemorytransferrequests
• Theunifiedsharedmemoryallowstosharepointersamongdifferentprocessingelementsthusavoidingexplicitmemorytransferrequests
![Page 39: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/39.jpg)
Advantages of the queuing model!
39
• CoherentcachesremovethenecessitytoperformexplicitsynchronizaGonoperaGon
![Page 40: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/40.jpg)
Advantages of the queuing model!
40
• ThesupportedsignalingmechanismenablesasynchronouseventsbetweenagentswithoutinvolvingtheOSkernel
![Page 41: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/41.jpg)
Advantages of the queuing model!
41
• TasksaredirectlyenqueuedbytheapplicaGonswithoutusingOSmechanisms
![Page 42: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/42.jpg)
Advantages of the queuing model!
42
• HSA picture:!
![Page 43: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/43.jpg)
Device side queuing!• Let’s consider a tree traversal problem:!– Every node in the tree is a job to be executed!– We may not know at priory the size of the tree!– Input parameters of a job may depend on parent
execution!
43
• Eachnodeisajob• Eachjobmay
generatesomechildjobs
![Page 44: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/44.jpg)
Device side queuing!• State-of-the-art solution:!– The job has to communicate to the host the new
jobs (possibly transmitting input data)!– The host queues the child jobs on the device!
44
Considerablememorytraffic!
• Eachnodeisajob• Eachjobmay
generatesomechildjobs
![Page 45: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/45.jpg)
Device side queuing!• Device side queuing:!– The job running on the device directly queues new
jobs in the device/host queues!
45
• Eachnodeisajob• Eachjobmay
generatesomechildjobs
![Page 46: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/46.jpg)
Device side queuing!• Benefits of device side queuing:!– Enable more natural expression of nested
parallelism necessary for applications with irregular or data-driven loop structures(i.e. breadth first search)!
– Remove of synchronization and communication with the host to launch new threads (remove expensive data transfer)!
– The finer granularities of parallelism is exposed to scheduler and load balancer!
46
![Page 47: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/47.jpg)
Device side queuing!• OpenCL 2.0 supports device side queuing!– Device-side command queues are out-of-order!– Parent and child kernels execute asynchronously!– Synchronization has to be explicitly managed by
the programmer!
47
![Page 48: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/48.jpg)
Summary on the queuing model!• User mode queuing for low latency dispatch!
– Application dispatches directly!– No OS or driver required in the dispatch path!
• Architected Queuing Layer!– Single compute dispatch path for all hardware!– No driver translation, direct to hardware!
• Allows for dispatch to queue from any agent!– CPU or GPU!
• GPU self-enqueue enables lots of solutions!– Recursion!– Tree traversal!– Wavefront reforming !
48
![Page 49: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/49.jpg)
Other necessary HW mechanisms!• Task preemption and context switching have to
be supported by all computing resources (also GPUs)!
49
![Page 50: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/50.jpg)
Key features of HSA!• hUMA – Heterogeneous Unified Memory
Architecture !• hQ – Heterogeneous Queuing !• HSAIL – HSA Intermediate Language !
50
![Page 51: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/51.jpg)
HSA intermediate layer (HSAIL)!• A portable “virtual ISA” for vendor-independent
compilation and distribution !– Like Java bytecodes for GPUs !
• Low-level IR, close to machine ISA level !– Most optimizations (including register allocation) performed
before HSAIL !• Generated by a high-level compiler (LLVM, gcc, Java
VM, etc.) !– Application binaries may ship with embedded HSAIL !
• Compiled down to target ISA by a vendor-specific “finalizer” !– Finalizer may execute at run time, install time, or build time !
51
![Page 52: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/52.jpg)
HSA intermediate layer (HSAIL)!• HSA compilation stack! • HSA runtime stack!
52
![Page 53: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/53.jpg)
HSA intermediate layer (HSAIL)!• Explicitly parallel!
– Designed for data parallel programming!
• Support for exceptions, virtual functions, and other high level language features!
• Syscall methods!– GPU code can call
directly system services, IO, printf, etc!
53
![Page 54: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/54.jpg)
HSA intermediate layer (HSAIL)!• Lower level than
OpenCL SPIR!– Fits naturally in the
OpenCL compilation stack!
• Suitable to support additional high level languages and programming models:!– Java, C++, OpenMP,
C++, Python, etc…!
54
![Page 55: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/55.jpg)
HSA software stack!
55
HSAIL
• HSA supports many languages!
![Page 56: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/56.jpg)
HSA and OpenCL!• HSA is an optimized platform architecture for
OpenCL!– Not an alternative to OpenCL!
• OpenCL on HSA will benefit from!– Avoidance of wasteful copies!– Low latency dispatch!– Improved memory model!– Pointers shared between CPU and GPU!– Device side queuing!
• OpenCL 2.0 leverages HSA Features!– Shared Virtual Memory!– Platform Atomics !
56
![Page 57: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/57.jpg)
HSA and Java!• Targeted at Java 9 (2015 release)!• Allows developers to efficiently
represent data parallel algorithms in Java!
• Sumatra “repurposes” Java 8’s multi-core Stream/Lambda API’s to enable both CPU or GPU computing!
• At runtime, Sumatra enabled Java Virtual Machine (JVM) will dispatch selected constructs to available HSA enabled devices!
57
![Page 58: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/58.jpg)
HSA and Java!
58
• Evolution of the Java acceleration before the Sumatra project!
![Page 59: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/59.jpg)
HSA software stack!
59
![Page 60: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/60.jpg)
HSA runtime!• A thin, user-mode API that provides the interface
necessary for the host to launch compute kernels to the available HSA components!
• The overall goal is to provide a high-performance dispatch mechanism that is portable across multiple HSA vendor architectures!
• The dispatch mechanism differentiates the HSA runtime from other language runtimes by architected argument setting and kernel launching at the hardware and specification level!
60
![Page 61: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/61.jpg)
HSA runtime!• The HSA core runtime API is standard across all HSA
vendors, such that languages which use the HSA runtime can run on different vendor’s platforms that support the API!
• The implementation of the HSA runtime may include kernel-level components (required for some hardware components, ex: AMD Kaveri) or may be entirely user-space (for example, simulators or CPU implementations)!
61
![Page 62: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/62.jpg)
HSA runtime!
62
![Page 63: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/63.jpg)
HSA taking platform to programmers!• Balance between CPU and GPU for performance and
power efficiency!• Make GPUs accessible to wider audience of
programmers!– Programming models close to today’s CPU programming models!– Enabling more advanced language features on GPU!– Shared virtual memory enables complex pointer-containing data
structures (lists, trees, etc) and hence more applications on GPU!– Kernel can enqueue work to any other device in the system (e.g.
GPU->GPU, GPU->CPU)!• Enabling task-graph style algorithms, Ray-Tracing, etc.!
63
![Page 64: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/64.jpg)
HSA taking platform to programmers!• Complete tool-chain for programming, debugging and
profiling!• HSA provides a compatible architecture across a wide
range of programming models and HW implementations!
64
![Page 65: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/65.jpg)
HSA programming model!• Single source !
– Host and device code side-by-side in same source file!– Written in same programming language!
• Single unified coherent address space!– Freely share pointers between host and device!– Similar memory model as multi-core CPU!
• Parallel regions identified with existing language syntax!– Typically same syntax used for multi-core CPU!
• HSAIL is the compiler IR that supports these programming models !
65
![Page 66: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/66.jpg)
Specifications and software!
66
![Page 67: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/67.jpg)
HSA architecture V1!• GPU compute C++ support!• User Mode Scheduling!• Fully coherent memory between CPU & GPU!• GPU uses pageable system memory via CPU pointers!• GPU graphics pre-emption!• GPU compute context switch!
67
![Page 68: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/68.jpg)
Partners roadmaps!
68
![Page 69: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/69.jpg)
Partners roadmaps!
69
2015
![Page 70: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/70.jpg)
Partners roadmaps!
70
![Page 71: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/71.jpg)
Partners roadmaps!
71
![Page 72: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/72.jpg)
Partners roadmaps!
72
![Page 73: HSA Foundation - Politecnico di Milanohome.deib.polimi.it/.../1.HSA_hsafoundation_v1.pdf · – Like Java bytecodes for GPUs ! • Low-level IR, close to machine ISA level ! – Most](https://reader035.fdocuments.in/reader035/viewer/2022070811/5f0a8d697e708231d42c31f8/html5/thumbnails/73.jpg)
Partners roadmaps!
73