SEQUENTIAL CONSISTENCY FOR HETEROGENEOUS-‐RACE-‐FREE BRADFORD M. BECKMANN
11/13/2013
2 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
EXECUTIVE SUMMARY
! ExisCng APU memory models ambiguous + opaque to programmers
! From the CPU world: SequenCal consistency for data-‐race-‐free (SC for DRF) ‒ Relaxed HW, precise semanCcs, programmer-‐friendly… ‒ …but it took 30 years for CPUs -‐-‐ our goal is to improve that for GPGPUs
! Primary difference between CPUs and GPGPUs ! Scoped synchroniza0on ‒ Our work adds scoped (parCal) synchronizaCon to SC for DRF
! Two specific models:
! Case study: GPGPU Task-‐sharing RunCme ‒ HRF-‐indirect provides up to 10% performance improvement
HRF-‐direct HRF-‐indirect
Model Complexity Simple Advanced
HW Flexibility Significant Limited
SW OpCmizaCons Limited Significant
Leverage current HW? No Yes
ü ü
ü ü
3 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
OUTLINE
BACKGROUND
SCOPES AND GPU HARDWARE
CASE STUDY: TASK-‐SHARING TASK RUNTIME
4 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
BACKGROUND
! Programmers use memory model to understand memory behavior ‒ SequenCal consistency (SC) [1979]: threads interleave like mulC-‐tasking uni-‐processor ‒ HW/compiler actually implements TSO [1991] or more relaxed model ‒ JavaTM [2005] and C++ [2008] insure SC for data-‐race-‐free (DRF) programs
! Programmers need a GPU memory model for abstracCon and portability ‒ Current GPU models expose ad hoc HW mechanisms ‒ SC for DRF is a start, but… ‒ … many GPU operaCons are not global, but have limited scopes (e.g., workgroup) OpenCLTM Execu>on Model
5 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
SEQUENTIAL CONSISTENCY FOR DATA-‐RACE-‐FREE
! Two memory accesses parCcipate in a data race if they ‒ access the same locaCon ‒ at least one access is a store ‒ can occur simultaneously
‒ i.e., appear as adjacent operaCons in interleaving
! A program is data-‐race-‐free if no possible execuCon results in a data race ! SequenCal consistency for data-‐race-‐free programs
‒ Avoid everything else
GPUs: Not good enough!
How do different types of scoped-‐synchroniza>on opera>ons interact?
6 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
SCOPES
! Scope: A subset of threads ! Scoped synchronizaCon: synchronizaCon w.r.t. a scope
‒ CUDA: threadfence_{block, system}, __syncthreads, etc. ‒ PTX: membar.{cta, gl, sys}, bar, etc. ‒ HSA: parCal acquire/release ‒ OPENCL 2.0: atomic_work_item_fence {work_group, device, all_svm_devices}
! Scopes introduce new class of races: ‒ What happens when threads use different scopes?
wf1 wf2 wf3 wf4
ST X = 1
Release_S12 Acquire_S12 LD X (1)
Release_SGlobal Acquire_SGlobal LD X (??)
Workgroup 1 Workgroup 2
L1 L1
L2
t1 t2 t3 t4
t4 t3 t1 t2
SGlobal S12 S34
7 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
RUNNING EXAMPLE: CURRENT GPU HARDWARE ! Current GPU: write-‐combining cache hierarchy
‒ WG release ! flush stores from coalescer ‒ WG acquire ! stall unCl coalescer is empty
‒ Global release ! Flush all dirty loca0ons in L1 cache ‒ Global acquire ! Invalidate all valid loca0ons in L1 cache
L1 L1
L2
wf1 wf2 wf3 wf4
wf1 wf2 wf3 wf4
ST X = 1
Release_S12 Acquire_S12 LD X (1)
Release_SGlobal Acquire_SGlobal LD X (??)
Coalescer
t3 sees (1)
X = 1
ST X = 1
Release_S12 Acquire_S12 LD X (1)
X = 1
Release_SGlobal Acquire_SGlobal LD X (??)
X = 1
X = 1
8 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
RUNNING EXAMPLE: OPTIMIZED GPU ! OpCmized GPU: per-‐wavefront L1 cache management
‒ WG release ! flush stores from coalescer ‒ WG acquire ! stall unCl coalescer is empty
‒ Global release ! Flush loca0ons wri6en by releasing wavefront in L1 ‒ Global acquire ! Inv. loca0ons read by acquiring wavefront in L1
L1 L1
L2
wf1 wf2 wf3 wf4
wf1 wf2 wf3 wf4
ST X = 1
Release_S12 Acquire_S12 LD X (1)
Release_SGlobal Acquire_SGlobal LD X (??)
Coalescer
t3 sees (0)
X = 1
ST X = 1
Release_S12 Acquire_S12 LD X (1)
X = 1
Release_SGlobal Acquire_SGlobal LD X (??)
X = 0
X = 0
9 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
DEFINING PERMITTED BEHAVIOR
! Which scenario should be allowed? ‒ Can programmers assume transiCvity?
‒ Permitng the “Current GPU Hardware” scenario ‒ Or must producers and consumers use the same scope?
‒ Permitng the “OpCmized GPU” scenario
! Our notaCon: ‒ HRF-‐direct
‒ Requires communicaCng threads to synchronize using the same scope ‒ CommunicaCon using different scopes is explicitly undefined
‒ HRF-‐indirect ‒ Extends HRF-‐direct to support transiCve communicaCon using different scopes ‒ Allows indirect communicaCon using a third party
! Both models require direct synchronizaCon using the same matching scope ‒ In other words, an acq/rel pair using scopes that are subset/superset is undefined ‒ We call this form of synchronizaCon scope inclusion ‒ While possible with current GPUs, extremely difficult to reason about
10 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
HRF MODELS IMPLICATION ON PROGRAMMERS ! What value will t3 see?
‒ HRF-‐direct: final LD X forms a race (inexact scopes between wf1-‐wf3) ‒ Undefined behavior ! don’t try!!
‒ HRF-‐indirect: No race (scope transiCvity) ‒ SC behavior ! t3 sees (1)
! Consequences: ‒ HRF-‐direct:
‒ Must use global scope w/o future sync. knowledge ! slower on exisCng HW ‒ HRF-‐indirect:
‒ Can use local scope w/o future sync. knowledge ! faster on exisCng HW ‒ Will NOT work with poten0ally op0mized future GPU
wf1 wf2 wf3 wf4
ST X = 1
Release_S12 Acquire_S12 LD X (1)
Release_SGlobal Acquire_SGlobal LD X (??)
11 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
CASE STUDY – TASK-‐SHARING RUNTIME
Hierarchical task queue: -‐ Wavefront produce/consume tasks independently -‐ Use local queue unCl: -‐ Local queue empty ! pull from global -‐ Local queue full ! push to global Challenge: -‐ Unpredictable synchronizaCon HRF-‐direct: -‐ Either all sync has to be global or wavefronts coordinate to push/pull HRF-‐indirect: -‐ Single wavefront can push/pull independently
NDRange (Kernel) Queue
WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi WI WI WI WI WI WI WI WI WI wi
Wavefront Queue
Workgroup Queue Workgroup Queue
Wavefront Queue
Wavefront Queue
Wavefront Queue
OVERVIEW
12 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
CASE STUDY – TASK-‐SHARING RUNTIME
! RunCme hides the OpenCL™ execuCon model ‒ ApplicaCon defines only independent tasks ‒ RunCme uses persistent threads
! The same funcCon assigned to a wavefront ‒ Grouped together using taskfronts
! SynchronizaCon occurs when tasks are enqueued/dequeued ‒ Enqueuer/producer does not know eventual consumer ‒ HRF-‐direct: must always use device/kernel scope synchronizaCon ‒ HRF-‐indirect: only use kernel scope synchronizaCon for global donaCons/consumpCon
! EvaluaCon: Unbalanced tree search (UTS) syntheCc workload ‒ Traversal of unbalanced graph whose topology is determined dynamically ‒ 4 different input sets
DETAILS
Queue
Taskfront
Task Task Task Task Taskfront Taskfront Task Task Task Task Task Task Task Task Task Task
13 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
CASE STUDY – TASK-‐SHARING RUNTIME RESULTS
0.95
1
1.05
1.1
1.15
uts_t1 uts_t2 uts_t4 uts_t5
Performan
ce Normalized
to HRF
-‐dire
ct
HRF-‐direct HRF-‐indirect input sets:
14 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
SUMMARY ! Our general approach (SC for HRF):
‒ Define a heterogeneous race: ‒ Two conflicCng accesses not separated by synchronizaCon, or ‒ SynchronizaCon does not use “enough” scope
‒ ExecuCon is SC if no heterogeneous races ‒ Undefined otherwise
! Proposing two memory models: ‒ HRF-‐direct
‒ Conflicts separated by synchronizaCon of iden0cal scope + Easier to define/understand + Permits more future HW opCmizaCons − Prohibits some SW opts in current hardware
‒ HRF-‐indirect ‒ Relaxes idenCcal requirement of HRF-‐direct:
‒ Scope TransiCvity: A sync B, B sync C ! A sync C
+ More accurate descripCon of current hardware capabiliCes + Has some SW benefits (E.g., is more composable ) − May limit future HW opts
15 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
DISCLAIMER & ATTRIBUTION
The informaCon presented in this document is for informaConal purposes only and may contain technical inaccuracies, omissions and typographical errors.
The informaCon contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, so~ware changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligaCon to update or otherwise correct or revise this informaCon. However, AMD reserves the right to revise this informaCon and to make changes from Cme to Cme to the content hereof without obligaCon of AMD to noCfy any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinaCons thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdicCons. OpenCL is a registered trademark of Apple Inc. Other names are for informaConal purposes only and may be trademarks of their respecCve owners.
Backup
17 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
SIMULATION CONFIGURATION
! APU Simulator ‒ Based on the gem5 open-‐source simulator ‒ Extended with a GPU execuCon model that directly executes HSAIL
! ConfiguraCon Parameters:
Parameter Value
# Compute Units 8
# SIMD Units / Compute Unit 4
L1 Cache 32 KB, 16-‐way assoc.
L2 Cache 2 MB, 16-‐way assoc.
18 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
HRF DESIGN SPACE
Others
19 | SC FOR HRF | NOVEMBER 19, 2013 | PUBLIC
L1-‐2 L2-‐3 Stage 1 Stage 2 Stage 3
L1/L2/DRAM
Scope 1-‐2 Scope 2-‐3
Scope Global
PROGRAMMABLE PIPELINE EXAMPLE
Top Related