Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand...
Transcript of Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand...
![Page 1: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/1.jpg)
Dark Secrets: Heterogeneous Memory Models
Lee Howes
2015-02-07, Qualcomm Technologies Inc.
![Page 2: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/2.jpg)
2
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Te chnologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s enginee ring, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QC T, and QWI. References
to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as a pplicable.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective owners.
![Page 3: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/3.jpg)
3
Introduction – why memory consistency
IntroductionCurrent
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
![Page 4: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/4.jpg)
4
Many programming languages have coped with weak memory models
These are fixed in various ways:
− Platform-specific rules
− Conservative compilation behavior
A clear memory model allows developers to understand their program behavior!
− Or part of it anyway. It’s not the only requirement but it is a start.
Many current models are based on the Data-Race-Free work
− Special operations are used to maintain order
− Between special instructions reordering is valid for efficiency
Weak memory models
![Page 5: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/5.jpg)
5
Many languages have tried to formalize their memory models
− Java, C++, C…
− Why?
OpenCL and HSA are not exceptions
− Both have developed models based on the DRF work, via C11/C++11
Ben Gaster, Derek Hower and I have collaborated on HRF-Relaxed
− An extension of DRF models for heterogeneous systems
− To appear in ACM TACO
Unfortunately, even a stronger memory model doesn’t solve all your problems…
Better memory models
![Page 6: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/6.jpg)
6
Current restrictions
IntroductionCurrent
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
![Page 7: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/7.jpg)
7
Three specific issues:
− Coarse grained access to data
− Separating addressing between devices and the host process
− Weak and poorly defined ordering controls
Any of these can be worked around with vendor extensions or device knowledge
Weakness in the OpenCL 1.x memory model
![Page 8: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/8.jpg)
8
Host controls data
Coarse host->device synchronization
Data
Host
OpenCL
Device
Allocate
![Page 9: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/9.jpg)
9
A single running command owns an entire allocation
Coarse host->device synchronization
Data
Host
OpenCL
Device
Run command
![Page 10: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/10.jpg)
10
The host can access it using map/unmap operations
Coarse host->device synchronization
Data
Host
OpenCL
Device
Map
![Page 11: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/11.jpg)
11
Coarse host->device synchronization
Data
Host
OpenCL
Device
Unmap
![Page 12: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/12.jpg)
12
Coarse host->device synchronization
Data
Host
OpenCL
Device
Run command
![Page 13: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/13.jpg)
13
Separate addressing
We can update memory with a pointer
Host
OpenCL
Device
data[n].ptr = &data[n+1];
data[n+1].val = 3;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
![Page 14: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/14.jpg)
14
Separate addressing
We try to read it – but what is the value of a?
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
![Page 15: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/15.jpg)
15
Separate addressing
Unfortunately, the address of data may change
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
![Page 16: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/16.jpg)
16
Bounds on visibility
Controlling memory ordering is challenging
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
![Page 17: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/17.jpg)
17
Physically, this probably means within a single core
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Write
![Page 18: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/18.jpg)
18
Barrier operations synchronize active threads and constituent work-items
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Barrier
![Page 19: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/19.jpg)
19
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read
![Page 20: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/20.jpg)
20
Bounds on visibility
Between groups we can’t
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read?
![Page 21: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/21.jpg)
21
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Write
![Page 22: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/22.jpg)
22
Ensure the write completes for a given work-item
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Fence
![Page 23: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/23.jpg)
23
Fence at the other end
Bounds on visibility
What if we use fences?
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Fence
![Page 24: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/24.jpg)
24
Ensure a read is after the fence
Bounds on visibility
Within a group we can synchronize
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 0
Memory
Work
item 0
Work
item 1
Work
item 2
Work
item 3
WorkGroup 1
Read
![Page 25: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/25.jpg)
25
Probably seeing a write that was written after a fence guarantees that the fence completed
− The spec is not very clear on this
There is no coherence guarantee
− Need the write ever complete?
− If it doesn’t complete, who can see it, who can know that the fence happened?
Can the flag be updated without a race?
− For that matter, what is a race?
Spliet et al: KMA.
− Weak ordering differences between platforms due to poorly defined model.
Meaning of fences
When did the fence happen?
{{
data[n] = value;
fence(…);
flag = trigger;
||
if(flag) {
fence(…);
value = data[n];
}}
![Page 26: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/26.jpg)
26
Basics of OpenCL 2.0
IntroductionCurrent
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
![Page 27: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/27.jpg)
27
Sharing virtual addresses
We can update memory with a pointer
Host
OpenCL
Device
data[n].ptr = &data[n+1];
data[n+1].val = 3;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
![Page 28: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/28.jpg)
28
Sharing virtual addresses
We try to read it – but what is the value of a?
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
![Page 29: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/29.jpg)
29
Sharing virtual addresses
Now the address does not change!
Host
OpenCL
Device
int a = data[n].ptr->val;
assert(a == 3);
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
![Page 30: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/30.jpg)
30
Unmap on the host, event dependency on device
Sharing data – when does the value change?
Coarse grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
clUnmapMemObject
Event e
![Page 31: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/31.jpg)
31
Granularity of data race covers the whole buffer
Sharing data – when does the value change?
Coarse grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
clUnmapMemObject
Event e
![Page 32: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/32.jpg)
32
Event dependency on device – caches will flush as necessary
Sharing data – when does the value change?
Fine grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
Event e
![Page 33: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/33.jpg)
33
Data will be merged – data race at byte granularity
Sharing data – when does the value change?
Fine grained
Host
OpenCL
Device
int a = data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
data[n+1].val = 3;
Event e
![Page 34: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/34.jpg)
34
No dispatch-level ordering necessary
Sharing data – when does the value change?
Fine grained with atomic support
Host
OpenCL
Deviceint a = atomic-load data[n].ptr->val;
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
data[n].ptr = &data[n+1];
Atomic-store data[n+1].val = 3;
![Page 35: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/35.jpg)
35
Races at byte level – avoided using the atomics in the memory model
Sharing data – when does the value change?
Fine grained with atomic support
Host
OpenCL
Device
struct Foo {
Foo *ptr;
int val;
};
struct Foo {
Foo *ptr;
int val;
};
Data[n]
Data[n+1]
int a = atomic-load data[n].ptr->val;
data[n].ptr = &data[n+1];
Atomic-store data[n+1].val = 3;
![Page 36: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/36.jpg)
36
This is an ease of programming concern
− Complex apps with complex data structures can more effectively work with data
− Less work to package and repackage data
It can also improve performance
− Less overhead in updating pointers to convert to offsets
− Less overhead in repacking data
− Lower overhead of data copies when appropriate hardware support present
Sharing virtual addresses
![Page 37: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/37.jpg)
37
Most memory operations are entirely unordered
− The compiler can reorder
− The caches can reorder
Unordered relations to the same location are races
− Behaviour is undefined
Ordering operations (atomics) may update the same location
− Ordering operations order other operations relative to the ordering operation
Ordering operations default to sequentially consistent semantics
Sharing virtual addresses
SC-for Data-Race-Free by default – release consistency for flexibility
![Page 38: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/38.jpg)
38
Commonly tried on OpenCL 1.x
− Not at all portable!
− Due to: weak memory ordering, lack of forward progress
Is the situation any better for OpenCL 2.0?
− Yes, memory ordering is now under control!
− Is forward progress?
So what can we safely do with synchronization
Take a spin wait…
![Page 39: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/39.jpg)
39
Yes and no
− Conceptually similar to CPU waiting on an event – ie all work-items to complete
− Other app could occupy device, graphics could consume device, work may interfere with graphics
− Risk of whole device being occupied elsewhere so work on the assumption that this context owns the
device
So what can we safely do with synchronization
CPU thread waits on work-item – works?
CPU threadOpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
![Page 40: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/40.jpg)
40
Probably
− The spec doesn’t guarantee this
− How do you know what core your work-item is on?
− All you know is which work-group it is in.
So what can we safely do with synchronization
Work-item on one core waits for work-item on another core – works?
Core 1
OpenCL Work-item
Core 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
![Page 41: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/41.jpg)
41
Sometimes
− On some architectures this is fine
− On other architectures if both SIMD threads are on the same core: starvation
− Thread 0 may never run to satisfy thread 1’s spin wait
So what can we safely do with synchronization
Work-item on one thread waits for work-item on another thread – works?
{SIMD} thead 1
OpenCL Work-item
{SIMD} thread 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
![Page 42: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/42.jpg)
42
No (well, sometimes yes, but rarely)
− Fairly widely understood, but sometimes hard for new developers
− If you think about the mapping to SIMD it is fairly obvious
− A single program counter can’t be in two places at once – some architectures can track multiple
So what can we safely do with synchronization
Work-item waits for work-item in the same SIMD thread – works?
{SIMD} thead 0
OpenCL Work-item 1
{SIMD} thread 0
OpenCL Work-item 0
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
![Page 43: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/43.jpg)
43
Maybe, maybe not
− It depends entirely on where the work-items are mapped in the group
− Same thread – no
− Different threads – maybe
− The developer can often tell, but it isn’t portable and the compiler can easily break it
So what can we safely do with synchronization
Work-items in a work-group
Work-group 0
OpenCL Work-item
Work-group 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
![Page 44: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/44.jpg)
44
Maybe, maybe not
− It depends entirely on where the work-groups are placed on the device
− Two work-groups on the same core – you have the thread to thread case
− Two work-groups on different cores – it probably works
− No way to control the mapping!
So what can we safely do with synchronization
Work-groups
Work-group 1
OpenCL Work-item
Work-group 0
OpenCL Work-item
// Do work dependent on flag
Store-release value to flag
while(load-acquire flag != value) {}
// Do work dependent on flag
Happens-before
![Page 45: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/45.jpg)
45
Realistically
− Spin waits on other OpenCL work-items are just not portable
− Very limited use of the memory model
So what can you do?
− Communicating that work has passed a certain point
− Updating shared data buffers with flags
− Lock-free FIFO data structures to share data
− OpenCL 2.0’s sub-group extension provides limited but important forward progress guarantees
So what can we safely do with synchronization
Overall, a fairly poor situation
![Page 46: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/46.jpg)
46
Heterogeneity in OpenCL 2.0
IntroductionCurrent
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
![Page 47: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/47.jpg)
47
Even acquire-release consistency can be expensive
In particular, always synchronizing the whole system is expensive
− A discrete GPU does not want to always make data visible across the PCIe interface
− A single core shuffling data in local cache does not want to interfere with DRAM-consuming throughput
tasks
OpenCL 2 optimizes this using the concept of synchronization scopes
Lowering implementation cost
![Page 48: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/48.jpg)
48
WI WI
SG SG
WG WG
Not all communication is global – so bound it
Hierarchical memory synchronization
Synchronization is expensive
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
![Page 49: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/49.jpg)
49
WI WI
SG SG
WG WG
Sub-group scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
![Page 50: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/50.jpg)
50
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
![Page 51: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/51.jpg)
51
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope; Device scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
![Page 52: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/52.jpg)
52
WI WI
SG SG
WG WG
Sub-group scope; Work-group scope; Device scope; All-SVM-Devices scope
Hierarchical memory synchronization
Scopes!
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
![Page 53: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/53.jpg)
53
WI WI
SG SG
WG WG
Hierarchical memory synchronization
Release to the appropriate scope, acquire from the matching scope
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release work-group-scope x load-acquire work-group-scope x
![Page 54: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/54.jpg)
54
WI WI
SG SG
WG WG
Hierarchical memory synchronization
Release to the appropriate scope, acquire from the matching scope
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release device-scope x load-acquire device-scope x
![Page 55: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/55.jpg)
55
WI WI
SG SG
WG WG
Hierarchical memory synchronization
If scopes do not reach far enough, this is a race
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
store-release work-group-scope x load-acquire work-group-scope x
![Page 56: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/56.jpg)
56
Allows aggressive hardware optimization to coherence traffic
− GPU coherence in particular is often expensive – GPUs generate a lot of memory traffic
The memory model defines synchronization rules in terms of scopes
− Insufficient scope is a race
− Non-matching scopes race (in the OpenCL 2.0 model, this isn’t necessary)
Scoped synchronization
![Page 57: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/57.jpg)
57
Four address spaces in OpenCL 2.0
− Constant and private are not relevant for communication
− Global and local maintain separate orders in the memory model
Synchronization, acquire/release behavior etc apply only to local OR global, not both
− The global release->acquire order below does not order the updates to a!
Address space orderings
{SIMD} thead 1
OpenCL Work-item
{SIMD} thread 0
OpenCL Work-item
local-store a = 2
Global-store-release value to flag
while(global-load-acquire flag != value) {}
assert local-load a==2
Local-happens-before
![Page 58: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/58.jpg)
58
Take an example like this:
− void updateRelease(int *flag, int *data);
If I lock the flag, do I safely update data or not?
− The function takes pointers with no address space
− The ordering depends on the address space
− The address space depends on the type of the pointers at the call site, or even earlier!
There are ways to get the fence flags, but it is messy, care must be taken
Multiple orderings in the presence of generic pointers
![Page 59: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/59.jpg)
59
Heterogeneous memory ordering
IntroductionCurrent
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
![Page 60: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/60.jpg)
60
Sequential consistency aims to be a simple, easy to understand, model
− Behave as if the ordering operations are simply interleaved
In the OpenCL model, sequential consistency is guaranteed only in specific cases:
− All memory_order_seq_cst operations have the scope memory_scope_all_svm_devices and all affected
memory locations are contained in system allocations or fine grain SVM buffers with atomics support
− All memory_order_seq_cst operations have the scope memory_scope_device and all affected memory
locations are not located in system allocated regions or fine-grain SVM buffers with atomics support
Consider what this means…
− You start modifying your app to have more fine-grained sharing of some structures
− Suddenly your atomics are not sequentially consistent at all!
− What about SC operations to local memory?
Limits of sequential consistency
![Page 61: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/61.jpg)
61
These are data-race-free memory models
They only guarantee ordering in the absence of races
− So we only actually order things we can observe! Order can be relaxed between atomics.
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG WG
![Page 62: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/62.jpg)
62
In one part of the execution, we have SC operations at device scope
− Let’s assume this is valid
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG WG
![Page 63: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/63.jpg)
63
Elsewhere we have another set of SC operations
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG scope SC
Operations.
Ordered SC.
WG WG
![Page 64: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/64.jpg)
64
Any access across from one work-group to another is a race
− It is equivalent to a non-atomic operation
− Therefore it is invalid
Is such a limit necessary?
First, scopes…
WI WI
SG SG
WG WG
GPU Device
Core
WG WG
SG SG
WI WI
Core
WIWI
SGSG SGSG
WIWI
WG scope SC
Operations.
Ordered SC.
WG WG
An access like this is a race
WG scope SC
Operations.
Ordered SC.
![Page 65: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/65.jpg)
65
In Hower et. al.’s work on Heterogeneous-Race-Free memory models this is made explicit
Sequential consistency can be maintained with scopes
Access to invalid scope
− Is unordered
− Is not observable
− So this is still valid sequential consistency: everything observable, ie that is race-free, is SC
Making sequential consistency a partial order
![Page 66: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/66.jpg)
66
We can apply SC semantics here for the same reason
− Actions to coarse-grained memory is not visible to other clients
− However – coarse buffers don’t fit cleanly in a hierarchical model
In Gaster et al. (to appear ACM TACO) we use the concept of observability as a memory
model extension
− “At any given point in time a given location will be available in a particular set of scope instances out to
some maximum instance and by some set of actors. Only memory operations that are inclusive with that
maximal scope instance will observe changes to those locations.”
Observability can migrate using API actions
− Map, unmap, and event dependencies
Extending to coarse-grained memory
![Page 67: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/67.jpg)
67
Observability
Initial state
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Observability bounds – device scope on the GPU
![Page 68: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/68.jpg)
68
Observability
Map operation
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Observability moves to device scope on the CPU
![Page 69: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/69.jpg)
69
Observability
Unmap
WI WI
SG SG
WG WG
OpenCL System
GPU Device
Core
WG WG
SG SG
WI WI
CPU Device DSPDevice
Core Core CoreCore Core
Memory
Allocation
Unmap will transfer dependence back to an OpenCL device
![Page 70: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/70.jpg)
70
We can cleanly put scopes and coarse memory in the same memory consistency model
− It is more complicated than, say, C++
− It is practical to formalize, and hopefully easy to explain
There is no need for quirky corner cases
We merely rethink slightly
− Instead of a total order S for all SC operations we have a total order Sa for each agent a of all SC operations
observable by that agent such that all these orders are consistent with each other.
− This is a relatively small step from the DRF idea, which is effectively that non-atomic operations are not
observable, and thus do not need to be strictly ordered.
The point
![Page 71: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/71.jpg)
71
Separate orders for local and global memory are likely to be painful for programmers
Do we need them?
− We have formalized the concept (also in the TACO paper) using multiple happens-before orders that subset
memory operations
− We also describe bridging-synchronization-order as a formal way to allow join points
− It is messy as a formalization. The harm to the developer is probably significant.
Joining address spaces
![Page 72: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/72.jpg)
72
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAMTexture
Cache
![Page 73: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/73.jpg)
73
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAMTexture
Cache
Entirely different interfaces from the processing core
![Page 74: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/74.jpg)
74
However!
− Hardware really does have different instructions and different timing to access different memories
− Can all hardware efficiently synchronize these memory interfaces?
− We will have to see how this plays out
− Consider this a warning to take care if you try to use this aspect of OpenCL 2.0
Joining address spaces
Processing core
Local
memory
L1 Cache
L2 Cache DRAMTexture
Cache
Entirely different interfaces from the processing core
![Page 75: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/75.jpg)
75
Summary
IntroductionCurrent
restrictions
Basics of
OpenCL 2.0
Heterogeneity in
OpenCL 2.0
Heterogeneous
memory
ordering
Summary
![Page 76: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/76.jpg)
76
Like mainstream languages, heterogeneous programming models are adopting firm memory
models
Without fundamental execution model guarantees the usefulness is limited
We are making progress on both these counts
Things are improving
![Page 77: Dark Secrets: Heterogeneous Memory Models · A clear memory model allows developers to understand their program behavior! − Or part of it anyway. It’s not the only requirement](https://reader034.fdocuments.in/reader034/viewer/2022050116/5f4d40de152e306760221df9/html5/thumbnails/77.jpg)
77
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Te chnologies, Inc., a
wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s enginee ring, research and
development functions, and substantially all of its product and services businesses, including its semiconductor business, QC T, and QWI. References
to “Qualcomm” may mean Qualcomm Incorporated, or subsidiaries or business units within the Qualcomm corporate structure, as a pplicable.
For more information on Qualcomm, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may
be trademarks or registered trademarks of their respective owners.
Thank youFollow us on: