GPU ArchitectureAn OpenCL Programmer’s Introduction
Lee Howes | November 3, 2010
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone2
The aim of this webinar
To provide a general background to modern GPU architectures
To place the AMD GPU designs in context:
– With other types of architecture
– With other GPU designs
To give an idea of why certain optimizations become necessary on such architectures and why the architectures are designed in that way
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone3
Agenda
Talk about GPUs as graphics processing devices
– What they are designed for
– What this means architecturally
The implications of SIMD execution on application development
LDS and latency hiding
How the GPU fits in the “CPU” design space.
A description of features of AMD Radeon HD5870 GPU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone4
What is a GPU?
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone5
In a nutshell
The GPU is a multicore processor optimized for graphics workloads
ShaderCore
ShaderCore
ShaderCore
ShaderCore
ShaderCore
ShaderCore
ShaderCore
ShaderCore
Tex
Tex
Tex
Tex
Rasterizer
Output blend
Video decode
Scheduler
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone6
Processing pixels
Pixel
Direction of light
Normal at surface
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone7
Processing pixels
Pixel
Direction of light
Normal at surface
Pixel
Pixel Pixel
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone8
Processing pixels
Pixel
Direction of light
Normal at surface
Pixel
Pixel Pixelsampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);
}
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone9
Processing pixels
Pixel
Direction of light
Normal at surface
Pixel
Pixel Pixel
Pixel Pixel
Pixel Pixel
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone10
SIMD execution and its implications
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone11
SIMD pixel execution
Pixel Pixel
Pixel Pixel
sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);
}
sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);
}
sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);
}
sampler mySamp;Texture2D<float3> myTex;float3 lightDir;float4 diffuseShader(float3 norm, float2 uv){float3 kd;kd = myTex.Sample(mySamp, uv);kd *= clamp( dot(lightDir, norm), 0.0, 1.0);return float4(kd, 1.0);
}
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone12
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone13
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone14
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone15
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone16
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone17
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone18
Branches that diverge
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
sampler mySamp;Buffer<float> myTex;
float diffuseShader(float threshold, float index)
{float brightness = myTex[index];float output;if( brightness > threshold )
output = threshold;else
output = brightness;return output;
}
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone19
SIMD execution: SIMD instructions?
Programming SIMD with
SIMD instructions
Most vector instructions sets:
SSE, AVX
Intel’s Larrabee and Knight’s Corner
Programming SIMD with
“scalar” instructions
GPU shader languages
GPU intermediate languages
OpenCL
Programmed masking
Hardware controlled masking
OpenCL compiling to SSE
Pixel shaders compiling to Larrabee.
Current generation GPUs
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone20
Why does this matter for Compute?
Graphics code traditionally has relatively short shaderson large triangles
– The level of branch divergence overall will not be high
With graphics code you can not necessarily control it
– SIMD batches are constructed by the hardware depending on the scene properties.
For OpenCL code you are defining your execution space
– You choose what work is performed by which work item
– You choose how to structure your algorithm to avoid this divergence
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone21
Throughput execution and latency hiding
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone22
Covering pipeline latency
Stall
Instruction 0
Instruction 1
Lanes 0-3
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone23
Covering pipeline latency: logical vector
Lanes 0-3 Lanes 4-7
Stall
Stall
Lanes 8-11 Lanes 12-15
Stall
Stall
Instruction 0
Instruction 1
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone24
Covering pipeline latency: ALU operations
Lanes 0-3 Instruction 0
Lanes 4-7 Instruction 0
Lanes 8-11 Instruction 0
Lanes 12-15 Instruction 0
Lanes 0-3 Instruction 1
Lanes 4-7 Instruction 1
Lanes 8-11 Instruction 1
Lanes 12-15 Instruction 1
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone25
Covering memory latency
Instruction 0
Instruction 1
Stall
Lanes 0-3
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone26
Covering memory latency: we still stall
Lanes 0-3 Lanes 4-7
Stall
Stall
Lanes 8-11 Lanes 12-15
Stall
Stall
Instruction 0
Instruction 1
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone27
Covering memory latency: another wavefront
Instruction 1
Instruction 0
Lanes 0-3 Lanes 4-7 Lanes 8-11 Lanes 12-15
Instruction 0
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone28
Latency hiding in the SIMD engine
Pixel Pixel Pixel Pixel
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone29
Latency hiding in the SIMD engine
Pixel Pixel Pixel Pixel
ALU ALU ALU ALU
Pixel Pixel Pixel Pixel
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone30
Latency hiding in the SIMD engine
Pixel Pixel Pixel Pixel
ALU ALU ALU ALU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone31
A throughput-oriented SIMD engine
Pixel Pixel Pixel Pixel
ALU ALU ALU ALU
Pixel Pixel Pixel Pixel
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone32
A throughput-oriented SIMD engine
Pixel Pixel Pixel Pixel
ALU ALU ALU ALU
Pixel Pixel Pixel Pixel
State
State
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone33
Adding the memory hierarchy
Unlike most CPUs, GPUs do not have vast cache hierarchies.
– Caches on CPUs allow primarily for lower access latency
Heavy multithreading reduces the latency requirement
– Latency is not an issue, we cover that with other threads
– Total bandwidth still an issue, even with high-latency high-speed memory interfaces
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone34
Texture caches and local memory
Designed to support sharing between work items
– Reduce bandwidth, not latency
Global
High latency load. Limited bandwidth.
Local
SIMD engine
Efficient random accesses. Very high bandwidth
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone35
A throughput-oriented SIMD engine
Pixel Pixel Pixel Pixel
ALU ALU ALU ALU
Pixel Pixel Pixel Pixel
State
State
Fetch
Decode
Execute
Local storage/Cache
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone36
A throughput-oriented SIMD engine
Fetch
Decode
Execute
Local storage/Cache
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone37
The GPU shader cores
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone38
The design space
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone39
The AMD Phenom™ II X6
6 cores
One state set per core
4-wide SIMD (actually there are two pipes)
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone40
The Intel i7 6-core variants
6 cores
Two state sets per core (SMT/Hyperthreading)
4-wide SIMD
Phenom II X6
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone41
Sun UltraSPARC T2
8 cores
Eight state sets per core
No SIMD
Phenom II X6Intel i7 6-core
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone42
The AMD HD5870 GPU
20 cores
Up to 24 logical 64-SIMD wide state sets per core (number depends on register requirements)
16-wide physical
Phenom II X6Intel i7 6-core UltraSPARC T2
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone43
The AMD Radeon HD5870 GPU
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone44
Features of the Radeon™ HD5870 architecture
Area 334 mm2
Transistors 2.15 billion
Memory Bandwidth 153 GB/sec
L2-L1 Rd Bandwidth 512 bytes/clk
L1 Bandwidth 1280 bytes/clk
Vector GPR 5.24 MB
LDS Memory 640kB
LDS Bandwidth 2560 bytes/clk
Concurrent Wavefronts 496
Shader (ALU units) 1600
Idle power 27 W
Max power 188 W
ATI Radeon™ HD 5870
2.72 Teraflops architecture
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone45
High level viewCommand Processor/Group Generator
Sequencer SequencerGDS
Rd
cach
e C
ro
ssb
ar
In
s c
ach
e
SIM
D E
ngin
es 0
-9
SIM
D E
ngin
es10-1
9
In
s c
ach
e
Write crossbar
Read L2 caches8kB/mem channel
Write combine cachesWrite combine caches
R/W cache for global atomics
R/W cache for global atomics
8 channel memory controller
GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone46
Clause execution
Command Processor/Group Generator
Sequencer SequencerGDS
Rd
cach
e C
ro
ssb
ar
In
s c
ach
e
SIM
D E
ngin
es 0
-9
SIM
D E
ngin
es10-1
9
In
s c
ach
e
Write crossbar
Read L2 caches
Write combine caches
Write combine caches
R/W cache for global atomics
R/W cache for global atomics
8 channel memory controller
GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)
15 x: SETGT_INT R0.x, R2.x, R4.x
16 x: PREDNE_INT ____, R0.x, 0.0f…
09 JUMP POP_CNT(1) ADDR(18)
10 ALU: ADDR(64) CNT(9)
17 x: SUB_INT R5.x, R3.x, R2.x
y: LSHL ____, R3.x, (…).x
z: LSHL ____, R2.x, (…).x VEC_120
t: MOV R8.x, 0.0f
18 x: SUB_INT R6.x, PV17.y, PV17.z
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)
12 ALU: ADDR(73) CNT(12)
19 x: LSHL ____, R5.x, (…).x
w: ADD_INT ____, R6.x, R1.x VEC_120
t: ADD_INT R7.x, R1.x, (…).y
13 TEX: ADDR(368) CNT(2)
22 VFETCH R0.x___, R0.y, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
23 VFETCH R1.x___, R0.z, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone47
Clause execution
Command Processor/Group Generator
Sequencer SequencerGDS
Rd
cach
e C
ro
ssb
ar
In
s c
ach
e
SIM
D E
ngin
es 0
-9
SIM
D E
ngin
es10-1
9
In
s c
ach
e
Write crossbar
Read L2 caches
Write combine caches
Write combine caches
R/W cache for global atomics
R/W cache for global atomics
8 channel memory controller
GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)
15 x: SETGT_INT R0.x, R2.x, R4.x
16 x: PREDNE_INT ____, R0.x, 0.0f…
09 JUMP POP_CNT(1) ADDR(18)
10 ALU: ADDR(64) CNT(9)
17 x: SUB_INT R5.x, R3.x, R2.x
y: LSHL ____, R3.x, (…).x
z: LSHL ____, R2.x, (…).x VEC_120
t: MOV R8.x, 0.0f
18 x: SUB_INT R6.x, PV17.y, PV17.z
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)
12 ALU: ADDR(73) CNT(12)
19 x: LSHL ____, R5.x, (…).x
w: ADD_INT ____, R6.x, R1.x VEC_120
t: ADD_INT R7.x, R1.x, (…).y
13 TEX: ADDR(368) CNT(2)
22 VFETCH R0.x___, R0.y, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
23 VFETCH R1.x___, R0.z, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone48
Clause execution
Command Processor/Group Generator
Sequencer SequencerGDS
Rd
cach
e C
ro
ssb
ar
In
s c
ach
e
SIM
D E
ngin
es 0
-9
SIM
D E
ngin
es10-1
9
In
s c
ach
e
Write crossbar
Read L2 caches
Write combine caches
Write combine caches
R/W cache for global atomics
R/W cache for global atomics
8 channel memory controller
GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)
15 x: SETGT_INT R0.x, R2.x, R4.x
16 x: PREDNE_INT ____, R0.x, 0.0f…
09 JUMP POP_CNT(1) ADDR(18)
10 ALU: ADDR(64) CNT(9)
17 x: SUB_INT R5.x, R3.x, R2.x
y: LSHL ____, R3.x, (…).x
z: LSHL ____, R2.x, (…).x VEC_120
t: MOV R8.x, 0.0f
18 x: SUB_INT R6.x, PV17.y, PV17.z
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)
12 ALU: ADDR(73) CNT(12)
19 x: LSHL ____, R5.x, (…).x
w: ADD_INT ____, R6.x, R1.x VEC_120
t: ADD_INT R7.x, R1.x, (…).y
13 TEX: ADDR(368) CNT(2)
22 VFETCH R0.x___, R0.y, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
23 VFETCH R1.x___, R0.z, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone49
Clause execution
Command Processor/Group Generator
Sequencer SequencerGDS
Rd
cach
e C
ro
ssb
ar
In
s c
ach
e
SIM
D E
ngin
es 0
-9
SIM
D E
ngin
es10-1
9
In
s c
ach
e
Write crossbar
Read L2 caches
Write combine caches
Write combine caches
R/W cache for global atomics
R/W cache for global atomics
8 channel memory controller
GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)
15 x: SETGT_INT R0.x, R2.x, R4.x
16 x: PREDNE_INT ____, R0.x, 0.0f…
09 JUMP POP_CNT(1) ADDR(18)
10 ALU: ADDR(64) CNT(9)
17 x: SUB_INT R5.x, R3.x, R2.x
y: LSHL ____, R3.x, (…).x
z: LSHL ____, R2.x, (…).x VEC_120
t: MOV R8.x, 0.0f
18 x: SUB_INT R6.x, PV17.y, PV17.z
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)
12 ALU: ADDR(73) CNT(12)
19 x: LSHL ____, R5.x, (…).x
w: ADD_INT ____, R6.x, R1.x VEC_120
t: ADD_INT R7.x, R1.x, (…).y
13 TEX: ADDR(368) CNT(2)
22 VFETCH R0.x___, R0.y, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
23 VFETCH R1.x___, R0.z, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone50
Clause execution
Command Processor/Group Generator
Sequencer SequencerGDS
Rd
cach
e C
ro
ssb
ar
In
s c
ach
e
SIM
D E
ngin
es 0
-9
SIM
D E
ngin
es10-1
9
In
s c
ach
e
Write crossbar
Read L2 caches
Write combine caches
Write combine caches
R/W cache for global atomics
R/W cache for global atomics
8 channel memory controller
GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5 GDDR5
08 ALU_PUSH_BEFORE: ADDR(62) CNT(2)
15 x: SETGT_INT R0.x, R2.x, R4.x
16 x: PREDNE_INT ____, R0.x, 0.0f…
09 JUMP POP_CNT(1) ADDR(18)
10 ALU: ADDR(64) CNT(9)
17 x: SUB_INT R5.x, R3.x, R2.x
y: LSHL ____, R3.x, (…).x
z: LSHL ____, R2.x, (…).x VEC_120
t: MOV R8.x, 0.0f
18 x: SUB_INT R6.x, PV17.y, PV17.z
y: MOV R8.y, 0.0f
z: MOV R8.z, 0.0f
w: MOV R8.w, 0.0f
11 LOOP_DX10 i0 FAIL_JUMP_ADDR(17)
12 ALU: ADDR(73) CNT(12)
19 x: LSHL ____, R5.x, (…).x
w: ADD_INT ____, R6.x, R1.x VEC_120
t: ADD_INT R7.x, R1.x, (…).y
13 TEX: ADDR(368) CNT(2)
22 VFETCH R0.x___, R0.y, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
23 VFETCH R1.x___, R0.z, fc156 MEGA(4)
FETCH_TYPE(NO_INDEX_OFFSET)
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone51
The SIMD Engine
16 processing elements: physical vector
Executes two 64-element wavefronts over 4 cycles: logical vector
– Each lane executes a 5-way VLIW instruction
Lane masking and branching to support divergence
Hardware barrier support for up to 8 work groups per SIMD engine
SEQ/Branch control
32kB Local Data Share: 32 banks with integer atomic units
8kB Read L1
Filter & Format
Filter & Format
Filter & Format
Filter & Format
Address & Load
Address & Load
Address & Load
Address & Load
Exp/Ld/Store
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone52
The SIMD Engine
16 processing elements: physical vector
Executes two 64-element wavefronts over 4 cycles: logical vector
– Each lane executes a 5-way VLIW instruction
Lane masking and branching to support divergence
Hardware barrier support for up to 8 work groups per SIMD engine
SEQ/Branch control
32kB Local Data Share: 32 banks with integer atomic units
8kB Read L1
Filter & Format
Filter & Format
Filter & Format
Filter & Format
Address & Load
Address & Load
Address & Load
Address & Load
Exp/Ld/Store
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone53
The SIMD Engine
16 processing elements: physical vector
Executes two 64-element wavefronts over 4 cycles: logical vector
– Each lane executes a 5-way VLIW instruction
Lane masking and branching to support divergence
Hardware barrier support for up to 8 work groups per SIMD engine
SEQ/Branch control
32kB Local Data Share: 32 banks with integer atomic units
8kB Read L1
Filter & Format
Filter & Format
Filter & Format
Filter & Format
Address & Load
Address & Load
Address & Load
Address & Load
Exp/Ld/Store
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone54
The Local Data Share
SIMD Engine
Conflict detection
andcontrol
scheduling
Source collectors and return data staging
Input address cross bar
Read data cross bar
Write data cross bar
B0
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
B11
B12
B13
B14
B15
B16
B17
B18
B19
B20
B21
B22
B23
B24
B25
B26
B27
B28
B29
B30
B31
Integer atomic units
SEQ PE0 PE15
Pre
-op re
turn
valu
e
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone55
LDS features
High bandwidth: twice external bandwidth (1024b/clock compared with 512b/clock per SIMD).
– Fully coalesced reads, writes and atomics with optimization for broadcast reads
Low latency access
– 0 latency direct reads
– 1 VLIW instruction latency for indirect ops (8 real cycles in the pipeline)
Bank conflicts hardware detected and serialized
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone56
The processing element (using OpenCL terms)
Operand Preparation
4 32b FP FMA or MullAdd1 64b FMA or Mul2 64b FP Add4 24b Int Mul or MulAdd4 32b Int Add, Logical or Special
General Purpose Registers
1 32b FP MulAdd1 32b Integer1 32b Special(log, exp, rcp, sin…)
Fetch Addr/Data
Export Addr/Data
LDS Requests
LDS op 0
LDS op 1
Constants
Input Data
Fetch Return
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone57
Dependent operations
Co-issue of dependent operations in a single VLIW packet
– Full IEEE intermediate rounding & normalization
– Dot4 (A= A*B + C*D + E*F + G*H)
– Dual Dot2 (A = A*B + C*D; F = F*H +
I*J)
– Dual dependent multiplies (A = B * C * D; F = G * H * I)
– Dual dependent adds (A = B * C * D; F = G * H * I)
– Dependent MulAdd (A = B * C + D + E * F)
24 bit integer
– Mul and Muladd (4 way c-issue)
– Heavy use for workgroup address calculation
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone58
The Global Data Share
Dual arrays of SIMD Engines
Conflict detection
andcontrol
scheduling
Fast append counter control
Left and right source collectors (4 Wis/clock each) and return data staging
Input address cross bar
Read data cross bar
Write data cross bar
B0
B1
B2
B3
B4
B5
B6
B7
B8
B9
B10
B11
B12
B13
B14
B15
B16
B17
B18
B19
B20
B21
B22
B23
B24
B25
B26
B27
B28
B29
B30
B31
Integer atomic units
SEQ Left bus Right bus
Pre
-op re
turn
valu
e
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone59
GDS features
Low latency access to a global shared memory
– 25 clocks latency
– Issued in separate clause in the same way as texture accesses
– Fully coalesced reads, writes and atomics as LDS.
8 work items per clock can request
Driver allocation and initialization
Useful for low latency global reductions
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone60
Summary
We’ve:
– looked at the basic principles of the GPU architecture in the processor design space
– seen some of the tradeoffs that lead to GPU features
– gone over the basic features of the HD 5870 architecture that affect compute applications
Later talks will go into detail on GPU optimizations on these architecture features
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone61
Questions and Answers
Visit the OpenCL Zone on developer.amd.com
http://developer.amd.com/zones/OpenCLZone/
The OpenCL Programming Webinars page includes:
Schedule of upcoming webinars
On-demand versions of this and past webinars
Slide decks of this and past webinars
– Upcoming webinars include
OpenCL Programming In Detail
Real World Application Example
Optimization Techniques
And Device Fission Extensions for OpenCL
| GPU Architecture | November 2010 | developer.amd.com -> OpenCL™ Zone62
Trademark Attribution
AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2009 Advanced Micro Devices, Inc. All rights reserved.
Top Related