D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs
-
Upload
leigh-simpson -
Category
Documents
-
view
28 -
download
0
description
Transcript of D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs
![Page 1: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/1.jpg)
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs
D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke
University of Michigan
PACT-23August 27th, 2014
![Page 2: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/2.jpg)
2
Achieving Peak GPU Performance: Theory and Practice
2007 2008 2009 2010 2011 2012 20130
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
GFL
OPS
GTX 780
GTX 680
GTX 580
GTX 480
GTX 2809800
8800SDK
CUBLAS
Peak
Matrix Multiplication
Not easy to fully utilize GPU capabilities!
![Page 3: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/3.jpg)
3
A Quick Overview of GPUs
SMs
Interconnect
…
L2 $
Chip
LD/ST
Writeback
Register File
SharedMemory
L1D $
DRAMDRAM
DRAM
IssueDecodeFetch
Data
DataDataResult
SPs
Result
Result Result
![Page 4: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/4.jpg)
4
A Quick Overview of GPUs
SMs
Interconnect
…
L2 $
Chip
LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
SPs
DRAMDRAM
DRAM
~100’s of cycles
![Page 5: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/5.jpg)
5
Cache Line
How do GPUs Achieve Great Performance?
• Effectively use available memory bandwidth– Exploit data reuse when
possible
SP SP SP SP
Store Store Store Store
![Page 6: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/6.jpg)
6
Cache Line
How do GPUs Achieve Great Performance?
• Effectively use available memory bandwidth– Exploit data reuse when
possible– Regular, well coalesced
memory accessesSP SP SP SP
Store
Cache Line
![Page 7: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/7.jpg)
7
Buffering to Optimize Bandwidth
SMs
Interconnect
…
L2 $
Chip
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
DRAMDRAM
DRAM
~100’s of cycles
IssueDecodeFetch
<10 cycles
Buffer data in fast Shared Memory
Tile[0]
Tile[1]Tile[2]
![Page 8: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/8.jpg)
8
SMs
Interconnect
…
L2 $
Chip
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
DRAMDRAM
DRAM
IssueDecodeFetch
Tile[0]
Tile[0]
Tile[0]
Tile[0]
Tile[0]
Tile[1]Tile[2]
Tile[1]Tile[2]
Tile[1]Tile[2]
Duplicated data in Shared Mem, Caches, Reg. File
Roundabout path to Shared Memory
Tile[1]Tile[2]
Buffering Problem 1: Wasted Storage
![Page 9: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/9.jpg)
9
Buffering Problem 2: Code Expansion
cvt.s64.s32 %rl6, %r13;add.s64 %rl7, %rl5, %rl6;shl.b64 %rl8, %rl7, 2;mov.u64 %rl9, __cuda_local_var_42177_35_non_const_block;add.s64 %rl10, %rl9, %rl8;cvta.to.global.u64 %rl11, %rl2;mul.wide.u32 %rl12, %r15, 4;add.s64 %rl13, %rl11, %rl12;ld.global.f32 %f1, [%rl13];st.shared.f32 [%rl10], %f1; cvt.u64.u32 %rl14, %r1; add.s64 %rl15, %rl14, %rl4;shl.b64 %rl16, %rl15, 2;add.s64 %rl17, %rl11, %rl16;ld.global.f32 %f2, [%rl17];st.shared.f32 [%rl10+132], %f2; shl.b32 %r21, %r1, 1;cvt.u64.u32 %rl18, %r21;add.s64 %rl19, %rl18, %rl4;shl.b64 %rl20, %rl19, 2;add.s64 %rl21, %rl11, %rl20;ld.global.f32 %f3, [%rl21];st.shared.f32 [%rl10+264], %f3; mul.lo.s32 %r24, %r1, 3;cvt.u64.u32 %rl22, %r24;add.s64 %rl23, %rl22, %rl4;shl.b64 %rl24, %rl23, 2;add.s64 %rl25, %rl11, %rl24;ld.global.f32 %f4, [%rl25];st.shared.f32 [%rl10+396], %f4; shl.b32 %r27, %r1, 2;cvt.u64.u32 %rl26, %r27;add.s64 %rl27, %rl26, %rl4;shl.b64 %rl28, %rl27, 2;add.s64 %rl29, %rl11, %rl28;ld.global.f32 %f5, [%rl29];st.shared.f32 [%rl10+528], %f5; mul.lo.s32 %r30, %r1, 5;cvt.u64.u32 %rl30, %r30;add.s64 %rl31, %rl30, %rl4;shl.b64 %rl32, %rl31, 2;add.s64 %rl33, %rl11, %rl32;ld.global.f32 %f6, [%rl33];st.shared.f32 [%rl10+660], %f6; mul.lo.s32 %r33, %r1, 6;cvt.u64.u32 %rl34, %r33;add.s64 %rl35, %rl34, %rl4;shl.b64 %rl36, %rl35, 2;add.s64 %rl37, %rl11, %rl36;ld.global.f32 %f7, [%rl37];st.shared.f32 [%rl10+792], %f7;mul.lo.s32 %r36, %r1, 7;cvt.u64.u32 %rl38, %r36;add.s64 %rl39, %rl38, %rl4;shl.b64 %rl40, %rl39, 2;add.s64 %rl41, %rl11, %rl40;ld.global.f32 %f8, [%rl41];st.shared.f32 [%rl10+924], %f8;bar.sync 15;
IADD R4.CC, R7, c [0x0] [0x150];SHL.W R18, R7, 0x2;IMUL.U32.U32.HI R20, R7, 0x4;MOV R12, c [0x0] [0x150];IADD.X R5, RZ, RZ;IADD R2.CC, R18, c [0x0] [0x148];IADD.X R3, R20, c [0x0] [0x14c];IADD R0, R0, R19;IMAD.U32.U32 R6.CC, R12, 0x2, R7;LD.E R14, [R2];IADD.X R8, RZ, RZ;IMAD R10.CC, R12, 0x3, R7;SHL R21, R0, 0x2;IADD.X R9, RZ, RZ;IMAD.U32.U32 R11.CC, R12, 0x4, R7;STS [R21], R14;SHR.U32 R0, R4, 0x1e;SHL R22, R4, 0x2;IADD.X R4, RZ, RZ;IMAD R27.CC, R12, 0x5, R7;SHR.U32 R13, R6, 0x1e;SHL R24, R6, 0x2;IADD.X R6, RZ, RZ;ISCADD R23, R5, R0, 0x2;IMAD R0.CC, R12, 0x6, R7;IADD.X R5, RZ, RZ;IMAD R33.CC, R12, 0x7, R7;SHR.U32 R15, R10, 0x1e;SHL R26, R10, 0x2;SHR.U32 R10, R11, 0x1e;SHL R28, R11, 0x2;IADD.X R11, RZ, RZ;IADD R12.CC, R22, c [0x0] [0x148];ISCADD R25, R8, R13, 0x2;IADD.X R13, R23, c [0x0] [0x14c];IADD R8.CC, R24, c [0x0] [0x148];SHR.U32 R7, R27, 0x1e;LD.E R13, [R12];SHL R30, R27, 0x2;STS [R21+0x84], R13;ISCADD R27, R9, R15, 0x2;IADD.X R9, R25, c [0x0] [0x14c];IADD R2.CC, R26, c [0x0] [0x148];ISCADD R29, R4, R10, 0x2;IADD.X R3, R27, c [0x0] [0x14c];IADD R4.CC, R28, c [0x0] [0x148];SHR.U32 R10, R0, 0x1e;ISCADD R31, R6, R7, 0x2;ISCADD R32, R5, R10, 0x2;LD.E R9, [R8];IADD.X R5, R29, c [0x0] [0x14c];IADD R6.CC, R30, c [0x0] [0x148];SHL R0, R0, 0x2;IADD.X R7, R31, c [0x0] [0x14c];SHR.U32 R34, R33, 0x1e;IADD R10.CC, R0, c [0x0] [0x148];SHL R33, R33, 0x2;LD.E R3, [R2];ISCADD R34, R11, R34, 0x2;LD.E R5, [R4];IADD.X R11, R32, c [0x0] [0x14c];IADD R14.CC, R33, c [0x0] [0x148];LD.E R6, [R6];IADD.X R15, R34, c [0x0] [0x14c];LD.E R8, [R10];LD.E R2, [R14];STS [R21+0x108], R9;STS [R21+0x18c], R3;STS [R21+0x210], R5;STS [R21+0x294], R6;STS [R21+0x318], R8;STS [R21+0x39c], R2;BAR.SYNC 0xf; PTX
59 Instructions
SASS73 Instructions
CUDA4 Lines
__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[TILE_HEIGHT * STRIDE];
// Preliminary address calculations … float *tile_ptr = tile + <offset>;
// Buffer into shared memory #pragma unroll for(unsigned int i = 0; i < TILE_SIZE; i++) tile_ptr[i * STRIDE] = src[i * ImgStride]; __syncthreads();
// Processing data …}
Each tile transfer requires many arithmetic ops to calculate addresses
Address generation consumes ~50% of tile transfer cycles
![Page 10: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/10.jpg)
10
Objective
• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)
CPU DRAMDMA$
!!
![Page 11: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/11.jpg)
11
Objective
• A tool to help achieve better memory performance• Inspired by Direct Memory Access (DMA)
CPU DRAMDMA$
!
GPU
$$$
SM
$?
!! Not interruptible!
Heavy bookkeeping!
![Page 12: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/12.jpg)
12
D2MA: The Big Picture
DRAM
GPU
$$$
SM
$D2MA
![Page 13: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/13.jpg)
13
D2MA: Data-Parallel Direct Memory Access
• Take advantage of regular memory accesses & unified L1D/Shared Memory space
• Decouple tile transfers from SM resources– Simplify address generation– Improve memory pipelining
• Direct path to shared memory
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR Tile[0]
![Page 14: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/14.jpg)
14
D2MA Programming Model__global__ void CUDAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];
int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;
//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);
for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}
__global__ void D2MAkernel2DCT(float *dst, float *src, int ImgStride){ __shared__ float tile[T_HEIGHT * T_STRIDE];
int OffsThreadInRow = threadIdx.y * T_SIZE + threadIdx.x; int OffsThreadInCol = threadIdx.z * T_SIZE; src += FMUL(blockIdx.y * T_HEIGHT, ImgStride) + blockIdx.x * T_WIDTH; dst += FMUL(blockIdx.y * T_HEIGHT + OffsThreadInCol, ImgStride) + blockIdx.x * T_WIDTH + OffsThreadInRow; float *tile_ptr = tile + OffsThreadInCol * T_STRIDE + OffsThreadInRow;
//process rows then columns CUDAsubroutineInplaceDCTvector(tile + (OffsThreadInCol + threadIdx.x) * T_STRIDE + OffsThreadInRow - threadIdx.x, 1); CUDAsubroutineInplaceDCTvector(tile_ptr, T_STRIDE);
for(unsigned int i = 0; i < T_SIZE; i++) dst[i * ImgStride] = tile_ptr[i * T_STRIDE];}Original Code D2MA-Optimized Code
CUDA: 4 LinesPTX: 59 Instructions
CUDA: 4 LinesPTX: 12 Instructions
#pragma unroll for(unsigned int i = 0; i < T_SIZE; i++) tile_ptr[i * T_STRIDE] = src[i * ImgStride]; __syncthreads();
d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride); d2ma_set_datatype_float(); d2ma_enable_shmem_blank_col(); d2ma_ignite_buffer(0);
![Page 15: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/15.jpg)
15
D2MA Engine
D2MA Overview
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
ControllerGlob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
![Page 16: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/16.jpg)
16
D2MA Operation: Configuration
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
01101010110110
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
0x1020 0x20 64 4 1
AGEN Logic Consistency Checker
d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);
ConfigConfig
![Page 17: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/17.jpg)
17
D2MA Operation: Addr. Generation
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
0111000
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
AGEN Logic Global Mem. AGEN
Shared Mem. AGENControl
0x1020 0x20 64 4 10x1020 0x20 64 4 1
0x1020
0x20
d2ma_configure_matrix(tile, src, T_HEIGHT, T_WIDTH, ImgStride);d2ma_set_datatype_float();d2ma_enable_shmem_blank_col();d2ma_ignite_buffer(0);
Ignite #0
![Page 18: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/18.jpg)
18
D2MA Operation: Memory Transfer
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
AGEN Logic Global Mem. AGEN
Shared Mem. AGENControl
0x1020 0x20 64 4 1
0x1020
0x20
Glob. Addr Shr. Addr …0x2000 0xFF …0xFFFF 0xFF …0xFFFF 0xFF …0x1020 0x20 …
0x10A0
0xA0
0x10A0 0xA0 …
![Page 19: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/19.jpg)
190x10A0 0xA0 …0x10A0 0xA0 …
D2MA Operation: Memory Transfer
SM
SPs LD/ST
Writeback
Register File
SharedMemory
L1D $
IssueDecodeFetch
D2MA
MSHR
D2MA EngineController
Glob. Addr
Shr. Addr
# Elements
Elem. SizeStride
Buf. 0Buf. 1Buf. 2Buf. 3
AGEN Logic Consistency Checker
AGEN Logic Global Mem. AGEN
Shared Mem. AGENControl
0x1020 0x20 64 4 1
Glob. Addr Shr. Addr …0x2000 0xFF …0x1020 0x20 …0x1020 0x20 …
&0x1020&0x20
&0x10A0&0xA0
![Page 20: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/20.jpg)
20
D2MA Operation: Enforcing Synchronization
Without D2MA With D2MA
Thread Block 1
Thread Block 2
Start TX 1
Load from buffer
End TX 1Start TX 2
Load from bufferRe-exec
load
Thread Block 1
Thread Block 2
Start TX 1, Thread barrier
Barrier satisfied, End TX 1
Load from buffer
Start TX 2, Thread barrier
No warp ready to schedule
Code independent
of buffer
syncthreads()No syncthreads()!
Independent code executes
Synchronization handled transparently by H/W
Programmer must guarantee consistency
![Page 21: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/21.jpg)
21
Experimental Evaluation
• GPGPU-Sim v3.2.1
• Benchmarks from NVIDIA CUDA SDK, Rodinia– Must perform shared memory buffering
Number of SMs 15Thread blocks/SM 8Shared memory/SM 48 KBD2MA engines/SM 1 Controllers/engine 8 Buffers/controller 4 AGEN/engine 1Warp scheduling policy Greedy-then-oldestL1 cache (size/assoc/block size) 16KB/4-way/128BL2 cache (size/assoc/block size) 768KB/16-way/128BL2/DRAM latency 240/440 cycles
![Page 22: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/22.jpg)
22
Results: Performance
DCT MMDWT
TRANS1
TRANS2 FWGAUSS
MEANSOBEL
PATH
SGEMVLU
D1LU
D2
GEOMEAN0
0.5
1
1.5
2
2.5
Spee
dup
Geomean speedup: 1.36x
![Page 23: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/23.jpg)
23
Results: Cycle Breakdown
DCT MMDWT
TRANS1
TRANS2 FWGAUSS
MEANSOBEL
PATH
SGEMVLU
D1LU
D2AVG
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
AGEN MEM
Brea
kdow
n of
Tile
Tra
nsfe
r Cyc
les
BaselineD 2MA
Avg TX cycles: ~5x reduction
Addr. Gen: improved by 98% Mem. TX: reduced by 66%
![Page 24: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/24.jpg)
24
Results: Overheads
• Model of D2MA Engine synthesized using Synopsys• Compared to NVIDIA GTX 480
– Die area: 529 mm2
– TDP: 250 W– One D2MA Engine per SM (15 SMs):
• Area overhead: 0.016%• Power overhead: 0.022%
92%
5% 3%
Engine Area
Controller Array AGEN LogicConsistency Checker
![Page 25: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/25.jpg)
25
Conclusion
• Programmer must optimize memory traffic to achieve good performance on GPUs– Shared memory buffering improves b/w utilization– Buffering still has overheads
• D2MA decouples tiled data buffering from existing SM resources– Reduces costs of address generation by 98%– Improves memory transfer times by 66%– Performance improves by 1.36x– Dynamic instructions executed reduced by 7%– Enforces synchronization transparently– Low area and power overheads (<0.03%)
![Page 26: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/26.jpg)
26
Thank You!
• Questions?
Image credits:http://www.opengraphicdesign.com/web/ajax-loading-graphics/http://www.barriersandbollards.com/html-pages/mb50-1.png
![Page 27: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/27.jpg)
D2MA: Accelerating Coarse-Grained Data Transfer for GPUs
D. Anoushe Jamshidi, Mehrzad Samadi, and Scott Mahlke
University of Michigan
PACT-23August 27th, 2014
![Page 28: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/28.jpg)
28
Special Addressing Modes
Blank Column Mode Halo Addressing
Mode
![Page 29: D 2 MA: Accelerating Coarse-Grained Data Transfer for GPUs](https://reader036.fdocuments.in/reader036/viewer/2022062422/5681345a550346895d9b3e9c/html5/thumbnails/29.jpg)
29
Results: Dynamic Instruction Count Reduction
DCT MMDWT
TRANS1
TRANS2 FWGAUSS
MEANSOBEL
PATH
SGEMVLU
D1LU
D2
GEOMEAN0
0.2
0.4
0.6
0.8
1
1.2
Dyn.
Inst
ruct
ion
Coun
t Rel
ativ
e to
Bas
elin
e