Post on 08-Jul-2020
For Inter-Warp DMR
• Replay Checker is used to check the instruction types
and command replay(DMR) in the following cycle if
different
• When the same type instructions are issued
consecutively, ReplayQ is used to maintain the
instructions so that the instructions can be verified
anytime later whenever the corresponding execution
unit becomes available.
– Opcode, Operands, and Original execution result for 32 threads
(around 500B for each entry)
Inter-Warp DMR: Exploiting underutilized resources among heterogeneous units
• In any fully utilized warps, the unused execution units conduct DMR of unverified previous warp’s
execution
• If the result of the stored original execution and the new result mismatches ERROR detected!!
Warped-DMR
Light-weight Error detection for GPGPU
Hyeran Jeon and Murali Annavaram
University of Southern California
MOTIVATION
ARCHITECTURAL SUPPORT
WARPED-DMR ABSTRACT
CONTACT
Hyeran Jeon
Email: hyeranje@usc.edu
Murali Annaram
Email: annavara@usc.edu
For many scientific
applications that commonly
run on supercomputers,
program correctness
is as important as
performance. Few soft or
hard errors could lead to
corrupt results and can
potentially waste days or
even months of computing
effort. In this research we
exploit unique architectural
characteristics of GPGPUs
to propose a light weight
error detection method,
called Warped Dual
Modular Redundancy
(Warped-DMR).
Warped-DMR detects
errors in computation by
relying on opportunistic
spatial and temporal dual-
modular execution of code.
Warped-DMR is light weight
because it exploits the
underutilized parallelism in
GPGPU computing for error
detection. Error detection
spans both within a warp as
well as between warps,
called intra-warp and inter-
warp DMR, respectively.
Warped-DMR achieves
96% error coverage while
incurring a worst-case 16%
performance overhead
without extra execution
units or programmer’s
effort.
Intra-Warp DMR: Exploiting underutilized resources among homogeneous units • For any underutilized warps, the inactive threads within the warp duplicate the active threads’
execution • Active mask gives a hint for duplication selection
• If the result of the inactive and active thread mismatches ERROR detected!!
For Intra-Warp DMR
• Register Forwarding Unit is used to have the pair of active and inactive threads use the same
operands, RFU forwards active thread’s register value to inactive thread according to active
mask
– Overhead : 0.08ns and 390um2 @ Synopsis Design Compiler
• Thread-Core mapping is used to increase error coverage by modifying thread-core affinity in
scheduler
• Scientific computing is different to
multimedia
– Correctness matters
– Some vendors began to add memory protection
schemes to GPU
• But what about execution units?
– Larger portion of die area is assigned to execution
units in GPU
– Vast number of cores Higher probability of
computation errors
• Underutilization among Homogeneous Units
– Since threads within a warp share a PC value, in a
diverged control flow, some threads should execute
one flow but the others not
• Underutilization among Homogeneous Units
– Dispatcher issues an instruction to one of three
execution units at a time
– In Worst case, two execution units among three
become idle
0%10%20%30%40%50%60%70%80%90%
100%32 3130 2928 2726 2524 2322 2120 1918 1716 1514 1312 1110 98 76 54 3
RESULTS
Warped-DMR(Intra-Warp DMR + Inter-Warp DMR) covers 96% of computations with 16%
performance overhead without extra execution units
BACKGROUND
• Instructions are executed in a batch of threads(warp or wavefront) unit
– Threads within a warp are running in lock-stepp manner by sharing a PC
• Instructions are categorized into 3 types and executed on the corresponding
execution units
– Arithmetic operations on SP, Memory operations on LD/ST, Transcendental
instructions(i.e. sin, cosine) on SFU)
SFU SP LD/ST
Local Memory
Scheduler/Dispatcher
Register File SM
Global Memory
GPU
SM
... Thread Block A Thread
Kernel
Warp
SPs LD/STs SFUs
time
sin ld
add
ld
add
ld
warp4: sin.f32 %f3, %f1 warp1: ld.shared.f32 %f20,[%r99+824] warp2: add.f32 %f16, %f14, %f15 warp1: ld.shared.f32 %f21, [%r99+956] warp2: add.f32 %f18, %f12, %f17 warp3: ld.shared.f32 %f2, [%r70+4]
If(cond)
{
b++;
} else {
b--;
}
a = b;
time
SP 2 SP 1
b++ b--
Cond?
a = b
same OK
different ERROR!!
SP 2 SP 1 time
Cond?
a = b
b++ b++ DMR
b-- b-- DMR
COMP
COMP
Flush & Error Handling
time
SPs LD/STs SFUs
sin ld add
ld add
ld
ld add
ld add
ld sin
DMR
DMR DMR
DMR
DMR
DMR
Code Typical GPU execution With Intra-Warp DMR
Code Typical GPU execution With Inter-Warp DMR
< Execution time breakdown with respect to the number of active threads >
Over 30% of execution time of
BitonicSort is run by 16 threads
40% of execution time of
BFS is run by 1 thread
2 types of Underutilization in GPGPU computing
WARPED-DMR : EXPLOITING THE UNDERUTILIZATIONS FOR ERROR DETECTION
Can we use these idle resources?
th3.r0 th3.r1
.
.
th2.r0 th2.r1
.
.
th1.r0 th1.r1
.
.
th0.r0 th0.r1
.
.
SP SP SP SP
RF
EXE
WB
Comparator
active mask
ERROR!!
Register Forwarding Unit
1100
th3.r1 th2.r1 th1.r1 th0.r1
th3.r1 th2.r1 th3.r1 th2.r1
CORE CORE CORE SP MEM MEM
MEM
SFU
ReplayQ
EX
E
RF
D
EC
SP
CHECKER same
SP
SP enqueue
& search
SFU DMR
< Error coverage w.r.t. SIMT cluster organization and Thread to Core mapping >
89.60 91.91
96.43
0
20
40
60
80
100
120
Erro
r C
ove
rage
(%
)
with 4core cluster with 8core cluster
< Normalized Kernel Simulation Cycles w.r.t. ReplayQ size >
1.41 1.32
1.24 1.16
00.20.40.60.8
11.21.41.61.8
2
No
rmal
ize
d S
imu
lati
on
Cyc
les
0 1 5 10