Warped-DMR Light-weight Error detection for GPGPU · 2012-12-08 · For Inter-Warp DMR 96.43 89.60...

For Inter-Warp DMR

• Replay Checker is used to check the instruction types

and command replay(DMR) in the following cycle if

different

• When the same type instructions are issued

consecutively, ReplayQ is used to maintain the

instructions so that the instructions can be verified

anytime later whenever the corresponding execution

unit becomes available.

– Opcode, Operands, and Original execution result for 32 threads

(around 500B for each entry)

Inter-Warp DMR: Exploiting underutilized resources among heterogeneous units

• In any fully utilized warps, the unused execution units conduct DMR of unverified previous warp’s

execution

• If the result of the stored original execution and the new result mismatches ERROR detected!!

Warped-DMR

Light-weight Error detection for GPGPU

Hyeran Jeon and Murali Annavaram

University of Southern California

MOTIVATION

ARCHITECTURAL SUPPORT

WARPED-DMR ABSTRACT

CONTACT

Hyeran Jeon

Email: hyeranje@usc.edu

Murali Annaram

Email: annavara@usc.edu

For many scientific

applications that commonly

run on supercomputers,

program correctness

is as important as

performance. Few soft or

hard errors could lead to

corrupt results and can

potentially waste days or

even months of computing

effort. In this research we

exploit unique architectural

characteristics of GPGPUs

to propose a light weight

error detection method,

called Warped Dual

Modular Redundancy

(Warped-DMR).

Warped-DMR detects

errors in computation by

relying on opportunistic

spatial and temporal dual-

modular execution of code.

Warped-DMR is light weight

because it exploits the

underutilized parallelism in

GPGPU computing for error

detection. Error detection

spans both within a warp as

well as between warps,

called intra-warp and inter-

warp DMR, respectively.

Warped-DMR achieves

96% error coverage while

incurring a worst-case 16%

performance overhead

without extra execution

units or programmer’s

effort.

Intra-Warp DMR: Exploiting underutilized resources among homogeneous units • For any underutilized warps, the inactive threads within the warp duplicate the active threads’

execution • Active mask gives a hint for duplication selection

• If the result of the inactive and active thread mismatches ERROR detected!!

For Intra-Warp DMR

• Register Forwarding Unit is used to have the pair of active and inactive threads use the same

operands, RFU forwards active thread’s register value to inactive thread according to active

– Overhead : 0.08ns and 390um2 @ Synopsis Design Compiler

• Thread-Core mapping is used to increase error coverage by modifying thread-core affinity in

scheduler

• Scientific computing is different to

multimedia

– Correctness matters

– Some vendors began to add memory protection

schemes to GPU

• But what about execution units?

– Larger portion of die area is assigned to execution

units in GPU

– Vast number of cores Higher probability of

computation errors

• Underutilization among Homogeneous Units

– Since threads within a warp share a PC value, in a

diverged control flow, some threads should execute

one flow but the others not

• Underutilization among Homogeneous Units

– Dispatcher issues an instruction to one of three

execution units at a time

– In Worst case, two execution units among three

become idle

0%10%20%30%40%50%60%70%80%90%

100%32 3130 2928 2726 2524 2322 2120 1918 1716 1514 1312 1110 98 76 54 3

RESULTS

Warped-DMR(Intra-Warp DMR + Inter-Warp DMR) covers 96% of computations with 16%

performance overhead without extra execution units

BACKGROUND

• Instructions are executed in a batch of threads(warp or wavefront) unit

– Threads within a warp are running in lock-stepp manner by sharing a PC

• Instructions are categorized into 3 types and executed on the corresponding

execution units

– Arithmetic operations on SP, Memory operations on LD/ST, Transcendental

instructions(i.e. sin, cosine) on SFU)

SFU SP LD/ST

Local Memory

Scheduler/Dispatcher

Register File SM

Global Memory

... Thread Block A Thread

Kernel

SPs LD/STs SFUs

sin ld

warp4: sin.f32 %f3, %f1 warp1: ld.shared.f32 %f20,[%r99+824] warp2: add.f32 %f16, %f14, %f15 warp1: ld.shared.f32 %f21, [%r99+956] warp2: add.f32 %f18, %f12, %f17 warp3: ld.shared.f32 %f2, [%r70+4]

If(cond)

} else {

a = b;

SP 2 SP 1

b++ b--

same OK

different ERROR!!

SP 2 SP 1 time

b++ b++ DMR

b-- b-- DMR

Flush & Error Handling

SPs LD/STs SFUs

sin ld add

ld add

ld sin

DMR DMR

Code Typical GPU execution With Intra-Warp DMR

Code Typical GPU execution With Inter-Warp DMR

< Execution time breakdown with respect to the number of active threads >

Over 30% of execution time of

BitonicSort is run by 16 threads

40% of execution time of

BFS is run by 1 thread

2 types of Underutilization in GPGPU computing

WARPED-DMR : EXPLOITING THE UNDERUTILIZATIONS FOR ERROR DETECTION

Can we use these idle resources?

th3.r0 th3.r1

th2.r0 th2.r1

th1.r0 th1.r1

th0.r0 th0.r1

SP SP SP SP

Comparator

active mask

ERROR!!

Register Forwarding Unit

th3.r1 th2.r1 th1.r1 th0.r1

th3.r1 th2.r1 th3.r1 th2.r1

CORE CORE CORE SP MEM MEM

ReplayQ

CHECKER same

SP enqueue

& search

SFU DMR

< Error coverage w.r.t. SIMT cluster organization and Thread to Core mapping >

89.60 91.91

with 4core cluster with 8core cluster

< Normalized Kernel Simulation Cycles w.r.t. ReplayQ size >

1.41 1.32

1.24 1.16

00.20.40.60.8

11.21.41.61.8

0 1 5 10

Warped-DMR Light-weight Error detection for GPGPU · 2012-12-08 · For Inter-Warp DMR 96.43 89.60...

Documents

Transcript of Warped-DMR Light-weight Error detection for GPGPU · 2012-12-08 · For Inter-Warp DMR 96.43 89.60...

DMR Trunking

DMR-ES10EB / DMR-ES10EC / DMR-ES10EG / DMR ... / DMR-ES10EC / DMR-ES10EG / DMR-ES10EP / DMR-ES10EBL 1.1.1. Leakage current cold check 1. Unplug the AC cord and connect a jumper between

Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism

ORACLE 11g Database Replay - NYOUG...Oracle 11g Database Replay 16 DB Replay ––––PreProcessing Workload It transforms the captured workload data into Replay files It will create

Burgess Replay

DMR Datasheet

Dmr Primer

Hytera DMR Series Product Introduction - tehnoturg.comtehnoturg.com/kasutusjuhend/DMR-Series-Product.pdf · Hytera DMR Series Product Introduction Hytera, complying with ETSI DMR

DMR Tier-III Trunking - portableradiosolutions.com · DMR Tier-III Trunking Hytera Open Standard DMR Trunking Portfolio. Hytera DMR Tier III Trunking, developed from the ETSI DMR

Hytera DMR Series Product Introductionracom.co.za/wp-content/uploads/2016/03/Hytera-DMR-Products... · Hytera DMR Series Product Introduction Hytera, complying with ETSI DMR open

Replay Solutions CFD

Capture & Replay

Orchard Replay

DMR EH67 DMR-EZ47VEB/VGN

Record and Replay

Replay Brochure

DMR Profile

Panasonic Parts List DMR-BW750 and DMR-BW850EF

Hytera DMR Introduction - info4u.usinfo4u.us/HyteraInfo.pdf · The DMR Association • The DMR Association is an evolution of the DMR-MOU set up at the time of the standardization

SHINee - Replay