A Hardware Redundancy and Recovery Mechanism for Reliable ...

28
1 University of Virginia Computer Science 2 NVIDIA Research A Hardware Redundancy and Recovery Mechanism for Reliable Scientific Computation on Graphics Processors Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1

Transcript of A Hardware Redundancy and Recovery Mechanism for Reliable ...

Page 1: A Hardware Redundancy and Recovery Mechanism for Reliable ...

1University of Virginia Computer Science2NVIDIA Research

A Hardware Redundancy andRecovery Mechanism for Reliable

Scientific Computation onGraphics Processors

Jeremy W. Sheaffer1

David P. Luebke2

Kevin Skadron1

Page 2: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Reliable Graphics?

• Transient errors can causeundesirable visual artifacts, such as:• Single pixel errors

• Single texel errors

• Single vertex errors

• Corrupt a frame

• Crash the computer

• Corrupt rendering state

Page 3: A Hardware Redundancy and Recovery Mechanism for Reliable ...
Page 4: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Motivation

• GPGPU

• One or more correct answers are expected

• Very different expectation from that of graphics

• More (exactly) like CPU expectations

• Massive parallelism provides opportunities thatare impractical or impossible on CPUs

Page 5: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Reducing Error Rates

• Error rates can be reduced by

• Reducing chip operating temperature

• Reducing crosstalk

• Increasing transistor sizes

• Increasing supply voltage

• Decreasing clock frequency

• Increasing power supply quality

• …

Page 6: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Reducing Error Rates

• Error rates can be reduced by

• …

• Detection and correction

Page 7: A Hardware Redundancy and Recovery Mechanism for Reliable ...

CPU Transient Fault Mitigation

• ECC and parity

• Scrubbing

• Used in conjunction with ECC to reduce 2-bit errors

• Larger or radiation-hardened gates

• Hardware fingerprinting or state dump withrollback

• Redundancy

• Primarily employed to protect logic

• Also sometimes used for memory

Page 8: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Reliability Through Redundancy

• Primary topic in recent transient fault reliabilityliterature

• Many clever ideas including• Triple redundancy with voting

• Lockstepped processors

• Redundant Multithreading

• CRT—Chip-level Redundantly Threaded processors

• SRT—Simultaneous and Redundantly Threaded processors

• The concepts of a ‘Sphere of Replication’ and leading andtrailing threads

• Memoization of redundant results

Page 9: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Designing a Reliable Functional Unit

• It is impossible to guarantee 100%reliability

• Anything outside of the sphere ofreplication must be either:

• Protected, as with ECC, or

• Unprotected and unimportant (as per an ACEanalysis)

Page 10: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Example: A Reliable ALU

Page 11: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Motivation for Reliable GPGPU

• GPGPU is becoming important enough thatvendors are devoting (human) resources toit

• GPGPU is already being applied in domainswhere errors are unacceptable

• GPGPU offers a much higher performanceper dollar than the traditionalsupercomputing infrastructure

Page 12: A Hardware Redundancy and Recovery Mechanism for Reliable ...

GPGPU Redundancy

• At what granularity should the redundancy occur?

• Possibilities include:

• Multiple GPUs

• Shader binary (software)

• Quad/Warp

• Shader unit (hardware)

• ALU

• Tightly coupled with comparator placement and datapath

• Possible to analytically eliminate many possibilities

• Experimentally evaluate remaining

Page 13: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Design Concerns

• Solution must not impact graphics performance

• Solution must be very cheap to implement

• GPU vendors are very reluctant to sacrifice real estate foranything which does not boost performance

• GPGPU is arguably becoming important

• But it does not drive the market

• (and it probably never will)

Page 14: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Performance Concerns

• It should be faster than 2x

• It should use less than 2x energy

• A well designed solution should be able toachieve these goals by taking advantage ofincreased memory locality of redundanttexture fetches

Page 15: A Hardware Redundancy and Recovery Mechanism for Reliable ...

A Reliable GPGPU Solution

Page 16: A Hardware Redundancy and Recovery Mechanism for Reliable ...

A Reliable GPGPU Solution

VS: Vertex Stream

DB: Domain Buffer

GP: Geometry Processing

FC: Fragment Core

FB: Framebuffer

Page 17: A Hardware Redundancy and Recovery Mechanism for Reliable ...

The Domain Buffer

•Stores assembled triangleinformation in protected memory

•Reads datapath from ROP forreissue

•Datapath could berepurposed from the unifiedshader model or f-bufferdatapaths

•Reissues with fragment(s) fromROP as stencil mask(s)

Page 18: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Other Pipeline Changes

•Rasterizer produces two of everyfragment

•Guarantees that identicalfragments are not computedon the same core

•The fragment engine has nochanges

•ROP uses a modified full/empty-bit semantic to act as thecomparator

Page 19: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Experiments

• Using a series of stressmarks to challengethe memory system

• Compare with a baseline, unreliable system andwith a perfect cache system in which cacheaccesses never go to memory

• Measure performance, power, energy, cache hit rate,and memory throughput

Page 20: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Improved Cache Performance

Page 21: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Reduced Memory Traffic

Page 22: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Better Core Utilization

Page 23: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Less than 2x performance overhead

Page 24: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Less than 2x power and energy

Page 25: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Conclusions

• We have presented a reliable GPGPU system

• Our solution utilizes a domain buffer, to reissuecorrupt fragments, dual issue from the rasterizer,and repurposes ROP as the comparator

• This work provides a complete solution to GPUreliability for last-generation hardware

• The important ideas map directly to current andforeseeable future hardware, but details become moredifficult

Page 26: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Future Work

• Scatter functionality in CTM and CUDAprovide difficult challenges

• Other aspects of the presented work mapvery well to new architectures, thoughdetails must be worked out

Page 27: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Acknowledgements

• The simulation framework on which we built thiswork was developed by Greg Johnson, Chris Burns,Alexander Joly, and William R. Mark at theUniversity of Texas, Austin

• This work was supported by NSF grants CCF-0429765, CCR-0306404, the Army Research Officeunder grant no. W911NF-04-1-0288, a researchgrant from Intel MRL, and an ATI graduatefellowship

Page 28: A Hardware Redundancy and Recovery Mechanism for Reliable ...

Thank You

• Questions?