Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation Kypros Constantinides...
-
Upload
eleanor-harrison -
Category
Documents
-
view
222 -
download
0
Transcript of Online Design Bug Detection: RTL Analysis, Flexible Mechanisms, and Evaluation Kypros Constantinides...
Online Design Bug Detection:Online Design Bug Detection:RTL Analysis, Flexible Mechanisms, and EvaluationRTL Analysis, Flexible Mechanisms, and Evaluation
Kypros ConstantinidesUniversity of Michigan
Onur MutluMicrosoft Research & Carnegie Mellon University
Todd AustinUniversity of Michigan
Challenges of Correct Microprocessor DesignChallenges of Correct Microprocessor Design
MICRO-41November 11th, 2008
2 Online Design Bug Detection
1.2 bugs per month
3.5 bugs per month
More bugs as more complex and diverse resources are integrated into a single chipMore bugs as more complex and diverse resources are integrated into a single chip
Chip-MultiprocessorsNew Features:• 64-bit extensions• Virtualization• Power Management• SSE3
Design Bugs: Deviations from the product specifications
*Data compiled from Intel product specification updates documents
Why is Online Design Bug Detection Needed?Why is Online Design Bug Detection Needed?
MICRO-41November 11th, 2008
3 Online Design Bug Detection
Lower System Performance
System Security:Attacks exploit HW
design bugs
Diminishing Brand/Company
Reputation
Lower CustomerSatisfaction Financial Loss
Expensive Recalls
Cost of design bugs
Microprocessor companies rely on ad-hoc techniques that changethe software and hardware configuration to work around design bugsMicroprocessor companies rely on ad-hoc techniques that change
the software and hardware configuration to work around design bugs
Online Design Bug Detection and AvoidanceOnline Design Bug Detection and Avoidance
MICRO-41November 11th, 2008
4 Online Design Bug Detection
Online DesignBug DetectionOnline DesignBug Detection
OnlineSystem Recovery
OnlineSystem Recovery
Bug AvoidanceTechniques
Bug AvoidanceTechniques
Bug detection mechanism is updated by firmware with
new design bugs
- Recover system from design bug effects- Low overhead periodic checkpoint and recovery- Existing mechanisms:• ReVive + ReViveI/O• SafetyNet
In this work we focus on online design bug detection
- Avoid the reoccurrence of the design bug- Existing mechanisms:• Scale down to safe-mode• Disable buggy part• Hypervisor execution guidance
Microprocessor Errata DocumentsMicroprocessor Errata Documents
MICRO-41November 11th, 2008
5 Online Design Bug Detection
R31. Interactions between the Instruction Translation Lookaside Buffer (ITLB) and the Instruction Streaming Buffer May Cause Unpredictable Software Behavior
Problem: Complex interactions within the instruction fetch/decode unit may make it possible for the processor to execute instructions from an internal streaming buffer containing stale or incorrect information.
Implication: When this erratum occurs, an incorrect instruction stream may be executed resulting in unpredictable software behavior.
From the Intel Pentium 4 Specification Update Document
Limitations:- Provide high-level description of the design bug- Hard to relate the design bug to the actual hardware implementation
Characterizing RTL Design BugsCharacterizing RTL Design Bugs
MICRO-41November 11th, 2008
6 Online Design Bug Detection
MMU
Trap Logic Unit
(TLU)
LoadStore Unit
(LSU)
IFUEXU
MUL
1089: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken;...1105: // modified for bug 39191106: // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken;1107: assign trap_to_redmode = trp_lvl_at_maxtlless1 & 1108: ~(rstint_taken | sirint_taken);
1089: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken;...1105: // modified for bug 39191106: // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken;1107: assign trap_to_redmode = trp_lvl_at_maxtlless1 & 1108: ~(rstint_taken | sirint_taken);
Example of RTL design bug in Verilog code – tlu_ctl.v
OpenSPARC T1(Niagara)
OpenSPARC Core
- RTL design bugs in Verilog code- Fixed and documented in the code
Load Store Unit (LSU): 157 bugsTrap Logic Unit (TLU): 139 bugsTotal of 296 bugs in SPARC core
Buggy Code
Corrected Code
Online Detection of Design BugsOnline Detection of Design Bugs
MICRO-41November 11th, 2008
7 Online Design Bug Detection
trp_lvl_at_maxtlless1 = 1
rstint_taken = 0
sirint_taken = 0
trap_toredmode
1089: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken;...1105: // modified for bug 39191106: // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken;1107: assign trap_to_redmode = trp_lvl_at_maxtlless1 & 1108: ~(rstint_taken | sirint_taken);
1089: assign intrpt_taken = rstint_taken | hwint_taken | sirint_taken;...1105: // modified for bug 39191106: // assign trap_to_redmode = trp_lvl_at_maxtlless1 & ~intrpt_taken;1107: assign trap_to_redmode = trp_lvl_at_maxtlless1 & 1108: ~(rstint_taken | sirint_taken);
Buggy Code
Buggy ImplementationCorrect Implementation
1
1
0
0
Design bug is exposed
trp_lvl_at_maxtlless1 = 1
rstint_taken = 0
hwint_taken = 1
sirint_taken = 0
trap_toredmode
Monitoring these signals can detect the bug occurrence
D Q
Clk
D Q
Clk
CombinationalLogic
…
Monitoring the flip-flops can detect the bug occurrence
Corrected Code
Insights from RTL Design Bug AnalysisInsights from RTL Design Bug Analysis RTL Analysis Observations:
~20 signals need to be monitored per bug >1000 unique signals need to be monitored for all the bugs studied Each bug has ~7 source signals not monitored for any other bug
Set of monitored signals is expanding for every new bug All bug source signals are coming from control flip-flops
Monitoring data buffers or data registers will not provide significant benefit
MICRO-41November 11th, 2008
8 Online Design Bug Detection
Limitations of online bug detection techniques in the literature: 1. Monitor only a few hundreds of signals (~200-300)2. Monitored signals are selected at design time
Limitations of online bug detection techniques in the literature: 1. Monitor only a few hundreds of signals (~200-300)2. Monitored signals are selected at design time
Flexible Bug Detection at the Flip-Flop LevelFlexible Bug Detection at the Flip-Flop Level
MICRO-41November 11th, 2008
9 Online Design Bug Detection
s s
s
s
s
0: Match1: Mismatch
ScanPortionScan
Portion
Bug DetectionPortion
Bug DetectionPortion
OperatingFlip-Flop
OperatingFlip-Flop
Bug Signature Encoding
Load using fieldprogrammable
scan chains
Flexible Bug Signature 1 X X 0 X X … X X X 0 X X
Monitor ALL control flip-flops in the design
FF needs to be 1 to expose bug
FF is not a bug Source signal
FF needs to be 0 to expose bug
Bug Detection Flip-Flop
011
001Detection Value
Monitor Enable
Bug DetectionPortion
Bug DetectionPortion
OperatingFlip-Flop
OperatingFlip-Flop
Distributed Global Bug Detection CheckingDistributed Global Bug Detection Checking
MICRO-41November 11th, 2008
10 Online Design Bug Detection
Flip-FlopLevel 8-bit Bug
Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
8-bit Bug Detection8-bit Bug Detection
s s s s s s s s
Bug ID Flag Match-bitvector
912
1 1 1 X XX 1 1 X0…
Bug ID Flag Match-bitvector
712
1 X 1 X 11 X X X0…
s s
Bug ID Flag Match-bitvector
12 1 1 1…1 1 0 1
Checking Treetable entries
loaded at system startup
by firmware
64 Control Flip-Flops
Bug #9 is detected
1 0 0 1
12, 1 12, 1
0 1 1 0
Bug #12 is detected
Detecting Multiple Design BugsDetecting Multiple Design Bugs
MICRO-41November 11th, 2008
11 Online Design Bug Detection
Bug Sign.#1
Bug Sign.#2 …
Bug Sign.#N
Design BugDatabase
Design Bugs &Triggering Conditions
System Bug Signature
Merge Bug Signatures
Use “Don’t cares” toresolve signal conflictsbetween bug signatures
No false negatives, but false positive bug
detections are possible
Bug DetectionFlip-Flops
Encode & Load
0
1
X
Bug Signature Conflict
Online Tuning of Coverage/Performance Trade-OffOnline Tuning of Coverage/Performance Trade-Off
MICRO-41November 11th, 2008
12 Online Design Bug Detection
Firmware loads initial system bug signatureFirmware loads initial system bug signature
Design bug detectedDesign bug detected
Update logUpdate log
False positive rate > threshold?
False positive?
Log of the false positive rate of
each bug
Log of the false positive rate of
each bug
Executionrecovery &design bug avoidance
Executionrecovery &design bug avoidance
Remove bug with highest false positive rate
Remove bug with highest false positive rate
Add bug with lowest false positive rate
Add bug with lowest false positive rate
Bug ID#
Bug ID#
Yes
Yes
No
No
Adjust the design bugs been covered by dunamically updating the system bug signature
Physical Memory
Area Overhead and Design Bug CoverageArea Overhead and Design Bug Coverage
MICRO-41November 11th, 2008
13 Online Design Bug Detection
RTL prototype implementation:- Synthesized with IBM 130nm process technology- Covers the whole OpenSPARC T1 Chip- 39K control flip-flops monitored (15% of all Flip-flops in OpenSPARC T1)- Bug detection flip-flops have an area overhead of 3%
10
0
5
15
20
25
Tot
al A
rea
Ove
rhea
d (
%)
Critical Design Bugs in 10 commercial processor ~65% [Sarangi et al., MICRO’06]
80% Coverage 10% Overhead
39K Bug DetectionAugmented Flip-Flops
(0.9W) 1.5%
Field Programmable
Framework(0.35W) 0.6%
SegmentChecking Tree
(16 entries per node)(0.74W) 1.3%
Power Consumption OverheadPower Consumption Overhead
MICRO-41November 11th, 2008
14 Online Design Bug Detection
Cores & L1Caches
(14.4W) 24.7%
L2 Cache(9W) 15.4%
Leakage(13.7W) 23.5%
Crossbar(0.6W) 1.1%
Misc. Units (I/O Bridge, DRAM Ctrl, CTU)
(0.9W) 1.5%
I/O Pads(6.9W) 12%
Wires & Repeaters(10.7W) 18.4%
OpenSPARC T1Power Budget: 58WIBM 130nm @ 1.2V
3.5% Power Overhead
ContributionsContributions RTL-level analysis of the design bugs of a commercial processor
Bugs have unique source signals that are hard to predict at design time
Monitored signals need to be selected in the field after bug discovery Current techniques not flexible enough - select signals at design time
Proposed a flexible online bug detection mechanism Monitor all control flip-flops in OpenSPARC T1 Set of monitored signals can be selected in the field using firmware RTL prototype: 80% bug coverage for 10% area overhead
MICRO-41November 11th, 2008
15 Online Design Bug Detection
Future Work - Evaluation ChallengesFuture Work - Evaluation Challenges Current infrastructure insufficient to measure false positive rate
Functional simulators: Lack of RTL level detail RTL simulators: Too slow to run applications
Developing a hardware prototype of our framework on FPGA Uncomment design bug fixes in
RTL code of OpenSPARC T1 Evaluate the effectiveness of our
framework on real applications Measure false positive rate Explore trade-off between bug
coverage and performance
MICRO-41November 11th, 2008
16 Online Design Bug Detection
MICRO-41November 11th, 2008
17 Online Design Bug Detection
Thank You!Thank You!
Questions?Questions?
Extra cost without any performance/utility benefits
The microprocessor designersshouldn’t rely on it
No guarantee of success - Doesn’t cover all possible design bugsCar airbags reduce fatalities by 8% when seat belts are worn
Objective: Reduce the risk of serious implications when critical design bugs are discovered after product release
Online Bug Detection & Avoidance: Online Bug Detection & Avoidance: A Microprocessor AirbagA Microprocessor Airbag
MICRO-41November 11th, 2008
18 Online Design Bug Detection
RTL Algorithmic Design BugsRTL Algorithmic Design Bugs
MICRO-41November 11th, 2008
19 Online Design Bug Detection
2993: //bug4814 - change rrobin_picker1 to rrobin_picker22993: // Choose one among 4 loads.2994: //lsu_rrobin_picker1 ld4_rrobin (2995: //.events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}),...3007: //.se(se),3008: //.so()3009: //);3010:3011: lsu_rrobin_picker2 ld4_rrobin (3012: .events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}),...3020: .se(se),3021: .so()3022: );
2993: //bug4814 - change rrobin_picker1 to rrobin_picker22993: // Choose one among 4 loads.2994: //lsu_rrobin_picker1 ld4_rrobin (2995: //.events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}),...3007: //.se(se),3008: //.so()3009: //);3010:3011: lsu_rrobin_picker2 ld4_rrobin (3012: .events({ld3_pcx_rq_vld,ld2_pcx_rq_vld,ld1_pcx_rq_vld,ld0_pcx_rq_vld}),...3020: .se(se),3021: .so()3022: );
Design bug in Verilog code – lsu_qctl1.v
- Algorithmic deviations from the design specifications- Require major modifications to be fixed
RTL Timing Design BugsRTL Timing Design Bugs
MICRO-41November 11th, 2008
20 Online Design Bug Detection
1228: // Begin - Bug3487....1239: dff #(48) ifu_std_d1 (1240: .din (tlb_st_data[47:0]),1241: .q (lsu_ifu_stxa_data[47:0]),1242: .clk (asi_data_clk),1243: .se (1'b0), .si (), .so ()1244: );1245:1246: // select is now a stage earlier, which should be1247: // fine as selects stay constant.1248: //assign lsu_ifu_stxa_data[47:0] = tlb_st_data_d1[47:0] ;1249:1250: // End - Bug3487.
1228: // Begin - Bug3487....1239: dff #(48) ifu_std_d1 (1240: .din (tlb_st_data[47:0]),1241: .q (lsu_ifu_stxa_data[47:0]),1242: .clk (asi_data_clk),1243: .se (1'b0), .si (), .so ()1244: );1245:1246: // select is now a stage earlier, which should be1247: // fine as selects stay constant.1248: //assign lsu_ifu_stxa_data[47:0] = tlb_st_data_d1[47:0] ;1249:1250: // End - Bug3487.
Design bug in Verilog code – lsu_qdp1.v
- Signals need to be latched a cycle earlier or later to keep correctness- Addition or removal of flip-flops is the most common fix
RTL OpenSPARC T1 Design Bug DistributionRTL OpenSPARC T1 Design Bug Distribution
MICRO-41November 11th, 2008
21 Online Design Bug Detection
Load/Store Unit (LSU)157 Design Bugs
Trap Logic Unit (TLU)139 Design Bugs
Power Consumption Estimation MethodologyPower Consumption Estimation Methodology
MICRO-41November 11th, 2008
22 Online Design Bug Detection
Methodology/Tools Used Design Components
Synopsys Power Compiler
1) SPARC Cores, 2) Crossbar, 3) FPU, 4) Misc. Units (I/O Bridge, DRAM Controllers, Control & Test Unit) 5) ACE Framework, 6) Online Design Bug Detection Mechanism
CACTI 4.2 1) L1 Inst. & Data Caches, 2) L2 Cache
Taken from * 1) I/O Pads, 2) Wires & Repeaters
* A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher. A Power-Efficient High-Throughput 32-Thread SPARC Processor, In IEEE Journal of Solid-State Circuits, 42(1), 2006
RTL Analysis ResultsRTL Analysis Results
MICRO-41November 11th, 2008
23 Online Design Bug Detection
Metrics LSU TLU
Min./Average/Max. number of first-level monitor signals per logic design bug
2/8/43 2/12/44
Min./Average/Max. number of source-level monitor signals per logic design bug
2/17/97 2/24/89
Source-level monitor signal sharing among different design bugs
68% 64%
Average number of unique source-level monitor signals per logic design bug
6 9
Unique source-level monitor signals (for all logic design bugs)
516 602
Merging Bug SignaturesMerging Bug Signatures
MICRO-41November 11th, 2008
24 Online Design Bug Detection
X X 1 0
4-bit Bug Detection Segments
X 0 1 X X X X XBug
Signature #1
X X 1 1 X 0 1 X X X X XBug
Signature #2
X X 1 X X 0 1 X X X X XIntermediateSignature #1
DesignBug #1
X X X X X 0 X 1 1 X 1 0
X X X X X 0 X 1 0 X 1 1
X X X X X 0 X 1 X X 1 XIntermediateSignature #2
DesignBug #2
X X 1 X X 0 X X X X 1 XSystem Bug
Signature
Bug Signature #1
Bug Signature #2
CASE 2 CASE 1 CASE 2
High-Level OverviewHigh-Level Overview
MICRO-41November 11th, 2008
25 Online Design Bug Detection
BUG#1XXX1X0…X1X0XX
XXX0X0…X1X1XX
BUG#2 X101XX…XX01XX
BUG#N XXXX1X…X101XX
… …
System Bug Signature
X 1 X 0 X 1 0 X
Bug Signature Collection
…
MergeBug Signatures2
Bug Detection Segment
System State (Flip-Flops)
Bug Detection Segment
Bug Detection Segment
Bug Detection Segment
…
…
Firmware encodes and loads the system bug signature to the bug detection segments
3
Segment Match Detection
Table
Segment Match Detection
Table
…
Segment Match Detection Table
Design BugRecovery Handler
SegmentChecking Tree
Global BugDetection Signal
Firmware loads the segment match
detection entries
4
Generate the bug signatures based on
bug triggering conditions
Design Bugs &Triggering Conditions
Aggregate bug detection segment match/mismatch
signals to a global bug detection signal
6
1
Cycle-by-cycle online checking for design bugs
5
If the global bug detection signal flags a bug, system
recovery is triggered 7
match/mismatch signals
OpenSPARC T1 Data & Control Flip-FlopsOpenSPARC T1 Data & Control Flip-Flops
MICRO-41November 11th, 2008
26 Online Design Bug Detection
Chip Submodule Data Signals Control Signals
SPARC Core (x8) 15632 (79.06%) 4140 (20.94%)
CPU-Cache Crossbar 27283 (98.69%) 362 (1.31%)
Floating-Point Unit 4054 (87.75%) 566 (12.25%)
Control & Test Unit 2325 (55.29%) 1880 (44.71%)
Input/Output Bridge 10251 (95.14%) 524 (4.86%)
DRAM Controller (x4) 13449 (94.70%) 752 (5.30%)
Total 222765 (84.95%) 39460 (15.05%)
Synergistic Online Bug & Defect DetectionSynergistic Online Bug & Defect Detection
MICRO-41November 11th, 2008
27 Online Design Bug Detection
Computation &Online Design Bug Checking
SystemStartup
Firmware Load Design Bug Data:
- Bug Signature- Segment Match Entries
StateCheckpoint
Computation &Online Design Bug Checking
StateCheckpoint
Design Bug Detected
State Recovery
AvoidDesign Bug
StateCheckpoint
Hardware Defect
Defect Detected
State Recovery
Hardware Repair
1
2 3 4
No Design Bug Checking
Firmware Test for Hardware Defects