Quick Error Detection for Effective Post-Silicon ... · Quick Error Detection for Effective...
Transcript of Quick Error Detection for Effective Post-Silicon ... · Quick Error Detection for Effective...
Quick Error Detection for
Effective Post-Silicon Validation
Detect bugs
Root-cause & fix
Run tests
Debug time:
1-4 weeks per bug
Localize bugs
Stanford University Intel Corporation
David Lin, Christine Cheng, Ted Hong, Yanjing Li, Farzan Fallah, Donald S. Gardner, Nagib Hakim, Subhasish Mitra
QED
Core + Uncore Wide variety
Diversity
8 Cores FFT Test from Splash2
0%
20%
40%
60%
80%
100%
100 1K 10K 100K 1M >10M
Error detection latency (cycles)
Cum
ula
tive
bu
gs d
ete
cte
d
104X
2X
Original
PLC+QED
8 Cores LU Test from Splash2
0%
20%
40%
60%
80%
100%
100 1K 10K 100K 1M >10M
Cum
ula
tive
bu
gs d
ete
cte
d
Error detection latency (cycles)
Original
PLC+QED
104X
2X
Long Error Detection Latency
Timeline
Error
occurred
Error detection latency
Ideal ~ 1,000 cycles
Reality ~ Billions cycles
Error
detected
Test
execution
QED Tests
QED Test 1
QED Test 2
…
…
QED Test N
Original Tests
Test 1
Test 2
…
…
Test N
De
tecte
d e
rro
r co
un
t
(no
rma
lize
d t
o Q
ED
)
QED
0
0.5
1
1-10 Billion
No-QED
Error detection latency (clock cycles)
0-10K
De
tecte
d e
rro
r co
un
t
(no
rma
lize
d t
o Q
ED
)
QED
0
0.5
1
1-10 Billion
No-QED
Error detection latency (clock cycles)
0-10K
106X
4X
8 Cores Industrial Validation Test
0%
20%
40%
60%
80%
100%
100 1K 10K 10 Billion
Cum
ula
tive
mem
ory
bu
gs d
ete
cte
d
Error detection latency (cycles)
Original
PLC+QED Improved
106X
Intel ® Core i7TM Hardware
Localization Dominates Cost Quick Error Detection QED Core + Uncore Transformation Example
QED techniques
Code change
Hardware change
Detection latency
Targeted component
EDDI-V
Some None
Small
Core
CFCSS-V Core
SW-RMT-V Core
HW-RMT-V None Some Core
Proactive Load & Check
Some None Uncore
...
Core 1 Core 2
<PLC mem[a..z]> <PLC mem[a..z]> <PLC mem[a..z]>
<PLC mem[a..z]> <PLC mem[a..z]>
Core N
<PLC mem[a..z]> <PLC mem[a..z]> <PLC mem[a..z]>
A’=A B’=B C’=C A = B * 2 A’= B’* 2 Check(A==A’)
D’=D E’=E F’=F G’=G H’=H E = F * G E’= F’* G’ Check(E==E’)
H = D + E H’= D’+ E’ Check(H==H’)
E’=E I’=E J’=J K’=K
I = E / 2 I’= E’/ 2 Check(I==I’)
Load J ← mem[z ] Load J’← mem[z’] Check(J==J’)
K = J + 1 K’= J’+ 1 Check(K==K’)
Lock(a); Lock(a’) Store mem[a ] ← C Store mem[a’] ← C’ Unlock(a’); Unlock(a)
Lock(c); Lock(c’) Store mem[c ] ← H Store mem[c’] ← H’ Unlock(c’); Unlock(c)
ALL Cores
ALL Threads
<PLC mem[a..z]>
for i in [a..z]
i’ in [a’..z’]
Lock(i)
Lock(i’)
Load X ← mem[i]
Load X’ ← mem[i’]
Check (X == X’)
Unlock(i’)
Unlock(i)
Key challenge: Long error detection latency
New technique: Quick Error Detection
Systematic, structured, automated
Error detection latency: 106X improved
Coverage: 4X improved
• Software only: readily application
. . .
. . .
. . .
Localization