Hardware Errors and the OS
-
Upload
vasily-sartakov -
Category
Education
-
view
201 -
download
3
description
Transcript of Hardware Errors and the OS
Introduction
Hardware Errors and the OS
Bjorn Dobel () OS Resilience 06.08.2013 30 / 58
Introduction
Hardware Errors in Theory
BulkSubstrate
Source–
–
Drain–
–
Gate
+
++
Oxide Layer
Radiation-induced errorsCosmic radiationAlpha particles emitted bypackaging
Thermal stress
Aging of circuitryElectromigrationHot Carrier InjectionNegative-Bias TemperatureInstability
Bjorn Dobel () OS Resilience 06.08.2013 31 / 58
Introduction
Hardware Errors in Theory
BulkSubstrate
Source–
–
Drain–
–
Gate
+
++
Oxide Layer
Radiation-induced errorsCosmic radiationAlpha particles emitted bypackaging
Thermal stress
Aging of circuitryElectromigrationHot Carrier InjectionNegative-Bias TemperatureInstability
Bjorn Dobel () OS Resilience 06.08.2013 31 / 58
Introduction
Hardware Errors in the Real World
Several studies investigated manifestation of hardware errors in software:
Saggese, 200585% of hardware errorsmaskedError outcome depends ona↵ected HW unit
Li, 2008, focus on permanenterrors
Permanent errors mainly leadto crashes / HW exceptions65% of errors corrupt OSstate before crashing
Arlat 2002, Chorus and LynxOSmicrokernels
Significant amount (30%) of”no change” errorsSome OS components aremore error-prone than others
Wang, 2003, focus on branchingerrors
Several cases (up to 40%)where taking di↵erent branchdoes not change programresult
Bjorn Dobel () OS Resilience 06.08.2013 32 / 58
Introduction
Hardware Errors in the Real World
Several studies investigated manifestation of hardware errors in software:
Saggese, 200585% of hardware errorsmaskedError outcome depends ona↵ected HW unit
Li, 2008, focus on permanenterrors
Permanent errors mainly leadto crashes / HW exceptions65% of errors corrupt OSstate before crashing
Arlat 2002, Chorus and LynxOSmicrokernels
Significant amount (30%) of”no change” errorsSome OS components aremore error-prone than others
Wang, 2003, focus on branchingerrors
Several cases (up to 40%)where taking di↵erent branchdoes not change programresult
Bjorn Dobel () OS Resilience 06.08.2013 32 / 58
Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software
Optimization Potential: don’t track harmless errors
Challenge: Binary applications
Optimization Potential: Hardware-Level Concurrency
Bjorn Dobel () OS Resilience 06.08.2013 33 / 58
Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software
Optimization Potential: don’t track harmless errors
Challenge: Binary applications
Optimization Potential: Hardware-Level Concurrency
Bjorn Dobel () OS Resilience 06.08.2013 33 / 58
Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software
Optimization Potential: don’t track harmless errors
Challenge: Binary applications
Optimization Potential: Hardware-Level Concurrency
Bjorn Dobel () OS Resilience 06.08.2013 33 / 58
Introduction
Challenges and Opportunities
Challenge: detect and correct hardware errors in software
Optimization Potential: don’t track harmless errors
Challenge: Binary applications
Optimization Potential: Hardware-Level Concurrency
Bjorn Dobel () OS Resilience 06.08.2013 33 / 58
Introduction
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
Bjorn Dobel () OS Resilience 06.08.2013 34 / 58
Introduction
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
Bjorn Dobel () OS Resilience 06.08.2013 34 / 58
Introduction
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
Bjorn Dobel () OS Resilience 06.08.2013 34 / 58
Introduction
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
Bjorn Dobel () OS Resilience 06.08.2013 34 / 58
Introduction
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Bjorn Dobel () OS Resilience 06.08.2013 34 / 58
Introduction
Fault Tolerance: State of the Union
non-COTS COTS
Hardwareerrors
Softwareerrors
RAD-hard
CPUs
Redundant
Multithr.
HP
NonStop
IBM z/OS
SeL4
Minix3
Carburizer
SWIFT
Encoded
Processing
Romain
Bjorn Dobel () OS Resilience 06.08.2013 34 / 58
Introduction
CS 101
Compute
Application
Inputs Outputs
Determinism property
Bjorn Dobel () OS Resilience 06.08.2013 35 / 58
Introduction
Redundant execution
App
App’
App”
Bjorn Dobel () OS Resilience 06.08.2013 36 / 58
Introduction
Redundant execution
CollectedInputs
App
App’
App”
Bjorn Dobel () OS Resilience 06.08.2013 36 / 58
Introduction
Redundant execution
CollectedInputs
App
App’
App”
Bjorn Dobel () OS Resilience 06.08.2013 36 / 58
Introduction
Redundant execution
CollectedInputs
App
App’
App”
=
Bjorn Dobel () OS Resilience 06.08.2013 36 / 58
Introduction
Redundant execution
CollectedInputs
App
App’
App”
=
Bjorn Dobel () OS Resilience 06.08.2013 36 / 58
Introduction
Inputs and Outputs
Inputs Outputs
System Calls System CallsShared Memory Shared MemoryI/O Memory I/O MemorySpecial Instructions(e.g., rdtsc)Hardware Interrupts Hardware Exceptions
(e.g., page faults)
Bjorn Dobel () OS Resilience 06.08.2013 37 / 58
Introduction
Process-Level Redundancy [Shye 2007]
Binary recompilation
Complex, unprotected compiler
Architecture-dependent
Reuse OS mechanisms
System calls for replica synchronization
Additional synchronization events
Virtual memory fault isolation
Restricted to Linux user-level programs
Microkernel-based
Bjorn Dobel () OS Resilience 06.08.2013 38 / 58
Introduction
Process-Level Redundancy [Shye 2007]
Binary recompilation
Complex, unprotected compiler
Architecture-dependent
Reuse OS mechanisms
System calls for replica synchronizationAdditional synchronization events
Virtual memory fault isolation
Restricted to Linux user-level programs
Microkernel-based
Bjorn Dobel () OS Resilience 06.08.2013 38 / 58
Introduction
Transparent Replication as OS Service
Application
L4 RuntimeEnvironment
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 39 / 58
Introduction
Transparent Replication as OS Service
ReplicatedApplication
L4 RuntimeEnvironment
Romain
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 39 / 58
Introduction
Transparent Replication as OS Service
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment
Romain
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 39 / 58
Introduction
Transparent Replication as OS Service
ReplicatedDriver
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment
Romain
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 39 / 58
Introduction
Transparent Replication as OS Service
Reliable Computing Base
ReplicatedDriver
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment
Romain
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 39 / 58
Introduction
Romain: Structure
Master
Bjorn Dobel () OS Resilience 06.08.2013 40 / 58
Introduction
Romain: Structure
Replica Replica Replica
Master
Bjorn Dobel () OS Resilience 06.08.2013 40 / 58
Introduction
Romain: Structure
Replica Replica Replica
Master
=
Bjorn Dobel () OS Resilience 06.08.2013 40 / 58
Introduction
Romain: Structure
Replica Replica Replica
Master
SystemCall Proxy
ResourceManager
=
Bjorn Dobel () OS Resilience 06.08.2013 40 / 58
Introduction
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1
Bjorn Dobel () OS Resilience 06.08.2013 41 / 58
Introduction
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1
1 22 3 4 5 6
Replica 2
Bjorn Dobel () OS Resilience 06.08.2013 41 / 58
Introduction
Resource Management: Capabilities
1 22 3 4 5 6
Replica 1
1 22 3 4 5 6
Replica 2
1 2 3 4 5 6 Master
Bjorn Dobel () OS Resilience 06.08.2013 41 / 58
Introduction
Partitioned Capability Tables
1 2 3 4 5 6
Replica 1
1 2 3 4 5 6
Replica 2
1 2 3 4 5 6 Master
Marked used
Master private
Bjorn Dobel () OS Resilience 06.08.2013 42 / 58
Introduction
Replica Memory Management
Replica 1
rw ro ro
Replica 2
rw ro ro
Master
Bjorn Dobel () OS Resilience 06.08.2013 43 / 58
Introduction
Replica Memory Management
Replica 1
rw ro ro
Replica 2
rw ro ro
Master
Bjorn Dobel () OS Resilience 06.08.2013 43 / 58
Introduction
Replica Memory Management
Replica 1
rw ro ro
Replica 2
rw ro ro
Master
Bjorn Dobel () OS Resilience 06.08.2013 43 / 58
Introduction
Shared Memory
Not in complete control of master
Standard technique: trap&emulateExecution overhead (x100 - x1000)Adds complexity to RCBDisassembler 6,000 LoCTiny emulator 500 LoC
Our implementation: copy & execute
Bjorn Dobel () OS Resilience 06.08.2013 44 / 58
Introduction
Copy&Execute
Master Replica
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]
X
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;
NOPrestore master
state
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]mov eax, [ebx]load repl. state
NOP; NOP; ...;
NOPrestore master
state
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;
NOPrestore master
state
mov eax, [ebx]
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;
NOPrestore master
state
mov eax, [ebx]
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Copy&Execute
Master Replica
mov eax, [ebx]load repl. state
NOP; NOP; ...;
NOPrestore master
state
mov eax, [ebx]
Bjorn Dobel () OS Resilience 06.08.2013 45 / 58
Introduction
Runtime Overhead
SPEC INT 2006
400perl
401bzip2
403gcc
429mcf
445gobmk
456hm-mer
458sjeng
462libquan-tum
464h264ref
471om-net++
473as-tar
11.051.1
1.151.2
1.251.3
Runtimenormalized
vs.nativeexecu-
tion
Single DMR TMR
1.451.95
Bjorn Dobel () OS Resilience 06.08.2013 46 / 58
Introduction
Replica-Core Placement Matters
429mcf
429mcfadj
462libquan-tum
462libquan-tumadj
471om-net++
471om-net++adj
11.051.1
1.151.2
1.251.3
Runtimenormalized
vs.nativeexecu-
tion
Bjorn Dobel () OS Resilience 06.08.2013 47 / 58
Introduction
Romain Lines of Code
Base code (main, logging, locking) 325Application loader 375Replica manager 628Redundancy 153Memory manager 445System call proxy 311Shared memory 281Total 2,518
Fault injector 668GDB server stub 1,304
Bjorn Dobel () OS Resilience 06.08.2013 48 / 58
Introduction
User land is covered!
ReplicatedDriver
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment
Romain
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 49 / 58
Introduction
User land is covered!
Reliable Computing Base
ReplicatedDriver
UnreplicatedApplication
ReplicatedApplication
L4 RuntimeEnvironment
Romain
L4/Fiasco.OC microkernel
Bjorn Dobel () OS Resilience 06.08.2013 49 / 58
Introduction
Minimizing the RCB
What to minimize?
Lines of Code (as in TCB)?
Time spent executing RCB code?
More likely: runtime ⇥ vulnerability
Bjorn Dobel () OS Resilience 06.08.2013 50 / 58
Introduction
Minimizing the RCB
What to minimize?
Lines of Code (as in TCB)?
Time spent executing RCB code?
More likely: runtime ⇥ vulnerability
Bjorn Dobel () OS Resilience 06.08.2013 50 / 58
Introduction
Minimizing the RCB
What to minimize?
Lines of Code (as in TCB)?
Time spent executing RCB code?
More likely: runtime ⇥ vulnerability
Bjorn Dobel () OS Resilience 06.08.2013 50 / 58
Introduction
Hardening the RCB
We need: Dedicated mechanisms toprotect the RCB (HW or SW)
We have: Full control over software
RAD-hardened hardware?
Too expensive
Embrace heterogeneity!
IBM CellARM big.LITTLE
Our proposal: Split HW intoResCores and NonRes-Cores
ResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
Bjorn Dobel () OS Resilience 06.08.2013 51 / 58
Introduction
Hardening the RCB
We need: Dedicated mechanisms toprotect the RCB (HW or SW)
We have: Full control over software
RAD-hardened hardware?
Too expensive
Embrace heterogeneity!
IBM CellARM big.LITTLE
Our proposal: Split HW intoResCores and NonRes-Cores
ResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
NonResCore
Bjorn Dobel () OS Resilience 06.08.2013 51 / 58
Introduction
Signaling Performance
10
20
30
40
50
60
Overheadin
%
Overhead by notification method
Local FaultsMigrationSync IPCShared Mem
susan CRC32
DMR
susan CRC32
TMR
Fast shared-memory messagepassing would be good !Intel SCC / Knights Corner
RCB/Non-RCB boundary isvulnerable
Messaging / Exceptionsneed to functionMust not overwrite otherdata
Bjorn Dobel () OS Resilience 06.08.2013 52 / 58
Introduction
Signaling Performance
10
20
30
40
50
60
Overheadin
%
Overhead by notification method
Local FaultsMigrationSync IPCShared Mem
susan CRC32
DMR
susan CRC32
TMR
Fast shared-memory messagepassing would be good !Intel SCC / Knights Corner
RCB/Non-RCB boundary isvulnerable
Messaging / Exceptionsneed to functionMust not overwrite otherdata
Bjorn Dobel () OS Resilience 06.08.2013 52 / 58
Introduction
Is software-level protection feasible?
We have full source of the RCB.
Compiler support for fault tolerance (SWIFT1, AN-Encoded Processing2)may help.
Hasn’t been done for kernel code yet.
Gedankenexperiment:We know how much RCB-related execution is added due to replication.We know average overheads for SWIFT (9.5%) and AN encoding(390%)3
Bjorn Dobel () OS Resilience 06.08.2013 53 / 58
Introduction
Is software-level protection feasible?
We have full source of the RCB.
Compiler support for fault tolerance (SWIFT1, AN-Encoded Processing2)may help.
Hasn’t been done for kernel code yet.
Gedankenexperiment:We know how much RCB-related execution is added due to replication.We know average overheads for SWIFT (9.5%) and AN encoding(390%)3
Bjorn Dobel () OS Resilience 06.08.2013 53 / 58
Introduction
Modeling software-level RCB protection
Application Code
tapp
Kernel:SystemCalls
tkern
RomainMasterCode
tmaster
Additional
Kernel
Invocations
t0kern
HardwareStalls (e.g.,caching)
thw
Native execution time Replication overhead
T = tnat + trep
= tapp + tkern + tmaster + t0kern + thw
Tprot = tapp + C ⇥ (tkern + tmaster + t0kern + thw )
tkern = t0kern = thw = 0
Tprot = tapp + C ⇥ tmaster
Bjorn Dobel () OS Resilience 06.08.2013 54 / 58
Introduction
Estimating RCB protection runtime
400perl
401bzip2
429mcf
445gobmk
456hm-mer
458sjeng
462libquan-tum
464h264ref
471om-net++
473as-tar
11.051.1
1.151.2
1.251.3
1.4
1.5
1.6
Runtimenormalized
vs.nativeexecu-
tion
Romain only Romain+SWIFT Romain+ANBD
Bjorn Dobel () OS Resilience 06.08.2013 55 / 58
Introduction
Summary
OS-level techniques to tolerate SW and HW faults
Address-space isolation
Microreboots
Various ways of handling session state
Replication against hardware errors
Special care needed to protect Reliable Computing Base
Bjorn Dobel () OS Resilience 06.08.2013 56 / 58
Introduction
Further Reading
Minix3: Jorrit Herder, Ben Gras,, Philip Homburg, Andrew S. Tanenbaum:Fault Isolation for Device Drivers, DSN 2009
CuriOS: Francis M. David, Ellick M. Chan, Je↵rey C. Carlyle and Roy H.Campbell CuriOS: Improving Reliability through Operating System Structure,OSDI 2008
L4ReAnimator: Dirk Vogt, Bjorn Dobel, Adam Lackorzynski: Stay strong,stay safe: Enhancing Reliability of a Secure Operating System, IIDS 2010
Bjorn Dobel () OS Resilience 06.08.2013 57 / 58
Introduction
Further Reading
Reliability Analysis:Saggese et al.: An Experimental Study of Soft Errors in Microprocessors, IEEE Micro 2005Li et al.: Understanding the Propagation of Hard Errors to Software and Implications for ResilientSystem Design, ASPLOS 2008Arlat et al.: Dependability of COTS Microkernel-Based Systems, IEEE ToCS 2002Wang et al.: Y-Branches: When you come to a Fork in the Road: Take it!, PACT 2003
PLR: Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseh Blomsted, RameshPeri: Using Process-Level Redundancy to Exploit Multiple Cores for TransientFault Tolerance, DSN 2007
Romain:Bjorn Dobel, Hermann Hartig, Michael Engel: Operating System Support for RedundantMultithreading, EMSOFT 2012Bjorn Dobel, Hermann Hartig: Who watches the watchmen? – Protecting Operating SystemReliability Mechanisms, HotDep 2012
Bjorn Dobel () OS Resilience 06.08.2013 58 / 58