Hardware Errors and the OS

68
Introduction Hardware Errors and the OS Bj¨ornD¨ obel () OS Resilience 06.08.2013 30 / 58

description

Lecture by Bjoern Doebel for Summer Systems School'13 (http://sss.ksyslabs.org)

Transcript of Hardware Errors and the OS

Page 1: Hardware Errors and the OS

Introduction

Hardware Errors and the OS

Bjorn Dobel () OS Resilience 06.08.2013 30 / 58

Page 2: Hardware Errors and the OS

Introduction

Hardware Errors in Theory

BulkSubstrate

Source–

Drain–

Gate

+

++

Oxide Layer

Radiation-induced errorsCosmic radiationAlpha particles emitted bypackaging

Thermal stress

Aging of circuitryElectromigrationHot Carrier InjectionNegative-Bias TemperatureInstability

Bjorn Dobel () OS Resilience 06.08.2013 31 / 58

Page 3: Hardware Errors and the OS

Introduction

Hardware Errors in Theory

BulkSubstrate

Source–

Drain–

Gate

+

++

Oxide Layer

Radiation-induced errorsCosmic radiationAlpha particles emitted bypackaging

Thermal stress

Aging of circuitryElectromigrationHot Carrier InjectionNegative-Bias TemperatureInstability

Bjorn Dobel () OS Resilience 06.08.2013 31 / 58

Page 4: Hardware Errors and the OS

Introduction

Hardware Errors in the Real World

Several studies investigated manifestation of hardware errors in software:

Saggese, 200585% of hardware errorsmaskedError outcome depends ona↵ected HW unit

Li, 2008, focus on permanenterrors

Permanent errors mainly leadto crashes / HW exceptions65% of errors corrupt OSstate before crashing

Arlat 2002, Chorus and LynxOSmicrokernels

Significant amount (30%) of”no change” errorsSome OS components aremore error-prone than others

Wang, 2003, focus on branchingerrors

Several cases (up to 40%)where taking di↵erent branchdoes not change programresult

Bjorn Dobel () OS Resilience 06.08.2013 32 / 58

Page 5: Hardware Errors and the OS

Introduction

Hardware Errors in the Real World

Several studies investigated manifestation of hardware errors in software:

Saggese, 200585% of hardware errorsmaskedError outcome depends ona↵ected HW unit

Li, 2008, focus on permanenterrors

Permanent errors mainly leadto crashes / HW exceptions65% of errors corrupt OSstate before crashing

Arlat 2002, Chorus and LynxOSmicrokernels

Significant amount (30%) of”no change” errorsSome OS components aremore error-prone than others

Wang, 2003, focus on branchingerrors

Several cases (up to 40%)where taking di↵erent branchdoes not change programresult

Bjorn Dobel () OS Resilience 06.08.2013 32 / 58

Page 6: Hardware Errors and the OS

Introduction

Challenges and Opportunities

Challenge: detect and correct hardware errors in software

Optimization Potential: don’t track harmless errors

Challenge: Binary applications

Optimization Potential: Hardware-Level Concurrency

Bjorn Dobel () OS Resilience 06.08.2013 33 / 58

Page 7: Hardware Errors and the OS

Introduction

Challenges and Opportunities

Challenge: detect and correct hardware errors in software

Optimization Potential: don’t track harmless errors

Challenge: Binary applications

Optimization Potential: Hardware-Level Concurrency

Bjorn Dobel () OS Resilience 06.08.2013 33 / 58

Page 8: Hardware Errors and the OS

Introduction

Challenges and Opportunities

Challenge: detect and correct hardware errors in software

Optimization Potential: don’t track harmless errors

Challenge: Binary applications

Optimization Potential: Hardware-Level Concurrency

Bjorn Dobel () OS Resilience 06.08.2013 33 / 58

Page 9: Hardware Errors and the OS

Introduction

Challenges and Opportunities

Challenge: detect and correct hardware errors in software

Optimization Potential: don’t track harmless errors

Challenge: Binary applications

Optimization Potential: Hardware-Level Concurrency

Bjorn Dobel () OS Resilience 06.08.2013 33 / 58

Page 10: Hardware Errors and the OS

Introduction

Fault Tolerance: State of the Union

non-COTS COTS

Hardwareerrors

Softwareerrors

Bjorn Dobel () OS Resilience 06.08.2013 34 / 58

Page 11: Hardware Errors and the OS

Introduction

Fault Tolerance: State of the Union

non-COTS COTS

Hardwareerrors

Softwareerrors

RAD-hard

CPUs

Redundant

Multithr.

Bjorn Dobel () OS Resilience 06.08.2013 34 / 58

Page 12: Hardware Errors and the OS

Introduction

Fault Tolerance: State of the Union

non-COTS COTS

Hardwareerrors

Softwareerrors

RAD-hard

CPUs

Redundant

Multithr.

HP

NonStop

IBM z/OS

Bjorn Dobel () OS Resilience 06.08.2013 34 / 58

Page 13: Hardware Errors and the OS

Introduction

Fault Tolerance: State of the Union

non-COTS COTS

Hardwareerrors

Softwareerrors

RAD-hard

CPUs

Redundant

Multithr.

HP

NonStop

IBM z/OS

SeL4

Minix3

Carburizer

Bjorn Dobel () OS Resilience 06.08.2013 34 / 58

Page 14: Hardware Errors and the OS

Introduction

Fault Tolerance: State of the Union

non-COTS COTS

Hardwareerrors

Softwareerrors

RAD-hard

CPUs

Redundant

Multithr.

HP

NonStop

IBM z/OS

SeL4

Minix3

Carburizer

SWIFT

Encoded

Processing

Bjorn Dobel () OS Resilience 06.08.2013 34 / 58

Page 15: Hardware Errors and the OS

Introduction

Fault Tolerance: State of the Union

non-COTS COTS

Hardwareerrors

Softwareerrors

RAD-hard

CPUs

Redundant

Multithr.

HP

NonStop

IBM z/OS

SeL4

Minix3

Carburizer

SWIFT

Encoded

Processing

Romain

Bjorn Dobel () OS Resilience 06.08.2013 34 / 58

Page 16: Hardware Errors and the OS

Introduction

CS 101

Compute

Application

Inputs Outputs

Determinism property

Bjorn Dobel () OS Resilience 06.08.2013 35 / 58

Page 17: Hardware Errors and the OS

Introduction

Redundant execution

App

App’

App”

Bjorn Dobel () OS Resilience 06.08.2013 36 / 58

Page 18: Hardware Errors and the OS

Introduction

Redundant execution

CollectedInputs

App

App’

App”

Bjorn Dobel () OS Resilience 06.08.2013 36 / 58

Page 19: Hardware Errors and the OS

Introduction

Redundant execution

CollectedInputs

App

App’

App”

Bjorn Dobel () OS Resilience 06.08.2013 36 / 58

Page 20: Hardware Errors and the OS

Introduction

Redundant execution

CollectedInputs

App

App’

App”

=

Bjorn Dobel () OS Resilience 06.08.2013 36 / 58

Page 21: Hardware Errors and the OS

Introduction

Redundant execution

CollectedInputs

App

App’

App”

=

Bjorn Dobel () OS Resilience 06.08.2013 36 / 58

Page 22: Hardware Errors and the OS

Introduction

Inputs and Outputs

Inputs Outputs

System Calls System CallsShared Memory Shared MemoryI/O Memory I/O MemorySpecial Instructions(e.g., rdtsc)Hardware Interrupts Hardware Exceptions

(e.g., page faults)

Bjorn Dobel () OS Resilience 06.08.2013 37 / 58

Page 23: Hardware Errors and the OS

Introduction

Process-Level Redundancy [Shye 2007]

Binary recompilation

Complex, unprotected compiler

Architecture-dependent

Reuse OS mechanisms

System calls for replica synchronization

Additional synchronization events

Virtual memory fault isolation

Restricted to Linux user-level programs

Microkernel-based

Bjorn Dobel () OS Resilience 06.08.2013 38 / 58

Page 24: Hardware Errors and the OS

Introduction

Process-Level Redundancy [Shye 2007]

Binary recompilation

Complex, unprotected compiler

Architecture-dependent

Reuse OS mechanisms

System calls for replica synchronizationAdditional synchronization events

Virtual memory fault isolation

Restricted to Linux user-level programs

Microkernel-based

Bjorn Dobel () OS Resilience 06.08.2013 38 / 58

Page 25: Hardware Errors and the OS

Introduction

Transparent Replication as OS Service

Application

L4 RuntimeEnvironment

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 39 / 58

Page 26: Hardware Errors and the OS

Introduction

Transparent Replication as OS Service

ReplicatedApplication

L4 RuntimeEnvironment

Romain

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 39 / 58

Page 27: Hardware Errors and the OS

Introduction

Transparent Replication as OS Service

UnreplicatedApplication

ReplicatedApplication

L4 RuntimeEnvironment

Romain

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 39 / 58

Page 28: Hardware Errors and the OS

Introduction

Transparent Replication as OS Service

ReplicatedDriver

UnreplicatedApplication

ReplicatedApplication

L4 RuntimeEnvironment

Romain

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 39 / 58

Page 29: Hardware Errors and the OS

Introduction

Transparent Replication as OS Service

Reliable Computing Base

ReplicatedDriver

UnreplicatedApplication

ReplicatedApplication

L4 RuntimeEnvironment

Romain

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 39 / 58

Page 30: Hardware Errors and the OS

Introduction

Romain: Structure

Master

Bjorn Dobel () OS Resilience 06.08.2013 40 / 58

Page 31: Hardware Errors and the OS

Introduction

Romain: Structure

Replica Replica Replica

Master

Bjorn Dobel () OS Resilience 06.08.2013 40 / 58

Page 32: Hardware Errors and the OS

Introduction

Romain: Structure

Replica Replica Replica

Master

=

Bjorn Dobel () OS Resilience 06.08.2013 40 / 58

Page 33: Hardware Errors and the OS

Introduction

Romain: Structure

Replica Replica Replica

Master

SystemCall Proxy

ResourceManager

=

Bjorn Dobel () OS Resilience 06.08.2013 40 / 58

Page 34: Hardware Errors and the OS

Introduction

Resource Management: Capabilities

1 22 3 4 5 6

Replica 1

Bjorn Dobel () OS Resilience 06.08.2013 41 / 58

Page 35: Hardware Errors and the OS

Introduction

Resource Management: Capabilities

1 22 3 4 5 6

Replica 1

1 22 3 4 5 6

Replica 2

Bjorn Dobel () OS Resilience 06.08.2013 41 / 58

Page 36: Hardware Errors and the OS

Introduction

Resource Management: Capabilities

1 22 3 4 5 6

Replica 1

1 22 3 4 5 6

Replica 2

1 2 3 4 5 6 Master

Bjorn Dobel () OS Resilience 06.08.2013 41 / 58

Page 37: Hardware Errors and the OS

Introduction

Partitioned Capability Tables

1 2 3 4 5 6

Replica 1

1 2 3 4 5 6

Replica 2

1 2 3 4 5 6 Master

Marked used

Master private

Bjorn Dobel () OS Resilience 06.08.2013 42 / 58

Page 38: Hardware Errors and the OS

Introduction

Replica Memory Management

Replica 1

rw ro ro

Replica 2

rw ro ro

Master

Bjorn Dobel () OS Resilience 06.08.2013 43 / 58

Page 39: Hardware Errors and the OS

Introduction

Replica Memory Management

Replica 1

rw ro ro

Replica 2

rw ro ro

Master

Bjorn Dobel () OS Resilience 06.08.2013 43 / 58

Page 40: Hardware Errors and the OS

Introduction

Replica Memory Management

Replica 1

rw ro ro

Replica 2

rw ro ro

Master

Bjorn Dobel () OS Resilience 06.08.2013 43 / 58

Page 41: Hardware Errors and the OS

Introduction

Shared Memory

Not in complete control of master

Standard technique: trap&emulateExecution overhead (x100 - x1000)Adds complexity to RCBDisassembler 6,000 LoCTiny emulator 500 LoC

Our implementation: copy & execute

Bjorn Dobel () OS Resilience 06.08.2013 44 / 58

Page 42: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 43: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]

X

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 44: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 45: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]load repl. state

NOP; NOP; ...;

NOPrestore master

state

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 46: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]mov eax, [ebx]load repl. state

NOP; NOP; ...;

NOPrestore master

state

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 47: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]load repl. state

NOP; NOP; ...;

NOPrestore master

state

mov eax, [ebx]

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 48: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]load repl. state

NOP; NOP; ...;

NOPrestore master

state

mov eax, [ebx]

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 49: Hardware Errors and the OS

Introduction

Copy&Execute

Master Replica

mov eax, [ebx]load repl. state

NOP; NOP; ...;

NOPrestore master

state

mov eax, [ebx]

Bjorn Dobel () OS Resilience 06.08.2013 45 / 58

Page 50: Hardware Errors and the OS

Introduction

Runtime Overhead

SPEC INT 2006

400perl

401bzip2

403gcc

429mcf

445gobmk

456hm-mer

458sjeng

462libquan-tum

464h264ref

471om-net++

473as-tar

11.051.1

1.151.2

1.251.3

Runtimenormalized

vs.nativeexecu-

tion

Single DMR TMR

1.451.95

Bjorn Dobel () OS Resilience 06.08.2013 46 / 58

Page 51: Hardware Errors and the OS

Introduction

Replica-Core Placement Matters

429mcf

429mcfadj

462libquan-tum

462libquan-tumadj

471om-net++

471om-net++adj

11.051.1

1.151.2

1.251.3

Runtimenormalized

vs.nativeexecu-

tion

Bjorn Dobel () OS Resilience 06.08.2013 47 / 58

Page 52: Hardware Errors and the OS

Introduction

Romain Lines of Code

Base code (main, logging, locking) 325Application loader 375Replica manager 628Redundancy 153Memory manager 445System call proxy 311Shared memory 281Total 2,518

Fault injector 668GDB server stub 1,304

Bjorn Dobel () OS Resilience 06.08.2013 48 / 58

Page 53: Hardware Errors and the OS

Introduction

User land is covered!

ReplicatedDriver

UnreplicatedApplication

ReplicatedApplication

L4 RuntimeEnvironment

Romain

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 49 / 58

Page 54: Hardware Errors and the OS

Introduction

User land is covered!

Reliable Computing Base

ReplicatedDriver

UnreplicatedApplication

ReplicatedApplication

L4 RuntimeEnvironment

Romain

L4/Fiasco.OC microkernel

Bjorn Dobel () OS Resilience 06.08.2013 49 / 58

Page 55: Hardware Errors and the OS

Introduction

Minimizing the RCB

What to minimize?

Lines of Code (as in TCB)?

Time spent executing RCB code?

More likely: runtime ⇥ vulnerability

Bjorn Dobel () OS Resilience 06.08.2013 50 / 58

Page 56: Hardware Errors and the OS

Introduction

Minimizing the RCB

What to minimize?

Lines of Code (as in TCB)?

Time spent executing RCB code?

More likely: runtime ⇥ vulnerability

Bjorn Dobel () OS Resilience 06.08.2013 50 / 58

Page 57: Hardware Errors and the OS

Introduction

Minimizing the RCB

What to minimize?

Lines of Code (as in TCB)?

Time spent executing RCB code?

More likely: runtime ⇥ vulnerability

Bjorn Dobel () OS Resilience 06.08.2013 50 / 58

Page 58: Hardware Errors and the OS

Introduction

Hardening the RCB

We need: Dedicated mechanisms toprotect the RCB (HW or SW)

We have: Full control over software

RAD-hardened hardware?

Too expensive

Embrace heterogeneity!

IBM CellARM big.LITTLE

Our proposal: Split HW intoResCores and NonRes-Cores

ResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

Bjorn Dobel () OS Resilience 06.08.2013 51 / 58

Page 59: Hardware Errors and the OS

Introduction

Hardening the RCB

We need: Dedicated mechanisms toprotect the RCB (HW or SW)

We have: Full control over software

RAD-hardened hardware?

Too expensive

Embrace heterogeneity!

IBM CellARM big.LITTLE

Our proposal: Split HW intoResCores and NonRes-Cores

ResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

NonResCore

Bjorn Dobel () OS Resilience 06.08.2013 51 / 58

Page 60: Hardware Errors and the OS

Introduction

Signaling Performance

10

20

30

40

50

60

Overheadin

%

Overhead by notification method

Local FaultsMigrationSync IPCShared Mem

susan CRC32

DMR

susan CRC32

TMR

Fast shared-memory messagepassing would be good !Intel SCC / Knights Corner

RCB/Non-RCB boundary isvulnerable

Messaging / Exceptionsneed to functionMust not overwrite otherdata

Bjorn Dobel () OS Resilience 06.08.2013 52 / 58

Page 61: Hardware Errors and the OS

Introduction

Signaling Performance

10

20

30

40

50

60

Overheadin

%

Overhead by notification method

Local FaultsMigrationSync IPCShared Mem

susan CRC32

DMR

susan CRC32

TMR

Fast shared-memory messagepassing would be good !Intel SCC / Knights Corner

RCB/Non-RCB boundary isvulnerable

Messaging / Exceptionsneed to functionMust not overwrite otherdata

Bjorn Dobel () OS Resilience 06.08.2013 52 / 58

Page 62: Hardware Errors and the OS

Introduction

Is software-level protection feasible?

We have full source of the RCB.

Compiler support for fault tolerance (SWIFT1, AN-Encoded Processing2)may help.

Hasn’t been done for kernel code yet.

Gedankenexperiment:We know how much RCB-related execution is added due to replication.We know average overheads for SWIFT (9.5%) and AN encoding(390%)3

Bjorn Dobel () OS Resilience 06.08.2013 53 / 58

Page 63: Hardware Errors and the OS

Introduction

Is software-level protection feasible?

We have full source of the RCB.

Compiler support for fault tolerance (SWIFT1, AN-Encoded Processing2)may help.

Hasn’t been done for kernel code yet.

Gedankenexperiment:We know how much RCB-related execution is added due to replication.We know average overheads for SWIFT (9.5%) and AN encoding(390%)3

Bjorn Dobel () OS Resilience 06.08.2013 53 / 58

Page 64: Hardware Errors and the OS

Introduction

Modeling software-level RCB protection

Application Code

tapp

Kernel:SystemCalls

tkern

RomainMasterCode

tmaster

Additional

Kernel

Invocations

t0kern

HardwareStalls (e.g.,caching)

thw

Native execution time Replication overhead

T = tnat + trep

= tapp + tkern + tmaster + t0kern + thw

Tprot = tapp + C ⇥ (tkern + tmaster + t0kern + thw )

tkern = t0kern = thw = 0

Tprot = tapp + C ⇥ tmaster

Bjorn Dobel () OS Resilience 06.08.2013 54 / 58

Page 65: Hardware Errors and the OS

Introduction

Estimating RCB protection runtime

400perl

401bzip2

429mcf

445gobmk

456hm-mer

458sjeng

462libquan-tum

464h264ref

471om-net++

473as-tar

11.051.1

1.151.2

1.251.3

1.4

1.5

1.6

Runtimenormalized

vs.nativeexecu-

tion

Romain only Romain+SWIFT Romain+ANBD

Bjorn Dobel () OS Resilience 06.08.2013 55 / 58

Page 66: Hardware Errors and the OS

Introduction

Summary

OS-level techniques to tolerate SW and HW faults

Address-space isolation

Microreboots

Various ways of handling session state

Replication against hardware errors

Special care needed to protect Reliable Computing Base

Bjorn Dobel () OS Resilience 06.08.2013 56 / 58

Page 67: Hardware Errors and the OS

Introduction

Further Reading

Minix3: Jorrit Herder, Ben Gras,, Philip Homburg, Andrew S. Tanenbaum:Fault Isolation for Device Drivers, DSN 2009

CuriOS: Francis M. David, Ellick M. Chan, Je↵rey C. Carlyle and Roy H.Campbell CuriOS: Improving Reliability through Operating System Structure,OSDI 2008

L4ReAnimator: Dirk Vogt, Bjorn Dobel, Adam Lackorzynski: Stay strong,stay safe: Enhancing Reliability of a Secure Operating System, IIDS 2010

Bjorn Dobel () OS Resilience 06.08.2013 57 / 58

Page 68: Hardware Errors and the OS

Introduction

Further Reading

Reliability Analysis:Saggese et al.: An Experimental Study of Soft Errors in Microprocessors, IEEE Micro 2005Li et al.: Understanding the Propagation of Hard Errors to Software and Implications for ResilientSystem Design, ASPLOS 2008Arlat et al.: Dependability of COTS Microkernel-Based Systems, IEEE ToCS 2002Wang et al.: Y-Branches: When you come to a Fork in the Road: Take it!, PACT 2003

PLR: Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseh Blomsted, RameshPeri: Using Process-Level Redundancy to Exploit Multiple Cores for TransientFault Tolerance, DSN 2007

Romain:Bjorn Dobel, Hermann Hartig, Michael Engel: Operating System Support for RedundantMultithreading, EMSOFT 2012Bjorn Dobel, Hermann Hartig: Who watches the watchmen? – Protecting Operating SystemReliability Mechanisms, HotDep 2012

Bjorn Dobel () OS Resilience 06.08.2013 58 / 58