Automatic Data Structure Repair for Self-Healing Systems Brian Demsky Martin Rinard Laboratory for...
-
date post
20-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Automatic Data Structure Repair for Self-Healing Systems Brian Demsky Martin Rinard Laboratory for...
Automatic Data Structure Repair for Self-Healing Systems
Brian DemskyMartin Rinard
Laboratory for Computer ScienceMassachusetts Institute of Technology
Motivation
F = 20G = 5
F = 20G = 10
I = 5
J = 2
Broken Data Structure
Errors• Missing elements• Inappropriate
sharing• Dangling
references• Out of bounds
array indices• Inconsistent values
Goal
F = 10G = 5
F = 20G = 10
I = 3
J = 2
F = 2G = 1
F = 20G = 5
F = 20G = 10
I = 5
J = 2
Broken Data Structure Consistent Data Structure
RepairAlgorithm
Goal
F = 10G = 5
F = 20G = 10
I = 3
J = 2
F = 2G = 1
F = 20G = 5
F = 20G = 10
I = 5
J = 2
Broken Data Structure Consistent Data Structure
RepairAlgorithm
ConsistencyProperties
FromDeveloper
What Does Repair Algorithm Produce?
• Data structure that • Satisfies consistency properties, and• Heuristically close to broken data
structure• Not necessarily the same data structure
as (hypothetical) correct program would produce
• But enough to keep program operating successfully
Precursors
• Data structure repair has historically appeared in systems with extreme reliability goals• 5ESS switch – hand coded audit
routines• IBM MVS operating system – hand
coded failure recovery routines• Key component of these systems
Where Is This Likely To Be Useful?
• Not for systems with slack - can just reboot• Cause of error must go away after reboot• Must be OK to lose volatile state• Must be OK to wait for reboot
• Persistent data structures (file systems, application files)• Autonomous and/or safety critical systems
• Monitor/control unstable physical phenomena
• Largely independent subcomputations• Moving time window
Architecture
101110011000111101110101010111100111011010111000111101110
Broken Bits
BrokenAbstract Model
RepairedAbstract Model
101001111000111101110101101011100110101010111011001100010
Repaired Bits
Model Definition &Translation
Internal ConsistencyProperties
External ConsistencyProperties
Architecture RationaleWhy go through the abstract model?
• Simple, uniform structure • Sets of objects• Relations between objects
• Simplifies both• Expression of consistency properties• Repair algorithm
• Enables system to support full range of efficient, heavily encoded data structures
File System Example
abst intro 0 2 1
Directory Entries Disk Blocks
struct Entry {byte name[Length];int firstBlock;
}struct Block {
int nextBlock;data byte[BlockSize];
}
struct Disk {Entry dir[NumEntries];Block block[NumBlocks];
}
Disk D;
-5 1 -1
Model Definition
• Sets of objectsset blocks of integer : partition used |
free;• Relations between objects – values of
object fields, referencing relationships between objectsrelation next : used, used;blocks
used freenext
Model TranslationBits translated to sets and relations in abstract
model using statements of the form:
Quantifiers, Condition Inclusion Constraint
for i in 0..NumEntries, 0 D.dir[i].firstBlock and D.dir[i].firstBlock < NumBlocks D.dir[i].firstBlock in used
for b in used, 0 D.block[b].nextBlock and D.block[b].nextBlock < NumBlocks b,D.block[b].nextBlock in next
for b,n in next, true n in usedfor b in 0..NumBlocks, not (b in used) b in free
Model in Example
1
0
2
next
next
used
free
3
blocks
abst intro 0 2 1
Directory Entries Disk Blocks
-5 1 -1
Internal Consistency PropertiesQuantifiers, Body
• Body is first-order property of basic propositions• Inequality constraints on values of numeric
fields • V.R = E, V.R < E, V.R E, V.R E, V.R > E
• Presence of required number of objects• size(S) = C, size(S) C, size(S) C
• Topology of region surrounding each object• size(V.R) = C, size(V.R) C, size(V.R) C • size(R.V) = C, size(R.V) C, size(R.V) C
• Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R• Example: for b in used, size(next.b) 1
Internal Consistency ViolationsEvaluate consistency properties, find
violationsfor b in used, size(next.b) 1 is false for b
= 1
1
0
2
next
next
used
free
3
blocks
Repairing Violations of Internal Consistency Properties
• Violation provides binding for quantified variables
• Convert Body to disjunctive normal form(p1 … pn ) … (q1 … qm )
p1 … pn , q1 … qm are basic propositions
• Choose a conjunction to satisfy• Repair violated basic propositions in
conjunction
Repairing Violations of Basic Propositions
• Inequality constraints on values of numeric fields • V.R = E, V.R < E, V.R E, V.R E, V.R > E• Compute value of expression, assign field
• Presence of required number of objects• size(S) = C, size(S) C, size(S) C• Remove or insert objects from/to set
• Topology of region surrounding each object• size(V.R) = C, size(V.R) C, size(V.R) C • size(R.V) = C, size(R.V) C, size(R.V) C• Remove or insert pairs from/to relation
• Inclusion constraints: V in S, V1 in V2.R, V1,V2 in R• Remove or add the object or pair from/to set or relation
Repair in Examplefor b in used, size(next.b) 1 is false for b
= 1Must repair size(next.1) 1
Can remove either 0,1 or 2,1 from next
1
0
2
next
next
used
free
3
blocks
Repair in Examplefor b in used, size(next.b) 1 is false for b
= 1Must repair size(next.1) 1
Can remove either 0,1 or 2,1 from next
1
0
2
next
used
free
3
blocks
Acyclic Repair Dependences
• Questions• Isn’t it possible for the repair of one
constraint to invalidate another constraint?
• What about infinite repair loops?• What about unsatisfiable specifications?
• Answer• We require specifications to have no
cyclic repair dependences between constraints
• So all repair sequences terminate• Repair can fail only because of resource
limitations
External Consistency Constraints
Quantifiers, Condition Body• Body of form V = E, V.F = E, V.F[I] = E• Example
for b in free, true D.block[b].nextBlock = -2
for i,j in next, true D.block[i].nextBlock = j
for b in used, size(b.next) = 0 D.block[b].nextBlock = -1
• Repair simply performs assignments• Translates model repairs to bit repairs
abst intro 0 2 1
Directory Entries Disk Blocks
-5 1 -1
abst intro 0 2 1
Directory Entries Disk Blocks
-1 -1 -2
Repaired File System
Repair in Example
Inconsistent File System
When to Test for Consistency and Repair
• Persistent data structures• Repair can be independent activity, or• Repair when data written out or read in
• Volatile data structures in running program• Under programmer control• Transaction-based approach
• Identify transaction start and end• Repair at start, end, or both
• Failure-based approach• Wait until program fails• Repair and restart from latest safe point
Experience• We acquired four benchmarks (written in C/C++)
• CTAS (air-traffic control tool)• Simplified Linux file system• Freeciv interactive game• Microsoft Word files
• We developed specifications for all four • Very little development time (days, not weeks)• Most of time spent figuring out Freeciv and
CTAS • Each benchmark has
• Workload• Fault insertion methodology
• Ran benchmarks with and without repair
CTAS
• Set of air-traffic control tools• Traffic management• Arrival planning• Flow visualization• Shortcut planning
• Deployed in centers around country (Dallas/Ft. Worth, Los Angeles, Denver, Miami, Minneapolis/St. Paul, Atlanta, Oakland)
• Approximately 1 million lines of C/C++ code
Results
• Workload – recorded radar feed from DFW• Fault insertion
• Simulate error in flight plan processing• Bad airport index in flight plan data
structure • Without repair
• System crashes – segmentation fault• With repair
• Aircraft has different origin or destination• System continues to execute• Anomaly eventually flushed from system
Aspects of CTAS
• Lots of independent subcomputations• System processes hundreds of aircraft –
problem with one should not affect others• Multipurpose system
(visualization, arrival planning, shortcuts, …) – problem in one purpose should not affect others
• Sliding time window: anomalies eventually flushed
• Rebooting ineffective – system will crash again as soon as it sees the problematic flight plan
intro 110 0 1011
directoryblock
inodebitmapblock
blockbitmapblock
inode inode…
inode block
disk blocks
Simplified Linux File System
Some Consistency Properties• inode bitmap consistent with inode
usage• block bitmap consistent with block
usage• directory entries refer to valid inodes • files contain valid blocks only• files do not share blocks
superblock
groupblock
Results
• Workload – write and verify several files • Fault insertion – crash file system
• Inode and block bitmap errors• Partially initialized directory and inode
entries• Without repair
• Incorrect file contents because of inode and disk block sharing
• With repair• Bitmaps repaired preventing illegal
sharing, correct file contents
PO MM
OO MP
PO MM
PP MP
loc: 3,0
loc: 2,3
Terrain Grid
City Structures
Freeciv
Consistency Properties• Tiles have valid terrain
values• Cities are not in the ocean• Each city has exactly one
reference from city location grid
• City locations are consistent in• City structures and• tile grid
O = OceanP = PlainM = Mountain
Results
• Workload – Freeciv software plays against itself
• Fault insertion – randomly corrupt terrain values
• Without repair – program fails (seg fault)• With repair
• Game runs just fine• But game plays out differently because
of the different terrain values
Microsoft Word Files• Files consist of a sequence of streams• Streams stored using FAT-based data
structure
• Consistency Properties• FAT blocks exist and contain valid entries• FAT streams are properly terminated• Free blocks properly marked• Streams contain valid blocks• No sharing of blocks between streams
abst 1 intro 7 0 1 9 2 -1 -1 -21
Directory Entries FAT Disk Blocks
Results
• Workload – several Microsoft Word files• Fault insertion – scramble FAT• Without repair
• If blocks containing the FAT were incorrectly marked as free, Word successfully loads file
• Otherwise, “The document name or path is not
valid”
• With repair• Word loads all files
Extensions
• Elimination of external consistency constraints• Eliminates problems with translating
repairs on the abstract model to the actual data structure
• Repair algorithm analyzes model definition rules to generate repair actions for the actual data structure
Extensions
• Support for doubly linked data structures• Enables the repair algorithm to
regenerate back links
Extensions
• Compilation and optimization of consistency checking• Achieved significant speedups (n x)
by compiling the specification• Achieved further speedups () by
partially optimizing away the construction of the abstract model
Related Work
• Hand-coded repair• Lucent 5ESS switch• IBM MVS operating system
• Self-stabilizing algorithms• Log-based recovery for database systems• Recovery-oriented computing
• Recursive restartability• Undo framework