Self-healing Software Systems Mauro Pezzè University of Lugano and University of Milano Bicocca.
-
Upload
darren-rice -
Category
Documents
-
view
214 -
download
0
Transcript of Self-healing Software Systems Mauro Pezzè University of Lugano and University of Milano Bicocca.
Self-healing Software Systems
Mauro PezzèUniversity of Lugano
and
University of Milano Bicocca
Why self healing?
Software fails
Verification & validation are hard
New factors amplify problemsdynamic behavior / emerging scenariosunexpected environment interactionsmulti vendors / multi owners
Dynamic autonomous changesthe provider
independentlyupdates
the service implementation
the application dynamicallyreconfigure the services
Servicebroker
Serviceprovider
Servicerequestor
publish
find
com
mun
icat
e
bind
the broker dynamically
discovernew service
statically unpredictable evolution
unpredictable environment interactions
statically unpredictabl
e interactions
Multi vendors / owners
Multi vendors / owners
Self-healing
Self-healing natural systems
Self-healing software systems?
• similarly to natural systems– focus on some classes of problems– maybe incomplete recovery– may imply changes in the body– does not work for all problems
• differently from natural systems– recover from expected as well as unexpected
problems– built-in as well as emerging mechanisms– potentially hazardous novel interactions
Inspire
but
not copy
from natural systems
focus on some classes of problems
Integration failures• common in presence of evolving/emerging
behaviors• often due to uncovered incompatibilities -
misunderstandings• hard/impossible to identify during classic
testing• easy to correct once diagnosed
Inconsistent interpretation of parameters or values
Mars Climate Orbiter
FAULTmeters – yard mismatch
FIXconverter
Violations of domains, capacity, size
Buffer overflow
FAULTmeters – yard mismatch
FIXconverter
…Integration Faults
Side effects on parameters or resourcesFAULT
conflict on temporary fileFIX
rename
misunderstood functionalityFAULT
Inconsistent interpretation of web hits
FIXconvert
Explicit control loops
IBM
Shaw
Detect failuresautomatically
Detecting failures automatically
• Application independent failures– memory faults– deadlocks– race conditions– Exceptions
• Application dependent failures– Oracles– Assertions
From design specs to code assertions
setIdpublic abstract void setId(java.lang.String id) Set the component identifier of this UIComponent (if any). […] Component identifiers must obey the following semantic restrictions (note that this restriction is NOT enforced by the setId() implementation):
• The specified identifier must be unique among all the components […] that are descendents of the nearest ancestor UIComponent [...], or within the scope of the entire component tree […].
JSF Specification 1.2, javax.faces.component.UIComponent
Getting Assertions Right
public class UIComponent {private String id = “default”;
public UIComponent(String id) {this.id = id;
}
public void setId(String id) {this.id = id;
}
public void doSomething() {this.id = “whatever”;
}}
requirementJava service
pages specification
Java server faces
implementation
annotation
generation
Observations
• 1 Property = 56 Assertions
Properties and Frequency
Property Description occurrences (spec)
Explicit/comparable
Classes have to implement specific interface directly
20
Caching Correct caching protocol
Concurrency Race conditions 48
Immutability Object state may not change 25
Initialization Specific component/class initialization before use
36
Language Data values must match a regular language
3
Resource Mgmt Locking/releasing resources 8
Uniqueness Objects must be unique within their context
25
Property Description occurrences (spec)
occurrences (bugs)
Explicit/comparable
Classes have to implement specific interface directly
20 3
Caching Correct caching protocol 3
Concurrency Race conditions 48 11
Immutability Object state may not change 25 2
Initialization Specific component/class initialization before use
36 10
Language Data values must match a regular language
3 9
Resource Mgmt Locking/releasing resources 8 3
Uniqueness Objects must be unique within their context
25
Properties and Frequency
Property Description occurrences (spec)
occurrences (bugs)
Explicit/comparable
Classes have to implement specific interface directly
20 6
Immutability Object state may not change 25 2
Initialization Specific component/class initialization before use
36 10
Language Data values must match a regular language
3 9
Uniqueness Objects must be unique within their context
25
Properties and Frequency
PropertiesProperties
Runtime ChecksRuntime Checks
Concept
UML Stereotypes
UML Stereotypes
AJ AdviceAJ Advice
Prototype
Platform independent
Platform specific
Pro
pert
y T
em
pla
tes
✔
Diagnosing Faults
Debugging
• hard manual activity• compare multiple execution
(need multiple runs)
Locating faults automatically
Infer information from running
systems
Generating models from system runs
System behavior
System behavior
Reality is Different!
Over-Generalization Over-Restriction
Over-Generalizationand
Over-Restriction
Models derived dynamically …
1 2
a
3
c
4 5
d f
eb
x < 0
kTail Daikon
Adabu gkTail
kTail
A. Biermann and J. Feldman. On the synthesis of
finite state machines from samples of their behavior.
IEEE Transactions on Computer, 21:592–597, 1972.
From Sequence of Events to Protocols
a -> a -> a -> b -> ca -> b -> ca -> a -> b -> ca -> a -> a -> a -> a -> c
kTail
a -> a -> a -> b -> ca -> b -> ca -> a -> b -> ca -> a -> a -> a -> a -> c
(1)
(2)
TRACESPTA
FSA
Build the PTA
(1)
TRACES
PTA
a -> a -> a -> b -> ca -> b -> ca -> a -> b -> ca -> a -> a -> a -> a -> c
k=2
2-future(2) = {aa,ab,bc}2-future(5) = {aa, bc}2-future(11) = {}2-future(8) = {c}…
2 FUTURES
2-future(8) = {c} 2-future(12) = {c}
2-future(11) = {} 2-future(13) = {}
2-future(2) = {aa, ab, bc} 2-future(3) = {aa, ab, bc}
…
Observations
KOver-
restriction
Over-generalization
only local
Daikon
totalCostunitCost
43
1
7
…
53
8
12
…
_ + _ _ < _
_=_
_ * _
unitCost = totalCostunitCost < totalCostunitCost <= totalCostunitCost + totalCost > unitCost…
preserve expressions with perfect confidence
unitCost <= totalCostunitCost + totalCost > unitCosttotalCost > 0…
1 < _
remove properties that are not statistically relevant
unitCost <= totalCostunitCost + totalCost > unitCosttotalCost > 0
remove redundant properties
unitCost <= totalCosttotalCost > 0
Daikon in a nutshell
Adabu
ADABU = Learning how objects can be used
…and add state observers
run…
… and trace
infer the model
Statically analyze target class…
Analyze
public int getAge() {
return age;
}
Inspector Method= no void no parameters
no side effects
Mutator = NOT Inspector Method
and instrumentall the inspector methods are invoked before and
after execution of mutators
Vector• has 9 inspectors, example with 3
( isEmpty(), capacity(), size())
• Traces are sequences of <state, method, state>
• Example
(true, 20, 0) (true, 20, 0)
From Concrete States to Abstract States
(true, 20, 0) (isEmpty(), capacity()>0, size()=0)
numerical values
references
enumerations and boolean
<0, =0, >0
null, !null
concrete value
abstraction rules
(true, 20, 0) (true, 20, 0)
gkTail
Motivating Example a catalog interacts with an imageDB component
only if the added item is associated with a picture, i.e., the picture attribute is different from null
catalog.addItem
catalog.addItem
imageDB.addPicture
catalog.addItem
catalog.addItem
imageDB.addPicture
item.getPicture() != null
item.getPicture() == null
GKTail
merge similar traces
Derive guards
Synthese EFSMs
Merge Similar Traces
merge
Derive Guards
x≥0
processed events events to be processed
x≥1
x=00≤y ≤20
x=0y=0,x=0y=20
Daikon
Synthese EFSM - PTA
0 1 2 3 4 5 6
m1
0≤x≤15
m1
x=1
m2
x=0y=0x=y
m3
z={’IT’,’UK’}
m1
x=0
m2
x=00≤y≤20
8 9 10 11 12 13
m3
z=’UK’
m3
z=’UK’
m2
x=0y=3
m3
z=’UK’
m1
x=0
m2
x=0y=15
22 23 24 25 26 27
m1
x=0m1
x=1
m2
x=0y=0x=y
m3
z=’IT’
m3
z=’IT’
m2
x=0y=30
3
8 9
m3
z=’UK’
m1
x=1
m2
x=0y=3
3
Synthese EFSM - K-future
0 1 2 3
m1
0≤x≤15
m1
x=1
m2
x=0y=0x=y
8 9
m3
z=’UK’
m3
z=’UK’ m2
x=0y=3
23 24
m1
x=1 m2
x=0y=0x=y
m1
x=1
0 1 2
m1
0≤x≤15
m1
x=1
8 9
m3
z=’UK’
m3
z=’UK’
23
m1
x=1
2Future(0)
2Future(8)
Merge states - Equivalence
4
1 2
m3
z=’UK’
m1
x=1
m2
x=0y=3
3
108 9
m3
z=’UK’
m1
x=1 m2
x=0y=3
1 is 2-equivalent to 8
Merge states – Weak Subsumption
4
1 2
m3
z=’UK’
m1
x=1
m2
x≥0y=3
3
108 9
m3
z=’UK’
m1
x=1 m2
x=0y=3
1 2-weakly subsumes 8
Merge states– Strong Subsumption
4
1 2
m3
z=’UK’
m1
x=1
m2
x≥0y=3
3
8 9
m3
z=’UK’
m1
x=1
1 2-strongly subsumes 8
Example
weak-subsumption with k=2
y≤20
Example
weak-subsumption with k=2
y≤20
Example
y≤20
Result
0 1 2 3 4 5 6
m1
0≤x≤15---
nUsr = 1
m1
x=1---
nUsr=2
m2
x=0y=0x=y
m3
z={’IT’,’UK’}m1
x=0---
nUsr=3
m2
x=00≤y≤20
8 9 12 13
m3
z=’UK’
m3
z=’UK’
m2
y>x
m1
x=0----
nUsr=3
m2
x=0y≤15
24 25 26 27
m2
x=0y=0x=y
m3
z=’IT’
m3
z=’IT’
m2
x=0y=30
m1
x=0---
nUsr = 3
y≤20
Inference
Any algorithm can be applied to derive a model from a rewritten trace
kBehavior
• incremental• based on merging of patterns rather than states
a b b f d e d e
kBehavior by example
1 2
a
3
c
4 5
d f
eb f
a b b f d e d e c
6
c
a b b d e d ec k = 2
Recursion
a b a h j h j h j l
a b a h j h j h j l
Spurious Loop Avoidance
h
a c h d e d f
a c d e d fh
Spurious Loop Avoidance
a c h d e d f
a c d e d fh
Locating faults automatically though behavioral anomalies
Inferred Behavior
Program Behavior
Legal Behavior
Failing Behavior
example
• known issue in Tomcat 6.0.0 (to 6.0.9)
Locating faults is difficult when faults are far from failures
Web
App
1
Web
App
2
Web
App
3
Servlet Catalina
TomcatJasper
public void lifecycleEvent(LifecycleEvent event) { … this.getClass().getClassLoader().loadClass ("org.apache.jasper.compiler.JspRuntimeContext");…
public void lifecycleEvent(LifecycleEvent event) { … this.getClass().getClassLoader().loadClass ("org.apache.jasper.compiler.JspRuntimeContext");…
failure
fault
Locating faults
Web
App
1
Web
App
2
Web
App
3
Servlet Catalina
TomcatJasper
capturecorrect
behavior
tracefailing
executions
locatefaulty
components
AnomalyAnomalyAnomaly
parameter[0] == “localhost”parameter[1] == 8080
GenericServlet.<init>
JspFactory.<clinit>JspFactory.<init>
URL.getFile
URL.getPotocol
Log.log
trace failing executions
We
bA
pp1
We
bA
pp2
We
bA
pp3
Servlet Catalina
TomcatJasper
Anomaly 1Bootstrap init
HostConfig start
HostConfig deployWar
StandardManager start
Anomaly 2: IO, JspFactory.getDefaultFactory returnValue != null does not hold
Anomaly 4
Anomaly 3: FSA for JspServlet.init in state q7, unexpected event: JspFactory.<init>
...
...
...
...Failure
May 7, 2009 11:16:10 PM org.apache.catalina.core.StandardHost startINFO: XML validation disabledMay 7, 2009 11:16:10 PM org.apache.catalina.startup.HostConfig deployWARINFO: Deploying web application archive ELResolverTest.warMay 7, 2009 11:16:34 PM org.apache.catalina.core.StandardContext startSEVERE: Error listenerStartMay 7, 2009 11:16:34 PM org.apache.catalina.core.StandardContext startSEVERE: Context [/ELResolverTest] startup failed due to previous errorsMay 7, 2009 11:19:45 PM org.apache.coyote.http11.Http11Protocol startINFO: Starting Coyote HTTP/1.1 on http-8080
monitor
Eliminate spurious anomalies
System Failure!!
unexpectedinteraction!
unexpectedinteraction!
unexpectedinteraction!
unexpectedvalue!
unexpectedvalue!
violations detected during both successful and failing executions are ignored
violations detected during failing executions only are re-arranged according to likely cause-effects
locate faulty components
ContainerBase.start
LifecycleSupport.fireLifecycleEvent
HostConfig.lifecycleEvent
HostConfig.start
ChipsListener.contextInitialized
JspFactory.<clinit>
Process
JspFactory.getDefaultFactory
Bootstrap.main
JspFactory.<init>
LogFactory.getLog...
...
...
...
...
anomaly graph
dynamic call tree
extract
Building Anomaly Graphs
Dynamic call graph for the Tomcat case study
initial anomaly graph
Anomaly Graphs Can Be Messy
• initial anomaly graph for a bug in Eclipse 3.3– multiple issues – false positives
locate faulty components
ContainerBase.start
LifecycleSupport.fireLifecycleEvent
HostConfig.lifecycleEvent
HostConfig.start
ChipsListener.contextInitialized
JspFactory.<clinit>
Process
JspFactory.getDefaultFactory
Bootstrap.main
JspFactory.<init>
LogFactory.getLog...
...
...
...
... cluster
anomaly graph
dynamic call tree faulty locations
extract
Refining Anomaly Graphs
• incrementally remove nodes with highest weights,
• measure coesion of the single resulting graphs– when removing edges the initial graph is
partitioned into multiple graphs
• stop the process when cohesion does not significantly improve anymore
Stopping Criterion
biggest change
edges with weights greater than this value are removed
Inverse cohesion of single graphs
Results
inspect big firstfirst two graphs
enough to explain the problem
Fixing faultsautomatically
Fixing faults automatically
• Application-independent approaches– Reboot/micro-reboot/rejuvenation
• Design redundancy– Multi version programming– Exception handling– Wrappers
• Genetic approaches• Exploit intrinsic redundancy
Automatic workarounds
Manual workaround:exploiting intrinsic redundancy
✖✔
Your family == Anyone, Your family
Exploiting intrinsic redundancy automatically
Your family = Anyone, Your familyequivalent sequences
Equivalent sequences
Functionally null operations Idempotent operations
Alternative operations
affect timing, scheduling, not functionality
setTimeout()
globally invariant functional effect
m.hide(); m.show()
sequences of operations that have the same intended effect
setTag(‘tag1’,‘tag2’);setTag(‘tag1); addTag(‘tag2’);
Functionally null operations
issue 519
map = new GMap2(document.getElementById("map"));map.setCenter(new GLatLng(37,-122),15);map.openInfoWindow(new GLatLng(37.4,-122), 'Hello World!');
map = new GMap2(document.getElementById("map"));setTimeout(“map.setCenter(new GLatLng(37,-122),15)”,500);map.openInfoWindow(new GLatLng(37.4,-122), 'Hello World!');
Idempotent operations
issue 1305
polyline.enableDrawing();
v = polyline.deleteVertex(polyline.getVertexCount()-1)polyline.insertVertex(polyline.getVertexCount()-1,v);polyline.enableDrawing();
Alternative operations
issue 585map.addOverlay(first);
function showOverlay(){ first.show();}function hideOverlay(){first.hide();}
map.addOverlay(first);
function showOverlay(){ map.addOverlay(first); first.show();}function hideOverlay(){ map.removeOverlay(first);}
From equivalent sequences to workarounds
•Each equivalent sequence has a priority:
setTimeout - setCenter add() -> add() show()
Successful workaround: 7
No. times used: 14
Priority = <success rate, success>
✔
Priority = <1/2, 7>
Successful workaround: 1
No. times used: 2
Priority = <1/2, 2>
Check system consistency
Open problems
• Check that the changes fix the problem– Invariants– Oracles– Models of correct execution
Societies of digital systems