[IEEE 2012 7th International Symposium on Software Engineering for Adaptive and Self-Managing...

Evaluation of Resilience in Self-Adaptive SystemsUsing Probabilistic Model-Checking

Javier CamaraUniversity of Coimbra

[email protected]

Rogerio de LemosUniversity of KentUnited Kingdom

[email protected]

Abstract—The provision of assurances for self-adaptive sys-tems presents its challenges since uncertainties associatedwith its operating environment often hamper the provision ofabsolute guarantees that system properties can be satisfied.In this paper, we define an approach for the verification ofself-adaptive systems that relies on stimulation and probabilis-tic model-checking to provide levels of confidence regardingservice delivery. In particular, we focus on resilience proper-ties that enable us to assess whether the system is able tomaintain trustworthy service delivery in spite of changes inits environment. The feasibility of our proposed approach forthe provision of assurances is evaluated in the context of theZnn.com case study.

Keywords-resilience; assurances; evaluation; probabilisticmodel checking; stimulation; self-adaptation

I. INTRODUCTION

Despite recent advances in self-adaptive systems, a keyaspect that still remains a challenge is the provision ofassurances, that is, the collection, structuring and combi-nation of evidence that a system satisfies a set of statedfunctional and non-functional properties during its operation.The main reason for this is the high degree of uncertaintyassociated with changes that may occur to the systemitself, its environment or its goals [5]. In particular, sincethe behavior of the environment cannot be predicted norcontrolled by the system (e.g., load, network conditions,etc.), this prevents obtaining absolute guarantees that systemproperties can be satisfied. Moreover, unlike in consolidatedmodel-based verification and validation techniques, modelsin self-adaptive systems cannot be assumed to be fixed sincechanges that might affect the system, its environment or itsgoals might also affect the models.

This intrinsic uncertainty associated with self-adaptivesystems has established the need to devise new approachesfor assessing whether a set of stated properties are satisfiedby the system during operation while changes occur. A majorrequirement for these approaches is that they need to beable to provide levels of confidence, instead of establishingabsolute guarantees about property satisfaction. But for that,there is the need to determine what are the limits of thesystem when facing changes while still providing a service

that can justifiably be trusted (i.e., the avoidance of failuresthat are unacceptably frequent or severe) [17].

In this paper, we propose a new approach, based onstimulation and probabilistic model-checking, for the verifi-cation of self-adaptive systems, whose environment exhibitsstochastic behavior [12]. In a nutshell, the underlying ideais to stimulate the environment of the system in order toexercise its adaptive capabilities, and to collect data abouthow the system reacts to those environmental changes. Thecollected data is aggregated into a probabilistic model ofthe system’s behavior that will be used as input into amodel checker for evaluating whether system propertiesare satisfied within certain confidence levels. The proposedapproach consists of four basic steps (see Figure 1):

1. Stimulus generation. Determine how the system’s envi-ronment should be stimulated to collect information relevantfor the properties of interest. Stimulation is based on systemadaptation alternatives, domain knowledge, levers (i.e., ef-fectors on the environment), and the properties to be verified;2. Experimental data collection. Stimulate the system’s en-vironment through levers to trigger adaptation mechanismsand collect information about system behavior undergoingadaptation as a set of execution traces;3. Model generation. Aggregate execution traces into aprobabilistic model of the system’s behavior;4. Property verification. Check system properties againstthe obtained probabilistic models.

Although the concept of updating probabilistic modelparameters with run-time system information has been ex-plored [9], our approach enables the construction fromscratch of a model of the system’s response to changes inits environment. The contribution of this paper is twofold:• A method to obtain levels of confidence regarding the

satisfaction of a set of stated non-functional (resilience)properties within a particular time bound;

• An analysis of how different parameters (e.g., reso-lution) in model construction impact the accuracy ofprobability estimations.

The rest of this paper is organized as follows. Section IIprovides an overview of the case study used to illustrate

978-1-4673-1787-0/12/$31.00 c© 2012 IEEE SEAMS 2012, Zürich, Switzerland53

Aggregator

Probabilistic

Model

Designer

1.a. Domain knowledge

1.b. Adaptations

1.c. Levers

2. Experimental data collection 3. Model Generation1. Stimulus Generation

Stimulus Framework

4. Property Verification

Probabilistic

Model-Checker

1.d. Properties

Execution

Traces

Stimulate Monitor

Changeload

SystemEnvironment

Observed

Data

Figure 1. Overview of the approach

the proposed approach. Section III introduces the formalmodel, as well as language for expressing system properties.Section IV details the steps comprised by our approach.Section V describes the validation and experimental results.Section VI describes related work, comparing it with theproposed approach. Finally, Section VII concludes the paperand indicates future research directions.

II. CASE STUDY

To illustrate our approach, we use the Znn.com casestudy [7], which is able to reproduce the typical infras-tructure for a news website. It has a three-tier architectureconsisting of a set of servers that provide contents frombackend databases to clients via frontend presentation logic.Architecturally, it is a web-based client-server system thatsatisfies an N-tier style, as illustrated in Figure 2. The systemuses a load balancer to balance requests across a pool ofreplicated servers, the size of which can be adjusted tobalance server utilization against service response time. Aset of client processes makes stateless requests, and theservers deliver the requested contents (i.e., text, images andvideos).

c0

c1

c2

lbproxy

s0

s1

s2

s3

Figure 2. Znn.com system architecture

The main objective for Znn.com is to provide content tocustomers within a reasonable response time, while keepingthe cost of the server pool within a certain operating budget.It is considered that from time to time, due to highly popularevents, Znn.com experiences spikes in requests that it cannotserve adequately, even at maximum pool size. To preventlosing customers, the system can provide minimal textualcontents during such peak times instead of not providing

service to a part of the customers. In particular, we identifytwo main quality objectives for the self-adaptation of thesystem: (i) performance, which depends on request responsetime, server load, and available network bandwidth; and (ii)cost, which is associated to the number of active servers.

Self-adaptive features in Znn.com are implemented us-ing Rainbow [6], [10], a platform for architecture-basedself-adaptation. Rainbow is capable of analyzing trade-offs among the different objectives, and execute differentadaptations according to the particular run-time conditionsof the system. For instance, in Znn.com, when responsetime becomes too high, the system should increment serverpool size if it is within budget to improve its performance;otherwise, servers should be switched to textual mode (startserving minimal text content) if cost is near budget limit.

III. SYSTEM MODEL

In this section, we present Discrete-Time Markov Chains(DTMCs) [15], the probabilistic model used as the basis torepresent the behavior of the system. Moreover, we introducethe kind of properties that we check in the system aswell as Probabilistic Computation Tree Logic (PCTL) [2],the language used for expressing them. This section isconcluded by describing how the aforementioned elementsare combined in order to obtain system models.

A. Discrete-Time Markov Chains

In order to model the behaviour of our system, we employDTMCs, defined as state-transition systems augmented withprobabilities. States represent possible configurations of thesystem, and transitions occur at discrete time and havean associated probability. DTMCs are discrete stochasticprocesses where the probability distribution of future statesdepend only upon the current state. Let AP be a fixed, finiteset of atomic propositions used to label states with propertiesof interest.

Definition 1 (DTMC): A (labelled) DTMC is a tuple(S , S 0, P, L) where:• S is a finite set of states;• S 0 ⊆ S is a set of initial states;

54

• P : S × S → [0, 1] is a transition probability matrixwhere

∑s′∈S P(s, s′) = 1 for all s ∈ S ;

• L : S → 2AP is a labelling function that assigns to eachstate s ∈ S the set L(s) of atomic propositions that aretrue in the state.

Each element P(s, s′) in the transition probability matrixrepresents the probability that the next state of the processwill be s′, given that the current state is s. Terminating states,i.e., those from which the system cannot move to anotherstate, are modeled by adding a self-loop (a single transitiongoing back to the same state with probability one).

B. Resilience Properties

In order to express resilience properties, we use PCTL [2],which is a logic language inspired by CTL [2]. Instead ofthe universal and existential quantification of CTL, PCTLprovides the probabilistic operator P./p(.), where p ∈ [0, 1]is a probability bound and ./ ∈ {≤, <,≥, >}. PCTL is definedby the following syntax:

Φ ::= true | a | Φ ∧ Φ | ¬Φ | P./p(ψ)ψ ::= XΦ | ΦUΦ | ΦU≤tΦ

Formulae Φ are named state formulae, and can be eval-uated over a Boolean domain (true, f alse) in each state.Formulae ψ are named path formulae, and describe a patternover the set of all possible paths originating in the statewhere they are evaluated. In state formulae, other Booleanoperators, such as disjunction (∨), implication (⇒), etc.can be specified based on the primary Boolean operators.Moreover, we employ the abbreviations: (bounded) finally(F(≤t)Φ = true U(≤t)Φ) and globally (GΦ = ¬F¬Φ).

The satisfaction relation for a state s in PCTL is:

s |= trues |= a iff a ∈ L(s)s |= ¬Φ iff s 2 Φ

s |= Φ1 ∧ Φ2 iff s |= Φ1 and s |= Φ2s |= P./p(ψ) iff Pr(s |= ψ) ./ p

A formal definition of how to compute Pr(s |= ψ) ispresented in [2]. The intuition is that, its value corresponds tothe fraction of paths originating in s and satisfying ψ over theentire set of paths originating in s. The satisfaction relationof a path formula with respect to a path π that originates instate s (i.e., π[0] = s) is:

π |= XΦ iff π[1] |= Φ

π |= ΦUΨ iff ∃ j ≥ 0.(π[ j] |= Ψ ∧ (∀0 ≤ k < j.π[k] |= Φ))π |= ΦU≤tΨ iff ∃0 ≤ j ≤ t.(π[ j] |= Ψ ∧ (∀0 ≤ k < j.π[k] |= Φ))

With the aid of PCTL, a designer can express propertiesabout the system that are typically domain-dependent. Fur-thermore, in order to ease the formulation of probabilisticproperties, we make use of property specification patterns [8]that describe generalized recurring properties in probabilistic

Table IPROBABILISTIC RESPONSE SPECIFICATION PATTERNS

PCTL Formulation DescriptionP≥1[G(Φ1 ⇒ P./p(F≤tΦ2))] Probabilistic Response. After state

formula Φ1 holds, state formula Φ2must become true within time boundt, with a probability bound ./ p.

P≥1[G(Φ1 ⇒ P./p(¬Φ2U≤tΦ3))] Probabilistic Constrained Response.After state formula Φ1 holds, state for-mula Φ3 must become true, withoutΦ2 ever holding, within time bound t,with a probability bound ./ p.

temporal logics. In our case, we are interested in how thesystem responds to changes in the environment, thereforewe restrict ourselves to properties that can be instanced byusing probabilistic response patterns [11], adapted to PCTLsyntax (see Table I). These patterns include a premise Φ1 thatrepresents in our case a change in the operating environmentof the system, and a subformula enclosed by the probabilisicoperator P./p(.) that represents the response to that changethat we are expecting from the system (with a probabilitybound p and a time bound t).

Example 1: In Znn.com, we are interested in assessinghow the system reacts to request response time going above aparticular threshold. Let expRspTime be the variable associatedwith experienced request response time, and totalCost be thevariable associated with operating cost. We can define thefollowing predicates:

cViolation = expRspTime > MAX RSPTIMEhiCost = totalCost ≥ THRESHOLD COST

Where MAX RSPTIME is a threshold that establishes the maxi-mum acceptable response time, and THRESHOLD COST determinesthe maximum operating budget expected for the system.Based on these predicates, we may instantiate the followingPCTL property, making use of the probabilistic constrainedresponse pattern included in Table I:P≥1[G(cViolation⇒ P≥0.85(¬ hiCost U≤120¬ cViolation ))]This property reads as: “When response time goes above

threshold MAX RSPTIME , the probability of lowering responsetime below MAX RSPTIMEwithout exceeding the expected budgetTHRESHOLD COST in 120 seconds is greater than 0.85”.

C. Operational Profiles and Adaptations

Within a self-adaptive system we may distinguish betweena conventional operational profile in which the system isoperating without experiencing any anomalies, and non-conventional operational profiles associated with changesin the environment that induce anomalies in the system(typically triggering adaptations).

Definition 2 (Conventional Operational Profile): Theconventional operational profile C of a system is the regionof the state space where no anomalies hold:

C = {s ∈ S | ∀α ∈ A, s 2 α}

55

where A is the set of possible anomalies that the system canexperience expressed as PCTL state formulae.

Definition 3 (Non-conventional Operational Profile): Anon-conventional operational profile N associated with ananomaly α ∈ A is the region of the state space where theanomaly holds:

N = {s ∈ S | s |= α}where α is a PCTL state formula built over the set of atomicpropositions AP.

An adaptation is triggered as a response to a systemanomaly caused by a change in the environment of the sys-tem. When the system experiences an anomaly, it goes fromits conventional operational profile into a non-conventionaloperational profile. As a general rule, the intended purposeof the adaptation is steering the system back to its conven-tional operational profile.

Definition 4 (Adaptation): An adaptation is a tuple(c, g, d) where:

• c is the applicability condition (e.g., system anomaly)expressed as a PCTL state formula built over the set ofatomic propositions AP;

• g is an adaptation goal expressed as a PCTL formulabuilt over the set of atomic propositions AP;

• d ∈ R+ is a deadline for the satisfaction of the goal ofthe adaptation in the system.

It is worth observing that we define adaptations in terms ofapplicability conditions and intended effects on the system,abstracting away from the specific actions that it carriesout. Moreover, the system may comprise several alternativeadaptations as a response to the same anomaly, and chooseamong them at run-time according to the particular systemstate. In any case, an adaptation (c, g, d) is always relatedto a non-conventional operational profile N through itsapplicability condition, that is, if we let α be the anomalyassociated with N, then c = α.

Example 2: Let us consider the predicates introduced inExample 1 as the set of possible anomalies in the system A =

{ cViolation , hiCost }. We may then define the conventionaloperational profile of the system as:

C = {s ∈ S | s |= ¬( cViolation ∨ hiCost )}That is, the set of states in which the system is operating

on budget and according to an acceptable request responsetime. Furthermore, associated with each anomaly, we iden-tify a non-conventional operational profile:

NcViolation = {s ∈ S | s |= cViolation }NhiCost = {s ∈ S | s |= hiCost }

Rainbow uses a language called Stitch [6] to repre-sent high-level adaptation concepts, embodied in adaptationstrategies. Let us consider now the code presented at theend of this section of a sample strategy intended to reduceresponse time in Znn.com. In line 2, we can observe thedefinition of cViolation , which is also the applicabilitycondition for the strategy ReduceResponseTime (line 5, between

brackets). In the case of Stitch, adaptation goals and dead-lines are not explicitly specified, therefore in this case, wecan complement the specification of the adaptation using asadaptation goal the property described in Example 1:ReduceResponseTime = ( cViolation ,

P≥0.85(¬ hiCost U≤120¬ cViolation ), 120 )Let us remark that although the deadline in this case

coincides with the time bound of the adaptation goal, thismay not always be the case (e.g., when the adaptation goalis a conjunction of several PCTL formulae with differenttime bounds).12 d e f i n e b o o l e a n c V i o l a t i o n = e x i s t s c : T . C l i e n t T i n3 M. components | c . expRspTime > M.

MAX RSPTIME;45 s t r a t e g y ReduceResponseTime [ c V i o l a t i o n ]{6 d e f i n e b o o l e a n h i L a t e n c y =7 e x i s t s k : T . HttpConnT i n M. c o n n e c t o r s | k . l a t e n c y

> M.MAX LATENCY;8 d e f i n e b o o l e a n hiLoad =9 e x i s t s s : T . Se rve rT i n M. components | s . load > M.

MAX UTIL ;1011 t 1 : ( # [ Pr{ t 1 }] h i L a t e n c y )−>s w i t c hT oT e x t ua lM o de ( )@

[1000 /∗ms∗ / ]{12 t 1 a : ( s u c c e s s )−>done ;13 }14 t 2 : ( # [ Pr{ t 2 }] h iLoad )−>e n l i s t S e r v e r ( 1 )@[2000 /∗ms∗ / ]{15 t 2 a : ( ! h iLoad )−>done ;16 t 2 b : ( ! s u c c e s s )−>do [ 1 ] t 1 ;17 }18 t 3 : ( d e f a u l t )−> f a i l ;19 }

D. Modeling Non-conventional Operational Profiles

The high degree of variability in the operating environ-ment of self-adaptive systems hampers the construction ofsystem accurate models. However, since our interest is toassess system ability to provide trustworthy service whenenvironmental changes occur, the analysis can be restrictedto non-conventional operational profiles of the system.

For our approach, we assume that a system model consistsof a set of non-conventional operational profiles modelsNi, i ∈ {1 . . . p}, each of which is associated with a setof adaptations Adi = {(c1, g1, d1), . . . , (cq, gq, dq)}, j ∈{1, . . . , q} through their applicability condition (let αi be theanomaly associated to Ni, then ∀c j, c j = αi).

In the following, we formally define what is a non-conventional operational profile model, but before that weformulate some basic concepts associated with that defini-tion. In those definitions, for the sake of clarity, we abstractfrom the probabilities in DTMCs. Therefore, we assume thatthere is a transition between states s and s′ in a DTMC(denoted as s - s′) iff P(s, s′) , 0.

Definition 5 (Boundary): The boundary of a non-conventional operational profile N is defined as:

boundary(N) = {n ∈ N | ∃s - n : s ∈ S \N}.Definition 6 (Trajectory): A trajectory is a finite se-

quence of transitions σ = s1 - . . . - sn.

56

We assume that for all transitions s - s′, the timeneeded by the system to move from state s to state s′ isgiven by τ ∈ R+. Then, for any trajectory of the system σ,we define its duration as ∆(σ) = nτ.

A state s′ ∈ S is reachable from a state s in time t(denoted as s

t. s′) if there exists a trajectory σ =

s - . . . - s′, such that ∆(σ) ≤ t.

Definition 7 (Non-conventional Operational Profile Model):A model for a non-conventional operational profile N is aDTMC such that:• Set of initial states S N

0 = boundary(N);

• Set of states S N = {s ∈ S | ∃s0 ∈ S N0 : s0

max(d j). s};

where Ad = {(c1, g1, d1), . . . , (cq, gq, dq)}, j ∈ {1, . . . , q}is the set of adaptations associated to N. Hence, initialstates correspond to those where the system enters the non-conventional operational profile, and the state space of themodel includes all the states reachable before the maximumof the deadlines for adaptations expires.

IV. APPROACH

In this section, we present an overview of the differentsteps comprised by our approach (Figure 1): (i) stimulusgeneration, where we determine how the environment mustbe stimulated to trigger system adaptation mechanisms;(ii) experimental data collection, where the system’s en-vironment is stimulated for collecting execution traces ofsystem undergoing adaptation; (iii) model generation, whereDTMC based models are obtained from the aggregatedexecution traces; and (iv) property verification, where systemproperties are verified against the obtained probabilisticmodel.

A. Stimulus Generation

To obtain a representative operational model of the sys-tem, we need to subject its environment to equally repre-sentative changes. A changeload is a collection of scenarioscomprising representative changes of the environment thatstimulate the system’s adaptive capabilities, enabling us toobserve its response. Obtaining a representative change-load is a challenging task that typically involves domainknowledge, either coming from experts in the applicationdomain, or from real data available from similar existingsystems. However, accessing such data is not possible inmany situations, since the type of experienced changes ortheir probability of occurrence are not usually recordedin already deployed systems. Moreover, the wide rangeand complexity of changes that a self-adaptive system canexperience [1] makes even more difficult to automaticallygenerate a representative changeload. However, in the caseof the evaluation of the kind of resilience properties de-scribed in Section III-B, the test designer can be aided bya partial automation of the process, as we discuss furtherin this section. Concretely, we provide a formal model of

scenarios used in our approach to stimulate the systemenvironment, followed by a description of a method to helpidentify a changeload.

1) Scenario Model: A scenario defines a postulated se-quence of events, which should be able to capture, duringa given time frame, changes in the environment, in thesystem, or at the interface between them (i.e., system goals).However, in this work we focus on self-adaptive systemswhere system goals do not change at run-time, thereforethese are not considered in our definition of scenario. Inparticular, scenarios consist of a set of traces that correspondto a set of monitored variables associated with the systemor its environment. Each trace consists of a sequence ofvariable values collected at discrete time points according toa rate τ that coincides with the transition time in our model(introduced in Section III-D). Moreover, collected values arediscretized according to a particular resolution in order tobound the growth of the state-space of our models.

Definition 8 (Trace): A trace of a variable is a tuple(η, [α, β],V) where:• η ∈ R+ is a quantization parameter that determines the

resolution of the monitored variable;• [α, β], α, β ∈ R is the range of values associated with

the variable;• V = 〈v1, . . . , vn〉 is a vector, where vi∈{1,...,n} ∈ [R]η

corresponds to the value of the monitored variable attime instant i.

The values of a variable at different time instants are basedon a quantization of the state-space R, approximated by theset:

[R]η = {r ∈ R | r = kη, k ∈ Z, α ≤ r ≤ β }.It is worth observing that vi is the quantized value of the

monitored variable. Let vRi be the actual value of the variableobserved at instant i. The quantized value vi is obtained as:

quant(vRi) = arg minr∈[R]η (|vRi − r|).Definition 9 (Scenario): A scenario defined over a set of

traces is a tuple (δ, τ,T ):• δ ∈ R+ determines the duration of the time frame

associated with the scenario [0, δ];• τ ∈ R+ is a time sampling parameter that indicates the

rate at which the variables included in the scenario aresampled. Let us remark that the number of samplesincluded in the scenario is n = δ/τ. We assume that δis always a multiple of τ;

• T = 〈tr1, . . . , trm〉 is a vector of traces tri∈{1,...,m} ∈Tenv ∪ Tsys, where m ∈ N corresponds to the number ofvariables included in the scenario, and Tenv, Tsys corre-spond respectively to the sets of traces of environmentand system variables.

Moreover, in order to be able to determine the overall stateof the system and its environment at a given time point, weintroduce the concept of snapshot.

57

Definition 10 (Snapshot): A snapshot is a vector〈v1, . . . , vm〉, v j∈{1,...,m} ∈ [R]η j that contains the values of aset of variables in a particular time instant.

We refer to a snapshot as environment snapshot or systemsnapshot if it includes only variables from Tenv or Tsys,respectively.

Definition 11 (Snapshot Trace): A snapshot trace is avector 〈s1, . . . , sn〉 s.t. si∈{1,...,n} is a snapshot for time iτ.

Therefore, we can characterize a run of the system as anordered collection of snapshots that captures state changesfrom system initialization to termination.

Example 3: Figure 3 outlines a sample scenario ofZnn.com that comprises two clients c1 and c2 , a server s, a load balancer, and the connectors among them conn1 − 3 .The duration of the scenario is one hour, and the state of thesystem and the environment is sampled periodically every10 seconds. Traces of environment variables Tenv includeconnector latency and server load, whereas traces for systemvariables Tsys comprise experienced response time for eachof the clients and operation cost. Quantization parametersused are ηlatency = 50, ηload = 5, ηexpRspTime = 10, andηtotalCost = 1. Let us remark that all variable values aremultiples of their respective quantization parameters.

20 30 95 105

100 150 150 200

50 60 70 70

3 3 5 5

conn3.lantency

s.load

c2.expRspTime

totalCost

Timeτ 2τ nτ(δ)

. . .

. . .

. . .

. . .Syste

mE

nvir

on

me

nt

conn1.latency=100, conn2.latency=100, conn3.latency=150,

s.load=30, c1.expRspTime=60, c2.expRspTime=60, totalCost=3

τ =10s.n=360δ =3600s.

Snapshot at t=20s.

(n-1)τ

40 60 60 70c1.expRspTime. . .

100 100 150 150conn2.lantency. . .

100 100 150 150conn1.lantency. . .

Figure 3. Znn.com sample scenario

2) Identifying a Changeload: The first step to generatinga representative changeload for the system is determininghow changes in its environment can steer the system towardsnon-conventional operational profiles. To achieve this goal,we assume that the stimulus framework provides a numberof effectors or levers that can be used to modify thevalues of environment variables (i.e., acting as a closed-loopcontroller, similarly to frameworks such as Rainbow, that useeffectors through operators to steer the system according toa set of objectives [10]).

Secondly, we need to define a metric to estimate thedistance between any state of the system to the objective,that is, a particular non-conventional operational profile. Forthis purpose, we define the distance from a snapshot s to anon-conventional operational profile N as:

DstN (s,N) = minsN∈N dst(s, sN ); with dst(s, s′) =m

maxj=1

|v j − v′j |η j

;

where function dst corresponds to the distance betweenany two snapshots s = 〈v1, . . . , vm〉 and s′ = 〈v′1, . . . , v′m〉.

Algorithm 1 Environment StimulationInput: Initial state of the system s0, non-conventional operational profileN, anomaly α associated with N, observation time parameter Oτ, settlingthreshold Oλ, settling period k.Output: Environment snapshot trace t to steer the system towards N.1: t := 〈〉; current := s0; d := DstN (current,N);2: while current 2 α ∧ ¬cond do3: ng := generate neighbor(Env(current));4: apply(ng);5: wait(Oτ);6: next := observe;7: if DstN (next,N) < d then8: append(t, next);9: d := DstN (next,N);

10: current := next;11: else12: apply(Env(current));13: clock := 0;14: while clock ≤ k ∧ ¬cond do15: wait(τ);16: next := observe;17: if dst(next, current) < Oλ then18: clock := clock + 1;19: else20: clock := 0;21: end if22: end while23: end if24: end while25: if cond then26: return 〈〉;27: else28: return t;29: end if

Using this metric, Algorithm 1 stimulates the environmentfor steering the system towards a non-conventional opera-tional profile N, starting from an initial condition s0. It isimportant to emphasize that this algorithm is intended asan aid for the stimulus designer, rather than as the onlyway to determine the changeload, since the controllabilityof the system from the initial condition s0 ( ∃t ∈ R+ :s0

t. sN , sN ∈ N) cannot be guaranteed due to the

uncertainties regarding how environmental changes affectthe system. Each iteration of the algorithm determines aneighbor ng (line 3) of the current system state by modifyingthe value of one or more environment variables. Let usremark that the concrete criteria for the generation of aneighbor state and its application on the environment is ab-stracted away by two functions: generate neighbor (line 3),which has to be customized for every case by using domainknowledge (e.g., in Znn.com, an increase in the latency ofa network link will probably increase response time, thusreducing the distance to NcViolation); and apply (line 4),which maps changes in environment variables to levers inthe environment (e.g., induction of an artificial delay in anetwork link to increase its associated latency). Function

58

Env (line 3) removes system variables from a snapshot,returning an environment snapshot only with variables fromTenv. Function observe (line 6) returns a snapshot with thecurrent state of the system. In particular, the algorithm waitsto observe the effect of a change by using an explicit timingdelay Oτ (line 5) to capture the settling time of the systemafter a change is applied. If the distance to N has beenreduced w.r.t. the previous state, the algorithm goes to itsnext iteration (line 2). Otherwise, the algorithm ensures thatthe system has settled, controlling that the output is boundedand does not exceed a particular threshold Oλ (line 17) fora particular period of time (tracked by the clock variable,line 14). If system does not settle by a given deadline or amaximum number of iterations (stop condition abstractedby cond, line 14), the algorithm returns an empty trace,meaning that a sequence of changes in the environment forsteering the system towards N has not been found. Functionappend (line 8) adds a new element at the end of a vectorV = 〈e1, e2, . . . , en〉: append(V, e) = 〈e1, . . . , en, e〉.B. Experimental Data Collection

Building a representative model of the system also impliescollecting data in a meaningful way. In particular, datarelated to the operation of the system in its conventionaloperational profile is not useful for evaluating resilienceproperties, so it must be discarded. Concretely, we collectinformation corresponding time intervals that start whenan anomaly occurs, and end when the deadline to fulfilladaptation goals expires.

Algorithm 2 Data CollectionInput: Snapshot trace S T , set of anomalies A, vector of maximum deadlinesfor adaptations MD = 〈md1, . . .mdp〉 : mdi = max(d j), i ∈ {1, . . . , n}, j ∈{1, . . . , p}.Output: Vector of sets of traces T N

1: T N := 〈T0, . . . ,Tn〉; ∀i ∈ {1, . . . , n} T N [i] := ∅;2: tr := 〈tr1, . . . , trn〉; ∀i ∈ {1, . . . , n} tr[i] := 〈〉;3: while S T , 〈〉 do4: current := extract(S T );5: for αc ∈ A : (current |= αc) ∧ (clocks[c] = 0) do6: clocks[c] := MD[c];7: tr[c] := 〈〉;8: end for9: for i ∈ {1, . . . , n} : clocks[i] , 0 do

10: append(tr[i], current);11: clocks[i] := clocks[i] − 1;12: if clocks[i] = 0 then13: T N := T N [i] ∪ {tr[i]};14: end if15: end for16: end while17: return T N ;

Algorithm 2 obtains a set of snapshot traces for each oneof the non-conventional operational profiles Ni, given a setof associated anomalies A = {α1, . . . , αn} expressed as PCTLstate formulae, a set of adaptations associated with eachof the anomalies Adi∈{1,...,n} = {(c1, g1, d1), . . . , (cp, gp, dp)},(s.t. ∀c j∈{1,...,p}, c j = αi), and a scenario represented as a

snapshot trace S = 〈s1, . . . , sq〉. The algorithm returns a setof snapshot traces Ti ∈ T N , maintains a current snapshottrace tri, and also a clock associated with each of the non-conventional operational profiles. For every snapshot in theinput snapshot trace S T , the algorithm evaluates whetherany of the anomalies hold (line 5). If this is the case and theclock for the corresponding anomaly is zero, the clock is setto the maximum deadline for any of the adaptations in Adi

(line 6) and a new current trace for Ni is allocated (line 7).Next, the current snapshot is added to all current traces withnon-zero clocks (line 10). These clocks are decrementedafter the processing of each subsequent snapshot (line 11),and when the value of a clock becomes zero, the algorithmstops collecting snapshots for the current trace and addsit to Ti (line 13). Function extract (line 4) returns andremoves the first element of a vector V = 〈e1, e2, . . . , en〉:extract(V) = e1, with V = 〈e2, . . . , en〉.C. Model Generation

Once snapshot traces for a non-conventional operationalprofile have been obtained, we use them to build its corre-sponding probabilistic model.

Algorithm 3 DTMC Synthesis.Input: Set of traces T associated with a non-conventional operationalprofile N, set of atomic propositions AP.Output: DTMC (S , S 0, P, L) associated with N.Auxiliary variables: Transition observation matrix O.1: S := ∅; S 0 := ∅; P := 0; L := ∅;2: O := 0;3: for tr ∈ T do4: current := ∅;5: f irst := true;6: while tr , 〈〉 do7: prev := current;8: current := extract(t);9: if f irst ∧ current < S 0 then

10: S 0 := S 0 ∪ {current};11: f irst := f alse;12: end if13: if current < S then14: S := S ∪ {current};15: L := L ∪ {(current, {α ∈ AP | current |= α})};16: end if17: if prev , ∅ then18: O(prev, current) := O(prev, current) + 1;19: S C := {s′ ∈ S | O(prev, s′) , 0};20: for s′ ∈ S C do21: P(prev, s′) := O(prev, s′)/

∑sc∈S C O(prev, sc);

22: end for23: end if24: end while25: end for26: return (S , S 0, P, L);

In particular, Algorithm 3 synthesizes a DTMC from a setof snapshot traces T and the set of atomic propositions ofinterest AP. The algorithm starts by incrementally buildingthe sets of states (initial and global, lines 10, 14), and theircorresponding labelling (line 15). Moreover, for the secondpart of the algorithm where the transition probability matrix

59

is built, we employ an auxiliary matrix O that is used tokeep information about the number of times that a transi-tion between any two states prev and current is observed(line 18). Hence, values in the transition probability matrixare updated according to this information by assigning toeach of the successors of state prev (i.e., all states s′ ∈ S s.t.O(prev, s′) , 0) a value proportional to the number of timesthat the transition prev - current has been observedw.r.t. the total number of transitions observed from sourcestate prev (line 21).

D. Property Verification

To verify properties, we need to translate the probabilisticmodels obtained for each of the non-conventional opera-tional profiles into an appropriate language to be used asinput to a probabilistic model checker. Concretely, we usePRISM [16], a widely used tool where DTMCs can bewritten as a set of commands of the form:

[] guard -> prob_1 : update_1 + ... + prob_n : update_n;

Where the guard is a predicate over all the variables inthe model and each update describes a valid transition ifthe guard is true. A transition is specified by giving thenew values of the variables considered. Furthermore, eachupdate occurs with an assigned probability. Translation ofour DTMC models results in a set of commands, eachof them associated to a state (snapshot) s = 〈v1, . . . , vn〉and its set of successor states S ′ = {s′1, . . . , s′m}, s′j =

(v′j1, . . . , v′jn), j ∈ {1, . . . ,m}, i ∈ {1, . . . , n}, according to the

following pattern:

[]n∧

i=1

vi -> P(s, s′1) :n∧

i=1

v′1i + . . .+ P(s, s′m) :n∧

i=1

v′mi

Finally, to assess the reliability of the probabilities ob-tained from the model checker, we need to estimate howgood is the resulting model. This is achieved by comparingrepresentative indicators measured directly from the execu-tion of the system against the same measures estimated inthe synthesized model. In particular, in the case of resilienceproperties, we use as an indicator the Mean Time To Re-covery (MTTR), that is, the average time that the systemtakes to recover from an experienced anomaly. Comparingthe real MTTR of the system against the one obtained froma probabilistic model helps us to estimate how accurate themodel is. In order to evaluate the MTTR in models, DTMCscan be easily extended into Discrete Markov Reward Models(DMRM) [14], where rewards (or costs) can be quantified.

V. EXPERIMENTAL VALIDATION

The aim of our validation is to determine the accuracyof probability estimations, and how different parameters inmodel construction (i.e., resolution and time) affect thisaccuracy. For this purpose, we compare probability estima-tions against indicators that can be measured directly inthe running system. In particular, we have evaluated theresilience of Znn.com to a phenomenon known as slashdot

effect when it gets flooded with visits within a period oftime (from a few hours to a couple of days [19]), causing arange of problems, including total system shut down in theworst cases. Concretely, we estimate the levels of confidenceexpected w.r.t. a particular set of resilience properties whenthe system adapts to a sudden rise in requests received. Toobtain our models, we identified a changeload characteristicof a slashdot-type effect, based on a sample collected by Ju-ric [13] (Figure 4), previously used for a general evaluationof the effectiveness of Rainbow in Znn.com [7].

Figure 4. Graph of traffic for a website experiencing a Slashdot-like effect

Scenarios in the changeload conform to the pattern: (i)initial period of low activity; (ii) period of sharp rise inrequests; and (iii) sustained period of peak in requests.

Our data sets were obtained from 100 separate runs ofRainbow running on top of the Znn.com simulator includedin the Rainbow SDK. Information was collected accordingto Algorithm 2, and using Algorithm 3 we synthesizedthree different abstractions for the behavior of the systemunder the non-conventional operational profile NcViolation,using different quantization parameters for the experiencedresponse time ηexpRspTime ∈ {10, 50, 100} for high (Mh),medium (Mm), and low-level (Ml) resolution models, re-spectively. A fixed quantization parameter was chosen forthe cost of operation ηtotalCost = 100 (cost/active server).The time sampling parameter used for all models is τ = 1s.ModelsMx are time-homogeneous DTMCs, where one-steptransition probabilities are always the same, independentlyof time [14]. Moreover, in order to compare results againsttheir time-homogeneous counterparts, we synthesized threeadditional abstractions (Mτ

h, Mτm, Mτ

l ) of similar character-istics, but including time in the model. Adaptations that donot quickly counteract the sudden rise in demand are noteffective, so the maximum time bound for the propertieschecked is 100s. (length of collected traces).

Table II details the experimental results obtained for thedifferent models. As expected, the size of the models withcoarser quantization parameters is remarkably smaller (up to79% and 46% reduction between Mh −Ml and Mτ

h −Mτl ,

respectively), due to the higher ratio of observations grouped

60

per abstract state of the model. This is also the reason forthe drastic reduction in size (about 1 order of magnitude)of time-homogeneous models, against those including time.However, this contrasts with the accuracy of the MTTR inthe different models, which indicates a high representativityof the system’s behavior and only presents slight variations(≈ 95−96%Mx, 96−99%Mτ

x) independently of the size andtype of model. It is worth observing that the best accuracyin MTTR is not achieved by the models with the smallestquantization parameters, but when a medium quantizationparameter (i.e., ηexpRspTime = 50) is chosen in both Mx

and Mτx models. This is because moderately coarse resolu-

tions in quantization parameters yield more representativeinformation about the dynamics of the system than finerresolutions if the number of system observations available isnot particularly high. However, if the quantization parameterchosen is too high w.r.t. the range of possible values of thevariable, the loss in resolution will also distort the repre-sentativeness of the system dynamics. This puts forward theimportance of striking a balance between required resolutionand representativeness of the system’s dynamics to chooseappropriate quantization parameters for the models.

Table IIMODEL METRICS FOR NcViolation

NcViolation Mh / Mτh Mm / Mτ

m Ml / Mτl

τ = 1s ηexpRspTime =10 ηexpRspTime =50 ηexpRspTime =100ηtotalCost =100 ηtotalCost =100 ηtotalCost =100

Model Size 582-977 / 221-435 / 120-218 /(#States-#Transitions) 3691-4344 2507-3154 1995-2559MTTR/MaxTTR (s) 19.74 / 43Estimated MTTR (s) 20.74 / 20.41 20.43 / 19.72 20.46 / 19.57Accuracy MTTR (%) 95.17 / 96.71 96.66 / 99.89 96.48 / 99.13

Regarding property verification, Table III presents infor-mation about the probability of satisfaction for two PCTLformulas specified according to the patterns in Table I,with the implicit premise cViolation. These probabilities arecomputed for the subformula enclosed by the probabilisicoperator P./p(.) with the mean (MTTR) and the maximum(MaxTTR) times to recovery as time bounds. In general,the accuracy of the measures obtained in Mτ

x models ishigher. This can be observed in property P1, which can beused as an indicator of accuracy, since a perfect probabilityestimation with t = MaxTTR should be 1 (by definition, thesystem always recovers before MaxTTR), and should be veryclose to 0.5 with t = MTTR. While the error range in theestimation of P1 with t = MaxTTR for Mτ

x models is 1.5-2%,in the case of Mx it is 9-10.4%. Moreover, accuracy in Mτ

xexperiences less degradation with lower resolutions. This canbe observed in probabilities for the same property and timebound experiencing a maximum variation of 0.5% in Mτ

xmodels, against 5% in Mx models. Regarding property P2,we can observe that its probability of satisfaction is low(≈ 0.3), and does not vary with different time bounds, since

in all paths where hiCost holds in any of its states, this alwayshappens before time MTTR. Let us remark that, unlike in P1,the probability of satisfaction for P2 with time bound MaxTTRis unknown and can only be estimated using a model.

Table IIISATISFACTION PROBABILITIES FOR PROPERTIES IN NcViolation

PCTL Formula Model Satisfaction Probabilityt = MTTR t = MaxTTR

P1 Mh / Mτh 0.594 / 0.517 0.910 / 0.985

F≤t¬cViolation Mm / Mτm 0.626 / 0.517 0.898 / 0.983

Ml / Mτl 0.644 / 0.517 0.896 / 0.980

P2 Mh / Mτh 0.303 / 0.276 0.303 / 0.276

¬hiCostU≤t¬cViolation Mm / Mτm 0.331 / 0.276 0.334 / 0.276

Ml / Mτl 0.328 / 0.276 0.334 / 0.276

VI. RELATED WORK

Many methodologies support the analysis of non-functional properties, based either on modeling, or directmeasurement using an existing system implementation.

While modeling is a useful approach to help in thedevelopment of a system for which there is no availableimplementation yet, it heavily relies on model parameterestimations obtained either from domain experts, or fromother similar existing systems. In particular, if we focus onthe intersection between self-adaptive systems and modelingusing probabilistic models, Calinescu and Kwiatkowska [4]introduce an autonomic architecture that uses Markov-chainquantitative analysis to dynamically adjust the parameters ofan IT system according to its environment and objectives.However, this work assumes that Markov chains describingthe different components in the system are readily availableand are used as input to the problem.

Regarding direct measurement, Epifani et al. [9] present amethodology and framework to keep models alive by feedingthem with run-time data to update their internal parameters,that should otherwise be provided by domain experts. Theframework focuses on reliability and performance, and usesDTMCs and Queuing Networks as models to reason aboutnon-functional properties. Moreover, Metzger et al. [18] de-scribe a testing and monitoring framework oriented towardsquality prediction in Service-Based Applications. Specifi-cally, the framework is focused on dynamic selection ofservice bindings. To achieve this goal, the main activity ofthe framework is test case selection for online testing, whichis combined with monitoring to predict the quality of theservices and proactively determine the need for adaptation.

A complementary approach is taken by Calinescu etal. [3], who extend and combine [4] and [9], describing atool-supported framework for the development of adaptiveservice-based systems. QoS requirements are translated intoprobabilistic temporal logic formulae used for analysis toidentify and enforce optimal system configurations.

Our approach focuses on quantitative analysis using mea-surement, not assuming the existence of a Markov-chain for

61

the system nor its components. Moreover, most proposalsdeal with service-based applications that rely on estimatesof the future behavior of the system to optimize its opera-tion by adjusting parameters, trying to enforce policies, orproactively determining the need for adaptation. In contrast,our approach focuses on providing levels of confidence w.r.t.behavior of the system related with adaptations which havealready taken place. In order to manage the complexity,information is selectively collected and aggregated intomodels specific to the particularities of resilience properties.

VII. CONCLUSIONS AND FUTURE WORK

In this paper, we have defined an approach for theverification of self-adaptive systems that relies on proba-bilistic model-checking for obtaining levels of confidenceregarding trustworthy service delivery when a system under-goes adaptation. Moreover, we have studied how differentparameters in model construction, such as resolution, orthe inclusion of time, affect the accuracy of the results.Our approach has been validated by assessing probabilisticresponse properties in Znn.com, facing a slashdot-like effect.Experimental results have shown that including time inmodels yields more accurate results regarding the estimationof satisfaction probability for the properties, although at thecost of increasing model size one order of magnitude w.r.t.time-homogeneous models. However, a moderate reductionin resolution always results in a remarkable reduction inthe size of models without compromising accuracy in prob-ability estimations, especially in models including time.Anyhow, time-homogeneous models still yield reasonableresults, making them useful in run-time applications thatrequire a compromise between accuracy and resources.

Regarding future work, we plan to extend our approach toenable comparison of alternative adaptations and incorporatefeedback about probabilities in the selection of the bestcourse of action at run-time in a self-adaptive system.

ACKNOWLEDGEMENTS

Co-financed by the Foundation for Science and Technol-ogy via project CMU-PT/ELE/0030/2009 and by FEDER viathe “Programa Operacional Factores de Competitividade”of QREN with COMPETE reference: FCOMP-01-0124-FEDER-012983. The authors would like to thank PauloCasanova and Bradley Schmerl from CMU, Shang-WenCheng at NASA JPL, and Rafael Ventura at University ofCoimbra for their support with the deployment of Rainbow.

REFERENCES

[1] J. Andersson, R. de Lemos, S. Malek, and D. Weyns. Model-ing dimensions of self-adaptive software systems. In SEfSAS,volume 5525 of LNCS, pages 27–47. Springer, 2009.

[2] C. Baier and J.-P. Katoen. Principles of Model Checking.MIT Press, 2008.

[3] R. Calinescu, L. Grunske, M. Z. Kwiatkowska, R. Mirandola,and G. Tamburrelli. Dynamic QoS Management and Opti-mization in Service-Based Systems. IEEE Trans. SoftwareEng., 37(3):387–409, 2011.

[4] R. Calinescu and M. Z. Kwiatkowska. Using QuantitativeAnalysis to Implement Autonomic IT Systems. In ICSE,pages 100–110. IEEE, 2009.

[5] B. H. Cheng, R. de Lemos, et al. Software Engineering forSelf-Adaptive Systems: a Research Roadmap. In SEfSAS,volume 5525 of LNCS, pages 1–26. Springer, 2009.

[6] S.-W. Cheng. Rainbow: Cost-Effective Software Architecture-Based Self-Adaptation. PhD thesis, CMU, 2008.

[7] S.-W. Cheng, D. Garlan, and B. R. Schmerl. Evaluatingthe Effectiveness of the Rainbow Self-Adaptive System. InSEAMS, pages 132–141. IEEE, 2009.

[8] M. B. Dwyer, G. S. Avrunin, and J. C. Corbett. Patterns inProperty Specifications for Finite-State Verification. In ICSE,pages 411–420, 1999.

[9] I. Epifani, C. Ghezzi, R. Mirandola, and G. Tamburrelli.Model Evolution by Run-Time Parameter Adaptation. InICSE, pages 111–121. IEEE CS, 2009.

[10] D. Garlan, S. W. Cheng, A. C. Huang, B. Schmerl, andP. Steenkiste. Rainbow: Architecture-Based Self-Adaptationwith Reusable Infrastructure. IEEE Computer, 37(10):46–54,2004.

[11] L. Grunske. Specification Patterns for Probabilistic QualityProperties. In ICSE, pages 31–40. ACM, 2008.

[12] S. Hart, M. Sharir, and A. Pnueli. Termination of ProbabilisticConcurrent Programs. In POPL, pages 1–6. ACM, 1982.

[13] M. Juric. Slashdotting of mjuric/universe. http://www.astro.princeton.edu/∼mjuric/universe/slashdotting/, 2004.

[14] V. Kulkarni. Modeling and Analysis of Stochastic Systems.Chapman and Hall, 1995.

[15] M. Kwiatkowska, G. Norman, and D. Parker. StochasticModel Checking. In SFM, volume 4486 of LNCS, pages220–270. Springer, 2007.

[16] M. Z. Kwiatkowska, G. Norman, and D. Parker. Quantita-tive Analysis With the Probabilistic Model Checker PRISM.Electr. Notes Theor. Comput. Sci, 153(2):5–31, 2006.

[17] J.-C. Laprie. From Dependability to Resilience. In DSN FastAbstracts. IEEE CS, 2008.

[18] O. Sammodi, A. Metzger, X. Franch, M. Oriol, J. Marco,and K. Pohl. Usage-Based Online Testing for ProactiveAdaptation of Service-Based Applications. In COMPSAC,pages 582–587. IEEE CS, 2011.

[19] D. Terdiman. Solution for Slashdot Effect? http://www.wired.com/science/discoveries/news/2004/10/65165, 2004.

62

[IEEE 2012 7th International Symposium on Software Engineering for Adaptive and Self-Managing...

Documents

Transcript of [IEEE 2012 7th International Symposium on Software Engineering for Adaptive and Self-Managing...