Optimizing Job Reliability Through Contention-Free ...scheduling and reliability optimization. The...

Optimizing Job Reliability Through Contention-Free, DistributedCheckpoint Scheduling

Yu Xiang, Hang Liu, Tian Lan, Howie Huang, Suresh SubramaniamDepartment of ECE, George Washington University

AbstractA datacenter that consists of hundreds or thousands ofservers can provide virtualized environments to a largenumber of cloud applications and jobs that value the re-quirement of reliability very differently. Checkpointinga virtual machine (VM) is a proven technique to im-prove reliability. However, existing checkpoint schedul-ing techniques for enhancing reliability of distributedsystems have fallen short, either because they tend tooffer the same, fixed reliability to all jobs, or becausetheir solutions are tied up to specific applications andrely on centralized checkpoint control mechanisms. Inthis work, we first show that reliability can be signifi-cantly improved through contention-free scheduling ofcheckpoints. Then, inspired by the Carrier Sense Multi-ple Access (CSMA) protocol in wireless congestion con-trol, we propose a novel framework for distributed andcontention-free scheduling of VM checkpointing to pro-vide reliability as a transparent, elastic service. We quan-tify reliability in closed form by studying system sta-tionary behaviours, and maximize job reliability throughutility optimization. Our design is validated via a proof-of-concept prototype that leverages readily available im-plementations in Xen hypervisors. The proposed check-point scheduling is shown to significantly reduce check-pointing interference and improve reliability by as muchas one order of magnitude over contention-obliviouscheckpoint schemes.

1 Introduction

Reliability is a critical requirement for modern datacen-ters. Clearly, high reliability is desirable for many jobsand applications, because even a small service downtimemay potentially lead to business interruption with heftyfinancial penalties. In a public cloud, reliability is pro-vided as a fixed service parameter, e.g., all Amazon EC2users are expected to receive 99.95% reliability [1]. In

other words, the best reliability that cloud customers canget right now is also the worst. Thus it is up to the cloudcustomers to harden the jobs running within their VMinstances in order to enhance reliability for critical appli-cations. Now, applications that provide increased relia-bility do exist, e.g., Oracle’s payroll and general ledgerprograms. However, these approaches are specific to ap-plications or require specific problem structures, and pro-viding elastic reliability for the masses remains an elu-sive goal in cloud computing today.

This paper introduces an approach for assigning elas-tic reliability to heterogeneous datacenter jobs via dis-tributed checkpoint scheduling and reliability optimiza-tion. Virtual Machine (VM) checkpointing is a widely-employed, application-transparent solution to improvereliability in public clouds [4, 5]. To optimize reliabil-ity of a single job, prior work has proposed a number ofmodels for calculating the optimal checkpoint schedule[6, 7, 8, 9, 10, 11], and several algorithms for balanc-ing checkpoint workload and performance overhead havealso been proposed in [14, 15, 16]. Unfortunately, thesesolutions fall short in optimizing checkpoints of multi-ple jobs whose reliability requirements may vary signif-icantly, due to their inadequacy of taking into accountresource contention among different jobs’ checkpoints.

In a multi-job scenario, uncoordinated VM check-points taken independently run the risk of interferingwith each other [12, 13] and may cause significant re-source contention and reliability degradation [2]. In par-ticular, the time to save local checkpoint images is deter-mined largely by how I/O resources are shared, while theoverhead to transfer locally saved images to networkedstorage relies on how network resources are shared. Ina large datacenter, chances that VM checkpointing, ifunmanaged and uncoordinated, would encounter severenetwork and I/O congestion, resulting in high VM check-pointing overhead and reliability loss. Clearly, for a largedatacenter, a centralized checkpoint scheduling schemethat micro-manages each job’s checkpoints is impracti-

1

cal for handling tens of thousands of jobs. Distributedcheckpoint scheduling is needed for achieving our goalof providing elastic reliability as a service.

To this end, we propose a novel job-level self-management approach that not only enables distributedcheckpoint scheduling but also optimizes reliabilityassignments to individual jobs. Our contention-freescheduling solution is inspired by the Carrier Sense Mul-tiple Access (CSMA) method, a distributed protocol foraccessing a shared transmission medium, wherein a nodeverifies the absence of other traffic before transmittingon the medium [24, 25, 26]. If a job senses any on-going checkpoint actions at its serving hosts, it waits (orbacks-off) for an indefinite amount of time and keepssilent if any of its hosts is busy or become busy duringits backoff. We compare our method with contention-oblivious checkpoint scheduling, wherein each job sim-ply checkpoints its VMs at a predetermined rate regard-less of any contention from other jobs’s checkpoint. Tothe best of our knowledge, this is the first work using aCSMA-based scheme for distributed datacenter resourcescheduling and reliability optimization. The main contri-butions of this paper are summarized as follows:

• We harness CSMA-based interference managementto provide a distributed and contention-free check-point scheduling protocol. Our solution is wellsuited for implementation in large-scale datacen-ters, as its job-level distributed checkpoint schedul-ing mechanism can effectively mitigate resourcecontentions caused by concurrent job checkpoints.

• Reliability received by each individual job is char-acterized in closed form by studying the stationarybehavior of our proposed protocol. It enables a jointreliability optimization where flexible service-levelagreements (SLAs) are negotiated through a jointassessment of all jobs’ reliability requirements andtotal datacenter resources available.

• Results are validated via a proof-of-concept pro-totype that leverages readily available implemen-tations in Xen and Linux. The proposed CSMA-based checkpoint scheduling is shown to signifi-cantly reduce checkpoint interference and improvereliability by 1 order of magnitude over contention-oblivious checkpoint schemes.

The rest of the paper is organized as follows. SectionII introduces the system model and illustrates the neces-sity for distributed, contention-free checkpoint schedul-ing. Our theoretical analysis of CSMA-based checkpointscheduling and reliability optimization are presented inSection III. Section IV contains experimental results viaa proof-of-concept prototype, and Section V concludesthe paper.

2 System Model and Motivations

2.1 Reliability Model Using CheckpointsIn the simplest form, a Virtual Machine Monitor (VMM)can periodically record a clear state of the running VMs,including a full image of the VM’s memory, CPU, andall the device states, and flush resulting VM imagesto a central storage server to establish recovery points[12, 17, 18]. For a single job consisting of multipleVMs, uncoordinated checkpoints taken independently ofeach other [12, 13] run the risk of cascaded rollbacks ifcausality is not respected. This can be avoided by takingsynchronous checkpoints of all the VMs that a job com-prises, whereas the probability of checkpoint contentionincreases as jobs often consist of multiple VMs.

As shown in Figure 1, a single job periodically check-points all its VMs every Ti seconds. Each checkpointrequires time Tn + Tf to complete, which includes timeTn to suspend all its VM executions in order to save alocal checkpoint image, as well as time Tf to transferthe saved VM images to a remote destination. After afailure occurs, the job can be restored from an availablecheckpoint and rolled back to the last saved state withrecovery time Tr. We define reliability by 1 minus thefraction of expected service downtime. More precisely,let a failure occur at time t after the nth checkpoint isfully completed. Then,

R = 1−E[

Service DowntimeTotal Service Time

]= 1−E

[t− (n−1)Ti−Tn−Tf +nTn +Tr

t +Tr

],(1)

where nTn is the total service downtime due to takingcheckpoints, t − (n− 1)Ti − Tn − Tf is the lost servicetime due to roll-back, and E is an expectation over allrandom factors, e.g., checkpoint overhead and failuretime. For clarification, we summarize main notations inthis paper in Table 1.

For a single job, a number of proposals for determin-ing the optimal temporal scheduling of checkpoints havebeen provided in [14, 15, 16, 19, 20]. Further, protocolsfor taking consistent snapshots of distributed servicesin virtualized environments using OpenFlow hardwareto improve fault tolerance are presented in [21]. How-ever, these results do not take into account the contentionamong different checkpoints in a multi-job scenario. It isshown that as the size of datacenter grows large, uncoor-dinated VM checkpoints from different jobs may causesignificant resource contention and result in high check-point overhead Tn (due to I/O resource contention) andTf (due to network resource contention) [2], which di-rectly translate to severe reliability degradation accord-ing to (1). Recently, a utility-based framework for joint

2

Figure 1: Multiple VMs belonging to the same job must becheckpointed simultaneously to avoid cascaded rollbacks.It increases the chance of checkpoint contention.

Figure 2: Fully coordinated checkpoint scheduling in apipeline mode significantly reduces resource contentionover parallel checkpoints.

reliability maximization under datacenter resource con-straints is proposed in [2]. It is shown to reduce expectedservice downtime by as much as an order of magnitude,even though the solution requires centralized coordina-tion/scheduling and does not allow a distributed imple-mentation at a large scale.

2.2 Need for Contention-Free CheckpointScheduling

Clearly, job reliability benefits from mitigating check-point overhead, while checkpoint frequency also has tobe determined to amortize not only service downtime dueto taking checkpoints but also potential service loss dueto failure and roll-back. Due to resource sharing in dat-acenters, all of these require a joint checkpoint schedul-ing over all jobs that share a common pool of resources.Consider two extreme cases for multi-job checkpointscheduling: parallel and pipeline scheduling, as illus-trated in Figure 2. In parallel mode, the checkpoints of

Table 1: Main notation.

Symbol Meaning

N N job indexed by i = 1, . . . ,NS S hosts indexed by h = 1, . . . ,Sλi Sensing rate of job iµi Service rate to checkpoint job iτc

i Mean checkpoint time of job iτr

i Mean rollback and recovery time of job ifi Mean failure rate of job iRi Reliability of job iXk A state in our Markov Chain model

PXk,Xl Transition rate between states Xk and Xlπk Stationary distribution in state XkAi A set of all states containing job iE[Y ] Expectation of random a variable Y

all N jobs are done at the same time and the total I/Oand network bandwidth are shared among them. In the-ory, the time to save a local checkpoint Tn and to transferVM images Tf will be at least N times higher than whencheckpoints are taken one at a time, and there can also bean overhead To for switching between VM checkpoints.On the other hand, if fine-grained checkpoint control ispossible, checkpoints of jobs can be taken one immedi-ately after another in a pipelined fashion by overlappingimage-saving time of one job’s checkpoint with the im-age transfer time of another job. With such completelycoordinated checkpoints, jobs can take full advantage ofall I/O and network bandwidth resource available, caus-ing minimal interference to others.

Figure 3: Fully coordinated pipeline checkpoint schedulesignificantly reduces contention and improves reliabilityover parallel checkpoint schedule. Reliability calculatedwith 8 failures/year.

To demonstrate the advantage of checkpoint coordina-tion, we set up a simple experiment involving two hostsand four VMs on each host to quantify how much reli-ability is achieved under each scheme. We implementboth parallel and pipeline scheduling, and measure thecheckpoint overhead and VM image transfer time. Fig-ure 3 shows that pipeline scheduling outperforms parallel

3

scheduling by nearly an order of magnitude for variousVM sizes. Further, the reliability improvement increasesas VM size grows because the larger the VM size, thelonger time it needs to save and transfer checkpoint im-ages, and more likely checkpoint contentions occur.

We conclude that checkpoint scheduling is crucial toprovide high reliability in a multi-job scenario. Eventhough pipeline scheduling completely avoids check-point interference and is able to efficiently utilize allI/O and network bandwidth available, such a centralizedcoordination and micro-management approach is pro-hibitive in large-scale datacenters hosting tens of thou-sands of jobs. Therefore, a practical checkpoint schedul-ing scheme should (i) allow a distributed implementationwithout relying on any centralized, fine-grained check-point control, (ii) be able to schedule contention-freecheckpoints for a large number of jobs that may haveheterogeneous parameters and demands, and (iii) en-able a joint reliability maximization to assign the opti-mal reliability level to each job that suits its demand.To this end, this paper makes novel use of the CSMAprotocol in wireless inference control to derive a dis-tributed, contention-free checkpoint scheduling protocolwith joint reliability optimization.

3 Our Proposed Protocol and ReliabilityOptimization

In this section, we propose a CSMA-based checkpointscheduling protocol and quantify the resulting reliabil-ity received by each job in closed-form. Unlike ex-isting work [24, 25, 26] that apply CSMA to wirelessinterference management and are often concerned withdata throughput, our reliability analysis quantifies thereliability received by each job in closed-form and en-ables joint reliability optimization of all jobs via a util-ity framework. The protocol is inspired by CSMA butis applied to datacenter resource sharing to achieve dis-tributed, contention-free checkpoint scheduling protocolin large-scale datacenters. Figure 4 shows an overviewof the system architecture. Each job may consist of oneor more VMs, which are distributed to different physicalmachines (or servers). Our checkpoints are organized atjob level - if a checkpoint of a job is triggered, all VMsthat belong to the job first save their checkpoint imagesto local storage (in order to minimize VM downtime)and then transfer them to networked storage to avoid hostfailure.

In our design, each job achieves reliability optimiza-tion via self-management in two ways: first, each job au-tonomously determines its own checkpointing schedul-ing based on locally available information, e.g., the co-location of other jobs and occurrence of checkpoint con-

Host

SLA SLA SLA SLA SLA SLA

Local storage

Inter-Job Checkpointing Coordination

Network Storage

checkpoint

restore

checkpoint

restore

Host

Local storage

checkpoint

restore

checkpoint

restore

Backup Backup

Network Storage

…

VM VM VM VM

Figure 4: Our proposed architecture for checkpoint schedul-ing.

tention. Second, each job autonomously updates itscheckpoint rate based on locally available optimal solu-tions, which is done in runtime with no dependence onany centralized management decisions.

In this section, we will first introduce the CSMA-based checkpoint scheduling protocol, characterize theresulting reliability via a Markov Chain analysis of sys-tem stationary distributions, and then present a joint reli-ability optimization. Theoretical results obtained in thissection will be validated through a prototype implemen-tation in Section IV.

3.1 CSMA-Based Checkpoint Scheduling

We consider a datacenter serving N jobs denoted byN = {1,2, . . . ,N} and using S servers denoted by S ={1,2, . . . ,S}. Each job i is comprised of hi VMs that arehosted on a subset of servers, i.e., Hi ⊆S .

As discussed in Section II, mitigating checkpoint con-tention can significantly reduce service downtime andimprove reliability. To develop a checkpoint schedulingprotocol that is not only contention-free but also fullydistributed, we extend an idealized model of CSMA asin [24, 25, 26]. CSMA is a probabilistic medium accesscontrol protocol in which a node verifies the absence ofother traffic before transmitting on a shared transmissionmedium. Our proposed checkpoint scheduling works asfollows: Each job i makes the decision to create a re-mote checkpoint image based only on its local parame-ters and observation of contention. If job i senses on-going checkpoints at any of its serving hosts (i.e., anyhost s such that s ∈ Hi), then it keeps silent. If noneof its serving hosts is busy, then job i waits (or backs-off) for a random period of time which is exponentiallydistributed with mean 1/λi and then starts its check-pointing.1 During the back-off, if some contending jobstarts taking checkpoints, then job i suspends its back-off

1The random backoff time is to ensure that two potentially-contending jobs that sense no contention from other jobs do not startcheckpointing at the same time and trigger a contention.

4

and resumes it after the contending checkpoint is com-plete. For analytical tractability, we assume that the to-tal time of saving a local checkpoint and transferring itto a remote destination is exponentially distributed withmean 1/µi = E(Tn +Tf ). This assumption of exponen-tial checkpoint time can be further removed using resultsin [26]. In such an idealized CSMA model, if sensingtime is negligible and back-off time follows a continu-ous distribution, then the probability for two contendingcheckpoints to start at the same time is 0 [24]. There-fore, the CSMA-based protocol, summarized in Figure. 5achieves contention-free, distributed scheduling of jobcheckpoints.

Assign positive sensing rates λi > 0 ∀i

Each job independently performs:Initialize backoff timer Biwhile job i is running

while Bi > 0if any server in Hi is busy

Job i keeps silentGenerate new backoff: Bi = exponential with mean 1

λiend ifUpdate Bi = Bi−1

end ifCheckpoint all VMs of job iGenerate new backoff: Bi = exponential with mean 1

λiend while

Figure 5: Our contention-free, distributed checkpointscheduling protocol inspired by CSMA.

In contrast to existing CSMA analysis that focuses ondata throughput, this paper aims to quantify individual-job reliability resulting from such contention-free, dis-tributed checkpoint scheduling protocol in large-scaledatacenters. It requires us to investigate the distributionscheckpoint interval Ti which is a random variable dueto the CSMA-based, probabilistic checkpoint schedulingprotocol. Given a set of sensing rates λ1, . . . ,λN , we useRi to denote the reliability received by job i in the pro-posed checkpoint scheduling protocol. Notice that reli-ability also depends on service rates µ1, . . . ,µN and jobfailure rate fi, which may further depend on server fail-ure model and VM placement. In this paper, we focus onthe checkpoint scheduling protocol and reliability maxi-mization by optimizing parameters λ1, . . . ,λN . Using ourmodel, we are able to find the reliability received by eachjob in closed form and then perform reliability optimiza-tion jointly over all jobs with respect to their utilities.

3.2 Markov Chain Model

In order to optimize reliability, we first need to ob-tain the reliability each job receives in the CSMA-based

checkpoint scheduling protocol for given sensing ratesλ1, . . . ,λN . We make use of a Markov Chain model,which is commonly employed for CSMA analysis inwireless interference management. The Markov Chainfor analyzing the protocol depends on sensing rate λi andcheckpoint overhead µi, as well as datacenter VM place-ment that determines the pattern of job interference. Wewill first describe the model in this subsection and thenuse it to derive job reliability in closed form to enablereliability optimization.

For any time t, we define a system state as the setof jobs actively taking checkpoints at t. Since ourCSMA-based protocol achieves contention-free check-point scheduling, in each state, a set of non-conflictingjobs (known as an Independent Set) are scheduled. Weassume that there exist K ≤ 2N possible states, repre-sented by Xk ⊆N , for k = 1, . . . ,K. In state Xk, if job iis not taking checkpoints and all of its conflicting jobs arenot taking checkpoints, the state Xk can transit to stateXk ∪{i} with a rate λi (i.e., job i starts its checkpoint).Similarly, state Xk ∪{i} can transit to state Xk with arate µi (i.e., job i completes its checkpoints). It is easyto see that the system state at any time is a ContinuousTime Markov Chain (CTMC).

Unlike existing CSMA analyses for wireless systemsthat focus on throughput, our goal is to quantify job re-liability using the Markov Chain model. According to(1), this requires the characterization of the distributionof checkpoint overhead Tn,Tf , and checkpoint interval Ti,which are related to sojourn time and returning time ofthe CTMC. We first transform the CTMC into an embed-ded Discrete Time Markov Chain (DTMC) that is easierto analyze. Since the embedded chain also has differ-ent holding times for its states, we further apply the uni-formization technique to obtain a randomized DTMC. Itis sufficient to consider transitions between states thatdiffer by one job because there is no contention in ouridealized CSMA model. Let v be a uniformization con-stant that is sufficiently large. Then, the DTMC has thefollowing transition probabilities:

PXk,Xk∪{i} =λi

vand PXk∪{i},Xk

=µi

v, (2)

where PXk,Xl denote the transition probabilities fromstate Xk to state Xl . Due to uniformization, we definevk = ∑l 6=k v ·PXk,Xl to be the sum of transition probabili-ties out of state Xk and add a self-transition rate 1−vk/vso that the transition probabilities form a stochastic ma-trix. This means we have

PXk,Xk = 1− vk

v. (3)

Now we can study properties of the original CTMCthrough the DTMC whose state transitions occur accord-ing to the jump times of an independent Poisson Process

5

with rate v. Fig. 6 (a) gives an example data center with3 jobs and 2 hosts. If each host is able to checkpoint 1VM at a time without incurring any performance loss,then checkpoints of job 3 conflicts with those of jobs1 and 2, whereas jobs 1 and 2 can take parallel check-points without any resource contention. Therefore, thissystem has K = 5 feasible states (or Independent Sets):{·}, {1}, {2}, {3}, {1,2}. State {·} means no job istaking checkpoints, {i} means a single job i takes check-points for i = 1,2,3, and {1,2} means jobs 1 and 2 takecheckpoints at the same time.

Figure 6: Example: 3 jobs and corresponding MarkovChain.

Given the above DTMC model, we are interested inanalyzing its stationary behavior, which reveals the dis-tributions of checkpoint overhead Tn,Tf , and checkpointinterval Ti. The transition probability matrix P of theDTMC has size K by K and its stationary distributionis denoted by π1, . . . ,πK , satisfying

(π1, . . . ,πK) = (π1, . . . ,πK) ·P, (4)

where πk is the stationary probability that the DTMCstays in state Xk. In the following lemma, we show thatthe stationary distribution can be obtained in closed formfor our DTMC model.

Lemma 1 When no checkpoint interference (i.e., con-tention) is permitted, the DTMC has stationary distribu-tion:

πk =∏i∈Xk

λi ·∏ j/∈Xkµ j

Cλ

, (5)

where Cλ is a normalization factor such that ∑k πk = 1.

Proof: This lemma can be directly proved by showingthat the stationary distribution in (5) satisfies the detailedbalance equation πkPXk,Xl = πlPXl ,Xk , ∀k, l. Therefore,the DTMC is time-reversible and its stationary distribu-tion depends on rates λi,µi of all jobs. �

3.3 Reliability AnalysisNow we can analyze the stationary behavior of theCTMC through the DTMC and an independent PoissonProcess with rate v. From (1), reliability is defined by 1minus the percentage of service downtime. It means thatwe need to obtain the distributions of checkpoint over-head Tn,Tf , and checkpoint interval Ti from the MarkovChain model. We assume that each job has known MeanTime to Failure (MTTF) 1/ fi and its failure time is mod-eled by an exponential distribution. In practice, theMTTF can be estimated from existing failure models orlarge-scale datacenter event logs [27, 28]. For example,if each server has independent failures according to aPoisson Process with rate f0 and job i is hosted by midifferent servers, then we have fi = mi · f0.

Consider checkpoint overhead Tn,Tf , and checkpointinterval Ti in our CSMA-based protocol for a single job i.Let Ai = {Xk : i ∈Xk} be a set of all states containingjob i. It is not hard to see that total checkpoint over-head Tn + Tf is the sojourn time that the CTMC stayswithin Ai, i.e., the time to checkpoint job i’s VMs. Sim-ilarly, checkpoint interval Ti is the first returning time ofthe CTMC to Ai. Clearly, both sojourn time and firstreturning time are random variables whose distributionsdepend on the Markov Chain model. Using the defini-tion in (1), we first rewrite reliability Ri with respect torandom checkpoint overhead and checkpoint interval.

Lemma 2 Let Ti be the random checkpoint interval ofjob i. If job i has Poisson failures with rate fi, then itsreliability is given by

Ri = 1− τci µiπAi − fiπAiETi− fiτ

ri −

fiE(T 2

i)

2ETi(6)

where τci is the mean time to save a local checkpoint im-

age, τri is mean repair time, and πAi = ∑k∈Ai πk is the

sum of stationary distribution of all states in Ai.

The result is very intuitive. First, πAi is the frac-tion of time that the Markov Chain spends in states Ai(i.e., checkpointing job i VMs). Since our protocol iscontention-free, out of E(Tn +Tc) = 1/µi seconds on av-erage for each checkpoint, job i VMs have to be sus-pended for E(Tn) = τc

i seconds to save consistent, localcheckpoint images. Therefore, the service downtime dueto taking checkpoints is given by τc

i µiπAi . Second, fiτri

is the expected downtime due to failure recovery and re-pair. Further, because of our assumption of Poisson fail-ures, lost service time due to VM roll-back after eachfailure can be derived using the Poisson Arrival SeesTime Average (PASTA) property, i.e, fiE

(T 2

i)/2ETi. Fi-

nally, when a failure arrives before a checkpoint is com-pleted (which again has probability πAi ), all VMs mustbe recovered from the last available checkpoint images.

6

It implies that an additional roll-back time of πAiETi isincurred on average. A formal proof for this lemma canbe found in our online technical report [32].

Therefore, reliability of job i can be obtained if ETiand ET 2

i are known. Next we derive them via their coun-terparts in the embedded DTMC. Since job i takes acheckpoint if the DTMC is in a state belonging to Ai,its checkpoint interval Ti can be measured by the first re-turning time to Ai, denoted by tAi . Let Y1,Y2 . . . be a se-quence of i.i.d. exponentially-distributed variables withmean 1/v. We have Ti = ∑

tAil=1 Yl , which results in

ETi = EtAi ·EYl =1vEtAi (7)

and

ET 2i = var(Yl) ·EtAi +(EYl)

2 ·Et2Ai

=1v2

(EtAi +Et2

Ai

). (8)

Here we used the i.i.d. property of Yl , as well as the inde-pendence between the DTMC and the underlying Pois-son Process.

Now it remains to find EtAi and Et2Ai

in the embeddedDTMC. When the number of jobs is large, we can ap-proximate the first returning time tAi by an exponentialdistribution [29]. Then, its second order moment shouldbe Et2

Ai= 2Et2

Ai. To find EtAi , we apply Kac’s Formula

in [29] and obtain the following result:

Lemma 3 The expectation of first returning time tAi forthe DTMC is given by

EtAi = 1+vµi

(1

πAi

−1). (9)

Proof: Checkpoint interval tAi is the time that the DTMCfirst returns to any state in Ai since it last left. Let X(n) bethe DTMC state at time n under stationary distribution.It is easy to see that

tAi = min{T |X(0) ∈Ai,X(1) /∈Ai,X(T ) ∈Ai},(10)

that is the minimum (random) time the chain returns toAi after it leaves at time n = 0. Applying Kac’s Formula[29] to the DTMC, we have 1/πAi = E

[τ+Ai

], where

τ+Ai

= min{T |X(0) ∈ Ai,X(T ) ∈ Ai,T ≥ 1} is the firsthitting time from a stationary distribution. Using the Lawof total probability, we further have

E[τ+Ai

]=

v−µi

vE[τ+Ai|X(1) ∈Ai

]+

µi

vE[τ+Ai|X(1) /∈Ai

],

=v−µi

v+

µi

vE[tAi

], (11)

where the first step uses P{X(1)∈Ai} = 1 − µi/v andP{X(1) /∈Ai} = µi/v because departure probability fromAi is a constant µi/v from all states. The second stepuses the definition of tAi in (10), as well as the fact

that E[τ+Ai|X(1) ∈Ai

]= 1 due to the definition of hit-

ting time. Combining (11) and Kac’s formula 1/πAi =

E[τ+Ai

], we derive the desired equation in (9). This com-

pletes the proof. �Plugging these results into (6), we can quantify the re-

liability received by each job i in our contention-free, dis-tributed checkpoint scheduling protocol. Again, we referreader to our online technical report [32] for a completeproof.

Theorem 1 For given rates λ1, . . . ,λK , each job i in ourprotocol receives the following reliability Ri:

Ri = 1− fiτri − τ

ci µiπAi −

fi

µi(πAi +

1πAi)

) (12)

3.4 Reliability OptimizationWe can use Theorem 1 to numerically calculate the re-liability of each job i for any given rates λ1, . . . ,λK andfailure rate fi. Let Ui(Ri) be a utility function, represent-ing the value of assigning reliability level ri to job i. Ourgoal is to derive an autonomous reliability optimizationwhere flexible SLAs are negotiated through a joint as-sessment of users utility and total datacenter resourcesavailable. Toward this end, we formulate a joint relia-bility optimization through a utility optimization frame-work [30, 31] that maximizes total utility ∑i Ui(Ri), i.e.,

max ∑i

Ui(Ri) (13)

s.t. Ri = 1− fiτri − τ

ci µiπAi −

fi

µi(πAi +

1πAi)

),

πAi =1

Cλ

· ∑Xk∈Ai

∏j∈Xk

λ j · ∏l /∈Xk

µl ,

var. λ1, . . . ,λK

where Cλ is a normalization factor such that ∑k πk = 1.Here we used the closed-form reliability characterizationin (12) and the stationary distribution in (5).

The reliability optimization is computed by maximiz-ing an aggregate utility ∑i Ui(Ri) over all feasible sens-ing rates λ1, . . . ,λK . In a dynamic setting, such reliabilityoptimizations must be solved for each job arrival and de-parture to balance reliability assignment autonomously.We notice that many local search heuristics, such as HillClimbing [33] and Simulated Annealing [34], can be em-ployed to solve the reliability optimization in (13) by in-crementally improving the total utility over single search

7

-1.00%

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

0.958

0.963

0.968

0.973

0.978

0.983

4 6 8 10 12 14 16

Dif

fere

nce

bet

wee

n in

pe

rce

nta

ge

Rel

iab

ility

Failure Rate (Failures per year)

Theoretic reliability analysis vs. Prototype implementation

Prototype implementation Theoretic reliability percentage

Figure 7: Comparison of the reliability values from ourtheoretical analysis with a prototype experiment using 24VMs in Xen. Our reliability analysis can accurately esti-mate reliability in the proposed contention-free checkpointscheduling protocol within a margin of ±1%.

2 4 6 8 10 12 14 1610

−7

10−6

10−5

10−4

10−3

10−2

# of updates

Rat

es λ

1,λ

2

rate λ1

rate λ2

Figure 8: Plot convergence of sensing rates λ1,λ2 whenHill Climbing local search [33] is employed to solve thereliability optimization in (13) with 2 classes of jobs anda utility 2R1 +R2. The algorithm converges within onlya few local updates to the optimal sensing rates.

directions. Under certain conditions, we can also charac-terize the optimal solution in closed form.

Theorem 2 If there exists a set of rates λ1, . . . ,λK anda positive constant Cλ satisfying the following system ofequations, then the rates maximize the aggregate utilityin (13) for arbitrary non-decreasing functions:

∑Xk∈Ai ∏ j∈Xkλ j ·∏l /∈Xk

µl =Cλ ·√

τci µ2

ifi

+1,∀i

∑Kk=1 ∏ j∈Xk

λ j ·∏l /∈Xkµl =Cλ (14)

These rates simultaneously maximize the reliabilities re-ceived by all jobs, i.e.,

Ri = 1− fiτri −2

√τc

i +fi

µi

2, ∀i. (15)

Proof: We apply the following inequality, ax + b/x ≥2√

ab for all positive a,b,x > 0, to the reliability in (12).It implies

Ri = 1− fiτri − τ

ci µiπAi −

fi

µi(πAi +

1πAi)

)

≤ 1− fiτri −2

√τc

i +fi

µi

2, (16)

where we used x = πAi , a = τci µi +

fiµi

and b = fiµi

in theinequality. Notice that the last step holds with equalityonly if x =

√b/a. For arbitrary non-decreasing utility

functions Ui(Ri), it is easy to see that aggregate util-ity ∑i Ui(Ri) is maximized if (16) holds with equality

for all i = 1, . . . ,N, i.e., all reliability values are maxi-mized simultaneously. This proves the maximum achiev-able reliability in (15), which can be achieved only if

πAi =

√τc

i µ2i

fi+1, ∀i. Plugging the stationary distribu-

tion in (5), this is exactly conditions (14). �Remark: Theorem 2 establishes the maximum utility ourcheckpoint scheduling algorithm can achieve. If the con-ditions in Theorem 2 are satisfied, then solving (14) givesus a set of rates λ1, . . . ,λK , which maximize the aggre-gate utility in (13) for arbitrary non-decreasing utilityfunctions. As an example, if all jobs share a commonresource bottleneck that allows only a single checkpointat each time, then we have Ai = {i} ∀i because any pairof jobs conflict with each other. It is easy to verify thatthe following rates satisfy conditions (14), and thereforethe reliability optimization can be solved in closed formfor arbitrary non-decreasing utility functions:

λi =

√τc

i µ2i

fi+1

∑Nj=1

√τc

j µ2j

f j+1

·1−∏

Nj=1 µ j

∑Nj=1 ∏l 6= j µl

, ∀i. (17)

Once the optimal solutions are obtained (through ei-ther local search heuristics or the sufficient conditionsabove), each job only has to update its checkpoint rateaccording to the optimal solutions. Due to the distributednature of CSMA-based scheduling, jobs can easily re-configure their checkpoint rates on-the-fly without rely-ing on any centralized checkpoint coordination.

Validation of theoretical analysis. To validate the

8

4 8 16 32 64 1280.9600

0.9800

0.99600.99650.99700.9975

0.9980

0.9985

0.9990

0.9995

Failure rate per year

Relia

bili

ty

Contention−oblivious

Contention−free

Figure 9: Reliability for different failure rates.

180 300 600 900 1200 1800 3600 108000.90000.91000.92000.93000.9400

0.9500

0.9600

0.9700

0.9800

0.9900

0.9960

Time intervals

Re

liab

ility


Contention−free

Figure 10: Reliability for different checkpoint time intervals.

reliability analysis in Theorem 1, we implement a pro-totype of the contention-free, distributed checkpointscheduling protocol with 3 servers supporting 24 XenVMs each with 1GB DRAM. The detailed implemen-tation parameters are provided later in Section IV. Wefirst benchmark necessary parameters in our theoreticalmodel using Markov Chain analysis, i.e., mean check-point local-saving time τc

i = 30.2 seconds, mean check-point overhead 1/µi = 71.5 seconds, and mean repairtime τr

i = 80.2 seconds for all jobs i = 1, . . . ,24. So alljobs in the experiment receive equal reliability value. Fora sensing rate of λi = 1/(2.5 days) and exponential fail-ures with fi ranging from 2 to 16 failures per year, wecompare the reliability values from our theoretical anal-ysis to the values obtained from the experiment. Fig-ure 7 shows that our theoretical analysis can accuratelyestimate the reliability values received in the proposedprotocol, with a small error margin of ±1%. This im-plies that our theoretical reliability analysis provides apowerful tool for reliability estimation and optimization.

Example for reliability optimization. To give a nu-merical example of the proposed reliability optimization,consider a data center with 2 classes of jobs: 10 largejobs that contain 10 VMs each and 100 small jobs thatcontain 2 VMs each. Assume that at most 2 jobs cantake non-contending checkpoints at each time. Averagecheckpoint overhead is τc

1 = 50 seconds for large jobsand τc

2 = 25 for small jobs. Recovery time is τr1 = 400

seconds and τr2 = 200 seconds. Assume that each host

has independent failures with rate f0 =2/year. Then,large jobs have failure rates f1 = 10 · f0 = 6.43e−7 andsmall jobs f2 = 2 · f0 = 1.29e− 7. Finally, total check-point time is 1/µlarge = 200 seconds for a large job and1/µlarge = 100 seconds for a small job. We implementHill Climbing local search [33] to find the optimal sens-

ing rates λ1,λ2 that maximize a utility 2R1 + R2. Asshown in Figure 8, the algorithm converges within afew local updates to the optimal sensing rates. At op-timum, large jobs receive a higher reliability R1 = 0.99than small jobs R= 0.90 because the weight of large jobsis twice as that of small jobs in the optimization objective2R1 +R2.

4 Implementation and Evaluations

We have implemented a prototype of the contention-freecheckpoint scheduling in C. We use a cluster of four ma-chines with Intel Atom CPU D525, 4GB DRAM, 7200RPM 1TB hard drives, and interconnected with a 1GB/snetwork. Note that I/O and network bandwidths ratherthan CPU and memory are the major limiting factors forour tests. To simulate the workload, each VM runs aCPU intensive benchmark [35] with 1 VCPU, 512MB or1GB DRAM, and 10GB VDisk. The host OS is Linux2.6.32 and Xen 4.0. If not specified, the failure rate iseight times per year, and each reliability result is the av-erage of three runs.

Figure 9 shows the reliability of a job when the annualfailure rate varies from 4 times to 128 times per year. Inthis experiment, we run three jobs (two VMs per job, andsix VMs in total) and present the average reliability. Forsmall failure rates, the reliabilities for both contention-oblivious and contention-free scheduling are very high.But as more failures occur, the benefit of contention-free scheduling becomes very obvious, achieving muchhigher reliability.

Reliability as a function of checkpoint interval isshown in Figure 10. Overall, contention-free schedulingcan achieve a reliability of two nines (> 99%), compared

9

128 64 32 16 8 4

0.2

0.4

0.6

0.8

1

1.2

Failure rate

Norm

alized d

ow

ntim

e


Contention−free

Figure 12: Normalized downtime for different annual failurerates.

512MB 1G0.9400

0.9500

0.9600

0.9700

0.9800

0.99000.9910

0.9920

0.9930

0.9940

VM size

Relia

bili

ty


Contention−free

Figure 13: Reliability for different VM sizes.

0.960 0.970 0.980 0.990 0.992 0.994 0.996 0.9980

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Reliability

CD

F o

f th

e job r

elia

bili

ty


Contention−free

2 NINEs

Figure 11: Reliability of 128 jobs for both contention-free andcontention-oblivious checkpoint scheduling.

to one nine (> 90%) for contention-oblivious scheduling.For contention-free scheduling, the reliability of the sys-tem keeps increasing as the checkpoint interval becomeslarger. At the same time, the contention-oblivious mech-anism increases at a slower pace, but it can also poten-tially reach as high reliability as contention-free schedul-ing. This happens because when the checkpoint inter-val becomes large enough, chances for checkpoint con-tention from different jobs are small. To demonstrate thescale of our approach, we also extend this test to sim-ulating 128 jobs. In this experiment, we intentionallyintensify the job checkpointing rate in our cluster. Asshown in Figure 11, almost all contention-free config-uration jobs can achieve a reliability of two nines butthe major percentage of contention-oblivious jobs fallsinto one nine reliability range. In addition, we presentthe normalized downtime for different annual failure ratesettings in Figure 12. Note that the downtime of a systemincludes the checkpoint time, and recovery time if thehost is down. All times are normalized to the downtimefor contention-oblivious scheduling with 128 failures per

year. One can see that our contention-free checkpointingcan achieve a reduction in downtime of upto 18.3% com-pared to contention-oblivious scheduling.

In Figure 13, we show the effect of VM memory size(and, therefore, checkpoint duration) on reliability. Un-der the same checkpoint interval and failure rate, a jobwith VM memory size of 512 MB achieves higher re-liability due to its smaller memory footprint. The rea-son is that a larger DRAM size requires more time tosuspend the VM and transfer the VM image from thehost machine to the destination storage. But the over-all trend still shows that our contention-free checkpoint-ing mechanism significantly outperforms contention-oblivious scheduling.

5 Conclusions

Inspired by the CSMA protocol for wireless interferencemanagement, we propose a new protocol for distributedand contention-free checkpoint scheduling in large-scaledatacenters, where jobs’ requirements for reliability varysignificantly. The protocol enables datacenter operatorsto provide elastic reliability as a transparent service totheir customers. Using Markov Chain analysis of sys-tem stationary behaviours, the reliability that each jobreceives in our protocol is characterized in closed form.We also present optimization algorithms to jointly max-imize all reliability levels with respect to an aggregateutility. Our design is validated through prototype im-plementations in Xen and Linux, and significant reliabil-ity improvements over contention-oblivious schedulingcheckpointing are demonstrated via experiments in real-istic settings.

10

6 Collection of Proofs

6.1 Proof of Lemma 1It is easy to verify that the stationary distributionsin Lemma 1 satisfies the detailed balance equation,πkPXk,Xl = πlPXl ,Xk , ∀k, l. It also implies that theMarkov chain is time reversible [29].

Stationary distribution of DTMC satisfies the detailedbalance equation, πkPXk,Xl = πlPXl ,Xk , ∀k, l, where Pis the Markov transition matrix, i.e., PXk,Xl is the prob-ability of transition from state Xk to state Xl , andπk and πl are the equilibrium probabilities of beingin states Xk and Xl , respectively. Substitute πk =∏i∈Xk

λi·∏ j/∈Xkµ j

C(~λ )into the detailed balanced equation as:

∏i∈Xkλi ·∏ j/∈Xk

µ j

C(~λ )·PXk,Xl =

∏i∈Xlλi ·∏ j/∈Xl

µ j

C(~λ )·PXl ,Xk

And jobs are either in a state or not in a state. Also sincein our DTMC, adjacent states only differs in one singlejob, Xk and Xl only differs in one element, let us sayjob m, then ∏i∈Xk

λi and ∏i∈Xlλi only differs in one

element, λm, so is ∏ j/∈Xkµ j and ∏ j/∈Xl

µ j. assumingm ∈ Band /∈ A , the lemma holds if we assume other-wise, thus we have

∏i∈Xl

λi = λm · ∏i∈Xk

λi

,∏

j/∈Xk

µ j = µm · ∏j/∈Xl

µ j

then after cancellation for the equal components the leftand right hand side of the equation becomes equal:

µm ·λm

v= λm ·

µm

v

This shows the detailed balance holds for πk =∏i∈Xk

λi·∏ j/∈Xkµ j

C(~λ )This completes the proof.

6.2 Proof of Lemma 2Let R be the fraction of downtime due to rollback when afailure happens, then according to PASTA, the probabil-ity that a failure arrives in any range during a checkpointinterval is the same as when it’s seen as average. Thenwe have:

P(R≥ x) =∫

∞

x (Ti− x) fTi(u)duETi

where fTi(u) is the probability density function of checkpoint interval Ti, let FR(x) and fR(x) be the cumulativedistribution function and probability density function of

downtime due to rollback R respectively, then we haveThus

FR(x) = 1−P(R≥ x) = 1−∫

∞

x (Ti− x) fTi(u)duETi

Take derivative with respect to x of both sides,

fR(x) =∫

∞

x fTi(u)duETi

Then the average downtime due to rollback can be ex-pressed as:

ER =

∫∞

0 (∫

∞

x fTi(u)du)xdxETi

=

∫∞

0 (1−FTi(x))xdx)ETi

,

where we used∫

∞

x fTi(u)du = 1 −∫ x

0 fTi(u)du = 1 −FTi(x) in the last step.

ET 2i =

∫∞

0u2 fTi(u)du =−

∫∞

0u2d(1−FTi(u))

=∫

∞

0(1−FTi(u))2udu−u2(1−FTi(u))

= 2∫

∞

0(1−FTi(u))udu−u2(1−FTi(u))|

∞0

= 2∫

∞

0(1−FTi(u))udu

Since FTi(u) is to 1 when u goes to infinity and u2(1−FTi(u)) goes to 0, also, when u = 0, u2(1−FTi(u)) = 0 aswell.

It is easy to verify that the second moment of Ti canbe expressed by ET 2

i = 2∫

∞

0 (1−FTi(u))udu. Hence we

have ER =ET 2

i2ETi

. Also with failure rate fi, the aver-age downtime due to rollback when a failure arrives is

ER f ailure =fiET 2

i2ETi

. This completes the proof.

6.3 Proof (7) and (8)Assuming that i.i.d. random variables {Yl} have momentgenerating functions(MGF) φY (s), and random sum Ti =

∑tAil=1 Yl has MGF φT (s), then φT (s) is expect value of

E[esTi |Ti = t], with respect to Ti. Then we have

φT (s) =∞

∑t=0

E[esTi|Ti=t ]PTi(t) =∞

∑0E[e(Y1+Y2+···+Yt )]PTi(t)

Let W = Y1 +Y 2+ · · ·+Yt , then according to the defini-tion of MGF, φW (s) = [φY (s)]n, which gives

φT (s) =∞

∑t=0

[φY (s)]nPTi(t)

in addition we can write [φY (s)]n = [elnφY (s)]n =e[lnφY (s)]n, which then gives

φT (s) =∞

∑t=0

e[lnφY (s)]nPTi(t)

11

, where the sum on the right hand side is actually φtAi(s)

evaluated at s = lnφY (s), therefore,

φT (s) = φtAi(lnφY (s))

. Then by the chain rule of derivatives,

φ′T (s) = φ

′tAi

(lnφY (s))φ ′Y (s)φY (s)

, with φY (0) = 1, φ ′Y (0) = E[Yl ] and φ ′tAi(0) = E[tAi ], set-

ting s = 0 the equation above becomes:

E[Ti] = φ′T (0) = φ

′tAi

(0)φ ′Y (0)φY (0)

= EtAiE[Yl ] =1vEtAi

Take the second derivative of φY (s), we have

φ′′T (s) = φ

′′tAi

(lnφY (s))(φ ′Y (s)φY (s)

)2

+φ′tAi

(lnφY (s))φY (s)φ ′′Y (s)− [φ ′Y (s)]

2

φ 2Y (s)

Setting s = 0 we have

ET 2i = Et2

Ai(EYl)

2 +EtAivar(Yl) =1v2 (EtAi)+Et2

Ai

6.4 Proof of Theorem 1From Lemma 2:

Ri = 1− τci µiπAi − fiτ

ri − ( fiπAiETi +

fiET 2i

2ETi)

And from Theorem 1:


ri − (

fi

µi+

fi

2µiπAi

)

Apply equation (7) and (8) to Lemma 2 we have:

fiπAiETi = fiπAi

1vEtAi

fiET 2i

2ETi= fi

1v2 (EtAi +Et2

Ai)

2EtAiv

where since tAi is exponential distributed, its second mo-ment

Et2Ai

= 2(EtAi)2

, this givesfiET 2

i2ETi

=fi

2v(1+2EtAi)

in Lemma 2. Plugging in these results to Lemma 2 wehave:


ri − fi(πAi

1vEtAi +

12v

+1vEtAi)

Combining equation (9) we have:

1vEtAi =

1v+

1µi(

1πAi

−1)

Apply this to Lemma 2:

Ri = 1−τci µiπAi− fiτ

ri − fi(

πAi

v+

1−πAi

µi+

12v

+1v+

1µiπAi

− 1µi)

To simplify:


ri − fi(

πAi +32v

−πAi

µi+

1µiπAi

)

as v→ ∞ we haveπAi+3

2v → 0, thus we have:

R− i = 1− fiτri − τ

ci µiπAi −

fi

µi(πAi +

1πAi)

)

This completes the proof.

12

References

[1] Amazon, “We Promise Our EC2 Cloud Will OnlyCrash Once A Week, Amazon Online TechnicalReport, October 2008

[2] N. Limrungsi, J. Zhao, Y. Xiang, T. Lan, H. Huangand S. Subramaniam, “Providing Reliability as AnElastic Service in Cloud Computing,” IEEE ICC2012, Aug. 2012.

[3] M. Wiboonrat, “ An Empirical Study on DataCenter System Failure Diagnosis ,” Internet Mon-itoring and Protection, ICIMP, July. 2008.

[4] J. Hui “Checkpointing Orhestration: Toward aScalable HPC Fault-Tolerant Environment,” CC-Grid, IEEE/ACM international Symposium, May.2012

[5] B. Nicolae, “BlobCR: Efficient checkpoint-restartfor HPC applications on IaaS clouds using virtualdisk image snapshots, 2011 International Confer-ence for High Performance Computing, Network-ing, Storage and Analysis, Nov. 2011

[6] N. Kobayashi and T. Dohi, “Bayesian perspec-tive of optimal checkpoint placement,” High-Assurance Systems Engineering, 2005. HASE2005. Ninth IEEE International Symposium on,pp. 143-152, 2005.

[7] K.M. Chandy, “A Survey of Analytic Modelsof Rollback and Recovery Strategies,” Computer,vol. 8, no. 5, pp. 40-47, May. 1975.

[8] T. Dohi, N. Kaio, and K.S. Trivedi, “Availabilitymodels with age-dependent checkpointing,”, Reli-able Distributed Systems, 2002. Proceedings. 21stIEEE Symposium on, pp. 130-139, 2002.

[9] N.H. Vaidya, “Impact of checkpoint latency onoverhead ratio of a checkpointing scheme,” Com-puters, IEEE Transactions on, vol. 46, no. 8, pp.942-947, 1997.

[10] H. Okamura, Y. Nishimura and T. Dohi, “A dy-namic checkpointing scheme based on reinforce-ment learning,” Dependable Computing, 2004.Proceedings. 10th IEEE Pacific Rim InternationalSymposium on, pp. 151-158, Mar. 2004.

[11] A. Duda, “The effects of checkpointing on pro-gram execution time,” Information ProcessingLetters, pp. 221-229, 1983.

[12] P. Ta-Shma, G. Laden, M. Ben-Yehuda and M.Factor, “Virtual machine time travel using con-tinuous data protection and checkpointing,” ACMSIGOPS Operating Systems Review, vol. 42, pp.127-134, 2008.

[13] M. Sun and D. M. Blough, “Fast, Lightweight Vir-tual Machine Checkpointing,” Georgia Tech. tech-nical report, 2010.

[14] I. Goiri, F. Juli‘a, J. Guitart and J. Torres,“Checkpoint-based Fault-tolerant Infrastructurefor Virtualized Service Providers,” in Proceedingsof IEEE/IFIP Network Operations and Manage-ment Symposium, Aug. 2010.

[15] M. Zhang, H. Jin, X. Shi, and S. Wu, “VirtCFT: ATransparent VM-Level Fault-Tolerant System forVirtual Clusters,” in Proceedings of Parallel andDistributed Systems (ICPADS), Dec. 2010.

[16] Y. Liu, R. Nassar, C. Leangsuksun, N. Naksine-haboon, M. Paun, and S. L. Scott, “An OptimalCheckpoint/Restart Model for a Large Scale HighPerformance Computing System,” in Proceedingsof Parallel and Distributed Processing (IPDPS),Apr. 2008.

[17] A. Warfield, R. Ross, K. Fraser, C. Limpach and S.Hand, “Parallax: Managing storage for a millionmachines,” HotOS, Jun. 2005.

[18] R. Badrinath, R. Krishnakumar and R. Ra-jan, “Virtualization aware job schedulers forcheckpoint-restart,” ICPADS, Dec. 2007.

[19] T. Ozaki, T. Dohi, H. Okamura and N. Kaio,“Distribution-free checkpoint placement algo-rithms based on min-max principle,” IEEE Trans.Dependable and Secure Computing, vol. 3, no. 2,pp. 130-140, 2006.

[20] T. Dohi, T. Ozaki, and N. Kaio, Optimal check-point placement with equality constraints, De-pendable, Autonomic and Secure Computing, 2ndIEEE International Symposium on, pp. 77-84,Oct. 2006.

[21] A. Kangarlou, D. Xu, U. Kozat, P. Padala, B.Lantz and K. Igarash, “In-network Live Snap-shot Service for Recovering Virtual Infrastruc-ture,” IEEE Network, vol. 25, no. 14, pp. 12-19,2011.

[22] Haikun Liu, “Optimize Performance of VirtualMachine Checkpointing via Memory Exclusion,”ChinaGrid Annual Conference, Aug 2009.

13

[23] B. Schroeder, E. Pinheiro, W. Weber, “DRAM er-rors in the wild: a large-scale field study,” in Pro-ceeding of SIGMETRICS eleventh internationaljoint conference on Measurement and modeling ofcomputer systems, pp. 193-204, 2009

[24] L. Jiangand and J. Walrand, “A distributed CSMAalgorithm for throughput and utility maximiza-tion in wireless networks,” IEEE/ACM Trans. Net-work, vol. 18, no. 3, pp. 960-972, Jun. 2010.

[25] X. Wang and K. Kar, Throughput Modelling andFairness Issues in CSMA/CA Based Ad-Hoc Net-works, Proceedings of IEEE Infocom, Miami,Mar. 2005.

[26] S.C. Liew, C. Kai, J. Leung and B. Wong, Back-of-the-Envelope Computation of Throughput Dis-tributions in CSMA Wireless Networks, submittedfor publication http://arxiv.org//pdf/0712.1854.

[27] International Working Group on CloudComputing Resiliency (IWGCR), “Down-time statistics of current cloud solution,”http://iwgcr.files.wordpress.com/2012/06/iwgcr-paris-ranking-001-en1.pdf, 2012.

[28] H. Gunawi, T. Do, J.M. Hellerstein I. Sto-ica, D. Borthakur and J. Robbins, “Fail-ure as a Service (FaaS): A cloud ser-vice for large-scale, online failure drills,”http://techreports.lib.berkeley.edu/accessPages/EECS-2011-87.html, 2012.

[29] F. Aldous, “Reversible Markov Chains and Ran-dom Walks on Graphs,” University of California,Berkeley 2002.

[30] M. Chiang, and S.H. Low, A.R. Calderbank andJ.C. Doyle, “Layering as Optimization Decompo-sition: A Mathematical Theory of Network Archi-tectures,” Proceedings of the IEEE, vol. 95, no. 1,pp. 255-312, 2007.

[31] T. Lan, X. Lin, M. Chiang and Lee, R.B, “Sta-bility and Benefits of Suboptimal Utility Maxi-mization,” Networking, IEEE/ACM Transactionson, vol. 19, no. 4, pp. 1194-1207, Aug. 2011.

[32] Y. Xiang, H.Liu, T. Lan, H. Huang, S. Subrama-niam,”Optimizing Job Reliability ThroughContention-Free, Distributed CheckpointScheduling”. Online technical report availableat www.seas.gwu.edu/ tlan/papers/ICAC.pd f ,2013.

[33] S.J. Russell, P. Norvig, “Artificial Intelligence: AModern Approach (2nd ed.),” Upper Saddle River,New Jersey: Prentice Hall, pp. 111-114.

[34] S. Kirkpatrick, C.D. Gelatt and M.P. Vec-chi, “Optimization by Simulated An-nealing,” Science 220 (4598): 671680.doi:10.1126/science.220.4598.671. JSTOR.

[35] Dongarra, Jack J and Luszczek, Piotr and Pe-titet, Antoine, “The LINPACK Benchmark: past,present and future,”Concurrency and Computa-tion: Practice and Experience, vol. 15, no. 9, pp.803-820, 2003

14

Optimizing Job Reliability Through Contention-Free ...scheduling and reliability optimization. The...

Documents

Transcript of Optimizing Job Reliability Through Contention-Free ...scheduling and reliability optimization. The...