Optimization Algorithms for Task Offloading and Scheduling ... · Optimization Algorithms for Task...

Optimization Algorithms forTask Offloading and Scheduling in Cloud Computing

by

Sowndarya Sundar

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2019 by Sowndarya Sundar

Abstract

Optimization Algorithms for

Task Offloading and Scheduling in Cloud Computing

Sowndarya Sundar

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2019

Cloud computing can augment the capabilities of resource-poor local devices with the help of

resourceful servers. Computational offloading refers to the migration of application tasks from

local devices for execution at the cloud. Intelligent task offloading and scheduling can help

optimize parameters such as energy consumption or execution time. In this dissertation, we

address such optimization problems in cloud computing environments.

Several existing works that address this problem often make simplistic or impractical as-

sumptions with respect to the system model or propose inefficient solutions. We consider more

practical system models with finite-capacity and heterogeneous local devices, cloudlets, and

edge-clouds. We also address offloading of applications consisting of dependent tasks, multi-user

environments, and scheduling tasks that arrive over time. We consider meaningful optimization

objectives and constraints. Our aim is to propose efficient algorithms with high performance

to obtain these task offloading and scheduling decisions.

We first consider the problem where a single user wishes to execute applications consisting of

dependent tasks and a multi-tier cloud computing environment that may consist of cloudlets,

peer devices, and a remote cloud. We propose the Individual Time Allocation with Greedy

Scheduling (ITAGS) algorithm in order to minimize total cost subject to application deadlines.

We then consider a multi-user cloud environment where each user has certain budget constraints.

We propose the Single Task Unload for Budget Resolution (STUBR) algorithm to minimize

sum completion time objective, and prove performance guarantees for the same. Finally, we

address the online problem where tasks arrive over time, and we do not know task information

in advance. We propose the Task Dispatch through Online Training (TDOT) algorithm in

order to maximize profit to a cloud service provider subject to processor load constraints, and

ii

provide performance guarantees.

For each of these problems, we also use trace-driven simulation results to compare against

existing alternatives and analyze the performance of the proposed algorithms in various sce-

narios. We see that the proposed algorithms are efficient, outperform other alternatives, and

exhibit near-optimal performance.

iii

Contents

List of Tables vi

List of Figures vi

1 Introduction 1

1.1 Computational Offloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Optimization and Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Summary of Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Offloading Dependent Tasks with Communication Delay . . . . . . . . . . 3

1.3.2 Multi-user Task Scheduling with Budget Constraints . . . . . . . . . . . . 4

1.3.3 Online Scheduling for Profit Maximization at the Cloud . . . . . . . . . . 5

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 6

2.1 Computational Offloading Frameworks . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Partitioning and Offloading Tasks . . . . . . . . . . . . . . . . . . . . . . 6

2.1.2 Common Cloud Computing Environments . . . . . . . . . . . . . . . . . . 7

2.2 Offloading Independent Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 Single-user Task Offloading . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.2 Multi-user Task Offloading . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Offloading Dependent Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Energy Consumption Objective . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.2 Makespan Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.3 Energy Consumption under a Deadline Objective . . . . . . . . . . . . . . 11

2.4 Online Task Offloading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 Related Theoretical Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5.1 Job-shop Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.2 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Review and Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

iv

3 Offloading Dependent Tasks with Communication Delay 15

3.1 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Local Processors and Remote Cloud . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Task Dependency Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Individual Time Allocation with Greedy Scheduling (ITAGS) . . . . . . . . . . . 20

3.2.1 Binary Relaxation and Individual Time Allowance . . . . . . . . . . . . . 20

3.2.2 Alternative Discretization Heuristic . . . . . . . . . . . . . . . . . . . . . 21

3.2.3 ITAGS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.4 Feasibility and Complexity Analysis . . . . . . . . . . . . . . . . . . . . . 26

3.3 Scheduling multiple applications with different deadlines . . . . . . . . . . . . . . 26

3.3.1 Binary Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3.2 Modified Alternative Discretization Heuristic . . . . . . . . . . . . . . . . 27

3.3.3 Modified ITAGS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 28


3.4 Trace-driven and Randomized Simulations . . . . . . . . . . . . . . . . . . . . . . 31

3.4.1 Comparison Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.2 Trace-driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4.3 Simulation with Randomly Generated Task Trees . . . . . . . . . . . . . . 36

3.4.4 Run-Time Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.4.5 Multiple Applications and Uncertain Processing and Communication Times 37

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4 Multi-user Task Scheduling with Budget Constraints 41

4.1 System Model and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 The STUBR algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Relaxed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.2 Rounded Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.3 Dealing with Budget Violation . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.4 WSPT Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


4.3 STUBR Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 With Task Release Times . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.2 With Fixed Communication Times . . . . . . . . . . . . . . . . . . . . . . 55

4.3.3 With Sequence-dependent Communication Times . . . . . . . . . . . . . . 56

4.4 Trace-driven Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4.1 Traces and Parameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . 61


v

4.4.3 For Release Times and Fixed Communication Times . . . . . . . . . . . . 62

4.4.4 For Sequence-dependent Communication Times . . . . . . . . . . . . . . . 64

4.4.5 Runtime Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Online Scheduling for Profit Maximization at the Cloud 66

5.1 System Model and Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.1 Cloud Processors and Online Task Arrival . . . . . . . . . . . . . . . . . . 67

5.1.2 Profit Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Task Dispatch through Online Training . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.1 Offline Solution through Lagrange Relaxation . . . . . . . . . . . . . . . . 70

5.2.2 Online Scheduling with Partial-Task Profit Taking . . . . . . . . . . . . . 70

5.2.3 Performance Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3 Modified Algorithm Without Partial-Task Profit Taking . . . . . . . . . . . . . . 79

5.4 TDOT with Data Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4.1 Offline Solution through Lagrange Relaxation . . . . . . . . . . . . . . . . 80

5.4.2 Online Scheduling Algorithm with Partial-Task Profit Taking . . . . . . . 81

5.4.3 Performance Bound Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.4 TDOT-G with Data Requirements . . . . . . . . . . . . . . . . . . . . . . 90

5.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90


5.5.2 Simulation Setup and Task Requirements . . . . . . . . . . . . . . . . . . 91

5.5.3 I.I.D. Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.5.4 Google-cluster Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.5.5 Overall Profit and ε Values . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Concluding Remarks 97

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2.1 Task Scheduling in the Presence of Zero Task Information . . . . . . . . . 98

6.2.2 Online Dependent-Task Scheduling . . . . . . . . . . . . . . . . . . . . . . 98

6.2.3 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2.4 Straggler Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2.5 Fuzzy Load Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

Bibliography 99

vi

List of Tables

2.1 Literature Review on Offloading Independent Tasks . . . . . . . . . . . . . . . . 8

2.2 Literature Review on Offloading Dependent Tasks . . . . . . . . . . . . . . . . . 10

2.3 Literature Review on Online Task Offloading . . . . . . . . . . . . . . . . . . . . 12

3.1 Chapter 3 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Run-time (sec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1 Chapter 4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

vii

List of Figures

3.1 Example network of local processors and cloud. . . . . . . . . . . . . . . . . . . . 17

3.2 Dummy tasks, d1 and d2, added to a DAG of 5 tasks. . . . . . . . . . . . . . . . 17

3.3 Simulation topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Cost vs. application deadline for Gaussian elimination and FFT in Scenario 1 . . 32

3.5 Cost vs. application deadline for Gaussian elimination and FFT for Scenario 2 . 32

3.6 Cost vs. application deadline for Gaussian elimination and FFT for Scenario 3 . 33

3.7 Cost vs. application deadline for Gaussian elimination and FFT in Scenario 4 . . 33

3.8 Task Graph for Gaussian Elimination Application with a matrix size of 5 . . . . 34

3.9 Cost vs. application deadline for randomly generated task trees . . . . . . . . . . 36

3.10 Cost vs. application deadline for different number of applications A . . . . . . . 38

3.11 Cost vs. Realizations for 15% error . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.12 Cost vs. error with known and unknown processing and communication times

(with 95% confidence intervals) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 Example system of 3 users and 5 cloud processors. . . . . . . . . . . . . . . . . . 43

4.2 For chess application on Galaxy S5. . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 For compute intensive application on Nexus 10. . . . . . . . . . . . . . . . . . . . 62

4.4 For chess application on Galaxy S5. . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 For compute intensive application on Nexus 10. . . . . . . . . . . . . . . . . . . . 64

5.1 Example system with two CSs consisting of two processors each. . . . . . . . . . 68

5.2 Effect of arrival rate λ on non-training set profit for i.i.d. tasks . . . . . . . . . . 92

5.3 Effect of max. data load Q on non-training set profit for i.i.d. tasks . . . . . . . 93

5.4 Effect of max. processing load L on non-training set profit for Google-cluster tasks 94

5.5 Effect of max. data load Q on non-training set profit for Google-cluster tasks . . 94

5.6 Effect of ε on overall profit for Google-cluster tasks . . . . . . . . . . . . . . . . . 95

viii

Chapter 1

Introduction

The usage of computationally intensive and resource-demanding applications has been rapidly

increasing in recent times. However, the improvements in the hardware and battery of local

devices are not sufficient to keep up with the power and time requirements of these applications.

For example, several solutions have been proposed to enhance the CPU performance [1], [2]

and to manage the disk and screen in an intelligent manner [3], [4] but, these solutions require

changes in the structure of local devices, or new hardware that results in an increase of cost

and may not be feasible for all devices [5]. As a result, resource poverty is a major obstacle for

many applications [6].

A Mobile Cloud Computing (MCC) system is one where mobile devices offload their compu-

tational tasks to cloud resource providers [7]. The cloud abstracts the complexities of provision-

ing computation and storage infrastructure [8], and provides the local devices access to nearly

unlimited computing power. This helps combat the resource poverty and augments the capabil-

ities of the devices through a technique called computational offloading. MCC has been proven

to be advantageous for several applications such as mobile commerce, mobile learning, mobile

healthcare, gaming [5], image and language processing [7], sharing GPS/internet data [9], and

crowd computing [10].

One of the major challenges facing mobile cloud computing is the communication delay

and overhead incurred due to transfer of data to the remote cloud. Edge computing [6] [11]

is a more recent advancement in MCC where finite computational/cloud resources are made

available at the edge of the network or in the vicinity of the mobile users. For example, in a

Mobile Edge Computing (MEC) system, MEC servers are deployed at the cellular base stations

and are shared by the mobile users.

Computational offloading has several benefits and can be used to optimize multiple param-

eters, as described in Section 1.1. In order to fully exploit the advantages of computational

offloading, we need to make intelligent task offloading decisions and utilize the resources at the

cloud efficiently. This is explained in Section 1.2.

1

Chapter 1. Introduction 2

1.1 Computational Offloading

Computational offloading refers to the migration of the computationally intensive parts (or

tasks) of an application from the local device to more powerful servers at the remote cloud.

This means that the execution of these resource-hungry parts of the application takes place in

the cloud and the results of the execution can be communicated to the local device for further

execution or to output results. This can prove to be extremely beneficial for the application

users for the following reasons:

• Augments capabilities of the mobile device

As the servers in the cloud are much more powerful in terms of speed and capability than

the local processors in the mobile device, computational offloading enables the mobile

device to run even computationally-intensive applications that the local processors alone

cannot handle. In other words, computational offloading gives the mobile user access to

nearly unlimited computing power thereby rendering the mobile device more resourceful.

• Decreases energy consumption

Offloading parts of the application implies that the mobile device has lesser work in terms

of execution, and this consequently results in reduction of energy consumption at the

mobile device. As a result, this helps improve the mobile device’s battery lifetime.

• Improves response time

The execution of some parts of the application at the faster servers at the cloud can

result in reduction of the time taken to execute the overall application, also known as

the makespan of the application. As a result, this improves the response time of the

application.

Existing research work has resulted in several computational offloading systems such as

energy-aware migration decisions at run-time [7], multiple virtual machine images [12], and

trusted cloudlets [6]. The aforementioned benefits of offloading systems can be optimized

through effective task scheduling.

1.2 Optimization and Task Scheduling

Each application can be modelled as a number of tasks, and each task can be executed either

locally at the mobile device or remotely at the cloud. This binary offloading decision on each

task should be taken such that the offloading for the entire application is optimal in terms of

the objective. This objective could be the overall energy consumption or application makespan.

It could also be a more sophisticated objective such as a cost or energy consumption subject to

latency constraints on the execution of the application, in order to provide a Quality of Service

guarantee to the application users [7], [13], [14].


The tasks that constitute an application can be considered to be independent or dependent

in nature based on precedence constraints and possible data communication between them.

Dependent tasks are often modeled using a task graph with nodes of the graph representing

the tasks and the edges in the graph representing the dependencies between the tasks. There

are several different existing techniques that can be used to identify the best task scheduling

decision for both independent and dependent tasks, and these are investigated in Chapter 2.

Cloud service providers (CSP) such as Amazon EC2 [15] often have multiple different in-

stances or servers with different storage, price, and computational capability. Consequently,

objectives such as makespan and energy consumption at the cloud can further be optimized

by scheduling tasks intelligently at the cloud, i.e., identifying which servers in the cloud should

run which tasks. It may also be in the CSPs interest to optimize objectives such as revenue

or profit. Solving these optimization problems in practice is quite challenging due to lack of

prior information about arriving tasks. Few existing works address this problem of online task

arrival and scheduling at the cloud, and this is also reviewed in Chapter 2.

The task offloading problem may also involve a multi-tier cloud computing environment

where a user is allowed to offload its task to a nearby edge-cloud or peer device, or to a

far away remote cloud. Based on the available communication and computation resources,

task scheduling decisions can be made in these more sophisticated environments to optimize

aforementioned objectives. Throughout this dissertation, we assume that the software required

to run the offloaded tasks already exists at the cloud, edge-clouds, and peer devices.

The main motivation of this dissertation lies in designing efficient algorithms to identify ef-

fective task scheduling decisions for computational offloading in cloud computing environments.

1.3 Summary of Contribution

Through this dissertation, we study optimization problems in cloud computing environments

through intelligent computational offloading and task scheduling. The results presented in

Chapters 3, 4, and 5 have appeared in [16] [17] [18] [19] [20]. In particular, this dissertation

makes the following contributions.

1.3.1 Offloading Dependent Tasks with Communication Delay

In Chapter 3, we consider the offloading of applications comprising multiple tasks, over a

generic cloud computing system consisting of a network of heterogeneous local processors and

a remote cloud. The local processors can represent the processing cores in a single mobile

device, local peer devices, and/or nearby cloudlets, depending on their computational speed

and communication distance from the user device. There is a time and a cost associated with

task execution, which depend on both the task and the processor where the task is scheduled.

We allow each application to consist of inter-dependent tasks with possible data commu-

nication between them. Each task may have predecessor tasks that must be completed before


the task can start. Furthermore, if data need to be transferred between tasks on different

processors, a communication delay, as well as communication cost, is incurred. The objective

of this work is to identify a task scheduling decision that minimizes the total cost of running

applications, subject to application completion deadlines. Our cost model is general, which

for example may include energy consumption or usage charges for task processing and commu-

nication. We observe that the precedence constraints and data transfer requirement between

tasks can drastically complicate their scheduling decision. Furthermore, the need to account

for both the cost and the run-time of the application adds to the challenge. Prior studies have

assumed simplified processor models to facilitate tractable analysis, such as non-concurrent

local and remote processors [7], infinite-capacity local processors [14], [21], [22], and negligible

delay between local processors [23], [24]. We use a more realistic processor model in this study.

This problem can be shown to be NP-hard, and, as such, there is no polynomial run-time

guarantee for finding an optimal solution. We propose a heuristic algorithm, termed ITAGS,

that utilizes a relaxed solution to the problem to obtain good task scheduling decisions. Through

trace-based simulation using real applications, as well as various randomly generated task trees,

we investigate the performance of ITAGS, highlighting the effect of the application deadline,

communication delay, number of processors, and number of tasks. We compare against existing

alternatives including a discretization heuristic, and the cost lower bound. We observe that

ITAGS demonstrates superior performance for all different scenarios considered.

1.3.2 Multi-user Task Scheduling with Budget Constraints

In Chapter 4, we study a problem of task scheduling and offloading in a cloud computing system

to minimize the computational delays of the users’ tasks. We study a multi-user scenario

with finite-capacity user devices, a finite-capacity cloud consisting of heterogeneous servers,

budget constraints for the users, and an objective to minimize weighted sum completion time.

In particular, we consider user tasks that may have different processing times, release times,

communication times, and weights. A task may be executed locally on the user’s device or

offloaded to a server at the finite-capacity cloud. The servers at the cloud are heterogeneous

processors with different speeds. The users are required to pay a certain monetary price based

on the usage time of a processor at the cloud, and the price may potentially depend on the

processor speed. Each user has a specific budget which determines the monetary cost that the

user is willing to spend for offloading tasks to the cloud.

Our objective is to identify the task scheduling decision that minimizes the sum of weighted

completion times of all tasks subject to all users’ budget constraints. The problem is NP-hard

since minimizing the sum of weighted completion times of jobs with release times on a single

processor is NP-hard. Our solution approach is inspired by an interval-indexed Integer Linear

Program (ILP) introduced in [25]. We exploit the structure of an approximation solution to

such an ILP to solve our problem.

We propose the Single-Task Unload through Budget Resolution (STUBR) algorithm, and


prove performance bounds for different task and channel models. We also assess its performance

through trace-based simulation. We see that STUBR exhibits maximum performance gains of

more than 50% for both chess and compute intensive applications [22] in comparison with the

Greedy Weighted Shortest Processing Time (WSPT) scheme. Finally, our simulation results

demonstrate that STUBR is highly scalable with respect to the number of users in the system.

1.3.3 Online Scheduling for Profit Maximization at the Cloud

In Chapter 5, we adopt an online task model where tasks arrive over time, in order to address

this more practical problem. We consider the offloading of these tasks to multiple cloud servers.

Each cloud server in our system model consists of finite-capacity and heterogeneous processors.

As a result, the cloud servers could represent cloudlets, edge-clouds, or peer devices, in a generic

and hybrid cloud computing environment.

In this chapter, we address the task scheduling problem from the perspective of a cloud

service provider (CSP) that obtains profit by processing user tasks. In our model, the profit

obtained is a function of the task processing time and the profit generated per unit time on the

scheduled processor. We aim to obtain the scheduling decision that maximizes the total profit

across all tasks arriving within a time interval, subject to processor load constraints. In our

online model, we do not know in advance the total number of tasks that will arrive within the

time interval. We also do not know the processing times of a task until it arrives at the CSP’s

controller, which then dispatches the task to the scheduled cloud server for processing.

We propose the Task Dispatch through Online Training (TDOT) algorithm, which consists

of training and exploitation phases. We provide performance bound analysis to show that

TDOT can generate profit that is close to the optimum, given a suitable size for the training

task set. TDOT assumes that profit can be obtained from partially-completed tasks, so we

further propose a modified version of TDOT, termed TDOT-G, for implementations where

profit can only be obtained from fully-completed tasks. Through simulation, using randomly-

generated as well as Google cluster data, we compare the performance of TDOT and TDOT-G

with that of greedy scheduling, logistic regression, and an offline upper-bound solution.

1.4 Thesis Organization

This dissertation is organized as follows. Related works are reviewed in Chapter 2. Chapter 3

presents our work on computational offloading of dependent tasks with communication delay.

Chapter 4 deals with a multi-user task offloading and scheduling problem. Chapter 5 addresses

an online scenario where tasks arrive over time. We conclude this dissertation in Chapter 6.

Chapter 2

Literature Review

In this chapter, we study the existing works that have addressed computational offloading

and task scheduling problems for cloud computing. In Section 2.1, we provide an overview of

computational offloading frameworks focusing on traditional offloading techniques and various

cloud computing environments. In Sections 2.2, 2.3 and 2.4, we review the works that address

offloading independent, dependent, and online tasks respectively. Finally, in Section 2.5, we

focus on related theoretical works in other domains that address similar problems.

2.1 Computational Offloading Frameworks

In this section, we provide some background on the framework required for computational of-

floading, namely, the infrastructure required to offload and schedule tasks, and the various cloud

computing environments for which the computational offloading problem has been studied.

2.1.1 Partitioning and Offloading Tasks

The computational offloading technique to offload applications that was initially studied con-

sidered just two possible outcomes to the scheduling decision, namely, either offloading the

entire application to the cloud or executing the entire application locally on the mobile de-

vice [6], [26], [27]. In [27], this decision is made with an objective to conserve energy for the

mobile device. The energy-optimal execution policy is obtained by solving two constrained

optimization problems, i.e., optimizing the clock frequency to complete CPU cycles for mo-

bile execution and the data transmission rate for cloud execution. However, offloading in finer

granularity has been shown to provide more flexibility and better performance than standalone

mobile or standalone cloud execution [13], [28].

As a result, each mobile application can be partitioned into different parts or tasks. This

partitioning can be either done by the programmers [29], [30] or by a profiler that partitions

applications automatically [7], [31] [32]. Each task can be executed either locally at the mobile

device or remotely at the cloud. This binary decision on each task should be taken such that

6

Chapter 2. Literature Review 7

the offloading for the entire application is optimal in terms of the objective such as minimizing

the overall energy consumption or the makespan. If a task is to be executed at the cloud, we

may also further decide on which cloud server to execute this task at the cloud in order to

further optimize our objective.

2.1.2 Common Cloud Computing Environments

Traditionally, most exsiting works study offloading tasks to powerful servers at a remote cloud

[7], [12]. However, offloading from a local device to a distant remote cloud can incur significant

delays, particularly if a large amount of input/output data needs to be communicated between

the cloud and the mobile device.

Consequently, edge computing [11, 33–36] is a recent advancement in Mobile Cloud Com-

puting (MCC) where finite computational/cloud resources are made available at the edge of

the network or in the vicinity of the mobile users. For example, in a Mobile/Multiaccess Edge

Computing (MEC) system [11, 35, 36], MEC servers are deployed at the cellular base stations

and are shared by the mobile users. Some works also consider offloading tasks to peer devices

such as nearby mobiles [37], [38], [39].

Some existing works consider a two-tier or three-tier offloading system where a task can be

executed 1) locally on the mobile device, 2) on a finite-capacity nearby computational resource,

or 3) at the remote cloud. A centralized decision engine or scheduler may decide whether

to offload or where to schedule each task. The existing decision-making and task scheduling

schemes are analyzed in the following sections.

2.2 Offloading Independent Tasks

Several works look to solve this offline task scheduling problem by assuming that the required

task information is known in advance. Such works can be broadly split into two categories, i.e.,

those that consider scheduling 1) independent tasks and 2) dependent tasks. In this section,

we study existing works that look to offload or schedule a number of independent tasks in a

cloud computing environment. Practically, these independent tasks may belong to a particular

mobile application or be individual applications by themselves. These tasks may belong to a

single user or multiple users. In Table 2.1, we summarize these existing works.

2.2.1 Single-user Task Offloading

In [41], a context-aware decision-making algorithm is formulated to schedule independent tasks,

taking into account the wireless medium, and cloud resources. Here, the objective is to schedule

the tasks such that the overall execution time and energy among all cloud resources is mini-

mized. However, the cloud virtual machines or VMs are assumed to be homogeneous in nature.

Similarly, [42] aims to optimize the offloading decision of the user to minimize the overall cost

of energy, computation, and delay for an application consisting of multiple independent tasks


Table 2.1: Literature Review on Offloading Independent Tasks

Ref. Objective Solution Assumptions

Single-user

[40]

Weighted sum

completion time

& makespan

Heuristic Homogeneous VMs

[41]Execution time

+ energyHeuristic Homogeneous VMs

[42] Overall cost Heuristic Single cloud server

[43]

Weighted sum

of execution time

+ energy

Sub-optimal Single MEC server

Multi-user

[44]Number of beneficial

cloud users

Game-theoretic;

Nash Equilibrium

Constant task

processing times

at cloud

[45] Response timeGame-theoretic;

Nash Equilibrium

Infinite-capacity

cloud server

[46]

Cost of

weighted energy+

delay+monetary

Game-theoretic;

Nash Equilibrium

Tasks can be

rejected

without execution

[47]

Cost of energy +

delay +

computation

Approximation Algorithm

using SDR

Infinite-capacity

cloud server


using semidefinite relaxation and randomization mapping approaches. This work assumes that

tasks can be offloaded to only a single remote server. In [43], an objective of weighted sum of

the execution delay and device energy consumption is considered to make an offloading decision

in a MEC (Mobile Edge Computing) system, and a sub-optimal algorithm is proposed.

2.2.2 Multi-user Task Offloading

In [44], [45], [46] a game-theoretic approach is adopted to obtain offloading decisions for inde-

pendent tasks from multiple users. In [44], a shared mobile-edge cloud is considered, and a

distributed algorithm is proposed to compute a Nash Equilibrium solution. In [45], a three-tier

mobile cloud computing system consisting of peer devices, cloudlets, and remote cloud, and the

problem is modeled as a Generalized Nash Equilibrium game, whereas in [46], the problem is

modeled as a potential game.

Unlike these decentralised techniques, in [47], a centralized approximation algorithm is

proposed to make offloading decisions with an objective to minimize overall cost of energy,

delay, and computation of all users. However, this work assumes a single infinite-capacity

remote cloud server.

2.3 Offloading Dependent Tasks

Each of the following techniques have been employed to identify an offloading decision on an

application consisting of dependent tasks satisfying some objective. In Table 2.2, we summarize

these techniques, and their advantages and disadvantages.

2.3.1 Energy Consumption Objective

Computational offloading decisions can be taken with the sole objective of minimizing the

energy consumption or overall cost due to application execution. Cost and energy can be

considered to be analogous in nature and hence, we study the works that look to minimize

either of these objectives. In [54] and [48], the problem of minimizing the sum of the cost is

addressed using graph partitioning. The proposed branch-and-bound algorithm provides an

optimal solution but has exponential-time complexity. Furthermore, [48] also makes several

impractical assumptions with respect to the system model.

2.3.2 Makespan Objective

Optimal offloading techniques to minimize makespan can be found for an application consisting

of sequential tasks, i.e., considering only task graphs with linear topology. In [55], the one-time

offload property has been proven for sequential tasks. This property states that the optimal

set of tasks to be offloaded in a sequential task graph while minimizing makespan will always

be a sequence of consecutive tasks in the task graph. In [49], an algorithm is proposed to find


Table 2.2: Literature Review on Offloading Dependent Tasks

Ref. Min. Objective Solution Assumptions

[8] Total cost Optimal in O(n3) time

Cloud & client tasks

do not execute

simultaneously

[48] Energy consumption Approximate; high complexity


do not execute

simultaneously

[49] Makespan Optimal Sequential tasks

[50] Makespan ApproximateInfinite capacity

mobile device

[51] Makespan Approximate; high complexity No communication

[7]Energy consumption

under deadline

Optimal;

exponential time complexity


do not execute simultaneously;

Infinite capacity

mobile device

[14]Energy consumption

under deadline

Optimal;

step-size dependent

time complexity

Infinite capacity mobile device

[13]

Energy

consumption

under deadline

ApproximateSequential

tasks

[52]Overall latency

under resource costPTAS solution

Devices possess

infinite capacity

[23]Cost

under deadlineHeuristic

Delay between

local processor

negligible

[53]

Weighted sum of

energy and

completion time

HeuristicDevice and cloud

possess infinite capacity


the entry-task and exit-task of the optimal one-time offload such that makespan or completion

time is minimized.

For generic task graphs, a load balancing heuristic is suggested in [49] to identify the offload-

ing decision. Numerically, this load balancing heuristic is shown to give better performance in

comparison with the greedy offloading algorithm in Odessa [56]. In [50], applications consisting

of dependent modules (or tasks), multiple finite-capacity servers at the cloud, and multi-user

scenarios are considered. A Mixed Integer Linear Programming (MILP) problem is formulated

with the objective of minimizing makespan. Furthermore, two greedy heuristics are proposed

to obtain solutions. The problem of task scheduling onto unrelated parallel machines is con-

sidered in [51], with an application to cloud computing. In this work, there exist precedence

constraints between tasks of the application, but no communication delay is considered between

the processors and no data communication is considered between tasks.

2.3.3 Energy Consumption under a Deadline Objective

Purely minimizing the energy consumption without regard to the makespan could result in

large application delays, particularly in practical systems wherein faster processors consume

more energy and vice-versa. Similarly, purely minimizing the makespan could result in large

amounts of energy being consumed. As a result, the objective of minimizing energy under

an application deadline has been considered in order to achieve a trade-off between these two

important quantities. In [53], an objective of energy-efficiency cost which is formulated as a

weighted sum of application completion time and energy consumption is considered. However,

this work assumes that the mobile device and the remote cloud has infinite number of available

servers.

The problem of maximizing energy savings at the mobile device due to computational of-

floading, subject to an application deadline, is formulated as an integer linear program in [7].

While integer linear programming provides an optimal solution, it is NP-hard in general and

hence, there is no polynomial run-time guarantee. In [14] and [52], a dynamic programming

approach is proposed to obtain a polynomial-time solution. However, the algorithms assume

that mobile devices are capable of simultaneously processing any number of tasks without any

loss in speed or efficiency. This assumption simplifies the problem and consequently allows dy-

namic programming to be used to provide a solution. In [13], the classical LARAC (Lagrangian

Relaxation Based Aggregated Cost) algorithm is adopted to obtain an approximate solution for

the specific case of sequential tasks.

2.4 Online Task Offloading

In [57], a stochastic optimization problem is formulated in order to schedule tasks either locally

on the mobile device or on a single MEC server. The tasks are assumed to be fluid in nature,

and a task is generated at the beginning of each time slot with a certain probability. Similarly,


Table 2.3: Literature Review on Online Task Offloading

Ref. Objective Solution Assumptions

[57]Power-constrained

delayOptimal stochastic policy

Fluid tasks;

single MEC server

[64] Resource usage Heuristics Preemptable tasks

[58] Cost ApproximateFluid tasks;

infinite-capacity datacenter

[59] Response time Reinforcement Learning Homogeneous VMs

[60]Weighted makespan

+offloading costApproximate Identical processors

[63]Queue-length-constrained

costApproximate

Identical tasks & cost functions

for an application

in [58], data is offloaded with an objective to minimize bandwidth, storage and computation

costs. Two online approximation algorithm are proposed. However, data is scheduled in a fluid

fashion, the processing capacity of the datacenters is not accounted for, and data can be sent

to only one chosen datacenter in a time slot.

In [59], a reinforcement learning approach is proposed to schedule online tasks to VMs, where

the VMs have a fixed buffer size. Similarly, in [60], an approximation algorithm is proposed to

schedule online tasks to identical processors. In [61], CloudSim toolkit is used to simulate the

proposed task scheduling algorithm with an objective to minimize makespan, whereas in [62],

Monte-Carlo simulation is used to test the proposed heuristic.

In [63], tasks from a certain number of applications arrive at an MEC system in each time

slot, and an approximation algorithm is proposed to minimize long-term average cost under

a queue-length constraint. In [64], two algorithms are proposed to optimize cloud resource

through preemptable task execution. These works are summarized in Table 2.3.

2.5 Related Theoretical Works

There are several theoretical works across domains that we can gain inspiration from in order

to solve task scheduling problems in cloud computing environments. Here, we address two

well-studied theoretical techniques:

• Job-shop scheduling to help solve task offloading problems, particularly for offline inde-

pendent tasks.

• Online learning to help solve online task offloading problems.


2.5.1 Job-shop Scheduling

Several job-shop scheduling [25, 65–68] works address the problem of scheduling jobs to par-

allel processors, and propose algorithms with performance guarantees. However, they often

make simplistic assumptions, and consequently the proposed techniques cannot be trivially

extended to work with a practical cloud computing system. For example, [67] assumes equal-

length jobs, [68] assumes single-processor system, and [65]. [25] and [66] address the problem of

scheduling jobs to heterogeneous machines for objectives of cost and weighted sum completion

time respectively, and provide performance guarantees for their proposed methods. However,

neither of these works accommodate multiple users, dependent jobs, or communication times.

Hence, they cannot be easily applied to solve computational offloading problems for practical

system models.

2.5.2 Online Learning

Online learning algorithms focus on making decisions online in the presence of partial or no

information by learning information over time. However, existing works require convexity of

functions and utilize specific objectives such as regret [69]. Some works propose online learning

algorithms to solve auction problems [70] or adwords problems [71]. Again, while these works

provide inspiration to solve online task scheduling problems for cloud computing environments,

they cannot be directly applied to solve these problems because of lack of practical objectives,

constraints, or assumptions.

2.6 Review and Contribution

In this dissertation, we aim to propose algorithms to solve optimization problems in cloud

computing environment through identifying task offloading or scheduling decisions. In Chap-

ter 4, we address the problem of offloading independent tasks to a network of heterogeneous

processors. We consider an objective of minimizing sum completion time subject to multiple

user budget constraints. In Section 2.2, we reviewed the existing works that address similar

problems. However, we can see from Table 2.1 that these works address the problem from a de-

centralized game-theoretic perspective or make certain assumptions with respect to the system

model such as homogeneous resources or an infinite-capacity cloud. We propose centralized al-

gorithms with performance guarantees, and consider a heterogeneous and finite-capacity system

model.

Similarly, in Section 2.3, we reviewed the existing works that deal with dependent task

scheduling, and we can see from Table 2.2 that these works make similar assumptions with

respect to the system model. In Chapter 3, we considered a system model consisting of finite-

capacity devices and generic DAG applications in order to do away with these assumptions. We

wish to identify a task scheduling decision that minimizes overall cost subject to application

completion deadlines, and propose an efficient heuristic algorithm to obtain effective solutions.


In Chapter 5, we address the online scheduling of tasks, where the total number of tasks

and task information are not known apriori. From Table 2.3, we see that existing works make

assumptions such as fluid tasks and homogeneous resources. We consider a system model that

does away with these assumptions, and propose algorithms with performance guarantees. We

also compare against other alternatives through trace-driven simulation.

Chapter 3

Offloading Dependent Tasks with

Communication Delay

In this chapter, we study the scheduling decision for applications consisting of dependent tasks,

in a generic cloud computing system comprising a network of heterogeneous local processors

and a remote cloud. We formulate an optimization problem to find the offloading decision that

minimizes the overall execution cost, subject to application completion deadlines. Since this

problem is NP-hard, we propose a heuristic algorithm termed Individual Time Allocation with

Greedy Scheduling (ITAGS) to obtain an efficient solution, and study its performance through

simulation.

The contributions of this work are as follows:

• We formulate a problem of cost minimization in scheduling a single application with

dependent tasks and a completion deadline, over a generic cloud computing system with

heterogeneous processors and communication delay. We relax the binary constraints to

obtain a convex problem and a lower bound to the optimal objective of the original

problem.

• We observe that a scheduling solution obtained by directly discretizing the binary-relaxed

solution does not provide satisfactory performance. Therefore, we propose a new heuristic

algorithm, termed Individual Time Allocation with Greedy Scheduling (ITAGS), which

utilizes the binary-relaxed solution to allocate a completion deadline to each individual

task and then greedily optimizes the scheduling of each task subject to its time allowance.

• We consider an extension to this problem where we need to schedule multiple applications

with different completion deadlines, and propose a modified version of ITAGS to obtain

a solution.

• Through trace-based simulation with real-world applications from [72], as well as various

randomly generated task trees, we study the impact of the application deadline and

15

Chapter 3. Offloading Dependent Tasks with Communication Delay 16

other system settings on the performance of ITAGS. We further compare ITAGS with

the dynamic programming approach from [14,21], other alternatives including the above

discretization heuristic, and the cost lower bound, demonstrating its superior effectiveness.

We also evaluate through simulation the robustness of ITAGS to variation in processing

times and communication delays.

The problem of offloading dependent tasks to multiple types of processors have been con-

sidered in [21], [22], [23], and [24]. However, in [21] and [22], the devices are assumed to possess

infinite capacity in terms of the number of tasks that can be processed simultaneously without

reduction in the processing speed for each task. On the other hand, in [24], the local pro-

cessor cores are assumed to exist on a single mobile device, and an objective of only energy

consumption by the mobile device is considered. Similarly, in [23], we investigate the objective

of cost minimization subject to an application deadline, for heterogeneous local and remote

processors. However, the delay between the local processors is assumed to be negligible. In

this chapter, we account for the delay between all processors in order to arrive at a general

model that encompasses scenarios such as offloading to peer devices and cloudlets in addition

to the cloud. The local processors have finite capacity, and there are time and cost associated

with both task execution and data communication between any two locations. This leads to a

unique and novel problem formulation. Additionally, we consider an extension with multiple

applications consisting of dependent tasks, which has not been considered in existing literature.

The rest of the chapter is organized as follows. Section 3.1 describes the system model

and the problem formulation. In Section 3.2, we present the motivation and details of ITAGS.

In Section 3.3, we consider the extension with multiple applications and propose the modified

version of ITAGS. Section 3.4 presents the simulation results, and concluding summary is given

in Section 3.5.

3.1 System Model and Problem Formulation

In this section, we consider the problem of offloading a single application with a completion

deadline. We extend this model to multiple applications in Section 3.3.

3.1.1 Local Processors and Remote Cloud

We consider a system with a finite number of local processors. These processors may be installed

in mobile edge computing hosts, cloudlet devices, or peer mobile devices. These processors may

have different speeds but are assumed to be unary, i.e., each processor executes one task at a

time, while the other tasks assigned to the processor wait in a queue. We emphasize that, with

respect to the cost and delay in task processing, this assumption is without loss of generality.1

1It is easy to see that there is no benefit in processor-sharing with respect to the sum queueing-and-executiondelay of the tasks.


Remote Cloud

1

2

3

4

d13

d12

d35

d45

d15

d34

d24

d23

d14

Figure 3.1: Example network of local processors and cloud.

��

� � �

� �

��

� � �

� �

Figure 3.2: Dummy tasks, d1 and d2, added to a DAG of 5 tasks.

We further assume a remote cloud center that provides an essentially infinite number of

processors, possibly through leasing of virtual machines. Consequently, the remote cloud can

be viewed as an additional processor having infinite capacity in terms of the number of tasks

it can process simultaneously. Let the set of all processors, including the remote cloud, be Pand its size be M . Let dij be the delay per unit data transfer between processors i and j. For

simplicity of illustration, we assume dij = dji and dij = 0 if i = j. An example of such a system

is depicted in Figure 3.1.

3.1.2 Task Dependency Graph

Consider a single application that must be completed before a deadline L. The application

is partitioned into tasks, whose dependency is modeled as a directed acyclic graph (DAG)

G = 〈V, E〉 where V is the set of tasks and E is the set of edges. The edge (i, k) on the graph

specifies that there is some required data transfer, eik, from task i to task k and hence, k

cannot start before i finishes. Furthermore, if they are scheduled at different processors j and

v respectively, the communication delay is eikdjv and the communication cost is ceikdjv, where

c is the communication cost per unit time. It is clear that the delay values are often smaller

while offloading to nearby local processors in comparison with the delay to offload to the remote

cloud.

If task i is executed on processor j, the execution time is tij and the execution cost is pjtij ,


Table 3.1: Chapter 3 Notations

Notation Description

tij execution time for task i on processor j

pj processing cost per unit time on processor j

c communication cost per unit delay

eik amount of data to be communicated from task i to k

djv delay per unit data between processors j and v

L application deadline

M total number of processors

N ′ total number of tasks

where pj is the processing price per unit time on processor j. In practice, the processing times

and data transfer requirement may be obtained by applying a program profiler as shown in

experimental studies such as MAUI [7] and Thinkair [12]. In this work, we proceed assuming

that such information in already given.

We assume that an application is initiated at a particular local processor and must end at the

same local processor. To model this requirement, for a given DAG representing an application,

we insert two dummy nodes, i.e., tasks having zero execution time and zero communication

cost. One dummy task is inserted at the start to trigger the application at the local device,

and another task is inserted at the very end to receive all the results back at the local device.

This is depicted in Figure 3.2. This insertion is without loss of generality since it preserves the

application. Hence, the total number of tasks can be considered to be

N ′ = |V|+ 2. (3.1)

3.1.3 Problem Formulation

The task scheduling decision contains both the mapping between tasks and processors and the

order of the tasks allocated to each processor. We define the scheduling decision variables as

follows:

xijr :=

1 if task i is on processor j at position r,

0 if otherwise,

for all i = 1, . . . N ′, j = 1, . . .M and r = 1, . . . N ′. Each task is to be scheduled to exactly one

of the existing positions on the processors. Hence,

M∑j=1

N ′∑r=1

xijr = 1, ∀i = 1, . . . , N ′. (3.2)


Furthermore, each position on each processor can be assigned to at most one task, which is

given byN ′∑i=1

xijr ≤ 1, ∀r = 1, . . . , N ′, j = 1, . . . ,M. (3.3)

The positions in each processor are filled by the tasks sequentially, i.e., until one position

on a processor is occupied, tasks cannot be assigned to subsequent positions. This is imposed

by the following constraint:

N ′∑i=1

xijr −N ′∑i=1

xij(r−1) ≤ 0, ∀r = 2, . . . , N ′, j = 1, . . . ,M. (3.4)

The two dummy tasks inserted are required to be scheduled on a local processor, so we have

N ′∑r=1

x11r = 1,

N ′∑r=1

xN ′1r = 1. (3.5)

Furthermore, our task scheduling decision is required to meet the application deadline,

which imposes constraints on the finishing times of the tasks. Let Fi be the finish time of task

i, for i = 1, . . . N ′. Then

FN ′ ≤ L (3.6)

ensures that the last task, and consequently the overall application, is completed by the deadline.

In addition,

F1 = 0 (3.7)

sets the finish time of the first task to zero as it is a dummy task and has zero execution time.

The relationship between the finish times of tasks on the same local processor j and the

decision variables is given by

Fi − Fk + C(2− xijr − xkj(r−1)) ≥ tij , ∀i, k = 1, . . . N ′, r = 2, . . . , N ′, j = 1, . . . , (M − 1),

(3.8)

where we assign a large positive number to C. This ensures that the finish time of a task

on processor j is at least equal to the sum of the finish time of the preceding task and the

processing time of the present task. Note that 2− xijr − xkj(r−1) is zero if and only if tasks k

and i are placed consecutively on processor j.

Finally, since the tasks of the application are dependent, the finish time of a task must be

greater than that of each of its predecessors by the amount of its predecessor’s execution time

and communication time from its predecessor. Thus, we have

Fi − Fk ≥N ′∑r=1

M∑j=1

tijxijr +

M∑j=1

N ′∑t=1

N ′∑r=1

M∑v=1

ekidvjxijrxkvt, ∀i = 1, . . . N ′, (k, i) ∈ E . (3.9)


The first term on the right hand side of (3.9) is the total execution time, and the second term

is the total data communication time, which occurs when task i is executed on processor j and

its predecessor k is executed on another processor v.

We define the total cost of application execution as the sum of the total execution cost and

the total communication cost. Our goal is to identify the schedule that minimizes this total

cost, subject to the application deadline, L. This can be formulated as an optimization problem

as follows:

minimize{xijr}

N ′∑r=1

M∑j=1

N ′∑i=1

pjtijxijr +N ′∑i=1

N ′∑k=1

M∑j=1

M∑v=1

N ′∑r=1

N ′∑t=1

ceikdjvxijrxkvt, (3.10)

subject to (3.2)− (3.9),

xijr ∈ {0, 1}, ∀i = 1, . . . N ′, r = 1, . . . , N ′, j = 1, . . . ,M. (3.11)

This problem is NP-hard since it contains the Generalized Assignment Problem (GAP) as

a special case, and GAP is NP-hard. Hence, we do not expect to find an optimal solution in

polynomial time. Consequently, we propose the ITAGS algorithm and study its effectiveness in

solving this problem.

3.2 Individual Time Allocation with Greedy Scheduling (ITAGS)

The ITAGS algorithm is built on the concept of appropriately allocating the application dead-

line among the individual tasks. To provide a guideline on the suitable amount of individual

time allocation, we first consider a binary-relaxed version of the original problem in the next

subsection. We follow that by discussing how one might design, as an inferior alternative to

ITAGS, a feasible binary solution via direct discretization. We then present the details of

ITAGS, concluding with a discussion of its feasibility and computational complexity.

3.2.1 Binary Relaxation and Individual Time Allowance

Optimization problem (3.10) is a mixed integer program, and it is non-convex due to its non-

convex objective and constraints (3.9). However, we note that the communication delay and

cost terms in (3.9) and (3.10) can be modified as follows:

M∑j=1

N ′∑t=1

N ′∑r=1

M∑v=1

ekidvjxijrxkvt (3.12)

=M∑j=1

M∑v=1

ekidvj

(N ′∑r=1

xijr

)(N ′∑t=1

xkvt

)

=

M∑j=1

M∑v=1

ekidvj max

[(N ′∑r=1

xijr

)+

(N ′∑t=1

xkvt

)− 1, 0

], (3.13)


where the last equality holds because {xijr} are binary. This converts the non-convex (3.12) to

a convex form in (3.13).

Therefore, we perform the following two-step binary relaxation on the original problem:

• Replace the communication terms in (3.10) and (3.9) by (3.13);

• Replace the binary constraints in (3.11) with linear constraints by simply restrict the

decision variables to be non-negative.

This leads to the following convex problem over decision variables {xijr}:

minimize{xijr}

N ′∑r=1

M∑j=1

N ′∑i=1

pjtijxijr +N ′∑i=1

N ′∑k=1

M∑j=1

M∑v=1

ekidvj max

[(N ′∑r=1

xijr

)+

(N ′∑t=1

xkvt

)− 1, 0

](3.14)

subject to (3.2)− (3.8),

Fi − Fk ≥N ′∑r=1

M∑j=1

tijxijr +

M∑j=1

M∑v=1

ekidvj max

[(N ′∑r=1

xijr

)+

(N ′∑t=1

xkvt

)− 1, 0

],

∀i = 1, . . . N ′, (k, i) ∈ E . (3.15)

xijr ≥ 0, ∀i = 1, . . . N ′, r = 1, . . . , N ′, j = 1, . . . ,M. (3.16)

An optimal solution to problem (3.14) can be efficiently computed using convex program-

ming solvers such as CVX. Note that replacing (3.11) with (3.16) is equivalent to allowing a

single task to be distributed and executed partially across several processors and positions.

This is unrealistic, but solving this relaxed problem is useful for two purposes. First, since

the relaxed problem has a larger feasible set, it serves as a lower bound to the optimum of

the original problem, which can be used for numerical performance benchmarking. Second,

the relaxed solution can be leveraged to recover a binary solution to the original problem. In

particular, as a part of the ITAGS algorithm, it supplies the individual time allowance for each

task as explained in Section 3.2.3.

3.2.2 Alternative Discretization Heuristic

Before presenting the details of ITAGS, we first consider a conventional approach to recover a

binary solution from the above relaxed solution, by discretizing the fractional solution xijr to

binary values. We will show later that such an approach, although non-trivial, does not provide

satisfactory performance. Therefore, it will be used mainly for performance benchmarking

against ITAGS.

We note that discretizing the fractional xijr solutions is challenging. Directly rounding

them to binary values will violate some constraints of the original problem. In particular, the

constraints on relative positions of tasks on a processor need to be taken into consideration,

to ensure that the scheduled tasks are in proper order to satisfy the dependency requirement.


Consequently, we consider the following algorithm, term discretization heuristic, which 1) dis-

regards the task positions in the relaxed solution, 2) schedules each task to a processor based

on the fractional solution, and 3) calculates the resultant task starting times to obtain their

relative position values for the final binary solution.

Reduction to Task-on-Processor

In this step, we assign xijr values to their corresponding task-on-processor variables yij as

follows:

yij =N ′∑r=1

xijr, ∀i = 1, . . . N ′, j = 1, . . . ,M. (3.17)

Thus, the yij variables contain just the fractional solution for each task i on each processor

j, which disregards the position information in xijr. It should be noted that yij obey the

scheduling constraint:∑M

j=1 yij = 1, ∀i = 1, . . . N ′.

Discretization

We next discretize the fractional yij solutions to decide the processor assignment decision si for

every task i by picking the processor that has the maximum yij value. The intuition behind

this is that a yij value can be viewed as the probability of scheduling task i on processor j, and

thus we take the decision with the highest probability:

si = argmaxj

yij , ∀i = 1, . . . N ′. (3.18)

Thus, the decision for every task i is

yij :=

1 if j = si,

0 if j 6= si.

Mapping to Positions

Although we have determined the processor on which each task needs to be scheduled, we still

need to decide the positions of tasks on each processor, or the starting times for each task.

Towards this end, we sort the tasks in the order of increasing Fi values from the solution to

the relaxed problem (3.14). This sorting will ensure that the precedence constraints in (3.9)

are obeyed between any two consecutive tasks. Thus, scheduling the tasks to their assigned

processors in this order will give us their corresponding positions and starting time values.

Feasibility Check

For the above task schedule, we check if the total delay meets the application deadline L. If so,

the corresponding cost is the resulting solution, or else the algorithm fails to produce a feasible


schedule. In the latter case, we will use the same fallback procedure as ITAGS described below,

to offload all tasks to the cloud.

3.2.3 ITAGS Algorithm

The task scheduling decision is required to meet the overall application completion deadline.

A purely greedy algorithm might schedule each task to the processor where it achieves the

minimum cost such that the overall application deadline is still met. However, such an algorithm

prioritizes the tasks that are scheduled in the beginning as these tasks would be able to take

away a larger chunk of the overall deadline allowance and make cost-effective decisions for

themselves. On the other hand, the tasks that are scheduled later in the greedy process would

have a relatively smaller portion of the overall deadline allowance available, resulting in possible

infeasibility and performance degradation.

Thus, the guiding principle behind the design of ITAGS is that the overly greedy aspect

of the above approach should be countered, by assigning individual deadlines to the tasks to

ensure uniform priority for all tasks regardless of their scheduling order. ITAGS consists of

three major steps: 1) Set individual time allowance for each task; 2) Schedule each task to a

processor based on a greedy approach subject to its individual deadline; and 3) Check feasibility

by testing if the last task meets the overall application deadline.

Step 1: Individual Time Allocation

In Step 1, we identify the time allowance to be given to each task. This is achieved by performing

binary relaxation on the original problem as detailed in Section 3.2.1, and solving the relaxed

problem to obtain the finish time Fi for each task i. These finish times are treated as individual

task deadlines in the next step.

Step 2: Task Scheduling

Once the individual deadlines are set in Step 1, Step 2 of ITAGS aims at assigning a processor

si for each task i. This task scheduling process has a principled greedy nature as the algorithm

takes one task at a time and schedules it to the processor where the task 1) can complete its

execution before its individual deadline and 2) incurs minimum additional cost.

ITAGS schedules the tasks starting from the top of the DAG and works its way down to

the bottom. Specifically, the tasks are scheduled in the increasing order of individual deadline

Fi. We note that this ensures that a task is scheduled only after its predecessors have been

scheduled since (3.15) ensures that the Fi value of task i exceeds that of its predecessors. The

topmost task is the first dummy task, and it is always scheduled to the local processor where

the application is initiated. Then, as ITAGS moves down the list of unscheduled tasks, for each

task i, we decide its start time STi and processor si.


First, for each potential processor j, we compute the accumulated execution delay Dij and

cost Cij , due to the execution of i on processor j, as follows:

Dij = max(k,i)∈E

(STk + tksk + Tki) (3.19)

Cij = pjtij + c∑

(k,i)∈E

Tki (3.20)

where Tki = dskjeki is the communication delay from processor sk to processor j with respect to

the data from task k to task i. In (3.19), the sum inside max calculates the time when a parent

task k completes execution and its data transfer to task i has arrived at processor j. Therefore,

Dij is the earliest start time of task i on processor j, by taking into account all parents of task

i. Note that if both tasks k and i are scheduled onto the same processor, i.e., sk=j, then the

communication delay per unit data dskj = 0 and consequently Tki = 0.

However, knowing Dij is not sufficient to decide whether task i should be place on processor

j, since Dij does not take into account the waiting time for a task on processor j if the processor

is local and is already executing another task. Therefore, we keep a tab on the end of busy time

on each local processor j, denoted by SLj , and we update it every time a task is scheduled onto

the processor. In other words, every time that some task k is assigned to processor j, we set

SLj = STk + tkj . (3.21)

This takes into account the amount of time that a task will have to wait for processor j if

it is assigned to this processor. Note that for the remote cloud M , SLM is always zero as

we assume that the cloud has infinite capacity in terms of the number of tasks it can process

simultaneously, resulting in zero waiting time for any task scheduled to the cloud.

As a result, the start time of a task i assigned to processor j is the maximum of the

accumulated execution delay Dij and current end of busy time SLj . Thus, in order for task i

to complete execution by its individual deadline Fi, the following condition must be satisfied:

max{Dij , SLj}+ tij ≤ Fi. (3.22)

We then choose processor si to schedule task i as follows:

si =

argminj∈Ji Cij if Ji 6= ∅

argminj Dij if Ji = ∅(3.23)

where Ji is the set of all processors for which (3.22) is satisfied for task i. From (3.23) we

see that if the individual deadline Fi is too tight and cannot be met by any processor, ITAGS

gracefully falls back to a greedy-time algorithm, i.e., one that tries to minimize makespan.


Step 3: Feasibility Check

The process outlined in Step 2 is repeated until the last dummy task. This dummy task is to be

scheduled to the local processor that initiated the application in order to obtain the results at the

initiating device. If this last task does not meet the overall application deadline L, infeasibility

occurs and the algorithm fails to produce a feasible schedule. Alternatively, if every task has

been scheduled successfully to some processor, then a feasible decision is obtained. These two

possibilities result in termination of the algorithm.

The details of ITAGS are given in Algorithm 1.

Algorithm 1 ITAGS algorithm (after Step 1)

Input: DAG G = 〈V, E〉, P, L, and solution to problem (3.14).Output: Scheduling decision variables {xijr}

SLj ← 0 for all j ∈ PSTi ← 0 for all i ∈ Vs1 = 1 {Schedule first dummy task to initiating processor}while there exist tasks not scheduled do

Choose unscheduled task i with minimum Fifor all j ∈ P do

Calculate Dij from (3.19)Calculate Cij from (3.20)

end forif i = N ′ thensN ′ = 1 {Schedule last dummy task to initiating processor}

elseFind si from (3.23)

end ifSTi ← max{Disi ,SLsi} {Setting actual starting time}if si < M then

SLsi ← STi + tisi {Updating the end of busy time for local processors}end if

end whileif DN ′sN′

> L thenNo feasible decision produced.return

end ifxijr ← 0 for all i, j and rSort the tasks scheduled to each single processor in increasing order of STi and obtain theirpositions ri.for all i ∈ V doxisiri = 1

end for


3.2.4 Feasibility and Complexity Analysis

For our NP-hard optimization problem (3.10), neither the discretization heuristic nor ITAGS

provides a feasibility guarantee. Consequently, for practical cases, we may consider a fallback

option that simply offloads all tasks belonging to the application to the remote cloud, if a

feasible solution is not found by the algorithm. Such a fallback option is applicable under the

assumption that the cloud has fast processors and high-speed access, so that offloading to the

cloud can meet the overall application deadline, albeit with an added cost.

The computational complexity of the discretization heuristic is O(2|V|+ |E|). On the other

hand, the computational complexity for the ITAGS algorithm excluding the time to compute

the lower bound solution is O(M(|V| + |E|)), which is polynomial with respect to the size of

the application. The time to compute the lower bound is dependent on the algorithms used by

the software to arrive at the solution. Assuming that a primal barrier algorithm and a υ-self

concordant barrier function with µ as the barrier parameter is used, the number of iterations

in the algorithm to arrive at the solution is O(√υlog(υµε )) for a convex program [73].

3.3 Scheduling multiple applications with different deadlines

Now, we consider an extension where multiple applications need to be executed, each one

having its own application completion deadline. Let the total number of applications be A.

Each application a is modeled as a DAG Ga = 〈Va, Ea〉 where Va is the set of tasks and Ea is

the set of edges. Edge eaki specifies the amount of data to be communicated from task i to task

k belonging to application a.

Similar to the single-application case in Section 3.1.2, the number of tasks in application

a, including the dummy tasks, is given by N ′a. Each application a has an application deadline

given by La, and if it’s task i is executed on processor j, the execution time is taij .

We redefine the task scheduling decision variables as follows.

xaijr :=

1 if task i belonging to application a

is on processor j at position r,

0 if otherwise,

for all a = 1, . . . , A, i = 1, . . . , N ′a, j = 1, . . . ,M and r = 1, . . . , P , where P =∑A

a=1N′a.

3.3.1 Binary Relaxation

The relaxed problem (3.14) is modified for this new multiple-application model as follows.

minimize{xaijr}

A∑a=1

P∑r=1

M∑j=1

N ′a∑i=1

pjtaijxaijr


+

A∑a=1

N ′a∑i=1

N ′a∑k=1

M∑j=1

M∑v=1

eakidvj max

[( P∑r=1

xaijr

)+

(P∑t=1

xakvt

)− 1, 0

](3.24)

subject to

M∑j=1

P∑r=1

xaijr = 1, ∀a = 1, . . . , A, i = 1, . . . , N ′a, (3.25)

A∑a=1

N ′a∑i=1

xaijr ≤ 1, ∀r = 1, . . . , P, j = 1, . . . ,M, (3.26)

A∑a=1

N ′a∑i=1

xaijr −A∑a=1

N ′a∑i=1

xaij(r−1) ≤ 0, ∀r = 2, . . . , P, j = 1, . . . ,M, (3.27)

FaN ′a ≤ La, ∀a = 1, . . . , A, (3.28)

Fa1 = 0, ∀a = 1, . . . , A, (3.29)

Fai − Fak ≥P∑r=1

M∑j=1

taijxaijr

+M∑j=1

M∑v=1

ekidvj max

[( P∑r=1

xaijr

)+

(P∑t=1

xakvt

)− 1, 0

],

∀a = 1, . . . , A, i = 1, . . . , N ′a, (k, i) ∈ Ea, (3.30)

Fai − Fbk + C(2− xaijr − xbkj(r−1)) ≥ taij , ∀a, b = 1, . . . , A, i = 1, . . . , N ′a,

k = 1, . . . , N ′b, r = 2, . . . , P, j = 1, . . . ,M, (3.31)

P∑r=1

xa11r = 1,P∑r=1

xaN ′a1r = 1, ∀a = 1, . . . , A, (3.32)

xaijr ≥ 0, ∀a = 1, . . . , A, i = 1, . . . , N ′a, r = 1, . . . , P, j = 1, . . . ,M. (3.33)

Consequently, using this relaxed solution, we can extend the discretization heuristic and

ITAGS proposed in Sections 3.2.2 and 3.2.3 respectively, with some modifications, to solve this

problem.

3.3.2 Modified Alternative Discretization Heuristic

The discretization heuristic can be adjusted to accommodate multiple applications with different

deadlines by tweaking its three constituent steps presented in Section 3.2.2 as follows.

Reduction to Task-on-Processor

In this step, we assign xaijr values to their corresponding task-on-processor variables yaij as

follows:

yaij =

P∑r=1

xaijr,∀a = 1, . . . , A, i = 1, . . . Na, j = 1, . . . ,M. (3.34)


Discretization

We next discretize the fractional yaij solutions to decide the processor assignment decision sai

for every task i by picking the processor that has the maximum yaij value.

sai = argmaxj

yaij ,∀a = 1, . . . , A, i = 1, . . . Na. (3.35)

Thus, the decision for every task i is

yaij :=

1 if j = sai,

0 if j 6= sai.

Mapping to Positions

We sort tasks in the order of increasing Fai values ∀a = 1, . . . , A, i = 1, . . . Na from the solution

to the relaxed problem (3.24). We schedule the tasks to their assigned processors in this order

to obtain their corresponding positions and starting time values.

Feasibility Check

For the above task schedule, we check if the application deadline La is met for every application

a. If so, the corresponding cost is the resulting solution, or else the algorithm fails to produce

a feasible task schedule for one or more applications. In the latter case, we will use the fallback

procedure described in Section 3.3.4.

3.3.3 Modified ITAGS Algorithm

Step 1: Individual Time Allocation

We solve the relaxed problem to obtain the finish time Fai for each task i belonging to appli-

cation a. These finish times are treated as individual task deadlines in the next step.

Step 2: Task Scheduling

Tasks are scheduled in the increasing order of individual deadline Fai. The topmost task is the

first dummy task, and it is always scheduled to the local processor where the application is

initiated. Then, as ITAGS moves down the list of unscheduled tasks, for each task i, we decide

its start time STai and processor sai.

First, for each potential processor j, we compute the accumulated execution delay Daij and

cost Caij , due to the execution of i of application a on processor j, as follows:

Daij = max(k,i)∈Ea

(STak + taksak + Taki) (3.36)


Caij = pjtaij + c∑

(k,i)∈Ea

Taki (3.37)

where Taki = dsakjeaki is the communication delay from processor sak to processor j with respect

to the data from task k to task i. In other words, every time that task i is assigned to processor

j, we set schedule length of processor j

SLj = STai + taij , (3.38)

where STai is the start time of task i of application a.

In order for the task to complete execution by its individual deadline Fai, the following

condition must be satisfied:

max{Daij , SLj}+ taij ≤ Fai. (3.39)

We then choose processor sai to schedule task i of application a as follows:

sai =

argminj∈Jai Caij if Jai 6= ∅

argminj Daij if Jai = ∅(3.40)

where Jai is the set of all processors for which (3.39) is satisfied.

Step 3: Feasibility Check

The process outlined in Step 2 is repeated until all tasks are scheduled. The dummy tasks

are to be scheduled to the local processor. If the last dummy task of every application a

does not meet its overall application deadline La, infeasibility occurs and the algorithm fails to

produce a feasible schedule for that application. Alternatively, if every task has been scheduled

successfully to some processor, then a feasible decision is obtained. These two possibilities

result in termination of the algorithm.


For this NP-hard optimization problem (3.24), neither the discretization heuristic nor ITAGS

provides a feasibility guarantee. If application deadline La is not met for some application a,

we use a fallback option where the algorithm simply offloads all tasks belonging to application

a to the remote cloud, similar to Section 3.2.4. Note that this will not alter the feasibility of

the other applications.

The computational complexity of the discretization heuristic excluding the time to compute

the lower bound solution is O(2∑A

a=1(|Va| + |Ea|)). On the other hand, the computational

complexity for the ITAGS algorithm is O(M∑A

a=1(|Va| + |Ea|)), which is polynomial with

respect to the total number of applications and their sizes.


Algorithm 2 Modified ITAGS algorithm (after Step 1)

Input: DAGs G = 〈Va, Ea〉, P, La, ∀a ∈ {1, . . . , A}, and solution to problem (3.24).Output: Scheduling decision variables {xaijr}

SLj ← 0 for all j ∈ PSTai ← 0 for all a ∈ {1, . . . , A}, i ∈ Vasa1 = 1,∀a ∈ {1, . . . , A} {Schedule first dummy task of each a to initiating processor}while there exist tasks not scheduled do

Choose unscheduled task i with minimum Fai across all applications aa← the application that includes task ifor all j ∈ P do

Calculate Daij from (3.36)Calculate Caij from (3.37)

end forif i = N ′ thensaN ′ = 1 {Schedule last dummy task to initiating processor}

elseFind sai from (3.40)

end ifSTai ← max{Daisai , SLsai} {Setting actual starting time}if sai < M then

SLsai ← STai + ta∗isi {Updating the end of busy time for local processors}end if

end whilefor all a ∈ {1, . . . , A} do

if DaN ′saN′> La then

No feasible decision produced.return

end ifend forxaijr ← 0 for all a, i, j and rSort the tasks scheduled to each single processor in increasing order of STai and obtain theirpositions rai.for all i ∈ Va doxaisairai = 1

end for


3.4 Trace-driven and Randomized Simulations

We investigate the performance of ITAGS with extensive simulation over multiple offloading

scenarios and applications, using both real-world applications and randomly generated task

trees with practical parameter values.

3.4.1 Comparison Targets

We compare ITAGS with the following alternatives:

• Discretization heuristic in Section 3.2.2 for single application, and modified discretization

heuristic in Section 3.3.2 for multiple applications.

• Purely local: Scheduling all tasks on the local device, i.e., user’s own device/processor.

• Purely remote: Scheduling all tasks on the remote cloud.

• Greedy algorithm: Picking tasks starting from the top of the DAG and scheduling each

task onto the processor where it has the least accumulated cost such that the overall

application deadline is still met.

• Kao’s dynamic programming: The dynamic programming method proposed in [14, 21].

Since the local device in [14,21] can execute any number of tasks simultaneously without

increasing the required processing time of each task, it is essentially assumed to have

an infinite number of identical unary-capacity processors. Furthermore, there is zero

delay between the local device and the remote cloud. Thus, we study the performance

of this dynamic programming algorithm by allowing only a finite number of identical

local processors and practical delay between these processors. In other words, we run

their algorithm to obtain a scheduling decision and apply this decision to our system

by queuing the tasks appropriately and calculating the cost and deadline accounting for

inter-processor delay.

All of the above algorithms are provided with the same fallback option as ITAGS. The lower

bound solution, described in Section 3.2.1, is also observed for benchmarking purposes. It is

calculated using the SDPT3 solver of CVX.

We present the results for a single application case in Sections B, C and D, and for the

multiple applications case in Section E. Additionally, in Section E, we evaluate the performance

of ITAGS when the task processing times and communication times are not precisely unknown

or subject to variation.

3.4.2 Trace-driven Simulation

The delay constrained cost minimization problem under consideration and the ITAGS algorithm

have general application to different network topologies. Here we consider the following typical

scenarios.


802.11 ac

LTE-A

LTE-A

LTE-A

802.11 n

802.11 n

Peer device/Local processor

Cloudlet

Remote cloud

Own Mobile Device/

Processor

Figure 3.3: Simulation topology

0 0.05 0.1 0.15 0.2 0.25

Application deadline (s)

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Cos

t (J)

Lower BoundPurely LocalPurely RemoteGreedyITAGSDisc. HeuDyn. Prog

(a) Gaussian elimination

0 0.05 0.1 0.15 0.2 0.25


0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Cos

t (J)

Lower BoundPurely LocalPurely RemoteGreedyITAGSDisc. HeuDyn. Prog

(b) FFT algorithm

Figure 3.4: Cost vs. application deadline for Gaussian elimination and FFT in Scenario 1

0 0.05 0.1 0.15 0.2 0.25


0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Cos

t (J)

Lower BoundPurely LocalPurely RemoteGreedyITAGSDisc. Heu


0 0.05 0.1 0.15 0.2 0.25


0.1

0.15

0.2

0.25

0.3

0.35

0.4

Cos

t (J)


(b) FFT algorithm

Figure 3.5: Cost vs. application deadline for Gaussian elimination and FFT for Scenario 2


0 0.05 0.1 0.15 0.2 0.25


0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Cos

t (J)



0 0.05 0.1 0.15 0.2 0.25


0.1

0.15

0.2

0.25

0.3

0.35

0.4

Cos

t (J)


(b) FFT algorithm

Figure 3.6: Cost vs. application deadline for Gaussian elimination and FFT for Scenario 3

0 0.05 0.1 0.15 0.2 0.25


0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Cos

t (J)



0 0.05 0.1 0.15 0.2 0.25


0.15

0.2

0.25

0.3

0.35

0.4

0.45

Cos

t (J)


(b) FFT algorithm

Figure 3.7: Cost vs. application deadline for Gaussian elimination and FFT in Scenario 4


Figure 3.8: Task Graph for Gaussian Elimination Application with a matrix size of 5

• Scenario 1 : Identical local processors (including the initiating processor) and a remote

cloud

• Scenario 2 : The initiating processor, a peer device/local processor, and a remote cloud

• Scenario 3 : The initiating processor, a cloudlet, and a remote cloud

• Scenario 4 : The initiating processor and a 3-tier architecture (peer device, cloudlet and

remote cloud) given in Figure 3.3.

Note that we apply Kao’s dynamic programming scheme only to Scenario 1, as the system

model considered in [14,21] cannot be extended to more complicated scenarios. We always label

the local processor initiating the application as processor 1.

We use the application DAG structures presented in [72] for Gaussian elimination (depicted

in Figure 3.8) and the FFT algorithm, as well as additional information provided in [72] with

respect to the computation and communication times, to test our proposed algorithms for the

aforementioned scenarios. We consider the Gaussian elimination application with a matrix size

of 5. We generate random values for the processing time of a single loop uniformly from the

interval (0.5, 5) ms and allocate processing times ti1 for each task i accordingly based on the

number of loops required for the execution of the task in the Gaussian elimination algorithm.


Similarly, the input/output data is drawn uniformly from the interval (10, 100) KB. For the

FFT algorithm, we generate the processing times ti1 for each task i uniformly in (0.5, 20) ms

and input/output data amount is drawn uniformly from the interval (10, 100) KB. Further, we

enforce that the computation times for the tasks in each level are equal and the communication

times between the tasks at two particular levels are equal as given in [72]. The rest of the

parameter values are kept the same as those of the Gaussian elimination application.

We use energy consumption as the measurement of cost. We set c = 0.935 watt [74].

The local processor initiating the application has p1 = 0.944 watt [74]. The following are the

parameter settings for the different scenarios.

• Scenario 1 : 3 additional identical local processors which may all be on the same initiating

device or on different devices. We assume that all these local processors have pj = 0.944

watt and tij = ti1, for each task i and processor j = 2, 3, 4.

• Scenario 2 : A single additional local processor, representing the peer device, having

p2 = 1.5 watts and ti2 = 0.75ti1 for task i.

• Scenario 3 : For Scenario 3, we consider a cloudlet consisting of two processors p2 = p3 = 4

watts and ti2 = ti3 = 0.5ti1 for each task i.

• Scenario 4 : Both the peer device from Scenario 2 and the cloudlet from Scenario 3.

For each of the above scenarios, we additionally consider a more powerful and consequently more

expensive remote cloud, consisting of an unlimited number of available processors, with p3 = 10

watts and tiM = 0.12ti1 for each task i. For communication between processors, we consider

practical communication delay based on the links as given in Figure 3.3, with 6.15 ns/byte

for 802.11ac, 17.77 ns/byte for 802.11n, and 80 ns/byte for Long Term Evolution Advanced

(LTE-A).

Figures 3.4-3.7 depict the cost versus application deadline for the Gaussian elimination and

FFT applications for these four scenarios. We see from these figures that ITAGS performs

consistently better than the other alternatives. From Figures 3.4a and 3.4b, we see that the

dynamic programming approach in [14, 21] performs poorly when subjected to practical con-

straints such as finite capacity processors and inter-processor delay. Naive algorithms such as

purely local, purely remote, and greedy do not give satisfactory cost, particularly for non-trivial

values of application deadline. The discretization heuristic generally performs better than the

naive alternatives but is out-performed by ITAGS. We also see that with increasing values of

application deadline, the cost decreases due to the cost-time tradeoff. For large values of appli-

cation deadline, the decisions tend towards being purely local as the local device is the cheapest

and the slowest in our settings.


0.4 0.5 0.6 0.7 0.8 0.9

Application Deadline (s)

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Cos

t (J)

Disc. Heu:M=2Disc. Heu:M=3Disc. Heu:M=4ITAGS:M=2ITAGS:M=3ITAGS:M=4Lower Bound:M=2Lower Bound:M=3Lower Bound:M=4

(a) Effect of the number of processors M

0.4 0.5 0.6 0.7 0.8 0.9


0.5

1

1.5

2

Cos

t (J)

Disc. Heu:c=0.3Disc. Heu:c=0.9Disc. Heu:c=1.5ITAGS:c=0.3ITAGS:c=0.9ITAGS:c=1.5Lower Bound:c=0.3Lower Bound:c=0.9Lower Bound:c=1.5

(b) Effect of communication cost c

0.4 0.5 0.6 0.7 0.8 0.9


0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Cos

t (J)

Disc. Heu:dl=10Disc. Heu:dl=20Disc. Heu:dl=30ITAGS:dl=10ITAGS:dl=20ITAGS:dl=30Lower Bound:dl=10Lower Bound:dl=20Lower Bound:dl=30

(c) Effect of local delay per byte dl

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3


0

1

2

3

4

5

6

Cos

t (J)

Disc. Heu:N=7Disc. Heu:N=10Disc. Heu:N=15ITAGS:N=7ITAGS:N=10ITAGS:N=15Lower Bound:N=7Lower Bound:N=10Lower Bound:N=15

(d) Effect of the number of tasks N

Figure 3.9: Cost vs. application deadline for randomly generated task trees

3.4.3 Simulation with Randomly Generated Task Trees

In order to further assess the behavior of ITAGS over richer parameter settings, we conduct

simulation based on randomly generated task trees in terms of the DAG structure, task execu-

tion times on the processors, and input/output data between tasks. For each parameter setting,

we observe the average performance of various algorithms over multiple realizations of the ran-

domly generated task trees. From our observations in the previous section, the discretization

heuristic mostly outperforms the other naive alternatives. Hence, we present comparison only

with the discretization heuristic, but also use the lower bound solution for benchmarking.

In Figure 3.9, we study the effect of various parameters on the performance of ITAGS. We

again use energy consumption as the measurement of cost. We consider a general topology with

multiple local helper processors and a remote cloud. We assume the processor initiating the

application, labeled as processor 1, has p1 = 0.944 watt [74], and any additional local helper

processors, representing faster cloudlets or peer devices, has pj = 1.5 watts and tij = 0.75ti1 for

each task i and processor j = 2, . . . , (M − 1). We consider a more powerful and consequently

more expensive remote cloud with p3 = 10 watts and tiM = 0.25ti1 for each task i. Here,


Table 3.2: Run-time (sec)

N Disc. Heu. ITAGS Disc. Heu. ITAGS

M=3 M=4

5 5.0612 5.0620 8.6019 8.6029

7 10.0160 10.0167 17.0474 17.0484

10 22.9529 22.9538 41.5981 41.5991

15 79.7471 79.7484 178.5292 178.5308

ti1 = number of cycles1.2GHz where the processor speed is 1.2GHz and the number of cycles is drawn from

a uniform distribution in the interval (100, 200) mega cycles. We set by default M = 3, N = 5,

c = 0.935 watt [74], and communication delay between the local processors dl = 10 ns/byte

but vary each of them in different plots. The input/output data amount is drawn uniformly

from the interval (1, 3) MB. The communication delay is taken as 50 ns/byte between a local

processor and the remote cloud.

We see that ITAGS substantially outperforms the discretization heuristic, and by inference

the other alternatives, over a wide range of parameter values in the number of processors,

the communication price, and the application size. Furthermore, as the application deadline

increases, ITAGS converges to the lower bound solution, and hence also converges to the opti-

mum, faster than the discretization heuristic.

3.4.4 Run-Time Comparison

In Table 3.2, we show the run-time of ITAGS under the settings of Figure 3.9a, averaged over all

L values. We observe that the run-time of ITAGS is nearly identical to that of the discretization

heuristic. Therefore, the substantial performance benefit of ITAGS is achieved with negligible

run-time penalty. Furthermore, ITAGS scales well with respect to the application size.

3.4.5 Multiple Applications and Uncertain Processing and Communication

Times

In this section, we consider multiple applications with different deadlines and the modified

algorithms proposed in Section 3.3. For this general scenario, we also study the robustness of

ITAGS to variation in the processing times and communication times.

Figure 3.10 depicts cost versus application deadline for different number of applications A.

For this figure, we assume that all applications have the same deadline, and use the default

parameter settings used in Section 3.4.3. We see that when A = 2 and A = 3, the performance

gap between ITAGS and the discretization heuristic is larger than when A = 1, i.e., the single-

application case. In other words, ITAGS can be expected to perform even better relatively for

larger systems with multiple applications.

We now consider the scenario where we do not know the exact task processing times and


0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8


0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Cos

t (J)

Lower Bound:A=1Lower Bound:A=2Lower Bound:A=3ITAGS:A=1ITAGS:A=2ITAGS:A=3Disc. Heu:A=1Disc. Heu:A=2Disc. Heu:A=3

Figure 3.10: Cost vs. application deadline for different number of applications A

0 5 10 15 20 25

Realizations

0

0.5

1

1.5

2

2.5

3

3.5

4

Cos

t (J)

Disc. Heu(known)Disc. Heu(unknown)ITAGS(known)ITAGS(unknown)

Figure 3.11: Cost vs. Realizations for 15% error


−0.05 0 0.05 0.1 0.15 0.2 0.25 0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Error

Cos

t (J)

Disc. Heu(known)Disc. Heu(unknown)ITAGS(known)ITAGS(unknown)Lower Bound(known)Lower Bound(unknown)

(a) Single Gaussian elimination application

−0.05 0 0.05 0.1 0.15 0.2 0.25 0.31

1.5

2

2.5

3

3.5

4

Error

Cos

t (J)

Disc. Heu(known)Disc. Heu(unknown)ITAGS(known)ITAGS(unknown)Lower Bound(known)Lower Bound(unknown)

(b) Two randomly-generated applications

Figure 3.12: Cost vs. error with known and unknown processing and communication times(with 95% confidence intervals)

communication times. We assume that we know certain estimates of these times, and allow

for an error about these estimates. We use the default parameter settings from Section 3.4.3,

except treating the processing time values (tij for all task i and processor j) and communication

time values (djr between processors j and r) there as estimates. We allow the actual times to

vary uniformly randomly between ((1 − ε)tij , (1 + ε)tij) and ((1 − ε)djr, (1 + ε)djr) for some

error ε ∈ [0, 1].

Figures 3.11 and 3.12 compare the costs for the cases with known and unknown task pro-

cessing times and communication times. The labels marked ’unknown’ refer to the cases where

we know just the estimates and not the exact times. The labels marked ’known’ refers to the

case where we know the exact times. For Figure 3.11, we consider two randomly-generated

applications with deadlines L1 = 0.6 and L2 = 0.8. We fix error ε = 0.15, and run ITAGS and

the discretization heuristic for multiple realizations. We see that ITAGS performs consistently

better than the discretization heuristic for every realization, for both known and unknown cases.

Additionally, we see that for some realizations, ITAGS and the discretization heuristic provide

the same solution for the ’known’ scenario but discretization heuristic provide a worse solution

for the ’unknown’ scenario. This implies that ITAGS is comparatively a more robust algorithm.

Figure 3.12 depicts the cost performance versus error. For each value of error ε, we run

several realizations and average the performance. We plot the 95% confidence intervals to

understand the variation in cost values for a particular value of error, and consequently gain

insight regarding the robustness of the schemes. Figure 3.12a considers a single Gaussian

elimination application with a deadline L = 0.6, and Figure 3.12b considers two randomly-

generated applications with deadlines L1 = 0.6 and L2 = 0.8. We see that ITAGS exhibits better

cost performance than the discretization heuristic for both known and unknown cases. We also

see that the confidence intervals for ITAGS are smaller than that of the discretization heuristic

on average, which implies that ITAGS is more robust to variation in processing/communication

times.


3.5 Summary

We study the scheduling of applications consisting of dependent tasks on heterogeneous pro-

cessors with communication delay and application completion deadlines. The proposed cost

minimization formulation is generic, allowing different cost structures and processor topologies.

To overcome the obstacles of task dependency and deadline constraint, we have developed the

ITAGS approach, where the scheduling of each task is assisted by an individual time allowance

obtained from a binary-relaxed version of the original optimization problem. Through trace-

driven and randomized simulations, we show that ITAGS substantially outperforms a wide

range of known algorithms. Furthermore, as the deadline constraint is relaxed, it converges to

optimality much faster than other alternatives.

Chapter 4

Multi-user Task Scheduling with

Budget Constraints

In this chapter, we study task scheduling and offloading in a cloud computing system with

multiple users where tasks have different processing times, release times, communication times,

and weights. Each user may schedule a task locally or offload it to a shared cloud with het-

erogeneous processors by paying a price for the resource usage. Our work aims at identifying

a task scheduling decision that minimizes the weighted sum completion time of all tasks, while

satisfying the users’ budget constraints.

Our main contributions are summarized below:

• We first consider the problem where all tasks are available at time zero and communication

times are negligible. We formulate an interval-indexed ILP inspired by [25]. Using a

relaxed LP-solution, we obtain an integer solution that is shown to provide a constant-

factor approximation to the minimum weighted sum completion time. Even though this

integer solution violates the budget constraints, we make an interesting observation that

the average budget violation decreases with respect to the number of users.

• Based on the above observation, we propose an algorithm termed Single Task Unload for

Budget Resolution (STUBR). In addition to finding a relaxed and rounded LP-solution for

the above interval-indexed ILP, STUBR resolves budget violations. We prove performance

bounds for this budget-resolved solution. We then use a greedy task ordering scheme on

each processor to further reduce the weighted sum completion time. We also study the

computational complexity of STUBR.

• We then extend STUBR to more practical models (a) with task release times, (b) with

fixed communication times, and (c) with sequence-dependent communication times, i.e.,

considering a finite-capacity channel model where tasks must be sequenced and commu-

nicated. We obtain performance bounds for these cases as well.

41

Chapter 4. Multi-user Task Scheduling with Budget Constraints 42

• Our trace-driven simulation shows that STUBR performs consistently better than the

existing alternatives. It exhibits maximum performance gains of more than 50% for both

chess and compute intensive applications [22] in comparison with the Greedy Weighted

Shortest Processing Time (WSPT) scheme. Finally, our simulation results demonstrate

that STUBR is highly scalable with respect to the number of users in the system.

The general problem of minimizing the weighted sum completion time on a single processor

has been well studied [65], but few works in the literature have considered the same objective

to schedule tasks in a multi-processor cloud environment. In [40], the authors proposed an Ant

Colony Optimization based algorithm to solve this NP-hard problem. The same objective was

also considered in [75] for scheduling coflows in data center networks and approximation algo-

rithms were proposed. In [25], the authors considered this objective for scheduling tasks with

release times on parallel processors, and proposed an 8-approximation algorithm. Our solu-

tion approach is inspired by [25]. However, our problem, in addition to considering a weighted

sum completion time objective and task release times, also accounts for multiple users, and

per-user budget constraints, which renders our problem more challenging than those addressed

in [25,40,65,75]. Additionally, in this work, we also address the proposed problem under more

generic task communication time models. The existing works that consider sum completion

time objective for multiprocessor environments, i.e., [25, 40, 65, 75] do not consider communi-

cation time, and cannot be easily extended to accommodate a finite-capacity communication

channel model as considered in this work.

The rest of the chapter is organized as follows. Section 4.1 describes the system model

and the problem formulation. In Section 4.2, we propose the STUBR algorithm, and present

performance guarantees. In Section 4.3, we extend this to the problem with release times, fixed

communication times, and with sequence-dependent communication times. Section 4.4 presents

the simulation results, and we summarize the chapter in Section 4.5.

4.1 System Model and Problem Formulation

In this section, we present details of the system model and problem formulation. Initially,

we consider the problem of scheduling immediately available tasks to heterogeneous processors

under user budget constraints. In Section 4.3, we extend this to the case where tasks have

release times, fixed communication times, and sequence-dependent communication times.

4.1.1 System Model

Processors and Tasks

We consider a system with N user/mobile devices. Each user i ∈ {1, . . . , N} wishes to complete

a set of independent tasks, denoted by Ji. Each user has its own unary local processor, i.e., it

can execute only one task at a time. This assumption is without loss of generality, as allowing


Figure 4.1: Example system of 3 users and 5 cloud processors.

Table 4.1: Chapter 4 Notations


tj local processing time of task j

cj communication time of task j

tRj release time of task j

wj weight of task j

αir speed-up achieved by user i’s tasks on processor r

βr cost per unit time to utilize processor r

Bi budget of user i

Ji set of tasks user i wishes to execute

R set of all processors (local and cloud)

C set of cloud processors

Ri set of cloud processors and user i’s local processor

R′ set of machine-interval processors

N total number of users

(τl−1,τl) time interval l

L number of intervals


multiple tasks to share a processor simultaneously will not provide any improvement to our

weighted sum completion time objective. The system includes a finite-capacity cloud consisting

of a number of heterogeneous processors that run at different speeds. Let C be the set of cloud

processors. Each processor at the cloud is assumed to be unary. Similar to the local processor

scenario, this unary-capacity assumption is also without loss of generality. An example of such

a system is illustrated in Figure 4.1.

The processing time for each task j ∈ Ji is tj on user i’s local processor. The speed-up

factor for each cloud processor r is αir ≥ 0, so that the processing time for task j at processor r

is αirtj . Each user can execute its tasks either locally or remotely at one of the cloud processors.

The processing times may be obtained by applying a program profiler as shown in experimental

studies such as MAUI [7], Clonecloud [31], and Thinkair [12]. In this work, we proceed assuming

that such information is already given. We also consider a weight wj associated with each task,

to signify the relative urgency of certain tasks with respect to the others. For notation simplicity,

we further define R as the set of all processors (including all users’ local processors), Ri as the

set of processors to which user i can offload its tasks, i.e., its own local processor and cloud

processors.

User Budget

The users are required to pay a certain price per unit time to use the processors at the cloud,

but no price to execute tasks locally on their own device. Let βr be the cost per unit time for

executing a task on processor r. Each user i has a budget Bi that determines the total expense

that the user is willing to incur for offloading tasks to the cloud.

Release Times and Communication Times

Each task j has a release time tRj , i.e., the time at which the task j becomes available at the local

processor. Furthermore, each task may require some input data that needs to be communicated

if the task is to be executed at the cloud. The time to transmit the input data for task j to the

cloud is given by cj . We consider two different communication models:

1. Fixed Communication Times: The input data for each task j can be transmitted to

the cloud as soon as the task is available. Hence, the communication delay for task j is

simply cj . Hence, the overall communication delay for task j is the sum of transmission

times of itself and all tasks before it. This allows us to model a communication link with

a large number of channels.

2. Sequence-dependent Communication Times: The input data for each task cannot

be transmitted as soon as the task is available. We assume that the data is transmitted

to the scheduled processor one task at a time, i.e., the channel to a processor is unary.

This allows us to model a communication link with finite capacity.


4.1.2 Problem Formulation

For clarity of presentation, initially we consider the case where all tasks are released and avail-

able at time zero, and the links between processors are fast enough so that the communication

delay between them is negligible. This leads to the problem formulation in this section and

the corresponding STUBR algorithm in Section 4.2. We will provide details on how they are

extended to the cases non-zero task release times, fixed communication times, and sequence-

dependent communication times in Section 4.3.

We wish to identify the task scheduling decision that minimizes the weighted sum completion

time of all tasks subject to user budget constraints. We formulate the proposed problem by

using an interval-indexing method proposed in [25]. Towards this end, we divide the time axis

into intervals (τl − 1, τl), where τ0 = 1 and τl = 2l−1, for l ∈ 1, ..., L, where L is the smallest

integer such that2L−1 ≥

∑j

tj .

This means that 2L−1 is a sufficiently large time horizon for the scheduling of all given tasks

since it accounts for the worst-case completion time∑

j tj . The task scheduling decision de-

termines the processors where each task should be scheduled, as well as the order of the tasks.

We define decision variables {xjrl} where xjrl = 1 if and only if task j finishes execution on

processor r in time interval l ∈ {1, . . . , L}. Such an approach reduces the number of variables

in our formulation in comparison with a time-indexed formulation with constant-size intervals,

making it computationally tractable, with a small penalty in the precision of quantifying the

optimization objective. The optimization problem is defined below.

min{xjrl}

N∑i=1

∑j∈Ji

wj∑r∈Ri

L∑l=1

τl−1xjrl, (4.1)

s.t.L∑l=1

∑r∈Ri

xjrl = 1, ∀i ∈ {1, . . . , N}, j ∈ Ji, (4.2)

N∑i=1

∑j∈Ji

αirtjxjrl ≤ τl, ∀r ∈ R, l ∈ {1, . . . , L}, (4.3)

∑j∈Ji

L∑l=1

∑r∈Ri

βrαirtjxjrl ≤ Bi, ∀i ∈ {1, . . . , N}, (4.4)

xjrl = 0, if τl < αirtj , ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ C, l ∈ {1, . . . , L}, (4.5)

xjrl = 0, if τl < tj , ∀i ∈ {1, . . . , N}, j ∈ Ji, r /∈ C, l ∈ {1, . . . , L}, (4.6)

xjrl = 0, if Bi < βrαirtj , ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}, (4.7)

xjrl ∈ {0, 1}, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}. (4.8)


The objective (4.1) is to minimize the weighted sum completion times of tasks across all users.

Constraint (4.2) ensures that every task is assigned to exactly one processor and one interval.

Constraint (4.3) enforces that for each interval l, the total load on every processor r cannot

exceed τl. Equation (4.4) enforces the budget constraints for each user. Equations (4.5)-(4.7)

ensure that individual tasks do not exceed the τl interval deadline and the budget. Constraint

(4.8) forces the decision variables to take on binary values.

Remark 1. One may note that τl−1 is a lower bound to the completion time of a task completing

in interval l, and consequently, (4.1) is a lower bound to the weighted sum completion time.

In Sections 4.2 and 4.3, we present algorithms that provide worst-case performance guarantees

in terms of constant factors above this lower-bound objective. Hence, the same algorithms also

have at least the same worst-case performance guarantees with respect to the optimal weighted

sum completion time.

4.2 The STUBR algorithm

In this section, we present the STUBR algorithm to solve problem (4.1). We then prove

some guarantees and properties of this algorithm, to better understand its functionality and

performance.

STUBR has the following steps:

1. Relax the integer constraints in problem (4.1) and obtain a relaxed solution.

2. Round this solution to obtain an integer solution that gives an objective value that is

no higher than 8 times the objective value achieved by the relaxed solution, and thus is

also no higher than 8 times of the optimal objective value of problem (4.1). While this

rounded solution is expected to violate the budget, we prove that the average cost over a

large number of users meets the average user budget.

3. We resolve any budget violation by strategically moving some tasks to the local device.

4. To further reduce the total weighted completion time, we note that the well-known WSPT

is optimal for a single processor and jobs without release times. Hence, on each processor,

we reorder the tasks allocated to it by WSPT ordering.

These steps are explained in detail in the following sections.

4.2.1 Relaxed Solution

For each user i ∈ {1, . . . , N}, j ∈ Ji, and r ∈ Ri, let pjr and bjr be the processing times and

costs for scheduling task j on processor r. For our initial plain model with no release times and


communication times, we have

pjr :=

αirtj if r ∈ C,

tj otherwise,(4.9)

bjr :=

βrαirtj if r ∈ C,

0 otherwise.(4.10)

Using (4.9) and (4.10), we reformulate the optimization problem in Section 4.1.2, and relax the

integer constraints to obtain the following linear program.

min{xjrl}

N∑i=1

∑j∈Ji

wj∑r∈Ri

L∑l=1

τl−1xjrl, (4.11)

s.t.

L∑l=1

∑r∈Ri

xjrl = 1, ∀i ∈ {1, . . . , N}, j ∈ Ji, (4.12)

N∑i=1

∑j∈Ji

pjrxjrl ≤ τl, ∀r ∈ R, l ∈ {1, . . . , L}, (4.13)

∑j∈Ji

∑r∈Ri

L∑l=1

bjrxjrl ≤ Bi, ∀i ∈ {1, . . . , N}, (4.14)

xjrl = 0, if τl < pjr, ∀i ∈ {1, . . . , N}, r ∈ R, l ∈ {1, . . . , L}, (4.15)

xjrl = 0, if Bi < bjr, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}, (4.16)

xjrl ≥ 0, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}. (4.17)

The above linear program can be solved efficiently in polynomial-time to obtain a relaxed

solution to the problem (4.1). This formulation also resembles the LP-relaxed version of the

problem minimizing the weighted sum completion time in a system of unrelated1 machines,

i.e. R||∑wjCj in the standard parallel-processor-scheduling notation as formulated in [25].

However, our formulation has additional budget constraints, (4.14) and (4.16), that need to be

met for each user. It also accommodates multiple users unlike the formulation in [25]. These

aspects render our formulation a more complex one requiring more sophisticated techniques,

for recovering an integer solution and resolving budget overage.

4.2.2 Rounded Solution

In [25], the authors used a method proposed in [66] for solving the generalized assignment

problem, and obtained an integer solution for their problem at hand. However, our formulation

1In the mode of unrelated machines, the processing times of a task on any two machines are independent.


renders a further constrained version of the generalized assignment problem due to the budget

constraints. We therefore extend the method proposed in [66] to obtain a rounded solution to

problem (4.1). We study the behavior of this solution and later employ techniques to improve

this solution for our problem. We also provide worst-case performance and incurred cost guar-

antees. Additionally, we study the behavior of the average incurred cost as the number of users

increases in the system.

Rounding Technique

We first convert the LP-solution xjrl to xjr′ , where each machine-interval pair (r, l) is viewed as

a single virtual processor r′ ∈ R′, such that R′ is the set of machine-interval processors. This

facilitates the application of the rounding method proposed in [66] to our problem.

The rounding technique lists the tasks in non-increasing order of pjr′ , for r′ ∈ R′, and

constructs a bipartite fractional matching. A fractional matching between task nodes and

machine nodes assigns each task node partially to multiple machine nodes, and all allocated

fractions for a particular task node should sum up to 1. Let f(vr′s,uj) denote the fractional

matching between task nodes uj , for j ∈ Ji, i ∈ {1, . . . , N}, and machine nodes vr′s, for r′ ∈ R′,s ∈ {1, . . . , kr′}, and kr′ = d

∑j xjr′e. This is constructed in accordance with the following:

xjr′ =∑

s:(vr′s,uj)∈E

f(vr′s, uj), ∀r′ ∈ R′, j ∈ Ji, i = {1, . . . , N}, (4.18)

∑j:(vr′s,uj)∈E

f(vr′s, uj) = 1, ∀r′ ∈ R′, s = {1, . . . , (kr′ − 1)}, (4.19)

where E is the set of edges of the bipartite graph. This fractional matching is then converted

to a minimum cost integer matching where each task is assigned to a single machine node.

For our problem, this would be equivalent to a weighted sum completion time integer match-

ing. We call this integer solution x = {xjrl}. This integer matching solution, however, is likely

to violate the interval deadline τl constraints as well as the user budget constraints, since the

relaxed solution that meets these constraints has been rounded. We now analyze the extent to

which these constraints could be violated, and the resulting performance guarantee.

Interval Deadline Violation and Performance Guarantee

Lemma 1. With the rounded solution, the total processing time of all tasks for every r′ ∈ R′ =(r, l), and interval l ∈ {1, . . . , L} cannot be worse than 2τl, i.e., constraint (4.13) is violated by

at most τl.

Proof. For each machine node vr′s, let the maximum possible processing time be

pmaxr′s = max

j:(vr′s,uj)∈Epjr′ , (4.20)


and minimum possible processing time be

pminr′s = min

j:(vr′s,uj)∈Epjr′ . (4.21)

Consequently, pminr′s ≥ pmax

r′(s+1), since tasks are allocated in non-increasing order of pjr′ while

constructing the fractional bipartite matching. Along the lines of the proof in [66], we have for

each r′ ∈ R′,

kr′∑s=2

pmaxr′s ≤

kr′−1∑s=1

pminr′s (4.22)

≤kr′−1∑s=1


pjr′f(vr′s, uj) (4.23)

≤kr′∑s=1


pjr′f(vr′s, uj) (4.24)

=N∑i=1

∑j∈Ji

pjr′xjr′ ≤ τl. (4.25)

Furthermore, pmaxr′1 ≤ τl,∀r′, from (4.15). Hence, we have

N∑i=1

∑j∈Ji

pjrxjrl =N∑i=1

∑j∈Ji

pjr′xjr′ (4.26)

≤kr′∑s=1

pmaxr′s ≤ 2τl, ∀r ∈ R, l = {1, . . . , L}. (4.27)

We derive an approximation ratio for the integer matching x which is presented in the

following theorem.

Theorem 1. The objective value of the rounded solution obtained from the integer matching x

cannot be worse than 8 times the optimal objective of problem (4.1).

Proof. We define new intervals τl := 2l+1, ∀l = {1, . . . , L}. From Lemma 1, we can see every

task j that was scheduled in the lth interval will be completed by time τl. This is because

τl − τl−1 ≤ 2τl, and from Lemma 1, we know that the total processing time for the tasks

assigned to the lth interval does not exceed 2τl. Let the contribution to the objective by task

j be given by Orelaxj and Oround

j in the relaxed solution and the rounded solution, respectively.

If task j is scheduled to complete in interval l, we have

Orelaxj = wjτl−1 (4.28)


= wj2l−2. (4.29)

Similarly, for the rounded solution, we have

Oroundj ≤ wjτl (4.30)

≤ wj2l+1 (4.31)

≤ wj2l−223 (4.32)

≤ 8Orelaxj . (4.33)

This implies that

N∑i=1

∑j∈Ji

Oroundj ≤

N∑i=1

∑j∈Ji

8Orelaxj . (4.34)

We see that the rounded objective value is at most 8 times the relaxed solution, and hence,

at most 8 times the optimal objective of problem (4.1) since the relaxed solution by definition

returns an objective value that is below the optimal objective.

Multiple Users and Incurred Cost Guarantees

Since our problem also accounts for multiple users and budget constraints, we wish to evaluate

the performance of this rounded solution with respect to these parameters.

Theorem 2. With the rounded solution, the sum of the incurred cost of all users cannot be

worse than (|R′|+ 1) times the sum of user budgets.

Proof. Let bmaxr′s and bmin

r′s be the maximum and minimum possible costs at machine node (r′,s),

respectively. For each processor r′, we have pminr′s ≥ pmax

r′(s+1), as explained in 1. Consequently,

we have bminr′s ≥ bmax

r′(s+1) for our model from (4.9) and (4.10). Then, we have

N∑i=1

∑r′∈R′

kr′∑s=2

bmaxr′s ≤

N∑i=1

∑r′∈R′

kr′−1∑s=1

bminr′s (4.35)

≤N∑i=1

∑r′∈R′

kr′−1∑s=1


bjr′f(vr′s, uj) (4.36)

≤N∑i=1

∑r′∈R′

kr′∑s=1


bjr′f(vr′s, uj) (4.37)

=

N∑i=1

∑r′∈R′

∑j∈Ji

bjr′xjr′ ≤N∑i=1

Bi. (4.38)

Hence, if we take out the tasks allocated to machine nodes vr′1 for every r′ ∈ R′, the remaining


tasks have a sum cost that is less than the sum of user budgets. There are at most |R′| such

tasks. Furthermore, for each r′ ∈ R′ and j such that (vr′1,uj) ∈ E , we know from (4.16) that

bjr′ ≤ bmaxr′1 ≤

N∑i=1

Bi. (4.39)

Hence we have

N∑i=1

∑j∈Ji

∑r∈R

L∑l=1

bjrxjrl =N∑i=1

∑j∈Ji

∑r′∈R′

L∑l=1

bjr′xjr′ (4.40)

≤N∑i=1

∑r′∈R′

kr′∑s=1

bmaxr′s ≤ (|R′|+ 1)

N∑i=1

Bi. (4.41)

Remark 2. We see from the above at most one task on each interval-processor violates the

sum of the user budgets. Consequently, we can find a subset of at most |R′| tasks that violate

the sum of the user budgets.

The following conclusions follow directly from Theorem 2.

Corollary 1. If bjr is independent of task j, let br = bjr. We further define S = {r ∈ R′ :

∃ j, xjr = 1}. Then, we have

N∑i=1

∑j∈Ji

∑r∈R

L∑l=1

brxjrl ≤N∑i=1

Bi +∑r∈S

br. (4.42)

Corollary 2. If Ci is the incurred cost for user i,

1

N

N∑i=1

Ci ≤1

N

N∑i=1

Bi +1

N|R′|Bmax, (4.43)

where Bmax = maxiBi, and for the specific case from Corollary 1,

1

N

N∑i=1

Ci ≤1

N

N∑i=1

Bi +1

N

∑r∈S

br. (4.44)

If we increase the number of users N , the total processing time increases, and consequently,

the number of intervals L increases. But we note that since the interval size increases expo-

nentially, the number of intervals L only increases logarithmically. Additionally, the number

of processors |R| is fixed. This implies that |R′| = L|R| increases more slowly in comparison

with N . Hence, we can see that as N → ∞, the second term on the right-hand side of (4.43)

approaches zero, leading to Corollary 3.


Corollary 3. As N →∞, the average cost incurred across all users meets the average budget.

Thus, the average user cost performance improves as the number of users in the system

increases. This property indicates that the proposed algorithm is highly scalable and is a

suitable choice for multi-user systems.

4.2.3 Dealing with Budget Violation

Even if the budget constraints are met on average, the budget constraints for each individual

user could still be violated. In cases where the users expect strict budget constraints, we need to

identify a technique by which this rounded solution can be modified to ensure that each user’s

budget is met, while not significantly affecting the weighted sum completion time. Since there

is no budget constraint on executing tasks on a user’s local device, we propose the following

technique to move certain tasks to the local device in the event of a budget violation:

1. Check if budget is violated for user i.

2. If so, sort all its offloaded tasks, {j ∈ Ji | xjrl = 1, ∀r ∈ C}, in non-decreasing order of

wjtj . We do this as we expect a task with smaller weight and smaller local processing

time does lesser damage to the weighted sum completion time objective when transferred

to the local device.

3. Start with the first task (with least wjtj) and schedule it on the local device. Update the

incurred cost of user i by subtracting the previously incurred cost of this task.

4. If incurred cost now meets the budget, stop. If not, repeat Steps 2 and 3 until the budget

is met.

5. Repeat for all users.

6. Once every user meets its budget, we apply our modified WSPT (presented in Section

4.2.4) on all processors.

Now we wish to understand the impact of moving tasks to the local device to meet the

budget on the performance.

Theorem 3. The objective value of the final solution is at most 2dlog2(2+ 1a

)e+2 times the optimal

solution, where a = mini,r αir is the minimum value of speed-up factor in the system.

Proof. We know, from Remark 2, that for every interval l, at most one task from every cloud

processor needs to be moved to the local device, and this task has a maximum processing time

of τl. We also know, from Lemma 1, that total processing time on a local processor for the

tasks assigned to the lth interval does not exceed 2τl. Furthermore, from (4.15), the processing

time of a task scheduled to finish in interval l cannot exceed τl. Thus, after moving a task

belonging to user i from cloud processor r to user i’s local processor, the total processing time


on the local processor will be (2 + 1αir

)τl, in the worst case. Since this task that we move back

may belong to any user, this value will be at most (2 + 1a)τl, as a = mini,r αir is the minimum

value of speed-up factor. In other words, we now have

N∑i=1

∑j∈Ji

pjrxjrl ≤ (2 +1

a)τl, ∀r ∈ R, l = {1, . . . , L}. (4.45)

We need to redefine τl defined in Theorem 1 such that every task that is assigned to the lth

interval may be run entirely within the interval (τl−1, τl). In other words, we need

τl − τl−1 ≤ (2 +1

a)τl. (4.46)

Towards this end, we set x = log2

(2 + 1

a

)+ 1, and τl = 2xτl = 2x2l−1 = 2x+l−1.

We now get, for every task j,

Oroundj

Orelaxj

≤ wjτlτl−1

≤ 2x+1 ≤ 2log2(2+ 1a)+2 (4.47)

Thus, the objective value of the final solution is at most 2dlog2(2+ 1a

)e+2 times the optimal solution.

4.2.4 WSPT Ordering

From the above, we obtain a scheduling decision for every task that specifies on which processor

the task should be executed. Some processors will be assigned multiple tasks. We know that

the WSPT ordering is optimal for the weighted sum completion time objective for a single

processor and jobs without release times [76]. Thus, we perform a WSPT ordering on the tasks

allocated to a particular processor to further improve our objective value as follows:

1. Step 1: Obtain the task scheduling decision, i.e., the processor on which each task should

be scheduled.

2. Step 2: On each processor r ∈ R, order the scheduled tasks in the non-decreasing order

ofpjrwj

. This ensures that tasks with smaller weights and longer completion times (without

accounting for wait times) are scheduled earlier.

3. Step 3: Modify the task completion times correspondingly, and obtain the new objective

value.


It can be readily noted that the STUBR algorithm provides a feasible solution. In other words,

the user budgets are always met, and all the tasks are always scheduled. Thus, in the worst

case with extremely tight budgets, the algorithm will execute all tasks locally.


The time complexity of STUBR is dominated by the LP-solving step (in Section 4.2.1) and

the rounding step (in Section 4.2.2) that involves finding the weighted sum completion time

bipartite matching. An LP can be solved in O(n3.5) time where n is the number of variables [77].

For our problem, this would imply that the time complexity for solving the LP is O((P |R|L)3.5),

where P =∑N

i=1 |Ji| is the total number of tasks. On the other hand, bipartite matching can

be solved in cubic time in the number of vertices by utilizing the Hungarian algorithm, proposed

in [78]. If P > |R|, the time complexity of this step is O(P 3). Thus, we see that the overall

worst-case time complexity of STUBR is O((P 2L)3.5).

4.3 STUBR Extensions

In this section, we consider the models with release times, fixed communication times, and

sequence-dependent communication times, introduced in Section 4.1.1.

4.3.1 With Task Release Times

STUBR can also be applied to solve the problem of scheduling tasks with release times. We

reformulate problem (4.11) to incorporate release times tRj for every task j as follows.

min{xjrl}

N∑i=1

∑j∈Ji

wj∑r∈Ri

L∑l=1

τl−1xjrl, (4.48)

s.t.L∑l=1

∑r∈Ri

xjrl = 1, ∀i ∈ {1, . . . , N}, j ∈ Ji, (4.49)

N∑i=1

∑j∈Ji

pjrxjrl ≤ τl, ∀r ∈ R, l ∈ {1, . . . , L}, (4.50)

∑j∈Ji

∑r∈Ri

L∑l=1

bjrxjrl ≤ Bi, ∀i ∈ {1, . . . , N}, (4.51)

xjrl = 0, if τl < tRj + pjr, ∀i ∈ {1, . . . , N}, r ∈ R, l ∈ {1, . . . , L}, (4.52)


xjrl ≥ 0, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}. (4.54)

On applying the same rounding method proposed in Section 4.2.2, we can easily see that

Lemma 1 is satisfied for this case as well. Additionally, we can also extend the results in

Theorem 1 to prove the following.




Proof. Every task that is assigned to the lth interval may be run entirely within the interval

(τl−1, τl), where τl := 2(l+1). This is because τl − τl−1 ≤ 2τl, and from Lemma 1, we know

that the total processing time for the tasks assigned to the lth interval does not exceed 2τl.

Additionally, every task j that is assigned to the lth interval will have been released by τl−1

because τl−1 > τl > tRj + pjr > tRj . Thus, similar to Theorem 1, we see that the rounded

objective value is at most 8 times the relaxed solution, and hence, at most 8 times the optimal

objective of problem (4.48) since the relaxed solution by definition returns an objective value

that is below the optimal objective.

We can also see that Theorem 2 and the corresponding corollaries are satisfied for this case.

We apply the budget resolution technique proposed in Section 4.2.3 and can easily see that

Theorem 3 can be proved for this case as well. Additionally, we also apply a modified WSPT

similar to that in Section 4.2.4 that we call m-WSPT ordering, by accommodating task release

times. We do this by scheduling tasks in the non-decreasing order oftRj +pjrwj

in Step 2.

4.3.2 With Fixed Communication Times

We can further extend the solution in Section 5.1 to the case where every task j has release

time tRj and communication time cj . This is equivalent to defining a new per processor release

time tRjr, to be the release times for scheduling task j on processor r, as follows:

tRjr :=

tRj + cj if r ∈ C,

tRj otherwise,(4.55)

Thus, the new version of problem (4.11) becomes

min{xjrl}

N∑i=1

∑j∈Ji

wj∑r∈Ri

L∑l=1

τl−1xjrl, (4.56)

s.t.L∑l=1

∑r∈Ri

xjrl = 1, ∀i ∈ {1, . . . , N}, j ∈ Ji, (4.57)

N∑i=1

∑j∈Ji

pjrxjrl ≤ τl, ∀r ∈ R, l ∈ {1, . . . , L}, (4.58)

∑j∈Ji

∑r∈Ri

L∑l=1

bjrxjrl ≤ Bi, ∀i ∈ {1, . . . , N}, (4.59)

xjrl = 0, if τl < tRjr + pjr, ∀i ∈ {1, . . . , N}, r ∈ R, l ∈ {1, . . . , L}, (4.60)


xjrl ≥ 0, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}. (4.62)


We can see that the fixed communication times are incorporated in constraint (4.60). On

applying the same rounding method proposed in Section 4.2.2, we can easily see that Lemma

1 is satisfied for this case as well. Additionally, we can also extend the results in Theorem 4 to

prove the following.



Proof. The proof is similar to the proof of Theorem 4, except that we note that every task

j that is assigned to the lth interval, i.e., (τl−1, τl) will have been released by τl−1 because

τl−1 > τl > tRjr + pjr > tRjr.

We see that Theorem 2, the corresponding corollaries, as well as Theorem 3 can be proved for

this case as well. Additionally, we can accommodate both task release times and communication

times by scheduling tasks in the non-decreasing order oftRjr+pjrwj

in Step 2 of m-WSPT proposed

in Section 4.2.4.

4.3.3 With Sequence-dependent Communication Times

Modified Channel Model and Problem Formulation

In a more practical model of a communication channel with finite channel capacity, the input

data is communicated to the scheduled processor one task at a time. To extend STUBR to this

more complicated scenario, we introduce the following new decision variables:

xjrpl :=

1 task j is communicated at interval p

to processor r and executed at interval l,

0 otherwise,

(4.63)

We define communication times for a task j on a processor r as

cjr :=

cj if r ∈ C,

0 otherwise,(4.64)

It should be noted that, under this channel model, the release time of a task j at the local

device is tRj , but the release time of the task at a cloud processor is now determined by the

sequence in which tasks are communicated to this processor.

Relaxed Solution

Incorporating (4.63), (4.64), and the consideration of sequence-dependent communication times

into the first step of STUBR, we first solve the following optimization problem to obtain an


LP-relaxed solution.

min{xjrpl}

N∑i=1

∑j∈Ji

wj∑r∈Ri

L∑l=1

τl−1

L∑p=1

xjrpl, (4.65)

s.t.L∑l=1

∑r∈Ri

l∑p=1

xjrpl = 1, ∀i ∈ {1, . . . , N}, j ∈ Ji, (4.66)

N∑i=1

∑j∈Ji

l∑p=1

pjrxjrpl ≤ τl, ∀r ∈ R, l ∈ {1, . . . , L}, (4.67)

N∑i=1

∑j∈Ji

L∑l=p

cjxjrpl ≤ τp, ∀r ∈ C, p ∈ {1, . . . , L}, (4.68)

∑j∈Ji

∑r∈Ri

L∑l=1

L∑p=1

bjrxjrl ≤ Bi, ∀i ∈ {1, . . . , N}, (4.69)

L∑l=1

xjrpl = 0, if τp < tRj + cj ,∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ C, p ∈ {1, . . . , L}, (4.70)

L∑p=1

xjrpl = 0, if τl < tRj + pjr, ∀i ∈ {1, . . . , N}, j ∈ Ji, r /∈ C, l ∈ {1, . . . , L}, (4.71)

xjrpl = 0, if τl < τp−1 + pjr and τp > tRj + cj ,

∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ C, p ∈ {1, . . . , L}, l ∈ {1, . . . , L}, (4.72)

xjrpl = 0, if Bi < bjr, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}, p ∈ {1, . . . , L},(4.73)

xjrpl ≥ 0, ∀i ∈ {1, . . . , N}, j ∈ Ji, r ∈ R, l ∈ {1, . . . , L}, (4.74)

Constraint (4.68) enforces that for each interval p the total load on the channel cannot exceed τp.

Equations (4.70) and (4.71) ensure that individual tasks do not exceed the τl and τp separately.

Constraint (4.72) ensures that a task cannot be communicated in p and executed in l even if

the task can be communicated by τp but it cannot be executed by τl.

Rounded Solution

We convert the LP-solution xjrpl,∀j, r to

yjr′ =L∑p=1

xjrpl (4.75)

where each (r, l) pair is viewed as a single virtual processor r′, and

zjr =L∑l=1

xjrpl (4.76)


where each (r, p) pair is viewed as a single virtual processor r. We then apply the rounding

technique proposed in Section 4.2.2 to both yjr′ and zjr separately.

Lemma 2. With the rounded solution, the total processing time of all tasks for every r′ ∈ R′

and interval l ∈ {1, . . . , L} cannot be worse than 2τl, i.e., constraint (4.67) is violated by at

most τl.

Proof. From inequality (4.24) of Lemma 1 and using constraint (4.67), we can see that for each

r′ ∈ R′,

kr′∑s=2

pmaxr′s ≤

N∑i=1

∑j∈Ji

pjr′yjr′ (4.77)

≤N∑i=1

∑j∈Ji

L∑p=1

pjr′yjr′ ≤ τl. (4.78)

Furthermore, pmaxr′1 ≤ τl, ∀r′, from (4.71), (4.72) and (4.74). Hence, we have

N∑i=1

∑j∈Ji

L∑p=1

pjrxjrpl ≤ 2τl, ∀r ∈ R, l = {1, . . . , L}. (4.79)

Lemma 3. With the rounded solution, the total communication time of all tasks for every

r ∈ R and interval p ∈ {1, . . . , L} cannot be worse than 2τp, i.e., constraint (4.68) is violated

by at most τp.

Proof. For each machine node vr′s, let the maximum possible communication time be

cmaxrs = max

j:(vrs,uj)∈Ecj , (4.80)

and minimum possible communication time be

cminrs = min

j:(vrs,uj)∈Ecj . (4.81)

From inequality (4.24) of Lemma 1 and using constraint (4.68), we can see that for each r ∈ R,

kr∑s=2

pmaxrs ≤

N∑i=1

∑j∈Ji

cjrzjr (4.82)

≤N∑i=1

∑j∈Ji

L∑l=1

cjrzjr ≤ τp. (4.83)


Furthermore, cmaxr1 ≤ τp,∀r, from (4.70) and (4.74). Hence, we have

N∑i=1

∑j∈Ji

L∑l=1

cjxjrpl ≤ 2τp, ∀r ∈ R, p = {1, . . . , L}. (4.84)



Proof. We define new communication intervals τp := 2(p+1) = 4τp, similar to Theorem 1. We

can easily show using Lemma 3 that every task j assigned to finish communication in the pth

interval may be run entirely within interval (τp−1, τp). Thus, the actual release time of these

tasks at the cloud processors is τp.

We also define new execution intervals τl := 2(l+3). We can easily show using Lemma 2 that

every task j assigned to finish communication in the lth interval may be run entirely within

interval (τl−1, τl). Every task j that is assigned to the lth interval will have been released by

τl−1 because

τl−1 = 8τl > 8τp−1 + 4pjr > 8τp−1 > 4τp (4.85)

If task j is scheduled to complete in interval l, we have

Orelaxj = wj2

l−2 (4.86)

Similarly, for the rounded solution, we have

Oroundj ≤ wj2l+3 (4.87)

≤ wj2l−225 (4.88)

≤ 32Orelaxj . (4.89)

This implies that

N∑i=1

∑j∈Ji

Oroundj ≤

N∑i=1

∑j∈Ji

32Orelaxj . (4.90)

Thus, we see that the rounded objective value is at most 32 times the relaxed solution, and

hence, at most 32 times the optimal objective of problem (4.56) since the relaxed solution by

definition returns an objective value that is below the optimal objective.

Theorem 2 and the corresponding corollaries on budget violation can be trivially extended

for this case since the costs are only dependent on task processing times.


Dealing with Budget Violation

We apply the same budget resolution technique proposed in Section 4.2.3 in order to ensure

that individual user budgets are met. Consequently, along the lines of Theorem 3, we can prove

the following.

Theorem 7. The objective value of the final solution is at most 2dlog2(2+ 1a

)e+3 times the optimal

solution, where a = mini,r αir is the minimum value of speed-up factor in the system.

Proof. Similar to Theorem 3, we can show that after moving a task belonging to user i back to

the local device, the local total processing time will be at most (2 + 1a)τl, where a = mini,r αir

is the minimum value of speed-up factor. In other words, we now have, using (4.67),

N∑i=1

∑j∈Ji

L∑p=1

pjrxjrpl ≤ (2 +1

a)τl, ∀r ∈ R, l = {1, . . . , L}. (4.91)

We need to redefine τl defined in Theorem 6 such that every task that is assigned to the lth

interval is available for processing by τl−1 and may be run entirely within the interval (τl−1, τl).

Towards this end, we set x = log2

(2 + 1

a

)+ 1, and τl = 2x+l.

We now get, for every task j,

Oroundj

Orelaxj

≤ wjτlτl−1

≤ 2x+2 ≤ 2log2(2+ 1a)+3. (4.92)

Thus, the objective value of the final solution is at most 2dlog2(2+ 1a

)e+3 times the optimal solution.

Remark 3. Even though the proven worst-case bounds look large, from the trace-driven simula-

tion results in Section 4.4, we see that the performance of STUBR in practice is no worse than

4 times the relaxed solution and consequently the optimal, for all models and cases considered.

We do not apply the modified WSPT ordering here as the release times on processors depend

on the sequence of tasks transmitted, and we cannot trivially extend this technique to improve

performance for this case.

For all of the extensions described in this section, using similar arguments as in Section

4.2.5, it can be verified that the STUBR algorithm always provides a feasible solution and has

worst-case time complexity O((P 2L)3.5).

4.4 Trace-driven Simulation

In addition to the worst-case bounds derived in Sections 4.2 and 4.3, in this section, we inves-

tigate the performance of STUBR, using trace-driven simulation. We study the effect of user

budget and number of tasks on algorithm performance. We evaluate STUBR for the model


with release times and fixed communication times described in Section 4.3.2, and for the model

with sequence-dependent release times and communication times described in Section 4.3.3.

4.4.1 Traces and Parameter Setting

In [22], the authors conducted experiments on four different applications, and provided task

characteristics in terms of input data, computation need, and arrival rates. Additionally, they

consider different mobile devices with varying computational capacities. We use the traces from

this paper in order to test our proposed algorithm as follows:

1. We take the computation need and input data given in [22] as mean, and allow a maximum

of ±50 % variation. In other words, we randomly pick these values from a uniform

distribution in (50% mean, 150% mean).

(a) We calculate the mean local processing time tj of tasks.

(b) We calculate the mean communication time of a task by dividing the input data (in

MBytes) by the available data rate, which is 20 Mbps from [22].

2. We pick the release time values from a uniform distribution in the range (0, number of tasksarrival rate (in task/sec)).

3. We pick the task weights from a uniform distribution in the range (0,1).

4. We run multiple randomized iterations (for different values of input data and computation)

for each parameter setting, and take the average among them to plot each point on the

graph.

We run our simulation using MATLAB, and utilize the CVX programming package to solve

our linear programs.


We use the following targets for comparison with STUBR algorithm:

• Lower bound : This is the relaxed solution obtained from Section 4.2.1.

• Rounded infeasible: This is the solution obtained from Section 4.2.2, without dealing with

budget violation. We also perform a modified WSPT, proposed in Section 4.2.4, on this

solution. Hence, this solution has an objective value that is a constant times the optimal,

but may violate the user budgets.

• Greedy : All tasks are sorted in the non-decreasing order of weighted local processing timetjwj

for all j, and each task is scheduled in this order onto the processor where it meets its

user’s budget and has the fastest processing time.


0 1 2 3 4 5 620

40

60

80

100

120

140

160

180

User budget

Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)

STUBRLocal processingGreedyComm. sensitiveRounded infeasibleLower bound

(a) Effect of user budget

8 9 10 11 12 13 14 15 160

100

200

300

400

500

600

Number of tasks per user

Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)

STUBRLocal m−WSPTGreedy WSPTComm. sensitiveRounded infeasibleLower bound

(b) Effect of the number of tasks per user

Figure 4.2: For chess application on Galaxy S5.

0 5 10 15 20 25 3050

100

150

200

250

300

350

400

450

500

550

User budget

Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)

STUBRLocal processingGreedyComm. sensitiveRounded infeasibleLower bound


8 9 10 11 12 13 14 15 160

200

400

600

800

1000

1200

1400

1600


Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)

STUBRLocal processingGreedy Comm. sensitiveRounded infeasibleLower bound


Figure 4.3: For compute intensive application on Nexus 10.

• Local processing : All tasks are scheduled locally, and ordered in the non-decreasing order

of weighted local processing timetRj +tjwj

for all j. This would illustrate the benefits of

offloading using our algorithm.

• Comm. sensitive: All tasks are sorted in the non-decreasing order of communication cj for

all j, and each task is scheduled in this order onto the processor where it meets its user’s

budget and has the fastest processing time. This method of sorting tries to offload the

tasks that have shorter communication times thereby decreasing the overall contribution

of communication time to the objective.

4.4.3 For Release Times and Fixed Communication Times

In this section, we evaluate STUBR for the model with release times and fixed communication

times described in Section 4.3.2, for chess and compute-intensive applications presented in [22].

In Figures 4.2a and 4.2b, we consider three Galaxy S5 users and chess applications. We


0 1 2 3 4 5 620

40

60

80

100

120

140

160

180

200

User budget

Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)

STUBRLocal processingGreedy Comm. sensitiveRounded solutionLower Bound


10 12 14 16 18 200

200

400

600

800

1000

1200

1400

1600


Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)



Figure 4.4: For chess application on Galaxy S5.

consider a five-processor cloud with speed-up factors αi1 = 0.5, αi2 = αi3 = 0.1, and αi4 = αi5 =

0.2 for every user i. We set the processor prices as β1 = 0.5, β2 = β3 = 3, and β4 = β5 = 2.

This parameter setting ensures that the users will have to pay a higher price to use a faster

processor.

For Figure 4.2a, we consider users with equal budget, and constant number of tasks |J1| = 5,

|J2| = 5, and |J3| = 10. This allows us to study the impact of user budget on the weighted

sum completion time and algorithm performance. We see that as the user budget increases,

the weighted sum completion time decreases as expected. We also see that the STUBR curve

appears to plateau beyond a particular value of budget that is large enough to offload all tasks

to the fastest processors. On the other hand, for tighter values of budget, we see that the

STUBR curve coincides with the local execution curve. Additionally, we also note that the gap

between STUBR and the rounded solution decreases with increasing budget until eventually

the STUBR curve meets the rounded solution curve. This illustrates that the amount of budget

violation decreases with increasing budget, and consequently, the STUBR solution approaches

the rounded solution.

In Figure 4.2b, we observe the impact of the number of tasks per user, for user budgets

B1 = B2 = B3 = 5. The total weighted completion time increases with increasing number

of tasks per user (and total number of tasks) as expected. We see that the performance gap

between STUBR and other schemes increases with increasing number of tasks, indicating that

STUBR is more scalable.

In Figures 4.3a and 4.3b, we consider three Nexus 10 users running compute-intensive

applications (as described in [22]). We consider the same five-processor simulation setup as

that of Figures 4.2a and 4.2b. For Figure 4.3a, we consider constant number of tasks |J1| = 5,

|J2| = 5, and |J3| = 10. For Figure 4.3b, we set B1 = B2 = B3 = 20. We again see that

STUBR provides superior performance and scales well.


0 1 2 3 4 5 60

100

200

300

400

500

600

User budget

Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)

STUBRLocal processingGreedy Comm. sensitiveRounded solutionLower bound


10 12 14 16 18 200

500

1000

1500

2000

2500

3000

3500

4000

4500

5000


Wei

ghte

d su

m c

ompl

etio

n tim

e (s

)



Figure 4.5: For compute intensive application on Nexus 10.

4.4.4 For Sequence-dependent Communication Times

We now consider the model with sequence-dependent release times and communication times

described in Section 4.3.3. In Figures 4.4 and 4.5, we use the same simulation setting from

Section 4.4.3 for the modified channel model and STUBR proposed in Section 4.3.3.

Interestingly, we observe that STUBR performs better than even the rounded solution

for some samples. This happens because moving tasks to the local device may significantly

reduce task completion times in some cases because of the reduction of sequence-dependent

release/communication times, particularly when these dominate the processing times. Further-

more, STUBR outperforms all other alternatives for both chess and compute-intensive appli-

cations. In fact, the performance gap between STUBR and other alternatives is even greater

than for the fixed communication case. It is also interesting to note that the communication

sensitive scheme performs better than the greedy scheme for this sequence-dependent commu-

nication model since the communication times now contribute more to the overall objective. In

some cases, we see that greedy and communication sensitive schemes may even increase with

increase in user budget, because of the naive nature of these schemes that causes the initial

tasks to use up all of the budget and do not take release times into account while making

scheduling decisions. We again see that STUBR scales well with increase in number of tasks

per user.

4.4.5 Runtime Overhead

We note that the sum completion time values that we are dealing with in these figures are of

the order of hundreds to thousands of seconds. This is much greater than the runtime of the

algorithm, and thus, the overhead due to the runtime of the algorithm is still compensated by

the improvements in the completion times due to scheduling and offloading.


4.5 Summary

We have studied a multi-user computational offloading problem, for a system consisting of a

finite-capacity cloud with heterogeneous processors. The offloaded tasks incur monetary cost

for the cloud resource usage and each user has a budget constraint. We have formulated the

problem of weighted-sum-completion-time minimization subject to the user budget constraints.

We have formulated a problem to minimize the weighted sum completion time subject to the

user budget constraints. The proposed STUBR algorithm relaxes, rounds, and resolves budget

violations, and it sorts the tasks to obtain an effective solution. We have also obtained inter-

esting performance bounds for both the underlying rounded solution and the budget-resolved

solution for different release-time and communication-time models. Through simulation us-

ing real-world application traces, we have observed that STUBR is scalable and substantially

outperforms the existing alternatives especially for larger systems.

Chapter 5

Online Scheduling for Profit

Maximization at the Cloud

In this chapter, we study the scheduling of tasks that arrive dynamically at a networked cloud

computing system consisting of heterogeneous processors. Execution of tasks yields some profit

to the cloud service provider. We intend to maximize the total profit across all tasks arriving

within a time interval, subject to processor load constraints, without prior knowledge of the

task arrival times or processing requirements. We propose polynomial-time algorithms that use

a combination of learning and dual-optimization techniques to obtain effective solutions.

The main contributions of this work are as follows:

• We formulate our task scheduling problem and propose the Task Dispatch through Online

Training (TDOT) algorithm. It consists of two broad phases: (1) a training phase where

we observe the processing times of some arriving tasks to obtain information about task

characteristics, and (2) an exploitation phase where we make decisions on future tasks with

the help of the information obtained. We draw inspiration from a relaxed solution to the

offline problem to identify the parameters that bridge TDOT’s training and exploitation

phases. This algorithm assumes that profit can be obtained on a partially-completed task,

if the processor load constraint is met before the task could complete execution.

• We derive performance bounds that quantify TDOT’s effectiveness against the offline

benchmark. For example, for Poisson task arrivals, we present a scenario (below Corollary

4) where TDOT achieves an expected profit that is at least half of the maximum profit

achievable by any offline algorithm.

• We consider an extension where we allow each task to have data requirements in addition

to the computation requirements. We then extend TDOT to deal with both computation

and storage resources.

• We then propose a modified version of TDOT, namely TDOT with Greedy Scheduling

(TDOT-G), for implementation in systems where profit can only be obtained from fully-

66

Chapter 5. Online Scheduling for Profit Maximization at the Cloud 67

completed tasks. We use tasks generated from randomly-generated i.i.d. data as well

as Google cluster traces [79] to investigate the practical performance of the proposed

algorithms. We compare it with alternatives such as greedy scheduling, logistic regression,

and an offline upper-bound solution. We observe that TDOT and TDOT-G generally

outperform all other online alternatives and achieves near-optimal performance over the

non-training set of tasks.

The existing works that tackle online problems in cloud computing often make certain

assumptions such as a single server [57,80], purely fluid tasks [57,58,62], homogeneous resources

[59, 60, 81], preemptable tasks [64], or propose heuristic solutions [61, 62]. On the other hand,

certain theoretical works address generic online problems such as assigning items to agents with

budgets [71,82] and scheduling jobs to machines [67,68], providing performance guarantees for

their schemes. However, these works solve a simpler problem [82], address a considerably

different objective [67, 68], or make certain impractical assumptions such as equal-length jobs

[67] or a single-processor system [68]. Some of the techniques we apply are similar to those

proposed in [71], but unlike [71], we (i) have no prior knowledge of the total number of tasks,

(ii) propose an algorithm to obtain feasible task scheduling decisions, (iii) consider a second

resource to accommodate for task data requirements, (iv) propose a modified algorithm for

implementations where profit can only be obtained from fully-completed tasks, and (v) assess

the performance of the proposed schemes through both performance bound analysis and trace-

based simulation.

The rest of this chapter is organized as follows. In Section 5.1, we describe the system

model comprising of cloud processors and online tasks, and formulate the optimization problem

to maximize profit. In Section 5.2, we propose the Task Dispatch through Online Training

(TDOT) algorithm, and provide a performance bound analysis for TDOT. We further propose

an improved version of TDOT, termed TDOT-G, in Section 5.3. In Section 5.4, we extend our

work to consider an additional task requirement, such as data requirements. In Section 5.5,

we present the simulation results, to compare the performance of TDOT and TDOT-G with

other alternatives including greedy scheduling, logistic regression, and an offline upper-bound

solution.

5.1 System Model and Problem Statement

5.1.1 Cloud Processors and Online Task Arrival

We consider a CSP with K broadly defined cloud servers (CSs), which can be, for example,

remote cloud servers, mobile edge hosts, or cloudlets. Each CS k has Pk processors, which may

not be identical. Tasks arrive at the CSP’s controller at an average rate of λ tasks per unit

time over a duration of length T . The role of the controller is to dispatch the tasks to the CSs

for execution. The processing requirements, e.g. number of cycles or processing time, for each


Table 5.1: Notations


tjrk processing time for task j on processor r at CS k

prk profit obtained per unit time on processor r at CS k

T time duration of the system

L maximum load on each processor

λ task arrival rate

Pk total number of processors at CS k

K total number of CSs

Figure 5.1: Example system with two CSs consisting of two processors each.

task j on processor r in CS k is given by tjrk,∀r, k, and is known only once the task arrives at

the controller.

5.1.2 Profit Maximization

We assume that the work of processor r in CS k generates profit prk per unit time, which may

account for multiple contributing factors such as the revenue from user payment and the cost

of maintaining the processor. Then, the profit obtained by executing task j having processing

requirements tjrk on processor r in CS k is given by prktjrk. By intelligently scheduling this

task, we may maximize the profit we gain from it. In this paper, we aim to identify a task

dispatching decision that maximizes the total profit across all tasks arriving within duration T .

It is important to note that since tasks arrive dynamically, we have no prior knowledge of the

exact number of tasks that arrive within duration T , nor their processing requirements on any

processor.

Let M be the random number of tasks that arrive within duration T . We define the

scheduling decision variables as xjrk = 1, when task j is scheduled to processor r in cloud


server k, and 0 otherwise. We consider a given load constraint L on each processor1, so that

M∑j=1

tjrkxjrk ≤ L, ∀k ∈ {1, . . . ,K}, r ∈ {1, . . . , Pk}. (5.1)

Additionally, each task is executed at most once:

Pk∑r=1

K∑k=1

xjrk ≤ 1, ∀j ∈ {1, . . . ,M}. (5.2)

We aim to maximize the profit of the CSP. Hence, we formulate an optimization problem as

follows:

maximize{xjrk}

M∑j=1

K∑k=1

Pk∑r=1

prktjrkxjrk (5.3)

subject to (5.1)− (5.2),

xjrk ∈ {0, 1} ∀j ∈ {1, . . . ,M}, k ∈ {1, . . . ,K}, r ∈ {1, . . . , Pk}. (5.4)

Here, objective (5.3) is to maximize the total profit across all tasks, CSs, and processors, while

we ensure that the maximum load L is well-utilized by packing these tasks efficiently.

Remark 4. The multiple cloud servers in our model allow us to differentiate between groups of

processors. For example, all the processors in a particular CS may incur a different cost from

another, and we may have profits prk = pk, ∀r, k. However, one may also visualize this model

as just processors with different profit rates.

In the offline version of the problem, the number of tasks and the task processing times

are known in advance. On the other hand, the online nature of the proposed problem is more

challenging due to the lack of prior information. Consequently, we propose a polynomial-time

online algorithm that uses training to identify appropriate scheduling decisions.

5.2 Task Dispatch through Online Training

In this section, we first obtain an optimal solution to the binary-relaxed offline problem for

performance benchmarking and to gain insights into the online algorithm construction. We

then propose the TDOT algorithm for online task scheduling, and provide a performance bound

with respect to the relaxed offline solution.

1The load constraint can be generalized to be processor dependent, i.e., Lrk. See Section 5.2.3 for a discussionon this.


5.2.1 Offline Solution through Lagrange Relaxation

We relax the binary constraints (5.4) in the offline version of problem (5.3). The dual of this

problem is then given by

minimize{urk≥0,vj≥0}

K∑k=1

Pk∑r=1

urkL+M∑j=1

vj (5.5)

subject to urktjrk + vj ≥ prktjrk, ∀j ∈ {1, . . . ,M}, r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}, (5.6)

where urk are vj are Lagrange multipliers corresponding to constraints (5.1) and (5.2) respec-

tively.

Constraint (5.6) implies that an optimal solution must satisfy

vj = maxr,k

(prk − urk) tjrk, ∀j ∈ {1, . . . ,M}. (5.7)

In other words, given optimal u = {urk, ∀r, k}, we should assign each task j to the processor

given by argmaxr,k (prk − urk) prktjrk.Thus, the dual problem can be rewritten as follows:

minimize{u≥0}

D(u) (5.8)

where

D(u) =K∑k=1

Pk∑r=1

urkL+M∑j=1

maxr,k

(prk − urk) tjrk. (5.9)

This solution is optimal for the binary-relaxed, offline version of problem (5.3), and is an upper

bound to the optimal online solution. We call this solution OPT and use it in Section 5.2.3 to

define the performance bound.

5.2.2 Online Scheduling with Partial-Task Profit Taking

Now we consider the online problem where tasks arrive dynamically. We neither know the total

number of tasks arriving within duration T nor the processing times of the tasks in advance.

Hence, we need to dynamically learn about the processing time characteristics, i.e., optimal

u values defined in (5.9). The proposed TDOT algorithm utilizes a technique that initially

performs training to learn from arriving tasks, and then uses the information to produce profit

on the remaining set of tasks. TDOT assumes that profit can also be obtained from partially-

completed tasks within the load constraints. In other words, if only a part of the task scheduled

to a processor can be completed before the load L is met, then we retain the partial profit

generated due to the execution of that task upto that point. This assumption is eliminated in

Section 5.3, i.e., we consider a model where profits can only be obtained from fully-completed

tasks. The TDOT algorithm consists of two phases and involves a user-defined parameter


0 < ε < 1.

Training

We observe the first bελT c arriving tasks, denoted by A = {1, . . . , bελT c}. For each task j ∈ A,

we record its computing requirement and hence tjrk,∀r, k. These tasks may be arbitrarily

scheduled. For simplicity, we may ignore for now the profit earned from these tasks and set

xjrk = 0, ∀j ∈ {1, . . . , bελT c}, r, k, which is shown later not to affect our derivations of the

competitive ratios for TDOT.

If we allocate only εL load to A, then we can write the dual problem objective (5.9) purely

for A as follows:

D(u,A) =K∑k=1

Pk∑r=1

urkεL+∑j∈A

maxr,k

(prk − urk) tjrk. (5.10)

Since the dual of an LP is also an LP, we can use any existing LP solver to efficiently obtain

u∗ = argminu≥0D(u,A).

Exploitation

Let Ac denote all tasks arriving after task bελT c. Now for each arriving task j ∈ Ac, we apply

weights u∗ to obtain the scheduling decision as follows. We set

(r′, k′) = argmaxr,k

(prk − u∗rk) tjrk, (5.11)

if the task can be scheduled on r′ in CS k′ without violating (1− ε)L. Otherwise, we schedule

just a fraction of the task that can be scheduled while meeting (1− ε)L. This is because TDOT

assumes we may obtain profit from partially-completed tasks as well. We achieve this by defining

load variable lrk for every r and k. If task j satisfies tjr′k′ < (1− ε)L− lr′k′ , we set xjr′k′ = 1;

else we set xjr′k′ =(1−ε)L−lr′k′

tjrk. We then update the variables, lr′k′ = lr′k′ + xjr′k′tjr′k′ . We

stop at the end of duration T .

After the above two phases, we obtain scheduling decision xjrk, ∀j, r, k, and the resulting

profit can be calculated by∑M

j=1

∑Kk=1

∑Pkr=1 prktjrkxjrk.

Remark 5. TDOT uses single-shot learning, which is unlike a reinforcement learning approach,

often used to solve multi-armed bandit problems, that iteratively explores and exploits, for each

incoming task, for example. Single-shot learning is more efficient to implement, and can be

more effective if tasks arriving within duration T have similar characteristics. As shown below,

it also has a performance bound unlike reinforcement learning.


5.2.3 Performance Bound Analysis

The TDOT algorithm, despite its simple premise of training and exploitation, obtains per-

formance that is close to the optimum. In this section, we present a performance bound for

expected profit produced by TDOT in comparison to the upper bound offline solution. We

have not included proof details here due to lack of space.

Let S(u∗,Ac) be the profit obtained by TDOT on the non-training set Ac. We next provide

in Lemma 1 a conditional performance bound on S(u∗,Ac) with respect to the upper bound

OPT. The following definitions are necessary. We define R(u∗) as the profit obtained in the

absence of load constraints by applying weights u∗ to the entire set of M tasks, and Rrk(u∗)

as the contribution of processor r in CS k to R(u∗). We further define Rrk(u∗,A) similarly to

Rrk(u∗), except over just the set of tasks A.

Lemma 4. For any given M number of tasks that arrive within duration T , if we have

K∑k=1

Pk∑r=1

|Rrk(u∗,A)− εRrk(u∗)| ≤ ε2√λT

Mmax{OPT, R(u∗)}, (5.12)

then S(u∗,Ac) ≥ (1− ε− ε√

λTM )OPT.

Proof. We first define a few functions for a given M , for the purposes of the proof. Let yjr′k′(u∗)

be the scheduling decision in the absence of load constraints L, by applying weights u∗ to the

entire set of M tasks. Specifically, ∀j, r′, k′, yjr′k′(u∗) = 1 if r′, k′ = argmaxr,k

(1− u∗rk

prk

)prktjrk

and 0 otherwise. Then, we obtain

R(u∗) =K∑k=1

Pk∑r=1

Rrk(u∗) =

K∑k=1

Pk∑r=1

M∑j=1

prktjrkyjrk(u∗) (5.13)

Note that Rrk(u∗) is the profit obtained due to utilizing processor r in CS k. Similarly, by

applying this to just A, we have

R(u∗,A) =

K∑k=1

Pk∑r=1

Rrk(u∗,A)

=K∑k=1

Pk∑r=1

∑j∈A

prktjrkyjrk(u∗) (5.14)

We define the contribution of processor r in CS k to the dual objective (5.9) as

Drk(u∗) = u∗rkL+

(1− urk

prk

)Rrk(u

∗), (5.15)

and to the dual objective (5.10) over just the set of tasks A as

Drk(u∗,A) = u∗rkεL+

(1−

u∗rkprk

)Rrk(u

∗,A). (5.16)


Let xjrk(u∗) be the scheduling decision obtained by applying u∗ in the presence of load

constraints. We define S(u∗) as the profit obtained while applying u∗ to the entire set of tasks

in the presence of load constraints. We can write S(u∗) as follows.

S(u∗) =K∑k=1

Pk∑r=1

M∑j=1

prktjrkxjrk(u∗)

=K∑k=1

Pk∑r=1

min{prkL,M∑j=1

prktjrkyjrk(u∗)} (5.17)

This is because the tasks are divisible and TDOT assigns a fraction of a task to ensure that L

is exactly met when∑M

j=1 tjrkyjrk(u∗) > L. We again define the contribution of processor r in

CS k to S(u∗) as

Srk(u∗) = min{prkL,

M∑j=1

prktjrkyjrk(u∗)} (5.18)

= min{prkL,Rrk} (5.19)

The profit obtained by TDOT can be written as

S(u∗,Ac) =K∑k=1

Pk∑r=1

∑j∈Ac

prktjrkxjrk(u∗), (5.20)

where the contribution of processor r in CS k to S(u∗,Ac) is

Srk(u∗,Ac) =

∑j∈Ac

prktjrkxjrk(u∗) (5.21)

= min{(1− ε)prkL,Rrk(Ac)}. (5.22)

For simplicity, for the rest of the proof, we drop u∗ from all functions. We also define some

srk,∀r, k, such that

|Rrk(A)− εRrk| ≤ srk, (5.23)

and, ∑r,k

srk ≤ ε2√λT

Mmax{OPT, R}, (5.24)

which is given by the hypothesis of the lemma in (5.12). Additionally, we set ark = srkε .

We first prove that for all r, k,

max {Rrk, Drk} − Srk ≤ ark. (5.25)


We consider the following two cases.

• Case 1: u∗rk > 0.

We can see, from (5.15), that Drk ≤ max {prkL,Rrk}. Consequently,

max {Rrk, Drk} − Srk ≤ max {prkL,Rrk} −min {prkL,Rrk} (5.26)

= |prkL−Rrk|. (5.27)

Since u∗rk > 0, by complementary slackness conditions on the LP for just the tasks in A,

we have Rrk(A) = εprkL. Thus, from (5.23), we have

|εprkL− εRrk| ≤ srk (5.28)

This implies that |prkL−Rrk| ≤ ark. From (5.27), we can now prove (5.25).

• Case 2: u∗rk = 0. We have

Drk = Rrk (5.29)

and

Rrk ≤Rrk(A)

ε+ ark (from (5.12))

≤ prkL+ ark (by complementary slackness). (5.30)

Thus, we get

Srk + ark = min{prkL+ ark, Rrk + ark}

≥ min{Rrk, Rrk + ark} (from (5.30))

= Rrk = Drk (from (5.29)). (5.31)

Hence, (5.25) is proven for Case 2.

We can sum (5.25) over r and k to have

max {R,D} − S ≤Pk∑r=1

K∑k=1

ark. (5.32)

Note that S ≤ OPT ≤ D, by weak duality. Therefore, from (5.32) and using (5.24), we can

see that

R−OPT ≤ 1

ε

Pk∑r=1

K∑k=1

srk ≤ ε√λT

MR (5.33)


Again using weak duality and (5.32), we have OPT− S ≤ ε√

λTM R. Consequently using (5.33),

S ≥

1−ε√

λTM

1− ε√

λTM

OPT (5.34)

Now, from (5.23), since Rrk(Ac) = Rrk − Rrk(A), we see that Rrk(Ac) > (1 − ε)Rrk − srktaking both cases into consideration. Applying this to (5.22), we have

Srk(Ac) ≥ min{(1− ε)prkL, (1− ε)Rrk − srk} (5.35)

≥ (1− ε)Srk − srk (5.36)

Summing over r and k,

S(Ac) ≥Pk∑r=1

K∑k=1

(1− ε)Srk (ignoring 2nd order terms) (5.37)

= (1− ε)S (5.38)

≥ (1− ε)

1−ε√

λTM

1− ε√

λTM

OPT (from (5.34)) (5.39)

≥ (1− ε− ε√λT

M)OPT (5.40)

This lemma states that if u∗ produces an unconstrained profit on the entire set of tasks that is

proportionally close to that on the training set A for each (r, k), then we obtain a performance

bound on the profit on the non-training set, i.e., S(u∗,Ac) for a given M .

Lemma 1 is used in our main theorem below, for which we need to define Pmax = maxk Pk,

and cmax = maxr,k prk.

Definition 1. Λ ⊆ [0, 1]KPmax is a (y, ε)-net for a scheduling rule yjrk() and some ε > 0, if for

all u ∈ [0, 1]KPmax, there exists an u′ ∈ Λ such that |yjrk(u)− yjrk(u′)| ≤ ε for all j, r, k, where

yjrk(u) is the scheduling decision obtained by applying u in the absence of load constraints.

Theorem 8. If OPTcmax

≥ KPmaxln(KPmax|Λ|/ε)

ε3, then we have

EM [E[S(u∗,Ac)|M ]] ≥(

1− 2ε− ε√λTEM

[1√M

])OPT.

Proof. For each 1 ≤ r ≤ Pk, 1 ≤ k ≤ K, and u ∈ Λ, we define a bad event, Br,k,u as one that


satisfies |Rrk(u,A)− εRrk(u)| > sr,k,u, where

srk,u =2

3cmax ln

(2

δ

)+ ||Rrk(u)||2

√2

(ελT

M

)ln

2

δ. (5.41)

The form of (5.41) is inspired by Lemma 3 from [71], except that in our case |A|M 6= ε in our case

due to the randomness of task arrivals.

Similar to [71], we can show that Pr(Br,k,u) ≤ δ for every 0 < δ < 1. To prove our theorem,

we set δ = εKPmax|Λ| .

We next show that the choice of srk,u in (5.41) satisfies the hypothesis in Lemma 1. We sum

up (5.41) over r and k. The first term in the right hand side is bounded by O(KPmaxcmax ln(

1δ

)).

Since OPTcmax

≥ KPmax ln(1/δ)ε3

, this is less than ε3OPT. In order to bound the contribution of the

second terms, we use the following two inequalities:

||Rrk(u)||2 ≤√cmaxRrk(u) (5.42)

and ∑r,k

√Rrk(u) ≤

√KPmax

∑r,k

Rrk(u) =√KPmaxR(u) (5.43)

Now combining these, we have

∑r,k

||Rrk(u)||2

√2ε ln

(2

δ

)≤

√KPmaxcmax

ελT

Mln

(1

δ

)R(u)

≤√ε4λT

MOPTR(u), (5.44)

where the last inequality is due to OPTcmax


. Thus, we see that

∑r,k

srk,u ≤ ε2√λT

Mmax{OPT, R}. (5.45)

Hence, setting srk = srk,u gives us (5.24).

Suppose u∗ ∈ Λ. Then using the fact that Pr(Br,k,u) ≤ εKPmax|Λ| and by simply applying a

union bound over all u ∈ Λ and r, k, we have that with probability ≥ 1− ε, none of the events

Br,k,u happen. Thus we can apply Lemma 1, to conclude that

Eu∗ [S(u∗,Ac)|M ] ≥ (1− ε)(1− ε− ε√λT

M)OPT (5.46)

≥ (1− 2ε− ε√λT

M)OPT (5.47)

Alternatively, suppose u∗ /∈ Λ. Because Λ is a (y; ε)-net, there exists u′ ∈ Λ such that


|yjrk(u∗)− yjrk(u′)| ≤ ε. Consequently, we can prove that

|Rrk(u′,A)−Rrk(u∗,A)| ≤∑j∈A

prktjrk|yjrk(u′)− yjrk(u∗)|

≤ εC(A),

where C(A) is a constant for the set of tasks A. Similarly, we have a constant C for the entire

set of tasks. Now, by using this and the triangle inequality, we have

|Rrk(u∗,A)− εRrk(u∗)| ≤ |Rrk(u′,A)− εRrk(u′)| (5.48)

+ |Rrk(u′,A)−Rrk(u∗,A)|

+ ε|Rrk(u′)−Rrk(u∗)|

≤ srk,u′ + εC(A) + ε2C (5.49)

≤ srk,u′ +O(ε2C) (5.50)

Summing over r and k, and applying Lemma 1, we again have

E[S(u∗,Ac)|M ] ≥ (1− 2ε− ε√λT

M)OPT. (5.51)

Combining both cases, we get E[S(u∗,Ac)|M ] ≥ (1− 2ε− ε√

λTM )OPT.

We then take expectation over M to obtain

EM [E[S(u∗,Ac)|M ]] ≥(

1− 2ε− ε√λTE

[1√M

])OPT. (5.52)

Note that the conclusion of Theorem 1 gives us a bound on the expected performance of TDOT.

Remark 6. The condition on OPTcmax

in Theorem 1 is easily met for all practical scenarios as this

is a ratio, of total profit across all tasks and processors to the profit per unit time on a single

processor, which is generally a large value.

Furthermore, by Jensen’s inequality, we have EM[

1√M

]≤√EM [ 1

M ]. Using this in Theorem

1, we now have

EM [E[S(u∗,Ac)|M ]] ≥

(1− 2ε− ε

√λTEM

[1

M

])OPT

≥

(1− 2ε−

√εEM

[ελT

M

])OPT. (5.53)

Thus, we can see that the profit performance gap depends on EM[ελTM

], which is the expected


proportion of tasks in the training set. Furthermore, using a lower bound on EM[

1M

]if M is

Poisson [83], we have the following corollary.

Corollary 4. Assume the condition of Theorem 1 is met. If M has a Poisson distribution with

mean λT , we have

EM [E[S(u∗,Ac)|M ]] ≥(1− 2ε− ε

√(3 + λT )(1− e−λT )

λT

)OPT. (5.54)

This corollary allows precise numerical calculation. As an example, if λ = 0.1, T = 1000s, and

we choose ε = 0.15, then EM [E[S(u∗,Ac)] ≥ 0.5 OPT.

Corollary 5. For λ→∞, (5.54) reduces to

EM [E[S(u∗,Ac)|M ]] ≥ (1− 3ε) OPT. (5.55)

When the task arrival rate is high, there are enough tasks for training so that ε can be set

small. In this case, Corollary 5 suggests that TDOT can perform close to an optimal offline

algorithm.

Remark 7. Instead of a single load constraint L, we can consider processor-dependent load

constraints Lrk,∀r, k, and rewrite equation (5.1) as follows:

M∑j=1

tjrkxjrk ≤ Lrk, ∀r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}.

We note that all results can be trivially extended to this case.

Remark 8. Our performance bound is computed purely based on the profit from the non-training

set of tasks Ac, but it is compared against the profit of an upper-bound offline algorithm that

considers the entire set of tasks. Consequently, any additional profit we obtain on training set

A is a bonus and further improves profit performance. We note that the value of ε we choose

splits the tasks into sets A and Ac, and consequently impacts profit performance. We study the

effect of ε on the total profit in Section 5.5.

5.2.4 Complexity Analysis

An LP can be solved in O(n3.5B) time where n is the number of variables and B is the number

of bits in the input [77]. Thus, for a given M , the dual minimization during the training phase

of TDOT can be done in O((εMP )3.5B) time where P =∑K

k=1 Pk is the total number of

processors. On the other hand, the time complexity of the exploitation phase is O((1− ε)M).

Thus, the time complexity of TDOT is dominated by LP-solving in the training phase.


Remark 9. We note that |A| = εM , and hence, the above complexity is equivalent to O((|A|P )3.5B).

This is usually small since the number of training tasks, i.e., |A|, is much smaller relative to

M .

5.3 Modified Algorithm Without Partial-Task Profit Taking

The TDOT algorithm proposed in the previous section assumes that we may obtain profit

on partially-completed tasks, if load constraint on the scheduled processor is met. Hence, we

propose a variant, namely TDOT with Greedy scheduling or TDOT-G, for scenarios where

profit can be obtained only for tasks that have fully completed execution while meeting the

load constraints. This algorithm consists of the same two broad phases as that of TDOT,

namely, the training phase, and the exploitation phase.

In this version, if an incoming task cannot be scheduled on the maximum profit processor,

we try to schedule it on the second maximum profit processor, and then the third maximum

profit processor, and so on. We expect this technique to result in better practical performance

than simply discarding a task that cannot be scheduled on the maximum profit processor as we

greedily try to ensure that the current task is at least executed on some processor, which will

produce some profit. The following are the steps of this algorithm.

• Step 1: Observe the processing times of the set of the first bελT c arriving tasks, A.

• Step 2: Find weights u* = argminuD(u,A).

• Step 3: For each incoming task j, we initialize P to be the total set of processors.

– Step 3a: Schedule the task to processor (r′, k′) = argmaxr,k∈P (prk − u∗rk) tjrk, if

(1− ε)L is not violated on processor r′ in CS k′. If (1− ε)L is violated on processor

r′ in CS k′, go to Step 3b. This violation is checked by using load variables lrk,∀r, k,

similar to TDOT.

– Step 3b: Remove (r′, k′) from P and repeat Step 3a unless P is empty, i.e., the task

cannot be scheduled on any processor.

• Step 4: Stop at the end of duration T .

The overall complexity of TDOT-G is still dominated by the LP-solving step, and given by

O((|A|P )3.5B) as shown in Section 5.2.4.

5.4 TDOT with Data Requirements

In this section, we allow each arriving task j to have data requirements djrk on processor r

in CS k, in addition to the processing requirements tjrk defined in Section 5.1. We assume

that similar to the task processing requirements, these data requirements are only known once


the task becomes available for dispatch. Each processor r in CS k also has data load con-

straints Q, similar to processing load constraints L. Problem (5.3) can be then reformulated to

accommodate these as follows:

max.{xjrk}

M∑j=1

K∑k=1

Pk∑r=1

prktjrkxjrk +M∑j=1

K∑k=1

Pk∑r=1

qrkdjrkxjrk (5.56)

s. t.

M∑j=1

tjrkxjrk ≤ L, ∀r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}, (5.57)

M∑j=1

djrkxjrk ≤ Q, ∀r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}, (5.58)

K∑k=1

Pk∑r=1

xjrk ≤ 1, ∀j ∈ {1, . . . ,M}, (5.59)

xjrk ∈ {0, 1},∀j ∈ {1, . . . ,M}, r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}. (5.60)

5.4.1 Offline Solution through Lagrange Relaxation

Similar to Section 5.2, the dual problem of the LP-relaxation of (5.56) is given by

min.{urk≥0,wrk≥0,vj≥0}

K∑k=1

Pk∑r=1

urkL+

K∑k=1

Pk∑r=1

wrkQ+

M∑j=1

vj (5.61)

s. t. urktjrk + wrkdjrk + vj ≥ prktjrk + qrkdjrk,

∀j ∈ {1, . . . ,M}, r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}. (5.62)

where urk, wrk and vj are Lagrange multipliers corresponding to constraints (5.57), (5.58) and

(5.59) respectively. We can rewrite constraint (5.62) as follows:

vj ≥(

1− urkprk

)prktjrk +

(1− wrk

qrk

)qrkdjrk,

∀j ∈ {1, . . . ,M}, r ∈ {1, . . . , Pk}, k ∈ {1, . . . ,K}. (5.63)

In other words, given optimal z = {urk, wrk, ∀r, k, }, we should assign each task j to the pro-

cessor given by argmaxr,k

(1− urk

prk

)prktjrk +

(1− wrk

qrk

)qrkdjrk.

Thus, the dual problem can be rewritten as follows:

minimize{z≥0}

D(z) (5.64)


where

D(z) =

Pk∑r=1

K∑k=1

urkL+

Pk∑r=1

K∑k=1

wrkQ+M∑j=1

maxr,k

[(1− urk

prk

)prktjrk +

(1− wrk

qrk

)qrkdjrk

](5.65)

5.4.2 Online Scheduling Algorithm with Partial-Task Profit Taking

Now we consider the online problem where tasks arrive dynamically, and we need to learn about

the optimal z values defined in (5.65). We modify the TDOT algorithm proposed in 5.2.2 for

this problem as follows.

Training

We observe the first bελT c arriving tasks, denoted by A = {1, . . . , bελT c}. For each task j ∈ A,

we record its computing requirement and hence tjrk,∀r, k. These tasks may be arbitrarily

scheduled. For simplicity, we may ignore for now the profit earned from these tasks and set

xjrk = 0, ∀j ∈ {1, . . . , bελT c}, r, k, which is shown later not to affect our derivations of the

competitive ratios for TDOT.

If we allocate only εL and εQ loads to A, then we can write the dual problem objective

(5.65) purely for A as follows.

D(z,A) =

Pk∑r=1

K∑k=1

urkεL+

Pk∑r=1

K∑k=1

wrkεQ+∑j∈A

maxr,k

[(1− urk

prk

)prktjrk +

(1− wrk

qrk

)qrkdjrk

](5.66)

Since the dual of an LP is also an LP, we can use any existing LP solver to efficiently obtain

z∗ = argminz≥0D(z,A).

Exploitation

Let Ac denote all tasks arriving after task bελT c. Now for each arriving task j ∈ Ac, we apply

weights z∗ to obtain the scheduling decision as follows. We set


[(1− urk

prk

)prktjrk +

(1− wrk

qrk

)qrkdjrk

], (5.67)

If the task j can be scheduled on r′ in CS k′ without violating both (1− ε)L and (1− ε)Q, we

schedule the entire task to processor r′ in CS k′. Otherwise, we schedule just a fraction of the

task to processor r′ in CS k′ that can be scheduled while meeting both (1− ε)L and (1− ε)Q.

We achieve this by defining load variables lrk and mrk for every r and k. If task j satisfies

both tjrk < (1− ε)L+ lrk and djrk < (1− ε)Q+mrk , we schedule the entire task i.e. xjrk = 1.

We update the load variables, lr′k′ = lr′k′ + tjr′k′ and mr′k′ = mr′k′ + djr′k′ . Else for task j, we


execute just xljr′k′ =(1−ε)L−lr′k′

tjrkfraction of the task, and store xmjr′k′ =

(1−ε)Q−mr′k′djrk

fraction of

the data. We update the load variables, lr′k′ = lr′k′+xljr′k′tjr′k′ and mr′k′ = mr′k′+xmjr′k′djr′k′ .

We stop at the end of duration T .

5.4.3 Performance Bound Analysis

Let S(z∗,Ac) be the profit obtained by TDOT on the non-training set Ac. We provide a

conditional performance bound on S(z∗,Ac) with respect to the offline optimum OPT in Lemma

5. We define some functions similar to Section 5.2.3. We define R(z∗) as the profit obtained

due to the total processing time in the absence of load constraints and by applying weights z∗

to the entire set of M tasks, and Rrk(z∗) as the contribution of processor r in CS k to R(z∗).

We further define Rrk(z∗,A) similarly to Rrk(z

∗), except over just the set of tasks A. Similarly,

we define C(z∗), Crk(z∗), and Crk(z

∗,A) as the corresponding profit obtained due to the data

requirements. Let yjr′k′(u∗) be the scheduling decision in the absence of load constraints L and

Q, by applying weights z∗ to the entire set of M tasks.

We now prove a performance bound on the revenue obtained on the test set Ac with respect

to the total offline solution OPT.

Lemma 5. For any given M number of tasks that arrive within duration T , if we have

K∑k=1

Pk∑r=1

max{|Rrk(z,A) + Crk(z,A)− εRrk(z)− εCrk(z)|,

|Rrk(z,A)− εRrk(z)|, |Crk(z,A)− εCrk(z)|, |(Rrk(z,A)− εRrk)− (Crk(z,A)− εCrk(z)|}

≤ ε2√λT

Mmax{OPT, R(z∗) + C(z∗)}, (5.68)

then S(z∗,Ac) ≥ (1− ε− ε√

λTM )OPT.

Proof. We first define a few functions for a given M , for the purposes of the proof. Let yjr′k′(z∗)

be the scheduling decision in the absence of load constraints L, by applying weights z∗ to the

entire set of M tasks. Specifically, ∀j, r′, k′, yjr′k′(z∗) = 1 if


[(1− urk

prk

)prktjrk +

(1− wrk

qrk

)qrkdjrk

]and 0 otherwise. Then, we obtain

R(z∗) =

K∑k=1

Pk∑r=1

Rrk(z∗) =

K∑k=1

Pk∑r=1

M∑j=1

prktjrkyjrk(z∗) (5.69)

Note that Rrk(z∗) is the profit obtained due to utilizing processor r in CS k. Similarly, by

applying this to just A, we have


R(z∗,A) =

K∑k=1

Pk∑r=1

Rrk(z∗,A)

=K∑k=1

Pk∑r=1

∑j∈A

prktjrkyjrk(z∗) (5.70)

Similarly, we have

C(z∗) =K∑k=1

Pk∑r=1

Crk(z∗) =

K∑k=1

Pk∑r=1

M∑j=1

qrkdjrkyjrk(z∗) (5.71)

and

C(z∗,A) =K∑k=1

Pk∑r=1

Crk(z∗,A)

=

K∑k=1

Pk∑r=1

∑j∈A

qrkdjrkyjrk(z∗) (5.72)

For simplicity, we also define function Yrk(z∗) = Rrk(z

∗) + Crk(z∗),∀r, k, and

Y (z∗) =K∑k=1

Pk∑r=1

Yrk(z∗) = R(z∗) + C(z∗). (5.73)

We define the contribution of processor r in CS k to the dual objective (5.65) as

Drk(z∗) = u∗rkL+ w∗rkQ+

(1−

u∗rkprk

)Rrk(z

∗) +

(1−

w∗rkqrk

)Crk(z

∗), (5.74)

and to the dual objective over just the set of tasks A as

Drk(z∗,A) = u∗rkεL+ w∗rkεQ+

(1−

u∗rkprk

)Rrk(z

∗,A) +

(1−

w∗rkqrk

)Crk(z

∗,A). (5.75)

Let xjrk(z∗) be the scheduling decision obtained by applying z∗ in the presence of load

constraints. We define S(z∗) as the profit obtained while applying z∗ to the entire set of tasks

in the presence of load constraints. We can write S(z∗) as follows.

S(z∗) =K∑k=1

Pk∑r=1

M∑j=1

prktjrkxjrk(z∗) +

K∑k=1

Pk∑r=1

M∑j=1

qrkdjrkxjrk(z∗)

=

K∑k=1

Pk∑r=1

min{prkL+ Crk(z∗), qrkQ+Rrk(z

∗), prkL+ qrkQ,Yrk(z∗)} (5.76)

This is because the tasks are divisible and TDOT assigns a fraction of a task to ensure that

L and Q are exactly met when∑M

j=1 tjrkyjrk(z∗) > L, and

∑Mj=1 djrkyjrk(z

∗) > Q. We again


define the contribution of processor r in CS k to S(z∗) as

Srk(z∗) = min{prkL+ Crk(z

∗), qrkQ+Rrk(z∗), prkL+ qrkQ,Yrk(z

∗)} (5.77)

The profit obtained by TDOT can be written as

S(z∗,Ac) =

K∑k=1

Pk∑r=1

∑j∈Ac

prktjrkxjrk(z∗) +

K∑k=1

Pk∑r=1

∑j∈Ac

qrkdjrkxjrk(z∗), (5.78)

where the contribution of processor r in CS k to S(z∗,Ac) is

Srk(z∗,Ac) = min{(1− ε)prkL+ Crk(z

∗,Ac), (1− ε)qrkQ+Rrk(z∗,Ac), (1− ε)prkL+ (1− ε)qrkQ,Yrk(z∗,Ac).

(5.79)

For simplicity, for the rest of the proof, we drop z∗ from all functions. We also define some

srk,∀r, k, such that

max{|Rrk(A) + Crk(A)− εRrk − εCrk|, |Rrk(A)− εRrk|, |Crk(A)− εCrk|,

|(Rrk(A)− εRrk)− (Crk(A)− εCrk|} ≤ srk, (5.80)

and, ∑r,k

srk ≤ ε2√λT

Mmax{OPT, Y }, (5.81)

which is given by the hypothesis of the lemma in (5.68). Additionally, we set ark = srkε .

We first prove that for all r, k,

max {Yrk, Drk} − Srk ≤ ark. (5.82)

We consider the following two cases.

• Case 1: u∗rk > 0 and w∗rk > 0.

We can see, from (5.74), that

Drk ≤ max{prkL+ Crk, qrkQ+Rrk, prkL+ qrkQ,Yrk}. (5.83)

Consequently,

max {Yrk, Drk} − Srk ≤ max {prkL+ Crk, qrkQ+Rrk, prkL+ qrkQ,Yrk}−

min {prkL+ Crk, qrkQ+Rrk, prkL+ qrkQ,Yrk} (5.84)

≤ max{|prkL+ qrkQ− Yrk|, |qrkQ− Crk|, |prkL−Rrk|, |prkL+ Crk − qrkQ−Rrk|}(5.85)


≤ max{|prkL+ qrkQ− Yrk|, |qrkQ− Crk|+ |prkL−Rrk|, |prkL+ Crk − qrkQ−Rrk|}(5.86)

Since u∗rk > 0 and w∗rk > 0, by complementary slackness conditions on the LP for just the

tasks in A, we have Rrk(A) = εprkL and Crk(A) = εqrkQ. Thus, from (5.80), we have

|εprkL+ εqrkQ− εRrk − εCrk| ≤ srk (5.87)

This implies that |prkL+ qrkQ− Yrk| ≤ ark. Similarly, we also have

|εprkL+ εCrk − εqrkQ− εRrk| ≤ srk (5.88)

which implies that |prkL+ Crk − qrkQ−Rrk| ≤ ark.

Similarly, we can prove that

|qrkQ− Crk| ≤ ark, (5.89)

and

|prkL−Rrk| ≤ ark. (5.90)

From (5.86), we can now prove (5.82).

• Case 2: u∗rk > 0 and w∗rk = 0. This implies that

Drk = u∗rkL+

(1−

u∗rkprk

)Rrk + Crk. (5.91)

Thus,

Drk ≤ max{prkL+ Crk, Yrk} (5.92)

≤ max{prkL+ Crk, qrkQ+Rrk, prkL+ qrkQ,Yrk} (5.93)

Now, similar to Case 1, we can prove (5.82).

• Case 3: u∗rk = 0 and w∗rk > 0. This implies that

Drk = u∗rkL+

(1−

u∗rkprk

)Rrk + Crk. (5.94)

Thus,

Drk ≤ max{prkL+ Crk, Yrk} (5.95)

≤ max{prkL+ Crk, qrkQ+Rrk, prkL+ qrkQ,Yrk} (5.96)


Now, similar to Case 1, we can prove (5.82).

• Case 4: u∗rk = 0 and w∗rk = 0. We have

Drk = Yrk (5.97)

and

Rrk ≤Rrk(A)


≤ prkL+ ark (by complementary slackness). (5.98)

Crk ≤Crk(A)


≤ qrkQ+ ark (by complementary slackness). (5.99)

Yrk ≤Rrk(A) + Crk(A)


≤ prkL+ qrkQ+ ark (by complementary slackness). (5.100)

Thus, we get

Srk + ark = min{prkL+ Crk + ark, qrkQ+Rrk + ark, prkL+ qrkQ+ ark, Yrk + ark}

≥ min{Yrk, Yrk + ark} (from (5.30))

≥ Yrk = Drk (from (5.29)). (5.101)

Hence, (5.82) is proven for Case 4.

We can sum (5.82) over r and k to have

max {Y,D} − S ≤Pk∑r=1

K∑k=1

ark. (5.102)

Note that S ≤ OPT ≤ D, by weak duality. Therefore, from (5.102) and using (5.81), we

can see that

Y −OPT ≤ 1

ε

Pk∑r=1

K∑k=1

srk ≤ ε√λT

MY (5.103)

Again using weak duality and (5.102), we have OPT−S ≤ ε√

λTM Y . Consequently using (5.103),

S ≥

1−ε√

λTM

1− ε√

λTM

OPT (5.104)


Now, from (5.80), since Rrk(Ac) + Crk(Ac) = Rrk + Crk − Rrk(A) − Crk(A), we see that

Rrk(Ac) +Crk(Ac > (1− ε)(Rrk +Crk)− srk taking both cases into consideration. Using this,

we can see that

Yrk(Ac) > (1− ε)Yrk − srk, (5.105)

Crk(Ac) > (1− ε)Crk − srk, (5.106)

and

Rrk(Ac) > (1− ε)Rrk − srk. (5.107)

Applying these to (5.79), we have

Srk(Ac) = min{(1− ε)prkL+ Crk(Ac), (1− ε)qrkQ+Rrk(Ac), (1− ε)(prkL+ qrkQ), Yrk(Ac)}(5.108)

≥ min{(1− ε)prkL+ (1− ε)Crk − srk, (1− ε)qrkQ+ (1− ε)Rrk − srk,

(1− ε)(prkL+ qrkQ), (1− ε)Yrk − srk} (5.109)

≥ (1− ε)Srk − srk. (5.110)

Summing over r and k,

S(Ac) ≥Pk∑r=1

K∑k=1

(1− ε)Srk (ignoring 2nd order terms) (5.111)

= (1− ε)S (5.112)

≥ (1− ε)

1−ε√

λTM

1− ε√

λTM

OPT (from (5.34)) (5.113)

≥ (1− ε− ε√λT

M)OPT (5.114)

Theorem 9. If OPTcmax

≥ KPmaxln(KPmax|Λ|/ε)

ε3for some Λ that is a (y, ε)-net, then we have

EM [E[S(z∗,Ac)|M ]] ≥(

1− 2ε− ε√λTEM

[1√M

])OPT.

Proof. For each 1 ≤ r ≤ Pk, 1 ≤ k ≤ K, and u ∈ Λ, we define a bad event, Br,k,z as one that

satisfies max{|Rrk(z,A) + Crk(z,A) − εRrk(z) − εCrk(z)|, |Rrk(z,A) − εRrk(z)|, |Crk(z,A) −εCrk(z)|, |(Rrk(z,A) − εRrk) − (Crk(z,A) − εCrk(z)|} > sr,k,z. Thus, the probability of a bad

event is given by

P [Br,k,z > sr,k,z] ≤ 4δ, (5.115)


where

srk,z =2

3cmax ln

(2

δ

)+ ||Yrk(z)||2

√2

(ελT

M

)ln

2

δ. (5.116)

for 0 < δ < 1. The form of (5.116) is inspired by Lemma 3 from [71], except that in our case|A|M 6= ε in our case due to the randomness of task arrivals. We use Lemma 3 from [71] to prove

that

P [|Rrk(z,A)− εRrk(z)| > pr,k,z] ≤ δ, (5.117)

for

prk,z =2

3rmax ln

(2

δ

)+ ||Rrk(z)||2

√2

(ελT

M

)ln

2

δ. (5.118)

Since prk,z < srk,z, this implies

P [S1] = P [|Rrk(z,A)− εRrk(z)| > sr,k,z] ≤ δ. (5.119)

Similarly, we can prove

P [S2] = P [|Crk(z,A)− εCrk(z)| > sr,k,z] ≤ δ, (5.120)

P [S3] = P [|Yrk(z,A)− εYrk(z)| > sr,k,z] ≤ δ, (5.121)

and

P [S4] = P [|(Rrk(z,A)− εRrk(z))− (Crk(z,A)− εCrk(z)| > sr,k,z] ≤ δ. (5.122)

Thus ,

P [Br,k,z > sr,k,z] = P [S1 ∩ S2 ∩ S3 ∩ S4] (5.123)

≤ P [S1] + P [S2] + P [S3] + P [S4] (5.124)

≤ 4δ. (5.125)

To prove our theorem, we set δ = εKPmax|Λ| . We show that the choice of srk,z satisfies the

hypothesis in Lemma 5. We sum up (5.41) over r and k. The first term in the right hand side

is bounded by O(KPmaxcmax ln(

1δ

)). Since OPT

cmax≥ KPmax ln(1/δ)

ε3, this is less than ε3OPT. In

order to bound the contribution of the second terms, we use the following two inequalities:

||Yrk(z)||2 ≤√cmaxYrk(z) (5.126)


and ∑r,k

√Yrk(z) ≤

√KPmax

∑r,k

Yrk(z) =√KPmaxY (z) (5.127)

Now combining these, we have

∑r,k

||Yrk(z)||2

√2ε ln

(2

δ

)≤

√KPmaxcmax

ελT

Mln

(1

δ

)Y (z)

≤√ε4λT

MOPTY (z), (5.128)

where the last inequality is due to OPTcmax


. Thus, we see that

∑r,k

srk,z ≤ ε2√λT

Mmax{OPT, Y }. (5.129)

Hence, setting srk = srk,z gives us (5.81).

Suppose z∗ ∈ Λ. Then using the fact that Pr(Br,k,z) ≤ εKPmax|Λ| and by simply applying a

union bound over all z ∈ Λ and r, k, we have that with probability ≥ 1− ε, none of the events

Br,k,z happen. Thus we can apply Lemma 5, to conclude that

E[S(z∗,Ac)|M ] ≥ (1− ε)(1− ε− ε√λT

M)OPT (5.130)

≥ (1− 2ε− ε√λT

M)OPT (5.131)

Alternatively, suppose z∗ /∈ Λ. Because Λ is a (y; ε)-net, there exists z′ ∈ Λ such that

|yjrk(z∗)− yjrk(z′)| ≤ ε. Consequently, we can prove that

|Rrk(z′,A)−Rrk(z∗,A)| ≤∑j∈A

prktjrk|yjrk(z′)− yjrk(z∗)| ≤ εC(A),

where C(A) is a constant for the set of tasks A. Similarly, we have a constant C for the entire

set of tasks. Now, by using this and the triangle inequality, we have

|Rrk(z∗,A)− εRrk(z∗)| ≤ |Rrk(z′,A)− εRrk(z′)|+ |Rrk(z′,A)−Rrk(z∗,A)|

+ ε|Rrk(z′)−Rrk(z∗)|

≤ srk,z′ + εC(A) + ε2C (5.132)

≤ srk,z′ +O(ε2C) (5.133)

We can similarly prove for |Crk(z∗,A) − εCrk(z∗)|, |Yrk(z∗,A) − εYrk(z∗)|, and |(Rrk(z,A) −


εRrk)− (Crk(z,A)− εCrk(z)|. Summing over r and k, and applying Lemma 1, we again have

E[S(z∗,Ac)|M ] ≥ (1− 2ε− ε√λT

M)OPT. (5.134)

Combining both cases, we get E[S(z∗,Ac)|M ] ≥ (1− 2ε− ε√

λTM )OPT.

We then take expectation over M to obtain

EM [E[S(z∗,Ac)|M ]] ≥(

1− 2ε− ε√λTE

[1√M

])OPT. (5.135)

5.4.4 TDOT-G with Data Requirements

TDOT-G proposed in Section 5.3 can be modified to accommodate task data requirements by

modifying just Step 3a as follows:

Step 3a: Schedule each task to processor

(r′, k′) = argmaxr,k∈P

{(

1− urkprk

)prktjrk +

(1− wrk

qrk

)qrkdjrk}, (5.136)

only if both (1 − ε)L and (1 − ε)Q are not violated on processor r′ in CS k′. If (1 − ε)L or

(1− ε)Q is violated on processor r′ in CS k′, go to Step 3b. This violation is checked by using

load variables lrk, ∀r, k, similar to TDOT.

5.5 Simulation Results

We investigate the performance of our proposed algorithms with extensive simulation, using

i.i.d task data as well as Google cluster traces with practical parameter values. We present

the comparison targets and simulation setup in Sections 5.5.1 and 5.5.2 respectively. We study

the non-training set profit performance on randomly-generated i.i.d. tasks in Section 5.5.3,

Google-cluster tasks in Section 5.5.4, and the entire set of tasks in Section 5.5.5.


We use the following comparison targets to evaluate the performance of TDOT and TDOT-G:

• Logistic Regression - Greedy (LR-G): We use

r′, k′ = argmaxr,k

{(prk − urk)tjrk + (qrk − wrk)djrk}

as the training labels for each task j in the training set A. We then perform multi-class

classification using logistic regression [84] to obtain the label for each non-training task,


and schedule the task to the corresponding processor as long as (1− ε)L and (1− ε)Q are

not violated. Else, we use a technique similar to TDOT-G. We set P to be the remaining

set of processors. We schedule task j to processor (r′, k′) = argmaxr,k∈P prktjrk + qrkdjrk,

if (1 − ε)L and (1 − ε)Q is not violated on processor r′ in CS k′. If it is violated, we

remove the processor from P and repeat until the task is scheduled or all processors are

exhausted.

• Naive Bayes: We prepare the training labels in a manner similar to LR-G above, and per-

form Naive Bayes classification to obtain labels for non-training tasks. We then schedule

the task to the corresponding processor as long as (1− ε)L and (1− ε)Q are not violated.

Else, we discard the task.

• Support Vector Machine (SVM): Similar to Naive Bayes above, but we use SVM classifier

instead to obtain labels for non-training tasks.

• Greedy Algorithm: This is similar to the ’G’ portion of TDOT-G. We set P to be the entire

set of processors. We schedule task j to processor (r′, k′) = argmaxr,k∈P prktjrk + qrkdjrk,

if (1 − ε)L and (1 − ε)Q is not violated on processor r′ in CS k′. If it is violated, we

remove the processor from P and repeat until the task is scheduled or all processors are

exhausted.

• Upper Bound Offline: Solve formulation (5.61) to obtain an upper-bound.

Based on whether we plot profit on just the non-training set Ac or the overall set of tasks,

we modify these comparison targets accordingly. In Section 5.5.5, we ensure every comparison

target obtains profit on the training set as well, for fair comparison. These details are given in

Section 5.5.5.

5.5.2 Simulation Setup and Task Requirements

We consider two different CSs with two processors each. The profits are set to p11 = 0.5,

p12 = 0.7, p21 = 0.3, and p22 = 0.3, and q11 = 0.2, q12 = 0.2, q21 = 0.1, and q22 = 0.1. We set

default values of system duration D = 3000, task arrival times λ = 0.1, maximum processing

load L = 3500, maximum data load Q = 4000 for Google-cluster tasks and Q = 1500 for I.I.D.

tasks, and ε = 0.2.

• Randomly-generated i.i.d. tasks: We draw task processing requirements tjrk,∀r, k and

task data requirements djrk, ∀r, k from independent and identical (i.i.d.) uniform distri-

butions between [5, 20] and [2, 10] respectively. We set default values of system duration

D = 3000, task arrival times λ = 0.1, maximum processing load L = 3500, maximum

data load Q = 800, and ε = 0.2.

• Google-cluster tasks: We use the task events information from Google cluster data [79] to

obtain the task arrival times, and compute average task per unit time λ = 1average inter-arrival time


0.08 0.09 0.10 0.11 0.12 0.13 0.14 0.15 0.16Arrival rate λ

2000

2500

3000

3500

4000

4500

Prof

it (o

n no

n-tra

inin

g se

t)

TDOT-GTDOTLR-GGreedyNaive BayesSVMUpper Bound Offline

Figure 5.2: Effect of arrival rate λ on non-training set profit for i.i.d. tasks

from these values. We consider Poisson task arrival at the controller, so that the total

number of tasks that arrive within duration T is a Poisson random variable with mean

λT . We also use the task usage information from [79], i.e., task start times and end times,

to obtain task processing times. We set mean task processing time mj = (task end time

- task start time). We then consider processors with different relative speeds, α11 = 1,

α12 = 2, α21 = 1.5, and α22 = 1.5, to obtain varied processing times on different pro-

cessors. Furthermore, we add an additional ±50% randomness added to task processing

times and data requirements to simulate unrelated processors/resources. We set default

values of system duration D = 3000, maximum processing load L = 2500, maximum data

load Q = 600, and ε = 0.2.

In the implementation of TDOT, while picking processor (r′, k′) = argmaxr,k (prk − u∗rk) tjrk,we randomly tie-break if there are multiple processors that give us the maximum value within

a tolerance of 0.001.

In Sections 5.5.3 and 5.5.4, we analyze the profit obtained on the non-training set of tasks,

Ac, for i.i.d tasks and Google-cluster tasks respectively. In Section 5.5.5, we then analyze the

overall profit obtained.

5.5.3 I.I.D. Tasks

In this section, we consider the case where task processing times and data requirements are

drawn from i.i.d. uniform distributions (tjrk ∼ U(5, 20) and djrk ∼ U(2, 10)). According to

the performance bound proven in Theorem 9 for i.i.d task requirements, TDOT approaches

near-optimal performance for increasing values of arrival rate λ and number of tasks M , and

decreasing values of ε. In order to study this, we plot the non-training profit versus λ in Figure

5.2. As we increase λ in the x-axis, we also decrease ε proportionally to maintain a constant

training-set size, i.e., bελT c. The other values are set to the defaults indicated in Section

5.5.1. We observe that TDOT outperforms other alternatives, particularly as arrival rate and


600 800 1000 1200 1400Max. Load, Q (s)

2000

2200

2400

2600

Prof

it (o

n no

n-tra

inin

g se

t)


Figure 5.3: Effect of max. data load Q on non-training set profit for i.i.d. tasks

consequently the total number of tasks increases. This is in line with the proven performance

bound. Thus, for a fixed training set size, as we increase the non-training set size, we expect

TDOT to approach near-optimality.

In Figure 5.3, we vary data load Q for the fixed default values of L, λ, and ε. We see that the

non-training set profit obtained by TDOT and TDOT-G exhibit superior performance relative

to the other alternatives. Thus, a reasonable guideline would be to pick TDOT if profit may

be obtained by the CSP from partially-completed tasks, and TDOT-G if not.

While these figures validate the proven theorems and indicate the effectiveness of the pro-

posed algorithms for i.i.d. task requirements, we also wish to evaluate the performance of the

schemes under more practical settings. Towards this end, in the following sections, we consider

tasks obtained from Google cluster data [79].

5.5.4 Google-cluster Tasks

Figures 5.4 and 5.5 exhibit the non-training profit versus processing load L performance and

data load Q respectively. The other values are set to the defaults indicated in Section 5.5.1.

We see that TDOT again outperforms the other algorithms particularly for tighter L and Q

load values. This makes sense particularly because TDOT allows profit to be obtained from

partially-completed tasks and this effect is prominent when the loads are tight. On the other

hand, TDOT-G performs better for loose loads, and is the alternative to if the CSP can obtain

profit only from fully-completed tasks. Both TDOT and TDOT-G have low computational

complexity compared to alternatives such as LR-G, Naive Bayes, and SVM, since it only involves

solving a linear program to find the weights in comparison with a convex gradient descent.

5.5.5 Overall Profit and ε Values

Although we use the set of first bελT c tasks for training purposes, in practice some profit can

be made on these tasks. We use the Greedy Algorithm presented in Section 5.5.1 applied to the


1000 1500 2000 2500 3000 3500Max. Load, L (s)

1500

2000

2500

3000

3500

Prof

it (o

n no

n-tra

inin

g se

t)


Figure 5.4: Effect of max. processing load L on non-training set profit for Google-cluster tasks

500 550 600 650 700 750 800 850 900Max. Load, Q (s)

2000

2500

3000

3500

4000

Prof

it (o

n no

n-tra

inin

g se

t)


Figure 5.5: Effect of max. data load Q on non-training set profit for Google-cluster tasks


0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50ε

2000

2250

2500

2750

3000

3250

3500

Prof

it (o

n en

tire

set)

TDOT-GTDOTLR-GGreedyNaive BayesSVM

Figure 5.6: Effect of ε on overall profit for Google-cluster tasks

training set of tasks. Adding this profit obtained on the training set to the profit obtained on

the non-training set, from Section 5.5.4, gives us the overall profit. Thus, for fair comparison,

we add this training set profit to TDOT-G, LR, LR-G, SVM, and Naive Bayes. Both greedy

and upper-bound can obtain profit on the entire set of tasks, by definition.

Figure 5.6 shows the overall profit performance for different values of ε using Google-cluster

task data similar to Section 5.5.4. We see that the best profit is achieved for ε = 0.3, and

TDOT-G exhibits superior performance for all values of ε relative to other alternatives.

We see that the upper-bound offline and greedy solutions have performance that is inde-

pendent of ε as expected. However, a value of ε = 0.2 produces the best profit performance on

average while using TDOT and TDOT-G. This suggests that using around 20% of the expected

number of tasks for training in this setting allows the algorithm to both have enough tasks to

learn well but also have enough tasks to exploit the benefit of training. We note that TDOT

and TDOT-G still outperform the other online alternatives.

5.6 Summary

We study the online scheduling of tasks to multiple cloud servers with an objective to maximize

profit subject to load constraints. The processors in our model are heterogeneous and unary-

capacity, and the tasks arrive dynamically, resulting in a challenging problem. We have proposed

a polynomial-time TDOT algorithm that consists of a training phase and an exploitation phase

to obtain effective scheduling solutions. We provided a performance bound for TDOT under

the assumption that profit can also be obtained on partially-completed tasks if the load is

already met. We also proposed a modified algorithm, TDOT-G, for implementations where

profit can only be obtained on fully-completed tasks. Through trace-driven simulation, we saw

that TDOT and TDOT-G consistently outperform the comparison targets and can be tuned to

exhibit near-optimal performance.

Chapter 6

Concluding Remarks

6.1 Conclusions

In this dissertation, we have studied task offloading and scheduling in cloud computing envi-

ronments. Different optimization problems are investigated, and we wish to make scheduling

decisions that minimize cost or completion time, or maximize profit. We propose efficient

algorithms to solve these challenging NP-hard problems.

In Chapter 3, we have investigated the scheduling of applications consisting of dependent

tasks on heterogeneous processors with communication delay and application completion dead-

lines. The proposed cost minimization formulation is generic, allowing different cost structures

and processor topologies. To overcome the obstacles of task dependency and deadline con-

straint, we have developed the ITAGS approach, where the scheduling of each task is assisted

by an individual time allowance obtained from a binary-relaxed version of the original opti-

mization problem. Through trace-driven and randomized simulations, we show that ITAGS

substantially outperforms a wide range of known algorithms. Furthermore, as the deadline

constraint is relaxed, it converges to optimality much faster than other alternatives.

In Chapter 4, we have considered a multi-user computational offloading problem, for a

system consisting of a finite-capacity cloud with heterogeneous processors and tasks with het-

erogeneous release times, processing times, and communication times. The offloaded tasks incur

monetary cost for the cloud resource usage and each user has a budget constraint. We have

formulated a problem to minimize the weighted sum completion time subject to the user bud-

get constraints. We propose the STUBR algorithm and prove interesting performance guar-

antees. Using trace-driven simulation, we compare against existing alternatives and observe

that STUBR exhibits superior performance. We have observed that STUBR is scalable and

substantially outperforms the existing alternatives especially for larger systems.

In Chapter 5, we have addressed the online scheduling of tasks to multiple cloud servers

with an objective to maximize profit subject to load constraints. The processors in our model

are heterogeneous and unary-capacity, and the tasks arrive dynamically, resulting in a chal-

lenging problem. We have proposed a polynomial-time TDOT algorithm that consists of a

96

Chapter 6. Concluding Remarks 97

training phase and an exploitation phase to obtain effective scheduling solutions. We provided

a performance bound for TDOT under the assumption that profit can also be obtained on

partially-completed tasks if the load is already met. We also proposed a modified algorithm,

TDOT-G, for implementations where profit can only be obtained on fully-completed tasks.

Through trace-driven simulation, we saw that TDOT and TDOT-G consistently outperform

the comparison targets and can be tuned to exhibit near-optimal performance.

For all these three problems considered in this dissertation, we proposed efficient and effec-

tive algorithms and evaluated the performance of the algorithms through mathematical analysis

as well as trace-driven simulation results.

6.2 Future Directions

6.2.1 Task Scheduling in the Presence of Zero Task Information

In Chapter 5, we consider the problem where tasks arrive over time and we know the task

processing time once the task arrives at the controller. We then utilize this information in

order to make better task scheduling decisions. However, we are also interested in considering

the more practical problem where we know the task processing time only after making the

scheduling decision and executing the task. We still expect a learning algorithm to perform

well, and can potentially use a training-exploitation technique similar to TDOT but this would

need to be modified in order to cope with the lack of information.

6.2.2 Online Dependent-Task Scheduling

An additional feature would be incorporating dependencies or priorities in online task arrivals.

So far, we have considered that tasks that arrive online can be executed in any order as they

are independent. But it is possible that multiple tasks belong to a particular application and

need to be executed in a certain order, or we may associate certain tasks with higher priority

if they are urgent. Accounting for these while scheduling tasks will allow us to create a more

robust and practical scheduling scheme.

6.2.3 Caching

In all the chapters of this dissertation, we assume that the scheduling decision needs to be

computed each time for an application or an arriving task. However, some computation time

and power can be saved if these decisions can be cached and used to make future decisions

when the same application or task arrives. Incorporating caching in our scheduling will require

an additional layer of sophistication from our algorithms.

Chapter 6. Concluding Remarks 98

6.2.4 Straggler Nodes

A processor or a node that is performing poorly or slower than anticipated due to issues such

as faulty hardware are called straggler nodes [85]. In our work, we have assumed that the

processors are all well-behaving. However, scheduling algorithms can be implemented such

that they take the straggler nodes into account and preempt or restart tasks to manage these

situations.

6.2.5 Fuzzy Load Constraint

In Chapter 5, we consider load constraints for each processor that are strict and deterministic.

A more practical and interesting model might be to consider the case where these constraints are

probabilistic, or can be violated during some situations (such as in order to complete execution

of a task that has already begun execution).

Bibliography

[1] R. Kakerow, “Low power design methodologies for mobile communication,” in Computer

Design: VLSI in Computers and Processors, 2002. Proceedings. 2002 IEEE International

Conference on. IEEE, 2002, pp. 8–13.

[2] L. D. Paulson, “Low-power chips for high-powered handhelds,” Computer, vol. 36, no. 1,

pp. 21–23, 2003.

[3] J. W. Davis, “Power benchmark strategy for systems employing power management,”

in Electronics and the Environment, 1993., Proceedings of the 1993 IEEE International

Symposium on. IEEE, 1993, pp. 117–119.

[4] R. N. Mayo and P. Ranganathan, “Energy consumption in mobile devices: why future

systems need requirements–aware energy scale-down,” in Power-Aware Computer Systems.

Springer, 2003, pp. 26–40.

[5] H. T. Dinh, C. Lee, D. Niyato, and P. Wang, “A survey of mobile cloud computing: archi-

tecture, applications, and approaches,” Wireless communications and mobile computing,

vol. 13, no. 18, pp. 1587–1611, 2013.

[6] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, “The case for vm-based cloudlets

in mobile computing,” IEEE Pervasive Computing, vol. 8, no. 4, pp. 14–23, 2009.

[7] E. Cuervo, A. Balasubramanian, D.-k. Cho, A. Wolman, S. Saroiu, R. Chandra, and

P. Bahl, “Maui: making smartphones last longer with code offload,” in Proc. ACM Inter-

national Conference on Mobile Systems, Applications, and Services (MobiSys), 2010.

[8] D. Kovachev, T. Yu, and R. Klamma, “Adaptive computation offloading from mobile

devices into the cloud,” in Parallel and Distributed Processing with Applications (ISPA),

2012 IEEE 10th International Symposium on. IEEE, 2012, pp. 784–791.

[9] N. Vallina-Rodriguez and J. Crowcroft, “Erdos: achieving energy savings in mobile os,” in

Proc. ACM workshop on MobiArch, pp. 37–42, 2011.

[10] M. Satyanarayanan, “Mobile computing: the next decade,” ACM SIGMOBILE Mobile

Computing and Communications Review, vol. 15, no. 2, pp. 2–10, 2011.

99

Bibliography 100

[11] B. Liang, “Mobile edge computing,” in Key Technologies for 5G Wireless Systems, V. W.

S. Wong, R. Schober, D. W. K. Ng, and L.-C. Wang, Eds., Cambridge University Press,

2017.

[12] S. Kosta, A. Aucinas, P. Hui, R. Mortier, and X. Zhang, “Thinkair: Dynamic resource

allocation and parallel execution in the cloud for mobile code offloading,” in Proc. IEEE

INFOCOM, 2012.

[13] W. Zhang, Y. Wen, and D. O. Wu, “Energy-efficient scheduling policy for collaborative

execution in mobile cloud computing,” in Proc. IEEE INFOCOM, 2013.

[14] B. Y.-H. Kao and B. Krishnamachari, “Optimizing mobile computational offloading with

delay constraints,” in Proc. IEEE GLOBECOM, 2014.

[15] A. EC2, “Pricing of on-demand instances,” https://aws.amazon.com/ec2/pricing/on-

demand/.

[16] S. Sundar and B. Liang, “Offloading dependent tasks with communication delay and dead-

line constraint,” in Proc. IEEE Conference on Computer Communications (INFOCOM),

2018.

[17] S. Sundar, J. P. Champati, and B. Liang, “Completion time minimization in multi-user task

scheduling with heterogeneous processors and budget constraints,” in Proc. IEEE/ACM

International Symposium on Quality of Service (IWQoS), Short Paper, 2018.

[18] S. Sundar and B. Liang, “Individual time allocation with greedy scheduling for offloading

dependent tasks with communication delay,” under review IEEE Transactions on Cloud

Computing (TCC), 2018.

[19] S. Sundar, J. P. Champati, and B. Liang, “Multi-user task offloading to heterogeneous

processors with communication delay and budget constraints,” under review IEEE Trans-

actions on Cloud Computing (TCC), 2018.

[20] S. Sundar and B. Liang, “Task dispatch through online training for profit maximization

at the cloud,” in Proc. IEEE INFOCOM Workshop on Network Intelligence, 2018.

[21] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, “Hermes: Latency optimal task

assignment for resource-constrained mobile computing,” in Proc. IEEE INFOCOM, 2015.

[22] K. Habak, M. Ammar, K. A. Harras, and E. Zegura, “Femto clouds: Leveraging mobile

devices to provide cloud service at the edge,” in Proc. IEEE CLOUD, 2015.

[23] S. Sundar and B. Liang, “Communication augmented latest possible scheduling for cloud

computing with delay constraint and task dependency,” in Proc. IEEE INFOCOM Work-

shop on Green and Sustainable Networking and Computing (GSNC 2016), 2016.

Bibliography 101

[24] X. Lin, Y. Wang, Q. Xie, and M. Pedram, “Task scheduling with dynamic voltage and

frequency scaling for energy minimization in the mobile cloud computing environment,”

IEEE Transactions on Services Computing, vol. 8, no. 2, pp. 175–186, 2015.

[25] L. A. Hall, A. S. Schulz, D. B. Shmoys, and J. Wein, “Scheduling to minimize average com-

pletion time: Off-line and on-line approximation algorithms,” Mathematics of Operations

Research, vol. 22, no. 3, pp. 513–544, 1997.

[26] B.-G. Chun and P. Maniatis, “Augmented smartphone applications through clone cloud

execution.” in HotOS, vol. 9, 2009, pp. 8–11.

[27] Y. Wen, W. Zhang, and H. Luo, “Energy-optimal mobile application execution: Taming

resource-poor mobile devices with cloud clones,” in Proc. IEEE INFOCOM, 2012.

[28] P. Balakrishnan and C.-K. Tham, “Energy-efficient mapping and scheduling of task inter-

action graphs for code offloading in mobile cloud computing,” in Proc. IEEE/ACM 6th

International Conference on Utility and Cloud Computing, pp. 34–41, 2013.

[29] J. Flinn, D. Narayanan, and M. Satyanarayanan, “Self-tuned remote execution for per-

vasive computing,” in Hot Topics in Operating Systems, 2001. Proceedings of the Eighth

Workshop on. IEEE, 2001, pp. 61–66.

[30] J. Flinn, S. Y. Park, and M. Satyanarayanan, “Balancing performance, energy, and qual-

ity in pervasive computing,” in Distributed Computing Systems, 2002. Proceedings. 22nd

International Conference on. IEEE, 2002, pp. 217–226.

[31] B.-G. Chun, S. Ihm, P. Maniatis, M. Naik, and A. Patti, “Clonecloud: elastic execution

between mobile device and cloud,” in Proceedings of the sixth conference on Computer

systems. ACM, 2011, pp. 301–314.

[32] R. K. Balan, M. Satyanarayanan, S. Y. Park, and T. Okoshi, “Tactics-based remote exe-

cution for mobile computing,” in Proceedings of the 1st international conference on Mobile

systems, applications and services. ACM, 2003, pp. 273–286.

[33] M. Satyanarayanan, P. Bahl, R. Caceres, and N. Davies, “The case for VM-based cloudlets

in mobile computing,” IEEE Pervasive Computing, vol. 8, no. 4, pp. 14–23, Oct. 2009.

[34] F. Bonomi, R. Milito, J. Zhu, and S. Addepalli, “Fog computing and its role in the internet

of things,” 2012.

[35] E. G. Specification, “Mobile edge computing (mec); framework and reference architecture,”

ETSI GS MEC 003 V1.1.1, vol. 3, 2016.

[36] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge

computing: The communication perspective,” IEEE Communications Surveys & Tutorials,

vol. 19, no. 4, pp. 2322–2358, 2017.

Bibliography 102

[37] R. K. Balan, D. Gergle, M. Satyanarayanan, and J. Herbsleb, “Simplifying cyber foraging

for mobile devices,” in Proc. ACM MobiSys, 2007.

[38] A. Mtibaa, K. A. Harras, and A. Fahim, “Towards computational offloading in mobile

device clouds,” in Proc. IEEE International Conference on Cloud Computing Technology

and Science (CloudCom), 2013.

[39] A. Mtibaa, K. A. Harras, K. Habak, M. Ammar, and E. W. Zegura, “Towards mobile

opportunistic computing,” in Proc. IEEE CLOUD, 2015.

[40] C. Mateos, E. Pacini, and C. G. Garino, “An ACO-inspired algorithm for minimizing

weighted flowtime in cloud-based parameter sweep experiments,” Advances in Engineering

Software, vol. 56, pp. 38–50, 2013.

[41] B. Zhou, A. V. Dastjerdi, R. N. Calheiros, S. N. Srirama, and R. Buyya, “A context sen-

sitive offloading scheme for mobile cloud computing service,” in Proc. IEEE International

Conference on Cloud Computing (CLOUD), 2015.

[42] M.-H. Chen, B. Liang, and M. Dong, “A semidefinite relaxation approach to mobile cloud

offloading with computing access point,” in Proc. IEEE Signal Processing Advances in

Wireless Communications (SPAWC), 2015.

[43] Y. Mao, J. Zhang, and K. B. Letaief, “Joint task offloading scheduling and transmit power

allocation for mobile-edge computing systems,” in Proc. IEEE Wireless Communications

and Networking Conference (WCNC), pp. 1–6, 2017.

[44] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computation offloading for mobile-

edge cloud computing,” IEEE/ACM Transactions on Networking, vol. 24, no. 5, pp. 2795–

2808, 2016.

[45] V. Cardellini, V. D. N. Persone, V. Di Valerio, F. Facchinei, V. Grassi, F. L. Presti,

and V. Piccialli, “A game-theoretic approach to computation offloading in mobile cloud

computing,” Mathematical Programming, vol. 157, no. 2, pp. 421–449, 2016.

[46] L. Tianze, W. Muqing, Z. Min, and L. Wenxing, “An overhead-optimizing task scheduling

strategy for ad-hoc based mobile edge computing,” IEEE Access, vol. 5, pp. 5609–5622,

2017.

[47] M.-H. Chen, B. Liang, and M. Dong, “Multi-user multi-task offloading and resource allo-

cation in mobile cloud systems,” IEEE Transactions on Wireless Communications, vol. 17,

no. 10, pp. 6790–6805, 2018.

[48] Z. Li, C. Wang, and R. Xu, “Computation offloading to save energy on handheld devices:

a partition scheme,” in Proc. ACM International Conference on Compilers, Architecture,

and Synthesis for Embedded Systems, 2001.

Bibliography 103

[49] M. Jia, J. Cao, and L. Yang, “Heuristic offloading of concurrent tasks for computation-

intensive applications in mobile cloud computing,” in Proc. IEEE INFOCOM Workshops,

2014.

[50] L. Yang, J. Cao, and H. Cheng, “Resource constrained multi-user computation partitioning

for interactive mobile cloud applications,” Technical report, Dept. of Computing, Hong

Kong Polytechnic Univ, 2012.

[51] M.-A. Hassan Abdel-Jabbar, I. Kacem, and S. Martin, “Unrelated parallel machines with

precedence constraints: application to cloud computing,” in Proc. IEEE CLOUDNET,

2014.

[52] Y.-H. Kao, B. Krishnamachari, M.-R. Ra, and F. Bai, “Hermes: Latency optimal task

assignment for resource-constrained mobile computing,” IEEE Transactions on Mobile

Computing, vol. 16, no. 11, pp. 3056–3069, 2017.

[53] S. Guo, B. Xiao, Y. Yang, and Y. Yang, “Energy-efficient dynamic offloading and re-

source scheduling in mobile cloud computing,” in Proc. IEEE International Conference on

Computer Communications (INFOCOM), 2016.

[54] C. Wang and Z. Li, “Parametric analysis for adaptive computation offloading,” ACM

SIGPLAN Notices, vol. 39, no. 6, pp. 119–130, 2004.

[55] Y. Zhang, H. Liu, L. Jiao, and X. Fu, “To offload or not to offload: an efficient code

partition algorithm for mobile cloud computing,” in Proc. IEEE CLOUDNET, 2012.

[56] M.-R. Ra, A. Sheth, L. Mummert, P. Pillai, D. Wetherall, and R. Govindan, “Odessa:

enabling interactive perception applications on mobile devices,” in Proceedings of the 9th

international conference on Mobile systems, applications, and services. ACM, 2011, pp.

43–56.

[57] J. Liu, Y. Mao, J. Zhang, and K. B. Letaief, “Delay-optimal computation task scheduling

for mobile-edge computing systems,” in Proc. IEEE International Symposium on Informa-

tion Theory (ISIT), 2016.

[58] L. Zhang, C. Wu, Z. Li, C. Guo, M. Chen, and F. C. Lau, “Moving big data to the cloud:

An online cost-minimizing approach,” IEEE Journal on Selected Areas in Communications,

vol. 31, no. 12, pp. 2710–2721, 2013.

[59] Z. Peng, D. Cui, J. Zuo, Q. Li, B. Xu, and W. Lin, “Random task scheduling scheme

based on reinforcement learning in cloud computing,” Cluster computing, vol. 18, no. 4,

pp. 1595–1607, 2015.

[60] J. P. Champati and B. Liang, “One-restart algorithm for scheduling and offloading in a

hybrid cloud,” in Proc. IEEE International Symposium on Quality of Service (IWQoS),

2015.

Bibliography 104

[61] Y. Fang, F. Wang, and J. Ge, “A task scheduling algorithm based on load balancing in

cloud computing,” in Proc. International Conference on Web Information Systems and

Mining, 2010.

[62] H. Goudarzi and M. Pedram, “Maximizing profit in cloud computing system via resource

allocation,” in Proc. IEEE Distributed Computing Systems Workshops (ICDCSW), 2011.

[63] Y. Chen, N. Zhang, Y. Zhang, and X. Chen, “Dynamic computation offloading in edge

computing for internet of things,” IEEE Internet of Things, 2019.

[64] J. Li, M. Qiu, Z. Ming, G. Quan, X. Qin, and Z. Gu, “Online optimization for scheduling

preemptable tasks on iaas cloud systems,” Journal of Parallel and Distributed Computing,

vol. 72, no. 5, pp. 666–677, 2012.

[65] D. P. Williamson and D. B. Shmoys, The Design of Approximation Algorithms, 1st ed.

New York, NY, USA: Cambridge University Press, 2011.

[66] D. B. Shmoys and E. Tardos, “An approximation algorithm for the generalized assignment

problem,” Mathematical Programming, vol. 62, no. 1-3, pp. 461–474, 1993.

[67] M. Chrobak, W. Jawor, J. Sgall, and T. Tichy, “Online scheduling of equal-length jobs:

Randomization and restarts help,” SIAM Journal on Computing, vol. 36, no. 6, pp. 1709–

1728, 2007.

[68] B. Kalyanasundaram and K. Pruhs, “Maximizing job completions online,” in Proc. Euro-

pean Symposium on Algorithms, 1998.

[69] S. Shalev-Shwartz et al., “Online learning and online convex optimization,” Foundations

and Trends R© in Machine Learning, vol. 4, no. 2, pp. 107–194, 2012.

[70] M.-F. Balcan, A. Blum, J. D. Hartline, and Y. Mansour, “Mechanism design via machine

learning,” pp. 605–614, 2005.

[71] N. R. Devanur and T. P. Hayes, “The adwords problem: online keyword matching with

budgeted bidders under random permutations,” in Proc. ACM Conference on Electronic

Commerce, 2009.

[72] H. Topcuoglu, S. Hariri, and M.-y. Wu, “Performance-effective and low-complexity task

scheduling for heterogeneous computing,” IEEE Transactions on Parallel and Distributed

Systems, vol. 13, no. 3, pp. 260–274, 2002.

[73] F. A. Potra and S. J. Wright, “Interior-point methods,” Journal of Computational and

Applied Mathematics, vol. 124, no. 1, pp. 281–302, 2000.

Bibliography 105

[74] B. Flipsen, J. Geraedts, A. Reinders, C. Bakker, I. Dafnomilis, and A. Gudadhe, “Envi-

ronmental sizing of smartphone batteries,” in Proc. IEEE Electronics Goes Green (EGG),

pp. 1–9, 2012.

[75] Z. Qiu, C. Stein, and Y. Zhong, “Minimizing the total weighted completion time of coflows

in datacenter networks,” in Proc. ACM symposium on Parallelism in Algorithms and Ar-

chitectures, 2015.

[76] W. E. Smith, “Various optimizers for single-stage production,” Naval Research Logistics,

vol. 3, no. 1-2, pp. 59–66, 1956.

[77] N. Karmarkar, “A new polynomial-time algorithm for linear programming,” in Proc. ACM

Symposium on Theory of Computing, 1984.

[78] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logis-

tics, vol. 2, no. 1-2, pp. 83–97, 1955.

[79] J. Wilkes, “More google cluster data,” Available at https://github.com/google/cluster-data,

Nov. 2011.

[80] J. P. Champati and B. Liang, “Semi-online algorithms for computational task offload-

ing with communication delay,” IEEE Transactions on Parallel and Distributed Systems,

vol. 28, no. 4, pp. 1189–1201, 2017.

[81] J. P. Champati and B. Liang, “Single restart with time stamps for computational offload-

ing in a semi-online setting,” in Proc. IEEE Conference on Computer Communications

(INFOCOM), 2017.

[82] G. Aggarwal, G. Goel, C. Karande, and A. Mehta, “Online vertex-weighted bipartite

matching and single-bid budgeted allocations,” in Proc. ACM-SIAM Symposium on Dis-

crete Algorithms, 2011.

[83] E. L. Grab and I. R. Savage, “Tables of the expected value of 1/x for positive bernoulli

and poisson variables,” Journal of the American Statistical Association, vol. 49, no. 265,

pp. 169–177, 1954.

[84] B. Krishnapuram, L. Carin, M. A. Figueiredo, and A. J. Hartemink, “Sparse multinomial

logistic regression: Fast algorithms and generalization bounds,” IEEE Transactions on

Pattern Analysis & Machine Intelligence, no. 6, pp. 957–968, 2005.

[85] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, “Improving mapreduce

performance in heterogeneous environments.” USENIX Symposium on Operating Systems

Design and Implementation, vol. 8, no. 4, p. 7, 2008.

Optimization Algorithms for Task Offloading and Scheduling ... · Optimization Algorithms for Task...

Documents

Transcript of Optimization Algorithms for Task Offloading and Scheduling ... · Optimization Algorithms for Task...