Vilfredo Optimising cluster resource allocations, one...

Joshua Bambrick

VilfredoOptimising cluster resource allocations, one Pareto

improvement at a time

Computer Science Tripos — Part II

Jesus College

2016

Vilfredo Federico Damaso Pareto (1848–1923) was an Italian economist, industrialist,sociologist, engineer, and philosopher [17].

Receiving an education emphasising Classics and Mathematics, Pareto’s scientificinterest in Economics can be traced to his doctorate studies in Engineering [4], and achance encounter with Maffeo Pantaleoni on a train [17]. His contributions to the fieldleft a profound legacy; according to Benoit Mandelbrot and Richard L. Hudson:

“Partly because of him, the field evolved from a branch of moralphilosophy as practised by Adam Smith into a data intensive field ofscientific research and mathematical equations. His books look more likemodern economics than most other texts of that day: tables of statisticsfrom across the world and ages, rows of integral signs and equations,intricate charts and graphs.”

— B. Mandelbrot and R. Hudson (2004) [59]

He made many important developments that shaped the way we discuss Economics.For example, his Pareto principle (80–20 rule), introduced in Cours d’ÉconomiePolitique [67], states that, for many events, 80% of the effects are due to 20% of thecauses. In his later book, Manuale d’economia politica [68], he outlined the notion ofa Pareto improvement, defined as a change in an allocation of resources, that deliversan improvement for at least one individual and is no worse for any other. Suchimprovements help achieve a state of Pareto optimality, describing an of allocation ofresources in which it is impossible to make any individual better off without makingat least one other individual worse off.

This dissertation, titled in honour of Pareto, describes how the application of Paretoimprovements to resources in cluster systems can be used to optimise allocations andhence improve overall performance.

The image depicts Pareto in his latter years, during the early 20th century. Taken before 1923, thephotograph is available in the public domain under UK copyright law.

i

Proforma

Name: Joshua BambrickCollege: Jesus CollegeProject Title: Vilfredo — Optimising cluster resource allocations, one

Pareto improvement at a timeExamination: Computer Science Tripos, Part II (June 2016)Word Count: 11,906Project Originators: Malte Schwarzkopf & Joshua BambrickSupervisors: Ionel Gog & Malte SchwarzkopfSpecial Difficulties: None

Original aims of the project

The implementation and evaluation of a system to support dynamic resourcereservation adjustments in a cluster management system. This aims to deliversignificant performance and utilisation gains over a traditional cluster manager thatkeeps reservations fixed at a user-specified request throughout task execution. Thedeliverable involves substantial modifications to a research cluster system, Firmament,to execute and monitor users’ tasks, adapt scheduling considerations, and adjust taskreservations. Analysis will look at how task throughput improves and will investigatethe means used to achieve these gains. The final system should operate automaticallywith no additional input required from end-users.

Work completed

Vilfredo has been highly successful. All success criteria were met, and manyextensions were implemented. I developed this resource management system as aflexible, testable module, based upon portable principles applicable to most state-of-the-art cluster managers. I demonstrated that the system is capable of achieving 35%higher task throughput than the baseline Firmament system, by utilising 29% greaterresource capacity, and successfully producing Pareto-improved resource allocationsin 86% of additional task placements. With utilisation burstiness profiling, andtimeslice-based similar task utilisation prediction, I believe this system is moreadvanced and accurate than the estimation adapted from Google’s Borg clustermanager [89].

ii

Declaration

I, Joshua Bambrick of Jesus College, being a candidate for Part II of the ComputerScience Tripos, hereby declare that this dissertation and the work described in it aremy own work, unaided except as may be specified below, and that the dissertationdoes not contain material that has already been used to any substantial extent for acomparable purpose.

Signed:

Date: 12 May, 2016

iii

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Preparation 52.1 Cluster systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Resource management in practice . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.3 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.4 Google Borg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Containment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Firmament . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.1 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.3 Task submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 K-nearest neighbour problem . . . . . . . . . . . . . . . . . . . . . . . . . 152.5.1 k-d trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5.2 BBD trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.5.3 ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6 Linux disk I/O scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7 Project management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7.1 Requirement analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 182.7.2 Development life cycle . . . . . . . . . . . . . . . . . . . . . . . . . 192.7.3 Testing strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.8 Choice of tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.8.1 Programming languages . . . . . . . . . . . . . . . . . . . . . . . . 212.8.2 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.8.3 Revision control and backup strategy . . . . . . . . . . . . . . . . 22

2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Implementation 243.1 Establishing exponential reservation decay . . . . . . . . . . . . . . . . . 24

iv

3.1.1 Measuring and communicating task resource usage . . . . . . . . 253.1.2 Reservation updates . . . . . . . . . . . . . . . . . . . . . . . . . . 293.1.3 Task limit enforcement . . . . . . . . . . . . . . . . . . . . . . . . . 313.1.4 Cost model reservation consideration . . . . . . . . . . . . . . . . 323.1.5 Machine over-allocation policy . . . . . . . . . . . . . . . . . . . . 333.1.6 Cgroup-enforced limits . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2 Accounting for burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2.1 Directly measuring burstiness . . . . . . . . . . . . . . . . . . . . . 353.2.2 Determining decay coefficient and safety margin . . . . . . . . . 383.2.3 Applying exponential averaging to usage observations . . . . . . 403.2.4 Calculating the smoothing coefficient . . . . . . . . . . . . . . . . 41

3.3 Similar task-based usage prediction . . . . . . . . . . . . . . . . . . . . . 423.3.1 Notion of timeslicing . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.2 Producing a set of records . . . . . . . . . . . . . . . . . . . . . . . 443.3.3 Record creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.3.4 Usage record weighting . . . . . . . . . . . . . . . . . . . . . . . . 483.3.5 Setting initial reservations . . . . . . . . . . . . . . . . . . . . . . . 513.3.6 Updating reservations . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.7 Integrated reservation update pipeline . . . . . . . . . . . . . . . 54

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4 Evaluation 564.1 Overall success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.1 Work Completed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.1.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Empirical evaluation approach . . . . . . . . . . . . . . . . . . . . . . . . 584.2.1 Test environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Exponential decay achievements . . . . . . . . . . . . . . . . . . . . . . . 594.3.1 Typical decay scenario . . . . . . . . . . . . . . . . . . . . . . . . . 594.3.2 Pareto-improved scheduling . . . . . . . . . . . . . . . . . . . . . . 634.3.3 Task preemption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.3.4 Terminating tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.3.5 Studying memory reclamation . . . . . . . . . . . . . . . . . . . . 66

4.4 Task resource throttling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.5 Accounting for burstiness . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5.1 Bursty tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.5.2 Non-bursty tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.5.3 Window size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.6 Similar task resource usage prediction . . . . . . . . . . . . . . . . . . . . 734.6.1 Experiment design and considerations . . . . . . . . . . . . . . . . 734.6.2 Prediction approaches re-cap . . . . . . . . . . . . . . . . . . . . . 734.6.3 Accuracy rating definition re-cap . . . . . . . . . . . . . . . . . . . 744.6.4 Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.7 Incrementality comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.7.1 Experimental construction . . . . . . . . . . . . . . . . . . . . . . . 764.7.2 Analysis of task state change improvements . . . . . . . . . . . . 77

v

4.7.3 Investigating resource utilisation . . . . . . . . . . . . . . . . . . . 784.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 Conclusions 805.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 805.2 Lessons learnt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.3 Further work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Bibliography 82

A Workload generator 90

B Project proposal 91B.1 Introduction & Background . . . . . . . . . . . . . . . . . . . . . . . . . . 92B.2 Special Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

B.2.1 Personal Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93B.2.2 Systems Research Group (SRG) Cluster . . . . . . . . . . . . . . . 94

B.3 Starting Point and Previous Experience . . . . . . . . . . . . . . . . . . . 94B.4 Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

B.4.1 Phase 1 – Core Implementation . . . . . . . . . . . . . . . . . . . . 95B.4.2 Phase 2 – Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 96B.4.3 Phase 3 – Enhancements . . . . . . . . . . . . . . . . . . . . . . . . 98

B.5 Success Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.6 Possible Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98B.7 Timetable: Work Plan and Milestones . . . . . . . . . . . . . . . . . . . . 100

B.7.1 Michaelmas Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100B.7.2 Christmas Break . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101B.7.3 Lent Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102B.7.4 Easter Break . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103B.7.5 Easter Term . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

vi

List of Figures

1.1 Complete system event rate evaluation . . . . . . . . . . . . . . . . . . . 4

2.1 Data-centre memory utilisation . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Data-centre request over-estimation . . . . . . . . . . . . . . . . . . . . . 72.3 Aggregate memory at Google . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Borg resource reclamation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.5 Linux containers architecture . . . . . . . . . . . . . . . . . . . . . . . . . 122.6 Monolithic architecture with Firmament . . . . . . . . . . . . . . . . . . . 132.7 Firmament architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.8 K-d tree data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.9 BBD tree shrink operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.10 Complete Fairness Queueing I/O scheduler . . . . . . . . . . . . . . . . 192.11 Project requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.12 Development methodologies implemented . . . . . . . . . . . . . . . . . 202.13 Backup and collaboration construction . . . . . . . . . . . . . . . . . . . . 22

3.1 Linux containers architecture . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 AUFS system with a Linux container . . . . . . . . . . . . . . . . . . . . . 273.3 Firmament architecture with Vilfredo . . . . . . . . . . . . . . . . . . . . 283.4 Vilfredo reservation decay . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.5 Enhanced CoCo cost model resource amalgamation. . . . . . . . . . . . . 323.6 Task preemption sorting heap . . . . . . . . . . . . . . . . . . . . . . . . . 333.7 Comparison of utilisation profiles for varying burstiness . . . . . . . . . 363.8 Approximate Fano factor calculation process . . . . . . . . . . . . . . . . 373.9 Burstiness-adjusted decay coefficient . . . . . . . . . . . . . . . . . . . . . 393.10 Burstiness-adjusted safety margin coefficient . . . . . . . . . . . . . . . . 403.11 Two safety margins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.12 No timeslicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.13 Fixed-duration timeslicing . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.14 Variable-duration timeslicing . . . . . . . . . . . . . . . . . . . . . . . . . 443.15 Randomly-generated record set . . . . . . . . . . . . . . . . . . . . . . . . 453.16 Similar request search pipeline . . . . . . . . . . . . . . . . . . . . . . . . 473.17 Transformed logistic function . . . . . . . . . . . . . . . . . . . . . . . . . 503.18 Exponential decay function . . . . . . . . . . . . . . . . . . . . . . . . . . 503.19 Timeslice prediction pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 533.20 Integrated reservation update pipeline . . . . . . . . . . . . . . . . . . . . 54

4.1 Test suite execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2 Exponential decay for memory and disk storage usage . . . . . . . . . . 60

vii

4.3 Exponential decay for disk I/O . . . . . . . . . . . . . . . . . . . . . . . . 624.4 Pareto-improved scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 634.5 Preemption behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644.6 Task termination event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.7 Reservation while varying safety margin . . . . . . . . . . . . . . . . . . 664.8 Reclamation while varying safety margin . . . . . . . . . . . . . . . . . . 674.9 Reservation while varying decay coefficient . . . . . . . . . . . . . . . . . 684.10 Reclamation while varying decay coefficient . . . . . . . . . . . . . . . . 684.11 Memory usage throttling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.12 Comparison of reservation behaviour with a bursty task . . . . . . . . . 704.13 Comparison of burstiness-adjusted reservation with a non-bursty task . 714.14 Comparison of burstiness-adjusting reservation with varying window

sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724.15 Task usage prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.16 Comparison of task state changes . . . . . . . . . . . . . . . . . . . . . . . 774.17 Comparison cluster system memory utilisation . . . . . . . . . . . . . . . 79

List of Tables

2.1 Comparison of container solutions . . . . . . . . . . . . . . . . . . . . . . 112.2 Project deliverables assessment . . . . . . . . . . . . . . . . . . . . . . . . 192.3 Libraries used to implement Vilfredo . . . . . . . . . . . . . . . . . . . . . 22

3.1 Reservation terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Comparison of burstiness metrics . . . . . . . . . . . . . . . . . . . . . . . 353.3 Comparison of machine learning task-similarity approaches . . . . . . . 46

4.1 Project deliverable completion . . . . . . . . . . . . . . . . . . . . . . . . . 574.2 Prediction and incrementality task mix . . . . . . . . . . . . . . . . . . . 604.3 Mean reclaimed memory with varying safety margin . . . . . . . . . . . 674.4 Task usage prediction strategies . . . . . . . . . . . . . . . . . . . . . . . . 744.5 Task usage prediction relative error . . . . . . . . . . . . . . . . . . . . . . 744.6 Incrementality reservation parameters . . . . . . . . . . . . . . . . . . . . 78

viii

List of Algorithms

3.3.1 Create timesliced measurements . . . . . . . . . . . . . . . . . . . . . . . 493.3.2 Resource usage weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.3.3 Determining the next timeslice prediction . . . . . . . . . . . . . . . . . . 52

ix

Acknowledgements

Throughout this project, I have very much appreciated the support and suggestionsfrom several people.

In particular, I am extremely grateful for the contributions from:

• Ionel Gog and Malte Schwarzkopf: for supervising this project, providingsignificant technical advice, and encouraging me to try ambitious approachesthat proved successful.

• My parents: who, despite self-confessing limited technical knowledge,provided much-needed emotional support and endless cups of tea.

x

Chapter 1

Introduction

This dissertation describes the implementation and evaluation of Vilfredo, anintelligent and scalable solution to a prevalent problem plaguing contemporarycluster infrastructure engineers — optimising task resource allocations. Designed tointegrate portably into modern cluster-management systems, Vilfredo builds uponcutting-edge research and extends it with several new approaches and resourceallocation algorithms. Evaluation has shown 29% greater resource utilisation, whichachieves a throughput improvement of over 35%.

1.1 Motivation

Rising computational demand has led to considerable industrial investment into theestablishment of vast clusters of thousands of commodity servers [29]. Such systemsnow lie at the heart of today’s networks and are critical to the ongoing operation ofsome of the largest firms. Clusters are coordinated by cluster management systems,such as Borg [89], Mesos [41], or Omega [77].

In a typical cluster system, when a user submits a task for execution, they provide aresource request, indicating what processing resources should be reserved for that task.The size of such requests is normally determined by end-user guesstimates. Yet, tomitigate the risk of task termination, such estimates are usually far too high, witha 30-day trace of Twitter’s production cluster finding that around 20% of workloadsrequested over 400% of their actual usage [21].

Researchers working on Google’s Borg cluster manager [89] observed that

“jobs usually reserve resources to handle rare workload spikes, but don’tuse these resources most of the time”

— Verma et al. (2015) [89]

Additionally, an analysis of a 29-day trace [73, 92] of tasks on a 12.5k-machine Googlecluster concluded that if a cluster could

1

2 CHAPTER 1. INTRODUCTION

“predict actual task resource usage more accurately than that suggestedby the requests, tasks could be packed more tightly without degradingperformance”

— Reiss et al. (2012) [72]

Indeed, this is a present sore point for leading technology companies, and an activeresearch area in distributed systems, with recent developments such as Quasar [28],Tetrisched [84], and Paragon [27]. Several such modern strategies are discussed in§2.2.

The complexities surrounding resource allocations are not limited to ComputerScience, but have been extensively investigated in the domain of Economics. Perhapstransitioning our thinking from computational resources to capital resources couldlend some inspiration for a better solution. In his 1906 book, Manuale d’economiapolitica [68], Italian economist Vilfredo Pareto outlined the notion of a Paretoimprovement, defined below. Such improvements deliver improved allocativeefficiency (and notably, state nothing about fairness).Definition (Pareto improvement) A change in an allocation of resources, that deliversan improvement for at least one individual and is no worse for any other.

Viewing the problem in economic terms, the over-reservation of resources for a taskimposed by the system (as opposed to the task itself) acts as a negative externality [70].The difference between the resources that a task uses and its reservation is of novalue to it at that time, and hence forms a deadweight loss [40]. This reduces theallocative efficiency [60] of the system, as additional resources reserved for the taskcannot provide a marginal benefit [90] to it (zero) greater than the marginal cost ofproviding them.

By reducing the resource reservation for a task, closer to a level that it will actuallyutilise, the deadweight loss can be minimised, and one can hence improve theallocative efficiency of the system. Therefore, it may be possible to induce betterresource allocations that support scheduling additional tasks. Such an event wouldachieve a Pareto improvement, and increase overall system throughput.

In this dissertation, I investigate the hypothesis that, by dynamically adjustingresource reservations, a cluster system can achieve more efficient allocations. Onemay then be able to schedule more tasks simultaneously, utilising more availableresources, and achieving a significantly higher task throughput. Furthermore, byprofiling resource usage variation, and using utilisation information of similar tasks,reservations can be sized such that they improve overall performance.

This project addresses performance optimisation of internal cluster systems in largefirms without focusing on pricing structures. Within Google, this is handled by achargeback system [89]; in an Infrastructure as a Service (IaaS) platform, such as AmazonEC21, it might be exposed via differential pricing.

1https://aws.amazon.com/ec2/

https://aws.amazon.com/ec2/

1.2. CHALLENGES 3

1.2 Challenges

The challenges this project presents are substantial. Given the prevalence withinindustry and scale of the problem, providing a novel, and performant, solution wouldprove not to be trivial.

Most engineers treat cluster systems as a black box, and have little, to no,understanding of the complexities and important decisions that the few infrastructuredevelopers are faced with. It is paramount to ensure that all integrated componentshelp deliver high performance, but create no additional headaches for end-users.

Assimilating progress made in recent research, and establishing a firm grasp ofthe inner workings and low-level implementation details of a state-of-the-art clustermanager, was a non-trivial first step.

A wide range of technical challenges were confronted throughout. New algorithmswere developed, such as determining approximate burstiness estimation (see §3.2.1).Integration of machine learning and distributed systems was deployed to predicttask resource utilisation (see §3.3). All modifications extended the already-complexFirmament architecture and were developed in the fairly unfamiliar language of C++.Furthermore, great care had to be taken to perform thorough evaluation, spanningmultiple machines using the Systems Research Group cluster.

Any work in this domain requires an appreciation, and persistent consideration, ofthe implications that occur across many scales — from individual bits to several-thousand-machine data centres. Going beyond the solutions provided by modernalternative systems presented many significant intellectual and creative challengesalong the way.

Although a daunting challenge, the performance gains that this project could delivermay have a significant and wide-reaching impact.

1.3 Related work

The Vilfredo project is particularly broad, drawing on previous work from a verylarge set of domains in Computer Science.

My reading of the surrounding literature is summarised in chapter 2, with a briefdescription of the important findings made in relevant areas.

1.4 Achievements

Vilfredo is a modern system that successfully derives from, and extends, cutting-edgeresearch in distributed systems. In particular, note that:

4 CHAPTER 1. INTRODUCTION

• The baseline deliverable for this project is a similar resource reclamation systemto that offered by Borg (see §2.2.4).

• Vilfredo presents a novel strategy to adjust how reservations change based uponan estimate of each task’s utilisation burstiness, described in §3.2. Additionally,a new, faster approximation to this algorithm is introduced.

• Vilfredo integrates a pioneering approach to predicting resource utilisation,described in §3.3. It finds similar tasks using Approximate Nearest Neighbours(see §2.5.3), weights the measurements based upon other similarity metrics,introduces timeslicing as a mechanism to profile utilisation over time, andintelligently varies predictions based upon observed accuracy.

The evaluation in chapter 4 has shown 29% greater resource utilisation thanthe baseline Firmament system (see §2.4). The resulting system achieves a 35%throughput improvement with a very low reschedule rate (discussed further in §4.7),seen in figure 1.1. Furthermore, Pareto-improved allocations are achieved with 86%of additional task placements.

Task completions (higher is better) Task reschedules (lower is better)0.0

0.5

1.0

1.5

2.0

2.5

Mea

n ev

ent r

ate

/ (ev

ents

/ m

inut

e) Firmament + VilfredoFirmament

Figure 1.1: Complete system event rate evaluation: The task throughput is over 35% higherwith Vilfredo. Task reschedule occurrence with Vilfredo is very low, at below 0.09 reschedules

per minute.

Chapter 2

Preparation

The objective of this chapter is to explain the preparatory steps taken, and researchcarried out, prior to commencing implementation of Vilfredo. The scope extends to anumber of projects that provide necessary foundations for this work, and explanationof the academic background to set the tone for discussion.

First, I discuss relevant research and background material to lay the groundwork.Given the scale of this project, to maximise the chance of success, this preparationwas rather substantive, so each relevant consideration is discussed concisely. Later, Idetail the goal analysis conducted and the development strategy established.

2.1 Cluster systems

Increasing demand for data centres to support an ever-widening range of clustercomputing frameworks, such as Hadoop [91], MapReduce [26] or MPI [80], has driventhe recent development of flexible cluster managers [41].

A cluster manager is a user-space platform that performs traditional OS duties at thelevel of a computing cluster. The implicit responsibilities are numerous and includescheduling tasks to machines, and overseeing task monitoring and execution.

Resource management is a crucial component of a cluster manager’s role. This entailsmeasuring machine resource capacities, allocating resources to tasks, and ensuringtasks are only placed on machines with sufficient unallocated resources.

2.2 Resource management in practice

2.2.1 Terminology

To enable a coherent discussion of the problems and solutions in the domain of clustersystems, a few important terms are worth noting. From system to system, approaches

5

6 CHAPTER 2. PREPARATION

and terminology may vary, but the following offers a reasonable overview of thatused in industry:

• A resource refers to a cluster infrastructure component, such as memory capacity,disk storage capacity, or disk I/O bandwidth.

• The capacity of a machine is the totality of the resources that it can makeavailable to end users to execute tasks.

• The resource request of a given submitted task is a value that indicates whatresources may be necessary for it to execute.

• The resource limit for a given task indicates the maximum resource that it ispermitted to use.

• The resource reservation for a given task represents the resources that the systemhas allocated to it for ongoing use during its execution.

2.2.2 Problem

Resource under-utilisation

Low resource utilisation in cluster systems is a major unsolved problem confrontedby leading technology companies.

For example, Twitter’s internal cluster system runs on Mesos, and has severalthousand machines [42], yet over the course of one month, memory use was typicallybetween 40-50% [28]. Likewise, at Google, a 29-day trace of a 12.5k-machine cluster,running on Borg (see §2.2.4) only saw memory utilisation of around 40% [72]. A studyof Amazon EC2 estimated CPU utilisation to be in the range of 3% to 17% [56].

Figure 2.1 illustrates low memory utilisation throughout long traces of cluster activityat two leading technology companies, demonstrated in recent publications.

Using such a small amount of available resource increases energy and operationalexpenses. At the scale of thousands of machines, this amounts to millions of dollarsbeing wasted every year.

Resource over-estimation

One source of this low utilisation is the over-estimation of future task resource usageby users.

Task execution and resource utilisation data from six internal production computingclusters at Google from December 2012 to November 2013 at 5-minute intervals wasanalysed by Reiss et al. [21] Analysis found that median per-user utilisation of a task’srequest varies from 30% to 60%. In 99% of measurements taken, at least 43% of thememory, and 89% of the disk capacity, remain unused [21, §3]. The high levels of

2.2. RESOURCE MANAGEMENT IN PRACTICE 7

0.0 0.2 0.4Time (days)

0.0

0.2

0.4

0.6

0.8

time(days)

0 5 10 15 20 250

50

100M

emor

y ut

ilisa

tion

(%)

(a) Low memory utilisation of a large cluster atGoogle throughout a 29-day period. Adaptedfrom [72, Fig. 8] to match my terminology.

Tasks of different priorities are stacked.

Time (hours)

20

40

60

80

100

Mem

ory

utili

satio

n (%

)

0 100 200 300 400 500 600 7000

(b) Low memory utilisation of a large cluster atTwitter throughout a 700-hour period. Adaptedfrom [28, 44, Fig. 1b, 3:15–3:44] to match my

terminology.

Figure 2.1: Memory utilisation of large clusters at two leading technology companies.

unused reservation at Google are illustrated in figure 2.2a. These findings led theanalysts to declare that:

“Apparently, users do not do a very good job of using their [requests], orthe [requests] were not correctly set for them, or their [resource utilization]varied over time, with high values being reached only occasionally – orsome combination of these factors.”

— Carvalho et al. (2004) [21, §3]

Analysis of a 30-day trace of Twitter’s production cluster likewise found that 70%of workloads over-estimated their request, with around 20% having requested over400% of their actual usage, as can be seen in figure 2.2b.

Cum

ulat

ive

frac

tion

Total unused reservation (% capacity)0 20 40 60 80 100

0%

20%

40%

60%

80%

100%

CPU Memory Disk

Cluster 5

(a) High resource request over-estimation at a largecluster at Google throughout a 29-day period. Adapted

from [21, Fig. 4].

0 20 40 60 80 100

1200

1000

800

600

400

200

0

Usa

ge re

serv

ed (%

)

Proportion of tasks (%)

(b) High resource request over-estimation ata large cluster at Twitter throughout a

700-hour period. Adapted from [28, Fig. 1d].

Figure 2.2: High task resource request over-estimation at two leading technology companies.


Machine over-subscription

Due to such poor estimation, some state-of-the-art cluster systems support over-subscription of machines, whereby the sum of the resource requests from the tasksexecuting on a given machine actually exceeds its capacity. The extent of such over-subscription is determined based upon a calculated risk of system failure.

Analysis of the trace of Google’s 12.5k-machine cluster [72] identified that reservedmemory was typically at 80% of capacity but regularly exceeded 100%. ReservedCPU was usually over 100% of capacity. However, the corresponding utilisation wasfar lower, at around 50% and 60% of capacity respectively.

Figure 2.3 illustrates the over-subscription of machines for memory, as the reservationrepeatedly exceeds capacity. However, the extent of over-subscription appears to besafe as the utilisation remains well below available capacity.

0.0 0.2 0.4 0.6 0.8 1.0

Time (days)

0.0

0.2

0.4

0.6

0.8time(days)

Utilisation Reservation

0 5 10 15 20 250

50

100

Agg

rega

te m

emor

y (%

)

0 5 10 15 20 25

Figure 2.3: Aggregate memory utilisation at Google, identifying machine over-subscription.Adapted from [72, Fig. 8] to match my terminology. Tasks of different priorities are stacked.

2.2.3 Approaches

In most cluster systems, the assumption is made that either groups of tasks requireuniform resources, for example, fixed-size MapReduce worker “slots”, or that usersindicate resource requests on task submission.

The YARN [86] system, for example, bundles resources into packages that are thenallocated to submitted jobs on the basis of their requests.

A more flexible approach is offered by Mesos [41] which delegates resource decisionsto pluggable allocation modules. An organisation must then develop their own suchmodule to interface with Mesos.

2.2. RESOURCE MANAGEMENT IN PRACTICE 9

More advanced resource management solutions

To improve throughput, a wide assortment of algorithms for resource allocation hasbeen implemented.

Omega [77] supports specialised schedulers, which tailor reservations to particulartypes of task. MapReduce tasks in particular, are supported by a dedicated scheduleron the basis that they contribute 20% of jobs in Google [77, §6], and that one canbuild a fairly accurate model of how their reservation affects execution duration [88].Omega determines resource utilisation of the entire system and may increase taskreservations on the basis of predicted benefits.

Microsoft’s Apollo [18] system uses opportunistic scheduling [18, §3.5] in an attempt tooptimise throughput. Tasks can execute in either regular mode or opportunistic mode.Regular tasks are scheduled first, before opportunistic ones are scheduled randomlyto machines where spare capacity is observed. Opportunistic tasks are preemptedwhen utilisation rises and nears capacity, but will otherwise eventually be upgradedto regular status.

Paragon [27], and its successor Quasar [28], perform a-priori task profiling. A smallsubset of a submitted workload is executed for a short period of time, and themeasured usage is used to “right-size” resource requests for the rest of the load.If this is found to be inaccurate, the allocation is adjusted if possible, or tasks will berescheduled if not.

2.2.4 Google Borg

Google’s cluster manager, Borg [89], forms a significant part of the company’s internalinfrastructure. It acts as a platform for a wide range of user-facing products includingGmail, Google Docs, and web search [89, §2.1]. The resource management approachimplemented by Borg forms the baseline for Vilfredo.

With Borg, users define resource requests with task submissions, and these determinean absolute limit to what the corresponding task may use.

To increase resource utilisation, Borg uses resource reclamation, as illustrated byfigure 2.4. Each task has an associated reservation, which is initialised to its request,and decays slowly during its execution. Such reservations are held above the task’sresource usage, plus an additional safety margin. If, at any point, a given task’sresource utilisation exceeds its reservation, then a boost event occurs, where thereservation is rapidly increased.

Additionally, each machine keeps track of the sum of reservations for all tasks thatit is responsible for. If this exceeds the machine’s capacity, tasks may be terminated.Task termination also occurs whenever a task exceeds its limit.

The scheduler compares a task’s resource limit to unreserved machine capacities todetermine whether they can support its execution.


reservation decay

reclaimed resources

boost event

Time

Mem

ory

LegendUtilisation Reservation Limit

safety margin

Figure 2.4: Borg resource reclamation: The reservation for resource falls over time, to withina given safety margin of the actual usage. If the usage exceeds the reservation, the reservation

is increased by a large amount.

In practice, on a median cell (group of machines), approximately 20% of the workloadruns in reclaimed resources [89, §5.5].

2.3 Containment

2.3.1 Purpose

Modern cluster managers typically execute tasks in an isolated environment. Thisoffers significant advantages over simply running binaries on the same platform asthe cluster system [20], such as:

• Application developers don’t need to consider details of infrastructure machinesand operating systems.

• Infrastructure engineers can upgrade systems with minimal impact onapplications.

• This provides support for low-level resource management requirements (see§2.3.2).

Google uses containers for tasks in its internal cluster manager, Borg (§2.2.4), and itsmore recent system Kubernetes [54].

2.3. CONTAINMENT 11

2.3.2 Technology

Traditional task isolation approaches might have included hypervisors [71, 85](using virtual machines), which bring significant resource overhead, or applicationvirtualisation [49], which offers limited resource management capabilities.

Recent developments in operating system-level virtualisation [64, 30] introduce a moremodern approach.

Control groups

Control groups (cgroups) [63] is a Linux kernel feature that forms the basis for manyrecent containment technologies. Initially developed internally at Google as processgroups [62], cgroups enable low-level isolation of process sets.

This can be used to deliver important resource-related features:

• Accounting — administrators can determine which cgroups have used whatprocessing resources.

• Throttling — administrators are able to place limits on what processing resourceindividual cgroups are able to use.

Containers

Since the development of cgroups, many containment abstractions have been built. Aselection of the possible solutions is outlined in table 2.1.

Container solution Advantage(s) Disadvantage(s)

7 Docker [64] Extensive image library. Limits task use of kernelfunctionality (e.g. cronjobs). No official C API.

7 LMCTFY [57] Official C API support.More flexible provisionsfor tasks.

Limited control overcgroups. Notmaintained.

3 Linux containers [39] Transparent access tocgroups.

Table 2.1: Comparison of container solutions: Possible solutions for providing isolatedenvironments for task executables.

Based on these options, I decided at an early stage to deploy Linux containers, thearchitecture of which is depicted in figure 2.5.


Container 1

Application

Shared kernel

Container 2

Application

Shared kernel

Container 3

Shared kernel

LXC

System call interface

User space

Kernel space

IPC Namespacing Cgroups Memory Filesystem

Drivers

Devices

Figure 2.5: Linux containers architecture: The infrastructure composition of an environmentrunning three Linux containers, two of which are executing tasks.

2.4 Firmament

A philosophical intention of the Vilfredo project is to design a resource managementsolution based upon portable principles, which can integrate with traditional andmodern cluster systems with minimal development overhead.

Firmament1 [76, Ch. 5] was the cluster manager selected for integration with Vilfredo.For illustration of cluster manager considerations and integrative work, I hereexamine relevant structural and high-level functional details.

2.4.1 Scheduling

Analogous to an OS thread scheduler on a distributed scale, the role of a clustermanager’s scheduler is to map submitted tasks to individual processing units.

Firmament takes inspiration in its scheduling approach from Microsoft’s Quincy [45].Scheduling preferences are specified in the form of a flow network [2], which is treatedas an input to a minimum-cost, maximum-flow optimisation [36]. Such a network formsa tree representing the cluster in a hierarchical structure. For example, multipleprocessing units (see §2.4.2) may be descendants of one machine, and multiplemachines may be grouped into an equivalence class to improve efficiency.

Each arc of such a network has an associated cost, which is determined in Firmamentby the cost model [76, §5.5]. The baseline version of Firmament accepts task

1Firmament was designed by Malte Schwarzkopf and Ionel Gog, within the Cambridge Systems atScale (CamSaS) initiative (http://camsas.org) at the University of Cambridge Computer Laboratory.

http://camsas.org

2.4. FIRMAMENT 13

submissions with a user-specified resource request, which determines the amount ofunallocated resources that a machine must have available to support it. To considersuch requests, the cost model integrates information regarding the machine capacitiesand resource allocations of the tasks scheduled to them.

2.4.2 Architecture

When integrating Vilfredo with Firmament, we assume a monolithic cluster design,as depicted in figure 2.6. Here, there is a single, centralised master machine thataccepts new tasks, and schedules them to subordinate machines using the algorithmoutlined in §2.4.1. This is the approach deployed by Maui [47], its successor Moab [31],Kubernetes [54], and Borg [89].

Master

Subordinates

Network communication

···Figure 2.6: Monolithic architecture with Firmament: A master and several subordinate

machines integrated in a monolithic structure with Firmament.

With Firmament, each cluster machine runs a coordinator, which behaves differentlydepending on whether it is a master or a subordinate. The coordinator can be brokendown into several simpler sub-components; relevant ones to this project, as illustratedin figure 2.7, include:

Executors

Executors are a representation of processing units (PUs) — typically individual CPUcores.

A master coordinator possesses a remote executor to represent each PU on allsubordinate machines. When a task is scheduled to a remote executor, the executorsends a delegation request to the corresponding subordinate machine.

A subordinate machine represents its local PUs as local executors, which areresponsible for monitoring tasks delegated from the remote executors, and executingthe task’s binary (as indicated in the delegation request).


The knowledge base

The knowledge base is a centralised store of structured bookkeeping data.This includes, for example, machine resource utilisation information, and datarepresenting important statistics about completed tasks.

The coordinated co-location (CoCo) cost model

The cost model is responsible for assigning costs to arcs in the flow network,following a scheme of preferences, to implicitly determine where tasks are placed.In particular, the CoCo model [76, §5.5.3], which I extended for use with Vilfredo, asdescribed in §3.1.4, represents costs as multi-dimensional cost vectors.

Notably, such vectors indicate, among other things, whether a particular machine hassufficient available resources to schedule a task.

The scheduler

A master machine scheduler is responsible for placing tasks on subordinate machines,on the basis of their available resources, and user-preferences based on the cost model.A subordinate machine scheduler is much simpler, and simply selects a local executorto execute the task.

Messages

Heartbeats are messages generated by both executing tasks and running subordinatecoordinators. They hold vital bookkeeping information, such as executing taskstart and end times, and are passed upwards in the coordinator’s hierarchicalstructure.

State changes are another type of message; these produced by executors to indicatewhenever a task changes state (for example, starts running).

2.4.3 Task submission

A master coordinator runs an HTTP [34] interface which supports the submission oftasks by users. The user defines a per-task resource request indicating the resourcesthat the task may require for execution (see §2.2.1), and a per-task priority used in theCoCo cost model.

On submission, the coordinator defers to its scheduler, which applies the cost modelto determine on what executor to place the task (if one with sufficient unallocatedresources can be found).

2.5. K-NEAREST NEIGHBOUR PROBLEM 15

Coordinator (Master)

Scheduler

Executors

Executor

Subordinate

Knowledge Base

task state

machine & task messages

task state requests

Legend

Internal module Shared module

Object ownership

Object pointer

In-memory communication

HTTP communication

task state requests task messages

Executor

Cost model

machine

data

External module

Figure 2.7: Firmament architecture: Simplified outline of the components that combine toform a Firmament master system. The subordinate system is similar, but without a cost

model.

A delegation request is sent to the selected subordinate machine over HTTP, via amaster remote executor; the subordinate coordinator passes it to the relevant localexecutor to commence execution.

2.5 K-nearest neighbour problem

In order to estimate future task resource utilisation more accurately, I predicted thatit may be useful to determine sets of completed tasks that are similar to a given newtask. The approach I took is described in §3.3. For this purpose, I decided to researchthe k-nearest neighbour (K-NN) [3] problem, a form of similarity search [93].

Here, a set of data points in a multidimensional space is provided. For anyquery point, the k points in that space which are nearest to the query must bedetermined.


2.5.1 k-d trees

The k-d tree [12, 13] data structure is a multidimensional binary search tree. Itrepresents points in a fixed-dimensional space, and delivers reasonable performancefor a number of operations. Most notably, search for a given point operates inO(log n) time [37]. A k-d tree, and its corresponding spatial representation (for twodimensions, conveniently) is shown in figure 2.8.

p10p9

p8

p7

p6

p5p4

p3

p2

p1

p10p9p8

p7p6

p5p4p3

p2p1

Figure 2.8: K-d tree data structure: A two-dimensional space of points, and a correspondingk-d tree, of bucket-size one. The points under a particular node are highlighted in grey.

Adapted from [65, Fig. 2].

Every non-leaf node represents a hyperplane (a split) that divides the space into two.Leaf nodes form a box that holds a representation of the points that feature on oneside of the split determined by their parent.

The total number of points that a box may hold has a fixed limit, the bucket size.On insertion, if this limit is going to be exceeded, the tree recursively subdivides,generating a new split, and two new boxes to replace the old. Such splits occuron lines perpendicular to one of the axes of the parent hyperplane that they divide,and are typically positioned to distribute data points as evenly as possible. This isachieved by selecting the dimension with maximal spread, and then splitting on themedian value in that dimension of points.

In tree form, this split corresponds to converting one node into two, whereby all thepoints in one node have a value in the selected dimension greater than the median,and vice versa.

As such, the tree will have height dlog2 ne, and size O(n). Furthermore, during asearch descending from the root, the number of points associated with the nodes onthe path followed will decrease exponentially.

2.5.2 BBD trees

With a cluster system, we expect that similar tasks will be submitted many times,leading to high levels of clustering.

2.5. K-NEAREST NEIGHBOUR PROBLEM 17

Whilst, on the surface, k-d trees appear to offer excellent performance, thehyperplanes represented can have arbitrarily high aspect ratio (that is, the ratio ofthe longest side to the shortest). Thus, during a search, there is no guarantee thatthe geometric size (that is, the length of its longest side) of the associated hyperplanedecreases exponentially. As such, the search complexity is in fact exponential in thedimension, and delivers very poor performance with highly-clustered data sets. Thisissue is prevalent among proposed data structures for this problem [37, 23, 53].

The balanced box-decomposition tree [8] is an augmentation of the k-d tree, inspiredby quadtrees [35]. It addresses the highlighted concern by guaranteeing a maximumaspect ratio. The behaviour exhibited is much the same, however, during insertion, ifthere is no available split that maintains this ratio, a shrink will occur.

Much like a split, a shrink divides a given node, but instead produces an outer boxand an inner box. Such a shrink takes place by performing provisional splits until ahyperplane is found that contains fewer than a threshold proportion of the number ofpoints in the original node; it then simply takes these points as the inner box, and theremaining points as the outer. The shrinking process is illustrated in figure 2.9.

(a) BBD tree provisional splits (onlymidpoint splits are shown).

(b) BBD tree inner and outer boxes resultingfrom a shrink.

Figure 2.9: BBD tree shrink operation: A set of provisional splits, and the inner and outerbox resulting from a BBD tree shrink operation. Adapted from [8, Fig. 3b].

Hence, the tree additionally may contain shrinking nodes, consisting of a set oforthogonal halfspaces, and supports rapidly zooming into regions where data pointsare clustered.

2.5.3 ANN

The Approximate Nearest Neighbours (ANN) library, due to Mount et al. is a popularopen-source package that supports highly-optimised algorithms for solving the K-


NN problem [65].

Beyond introducing and incorporating the BBD tree data structure to optimise forclustered data, the algorithm introduces an approximation factor, ε.

For some positive real ε, the point p is defined as a (1 + ε)-approximate nearestneighbour of q if its distance from q is no greater than (1 + ε) times the distance tothe actual nearest neighbour. By proceeding down the BBD tree in the fashion of abinary search for neighbours of q, and tracking the closest point observed so far as p,once the distance of the current point from q exceeds the distance between q and p,divided by (1 + ε), the search can terminate. Any subsequently visited point cannotbe sufficiently close to q to contradict the claim that p is a (1+ ε)-approximate nearestneighbour.

Moreover, investigation [7, 8] has shown that for a set, S, of n data points in Rd, thereis a constant, cd,ε ≤ d[1 + 6/ε]d, such that given ε > 0, q ∈ Rd, and any k, 1 ≤ k ≤ n,ANN can compute a sequence of (1 + ε)-approximate nearest neighbours of q in S inO((cd,ε + kd) log n) time.

2.6 Linux disk I/O scheduling

Disk I/O bandwidth is an important machine resource that I aim to take intoconsideration with this resource management system.

On a Linux machine, the order of block device I/O requests, such as reads and writeson disk, is determined by the operating system’s chosen I/O scheduler [6, Ch. 37]. Thedefault scheduler on modern Ubuntu machines is Deadline [9]. As a policy, Deadlineimposes deadlines on I/O operations, such as to guarantee a start service time forevery incoming request by averting starvation [78].

The CFQ (Complete Fairness Queueing) [10] scheduler is an alternative illustrated byfigure 2.10. It assigns synchronous requests to per-process queues and determinesdisk access by allocating timeslices for each such queue.

CFQ achieves improved fairness and offers the advantage of official support for I/Ousage, from cAdvisor [61] for measuring, and from LXC for cgroup-based throttling(see §3.1.6). Thus, I opted to transition any machines running Vilfredo to a CFQsetup.

2.7 Project management

2.7.1 Requirement analysis

Having garnered sufficient understanding of the domain to consider development, Iset about establishing requirements for Vilfredo. I first determined which deliverables

2.7. PROJECT MANAGEMENT 19

CFQ I/O scheduler

Select request

Selected process request queue

Waiting process request queue

Disk

Classify requestProcess

Submit request to scheduler

Submit request to

disk

Waiting process request queue

Figure 2.10: Complete Fairness Queueing I/O scheduler: Multiple process queues within theCFQ I/O scheduler, holding process-submitted requests. One such queue is selected with its

oldest request submitted to disk.

were the most important, and associated them with estimates of the likelihoodfor success and difficulty of the problem faced. These findings are outlined intable 2.2.

Deliverable Priority Risk Difficulty

Implement container-based task execution High Medium Medium

Build container-monitoring system High Medium High

Track and exponentially decay task reservations High High Medium

Enforce cgroup-based resource throttling Low Low Low

Enable usage measurement smoothing Medium Medium Low

Develop task burstiness profiling Medium Medium High

Set up similarity-based resource prediction Medium High High

Integrate timeslice-based resource prediction Medium Medium High

Table 2.2: Assessment of project deliverables. Risk indicates how severely inability to deliverwas predicted to adversely affect overall performance.

Additionally, I broke down the planned system into a set of sub-components anddetermined their corresponding dependencies, as shown in figure 2.11.

2.7.2 Development life cycle

Having established my requirements, I noticed that the core deliverable of the projecthad a number of interdependent sub-components, and hence it made sense forme to adopt the Waterfall model [11]. This model is well-suited to projects where


Reservation updating

Usage measurement tracking

Reservation trackingTask preemptionReservation decay

Reservation boost

Reservation safety

Reservation initialise

Burstiness-adjusting and usage smoothing

Similar task-based usage prediction

Pull-based message communication

Container monitoring

Container-based execution

Container usage throttling

Usage timeslicing

Figure 2.11: Requirements and dependencies for the Vilfredo project, where a→ b means a isdependent on b.

requirements can be known in advance, and are unlikely to evolve, thus progressinglinearly through a series of discrete phases. The waterfall approach I adopted can beseen in figure 2.12a.

With a workable prototype of this in place, I anticipated that a number of riskierextensions could be independently developed within isolated iterations of the Iterativemodel [55]. This permitted greater flexibility for tweaking systems, and rapidresponses when initial designs need adaptation. The particular iterative methodologyI deployed is outlined in figure 2.12b.

Testing

Implementation

Design

Requirements

(a) Waterfall development model.

Performance improved

Testing

ImplementDesignIntegrate

Evaluate

Performance didn t improve

(b) Iterative development model.

Figure 2.12: Development methodologies implemented.

This approach is well-encapsulated by the Spiral model [16].

2.7.3 Testing strategy

To ensure that algorithms performed as intended, and that past bugs did notreappear, I developed a comprehensive test suite using the Google Test [79] framework.

2.8. CHOICE OF TOOLS 21

This consisted of a series of white- and black-box tests [81], ensuring that correctbehaviour was observed with a wide variety of input. Special care was taken toensure that edge cases, corner cases and base cases were all handled correctly.

Extensive evaluation of the performance of the system was carried out, for which Ideveloped:

• A task workload generator, with probabilistic task inter-arrival times (seeappendix A).

• A set of example tasks, based upon the research of realistic workloads discussedin §4.2.1.

2.8 Choice of tools

2.8.1 Programming languages

C++

The bulk of the software solution was developed in C++ [82].

I determined that this language was suitable for the purpose due to its highperformance and the fact that this decision would simplify the process of integratingwith the C++-based Firmament system.

C++ was also used to build a suite of test tasks, for evaluation purposes, as outlinedin §4.2.1.

Shell and Python

I wrote a number of shell [50, Ch. 3] scripts for the purpose of easing development,for example, to transfer updated files and test binaries to the Systems Research Groupcluster, to install dependencies, and to process Vilfredo logs for evaluation.

For simplicity, a combination of Python [22] and shell scripts was chosen for use withthe probabilistic task-submission simulator discussed in §4.2.1.

Python was additionally used for a variety of plotting scripts, which can be seenthroughout chapter 4.

2.8.2 Libraries

The libraries I selected to use for this project are outlined in table 2.3.

2License depends upon implementation.


Library Version Purpose License

C++ StandardLibrary

C++11 Containers, algorithms, time,memory management, functions,strings.

GPLv3 or MIT2

Boost 1.55 Threading, filesystemmanagement.

Boost SoftwareLicense

LinuxContainers

1.0.8 Task containment (see §2.3). GNU LGPL

cAdvisor 0.22.0 Container monitoring. ASL

cURL 7.35.0 Querying HTTP APIs. MIT/Xderivative

Jansson 2.5 Parsing API responses. MIT

Google Test 1.7 Developing unit and integrationtests.

BSD 3-clause

Table 2.3: Libraries used to implement Vilfredo.

2.8.3 Revision control and backup strategy

Given the scope of this project, sufficient planning of revision control and backupstrategy was crucial prior to commencing implementation. For this purpose, Iadopted the construction outlined in figure 2.13.

Personal laptopLocal repository

External driveLocal backup

joshbambrick.comRemote backup

Scheduled snapshot(weekly)

Manual backup(bi-weekly)

Manual repository push

GerritHub.comCode review

Accept commitGitHub.com

Remote repository

Figure 2.13: Backup and collaboration construction.3 4 5

I opted to use the git revision control system [83] to manage development via thepopular GitHub6 platform. Since I was integrating my work with a larger system, it

3The GitHub Mark is a trademark of GitHub, Inc. and is used with express permission.4The Gerrit logo is in the public domain.5GerritHub is a trademark of GerritForge Ltd.6https://github.com/

https://github.com/

2.9. SUMMARY 23

was important to ensure that new code was consistent with the standards set in theFirmament system. Thus, I decided to use the GerritHub7 for code review, prior tomerging commits into the larger codebase.

Whilst this set-up offered a degree of resilience to system failures and user error,I felt it was inadequate to provide sufficient protection. Therefore, I additionallyconstructed a local backup system, manually taking backups on a bi-weekly basis,and added a further remote snapshot taken on Saturday mornings placed on a privateserver.

A similar set-up was arranged for the writing of this LATEX dissertation.

2.9 Summary

In this chapter, I outlined the preparatory steps I took, and research that I madebefore commencing work on the implementation of Vilfredo.

Significant analysis of the current state of resource management in modern clustersystems was presented, the case was made for the necessity of novel solutions inthe area, and investigations into potentially useful tools was shown. Additionally,planning work into the requirements analysis and development strategy wasexplained to help maximise the likelihood of success.

The upcoming chapter offers an in-depth discussion of the implementation details,and important design considerations, in building the Vilfredo system.

7http://gerrithub.io/

http://gerrithub.io/

Chapter 3

Implementation

The objective of this chapter is to describe how the Vilfredo system was implemented.I explain both standalone modules developed and extensions I made to theFirmament cluster manager.

Given the scale of the Firmament project and the large number of interdependentcomponents that were integrated, I will restrict this chapter to describe thearchitecture of the Vilfredo system and notable algorithmic developments; Iintend not to over-invest in descriptions of technicalities or low-level softwareimplementations.

The noteworthy implementation approaches covered in this chapter are:

• Establishing exponential reservation decay (§3.1) which outlines the keycomponents involved in setting up a reservation decay system, wherereservations fall to a given fraction of their previous value. Further discussion ismade of vital infrastructure enhancements, scheduling improvements and taskexecution approach and monitoring.

• Accounting for burstiness (§3.2) which presents strategies implemented todynamically adjust how the reservation varies over time when task resourceusage is bursty, i.e. the utilisation changes rapidly.

• Similar task-based usage prediction (§3.3) which explains a variety of featuresthat were introduced for the purposes of predicting the future resource usageof a task, based on observing similar tasks. This includes not just determininghow to use previous usage records, but indeed determining whether a task issimilar to another.

3.1 Establishing exponential reservation decay

A significant challenge of this project was in establishing the first complete versionof Vilfredo, comprising an end-to-end solution which was capable of executing

24

3.1. ESTABLISHING EXPONENTIAL RESERVATION DECAY 25

tasks, tracking resource usage, controlling resource reservations, and effecting thesereservations within Firmament.

3.1.1 Measuring and communicating task resource usage

Task execution

One of the earliest goals in the Vilfredo project was to establish a way of isolatingtasks’ resource usage from each other, and to measure this on an individual, per-task, basis. As discussed in §2.3.2, I opted to use Linux containers (LXC) [39] for thispurpose. As a re-cap, the LXC architecture is depicted in figure 3.1.

Container 1

Application

Shared kernel

Container 2

Application

Shared kernel

Container 3

Shared kernel

LXC

System call interface

User space

Kernel space

IPC Namespacing Cgroups Memory Filesystem

Drivers

Devices

Figure 3.1: Linux containers architecture: The infrastructure composition of an environmentrunning three Linux containers, two of which are executing tasks.

To communicate information, such as a task binary, to a container, a handful ofdirectories are mounted [51] into the container’s filesystem.

Task resource measurement

To track per-task resource utilisation information throughout execution, I adaptedFirmament’s local executor (see §2.4.2). A new module launches cAdvisor (ContainerAdvisor) [61], which uses cgroups (see §2.3.2) to expose container resourcemeasurements via a RESTful [74] HTTP API. I used cURL1 to periodically query theAPI, and Jansson2 to parse the responses from JSON [25] format.

1https://curl.haxx.se/2http://www.digip.org/jansson/

https://curl.haxx.se/

http://www.digip.org/jansson/

26 CHAPTER 3. IMPLEMENTATION

Measuring memory

Memory is the easiest resource to measure as it can be taken directly from thecAdvisor API.

Tracking disk I/O and storage

Both disk I/O bandwidth and disk storage were targets for measurement, and bothrequired extensions beyond directly parsing the cAdvisor API. For this purpose, Idevised a separate tool for tracking the disk I/O and storage on a per-containerbasis. The approach used to determine the tracked values is discussed below.

Measuring disk I/O

The cAdvisor system reads disk I/O information, on a per-block-device basis, from/sys/fs/cgroup/blkio/ [15]. This information is then exposed via the API as:

• blkio.io_service_bytes — the number of bytes transferred to/from the diskby the cgroup which includes the container.

• blkio.io_service_time — the total time interval between request dispatch andcompletion by I/O requests made by the cgroup which includes the container.

Tracking the increment of the io_service_bytes, d, and io_service_time, t, valuesbetween readings can be used to estimate the I/O bandwidth during that period upto time i, as bwi, according to eq. (3.1):

bwi =∆d∆t

(3.1)

In the absence of service time information,3 Vilfredo records the time interval betweenI/O measurements from the API, and uses this as an upper bound for ∆t, todetermine a lower bound for disk I/O bandwidth.

Measuring disk storage

cAdvisor does not offer direct, reliable storage usage information. To determine suchvalues, I therefore had to recursively compute the disk footprint of a container. Sincerecursing through an entire directory structure can be slow, I decided to use AUFS4,a multi-layered unification filesystem [69] for the container. AUFS stacks multipledirectories and exposes them as a unified view through a single mount point, asillustrated in figure 3.2. For this use case, the directories are rootfs and delta0:

3The cAdvisor API documentation states that it only officially supports the CFQ I/O scheduler [10],however, in practice it also exposes byte read information when Deadline [9] is used.

4http://aufs.sourceforge.net/aufs.html

http://aufs.sourceforge.net/aufs.html


Container

F0 F2 F3 F4

F0 F1 F2 F3

marked: deleted

delta0

rootfs

read write delete writeread

AUFS

Figure 3.2: AUFS system with a Linux container: Reads occur from delta0 if the file existsthere and rootfs if not. All writes (including marking files as deleted) interact with delta0,

using a copy-on-write scheme.

• rootfs is the state of the filesystem containing all the files originally in thecontainer before it is launched.

• delta0 contains additions to the container’s filesystem after it has started.

By only once calculating the size of rootfs (containing around 17,000 files), aconsiderable speedup can be achieved on tasks that produce relatively few files.

The computation is implemented as a depth-first traversal [32, Ch. 3], since thisminimises its memory footprint.

Communication within Firmament

The baseline Firmament system executed tasks by starting a process and inserting amonitoring library into it, which pushed periodic heartbeats (see §2.4.2).

With Vilfredo, tasks execute in containers managed via the LXC API [58], somonitoring them is more challenging — injecting a library is no longer trivial. Inaddition, unnecessary network bandwidth is expended in the baseline system to pushmessages to the master coordinator, which runs in the same executable as the executoritself. For this reason, I opted to replace this push-based communication system, witha pull-based one. Task heartbeats and state change messages are regularly pulled fromlocal executors to the coordinators, as per the system depicted in figure 3.3.

With the infrastructure in place to communicate messages about tasks to coordinators,the container monitor can then be used to periodically measure task resource usage,and attach this information into the task heartbeats. At the coordinator, thesemeasurements are loaded into the knowledge base (see §2.4.2), which can be queriedusing task identifiers to determine the associated records.


Coordinator (subordinate)

Scheduler

Executors

Local Executor (busy)

Local Executor (idle)

Task

Container (LXC)Container Monitor

cAdvisor

Master

Knowledge Base

Vilfredo

task reservations

task state

task

mes

sag

es

task

sta

te r

equ

ests

task state requests

machine & task messages

task state requests

task state requests task messages

heartbeats state changes

usage

measurementsusage data

LegendInternal module (hierarchical)

External module

Internal module (shared)

Object ownership

Object pointer

In-memory communication

HTTP communication

Vilfredo

Figure 3.3: Firmament architecture with Vilfredo: A much-simplified schematic identifyingthe key components and data flows in the new subordinate architecture. The master remainsmuch the same, but with the addition of Vilfredo, as discussed in §3.3. All modules shown

were modified for this project, and Vilfredo, the Container Monitor, the Container andcAdvisor were all introduced.


3.1.2 Reservation updates

Vilfredo updates resource reservations (sometimes called allocations in the literature)in a similar manner to Borg (see §2.2.4). The goal is to reduce reservations towardsactual utilisation to support more tasks on the same number of machines.

Terminology

For clarity, table 3.1 defines terms used throughout the rest of this dissertation.

Term Symbol Explanation

Resource A cloud infrastructure component, such as memoryusage, disk capacity, or disk I/O bandwidth.

Capacity The total resources an infrastructure service can makeavailable to end users to execute tasks.

Resourceutilisation/usage

The resources that a given task is using at a given time.This is measured periodically as per §3.1.1.

Resourcerequest

−→q For a given task, −→q is a vector indicating what resourcesmay be necessary for execution. Determined by the user,−→q often provides a very loose upper bound on what thetask will actually use.

Resourcelimit

−→l For a given task,

−→l is a vector indicating the maximum

resource it is allowed to use. Set equal to −→q by Vilfredo,any task exceeding its

−→l is terminated as per §3.1.3.

Resourcereservation

−→r For a given task, −→r is a vector representing the resourcesthe system has allocated to it. Vilfredo varies −→r during atask’s execution (by strategies discussed in this chapter).

Reservationdecaycoefficient

cd A coefficient between 0 and 1. When a resourcereservation, −→r , is decreased periodically (as per §3.1.2),it falls to cd times its current value.

Safety margin −→m To ensure the reservation, −→r , does not fall too close to atask’s measured utilisation, −→m is the amount of extraresource that should be left reserved, in addition to −→r .

Safety margincoefficient

cm A coefficient between 0 and 1. With each −→r update, atask’s −→m is cm times the most recent measurement.

Boost If −→r ever falls below a task’s measured utilisation, aboost event occurs, where it is rapidly increased.

Boostcoefficient

cb A coefficient greater than 1. During a boost, the new −→rvalue is cb times the most recent utilisation measurement.

Table 3.1: Vilfredo reservation terminology.


Reservation tracking

With regular records of task resource usage available, my next challenge was tocalculate the task resource reservations during each task’s execution. The knowledgebase was adapted to track per-task reservation resource vectors, representing thecurrent values for the task’s reservation for each resource monitored.

Reservation initialisation

Prior to task execution, Vilfredo initialises this resource reservation, −→r , to theresource request, −→q , passed by the user on task submission. This resource requestalso provides the absolute upper bound for resource usage; if this is exceeded thetask is terminated, as detailed below.

Reservation decay

Vilfredo periodically updates the reservations of each executing task.

The reservation is decayed as per the reservation decay coefficient, cd, determinedby a parameter within the range of 0 and 1 (perhaps quite high at around 0.9 toprevent freeing reservations before immediately having to return them, as discussedin §4.3.5). Each time the reservation is updated, the decayed value is equal to the old,multiplied by cd.

Updates happen at consistent time intervals so the curve of reservation values withtime displays exponential decay, falling at a rate proportional to its current value.

Ensuring safe reservations

It is imperative to ensure that each task is granted ample resources to cover itscomplete usage in practice. Thus, I introduced a safety margin, −→m , which ensuresthat the reservation remains at least a constant multiple of the resource usage(perhaps around 1.2). Vilfredo queries the knowledge base to determine the mostrecent resource usage measurements for the task. The margin is determined by asafety margin coefficient, cm, and the latest resource usage, for each type of resourcemonitored.

Additionally, if the reservation has fallen too quickly, or resource utilisation suddenlyincreases, a situation might arise where a utilisation measurement exceeds thereservation. If this ever occurs, Vilfredo will boost the reservation, using the boostcoefficient, to cb times the utilisation (higher than the safety margin).

Finally, Vilfredo enforces the limit,−→l , on the reservation. Regardless of the safety

margin, or usage measured, each resource’s reservation is clamped to a maximumvalue, determined by the task’s original resource request.


Combined reservation updates

To update the value for the resource reservation, −→r , the effects of the reservationdecay coefficient, cd, safety margin, −→m , and limit,

−→l are applied consecutively,

considering the actual usage measurement, −→u . This updates the reservation, −→r ,with exponential decay as per the recursive definition in eq. (3.2), using vector indexnotation.

ri =

{min(li, ui ·mi, ri · cd) if ri < ui

cb · ui otherwise(3.2)

Figure 3.4 demonstrates this reservation variation behaviour with idealised taskresource utilisation measurements. Note that the terms in the diagram are for singlevector elements and the numerical indices represent arbitrarily-spaced times.

reclaimed resources

decay step

r0 = limit

r1 = r0 (cd)

r2 = r1 (cd)

boostevent

Time

Mem

ory

r3 = u1 (cm)u1

u2

r4 = u2 (cm)

Legend

Utilisation Reservation Limit

safety margin

Figure 3.4: Vilfredo reservation decay: The reservation falls over time, to within a givensafety margin of the actual usage. If the usage exceeds the reservation, it will be boosted.

3.1.3 Task limit enforcement

In the baseline Firmament implementation, a task’s resource request, −→q , wassubmitted by the user, but the concept of a resource limit,

−→l , did not exist. With

each additional measurement, Vilfredo compares resource utilisation to this limit (setequal to −→q ), and terminates tasks that exceed this value for any resource.


3.1.4 Cost model reservation consideration

The overarching concept behind Vilfredo is to permit newly-scheduled tasks to usethe resources freed by decreasing the reservation for existing tasks below theiroriginal resource request. As such, I had to extend Firmament with new features,to allow tasks to be scheduled to machines with sufficient unreserved resource. Twokey areas required modification: the cost model (see §2.4.2), and the calculation ofmachine resource capacities.

I decided to modify the Coordinated Co-location (CoCo) cost model [76, §5.5.3]. Thismodel amalgamates reserved resources under nodes by starting at the level of aProcessing Unit (PU), and iterating upwards towards the scheduling node, combiningthe reservations found at each level. At the PU level, such reservations are calculatedas the difference between the capacity allocated to the unit, and the resource requestof the task it executes, if any. I improved the cost model by determining whetherthe task a PU was executing had known reservations, as calculated by Vilfredo, andusing these in place of resource requests where available. The modified CoCo costmodel is represented by figure 3.5.

42

Reservation Request Capacity

42

44

44

34

34

24

12

24

11

11

PU

PU

PU

PU

PU

PU

S

S

S

42

42

44

44

R

M

-

42

42+

32

44+

34 -

34 -

10

23+

84

76+

Direction of amalgamation

Legend M Master machine R Rack of machines S Subordinate machine

PU Idle PU PU Busy PU

12

24

11

Figure 3.5: Enhanced CoCo cost model resource amalgamation: Where there are executingtasks, the reservation is now considered (or the request if none is available) and subtracted

from the capacity. The available capacity is amalgamated by summing the value seen at lowernodes, at each progressively higher node in the tree.

However, reservations are calculated at executing subordinate coordinator nodes, andtask scheduling occurs at the level of the master coordinator, which hence does notknow the reservation. To solve this, the task heartbeats (see §2.4.2), sent to master nodes


via HTTP, were augmented to include tasks’ reservations, which are then recordedby the master.

Calculation of machine resource capacities is performed by subordinate coordinators,and passed up to the master, for consideration in the cost model. The original versionof Firmament supported capacities for memory, and disk I/O bandwidth, but notdisk space capacity. I decided to use statvfs(2) [52] to determine the availablespace, and include this in the capacity messages.

3.1.5 Machine over-allocation policy

As described, Firmament with Vilfredo now supports allocations where the sum ofresource requests for the executing tasks on a machine exceeds its capacity, but wherethe sum of the reservations does not. However, as the reservations are regularlyupdated during task execution, and are only bound by the task’s resource request,a situation may arise where the sum of the reservations rises above the machine’scapacity. Where this occurs, to mitigate risk of system failure, tasks must be preemptedto free reserved resources. Since the tasks selected for preemption have not exceededtheir resource request, they should be rescheduled elsewhere in the cluster.

Task 5

Reservation: 20Priority: 1

Task 2


Task 1


Task 4


Task 3


Next candidate for rescheduling

Figure 3.6: Task preemption sorting heap: The item at the top of the heap will be the nextselected for preemption if insufficient machine resources remain unreserved. Note that, due to

its high priority, the task with the greatest reservation will be the last selected.

Vilfredo periodically looks at how much resource has been reserved on a machineand, if this exceeds the machine capacity, it determines the amount of resources thathave to be freed. Vilfredo then sorts the tasks using a heap, ordered first in reverseorder of priority (see §2.4.3), and then by the size of their resource reservations, asseen in figure 3.6. Until there are no more resources to free, tasks are iteratively


popped off the heap, their reservations subtracted from the resources to free, andthey are added to a set of tasks to reschedule. The purpose of the task ordering is toensure that important tasks are less likely to be rescheduled, but tasks with higherreservations are more likely.

I adapted the coordinator to terminate tasks locally before sending a state changemessage (see §2.4.2) to the master coordinator. The state change messages wereaugmented to incorporate a parameter to indicate the task should be rescheduled.On receipt of such a message, the enhanced master coordinator removes the taskfrom the flow graph, and re-submits it to the scheduler.

3.1.6 Cgroup-enforced limits

As explained, task limits are enforced by terminating tasks that exceed their resourcerequest. This may lead to reduced overall performance of a cluster system, due totime wasted allocating resources to a task that never completes.

Vilfredo minimises the likelihood of such events by throttling tasks’ access to certainresources, by means of cgroups (see §2.3.2), in addition to enforcing hard limits oncontainers.

When a container has used all of its allocated memory, the container acts as thoughthe system has insufficient unused accessible physical memory; hence swap spaceis used instead. The per-container swap space allocation itself is determined by aparameter.

3.2 Accounting for burstiness

An important observation to make when setting reservations is that if they blindlyfollow the pattern of utilisation over short periods of time, they are likely to result inhigher levels of task preemption. Such a situation might arise if a task experiences ashort period of low resource usage, followed by a large burst. The reservation maydecay during the first period, permitting another task to be scheduled to the machine,but a task must be preempted when the burst forces the original task’s reservationhigher (especially so if a boost event is induced). Hence, for burstier tasks, it mightbe useful to decay the reservation less and set a higher safety margin.

Furthermore, if we can determine that a given task does not exhibit bursty resourceusage variation, then it may be possible to both apply a smaller safety margin, andaccelerate reservation decay, to reclaim more allocated resources faster.

3.2. ACCOUNTING FOR BURSTINESS 35

3.2.1 Directly measuring burstiness

Burstiness metrics

I take burstiness to refer to the intermittent increases and decreases in measuredvalue of time series data, such that a larger burstiness is associated with greaterfluctuations [66].

With this definition providing a general idea for what we seek to measure, recordingsof resource usage can be examined to quantify the current burstiness of usage. Severalsuch metrics are discussed in the literature.

Burstiness metric Advantage(s) Disadvantage(s)

7 Variance [14]

σ2 =

n

∑i=1

(xi − µ)2

n

Provides a mathematicalmeasure for how mucha set of numbers isdispersed.

Does not account for themagnitude of values inthe set.

7 Index of dispersion [24]

σ2

µ

Accounts for themagnitude of values inthe set.

Requires iteration overall the values in the timeseries.

3 Fano factor [33]

σ2w

µw

Windowing enablescalculation withunbounded time series.

Table 3.2: Comparison of burstiness metrics: Possible calculations that could be used toestimate the burstiness of a given resource in a task.

Based upon my findings detailed in table 3.2, I decided upon the Fano factor, as themetric for estimating task burstiness.

Being derived from the index of dispersion, the Fano factor can be used to splitdistributions into three distinct groups, for ease of discussion:

• An under-dispersed distribution has a Fano factor of value < 1. These are lessbursty than a “normal” task.

• A task which exhibits a “normal” degree of burstiness (that is, the utilisationvariance and mean are equal), has a Fano factor of exactly 1.

• An over-dispersed distribution has a Fano factor of value > 1. This is burstierthan a “normal” task.

I modified Vilfredo’s decay iteration to calculate the Fano factor for each task. Attime, i, this fetches usage records from the knowledge base, and iterates through the


last w measurements, of m total, where w is a parameter-determined window size.Thus, for the window of previous measurements, of size, w, at time, i, the mean, µwi,and variance, σ2

wi, for each resource type is calculated, and used to determine theFano factor as:

bi =σ2

wiµwi

(3.3)

Such calculations are made for each type of resource tracked, in order to determinea Fano factor for each. Each resource’s burstiness can then be consideredindependently.

Figure 3.7 presents example of functions with different Fano factors.

0 10 20 30 40 50 60Time

010203040

Usa

ge

(a) Utilisation profile for a normal task: The Fano factor for a task producing with this“normal” utilisation profile is 1.00.

0 10 20 30 40 50 60Time

010203040

Usa

ge

(b) Utilisation profile for an under-dispersed task: The Fano factor for a task producing withthis stable utilisation profile is 0.11.

0 10 20 30 40 50 60Time

010203040

Usa

ge

(c) Utilisation profile for an over-dispersed task: The Fano factor for a task producing withthis bursty utilisation profile is 13.12.

Figure 3.7: Comparison of utilisation profiles for varying burstiness.


Approximate Fano factor

Calculating the burstiness for each task requires iterating through w records for each,every time the reservation is updated. With n tasks running on the same machine,determining the exact burstiness for each takes O(wn) time. With Firmament, n islimited by the number of CPU cores, so one would typically expect that w� n.

Since reservations can be updated arbitrarily frequently, when w is large, somereasonable parameter settings could cause Firmament to require substantial CPUresources, simply to update reservations. Hence, getting an idea for the burstiness ofa task over a longer stretch of its execution is problematic.

For this reason, I devised an alternative calculation to approximate the Fano factor,which can instead be computed in O(n) time, independent of w.

As depicted in figure 3.8, two queues are instead tracked throughout the task’sexecution, holding, for each resource measurement at time i:

• The utilisation measurement, xi.

• The value Si, as described by eq. (3.4) where µwi was the mean of the utilisationmeasurements in the window at time i, calculated using the above queue.

Si =i

∑j=i−w

(xj − µwj)2 (3.4)

Hence, by tracking the sum of items in each queue, the Fano factor is approximated,as b̂m at time m, according to eq. (3.5), where µwm is the mean of the utilisationmeasurements, xi, in the window at that time.

b̂m =

m

∑i=m−w

Si

µwm(3.5)

Task T burstiness data

Sn-w-1 Sn-w Sn-w+1 Sn-1 Sn

xn-w-1 xn-w xn-w+1 xn-1 xn

subtract & delete add & insert

···

···

Figure 3.8: The lists and sums stored by Vilfredo to calculate an approximation to the Fanofactor in constant time, independent of the window size.


A minimum size of queue is enforced to ensure that sufficient data is available toestimate burstiness, and a maximum is used to allow for memory constraints andensure that the value can dynamically adjust to the current usage pattern.

3.2.2 Determining decay coefficient and safety margin

Having calculated the Fano factor for a resource-task pair, I adapted Vilfredo to alterthe task’s reservation, accounting for the estimated burstiness.

Modifying the reservation decay

First, the Fano factor, b (perhaps using the approximating described in §3.2.1), isscaled using the transformed logistic function described in §3.3.4, to determine a newcoefficient, cd′, in the range of 0 to 1. This can then be used directly as the reservationdecay coefficient, as per eq. (3.6), with a low minimum value enforced.

cd′ = max(TLogistic(b, k), min_decay_coeff) (3.6)

As such, burstier tasks have reservations that decay slowly, or not at all; whilsttasks that exhibit very little burstiness see accelerated reservation decay, freeing upresources faster.

In fact, Vilfredo is capable of calculating the k value in eq. (3.6), such that a task thatis neither under-dispersed, nor over-dispersed, will have a coefficient equal to theflag-defined value, cd.

To determine this value, first note that the expanded formula to calculate the adjustedcd value is as per eq. (3.7).

cd′ = max((

21 + exp(−bk)

− 1)

, min_decay_coeff)

(3.7)

We can re-arrange this to determine the necessary k value algebraically, such thatcd′ = cd when b = 1, as per eq. (3.8).

k = − ln(

2cd + 1

− 1)

(3.8)

Figure 3.9 provides a plot representing the functional relationship between thecalculated burstiness, b, and the resulting adjusted decay coefficient, cd′.


0.0 0.5 1.0 1.5 2.0 2.5 3.0b

0.80

0.85

0.90

0.95

1.00c d′

Figure 3.9: Burstiness-adjusted decay coefficient, with a minimum of 0.85 and k calculatedsuch that cd′ = 0.95 when b = 1.

Modifying the safety margin

Additionally, I decided to introduce an alternative safety margin; the burstiness-adjusted safety margin, is more likely to exceed the value of bursts observed, and hencea boost will be less likely to occur. Based on practical observations of the relationshipbetween the calculated Fano factors, and the shape of the usage plots (such as thoseseen in §4.5), I determined that a linear relationship between safety margin and Fanofactor would be suitable, clamped within a given range of reasonable values.

Hence, the adjusted safety margin coefficient, cm′, is found using the burstiness, b,as per eq. (3.9). Once again, I established this set up such that a task that is neitherunder-dispersed, nor over-dispersed has a safety margin equal to the flag-definedvalue, cm.

cm′ = max(min((b− 1) + cm, max_safety), min_safety) (3.9)

Crucially, when burstiness is considered, the minimum safety margin (used withvery low burstiness) is half of the standard value. Hence, double the resources canbe reclaimed from tasks that exhibit very stable behaviour.

Figure 3.10 provides a plot representing the functional relationship between thecalculated burstiness, b, and the resulting adjusted safety margin coefficient,cm′.


0.0 0.5 1.0 1.5 2.0 2.5 3.0b

0.0

0.5

1.0

1.5

2.0

2.5c m′

Figure 3.10: Burstiness-adjusted safety margin coefficient, with a minimum of 1.15 and amaximum of 2.

3.2.3 Applying exponential averaging to usage observations

In a bursty task, the observed usage often comes in the form of sharp, sporadicpeaks. The purpose of the resource reservation of a task is to represent the amountof resources reserved for a task. Rapidly increasing the reservation on the basis ofa single burst that is unlikely to be followed by another does not serve this purpose— it is better to attempt to get an idea of the general usage of resources, and let theFano factor account for burstiness of tasks.

For this reason, I opted to extend Vilfredo by considering the exponential average [19]of resource usages over all previous observations. With each decay iteration,Vilfredo updates a record of the exponentially-smoothed usage, initialised to thefirst measured usage. Thus the smoothed utilisation measurements, −→sm , are updatedbased upon the most recent utilisation measurements, −→u , using the utilisationsmoothing coefficient, cs, as per eq. (3.10).

−→sm = (1− cs) · −→sm + cs · −→u (3.10)

I designed this feature for optimal compatibility with the burstiness calculation,leading me to apply two safety margins:

• With a bursty task, the burstiness-adjusted safety margin will be larger; it wouldbe problematic to over-compensate for individual bursts, so the burstiness-adjusted safety margin is applied to the exponentially-smoothed usage.

• To ensure that the reservation does not fall below such bursts, the original safetymargin is also adhered to, but here considering the single most recent usageobservation.

How these safety margins vary throughout the execution of a bursty task is illustratedin figure 3.11. For simplicity, this diagram assumes that the reservation has alreadydecayed to within the safety margin, and makes no adjustments for decays orboosts.


Time

Mem

ory

Legend

Utilisation Utilisation + non-adjusted safety margin

Smoothed utilisation Utilisation + adjusted safety margin

Figure 3.11: Idealised representation of the two safety margins: Ignoring boost and decay, themaximum of these two, at any given time, will be the actual reservation.

3.2.4 Calculating the smoothing coefficient

The smoothing coefficient, cs, affects how rapidly the exponentially-smoothed usagevaries with each addition measurement.

In effect, this usage is a weighted arithmetic mean, where more recent measurementshave greater weights. In fact, the sequence of weights forms a geometric progression, asthe ith most recent weight has the form of eq. (3.11), where a = 1− r = cs.

wi = ari−1 (3.11)

Hence, the sum of the n most recent weights can be calculated using a geometric series,as per eq. (3.12).

n

∑i=1

wi =a(1− rn)

1− r(3.12)

For example, one may predict that a reasonable coefficient would be such that the nmost recent measurements contributed to w× the weight of the smoothed average.By rearranging eq. (3.12), and substituting a = 1 − r = cs, we can determine thenecessary smoothing coefficient as per eq. (3.13).

cs = 1− (1− w)1n (3.13)

Vilfredo permits the cluster administrator to specify the n value, and to assume w =0.5, or to explicitly provide cs.


3.3 Similar task-based usage prediction

When running at large scale, many tasks submitted to the cluster system for executionhave similar resource utilisation patterns. Records of such similar tasks can bedeployed to predict the usage of newly-submitted tasks.

This section presents a selection of approaches I implemented to utilise suchinformation in an effort to optimise prediction of task resource utilisation.

3.3.1 Notion of timeslicing

After a task completes, the measurements taken of its utilisation might be useful topredict utilisation of similar tasks in future. Timeslicing offers a means of describinghow utilisation measurements that are taken at different times are treated for thispurpose.

When reservations are updated, an early Vilfredo system would implicitly estimatethe next resource utilisation values based on the most recent readings, and vary thereservation accordingly. By looking at timeslice information, it might be possible todetermine more accurate predictions of future resource utilisation; this might then beuseful to set more accurate reservations.

Here, I take a timeslice to refer to a subset of the execution time of a given task. Atask’s entire execution can be broken down into an ordered sequence of timeslices,where each has a corresponding usage resource vector in its record. Timeslices donot overlap, but their aggregate represents the entire execution of a task.

The implementations supported by Vilfredo are:

No timeslicing

No timeslicing refers to the situation where a single usage measurement is held inthe task’s record. In practice, the 90th percentile value of all observations is held (andis represented as a single timeslice in the record). This situation is represented infigure 3.12.

This value is simple to calculate, and requires the least memory to store. This maymean more records can be held, and hence the set of tasks found might be moresimilar to the query when using approach described in §3.3.2.

This value might also offer a better value for initialisation, since it is predictive of theentire execution of the program, preventing the rapid decay that a low first timeslicemeasurement might induce.

3.3. SIMILAR TASK-BASED USAGE PREDICTION 43

t = 0s t = 300s

Measurements (collected by time)

m = 0m = 0m = 0m = 0m = 0 m = 7m = 7m = 4m = 4m = 2 m = 30m = 12m = 9m = 9m = 7

Measurements (re-sorted by size)

Number of measurements in record = 1

Selected measurement = 90th percentile = 14th largest Selected measurement

Figure 3.12: No timeslicing: A completed task’s utilisation measurements are sorted by size,and the 90th percentile value is stored in the record.

Fixed-duration timeslicing

Fixed-duration timeslicing is where each resource utilisation measurement, as foundin the knowledge base, corresponds to a separate, individual, timeslice in the task’srecord. This situation is represented in figure 3.13.

t = 0s t = 300s

Task 1 timeslices

Task 1 timeslice duration = 20s Task 1 number of timeslices in record = 15

t = 0s t = 60s

Task 2 timeslices


Figure 3.13: Fixed-duration timeslicing: A completed task’s utilisation measurements areeach taken to correspond to their own timeslice, and the value of that measurement is stored

in the record.

These timeslice values are easier to calculate that those for variable timeslicing.Additionally, a granularity of information that corresponds directly to the availabilityof measurements may provide better predictions. However, the per-task memoryrequirement is unbounded, and this approach is unlikely to provide accuratepredictions where similar tasks follow similar profiles but have varying durationsof execution.

Variable-duration timeslicing

Variable-duration timeslicing requires a positive integer number of per-task timeslicesto be specified globally. Hence, every task record, regardless of the corresponding


task’s duration, has the same number of values. This approach is represented infigure 3.14.

t = 0s t = 300s

Task 1 timeslices


t = 0s t = 60s

Task 2 timeslices


Figure 3.14: Variable-duration timeslicing: A completed task’s utilisation measurementsgrouped such that every record has an equal number of timeslices. A timeslice’s value is

taken as the 90th percentile measurement value in the corresponding group.

This strategy is more complicated to understand, and is dependent on estimates ofthe current task’s duration based upon the median value of the similar tasks (see§3.3.3). However, this approach supports reasoning about normalised task durations.For example, if two tasks have similar requests, they may be likely to follow similarresource utilisation profiles, but be working on different amounts of input data, andhence one could take longer.

The implementation implications of each approach are addressed at the relevantpoints throughout the rest of this section.

3.3.2 Producing a set of records

The rate of task submission to a large production cluster system can be very high.Clearly, the greatest problem faced when attempting to use observed task recordsto make these predictions, is to reduce the set of records to a size that makes theproblem tractable.

Vilfredo determines the set of records at the master coordinator. The master receivesthe measurements of all tasks that execute in the system, and hence has all possibleinformation to determine the record sets that are likely to produce better predictions.This set is then forwarded to the subordinate machine that is selected to execute thenew task. Since the usages in all records in the set may not be equally well-correlatedwith the new task, the subordinate weights each record to make its predictions (see§3.3.4).


Random sampling approach

Perhaps the most obvious, and simplest, solution to narrowing down the set of tasks,is to randomly generate a set, X, of distinct whole numbers in the range of 0 to n, thenumber of records held. Then, to produce the sample set, one could simply includeeach ith most recent record, for every i in X.

Figure 3.15 offers a schematic for this approach.

R5R4R3R2R1R0 R11R10R9R8R7R6records

Record generator

randomly-generated indices clash×

Record setoutput

Figure 3.15: Randomly-generated record set: A random number generator could selectrecords by index. Clashes could be dealt with by iteratively selecting another record until all

items are unique.

In practice, I decided that such a solution would likely require such a large value of nto make accurate predictions that the problem would not be practical in a productionsystem, and sought a better alternative.

K nearest-neighbours task-similarity approach

With each task submission to Firmament, a user-defined resource usage request isprovided. I hypothesised that such requests would be systematically correlated withthe resource utilisation in practice. Therefore, by using a set of tasks with similarresource requests, accurate usage predictions can be made.

Request similarity algorithmic approach

A number of approaches discussed in the literature seemed feasible when consideringusing request similarity to generate sets of records. Some of these are outlined intable 3.3.

To statistically determine a set of tasks with similar resource requests, I deployed theApproximate Nearest Neighbour (ANN) library due to Mount et al. [65], the details ofwhich were explained in §2.5.3.

A key feature of the ANN library is the application of an approximation factor, ε.By varying ε, Vilfredo can minimise query time, as the number of records rises.This ensures that the solution can scale to a reasonable size without dramaticallyincreasing processing time as new tasks are submitted.


Statistical solution Advantage(s) Disadvantage(s)

7 Artificial neuralnetworks [75]

Support combinationsof categorical andcontinuous features.Can representcomplicated functions.

Require large trainingsequences to providevaluable results.

7 K-nearest neighbourswith a k-d tree [12]

Capable of providingaccurate similaritieswith few inputs.

Poor performance withhigh clustering ordimensionality.

3 Approximate nearestneighbours with a BBDtree [8]

Provides accuratesimilarities. Can varyerror bound to improvespeed at larger scales.

Table 3.3: Comparison of machine learning task-similarity approaches: Possible statisticalsolutions for determining sets of similar tasks.

ANN tree construction

The ANN solution takes time linear in the number of data points it represents tocreate a new tree [65, Ch. 2]. As such, it would be infeasible to recreate the treewith each additional record, while supporting a large number of tasks. Instead,Vilfredo adds each additional submitted record into a queue, awaiting insertion intothe tree. Vilfredo then rebuilds the tree when the length of this queue exceeds aparameter-defined threshold times the number of elements in the tree. Hence, therebuilding process takes amortised constant time, over the duration of execution ofthe master coordinator (by the same argument used to determine the time complexityof insertion with a dynamically-sized array [38]).

During the tree-building (or rebuilding) process, Vilfredo determines the resourcerequests from the metadata held in the queue, and holds pointers to queue itemsthemselves, as depicted in figure 3.16. To construct the tree, the resource requestsare normalised (as described below) and passed to ANN. Each query returns a setof k indices indicating the matched resource request, which are used by Vilfredo todetermine the corresponding usage record and metadata.

Delivering a scalable solution is a significant consideration of this project, so memoryoveruse is a concern. Hence, I adopted a fixed limit on the number of items that therecord queue can contain. Items cannot be removed from the queue as new itemsare submitted, since their corresponding resource requests are still represented in thetree. Instead, Vilfredo addresses this problem at the point of tree-rebuilding, poppingitems off the queue prior to tree reconstruction.


un-inserted record data

record pointers

records

record pointer array

normalised request array

ANN

nearest query array indices (3)

Normalise pointsVilfredo

submit query

submit query

add to record set

Figure 3.16: Pipeline used by Vilfredo to query the tree for tasks with similar resourcerequests. The ANN response indexes into an array of pointers which themselves point to list

items of metadata (including the similar tasks’ usage measurements).

Normalised resource request representation

During tree construction and querying, one must decide how to map the resourcerequests into graph points. The naïve solution would be to convert each integerresource component of the request to a real number, and use each as thecorresponding point’s value in a separate dimension. However, this has a keydisadvantage.

The request values for different resources are likely to form clusters around differentareas. Consider, for example, a resource request consisting of just RAM and diskbandwidth. With Vilfredo, Firmament represents requests for RAM in MiB5, andrequest for disk bandwidth in MiB/s. The values considered with each of theseresources are likely to be on different scales. On a cluster using traditional HDDs,disk bandwidth might be limited to 50 MiB/s, leaving a range of 0 to 50 for reasonablerequest values. However, the memory supported could quite reasonably reach6144 MiB, or be as low as 512 MiB. Hence, with the naïve solution, given the tworequests presented, the former will be considered closer to the queried value than thelatter.

To account for this issue, Vilfredo normalises the request values to the mean. Eachpoint added to the tree, and each point queried, is divided, in each dimension, by thecorresponding mean.

3.3.3 Record creation

After a task completes, Vilfredo creates its usage record. The approach taken isdetermined by the timeslicing policy enabled (see §3.3).

51 MiB = 10242 bytes, where MiB means mibibyte


With no timeslicing, the resource usage values held in each record correspond to the90th percentile of the resource usage. Once a task has completed, Vilfredo calculatesthis value in time linear in the number of resource utilisation measurements made; itfirst queries the knowledge base for this data, before deploying quickselect [43] to findthe correct measurement.

With fixed-duration timeslicing, Vilfredo iterates over each measurement found inthe knowledge base, and simply adds each to the record.

The approach with variable-duration timeslicing is summarised in algorithm 3.3.1.Here, Vilfredo computes the per-timeslice minimum and maximum indices of theresource utilisation records in the knowledge base. The system supports all scenarioswhere the number of timeslices is greater, equal to, or less than, the number ofmeasurements. Vilfredo calculates indices as floating-point numbers and then roundsthem to the nearest integer — hence, where there is ambiguity, a measurement will beincluded in all possible timeslices (for example, if there is only one measurement, thiswill feature in all timeslices). With these ranges determined, Vilfredo then selects the90th measurement value in that range, again applying quickselect to determine thesolution in time linear in the maximum of the number of timeslices and the numberof records.

In addition, Vilfredo will determine the task’s timeslice duration in milliseconds, andinclude this as part of the record metadata. Prior to executing a task, its durationis estimated as the median of timeslice duration values found in the set of similartasks.

3.3.4 Usage record weighting

Of course, not all of the records in the set provide equally-accurate estimations of thenew task’s resource utilisation. For example, outliers may be contributed by poorly-performing “straggler” executors [26]. To address this concern, Vilfredo implementsa record-weighting system. This calculates the weighted arithmetic mean [48] of aset of resource usage record observations, based on the metadata associated witheach.

The relevant metadata consists of:

• The unique identifier of the subordinate machine on which the task executed.

• The number of equivalence classes (as described in §2.4.1) the record’s task hadin common with the new one.

• The Euclidean distance between the transformed (see §3.3.2) resource requests,−→m and −→n , on the graph represented by ANN, as described by eq. (3.14).

dist =√

∑i(mi − ni)

2 (3.14)


Algorithm 3.3.1 Create timesliced measurements: Here the parameter ms is an arrayof the measurements taken throughout the complete execution of a finished task.

1: function CreateTimeslicedMeasurements(ms)Precondition: timeslice_strategy is none or fixed or variablePrecondition: ms is not empty

2: mins← empty list3: maxs← empty list4: if timeslice_strategy is none then5: Push(mins, 0)6: Push(maxs, MaxIndex(ms))7: else if timeslice_strategy is fixed then8: for mi in ms do9: Push(mins, Index(mi))

10: Push(maxs, Index(mi))11: else if timeslice_strategy is variable then12: ratio← Size(ms)

NUM_TIMESLICES13: for t in NUM_TIMESLICES do14: Push(mins, Round(t× ratio))15: Push(maxs, Round((t + 1)× ratio))

return SelectRangeMeasurements(ms, mins, maxs)

16:17: function SelectRangeMeasurements(ms, mins, maxs)Precondition: Size(mins) = Size(maxs)18: r← empty list19: for min in mins and max in maxs do20: Push(r, SelectPercentile(ms, min, max, 90)

return r

Using the functions defined below, the number of common equivalence classes isscaled with the transformed logistic function (with a parameter-defined k value),and the request distance is scaled by exponential decay (with a parameter-definedexponential decay constant, λ).


Scaling approaches

The logistic function is a sigmoid curve, first described by Pierre François Verhulstin the domain of Biology [87]. The transformed function that Vilfredo uses to scalelarger input values closer to 1, with a minimum of 0, follows eq. (3.15), and thus hasthe form seen in figure 3.17.

TLogistic(k, x) =2

1 + exp(−k× x)+ 1 (3.15)

0.0 0.5 1.0 1.5 2.0x

0.0

0.2

0.4

0.6

0.8

1.0

TLog

istic

(k, x

)

Figure 3.17: Transformed logistic function with a k value of 3.

An exponential decay curve decreases at a rate proportional to its current value,tending to 0 as the x value approaches infinity. The exponential decay function thatVilfredo uses to scale larger input values closer to 0, with a maximum value of 1,follows eq. (3.16), and thus has the form seen in figure 3.18.

ExpDecay(λ, x) = e−λx (3.16)

0.0 0.5 1.0 1.5 2.0x

0.0

0.2

0.4

0.6

0.8

1.0

ExpD

ecay

(λ, x

)

Figure 3.18: Exponential decay function with a λ value of 3.


Weighting algorithm

As per algorithm 3.3.2, Vilfredo considers each item of record metadata in turn,associating a weight with every record, before using these to determine a singleresource reservation. With timeslicing enabled, such means are calculated over theobservations corresponding to one timeslice index in each record (if available).

Algorithm 3.3.2 Resource usage weighting: Here, the parameter rs is a list of recordsfrom previous tasks and t is the current timeslice.

1: function WeightUsages(rs, t)2: ws← empty list3: for r in rs do4: wr ← 05: if HasTimeslice(r, t) then6: wr ← base_weight7: wr ← wr + ExpDecay(GetRecordRequestDist(r), dist_dropoff)8: wr ← wr + TLogistic(GetMatchedEquivClasses(r), equiv_dropoff)9: if LocalMachine(r) then

10: wr ← wr + same_machine_weight

11: push(ws, wr)return WeightUsageMean(rs, ws, t)

12:13: function WeightUsageMean(rs, ws, t)14: −→vt ← zero vector15: wtot ← Sum(ws)16: for wi in ws and ri in rs do17: −→vt ← −→vt + ( wi

wtot· GetTimesliceUsage(ri, t))

return −→vt

3.3.5 Setting initial reservations

With the original Vilfredo system, or if no similar tasks are found, the reservation isinitialised to the resource request, and takes time to approach the actual usage.

Using records of usage observations, Vilfredo predicts the resource utilisation atthe start of the task’s execution, and uses this to determine an appropriate initialreservation, as per §3.1.2. Thus the period of decay towards utilisation early in atask’s execution can be avoided.

Two simple approaches may be deployed to make the first prediction:

• In the case of no timeslicing, the initial reservation set to the weighted mean ofthe observed values in the record set.

• Where timeslicing is enabled, the weighted mean is calculated based on the firsttimeslice in each record.


3.3.6 Updating reservations

With timeslicing disabled, the reservation-update process is unchanged; it simplyuses the most recent usage measurements to predict the next.

Algorithm 3.3.3 Determining the timeslice prediction. The parameters are −→au and −→as(accuracy ratings for utilisation and similar task-based predictions respectively), −−→ui−1and −→ui (smoothed utilisation measurements for previous and current timeslice), and−→si and −→si+1 (similar task-based predictions for current and next timeslice).

1: function DeterminePrediction(−→au , −→as , −−→ui−1, −→ui , −→si , −→si+1)Precondition: −→au , −→as , −−→ui−1, −→ui , −→si and −→si+1 have same dimensions

2: UpdateAccuracy(−→au , −−→ui−1, −→ui )3: UpdateAccuracy(−→as , −→si , −→ui )

return SelectBestRated(−→au , −→as , −→si+1, −→ui )4:5: procedure UpdateAccuracy(−→a , −→u , −→p )

Precondition: −→a , −→p and −→u have same dimensions6: for i in −→a do7: acc ← Logistic( ui

|ui−pi, smooth_dropoff)

8: ai ← ((1− smooth_coeff) · ai) + (smooth_coeff · acc)9:

10: function SelectBestRated(−→a1 , −→a2 , −→p1 , −→p2 )Precondition: −→a1 , −→a2 , −→p1 and −→p2 have same dimensions11: −→r ← empty vector12: for i in −→ai do13: if a1i > a2i then14: ri ← p1i15: else16: ri ← p2i

return −→r

The strategy Vilfredo deploys when updating task resource reservations withtimeslicing enabled is summarised in algorithm 3.3.3. The index of the next timeslice,t, is calculated with number of measurements, n, measurement capture period, p,and median timeslice duration, d, according to eq. (3.17). The measurement madeat this timeslice is then found in each record, and a weighted average is taken as in§3.3.4.

t =

{n if using fixed-duration timeslicingmin(n·p

d , timeslice_limit) if using variable-duration timeslicing(3.17)

For different tasks, it is quite possible that the timeslice-based weighted averagesachieve different levels of accuracy. Furthermore, the original version of Vilfredo


effectively predicted that the usage stays the same between successive measurements.In some cases, that approach might offer better accuracy than the new predictions.Hence, I opted to incorporate an accuracy-rating system.

To rate the accuracy of a prediction, Vilfredo compares the previously-predictedusage value (for each resource), p, to the measured value, m. It then calculatesthe inverse relative error, scaled by the transformed logistic function into a range of[0,1] as per eq. (3.18). This approach is applied for both the previous usage andthe previous timeslice-based prediction. To get a better picture of the accuracy oftimeslice-based predictions over the full execution of the task, exponential smoothing isdeployed (similar to §3.2.3).

2

1 + exp(−k m|m−p|

) − 1 (3.18)

Having determined these smoothed accuracy ratings, the prediction method that hasproved most accurate so far, for each resource, is taken as the prediction for the nexttimeslice.

Error

Error

ui-1 Smooth

Smooth

if <

then

else

ui

si

Δsi

Δui-1

Δui-2

Δsi-1

Δsi

Δui-1

si

Δui-1 Δsi

si

ui predictioni+1

Function invocation Comparison

Legend

Input/Output

Figure 3.19: Next timeslice utilisation prediction pipeline.

A simplified schematic showing the pipeline for calculating the prediction fortimeslice i + 1, predictioni+1, is presented by figure 3.19. This uses usagemeasurements of timeslice i, ui, the similar task-based prediction, si+1, errors fromusage and similar task prediction, ∆ui−1 and ∆si, and their corresponding smoothedvalues, ∆ui−1 and ∆si.

Vilfredo then uses this to update the reservation as per §3.1.2, considering thereservation decay coefficient, safety margin, and limit. It further integrates with thefeatures detailed for accounting for burstiness (see §3.2).


3.3.7 Integrated reservation update pipeline

The variety of reservation adjustments supported by Vilfredo integrate into a singlepipeline depicted in figure 3.20.

Get usage from KB

Start on timer

Smooth usage

Determine burstiness

Is reservation too low?

Is less than usage?

YesNo

No

Is similar-task prediction accurate?

Yes

BoostApply burstiness-adjusted

safety margin

Select similar-task

prediction

Select smoothed

usage

YesNo

Apply burstiness-adjusted

decay

Figure 3.20: Integrated reservation update pipeline.

3.4 Summary

In this chapter, I discussed the features delivered by the Vilfredo project, and detailedsome of the more complex approaches I adopted. A review of the end-to-end systemwas presented, detailing how Vilfredo integrates with Firmament, and the overallsystem architecture was discussed at a high level (see §3.1.1).

3.4. SUMMARY 55

Finer details, explaining the implementation of key components, such as task andreservation management were provided (see §3.1.1 and §3.1.2), identifying their rolesand some of the important decisions that were made. Substantial discussion wasmade regarding novel extensions, which attempt to further optimise cluster resourceallocations, beyond the current state-of-the-art. Such extensions include accountingfor precise and approximate burstiness (see §3.2) and similar-task based utilisationprediction (see §3.3). Explanation of issues encountered, such as by tasks overusingallocation (see §3.1.3 and §3.1.6), highly-varied task resource profiles (see §3.2), andattempting to match large numbers of tasks (see §3.3.2), was given, in addition to myintellectual responses and material solutions.

The final product is a complete system, designed to work successfully at scale andbe capable of delivering meaningful performance improvements over modern clustermanagers.

The forthcoming chapter presents a detailed analysis of the achievements of thisproject and Vilfredo’s real-world performance.

Chapter 4

Evaluation

The objective of this chapter is to evaluate the success of Vilfredo at meeting theproject objectives, and to measure the performance gains Vilfredo can bring to state-of-the-art cluster scheduling systems.

First, I briefly summarise the success criteria and competed extensions. Followingon, I discuss the testing strategy and how this fits into the software developmentmodel that I followed throughout the implementation process. The final, andmost extensive, part of this chapter makes in-depth analyses of how each iterativeimprovement to Vilfredo helps to achieve the overarching goal of the project, andseeks to measure the incremental effects of the enhancements it can provide moderncluster managers.

With each additional feature integrated into the project, it is important to bear inmind that the overarching goal is to achieve a Pareto improvement (as defined in §1.1)over the original Firmament system. Thus, the aim is to adjust resource allocationssuch that unused reservations are reduced; these can then be allocated to alternativetasks that could use them to execute.

Forced by the provider policy, the resource request for a task is typically far too high(see §2.2.2). By reserving an amount closer to the actual utilisation, Vilfredo rendersmore resource on a given machine unreserved. This then may be allocated to othertasks, enabling them to be scheduled to a machine. There is an ongoing balance to bemaintained between the aggressiveness of reservation reduction, and the concerns ofpreemption.

4.1 Overall success

This project has been highly successful. All success criteria that were established inthe project proposal (see appendix B) have been met, and several proposed extensionshave been successfully implemented.

56

4.1. OVERALL SUCCESS 57

4.1.1 Work Completed

Table 4.1 outlines the work completed, including relevant implementation andexperiment references.

Deliverable Successcriterion

Impl.details

Relevantexperiments

Completed

Tracking reservation andusage

1 §3.1 §4.3.1 3

Periodic decay 2 §3.1.2 §4.3.1 3

Boost underestimations 3 §3.1.2 §4.3.1 3

Terminate tasks reachingtheir limit

4 §3.1.3 §4.3.4 3

Use machine learning to setreservation (e.g. K-NN)

5 §3.3 §4.6 3

Enforce cgroup-basedresource throttling

§3.1.6 §4.4 3

Enable usage measurementsmoothing

§3.2.3 §4.5 3

Develop task burstinessprofiling

§3.2.1 §4.5 3

Integrate timeslice-basedresource prediction

§3.3 §4.6 3

Table 4.1: Project deliverable completion: A summary of the work considered in this project.

4.1.2 Testing

Vilfredo was developed as a stand-alone module, for the purposes of flexibility,maintainability, and most importantly, potential portability with other cluster systemsseeking to integrate its behaviour.

This eased the process of developing a comprehensive test suite, providing acombination of unit and integration tests. I deployed the Google Test [79] frameworkto develop a set of white- and black-box tests [81], to verify that sub-componentsproduced valid output, and all possible code paths were considered. Additionally,when bugs were found, I introduced regression tests, to ensure that patches providedpermanent fixes. Furthermore, I developed several integration tests to ensure that notonly does each module function correctly, but as a whole, the Vilfredo system behavesreasonably.

58 CHAPTER 4. EVALUATION

Figure 4.1 demonstrates the final execution of the test suite. Each test integrates avariety of sub-tests relevant to a particular independent module.

Figure 4.1: Test suite execution: Identifying valid behaviour of all sub-components and theintegrated system.

4.2 Empirical evaluation approach

This section describes the high-level approaches involved in empirically investigatinghow well Vilfredo achieves its purpose, and in analysing the benefits offered byindividual features.

Overall, Vilfredo achieves over 35% higher task throughput than the baselineFirmament system, by utilising 29% greater available capacity, and successfullyproducing Pareto-improved resource allocations with 86% of additional taskplacements.

The remainder of this chapter is divided into a series of individual sections:

4.3. EXPONENTIAL DECAY ACHIEVEMENTS 59

• Exponential decay achievements (§4.3) which investigates the behaviour of thecore unified Firmament-Vilfredo system. This section will confirm that the newcluster manager behaves as intended.

• Task resource throttling (§4.4) which demonstrates the improved behaviourprovided by throttling resource utilisation of over-demanding tasks.

• Accounting for burstiness (§4.5) which provides an in-depth examination ofhow Vilfredo now treats tasks that demonstrate different degrees of burstiness.

• Similar task resource usage prediction (§4.6) which analyses how successfulthe strategies implemented as per §3.3 are at predicting resource utilisation.

• Incrementality comparison (§4.7) which delivers a complete comparison of theperformance gains offered by Vilfredo when scheduling tasks. It comparesa variety of metrics, tracked over an extended period, and demonstrates thesignificant achievements the new system has made.

4.2.1 Test environment

Development and evaluation of Vilfredo largely took place on my personal 64-bitlaptop running Ubuntu 14.04.4 LTS (Trusty Tahr), with a four-thread, dual-core IntelCore i5-3210M 2.5 GHz processor and 6 GiB of memory.

For evaluation purposes, I developed a suite of test executables to representan assortment of tasks, each following different resource-utilisation patterns,controllable via a set of command-line parameters. This offered an unboundednumber of possible executions — to support such a variety of tasks, Vilfredo wasrequired to be highly flexible, and this is what the evaluation sought to prove.

To imitate clients that make requests, I constructed a workload generator (seeappendix A) which submits tasks with probabilistic inter-arrival rates. Parameterswere used to control the probability distributions and hence permit tests at a rangeof scales. For the experiments presented in §4.6 to §4.7, the workload mix that Ideveloped was intended to represent a realistic subset of those that a typical industrycluster system handles (based on a 29-day trace released by Google [1]), and hencefollowed the probabilities laid out in table 4.2.

4.3 Exponential decay achievements

4.3.1 Typical decay scenario

Without Vilfredo, Firmament supports no reservation decay — the resources reservedfor each task remain fixed throughout their execution, and typically well above theactual utilisation.


Task set Probability Description

Small batch tasks 70% A set of short-lived tasks (seconds to minutes).Utilise a variable amount of memory, but haveminimal interaction with disk.

Disk batch tasks 23% A set of slightly longer tasks (minutes). Userather large amounts of memory and exhibitconsiderable bursts of disk I/O activity, andvariations in storage usage.

Service tasks 7% A set of long and high-priority tasks, intended tobe represent user-facing service jobs on aproduction cluster system. Runs for severalhours. Fairly stable, but high, memoryutilisation. Limited disk interaction.

Table 4.2: Prediction and incrementality task mix: An overview of the sets of tasks that weresubmitted during experimentation, and their respective probabilities with each submission.

In §2.2.2, I identified that users typically drastically over-estimate the amount ofresources to request for the tasks they submit. Now let us observe how, withVilfredo, the reservation for each individual task decays over time, towards its actualutilisation.

0 50 100 150 200Execution time (s)

0

50

100

150

200

250

300

Mem

ory

(MiB

)

Memory usageMemory reservationMemory limit

(a) Exponential decay for memory usage: Thememory usage measurements, and

corresponding memory reservation and limit,during the 200s execution of a task.

0 10 20 30 40 50 60 70 80 90Execution time (s)

0200400600800

10001200140016001800

Dis

k st

orag

e (M

iB)

Storage usageStorage reservationStorage limit

(b) Exponential decay for disk storage: Thedisk storage measurements, and

corresponding storage reservation and limit,during the 90s execution of a task.

Figure 4.2: Exponential decay for memory and disk storage usage


Memory

The memory reservation for a task, along with its corresponding memory usage andmemory request (and hence, limit) varying throughout its execution, is shown infigure 4.2a. The task allocates another 4 MiB, every two seconds, for 72 seconds,before waiting for 64s, after which all the memory is de-allocated at the same rate.Vilfredo’s safety margin coefficient was set to 1.25, the reservation decay coefficientwas 0.95, and decays occurred every 5 seconds.

With the baseline Firmament system, the reservation would remain fixed at 256 MiBthroughout the entire duration of the experiment.

Now with Vilfredo, observe the following behaviour:

• The reservation is initialised to the resource request of 256 MiB.

• The limit remains fixed throughout at 256 MiB.

• The resource usage is measured and varies with time: starting at 0 MiB at 0s,and reaching a maximum of just below 150 MiB, at around 70s.

• Near the beginning, the reservation begins to fall, following an exponentialdecay pattern — with each update, falling to 0.95× its previous value.

• The reservation does not drop below the safety margin, of 1.25× the resourceusage.

• The reservation begins to rise with the safety margin, ensuring the resourcesreserved for the task remain sufficient for its continued execution.

• No boost event occurs, as the resource usage never surpasses the reservation.Observation Reservations vary discretely. Resource usages vary continuously.

Additionally, observe that in the interval between resource usage measurements,usage is assumed to vary continuously. However, it is only at the periodic reservationupdates that the resource reservations will change. As such, in several of the plots,the reservation jumps at points, in between which the measured usage varies linearly(linearity is assumed to aid illustration).

Disk capacity

The disk capacity reservation for a task, and its corresponding memory usage anddisk capacity request varying throughout its execution, is shown in figure 4.2b.

The task adds a file of size 23 MiB to disk every two seconds, for 60 seconds, beforewaiting. Vilfredo’s safety margin coefficient was 1.25, the decay coefficient was 0.95,and decays occurred every five seconds.

Observe similar behaviour taking place here to the case of memory, as in figure 4.2a.Once again, the reservation initially falls, but remains within, or at, the safety limit,and never does a boost event occur.


Disk I/O

0 10 20 30 40 50 60Execution time (s)

0

10

20

30

40

50

Dis

k I/O

ban

dwid

th (M

iB /

s) Disk I/O bandwidthusageDisk I/O bandwidthreservation

Figure 4.3: Exponential decay for disk I/O: Very bursty disk I/O measurements, andcorresponding I/O reservation and limit, during the 60s execution of a task.

The disk I/O bandwidth reservation for a task, along with its correspondingdisk bandwidth usage and limit varying throughout its execution, is shown infigure 4.3.

For 50 seconds, the task continuously writes data to disk in blocks of about 50 MiB.Vilfredo’s reservation decay coefficient was set to 0.9, decaying every second, with asafety margin of 1.2, and boost coefficient of 1.5.

In this case, observe that, as we might expect, the usage is very bursty. Whilethe reservation falls due to low usage measurements, several boost events occur (ataround 10s, and again at around 28s) due to the sharp rises in bandwidth use.

At these points, Vilfredo observed a situation where the reservation fell belowthe resource utilisation, so to avoid a system fault, it responded by aggressivelyincreasing the reservation to 1.5× the measured disk I/O value. Such behaviour is tobe expected, but may lead to a task being scheduled where insufficient bandwidth isavailable, and hence preemptions may occur, as seen in §4.3.3.

In this case, since reservation updates occurred every second, it took up to a secondfor Vilfredo to respond to the burst, causing a short period, just before the boost,where utilisation exceeds reservation.Observation Lower reservations are not always better.

This exemplifies another key observation. The ideal reservation is the one whichpermits as many more tasks to run on the same machine as possible, without beingpreempted later during their execution. This consideration means, for example, thatsimply setting the reservation to be equal to the current usage is a non-starter, sincemany tasks may end up being scheduled to a given machine, but later most would


likely be preempted as the machine is in fact not capable of supporting their resourcerequirements.

That issue in this particular example is addressed by the extensions explained in§3.2. In practice, with a high reservation decay coefficient such events remain unlikelyto cause task preemptions unless they occur in synchrony.

4.3.2 Pareto-improved scheduling

0 20 40 60 80 100 120 140Execution time (s)

0100200300400500600

Mem

ory

(MiB

)

Task 1 reservationMachine reservationMachine capacity

(a) No Pareto-improved scheduling withoutVilfredo: A single task runs, and its reservation

stays high so that no other tasks can be scheduled.

0 20 40 60 80 100 120 140Execution time (s)

0100200300400500600

Mem

ory

(MiB

)

Task 1 reservationTask 2 reservationMachine reservationMachine capacity

(b) Pareto-improved scheduling with Vilfredo:The memory reservations for two tasks, with their

executing machine’s total reservation andcapacity. By deallocating resources from one task,

another is successfully scheduled to the samemachine. A Pareto improvement takes place.

Figure 4.4: Pareto-improved scheduling.

The key purpose of reducing task reservations is to schedule more tasks on the samenumber of machines, where the sum of resource requests would previously haveprevented this. Such a situation is observed in figure 4.4b.

When determining the value to set the safety margin to, one must strike abalance between higher reservation reclamation (to support scheduling more tasks),and stability of the reservation over time (to minimise the likelihood of taskpreemptions).

Note that here the machine reservation falls with the first, and originally only task,Task 1. It is only because of this decay that the machine is deemed to have sufficientunreserved resources to support Task 2. Having started to execute a new task, the


machine’s reservation rises to the sum of the reservations of its tasks, but remainsbelow its capacity.

Hence, the machine now executes an extra task, Task 2, in parallel with the original,Task 1, reclaiming idle resources as is the goal of Vilfredo.

Here, Vilfredo has achieved a Pareto improvement (as defined in §1.1, with the goalexplained at the start of this chapter). The original task was deallocated resourcesthat it did not use, or therefore value; these resources were allocated to another taskthat was capable of utilising them.

4.3.3 Task preemption


0100200300400500600

Mem

ory

(MiB

)

Task 1 reservationMachine reservationMachine capacity

(a) Behaviour if no reservation decay occurs: Asingle task runs and no attempt at improving

throughput occurs.

Preempt lowpriority task

(b) Preemption behaviour with reservation decay:Too many tasks have been scheduled to a machine.A low priority task is preempted and rescheduled.

Figure 4.5: Preemption behaviour.

As discussed in §3.1.5, if a machine’s reservation rises above its capacity, Vilfredointelligently preempts tasks to prevent system failure. Such a scenario is seen infigure 4.5b.

A key observation is that the lower priority task was selected for preemption.Data centre tasks are typically divided into high priority ones (critical for ongoingperformance of production systems), and low priority ones (less important). Thus, wecan mitigate the risk of affecting critical production systems by targeting low-prioritytasks. Furthermore, in a system, such as Borg (see §2.2.4), where tasks are explicitly


partitioned into these two categories, one could perhaps only schedule batch tasks toreclaimed resources.

This situation arose as an extra task was scheduled as per §4.3.2 (indeed, thisfigure shows the aftermath of the situation presented there). However, the machinereservations rose, and hence it was no longer deemed capable of supporting its tasks.Vilfredo successfully spots this situation, and preempts a sufficient number of tasks(here, just one task, Task 1, is preempted), so as to free these reserved resources.The machine reservations fall back down below the capacity, and overcommitmentis averted.

In this scenario, the Pareto improvement achieved in §4.3.2 could be temporarily lost.However, by additionally targeting tasks that have high resource reservation (as perthe policy defined in §3.1.5) Vilfredo minimises wasted computation by preemptingfewer tasks.

Furthermore, note that the additional priority considerations have here created asituation whereby a high priority task is able to run when it may have otherwisehad to wait longer. This would be a favourable outcome in the eyes of most clusteradministrators.

The overall success of Vilfredo, as analysed in §4.7, is dependent on the assumptionthat such events are less likely than successful, enduring Pareto improvements.

4.3.4 Terminating tasks

Terminate task

Figure 4.6: Task termination event: The memory usage measurements, and correspondingmemory reservation and limit. The task usage rises rapidly, and then exceeds its limit, before

Vilfredo terminates the task.

Where a task that exceeds its resource request, and hence limit, it is terminated, asseen above in figure 4.6.


In this scenario, the task’s usage has surpassed its overall limit. This situationonly arises due to a poorly-behaved task that fails to only use resources that it hasrequested. Once again, Vilfredo successfully finds such situations, and respondsaccordingly by terminating the task altogether. It will not be rescheduled.

4.3.5 Studying memory reclamation

The Vilfredo allows the cluster administrator to make a number of decisions tocontrol its behaviour. A variety of parameters directly affect the decay behaviour,and hence can have dramatic consequences on the performance benefits offered overthe enhanced system.

In this section, I shall briefly discuss a few of the key considerations, and identify theimpacts these decisions can have of the resulting performance.

Varying safety margin coefficient

The safety margin, defined in §3.1.2, is determined by the most recent resource usagemeasurement and the safety margin coefficient, cm. The key purpose behind thisconcept is to ensure that the task can vary its utilisation, whilst minimising thelikelihood of exceeding its reservation (which would induce a boost event).


0

50

100

150

200

250

300

Mem

ory

(MiB

)

Memory usageMedium safety (1.1)memory reservationLow safety (1.025)memory reservationHigh safety (1.75)memory reservationMemory limit

Figure 4.7: Reservation while varying safety margin: The memory reservation, usage andlimit with different safety margin settings throughout execution of a task.

Figure 4.7 and figure 4.8 identify the effect of setting cm to different levels, given thesame resource utilisation pattern. The effect of cm on the reservation can be observedin figure 4.7, and the resulting resource reclamation can be identified in figure 4.8with the means tabulated in table 4.3.



0

20

40

60

80

100Re

ques

t rec

lam

atio

n (%

) Medium safety (1.1)memory reclamationLow safety (1.025)memory reclamationHigh safety (1.75)memory reclamation

Figure 4.8: Reclamation while varying safety margin: The proportion of the memory requestreclaimed by Vilfredo with different safety margin settings throughout execution of a task.

Safety margin coefficient Mean reclaimed memory (MiB)

Medium (1.1) 105

Low (1.025) 97

High (1.75) 45

Table 4.3: Comparison of the mean reclaimed memory with different safety margin settings.

Observe, in particular, that a very small cm value, 1.025, manages to reclaimmore memory than with any other setting, however the resulting safety margin isinsufficient to contain the resources. The induced boost event at just after 50s reducesthe reclaimed resources below those with the larger cm value. More importantly, theresulting reservation varies more widely during the task’s execution — significantlyincreasing the probability of a task preemption (see §3.1.5).

By way of contrast, a very large cm value, 1.75, offers a stable resulting reservation byminimising the chance of boost events. However, the resulting resource reclamationis far less, and stays at the limit for a large portion of the execution. This may preventscheduling of tasks that the machine could easily support.

The ideal cm value offers the best balance between lower the probability of boostevents, and increasing resource reclamation. Indeed, with the value of 1.1, no boostevent is seen, but more resources are reclaimed than with a very large margin.

Varying reservation decay coefficient

In a given interval on an exponential decay curve, each successive value is a constant(< 1) proportion of the previous. With Vilfredo, this constant is determined by thereservation decay coefficient, cd.



0

50

100

150

200

250

300M

emor

y (M

iB)

Memory usageMedium decay (0.95)memory reservationSlow decay (0.99)memory reservationFast decay (0.8)memory reservationMemory limit

Figure 4.9: Reservation while varying decay coefficient: The memory reservation, usage andlimit with different decay coefficient settings throughout execution of a task.


0

20

40

60

80

100

Requ

est r

ecla

mat

ion

(%) Medium decay (0.95)

memory reclamationSlow decay (0.99)memory reclamationFast decay (0.8)memory reclamation

Figure 4.10: Reclamation while varying decay coefficient: The proportion of the requestreclaimed by Vilfredo with different decay coefficient settings throughout execution of a task.

By selecting a lower cd value, resources are reclaimed more aggressively, but thelikelihood of effecting insufficient reservations for sustainable execution increases.This behaviour can be seen in figures 4.9 to 4.10.

Note the resemblance of this behaviour to that for cm. For similar reasons:

• Overly-aggressive (fast) decay: Low cd values are seen to lead to more variableoverall reservations, but increased resource reclamation.

• Insufficient (slow) decay: High cd values minimise resource reclamation,leading to more unused reserved resources.

4.4. TASK RESOURCE THROTTLING 69

4.4 Task resource throttling

Terminate task

Lim

it re

ache

d

(a) Over-using task without cgroup throttling:The shortened execution of a task, terminated by

Vilfredo for memory beyond its limit.

Lim

it re

ache

d

(b) Over-using task with cgroup throttling: Thecompleted execution of a task with memory

throttled within its limit using cgroups.

Figure 4.11: Memory usage throttling.

Users typically request levels of resource that are far too high (see §2.2.2). However,if these requests are ever exceeded, Vilfredo supports either terminating the task, asseen in figure 4.11a, or throttling its usage (for example, swap space would be usedfor memory), as seen in figure 4.11b.

Observe that in figure 4.11b the memory usage rises just as rapidly as before. Thelimit is reached (at 14s, as in figure 4.11a) but is not exceeded. In this case, thecontainer responded as though the system has insufficient unused accessible physicalmemory, and swap space was used instead. Swap space itself can also be limited, asdescribed in §3.1.6, in which case the task may fail instead of compromising thesystem.


4.5 Accounting for burstiness

Where a task is bursty, the reservation tends to fall too quickly, resulting in a seriesof boost events taking place; this causes issues with over-scheduling tasks and over-reserving of resources (see §4.3). Furthermore, in very stable tasks, a considerablesafety margin and slow reservation decay result in overzealous reservation; fewerresources could be reserved, and potentially allocated to other tasks, achievingfurther Pareto improvements.

Vilfredo calculates, the Fano factor per resource-task combination, and uses it tovary the reservation decay coefficient and safety margin, as described in §3.2.2.Additionally, usage is exponentially smoothed, in order to give a Vilfredo a better ideaof longer-term resource utilisation, but with an emphasis on more recent values.

4.5.1 Bursty tasks

0 50 100 150 200 250 300Execution time (s)

0

10

20

30

40

50

60

Mem

ory

(MiB

)

Disk I/O bandwidth usageDisk I/O bandwidthreservationDisk I/O bandwidth limit

(a) Original bursty reservation: Behaviour ofnon-burstiness-adjusted reservation on a bursty

task.

0 50 100 150 200 250 300Execution time (s)

0

10

20

30

40

50

60

Mem

ory

(MiB

)

0

10

20

30

40

50

Burs

tines

s (M

iB)

Disk I/O bandwidth usageSmoothed disk I/Obandwidth usageDisk I/O bandwidthreservationDisk I/O bandwidth limitBurstiness (plotted onright axis)

(b) Burstiness-adjusted bursty reservation:Behaviour of burstiness-adjusted reservation on a

bursty task.

Figure 4.12: Comparison of reservation behaviour with a bursty task.

A comparison of behaviour with a bursty task can be seen in figure 4.12.


The experiment was conducted with a safety margin coefficient of 1.25, with decaysof 0.95 occurring every five seconds (Vilfredo adjusts these values with burstinessconsiderations enabled). The task writes to disk, for five seconds, four times, withfive second intervals, before waiting.

In figure 4.12a we can observe that, without burstiness adjustments, the decayingreservation falls below the level of disk I/O usage on several occasions, and hencemultiple boost events occur. Furthermore, once the bursty period has ended, thereservation decays slowly.

In contrast, when burstiness is adjusted for, the reservation sees very slow decay anda large safety margin is maintained, as in figure 4.12b (here with a window size of50). After all bursts have completed, the mean of the burstiness calculation windowfalls, leading to a brief spike in the burstiness value, as per its definition given in§3.2.1. Since this is due to very bursty past behaviour (and burstiness-adjusted safetymargins have a maximum), this causes no burst in the actual reservation.

In this case, the disk I/O reservation remains at its limit throughout the burstyperiod. However, once the bursty period appears to be over, Vilfredo decreases thereservation rapidly, falling towards 0 before the task completes.

4.5.2 Non-bursty tasks

The benefits of detecting burstiness extend to non-bursty tasks.

0 20 40 60 80 100 120 140 160 180Execution time (s)

0

100

200

300

400

500

600

Mem

ory

(MiB

)

Memory usageBurstiness-adjustedmemory reservationStandard memoryreservationMemory limit

Figure 4.13: Comparison of burstiness-adjusted reservation with a non-bursty task: Thereservation falls rapidly during non-bursty periods, and reclaims more memory.

With a non-bursty task, the burstiness will fall quickly to a low value,hence minimising the safety margin and accelerating reservation decay, as infigure 4.13.


This experiment was conducted with a safety margin of 1.25, with decays with acoefficient of 0.95 occurring every 15s (Vilfredo adjusts these values with burstinessconsiderations enabled). The task allocates 100 MiB of memory in five-second periodsthree times with 15-second intervals, before waiting.

Observe that, with burstiness-adjustments enabled, Vilfredo determines theburstiness to be high during the initial period, up to about 60s, whilst the resourceutilisation is rising, and hence does not decrease the reservation much. However,when usage remains steady, the reservation rapidly falls from over 400 MiB at 90s tojust over 300 MiB shortly after. Notice that the reserved resources are indeed lowerduring stable usage periods, than when burstiness is not considered, since Vilfredoanticipates that this is a safe bet.

With burstiness-adjustments enabled, the index of dispersion (a measure ofburstiness, presented in §3.2.1) of the reservation itself throughout the experimentwas 4.9 MiB — half of the standard reservation index of dispersion at 13.5 MiB.Hence, much more stable scheduling can take place, minimising the risk of taskpreemptions.

4.5.3 Window size

The key decision, when using Vilfredo to calculate burstiness, is the size of thewindow considered. In effect, the window size represents the period over which thedispersion of usage measurements is calculated, and hence, how quickly reservationsrespond to changes in variation.

0 50 100 150 200 250 300Execution time (s)

0

10

20

30

40

50

60

Mem

ory

(MiB

)

Disk I/O bandwidthusageMedium window diskI/O bandwidthreservationBig window disk I/ObandwidthreservationSmall window diskI/O bandwidthreservationLimit

Figure 4.14: Comparison of burstiness-adjusting reservation with varying window sizes: Awell-set window size offers a balance of adjusting to change and remembering recent bursty

behaviour.

4.6. SIMILAR TASK RESOURCE USAGE PREDICTION 73

As above, this experiment was conducted with a safety margin of 1.25, with decays of0.95 occurring every five seconds (Vilfredo now adjusts these values). The task writesto disk for five seconds four times, with five second intervals, before waiting.

With a larger window size, transient bursts that are detected have longer-lastingeffects, big window reservation in figure 4.14 remains at the limit throughout theduration of execution, due to the initial bursty period in the first third.

A small window will rapidly drop reservations, after short burst-less periods,and hence the behaviour looks somewhat worse than a non-burstiness-adjustedreservation.

An ideal window size accounts for recent bursts, but allows the reservation to adjust,and fall rapidly, after a stable period has commenced. The medium window sizereservation in figure 4.14 offers an example of such behaviour.

4.6 Similar task resource usage prediction

As described in §3.3, the more accurately we can predict future resource utilisation,the better resource adjustments can be made [72]. In this section, I analyse theextent to which the developments described in §3.3 achieve their goal of accuratelypredicting resource usage.

4.6.1 Experiment design and considerations

This experiment specifically regards future utilisation predictions made when:

• Initialising the resource reservations (and hence, with no usage measurementsfor the task available).

• Updating the resource reservations during execution.

I deployed each prediction approach over an hour-long period, with an additionalhour of training, before collecting results. Tasks were submitted using theprobabilistic workload generator (see §4.2.1), with the realistic task mix described. Theaccuracy ratings recorded were all those calculated when updating the reservation asper §3.3.6 (thus, with every reservation update, a new rating was logged).

4.6.2 Prediction approaches re-cap

To re-cap the approaches, the investigation considered all the strategies listed intable 4.4.


Prediction strategy Description

Firmament Uses the resource request to set its usage “prediction”.

Firmament + Vilfredowith no timeslicing

Initialises reservations using the 90th percentileresource held in the records of the provided similartasks, and updates them based on the most recentmeasurement.

Firmament + Vilfredowith fixed-durationtimeslicing

Initialises reservations and updates them based on aprediction of the upcoming fixed-duration timeslice,using measurements held in the records of theprovided similar tasks for the same timeslice.

Firmament + Vilfredowith variable-durationtimeslicing

As above, but using variable-duration timeslices(splitting every execution into an equal number ofblocks — here 50).

Table 4.4: Task usage prediction strategies: The strategies that Vilfredo supports to predicttask resource usage.

4.6.3 Accuracy rating definition re-cap

To compare accuracy, the accuracy rating, in the range of [0,1], as described in §3.3.6is used. This is calculated as per eq. (4.1), with a k value of 0.35, a measurement ofm, and a prediction of p.

2

1 + e−k m|m−p|

− 1 (4.1)

4.6.4 Outcome

Prediction strategy Initialisationrelative error

Full executionrelative error

Firmament 3.98 0.64

Firmament + Vilfredo with no timeslicing 0.30 0.22

Firmament + Vilfredo with fixed-durationtimeslicing

0.50 0.21

Firmament + Vilfredo withvariable-duration timeslicing

0.56 0.16

Table 4.5: The relative error found when using each strategy to predict task resource usage.

4.6. SIMILAR TASK RESOURCE USAGE PREDICTION 75

Initialisation Full execution0.00.10.20.30.40.50.60.70.80.91.0

Mea

n ac

cura

cy ra

ting

FirmamentFirmament + Vilfredo with no timeslicingFirmament + Vilfredo with fixed timeslicingFirmament + Vilfredo with variable timeslicing

Figure 4.15: Task usage prediction: Comparison of the accuracy ratings for the predictionsmade via different implementations of usage, both initially and throughout execution.

As demonstrated in figure 4.15, considerable gains have been made in predictingfuture resource usage of tasks, on the basis of the task-similarity estimation andweighting process described in §3.3. These the accuracy rating values correspondto the relative errors presented in table 4.5.

The initial values set by every Vilfredo strategy achieves an accuracy at least 6×better than baseline Firmament approach (and up to over 10× better) with 95%confidence. Notably, the best approach appears to be the one using the 90th

percentile measurement throughout the duration of execution of similar tasks.

Remarkably accurate predictions were maintainable throughout execution, with thehighest mean accuracy rating of 0.8 — 3× better than using the limit. Interestingly,the variable-duration timeslicing approach achieves significantly better results thansimply using the latest usage or deploying fixed-duration timeslicing.

It is likely that the timeslicing offered finer-grained information to Vilfredo to makebetter ongoing predictions, and that variable-duration timeslices gave more accurateresults for tasks that had similar resource utilisation profiles, but differing lengths.The accuracy of a percentile-based initialisation prediction may be due to the greatervariability of tasks that is often seen in the early period [72].

Note that in future versions of Vilfredo, these approaches could be combinedeasily.


4.7 Incrementality comparison

In this final section, I analyse the extent to which Vilfredo meets its key goal —delivering significant performance improvements in realistic cluster systems, over acluster manager without dynamic reservation adjustments.

Notably, we shall confirm the hypothesis set forth in §1.1, that

Hypothesis:

“by dynamically-adjusting resource reservations, a cluster system might beable to achieve more efficient allocations ... [by] utilising more of its availableresources ... [it could achieve] a significantly higher task throughput... [and]by profiling resource usage variation, and using utilisation information ofsimilar tasks, reservations can be set that improve overall performance.”

4.7.1 Experimental construction

Careful consideration at this stage was necessary to ensure a scientifically-meaningfulanalysis could be performed.

To ensure that a realistic environment was established:

• I deployed Vilfredo with Firmament across four machines on the ComputerLaboratory Systems Research Group cluster. Each was a 64-bit server runningUbuntu 14.04.2 LTS (Trusty Tahr), with an eight-thread, four-core Intel Xeon E55202.27 GHz processor and 12 GiB of memory.

• I used the probabilistic workload generator, operating on the task mix defined in§4.2.1 was based upon a 29-day trace of a 12.5k-machine Google cluster systemto represent a realistic representation of the requests.

Comparative evaluations took place over 120-minute periods of time, with the samesubordinate and master coordinators.

I first look at what the effects are of Vilfredo on the state change events in §4.7.2. Thisincludes both task completions (where higher throughput is obviously better), andthe number of task reschedules (the effect of which is discussed below). Followingon, I investigate how performance improvements have been achieved by consideringthe percentage of resource capacity that is utilised §4.7.3.

4.7. INCREMENTALITY COMPARISON 77

Task completions (higher is better) Task reschedules (lower is better)0.0

0.5

1.0

1.5

2.0

2.5M

ean

even

t rat

e / (

even

ts /

min

ute) Firmament + Vilfredo

Firmament

Figure 4.16: Comparison of task state changes: The task throughput is over 35% higher withVilfredo. Task reschedule occurrence with Vilfredo is very low, at below 0.09 reschedules per

minute.

4.7.2 Analysis of task state change improvements

Throughput assessment

Task throughput when running with Vilfredo is over 35% higher than without, asdemonstrated in figure 4.16. It completes over 235 tasks on average, versus the 174completed in the same 120-minute window by original Firmament.

My investigation of additional data logged by Vilfredo indicates that theseimprovements are due to significant reclamation of resources in long service tasks(as described in §4.2.1). In particular, across the several iterations of this experiment,the most competitive resource seems to be memory — in approximately 90% ofscheduling iterations where a task could not be scheduled to a machine, insufficientunreserved memory on the cluster was a factor. The next most competitive resourcewas disk I/O, and there were no cases when disk storage was insufficient (diskstorage bottlenecks are very rare in modern systems [21]).

Reschedule event assessment

As a feature of Vilfredo, where the sum of the reservations of tasks executing on amachine exceed its capacity, an executing task will be stopped and rescheduled. Thisis not performed by the baseline system, and hence machine failure is more likely. Noreschedules were observed in that case; no system failures occurred either, since themachine ran with lower resource utilisation than Vilfredo can achieve, as observed in§4.7.3.


Parameter Value

Reservation decay coefficient, cd 0.95

Minimum reservation decay coefficient 0.8

Safety margin coefficient, cm 1.25

Minimum safety margin coefficient 1.1

Burstiness estimation window size, w 5

Table 4.6: Incrementality reservation parameters: The aggressive parameters used to updatethe reservation during the incrementality experiment.

For the purpose of investigation, I performed this experiment with aggressiveresource reclamation settings, detailed in table 4.6. As a result, a minority, on averagefewer than 3.5%, of cases were seen to induce task execution reschedule events. Insuch scenarios, resources were wasted by attempting to schedule a task to a machinewhich could not support it. This implies that the resource requests were too high topermit scheduling this task in original Firmament, so it is likely that those resourceswould have been wasted in the equivalent scenario.

Where reschedule events occurred, Vilfredo logs indicate that in all casesresubmission to the scheduler was successful.

By combining the increase in throughput with the reschedule rate, we can determinethat, with these parameters, Vilfredo has an 86% success rate when inducing taskplacements beyond those of the original system.

4.7.3 Investigating resource utilisation

Figure 4.17 identifies the median measurements of the proportion of utilised memorythroughout the 120-minute duration of the experiment. When running with Vilfredo,over 29% greater memory (11.7% of the total capacity) was utilised by the end of theexperiment. Memory, in particular, was selected for this analysis since this was thekey limiting resource throughout.

Importantly, as time passes, the unused resource level falls, as Vilfredo reclaimsincreasingly more, up to a fixed margin, by a combination of the strategies:

• Decaying reservations of long tasks.

• Achieving more accurate usage estimates.

• Detecting non-bursty tasks, and reducing safety margins.

As suggested in §1.1, by dynamically decaying reservations, Vilfredo minimisesdeadweight loss to achieve higher allocative efficiency. Thus, it induces improvedresource allocations, enabling scheduling of tasks to utilise more resources, as seen infigure 4.17. This increases overall system throughput, as observed in figure 4.16.

4.8. SUMMARY 79

11.7%

Figure 4.17: Comparison of used cluster system memory: The median proportion of memoryused by executing tasks within ten-minute periods, across an hour-long execution.

By a series of Pareto improvements (defined in §1.1), with a success rate of 86%, wecan confirm the hypothesis set forth at the start of this dissertation (see §1.1).

4.8 Summary

This chapter conducted a thorough analysis of the advantages that Vilfredo is capableof bringing to modern cluster systems.

Most notably, the hypothesis posed was confirmed: dynamically-adjusted resourcereservation can indeed deliver greater parallelisation, and furthermore, that profilingresource usage variation, and using utilisation information of similar tasks,reservations can be set to achieve significantly higher task throughput.

Simple exponential decay was analysed, and found to be capable of reducing taskreservations over time. Significant improvements were enabled by the introductionof more intelligent features, looking more closely at the resource allocation.

Overall, when introduced to Firmament, Vilfredo was shown capable of:

3 Achieving over 35% higher task completion rate.

3 Achieving over 29% higher resource utilisation.

3 Achieving Pareto-improved resource allocations with 86% of induced attempts.

Beyond an iterative improvement on an exciting research project, the novelapproaches explored by Vilfredo may be the first steps in a new direction for resourcemanagement in cutting-edge cluster systems development.

Chapter 5

Conclusions

This dissertation has presented a detailed overview of the considerations,implementation, and evaluation of Vilfredo. It is an intelligent, and novel, solutionto resource management, that is capable of delivering significant performanceimprovements in cluster systems. Not only has task throughput shown a 35%increase, but the developments made exceed the capabilities of modern industrialalternatives, such as Borg (see §2.2.4), with burstiness-detection and similar task-basedutilisation prediction.

5.1 Achievements

Not only have all the success criteria been achieved, but I integrated a numberof pioneering extensions to further optimise performance, to create a cohesive,harmonious system. I have demonstrated the capability of Vilfredo to deliversignificant performance gains by introducing Pareto-improved resource allocations,and thus utilising greater available system resource capacity (see §1.1).

Indeed, the development of this system did prove to be a challenging undertaking.Considerable research and skill-development was required, and the experience ofdeveloping low-level systems operating on a large scale is one that will stay with me.I particularly enjoyed the experience of contributing to a substantial research projectin the domain of distributed systems, and very much hope to be able to contributemore research in this area in future.

I am very pleased by the achievements of Vilfredo, and the considerable wide-reaching potential impact that these developments could bring. Particularly excitingis the possibility for future research into dynamic task resource utilisation profiling,such as the burstiness estimation described in §3.2, and the real potential for large-scale deployment of this concept in industrial systems.

80

5.2. LESSONS LEARNT 81

5.2 Lessons learnt

Previous experience helped immensely with the process of developing from a simplenotional concept to a substantial system integrated into a large cluster manager.Perhaps, however, greater initial research could have aided and motivated theprocess. Given the scale and scope of this project, grappling with the large volumeof related literature proved to be a more considerable time investment than I hadinitially anticipated — shortly after commencing implementation work, I foundmyself back in the books. Additionally, a project in distributed systems has given mea greater appreciation for considerations of the implications that each developmentmay have across many scales; this will likely help me in future work.

5.3 Further work

The Vilfredo system has been successful in achieving, and surpassing, the initialspecification laid out in the proposal in appendix B. Nevertheless, there is, of course,potential for further research and evaluative work that could deliver even greaterbenefits to systems in this domain.

Such work might include:

• Additional resource support: the current system supports utilisation andreservation adjustments for memory, disk I/O, and disk storage. This couldbe further extended to look at other competitive resources, such as CPU cyclesor network bandwidth.

• Compressible resource adjustments: some modern cluster managers supportthe notion of compressible resources [89, §6.2], such as disk I/O or CPU cyclesthat are rate-based. These are treated slightly differently from the others tosupport dynamic throttling; whereby utilised resources are reclaimed from tasksby throttling, without terminating them.

• Extension of task similarity features: Vilfredo produces sets of tasks by usingANN (see §2.5.3) to find tasks with similar resource requests, and weights theseto make utilisation predictions. Prior to weighting, an additional layer couldfurther filter these sets based upon criteria such as machine proximity, matchedequivalence classes, or task submission time. Such a module might requiremore complex machine learning strategies, such as neural networks, and thusscalability could be a concern.

I am delighted with the success that this project has been, and intend to releasethe source code on the internet under the Apache License, version 2.0 [5]. I hope tocontinue working in this domain, and to publish Vilfredo’s contributions to resourcemanagement as a paper in the future.

Bibliography

[1] O. Abdul-Rahman and K. Aida. ‘Towards Understanding the Usage Behaviorof Google Cloud Users: The Mice and Elephants Phenomenon’. In: Proceedingsof the 6th IEEE International Conference on Cloud Computing Technology and Science(CloudCom). IEEE, 2014, pp. 272–277 (cit. on p. 59).

[2] R. Ahuja, T. Magnanti and J. Orlin. Network Flows: Theory, Algorithms andApplicationsa. Prentice Hall, 1993 (cit. on p. 12).

[3] N. Altman. ‘An Introduction to Kernel and Nearest-Neighbor NonparametricRegression’. In: The American Statistician 46.3 (1992), pp. 175–185 (cit. on p. 15).

[4] L. Amoroso. ‘Vilfredo Pareto’. In: Econometrica: Journal of the Econometric Society(1938), pp. 1–21 (cit. on p. i).

[5] Apache License. Version 2. The Apache Software Foundation, Jan. 2002. url:https://www.apache.org/licenses/LICENSE-2.0.txt (visited on 01/05/2016)(cit. on p. 81).

[6] R. Arpaci-Dusseau and A. Arpaci-Dusseau. Operating Systems: Three EasyPieces. Arpaci-Dusseau Books, 2014 (cit. on p. 18).

[7] S. Arya and D. Mount. ‘Approximate Nearest Neighbor Queries in FixedDimensions’. In: Proceedings of the 16th ACM-SIAM Symposium on DiscreteAlgorithms (SODA). Vol. 93. ACM, 1993, pp. 271–280 (cit. on p. 18).

[8] S. Arya, D. Mount, N. Netanyahu, R. Silverman and A. Wu. ‘An OptimalAlgorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions’.In: Journal of the ACM (JACM) 45.6 (1998), pp. 891–923 (cit. on pp. 17–18, 46).

[9] J. Axboe. Deadline I/O scheduler. Linux Kernel Organization, Inc. 2002. url:https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt(visited on 19/12/2015) (cit. on pp. 18, 26).

[10] J. Axboe. CFQ (Complete Fairness Queueing) I/O scheduler. Linux KernelOrganization, Inc. 2003. url: https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt (visited on 19/12/2015) (cit. on pp. 18, 26).

82

https://www.apache.org/licenses/LICENSE-2.0.txt

https://www.kernel.org/doc/Documentation/block/deadline-iosched.txt

https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt

https://www.kernel.org/doc/Documentation/block/cfq-iosched.txt

BIBLIOGRAPHY 83

[11] H. Benington. ‘Production of Large Computer Programs’. In: IEEE Annuals ofthe History of Computing 4 (1983), pp. 350–361 (cit. on p. 19).

[12] J. Bentley. ‘Multidimensional Binary Search Trees Used for AssociativeSearching’. In: Communications of the ACM 18.9 (1975), pp. 509–517 (cit. onpp. 16, 46).

[13] J. Bentley. ‘K-d Trees for Semidynamic Point Sets’. In: Proceedings of the 6th

ACM Symposium on Computational geometry. ACM, 1990, pp. 187–197 (cit. onp. 16).

[14] J. Bland and D. Altman. ‘Statistics Notes: Measurement error’. In: BritishMedical Journal 313.7059 (1996), p. 744 (cit. on p. 35).

[15] Block I/O Controller. Linux Kernel Organization, Inc. Jan. 2016. url: https://www.kernel.org/doc/Documentation/cgroup- v1/blkio- controller.txt(visited on 22/05/2016) (cit. on p. 26).

[16] B. Boehm. ‘A Spiral Model of Software Development and Enhancement’. In:Computer 21.5 (1988), pp. 61–72 (cit. on p. 20).

[17] G. Bousquet. Vilfredo Pareto: Sa vie et son oeuvre. Payot, 1928 (cit. on p. i).

[18] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian et al. ‘Apollo: Scalableand Coordinated Scheduling for Cloud-Scale Computing’. In: Proceedings of the11th USENIX Symposium on Operating Systems Design and Implementation (OSDI14). Oct. 2014, pp. 285–300 (cit. on p. 9).

[19] R. Brown, R. Meyer and D. D’Esopo. ‘Exponential Smoothing for PredictingDemand’. In: Operations Research. Vol. 5. 1. INFORMS, 1957, pp. 145–145 (cit. onp. 40).

[20] B. Burns, B. Grant, D. Oppenheimer, E. Brewer and J. Wilkes. ‘Borg, Omega,and Kubernetes’. In: ACM Queue 14.1 (2016), p. 10 (cit. on p. 10).

[21] M. Carvalho, W. Cirne, F. Brasileiro and J. Wilkes. ‘Long-term SLOsfor Reclaimed Cloud Computing Resources’. In: Proceedings of the 2014 ACMSymposium on Cloud Computing. ACM, 2014, 20:1–20:13 (cit. on pp. 1, 6–7, 77).

[22] W. Chun. Core Python Programming. Vol. 1. Prentice Hall, 2001 (cit. on p. 21).

[23] K. Clarkson. ‘Nearest Neighbor Queries in Metric Spaces’. In: Discrete &Computational Geometry 22.1 (1999), pp. 63–93 (cit. on p. 17).

[24] D. Cox and P. Lewis. ‘The Statistical Analysis of Series of Events’. In: (1966)(cit. on p. 35).

https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt

https://www.kernel.org/doc/Documentation/cgroup-v1/blkio-controller.txt

84 BIBLIOGRAPHY

[25] D. Crockford. The JSON Data Interchange Format. Tech. rep. ECMAInternational, 2013 (cit. on p. 25).

[26] J. Dean and S. Ghemawat. ‘MapReduce: Simplified Data Processing on LargeClusters’. In: Communications of the ACM 51.1 (2008), pp. 107–113 (cit. on pp. 5,48).

[27] C. Delimitrou and C. Kozyrakis. ‘Paragon: QoS-Aware Scheduling forHeterogeneous Datacenters’. In: ACM SIGARCH Computer Architecture News41.1 (2013), pp. 77–88 (cit. on pp. 2, 9).

[28] C. Delimitrou and C. Kozyrakis. ‘Quasar: Resource-efficient and QoS-awareCluster Management’. In: Proceedings of the 19th ACM International Conferenceon Architectural Support for Programming Languages and Operating Systems. ACM,2014, pp. 127–144 (cit. on pp. 2, 6–7, 9).

[29] S. Di, D. Kondo and F. Cappello. ‘Characterizing Cloud Applications on aGoogle Data Center’. In: Proceedings of the 42nd IEEE International Conference onParallel Processing (ICPP). IEEE, 2013, pp. 468–473 (cit. on p. 1).

[30] R. Dua, A. Raja and D. Kakadia. ‘Virtualization vs Containerization tosupport PaaS’. In: Proceedings of the 2014 IEEE International Conference on CloudEngineering (IC2E). IEEE, 2014, pp. 610–614 (cit. on p. 11).

[31] Y. Etsion and D. Tsafrir. ‘A Short Survey of Commercial Cluster BatchSchedulers’. In: School of Computer Science and Engineering, The Hebrew Universityof Jerusalem 44221 (2005), pp. 2005–13 (cit. on p. 13).

[32] S. Even. Graph Algorithms. Cambridge University Press, 2011 (cit. on p. 27).

[33] U. Fano. ‘Ionization Yield of Radiations II. The Fluctuations of the Number ofIons’. In: Physical Review 72.1 (1947), p. 26 (cit. on p. 35).

[34] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach et al.Hypertext Transfer Protocol – HTTP/1.1. June 1999 (cit. on p. 14).

[35] R. Finkel and J. Bentley. ‘Quad Trees: A Data Structure for Retrieval onComposite Keys’. In: Acta informatica 4.1 (1974), pp. 1–9 (cit. on p. 17).

[36] A. Frangioni and A. Manca. ‘A Computational Study of Cost Reoptimizationfor Min-Cost Flow Problems’. In: INFORMS Journal on Computing 18.1 (2006),pp. 61–70 (cit. on p. 12).

[37] J. Friedman, J. Bentley and R. Finkel. ‘An Algorithm for Finding Best Matchesin Logarithmic Expected Time’. In: ACM Transactions on Mathematical Software(TOMS) 3.3 (1977), pp. 209–226 (cit. on pp. 16–17).

BIBLIOGRAPHY 85

[38] M. Goodrich and R. Tamassia. Algorithm Design: Foundations, Analysis andInternet Examples. John Wiley & Sons, 2006 (cit. on p. 46).

[39] M. Helsley. ‘LXC: Linux container tools’. In: IBM devloperWorks TechnicalLibrary (2009) (cit. on pp. 11, 25).

[40] J. Hicks. ‘Consumers’ Surplus and Index-Numbers’. In: The Review of EconomicStudies 9.2 (1942), pp. 126–137 (cit. on p. 2).

[41] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz etal. ‘Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center’.In: Proceedings of the 8th USENIX Symposium on Networked Systems Design andImplementation (NSDI). Vol. 11. USENIX, 2011, pp. 22–22 (cit. on pp. 1, 5, 8, 93).

[42] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz etal. ‘Mesos: Flexible Resource Sharing for the Cloud’. In: ;login: The USENIXMagazine 41.1 (Aug. 2011) (cit. on p. 6).

[43] C. Hoare. ‘Algorithm 65: Find’. In: Communications of the ACM 4.7 (July 1961),pp. 321–322 (cit. on p. 48).

[44] Improving Resource Efficiency with Apache Mesos. Twitter University, 8th Apr.2014. url: https://www.youtu.be/YpmElyi94AA (visited on 18/04/2016) (cit. onp. 7).

[45] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar and A.Goldberg. ‘Quincy: Fair Scheduling for Distributed Computing Clusters’. In:Proceedings of the 22nd ACM-SIGOPS Symposium on Operating systems principles.ACM, 2009, pp. 261–276 (cit. on pp. 12, 93, 97).

[46] M. Iverson, F. Ozguner and L. Potter. ‘Statistical Prediction of TaskExecution Times Through Analytic Benchmarking for Scheduling in aHeterogeneous Environment’. In: IEEE Transactions on Computers archive (Dec.1999), pp. 1374–1379 (cit. on p. 94).

[47] D. Jackson, Q. Snell and M. Clement. ‘Core Algorithms of the MauiScheduler’. In: Job Scheduling Strategies for Parallel Processing. Springer, 2001,pp. 87–102 (cit. on p. 13).

[48] F. James. Statistical Methods in Experimental Physics. World Scientific, 2006 (cit.on p. 48).

[49] M. Jones. ‘Application virtualization, past and future. An introduction toapplication virtualization’. In: IBM Developer Works (2011) (cit. on p. 11).

https://www.youtu.be/YpmElyi94AA

86 BIBLIOGRAPHY

[50] B. Kernighan and J. Mashey. ‘The Unix Progamming Environment’. In:Computer 4 (1981), pp. 12–24 (cit. on p. 21).

[51] M. Kerrisk. mount(8). Linux man-pages project. 2016. url: http://man7.org/linux/man-pages/man8/mount.8.html (visited on 10/01/2016) (cit. on p. 25).

[52] M. Kerrisk. statvfs(2). Linux man-pages project. 2016. url: http://man7.org/linux/man-pages/man2/statfs.2.html (visited on 04/02/2016) (cit. on p. 33).

[53] J. Kleinberg. ‘Two Algorithms for Nearest-Neighbor Search in HighDimensions’. In: Proceedings of the 29th ACM Symposium on Theory of computing.ACM, 1997, pp. 599–608 (cit. on p. 17).

[54] Kubernetes — Accelerate Your Delivery. 23rd Apr. 2015. url: http://kubernetes.io/ (visited on 18/04/2016) (cit. on pp. 10, 13).

[55] C. Larman and V. Basili. ‘Iterative and Incremental Development: A BriefHistory’. In: Computer 6 (2003), pp. 47–56 (cit. on p. 20).

[56] H. Liu. ‘A Measurement Study of Server Utilization in Public Clouds’. In:Proceedings of the 9th IEEE International Conference on Dependable, Autonomic andSecure Computing (DASC). IEEE, 2011, pp. 435–442 (cit. on p. 6).

[57] lmctfy — Let Me Contain That For You. GitHub. 28th May 2015. url: https://github.com/google/lmctfy (visited on 18/04/2016) (cit. on p. 11).

[58] lxccontainer.h. Linux container projects. 31st Mar. 2016. url: https://github.com/lxc/lxc/blob/master/src/lxc/lxccontainer.h (visited on 01/05/2016)(cit. on p. 27).

[59] B. Mandelbrot and R. Hudson. The (Mis) Behavior of Markets, 2004 (cit. on p. i).

[60] R. Markovits. Matters of Principle: Legitimate Legal Argument and ConstitutionalInterpretation. NYU Press, 1998 (cit. on p. 2).

[61] R. McKendrick. Monitoring Docker. Packt Publishing Ltd, 2015 (cit. on pp. 18,25).

[62] P. Menage. ‘Adding Generic Process Containers to the Linux Kernel’. In:Proceedings of the 2007 Ottawa Linux Symposium. Vol. 2. ACM, 2007, pp. 45–57(cit. on p. 11).

[63] P. Menage, P. Jackson and C. Lameter. Cgroups. Linux Kernel Organization,Inc. 2008. url: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt (visited on 18/12/2015) (cit. on p. 11).

http://man7.org/linux/man-pages/man8/mount.8.html

http://man7.org/linux/man-pages/man8/mount.8.html

http://man7.org/linux/man-pages/man2/statfs.2.html

http://man7.org/linux/man-pages/man2/statfs.2.html

http://kubernetes.io/

http://kubernetes.io/

https://github.com/google/lmctfy

https://github.com/google/lmctfy

https://github.com/lxc/lxc/blob/master/src/lxc/lxccontainer.h

https://github.com/lxc/lxc/blob/master/src/lxc/lxccontainer.h

https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt

BIBLIOGRAPHY 87

[64] D. Merkel. ‘Docker: Lightweight Linux Containers for Consistent Developmentand Deployment’. In: Linux Journal 2014.239 (2014), p. 2 (cit. on p. 11).

[65] D. Mount. ANN Programming Manual. Department of Computer Science,University of Maryland. 2010. url: https://www.cs.umd.edu/~mount/ANN/Files/1.1.2/ANNmanual_1.1.pdf (visited on 04/11/2015) (cit. on pp. 16, 18,45–46).

[66] M. Neuts. ‘The Burstiness of Point Processes’. In: Stochastic Models 9.3 (1993),pp. 445–466 (cit. on p. 35).

[67] V. Pareto. ‘Cours d’économie politique’. In: Lausanne: Librairie de l’Université(1896) (cit. on p. i).

[68] V. Pareto. Manuale di economia politica. Vol. 13. Societa Editrice, 1906 (cit. onpp. i, 2).

[69] J. Pendry and M. McKusick. ‘Union Mounts in 4.4BSD-Lite’. In: Proceedings ofthe 1995 USENIX Conference on UNIX and Advanced Computing Systems. USENIX,Jan. 1995, pp. 25–33 (cit. on p. 26).

[70] A. Pigou. The Economics of Welfare. Palgrave Macmillan, 2013 (cit. on p. 2).

[71] G. Popek and R. Goldberg. ‘Formal Requirements for Virtualizable ThirdGeneration Architectures’. In: Communications of the ACM 17.7 (1974),pp. 412–421 (cit. on p. 11).

[72] C. Reiss, A. Tumanov, G. Ganger, R. Katz and M. Kozuch. ‘Heterogeneityand Dynamicity of Clouds at Scale: Google Trace Analysis’. In: Proceedings ofthe 3rd ACM Symposium on Cloud Computing. ACM, 2012, p. 7 (cit. on pp. 2, 6–8,73, 75).

[73] C. Reiss, J. Wilkes and J. Hellerstein. google/cluster-data. Tech. rep. Google,Feb. 2016. url: https : / / github . com / google / cluster - data (visited on29/11/2015) (cit. on pp. 1, 97).

[74] L. Richardson and S. Ruby. RESTful Web Services. O’Reilly Media, Inc., 2008(cit. on p. 25).

[75] D. Rumelhart, J. McClelland, PDP Research Group et al. Parallel DistributedProcessing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations. MITPress, 1986 (cit. on p. 46).

[76] M. Schwarzkopf. ‘Operating system support for warehouse-scale computing’.PhD thesis. University of Cambridge, 2015 (cit. on pp. 12, 14, 32).

https://www.cs.umd.edu/~mount/ANN/Files/1.1.2/ANNmanual_1.1.pdf

https://www.cs.umd.edu/~mount/ANN/Files/1.1.2/ANNmanual_1.1.pdf

https://github.com/google/cluster-data

88 BIBLIOGRAPHY

[77] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek and J. Wilkes. ‘Omega:flexible, scalable schedulers for large compute clusters’. In: Proceedings of the 8th

ACM European Conference on Computer Systems. ACM, 2013, pp. 351–364 (cit. onpp. 1, 9).

[78] S. Seelam, R. Romero, P. Teller and B. Buros. ‘Enhancements to Linux I/OScheduling’. In: Proceedings of the 2005 Ottawa Linux Symposium. Vol. 2. ACM,2005, pp. 183–200 (cit. on p. 18).

[79] A. Sen. ‘A quick introduction to the Google C++ Testing Framework’. In: IBMDeveloperWorks (2010) (cit. on pp. 20, 57).

[80] M. Snir. MPI — The Complete Reference: The MPI Core. Vol. 1. MIT press, 1998(cit. on p. 5).

[81] Standard for Software Component Testing. Tech. rep. British Computer SocietySpecialist Interest Group in Software Testing (BCS SIGIST), Apr. 2001, p. 67(cit. on pp. 21, 57).

[82] B. Stroustrup. The C++ Programming Language. Pearson Education India, 1986(cit. on p. 21).

[83] T. Swicegood. Pragmatic Version Control Using Git. Pragmatic Bookshelf, 2008(cit. on p. 22).

[84] A. Tumanov, T. Zhu, M. Kozuch, M. Harchol-Balter and G. Ganger.Tetrisched: Space-Time Scheduling for Heterogeneous Datacenters. Tech. rep.Carnegie Mellon University, 2013 (cit. on p. 2).

[85] G. Vallee, T. Naughton, C. Engelmann, H. Ong and S. Scott. ‘System-levelVirtualization for High Performance Computing’. In: Proceedings of the 16th IEEEEuromicro Conference on Parallel, Distributed and Network-Based Processing. 2008,pp. 636–643 (cit. on p. 11).

[86] V. Kumar Vavilapalli, A. Murthy, C. Douglas, S. Agarwal, M. Konar, R.Evans et al. ‘Apache Hadoop YARN: Yet Another Resource Negotiator’. In:Proceedings of the 4th ACM Symposium on Cloud Computing. ACM, 2013, p. 5 (cit.on p. 8).

[87] P. Verhulst. ‘Recherches mathématiques sur la loi d’accroissement de lapopulation’. In: Nouveaux Mémoires de l’Académie Royale des Sciences et Belles-Lettres de Bruxelles 18 (1845), pp. 14–54 (cit. on p. 50).

[88] A. Verma, L. Cherkasova and R. Campbell. ‘SLO-Driven Right-Sizing andResource Provisioning of MapReduce Jobs’. In: Proceedings of the ACM-

BIBLIOGRAPHY 89

SIGOPS Workshop on Large-Scale Distributed Systems and Middleware (LADIS) inconjunction with VLDB. ACM, Aug. 2011 (cit. on p. 9).

[89] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune and J. Wilkes.‘Large-scale cluster management at Google with Borg’. In: Proceedings of the 10th

ACM European Conference on Computer Systems. ACM, 2015, p. 18 (cit. on pp. ii,1–2, 9–10, 13, 81, 92).

[90] F. von Wieser. ‘Theorie der gesellschaftlichen Wirtschaft’. In: Grundriss derSozialökonomik 1 (1914) (cit. on p. 2).

[91] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 2012 (cit. on p. 5).

[92] J. Wilkes. More Google Cluster Data. Google. 2011. url: http : / /googleresearch.blogspot.com/2011/11/more-google-cluster-data.html(visited on 29/11/2015) (cit. on pp. 1, 97).

[93] P. Zezula, G. Amato, V. Dohnal and M. Batko. Similarity Search: The MetricSpace Approach. Vol. 32. Springer Science & Business Media, 2006, pp. 16–17 (cit.on p. 15).

http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html

http://googleresearch.blogspot.com/2011/11/more-google-cluster-data.html

Appendix A

Workload generator

secs_mid=$1 ; secs_var=$2 ; tasks_dir=$3 ; scripts_dir=$4

declare −A weightsarg_index=0start_index=6for arg in "$@" ; do

next_arg_index=$ ( ( $arg_index+1) )if [ $ ( ( arg_index%2) ) −eq $ ( ( start_index%2) ) −a $arg_index −

ge $start_index ] ; thenweights+=([$ { ! arg_index } ] = $ { ! next_arg_index } )

fiarg_index=$next_arg_index

done

weight_sum=0task_names = ( )for task_label in "${!weights[@]}" ; do

weight_sum=$ ( ( weight_sum + weights [ "$task_label" ] ) )for ( ( i=0; i<$ ( ( weights [ "$task_label" ] ) ) ; i++ ) ) ; do

task_names+=("$task_label" ) ;done

done

while truedo

task_name="${task_names[$RANDOM␣%␣weight_sum]}"echo "SUBMITTING:" $task_namesleep $ ( ( secs_mid − secs_var + ( $RANDOM % ( secs_mid −

secs_var + 1) ) ) )bash $tasks_dir/$task_name/submit . sh $scripts_dir $tasks_dir

/$task_name $tasks_dir/programsdone

90

Appendix B

Project proposal

Joshua Bambrick

Jesus College

jpb80

Part II Project Proposal

Task resource limitation for cluster computingsystems

23 October, 2015

Project Originators: Malte Schwarzkopf & Joshua Bambrick

Resources Required: Yes, please see Special Resources

Project Supervisors: Ionel Gog & Malte Schwarzkopf

Director of Studies: Cecilia Mascolo

Overseers: Jean Bacon & Ross Anderson

91

92 APPENDIX B. PROJECT PROPOSAL

B.1 Introduction & Background

Cluster computing systems have become increasingly widely-used in recent years,driven by the rising accessibility of such systems offered by the paradigmof Infrastructure as a Service (IaaS) with packages supplied by cloud computingproviders.

The role of a cluster manager is, given a set of tasks to execute, ensure that executionoccurs on the cluster in accordance with a set of preferences. The two main aimsof a traditional cluster manager are to handle failures of tasks and utilise resourcesefficiently, to ensure that some metric of the throughput of the system is maximised,such as the number of tasks completed per unit time. Responsibilities of a clustermanager might include scheduling tasks by mapping them to particular processingunits at a particular time, to preempt and restart tasks that are taking too long tocomplete, to monitor the health of processing units and restart their assigned tasks ifthey fail.

To perform its duties, a cluster manager must be capable of tracking what tasks andprocessing units there are, and what resources each will have available at a particulartime. To partition resources, it is common for the cluster manager to specify an upperbound for each task; the bounds are traditionally specified explicitly by the user orthe framework that interacts with the cluster manager, and remain fixed throughoutthe execution of the task.

Resources that may be considered include, for example, the number of CPU coresneeded for a task, the amount of RAM needed, the amount of disk space required,and the network bandwidth.

Underestimating the resources a task requires may lead to the task being haltedby the cluster manager, requiring it to be restarted and perhaps never completingsuccessfully. As such, users are motivated to drastically overestimate the resourcestheir tasks will require for successful execution. Furthermore, the upper boundsapply throughout the entire execution of a task, while the resources it requires for themajority of its execution may be substantially lower. Such misallocation of resourceswill reduce the parallelism that the cluster system can achieve as new tasks mayneed to wait to be scheduled, due to the cluster manager’s unfounded belief thatinsufficient resources are available. 1

This problem is beginning to be addressed in modern cluster managers, such asGoogle’s Borg [89]; the approach of this cluster manager is to periodically measurethe resources a task is utilising, and decay the resource allocation, which is initialisedto the user-specified limit, towards the true value. In order to bias allocation towardsover-estimation, a safety margin is added to the estimation. The remaining unallocatedresources can then be reclaimed and considered in the scheduling of new tasks.

1It is important to note, however, that a slight overestimation of a task’s resource requirements, is afar better condition than a slight underestimation; the former provides a tight bound, supporting near-optimal parallelism, but the latter requires the task to be entirely restarted, and all of its completedwork is discarded.

B.2. SPECIAL RESOURCES 93

The Firmament cluster manager is a locally-developed research system, similar toBorg or Mesos [41]. Firmament adopts an approach to scheduling as inspired by theQuincy [45] cluster manager whereby the user’s scheduling preferences are specifiedin the form of a flow network. This flow network is then treated as an input to a min-cost flow optimisation. A key feature of this scheduler is that similar entities, suchas identical machines, or the tasks within the same job, are aggregated into tieredequivalence classes, represented as vertices in the flow network. That entities withina particular equivalence class must be fungible, makes the min-cost flow problemtractable, and this helps the system to scale better. The flexibility of this approachhas been exhibited by successfully applying it to implement a variety of commonscheduling policies at large scales.

When estimating task resources, there are a number of features that might havea correlative relationship with the resource requirements of a particular task.Tasks within the same equivalence class will exhibit similar patterns of resourcerequirements throughout their execution; but it may be the case that the executionof tasks on different processing units follows different patterns. Both of these effectscould feasibly be modelled using a feature of the equivalence class in which theentity falls. One would certainly hope that the user-defined resource estimationsfor a particular task would follow some correlative pattern with the true resourcerequirements. The level of trust in a user’s resource estimation might be affected bywho the user is, or what framework provided the estimate. It may also be worthconsidering the previous accuracy of estimates for a particular task, or equivalenceclass, and providing a wider safety margin if this is lower.

In this project I will be working on adapting Firmament to make an automaticestimation of the resources that should be assigned to each task, and implementthe imposition of bounds on the resources used, in order to increase the level ofparallelism that the system can support. This will involve developing support fortracking resource usage, and setting and updating a limit for the resources a taskcan use. Furthermore, Firmament’s profiling abilities will need to be extended, tosupport limitations on resources that are not currently considered.

B.2 Special Resources

B.2.1 Personal Machine

I intend to use my personal 64-bit laptop, running Linux, for development. Th laptophas an Intel Core i5 processor and 6 GB of memory.

Git, will be used for version control, with all new code being pushed to GitHub. Thiswill help protect my work from data loss due to hardware or software issues. I intendto set up automatic backups to the Managed Cluster Service.


B.2.2 Systems Research Group (SRG) Cluster

Other computing resources might be useful throughout the implementation andevaluation stages of the project, including:

• A dedicated server might be helpful to execute new code faster for development,when testing new approaches, and evaluation, when attempting to takemeasurements, and hence decrease development times.

• The SRG cluster could be used to observe that the expected performance isachieved in practice on a distributed system implementing my modifications,determining equivalent results for a version of Firmament not implementingthese, in order to compare the results for evaluation.

B.3 Starting Point and Previous Experience

To deliver good results in this project, I will draw upon work previously done in thearea of resource optimisation, and skills that I have developed in recent years.

• Existing researchThere has been a considerable amount of research effort in the error of resourceestimation for tasks in large systems, however little of this seeks to integratesuch a feature into the domain of a cluster manager.

The Google Borg scheduler implements a task resource reclamation strategywhich periodically measures the resources a task uses, and decays the allocatedresources towards the observed value. Implementing a similar algorithm thatreclaims resources for Firmament, is the core deliverable of this project.

A combination of a statistical k-Nearest Neighbor (k-NN) regression algorithmwith analytic code profiling techniques has been applied to informationrepresenting input data set and the machine type for estimating execution timein a heterogeneous distributed computing environment, with some success [46].

• FirmamentA cluster manager developed in the SRG. The project offers a modern approachby representing data centres as flow networks, offering many schedulingpolicies and efficient solvers for the min-flow problem.

• Systems and Programming ExperienceDuring an internship at Amazon.com, I was able to gain experience optimisingmachine learning linear regression systems, and developing distributed clustersystems with Apache Spark. Throughout the Tripos, I have developedskills in the high-level, object-oriented language Java, and have also beenexposed to the systems programming languages C and C++; furthermore, Iexpect to be applying theory taught during the Part IA and Part IB courses,Operating Systems, Algorithms and Data Structures, Computer Design, Concurrentand Distributed Systems, and Artificial Intelligence I, as well as that which I will

B.4. PROJECT STRUCTURE 95

get the opportunity to hear about in Part II courses, such as Artificial IntelligenceII.

B.4 Project Structure

I have divided my project into multiple stages. Within each stage, I shall be workingtowards well-defined milestones, which will allow me to measure the developmentof my project, and ensure progress towards the overarching goal.

B.4.1 Phase 1 – Core Implementation

In the first phase, a simple resource reclamation system will be developed. Thissystem must be able decrease the maximum resource estimation (or reservation) of atask during execution, without causing the execution of the task to be stopped.

This system must:

1. Keep track of a reservation for tasks during their execution.

2. Initialise this reservation to the user-defined limit.

3. Periodically decay each task’s reservation during its execution, towards itsusage.

4. Rapidly increase a task’s reservation towards the limit, if usage exceeds thereservation.

5. Preempt tasks that exceed their reservation, if the reservation cannot beincreased.

6. Enforce the user-specified limits by killing tasks that exceed this limit.

This will involve a variety of modification and additions to the current Firmamentsystem, such as:

1. Addition of support for imposing and adopting resource limits for tasks.

2. Addition of support for setting, updating, and considering resource reservationsfor tasks.

3. Extension of Firmament’s statistics collection with more profiling options, forexample I/O bandwidth and disk space.

I will then seek to apply a machine learning approach, such as implementing thek-NN algorithm, to set the value of the reservation.


B.4.2 Phase 2 – Evaluation

Following completion of this initial stage, I will seek to evaluate how utilisation hasimproved. In order to do this, I may consider multiple features, and compare themwith a version of system that doesn’t implement a resource reclamation strategy, butwhich is otherwise identical.

A key aim of this project, is investigating how tight a bound can be placed oneach estimate of resource requirements. A tighter bound may improve utilisationbut increase the probability of a task being halted or preempted for exceedingit. As such, during the evaluation stage, I will seek to vary the parametersthat implicitly determine properties of this bound, and take measurements forcomparison. Parameters of interest are, for example, the safety margin, the periodbetween reservation decay decisions, and the rate of decreasing and increasing thereservation.

In order to get an idea of the optimality of the whole system, I must consider:

• throughput: the total amount of work completed per time unit.

• wasted resources: what resources have been wasted due to misallocation.

• task halts: a measure of how often the estimation is too low.

• fairness: a measure of how equal tasks are considered during scheduling.

The measure of throughput may take the form of:

• the mean total number of tasks that execute simultaneously.

• the total number of CPU cycles that are completed per time unit.

The measure of wasted resources may take the form of:

• the fraction of resources that are idle in the cluster.

• the fraction of resources that are idle, but reserved, in the cluster.

The wasted resources is hypothesised to correlate with the throughput – as fewerresources are wasted, I anticipate that more tasks will be able to run at one time, andhence throughput will increase.

The measure of task kills may take the form of:

• the probability that any given task will be killed.

• the total number of tasks that are killed per time unit.

• the total number of CPU cycles that are wasted due to kills.

The measure of fairness may take the form of:

• the probability that any given task in a given equivalence class will be killed.

• the mean number of CPU cycles that any given task in a given equivalence classwill execute before being killed.

B.4. PROJECT STRUCTURE 97

An ideal system would strike a balance whereby it maintains high levels of utilisationand fairness, while killing only few tasks, and wasting few resources. It may bevaluable to note the distinction between throughput and workload runtime – whilstone would expect a degree of correlation, it may also be necessary to consider thetrade-off between optimising for these two features.

The modifications that I make to the profiling for Firmament and the support alreadyavailable in the knowledge base will help me to make these measures and runexperiments to identify how successful the resource reclamation strategy has been.Concretely, this might take the form of:

• To determine the correctness of the reservation, the profiling system could trackthe actual resources used by individual tasks, and the worker responsible forrunning this task can compare these resources to the reservation it holds, inorder to determine how correct the reservation was, for each resource it takesinto consideration. This data will also be useful to calculate the throughput asthe difference between the reservation and the measured values will indicatewhat idle resources were held in reservation. I will then be able to compare thedistributions of the estimation and real resource usage.

• To measure wasted resources, one could implement the above strategy, and takethe wasted resources as the value of the difference between the reservation andresources used, assuming that the reservation is greater.

• To measure task kills, the workers could also keep track of the number of tasksthat they kill due to exceeding their reservation; it may also record the periodof time for which this record applied. When killing a task, the worker mightalso record the number of CPU cycles that the task had run.

• To consider fairness, the task equivalence class profiles could track the numberof tasks within that class, and the workers could update these profiles wheneverthey kill a task, perhaps including the number of CPU cycles that the task hadrun up to that point.

These tests should ideally be run in a realistic, distributed environment, for which itmight be useful to execute them on the SRG cluster, instead of my personal laptop.Furthermore, it would be best to run tests using a realistic set of tasks, for which itmight be possible to use the trace Google released from one of their data centres [92,73], or I could attempt to use the benchmarks similar to those used in the Quincypaper [45].

These features may conflict at times, and part of the evaluation stage will be asubjective judgement as to the optimality of each system. One might, for example,develop a system that achieves high parallelism, but low fairness, where certain tasks(such as longer tasks) never get to run to completion – this is something I will seekto avoid, and hence will consider at evaluation.


B.4.3 Phase 3 – Enhancements

The final phase of the project will focus on approaches to improve on the concept ofresource reclamation. These come in the form of extensions, that will be implementedonly if time permits. Current thoughts on possible extensions are outlined in PossibleExtensions, but alternatives may develop as progress is made on the earlier stages ofthe project.

B.5 Success Criteria

For this project to be deemed successful, the following tasks must be completed:

1. Tracking reservation and usage – Tasks scheduled via Firmament have theirresource reservations and usage tracked during execution.

2. Periodic decay – Implementation of a periodic decay of each task’s reservationduring its execution, towards its usage.

3. Boost underestimations – Implementation of a rapid increase of a task’sreservation towards its user-defined limit, if usage approaches the reservation.

4. Terminate tasks reaching their limit – Tasks that reach the user-specifiedresource limit are terminated.

5. Use machine learning to set reservation – Use a machine learning approach,such as the k-NN algorithm, to set the reservation for tasks.

6. Evaluation – Comparison of the throughput and incorrect estimation rate of thecluster manager incorporating resource reclamation. Comparison of resourceutilisation with and without resource reclamation.

7. Parameter trade-off – Investigation of the effect of altering parameters, such asa safety margin, the period between decay decisions and the rate of decreasingand increasing the reservation.

B.6 Possible Extensions

There are a number possible approaches to this problem that might be implementedto make more precise estimations of tasks’ resource usage.

• Analytic decay value – when decaying the reservation, the core implementationcalls for taking a single measurement across the entire period since the lastevaluation, and using this to determine the level of decay. Better results mightinstead be achieved by using a statistical property observed during this period,such as the maximum, median or mean or each resource usage property. Forexample, if the period is 1000 ms, then instead of basing the decay value for a

B.6. POSSIBLE EXTENSIONS 99

particular resource on the value measured at the end of the period (i.e. at the1000th millisecond), you could use the maximum, or upper quartile, value at thethroughout the entire period (i.e. from the 1st to the 1000th millisecond).

• Exponential smoothing of the decay value – when decaying the reservation,previous levels of decay might give a stronger indication of the general patternof variation in resource requirements that a given task follows. It might thenbe worthwhile to implement exponential smoothing to determine the level ofdecay, which would make this level more consistent, by basing it on the entireexecution of the task, with a temporal bias, rather than just the precedingperiod.

• Statistical analysis of similar tasks – Tasks that fall into the same equivalenceclass may be likely to follow similar patterns of resource usage during execution.As such, the statistical properties of their resource usage might be used toestimate the resources required of unscheduled tasks which are also deemedsimilar.

– use analysis to specify unchanging resources – use this statistical analysisto specify the resources of a task and leave them unchanged throughoutexecution. For example, if you record the maximum value of each type ofresource used throughout the execution of each task, then you could setthe reservation of new tasks to the 99th percentile of this.

– combining analysis with periodic decay – use this statistical analysis tospecify the initial resource allocation for tasks, and periodically decay thistowards the actual usage. For example, if you record the maximum valueof each type of resource used throughout the execution of each task, thenyou could set the reservation of new tasks to the 99th percentile of this, anddecay the reservation towards the actual usage.

– combining analysis with timeslices – make finer-grained statisticalcalculations about tasks based on their equivalence class and the timeslicefor which the records were made (a combination of a temporal durationand the duration expired since the task was started). Periodically re-evaluate the allocation and determine its new value based on the currenttimeslice. Potentially deploy an exponential smoothing equation make anassertion about the accuracy of previous estimates, and provide a looserbound if this is low.

• Supervised learning approach – By using tasks as training examples, traingeneral models to estimate the resources that unscheduled tasks should beallocated. A biased loss function might be useful to ensure that it is more likelyto overestimate the resource allocation, than underestimate.

– use task equivalence class as a feature – build models that map theequivalence class of a task to the resources that it should be allocated.

– combining learning with periodic decay – use the learning approach tospecify the initial resource allocation for tasks, and periodically decay this


towards the true value.

– introduce timeslice as a feature – the notion of a timeslice described abovemight be a useful feature in my models, and make finer-grained assertionsabout the resources a task should be allocated. Exponential smoothingcould again be a potential strategy to deploy in order to estimate theaccuracy of my model for the current task.

– introduce the user-defined limit as a feature – the user specifies a limitto the resources each task may be allocated. I might find that using theselimits as features in my model produces more accurate estimates. I mayalso seek to join this feature with a class representing the user who madethe prediction, since some users might be more inclined to make accuratepredictions than others.

– introduce processing unit equivalence class as a feature – I might findthat the machine on which a task runs affects the resources that the taskrequires. For example, some machines might be attached to slower disk,they might communicate with memory more slowly, or they might havedifferent clocks speeds. By considering the machine equivalence classof the machine which executes tasks that themselves fall into particulartask equivalence class, I might be able to implicitly model some of thesefeatures, and hence obtain more accurate estimates.

B.7 Timetable: Work Plan and Milestones

My work will be divided into fortnightly packages, as described below. I intend ondeveloping the core deliverable over the course of Michaelmas, and have completedall work up to a draft of the dissertation, in advance of the Easter break. The Easterbreak will ideally be used just to make finishing touches, and for revision of coursematerial, in preparation for the June examinations.

B.7.1 Michaelmas Term

Weeks 1 and 2 : 8/10/15 – 21/10/15

• Research the fields of cluster computing and task resource estimation.

• Set up Git repository.

• Set up development environment.

• Write project proposal.

• Familiarise myself with the Firmament codebase.

B.7. TIMETABLE: WORK PLAN AND MILESTONES 101

• Produce experimental code to identify programming and integrationchallenges.

Deliverable: Project proposal.

Weeks 3 and 4 : 22/10/15 – 4/11/15

• Start development of reclamation algorithm.

• Submit proposal.

Deliverable: Initial bibliography and basic reclamation algorithm.

Important date (Fri 23 Oct 2015 – 12 noon): Proposal submission deadline.

Weeks 5 and 6 : 5/11/15 – 18/11/15

• Complete decay feature of reclamation algorithm.

• Develop unit tests for decay feature of reclamation algorithm.

• Integrate reclamation algorithm with the Firmament codebase.

Deliverable: Tested and integrated reclamation algorithm with decay feature.

Weeks 7 and 8 : 19/11/15 – 2/12/15

• Develop boost feature for underestimated resources in reclamation algorithm.

• Develop unit tests for boost feature of reclamation algorithm.

Deliverable: Tested, integrated and fully completed reclamation algorithm.

Milestone: Phase 1 complete.

B.7.2 Christmas Break

Weeks 1 and 2 : 3/12/15 – 16/12/15

• Get algorithm running on a real cluster system.


• Evaluate the performance of the system with resource reclamation against thesystem without it.

Deliverable: Metrics representing a quantitative measurement of performance forboth setups.

Milestone: Phase 2 complete.

Weeks 3 and 4 : 17/12/15 – 30/12/15

• Write the progress report draft.

• Buffer time.

Deliverable: Progress report draft.

Weeks 5 and 6 : 31/12/15 – 13/1/16

• Write presentation draft.

• Develop extensions of analytic decay value and exponential smoothing of decay.

Deliverable: Extensions completed.

B.7.3 Lent Term

Weeks 1 and 2 : 14/1/16 – 27/1/16

• Commence work on an extension involving alternative systems resourceestimation systems.

• Develop unit tests to account for new features.

Deliverable: A basic alternative extension started and corresponding unit tests.

Weeks 3 and 4 : 28/1/16 – 10/2/16

• Continue implementation of extensions and unit tests.

• Finalise progress report and presentation.

B.7. TIMETABLE: WORK PLAN AND MILESTONES 103

Deliverable: Further developed extensions.

Deliverable: Progress report and presentation.

Important date (Fri 29 Jan 2016 – 12 noon): Progress report submission deadline.

Important dates (Thu 4, Fri 5, Mon 8, Tue 9 Feb 2016 - 2pm): Reportpresentation.

Weeks 5 and 6 : 11/2/16 – 24/2/16

• Continue implementation of extensions and unit tests.

Deliverable: Further developed extensions.

Weeks 7 and 8 : 25/2/16 – 9/3/16

• Commence writing dissertation.

• Perform additional evaluation, incorporating extensions.

Deliverable: Outline of dissertation.

Deliverable: Additional evaluation metrics.

B.7.4 Easter Break

Weeks 1 and 2 : 10/3/16 – 23/3/16

• Continue writing dissertation.

• Submit dissertation for review.

Deliverable: Draft of dissertation.

Weeks 3 and 4 : 24/3/16 – 6/4/16

• Edit dissertation.


Weeks 5 and 6 : 7/4/16 – 20/4/16

• Complete dissertation.

Deliverable: Final dissertation.

B.7.5 Easter Term

Weeks 1 and 2 : 21/4/16 – 4/5/16

• Prepare dissertation for submission.

Weeks 3 and 4 : 5/4/16 – 18/5/16

• Submit dissertation.

Important date (Fri 13 May 2016 – 12 noon): Dissertation submission deadline.

Vilfredo Optimising cluster resource allocations, one...

Documents

Transcript of Vilfredo Optimising cluster resource allocations, one...