Forgoing hypervisor fidelity for measuring virtual...

Forgoing hypervisor fidelity formeasuring virtual machine

performance

Oliver R. A. Chick

Gonville and Caius College

This dissertation is submitted for the degree of Doctor of Philosophy

http://orcid.org/0000-0002-6889-8561

FORGOING HYPERVISOR FIDELITY FOR MEASURING VIRTUAL MACHINE PERFORMANCE

OLIVER R. A. CHICK

For the last ten years there has been rapid growth in cloud computing, which

has largely been powered by virtual machines. Understanding the performance

of a virtual machine is hard: There is limited access to hardware counters, tech-

niques for probing have higher probe effect than on physical machines, and per-

formance is tightly coupled with the hypervisors scheduling decisions. Yet, the

need for measuring virtual machine performance is high as virtual machines are

slower than physical machines and have highly-variable performance.

Current performance-measurement techniques demand hypervisor fidelity:

They execute the same instructions on a virtual machine and physical machine.

Whilst fidelity has historically been considered an advantage as it allows the hy-

pervisor to be transparent to virtual machines, the use case of hypervisors has

changed from multiplexing access to a single mainframe across an institution to

forming a building block of the cloud.

In this dissertation I reconsider the argument for hypervisor fidelity and show

the advantages of software that co-operates with the hypervisor. I focus on pro-

ducing software that explains the performance of virtual machines by forgoing

hypervisor fidelity. To this end, I develop three methods of exposing the hy-

pervisor interface to performance measurement tools: (i) Kamprobes is a tech-

nique for probing virtual machines that uses unprivileged instructions rather

than interrupt-based techniques. I show that this brings the time requires to

fire a probe in a virtual machine to within twelve cycles of native performance.

(ii) Shadow Kernels is a technique that uses the hypervisors memory manage-

ment unit so that an operating system kernel can have per-process specialisation,

which can be used to selectively fire probes, with low overheads (835354 cyclesper page) and minimal operating system changes (340 LoC). (iii) Soroban uses

machine learning on the hypervisors scheduling activity to report the virtualisa-

tion overhead in servicing requests and can distinguish between latency caused

by high virtual machine load and latency caused by the hypervisor.

Understanding the performance of a machine is particularly difficult when

executing in the cloud due to the combination of the hypervisor and other virtual

machines. This dissertation shows that it is worthwhile forgoing hypervisor

fidelity to improve the visibility of virtual machine performance.

DECLARATION

This dissertation is my own work and contains nothing which is the outcome

of work done in collaboration with others, except where specified in the text.

This dissertation is not substantially the same as any that I have submitted for a

degree or diploma or other qualification at any other university. This dissertation

does not exceed the prescribed limit of 60 000 words.

Oliver R. A. Chick

November 30, 2015

http://orcid.org/0000-0002-6889-8561

ACKNOWLEDGEMENTS

This work was principally supported by the Engineering and Physical Sciences

Research Council [grant number EP/K503009/1] and by internal funds from the

University of Cambridge Computer Laboratory.

I should like to pay personal thanks to Dr Andrew Rice and Dr Ripduman So-

han for their countless hours of supervision and technical expertise, without

which I would have been unable to conduct my research. Further thanks to

Dr Ramsey M. Faragher for encouragement and help in wide-ranging areas.

Special thanks to Lucian Carata and James Snee for their efforts in cod-

ing reviews and being prudent collaborators, as well as Dr Jeunese A. Payne,

Daniel R. Thomas, and Diana A. Vasile for proof reading this dissertation.

My gratitude goes to Prof. Andy Hopper for his support for the Resourceful

project.

All members of the DTG, especially Daniel R. Thomas and other inhabitants

of SN14 have provided me with both wonderful friendships and technical assis-

tance, which has been invaluable throughout my Ph.D.

Final thanks naturally go to my parents for their perpetual support.

http://orcid.org/0000-0002-4677-8032http://orcid.org/0000-0003-0740-8650http://orcid.org/0000-0003-0740-8650http://orcid.org/0000-0001-7445-1136http://orcid.org/0000-0001-7607-9729http://orcid.org/0000-0001-8936-0683http://orcid.org/0000-0001-8936-0683

CONTENTS

1 Introduction 15

1.1 Defining forgoing hypervisor fidelity . . . . . . . . . . . . . . . . 16

1.2 Limitations of hypervisor fidelity in performance measurement tools 17

1.3 The case for forgoing hypervisor fidelity in performance measure-

ment tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.5 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.6 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.7 Scope of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.7.1 Xen hypervisor . . . . . . . . . . . . . . . . . . . . . . . . 23

1.7.2 GNU/Linux operating system . . . . . . . . . . . . . . . . 23

1.7.3 Paravirtualised guests . . . . . . . . . . . . . . . . . . . . . 24

1.7.4 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.8 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Background 27

2.1 Historical justification for hypervisor fidelity . . . . . . . . . . . . 28

2.2 Contemporary uses for virtualisation . . . . . . . . . . . . . . . . 29

2.3 Virtualisation performance problems . . . . . . . . . . . . . . . . 33

2.3.1 Privileged instructions . . . . . . . . . . . . . . . . . . . . 33

2.3.2 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3.3 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.4 Increased contention . . . . . . . . . . . . . . . . . . . . . 34

2.3.5 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.6 Unpredictable timing . . . . . . . . . . . . . . . . . . . . . 35

2.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.4 The changing state of hypervisor fidelity . . . . . . . . . . . . . . 35

2.4.1 Historical changes to hypervisor fidelity . . . . . . . . . . 35

2.4.2 Recent changes to hypervisor fidelity . . . . . . . . . . . . 36

2.4.3 Current state of hypervisor fidelity . . . . . . . . . . . . . 38

2.4.3.1 Installing guest additions . . . . . . . . . . . . . 38

2.4.3.2 Moving services into dedicated domains . . . . . 38

2.4.3.3 Lack of transparency of HVM containers . . . . 39

2.4.3.4 Hypervisor/operating system semantic gap . . . . 39

2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.5 Rethinking operating system design for hypervisors . . . . . . . . 40

2.6 Virtual machine performance measurement . . . . . . . . . . . . . 41

2.6.1 Kernel probing . . . . . . . . . . . . . . . . . . . . . . . . 41

2.6.2 Kernel specialisation . . . . . . . . . . . . . . . . . . . . . 42

2.6.3 Performance interference . . . . . . . . . . . . . . . . . . . 43

2.6.3.1 Measurement . . . . . . . . . . . . . . . . . . . . 43

2.6.3.2 Modelling . . . . . . . . . . . . . . . . . . . . . . 44

2.6.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . 45

2.7 Application to a broader context . . . . . . . . . . . . . . . . . . 46

2.7.1 Containers . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.7.2 Microkernels . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Kamprobes: Probing designed for virtualised operating systems 49

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2 Current probing techniques . . . . . . . . . . . . . . . . . . . . . 51

3.2.1 Linux: Kprobes . . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.2 Windows: Detours . . . . . . . . . . . . . . . . . . . . . . 52

3.2.3 FreeBSD, NetBSD, OS X: DTrace function boundary tracers 53

3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Experimental evidence against virtualising current probing tech-

niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3.1 Cost of virtualising Kprobes . . . . . . . . . . . . . . . . . 54

3.3.2 Cost of virtualised interrupts . . . . . . . . . . . . . . . . . 57

3.3.3 Other causes of slower performance when virtualised . . . 58

3.4 Kamprobes design . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.1 Kamprobes API . . . . . . . . . . . . . . . . . . . . . . . . 60

3.5.2 Kernel module . . . . . . . . . . . . . . . . . . . . . . . . 61

3.5.3 Changes to the x86-64 instruction stream . . . . . . . . . 61

3.5.3.1 Inserting Kamprobes into an instruction stream . 61

3.5.3.2 Kamprobe wrappers . . . . . . . . . . . . . . . . 62

3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.6.1 Inserting probes . . . . . . . . . . . . . . . . . . . . . . . . 69

3.6.2 Firing probes . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.6.3 Kamprobes executing on bare metal . . . . . . . . . . . . . 74

3.7 Evaluation summary . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.8.1 Backtraces . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.8.2 FTrace compatibility . . . . . . . . . . . . . . . . . . . . . 76

3.8.3 Instruction limitations . . . . . . . . . . . . . . . . . . . . 76

3.8.4 Applicability to other instruction sets and ABIs . . . . . . 76

3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4 Shadow kernels: A general mechanism for kernel specialisation in exist-

ing operating systems 79

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.2.1 Shadow Kernels for probing . . . . . . . . . . . . . . . . . 82

4.2.2 Per-process kernel profile-guided optimisation . . . . . . . 84

4.2.3 Kernel optimisation and fast-paths . . . . . . . . . . . . . 84

4.2.4 Kernel updates . . . . . . . . . . . . . . . . . . . . . . . . 85

4.3 Design and implementation . . . . . . . . . . . . . . . . . . . . . 86

4.3.1 User space API . . . . . . . . . . . . . . . . . . . . . . . . 86

4.3.2 Linux kernel module . . . . . . . . . . . . . . . . . . . . . 87

4.3.2.1 Module insertion . . . . . . . . . . . . . . . . . . 88

4.3.2.2 Initialisation of a shadow kernel . . . . . . . . . 88

4.3.2.3 Adding pages to the shadow kernel . . . . . . . . 89

4.3.2.4 Switching shadow kernel . . . . . . . . . . . . . 89

4.3.2.5 Interaction with other kernel modules . . . . . . 90

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.4.1 Creating a shadow kernel . . . . . . . . . . . . . . . . . . 91

4.4.2 Switching shadow kernel . . . . . . . . . . . . . . . . . . . 93

4.4.2.1 Switching time . . . . . . . . . . . . . . . . . . . 93

4.4.2.2 Effects on caching . . . . . . . . . . . . . . . . . 95

4.4.3 Kamprobes and Shadow Kernels . . . . . . . . . . . . . . 97

4.4.4 Application to web workload . . . . . . . . . . . . . . . . 102

4.4.5 Evaluation summary . . . . . . . . . . . . . . . . . . . . . 103

4.5 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . 103

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.6.1 Modifications required to kernel debuggers . . . . . . . . . 105

4.6.2 Software guard extensions . . . . . . . . . . . . . . . . . . 105

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5 Soroban: Attributing latency in virtualised environments 107

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2.1 Performance monitoring . . . . . . . . . . . . . . . . . . . 110

5.2.2 Virtualisation-aware timeouts . . . . . . . . . . . . . . . . 110

5.2.3 Dynamic allocation . . . . . . . . . . . . . . . . . . . . . . 111

5.2.4 QoS-based, fine-grained charging . . . . . . . . . . . . . . 111

5.2.5 Diagnosing performance anomalies . . . . . . . . . . . . . 112

5.3 Sources of virtualisation overhead . . . . . . . . . . . . . . . . . . 112

5.4 Effect of virtualisation overhead on end-to-end latency . . . . . . 116

5.5 Attributing latency . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5.5.1 Justification of Gaussian processes . . . . . . . . . . . . . 121

5.5.2 Alternative approaches . . . . . . . . . . . . . . . . . . . . 122

5.6 Choice of feature vector elements . . . . . . . . . . . . . . . . . . 123

5.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.7.1 Xen modifications . . . . . . . . . . . . . . . . . . . . . . 126

5.7.1.1 Exposing scheduler data . . . . . . . . . . . . . . 126

5.7.1.2 Sharing scheduler data between Xen and its vir-

tual machines . . . . . . . . . . . . . . . . . . . . 127

5.7.2 Linux kernel module . . . . . . . . . . . . . . . . . . . . . 127

5.7.3 Application modifications . . . . . . . . . . . . . . . . . . 128

5.7.3.1 Soroban API . . . . . . . . . . . . . . . . . . . . 128

5.7.3.2 Using the Soroban API . . . . . . . . . . . . . . . 129

5.7.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . 129

5.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

5.8.1 Validation of model . . . . . . . . . . . . . . . . . . . . . 130

5.8.1.1 Mapping scheduling data to virtualisation over-

head . . . . . . . . . . . . . . . . . . . . . . . . . 131

5.8.1.2 Negative virtualisation overhead . . . . . . . . . 133

5.8.2 Validating virtualisation overhead . . . . . . . . . . . . . . 137

5.8.3 Detecting increased-load from the cloud-provider. . . . . . 140

5.8.4 Performance overheads of Soroban . . . . . . . . . . . . . 141

5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

5.9.1 Increased programmer burden of program annotations . . 142

5.9.2 Scope of performance isolation considered by Soroban . . 143

5.9.3 Limitation to uptake . . . . . . . . . . . . . . . . . . . . . 143

5.9.4 Improvements to machine learning . . . . . . . . . . . . . 143

5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6 Conclusion 145

6.1 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.2 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

6.3 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4.1 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4.2 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . 149

6.4.3 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

6.4.4 Other performance measurement techniques that forgo hy-

pervisor fidelity . . . . . . . . . . . . . . . . . . . . . . . . 150

6.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

CHAPTER 1

INTRODUCTION

The recent emergence of cloud computing is largely dependent on the populari-

sation of high-performance and secure x86-64 virtualisation. By using a hypervi-

sor cloud operators are able to multiplex their hardware, with high performance

and strong data isolation, between multiple competing users. This multiplexing

allows cloud providers to increase machine utilisation and increase service scal-

ability. Moreover, the hypervisor eases system management with maintenance

features such as snapshotting and live migration.

Yet, despite the advantages of virtual machines they remain slower than phys-

ical machines and have highly-variable performance [60]. Whilst efforts have

improved both the raw performance and performance isolation of virtual ma-

chines, the increased indirection and additional complexity in virtualising privi-

leged instructions makes it unlikely that we shall achieve parity of performance.

Developers therefore need techniques to help them measure how much slower

their applications execute in a virtual machine than they would have done on

bare metal. Furthermore, they need to be able to diagnose and fix performance

issues that occur in virtualised production systems.

However, using current techniques it is difficult to measure the performance

of software when it executes in virtual machines. Many of the methods used

to measure the performance of software when executing on bare metal, such as

raw access to performance counters, processor tracing, and visibility of hard-

ware performance metrics are not directly accessible [18], expensive [105], or

inaccurate [105, 71] when executing in a virtual machine. The combination of

less predictable performance and unavailability of performance-debugging tech-

niques makes it hard to measure the performance of an application executing in

a virtual machine.

One technique is to optimise software on bare metal, where access to more

hardware features is available, and then to virtualise the software. However,

15

this is a poor approach as virtualisation has different performance impacts on

different operations.1

Currently, the main virtualisation techniques used by hypervisors either have

guests execute unmodified code, relying on hardware virtualisation extensions

to emulate bare-metal hardware from the point-of-view of the guest or exe-

cute paravirtualised guests whereby the virtual machines are made aware that

they are executing on a hypervisor and issue hypercalls, as opposed to execut-

ing privileged instructions. But such paravirtualisation of mainstream operating

systems only applies to the low-level hardware interfaces, typically restricted to

the architecture-dependent (arch/) code. As such, performance measurement

techniques that execute on a virtual machine exhibit hypervisor fidelity: They

execute without consideration of the fact that they are executing in a virtual ma-

chine. They are therefore unable to access the same set of counters that they can

on physical machines and are unable to explain performance issues, such as CPU

starvation of the entire operating system, that do not exist on physical machines.

Slower and less-predictable performance of software executing in a virtual

machine are two of the greatest disadvantages of executing software using a

virtual machine, yet current techniques for measuring this performance do not

consider the role of the virtualisation in slow performance. In this dissertation I

argue the benefits of forgoing hypervisor fidelity to measure performance. That

is, given the importance of measuring the performance of virtual machines we

should turn to forgoing fidelity, in the same way as we have previously forgone

fidelity to ameliorate previous problems with virtualisation, such as slow perfor-

mance and the difficulties in virtualising classical x86.

I show that by forgoing hypervisor fidelity it is possible to build performance-

analysis techniques that reduce the probe effect of measuring virtual machines

and explain performance characteristics of software that one cannot measure

without considering the role of the hypervisor in executing software.

1.1 Defining forgoing hypervisor fidelity

Hypervisor fidelity is a well-defined concept [115]. However, the concept of for-

going hypervisor fidelity is less well defined. In this dissertation I define forgoing

1Indeed, I show in Chapter 4 and Chapter 5 that depending on the operation performed,virtualisation overheads can vary to the extent of changing the shape of a distribution.

16

hypervisor fidelity as a property of software that is designed for execution on a

virtual machine and makes use of the properties of the hypervisor.

1.2 Limitations of hypervisor fidelity in performance

measurement tools

Hypervisors date back to early work by IBM in the 1960s, where they were

initially used to multiplex access to a scarce, expensive mainframe. However,

the current trend of using hypervisors to virtualise cloud infrastructure has its

roots in the renaissance that followed fast and secure techniques to virtualise the

x86-64 instruction set. The re-emergence of paravirtualisation, addition of hard-

ware virtualisation extensions, and plentiful memory and CPU capacity servers

throughout the 2000s made it possible to execute many virtual machines on a

single server to increase utilisation. This, combined with a consumer movement

to performing computations and storing data on servers, made virtualisation at-

tractive to industry as virtualisation is cheaper and more scalable than executing

on dedicated machines.

The rise of cloud computing in recent years has been impressive. Amazon

EC2 alone has grown from nine million to twenty eight million public IP ad-

dresses in the past two years [143]. This number is clearly an underestimate for

the actual use of virtual machines as it doesnt include other cloud providers, or

non-public IP addresses.

However, the performance of virtual machines executing in the cloud is highly-

variable [39, 49], with cloud providers now competing on the predictability of

their services [9]. Despite this, the tools available to users to measure the per-

formance of their virtual machines have not kept up with the growth in cloud

computing. Given the difficulty in correctly virtualising all hardware counters

and eliminating performance interference, I show how by forgoing hypervisor

fidelity we can build tools that aid with measuring the performance of a virtual

machine.

17

1.3 The case for forgoing hypervisor fidelity in perfor-

mance measurement tools

Forgoing hypervisor fidelity to ameliorate problems in the virtualisation domain

has been repeatedly used in the past. I now explore previous times that we have

forgone hypervisor fidelity to improve the utility of virtual machines and argue

that contemporary problems mean that it is time to forgo hypervisor fidelity of

performance measurement techniques.

The concept of forgoing hypervisor fidelity is almost as old as virtualisation

itself. The early literature relating to OS/360 and OS/370 considers the role

of pure against impure virtual machines, whereby an impure virtual machines

executes differently as it has been virtualised. The advantage of impure virtual

machines was that they could execute faster than pure virtual machines. In the

end, pure virtual machines became the dominant virtual machine type, although

techniques such as paravirtualisation borrow from the ideas of impure virtual

machines.

More recently, forgoing hypervisor fidelity has been used to overcome classi-

cal limitations of the x86 instruction set that meant it was not virtualisable in a

way that provided both security and performance. By adopting paravirtualisa-

tion to overcome the limitations of classical x86, Xen forgoes hypervisor fidelity

since virtual machines execute with knowledge of the hypervisor and issue hy-

percalls rather than executing non-virtualisable instructions.

Even today, we forgo hypervisor fidelity to overcome performance problems

with virtualisation. One problem that virtual machines face is the possibility of

not being scheduled when they need to execute, for instance after packets have

arrived for the virtual machine. In order for the hypervisor to more-favourably

schedule the virtual machine when it has work to do, under Xen there are two

hypercalls that allow guests to deschedule themselves: yield and block. When a

guest is waiting for I/O or the network they can execute the block hypercall, pa-

rameterised on the event that they are waiting for. The hypervisor then preempts

the guest until the corresponding event is placed on the guests event channel, at

which point the hypervisor wakes the guest. The advantage in this case of the

guest acknowledging the presence of the hypervisor is that by blocking when it

cannot make progress the scheduling algorithm stops consuming credit from the

18

domain. Therefore, when the guest is able to execute the scheduling algorithm

will be more-favourable to the domain. Similarly, the yield hypercall allows

guests to relinquish their slot on the CPU, without parameterisation, such that

they will later be scheduled more favourably. Both the block and yield hypercalls

improve the performance of the guest, through forgoing hypervisor fidelity.

Even with the advent of hardware virtualisation that allows unmodified vir-

tual machines to execute, we still forgo hypervisor fidelity in the drivers on vir-

tual machines to improve performance. On hardware virtual machines (HVM)

the emulation of connected devices (which a tool such as QEMU can provide)

is slow, therefore HVM guests that need more performance are often converted

to PV on HVM guests, using virtualisation drivers that replace the emulated

devices with a driver that directly issues hypercalls. This allows guests to use the

hardware-assisted virtualisation interface when this is fastest, such as executing

a system call since the lack of rings one and two on x86-64 require all pure-

paravirtualised system calls to perform a context switch through the hypervisor,

and use the paravirtualised interface when this is faster, such as avoiding hard-

ware emulation. This is an example of the virtual machine forgoing hypervisor

fidelity to improve the performance of a virtual machine.

As we have seen, forgoing hypervisor fidelity is an oft-used technique to

solving problems in the virtualisation domain, in particular for solving perfor-

mance issues. A significant issue facing virtualisation today is that performance

is variable and yet techniques for measuring the performance of virtual machines

have lower utility than techniques for measuring the performance of physical ma-

chines. I propose rethinking where we forgo hypervisor fidelity in a mainstream

operating system, designed to execute in a contemporary cloud environment.

In this dissertation I show that by building performance measurement tools

that dont have strict hypervisor fidelity it is possible to mitigate many of the

issues of measuring the performance of a virtual machine. Forgoing hypervisor

fidelity should not be controversial given the trend of forgoing hypervisor fidelity

to solve performance-related issues.

In the remainder of this chapter I introduce three key methods by which

forgoing hypervisor fidelity allows software to report better performance mea-

surements when virtualised. Later, I present each contribution in detail.

19

1.4 Kamprobes

Current kernel probing mechanisms are built without forgoing hypervisor fi-

delity. That is, developers execute the same types of probes on virtual machines

as they do on physical machines. However, these methods usually rely on set-

ting software interrupts in an instruction stream. Whilst these generally execute

well on physical hardware, I show in Chapter 3 that interrupts on a virtual ma-

chine are 1.81 times more expensive than interrupts on hardware (3.3.2), as thehypervisor has to execute.

Probes are a common technique for measuring the performance of computer

software. By allowing developers to add additional code at a programs runtime,

probes allow developers to execute code that measures wall-clock time, cycles,

or other resources used by a piece of code without the burden of modifying the

softwares source code, recompiling and re-executing the software. However,

a problem with probes is that when they fire they consume resources, thereby

affecting the performance of the application that they try to measure.

Whilst this probe effect impacts both physical machines and virtual machines,

the overheads are 2.28 times higher on virtual machines than physical machines (3.3).Moreover, virtualisation increases the standard deviation of the number of cycles

requires to fire a probe from 8 cycles to 869 cycles (3.6.2).By having higher overheads, probing mechanisms on virtual machines ex-

acerbate the probe effect. This makes it harder to identify the cause of poor

performance of applications on virtual machines.

Kamprobes is a technique for probing virtual machines that only uses unpriv-

ileged instructions, such that the hypervisor is not involved in a probe firing and

avoids other operations that are expensive in a virtual machine such as hold-

ing locks. Kamprobes forgoes hypervisor fidelity by being designed to execute

with maximum performance on a virtual machine. For instance, by only using

non-privileged instructions, the design of Kamprobes forgoes hypervisor fidelity.

There is only a modest difference between executing in a virtual machine and

on a physical machine on the number of cycles (twelve cycles) and the variability

(two cycles of standard deviation). Moreover, Kamprobes execute much faster

than Kprobes (the current state-of-the-art in Linux kernel probing), with a Kam-

probe taking 6916 cycles to execute, whereas a Kprobe takes 6980869 cycles

20

to execute (3.6.2). Furthermorewhilst not an issue of virtualisationwhenKprobes determines which handler to execute it performs a lookup that scales

with O(n), with the number of probes inserted. The technique that Kamprobes

uses does not need to perform a lookup, and so scales linearly (O(1)). Kam-

probes can therefore be used in circumstances that require many probessuch

as for a function boundary tracerfor which Kprobes is too slow.

1.5 Shadow Kernels

Whilst Kamprobes are a low-overhead technique for probing virtual machines, if

they are usedeven with empty probe handlerson hot codepaths the overhead

of them repeatedly firing can significantly reduce performance. In principle this

shouldnt be an issue because much of the time developers want to measure the

performance of one particular processs interactions with the kernel in isolation.

But there is no current way of setting kernel probes that only fire when one

particular process executes.

Shadow Kernels is a technique I developed by which specialisation, such as

setting probes, can be applied to a kernel instruction stream on a fine-grained

basis such that the specialisation applies to a subset of the processes or system

calls executing on the system. Currently, specialising the operating system kernel

makes changes to the kernel instruction stream that affect all processes executing

on the system. This is because whenever the kernel instruction stream is modified

the address space of every process is modified as each processes maps the shared

kernel into its own address space. The underlying issue is that modifications to

the instruction stream of the kernel are a global operation, in that the shared

instruction stream is executed by all processes. I therefore show that the effect

of this is to reduce the performance of all processes executing on the system,

regardless of if their interaction of the kernel were the target of specialisation.

Shadow Kernels requires co-operation of virtual machines with the hypervi-

sor since the virtual machines execute hypercalls that cause the hypervisor to

modify the physical-to-machine memory mappings such that the virtual mem-

ory containing the kernel instruction stream maps to different machine-physical

memory depending on the calling context.

Shadow Kernels is a technique that utilises the indirection of virtualised page

21

tables such that multiple copies of the kernel instruction stream co-exist within

a single domain. This allows processes that are not the target of instrumentation

to execute their original kernel instruction stream, whilst applications whose

interaction with the kernel is the target of specialisation execute a specialised

instruction stream.

Building Shadow Kernels without a hypervisor would be challenging: Oper-

ating systems are designed with a memory layout such that the kernel resides at

a fixed offset in physical memory. However, with Shadow Kernels there are mul-

tiple copies of pages that include the kernel instruction stream, with the memory

management unit changing which page virtual addresses resolve to. Therefore,

there is no-longer a fixed mapping between physical and virtual pages in the ker-

nel instruction stream. Furthermore, the hypervisor-based approach is easy to

port Shadow Kernels to other operating systems.

1.6 Soroban

A key issue with executing software in the cloud is that applications often exe-

cute more slowly and sometimes with performance interference from other vir-

tual machines [60]. For latency-sensitive applications, in particular, this virtuali-

sation overhead prevents users from switching to virtual machines [142]. How-

ever, current application monitoring systems are built with hypervisor fidelity, in

that they report the same metrics if they execute on a physical machine or a vir-

tual machine. As the performance of an application is affected by the hypervisor

in a way that is hard to predict, it is currently difficult to measure how much of

the latency of a program executing in the cloud is caused by the overheads of

virtualisation and how much is due to other causes, such as a high load on the vir-

tual machine. Soroban is a technique that forgoes hypervisor fidelity to measure

how much of the latency of a request is due to the overheads of virtualisation.

By forgoing hypervisor fidelity throughout the software stack, up to the appli-

cation, Soroban reports the additional latency imposed on servicing individual

requests in a request-response system. This allows developers to measure the

additional overheads that their application experiences due to executing in a

virtual machine, as opposed to executing on bare metal. By reporting the virtu-

alisation overhead, developers can decide whether the additional overheads are

22

worthwhile.

Soroban uses a modified version of Xen that shares with each domain the

activity performed it by the scheduler, such as timestamps of when the virtual

machine is scheduled in and out. Soroban then trains a Gaussian process on the

relationship between these variables and the response time of a request-response

system. The result of the learning phase is a model that when given a feature

vector of scheduling activity on a domain it reports the impact that these events

have on the event response time.

I evaluate Soroban, showing that the technique can be applied to a web server

and measure the increase in latency due to virtualisation in servicing requests. I

demonstrate that as more virtual machines execute concurrently, Soroban in-

creases the latency increased with virtualisation, but when the web server exe-

cutes requests slowly due to high load, Soroban does not increase the measure

of virtualisation overhead.

1.7 Scope of thesis

In this dissertation, I primarily focus on Xen, executing paravirtualised GNU/Linux

on x86-64 hardware. I now justify this choice.

1.7.1 Xen hypervisor

Xen is the hypervisor used by Amazon EC2 [139], which as of May 2015 is ten

times larger than the combined size of all its competitors [85]. Given the clear

dominance of Xen in the cloud, solutions to problems of measuring performance

when virtualised with Xen have a high impact. However, the key contributions

of my thesis can be ported to other hypervisors.

1.7.2 GNU/Linux operating system

As of December 2014, 75% of enterprises report using Linux as their primary

cloud platform [52], with the market share of Linux virtual machines increasing.

Most of the remainder is Windows virtual machines, however the number of

these is falling.

23

1.7.3 Paravirtualised guests

Presently there are two main techniques to virtualising an operating system in

the cloud. (i) Hardware extensions allow an unmodified operating system to ex-

ecute on a hypervisor (HVM). This is a common way of virtualising proprietary

operating systems, such as Microsoft Windows. (ii) Modifying the guest oper-

ating system such that it is aware that is executing in a virtualised environment

and directly issues hypercalls rather than performing privileged instructions.

The performance of paravirtualised guests is comparable with the perfor-

mance of hardware virtual machines, with regular changes as to which one is

the faster form of virtualisation.

In this dissertation I use paravirtualised virtual machines as they have an ex-

isting interface with the hypervisor, through which virtual machines can issue

hypercalls. As this dissertation proposes forgoing hypervisor fidelityas such

creating paravirtualised performance measurement techniquesit is more nat-

ural to build these on paravirtual virtual machines. However, hardware virtual

machines often have a paravirtualised interface through which drivers can oper-

ate, so many of the ideas could be ported to hardware virtual machines.

1.7.4 x86-64

Whilst instruction sets other than x86-64 are virtualisable, Intel currently has

a 98.5% market share in server processors (as measured by number of proces-

sors) [77], with much of the remainder being taken by AMD x86-64 processors.

As such, I do not consider other instruction sets.

The contributions of Kamprobes in Chapter 4 are particularly tightly-coupled

with the x86-64 instruction set. However, the fundamental idea of using unpriv-

ileged instructions to build a probing system hold true across other instruction

sets. Indeed, on a fixed-width instruction set, such as ARM, this technique is

both easier to implement and can be used on more opcodes than on x86-64.

Both Shadow Kernels and Soroban are less reliant on any particular instruc-

tion set.

24

1.8 Overview

In summary, the key contributions of this dissertation are:

Kamprobes. Current probing techniques are built to execute on a physical ma-

chine and as such rely on interrupts to obtain an execution context. How-

ever, on a virtual machine interrupts are a privileged instruction, so are

expensive. Kamprobes is a low-overhead probing technique for x86-64

virtual machines that execute with near-native performance in a virtual

machine.

Shadow Kernels. By forgoing hypervisor fidelity, virtual machines can remap

their text section to allow virtual machines to specialise shared text re-

gions, in particular the kernel. Whilst I focus on the use case of scoping

kernel probes, the technique can be applied to other types of kernel text

specialisation, such as profile-guided optimisation.

Soroban. A key concern that prevents the uptake of virtualisation is the impact

of the virtualisation overhead. I show that by building software that ac-

knowledges the presence of the hypervisor in its own monitoring, it is pos-

sible to measure the virtualisation overhead of fine-grained activities, such

as serving an HTTP request.

The remainder of this dissertation is structured as follows. I explore the back-

ground for my thesis in Chapter 2, arguing that the requirement of hypervisor

fidelity for performance measurement techniques is a relic of classical hypervisor

use cases and can be forgone for contemporary operating systems. In Chapter 3

I introduce Kamprobes, a probing technique for virtualised x86-64 operating

systems. In Chapter 4 I propose Shadow Kernels as a solution for specialisation,

such as scoping the firing of probes. In Chapter 5 I present Soroban, a technique

for using machine learning to report, for each request-response, the additional

latency added by executing on the hypervisor.

25

CHAPTER 2

BACKGROUND

In their 1974 paper Popek and Goldberg state the classical definition of a hyper-

visor as having three properties: Fidelity, performance and safety [115].

Fidelity. Fidelity represents the concept that a hypervisor should portray an accu-

rate representation of the underlying hardware, such that software can exe-

cute on the hypervisor without requiring modification, or being aware that

it executes in a virtualised environment. As such, the results of software

executing in a virtualised environment must be identical to those obtained

when executing on physical hardware, barring any effects of different tim-

ing whilst executing on virtualised hardware.

Performance. The performance of a virtual machine must not be substantially

slower than when executing on physical hardware. In particular, most in-

structions that execute must run unmodified, without trap-and-emulation

techniques (trap-and-emulation is the only virtualisation technique that

Popek and Goldberg consider).

Safety. Virtual machines must act independently, without the ability to interfere

with other domains executing on the system. Particularly, virtual machines

should not have direct access to shared hardware, with which they can

modify the state of another virtual machine in a way that would not be

expected of that machine executing on physical hardware.

In this dissertation I propose performance-analysis techniques that are de-

signed to complement virtualisation, by either using code that virtualises well or

by using techniques that interact with the hypervisor. As such, this work breaks

the traditional definition of a hypervisor in that it no longer offers fidelity. In this

chapter, I consider related work to argue that the difficulty of measuring the per-

formance of virtual machines is exacerbated by the requirement of fidelity and

27

that this should be relaxed given the changing uses of hypervisors. Throughout

the rest of this dissertation I use this argument to justify techniques that require

performance-analysis techniques that are tightly-coupled with the hypervisor.

2.1 Historical justification for hypervisor fidelity

In this section I consider the historical justification for hypervisors, especially for

hypervisor fidelity. I later show that the use cases of hypervisors has changed

and as such we should reconsider the hypervisors original design principles.

The concept of hypervisor fidelity, whilst formalised in 1974 [115], dates

back to the start of research into virtual machines by IBM. IBM built early hy-

pervisors that allowed multiple users to concurrently execute on a rare and ex-

pensive mainframe with the illusion of being the only user of the machine. That

is, each user had the illusion of being the sole user of the machines hardware,

with their operating system being the only one executing. The key issues that

early hypervisors attempt to fix are that OS/360 uses thenow-common [75]

architecture of a machine executing a single kernel that is shared with every

process executing on the system: (i) Different users are unable to execute dif-

ferent operating system versions. Due to the lack of availability of mainframes,

users were unable to obtain another machine to execute their own operating

system version. (ii) Users cannot develop new operating system features in isola-

tion from other users. For instance, if a developer were to extend the operating

system, but their code contains a bug, with OS/360 it is not possible to prevent

this from affecting concurrent users. As traditional abstractions are lower-level

than contemporary abstractions, it was commonplace for developers to regularly

need to modify or extend their operating system.

CP-40 is considered to be the first hypervisor, being released in 1967 and able

to concurrently execute fourteen virtual machines. As the complexity of hard-

ware increased through the 1970s the use of hypervisors became more practical

and feature in the development of OS/360 and OS/370 [62, 127]. Behind all

IBM work is the control program (CP), which allows concurrent execution of

operating systems, each of which has the illusion of executing on physical hard-

ware [56]. The original versions of CP allow an unmodified operating system to

execute in a virtualised environment in which CP configures the hardware such

28

that whenever a virtual machines executes a privileged instruction the hardware

induces a trap, which CP catches, decodes and emulates in a safe way. There

were other early hypervisors, such as the FIGARO system, which was part of the

Cambridge Multiple-Access System that have similar design goals [147]. As such,

these early hypervisors do provide fidelity, in that the software that executes on

them has the same side effectsignoring timing effectson both physical and

virtual hardware.

2.2 Contemporary uses for virtualisation

Having shown the historical justification for hypervisor fidelity, I now argue that

the use case for virtualisation is now different to in the 1960s and 1970s. As such,

it is time to reassess the requirement of virtual machine fidelity, in particular to

aid in helping developers measure the performance of their virtual machines.

Rather than building performance tools that explain a subset of what can be

viewed on a physical machine, due to limited access to performance counters,

we should forgo hypervisor fidelity by building performance analysis techniques

that are designed to execute on a virtual machine.

Compared with when hypervisors were pioneered, hardware is now cheaper

and more readily-available, as such the original requirements for virtualisation

no longer hold: (i) In contemporary computing users have access to many ma-

chines, as such they are usually able to execute an operating system of choice

on a different computer. (ii) The influx of additional hardware also means that

development of operating system features can be performed on dedicated devel-

opment hardware. Indeed, executing production services on the same hardware

that is used for operating system development, even when a hypervisor is used,

would be unconventional in the current era. In comparison to when virtualisa-

tion was pioneered, it is standard practice to have fleets of physical machines just

testing changes to operating system source code. Moreover, higher-level abstrac-

tions reduce the requirement of most development work to involve modifying

the operating system.

In the last ten years virtualisation has underpinned the move to cloud com-

puting, which in turn has revolutionised computing [6]. A lower bounds indica-

tor of the growth of cloud computing is that Amazon AWS alone has increased

29

from nine million to twenty eight million EC2 public IP addresses in the past two

years [143]. The key benefit of the hypervisor in these cloud computing environ-

ments is allowing operators to provide virtual machines to their customers, so

that multiple customers can share the same physical server without interference.

In particular, hypervisors give a number of advantages to cloud providers:

Higher machine utilisation. By co-hosting virtual machines on a physical server

the utilisation of the physical server increases when compared with execut-

ing each service on a dedicated physical machine. Whilst higher utilisation

was a key factor in the early work on hypervisors, this was because the

mainframes that they executed on were scarce and highly-contested. How-

ever for cloud providers, servers are readily-available, but higher utilisa-

tion decreases power consumption, cooling, maintenance and real-estate

expenditure. In order to increase utilisation, hypervisors now offer fea-

tures such as memory overcommitting through ballooning [144] and pre-

allocation [94]. Although such higher utilisation has remained a benefit of

using a hypervisor, the reasons for desiring higher utilisation have changed,

as such the role of the hypervisor has changed. The downside to higher

utilisation is that it risks starving virtual machines of resources, thereby

reducing their performance. Operating system starvation is not a problem

that exists when executing on bare metal therefore tools that do not forgo

hypervisor fidelity cannot report this effect.

Creating virtual machines is fast and cheap. Users can spawn a new, booted vir-

tual machine in less than one second [84]. This is not possible without

a hypervisor, since fast boot up is achieved by forking an already-booted

virtual machine, such that the two have the same state. With physical

machines, the closest alternatives are techniques such as PXE that aid in

reducing the time between connecting a server and it being fully-booted.

However, for most use cases the main time cost in running a new physical

server is actually in finding server hosting and obtaining a physical server.

With hypervisors, there is no need for most users to purchase physical host-

ing and servers, as they can simply pay for a virtual machine from their

cloud provider. Moreover, the economics of cloud computing often make

it cheaper to execute in the cloud than building a data centre [136]. This

clearly differs from the original use case of a hypervisor in which being

30

able to rapidly spawn a new machine was not a desired feature.

Scalability to near-infinite computing resource on demand. Usage patterns of Internet-

connected applications are highly-variable [119]. In order to respond to

spikes in demand they need to be elastic, in that they need to execute using

more machines during spikes to maintain a quality of service. Hypervisors

allow scalability of virtual machines to up to 3 000 virtual machines in a

32 host pool [61]. In cloud computing environments, where hypervisor

pools are less common, the bottleneck on the number of virtual machines

that can execute is bounded by economic factors. As virtual machines are

fast to spawn, users can build more scalable software that responds to

changes in demand by creating more virtual machines. Such requirements

were never present in the early forms of virtualisation, as they operated

before the creation of the Internet, so contemporary issues such as the

slashdot effect and viral trends did not exist. Furthermore, the original

workloads that execute on a hypervisor were non-interactive batch jobs,

therefore they had different performance requirements to contemporary

clouds, where request-latency is a key metric.

Live migration of virtual machines. Modern hypervisors can transparently mi-

grate virtual machines between physical hosts [124] without downtime [34]

and similarly migrate and load-balance [59] storage between repositories

without downtime [94]. This allows system administrators to perform

maintenance on physical machines without disrupting a service executing

on the virtual machines, since they can first migrate the instance onto an-

other host. As organisations rarely had more than one mainframe when

hypervisors were initially designed, this was not a use case of the pioneer-

ing work. The downside of live migration is that if the virtual machine

is migrated onto a highly-loaded or less powerful host then it may exe-

cute more slowly However, this decrease in performance is from the cloud

provider, so is hard to detect with existing techniques.

High isolation compared with other virtualisation techniques. Kernel security vul-

nerabilities only affect the domain in which the vulnerability is used. Be-

tween 20112013 there where 147 such exploits for Linux [2]. Compared

with other virtualisation techniques that share the same kernel, hypervi-

31

sor exploits are more rare, with Xen having had just one privilege escala-

tion vulnerability from paravirtualised guests [140]. Since the invention of

hypervisors this requirement has increased: Attack vectors are now more

readily exploited and there are more commercial requirements for isolation

of services.

Backup and restore. There are advantages to providing backup and restore from

outside of a domain [152], since it is fast [37] and does not require operat-

ing system co-operation for access locked files and cannot be disabled by

malicious software. Backup and restore was not a concern for hypervisor

design in the 1960s.

Accountability. Accountable virtual machines allow users to audit the software

executing on remote hosts by having the software execute on top of a

hypervisor that performs tamper-evident logging [64]. Using virtualisation

for accountability is a new use-case for hypervisors that they were not

originally designed for.

Emulating legacy software. Windows 7 and later versions contain a hypervisor

to execute Windows XP. When the Windows instance is a virtual machine

the emulator then executes using nested virtualisation [66]. Whilst nested

virtual machines were considered in early work [147], this was mainly a

point of academic enlightenment.

Emulating advances in time. As hypervisors emulate wall-clock time to their guests,

they can be used to discover how software will behave at a future point in

time [35] or when executing under future, faster hardware [109]. Emulat-

ing changes in time was not an original design goal of hypervisors.

I have described a number of ways in which hypervisors are used as part of

mainstream cloud-computing environments. In particular, I have shown how

the use cases for the hypervisor in 2015 differ from those in the 1960s and

1970s when the classical definition of the hypervisor was developed. Due to

this change in use case, it is reasonable to argue that strict adherence to an out-

dated definition of the hypervisor should be challenged. One of the recurring

themes is the change related to moving from serving a batch-processing work-

load, to a request-response system in which users need high scalability, and low

32

latency in their serving of requests. Concurrently, virtual machines now execute

in a less predictable environment, with untrusted parties, malicious actors and

automated scheduling all acting in ways that affect the performance of virtual

machines, in a way that early virtual machines did not experience. As such, the

importance of measuring performance has increased, such that fidelity now has

lower utility than measuring the performance of virtual machines.

2.3 Virtualisation performance problems

Despite its popularity, a particular problem with virtualisation is that the per-

formance of virtual machines is slower and more variable than the performance

of physical machines yet it is difficult to measure the performance of a virtual

machine.

As well as contention for shared resources [117] there are other sources of

slow performance, which I now explore.

2.3.1 Privileged instructions

Under virtualisation certain instructions become more expensive, such as vmexit,

which increases by a factor of between five and twenty-five under virtualisa-

tion [122]. Also, as AMD64 only has two rings, paravirtualised guests have a

user space and kernel space that both execute in ring one and the hypervisor has

to mediate every system call. This makes system calls more expensive in virtual

machines than on physical machines, although by how much varies depending

on hardware [31].

2.3.2 I/O

I/O on virtual machines involves a longer data path than on physical machines

since the hypervisor has to map blocks from the virtual disks exposed to its

guests to physical blocks on storage, that is often remote. I/O operations are a

regular source of slow performance [57, 26, 100] which are around 20% slower,

depending on configuration. Furthermore, the hypervisors batching of I/O re-

quests can lead to extreme arrival patterns [22].

33

2.3.3 Networking

Networking in virtual machines can be unpredictable [98]: When executing on

a CPU-contended host compared with a CPU-uncontended host, throughput

can decrease by up to 87% and round trip time can increase from 10 ms to

67 ms [129]. On Xen, two causes of this are the back end of the split-driver

being starved of CPU resource as the driver domain is not scheduled or the front

end of the split-driver being starved as the scheduler in the virtual machine does

not schedule the driver during its scheduling quanta.

The effect of poor networking performance is that there are significant reduc-

tions in quality of service as observed by end-users in throughput and delay [26].

2.3.4 Increased contention

When executing as a virtual machine there is higher contention, caused by two

sources: Other virtual machines being scheduled and the hypervisor/domain zero

executing. The hypervisor increases contention when executing as a virtual ma-

chine due to switches to the hypervisor, through executing a vm-exit instruction

that needs to save the state of the virtual machine and restore the state of the

next domain [3]. Other virtual machines also cause performance interference,

especially for micro virtual machines, which execute on physical hosts with low

priority to use the spare CPU cycles left by other virtual machines. Such mi-

cro virtual machines are serviced poorly and to get maximum performance

for the instance typevirtual machines need to inject delays to be scheduled

favourably [146].

2.3.5 Locking

Locking has long been known to be problematic on virtual machines. When

operating systems are designed programmers often protect data structures with

mutexes, and assume that they hold the mutex for a short period of time as

holding a mutex on a shared data structure for a long time is expensive [114].

However, when executing in a virtual machine there is the possibility of a vCPU

being preempted whilst it holds a mutex, preventing other threads from making

progress [40]. Another problem is lock scalability, which unless modified to

perform better under a hypervisor, scales poorly with the number of vCPUs [76].

34

2.3.6 Unpredictable timing

When executing inside a virtual machine, time becomes unpredictable as virtu-

alised time sources are unreliable and behave poorly under live migration [19].

Also, operations that one expects to have a constant time can take an unpre-

dictable amount of time. For instance, techniques such as kernel same page

merging can help reduce the memory overhead of executing in virtual machines

by sharing identical pages between virtual machines [101]. However, when a vir-

tual machine modifies a shared page the hypervisor traps and creates a copy of

the page specifically for that virtual machine to modify. This makes page access

times unpredictable from within the virtual machine [135].

2.3.7 Summary

Despite many advances, virtual machines remain slower and less predictable

than physical machines. As it is unlikely that these issues will be completely

removed, it is important that users of virtual machines are able to measure the

performance of their virtual machine.

2.4 The changing state of hypervisor fidelity

Given the performance overhead of executing in a virtualised environment and

the difficulty in measuring this performance in a virtual machine, I propose that

virtual machines should forgo hypervisor fidelity for performance measurement

techniques. Rather than treating the hypervisor as a physical machine for every-

thing except the lowest layers of the kernel, performance measurement tools

should be designed to execute well in a virtual environment and should co-

operate with the hypervisor to maximise visibility of performance. Whilst this

does involve changing the accepted use of the interface between virtual machines

and hypervisors, I now show that changes to this interface have previously been

used to ameliorate performance problems in the virtualisation domain.

2.4.1 Historical changes to hypervisor fidelity

Even in the earliest work on hypervisors, there was acceptance that pure-virtualisation

may not be practical. A concern with the early versions of CP is that it performed

35

slowly, which is largely attributable to using trap-and-emulate to prevent virtual

machines from executing privileged instructions and causing them to execute an

emulated version. To address this, the evolution into OS/370 introduces the idea

of a hypercall [150], in which the virtualised operating system sets up some state

to communicate with CP and then uses the DIAGNOSE instruction to transition

context into CP [36]. By introducing the concept of a hypercall, IBM acknowl-

edge that building operating systems that are entirely-strict to the definition of

fidelity is not necessary. Rather, in cases where full emulation of physical hard-

ware has a high cost, it is better to forgo fidelity by making the virtual machine

aware that it is executing on a hypervisor and execute a hypercall rather than

perform the expensive operation.

I argue that we have the same issue today, whereby current techniques for

measuring the performance of a virtual machine execute the same code on vir-

tual machines as they do on physical machines. Therefore, performance measure-

ment tools have lower utility on virtual machines than physical machines as they

use code that virtualises poorly and cannot report the cost incurred due to virtual-

isation. As such, we should reconsider whether applying the technique employed

by IBM in 1973 to solve the problems of the daynamely poor performance

can solve the contemporary issue of it being difficult to measure the performance

of virtual machines. In particular, we should consider using paravirtualised per-

formance measurement techniques.

The invention of the hypercall created a debate that continues throughout

the 1970s [56] regarding pure vs impure virtual machines, in which a pure vir-

tual machine is a guest that runs unmodified code, whereas an impure virtual

machine runs modified code. In particular, there is consideration of position of

the hypervisor interface, since the hypervisor can either simulate high-level ac-

tions, such as reading a line, or can simulate the individual instructions involved

in performing the high level action [16].

2.4.2 Recent changes to hypervisor fidelity

With the popularisation of (early versions of) x86, virtualisation became harder

as the instruction set does not provide trap-and-emulate ability for privileged in-

structions such as SIDT, SGDT and SSL [121]. Therefore, to virtualise traditional

x86, one has to use binary translation, the process by which the instruction

36

stream is scanned and privileged instructions are rewritten with function calls to

emulating functions. Performing full binary translation is a slow process [78], so

early x86-64 hypervisors were either slow or unsecure [121]. Those that are slow

fail the hypervisor definition as they do not provide the performance property.

Furthermore, as Popek and Goldbergs hypervisor definition is tightly-coupled

with trap-and-emulate techniques in its formalisation of fidelity, such that vir-

tual machines cannot execute a modified instruction stream, binary rewriting is

not considered classically virtualisable [1].

To resolve the issues of virtualisation on traditional x86, Baraham et al. built

Xen, a hypervisor that uses paravirtualisation to emulate x86 with performance,

strong-isolation and unreduced functionality [12]. In using paravirtualisation,

Xen requires that operating systems be modified to issue hypercalls rather than to

execute with true-fidelity when issuing privileged instructions. One contribution

of Xen was to paravirtualise the memory management unit, in that guests page

tables are mapped read-only and the guest has to issue a hypercall to update

them. This design allows virtual machines to directly map virtual addresses to

the addresses of the memory on the physical server (machine physical frames),

rather than have shadow page tables that give the illusion of executing in an

independent address space. In overcoming the shortcomings of x86, by forgoing

hypervisor fidelity, Xen is much like my proposal of forgoing hypervisor fidelity

to overcome the shortcomings of performance measurement of virtual machines.

More recent advances in the x86-64 instruction undeniably restore a degree

of fidelity to the hypervisor by allowing unmodified virtual machines to execute

in a hardware virtual machine (HVM) container [141]. HVM containers extend

the architecture of x86-64 so as to provide a privileged mode [123] in which

the hypervisor executes and to which calls to privileged instructions cause the

processor to enter a protected mode (sometimes considered negative rings), in

which the hypervisor executes. Whilst this increase in fidelity does create some

advantages, for instance operating systems can migrate between executing as a

physical and virtual instance [83], I nevertheless still argue that this increase in

fidelity only came when hardware had advanced sufficiently (for instance with In-

tel VT-X) such that fast and secure x86 virtualisation was no longer problematic.

Should future hardware allow virtual machines to measure their performance to

same degree as physical machines, then restoring fidelity to measuring the perfor-

mance of virtual machines may be reasonable. There is already limited evidence

37

of hardware advances increasing the ability of a virtual machine to measure its

performance [104].

2.4.3 Current state of hypervisor fidelity

Despite the increase in hardware virtualisation, I still argue that it is common-

place for the software stacks that execute on the hypervisor to not exhibit strict

fidelity. This is principally due to the process of re-hosting an application on in-

frastructure as a service, during which developers are encouraged to make use of

properties of the cloud, such as the scalability of virtual machines [103]. How-

ever, within virtual machines there are differences to the software stack when

compared with physical machines. As such, forgoing hypervisor fidelity in per-

formance measurement techniques is not a radical move.

2.4.3.1 Installing guest additions

All high-performance hypervisors that use hardware virtualisation techniques

still provide extensions to improve the performance of their guests: XenServer

Guest Tools, VirtualBox Guest Additions, and VMware tools are some exam-

ples. These typically provide drivers that allow the guest operating system to

communicate directly with the hypervisor so that full emulation of devices is not

required. However, installing such extensions reduces the fidelity of the virtual

machine, since by using different drivers, the virtual machine executes differently

on physical and virtual hardware.

2.4.3.2 Moving services into dedicated domains

There is a growing trend to use virtual machine introspection to provide ser-

vices that would traditionally have been provided by processes or operating sys-

tem [24]. For example, Bitdefender performs malware detection from outside

a separate, privileged domain, which prevents malware from attacking the mal-

ware detection program, as it is commonplace for viruses to attack antivirus

mechanisms [88]. Furthermore, most commercial hypervisors now support vir-

tual machine snapshotting, a feature typically performed by the filesystem. There

are also proposals to move monitoring into a separate domain [82]. Given the

trend of separating services out such that they execute outside of the original do-

38

main, I argue that hardware virtualisation does not achieve full fidelity, since if

those operating systems were to execute on physical hardware, they would need

reconfiguring such that they execute processes to perform all of these features.

2.4.3.3 Lack of transparency of HVM containers

Even when executing inside a hardware virtual machine container, which is sup-

posed to provide fidelity, the interface with the hypervisor still differs from that

provided by exclusive use of hardware. One demonstration of this difference in

interface is malware that detects the presence of a hypervisor through irregular-

ities in the availability of resources, such as CPU cycles, caches, and the TLB,

and refuses to execute a payload [149]. Furthermore, the timing properties of

a virtual machine differ from physical machines, both due to virtualisation over-

head and changes to the time required to access hardware that is emulated by

the hypervisor, and hidden page faults caused by access to hypervisor-protected

pages, different timing between virtualised instructions (such as cpuid) and non-

virtualised instructions (NOP) [54]. Given that the interface to the hypervisor is

leaky, I argue that we should acknowledge this difference throughout the soft-

ware stack, rather than maintaining fidelity.

2.4.3.4 Hypervisor/operating system semantic gap

The performance of a virtual machine can be improved if the hypervisor is better

able to predict the virtual machines actions. There are two main techniques of

improving the prediction rates: Monitoring the virtual machine, with knowledge

of its data structures so-as to be able to improve decisions and policies, which

can increase the cache hit ratio of a virtual machine by up to 28% [73] or moving

functionality into the hypervisor from the guest [87]. The latter reduces fidelity

and the former requires co-operation, therefore we observe deviation from the

standard definition of a hypervisor.

2.4.4 Summary

I have now shown that since the advent of the hypervisor, forgoing hypervisor

fidelity has been a common solution to solving problems in the realms of virtu-

alisation. Even today, with hardware virtual machines, virtual machines do not

39

strictly provide fidelity. My demonstration that forgoing hypervisor fidelity has

successfully been used to solve past problems with virtualisation confirms my

thesis that the use of the interface should change so as to improve the utility of

performance measurement tools.

2.5 Rethinking operating system design for hypervi-

sors

There is considerable research literature that reconsiders the role of the operating

system, when executing in the cloud, from the ground-up, which often forgoes

fidelity to increase utility.

Library operating systems, such as OSv recognise that in a typical cloud soft-

ware stack there is a hypervisor, operating system and a language runtime [80].

Each of these performs abstraction and protection, at the cost of an increased

footprint and performance overhead, such as a 22% impact on the throughput

of lighttpd. Library operating systems replace everything that executes above

the hypervisor with a single binary, so that the hypervisor performs abstraction

and protection [80]. Similarly, Mirage is designed to execute on a hypervisor

only, making use of the small hypervisor interface [91], thereby improving on

Linux in terms of boot time, I/O throughput and memory footprint.

SR-IOV increases fidelity by letting operating systems directly interact with

the network interface card, with the hardware ensuring isolation [41]. However,

SR-IOV can be used in unconventional ways: Dune is a hypervisor-like project

that uses hardware virtualisation features to allow usespace direct access to safe

hardware features, such as ring protection, page tables and the TLB [14]. Belay

et al. achieve this by using hardware extensions built for virtualisation but have

their lowest layer of software still expose an abstraction of a process, rather than

hardware. Furthermore, Arrakis [113] and IX [15] use SR-IOV to separate the

control and data plane so-as to increase networking throughput of commodity

hardware.

The work that I present in this dissertation focuses on applying performance

measurement techniques to mainstream operating systems in the cloud. As re-

search operating systems are not yet mainstream I do not explicitly show the

benefits that they would receive. However, the key techniques in all three of

40

my contributions could be applied to such operating systems, without causing

diverting behaviour between virtual and physical machines.

2.6 Virtual machine performance measurement

Having argued that the requirement for hypervisors to exhibit fidelity is overly-

restrictive and that forgoing hypervisor fidelity has been previously used to solve

problems in the virtualisation domain, I now explore work related to virtual

machine performance.

2.6.1 Kernel probing

Probing has a rich history that goes back to the dawn of computers. The first

use of probing is believed to have been to used by Maurice Wilkes to insert sub-

routines into code executing on the EDSAC. These sub-routines would print dis-

tinctive symbols at intervals throughout a program so that the operator could

determine an error [55]. Later computers, starting with the UNIVAC M-460

included programs such as DEBUG that let operators specify addresses to insert

additional code that could be used for debugging [47].

Contemporary operating systems have a probing system to allow users to de-

bug their software and measure its performance. Linux uses Kprobes [107] and

Microsoft Windows uses Detours [69]. NetBSD [106], FreeBSD [96], and OS

X all use DTrace, which embodies a probing system in a wider instrumentation

system. There has been further work on these systems to optimise them [68]

as the benefits of fast probing have long been known [79]. However, with the

exception of Windows Detours, these all use interrupt-based probing techniques.

Previous work has shown another technique for probing, based on jumps,

which are often faster than executing interrupts [137, 138]. Windows Detours

was the first of these jump-based probing systems that preserves the semantics

of the target function as a callable subroutine [69]. However, whilst there is

some benefit from using jump-based techniques on physical machines, I show

that their utility when applied to virtual machines is much higher. This is due to

interrupt-based techniques virtualising poorly.

There has been some consideration of changing the nature of operating sys-

tem probing in the virtualised environment by disaggregating probe handlers

41

into a separate domain [118]. However, this hasnt received popular uptake.

2.6.2 Kernel specialisation

Kernel specalisation is not a new concept: Early work on the synthesis kernel

pioneers kernel specialisation by generating efficient kernel code that acts as fast-

paths for applications [116]. The advantages of kernel specialisation are well

known [23, 17]: Profile-guided optimisation of Linux improves the kernel per-

formance by up to 10% [151] and exokernels [45] remove kernel abstractions

so that applications interact with hardware through fewer layers of indirection,

thereby reducing kernel overheads. For instance, Xok is an operating system

with an exokernel whereby a specialised web server has over four times the

throughput of a non-specialised web server [74]. Indeed, the benefits of spe-

cialisation are a key feature of Barrelfish, an opearting system redesign to allow

kernel specialisation such that cores run different kernels [125] and Dune for al-

lowing applications access to privileged CPU features [14]. Another possible op-

erating system redesign to allow kernel specialisation is using microkernels, since

only a small set of features are then executed by an operating system mapped

into every process, rather user space services can provide competing specialised

implementations of features [86].

In Chapter 4 I introduce Shadow Kernels, a technique that allows per-process

kernel specialisation by having applications that acknowledge the presence of the

hypervisor and execute code that causes the hypervisor to switch the underlying

memory of the domains kernel. The key benefit of Shadow Kernels is to allow

multiple kernel instruction streams to execute on a single machine. There do

indeed exist techniques of executing multiple kernels already, however they all

differ from Shadow Kernels. Executing processes inside virtual machines allows

multiple kernels to execute on a single machine [36]. However, each kernel

will still typically support multiple processes executing on it, whereas Shadow

Kernels can target individual processes.

The technique used in Shadow Kernels of modifying kernel instruction streams

is well-established. For instance, KSplice modifies the kernel instruction stream

to binary patch security updates into a kernel without rebooting the machine [7],

but this is a global change that affects all processes, whereas Shadow Kernels

can restrict that patch to an individual process. Furthermore, malware can use

42

memory management tricks to hide itself from detection by unmapping memory

containing the rootkit [133]. Shadow Kernels differs in that rather than hid-

ing malware it allows multiple kernel instruction streams to coexist. Similarly,

Mondrix uses changes to the MMU to provide isolation between Linux kernel

modules [148], albeit with a performance overhead of up to 15%.

2.6.3 Performance interference

A key concern with executing virtual machines in the cloud is performance in-

teference, whereby two or more virtual machines compete for resources. Hyper-

visors are designed to have strong performance isolation guarantees, by having

coarse-grained scheduling and no sharing of data structures between virtualisa-

tion domains [12]. In particular, many services in the cloudas well as in other

circumstances [48]are latency-sensitve in that they require low and predictable

latency [32]. However achieving predictable latency without performance isola-

tion is hard. This lack of perfect performance isolation makes it difficult to virtu-

alise some workloads [67]. Whilst executing in the cloud allows some detection

of performance anomalies before deploying some services [134], this remains an

unsolved problem in the general case.

2.6.3.1 Measurement

Researchers have long-studied methods of reducing performance interference of

operating systems, in particular with the rise of latency-sensitive applications

such as video-streaming [66]. With the rise of hypervisors, there has been fur-

ther work in reducing performance interference, whilst increasing utilisation of

hardware by using a custom scheduler that limits the resources consumed by vir-

tual machines in their domain and in driver domains, such as domain zero [63].

However, in current cloud deployments, virtual machine workloads can in-

terfere badly with each other, for instance the IOPS available to a virtual ma-

chine can fluctuate wildly depending on other virtual machines executing [60]

and poor scheduling causes performance interference, for instance colocating a

random and a sequential load reduces performance for the sequential load [58].

Some work improves on the performance guarantees in the cloud, for example

with virtual datacentres that have guaranteed throughput. An implementation

of a virtual datacentre is Pulsar, which modifies the hypervisors in the cloud to

43

use a leaky bucket per virtual machine on shared resources to guarantee perfor-

mance [4].

Whilst guaranteeing performance isolation is preferable, whenever the ma-

chine is saturated by its virtual machines there is necessarily performance interfer-

ence, in which case monitoring and reporting the performance is possible. There

are many ways of measuring the performance of an operating system. Modern

operating systems, such as Linux, have a wealth of tools to help measure operat-

ing system performance. For instance, Linux has FTrace, perf, SystemTap [43],

KLogger [46] and numerous domain-specific tools. Another method, originally

implemented on a modified Digital UNIX 4.0D kernel, reports the resource con-

sumption of resource containers, rather than of processes and threads [11].

However, all of these methods do not distinguish poor application perfor-

mance from the overheads of virtualistion. That is, these tools are unable to

report if the virtual machine is starved of resources. Not only do these tools

not inform users of virtualisation overhead, they often are unable to access the

same set of hardware features as a physical machine to accurately report per-

formance to domains.1 Xenoprof is currently the only attempt to provide Xen

virtual machines with a way of measuring performance [99]. However, Xeno-

prof is incompatible with recent versions of Xen. The technique that I present

in Chapter 5 differs in that it requires developers to annotate their programs to

indicate the processing of requestsmuch like is required by X-trace [51]but

then reports the overheads of virtualisation, rather than the performance of the

virtual machine, and gives these details on a per-request basis. Calculating this

overhead requires applications to have information about how the virtual ma-

chine in which they execute is scheduled. Having a hypervisor expose its inner

state is similar to how Infokernels expose kernel internals across the interface

with applications [8].

2.6.3.2 Modelling

There has been work performed by the modelling community that looks into

performance interference between virtual machines. This work largely models

which workloads interact badly with each other in order to build better virtual

machine placement algorithms. This differs from the technique that I present

1vPMU is an upcoming (as of 2015-09-17) feature for Xen and Linux.

44

in Chapter 5, which is a measurement technique for helping to measure the

performance of clouds as they execute. An example of modelling performance

interference is hALT, which uses machine learning trained on a dataset from

Google [120], to model which workloads cause performance interference [28].

Q-Clouds models CPU-bound virtual machines using a multiple-input multiple-

output model whereby they take online feedback from an application and use

this as an input to the model and use the output to place virtual machines more

effectively [102]. TRACON is similar to Q-Clouds, but focusses on I/O-intensive

workloads [28]. Casale et al. produce models of virtual machine disk perfor-

mance, based on monitoring the hypervisors batching of I/O requests and the

arrival queue [22]. CloudScope improves on modelling the performance of vir-

tual machine interference by doing away with the need for machine learning or

queuing-based models by modelling virtual machine performance using Markov

chains to achieve a low-error model that is not tightly coupled with an applica-

tion [25].

This work all differs from Soroban in that it is modelling the performance

of an entire virtual machine. The virtual machine being modelled is typically

assumed to be in a steady-state for a prolonged period of time (perhaps several

minutes in length) and the model finds the best placement of virtual machines to

minimise performance interference. However, Soroban is a measurement tech-

nique that reports the additional latency incurred in servicing a single request in

a request-response system. That is, Soroban measures if during the servicing of

a request the virtual machine were scheduled out and reports the corresponding

cost of this.

2.6.3.3 Summary

I have shown that there is a field of work that considers how to instrument

and measure the performance of

Forgoing hypervisor fidelity for measuring virtual...

Documents

Transcript of Forgoing hypervisor fidelity for measuring virtual...