Forgoing hypervisor fidelity for measuring virtual...

172
Forgoing hypervisor fidelity for measuring virtual machine performance Oliver R. A. Chick Gonville and Caius College This dissertation is submitted for the degree of Doctor of Philosophy

Transcript of Forgoing hypervisor fidelity for measuring virtual...

  • Forgoing hypervisor fidelity formeasuring virtual machine

    performance

    Oliver R. A. Chick

    Gonville and Caius College

    This dissertation is submitted for the degree of Doctor of Philosophy

    http://orcid.org/0000-0002-6889-8561

  • FORGOING HYPERVISOR FIDELITY FOR MEASURING VIRTUAL MACHINE PERFORMANCE

    OLIVER R. A. CHICK

    For the last ten years there has been rapid growth in cloud computing, which

    has largely been powered by virtual machines. Understanding the performance

    of a virtual machine is hard: There is limited access to hardware counters, tech-

    niques for probing have higher probe effect than on physical machines, and per-

    formance is tightly coupled with the hypervisors scheduling decisions. Yet, the

    need for measuring virtual machine performance is high as virtual machines are

    slower than physical machines and have highly-variable performance.

    Current performance-measurement techniques demand hypervisor fidelity:

    They execute the same instructions on a virtual machine and physical machine.

    Whilst fidelity has historically been considered an advantage as it allows the hy-

    pervisor to be transparent to virtual machines, the use case of hypervisors has

    changed from multiplexing access to a single mainframe across an institution to

    forming a building block of the cloud.

    In this dissertation I reconsider the argument for hypervisor fidelity and show

    the advantages of software that co-operates with the hypervisor. I focus on pro-

    ducing software that explains the performance of virtual machines by forgoing

    hypervisor fidelity. To this end, I develop three methods of exposing the hy-

    pervisor interface to performance measurement tools: (i) Kamprobes is a tech-

    nique for probing virtual machines that uses unprivileged instructions rather

    than interrupt-based techniques. I show that this brings the time requires to

    fire a probe in a virtual machine to within twelve cycles of native performance.

    (ii) Shadow Kernels is a technique that uses the hypervisors memory manage-

    ment unit so that an operating system kernel can have per-process specialisation,

    which can be used to selectively fire probes, with low overheads (835354 cyclesper page) and minimal operating system changes (340 LoC). (iii) Soroban uses

    machine learning on the hypervisors scheduling activity to report the virtualisa-

    tion overhead in servicing requests and can distinguish between latency caused

    by high virtual machine load and latency caused by the hypervisor.

    Understanding the performance of a machine is particularly difficult when

    executing in the cloud due to the combination of the hypervisor and other virtual

  • machines. This dissertation shows that it is worthwhile forgoing hypervisor

    fidelity to improve the visibility of virtual machine performance.

  • DECLARATION

    This dissertation is my own work and contains nothing which is the outcome

    of work done in collaboration with others, except where specified in the text.

    This dissertation is not substantially the same as any that I have submitted for a

    degree or diploma or other qualification at any other university. This dissertation

    does not exceed the prescribed limit of 60 000 words.

    Oliver R. A. Chick

    November 30, 2015

    http://orcid.org/0000-0002-6889-8561

  • ACKNOWLEDGEMENTS

    This work was principally supported by the Engineering and Physical Sciences

    Research Council [grant number EP/K503009/1] and by internal funds from the

    University of Cambridge Computer Laboratory.

    I should like to pay personal thanks to Dr Andrew Rice and Dr Ripduman So-

    han for their countless hours of supervision and technical expertise, without

    which I would have been unable to conduct my research. Further thanks to

    Dr Ramsey M. Faragher for encouragement and help in wide-ranging areas.

    Special thanks to Lucian Carata and James Snee for their efforts in cod-

    ing reviews and being prudent collaborators, as well as Dr Jeunese A. Payne,

    Daniel R. Thomas, and Diana A. Vasile for proof reading this dissertation.

    My gratitude goes to Prof. Andy Hopper for his support for the Resourceful

    project.

    All members of the DTG, especially Daniel R. Thomas and other inhabitants

    of SN14 have provided me with both wonderful friendships and technical assis-

    tance, which has been invaluable throughout my Ph.D.

    Final thanks naturally go to my parents for their perpetual support.

    http://orcid.org/0000-0002-4677-8032http://orcid.org/0000-0003-0740-8650http://orcid.org/0000-0003-0740-8650http://orcid.org/0000-0001-7445-1136http://orcid.org/0000-0001-7607-9729http://orcid.org/0000-0001-8936-0683http://orcid.org/0000-0001-8936-0683

  • CONTENTS

    1 Introduction 15

    1.1 Defining forgoing hypervisor fidelity . . . . . . . . . . . . . . . . 16

    1.2 Limitations of hypervisor fidelity in performance measurement tools 17

    1.3 The case for forgoing hypervisor fidelity in performance measure-

    ment tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    1.4 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    1.5 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    1.6 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    1.7 Scope of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    1.7.1 Xen hypervisor . . . . . . . . . . . . . . . . . . . . . . . . 23

    1.7.2 GNU/Linux operating system . . . . . . . . . . . . . . . . 23

    1.7.3 Paravirtualised guests . . . . . . . . . . . . . . . . . . . . . 24

    1.7.4 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    1.8 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2 Background 27

    2.1 Historical justification for hypervisor fidelity . . . . . . . . . . . . 28

    2.2 Contemporary uses for virtualisation . . . . . . . . . . . . . . . . 29

    2.3 Virtualisation performance problems . . . . . . . . . . . . . . . . 33

    2.3.1 Privileged instructions . . . . . . . . . . . . . . . . . . . . 33

    2.3.2 I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    2.3.3 Networking . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.3.4 Increased contention . . . . . . . . . . . . . . . . . . . . . 34

    2.3.5 Locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.3.6 Unpredictable timing . . . . . . . . . . . . . . . . . . . . . 35

    2.3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.4 The changing state of hypervisor fidelity . . . . . . . . . . . . . . 35

    2.4.1 Historical changes to hypervisor fidelity . . . . . . . . . . 35

    2.4.2 Recent changes to hypervisor fidelity . . . . . . . . . . . . 36

    2.4.3 Current state of hypervisor fidelity . . . . . . . . . . . . . 38

  • 2.4.3.1 Installing guest additions . . . . . . . . . . . . . 38

    2.4.3.2 Moving services into dedicated domains . . . . . 38

    2.4.3.3 Lack of transparency of HVM containers . . . . 39

    2.4.3.4 Hypervisor/operating system semantic gap . . . . 39

    2.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    2.5 Rethinking operating system design for hypervisors . . . . . . . . 40

    2.6 Virtual machine performance measurement . . . . . . . . . . . . . 41

    2.6.1 Kernel probing . . . . . . . . . . . . . . . . . . . . . . . . 41

    2.6.2 Kernel specialisation . . . . . . . . . . . . . . . . . . . . . 42

    2.6.3 Performance interference . . . . . . . . . . . . . . . . . . . 43

    2.6.3.1 Measurement . . . . . . . . . . . . . . . . . . . . 43

    2.6.3.2 Modelling . . . . . . . . . . . . . . . . . . . . . . 44

    2.6.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . 45

    2.7 Application to a broader context . . . . . . . . . . . . . . . . . . 46

    2.7.1 Containers . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    2.7.2 Microkernels . . . . . . . . . . . . . . . . . . . . . . . . . 47

    2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    3 Kamprobes: Probing designed for virtualised operating systems 49

    3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    3.2 Current probing techniques . . . . . . . . . . . . . . . . . . . . . 51

    3.2.1 Linux: Kprobes . . . . . . . . . . . . . . . . . . . . . . . . 51

    3.2.2 Windows: Detours . . . . . . . . . . . . . . . . . . . . . . 52

    3.2.3 FreeBSD, NetBSD, OS X: DTrace function boundary tracers 53

    3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.3 Experimental evidence against virtualising current probing tech-

    niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    3.3.1 Cost of virtualising Kprobes . . . . . . . . . . . . . . . . . 54

    3.3.2 Cost of virtualised interrupts . . . . . . . . . . . . . . . . . 57

    3.3.3 Other causes of slower performance when virtualised . . . 58

    3.4 Kamprobes design . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    3.5.1 Kamprobes API . . . . . . . . . . . . . . . . . . . . . . . . 60

    3.5.2 Kernel module . . . . . . . . . . . . . . . . . . . . . . . . 61

    3.5.3 Changes to the x86-64 instruction stream . . . . . . . . . 61

  • 3.5.3.1 Inserting Kamprobes into an instruction stream . 61

    3.5.3.2 Kamprobe wrappers . . . . . . . . . . . . . . . . 62

    3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    3.6.1 Inserting probes . . . . . . . . . . . . . . . . . . . . . . . . 69

    3.6.2 Firing probes . . . . . . . . . . . . . . . . . . . . . . . . . 71

    3.6.3 Kamprobes executing on bare metal . . . . . . . . . . . . . 74

    3.7 Evaluation summary . . . . . . . . . . . . . . . . . . . . . . . . . 75

    3.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    3.8.1 Backtraces . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    3.8.2 FTrace compatibility . . . . . . . . . . . . . . . . . . . . . 76

    3.8.3 Instruction limitations . . . . . . . . . . . . . . . . . . . . 76

    3.8.4 Applicability to other instruction sets and ABIs . . . . . . 76

    3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    4 Shadow kernels: A general mechanism for kernel specialisation in exist-

    ing operating systems 79

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

    4.2.1 Shadow Kernels for probing . . . . . . . . . . . . . . . . . 82

    4.2.2 Per-process kernel profile-guided optimisation . . . . . . . 84

    4.2.3 Kernel optimisation and fast-paths . . . . . . . . . . . . . 84

    4.2.4 Kernel updates . . . . . . . . . . . . . . . . . . . . . . . . 85

    4.3 Design and implementation . . . . . . . . . . . . . . . . . . . . . 86

    4.3.1 User space API . . . . . . . . . . . . . . . . . . . . . . . . 86

    4.3.2 Linux kernel module . . . . . . . . . . . . . . . . . . . . . 87

    4.3.2.1 Module insertion . . . . . . . . . . . . . . . . . . 88

    4.3.2.2 Initialisation of a shadow kernel . . . . . . . . . 88

    4.3.2.3 Adding pages to the shadow kernel . . . . . . . . 89

    4.3.2.4 Switching shadow kernel . . . . . . . . . . . . . 89

    4.3.2.5 Interaction with other kernel modules . . . . . . 90

    4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

    4.4.1 Creating a shadow kernel . . . . . . . . . . . . . . . . . . 91

    4.4.2 Switching shadow kernel . . . . . . . . . . . . . . . . . . . 93

    4.4.2.1 Switching time . . . . . . . . . . . . . . . . . . . 93

    4.4.2.2 Effects on caching . . . . . . . . . . . . . . . . . 95

  • 4.4.3 Kamprobes and Shadow Kernels . . . . . . . . . . . . . . 97

    4.4.4 Application to web workload . . . . . . . . . . . . . . . . 102

    4.4.5 Evaluation summary . . . . . . . . . . . . . . . . . . . . . 103

    4.5 Alternative approaches . . . . . . . . . . . . . . . . . . . . . . . . 103

    4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

    4.6.1 Modifications required to kernel debuggers . . . . . . . . . 105

    4.6.2 Software guard extensions . . . . . . . . . . . . . . . . . . 105

    4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

    5 Soroban: Attributing latency in virtualised environments 107

    5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

    5.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    5.2.1 Performance monitoring . . . . . . . . . . . . . . . . . . . 110

    5.2.2 Virtualisation-aware timeouts . . . . . . . . . . . . . . . . 110

    5.2.3 Dynamic allocation . . . . . . . . . . . . . . . . . . . . . . 111

    5.2.4 QoS-based, fine-grained charging . . . . . . . . . . . . . . 111

    5.2.5 Diagnosing performance anomalies . . . . . . . . . . . . . 112

    5.3 Sources of virtualisation overhead . . . . . . . . . . . . . . . . . . 112

    5.4 Effect of virtualisation overhead on end-to-end latency . . . . . . 116

    5.5 Attributing latency . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    5.5.1 Justification of Gaussian processes . . . . . . . . . . . . . 121

    5.5.2 Alternative approaches . . . . . . . . . . . . . . . . . . . . 122

    5.6 Choice of feature vector elements . . . . . . . . . . . . . . . . . . 123

    5.7 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    5.7.1 Xen modifications . . . . . . . . . . . . . . . . . . . . . . 126

    5.7.1.1 Exposing scheduler data . . . . . . . . . . . . . . 126

    5.7.1.2 Sharing scheduler data between Xen and its vir-

    tual machines . . . . . . . . . . . . . . . . . . . . 127

    5.7.2 Linux kernel module . . . . . . . . . . . . . . . . . . . . . 127

    5.7.3 Application modifications . . . . . . . . . . . . . . . . . . 128

    5.7.3.1 Soroban API . . . . . . . . . . . . . . . . . . . . 128

    5.7.3.2 Using the Soroban API . . . . . . . . . . . . . . . 129

    5.7.4 Data processing . . . . . . . . . . . . . . . . . . . . . . . . 129

    5.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

    5.8.1 Validation of model . . . . . . . . . . . . . . . . . . . . . 130

  • 5.8.1.1 Mapping scheduling data to virtualisation over-

    head . . . . . . . . . . . . . . . . . . . . . . . . . 131

    5.8.1.2 Negative virtualisation overhead . . . . . . . . . 133

    5.8.2 Validating virtualisation overhead . . . . . . . . . . . . . . 137

    5.8.3 Detecting increased-load from the cloud-provider. . . . . . 140

    5.8.4 Performance overheads of Soroban . . . . . . . . . . . . . 141

    5.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

    5.9.1 Increased programmer burden of program annotations . . 142

    5.9.2 Scope of performance isolation considered by Soroban . . 143

    5.9.3 Limitation to uptake . . . . . . . . . . . . . . . . . . . . . 143

    5.9.4 Improvements to machine learning . . . . . . . . . . . . . 143

    5.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

    6 Conclusion 145

    6.1 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    6.2 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

    6.3 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

    6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

    6.4.1 Kamprobes . . . . . . . . . . . . . . . . . . . . . . . . . . 148

    6.4.2 Shadow Kernels . . . . . . . . . . . . . . . . . . . . . . . . 149

    6.4.3 Soroban . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

    6.4.4 Other performance measurement techniques that forgo hy-

    pervisor fidelity . . . . . . . . . . . . . . . . . . . . . . . . 150

    6.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

  • CHAPTER 1

    INTRODUCTION

    The recent emergence of cloud computing is largely dependent on the populari-

    sation of high-performance and secure x86-64 virtualisation. By using a hypervi-

    sor cloud operators are able to multiplex their hardware, with high performance

    and strong data isolation, between multiple competing users. This multiplexing

    allows cloud providers to increase machine utilisation and increase service scal-

    ability. Moreover, the hypervisor eases system management with maintenance

    features such as snapshotting and live migration.

    Yet, despite the advantages of virtual machines they remain slower than phys-

    ical machines and have highly-variable performance [60]. Whilst efforts have

    improved both the raw performance and performance isolation of virtual ma-

    chines, the increased indirection and additional complexity in virtualising privi-

    leged instructions makes it unlikely that we shall achieve parity of performance.

    Developers therefore need techniques to help them measure how much slower

    their applications execute in a virtual machine than they would have done on

    bare metal. Furthermore, they need to be able to diagnose and fix performance

    issues that occur in virtualised production systems.

    However, using current techniques it is difficult to measure the performance

    of software when it executes in virtual machines. Many of the methods used

    to measure the performance of software when executing on bare metal, such as

    raw access to performance counters, processor tracing, and visibility of hard-

    ware performance metrics are not directly accessible [18], expensive [105], or

    inaccurate [105, 71] when executing in a virtual machine. The combination of

    less predictable performance and unavailability of performance-debugging tech-

    niques makes it hard to measure the performance of an application executing in

    a virtual machine.

    One technique is to optimise software on bare metal, where access to more

    hardware features is available, and then to virtualise the software. However,

    15

  • this is a poor approach as virtualisation has different performance impacts on

    different operations.1

    Currently, the main virtualisation techniques used by hypervisors either have

    guests execute unmodified code, relying on hardware virtualisation extensions

    to emulate bare-metal hardware from the point-of-view of the guest or exe-

    cute paravirtualised guests whereby the virtual machines are made aware that

    they are executing on a hypervisor and issue hypercalls, as opposed to execut-

    ing privileged instructions. But such paravirtualisation of mainstream operating

    systems only applies to the low-level hardware interfaces, typically restricted to

    the architecture-dependent (arch/) code. As such, performance measurement

    techniques that execute on a virtual machine exhibit hypervisor fidelity: They

    execute without consideration of the fact that they are executing in a virtual ma-

    chine. They are therefore unable to access the same set of counters that they can

    on physical machines and are unable to explain performance issues, such as CPU

    starvation of the entire operating system, that do not exist on physical machines.

    Slower and less-predictable performance of software executing in a virtual

    machine are two of the greatest disadvantages of executing software using a

    virtual machine, yet current techniques for measuring this performance do not

    consider the role of the virtualisation in slow performance. In this dissertation I

    argue the benefits of forgoing hypervisor fidelity to measure performance. That

    is, given the importance of measuring the performance of virtual machines we

    should turn to forgoing fidelity, in the same way as we have previously forgone

    fidelity to ameliorate previous problems with virtualisation, such as slow perfor-

    mance and the difficulties in virtualising classical x86.

    I show that by forgoing hypervisor fidelity it is possible to build performance-

    analysis techniques that reduce the probe effect of measuring virtual machines

    and explain performance characteristics of software that one cannot measure

    without considering the role of the hypervisor in executing software.

    1.1 Defining forgoing hypervisor fidelity

    Hypervisor fidelity is a well-defined concept [115]. However, the concept of for-

    going hypervisor fidelity is less well defined. In this dissertation I define forgoing

    1Indeed, I show in Chapter 4 and Chapter 5 that depending on the operation performed,virtualisation overheads can vary to the extent of changing the shape of a distribution.

    16

  • hypervisor fidelity as a property of software that is designed for execution on a

    virtual machine and makes use of the properties of the hypervisor.

    1.2 Limitations of hypervisor fidelity in performance

    measurement tools

    Hypervisors date back to early work by IBM in the 1960s, where they were

    initially used to multiplex access to a scarce, expensive mainframe. However,

    the current trend of using hypervisors to virtualise cloud infrastructure has its

    roots in the renaissance that followed fast and secure techniques to virtualise the

    x86-64 instruction set. The re-emergence of paravirtualisation, addition of hard-

    ware virtualisation extensions, and plentiful memory and CPU capacity servers

    throughout the 2000s made it possible to execute many virtual machines on a

    single server to increase utilisation. This, combined with a consumer movement

    to performing computations and storing data on servers, made virtualisation at-

    tractive to industry as virtualisation is cheaper and more scalable than executing

    on dedicated machines.

    The rise of cloud computing in recent years has been impressive. Amazon

    EC2 alone has grown from nine million to twenty eight million public IP ad-

    dresses in the past two years [143]. This number is clearly an underestimate for

    the actual use of virtual machines as it doesnt include other cloud providers, or

    non-public IP addresses.

    However, the performance of virtual machines executing in the cloud is highly-

    variable [39, 49], with cloud providers now competing on the predictability of

    their services [9]. Despite this, the tools available to users to measure the per-

    formance of their virtual machines have not kept up with the growth in cloud

    computing. Given the difficulty in correctly virtualising all hardware counters

    and eliminating performance interference, I show how by forgoing hypervisor

    fidelity we can build tools that aid with measuring the performance of a virtual

    machine.

    17

  • 1.3 The case for forgoing hypervisor fidelity in perfor-

    mance measurement tools

    Forgoing hypervisor fidelity to ameliorate problems in the virtualisation domain

    has been repeatedly used in the past. I now explore previous times that we have

    forgone hypervisor fidelity to improve the utility of virtual machines and argue

    that contemporary problems mean that it is time to forgo hypervisor fidelity of

    performance measurement techniques.

    The concept of forgoing hypervisor fidelity is almost as old as virtualisation

    itself. The early literature relating to OS/360 and OS/370 considers the role

    of pure against impure virtual machines, whereby an impure virtual machines

    executes differently as it has been virtualised. The advantage of impure virtual

    machines was that they could execute faster than pure virtual machines. In the

    end, pure virtual machines became the dominant virtual machine type, although

    techniques such as paravirtualisation borrow from the ideas of impure virtual

    machines.

    More recently, forgoing hypervisor fidelity has been used to overcome classi-

    cal limitations of the x86 instruction set that meant it was not virtualisable in a

    way that provided both security and performance. By adopting paravirtualisa-

    tion to overcome the limitations of classical x86, Xen forgoes hypervisor fidelity

    since virtual machines execute with knowledge of the hypervisor and issue hy-

    percalls rather than executing non-virtualisable instructions.

    Even today, we forgo hypervisor fidelity to overcome performance problems

    with virtualisation. One problem that virtual machines face is the possibility of

    not being scheduled when they need to execute, for instance after packets have

    arrived for the virtual machine. In order for the hypervisor to more-favourably

    schedule the virtual machine when it has work to do, under Xen there are two

    hypercalls that allow guests to deschedule themselves: yield and block. When a

    guest is waiting for I/O or the network they can execute the block hypercall, pa-

    rameterised on the event that they are waiting for. The hypervisor then preempts

    the guest until the corresponding event is placed on the guests event channel, at

    which point the hypervisor wakes the guest. The advantage in this case of the

    guest acknowledging the presence of the hypervisor is that by blocking when it

    cannot make progress the scheduling algorithm stops consuming credit from the

    18

  • domain. Therefore, when the guest is able to execute the scheduling algorithm

    will be more-favourable to the domain. Similarly, the yield hypercall allows

    guests to relinquish their slot on the CPU, without parameterisation, such that

    they will later be scheduled more favourably. Both the block and yield hypercalls

    improve the performance of the guest, through forgoing hypervisor fidelity.

    Even with the advent of hardware virtualisation that allows unmodified vir-

    tual machines to execute, we still forgo hypervisor fidelity in the drivers on vir-

    tual machines to improve performance. On hardware virtual machines (HVM)

    the emulation of connected devices (which a tool such as QEMU can provide)

    is slow, therefore HVM guests that need more performance are often converted

    to PV on HVM guests, using virtualisation drivers that replace the emulated

    devices with a driver that directly issues hypercalls. This allows guests to use the

    hardware-assisted virtualisation interface when this is fastest, such as executing

    a system call since the lack of rings one and two on x86-64 require all pure-

    paravirtualised system calls to perform a context switch through the hypervisor,

    and use the paravirtualised interface when this is faster, such as avoiding hard-

    ware emulation. This is an example of the virtual machine forgoing hypervisor

    fidelity to improve the performance of a virtual machine.

    As we have seen, forgoing hypervisor fidelity is an oft-used technique to

    solving problems in the virtualisation domain, in particular for solving perfor-

    mance issues. A significant issue facing virtualisation today is that performance

    is variable and yet techniques for measuring the performance of virtual machines

    have lower utility than techniques for measuring the performance of physical ma-

    chines. I propose rethinking where we forgo hypervisor fidelity in a mainstream

    operating system, designed to execute in a contemporary cloud environment.

    In this dissertation I show that by building performance measurement tools

    that dont have strict hypervisor fidelity it is possible to mitigate many of the

    issues of measuring the performance of a virtual machine. Forgoing hypervisor

    fidelity should not be controversial given the trend of forgoing hypervisor fidelity

    to solve performance-related issues.

    In the remainder of this chapter I introduce three key methods by which

    forgoing hypervisor fidelity allows software to report better performance mea-

    surements when virtualised. Later, I present each contribution in detail.

    19

  • 1.4 Kamprobes

    Current kernel probing mechanisms are built without forgoing hypervisor fi-

    delity. That is, developers execute the same types of probes on virtual machines

    as they do on physical machines. However, these methods usually rely on set-

    ting software interrupts in an instruction stream. Whilst these generally execute

    well on physical hardware, I show in Chapter 3 that interrupts on a virtual ma-

    chine are 1.81 times more expensive than interrupts on hardware (3.3.2), as thehypervisor has to execute.

    Probes are a common technique for measuring the performance of computer

    software. By allowing developers to add additional code at a programs runtime,

    probes allow developers to execute code that measures wall-clock time, cycles,

    or other resources used by a piece of code without the burden of modifying the

    softwares source code, recompiling and re-executing the software. However,

    a problem with probes is that when they fire they consume resources, thereby

    affecting the performance of the application that they try to measure.

    Whilst this probe effect impacts both physical machines and virtual machines,

    the overheads are 2.28 times higher on virtual machines than physical machines (3.3).Moreover, virtualisation increases the standard deviation of the number of cycles

    requires to fire a probe from 8 cycles to 869 cycles (3.6.2).By having higher overheads, probing mechanisms on virtual machines ex-

    acerbate the probe effect. This makes it harder to identify the cause of poor

    performance of applications on virtual machines.

    Kamprobes is a technique for probing virtual machines that only uses unpriv-

    ileged instructions, such that the hypervisor is not involved in a probe firing and

    avoids other operations that are expensive in a virtual machine such as hold-

    ing locks. Kamprobes forgoes hypervisor fidelity by being designed to execute

    with maximum performance on a virtual machine. For instance, by only using

    non-privileged instructions, the design of Kamprobes forgoes hypervisor fidelity.

    There is only a modest difference between executing in a virtual machine and

    on a physical machine on the number of cycles (twelve cycles) and the variability

    (two cycles of standard deviation). Moreover, Kamprobes execute much faster

    than Kprobes (the current state-of-the-art in Linux kernel probing), with a Kam-

    probe taking 6916 cycles to execute, whereas a Kprobe takes 6980869 cycles

    20

  • to execute (3.6.2). Furthermorewhilst not an issue of virtualisationwhenKprobes determines which handler to execute it performs a lookup that scales

    with O(n), with the number of probes inserted. The technique that Kamprobes

    uses does not need to perform a lookup, and so scales linearly (O(1)). Kam-

    probes can therefore be used in circumstances that require many probessuch

    as for a function boundary tracerfor which Kprobes is too slow.

    1.5 Shadow Kernels

    Whilst Kamprobes are a low-overhead technique for probing virtual machines, if

    they are usedeven with empty probe handlerson hot codepaths the overhead

    of them repeatedly firing can significantly reduce performance. In principle this

    shouldnt be an issue because much of the time developers want to measure the

    performance of one particular processs interactions with the kernel in isolation.

    But there is no current way of setting kernel probes that only fire when one

    particular process executes.

    Shadow Kernels is a technique I developed by which specialisation, such as

    setting probes, can be applied to a kernel instruction stream on a fine-grained

    basis such that the specialisation applies to a subset of the processes or system

    calls executing on the system. Currently, specialising the operating system kernel

    makes changes to the kernel instruction stream that affect all processes executing

    on the system. This is because whenever the kernel instruction stream is modified

    the address space of every process is modified as each processes maps the shared

    kernel into its own address space. The underlying issue is that modifications to

    the instruction stream of the kernel are a global operation, in that the shared

    instruction stream is executed by all processes. I therefore show that the effect

    of this is to reduce the performance of all processes executing on the system,

    regardless of if their interaction of the kernel were the target of specialisation.

    Shadow Kernels requires co-operation of virtual machines with the hypervi-

    sor since the virtual machines execute hypercalls that cause the hypervisor to

    modify the physical-to-machine memory mappings such that the virtual mem-

    ory containing the kernel instruction stream maps to different machine-physical

    memory depending on the calling context.

    Shadow Kernels is a technique that utilises the indirection of virtualised page

    21

  • tables such that multiple copies of the kernel instruction stream co-exist within

    a single domain. This allows processes that are not the target of instrumentation

    to execute their original kernel instruction stream, whilst applications whose

    interaction with the kernel is the target of specialisation execute a specialised

    instruction stream.

    Building Shadow Kernels without a hypervisor would be challenging: Oper-

    ating systems are designed with a memory layout such that the kernel resides at

    a fixed offset in physical memory. However, with Shadow Kernels there are mul-

    tiple copies of pages that include the kernel instruction stream, with the memory

    management unit changing which page virtual addresses resolve to. Therefore,

    there is no-longer a fixed mapping between physical and virtual pages in the ker-

    nel instruction stream. Furthermore, the hypervisor-based approach is easy to

    port Shadow Kernels to other operating systems.

    1.6 Soroban

    A key issue with executing software in the cloud is that applications often exe-

    cute more slowly and sometimes with performance interference from other vir-

    tual machines [60]. For latency-sensitive applications, in particular, this virtuali-

    sation overhead prevents users from switching to virtual machines [142]. How-

    ever, current application monitoring systems are built with hypervisor fidelity, in

    that they report the same metrics if they execute on a physical machine or a vir-

    tual machine. As the performance of an application is affected by the hypervisor

    in a way that is hard to predict, it is currently difficult to measure how much of

    the latency of a program executing in the cloud is caused by the overheads of

    virtualisation and how much is due to other causes, such as a high load on the vir-

    tual machine. Soroban is a technique that forgoes hypervisor fidelity to measure

    how much of the latency of a request is due to the overheads of virtualisation.

    By forgoing hypervisor fidelity throughout the software stack, up to the appli-

    cation, Soroban reports the additional latency imposed on servicing individual

    requests in a request-response system. This allows developers to measure the

    additional overheads that their application experiences due to executing in a

    virtual machine, as opposed to executing on bare metal. By reporting the virtu-

    alisation overhead, developers can decide whether the additional overheads are

    22

  • worthwhile.

    Soroban uses a modified version of Xen that shares with each domain the

    activity performed it by the scheduler, such as timestamps of when the virtual

    machine is scheduled in and out. Soroban then trains a Gaussian process on the

    relationship between these variables and the response time of a request-response

    system. The result of the learning phase is a model that when given a feature

    vector of scheduling activity on a domain it reports the impact that these events

    have on the event response time.

    I evaluate Soroban, showing that the technique can be applied to a web server

    and measure the increase in latency due to virtualisation in servicing requests. I

    demonstrate that as more virtual machines execute concurrently, Soroban in-

    creases the latency increased with virtualisation, but when the web server exe-

    cutes requests slowly due to high load, Soroban does not increase the measure

    of virtualisation overhead.

    1.7 Scope of thesis

    In this dissertation, I primarily focus on Xen, executing paravirtualised GNU/Linux

    on x86-64 hardware. I now justify this choice.

    1.7.1 Xen hypervisor

    Xen is the hypervisor used by Amazon EC2 [139], which as of May 2015 is ten

    times larger than the combined size of all its competitors [85]. Given the clear

    dominance of Xen in the cloud, solutions to problems of measuring performance

    when virtualised with Xen have a high impact. However, the key contributions

    of my thesis can be ported to other hypervisors.

    1.7.2 GNU/Linux operating system

    As of December 2014, 75% of enterprises report using Linux as their primary

    cloud platform [52], with the market share of Linux virtual machines increasing.

    Most of the remainder is Windows virtual machines, however the number of

    these is falling.

    23

  • 1.7.3 Paravirtualised guests

    Presently there are two main techniques to virtualising an operating system in

    the cloud. (i) Hardware extensions allow an unmodified operating system to ex-

    ecute on a hypervisor (HVM). This is a common way of virtualising proprietary

    operating systems, such as Microsoft Windows. (ii) Modifying the guest oper-

    ating system such that it is aware that is executing in a virtualised environment

    and directly issues hypercalls rather than performing privileged instructions.

    The performance of paravirtualised guests is comparable with the perfor-

    mance of hardware virtual machines, with regular changes as to which one is

    the faster form of virtualisation.

    In this dissertation I use paravirtualised virtual machines as they have an ex-

    isting interface with the hypervisor, through which virtual machines can issue

    hypercalls. As this dissertation proposes forgoing hypervisor fidelityas such

    creating paravirtualised performance measurement techniquesit is more nat-

    ural to build these on paravirtual virtual machines. However, hardware virtual

    machines often have a paravirtualised interface through which drivers can oper-

    ate, so many of the ideas could be ported to hardware virtual machines.

    1.7.4 x86-64

    Whilst instruction sets other than x86-64 are virtualisable, Intel currently has

    a 98.5% market share in server processors (as measured by number of proces-

    sors) [77], with much of the remainder being taken by AMD x86-64 processors.

    As such, I do not consider other instruction sets.

    The contributions of Kamprobes in Chapter 4 are particularly tightly-coupled

    with the x86-64 instruction set. However, the fundamental idea of using unpriv-

    ileged instructions to build a probing system hold true across other instruction

    sets. Indeed, on a fixed-width instruction set, such as ARM, this technique is

    both easier to implement and can be used on more opcodes than on x86-64.

    Both Shadow Kernels and Soroban are less reliant on any particular instruc-

    tion set.

    24

  • 1.8 Overview

    In summary, the key contributions of this dissertation are:

    Kamprobes. Current probing techniques are built to execute on a physical ma-

    chine and as such rely on interrupts to obtain an execution context. How-

    ever, on a virtual machine interrupts are a privileged instruction, so are

    expensive. Kamprobes is a low-overhead probing technique for x86-64

    virtual machines that execute with near-native performance in a virtual

    machine.

    Shadow Kernels. By forgoing hypervisor fidelity, virtual machines can remap

    their text section to allow virtual machines to specialise shared text re-

    gions, in particular the kernel. Whilst I focus on the use case of scoping

    kernel probes, the technique can be applied to other types of kernel text

    specialisation, such as profile-guided optimisation.

    Soroban. A key concern that prevents the uptake of virtualisation is the impact

    of the virtualisation overhead. I show that by building software that ac-

    knowledges the presence of the hypervisor in its own monitoring, it is pos-

    sible to measure the virtualisation overhead of fine-grained activities, such

    as serving an HTTP request.

    The remainder of this dissertation is structured as follows. I explore the back-

    ground for my thesis in Chapter 2, arguing that the requirement of hypervisor

    fidelity for performance measurement techniques is a relic of classical hypervisor

    use cases and can be forgone for contemporary operating systems. In Chapter 3

    I introduce Kamprobes, a probing technique for virtualised x86-64 operating

    systems. In Chapter 4 I propose Shadow Kernels as a solution for specialisation,

    such as scoping the firing of probes. In Chapter 5 I present Soroban, a technique

    for using machine learning to report, for each request-response, the additional

    latency added by executing on the hypervisor.

    25

  • 26

  • CHAPTER 2

    BACKGROUND

    In their 1974 paper Popek and Goldberg state the classical definition of a hyper-

    visor as having three properties: Fidelity, performance and safety [115].

    Fidelity. Fidelity represents the concept that a hypervisor should portray an accu-

    rate representation of the underlying hardware, such that software can exe-

    cute on the hypervisor without requiring modification, or being aware that

    it executes in a virtualised environment. As such, the results of software

    executing in a virtualised environment must be identical to those obtained

    when executing on physical hardware, barring any effects of different tim-

    ing whilst executing on virtualised hardware.

    Performance. The performance of a virtual machine must not be substantially

    slower than when executing on physical hardware. In particular, most in-

    structions that execute must run unmodified, without trap-and-emulation

    techniques (trap-and-emulation is the only virtualisation technique that

    Popek and Goldberg consider).

    Safety. Virtual machines must act independently, without the ability to interfere

    with other domains executing on the system. Particularly, virtual machines

    should not have direct access to shared hardware, with which they can

    modify the state of another virtual machine in a way that would not be

    expected of that machine executing on physical hardware.

    In this dissertation I propose performance-analysis techniques that are de-

    signed to complement virtualisation, by either using code that virtualises well or

    by using techniques that interact with the hypervisor. As such, this work breaks

    the traditional definition of a hypervisor in that it no longer offers fidelity. In this

    chapter, I consider related work to argue that the difficulty of measuring the per-

    formance of virtual machines is exacerbated by the requirement of fidelity and

    27

  • that this should be relaxed given the changing uses of hypervisors. Throughout

    the rest of this dissertation I use this argument to justify techniques that require

    performance-analysis techniques that are tightly-coupled with the hypervisor.

    2.1 Historical justification for hypervisor fidelity

    In this section I consider the historical justification for hypervisors, especially for

    hypervisor fidelity. I later show that the use cases of hypervisors has changed

    and as such we should reconsider the hypervisors original design principles.

    The concept of hypervisor fidelity, whilst formalised in 1974 [115], dates

    back to the start of research into virtual machines by IBM. IBM built early hy-

    pervisors that allowed multiple users to concurrently execute on a rare and ex-

    pensive mainframe with the illusion of being the only user of the machine. That

    is, each user had the illusion of being the sole user of the machines hardware,

    with their operating system being the only one executing. The key issues that

    early hypervisors attempt to fix are that OS/360 uses thenow-common [75]

    architecture of a machine executing a single kernel that is shared with every

    process executing on the system: (i) Different users are unable to execute dif-

    ferent operating system versions. Due to the lack of availability of mainframes,

    users were unable to obtain another machine to execute their own operating

    system version. (ii) Users cannot develop new operating system features in isola-

    tion from other users. For instance, if a developer were to extend the operating

    system, but their code contains a bug, with OS/360 it is not possible to prevent

    this from affecting concurrent users. As traditional abstractions are lower-level

    than contemporary abstractions, it was commonplace for developers to regularly

    need to modify or extend their operating system.

    CP-40 is considered to be the first hypervisor, being released in 1967 and able

    to concurrently execute fourteen virtual machines. As the complexity of hard-

    ware increased through the 1970s the use of hypervisors became more practical

    and feature in the development of OS/360 and OS/370 [62, 127]. Behind all

    IBM work is the control program (CP), which allows concurrent execution of

    operating systems, each of which has the illusion of executing on physical hard-

    ware [56]. The original versions of CP allow an unmodified operating system to

    execute in a virtualised environment in which CP configures the hardware such

    28

  • that whenever a virtual machines executes a privileged instruction the hardware

    induces a trap, which CP catches, decodes and emulates in a safe way. There

    were other early hypervisors, such as the FIGARO system, which was part of the

    Cambridge Multiple-Access System that have similar design goals [147]. As such,

    these early hypervisors do provide fidelity, in that the software that executes on

    them has the same side effectsignoring timing effectson both physical and

    virtual hardware.

    2.2 Contemporary uses for virtualisation

    Having shown the historical justification for hypervisor fidelity, I now argue that

    the use case for virtualisation is now different to in the 1960s and 1970s. As such,

    it is time to reassess the requirement of virtual machine fidelity, in particular to

    aid in helping developers measure the performance of their virtual machines.

    Rather than building performance tools that explain a subset of what can be

    viewed on a physical machine, due to limited access to performance counters,

    we should forgo hypervisor fidelity by building performance analysis techniques

    that are designed to execute on a virtual machine.

    Compared with when hypervisors were pioneered, hardware is now cheaper

    and more readily-available, as such the original requirements for virtualisation

    no longer hold: (i) In contemporary computing users have access to many ma-

    chines, as such they are usually able to execute an operating system of choice

    on a different computer. (ii) The influx of additional hardware also means that

    development of operating system features can be performed on dedicated devel-

    opment hardware. Indeed, executing production services on the same hardware

    that is used for operating system development, even when a hypervisor is used,

    would be unconventional in the current era. In comparison to when virtualisa-

    tion was pioneered, it is standard practice to have fleets of physical machines just

    testing changes to operating system source code. Moreover, higher-level abstrac-

    tions reduce the requirement of most development work to involve modifying

    the operating system.

    In the last ten years virtualisation has underpinned the move to cloud com-

    puting, which in turn has revolutionised computing [6]. A lower bounds indica-

    tor of the growth of cloud computing is that Amazon AWS alone has increased

    29

  • from nine million to twenty eight million EC2 public IP addresses in the past two

    years [143]. The key benefit of the hypervisor in these cloud computing environ-

    ments is allowing operators to provide virtual machines to their customers, so

    that multiple customers can share the same physical server without interference.

    In particular, hypervisors give a number of advantages to cloud providers:

    Higher machine utilisation. By co-hosting virtual machines on a physical server

    the utilisation of the physical server increases when compared with execut-

    ing each service on a dedicated physical machine. Whilst higher utilisation

    was a key factor in the early work on hypervisors, this was because the

    mainframes that they executed on were scarce and highly-contested. How-

    ever for cloud providers, servers are readily-available, but higher utilisa-

    tion decreases power consumption, cooling, maintenance and real-estate

    expenditure. In order to increase utilisation, hypervisors now offer fea-

    tures such as memory overcommitting through ballooning [144] and pre-

    allocation [94]. Although such higher utilisation has remained a benefit of

    using a hypervisor, the reasons for desiring higher utilisation have changed,

    as such the role of the hypervisor has changed. The downside to higher

    utilisation is that it risks starving virtual machines of resources, thereby

    reducing their performance. Operating system starvation is not a problem

    that exists when executing on bare metal therefore tools that do not forgo

    hypervisor fidelity cannot report this effect.

    Creating virtual machines is fast and cheap. Users can spawn a new, booted vir-

    tual machine in less than one second [84]. This is not possible without

    a hypervisor, since fast boot up is achieved by forking an already-booted

    virtual machine, such that the two have the same state. With physical

    machines, the closest alternatives are techniques such as PXE that aid in

    reducing the time between connecting a server and it being fully-booted.

    However, for most use cases the main time cost in running a new physical

    server is actually in finding server hosting and obtaining a physical server.

    With hypervisors, there is no need for most users to purchase physical host-

    ing and servers, as they can simply pay for a virtual machine from their

    cloud provider. Moreover, the economics of cloud computing often make

    it cheaper to execute in the cloud than building a data centre [136]. This

    clearly differs from the original use case of a hypervisor in which being

    30

  • able to rapidly spawn a new machine was not a desired feature.

    Scalability to near-infinite computing resource on demand. Usage patterns of Internet-

    connected applications are highly-variable [119]. In order to respond to

    spikes in demand they need to be elastic, in that they need to execute using

    more machines during spikes to maintain a quality of service. Hypervisors

    allow scalability of virtual machines to up to 3 000 virtual machines in a

    32 host pool [61]. In cloud computing environments, where hypervisor

    pools are less common, the bottleneck on the number of virtual machines

    that can execute is bounded by economic factors. As virtual machines are

    fast to spawn, users can build more scalable software that responds to

    changes in demand by creating more virtual machines. Such requirements

    were never present in the early forms of virtualisation, as they operated

    before the creation of the Internet, so contemporary issues such as the

    slashdot effect and viral trends did not exist. Furthermore, the original

    workloads that execute on a hypervisor were non-interactive batch jobs,

    therefore they had different performance requirements to contemporary

    clouds, where request-latency is a key metric.

    Live migration of virtual machines. Modern hypervisors can transparently mi-

    grate virtual machines between physical hosts [124] without downtime [34]

    and similarly migrate and load-balance [59] storage between repositories

    without downtime [94]. This allows system administrators to perform

    maintenance on physical machines without disrupting a service executing

    on the virtual machines, since they can first migrate the instance onto an-

    other host. As organisations rarely had more than one mainframe when

    hypervisors were initially designed, this was not a use case of the pioneer-

    ing work. The downside of live migration is that if the virtual machine

    is migrated onto a highly-loaded or less powerful host then it may exe-

    cute more slowly However, this decrease in performance is from the cloud

    provider, so is hard to detect with existing techniques.

    High isolation compared with other virtualisation techniques. Kernel security vul-

    nerabilities only affect the domain in which the vulnerability is used. Be-

    tween 20112013 there where 147 such exploits for Linux [2]. Compared

    with other virtualisation techniques that share the same kernel, hypervi-

    31

  • sor exploits are more rare, with Xen having had just one privilege escala-

    tion vulnerability from paravirtualised guests [140]. Since the invention of

    hypervisors this requirement has increased: Attack vectors are now more

    readily exploited and there are more commercial requirements for isolation

    of services.

    Backup and restore. There are advantages to providing backup and restore from

    outside of a domain [152], since it is fast [37] and does not require operat-

    ing system co-operation for access locked files and cannot be disabled by

    malicious software. Backup and restore was not a concern for hypervisor

    design in the 1960s.

    Accountability. Accountable virtual machines allow users to audit the software

    executing on remote hosts by having the software execute on top of a

    hypervisor that performs tamper-evident logging [64]. Using virtualisation

    for accountability is a new use-case for hypervisors that they were not

    originally designed for.

    Emulating legacy software. Windows 7 and later versions contain a hypervisor

    to execute Windows XP. When the Windows instance is a virtual machine

    the emulator then executes using nested virtualisation [66]. Whilst nested

    virtual machines were considered in early work [147], this was mainly a

    point of academic enlightenment.

    Emulating advances in time. As hypervisors emulate wall-clock time to their guests,

    they can be used to discover how software will behave at a future point in

    time [35] or when executing under future, faster hardware [109]. Emulat-

    ing changes in time was not an original design goal of hypervisors.

    I have described a number of ways in which hypervisors are used as part of

    mainstream cloud-computing environments. In particular, I have shown how

    the use cases for the hypervisor in 2015 differ from those in the 1960s and

    1970s when the classical definition of the hypervisor was developed. Due to

    this change in use case, it is reasonable to argue that strict adherence to an out-

    dated definition of the hypervisor should be challenged. One of the recurring

    themes is the change related to moving from serving a batch-processing work-

    load, to a request-response system in which users need high scalability, and low

    32

  • latency in their serving of requests. Concurrently, virtual machines now execute

    in a less predictable environment, with untrusted parties, malicious actors and

    automated scheduling all acting in ways that affect the performance of virtual

    machines, in a way that early virtual machines did not experience. As such, the

    importance of measuring performance has increased, such that fidelity now has

    lower utility than measuring the performance of virtual machines.

    2.3 Virtualisation performance problems

    Despite its popularity, a particular problem with virtualisation is that the per-

    formance of virtual machines is slower and more variable than the performance

    of physical machines yet it is difficult to measure the performance of a virtual

    machine.

    As well as contention for shared resources [117] there are other sources of

    slow performance, which I now explore.

    2.3.1 Privileged instructions

    Under virtualisation certain instructions become more expensive, such as vmexit,

    which increases by a factor of between five and twenty-five under virtualisa-

    tion [122]. Also, as AMD64 only has two rings, paravirtualised guests have a

    user space and kernel space that both execute in ring one and the hypervisor has

    to mediate every system call. This makes system calls more expensive in virtual

    machines than on physical machines, although by how much varies depending

    on hardware [31].

    2.3.2 I/O

    I/O on virtual machines involves a longer data path than on physical machines

    since the hypervisor has to map blocks from the virtual disks exposed to its

    guests to physical blocks on storage, that is often remote. I/O operations are a

    regular source of slow performance [57, 26, 100] which are around 20% slower,

    depending on configuration. Furthermore, the hypervisors batching of I/O re-

    quests can lead to extreme arrival patterns [22].

    33

  • 2.3.3 Networking

    Networking in virtual machines can be unpredictable [98]: When executing on

    a CPU-contended host compared with a CPU-uncontended host, throughput

    can decrease by up to 87% and round trip time can increase from 10 ms to

    67 ms [129]. On Xen, two causes of this are the back end of the split-driver

    being starved of CPU resource as the driver domain is not scheduled or the front

    end of the split-driver being starved as the scheduler in the virtual machine does

    not schedule the driver during its scheduling quanta.

    The effect of poor networking performance is that there are significant reduc-

    tions in quality of service as observed by end-users in throughput and delay [26].

    2.3.4 Increased contention

    When executing as a virtual machine there is higher contention, caused by two

    sources: Other virtual machines being scheduled and the hypervisor/domain zero

    executing. The hypervisor increases contention when executing as a virtual ma-

    chine due to switches to the hypervisor, through executing a vm-exit instruction

    that needs to save the state of the virtual machine and restore the state of the

    next domain [3]. Other virtual machines also cause performance interference,

    especially for micro virtual machines, which execute on physical hosts with low

    priority to use the spare CPU cycles left by other virtual machines. Such mi-

    cro virtual machines are serviced poorly and to get maximum performance

    for the instance typevirtual machines need to inject delays to be scheduled

    favourably [146].

    2.3.5 Locking

    Locking has long been known to be problematic on virtual machines. When

    operating systems are designed programmers often protect data structures with

    mutexes, and assume that they hold the mutex for a short period of time as

    holding a mutex on a shared data structure for a long time is expensive [114].

    However, when executing in a virtual machine there is the possibility of a vCPU

    being preempted whilst it holds a mutex, preventing other threads from making

    progress [40]. Another problem is lock scalability, which unless modified to

    perform better under a hypervisor, scales poorly with the number of vCPUs [76].

    34

  • 2.3.6 Unpredictable timing

    When executing inside a virtual machine, time becomes unpredictable as virtu-

    alised time sources are unreliable and behave poorly under live migration [19].

    Also, operations that one expects to have a constant time can take an unpre-

    dictable amount of time. For instance, techniques such as kernel same page

    merging can help reduce the memory overhead of executing in virtual machines

    by sharing identical pages between virtual machines [101]. However, when a vir-

    tual machine modifies a shared page the hypervisor traps and creates a copy of

    the page specifically for that virtual machine to modify. This makes page access

    times unpredictable from within the virtual machine [135].

    2.3.7 Summary

    Despite many advances, virtual machines remain slower and less predictable

    than physical machines. As it is unlikely that these issues will be completely

    removed, it is important that users of virtual machines are able to measure the

    performance of their virtual machine.

    2.4 The changing state of hypervisor fidelity

    Given the performance overhead of executing in a virtualised environment and

    the difficulty in measuring this performance in a virtual machine, I propose that

    virtual machines should forgo hypervisor fidelity for performance measurement

    techniques. Rather than treating the hypervisor as a physical machine for every-

    thing except the lowest layers of the kernel, performance measurement tools

    should be designed to execute well in a virtual environment and should co-

    operate with the hypervisor to maximise visibility of performance. Whilst this

    does involve changing the accepted use of the interface between virtual machines

    and hypervisors, I now show that changes to this interface have previously been

    used to ameliorate performance problems in the virtualisation domain.

    2.4.1 Historical changes to hypervisor fidelity

    Even in the earliest work on hypervisors, there was acceptance that pure-virtualisation

    may not be practical. A concern with the early versions of CP is that it performed

    35

  • slowly, which is largely attributable to using trap-and-emulate to prevent virtual

    machines from executing privileged instructions and causing them to execute an

    emulated version. To address this, the evolution into OS/370 introduces the idea

    of a hypercall [150], in which the virtualised operating system sets up some state

    to communicate with CP and then uses the DIAGNOSE instruction to transition

    context into CP [36]. By introducing the concept of a hypercall, IBM acknowl-

    edge that building operating systems that are entirely-strict to the definition of

    fidelity is not necessary. Rather, in cases where full emulation of physical hard-

    ware has a high cost, it is better to forgo fidelity by making the virtual machine

    aware that it is executing on a hypervisor and execute a hypercall rather than

    perform the expensive operation.

    I argue that we have the same issue today, whereby current techniques for

    measuring the performance of a virtual machine execute the same code on vir-

    tual machines as they do on physical machines. Therefore, performance measure-

    ment tools have lower utility on virtual machines than physical machines as they

    use code that virtualises poorly and cannot report the cost incurred due to virtual-

    isation. As such, we should reconsider whether applying the technique employed

    by IBM in 1973 to solve the problems of the daynamely poor performance

    can solve the contemporary issue of it being difficult to measure the performance

    of virtual machines. In particular, we should consider using paravirtualised per-

    formance measurement techniques.

    The invention of the hypercall created a debate that continues throughout

    the 1970s [56] regarding pure vs impure virtual machines, in which a pure vir-

    tual machine is a guest that runs unmodified code, whereas an impure virtual

    machine runs modified code. In particular, there is consideration of position of

    the hypervisor interface, since the hypervisor can either simulate high-level ac-

    tions, such as reading a line, or can simulate the individual instructions involved

    in performing the high level action [16].

    2.4.2 Recent changes to hypervisor fidelity

    With the popularisation of (early versions of) x86, virtualisation became harder

    as the instruction set does not provide trap-and-emulate ability for privileged in-

    structions such as SIDT, SGDT and SSL [121]. Therefore, to virtualise traditional

    x86, one has to use binary translation, the process by which the instruction

    36

  • stream is scanned and privileged instructions are rewritten with function calls to

    emulating functions. Performing full binary translation is a slow process [78], so

    early x86-64 hypervisors were either slow or unsecure [121]. Those that are slow

    fail the hypervisor definition as they do not provide the performance property.

    Furthermore, as Popek and Goldbergs hypervisor definition is tightly-coupled

    with trap-and-emulate techniques in its formalisation of fidelity, such that vir-

    tual machines cannot execute a modified instruction stream, binary rewriting is

    not considered classically virtualisable [1].

    To resolve the issues of virtualisation on traditional x86, Baraham et al. built

    Xen, a hypervisor that uses paravirtualisation to emulate x86 with performance,

    strong-isolation and unreduced functionality [12]. In using paravirtualisation,

    Xen requires that operating systems be modified to issue hypercalls rather than to

    execute with true-fidelity when issuing privileged instructions. One contribution

    of Xen was to paravirtualise the memory management unit, in that guests page

    tables are mapped read-only and the guest has to issue a hypercall to update

    them. This design allows virtual machines to directly map virtual addresses to

    the addresses of the memory on the physical server (machine physical frames),

    rather than have shadow page tables that give the illusion of executing in an

    independent address space. In overcoming the shortcomings of x86, by forgoing

    hypervisor fidelity, Xen is much like my proposal of forgoing hypervisor fidelity

    to overcome the shortcomings of performance measurement of virtual machines.

    More recent advances in the x86-64 instruction undeniably restore a degree

    of fidelity to the hypervisor by allowing unmodified virtual machines to execute

    in a hardware virtual machine (HVM) container [141]. HVM containers extend

    the architecture of x86-64 so as to provide a privileged mode [123] in which

    the hypervisor executes and to which calls to privileged instructions cause the

    processor to enter a protected mode (sometimes considered negative rings), in

    which the hypervisor executes. Whilst this increase in fidelity does create some

    advantages, for instance operating systems can migrate between executing as a

    physical and virtual instance [83], I nevertheless still argue that this increase in

    fidelity only came when hardware had advanced sufficiently (for instance with In-

    tel VT-X) such that fast and secure x86 virtualisation was no longer problematic.

    Should future hardware allow virtual machines to measure their performance to

    same degree as physical machines, then restoring fidelity to measuring the perfor-

    mance of virtual machines may be reasonable. There is already limited evidence

    37

  • of hardware advances increasing the ability of a virtual machine to measure its

    performance [104].

    2.4.3 Current state of hypervisor fidelity

    Despite the increase in hardware virtualisation, I still argue that it is common-

    place for the software stacks that execute on the hypervisor to not exhibit strict

    fidelity. This is principally due to the process of re-hosting an application on in-

    frastructure as a service, during which developers are encouraged to make use of

    properties of the cloud, such as the scalability of virtual machines [103]. How-

    ever, within virtual machines there are differences to the software stack when

    compared with physical machines. As such, forgoing hypervisor fidelity in per-

    formance measurement techniques is not a radical move.

    2.4.3.1 Installing guest additions

    All high-performance hypervisors that use hardware virtualisation techniques

    still provide extensions to improve the performance of their guests: XenServer

    Guest Tools, VirtualBox Guest Additions, and VMware tools are some exam-

    ples. These typically provide drivers that allow the guest operating system to

    communicate directly with the hypervisor so that full emulation of devices is not

    required. However, installing such extensions reduces the fidelity of the virtual

    machine, since by using different drivers, the virtual machine executes differently

    on physical and virtual hardware.

    2.4.3.2 Moving services into dedicated domains

    There is a growing trend to use virtual machine introspection to provide ser-

    vices that would traditionally have been provided by processes or operating sys-

    tem [24]. For example, Bitdefender performs malware detection from outside

    a separate, privileged domain, which prevents malware from attacking the mal-

    ware detection program, as it is commonplace for viruses to attack antivirus

    mechanisms [88]. Furthermore, most commercial hypervisors now support vir-

    tual machine snapshotting, a feature typically performed by the filesystem. There

    are also proposals to move monitoring into a separate domain [82]. Given the

    trend of separating services out such that they execute outside of the original do-

    38

  • main, I argue that hardware virtualisation does not achieve full fidelity, since if

    those operating systems were to execute on physical hardware, they would need

    reconfiguring such that they execute processes to perform all of these features.

    2.4.3.3 Lack of transparency of HVM containers

    Even when executing inside a hardware virtual machine container, which is sup-

    posed to provide fidelity, the interface with the hypervisor still differs from that

    provided by exclusive use of hardware. One demonstration of this difference in

    interface is malware that detects the presence of a hypervisor through irregular-

    ities in the availability of resources, such as CPU cycles, caches, and the TLB,

    and refuses to execute a payload [149]. Furthermore, the timing properties of

    a virtual machine differ from physical machines, both due to virtualisation over-

    head and changes to the time required to access hardware that is emulated by

    the hypervisor, and hidden page faults caused by access to hypervisor-protected

    pages, different timing between virtualised instructions (such as cpuid) and non-

    virtualised instructions (NOP) [54]. Given that the interface to the hypervisor is

    leaky, I argue that we should acknowledge this difference throughout the soft-

    ware stack, rather than maintaining fidelity.

    2.4.3.4 Hypervisor/operating system semantic gap

    The performance of a virtual machine can be improved if the hypervisor is better

    able to predict the virtual machines actions. There are two main techniques of

    improving the prediction rates: Monitoring the virtual machine, with knowledge

    of its data structures so-as to be able to improve decisions and policies, which

    can increase the cache hit ratio of a virtual machine by up to 28% [73] or moving

    functionality into the hypervisor from the guest [87]. The latter reduces fidelity

    and the former requires co-operation, therefore we observe deviation from the

    standard definition of a hypervisor.

    2.4.4 Summary

    I have now shown that since the advent of the hypervisor, forgoing hypervisor

    fidelity has been a common solution to solving problems in the realms of virtu-

    alisation. Even today, with hardware virtual machines, virtual machines do not

    39

  • strictly provide fidelity. My demonstration that forgoing hypervisor fidelity has

    successfully been used to solve past problems with virtualisation confirms my

    thesis that the use of the interface should change so as to improve the utility of

    performance measurement tools.

    2.5 Rethinking operating system design for hypervi-

    sors

    There is considerable research literature that reconsiders the role of the operating

    system, when executing in the cloud, from the ground-up, which often forgoes

    fidelity to increase utility.

    Library operating systems, such as OSv recognise that in a typical cloud soft-

    ware stack there is a hypervisor, operating system and a language runtime [80].

    Each of these performs abstraction and protection, at the cost of an increased

    footprint and performance overhead, such as a 22% impact on the throughput

    of lighttpd. Library operating systems replace everything that executes above

    the hypervisor with a single binary, so that the hypervisor performs abstraction

    and protection [80]. Similarly, Mirage is designed to execute on a hypervisor

    only, making use of the small hypervisor interface [91], thereby improving on

    Linux in terms of boot time, I/O throughput and memory footprint.

    SR-IOV increases fidelity by letting operating systems directly interact with

    the network interface card, with the hardware ensuring isolation [41]. However,

    SR-IOV can be used in unconventional ways: Dune is a hypervisor-like project

    that uses hardware virtualisation features to allow usespace direct access to safe

    hardware features, such as ring protection, page tables and the TLB [14]. Belay

    et al. achieve this by using hardware extensions built for virtualisation but have

    their lowest layer of software still expose an abstraction of a process, rather than

    hardware. Furthermore, Arrakis [113] and IX [15] use SR-IOV to separate the

    control and data plane so-as to increase networking throughput of commodity

    hardware.

    The work that I present in this dissertation focuses on applying performance

    measurement techniques to mainstream operating systems in the cloud. As re-

    search operating systems are not yet mainstream I do not explicitly show the

    benefits that they would receive. However, the key techniques in all three of

    40

  • my contributions could be applied to such operating systems, without causing

    diverting behaviour between virtual and physical machines.

    2.6 Virtual machine performance measurement

    Having argued that the requirement for hypervisors to exhibit fidelity is overly-

    restrictive and that forgoing hypervisor fidelity has been previously used to solve

    problems in the virtualisation domain, I now explore work related to virtual

    machine performance.

    2.6.1 Kernel probing

    Probing has a rich history that goes back to the dawn of computers. The first

    use of probing is believed to have been to used by Maurice Wilkes to insert sub-

    routines into code executing on the EDSAC. These sub-routines would print dis-

    tinctive symbols at intervals throughout a program so that the operator could

    determine an error [55]. Later computers, starting with the UNIVAC M-460

    included programs such as DEBUG that let operators specify addresses to insert

    additional code that could be used for debugging [47].

    Contemporary operating systems have a probing system to allow users to de-

    bug their software and measure its performance. Linux uses Kprobes [107] and

    Microsoft Windows uses Detours [69]. NetBSD [106], FreeBSD [96], and OS

    X all use DTrace, which embodies a probing system in a wider instrumentation

    system. There has been further work on these systems to optimise them [68]

    as the benefits of fast probing have long been known [79]. However, with the

    exception of Windows Detours, these all use interrupt-based probing techniques.

    Previous work has shown another technique for probing, based on jumps,

    which are often faster than executing interrupts [137, 138]. Windows Detours

    was the first of these jump-based probing systems that preserves the semantics

    of the target function as a callable subroutine [69]. However, whilst there is

    some benefit from using jump-based techniques on physical machines, I show

    that their utility when applied to virtual machines is much higher. This is due to

    interrupt-based techniques virtualising poorly.

    There has been some consideration of changing the nature of operating sys-

    tem probing in the virtualised environment by disaggregating probe handlers

    41

  • into a separate domain [118]. However, this hasnt received popular uptake.

    2.6.2 Kernel specialisation

    Kernel specalisation is not a new concept: Early work on the synthesis kernel

    pioneers kernel specialisation by generating efficient kernel code that acts as fast-

    paths for applications [116]. The advantages of kernel specialisation are well

    known [23, 17]: Profile-guided optimisation of Linux improves the kernel per-

    formance by up to 10% [151] and exokernels [45] remove kernel abstractions

    so that applications interact with hardware through fewer layers of indirection,

    thereby reducing kernel overheads. For instance, Xok is an operating system

    with an exokernel whereby a specialised web server has over four times the

    throughput of a non-specialised web server [74]. Indeed, the benefits of spe-

    cialisation are a key feature of Barrelfish, an opearting system redesign to allow

    kernel specialisation such that cores run different kernels [125] and Dune for al-

    lowing applications access to privileged CPU features [14]. Another possible op-

    erating system redesign to allow kernel specialisation is using microkernels, since

    only a small set of features are then executed by an operating system mapped

    into every process, rather user space services can provide competing specialised

    implementations of features [86].

    In Chapter 4 I introduce Shadow Kernels, a technique that allows per-process

    kernel specialisation by having applications that acknowledge the presence of the

    hypervisor and execute code that causes the hypervisor to switch the underlying

    memory of the domains kernel. The key benefit of Shadow Kernels is to allow

    multiple kernel instruction streams to execute on a single machine. There do

    indeed exist techniques of executing multiple kernels already, however they all

    differ from Shadow Kernels. Executing processes inside virtual machines allows

    multiple kernels to execute on a single machine [36]. However, each kernel

    will still typically support multiple processes executing on it, whereas Shadow

    Kernels can target individual processes.

    The technique used in Shadow Kernels of modifying kernel instruction streams

    is well-established. For instance, KSplice modifies the kernel instruction stream

    to binary patch security updates into a kernel without rebooting the machine [7],

    but this is a global change that affects all processes, whereas Shadow Kernels

    can restrict that patch to an individual process. Furthermore, malware can use

    42

  • memory management tricks to hide itself from detection by unmapping memory

    containing the rootkit [133]. Shadow Kernels differs in that rather than hid-

    ing malware it allows multiple kernel instruction streams to coexist. Similarly,

    Mondrix uses changes to the MMU to provide isolation between Linux kernel

    modules [148], albeit with a performance overhead of up to 15%.

    2.6.3 Performance interference

    A key concern with executing virtual machines in the cloud is performance in-

    teference, whereby two or more virtual machines compete for resources. Hyper-

    visors are designed to have strong performance isolation guarantees, by having

    coarse-grained scheduling and no sharing of data structures between virtualisa-

    tion domains [12]. In particular, many services in the cloudas well as in other

    circumstances [48]are latency-sensitve in that they require low and predictable

    latency [32]. However achieving predictable latency without performance isola-

    tion is hard. This lack of perfect performance isolation makes it difficult to virtu-

    alise some workloads [67]. Whilst executing in the cloud allows some detection

    of performance anomalies before deploying some services [134], this remains an

    unsolved problem in the general case.

    2.6.3.1 Measurement

    Researchers have long-studied methods of reducing performance interference of

    operating systems, in particular with the rise of latency-sensitive applications

    such as video-streaming [66]. With the rise of hypervisors, there has been fur-

    ther work in reducing performance interference, whilst increasing utilisation of

    hardware by using a custom scheduler that limits the resources consumed by vir-

    tual machines in their domain and in driver domains, such as domain zero [63].

    However, in current cloud deployments, virtual machine workloads can in-

    terfere badly with each other, for instance the IOPS available to a virtual ma-

    chine can fluctuate wildly depending on other virtual machines executing [60]

    and poor scheduling causes performance interference, for instance colocating a

    random and a sequential load reduces performance for the sequential load [58].

    Some work improves on the performance guarantees in the cloud, for example

    with virtual datacentres that have guaranteed throughput. An implementation

    of a virtual datacentre is Pulsar, which modifies the hypervisors in the cloud to

    43

  • use a leaky bucket per virtual machine on shared resources to guarantee perfor-

    mance [4].

    Whilst guaranteeing performance isolation is preferable, whenever the ma-

    chine is saturated by its virtual machines there is necessarily performance interfer-

    ence, in which case monitoring and reporting the performance is possible. There

    are many ways of measuring the performance of an operating system. Modern

    operating systems, such as Linux, have a wealth of tools to help measure operat-

    ing system performance. For instance, Linux has FTrace, perf, SystemTap [43],

    KLogger [46] and numerous domain-specific tools. Another method, originally

    implemented on a modified Digital UNIX 4.0D kernel, reports the resource con-

    sumption of resource containers, rather than of processes and threads [11].

    However, all of these methods do not distinguish poor application perfor-

    mance from the overheads of virtualistion. That is, these tools are unable to

    report if the virtual machine is starved of resources. Not only do these tools

    not inform users of virtualisation overhead, they often are unable to access the

    same set of hardware features as a physical machine to accurately report per-

    formance to domains.1 Xenoprof is currently the only attempt to provide Xen

    virtual machines with a way of measuring performance [99]. However, Xeno-

    prof is incompatible with recent versions of Xen. The technique that I present

    in Chapter 5 differs in that it requires developers to annotate their programs to

    indicate the processing of requestsmuch like is required by X-trace [51]but

    then reports the overheads of virtualisation, rather than the performance of the

    virtual machine, and gives these details on a per-request basis. Calculating this

    overhead requires applications to have information about how the virtual ma-

    chine in which they execute is scheduled. Having a hypervisor expose its inner

    state is similar to how Infokernels expose kernel internals across the interface

    with applications [8].

    2.6.3.2 Modelling

    There has been work performed by the modelling community that looks into

    performance interference between virtual machines. This work largely models

    which workloads interact badly with each other in order to build better virtual

    machine placement algorithms. This differs from the technique that I present

    1vPMU is an upcoming (as of 2015-09-17) feature for Xen and Linux.

    44

  • in Chapter 5, which is a measurement technique for helping to measure the

    performance of clouds as they execute. An example of modelling performance

    interference is hALT, which uses machine learning trained on a dataset from

    Google [120], to model which workloads cause performance interference [28].

    Q-Clouds models CPU-bound virtual machines using a multiple-input multiple-

    output model whereby they take online feedback from an application and use

    this as an input to the model and use the output to place virtual machines more

    effectively [102]. TRACON is similar to Q-Clouds, but focusses on I/O-intensive

    workloads [28]. Casale et al. produce models of virtual machine disk perfor-

    mance, based on monitoring the hypervisors batching of I/O requests and the

    arrival queue [22]. CloudScope improves on modelling the performance of vir-

    tual machine interference by doing away with the need for machine learning or

    queuing-based models by modelling virtual machine performance using Markov

    chains to achieve a low-error model that is not tightly coupled with an applica-

    tion [25].

    This work all differs from Soroban in that it is modelling the performance

    of an entire virtual machine. The virtual machine being modelled is typically

    assumed to be in a steady-state for a prolonged period of time (perhaps several

    minutes in length) and the model finds the best placement of virtual machines to

    minimise performance interference. However, Soroban is a measurement tech-

    nique that reports the additional latency incurred in servicing a single request in

    a request-response system. That is, Soroban measures if during the servicing of

    a request the virtual machine were scheduled out and reports the corresponding

    cost of this.

    2.6.3.3 Summary

    I have shown that there is a field of work that considers how to instrument

    and measure the performance of