slides on Xen 3.0

62
Virtualization Ian Pratt Keir Fraser, Steven Hand, Christian Limpach, Andrew Warfield, Dan Magenheimer (HP), Jun Nakajima (Intel), Asit Mallick (Intel) Computer Laboratory

Transcript of slides on Xen 3.0

Page 1: slides on Xen 3.0

Xen 3.0 and the Art of Virtualization

Ian PrattKeir Fraser, Steven Hand, Christian Limpach, Andrew Warfield, Dan Magenheimer (HP), Jun Nakajima (Intel), Asit Mallick (Intel)

Computer Laboratory

Page 2: slides on Xen 3.0

Outline

Virtualization Overview Xen Architecture New Features in Xen 3.0 VM Relocation Xen Roadmap

Page 3: slides on Xen 3.0

Virtualization Overview

Single OS image: Virtuozo, Vservers, Zones Group user processes into resource containers Hard to get strong isolation

Full virtualization: VMware, VirtualPC, QEMU Run multiple unmodified guest OSes Hard to efficiently virtualize x86

Para-virtualization: UML, Xen Run multiple guest OSes ported to special arch Arch Xen/x86 is very close to normal x86

Page 4: slides on Xen 3.0

Virtualization in the Enterprise

X

Consolidate under-utilized servers to reduce CapEx and OpEx

Avoid downtime with VM Relocation

Dynamically re-balance workload to guarantee application SLAs

X Enforce security policyX

Page 5: slides on Xen 3.0

Xen Today : Xen 2.0.6

Secure isolation between VMs Resource control and QoS Only guest kernel needs to be ported

User-level apps and libraries run unmodified Linux 2.4/2.6, NetBSD, FreeBSD, Plan9, Solaris

Execution performance close to native Broad x86 hardware support Live Relocation of VMs between Xen nodes

Page 6: slides on Xen 3.0

Para-Virtualization in Xen

Xen extensions to x86 arch Like x86, but Xen invoked for privileged ops Avoids binary rewriting Minimize number of privilege transitions into Xen Modifications relatively simple and self-

contained Modify kernel to understand virtualised env.

Wall-clock time vs. virtual processor time• Desire both types of alarm timer

Expose real resource availability• Enables OS to optimise its own behaviour

Page 7: slides on Xen 3.0

Xen 2.0 Architecture

Event Channel Virtual MMUVirtual CPU Control IF

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

NativeDeviceDriver

GuestOS(XenLinux)

Device Manager & Control s/w

VM0

NativeDeviceDriver

GuestOS(XenLinux)

UnmodifiedUser

Software

VM1

Front-EndDevice Drivers

GuestOS(XenLinux)

UnmodifiedUser

Software

VM2

Front-EndDevice Drivers

GuestOS(XenBSD)

UnmodifiedUser

Software

VM3

Safe HW IF

Xen Virtual Machine Monitor

Back-End Back-End

Page 8: slides on Xen 3.0

Xen 3.0 Architecture

Event Channel Virtual MMUVirtual CPU Control IF

Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)

NativeDeviceDriver

GuestOS(XenLinux)

Device Manager & Control s/w

VM0

NativeDeviceDriver

GuestOS(XenLinux)

UnmodifiedUser

Software

VM1

Front-EndDevice Drivers

GuestOS(XenLinux)

UnmodifiedUser

Software

VM2

Front-EndDevice Drivers

UnmodifiedGuestOS(WinXP))

UnmodifiedUser

Software

VM3

Safe HW IF

Xen Virtual Machine Monitor

Back-End Back-End

VT-x

x86_32x86_64

IA64

AGPACPIPCI

SMP

Page 9: slides on Xen 3.0

rin

g 3

x86_32

Xen reserves top of VA space

Segmentation protects Xen from kernel

System call speed unchanged

Xen 3 now supports PAE for >4GB mem

Kernel

User

4GB

3GB

0GB

Xen

S

S

U rin

g 1

rin

g 0

Page 10: slides on Xen 3.0

x86_64

Large VA space makes life a lot easier, but:

No segment limit support

Need to use page-level protection to protect hypervisor

Kernel

User

264

0

Xen

U

S

U

Reserved

247

264-247

Page 11: slides on Xen 3.0

x86_64

Run user-space and kernel in ring 3 using different pagetables Two PGD’s (PML4’s): one

with user entries; one with user plus kernel entries

System calls require an additional syscall/ret via Xen

Per-CPU trampoline to avoid needing GS in Xen

Kernel

User

Xen

U

S

U

syscall/sysret

r3

r0

r3

Page 12: slides on Xen 3.0

Para-Virtualizing the MMU

Guest OSes allocate and manage own PTs Hypercall to change PT base

Xen must validate PT updates before use Allows incremental updates, avoids

revalidation Validation rules applied to each PTE:

1. Guest may only map pages it owns*2. Pagetable pages may only be mapped RO

Xen traps PTE updates and emulates, or ‘unhooks’ PTE page for bulk updates

Page 13: slides on Xen 3.0

Writeable Page Tables : 1 – Write fault

MMU

Guest OS

Xen VMM

Hardware

page fault

first guestwrite

guest reads

Virtual → Machine

Page 14: slides on Xen 3.0

Writeable Page Tables : 2 – Emulate?

MMU

Guest OS

Xen VMM

Hardware

first guestwrite

guest reads

Virtual → Machine

emulate?

yes

Page 15: slides on Xen 3.0

Writeable Page Tables : 3 - Unhook

MMU

Guest OS

Xen VMM

Hardware

guest writes

guest reads

Virtual → MachineX

Page 16: slides on Xen 3.0

Writeable Page Tables : 4 - First Use

MMU

Guest OS

Xen VMM

Hardware

page fault

guest writes

guest reads

Virtual → MachineX

Page 17: slides on Xen 3.0

Writeable Page Tables : 5 – Re-hook

MMU

Guest OS

Xen VMM

Hardware

validate

guest writes

guest reads

Virtual → Machine

Page 18: slides on Xen 3.0

MMU Micro-Benchmarks

L X V U

Page fault (µs)

L X V U

Process fork (µs)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

lmbench results on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)

Page 19: slides on Xen 3.0

SMP Guest Kernels

Xen extended to support multiple VCPUs Virtual IPI’s sent via Xen event channels Currently up to 32 VCPUs supported

Simple hotplug/unplug of VCPUs From within VM or via control tools Optimize one active VCPU case by

binary patching spinlocks

Page 20: slides on Xen 3.0

SMP Guest Kernels

Takes great care to get good SMP performance while remaining secure Requires extra TLB syncronization IPIs

Paravirtualized approach enables several important benefits Avoids many virtual IPIs Allows ‘bad preemption’ avoidance Auto hot plug/unplug of CPUs

SMP scheduling is a tricky problem Strict gang scheduling leads to wasted cycles

Page 21: slides on Xen 3.0

I/O Architecture

Xen IO-Spaces delegate guest OSes protected access to specified h/w devices Virtual PCI configuration space Virtual interrupts (Need IOMMU for full DMA protection)

Devices are virtualised and exported to other VMs via Device Channels Safe asynchronous shared memory transport ‘Backend’ drivers export to ‘frontend’ drivers Net: use normal bridging, routing, iptables Block: export any blk dev e.g. sda4,loop0,vg3

(Infiniband / Smart NICs for direct guest IO)

Page 22: slides on Xen 3.0

VT-x / (Pacifica)

Enable Guest OSes to be run without para-virtualization modifications E.g. legacy Linux, Windows XP/2003

CPU provides traps for certain privileged instrs

Shadow page tables used to provide MMU virtualization

Xen provides simple platform emulation BIOS, Ethernet (ne2k), IDE emulation

(Install paravirtualized drivers after booting for high-performance IO)

Page 23: slides on Xen 3.0

NativeDevice Drivers

Co

ntro

l P

anel

(xm/xe

nd

)

Fro

nt en

d

Virtu

al Drivers

Linux xen64

Xen Hypervisor

Device

Mo

dels

Guest BIOS

Unmodified OS

Domain N

Linux xen64

Callback / Hypercall VMExit

Virtual Platform

0D

Guest VM (VMX)(32-bit)

Backen

dV

irtual d

river

Native Device Drivers

Domain 0

Event channel0P

1/3P

3P

I/O: PIT, APIC, PIC, IOAPICProcessor Memory

Control Interface Hypercalls Event Channel Scheduler

FE

V

irtual

Drivers

Guest BIOS

Unmodified OS

VMExit

Virtual Platform

Guest VM (VMX)(64-bit)

FE

V

irtual

Drivers

3D

Page 24: slides on Xen 3.0

MMU Virtualizion : Shadow-Mode

MMU

Accessed &dirty bits

Guest OS

VMM

Hardware

guest writes

guest reads Virtual → Pseudo-physical

Virtual → Machine

Updates

Page 25: slides on Xen 3.0

VM Relocation : Motivation

VM relocation enables: High-availability

• Machine maintenance

Load balancing• Statistical multiplexing gain

Xen

Xen

Page 26: slides on Xen 3.0

Assumptions

Networked storage NAS: NFS, CIFS SAN: Fibre Channel iSCSI, network block

dev drdb network RAID

Good connectivity common L2 network L3 re-routeing

Xen

Xen

Storage

Page 27: slides on Xen 3.0

Challenges

VMs have lots of state in memory Some VMs have soft real-time

requirements E.g. web servers, databases, game servers May be members of a cluster quorum Minimize down-time

Performing relocation requires resources Bound and control resources used

Page 28: slides on Xen 3.0

Stage 0: pre-migration

Stage 1: reservation

Stage 2: iterative pre-copy

Stage 3: stop-and-copy

Stage 4: commitment

Relocation Strategy

VM active on host ADestination host

selected(Block devices

mirrored)Initialize container on target host

Copy dirty pages in successive rounds

Suspend VM on host A

Redirect network traffic

Synch remaining state

Activate on host BVM state on host A

released

Page 29: slides on Xen 3.0

Pre-Copy Migration: Round 1

Page 30: slides on Xen 3.0

Pre-Copy Migration: Round 1

Page 31: slides on Xen 3.0

Pre-Copy Migration: Round 1

Page 32: slides on Xen 3.0

Pre-Copy Migration: Round 1

Page 33: slides on Xen 3.0

Pre-Copy Migration: Round 1

Page 34: slides on Xen 3.0

Pre-Copy Migration: Round 2

Page 35: slides on Xen 3.0

Pre-Copy Migration: Round 2

Page 36: slides on Xen 3.0

Pre-Copy Migration: Round 2

Page 37: slides on Xen 3.0

Pre-Copy Migration: Round 2

Page 38: slides on Xen 3.0

Pre-Copy Migration: Round 2

Page 39: slides on Xen 3.0

Pre-Copy Migration: Final

Page 40: slides on Xen 3.0

Writable Working Set

Pages that are dirtied must be re-sent Super hot pages

• e.g. process stacks; top of page free list

Buffer cache Network receive / disk buffers

Dirtying rate determines VM down-time Shorter iterations → less dirtying → …

Page 41: slides on Xen 3.0

Rate Limited Relocation

Dynamically adjust resources committed to performing page transfer Dirty logging costs VM ~2-3% CPU and network usage closely linked

E.g. first copy iteration at 100Mb/s, then increase based on observed dirtying rate Minimize impact of relocation on server

while minimizing down-time

Page 42: slides on Xen 3.0

Web Server Relocation

Page 43: slides on Xen 3.0

Iterative Progress: SPECWeb

52s

Page 44: slides on Xen 3.0

Iterative Progress: Quake3

Page 45: slides on Xen 3.0

Quake 3 Server relocation

Page 46: slides on Xen 3.0

Extensions

Cluster load balancing Pre-migration analysis phase Optimization over coarse timescales

Evacuating nodes for maintenance Move easy to migrate VMs first

Storage-system support for VM clusters Decentralized, data replication, copy-on-

write Wide-area relocation

IPSec tunnels and CoW network mirroring

Page 47: slides on Xen 3.0

Current 3.0 Status

  x86_32 x86_32p x86_64 IA64 Power

Domain 0          

Domain U          

SMP Guests     new!    

Save/Restore/Migrate ~tools        

>4GB memory   16GB 4TB   ?

VT     64-on-64    

Driver Domains          

Page 48: slides on Xen 3.0

3.1 Roadmap

Improved full-virtualization support Pacifica / VT-x abstraction

Enhanced control tools project Performance tuning and optimization

Less reliance on manual configuration Infiniband / Smart NIC support (NUMA, Virtual framebuffer, etc)

Page 49: slides on Xen 3.0

Research Roadmap

Whole-system debugging Lightweight checkpointing and replay Cluster/dsitributed system debugging

Software implemented h/w fault tolerance Exploit deterministic replay

VM forking Lightweight service replication, isolation

Secure virtualization Multi-level secure Xen

Page 50: slides on Xen 3.0

Conclusions

Xen is a complete and robust GPL VMM Outstanding performance and scalability Excellent resource control and protection Vibrant development community Strong vendor support

http://xen.sf.net

Page 51: slides on Xen 3.0

Thanks!

The Xen project is hiring, both in Cambridge UK, Palo Alto and New York

[email protected]

Computer Laboratory

Page 52: slides on Xen 3.0

Backup slides

Page 53: slides on Xen 3.0

Isolated Driver VMs

Run device drivers in separate domains

Detect failure e.g. Illegal access Timeout

Kill domain, restart E.g. 275ms outage

from failed Ethernet driver

0

50

100

150

200

250

300

350

0 5 10 15 20 25 30 35 40time (s)

Page 54: slides on Xen 3.0

Device Channel Interface

Page 55: slides on Xen 3.0

Scalability

Scalability principally limited by Application resource requirements several 10’s of VMs on server-class machines

Balloon driver used to control domain memory usage by returning pages to Xen Normal OS paging mechanisms can deflate

quiescent domains to <4MB Xen per-guest memory usage <32KB

Additional multiplexing overhead negligible

Page 56: slides on Xen 3.0

System Performance

L X V U

SPEC INT2000 (score)

L X V U

Linux build time (s)

L X V U

OSDB-OLTP (tup/s)

L X V U

SPEC WEB99 (score)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Benchmark suite running on Linux (L), Xen (X), VMware Workstation (V), and UML (U)

Page 57: slides on Xen 3.0

TCP results

L X V UTx, MTU 1500 (Mbps)

L X V URx, MTU 1500 (Mbps)

L X V UTx, MTU 500 (Mbps)

L X V URx, MTU 500 (Mbps)

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

TCP bandwidth on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)

Page 58: slides on Xen 3.0

Scalability

L X

2L X

4L X

8L X

16

0

200

400

600

800

1000

Simultaneous SPEC WEB99 Instances on Linux (L) and Xen(X)

Page 59: slides on Xen 3.0

Resource Differentation

2 4 8 8(diff)OSDB-IR

2 4 8 8(diff)OSDB-OLTP

0.0

0.5

1.0

1.5

2.0

Simultaneous OSDB-IR and OSDB-OLTP Instances on Xen

Ag

gre

gate

th

rou

gh

pu

t re

lati

ve t

o o

ne in

stan

ce

Page 60: slides on Xen 3.0
Page 61: slides on Xen 3.0
Page 62: slides on Xen 3.0