slides on Xen 3.0
-
Upload
cameroon45 -
Category
Documents
-
view
621 -
download
2
Transcript of slides on Xen 3.0
Xen 3.0 and the Art of Virtualization
Ian PrattKeir Fraser, Steven Hand, Christian Limpach, Andrew Warfield, Dan Magenheimer (HP), Jun Nakajima (Intel), Asit Mallick (Intel)
Computer Laboratory
Outline
Virtualization Overview Xen Architecture New Features in Xen 3.0 VM Relocation Xen Roadmap
Virtualization Overview
Single OS image: Virtuozo, Vservers, Zones Group user processes into resource containers Hard to get strong isolation
Full virtualization: VMware, VirtualPC, QEMU Run multiple unmodified guest OSes Hard to efficiently virtualize x86
Para-virtualization: UML, Xen Run multiple guest OSes ported to special arch Arch Xen/x86 is very close to normal x86
Virtualization in the Enterprise
X
Consolidate under-utilized servers to reduce CapEx and OpEx
Avoid downtime with VM Relocation
Dynamically re-balance workload to guarantee application SLAs
X Enforce security policyX
Xen Today : Xen 2.0.6
Secure isolation between VMs Resource control and QoS Only guest kernel needs to be ported
User-level apps and libraries run unmodified Linux 2.4/2.6, NetBSD, FreeBSD, Plan9, Solaris
Execution performance close to native Broad x86 hardware support Live Relocation of VMs between Xen nodes
Para-Virtualization in Xen
Xen extensions to x86 arch Like x86, but Xen invoked for privileged ops Avoids binary rewriting Minimize number of privilege transitions into Xen Modifications relatively simple and self-
contained Modify kernel to understand virtualised env.
Wall-clock time vs. virtual processor time• Desire both types of alarm timer
Expose real resource availability• Enables OS to optimise its own behaviour
Xen 2.0 Architecture
Event Channel Virtual MMUVirtual CPU Control IF
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
NativeDeviceDriver
GuestOS(XenLinux)
Device Manager & Control s/w
VM0
NativeDeviceDriver
GuestOS(XenLinux)
UnmodifiedUser
Software
VM1
Front-EndDevice Drivers
GuestOS(XenLinux)
UnmodifiedUser
Software
VM2
Front-EndDevice Drivers
GuestOS(XenBSD)
UnmodifiedUser
Software
VM3
Safe HW IF
Xen Virtual Machine Monitor
Back-End Back-End
Xen 3.0 Architecture
Event Channel Virtual MMUVirtual CPU Control IF
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
NativeDeviceDriver
GuestOS(XenLinux)
Device Manager & Control s/w
VM0
NativeDeviceDriver
GuestOS(XenLinux)
UnmodifiedUser
Software
VM1
Front-EndDevice Drivers
GuestOS(XenLinux)
UnmodifiedUser
Software
VM2
Front-EndDevice Drivers
UnmodifiedGuestOS(WinXP))
UnmodifiedUser
Software
VM3
Safe HW IF
Xen Virtual Machine Monitor
Back-End Back-End
VT-x
x86_32x86_64
IA64
AGPACPIPCI
SMP
rin
g 3
x86_32
Xen reserves top of VA space
Segmentation protects Xen from kernel
System call speed unchanged
Xen 3 now supports PAE for >4GB mem
Kernel
User
4GB
3GB
0GB
Xen
S
S
U rin
g 1
rin
g 0
x86_64
Large VA space makes life a lot easier, but:
No segment limit support
Need to use page-level protection to protect hypervisor
Kernel
User
264
0
Xen
U
S
U
Reserved
247
264-247
x86_64
Run user-space and kernel in ring 3 using different pagetables Two PGD’s (PML4’s): one
with user entries; one with user plus kernel entries
System calls require an additional syscall/ret via Xen
Per-CPU trampoline to avoid needing GS in Xen
Kernel
User
Xen
U
S
U
syscall/sysret
r3
r0
r3
Para-Virtualizing the MMU
Guest OSes allocate and manage own PTs Hypercall to change PT base
Xen must validate PT updates before use Allows incremental updates, avoids
revalidation Validation rules applied to each PTE:
1. Guest may only map pages it owns*2. Pagetable pages may only be mapped RO
Xen traps PTE updates and emulates, or ‘unhooks’ PTE page for bulk updates
Writeable Page Tables : 1 – Write fault
MMU
Guest OS
Xen VMM
Hardware
page fault
first guestwrite
guest reads
Virtual → Machine
Writeable Page Tables : 2 – Emulate?
MMU
Guest OS
Xen VMM
Hardware
first guestwrite
guest reads
Virtual → Machine
emulate?
yes
Writeable Page Tables : 3 - Unhook
MMU
Guest OS
Xen VMM
Hardware
guest writes
guest reads
Virtual → MachineX
Writeable Page Tables : 4 - First Use
MMU
Guest OS
Xen VMM
Hardware
page fault
guest writes
guest reads
Virtual → MachineX
Writeable Page Tables : 5 – Re-hook
MMU
Guest OS
Xen VMM
Hardware
validate
guest writes
guest reads
Virtual → Machine
MMU Micro-Benchmarks
L X V U
Page fault (µs)
L X V U
Process fork (µs)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
lmbench results on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)
SMP Guest Kernels
Xen extended to support multiple VCPUs Virtual IPI’s sent via Xen event channels Currently up to 32 VCPUs supported
Simple hotplug/unplug of VCPUs From within VM or via control tools Optimize one active VCPU case by
binary patching spinlocks
SMP Guest Kernels
Takes great care to get good SMP performance while remaining secure Requires extra TLB syncronization IPIs
Paravirtualized approach enables several important benefits Avoids many virtual IPIs Allows ‘bad preemption’ avoidance Auto hot plug/unplug of CPUs
SMP scheduling is a tricky problem Strict gang scheduling leads to wasted cycles
I/O Architecture
Xen IO-Spaces delegate guest OSes protected access to specified h/w devices Virtual PCI configuration space Virtual interrupts (Need IOMMU for full DMA protection)
Devices are virtualised and exported to other VMs via Device Channels Safe asynchronous shared memory transport ‘Backend’ drivers export to ‘frontend’ drivers Net: use normal bridging, routing, iptables Block: export any blk dev e.g. sda4,loop0,vg3
(Infiniband / Smart NICs for direct guest IO)
VT-x / (Pacifica)
Enable Guest OSes to be run without para-virtualization modifications E.g. legacy Linux, Windows XP/2003
CPU provides traps for certain privileged instrs
Shadow page tables used to provide MMU virtualization
Xen provides simple platform emulation BIOS, Ethernet (ne2k), IDE emulation
(Install paravirtualized drivers after booting for high-performance IO)
NativeDevice Drivers
Co
ntro
l P
anel
(xm/xe
nd
)
Fro
nt en
d
Virtu
al Drivers
Linux xen64
Xen Hypervisor
Device
Mo
dels
Guest BIOS
Unmodified OS
Domain N
Linux xen64
Callback / Hypercall VMExit
Virtual Platform
0D
Guest VM (VMX)(32-bit)
Backen
dV
irtual d
river
Native Device Drivers
Domain 0
Event channel0P
1/3P
3P
I/O: PIT, APIC, PIC, IOAPICProcessor Memory
Control Interface Hypercalls Event Channel Scheduler
FE
V
irtual
Drivers
Guest BIOS
Unmodified OS
VMExit
Virtual Platform
Guest VM (VMX)(64-bit)
FE
V
irtual
Drivers
3D
MMU Virtualizion : Shadow-Mode
MMU
Accessed &dirty bits
Guest OS
VMM
Hardware
guest writes
guest reads Virtual → Pseudo-physical
Virtual → Machine
Updates
VM Relocation : Motivation
VM relocation enables: High-availability
• Machine maintenance
Load balancing• Statistical multiplexing gain
Xen
Xen
Assumptions
Networked storage NAS: NFS, CIFS SAN: Fibre Channel iSCSI, network block
dev drdb network RAID
Good connectivity common L2 network L3 re-routeing
Xen
Xen
Storage
Challenges
VMs have lots of state in memory Some VMs have soft real-time
requirements E.g. web servers, databases, game servers May be members of a cluster quorum Minimize down-time
Performing relocation requires resources Bound and control resources used
Stage 0: pre-migration
Stage 1: reservation
Stage 2: iterative pre-copy
Stage 3: stop-and-copy
Stage 4: commitment
Relocation Strategy
VM active on host ADestination host
selected(Block devices
mirrored)Initialize container on target host
Copy dirty pages in successive rounds
Suspend VM on host A
Redirect network traffic
Synch remaining state
Activate on host BVM state on host A
released
Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 1
Pre-Copy Migration: Round 2
Pre-Copy Migration: Round 2
Pre-Copy Migration: Round 2
Pre-Copy Migration: Round 2
Pre-Copy Migration: Round 2
Pre-Copy Migration: Final
Writable Working Set
Pages that are dirtied must be re-sent Super hot pages
• e.g. process stacks; top of page free list
Buffer cache Network receive / disk buffers
Dirtying rate determines VM down-time Shorter iterations → less dirtying → …
Rate Limited Relocation
Dynamically adjust resources committed to performing page transfer Dirty logging costs VM ~2-3% CPU and network usage closely linked
E.g. first copy iteration at 100Mb/s, then increase based on observed dirtying rate Minimize impact of relocation on server
while minimizing down-time
Web Server Relocation
Iterative Progress: SPECWeb
52s
Iterative Progress: Quake3
Quake 3 Server relocation
Extensions
Cluster load balancing Pre-migration analysis phase Optimization over coarse timescales
Evacuating nodes for maintenance Move easy to migrate VMs first
Storage-system support for VM clusters Decentralized, data replication, copy-on-
write Wide-area relocation
IPSec tunnels and CoW network mirroring
Current 3.0 Status
x86_32 x86_32p x86_64 IA64 Power
Domain 0
Domain U
SMP Guests new!
Save/Restore/Migrate ~tools
>4GB memory 16GB 4TB ?
VT 64-on-64
Driver Domains
3.1 Roadmap
Improved full-virtualization support Pacifica / VT-x abstraction
Enhanced control tools project Performance tuning and optimization
Less reliance on manual configuration Infiniband / Smart NIC support (NUMA, Virtual framebuffer, etc)
Research Roadmap
Whole-system debugging Lightweight checkpointing and replay Cluster/dsitributed system debugging
Software implemented h/w fault tolerance Exploit deterministic replay
VM forking Lightweight service replication, isolation
Secure virtualization Multi-level secure Xen
Conclusions
Xen is a complete and robust GPL VMM Outstanding performance and scalability Excellent resource control and protection Vibrant development community Strong vendor support
http://xen.sf.net
Thanks!
The Xen project is hiring, both in Cambridge UK, Palo Alto and New York
Computer Laboratory
Backup slides
Isolated Driver VMs
Run device drivers in separate domains
Detect failure e.g. Illegal access Timeout
Kill domain, restart E.g. 275ms outage
from failed Ethernet driver
0
50
100
150
200
250
300
350
0 5 10 15 20 25 30 35 40time (s)
Device Channel Interface
Scalability
Scalability principally limited by Application resource requirements several 10’s of VMs on server-class machines
Balloon driver used to control domain memory usage by returning pages to Xen Normal OS paging mechanisms can deflate
quiescent domains to <4MB Xen per-guest memory usage <32KB
Additional multiplexing overhead negligible
System Performance
L X V U
SPEC INT2000 (score)
L X V U
Linux build time (s)
L X V U
OSDB-OLTP (tup/s)
L X V U
SPEC WEB99 (score)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Benchmark suite running on Linux (L), Xen (X), VMware Workstation (V), and UML (U)
TCP results
L X V UTx, MTU 1500 (Mbps)
L X V URx, MTU 1500 (Mbps)
L X V UTx, MTU 500 (Mbps)
L X V URx, MTU 500 (Mbps)
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
TCP bandwidth on Linux (L), Xen (X), VMWare Workstation (V), and UML (U)
Scalability
L X
2L X
4L X
8L X
16
0
200
400
600
800
1000
Simultaneous SPEC WEB99 Instances on Linux (L) and Xen(X)
Resource Differentation
2 4 8 8(diff)OSDB-IR
2 4 8 8(diff)OSDB-OLTP
0.0
0.5
1.0
1.5
2.0
Simultaneous OSDB-IR and OSDB-OLTP Instances on Xen
Ag
gre
gate
th
rou
gh
pu
t re
lati
ve t
o o
ne in
stan
ce