1 Outline for Today Objective –More power aware memory –Virtual Machines Announcements –These...
-
Upload
leslie-hancock -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Outline for Today Objective –More power aware memory –Virtual Machines Announcements –These...
1
Outline for Today
• Objective– More power aware memory– Virtual Machines
• Announcements– These slides will be up after class
(sometime).– Midterm rules: open book, notes.
Not open google, IM, etc.
2
Memory System Power Consumption
• Laptop: memory is small percentage of total power budget
• Handheld: low power processor, memory is more important
Memory
Other
Memory
Other
Laptop Power Budget9 Watt Processor
Handheld Power Budget1 Watt Processor
4
Opportunity: Power Aware DRAM
• Multiple power states– Fast access, high power– Low power, slow access
• New take on memory hierarchy
• How to exploit opportunity?
Standby180mW
Active300mW
Power Down3mW
Nap30mW
Read/Write
Transaction
+6 ns+6000 ns
+60 ns
RambusRDRAM
Power States
5
RDRAM Active Components
Refresh Clock Rowdecoder
Coldecoder
Active X X X X
Standby X X X
Nap X X
Pwrdn X
6
RDRAM as a Memory Hierarchy
• Each chip can be independently put into appropriate power mode
• Number of chips at each “level” of the hierarchy can vary dynamically.
Active Nap
Policy choices– initial page placement in an
“appropriate” chip– dynamic movement of page
from one chip to another– transitioning of power state of
chip containing page
Active
7
CPU/$
Chip 0
Chip 1
Chip 3
RAMBUS RDRAM Main Memory Design
Chip 2
Part of Cache Block
• Single RDRAM chip provides high bandwidth per access– Novel signaling scheme transfers multiple bits on one wire– Many internal banks: many requests to one chip
• Energy implication: Activate only one chip to perform access at same high bandwidth as conventional design
Power DownStandbyActive
8
CPU/$
Chip 0
Chip 1
Chip 3
Conventional Main Memory Design
Chip 2
Part of Cache Block
• Multiple DRAM chips provide high bandwidth per access– Wide bus to processor– Few internal banks
• Energy implication: Must activate all those chips to perform access at high bandwidth
Active Active Active Active
10
Exploiting the Opportunity
Interaction between power state model and access locality
• How to manage the power state transitions?– Memory controller policies– Quantify benefits of power states
• What role does software have?– Energy impact of allocation of data/text to
memory.
11
CPU/$
OS Page Mapping
Allocation
Chip 0
Chip 1
Chip n-1
Power Down
StandbyActive
ctrl ctrl ctrl
Hardware control
Software control
• Properties of PA-DRAM allow us to access and control each chip individually
• 2 dimensions to affect energy policy: HW controller / OS
• Energy strategy:
– Cluster accesses to already powered up chips
– Interaction between power state transitions and data locality
Power-Aware DRAM Main Memory Design
12
Power State Transitioning
time
requestscompletionof last request in run
gap
phigh
plowph->l pl->h
phighth->l tl->h
tbenefit
(th->l + tl->h + tbenefit ) * phigh > th->l * ph->l + tl->h * pl->h + tbenefit * plow
Ideal case:Assume we wantno added latency
constant
13
Benefit Boundary
th->l * ph->l + tl->h * pl->h – (th->l + tl->h) * phigh
(phigh – plow)tbenefit >
gap th->l + tl->h + tbenefit
14
Power State Transitioning
time
requestscompletionof last request in run
gap
phigh
plowph->l
phighth->l tl->h
On demand case-adds latency oftransition back uppl->h
15
Power State Transitioning
time
requestscompletionof last request in run
gap
phigh
plowph->l pl->h
phighth->l tl->h
Threshold based-delays transition down
threshold
16
Page Allocation Polices
Virtual to Physical Page Mapping• Random Allocation – baseline policy
– Pages spread across chips
• Sequential First-Touch Allocation– Consolidate pages into minimal number of chips– One shot
• Frequency-based Allocation– First-touch not always best– Allow (limited) movement after first-touch
17
Power-Aware Virtual Memory Based On Context Switches
Huang, Pillai, Shin, “Design and Implementation of Power-Aware Virtual Memory”, USENIX 03.
• Power state transitions under SW control (not HW controller)• Treated explicitly as memory hierarchy: a process’s active set of nodes
is kept in higher power state• Size of active node set is kept small by grouping process’s pages in
nodes together – “energy footprint” – Page mapping - viewed as NUMA layer for implementation
– Active set of pages, i, put on preferred nodes, i
• At context switch time, hide latency of transitioning– Transition the union of active sets of the next-to-run and likely next-after-
that processes to standby (pre-charging) from nap– Overlap transitions with other context switch overhead
19
Rambus RDRAM
Standby225mW
Active313mW
Power Down7mW
Nap11mW
Read/Write
Transaction
+3 ns
+22510 ns
+20 ns
RambusRDRAM
Power States
+20 ns
+225 ns
20
Determining Active Nodes• A node is active iff at least one page from the node is mapped into process i’s
address space.• Table maintained whenever page is mapped in or unmapped in kernel.• Alternatives
rejected due to overhead:– Extra page faults– Page table scans
• Overhead is onlyone incr/decrper mapping/unmapping op
count n0 n1 … n15
p0108 2 17
…
pn193 240 4322
21
Implementation Details
Problem: DLLs and files shared by multiple processes (buffer cache) become scattered all over memory with a straightforward assignment of incoming pages to process’s active nodes – large energy footprints afterall.
22
Implementation DetailsSolutions:• DLL Aggregation
– Special case DLLs by allocating Sequential first-touch in low-numbered nodes
• Migration– Kernal thread – kmigrated – running in background
when system is idle (waking up every 3s)– Scans pages used by each process, migrating if
conditions met• Private page not on
• Shared page outside i
23
24
Evaluation Methodology• Linux implementation• Measurements/counts taken of events and
energy results calculated (not measured)• Metric – energy used by memory (only).• Workloads – 3 mixes: light (editting, browsing,
MP3), poweruser (light + kernel compile), multimedia (playing mpeg movie)
• Platform – 16 nodes, 512MB of RDRAM• Not considered: DMA and kernel maintenance
threads
25
Results
• Base – standby when not accessing
• On/Off –nap when system idle
• PAVM
26
Results
• PAVM• PAVMr1 - DLL
aggregation• PAVMr2 –
both DLL aggregation & migration
27
Results
28
Conclusions
• Multiprogramming environment.
• Basic PAVM: save 34-89% energy of 16 node RDRAM
• With optimizations: additional 20-50%
• Works with other kinds of power-aware memory devices
Discussion: What about page replacement
policies? Should (or how could) they
be power-aware?
Introduction to Virtual Machine Monitors
31
Traditional Multiprogrammed OS
• Multiple applications running with the abstraction of dedicated machine provided by OS
• Pass through of non-privileged instructions
• ISA – instruction set architecture
• ABI – application binary interface
HW
OS
Application(s)
ISA
ABI Syscalls
instr
32
Traditional Multiprogrammed OS
HW
OS
Application(s)
• Multiple applications running with the abstraction of dedicated machine provided by OS
• Pass through of non-privileged instructions
• ISA – instruction set architecture
• ABI – application binary interface
ISA
ABI Syscalls instr
33
©James Smith, U.Wisc
34
Virtualization Layer
©James Smith, U.Wisc
35
Variations on the Theme
©James Smith, U.Wisc
36
Virtual Machines• History: invented by IBM in
1960’s• Fully protected and isolated
copy of the physical machine providing the abstraction of a dedicated machine
• Layer: Virtual Machine Monitor (VMM)
• Replicating machine for multiple OSs
• Security Isolation
©James Smith, U.Wisc
37
Virtual Machine Monitor
©J. Sugarman, USENIX01
38
Issues
• Hardware must be fully virtualizable – all sensitive (privileged) instructions must trap to VMM– X86 is not fully virtualizable
• In traditional model, all devices need drivers in VMM– PCs have lots of possible devices –
leverage the host OS for its drivers => hosted model
39
VMware Hosted Model
©J. Sugarman, USENIX01
40
Hosting Implications
• World switch – heavier weight than normal context switch
• VMM runs with full privileges (e.g., kernel mode)• I/O operations involve
– Interception by VMM– Switch to host world via Vmdriver– Issuing I/O operation to host OS via Vmapp
• Interrupts handled by host OS– VMM yields control to host OS– Reasserts interrupts to Guest OS
• Host OS does scheduling and can also pageout memory of a virtual machine
41
VMware Hosted Model
42
Memory Virtualization
Waldspurger OSDI 02
43
Reclaiming Memory
• Traditional: add transparent swap layer– Requires meta-level page replacement decisions– Best data to guide decisions known only by guest
OS– Guest and meta-level policies may clash– Example: “double paging” anomaly
• Alternative: implicit cooperation– Coax guest into doing page replacement– Avoid meta-level policy decisions
Waldspurger OSDI 02
44
Ballooning
Waldspurger OSDI 02
45
Content-based Sharing
• Motivation– Multiple VMs running same OS, apps– Collapse redundant copies of code, data, zeros
• Transparent page sharing– Map multiple PPNs to single MPN copy-on-write
• New twist: content-based sharing– General-purpose, no guest OS changes– Background activity saves memory over time
Waldspurger OSDI 02
46
Sharing via Hashing
Waldspurger OSDI 02
47
Sharing via Hashing
Waldspurger OSDI 02
48
Real-world Sharing
Waldspurger OSDI 02
49
Allocation Policies
Waldspurger OSDI 02
50
Allocation Policies
Waldspurger OSDI 02
51
Reclaiming Idle Memory 1st
Waldspurger OSDI 02
52
75% Idle Tax Rate
Waldspurger OSDI 02
64
Xen 2.0 Features• Secure isolation between VMs• Resource control and QoS• Only guest kernel needs to be ported
– All user-level apps and libraries run unmodified– Linux 2.4/2.6, NetBSD, FreeBSD, Plan9
• Execution performance is close to native• Supports the same hardware as Linux x86• Live Relocation of VMs between Xen
nodes
66
Structure
hypercalls
events
67
Para-Virtualization in Xen• Arch xen_x86 : like x86, but Xen hypercalls
required for privileged operations– Avoids binary rewriting– Minimize number of privilege transitions into Xen– Modifications relatively simple and self-contained
• Modify kernel to understand virtualised env.– Wall-clock time vs. virtual processor time
• Xen provides both types of alarm timer
– Expose real resource availability• Enables OS to optimise behaviour
68
x86 CPU virtualization • Xen runs in ring 0 (most privileged)• Ring 1/2 for guest OS, 3 for user-space
– GPF if guest attempts to use privileged instr
• Xen lives in top 64MB of linear addr space– Segmentation used to protect Xen as switching
page tables too slow on standard x86
• Hypercalls jump to Xen in ring 0• Guest OS may install ‘fast trap’ handler
– Direct user-space to guest OS system calls
• MMU virtualisation: shadow vs. direct-mode
69
Structure
Privilegering 0
Privilegering 1
Privilegering 3
70
Para-Virtualizing the MMU• Guest OSes allocate and manage own PTs
– Hypercall to change PT base
• Xen must validate PT updates before use– Allows incremental updates, avoids revalidation
• Validation rules applied to each PTE:1. Guest may only map pages it owns*
2. Pagetable pages may only be mapped RO
• Xen traps PTE updates and emulates, or ‘unhooks’ PTE page for bulk updates
71
I/O Architecture
• Xen IO-Spaces delegate guest OSes protected access to specified h/w devices– Virtual PCI configuration space– Virtual interrupts
• Devices are virtualised and exported to other VMs via Device Channels– Safe asynchronous shared memory transport– ‘Backend’ drivers export to ‘frontend’ drivers– Net: use normal bridging, routing, iptables– Block: export any blk dev e.g. sda4,loop0,vg3
72
Xen 2.0 Architecture
Event Channel Virtual MMUVirtual CPU Control IF
Hardware (SMP, MMU, physical memory, Ethernet, SCSI/IDE)
NativeDeviceDriver
GuestOS(XenLinux)
Device Manager & Control s/w
VM0
NativeDeviceDriver
GuestOS(XenLinux)
UnmodifiedUser
Software
VM1
Front-EndDevice Drivers
GuestOS(XenLinux)
UnmodifiedUser
Software
VM2
Front-EndDevice Drivers
GuestOS(XenBSD)
UnmodifiedUser
Software
VM3
Safe HW IF
Xen Virtual Machine Monitor
Back-End Back-End