Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...
-
Upload
angelina-king -
Category
Documents
-
view
217 -
download
0
Transcript of Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...
Ph.D. Comprehensive Examination
José A. Baiocchi ParedesDepartment of Computer ScienceUniversity of Pittsburgh
Towards Virtualization of Embedded Systems with Scratchpad Memory
OverviewSystem Virtualization
Paravirtualization(OS Assisted)
Full Systemvirtualization
Trap-And-Emulate(Classic)
Hardware AssistedVirtualization
Memory ResourceManagement
needs
approaches
VirtualMachine
VirtualMachine
VirtualMachine
System Virtualization Allow multiple Operating
Systems share Hardware
Uses: Server consolidation Co-located hosting Distributed web services Application mobility Secure computing platforms Etc.
Virtual Machine Monitor
User Apps
GuestOS 3
User Apps
GuestOS 2
User Apps
GuestOS 1
Hardware
Type I: “Bare Metal”
Virtual Machine Monitor
User Apps
GuestOS 3
User Apps
GuestOS 2
User Apps
GuestOS 1
Hardware
Host OS
Type II: “Hosted”
VMM
Innocuous
Sensitive
Nonprivileged
Privileged
Classical VMM Instruction behavior
Sensitive Instructions (S) control-sensitive: change resource
configuration behavior-sensitive: depend on
resource configuration Privileged Instructions (P)
trap in user mode don’t trap in supervisor mode
VMM can be built if S P Trap-And-Emulate
Deprivileging: Guest OS in user mode, VMM in supervisor mode
Impossible for x86!Efficiency
Resource Control
Equivalence
Hardware
ISA
User Applications
Guest OS
ISA
P S
Popek & GoldbergFormal Req. for Virt.
3rd Gen. ArchitecturesCACM’74
VMM
EmulationRoutineAllocator
Dispatcher
trap
x86 Virtualization Challenges Protection (Segmentation)
4 Privilege Levels (Rings) Segment access by PL Deprivileging: 0/1/3, 0/3/3
Sensitive structures On-chip: control registers, table
registers, etc Off-chip: segment descriptor
tables, page tables, interrupt tables, etc
Shadow structures Tracing: write-protected primary
structures Sensitive unprivileged instructions
3210
OS
Apps
Privilege Rings
Segm
LinearAddressSpace
Segmentation
%cr3
PhysicalAddressSpace
PagDir
PagTab
TLBs
Paging
Page
SegmDescr
DPL
LogicalAddress
%ldtr
%gdtr
GDT
LDT
CPL
%cs
User Apps
Paravirtualization Guest OS modifications (0/1/3)
Paravirtualized x86 interface OS can’t evolve independently!
CPU Xen-validated exception handlers ‘Fast’ handler for system calls Timer: real, virtual, wall-clock
Memory Xen in top 64MB of address space Validated updates to segment
descriptor tables and page tables I/O
Buffer-descriptor rings HW interrupts replaced by events
Domain0 runs control software
Hardware
User Apps
Xen Hypervisor
Control Plane SW
Paravirt.Guest OS
x86
x86-Dom0CtrlIntf
Virtx86CPU
VirtPhysMem
VirtNetIntf
VirtBlckDev
Paravirt.Guest OS
ABI
Efficiency
Resource Control
Equivalence
Domain0
XDD XDD
Barham et al.Xen and the Art of
VirtualizationSOSP’03
VMM
EmulationRoutineAllocator
Dispatcher
Hardware Assisted VMM x86 extensions
1st gen: AMD-V™, Intel® VT-x enable trap-and-emulate
Guest OS runs in new guest mode, VMM in host mode 4 privilege rings in both modes Host to guest: vmrun
Virtual Machine Control Block (VMCB) Host state + guest state + control
fields Guest to host: exit conditions Diagnostic fields to aid VMM Efficiency
Resource Control
Equivalence
x86+
x86
Hardware
VMCB
Adams & AgesenHW & SW Techniquesfor x86 Virtualization
ASPLOS’06
User Applications
Guest OS
exit
Full System Virtualization Direct Execution of ring 3 code Binary Translation of ring 0 code
Dynamic Binary Translator (DBT) Input: any x86 code
no ABI assumptions Output: subset of x86 code
stored in Code Cache (CC) runs in ring 3
Privileged instruction replacement Simple: in-CC sequences Complex: callout-and-emulate
Adaptive BT Frequent traps replaced by callouts Reverted when trapping infrequent
Adams & AgesenHW & SW Techniquesfor x86 Virtualization
ASPLOS’06
Hardware
x86
x86
Efficiency
Resource Control
Equivalence
CC
VMMCCDBT
EmulationRoutine
User Applications
Guest OS
Memory Resource Mgmt. Virtual Physical Memory
physical addr. machine addr. VM config.: min, max, shares
Content-Based Page Sharing Reduce memory pressure Identical pages: copy-on-write
Share-Based Allocation Min-funding revocation Idle memory tax
Reclamation Ballooning forces guest OS to
make paging decisions Fallback to Demand Paging
User Apps
HW
VMware ESX
Guest OS
VM VM
Machine Memory
Phys.Mem
Linear Mem
Phys.Mem
User Apps
Guest OS
Linear Mem
WaldspurgerMemory Res. Mgmt.
in VMware ESX ServerOSDI’02
Phys.Mem
Balloon
))1(( fkfP
S
Overview
Dynamic BinaryOptimization
Code CacheManagement
Dynamic BinaryTranslation
System Virtualization
Paravirtualization(OS Assisted)
Full Systemvirtualization
Trap-And-Emulate(Classic)
Hardware AssistedVirtualization
Memory ResourceManagement
needsbased on
approaches
needs
enables
Dynamic Binary Translation Modify a running program binary
instructions before they execute on the host platform
Uses: Emulation Virtualization Dynamic Optimization Code security (shepherding) Dynamic Instrumentation Software I-Caching Etc.
DBT
HW
App
Host OS
App
App
App
Guest OS
Guest OS
DBT
Host OS
HW
HWHW
DBTDBT
Binary
A
C
B
D
E
Code Cache
Generic DBT operation
call
return
DBT
ContextSave
ContextRestore
NewFragment
End offragment?
N
Y
Cached?NewPC
Y
N
Translate
Next PC
Decode
Fetch
AA
to B
to C
fragmentexitstubs
G
I
H
J
conditionalbranch: stop
Scott et al.Retarget. & reconfig.
SDTCGO’03
Code Cache
Generic DBT operationDBT
ContextSave
ContextRestore
NewFragment
End offragment?
Cached?NewPC
N
Y
Y
N
Translate
Next PC
Decode
Fetch
A
to B
to C
C
D
G
to H
to IH
J
indE
to A
branch and link: emulate side effects and elide
unconditional branch: elideindirect exit stub
Scott et al.Retarget. & reconfig.
SDTCGO’03
Binary
A
C
B
D
E
call
return
G
I
H
J
Code Cache
Generic DBT operation
A
to B
to C
C
D
G
to H
to IH
J
indE
to A
Reducing context switches fragment linking for direct targets indirect branch target cache
(IBTC) for indirects
computedtarget
IBTC
translatedtarget
indIBTC
lookup
Kumar et al.Compile-time planningoverhead reduc. SDT
IJPP’05
Binary
A
C
B
D
E
call
return
G
I
H
J
DBT
ContextSave
ContextRestore
NewFragment
End offragment?
Cached?NewPC
N
Y
Y
N
Translate
Next PC
Decode
Fetch
DBO
Link Fragments
Trace Selector
Dynamic OptimizationInterpreter
Interpretuntil
taken branch
Interpret+ codegen
untiltaken branch
Startof trace?
Hot?
End oftrace?
Bala et al.DynamoPLDI’00
ContextRestore
Cached?
BTA
Incrementcounter
ContextSave
Y
Code Cache
Trace selection
Binary
A
C
B
D
E
call
return
G
I
H
J
N
Optimize Trace
Y
Form Fragments
N
N
Y
Y
N
Hot Trace Buffer
DBO
Link Fragments
Trace Selector
Hot Trace Buffer
Dynamic OptimizationInterpreter
Interpretuntil
taken branch
Interpret+ codegen
untiltaken branch
Startof trace?
Hot?
End oftrace?
Bala et al.DynamoPLDI’00
ContextRestore
Cached?
BTA
Incrementcounter
ContextSave
Y
Code Cache
Trace formation: Most Recently Executed Tail (MRET)
Binary
A
C
B
D
E
call
return
G
I
H
J
N
Optimize Trace
Y
Form Fragments
N
N
Y
Y
N
A
C
D
E
G
H
J
DBO
Link Fragments
Trace Selector
Hot Trace Buffer
Dynamic OptimizationInterpreter
Interpretuntil
taken branch
Interpret+ codegen
untiltaken branch
Startof trace?
Hot?
End oftrace?
Bala et al.DynamoPLDI’00
ContextRestore
Cached?
BTA
Incrementcounter
ContextSave
Y
Code Cache
Trace Optimization: IR, 2 passes (forward+backward)
Binary
A
C
B
D
E
call
return
G
I
H
J
N
Optimize Trace
Y
Form Fragments
N
N
Y
Y
N
A
C
D
G
H
E
J
A
C
D
G
H
J
E
• Branch fixup• Redundance
elimination• Compensation
blocks• Copy
propagation• Loop unrolling• etc
DBO
Link Fragments
Trace Selector
Hot Trace Buffer
Dynamic OptimizationInterpreter
Interpretuntil
taken branch
Interpret+ codegen
untiltaken branch
Startof trace?
Hot?
End oftrace?
Bala et al.DynamoPLDI’00
ContextRestore
Cached?
BTA
Incrementcounter
ContextSave
Y
Code Cache
Fragment formation and linking
Binary
A
C
B
D
E
call
return
G
I
H
J
N
Optimize Trace
Y
Form Fragment
N
N
Y
Y
N
to B
to I
A
C
D
G
H
J
E
A
C
D
G
H
J
E
to H
B
D
G
I
J
E
B
D
G
I
J
E
Nursery Cache
Persistent Cache
Probation Cache(Instrumented)
FIFO
HOT
COLD
Code Cache ManagementHazelwood & SmithManaging Bounded
Code CachesTACO’03
Code Cache Manager
EvictCode
Roomin CC?
NUpdate mapand insert
Y
DBT
ContextRestore
MapLookup
PCmiss
hit
RegionFormation
Handle CC overflows Overhead sources
miss rate eviction frequency unlinking cost
Strategies: FLUSH FIFO Mid-grained Generational
Code Cache
…
Cache Unit
Cache Unit
Cache Unit
Overview
Software-basedInstruction Cache Scratchpad
Memory
Compiler-generatedOverlays Embedded
Systems
Overview
Dynamic BinaryOptimization
Code CacheManagement
Dynamic BinaryTranslation
System Virtualization
Paravirtualization(OS Assisted)
Full Systemvirtualization
Trap-And-Emulate(Classic)
Hardware AssistedVirtualization
Memory ResourceManagement
needsbased on
approaches
needs
have
enables
approaches
Software-controlled SRAM Replaces or complements caches
Advantages: Fast Smaller than cache Energy-efficient Better timing-predictability
How to manage SPM? Static partitioning Software caching Overlays
Scratchpad Memory (SPM)
System-on-Chip (SoC)
ROMCPUMain
MemoryDRAMSPM
System-on-Chip (SoC)
ROMCPU
MainMemoryDRAMSPM
I-L1
System-on-Chip (SoC)
ROMCPU
MainMemoryDRAMSPM
I-L1D-L1
SW-Based I-CacheMiller & AgarwalSoftware-based
Instruction CachingASPLOS’06
Binary
A
C
B
D
E
call
return
G
H
I
Binary
BinaryRewriter
B
D
E
G
I
C1
C2
……
Basic Block Formation:splitting & padding
A
A C1 BB D
G I H…
C1 C2C2 DD G E
H I
H
DestinationsTable
SPM
Runtime
Memory
B
D
E
G
I
C1
C2
…
A
A C1 BB D
G I H…
C1 C2C2 DD G E
H I
H
DestinationsTable
EP1 EP2
IndEP
RUN
Almost a DBT!!! (offline region formation)
A
C1
A
A C1
EP1
Compiler-generated SPM Overlays Compiler introduces code to copy objects from
memory to SPM and back at selected program points
Questions: Which objects to promote/demote? At what (profitable) program points?
Needs to know: Profile information SPM size
Concomitance + SMI Concomitance measures temporal
distance of block(s) execution Large self-concomitance SPM Large concomitance (2 blocks)
can’t overlay Program graph partitioning
Nodes: blocks with large self-concomitance
Partition into overlays Insert SMI in CFG edges
Special instruction to copy code from memory to scratchpad
Supported by SPM controller
Janapsatya et al.Expl. Statistical Info.
for Implem. Instr. SPMTVLSI’06
ControlLogic
Addressof DRAM
SizeAddressof SPM
MemoryController
Basic Block Table (BBT)
From/toCPU
To I-MEM and I-SPM
SMI opcode Operand: BBT addr
SPM controller
Udayakumaran et al.Dynamic Allocation
for SPMTECS’06
Data-Program Rel. Graph For globals, stack variables and
code (procedures) Program points based on control
flow DPRG represents program
regions and their time order Code inserted to promote/demote
objects Usage information from profile Liveness analysis to eliminate
unnecesary transfers Problems:
Pointers Join nodes Gotos
Optimal Scratchpad Overlay
For globals, non-scalar locals and code traces Based on Live Ranges (profile for variables, static analysis for
traces) Memory Assignment: NP-complete, reduces to register allocation Solutions:
Optimal: ILP formulation (16 sec.) Near Optimal: Heuristic
Verma & MarwedelOverlay Techniques
for SPMTVLSI’06
1. Memory Object Determination
2. Liveness Analysis
3. Memory Assignment
4. Onchip Address Assignment
5. Code Generation
Conclusions DBT-based virtualization transparently virtualizes general-
purpose architectures (x86) Paravirtualization sacrifices OS-independence HW assisted not yet as efficient, increases HW cost.
Software i-caching manages SPM for code at runtime DBT can provide it (CC in SPM) Compiler-generated overlays already use profile information, but
need to know SPM size DBO-ideas (trace selection) can be adapted to exploit SPM for
code
DBT for embedded systems: exploit SPM and enable virtualization