Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...

Ph.D. Comprehensive Examination

José A. Baiocchi ParedesDepartment of Computer ScienceUniversity of Pittsburgh

Towards Virtualization of Embedded Systems with Scratchpad Memory

OverviewSystem Virtualization

Paravirtualization(OS Assisted)

Full Systemvirtualization

Trap-And-Emulate(Classic)

Hardware AssistedVirtualization

Memory ResourceManagement

needs

approaches

VirtualMachine

VirtualMachine

VirtualMachine

System Virtualization Allow multiple Operating

Systems share Hardware

Uses: Server consolidation Co-located hosting Distributed web services Application mobility Secure computing platforms Etc.

Virtual Machine Monitor

User Apps

GuestOS 3

User Apps

GuestOS 2

User Apps

GuestOS 1

Hardware

Type I: “Bare Metal”

Virtual Machine Monitor

User Apps

GuestOS 3

User Apps

GuestOS 2

User Apps

GuestOS 1

Hardware

Host OS

Type II: “Hosted”

VMM

Innocuous

Sensitive

Nonprivileged

Privileged

Classical VMM Instruction behavior

Sensitive Instructions (S) control-sensitive: change resource

configuration behavior-sensitive: depend on

resource configuration Privileged Instructions (P)

trap in user mode don’t trap in supervisor mode

VMM can be built if S P Trap-And-Emulate

Deprivileging: Guest OS in user mode, VMM in supervisor mode

Impossible for x86!Efficiency

Resource Control

Equivalence

Hardware

ISA

User Applications

Guest OS

ISA

P S

Popek & GoldbergFormal Req. for Virt.

3rd Gen. ArchitecturesCACM’74

VMM

EmulationRoutineAllocator

Dispatcher

trap

x86 Virtualization Challenges Protection (Segmentation)

4 Privilege Levels (Rings) Segment access by PL Deprivileging: 0/1/3, 0/3/3

Sensitive structures On-chip: control registers, table

registers, etc Off-chip: segment descriptor

tables, page tables, interrupt tables, etc

Shadow structures Tracing: write-protected primary

structures Sensitive unprivileged instructions

3210

OS

Apps

Privilege Rings

Segm

LinearAddressSpace

Segmentation

%cr3

PhysicalAddressSpace

PagDir

PagTab

TLBs

Paging

Page

SegmDescr

DPL

LogicalAddress

%ldtr

%gdtr

GDT

LDT

CPL

%cs

User Apps

Paravirtualization Guest OS modifications (0/1/3)

Paravirtualized x86 interface OS can’t evolve independently!

CPU Xen-validated exception handlers ‘Fast’ handler for system calls Timer: real, virtual, wall-clock

Memory Xen in top 64MB of address space Validated updates to segment

descriptor tables and page tables I/O

Buffer-descriptor rings HW interrupts replaced by events

Domain0 runs control software

Hardware

User Apps

Xen Hypervisor

Control Plane SW

Paravirt.Guest OS

x86

x86-Dom0CtrlIntf

Virtx86CPU

VirtPhysMem

VirtNetIntf

VirtBlckDev

Paravirt.Guest OS

ABI

Efficiency

Resource Control

Equivalence

Domain0

XDD XDD

Barham et al.Xen and the Art of

VirtualizationSOSP’03

VMM

EmulationRoutineAllocator

Dispatcher

Hardware Assisted VMM x86 extensions

1st gen: AMD-V™, Intel® VT-x enable trap-and-emulate

Guest OS runs in new guest mode, VMM in host mode 4 privilege rings in both modes Host to guest: vmrun

Virtual Machine Control Block (VMCB) Host state + guest state + control

fields Guest to host: exit conditions Diagnostic fields to aid VMM Efficiency

Resource Control

Equivalence

x86+

x86

Hardware

VMCB

Adams & AgesenHW & SW Techniquesfor x86 Virtualization

ASPLOS’06

User Applications

Guest OS

exit

Full System Virtualization Direct Execution of ring 3 code Binary Translation of ring 0 code

Dynamic Binary Translator (DBT) Input: any x86 code

no ABI assumptions Output: subset of x86 code

stored in Code Cache (CC) runs in ring 3

Privileged instruction replacement Simple: in-CC sequences Complex: callout-and-emulate

Adaptive BT Frequent traps replaced by callouts Reverted when trapping infrequent

Adams & AgesenHW & SW Techniquesfor x86 Virtualization

ASPLOS’06

Hardware

x86

x86

Efficiency

Resource Control

Equivalence

CC

VMMCCDBT

EmulationRoutine

User Applications

Guest OS

Memory Resource Mgmt. Virtual Physical Memory

physical addr. machine addr. VM config.: min, max, shares

Content-Based Page Sharing Reduce memory pressure Identical pages: copy-on-write

Share-Based Allocation Min-funding revocation Idle memory tax

Reclamation Ballooning forces guest OS to

make paging decisions Fallback to Demand Paging

User Apps

HW

VMware ESX

Guest OS

VM VM

Machine Memory

Phys.Mem

Linear Mem

Phys.Mem

User Apps

Guest OS

Linear Mem

WaldspurgerMemory Res. Mgmt.

in VMware ESX ServerOSDI’02

Phys.Mem

Balloon

))1(( fkfP

S

Overview

Dynamic BinaryOptimization

Code CacheManagement

Dynamic BinaryTranslation

System Virtualization






needsbased on

approaches

needs

enables

Dynamic Binary Translation Modify a running program binary

instructions before they execute on the host platform

Uses: Emulation Virtualization Dynamic Optimization Code security (shepherding) Dynamic Instrumentation Software I-Caching Etc.

DBT

HW

App

Host OS

App

App

App

Guest OS

Guest OS

DBT

Host OS

HW

HWHW

DBTDBT

Binary

A

C

B

D

E

Code Cache

Generic DBT operation

call

return

DBT

ContextSave

ContextRestore

NewFragment

End offragment?

N

Y

Cached?NewPC

Y

N

Translate

Next PC

Decode

Fetch

AA

to B

to C

fragmentexitstubs

G

I

H

J

conditionalbranch: stop

Scott et al.Retarget. & reconfig.

SDTCGO’03

Code Cache

Generic DBT operationDBT

ContextSave

ContextRestore

NewFragment

End offragment?

Cached?NewPC

N

Y

Y

N

Translate

Next PC

Decode

Fetch

A

to B

to C

C

D

G

to H

to IH

J

indE

to A

branch and link: emulate side effects and elide

unconditional branch: elideindirect exit stub

Scott et al.Retarget. & reconfig.

SDTCGO’03

Binary

A

C

B

D

E

call

return

G

I

H

J

Code Cache

Generic DBT operation

A

to B

to C

C

D

G

to H

to IH

J

indE

to A

Reducing context switches fragment linking for direct targets indirect branch target cache

(IBTC) for indirects

computedtarget

IBTC

translatedtarget

indIBTC

lookup

Kumar et al.Compile-time planningoverhead reduc. SDT

IJPP’05

Binary

A

C

B

D

E

call

return

G

I

H

J

DBT

ContextSave

ContextRestore

NewFragment

End offragment?

Cached?NewPC

N

Y

Y

N

Translate

Next PC

Decode

Fetch

DBO

Link Fragments

Trace Selector

Dynamic OptimizationInterpreter

Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?

Bala et al.DynamoPLDI’00

ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Trace selection

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragments

N

N

Y

Y

N

Hot Trace Buffer

DBO

Link Fragments

Trace Selector

Hot Trace Buffer


Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?


ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Trace formation: Most Recently Executed Tail (MRET)

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragments

N

N

Y

Y

N

A

C

D

E

G

H

J

DBO

Link Fragments

Trace Selector

Hot Trace Buffer


Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?


ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Trace Optimization: IR, 2 passes (forward+backward)

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragments

N

N

Y

Y

N

A

C

D

G

H

E

J

A

C

D

G

H

J

E

• Branch fixup• Redundance

elimination• Compensation

blocks• Copy

propagation• Loop unrolling• etc

DBO

Link Fragments

Trace Selector

Hot Trace Buffer


Interpretuntil

taken branch

Interpret+ codegen

untiltaken branch

Startof trace?

Hot?

End oftrace?


ContextRestore

Cached?

BTA

Incrementcounter

ContextSave

Y

Code Cache

Fragment formation and linking

Binary

A

C

B

D

E

call

return

G

I

H

J

N

Optimize Trace

Y

Form Fragment

N

N

Y

Y

N

to B

to I

A

C

D

G

H

J

E

A

C

D

G

H

J

E

to H

B

D

G

I

J

E

B

D

G

I

J

E

Nursery Cache

Persistent Cache

Probation Cache(Instrumented)

FIFO

HOT

COLD

Code Cache ManagementHazelwood & SmithManaging Bounded

Code CachesTACO’03

Code Cache Manager

EvictCode

Roomin CC?

NUpdate mapand insert

Y

DBT

ContextRestore

MapLookup

PCmiss

hit

RegionFormation

Handle CC overflows Overhead sources

miss rate eviction frequency unlinking cost

Strategies: FLUSH FIFO Mid-grained Generational

Code Cache

…

Cache Unit

Cache Unit

Cache Unit

Overview

Software-basedInstruction Cache Scratchpad

Memory

Compiler-generatedOverlays Embedded

Systems

Overview

Dynamic BinaryOptimization

Code CacheManagement

Dynamic BinaryTranslation

System Virtualization






needsbased on

approaches

needs

have

enables

approaches

Software-controlled SRAM Replaces or complements caches

Advantages: Fast Smaller than cache Energy-efficient Better timing-predictability

How to manage SPM? Static partitioning Software caching Overlays

Scratchpad Memory (SPM)

System-on-Chip (SoC)

ROMCPUMain

MemoryDRAMSPM


ROMCPU

MainMemoryDRAMSPM

I-L1


ROMCPU

MainMemoryDRAMSPM

I-L1D-L1

SW-Based I-CacheMiller & AgarwalSoftware-based

Instruction CachingASPLOS’06

Binary

A

C

B

D

E

call

return

G

H

I

Binary

BinaryRewriter

B

D

E

G

I

C1

C2

……

Basic Block Formation:splitting & padding

A

A C1 BB D

G I H…

C1 C2C2 DD G E

H I

H

DestinationsTable

SPM

Runtime

Memory

B

D

E

G

I

C1

C2

…

A

A C1 BB D

G I H…

C1 C2C2 DD G E

H I

H

DestinationsTable

EP1 EP2

IndEP

RUN

Almost a DBT!!! (offline region formation)

A

C1

A

A C1

EP1

Compiler-generated SPM Overlays Compiler introduces code to copy objects from

memory to SPM and back at selected program points

Questions: Which objects to promote/demote? At what (profitable) program points?

Needs to know: Profile information SPM size

Concomitance + SMI Concomitance measures temporal

distance of block(s) execution Large self-concomitance SPM Large concomitance (2 blocks)

can’t overlay Program graph partitioning

Nodes: blocks with large self-concomitance

Partition into overlays Insert SMI in CFG edges

Special instruction to copy code from memory to scratchpad

Supported by SPM controller

Janapsatya et al.Expl. Statistical Info.

for Implem. Instr. SPMTVLSI’06

ControlLogic

Addressof DRAM

SizeAddressof SPM

MemoryController

Basic Block Table (BBT)

From/toCPU

To I-MEM and I-SPM

SMI opcode Operand: BBT addr

SPM controller

Udayakumaran et al.Dynamic Allocation

for SPMTECS’06

Data-Program Rel. Graph For globals, stack variables and

code (procedures) Program points based on control

flow DPRG represents program

regions and their time order Code inserted to promote/demote

objects Usage information from profile Liveness analysis to eliminate

unnecesary transfers Problems:

Pointers Join nodes Gotos

Optimal Scratchpad Overlay

For globals, non-scalar locals and code traces Based on Live Ranges (profile for variables, static analysis for

traces) Memory Assignment: NP-complete, reduces to register allocation Solutions:

Optimal: ILP formulation (16 sec.) Near Optimal: Heuristic

Verma & MarwedelOverlay Techniques

for SPMTVLSI’06

1. Memory Object Determination

2. Liveness Analysis

3. Memory Assignment

4. Onchip Address Assignment

5. Code Generation

Conclusions DBT-based virtualization transparently virtualizes general-

purpose architectures (x86) Paravirtualization sacrifices OS-independence HW assisted not yet as efficient, increases HW cost.

Software i-caching manages SPM for code at runtime DBT can provide it (CC in SPM) Compiler-generated overlays already use profile information, but

need to know SPM size DBO-ideas (trace selection) can be adapted to exploit SPM for

code

DBT for embedded systems: exploit SPM and enable virtualization

Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...

Documents

Transcript of Ph.D. Comprehensive Examination José A. Baiocchi Paredes Department of Computer Science University...