HyVM – Hybrid Virtual Machines -...

HyVM – Hybrid Virtual MachinesMachinesAda GavrilovskaAda GavrilovskaKarsten Schwan

Sudha Yalamanchili

and many PhD students

Project ContextProject Context

Future HPC applications on heterogeneous platforms• off (e.g., new NSF Track II machine)-chip and next,

toward on-chip heterogeneous cores• rich memory hierarchies, e.g., NUMAy g

New Execution Models• diverse cores• diverse cores• custom OS kernels• I/O support

Accelerators and tool chains• CUDA, OpenCL, …p• opportunity for compiler-based optimization methods

Future ApplicationsMedia and image processing:• for dynamic web content‘S fi h lik ’

Science and gaming:• Fusion modeling• High perf I/O: GTC Pixie3D XGC• ‘Snapfish‐like’

imagingsuite

• High perf. I/O: GTC, Pixie3D, XGC• Libraries: LAPACK, BLAS,VSIPL, …

Financial and risk analysis:• Black Scholes• Risk Analysis

Data‐intensive and web:• Critical enterprise codes• Data mining• Risk Analysis

• Derivatives processing• Data mining• Sensor data processing

Heterogeneous Platforms• Asymmetries:

– Performance• Different clock speeds, duty cycles• Differing issue widths• Varying cache sizesVarying cache sizes

– NUMA memory– Toward ‘islands of cores’

Functional differences:– Diverse accelerators: heterogeneous:

• Cell, GPU, Communications, Encryption, …

– Shared ISA: asymmetric:• Missing SSE version• Missing SSE version• Missing floating point• Additional instructions for acceleration (Larrabee, …)

HyVM Project GoalUniform runtime model for heterogeneous platforms:

Hybrid Virtual MachinesU if it ft b d l tf t i– Uniformity: software-based platform extensions:

• Virtual Execution Unit (VEU): uniform runtime representation for program executables, targeting heterogeneous cores

• Heterogeneity-awareness: system-level management methods forHeterogeneity awareness: system level management methods for improved platform utilization (incl. cache and energy) and application performance (SLAs)

• Dynamic platform emulation: runtime CK compilation or re-writing for diverse accelerator targets (via LLVM)for diverse accelerator targets (via LLVM)

– High performance: diverse executables:• Commodity and custom VEU ‘containers’: Virtual Machines (VMs)

– processes/threads – commodity cores; Special Executionprocesses/threads commodity cores; Special Execution Environments (e.g., NVIDIA) - Computational Kernels (CKs) -accelerators

• Runtime and adaptive {CK} optimization for parallelismSt d d li t CK i d ti API (O CL• Standards-compliant CK programming and runtime APIs (OpenCL, CUDA)

• Compiler-based optimization techniques for {CK}

HyVM Project ElementsAttaining the uniform HyVM execution model

Leveraging virtualization technologies:– Leveraging virtualization technologies:• Virtual Execution Units (VEUs)

– Finer grain schedulable entities than VCPUs• Specialized execution environments (SEEs) for accelerators• Specialized execution environments (SEEs) for accelerators

– GViM for efficient GPU virtualization (Niraj Tolia, HP)– Cellule: Dilma Silva, Jimi Xenides, Hubertus Franke (IBM)

• Montage: dynamic resource management for sets of VEUs (SLA• Montage: dynamic resource management for sets of VEUs (SLA-awareness, runtime monitoring)

– Coordinated scheduling for accelerators – VMaCS– InTuneS - system abstractions for coordinated management– InTuneS - system abstractions for coordinated management– Cache-aware ‘region’ scheduling + correlation scheduling for

shared ISA VEUs – addressing NUMA and asymmetric platforms• Future hypervisors (using Xen): heterogeneity-awareFuture hypervisors (using Xen): heterogeneity aware

– Extended work: scalable hypervisor structures

HyVM Project ElementsHyVM Project Elements

– Leveraging evolving industry standards for accelerator APIs and interactions:

• Tool chain support for CUDA => LLVM/PTX translation, with future work on OpenCL – Ocelot

• Exploiting compiler methods to optimize execution regimes for sets of VEUs – Harmony

• Runtime APIs at CUDA level - GViM– Exploring new programming and execution models:

• Datalog to GPU compilation• Scheduling via on line performance models and dynamic• Scheduling via on-line performance models and dynamic

code generation for GPU vs. Host ISAs Ocelot/LLVM• New analysis tools for debugging and performance tuning via

emulation Ocelot++emulation Ocelot++

DataInCK1

DataIn

DataIn

CK1 CK2CK3

CK4

Data Data

DataOutCK5

DataOut

i l i

Execution Model(s) OS

Structures

Virtual Execution Unit (VEU)


Executable Code

State

Virtual Execution Unit

Virtual Execution Units (VEUs)

Structures

Executable Code

State Virtual Execution Unit (VEU)

Executable Code

State

(VEU)

Executable

State

OR Executable Code

State


Executable Code

State

Executable Code

Executable Code

GPUGPU CPUCPU sCPU (Shared ISA)sCPU (Shared ISA)Emulated GPUEmulated GPU

Homogeneous View

Heterogeneous View

Spectrum of Exported Heterogeneity

Simple GuestSimple Guest

Virtual Platform

Enhanced GuestEnhanced Guest

Vi t l Pl tfVCPUVCPUVCPUVCPU

Virtual Platform

Platform

VCPUVCPUVCPUVCPUVirtual Platform

C i /

sVCPUsVCPUaVCPUaVCPU

Virtual Machine Monitor + Management Domain

Platform interactions

Cooperative export/use of platform information

Virtual Machine Monitor + Management Domain


C llC ll CPUCPU h dh dCPUCPU

Physical Platform

CellCell CPUCPU sCPU (Shared ISA)sCPU (Shared ISA)CPUCPU

Guest RequestHypervisor Response

CPUCPU CPUCPU CPUCPU CPUCPUVirtual Platform Physical Platform

Virtual Platform Physical PlatformCPUCPU CPUCPU CPUCPU sCPUsCPUFault &

Migrate

Physical Platform

CPUCPU acc1acc1 CPUCPU acc1acc1Virtual Platform Physical Platform

CPUCPU 22 CPUCPU 11Virtual Platform Physical Platform

CPUCPU acc2acc2 CPUCPU acc1acc1Ocelot binary

translation

Virtual Machine/GuestVirtual Machine/GuestApplicationsApplications

OpenCL, CUDA, Harmony… kernel(s) for heterogeneous

ISA cores

General purpose or shared ISA programs

threads/processes

State

Activity Unit

Executable Code

StateActivity Unit

Executable Code

Activity Unit

Executable Code

StateActivity Unit

Executable Code

State

e

Communication

OS SchedulerHarmony, OpenCL Runtimes,

GViM - CUDA API Cellule

Guest OS

aVCPUaVCPU VCPUVCPU sVCPUsVCPU

GViM - CUDA API, Cellule

aVCPUaVCPU

Virtual Machine Monitor + Management Domain + SEEs

Ocelot Binary Translation

Admission Control and Monitoring

Federated + Correlation + Region Scheduling


Technical Elements – GViM and VMaCSVishakha Gupta, Niraj Tolia, othersp j

Extending Xen for efficient {CK} execution

Management Domain (Dom0)

Mgmt GPU

VMGPU Application

VMGPU Application

Mgmt Extension

GPU Backend

Traditional D i D i

GPU Frontend

LiGPU Driver

CUDA API

GPU Frontend

Li

CUDA API

Hypervisor (Xen)

Device Drivers Linux Linux

yp ( )

General purpose multicoresGeneral purpose multicores

aPCPU (NVIDIA GPU) Other Devices

Technical Elements – GViM + VMaCS

Polling Domain request queuesAccelerator ready queues

Resource Management

Logic

thread(s) from BackendLogic Backend

GViM:• VEUs on

Dom0

Mgmt Extension

GPU Backend

Traditional Device

VM

GPU FrontendGPU Driver

CUDA APIGPU Application

VM

GPU Frontend

CUDA APIGPU Application

• VEUs ongVCPUs

Hypervisor (Xen)

Device Drivers

G l ltiG l lti

LinuxGPU Driver LinuxVMAcS:• Management Extensions for General purpose multicoresGeneral purpose multicores

pCompute Acc (NVIDIA GPU) DevicesTraditional Devices

Extensions forCoordinatedScheduling

Technical Elements – GViMGPU Guest Performance Baseline Results

300

Total execution time (without host data assignment)

GPU Guest Performance – Baseline Results

250

e (m

sec)

150

200

xecution

Tim

e

Linux

Dom0

50

100

CUDA Ex Dom0

GuestVM

0

Bitonic (512) Matrix(48x48) MersenneTwister(12000) BlackScholes(30000)

Micro‐benchmarks (with dataset values)( )

Less than 15% slowdown in the worst caseFurther improvements made/underway

VMaCS Scheduling Methods - Examplesg p

• Round robin (RR)Round robin (RR)– Some guest domain chosen every period– Poller monitors its queues for the periodq p– Moves to the next domain

• XenoCredit– Guest credits assigned at boot time– Use these credits to calculate proportions – Poll domains for time proportional to their credits

BlackScholes: XenoCredit vs. Round Robin

• BlackScholes with 60000 options and 4096 iterationsH t ith 2 GPU d 4 t VM• Host with 2 GPUs and 4 guest VMs

Motivates new methods for resource management in gaccelerator-based systems – now being completed

Technical Elements - Region and Correlation SchedulingCorrelation Scheduling

Min Lee, Vishakha Gupta, Rob Knauerhase

• Targeting Asymmetric Multicore Platforms• Using Intel quad core and Nehalem server

architectures• Region scheduling – dealing with the

NUCA=>NUMA nature of future manycore systems + NUMAFix (Dulloor Rao)systems + NUMAFix (Dulloor Rao)

• Correlation scheduling – dealing with asymmetric codes with shared ISAasymmetric codes with shared ISA

Region Scheduling:C h A S h d li f CMPCache Aware Scheduling for CMPs

A B C DCurrent working set

Task A

Over utilized! Less utilized!

A B C D

Bad

Task B

Task CTask C

Task D

A C B D

Fully utilized Fully utilizedGood ☺

• IssuesW ki i i i i l ( i d h d)– Working set estimation – size, constituent elements (unique and shared)

• Approach– Runtime ‘region’ determination to capture these working set

Region Scheduling – Basic Idea

• Partition physical memory

P1R1(2)

R2(2)

P2

P3

P4

into regions• Schedule regions on

caches

P5

P6

P7

P8

R3(2)

R4(2)caches• Core can access only the

allowed regions

P9

P10

P11

P12

P13

( )

R5(5)

allowed regions– System-level enforcement

(using page tables)

P13

P14

P15

P16

P17

R6(4)

• Private/Shared regions– we focus on private regions ....

P17

P18

P19R7(2)

Physical page

..

PrivateRegion(Size)

SharedRegion(Size)

Correlation Scheduling: M i t A t i CMapping to Asymmetric Cores

Asymmetry-aware Hypervisor supports asymmetric cores and NUMA memory

VCPU Admission FlowchartNew VCPUNew VCPU

Any correlation

No

information?

Check availability of given PCPU

Yes

Overloaded? Add to the given PCPUYes

No

Occupied by relocatable VCPU(s)?

No

Shuffle VCPUs

Yes

Move to the next most correlated PCPU

No

All busy? Find the best feasible match

YesNo

Technical Elements – ‘Islands of Cores’P i k T bPriyanka Tembey

Dulloor RaoMukil Kesavan

Motivation: many-core platforms• Multiple execution and scheduling contexts: 'islands'Multiple execution and scheduling contexts: islands

• differ in capabilities, micro-architectures• have NUMA memory characteristicsy

• Inter-connection variants – current platforms• PCIe, shared caches, fast point-to-point Interconnect

(QPI/Hypertransport)• Inter-connects – future platforms

• complex cache structures (!)• complex cache structures (!)• PGAS memory with integrated memory

controllers/inter-connects?

Technical Elements – Islands of CoresTechnical Elements Islands of Cores

• InTuneS: Priyanka TembeyI t h d l i l d li ht i ht t h l– Inter scheduler-island lightweight event channel

– Generic yet expressive definition of inter-island message semantics– Gray-box scheduler coordination interface

• scheduler implementation-agnostic APIp g– Application/VM-scheduler interface for explicit SLA policy enforcement– Application/VM-monitoring for implicit observation-based policy

enforcement– Evaluation with complex scheduler (Linux) and hypervisor schedulerEvaluation with complex scheduler (Linux) and hypervisor scheduler

(HV in progress), initially using communication accelerator

• Scalable hypervisor structures: Mukil Kesavan and Dulloor RaoH i t t h ld t h `i l d f ’– Hypervisor structure should match `islands of cores’

– e.g., consider Xen `Dom0’ bundling of functionality⇒ offer solutions for isolation with bundled functionality⇒ Find ways to efficiently run many `small’ domains, specialized, each ⇒ d ays to e c e t y u a y s a do a s, spec a ed, eac

with different tasks

Technical Elements – Creating, Optimizing, and Re-targeting VEUs via Harmony

and Ocelot - Gregory Diamos, Andrew Kerr

Memory

Cap Model 3readInputs();

chunk

Inputs Outputs InputsOutputs

Memory computeInvariants();for all chunks{simulateChunk();

}kernel

Harmony Run-time

chunk

chunkTransparentscheduling, execution management of

}generateResults();

kernel

VEUs

LocalMemoryCache

ACC

DMA

FIFOLocalMemoryCache

LocalMemoryCache

ACC

DMA

FIFOACC

DMA

FIFOCPU CPU CPU

management of chunks

Bi tibl

Network (e.g., Hypertransport, QPI, PCIe)

Binary compatible VEUs across platform

Minimize/avoid re-tuning and porting applications as you add l taccelerators

Advanced optimizationsSpeculation, performance prediction, kernel fusion

Scalable Portable Execution – Harmony RuntimeRuntime

Future extensions of Harmony Current Single Node/MultiCurrent Single Node/Multi--GPU VersionGPU Version

yfunctionality will be to operate across multiple nodes. Intent:

1 Decision models for remote vs1.Decision models for remote vs. local execution2.Transparent deployment and execution of GPU kernels on remote GPUs3.Inclusion of performance optimizations, including:

i) speculative kernel executioni) speculative kernel execution,ii) data movement, and iii) memory footprint vs. concurrency tradeoffs

Problem Scaling – Risk Analysis ApplicationApplication

Measured execution

times

GPU interaction h d

Harmony distributes the work across 4 CPU cores, 2 slow GPU boards, and 1 fast GPU board

overhead dominates

Scaling of GPGPU Applications

Average applications can scale to 1015 cores, each with 180

element -wide SIMD units

Using the Ocelot CUDA Emulator

Ocelot: Dynamic Execution Infrastructure

NVIDIA Virtual ISA

Emerging Environment

Language Front End Language Front End

DatalogC Front End

StatusStatus: • In progress

Kernel IR

StatusStatus: • Moving Forward with Datalog FE retarget (LogicBlox Inc.)

p g

Run Time (Harmony)StatusStatus: • Single node/multi-GPU• being released

Ocelotg

StatusStatus: StatusStatus:CUDA JIT LLVM I/F

Emulator• Released

StatusStatus: • In progress (Fall 2009)

Supported ISAs (MIPS, SPARC, x86, etc.)

Future Directions• Why HyVMs: future applications:

– Media-rich web applications• online `photoshop’ (with HP)

d i t t i ti ( ith M t l )• dynamic stream customization (with Motorola)– Financial and HPC codes

• Benchmarks (e.g., Black Scholes – IBM, HPC – NVIDIA, Interactive - Intel)• Petascale codes (with DOE)

P t l I/O ( ith DOE)• Petascale I/O (with DOE)

• From platforms to clusters to large-scale data centers:– Distributed execution model for petascale machinesDistributed execution model for petascale machines– Other execution models: data-intensive (Hadoop, System S, …)

• Future platforms:– Power/performance tradeoffs and implications– Memory hierarchies and their effects (on- vs. off-chip)– Alternative memory models – e.g., PGAS– Massively parallel HyVMs – e g 1000s of lightweight domains– Massively parallel HyVMs – e.g., 1000s of lightweight domains– Tool chains: future standards (e.g., OpenCL), compiler research (other

faculty)

Broader ContextCERCS Research Center

• CERCS – Center for Experimental Research in Computer Systems (www cercs gatech edu)CERCS Center for Experimental Research in Computer Systems (www.cercs.gatech.edu)– NSF Center– Member companies include HP, IBM, Intel, and Atlanta companies like Logicblox and ICE, plus

several others– ORNL key partners: Steve Poole, Jeff Nichols, Barney Maccabe, Scott Klasky

• Opportunities for large-scale experimentation:– New award for large-scale hybrid cluster machine: NSF Track II award, Vetter, PI, Schwan and

Yalamanchili Co-PIs, first machine to be built Winter 2010• provides additional funding to explore ‘distributed execution environment’ on high end machine and for ‘distributed

performance emulation of multi-GPU machines’

• Project growth and impact through CERCS efforts and faculty:– Involving other faculty:

• Santosh Pande – Glimpses - finding accelerator functions in application codes (NSF)• Hyesoon Kim – Prospector – precise performance (and power) estimation for next generation accelerators (with additional

support from Intel – Larrabee – NVIDIA and NSF) – IBM ArchPIC - Valentina Salapura • Nate Clark – program analysis (Google)Nate Clark program analysis (Google)• Matt Wolf – I/O acceleration for petascale machines (DOE ORNL)

– Working with technology and application partners:• IBM: financial and HPC apps., focus on system-level esource management• NVIDIA: student fellowships – Harmony, Ocelot, equipment donations• Intel: exploring asymmetric architectures• HP: CUDA-based virtualization, coordinated scheduling, data center and web applications, g, pp• Logicblox: accelerating Datalog applications used in retail management and insurance risk estimation

– Additional research support:• NSF Track II award – Oct. 2009• NSF Modeling and Simulation Award (recommended) October 2009 <integrating Ocelot with QEMU>• NSF SHF program award – Sept. 2009• DOE and NSF: I/O for petascale machines, Jan. 2009• Steve Poole: Power/Performance tradeoffs• HP: coordinated scheduling and large-scale data center management, Aug. 2008• Intel: IA-compatible VMs for heterogeneous many-core systems, since 2007

– Linkage to other projects:• The Energy Stack

Publications and ResourcesProject web page: http://www.cercs.gatech.edu/projects/HyVM/Recent Publications:– Priyanka Tembey, Anish Bhatt, Dulloor Rao, Ada Gavrilovska, Karsten Schwan,

“Fl ibl Cl ifi ti H t M lti A li Pl tf ” ICCCN'08“Flexible Classification on Heterogeneous Multicore Appliance Platforms”, ICCCN'08, Aug. 2008

– A. Kerr, G. Diamos, and S. Yalamanchili, "A Characterization and Analysis of PTX Kernels," to appear in the 2009 IEEE International Symposium on Workload Characterization, October 2009.

– V. Gupta, J. Xenidis, P. Tembey, K. Schwan, A. Gavrilovska, "Cellule: Lighweight Execution Environment for Virtualized Accelerators", poster session, SuperComputing'09, Nov. 2009.

– G. Diamos and. S. Yalamanchili, "Kernel Level Speculation”, under submission.– V. Gupta, J. Xenidis, K. Schwan, "Cellule: Lighweight Execution Environment forV. Gupta, J. Xenidis, K. Schwan, Cellule: Lighweight Execution Environment for

Virtualized CellB.E.", under submission. – Min Lee and Karsten Schwan, "Region Scheduling: Virtual Machine Scheduling for

Next Generation Multicore Platforms", in preparation.– Vishakha Gupta, Rob Knauerhase, and Karsten Schwan, “Asymmetric Hypervisors”, in

preparationpreparation.– Priyanka Tembey, Ada Gavrilovska, and Karsten Schwan, “InTuneS: System

Abstractions for Coordinated Scheduling”, in preparation.

Software:– Ocelot, can be obtained at http://code.google.com/p/gpuocelot/, including

documentation.– Harmony programming model:

http://www.ece.gatech.edu/research/labs/casl/harmony/hpdc5hot-diamos.pdf

HyVM – Hybrid Virtual Machines -...

Documents

Transcript of HyVM – Hybrid Virtual Machines -...