ACA Term Paper Hgupta01 Rrao

0

Cloud

Computing:

Shared

Resources

and their

Management

December 12

2011

Himanshu Gupta: 3036 | Rajat Rao: 7365 CSE 661

Cloud Computing: Shared Resources and their Management 2011

1

Contents

Abstract ......................................................................................................................................................... 3

Goal of Paper ................................................................................................................................................ 3

Background ................................................................................................................................................... 4

Introduction ................................................................................................................................................... 5

Hypervisors ................................................................................................................................................... 8

Types of hypervisors ................................................................................................................................. 8

Native v/s Hosted ...................................................................................................................................... 9

Linux Hypervisors ...................................................................................................................................... 11

KVM ....................................................................................................................................................... 11

High-level view of the KVM hypervisor ............................................................................................ 12

Lguest ...................................................................................................................................................... 12

Linux hypervisor benefits ....................................................................................................................... 13

Xen Hypervisor ........................................................................................................................................... 14

Architecture of Xen ................................................................................................................................. 16

Modified Linux Kernel/Domain 0 ...................................................................................................... 16

Domain U ............................................................................................................................................ 17

Working .............................................................................................................................................. 18

Domain Control and Management .............................................................................................................. 18

Xend ........................................................................................................................................................ 19

XM .......................................................................................................................................................... 19

Linxenctrl ................................................................................................................................................ 19

Domain0 to DomainU/Guest Domain Communication .............................................................................. 19

Processors for Clouds ................................................................................................................................. 22

Single-Chip Cloud Computer/Intel ......................................................................................................... 22

Top Level Architecture ....................................................................................................................... 24

L2 Cache ............................................................................................................................................. 25

LMB (Local Memory buffer) .............................................................................................................. 25

DDR3 Memory Controllers ................................................................................................................ 26

LUT ..................................................................................................................................................... 26

MIU (Mesh Interface Unit) ................................................................................................................. 26

SCC Power Controller (VRC)............................................................................................................. 28


2

AMD Opteron ......................................................................................................................................... 28

Memory Cloud ............................................................................................................................................ 31

Inside the Difference Machine .................................................................................................................... 31

Page Sharing ........................................................................................................................................... 32

Patching .................................................................................................................................................. 32

Delta Encoding........................................................................................................................................ 34

Memory Compression ............................................................................................................................. 34

Future Work and Conclusion ...................................................................................................................... 35

Appendix and Presentation Slides............................................................................................................... 36

References: .................................................................................................................................................. 42


3

Abstract

This paper describes why we think that shared memory architecture and management is of prime

importance in a Cloud Computing environment. Cloud Computing refers to both the

applications acting as services over the web and the hardware and software system in the

datacenters that provide those services. The services themselves have long been referred to as

Software as a Service/SaaS. The hardware and software in the datacenter is what we will call a

Cloud. Thus, Cloud Computing is the whole package of Utility Computing and also Software as

a Service.

Most of the applications need some computational model, storage model and a communication

model. The statistical multiplexing necessary to gain elasticity and the feeling of unlimited

capacity requires to take the virtualization [13] path so that each of these storage resources to be

virtualized to cover the method of how they are multiplexed and shared.

Cloud-computing helps to provide a way of achieving boundless computation, even if

computational power requirement grows or shrinks rapidly. Thus services used for cloud

computing are reply on distribution of job and resources pooling. Thus the frameworks which

divide the work and resources are designed to support large scale distribution in an ever

changing environment, need on board memory to serve each request rather than communicate

with each other asking for resources to achieve parallelism. Thus to achieve this we need vast

amount of memory/resources and need to efficiently manage it to reduce the wastage cost.

Highly infinite storage is often implemented with complex, multi-tiered distributed systems built

on top of clusters of servers and series of disk drives. We need a robust and intelligent high

standard management, load balancing and fall back recovery methods to achieve high

performance and availability of resources even though there is an abundance of failure sources

that include s/w, h/w, power and network connectivity issues.

Goal of Paper

We are doing a detailed analysis and study of the cloud computing. The area of our focus is what

is the hardware support needed for running cloud systems. We found that for setting any cloud

system two important things are needed. One of them is the hardware servers/machines which

should be powerful enough to support the rising demand. The requirements of these server


4

machines are different as compared to the desktop/normal use CPU. Some of the key players of

the processor market have introduced some breakthrough products for the same purpose. Other

than the machines/servers the cloud extremely depends on the layer which runs on this hardware

to serve the distributed computing requirement rapidly. The OS needed for this purpose is called

Hypervisor. Hypervisors also comes in different forms and a detailed study of those hypervisors

is done and also provided in this paper. This paper also talks about the memory sharing among

the virtualized machines.

Note: This paper does not discuss about the software/service side of the cloud computing such as

PAAS/SAAS.

Background

In the past, there have been two ways of creating a computer which can server high

computational needs.

1. Blue Gene [2] approach: It creates a gigantic computer machine with thousands of CPU.

2. The other approach, used by Google, is to take 100s of thousands of small, cheap,

computers and join them in a “cluster” in such a way that they all work together as one

big computer.

Supercomputers have many processors plugged into a single machine, sharing common memory

and I/O, while clusters are made up of many smaller machines, each of which contain a fewer

number of processors with each machine having its own local memory and I/O. Maintaining

these both systems required a lot of expenses and complexity. Cloud computing serves an easy

way to achieve high computations requirement while keeping the cost low.

Cloud computing is natural evolution of the common adoption of virtualization and utility

computing. Lower level details are hidden from end-users, who no longer have need for

proficiency in, or control over, the technology infrastructure "in the cloud" that supports them

[9].


5

Introduction

Technological advances in the years have developed a trend towards workstation oriented

computing systems, all your apps stay in the cloud storage and you use your workstation to use

it. The use of powerful systems and server resources through these environments has ignited

wide scope in Cloud Computing. Current researches are aimed towards using these types of

systems to solve large scale issues. In order for the Cloud to process and solve large problems it

needs resources which are available immediately so that the request are handled all the time, thus

resources need to be shared and managed efficiently for this purpose.

Cloud Computing uses virtualization as its backbone and the current implementation in industry

of cloud computing is all based on virtualization technology. This section discuss about the

virtualization technology used for the same.

In hardware by shared memory we mean by large chunk of memory being shared by different

processors .Thus different processing systems having the same data view it is difficult to scale

the system to a higher level at larger scale. The problem with shared memory systems is that

processors need quick access to memory and will likely cache memory, which has its own

complications:

There is a bottleneck from processor to main memory connection because each processor

will have its own cache attached to it (on chip cache) and a shared memory.

Thus there are chances of cache coherency problem in the system as whenever one on chip

cache is updated with information that may be used by other processors, the change needs to

be reflected to the other processors, otherwise the different processors will be working with

its own copy of data which will be different from the data which just got updated. Thus we

have protocols to reduce it, also these protocols, when they function properly, can provide a

very good way to improve the performance access to shared information between multiple

processors. However on the other hand they can sometimes when there is lot of continuous

and concurrent updating carried out it becomes overloaded and become a bottleneck to

performance.

Also in a Cloud system all aspect of data center environment plays a role in performance which

includes the speed and number of CPUs, the percentage of shared memory available, the size and

http://en.wikipedia.org/wiki/Cache#CPU_memory_cache


6

performance of storage systems, and the speed and efficiency of the network connecting them.

Consequently in the data center too, the memory availability and the speed of connection among

systems remain the true bottlenecks to performance as the processor systems need to interact

with each other in the data center to help serve each request coming to them. Data centers

everywhere face ongoing demands for higher performance and greater efficiency. There are

chances of the memory failure in the systems due to various reasons, hence there are many new

techniques used to reduce the wastage caused due to the failure of the memory management

system due to the heavy transaction load in large scale environment. Some of the techniques we

will touch in this paper include memory virtualization, CPU virtualization etc. to solve this.

Cloud computing owes much of its functioning to the virtualization technology [12] which is

essentially a software layer over the hardware that facilitates the creation of a virtual

environment (rather than an actual one) in which hardware and software systems are key

players. Distributed and networked servers share a memory pool to overcome physical memory

limitations which is a common bottleneck in software performance; this is done using the

Memory virtualization. With this features integrated into the network environment, applications

can take advantage of a very huge elastic amount of memory to improve entire performance,

utilization of the system, increase the usage of the memory efficiency, and enable new use cases

like integration of the new user application in the existing application.

Shared memory systems implementations are very much different from memory virtualization.

Shared memory systems do not allow abstracting us from the memory resources, thus requiring

the design and implementation to be done in a single operating system object and not in the

common clustered environment of the commodity servers.

Distributed Shared Memory/DSM in Computer Architecture is a form of memory architecture

where the memories which are physically separate from each other can be addressed using the

one logically shared address space. Hence there is the address space which is shared thus the

same physical address on two processors referring to the same location in memory and there is

no single centralized memory. Whereas shared architecture may involve cutting and dividing

memory into different common parts distributed amongst processing system and main memory;

or distributing all memory between server nodes. A need to efficiently use a proper coherence

http://en.wikipedia.org/wiki/Shared_memory

http://en.wikipedia.org/wiki/Computer_Architecture

http://en.wikipedia.org/wiki/Shared_memory

http://en.wikipedia.org/wiki/Coherence_protocol


7

protocol in accordance with a consistency model, to maintain memory coherence is very

important.

Memory Virtualization takes the advantages of the shared memory pool. Platform on which the

Memory Virtualization is build using these solutions eliminate memory segmentation for the

entire server clusters by creating a pool of shared memory. This platform combines memory

available from the appliances combined with memory from existing servers to create a large

Terabytes of memory cache which can be seamlessly used and shared among vast majority of

servers or data center in a group. It provides a network of shared memory storage resource big

enough to accommodate billions of sets of data, thus this helps to tremendously reduce the

information access and thus improving the performance of the processing by huge amount. Thus

the overall output and result is given back very quick to the user for each request he made to the

server.

The CPU virtualization involves a one CPU processor acting as interface for the two separate

CPU processors. Thus this gives a feeling of running two separate machines on a single physical

computer. Also the most important reason for implementing this is to run two heterogeneous

operating systems on one physical machine. The main objective of the CPU

processor virtualization is to make a CPU run in the same way that two different CPU machine

would run. A very good summary of how this is implemented is that software used for

virtualization is set up in a way that it and only it can talk directly with the background CPU

processor. Everything else which happens on the frontend system goes through this software.

The software then takes on this request and it is its job to establish communications with the rest

of the computer as if it were connected to two different CPUs.

http://en.wikipedia.org/wiki/Coherence_protocol

http://en.wikipedia.org/wiki/Consistency_model

http://en.wikipedia.org/wiki/Memory_coherence

http://www.wisegeek.com/how-does-a-cpu-work.htm

http://www.wisegeek.com/what-is-virtualization.htm

http://www.wisegeek.com/what-is-virtualization-software.htm


8

The virtualization needs a layer over the hardware which is known as Hypervisors. The section

provided below discuss about the hypervisors in detail.

Hypervisors

Hypervisors plays a crucial role in virtualization/cloud computing. Hypervisors are software or

firmware components that can virtualize system resources [5]. It is responsible for scheduling

CPU and memory partitioning among various virtual machines running on the same hardware

device. It not only abstracts the hardware for the virtual machines but also controls the execution

of virtual machines as they share the common processing environment. It has no knowledge of

networking, external storage devices, video, or any other common I/O functions found on a

computing system. It is sometimes called the virtual machine monitor or VMM [5].

Types of hypervisors

Native hypervisors: sits directly on the hardware platform and are most likely used to

gain better performance for individual users.

Embedded hypervisors: They are integrated into a processor on a separate chip. Using

this type of hypervisor is how a service provider gains performance improvements.


9

Hosted hypervisors: They run as a distinct software layer above both the hardware and

the OS. This type of hypervisor is useful both in private and public clouds to gain

performance improvements.

The image provided below depicts the key difference between these two.

Native v/s Hosted

The Cloud Computing must completely separate physical resources management from virtual

resource management. It should also provide capability to intervene between applications and

resources in real-time. Additionally, this hypervisor should be capable of managing both the

resources located locally within same machine as well as any resources in other servers that may

be located elsewhere physically, connected by a network. Once the management of physical

resources is separated from the virtual resource management the need for a mediation layer that

arbitrates the allocation of resources between multiple applications and the shared distributed

physical resources becomes apparent.

So a hypervisor irrespective of its type is just an

application developed on layered architecture that

abstracts the machine hardware and other low level

details from its guests. So each guest sees a virtual

machine instead of the real hardware. At a higher

level, the hypervisor needs some number of things

to boot a guest OS. These are provided as under:

A disk.


10

A kernel image to boot.

A network device.

A configuration (such as IP addresses and quantity of memory to use).

The hard-disk and network devices generally use the hosting machine's physical disk and

network device.

Simplified hypervisor architecture acts as glue that permits a guest OS to be execute concurrently

with the host OS. This functionality requires a few specific elements.

Interrupts must be handled uniquely by the hypervisor to deal with real interrupts or to

route interrupts for virtual devices to the guest operating system.

System calls that bridge user-space applications with kernel functions.

Input/output (I/O) can be virtualized in the kernel or assisted by code in the guest

operating system.

A hypercall layer is commonly available that allows guests to make requests of the host

operating system.

Page Mapper: It points the hardware to the pages for the specific OS. The operating

system can be guest or hypervisor itself.

Hypervisor must also be able to handle traps or exceptions that occur within guest

operating systems.

A high level scheduler is also necessary to transfer control between the guest operating

system and hypervisor or vice versa.


11

Simplified view of a Linux-based hypervisor

The market has a wide range of hypervisors available which differs in the types discussed above.

We have picked up some open source hypervisors to study that how they manage the resources

and the guest operating systems. The studied hypervisors are provided below with their analysis:

Linux Hypervisors

We studied two Linux-based hypervisors.

1. KVM: Supports full virtualization.

2. Lguest: It is experimental and supports paravirtualization

KVM

Some of its important features are provided as under:

First hypervisor to become part of the native Linux kernel and was for x86 hardware.

It is implemented as a kernel module which allows Linux to become a hypervisor just by

loading a module.

It provides full virtualization on hardware platforms that provide hypervisor instruction

support. Example: Intel Virtualization Technology, AMD Virtualization.

It does supports paravirtualized guests which includes Linux as well as Windows.

It also added provision for symmetrical multiprocessing hosts as well as guests and

supports enhanced features like live migration to allow guest OS to migrate between

physical servers.


12

This technology is implemented as two components.

1. KVM-loadable module: Provides management of the virtualization hardware, exposing

its capabilities through the ‘/proc’ file system [5].

2. PC platform emulation: Provided by a modified version of QEMU [5]. QEMU executes

as a user-space process, coordinating with the kernel for guest operating system requests.

High-level view of the KVM hypervisor

When KVM boots a new operating system it becomes a process of the host OS and it can be

managed and scheduled like any other process. But it is in the "guest" mode which is

independent of the kernel and user modes. It uses the underlying hardware's virtualization

support to provide native virtualization

In KVM every guest OS is mapped through the /dev/kvm device, and maintain its own virtual

address space which is mapped to its host kernel's physical address space. Input/output demands

are matched through the host kernel to the QEMU running process on the host hypervisor.

KVM operates in the context of Linux as the host but supports a large number of guest operating

systems, provided underlying hardware virtualization support.

Lguest

Previously known as lhype, Lguest hypervisor provides full virtualization support to run OS. It

offers lightweight paravirtualization for Lguest-enabled x86 Linux guests. This means that guest

OS are aware that they are virtualized, and this information provides performance improvements.


13

It also simplifies the overall code requirements, requiring only a thin layer in the guest and also

in the host operating system.

The guest operating system includes a thin layer of Lguest code which provides multiple services

like the kernel being booted is being virtualized, routing privileged operations to the host OS

using hypercalls.

Breakdown of the Lguest approach to x86 paravirtualization

The kernel-side of things is implemented as an executable/loadable module which is known as

lg.ko [5] which contains the non-host OS interface to the host kernel. The 1st element is the

switcher, which supports the context-switching of guest OS for execution. The /proc file system

code is also implemented in this module, which implements the user-space interfaces to the

kernel and drivers, including hypercalls. There's code to provide the memory mapping through

the use of shadow page-tables and management of x86 segments. [5]

Lguest has been in the mainline kernel since 2.6.23. It consists of nearly 5000 source lines of

code.

Linux hypervisor benefits

Linux hypervisors have some noticeable benefits. Some of them are provided below:

1. One is from the progressing advancement of Linux and the quality amount of work that

goes into it.


14

2. We can also take advantage of that platform as an operating system in addition to a

hypervisor. Therefore, in addition to running multiple guest operating systems on a Linux

hypervisor, you can run your other traditional applications at that level.

3. The standard protocols (TCP/IP) and other useful applications (Web servers) are

available alongside the guests.

Xen Hypervisor

It offers a powerful, efficient, and secure feature set for virtualization of x86, x86_64, IA64,

ARM, and other CPU architectures and supports many guest OS.

The Xen hypervisor is a layer of software running directly on computer hardware just like an OS,

allowing the hardware to run multiple guest OS at the same time. Xen hypervisor can run on a

wide variety of OS like Linux, NetBSD, FreeBSD, Solaris, and Windows etc.

The Xen.org community develops and maintains the Xen hypervisor as a free solution licensed

under the GNU General Public License. [4]

A computer running the Xen hypervisor contains three components:

Xen Hypervisor

Domain 0, the Privileged Domain (Dom0) – Privileged guest running on the hypervisor.

It has direct hardware access and manages the guest operating systems.

Multiple DomainU, Unprivileged Domain Guests (DomU) – Unprivileged guests

running on the hypervisor; having no direct access to hardware (e.g. memory, disk, etc.)


15

The Xen hypervisor acts as an interface for all hardware requests such as CPU, I/O, and disk for

the guest operating systems. By separating the guests from the hardware, it is able to run multiple

OS independently and securely.

Dom0 is loaded by the Xen at initial system start-up and can run any operating system except

Windows. Only Dom0 has privileges to access the Xen hypervisor that is not allocated to any

other DomU. Thus a system user/administrator or application with enough privilege can use

Dom0 and manage the entire guest OS and hypervisor.

DomUs are hosted and maintained by Dom0 which independently run on the system. These

guests are either run with a special improved OS referred to as paravirtualizion or un-modified

OS leveraging special virtualization hardware such as AMD-V and Intel VT which is also known

as hardware virtual machine (HVM).

Some of the terms used by Xen are explained below:

• Paravirtualization

It’s a virtualization technique which allows the running OS to be informed that it is being

executed on a hypervisor in place of some base hardware. The OS must be altered to accept and

resolve the problems of running on a hypervisor instead of real hardware.

• Hardware Virtual Machine (HVM)


16

It’s an OS which is running in a virtualized environment unchanged and unaware of the fact that

it is not running directly on the hardware. Special hardware is required to allow this, thus the

term HVM.

Note: Microsoft Windows requires a HVM Guest environment.

Architecture of Xen

A Xen virtual environment consist of several items that work together to deliver the

virtualization environment a customer is looking to deploy:

• Xen Hypervisor

• Domain 0 Guest

o Domain Management and Control (Xen DM&C)

• Domain U Guest (Dom U)

o PV Guest

o HVM Guest

The diagram below shows the basic organization of these components.

Modified Linux Kernel/Domain 0

There is only one Domain0 which runs in an instance of Xen and all Xen virtualization

environments require Domain 0 to be running before any other virtual machines can be

started. It has access to physical I/O resources and can interact with the other virtual

machines (Domain U: PV and HVM Guests) running on the system.

Two drivers are included in Domain 0 to support network and local disk requests from

Domain U PV and HVM Guests:

1. Network Backend Driver

2. Block Backend Driver


17

Network Backend Driver communicates directly with the local networking hardware to

process all virtual machines requests coming from the Domain U guests.

Block Backend Driver communicates with the local storage disk to read and write data

from the drive based upon Domain U requests.

Domain U

It has no direct access to physical hardware on the machine. All paravirtualized VM

executing on a Xen hypervisor are referred to as “Domain U PV Guests” and are

modified Linux OS, FreeBSD, Solaris and other UNIX OS. All fully virtualized

machines running on a Xen hypervisor run standard Windows or any other unchanged

operating system [3]. The Domain U HVM Guest VM is not aware about that other VM

present. A Domain U PV Guest contains two drivers for network and disk access, PV

Network Driver and PV Block Driver.


18

Working

A Domain U HVM Guest does not have the PV

drivers situated within the virtual machine; in its

place a special daemon is started for every HVM

Guest in Domain 0, Qemudm. It supports the

Domain U-HVM Guest for disk access requests and

networking.

The Domain U HVM Guest must initialize as it

would on a general machine so software is added to

the Domain U HVM Guest, Xen virtual firmware, to

simulate the BIOS an operating system would

expect on startup.

Domain Control and Management

Linux daemons are classified as domain control and management. These all daemons are present

in the Domain 0 so that even if some other domain dies the user or system is aware of it and the

whole system doesn’t crashes.


19

Xend

A Daemon called Xend Daemon is an application developed in python. It is also

considered the system manager of the Xen environment. It uses the entire provided API

to make request of the Xen Hypervisor.

XM

This is a command line tool which is able to take the user input and forward to XEND via

XML RPC.

Linxenctrl

A whole system developed using C and is able to talk to the hypervisor via Domain 0.

Domain0 to DomainU/Guest Domain Communication

Xen hypervisor is not written to support network and disk request [6]. To accomplish such

operations the guest domain must communicate with the domain 0 using the hypervisor. We

have studied an example case to know about this in detail. The description is provided below:


20

The Guest Domain driver receives a request to write to the local disk and writes the data

via the Xen hypervisor to the appropriate local memory which is shared with Domain 0.

An event channel exists between Domain 0 and the guest domain that allows them to

communicate via asynchronous inter-domain interrupts in the Xen hypervisor.

Domain 0 will receive an interrupt from the Xen hypervisor causing the driver to access

the local system memory reading the appropriate blocks from the guest domain shared

memory.

The data from shared memory is then written to the local hard disk at a specific location.

The figure given below provides a better explanation of what is discussed:

Note: A latest feature in Xen is being designed to enhance the complete performance and lessen

the load on domain 0. In this design the guest domain has direct access to the hardware without

communicating with domain 0. The figure given below shows the concept about which we are

talking here:


21

Guest OS accessing hardware directly without communication with Domain0


22

Processors for Clouds

The selection of processor is a critical task when it comes to cloud computing. While deciding

for a processor to be used for cloud computing following points have to be considered.

Good virtualization efficiency.

Good power efficiency.

Greater use of virtual memory.

Able to create virtual I/O.

The cloud platform differs from the general computing platform a lot. Like here good

virtualization efficiency means that the processors should be more efficient for doing encryption

and communication rather than visualization. Also the power efficiency also plays a key role. We

need to know that are the multicore processors are more efficient than uniprocessors? The

answer depends on various factors like weather there is a single virtual machine on a single

physical server or several virtual machines mapped to a single physical server. The complexity in

case one is low as the resources are not shared and they are available whenever the VM needs

them depending upon the running application request but in the shared system they may have to

time-share some resources, such as network connections, I/O channels, graphic cards or memory.

In this case it is important to

a. Provide quick access to non -shared resources, e.g., having memory local to a

processor helps;

b. Minimize contention for shared resources, e.g., having dedicated data paths to a shared

resource and means to quickly change the context of that resource helps.

In other words, an architecture that allows to get as close as possible logically "partitioning" a

server into "logical" servers is well-suited for virtualization. When the resources are shared

security becomes an important issue. Thus the design being adopted should also provide a secure

way.

Single-Chip Cloud Computer/Intel

Designed and developed by Intel’s Tera-scale Computing Research Program this microprocessor contains

48 cores. This mimics a cloud of computers integrated into one silicon chip. It has the ability to

dynamically configure frequency and voltage to vary power consumption from 125W to as low as 25W.

Some of its key features are:


23

a. It incorporates technologies intended to scale multi-core processors to 100 cores.

b. Advanced power management technologies.

c. Support for message-passing.

d. Improved energy efficiency.

e. Improved core to core communication.

The name “Single-chip Cloud Computer” resembles the fact that the architecture resembles a

scalable cluster of computers such as in a cloud but integrated into silicon.

The research chip features:

24 “tiles” with two IA cores per tile.

A 24-router mesh network with 256 GB/s bisection bandwidth.

4 integrated DDR3 memory controllers.

Hardware support for message-passing.

In SCC each core can run a separate Operating System and software stack while acting like a

separate compute node that can communicate with other nodes using a packet-based network.

The most important feature of the SCC's network fabric architecture is that it supports message-

passing programming models that can scale up to thousands of processors in cloud datacenters.

Each core has two levels of cache, there is no hardware cache coherence support among cores,

lessen power utilization and to motivate the investigation of datacenter shared memory software


24

models, on-chip. The researchers at Intel have successfully given a demonstration of message-

passing as well as software-based coherent shared memory on the SCC.

Lowering power consumption is a focus of the chip as well. Software applications are given

control to turn cores on/off or to change their performance levels, continuously adapting to use

the minimum energy demanded at a given moment. It can run all 48 cores at one time over a

range of 25W to 125W and selectively vary the frequency and voltage of the mesh network as

well as sets of cores. Each tile (2 cores) can have its own frequency, and groupings of four tiles

(8 cores) can each run at their own voltage.

Top Level Architecture

Management console is used to load the programs into SCC memory. The memory is also

dynamically mapped to the address space of forty eight cores or to the memory space of the

Management Console, for program loading or debugging.

Input/output instructions on the SCC processor cores are also mapped to the system interface and

by default to the Management Console interface.

SCC chip uses 4 memory controllers at the mesh border. The default boot configuration of

memory configuration registers gives each Single Chip Cloud core access to a private memory

region on 1 memory controller and shared access to the local memory buffers situated in every

tile. This shared memory is used to pass messages between cores so that coherency can be

maintained.

Remapping of main memory can be done to share the regions among one or more cores. Thus,

shared memory may be on-die or be off-chip which can be accessed by using memory

controllers.


25

Top Level Architecture

L2 Cache

Each core has its own private two hundred and fifty six kilo bytes L2 cache and an associated

controller. When a miss occurs, the cache controller sends the address to the Mesh Interface

Unit/MIU for its decoding and retrieval. Each core can have only 1 unresolved memory request

and will stall on missed reads until data are returned. If a write miss happens, the processor will

continue operation until another miss of either type occurs. Once the data is arrived, the

processor will continue normal operation. Tiles with multiple outstanding requests are supported

by the memory and network system.

LMB (Local Memory buffer)

Over the traditional cache structures, local memory buffer is capable of fast read/write operations

that has been added to each tile. The sixteen KB buffer provides the equivalent of 512 full cache

lines of memory. Any of the available cores or the system interface can read or write data from

these twenty four on-die message buffers. One of the envisioned uses for the message passing

buffer is message passing.


26

DDR3 Memory Controllers

The 4 memory controllers provide a full capacity of sixty four GB of DDR3 memory. This

memory physically exists on the SCC board. Every memory controller supports 2 un-buffered

DIMMs per channel with 2 ranks per DIMM. The supported DRAM type is DDR3-800 x8 with

one GB, two GB or four GB size, which can lead to sixteen GB capacity per channel. The DDR3

protocol includes calibration, automatic training, compensation and episodic refresh of the

Dynamic RAM. Memory accesses are processed in order, while accesses to different banks and

ranks are interleaved to improve throughput.

LUT

Each core has a lookup table/LUT which is a set of configuration registers in the Configuration

Block which maps a core’s physical addresses to the extended memory map of the system. Each

lookup table contains 256 entries, 1 for each 16MB segment of the core’s 4GB physical memory

address space. Each entry can point to any memory location out of local memory buffer, private

memory, system interface, configuration registers or system memory.

Lookup tables are also programmed by writes through the system interface from the

Management Console. They are set during the bootstrap process to an initial configuration which

is used later on. After booting, the memory map can be dynamically modified by any core having

mapping to their location in system address space.

When L2 cache Miss happens, the mesh interface unit system looks in the lookup table to

determine where the memory request should be sent. Although the lookup table can be

programmed in a way that user sees fit, a default memory map for all system memory sizes has

been developed and used.

MIU (Mesh Interface Unit)

The Mesh Interface Unit/MIU contains the following:

Packetizer and De-Packetizer

Command interpretation and address decode/lookup

Local configuration registers

Link level flow control and Credit Management

Arbiter


27

The packetizer/de-packetizer interprets the data to/from the agents and to/from the mesh. The

command, data and address buffers provide queuing facility. Explicitly, the mesh interface unit takes

a cache miss and decodes the address, using the lookup table to map from core to system address. It

then places the request into the appropriate queue. The queues are the following:

Router to DDR3 request

Message Passing Buffer access

Local configuration register access

For traffic from the router, the mesh interface unit routes the data to the right local destination. The

link level flow control keeps a check on the flow of data on the mesh using a credit-based protocol.

The arbiter controls tile element access to the mesh interface unit at any point of time using the

round robin order.

The core reads and writes a thirty two bit local address. The eight bits of this address directly point

the lookup table on any cache miss. The LUT returns 22 bits which is a ten-bit system address

extension, a three-bit subdestID, eight-bit tile ID and a bypass bit. The system address is composed

of forty six bits – the bypass bit + 10-bit address extension +3-bit subdestID + 8-bit tile ID + 24

lower bits of the 32-bit core address.

The sub-destination ID (subdestID) defines the port where the packet leaves the router. The tileID

gets the packet to the tile and from there; the subdestID can choose a memory controller, the power

controller, or the system interface/SIF.

The bypass bit specifies local tile memory buffer access.

The lower 34 bits of the 46-bit system address get sent to the destination specified by the tile ID. The

tile ID represents the tile coordinates in a Y, X format (four bits each).

When the request is in the suitable tile, the mesh interface unit looks at the lower thirteen bits of the

address to see what operation it can perform. For the Lookup Table read/write operations. The

remaining operations have the operation specifics (write, read, etc.) automatically sent as part of the

command code by the originator of the request. Note that each operation will only read/write a

certain number of data bits.


28

SCC Power Controller (VRC)

The VRC has its own destination target in each core’s memory map and thus its own entry in the

look up table. The core will send a write request to an address whose top eight bits match the

VRC entry in the look up table. This will ensure that the data packet is sent to the VRC. A core

or the system interface can write to this memory location, and it will be decoded as a command

for the power controller. This command is then routed to the VRC across the mesh and executed.

The power controller accepts the command, adjusts the voltage, and then sends an

acknowledgment back to the tile so that it knows the command completed successfully.

When the single chip cloud board is started up, the power controller must be written initially to

power-on and reset the tiles. Once started, it receives additional requests increase voltage for

faster operation, to power down a quadrant or lower voltage for power efficiency.

An abstract programing interface has been developed to control the voltage.

The voltage can be altered by writing a seventeen bit value to the power controller register in

which 16th

bit must be one. Bits 15:11 doesn’t matter but bits 10:08 are the voltage ID called the

VID which chooses a voltage domain.

There are eight domains of voltage, starting from V0 through V7. These are also called voltage

islands. Voltage Domains V2 and V6 are same and are used for mesh and the system interface.

The remaining six voltage domains (V0, V1, V3, V4, V5, and V7) represent two by two tile

arrays.

V0 is the voltage domain in the upper left. The voltage domain increments as we move to the

right, skipping V2. The voltage domains in the upper row are V0, V1 and V3. The voltage

domains in the lower row are V4, V5, and V7. The V6 is skipped. Bits 7:0 are the VID value.

The VID is of value 8 bits which specify 256 values.

AMD Opteron

Opteron 6000 and 4000 series launched by AMD is catered for enterprise and cloud computing.

Case studies have shown that it can give businesses up to 84 per cent higher performance [10].

The Opteron 4200 series can support up to 8 cores to handle more virtual machines while the

6200 series can go up to sixteen cores.


29

The AMD Opteron processors and chipsets provide a powerful foundation to construct

installations that are energy efficient and easily manageable with overall performance required to

deliver what is required. AMD has redesigned the core architecture to enhance execution paths

that help decrease the power utilization. With the new architecture, featuring four-core through

sixteen-core processors, cloud and web deployments that can benefit from more core density can

handle increasing numbers of transactions, and can have up to 160% more real cores per server.

Direct Connect Architecture 2.0 offers superior memory bandwidth, scalability, and I/O

performance.

AMD Turbo CORE technology provides faster clock speeds for improved performance

of CPU intensive workloads.

Greater core density allows you to scale out cloud deployments with fewer nodes, saving

floor space and power without compromising on scalability.

Virtualization features are directly included into silicon, AMD Opteron processors. The

important features provided by incorporating this technology are given as under:

Simplify your server and client systems’ infrastructure.

Minimize power and cooling costs.

Streamline deployments and upgrades.

Maximize your software investment.

Improve system performance, manageability and data security.

Minimize datacenter space and overhead expenses.


30

FEATURES BENEFITS

Virtualization extensions to the x86

instruction set

Enables software to more efficiently create virtual machines so that multiple

operating systems and their applications can run simultaneously on the same

computer.

Tagged TLB Hardware features that facilitate efficient switching between virtual machines for

better application responsiveness.

Rapid Virtualization Indexing (RVI) Helps accelerate the performance of many virtualized applications by enabling

hardware-based virtual machine memory management.

AMD-V™ Extended Migration Helps virtualization software with live migrations of virtual machines between all

available AMD Opteron processor generations. For an in-depth look at Extended

Migration, read more.

I/O Virtualization Enables direct device access by a virtual machine, bypassing the hypervisor for

improved application performance and improved isolation of virtual machines for

increased integrity and security.

This table is taken from AMD specification sheet. [11]


31

Memory Cloud

In Internet hosting and Cloud based system centers Virtual Machine Monitors (VMM) is an

important platform to offer services. By distributing Hardware resources among different VMs,

VMM decrease the management of the hosting centers. Thus effective management can take the

usage of the available multiple processors. However main memory is the main bottleneck to

achieve such large scale consolidation as it does not amend with the multiplexing scenario like

how the processors do. Effectively Researchers and VM suppliers inorder to remove this

bottleneck build various products, the most interesting which we felt is the "Difference Engine"

which is an extension to the Xen virtual machine monitor as discussed above which support both

sub-page level sharing using the page patching and in-core memory compression. VMware ESX

server which is one of the variant of such kind of products implements the content based page

sharing, which is more useful for homogenous VMs, the environments depends on the nature of

the OS and guest VMs thus this category of product is not extendible to heterogeneous

environments.

In any of the environments with multiple OS and VMs there are pages are nearly identical, thus

finding these identical and similar pages can help us to combine together and store them into a

much smaller memory in case of identical pages and as a patches in case of similar pages. Thus

compared the Difference Machine with the VMware ESX we find that Difference Machine is

much better as it performs well even in a distributed heterogeneous environment. Also these

types of machine take use of the free memory available after compressing the memory to support

additional virtual machines along with the required VMs

Inside the Difference Machine

Difference Machine builds on following principles:

Page sharing

Patching

Delta encoding

Memory compression.


32

The above figures show how the principles are implemented in the machine, we describe in

detail about its working and importance in the following part.

Page Sharing

Both the VMware ESX and the Difference Machine uses this principle to find repeated pages in

the system environment. The way it works is that the system keeps track of the pages and

maintains a hash of page contents, whenever there is a redundant page there is a hash collision

thus we find the potential duplicate pages. Thus it does byte by byte comparison to ensure the

pages are indeed identical before sharing them this helps us to identify the target for sharing and

update the virtual memory to point to the shared copy. Whenever there is a write on the shared

page it triggers a page fault and is caught by the VMM and it send the private copy of the shared

page to the VM which caused the fault to occur and updates the virtual memory mapping

approximately. If there are no VMs which are referring to the shared page the VMM reclaims the

memory and returns it back the memory pool. Many more principles are built on this to increase

the performance and integration of the global lock scenario.

Patching

The ultimate aim of patching is reduce the memory usage to storage the redundant page

information in the memory, thus using this we eliminate the redundant images of the pages in the

memory. Thus the Difference Machine does this by creating patches that represent a page for

similar pages. There were many issues faced with the sub-page sharing like identifying the

reference pages as candidate. Difference Engine uses a parameterized logic to point out the


33

similar pages with respect to the hashes of the several 64-byte portion of the page. At locations

chosen randomly on the page, Hash Similarity Detector (k, s) hashes the contents of (k · s) 64-

byte blocks at, and then gets these hashes along into ‘k’ groups of ‘s’ hashes . Hash Similarity

Detector (1, 2) merges the hashes from two locations in the page into single index of length two.

Hash Similarity Detector (2, 1) whereas indexes each page two times: first time it’s based on the

first block contents, and then again based on the contents of a second block. Pages are chosen as

candidates that have at least one of the two blocks same. Figure 2 shows the scheme effect for

different value settings on workloads described above. X-axis, we have values in the format (k,

s), c, and Y-axis we plot the total savings from patching after all identical pages have been

The machine factor in the memory used to store the shared and patched/compressed pages uses

the below formula

For the workloads, Hash Similarity Detector (2, 1) with one candidate does surprisingly well.

There is a huge gain because of hashing two different blocks in the 2 different page, but little

additional gain by hashing more blocks. Combining blocks does not help much, at least for these

workloads. Storing many candidates in one hash bucket also produces less gain. Hence,

Difference Engine indexes by hashing 64- byte blocks at two locations in the page randomly


34

selected and using the hash value as a distinct index to save the page in the hash-table. To find a

similar page, the system have to calculate hashes at the 2 locations, search those hash table

entries, and choose the best of the 2 pages found.

Delta Encoding

Research based on finding similarity between files in a large system is widely used in this

principle, the logic comprised of finding the fingerprint of the file over a size of fixed-size blocks

at multiple offsets of the files. Thus the maximum of finger prints gave a strong indication of the

similarity among files. However it had its limitations, the logic was not scalable for dynamically

evolving virtual memory system, and it was insufficient to find the intersecting set from among

large number of candidate-pages. Thus in order to cover these issues the delta encoding was used

in an advanced fashion to compress similar files

Memory Compression

Memory compression helps to save the memory in the system, thus memory compression is

beneficiary sometimes however the performance overhead outweighs the memory savings. Thus

for pages that are not similar to other pages in memory, we encode them to decrease the memory

footprint. Compression/encoding are useful only if the compression ratio is substantially high.

Difference Engine supports multiple compression algorithms. It invalidates compressed pages in

the VM and save them in a heap area in machine memory. When a virtual machine accesses a

compressed page, Difference Engine decompresses the page and returns it to the virtual machine

decompressed and remains there until it is selected for compression.


35

Future Work and Conclusion

The detailed study done for studying the how cloud computing works and how and what is done

to share and manage the shared resources was done. The processors used for cloud computing

are specially designed to support virtualization and also the hypervisors play a critical role in

managing the shared resources like I/O, memory etc. The hypervisor is a new battleground now

days. The trend is now changing from OS to hypervisors.

The hypervisors are designed as such so that they can support the guest OS and also the

hardware access is maintained. Few enhancements are going on this area so that the guest OS

can have the direct access to hardware so that the performance can be improved. The processors

discussed like SCC and Opteron provides native support for virtualization and hence are best

suited to be used for cloud computing.

More work is being done to enhance both the critical components: hypervisors and processors.

The Intel SCC is a remarkable experimental product in this range which can be scaled up to 100

cores on a single chip. The hypervisors are also getting improved keeping the efficiency in mind.

The discussions and current trends support development in this area with a great velocity.

Memory being the bottleneck in large scale computations is being tweaked to handle billions of

transaction. Thus the concept of Memory cloud is added to the arsenal of cloud computing to

help support the server clusters system. The principles used by Memory cloud are very robust to

handle such scenarios and very flexible to add new design principles. Many VM vendors and

researchers are working in full throttle to improve the conditions in order to develop a efficient

and failure free environment of cloud.

We also wanted to look at the actual implementation of Xen and Rackspace to test and observe

their resource sharing. The source code of Xen can be obtained for educational purpose. Also

both amazon and ec2 provide 300+ free hours but testing it may take few weeks so we were not

able to perform and analyze the actual results.


36

Appendix and Presentation Slides


37


38


39


40


41


42

References:

[1] Intel. N.p.: n.p., n.d. N. pag. Web. 12 Dec. 2011.

<http://techresearch.intel.com/ProjectDetails.aspx?Id=1>.

[2] Wikipedia. N.p.: n.p., n.d. N. pag. Web. 12 Dec. 2011.

<http://en.wikipedia.org/wiki/Blue_Gene>.

[3] Xen.org. How Does Xen Work? N.p.: n.p., 2009. 1-10. Web. 12 Dec. 2011.

[4] Xen. N.p.: n.p., n.d. Web. 12 Dec. 2011. <http://wiki.xen.org/>.

[5] Jones, Tim M. Anatomy of a Linux hypervisor. N.p.: n.p., 2009. Web. 12 Dec. 2011.

<http://www.ibm.com/developerworks/linux/library/l-hypervisor/index.html?ca=dgr-jw64Lnx-

Hypervisor&S_TACT=105AGY46&S_CMP=grjw64>

[6] Sarathy, Vijay, Purnendu Narayan, and Rao Mikkilineni. Next generation Cloud Computing

Architecture. Los Altos: n.p., n.d. 1-6. Web. 12 Dec. 2011.

[7] Wikipedia. Xen. N.p.: n.p., n.d. Web. 12 Dec. 2011. <http://en.wikipedia.org/wiki/Xen>.

[8] Intel Labs. SCC External Architecture Specification (EAS). .94 ed. N.p.: n.p., 2010. 1-44.

Web. 12 Dec. 2011. <http://techresearch.intel.com/spaw2/uploads/files//SCC_EAS.pdf>.

[9] Wikipedia. Cloud Computing. N.p.: n.p., n.d. Web. 12 Dec. 2011.

<http://en.wikipedia.org/wiki/Cloud_computing>.

[10] Lui, Spandas. N.p.: n.p., n.d. Web. 12 Dec. 2011.

<http://www.itworld.com/hardware/230919/amd-launches-opteron-processors-virtualisation-and-

cloud-computing>.

[11] AMD. AMD Virtualization (AMD-V™) Technology. N.p.: n.p., n.d. Web. 12 Dec. 2011.

<http://sites.amd.com/us/business/it-solutions/virtualization/Pages/virtualization.aspx#2>.

[12] vxzen. What is Xen Hypervisor? N.p.: n.p., n.d. Web. 12 Dec. 2011.

<http://vzxen.com/features>.

[13] Wikipedia. Virtualization. N.p.: n.p., n.d. Web. 12 Dec. 2011.

<http://en.wikipedia.org/wiki/Virtualization>.

ACA Term Paper Hgupta01 Rrao

Documents

Transcript of ACA Term Paper Hgupta01 Rrao