Unit-1 Systems modeling, Clustering and virtualization

Unit-1

Systems modeling, Clustering and virtualization

Scalable Computing over the Internet:

Over the past 60 years, computing technology has undergone a series of platform and

environment changes. In this section, we assess evolutionary changes in machine architecture,

operating system platform, network connectivity, and application workload. Instead of using a

centralized computer to solve computational problems, a parallel and distributed computing

system uses multiple computers to solve large-scale problems over the Internet. Thus,

distributed computing becomes data-intensive and network-centric. This section identifies the

applications of modern computer systems that practice parallel and distributed computing.

These large-scale Internet applications have significantly enhanced the quality of life and

information services in society today.

The Platform Evolution

Computer technology has gone through five generations of development, with each generation

lasting from 10 to 20 years. Successive generations are overlapped in about 10 years. For

instance, from 1950 to 1970, a handful of mainframes, including the IBM 360 and CDC 6400,

were built to satisfy the demands of large businesses and government organizations. From 1960

to 1980, lower-cost mini-computers such as the DEC PDP 11 and VAX Series became popular

among small businesses and on college campuses.

From 1970 to 1990, we saw widespread use of personal computers built with VLSI micro

proces-sors. From 1980 to 2000, massive numbers of portable computers and pervasive devices

appeared in both wired and wireless applications. Since 1990, the use of both HPC and HTC

systems hidden in clusters, grids, or Internet clouds has proliferated. These systems are

employed by both consumers and high-end web-scale computing and information services.

The general computing trend is to leverage shared web resources and massive amounts of data

over the Internet. On the HPC side, supercomputers (massively parallel processors or MPPs) are

gradually replaced by clusters of cooperative computers out of a desire to share computing

resources. The cluster is often a collection of homogeneous compute nodes that are physically

connected in close range to one another.

On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and

content delivery applications. A P2P system is built over many client machines Peer machines

are globally distributed in nature. P2P, cloud computing, and web service platforms are more

focused on HTC applications than on HPC applications. Clustering and P2P technologies lead to

the development of computational grids or data grids.

Three New Computing Paradigms

Centralized computing This is a computing paradigm by which all computer resources

are centralized in one physical system. All resources (processors, memory, and storage)

are fully shared and tightly coupled within one integrated OS. Many data centers and

supercomputers are centralized systems, but they are used in parallel, distributed, and

cloud computing applications

Parallel computing In parallel computing, all processors are either tightly coupled with

centralized shared memory or loosely coupled with distributed memory. Some authors

refer to this discipline as parallel processing Interprocessor communication is

accomplished through shared memory or via message passing. A computer system

capable of parallel computing is commonly known as a parallel computer. Programs

running in a parallel computer are called parallel programs. The process of

writing parallel programs is often referred to as parallel programming.

Distributed computing This is a field of computer science/engineering that studies

distributed systems. Adistributed system consists of multiple autonomous computers,

each having its own private memory, communicating through a computer network.

Information exchange in a distributed system is accomplished through message passing.

A computer program that runs in a distributed system is known as a distributed

program. The process of writing distributed programs is referred to as distributed

programming.

Cloud computing An Internet cloud of resources can be either a centralized or a

distributed computing system. The cloud applies parallel or distributed computing, or

both. Clouds can be built with physical or virtualized resources over large data centers

that are centralized or distributed. Some authors consider cloud computing to be a form

of utility computing or service computing.

Technologies for Network-Based Systems

It will focus on viable approaches to building distributed operating systems for handling

massive parallelism in a distributed environment.

Multicore CPUs and Multithreading Technologies

The growth of component and network technologies over the past 30 years, are crucial to the

development of HPC and HTC systems.

Advances in CPU Processors

Today, advanced CPUs or microprocessor chips assume a multicore architecture with dual,

quad, six, or more processing cores. These processors exploit parallelism at ILP and TLP levels.

However, the clock rate reached its limit on CMOS-based chips due to power limitations. This

limitation is attributed primarily to excessive heat generation with high frequency or high

voltages. The ILP is highly exploited in modern CPU processors which include multiple-issue

superscalar architecture, dynamic branch prediction, and speculative execution, among others.

Figure shows the architecture of a typical multicore processor. Each core is essentially a

processor with its own private cache (L1 cache).

Multiple cores are housed in the same chip with an L2 cache that is shared by all cores. In the

future, multiple CMPs could be built on the same CPU chip with even the L3 cache on the chip.

Multicore and multithreaded CPUs are equipped with many high-end processors.

Multicore CPU and Many-Core GPU Architectures Multicore CPUs may increase from the tens

of cores to hundreds or more in the future. But the CPU has reached its limit in terms of

exploiting massive DLP due to the aforementioned memory wall problem. This has triggered

the development of many-core GPUs with hundreds or more thin cores. The GPU also has been

applied in large clusters to build supercomputers in MPPs. In the future, multiprocessors can

house both fat CPU cores and thin GPU cores on the same chip.

Multithreading Technology Typical instruction scheduling patterns are shown here. Only

instructions from the same thread are executed in a superscalar processor. Fine-grain

multithreading switches the execution of instructions from different threads per cycle. Course-

grain multithreading executes many instructions from the same thread for quite a few cycles

before switching to another thread. The multicore CMP executes instructions from different

threads completely. The SMT allows simultaneous scheduling of instructions from different

threads in the same cycle.

GPU (Graphics processing unit) A GPU is a graphics coprocessor or accelerator mounted on a

computer’s graphics card or video card. A GPU offloads the CPU from tedious graphics tasks in

video editing applications. A modern GPU chip can be built with hundreds of processing cores.

GPUs have a throughput architecture that exploits massive parallelism by executing many

concurrent threads slowly, instead of executing a single long thread in a conventional

microprocessor very quickly.

MEMORY, STORAGE, AND WIDE-AREA NETWORKING

Memory Technology Memory chips have experienced a 4x increase in capacity every three

years. The capacity increase of disk arrays will be even greater in the years to come. Faster

processor speed and larger memory capacity result in a wider gap between processors and

memory.

Disks and Storage Technology The rapid growth of flash memory and solid-state drives (SSDs)

also impacts the future of HPC and HTC systems. Eventually, power consumption, cooling, and

packaging will limit large system development. Power increases linearly with respect to clock

frequency and quadratic ally with respect to voltage applied on chips.

Virtual Machines and Virtualization Middleware

Virtual machines (VMs) offer novel solutions to underutilized resources, application inflexibility,

software manageability, and security concerns in existing physical machines. To build large

clusters, grids, and clouds, we need to access large amounts of computing, storage, and

networking resources in a virtualized manner. We need to aggregate those resources, and

hopefully, offer a single system image. The VM is built with virtual resources managed by a

guest OS to run a specific application. Between the VMs and the host platform, one needs to

deploy a middleware layer called a virtual machine monitor (VMM). The VMM called a

hypervisor in privileged mode.

VM Primitive Operations The VMM provides the VM abstraction to the guest OS. Low-level

VMM operations are indicated below

First, the VMs can be multiplexed between hardware machines.

Second, a VM can be suspended and stored in stable storage.

Third, a suspended VM can be resumed or provisioned to a new hardware platform.

Finally, a VM can be migrated from one hardware platform to another.

The VM approach ill significantly enhance the utilization of server resources. Multiple server

functions can be consolidated on the same hardware platform to achieve higher system

efficiency.

Virtual Infrastructures

Virtual infrastructure is connects resources like compute, storage, and networking to

distributed applications. It is a dynamic mapping of system resources to specific applications.

The result is decreased costs and increased efficiency and responsiveness.

System models for Distributed and Cloud Computing

Distributed and cloud computing systems are built over a large number of autonomous

computer nodes. These node machines are interconnected by SANs, LANs, or WANs in a

hierarchical man-ner. With today’s networking technology, a few LAN switches can easily

connect hundreds of machines as a working cluster. A WAN can connect many local clusters to

form a very large cluster of clusters. In this sense, one can build a massive system with millions

of computers connected to edge networks.

Clusters of Cooperative Computers

A computing cluster consists of interconnected stand-alone computers which work

cooperatively as a single integrated computing resource. In the past, clustered computer

systems have demonstrated impressive results in handling heavy workloads with large data

sets.

Cluster Architecture

This network can be as simple as a SAN (e.g., Myrinet) or a LAN (e.g., Ethernet). To build a larger

cluster with more nodes, the interconnection network can be built with multiple levels of

Gigabit Ethernet, Myrinet, or InfiniBand switches. Through hierarchical construction using a

SAN, LAN, or WAN, one can build scalable clusters with an increasing number of nodes. The

cluster is connected to the Internet via a virtual private network (VPN) gateway. The gateway IP

address locates the cluster. The system image of a computer is decided by the way the OS

manages the shared cluster resources. Most clusters have loosely coupled node computers. All

resources of a server node are managed by their own OS. Thus, most clusters have multiple

system images as a result of having many autonomous nodes under different OS control.

Single-System Image

Cluster designers desire a cluster operating system or some middle-ware to support SSI at

various levels, including the sharing of CPUs, memory, and I/O across all cluster nodes. An SSI is

an illusion created by software or hardware that presents a collection of resources as one

integrated, powerful resource. SSI makes the cluster appear like a single machine to the user. A

cluster with multiple system images is nothing but a collection of inde-pendent computers.

Hardware, Software, and Middleware Support

Clusters exploring massive parallelism are commonly known as MPPs. Almost all HPC clusters in

the Top 500 list are also MPPs. The building blocks are computer nodes (PCs, workstations,

servers, or SMP), special communication software such as PVM or MPI, and a network interface

card in each computer node. Most clusters run under the Linux OS. The computer nodes are

interconnected by a high-bandwidth network (such as Gigabit Ethernet, Myrinet, InfiniBand,

etc.).

Special cluster middleware supports are needed to create SSI or high availability (HA). Both

sequential and parallel applications can run on the cluster, and special parallel environments

are needed to facilitate use of the cluster resources. For example, distributed memory has

multiple images. Users may want all distributed memory to be shared by all servers by

forming distribu-ted shared memory (DSM). Many SSI features are expensive or difficult to

achieve at various cluster operational levels. Instead of achieving SSI, many clusters are loosely

coupled machines. Using virtualization, one can build many virtual clusters dynamically, upon

user demand.

Major Cluster Design Issues

Unfortunately, a cluster-wide OS for complete resource sharing is not available yet. Middleware

or OS extensions were developed at the user space to achieve SSI at selected functional levels.

Without this middleware, cluster nodes cannot work together effectively to achieve

cooperative computing. The software environments and applications must rely on the

middleware to achieve high performance. The cluster benefits come from scalable

performance, efficient message passing, high system availability, seamless fault tolerance, and

cluster-wide job management.

Software environments for distributed systems and clouds

This section introduces popular software environments for using distributed and cloud

computing systems.

Service-Oriented Architecture (SOA)

In grids/web services, Java, and CORBA, an entity is, respectively, a service, a Java object, and a

CORBA distributed object in a variety of languages. These architectures build on the traditional

seven Open Systems Interconnection (OSI) layers that provide the base networking

abstractions. On top of this we have a base software environment, which would be .NET or

Apache Axis for web services, the Java Virtual Machine for Java, and a broker network for

CORBA. On top of this base environment one would build a higher level environment reflecting

the special features of the distributed computing environment. This starts with entity interfaces

and inter-entity communication, which rebuild the top four OSI layers but at the entity and not

the bit level.

Layered Architecture for Web Services and Grids

The entity interfaces correspond to the Web Services Description Language (WSDL), Java

method, and CORBA interface definition language (IDL) specifications in these example

distributed systems. These interfaces are linked with customized, high-level communication

systems: SOAP, RMI, and IIOP in the three examples. These communication systems support

features including particular message patterns (such as Remote Procedure Call or RPC), fault

recovery, and specialized routing. Often, these commu-nication systems are built on message-

oriented middleware (enterprise bus) infrastructure such as Web-Sphere MQ or Java Message

Service (JMS) which provide rich functionality and support virtualization of routing, senders,

and recipients.

In the case of fault tolerance, the features in the Web Services Reliable Messaging

(WSRM)framework mimic the OSI layer capability (as in TCP fault tolerance) modified to match

the differ-ent abstractions (such as messages versus packets, virtualized addressing) at the

entity levels. Secur-ity is a critical capability that either uses or reimplements the capabilities

seen in concepts such as Internet Protocol Security (IPsec) and secure sockets in the OSI layers.

Web Services and Tools

Both web services and REST systems have very distinct approaches to building reliable

interoperable systems. In web services, one aims to fully specify all aspects of the service and

its environment. This specification is carried with communicated messages using Simple Object

Access Protocol (SOAP). The hosting environment then becomes a universal distributed

operating system with fully distributed capability carried by SOAP messages. This approach has

mixed success as it has been hard to agree on key parts of the protocol and even harder to

efficiently implement the protocol by software such as Apache Axis.

In the REST approach, one adopts simplicity as the universal principle and delegates most of the

difficult problems to application (implementation-specific) software. In a web services

language, REST has minimal information in the header, and the message body (that is opaque

to generic message processing) carries all the needed information. REST architectures are

clearly more appropriate for rapid technology environments. However, the ideas in web

services are important and probably will be required in mature systems at a different level in

the stack (as part of the application). Note that REST can use XML schemas but not those that

are part of SOAP; “XML over HTTP” is a popular design choice in this regard. Above the

communication and management layers, we have the ability to compose new entities or

distributed programs by integrating several entities together.

In CORBA and Java, the distributed entities are linked with RPCs, and the simplest way to build

composite applications is to view the entities as objects and use the traditional ways of linking

them together. For Java, this could be as simple as writing a Java program with method calls

replaced by Remote Method Invocation (RMI), while CORBA supports a similar model with a

syntax reflecting the C++ style of its entity (object) interfaces. Allowing the term “grid” to refer

to a single service or to represent a collection of services, here sensors represent entities that

output data (as messages), and grids and clouds represent collections of services that have

multiple message-based inputs and outputs.

The Evolution of SOA

Service-oriented architecture (SOA) has evolved over the years. SOA applies to building grids,

clouds, grids of clouds, clouds of grids, clouds of clouds (also known as interclouds), and

systems of systems in general. A large number of sensors provide data-collection services,

denoted in the figure as SS (sensor service). A sensor can be a ZigBee device, a Bluetooth

device, a WiFi access point, a personal computer, a GPA, or a wireless phone, among other

things. Raw data is collected by sensor services. All the SS devices interact with large or small

computers, many forms of grids, databases, the compute cloud, the storage cloud, the filter

cloud, the discovery cloud, and so on. Filter services ( fs in the figure) are used to eliminate

unwanted raw data, in order to respond to specific requests from the web, the grid, or web

services.

Most distributed systems require a web interface or portal. For raw data collected by a large

number of sensors to be transformed into useful information or knowledge, the data stream

may go through a sequence of compute, storage, filter, and discovery clouds. Finally, the inter-

service messages converge at the portal, which is accessed by all users. Two example portals,

OGFCE and HUBzero, are described in Section 5.3 using both web service (portlet) and Web 2.0

(gadget) tech-nologies. Many distributed programming models are also built on top of these

basic constructs.

Grids versus Clouds

The boundary between grids and clouds are getting blurred in recent years. For web services,

work-flow technologies are used to coordinate or orchestrate services with certain

specifications used to define critical business process models such as two-phase transactions. In

general, a grid system applies static resources, while a cloud emphasizes elastic resources. For

some researchers, the differences between grids and clouds are limited only in dynamic

resource allocation based on virtualization and autonomic computing. One can build a grid out

of multiple clouds. This type of grid can do a better job than a pure cloud, because it can

explicitly support negotiated resource allocation. Thus one may end up building with a system

of systems: such as a cloud of clouds, a grid of clouds, or acloud of grids, or inter-clouds as a

basic SOA architecture.

Performance, Security, and Energy Efficiency

Performance Metrics and Scalability Analysis

Performance metrics are needed to measure various distributed systems. In this section, we

will discuss various dimensions of scalability and performance laws. Then we will examine

system scalability against OS images and the limiting factors encountered.

Performance Metrics

We discussed CPU speed in MIPS and network bandwidth in Mbps in Section 1.3.1 to estimate

pro-cessor and network performance. In a distributed system, performance is attributed to a

large number of factors. System throughput is often measured in MIPS, Tflops (tera floating-

point operations persecond), or TPS (transactions per second). Other measures include job

response time and network latency. An interconnection network that has low latency and high

bandwidth is preferred. Systemoverhead is often attributed to OS boot time, compile time, I/O

data rate, and the runtime support sys-tem used. Other performance-related metrics include

the QoS for Internet and web services; systemavailability and dependability; and security

resilience for system defense against network attacks.

Dimensions of Scalability

Users want to have a distributed system that can achieve scalable performance. Any resource

upgrade in a system should be backward compatible with existing hardware and software

resources. Overdesign may not be cost-effective. System scaling can increase or decrease

resources depending on many practical factors. The following dimensions of scalability are

characterized in parallel and distributed systems.

Size scalability This refers to achieving higher performance or more functionality by increasing

themachine size. The word “size” refers to adding processors, cache, memory, storage, or I/O

channels. The most obvious way to determine size scalability is to simply count the number of

processors installed. Not all parallel computer or distributed architectures are equally size-

scalable. For example, the IBM S2 was scaled up to 512 processors in 1997. But in 2008, the

IBM BlueGene/L system scaled up to 65,000 processors.

Software scalability This refers to upgrades in the OS or compilers, adding mathematical and

engineering libraries, porting new application software, and installing more user-friendly

programming environments. Some software upgrades may not work with large system

configurations. Testing and fine-tuning of new software on larger systems is a nontrivial job.

Application scalability this refers to matching problem size scalability with machine

size scalability. Problem size affects the size of the data set or the workload increase. Instead of

increasing machine size, users can enlarge the problem size to enhance system efficiency or

cost-effectiveness.

Technology scalability This refers to a system that can adapt to changes in building tech-

nologies, such as the component and networking technologies discussed in Section 3.1. When

scaling a system design with new technology one must consider three aspects: time, space,

and heterogeneity. (1) Time refers to generation scalability. When changing to new-generation

processors, one must consider the impact to the motherboard, power supply, packaging and

cooling, and so forth. Based on past experience, most systems upgrade their commodity

processors every three to five years. (2) Space is related to packaging and energy concerns.

Technology scalability demands harmony and portability among suppliers. (3) Heterogeneity

refers to the use of hardware components or software packages from different vendors.

Heterogeneity may limit the scalability.

Security Responsibilities

Three security requirements are often considered: confidentiality, integrity, and availability for

most Internet service providers and cloud users. In the order of SaaS, PaaS, and IaaS, the

providers gra-dually release the responsibility of security control to the cloud users. In

summary, the SaaS model relies on the cloud provider to perform all security functions. At the

other extreme, the IaaS model wants the users to assume almost all security functions, but to

leave availability in the hands of the providers. The PaaS model relies on the provider to

maintain data integrity and availability, but bur-dens the user with confidentiality and privacy

control.

Energy Efficiency

Primary performance goals in conventional parallel and distributed computing systems are high

performance and high throughput, considering some form of performance reliability (e.g., fault

tol-erance and security). However, these systems recently encountered new challenging issues

including energy efficiency, and workload and resource outsourcing. These emerging issues are

crucial not only on their own, but also for the sustainability of large-scale computing systems in

general. This section reviews energy consumption issues in servers and HPC systems, an area

known as distributed power management (DPM).

Protection of data centers demands integrated solutions. Energy consumption in parallel and

dis-tributed computing systems raises various monetary, environmental, and system

performance issues. For example, Earth Simulator and Petaflop are two systems with 12 and

100 megawatts of peak power, respectively. With an approximate price of $100 per megawatt,

their energy costs during peak operation times are $1,200 and $10,000 per hour; this is beyond

the acceptable budget of many (potential) system operators. In addition to power cost, cooling

is another issue that must be addressed due to negative effects of high temperature on

electronic components. The rising tempera-ture of a circuit not only derails the circuit from its

normal range, but also decreases the lifetime of its components.

UNIT II

Virtual Machines and Virtualization of Clusters and Data Centers

Implementation Levels of Virtualization

Virtualization is a computer architecture technology by which multiple virtual machines (VMs)

are multiplexed in the same hardware machine. The idea of VMs can be dated back to the

1960s. The purpose of a VM is to enhance resource sharing by many users and improve

computer performance in terms of resource utilization and application flexibility. Hardware

resources (CPU, memory, I/O devices, etc.) or software resources (operating system and

software libraries) can be virtualized in various functional layers. This virtualization technology

has been revitalized as the demand for distributed and cloud computing increased sharply in

recent years.

The idea is to separate the hardware from the software to yield better system efficiency. For

example, computer users gained access to much enlarged memory space when the concept

of virtual memory was introduced. Similarly, virtualization techniques can be applied to

enhance the use of compute engines, networks, and storage. In this chapter we will discuss

VMs and their applications for building distributed systems. According to a 2009 Gartner

Report, virtualization was the top strategic technology poised to change the computer industry.

With sufficient storage, any computer platform can be installed in another host computer, even

if they use processors with different instruction sets and run with distinct operating systems on

the same hardware.

Levels of Virtualization Implementation

A traditional computer runs with a host operating system specially tailored for its hardware

architecture. After virtualization, different user applications managed by their own operating

systems (guest OS) can run on the same hardware, independent of the host OS. This is often

done by adding additional software, called a virtualization layer. This virtualization layer is

known as hypervisor or virtual machine monitor (VMM). The VMs are shown in the upper

boxes, where applications run with their own guest OS over the virtualized CPU, memory, and

I/O resources.

The main function of the software layer for virtualization is to virtualize the physical hardware

of a host machine into virtual resources to be used by the VMs, exclusively. This can be

implemented at various operational levels, as we will discuss shortly. The virtualization

software creates the abstraction of VMs by interposing a virtualization layer at various levels of

a computer system. Common virtualization layers include the instruction set architecture

(ISA) level, hardware level, operating system level, library support level, and application level.

Instruction Set Architecture Level

At the ISA level, virtualization is performed by emulating a given ISA by the ISA of the host

machine. For example, MIPS binary code can run on an x86-based host machine with the help

of ISA emulation. With this approach, it is possible to run a large amount of legacy binary code

writ-ten for various processors on any given new hardware host machine. Instruction set

emulation leads to virtual ISAs created on any hardware machine.

The basic emulation method is through code interpretation. An interpreter program interprets

the source instructions to target instructions one by one. One source instruction may require

tens or hundreds of native target instructions to perform its function. Obviously, this process is

relatively slow. For better performance, dynamic binary translation is desired. This approach

translates basic blocks of dynamic source instructions to target instructions. The basic blocks

can also be extended to program traces or super blocks to increase translation efficiency.

Instruction set emulation requires binary translation and optimization. A virtual instruction set

architecture (V-ISA) thus requires adding a processor-specific software translation layer to the

compiler.

Hardware Abstraction Level

Hardware-level virtualization is performed right on top of the bare hardware. On the one hand,

this approach generates a virtual hardware environment for a VM. On the other hand, the

process manages the underlying hardware through virtualization. The idea is to virtualize a

computer’s resources, such as its processors, memory, and I/O devices. The intention is to

upgrade the hardware utilization rate by multiple users concurrently. The idea was

implemented in the IBM VM/370 in the 1960s. More recently, the Xen hypervisor has been

applied to virtualize x86-based machines to run Linux or other guest OS applications.

Operating System Level

This refers to an abstraction layer between traditional OS and user applications. OS-level

virtualiza-tion creates isolated containers on a single physical server and the OS instances to

utilize the hard-ware and software in data centers. The containers behave like real servers. OS-

level virtualization is commonly used in creating virtual hosting environments to allocate

hardware resources among a large number of mutually distrusting users. It is also used, to a

lesser extent, in consolidating server hardware by moving services on separate hosts into

containers or VMs on one server.

Library Support Level

Most applications use APIs exported by user-level libraries rather than using lengthy system

calls by the OS. Since most systems provide well-documented APIs, such an interface becomes

another candidate for virtualization. Virtualization with library interfaces is possible by

controlling the communication link between applications and the rest of a system through API

hooks. The software tool WINE has implemented this approach to support Windows

applications on top of UNIX hosts. Another example is the vCUDA which allows applications

executing within VMs to leverage GPU hardware acceleration.

User-Application Level

Virtualization at the application level virtualizes an application as a VM. On a traditional OS, an

application often runs as a process. Therefore, application-level virtualization is also known

as process-level virtualization. The most popular approach is to deploy high level language

(HLL)VMs. In this scenario, the virtualization layer sits as an application program on top of the

operating system, and the layer exports an abstraction of a VM that can run programs written

and compiled to a particular abstract machine definition. Any program written in the HLL and

compiled for this VM will be able to run on it. The Microsoft .NET CLR and Java Virtual Machine

(JVM) are two good examples of this class of VM.

Other forms of application-level virtualization are known as application isolation, application

sandboxing, or application streaming. The process involves wrapping the application in a layer

that is isolated from the host OS and other applications. The result is an application that is

much easier to distribute and remove from user workstations. An example is the LANDesk

application virtualization platform which deploys software applications as self-contained,

executable files in an isolated environment without requiring installation, system modifications,

or elevated security privileges.

Virtualization Structures/Tools and Mechanisms

Before virtualization, the operating system manages the hardware. After virtualization, a

virtualization layer is inserted between the hardware and the OS. In such a case, the

virtualization layer is responsible for converting portions of the real hardware into virtual

hardware. Depending on the position of the virtualization layer, there are several classes of VM

architectures, namely

1. Hypervisor architecture.

2. Paravirtualization.

3. host-based virtualization.

Hypervisor and Xen Architecture

Depending on the functionality, a hypervisor can assume a micro-kernel architecture or a

monolithic hypervisor architecture. A micro-kernel hypervisor includes only the basic and

unchanging functions (such as physical memory management and processor scheduling). The

device drivers and other changeable components are outside the hypervisor. A monolithic

hypervisor implements all the aforementioned functions, including those of the device drivers.

Therefore, the size of the hypervisor code of a micro-kernel hypervisor is smaller than that of a

monolithic hypervisor.

Xen Architecture

Xen is an open source hypervisor program developed by Cambridge University. Xen is a

microkernel hypervisor, which separates the policy from the mechanism. It implementsall the

mechanisms, leaving the policy to be handled by Domain 0, as shown in Figure. Xen does not

include any device drivers natively. It just provides a mechanism by which a guest OS can have

direct access to the physical devices.

Like other virtualization systems, many guest OSes can run on top of the hypervisor. The guest

OS (privileged guest OS), which has control ability, is called Domain 0, and the others are called

Domain U. It is first loaded when Xen boots without any file system drivers being available.

Domain 0 is designed to access hardware directly and manage devices.

Binary Translation with Full Virtualization

Depending on implementation technologies, hardware virtualization can be classified into two

categories: full virtualization and host-based virtualization.

Full Virtualization

With full virtualization, noncritical instructions run on the hardware directly while critical

instructions are discovered and replaced with traps into the VMM to be emulated by software.

Both the hypervisor and VMM approaches are considered full virtualization. Noncritical

instructions do not control hardware or threaten the security of the system, but critical

instructions do. Therefore, running noncritical instructions on hardware not only can promote

efficiency, but also can ensure system security.

Host-Based Virtualization

An alternative VM architecture is to install a virtualization layer on top of the host OS. This host

OS is still responsible for managing the hardware. The guest OSes are installed and run on top

of the virtualization layer. Dedicated applications may run on the VMs. Certainly, some other

applications can also run with the host OS directly. This host based architecture has some

distinct advantages, as enumerated next. First, the user can install this VM architecture without

modifying the host OS. Second, the host-based approach appeals to many host machine

configurations.

Para-Virtualization

It needs to modify the guest operating systems. A para-virtualized VM provides special APIs

requiring substantial OS modifications in user applications. Performance degradation is a critical

issue of a virtualized system. Figure illustrates the concept of a para-virtualized VM

architecture. The guest OS are para-virtualized. They are assisted by an intelligent compiler to

replace the non virtualizable OS instructions by hypercalls. The traditional x86 processor offers

four instruction execution rings: Rings 0, 1, 2, and 3. The lower the ring number, the higher the

privilege of instruction being executed. The OS is responsible for managing the hardware and

the privileged instructions to execute at Ring 0, while user-level applications run at Ring 3.

Although para-virtualization reduces the overhead, it has incurred problems like compatibility

and portability, because it must support the unmodified OS as well. Second, the cost is high,

because they may require deep OS kernel modifications. Finally, the performance advantage of

para-virtualization varies greatly due to workload variations.

CPU, Memory, and Network Virtualization

A VMware virtual machine provides complete hardware virtualization. The guest operating

system and applications running on a virtual machine can never determine directly which

physical resources they are accessing (such as which physical CPU they are running on in a

multiprocessor system, or which physical memory is mapped to their pages).

CPU Virtualization

Each virtual machine appears to run on its own CPU (or a set of CPUs), fully isolated from other

virtual machines. Registers, the translation lookaside buffer, and other control structures are

maintained separately for each virtual machine. Most instructions are executed directly on the

physical CPU, allowing resource-intensive workloads to run at near-native speed. The

virtualization layer safely performs privileged instructions.

Memory Virtualization

A contiguous memory space is visible to each virtual machine. However, the allocated physical

memory might not be contiguous. Instead, noncontiguous physical pages are remapped and

presented to each virtual machine. With unusually memory-intensive loads, server memory

becomes overcommitted. In that case, some of the physical memory of a virtual machine might

be mapped to shared pages or to pages that are unmapped or swapped out.

ESX/ESXi performs this virtual memory management without the information that the guest

operating system has and without interfering with the guest operating system’s memory

management subsystem.

Network Virtualization

The virtualization layer guarantees that each virtual machine is isolated from other virtual

machines. Virtual machines can communicate with each other only through networking

mechanisms similar to those used to connect separate physical machines.

The isolation allows administrators to build internal firewalls or other network isolation

environments that allow some virtual machines to connect to the outside, while others are

connected only through virtual networks to other virtual machines.

Virtual Clusters and Resource Management

When a traditional VM is initialized, the administrator needs to manually write configuration

information or specify the configuration sources. When more VMs join a network, an inefficient

configuration always causes problems with overloading or underutilization.

Amazon’s Elastic Compute Cloud (EC2) is a good example of a web service that provides elastic

computing power in a cloud. EC2 permits customers to create VMs and to manage user

accounts over the time of their use. Most virtualization platforms, including XenServer and

VMware ESX Server, support a brid-ging mode which allows all domains to appear on the

network as individual hosts. By using this mode, VMs can communicate with one another freely

through the virtual network interface card and configure the network automatically.

Physical versus Virtual Clusters

Virtual clusters are built with VMs installed at distributed servers from one or more physical

clus-ters. The VMs in a virtual cluster are interconnected logically by a virtual network across

several physical networks. Each virtual cluster is formed with physical machines or a VM hosted

by multiple physical clusters. The virtual cluster boundaries are shown as distinct boundaries.

The provisioning of VMs to a virtual cluster is done dynamically to have the following interest-

ing properties:

The virtual cluster nodes can be either physical or virtual machines. Multiple VMs

running with different OSes can be deployed on the same physical node.

A VM runs with a guest OS, which is often different from the host OS, that manages the

resources in the physical machine, where the VM is implemented.

The purpose of using VMs is to consolidate multiple functionalities on the same server.

This will greatly enhance server utilization and application flexibility.

VMs can be colonized (replicated) in multiple servers for the purpose of promoting

distributed parallelism, fault tolerance, and disaster recovery.

The size (number of nodes) of a virtual cluster can grow or shrink dynamically, similar to

the way an overlay network varies in size in a peer-to-peer (P2P) network.

The failure of any physical nodes may disable some VMs installed on the failing nodes.

But the failure of VMs will not pull down the host system.

Since system virtualization has been widely used, it is necessary to effectively manage VMs

running on a mass of physical computing nodes (also called virtual clusters) and consequently

build a high-performance virtualized computing environment. This involves virtual cluster

deployment, monitoring and management over large-scale clusters, as well as resource

scheduling, load balancing, server consolidation, fault tolerance, and other techniques.

Virtualization For Data-Center Automation

Data-center automation refers huge volumes of hardware, software, and database resources in

these data centers can be allocated dynamically to millions of Internet users simultaneously,

with guaranteed QoS. The latest virtualization development highlights high availability (HA),

backup services, workload balancing, and further increases in client bases.

Server Consolidation in Data Centers

The heterogeneous workloads in the data center can be roughly divided into two categories:

chatty workloads and noninteractive workloads. Chatty workloads may burst at some point and

return to a silent state at some other point. For example, video services can be used by a lot of

people at night and few people use it during the day. Noninteractive workloads do not require

people’s efforts to make progress after they are submitted. Server consolidation is an approach

to improve the low utility ratio of hardware resources by reducing the number of physical

servers. The use of VMs increases resource management complexity.

It enhances hardware utilization. Many underutilized servers are consolidated into

fewer servers to enhance resource utilization. Consolidation also facilitates backup

services and disaster recovery.

In a virtual environment, the images of the guest OSes and their applications are readily

cloned and reused.

Total cost of ownership is reduced.

Improves availability and business continuity.

Automation of data-center operations includes resource scheduling, architectural support,

power management, automatic or autonomic resource management, performance of analytical

models, and so on. In virtualized data centers, an efficient, on-demand, fine-grained scheduler

is one of the key factors to improve resource utilization. Dynamic CPU allocation is based on VM

utilization and application-level QoS metrics. One method considers both CPU and memory

flowing as well as automatically adjusting resource overhead based on varying workloads in

hosted services. Another scheme uses a two-level resource management system to handle the

complexity involved. A local controller at the VM level and a global controller at the server level

are designed.

Virtual Storage Management

Virtual storage includes the storage managed by VMMs and guest OSes. Generally, the data

stored in this environment can be classified into two categories: VM images and application

data. The VM images are special to the virtual environment, while application data includes all

other data which is the same as the data in traditional OS environments. In data centers, there

are often thousands of VMs, which cause the VM images to become flooded. Parallax is a

distributed storage system customized for virtualization environments. Content Addressable

Storage (CAS) is a solution to reduce the total size of VM images, and therefore supports a large

set of VM-based systems in data centers.

Parallax designs a novel architecture in which storage features that have traditionally been

implemented directly on high-end storage arrays and switchers are relocated into a federation

of storage VMs. These storage VMs share the same physical hosts as the VMs that they serve.

Figure provides an overview of the Parallax system architecture. For each physical machine,

Parallax customizes a special storage appliance VM. The storage appliance VM acts as a block

virtualization layer between individual VMs and the physical storage device. It provides a virtual

disk for each VM on the same physical machine.

UNIT III

Cloud Platform Architecture

Cloud Computing and service Models

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to

a shared pool of configurable computing resources (e.g., networks, servers, storage,

applications, and services) that can be rapidly provisioned and released with minimal

management effort or service provider interaction. This cloud model is composed of five

essential characteristics, three service models, and four deployment models.

Software as a Service (SaaS)

The capability provided to the consumer is to use the provider’s applications running on a cloud

infrastructure2. The applications are accessible from various client devices through either a thin

client interface, such as a web browser (e.g., web-based email), or a program interface. The

consumer does not manage or control the underlying cloud infrastructure including network,

servers, operating systems, storage, or even individual application capabilities, with the

possible exception of limited user-specific application configuration settings.

Platform as a Service (PaaS)

The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-

created or acquired applications created using programming languages, libraries, services, and

tools supported by the provider.3 The consumer does not manage or control the underlying

cloud infrastructure including network, servers, operating systems, or storage, but has control

over the deployed applications and possibly configuration settings for the application-hosting

environment.

Infrastructure as a Service (IaaS)

The capability provided to the consumer is to provision processing, storage, networks, and

other fundamental computing resources where the consumer is able to deploy and run

arbitrary software, which can include operating systems and applications. The consumer does

not manage or control the underlying cloud infrastructure but has control over operating

systems, storage, and deployed applications; and possibly limited control of select networking

components (e.g., host firewalls).

Architectural Design of Compute and Storage Clouds

Cloud Architecture for Distributed Computing

An Internet cloud is envisioned as a public cluster of servers provisioned on demand to perform

collective web services or distributed applications using the datacenter resources. The cloud

design objectives are first specified below. Then we present a basic cloud architecture design.

Cloud Platform Design Goals: Scalability, virtualization, efficiency, and reliability are four major

design goals of a cloud computing platform. Clouds support Web 2.0 applications. The cloud

management receives the user request and then finds the correct resources, and then calls the

provisioning services which invoke resources in the cloud. The cloud management software

needs to support both physical and virtual machines. Security in shared resources and shared

access of datacenters also post another design challenge.

The platform needs to establish a very large-scale HPC infrastructure. The hardware and

software systems are combined together to make it easy and efficient to operate. The system

scalability can benefit from cluster architecture. If one service takes a lot of processing power

or storage capacity or network traffic, it is simple to add more servers and bandwidth. The

system reliability can benefit from this architecture. Data can be put into multiple locations. For

example, the user email can be put in three disks which expand to different geographical

separate data centers. In such situation, even one of the datacenters crashes, the user data is

still accessible. The scale of cloud architecture can be easily expanded by adding more servers

and enlarging the network connectivity accordingly.

Enabling Technologies for Clouds: The key driving forces behind cloud computing are the

ubiquity of broadband and wireless networking, falling storage costs, and progressive

improvements in Internet computing software. Cloud users are able to demand more capacity

at peak demand, reduce costs, experiment with new services, and remove unneeded capacity,

whereas service providers can increase the system utilization via multiplexing, virtualization,

and dynamic resource provisioning.

Technology Requirements and Benefits

Fast Platform Deployment Fast, efficient, and flexible deployment of cloud resources to provide dynamic computing environment to users

Virtual Clusters on Demand Virtualized cluster of VMs provisioned to satisfy user demand and virtual cluster reconfigured as workload changes.

Multi-Tenant Techniques SaaS distributes software to a large number of users for their simultaneous uses and resource sharing if so desired.

Massive Data Processing Internet search and web services often require massive data processing, especially to support personalized services.

Web-Scale Communication Support e-commerce, distance education, telemedicine, social

networking, digital government, and digital entertainment, etc. Distributed Storage Large-scale storage of personal records and public archive information

demand distributed storage over the clouds Licensing and Billing Services License management and billing services greatly benefit all types of

cloud services in utility computing.

These technologies play instrumental roles to make cloud computing a reality. Most of these

technologies are mature today to meet the increasing demand. In the hardware area, the rapid

progress in multi-core CPUs, memory chips, and disk arrays has made it possible to build faster

datacenters with huge storage space. Resource virtualization enables rapid cloud deployment

faster and fast disaster recovery. Service-oriented architecture (SOA) also plays a vital role. The

progress in providing Software as a Service (SaaS), Wed.2.0 standards, and Internet

performance have all contributed to the emergence of cloud services. Today’s clouds are

designed to serve a large number of tenants over massive volume of data. The availability of

large-scale, distributed storage systems lies the foundation of today’s datacenters. Of course,

cloud computing is greatly benefitted by the progress made in license management and

automatic billing techniques in recent years.

A Generic Cloud Architecture: The Internet cloud is envisioned as a massive cluster of servers.

These servers are provisioned on demand to perform collective web services or distributed

applications using datacenter resources. Cloud platform is formed dynamically by provisioning

or de-provisioning, of servers, software, and database resources. Servers in the cloud can be

physical machines or virtual machines. User interfaces are applied to request services. The

provisioning tool carves out the systems from the cloud to deliver on the requested service.

In addition to building the server cluster, cloud platform demand distributed storage and

accompanying services. The cloud computing resources are built in datacenters, which are

typically owned and operated by a third-party provider. Consumers do not need to know the

underlying technologies. In a cloud, software becomes a service. The cloud demands a high-

degree of trust of massive data retrieved from large datacenters. We need to build a framework

to process large scale data stored in the storage system. This demands a distributed file system

over the database system. Other cloud resources are added into a cloud platform including the

storage area networks, database systems, firewalls and security devices. Web service providers

offer special APIs that enable developers to exploit Internet clouds. Monitoring and metering

units are used to track the usage and performance of resources provisioned.

The software infrastructure of a cloud platform must handle all resource management and do

most of the maintenance, automatically. Software must detect the status of each node, server

joining and leaving and do the tasks accordingly. Cloud computing providers, like Google and

Microsoft, have built a large number of datacenters all over the world. Each datacenter may

have thousands of servers. The location of the datacenter is chosen to reduce power and

cooling costs. Thus, the datacenters are often built around hydroelectricity power stop. The

cloud physical platform builder concerns more about the performance/price ratio and reliability

issues than the shear speed performance.

In general, the private clouds are easier to manage. Public clouds are easier to access. The

trends of cloud development are that more and more clouds will be hybrid. This is due to the

fact that many cloud application must go beyond the boundary of an Intranet. One must learn

how to create a private cloud and how to interact with the public clouds in the open Internet.

Security becomes a critical issue in safeguard the operations of all cloud types.

Inter-cloud Resource Management

Cloud resource management and inter-cloud resource exchange schemes are reviewed in this

section.

Extended Cloud Computing Services

Figure shows six layers of cloud services, ranging from hardware, network, and collocation to

infrastructure, platform, and software applications. The cloud platform provides PaaS, which

sits on top of the IaaS infrastructure. The top layer offers SaaS.

The bottom three layers are more related to physical requirements. The bottommost layer

provides Hardware as a Service (HaaS). The next layer is for interconnecting all the hardware

components, and is simply called Network as a Service (NaaS). Virtual LANs fall within the scope

of NaaS. The next layer above NaaS offers Location as a Service (LaaS), which provides a

collocation service to house, power, and secure all the physical hardware and network

resources. The cloud infrastructure layer can be further subdivided as Data as a Service (DaaS)

and Communication as a Service (CaaS) in addition to compute and storage in IaaS.

The cloud players are divided into three classes: (1) cloud service providers and IT

administrators, (2) software developers or vendors, and (3) end users or business users. These

cloud players vary in their roles under the IaaS, PaaS, and SaaS models. From the software

vendors’ perspective, application performance on a given cloud platform is most important.

From the providers’ perspective, cloud infrastructure performance is the primary concern. From

the end users’ perspective, the quality of services, including security, is the most important.

Cloud Service Tasks and Trends

The top layer is for SaaS applications, mostly used for business applications. The approach is to

widen market coverage by investigating customer behaviors and revealing opportunities by

statistical analysis. SaaS tools also apply to distributed collaboration, and financial and human

resources management.

PaaS is provided by Google, Salesforce.com, and Facebook, among others. IaaS is provided by

Amazon, Windows Azure, and RackRack, among others. Collocation services require multiple

cloud providers to work together to support supply chains in manufacturing. Network cloud

services provide communications such as those by AT&T, Qwest, and AboveNet. Details can be

found in Clou’s introductory book on business clouds.

Software Stack for Cloud Computing

The overall software stack structure of cloud computing software can be viewed as layers. Each

layer has its own purpose and provides the interface for the upper layers just as the traditional

software stack does. However, the lower layers are not completely transparent to the upper

layers.

The platform for running cloud computing services can be either physical servers or virtual

servers. By using VMs, the platform can be flexible, that is, the running services are not bound

to specific hardware platforms. This brings flexibility to cloud computing platforms. The

software layer on top of the platform is the layer for storing massive amounts of data. This layer

acts like the file system in a traditional single machine. Other layers running on top of the file

system are the layers for executing cloud computing applications. They include the database

storage system, programming for large-scale clusters, and data query language support. The

next layers are the components in the software stack.

Runtime Support Services

As in a cluster environment, there are also some runtime supporting services in the cloud

computing environment. Cluster monitoring is used to collect the runtime status of the entire

cluster. One of the most important facilities is the cluster job management system. The

scheduler queues the tasks submitted to the whole cluster and assigns the tasks to the

processing nodes according to node availability. The distributed scheduler for the cloud

application has special characteristics that can support cloud applications, such as scheduling

the programs written in MapReduce style. The runtime support system keeps the cloud cluster

working properly with high efficiency. Runtime support is software needed in browser-initiated

applications applied by thousands of cloud customers. The SaaS model provides the software

applications as a service, rather than letting users purchase the software.

Cloud Security

While security and privacy concerns are similar across cloud services and traditional non-cloud

services, those concerns are amplified by the existence of external control over organizational

assets and the potential for mismanagement of those assets. Transitioning to public cloud

computing involves a transfer of responsibility and control to the cloud provider over

information as well as system components that were previously under the customer’s direct

control.

Despite this inherent loss of control, the cloud service customer still needs to take responsibility

for its use of cloud computing services in order to maintain situational awareness, weigh

alternatives, set priorities, and effect changes in security and privacy that are in the best

interest of the organization. The customer achieves this by ensuring that the contract with the

provider and its associated cloud service agreement has appropriate provisions for security and

privacy. In particular, the agreement must help maintain legal protections for the privacy of

data stored and processed on the provider's systems. The customer must also ensure

appropriate integration of cloud computing services with their own systems for managing

security and privacy.

There are a number of security risks associated with cloud computing that must be adequately

addressed:

Loss of governance: In a public cloud deployment, customers cede control to the cloud provider

over a number of issues that may affect security. Yet cloud service agreements may not offer a

commitment to resolve such issues on the part of the cloud provider, thus leaving gaps in

security defenses.

Responsibility ambiguity: Responsibility over aspects of security may be split between the

provider and the customer, with the potential for vital parts of the defenses to be left

unguarded if there is a failure to allocate responsibility clearly. This split is likely to vary

depending on the cloud computing model used (e.g., IaaS vs. SaaS).

Authentication and Authorization: The fact that sensitive cloud resources are accessed from

anywhere on the Internet heightens the need to establish with certainty the identity of a user --

especially if users now include employees, contractors, partners and customers. Strong

authentication and authorization becomes a critical concern.

Isolation failure: Multi-tenancy and shared resources are defining characteristics of public

cloud computing. This risk category covers the failure of mechanisms separating the usage of

storage, memory, routing and even reputation between tenants (e.g. so-called guest-hopping

attacks).

Compliance and legal risks: The cloud customer’s investment in achieving certification (e.g., to

demonstrate compliance with industry standards or regulatory requirements) may be lost if the

cloud provider cannot provide evidence of their own compliance with the relevant

requirements, or does not permit audits by the cloud customer. The customer must check that

the cloud provider has appropriate certifications in place.

Handling of security incidents: The detection, reporting and subsequent management of

security breaches may be delegated to the cloud provider, but these incidents impact the

customer. Notification rules need to be negotiated in the cloud service agreement so that

customers are not caught unaware or informed with an unacceptable delay.

Management interface vulnerability: Interfaces to manage public cloud resources (such as self

provisioning) are usually accessible through the Internet. Since they allow access to larger sets

of resources than traditional hosting providers, they pose an increased risk, especially when

combined with remote access and web browser vulnerabilities.

Application Protection: Traditionally, applications have been protected with defense-in-depth

security solutions based on a clear demarcation of physical and virtual resources, and on

trusted zones. With the delegation of infrastructure security responsibility to the cloud

provider, organizations need to rethink perimeter security at the network level, applying more

controls at the user, application and data level. The same level of user access control and

protection must be applied to workloads deployed in cloud services as to those running in

traditional data centers. This requires creating and managing workload-centric policies as well

as implementing centralized management across distributed workload instances.

Data protection: Here, the major concerns are exposure or release of sensitive data as well as

the loss or unavailability of data. It may be difficult for the cloud service customer (in the role of

data controller) to effectively check the data handling practices of the cloud provider. This

problem is exacerbated in cases of multiple transfers of data, (e.g., between federated cloud

services or where a cloud provider uses subcontractors).

Malicious behavior of insiders: Damage caused by the malicious actions of people working

within an organization can be substantial, given the access and authorizations they enjoy. This

is compounded in the cloud computing environment since such activity might occur within

either or both the customer organization and the provider organization.

Business failure of the provider: Such failures could render data and applications essential to

the customer's business unavailable over an extended period.

Service unavailability: This could be caused by hardware, software or communication network

failures.

Vendor lock-in: Dependency on proprietary services of a particular cloud service provider could

lead to the customer being tied to that provider. The lack of portability of applications and data

across providers poses a risk of data and service unavailability in case of a change in providers;

therefore it is an important if sometimes overlooked aspect of security. Lack of interoperability

of interfaces associated with cloud services similarly ties the customer to a particular provider

and can make it difficult to switch to another provider.

Insecure or incomplete data deletion: The termination of a contract with a provider may not

result in deletion of the customer’s data. Backup copies of data usually exist, and may be mixed

on the same media with other customers’ data, making it impossible to selectively erase. The

very advantage of multi-tenancy (the sharing of hardware resources) thus represents a higher

risk to the customer than dedicated hardware.

Visibility and Audit: Some enterprise users are creating a “shadow IT” by procuring cloud

services to build IT solutions without explicit organizational approval. Key challenges for the

security team are to know about all uses of cloud services within the organization (what

resources are being used, for what purpose, to what extent, and by whom), understand what

laws, regulations and policies may apply to such uses, and regularly assess the security aspects

of such uses.

Cloud computing does not only create new security risks: it also provides opportunities to

provision improved security services that are better than those many organizations implement

on their own. Cloud service providers could offer advanced security and privacy facilities that

leverage their scale and their skills at automating infrastructure management tasks. This is

potentially a boon to customers who have little skilled security personnel.

Trust Management:

Reputation-Guided Protection of Data Centers:

Trust is a personal opinion, which is very subjective and often biased. Trust can be transitive but

not necessarily symmetric between two parties. Reputation is a public opinion, which is more

objective and often relies on a large opinion aggregation process to evaluate. Reputation may

change or decay over time. Recent reputation should be given more preference than past

reputation.

Reputation Systems for Clouds

Redesigning the aforementioned reputation systems for protecting data centers offers new

opportunities for expanded applications beyond P2P networks. Data consistency is checked

across multiple databases. Copyright protection secures wide-area content distributions. To

separate user data from specific SaaS programs, providers take the most responsibility in

maintaining data integrity and consistency. Users can switch among different services using

their own data. Only the users have the keys to access the requested data.

The data objects must be uniquely named to ensure global consistency. To ensure data

consistency, unauthorized updates of data objects by other cloud users are prohibited. The

reputation system can be implemented with a trust overlay network. A hierarchy of P2P

reputation systems is suggested to protect cloud resources at the site level and data objects at

the file level. This demands both coarse-grained and fine-grained access control of shared

resources. These reputation systems keep track of security breaches at all levels.

The reputation system must be designed to benefit both cloud users and data centers. Data

objects used in cloud computing reside in multiple data centers over a SAN. In the past, most

reputation systems were designed for P2P social networking or for online shopping services.

These reputation systems can be converted to protect cloud platform resources or user

applications in the cloud. A centralized reputation system is easier to implement, but demands

more powerful and reliable server resources. Distributed reputation systems are more scalable

and reliable in terms of handling failures. The five security mechanisms presented earlier can be

greatly assisted by using a reputation system specifically designed for data centers.

However, it is possible to add social tools such as reputation systems to support safe cloning of

VMs. Snapshot control is based on the defined RPO. Users demand new security mechanisms to

protect the cloud. For example, one can apply secured information logging, migrate over

secured virtual LANs, and apply ECC-based encryption for secure migration. Sandboxes provide

a safe execution platform for running programs. Further, sandboxes can provide a tightly

controlled set of resources for guest operating systems, which allows a security test bed to test

the application code from third-party vendors.

Service Oriented Architecture:

Service-oriented architecture (SOA) is an evolution of distributed computing based on the

request/reply design paradigm for synchronous and asynchronous applications. An application's

business logic or individual functions are modularized and presented as services for

consumer/client applications. What's key to these services is their loosely coupled nature; i.e.,

the service interface is independent of the implementation. Application developers or system

integrators can build applications by composing one or more services without knowing the

services' underlying implementations.

Service-oriented architectures have the following key characteristics:

SOA services have self-describing interfaces in platform-independent XML documents.

Web Services Description Language (WSDL) is the standard used to describe the

services.

SOA services communicate with messages formally defined via XML Schema (also called XSD). Communication among consumers and providers or services typically happens in heterogeneous environments, with little or no knowledge about the provider. Messages between services can be viewed as key business documents processed in an enterprise.

SOA services are maintained in the enterprise by a registry that acts as a directory listing. Applications can look up the services in the registry and invoke the service. Universal Description, Definition, and Integration (UDDI) is the standard used for service registry.

Each SOA service has a quality of service (QoS) associated with it. Some of the key QoS elements are security requirements, such as authentication and authorization, reliable messaging, and policies regarding who can invoke services.

Why SOA?

The reality in IT enterprises is that infrastructure is heterogeneous across operating systems, applications, system software, and application infrastructure. Some existing applications are used to run current business processes, so starting from scratch to build new infrastructure isn't an option. Enterprises should quickly respond to business changes with agility; leverage existing investments in applications and application infrastructure to address newer business requirements; support new channels of interactions with customers, partners, and suppliers; and feature an architecture that supports organic business. SOA with its loosely coupled nature allows enterprises to plug in new services or upgrade existing services in a granular fashion to address the new business requirements, provides the option to make the services consumable across different channels, and exposes the existing enterprise and legacy applications as services, thereby safeguarding existing IT infrastructure investments.

SOA Architecture:

Fig: Service Architecture

In the above Figure several service consumers can invoke services by sending messages. These

messages are typically transformed and routed by a service bus to an appropriate service

implementation. This service architecture can provide a business rules engine that allows

business rules to be incorporated in a service or across services. The service architecture also

provides a service management infrastructure that manages services and activities like auditing,

billing, and logging. In addition, the architecture offers enterprises the flexibility of having agile

business processes, better addresses the regulatory requirements like Sarbanes Oxley (SOX),

and changes individual services without affecting other services.

SOA infrastructure

To run and manage SOA applications, enterprises need an SOA infrastructure that is part of the

SOA platform. An SOA infrastructure must support all the relevant standards and required

runtime containers.

Fig: A typical SOA infrastructure.

WSDL is used to describe the service; UDDI, to register and look up the services; and SOAP, as a

transport layer to send messages between service consumer and service provider. While SOAP

is the default mechanism for Web services, alternative technologies accomplish other types of

bindings for a service. A consumer can search for a service in the UDDI registry, get the WSDL

for the service that has the description, and invoke the service using SOAP.

Benefits of SOA

While the SOA concept is fundamentally not new, SOA differs from existing distributed

technologies in that most vendors accept it and have an application or platform suite that

enables SOA. SOA, with a ubiquitous set of standards, brings better reusability of existing assets

or investments in the enterprise and lets you create applications that can be built on top of new

and existing applications. SOA enables changes to applications while keeping clients or service

consumers isolated from evolutionary changes that happen in the service implementation. SOA

enables upgrading individual services or services consumers; it is not necessary to completely

rewrite an application or keep an existing system that no longer addresses the new business

requirements. Finally, SOA provides enterprises better flexibility in building applications and

business processes in an agile manner by leveraging existing application infrastructure to

compose new services.

Message Oriented Middleware

Message-oriented middleware (MOM) is software or hardware infrastructure supporting

sending and receiving messages between distributed systems. MOM allows application

modules to be distributed over heterogeneous platforms and reduces the complexity of

developing applications that span multiple operating systems and network protocols.

The middleware creates a distributed communications layer that insulates the application

developer from the details of the various operating systems and network interfaces. APIs that

extend across diverse platforms and networks are typically provided by MOM.

Middleware categories

Remote Procedure Call or RPC-based middleware

Object Request Broker or ORB-based middleware

Message Oriented Middleware or MOM-based middleware

All these models make it possible for one software component to affect the behavior of another

component over a network. They are different in that RPC- and ORB-based middleware create

systems of tightly coupled components, whereas MOM-based systems allow for a loose

coupling of components. In an RPC- or ORB-based system, when one procedure calls another, it

must wait for the called procedure to return before it can do anything else. In

these synchronous messaging models, the middleware functions partly as a super-linker,

locating the called procedure on a network and using network services to pass function or

method parameters to the procedure and then to return results.

Advantages

Central reasons for using a message-based communications protocol include its ability to store

(buffer), route, or transform messages while conveying them from senders to receivers.

Another advantage of messaging provider mediated messaging between clients is that by

adding an administrative interface, you can monitor and tune performance. Client applications

are thus effectively relieved of every problem except that of sending, receiving, and processing

messages. It is up to the code that implements the MOM system and up to the administrator to

resolve issues like interoperability, reliability, security, scalability, and performance.

Disadvantages

The primary disadvantage of many message-oriented middleware systems is that they require

an extra component in the architecture, the message transfer agent (message broker). As with

any system, adding another component can lead to reductions in performance and reliability,

and can also make the system as a whole more difficult and expensive to maintain.

In addition, many inter-application communications have an intrinsically synchronous aspect,

with the sender specifically wanting to wait for a reply to a message before continuing

(see real-time computing and near-real-time for extreme cases). Because message-

based communication inherently functions asynchronously, it may not fit well in such

situations. That said, most MOM systems have facilities to group a request and a response as a

single pseudo-synchronous transaction.

UNIT IV

Cloud Programming and Software Environments

Features of Cloud and Grid Platforms

Cloud Capabilities and Platform Features

Commercial cloud capabilities offer cost-effective utility computing with the elasticity to scale

up and down in power. However, as well as this key distinguishing feature, commercial clouds

offer a growing number of additional capabilities commonly termed “Platform as a Service”

(PaaS). For Azure, current platform features include Azure Table, queues, blobs, Database SQL,

and web and Worker roles. Amazon is often viewed as offering “just” Infrastructure as a Service

(IaaS), but it continues to add platform features including SimpleDB (similar to Azure Table),

queues, notification, monitoring, content delivery network, relational database, and

MapReduce (Hadoop). Google does not currently offer a broad-based cloud service, but the

Google App Engine (GAE) offers a powerful web application development environment.

Workflow

workflow has spawned many projects in the United States and Europe. Pegasus, Taverna, and

Kepler are popular, but no choice has gained wide acceptance. There are commercial systems

such as Pipeline Pilot, AVS (dated), and the LIMS environments. A recent entry is Trident [2]

from Microsoft Research which is built on top of Windows Workflow Foundation. If Trident runs

on Azure or just any old Windows machine, it will run workflow proxy services on external

(Linux) environments. Workflow links multiple cloud and noncloud services in real applications

on demand.

Data Transport

The cost (in time and money) of data transport in (and to a lesser extent, out of) commercial

clouds is often discussed as a difficulty in using clouds. If commercial clouds become an

important component of the national cyberinfrastructure we can expect that high-bandwidth

links will be made available between clouds and TeraGrid. The special structure of cloud data

with blocks (in Azure blobs) and tables could allow high-performance parallel algorithms, but

initially, simple HTTP mechanisms are used to transport data [3–5] on academic

systems/TeraGrid and commercial clouds.

Security, Privacy, and Availability

The following techniques are related to security, privacy, and availability requirements for

developing a healthy and dependable cloud programming environment.

Use virtual clustering to achieve dynamic resource provisioning with minimum overhead cost.

Use stable and persistent data storage with fast queries for information retrieval.

Use special APIs for authenticating users and sending e-mail using commercial accounts.

Cloud resources are accessed with security protocols such as HTTPS and SSL.

Fine-grained access control is desired to protect data integrity and deter intruders or hackers.

Shared data sets are protected from malicious alteration, deletion, or copyright violations.

Features are included for availability enhancement and disaster recovery with life migration of

VMs.

Use a reputation system to protect data centers. This system only authorizes trusted clients and

stops pirates.

Data Features and Databases

Program Library

Many efforts have been made to design a VM image library to manage images used in academic

and commercial clouds. The basic cloud environments described in this chapter also include

many management features allowing convenient deployment and configuring of images (i.e.,

they support IaaS).

Blobs and Drives

The basic storage concept in clouds is blobs for Azure and S3 for Amazon. These can be

organized (approximately, as in directories) by containers in Azure. In addition to a service

interface for blobs and S3, one can attach “directly” to compute instances as Azure drives and

the Elastic Block Store for Amazon. This concept is similar to shared file systems such as Lustre

used in TeraGrid. The cloud storage is intrinsically fault-tolerant while that on TeraGrid needs

backup storage. However, the architecture ideas are similar between clouds and TeraGrid, and

the Simple Cloud File Storage API.

DPFS

This covers the support of file systems such as Google File System (MapReduce), HDFS

(Hadoop), and Cosmos (Dryad) with compute-data affinity optimized for data processing. It

could be possible to link DPFS to basic blob and drive-based architecture, but it’s simpler to use

DPFS as an application-centric storage model with compute-data affinity and blobs and drives

as the repository-centric view.

In general, data transport will be needed to link these two data views. It seems important to

consider this carefully, as DPFS file systems are precisely designed for efficient execution of

data-intensive applications. However, the importance of DPFS for linkage with Amazon and

Azure is not clear, as these clouds do not currently offer fine-grained support for compute-data

affinity. We note here that Azure Affinity Groups are one interesting capability. We expect that

initially blobs, drives, tables, and queues will be the areas where academic systems will most

usefully provide a platform similar to Azure (and Amazon). Note the HDFS (Apache) and Sector

(UIC) projects in this area.

SQL and Relational Databases

Both Amazon and Azure clouds offer relational databases and it is straightforward for academic

systems to offer a similar capability unless there are issues of huge scale where, in fact,

approaches based on tables and/or MapReduce might be more appropriate [8]. As one early

user, we are developing on FutureGrid a new private cloud computing model for the

Observational Medical Outcomes Partnership (OMOP) for patient-related medical data which

uses Oracle and SAS where FutureGrid is adding Hadoop for scaling to many different analysis

methods.

Note that databases can be used to illustrate two approaches to deploying capabilities.

Traditionally, one would add database software to that found on computer disks. This software

is executed, providing your database instance. However, on Azure and Amazon, the database is

installed on a separate VM independent from your job (worker roles in Azure). This implements

“SQL as a Service.” It may have some performance issues from the messaging interface, but the

“aaS” deployment clearly simplifies one’s system. For N platform features, one only needs N

services, whereas number of possible images with alternative approaches is a prohibitive 2N.

Table and NOSQL Nonrelational Databases:

A substantial number of important developments have occurred regarding simplified database

structures—termed “NOSQL”—typically emphasizing distribution and scalability. These are

present in the three major clouds: BigTable in Google, SimpleDB in Amazon, and Azure Table

for Azure. Tables are clearly important in science as illustrated by the VOTable standard in

astronomy and the popularity of Excel. However, there does not appear to be substantial

experience in using tables outside clouds.

There are, of course, many important uses of nonrelational databases, especially in terms of

triple stores for metadata storage and access. Recently, there has been interest in building

scalable RDF triple stores based on MapReduce and tables or the Hadoop File System, with

early success reported on very large stores. The current cloud tables fall into two groups: Azure

Table and Amazon SimpleDB are quite similar and support lightweight storage for “document

stores,” while BigTable aims to manage large distributed data sets without size limitations.

All these tables are schema-free (each record can have different properties), although BigTable

has a schema for column (property) families. It seems likely that tables will grow in importance

for scientific computing, and academic systems could support this using two Apache projects:

Hbase for BigTable and CouchDB for a document store. Another possibility is the open source

SimpleDB implementation M/DB . The new Simple Cloud APIs for file storage, document

storage services, and simple queues could help in providing a common environment between

academic and commercial clouds.

Queuing Services

Both Amazon and Azure offer similar scalable, robust queuing services that are used to

communicate between the components of an application. The messages are short (less than 8

KB) and have a Representational State Transfer (REST) service interface with “deliver at least

once” semantics. They are controlled by timeouts for posting the length of time allowed for a

client to process. One can build a similar approach (on the typically smaller and less challenging

academic environments), basing it on publish-subscribe systems such as ActiveMQ or

NaradaBrokering with which we have substantial experience.

Parallel & Distributed Programming Paradigms

Running a parallel program on a distributed computing system (parallel and distributed

programming) has several advantages for both users and distributed computing systems. From

the users’ perspective, it decreases application response time; from the distributed computing

systems standpoint, it increases throughput and resource utilization. Running a parallel

program on a distributed computing system, however, could be a very complicated process.

MapReduce, Hadoop, and Dryad are three of the most recently proposed parallel and

distributed programming models. They were developed for information retrieval applications

but have been shown to be applicable for a variety of important applications. Further, the loose

coupling of components in these paradigms makes them suitable for VM implementation and

leads to much better fault tolerance and scalability for some applications than traditional

parallel computing models such as MPI.

MapReduce, Twister, and Iterative MapReduce:

This software framework abstracts the data flow of running a parallel program on a distributed

computing system by providing users with two interfaces in the form of two functions: Map and

Reduce. Users can override these two functions to interact with and manipulate the data flow

of running their programs. the “value” part of the data, (key, value), is the actual data, and the

“key” part is only used by the MapReduce controller to control the data flow.

Fig: Map Reduce FrameWork

MapReduce Actual Data and Control Flow

The main responsibility of the MapReduce framework is to efficiently run a user’s program on a

distributed computing system. Therefore, the MapReduce framework meticulously handles all

partitioning, mapping, synchronization, communication, and scheduling details of such data

flows.

Data partitioning The MapReduce library splits the input data (files), already stored in

GFS, into M pieces that also correspond to the number of map tasks.

Computation partitioning This is implicitly handled (in the MapReduce framework) by

obliging users to write their programs in the form of the Map and Reduce functions.

Therefore, the MapReduce library only generates copies of a user program (e.g., by a

fork system call) containing the Map and the Reduce functions, distributes them, and

starts them up on a number of available computation engines.

Determining the master and workers The MapReduce architecture is based on a

masterworker model. Therefore, one of the copies of the user program becomes the

master and the rest become workers. The master picks idle workers, and assigns the

map and reduce tasks to them. A map/reduce worker is typically a computation engine

such as a cluster node to run map/ reduce tasks by executing Map/Reduce functions.

Steps 4–7 describe the map workers.

Reading the input data (data distribution) Each map worker reads its corresponding

portion of the input data, namely the input data split, and sends it to its Map function.

Although a map worker may run more than one Map function, which means it has been

assigned more than one input data split, each worker is usually assigned one input split

only.

Map function Each Map function receives the input data split as a set of (key, value)

pairs to process and produce the intermediated (key, value) pairs.

Combiner function This is an optional local function within the map worker which

applies to intermediate (key, value) pairs. The user can invoke the Combiner function

inside the user program. The Combiner function runs the same code written by users for

the Reduce function as its functionality is identical to it. The Combiner function merges

the local data of each map worker before sending it over the network to effectively

reduce its communication costs. As mentioned in our discussion of logical data flow, the

MapReduce framework sorts and groups the data before it is processed by the Reduce

function. Similarly, the MapReduce framework will also sort and group the local data on

each map worker if the user invokes the Combiner function.

Partitioning function As mentioned in our discussion of the MapReduce data flow, the

intermediate (key, value) pairs with identical keys are grouped together because all

values inside each group should be processed by only one Reduce function to generate

the final result. However, in real implementations, since there are M map and R reduce

tasks, intermediate (key, value) pairs with the same key might be produced by different

map tasks, although they should be grouped and processed together by one Reduce

function only.

Synchronization MapReduce applies a simple synchronization policy to coordinate map

workers with reduce workers, in which the communication between them starts when

all map tasks finish.

Communication Reduce worker i, already notified of the location of region i of all map

workers, uses a remote procedure call to read the data from the respective region of all

map workers. Since all reduce workers read the data from all map workers, all-to-all

communication among all map and reduce workers, which incurs network congestion,

occurs in the network. This issue is one of the major bottlenecks in increasing the

performance of such systems. A data transfer module was proposed to schedule data

transfers independently.

Sorting and Grouping When the process of reading the input data is finalized by a

reduce worker, the data is initially buffered in the local disk of the reduce worker. Then

the reduce worker groups intermediate (key, value) pairs by sorting the data based on

their keys, followed by grouping all occurrences of identical keys. Note that the buffered

data is sorted and grouped because the number of unique keys produced by a map

worker may be more than R regions in which more than one key exists in each region of

a map worker.

Reduce function The reduce worker iterates over the grouped (key, value) pairs, and for

each unique key, it sends the key and corresponding values to the Reduce function.

Then this function processes its input data and stores the output results in

predetermined files in the user’s program.

Hadoop Library from Apache

Hadoop is an open source implementation of MapReduce coded and released in Java (rather

than C) by Apache. The Hadoop implementation of MapReduce uses the Hadoop Distributed

File System (HDFS) as its underlying layer rather than GFS. The Hadoop core is divided into two

fundamental layers: the MapReduce engine and HDFS. The MapReduce engine is the

computation engine running on top of HDFS as its data storage manager. The following two

sections cover the details of these two fundamental layers.

HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data

on a distributed computing system.

HDFS Architecture: HDFS has a master/slave architecture containing a single NameNode as the

master and a number of DataNodes as workers (slaves). To store a file in this architecture, HDFS

splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (DataNodes). The

mapping of blocks to DataNodes is determined by the NameNode. The NameNode (master)

also manages the file system’s metadata and namespace. In such systems, the namespace is the

area maintaining the metadata, and metadata refers to all the information stored by a file

system that is needed for overall management of all files. For example, NameNode in the

metadata stores all information regarding the location of input splits/blocks in all DataNodes.

Each DataNode, usually one per node in a cluster, manages the storage attached to the node. Each

DataNode is responsible for storing and retrieving its file blocks.

HDFS Features: Distributed file systems have special requirements, such as performance,

scalability, concurrency control, fault tolerance, and security requirements [62], to operate

efficiently. However, because HDFS is not a general-purpose file system, as it only executes

specific types of applications, it does not need all the requirements of a general distributed file

system. For example, security has never been supported for HDFS systems. The following

discussion highlights two important characteristics of HDFS to distinguish it from other generic

distributed file systems.

HDFS Fault Tolerance: One of the main aspects of HDFS is its fault tolerance characteristic.

Since Hadoop is designed to be deployed on low-cost hardware by default, a hardware failure in

this system is considered to be common rather than an exception. Therefore, Hadoop considers

the following issues to fulfill reliability requirements of the file system.

Block replication To reliably store data in HDFS, file blocks are replicated in this system. In

other words, HDFS stores a file as a set of blocks and each block is replicated and distributed

across the whole cluster. The replication factor is set by the user and is three by default.

Replica placement The placement of replicas is another factor to fulfill the desired fault

tolerance in HDFS. Although storing replicas on different nodes (DataNodes) located in different

racks across the whole cluster provides more reliability, it is sometimes ignored as the cost of

communication between two nodes in different racks is relatively high in comparison with that

of different nodes located in the same rack. Therefore, sometimes HDFS compromises its

reliability to achieve lower communication costs. For example, for the default replication factor

of three, HDFS stores one replica in the same node the original data is stored, one replica on a

different node but in the same rack, and one replica on a different node in a different rack to

provide three copies of the data.

Heartbeat and Blockreport messages Heartbeats and Blockreports are periodic messages sent

to the NameNode by each DataNode in a cluster. Receipt of a Heartbeat implies that the

DataNode is functioning properly, while each Blockreport contains a list of all blocks on a

DataNode. The NameNode receives such messages because it is the sole decision maker of all

replicas in the system.

HDFS High-Throughput Access to Large Data Sets (Files): Because HDFS is primarily designed

for batch processing rather than interactive processing, data access throughput in HDFS is more

important than latency. Also, because applications run on HDFS typically have large data sets,

individual files are broken into large blocks (e.g., 64 MB) to allow HDFS to decrease the amount

of metadata storage required per file. This provides two advantages: The list of blocks per file

will shrink as the size of individual blocks increases, and by keeping large amounts of data

sequentially within a block, HDFS provides fast streaming reads of data.

HDFS Operation: The control flow of HDFS operations such as write and read can properly

highlight roles of the NameNode and DataNodes in the managing operations. In this section,

the control flow of the main operations of HDFS on files is further described to manifest the

interaction between the user, the NameNode, and the DataNodes in such systems.

Reading a file To read a file in HDFS, a user sends an “open” request to the NameNode to get

the location of file blocks. For each file block, the NameNode returns the address of a set of

DataNodes containing replica information for the requested file. The number of addresses

depends on the number of block replicas. Upon receiving such information, the user calls the

read function to connect to the closest DataNode containing the first block of the file. After the

first block is streamed from the respective DataNode to the user, the established connection is

terminated and the same process is repeated for all blocks of the requested file until the whole

file is streamed to the user.

Writing to a file To write a file in HDFS, a user sends a “create” request to the NameNode to

create a new file in the file system namespace. If the file does not exist, the NameNode notifies

the user and allows him to start writing data to the file by calling the write function. The first

block of the file is written to an internal queue termed the data queue while a data streamer

monitors its writing into a DataNode. Since each file block needs to be replicated by a

predefined factor, the data streamer first sends a request to the NameNode to get a list of

suitable DataNodes to store replicas of the first block.

The steamer then stores the block in the first allocated DataNode. Afterward, the block is

forwarded to the second DataNode by the first DataNode. The process continues until all

allocated DataNodes receive a replica of the first block from the previous DataNode. Once this

replication process is finalized, the same process starts for the second block and continues until

all blocks of the file are stored and replicated on the file system.

Architecture of MapReduce in Hadoop

The topmost layer of Hadoop is the MapReduce engine that manages the data flow and control

flow of MapReduce jobs over distributed computing systems. Similar to HDFS, the MapReduce

engine also has a master/slave architecture consisting of a single JobTracker as the master and

a number of TaskTrackers as the slaves (workers). The JobTracker manages the MapReduce job

over a cluster and is responsible for monitoring jobs and assigning tasks to TaskTrackers. The

TaskTracker manages the execution of the map and/or reduce tasks on a single computation

node in the cluster.

Each TaskTracker node has a number of simultaneous execution slots, each executing either a

map or a reduce task. Slots are defined as the number of simultaneous threads supported by

CPUs of the TaskTracker node. For example, a TaskTracker node with N CPUs, each supporting

M threads, has M * N simultaneous execution slots. It is worth noting that each data block is

processed by one map task running on a single slot. Therefore, there is a one-to-one

correspondence between map tasks in a TaskTracker and data blocks in the respective

DataNode.

Figure: MapReduce in Hadoop

Programming Support of Google App Engine

Programming the Google App Engine

Several web resources (e.g., http://code.google.com/appengine/) and specific books and

articles (e.g., www.byteonic.com/2009/overview-of-java-support-in-google-app-engine/)

discuss how to program GAE.

There are several powerful constructs for storing and accessing data. The data store is a NOSQL

data management system for entities that can be, at most, 1 MB in size and are labeled by a set

of schema-less properties. Queries can retrieve entities of a given kind filtered and sorted by

the values of the properties. Java offers Java Data Object (JDO) and Java Persistence API (JPA)

interfaces implemented by the open source Data Nucleus Access platform, while Python has a

SQL-like query language called GQL. The data store is strongly consistent and uses optimistic

concurrency control.

Figure: Programming Environment for Google App Engine

An update of an entity occurs in a transaction that is retried a fixed number of times if other

processes are trying to update the same entity simultaneously. Your application can execute

multiple data store operations in a single transaction which either all succeed or all fail

together. The data store implements transactions across its distributed network using “entity

groups.” A transaction manipulates entities within a single group. Entities of the same group

are stored together for efficient execution of transactions. Your GAE application can assign

entities to groups when the entities are created. The performance of the data store can be

enhanced by in-memory caching using the memcache, which can also be used independently of

the data store.

Recently, Google added the blobstore which is suitable for large files as its size limit is 2 GB.

There are several mechanisms for incorporating external resources. The Google SDC Secure

Data Connection can tunnel through the Internet and link your intranet to an external GAE

application. The URL Fetch operation provides the ability for applications to fetch resources and

communicate with other hosts over the Internet using HTTP and HTTPS requests. There is a

specialized mail mechanism to send e-mail from your GAE application.

Applications can access resources on the Internet, such as web services or other data, using

GAE’s URL fetch service. The URL fetch service retrieves web resources using the same

highspeed Google infrastructure that retrieves web pages for many other Google products.

There are dozens of Google “corporate” facilities including maps, sites, groups, calendar, docs,

and YouTube, among others. These support the Google Data API which can be used inside GAE.

An application can use Google Accounts for user authentication. Google Accounts handles user

account creation and sign-in, and a user that already has a Google account (such as a Gmail

account) can use that account with your app. GAE provides the ability to manipulate image data

using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. An

application can perform tasks outside of responding to web requests. Your application can

perform these tasks on a schedule that you configure, such as on a daily or hourly basis using

“cron jobs,” handled by the Cron service.

Alternatively, the application can perform tasks added to a queue by the application itself, such

as a background task created while handling a request. A GAE application is configured to

consume resources up to certain limits or quotas. With quotas, GAE ensures that your

application won’t exceed your budget, and that other applications running on GAE won’t

impact the performance of your app. In particular, GAE use is free up to certain quotas.

Google File System (GFS)

GFS was built primarily as the fundamental storage service for Google’s search engine. As the

size of the web data that was crawled and saved was quite substantial, Google needed a

distributed file system to redundantly store massive amounts of data on cheap and unreliable

computers. None of the traditional distributed file systems can provide such functions and hold

such large amounts of data. In addition, GFS was designed for Google applications, and Google

applications were built for GFS. In traditional file system design, such a philosophy is not

attractive, as there should be a clear interface between applications and the file system, such as

a POSIX interface.

There are several assumptions concerning GFS. One is related to the characteristic of the cloud

computing hardware infrastructure (i.e., the high component failure rate). As servers are

composed of inexpensive commodity components, it is the norm rather than the exception that

concurrent failures will occur all the time. Another concern the file size in GFS. GFS typically will

hold a large number of huge files, each 100 MB or larger, with files that are multiple GB in size

quite common. Thus, Google has chosen its file data block size to be 64 MB instead of the 4 KB

in typical traditional file systems. The I/O pattern in the Google application is also special. Files

are typically written once, and the write operations are often the appending data blocks to the

end of files. Multiple appending operations might be concurrent. There will be a lot of large

streaming reads and only a little random access. As for large streaming reads, highly sustained

throughput is much more important than low latency.

Thus, Google made some special decisions regarding the design of GFS. As noted earlier, a 64

MB block size was chosen. Reliability is achieved by using replications (i.e., each chunk or data

block of a file is replicated across more than three chunk servers). A single master coordinates

access as well as keeps the metadata. This decision simplified the design and management of

the whole cluster. Developers do not need to consider many difficult issues in distributed

systems, such as distributed consensus. There is no data cache in GFS as large streaming reads

and writes represent neither time nor space locality. GFS provides a similar, but not identical,

POSIX file system accessing interface. The distinct difference is that the application can even

see the physical location of file blocks. Such a scheme can improve the upper-layer applications.

The customized API can simplify the problem and focus on Google applications. The customized

API adds snapshot and record append operations to facilitate the building of Google

applications.

Figure: GFS Architecture

The above figure shows the GFS architecture. It is quite obvious that there is a single master in

the whole cluster. Other nodes act as the chunk servers for storing data, while the single master

stores the metadata. The file system namespace and locking facilities are managed by the

master. The master periodically communicates with the chunk servers to collect management

information as well as give instructions to the chunk servers to do work such as load balancing

or fail recovery.

The master has enough information to keep the whole cluster in a healthy state. With a single

master, many complicated distributed algorithms can be avoided and the design of the system

can be simplified. However, this design does have a potential weakness, as the single GFS

master could be the performance bottleneck and the single point of failure. To mitigate this,

Google uses a shadow master to replicate all the data on the master, and the design guarantees

that all the data operations are performed directly between the client and the chunk server.

The control messages are transferred between the master and the clients and they can be

cached for future use. With the current quality of commodity servers, the single master can

handle a cluster of more than 1,000 nodes.

The above figure shows the data mutation (write, append operations) in GFS. Data blocks must

be created for all replicas. The goal is to minimize involvement of the master. The mutation

takes the following steps:

The client asks the master which chunk server holds the current lease for the chunk and

the locations of the other replicas. If no one has a lease, the master grants one to a

replica it chooses (not shown).

The master replies with the identity of the primary and the locations of the other

(secondary) replicas. The client caches this data for future mutations. It needs to contact

the master again only when the primary becomes unreachable or replies that it no

longer holds a lease.

The client pushes the data to all the replicas. A client can do so in any order. Each chunk

server will store the data in an internal LRU buffer cache until the data is used or aged

out. By decoupling the data flow from the control flow, we can improve performance by

scheduling the expensive data flow based on the network topology regardless of which

chunk server is the primary.

Once all the replicas have acknowledged receiving the data, the client sends a write

request to the primary. The request identifies the data pushed earlier to all the replicas.

The primary assigns consecutive serial numbers to all the mutations it receives, possibly

from multiple clients, which provides the necessary serialization. It applies the mutation

to its own local state in serial order.

The primary forwards the write request to all secondary replicas. Each secondary replica

applies mutations in the same serial number order assigned by the primary.

The secondaries all reply to the primary indicating that they have completed the

operation.

The primary replies to the client. Any errors encountered at any replicas are reported to

the client. In case of errors, the write corrects at the primary and an arbitrary subset of

the secondary replicas. The client request is considered to have failed, and the modified

region is left in an inconsistent state. Our client code handles such errors by retrying the

failed mutation. It will make a few attempts at steps 3 through 7 before falling back to a

retry from the beginning of the write.

Programming On Amazon Aws and Microsoft Azure

Programming on Amazon EC2

Amazon was the first company to introduce VMs in application hosting. Customers can rent

VMs instead of physical machines to run their own applications. By using VMs, customers can

load any software of their choice. The elastic feature of such a service is that a customer can

create, launch, and terminate server instances as needed, paying by the hour for active servers.

Amazon provides several types of preinstalled VMs. Instances are often called Amazon Machine

Images (AMIs) which are preconfigured with operating systems based on Linux or Windows,

and additional software.

Figure: Amazon EC2 Execution environment

Standard instances are well suited for most applications.

Micro instances provide a small number of consistent CPU resources and allow you to

burst CPU capacity when additional cycles are available. They are well suited for lower

throughput applications and web sites that consume significant compute cycles

periodically.

High-memory instances offer large memory sizes for high-throughput applications,

including database and memory caching applications.

High-CPU instances have proportionally more CPU resources than memory (RAM) and

are well suited for compute-intensive applications.

Cluster compute instances provide proportionally high CPU resources with increased

network performance and are well suited for high-performance computing (HPC)

applications and other demanding network-bound applications. They use 10 Gigabit

Ethernet interconnections.

The cost in the third column is expressed in terms of EC2 Compute Units (ECUs) where one

ECU provides the CPU capacity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor.

Amazon Simple Storage Service (S3)

Amazon S3 provides a simple web services interface that can be used to store and retrieve any

amount of data, at any time, from anywhere on the web. S3 provides the object-oriented

storage service for users. Users can access their objects through Simple Object Access Protocol

(SOAP) with either browsers or other client programs which support SOAP. SQS is responsible

for ensuring a reliable message service between two processes, even if the receiver processes

are not running.

The fundamental operation unit of S3 is called an object. Each object is stored in a bucket and

retrieved via a unique, developer-assigned key. In other words, the bucket is the container of

the object. Besides unique key attributes, the object has other attributes such as values,

metadata, and access control information. From the programmer’s perspective, the storage

provided by S3 can be viewed as a very coarse-grained key-value pair. Through the key-value

programming interface, users can write, read, and delete objects containing from 1 byte to 5

gigabytes of data each. There are two types of web service interface for the user to access the

data stored in Amazon clouds. One is a REST (web 2.0) interface, and the other is a SOAP

interface. Here are some key features of S3:

Redundant through geographic dispersion

Designed to provide 99.999999999 percent durability and 99.99 percent availability of

objects over a given year with cheaper reduced redundancy storage (RRS).

Authentication mechanisms to ensure that data is kept secure from unauthorized

access. Objects can be made private or public, and rights can be granted to specific

users.

Per-object URLs and ACLs (access control lists).

Default download protocol of HTTP. A BitTorrent protocol interface is provided to lower

costs for high-scale distribution.

$0.055 (more than 5,000TB) to 0.15 per GB per month storage (depending on total

amount).

First 1GB per month input or output free and then $.08 to $0.15 per GB for transfers

outside an S3 region.

There is no data transfer charge for data transferred between Amazon EC2 and Amazon

S3 within the same region or for data transferred between the Amazon EC2 Northern

Virginia region and the Amazon S3 U.S. Standard region (as of October 6, 2010).

Amazon Elastic Block Store (EBS) and SimpleDB

The Elastic Block Store (EBS) provides the volume block interface for saving and restoring the

virtual images of EC2 instances. Traditional EC2 instances will be destroyed after use. The status

of EC2 can now be saved in the EBS system after the machine is shut down. Users can use EBS

to save persistent data and mount to the running instances of EC2. Note that S3 is “Storage as a

Service” with a messaging interface. EBS is analogous to a distributed file system accessed by

traditional OS disk access mechanisms. EBS allows you to create storage volumes from 1GB to

1TB that can be mounted as EC2 instances.

Multiple volumes can be mounted to the same instance. These storage volumes behave like

raw, unformatted block devices, with user-supplied device names and a block device interface.

You can create a file system on top of Amazon EBS volumes, or use them in any other way you

would use a block device (like a hard drive). Snapshots are provided so that the data can be

saved incrementally. This can improve performance when saving and restoring data. In terms of

pricing, Amazon provides a similar pay-per-use schema as EC2 and S3. Volume storage charges

are based on the amount of storage users allocate until it is released, and is priced at $0.10 per

GB/month. EBS also charges $0.10 per 1 million I/O requests made to the storage (as of

October 6, 2010). The equivalent of EBS has been offered in open source clouds such Nimbus.

Amazon SimpleDB Service

SimpleDB provides a simplified data model based on the relational database data model.

Structured data from users must be organized into domains. Each domain can be considered a

table. The items are the rows in the table. A cell in the table is recognized as the value for a

specific attribute (column name) of the corresponding row. This is similar to a table in a

relational database. However, it is possible to assign multiple values to a single cell in the table.

This is not permitted in a traditional relational database which wants to maintain data

consistency.

Many developers simply want to quickly store, access, and query the stored data. SimpleDB

removes the requirement to maintain database schemas with strong consistency. SimpleDB is

priced at $0.140 per Amazon SimpleDB Machine Hour consumed with the first 25 Amazon

SimpleDB Machine Hours consumed per month free (as of October 6, 2010). SimpleDB, like

Azure Table, could be called “LittleTable,” as they are aimed at managing small amounts of

information stored in a distributed table; one could say BigTable is aimed at basic big data,

whereas LittleTable is aimed at metadata. Amazon Dynamo is an early research system along

the lines of the production SimpleDB system.

Microsoft Azure Programming Support

When the system is running, services are monitored and one can access event logs,

trace/debug data, performance counters, IIS web server logs, crash dumps, and other log files.

This information can be saved in Azure storage. Note that there is no debugging capability for

running cloud applications, but debugging is done from a trace. One can divide the basic

features into storage and compute capabilities. The Azure application is linked to the Internet

through a customized compute VM called a web role supporting basic Microsoft web hosting.

Such configured VMs are often called appliances. The other important compute class is the

worker role reflecting the importance in cloud computing of a pool of compute resources that

are scheduled as needed. The roles support HTTP(S) and TCP. Roles offer the following

methods:

The OnStart() method which is called by the Fabric on startup, and allows you to

perform initialization tasks. It reports a Busy status to the load balancer until you return

true.

The OnStop() method which is called when the role is to be shut down and gives a

graceful exit.

The Run() method which contains the main logic.

Figure: Features of the Azure Cloud Platform

SQLAzure

All the storage modalities are accessed with REST interfaces except for the recently introduced

Drives that are analogous to Amazon EBS discussed in Section 6.4.3, and offer a file system

interface as a durable NTFS volume backed by blob storage. The REST interfaces are

automatically associated with URLs and all storage is replicated three times for fault tolerance

and is guaranteed to be consistent in access.

The basic storage system is built from blobs which are analogous to S3 for Amazon. Blobs are

arranged as a three-level hierarchy: Account → Containers → Page or Block Blobs. Containers

are analogous to directories in traditional file systems with the account acting as the root. The

block blob is used for streaming data and each such blob is made up as a sequence of blocks of

up to 4 MB each, while each block has a 64 byte ID. Block blobs can be up to 200 GB in size.

Page blobs are for random read/write access and consist of an array of pages with a maximum

blob size of 1 TB. One can associate metadata with blobs as <name, value> pairs with up to 8 KB

per blob.

Azure Tables

The Azure Table and Queue storage modes are aimed at much smaller data volumes. Queues

provide reliable message delivery and are naturally used to support work spooling between

web and worker roles. Queues consist of an unlimited number of messages which can be

retrieved and processed at least once with an 8KB limit on message size. Azure supports PUT,

GET, and DELETES message operations as well as CREATE and DELETE for queues. Each account

can have any number of Azure tables which consist of rows called entities and columns called

properties.

There is no limit to the number of entities in a table and the technology is designed to scale

well to a large number of entities stored on distributed computers. All entities can have up to

255 general properties which are <name, type, value> triples. Two extra properties,

PartitionKey and RowKey, must be defined for each entity, but otherwise, there are no

constraints on the names of properties—this table is very flexible! RowKey is designed to give

each entity a unique label while PartitionKey is designed to be shared and entities with the

same PartitionKey are stored next to each other; a good use of PartitionKey can speed up

search performance. An entity can have, at most, 1MB storage; if you need large value sizes,

just store a link to a blob store in the Table property value. ADO.NET and LINQ support table

queries.

EMERGING CLOUD SOFTWARE ENVIRONMENTS

Open Source Eucalyptus and Nimbus

Eucalyptus is a product from Eucalyptus Systems (www.eucalyptus.com) that was developed

out of a research project at the University of California, Santa Barbara. Eucalyptus was initially

aimed at bringing the cloud computing paradigm to academic supercomputers and clusters.

Eucalyptus provides an AWS-compliant EC2-based web service interface for interacting with the

cloud service. Additionally, Eucalyptus provides services, such as the AWS-compliant Walrus,

and a user interface for managing users and images.

Eucalyptus Architecture

The Eucalyptus system is an open software environment. The architecture was presented in a

Eucalyptus white paper. The system supports cloud programmers in VM image management as

follows. Essentially, the system has been extended to support the development of both the

computer cloud and storage cloud.

VM Image Management

Eucalyptus takes many design queues from Amazon’s EC2, and its image management system is

no different. Eucalyptus stores images in Walrus, the block storage system that is analogous to

the Amazon S3 service. As such, any user can bundle her own root file system, and upload and

then register this image and link it with a particular kernel and ramdisk image. This image is

uploaded into a user-defined bucket within Walrus, and can be retrieved anytime from any

availability zone. This allows users to create specialty virtual appliances

(http://en.wikipedia.org/wiki/Virtual_appliance) and deploy them within Eucalyptus with ease.

The Eucalyptus system is available in a commercial proprietary version, as well as the open

source version we just described.

Nimbus

Nimbus [81,82] is a set of open source tools that together provide an IaaS cloud computing

solution. It allows a client to lease remote resources by deploying VMs on those resources and

configuring them to represent the environment desired by the user. To this end, Nimbus

provides a special web interface known as Nimbus Web. Its aim is to provide administrative and

user functions in a friendly interface. Nimbus Web is centered around a Python Django web

application that is intended to be deployable completely separate from the Nimbus service.

Nimbus supports two resource management strategies. The first is the default “resource pool”

mode. In this mode, the service has direct control of a pool of VM manager nodes and it

assumes it can start VMs. The other supported mode is called “pilot.” Here, the service makes

requests to a cluster’s Local Resource Management System (LRMS) to get a VM manager

available to deploy VMs. Nimbus also provides an implementation of Amazon’s EC2 interface

that allows users to use clients developed for the real EC2 system against Nimbus-based clouds.

Amazon Simple Storage Service (S3)

Amazon S3 provides a simple web services interface that can be used to store and retrieve any

amount of data, at any time, from anywhere on the web. S3 provides the object-oriented

storage service for users. Users can access their objects through Simple Object Access Protocol

(SOAP) with either browsers or other client programs which support SOAP. SQS is responsible

for ensuring a reliable message service between two processes, even if the receiver processes

are not running.

The fundamental operation unit of S3 is called an object. Each object is stored in a bucket and

retrieved via a unique, developer-assigned key. In other words, the bucket is the container of

the object. Besides unique key attributes, the object has other attributes such as values,

metadata, and access control information. From the programmer’s perspective, the storage

provided by S3 can be viewed as a very coarse-grained key-value pair. Through the key-value

programming interface, users can write, read, and delete objects containing from 1 byte to 5

gigabytes of data each. There are two types of web service interface for the user to access the

data stored in Amazon clouds. One is a REST (web 2.0) interface, and the other is a SOAP

interface. Here are some key features of S3:

Redundant through geographic dispersion

Designed to provide 99.999999999 percent durability and 99.99 percent availability of

objects over a given year with cheaper reduced redundancy storage (RRS).

Authentication mechanisms to ensure that data is kept secure from unauthorized

access. Objects can be made private or public, and rights can be granted to specific

users.

Per-object URLs and ACLs (access control lists).

Default download protocol of HTTP. A BitTorrent protocol interface is provided to lower

costs for high-scale distribution.

$0.055 (more than 5,000TB) to 0.15 per GB per month storage (depending on total

amount).

First 1GB per month input or output free and then $.08 to $0.15 per GB for transfers

outside an S3 region.

There is no data transfer charge for data transferred between Amazon EC2 and Amazon

S3 within the same region or for data transferred between the Amazon EC2 Northern

Virginia region and the Amazon S3 U.S. Standard region (as of October 6, 2010).

Unit-5

Cloud Resource Management and Scheduling

Policies and Mechanisms for Resource Management

Cloud resource management policies can be loosely grouped into five classes:

1. Admission control.

2. Capacity allocation.

3. Load balancing.

4. Energy optimization.

5. Quality-of-service (QoS) guarantees.

Load balancing and energy optimization can be done locally, but global load-balancing and

energy optimization policies encounter the same difficulties as the one we have already

discussed. Load balancing and energy optimization are correlated and affect the cost of

providing the services. Indeed, it was predicted that by 2012 up to 40% of the budget for IT

enterprise infrastructure would be spent on energy.

The common meaning of the term load balancing is that of evenly distributing the load to a set

of servers. For example, consider the case of four identical servers, A, B,C, and D, whose

relative loads are 80%,60%,40%, and 20%, respectively, of their capacity. As a result of perfect

load balancing, all servers would end with the same load − 50% of each server’s capacity. In

cloud computing a critical goal is minimizing the cost of providing the service and, in particular,

minimizing the energy consumption. This leads to a different meaning of the term load

balancing; instead of having the load evenly distributed among all servers, we want to

concentrate it and use the smallest number of servers while switching the others to standby

mode, a state in which a server uses less energy. In our example, the load from D will migrate to

A and the load from C will migrate to B; thus, A and B will be loaded at full capacity, whereas C

and D will be switched to standby mode. Quality of service is that aspect of resource

management that is probably the most difficult to address and, at the same time, possibly the

most critical to the future of cloud computing.

Virtually all optimal – or near-optimal – mechanisms to address the five classes of policies don’t

scaleupandtypicallytargetasingleaspectofresourcemanagement,e.g.,admissioncontrol,butignore

energy conservation. Many require complex computations that cannot be done effectively in

the time available to respond. The performance models are very complex, analytical solutions

are intractable, and the monitoring systems used to gather state information for these models

can be too intrusive and unable to provide accurate data. Many techniques are concentrated

on system performance in terms of throughput and time in system, but they rarely include

energy tradeoffs or QoS guarantees. Some techniques are based on unrealistic assumptions; for

example, capacity allocation is viewed as an optimization problem, but under the assumption

that servers are protected from overload.

Allocation techniques in computer clouds must be based on a disciplined approach rather than

adhoc methods. The four basic mechanisms for the implementation of resource management

policies are:

Control theory. Control theory uses the feedback to guarantee system stability and predict

transient behavior, but can be usedonly topredict local rather than global behavior. Kalman

filters have been used for unrealistically simplified models.

Machine learning. A major advantage of machine learning techniques is that they do not need

a performance model of the system. This technique could be applied to coordination of several

autonomic system managers.

Utility-based. Utility-based approaches require a performance model and a mechanism to

correlate user-level performance with cost.

Market-oriented/economic mechanisms. Such mechanisms do not require a model of the

system, e.g., combinatorial auctions for bundles of resources.

Applications of control theory to task scheduling on a cloud

Control theory has been used to design adaptive resource management for many classes of

applications, including power management, task scheduling, QoS adaptation in Web servers,

and load balancing. The classical feedback control methods are used in all these cases to

regulate the key operating parameters of the system based on measurement of the system

output; the feedback control in these methods assumes a linear time-in variants system model

and a closed-loop controller. This controller is based on an open-loop system transfer function

that satisfies stability and sensitivity constraints.

The main components of a control system are the inputs, the control system components, and

the outputs. The inputs in such models are the offered workload and the policies for admission

control, the capacity allocation, the load balancing, the energy optimization, and the QoS

guarantees in the cloud. The system components are sensors used to estimate relevant

measures of performance and controllers that implement various policies; the output is the

resource allocations to the individual applications.

The controllers use the feedback provided by sensors to stabilize the system; stability is related

to the change of the output. If the change is too large, the system may become unstable. In our

context the system could experience thrashing, the amount of useful time dedicated to the

execution of applications becomes increasingly small and most of the system resources are

occupied by management functions.

There are three main sources of instability in any control system:

The delay in getting the system reaction after a control action.

The granularity of the control, the fact that a small change enacted by the controllers

leads to very large changes of the output.

Oscillations, which occur when the changes of the input are too large and the control is

too weak, such that the changes of the input propagate directly to the output.

Two types of policies are used in autonomic systems: (i) threshold-based policies and (ii)

sequential decision policies based on Markovian decision models. In the first case, upper and

lower bounds on performance trigger adaptation through resource reallocation. Such policies

are simple and intuitive but require setting per-application thresholds.

If upper and lower thresholds are set, instability occurs when they are too close to one another

if the variations of the workload are large enough and the time required to adapt does not

allow the system to stabilize. The actions consist of allocation/deallocation of one or more

virtual machines; sometimes allocation/deallocation of a single VM required by one of the

thresholds may cause crossing of theother threshold and this may represent, another source of

instability.

Feedback control based on dynamic thresholds

The elements involved in a control system are sensors, monitors and actuators. These ns or

measure the parameter(s) of interest, then transmit the measured values to a monitor, which

determines whether the system behavior must be changed, and, if so, it requests that the

actuators carry out the necessary actions. Often the parameter used for admission control

policy is the current system load; when a threshold, e.g., 80%, is reached, the cloud stops

accepting additional load.

In practice, the implementation of such a policy is challenging or outright infeasible. First, due

to the very large number of servers and to the fact that the load changes rapidly in time, the

estimation of the current system load is likely to be inaccurate. Second, the ratio of average to

maximal resource requirements of individual users specified in a service-level agreement is

typically very high. Once an agreement is in place, user demands must be satisfied; user

requests for additional resources within the SLA limits cannot be denied.

Thresholds. A threshold is the value of a parameter related to the state of a system that

triggers a change in the system behavior. Thresholds are used in control theory to keep critical

parameters of a system in a predefined range. The threshold could be static, defined once and

for all, or it could be dynamic. A dynamic threshold could be based on an average of

measurements carried out over a time interval, a so-called integral control. The dynamic

threshold could also be a function of the values of multiple parameters at a given time or a mix

of the two.

To maintain the system parameters in a given range, a high and a low threshold are often

defined. The two thresholds determine different actions; for example, a high threshold could

force the system to limit its activities and a low threshold could encourage additional activities.

Control granularity refers to the level of detail of the information used to control the system.

Fine control means that very detailed information about the parameters controlling the system

state is used, whereas coarse control means that the accuracy of these parameters is traded for

the efficiency of implementation.

Proportional Thresholding. Application of these ideas to cloud computing, in particular to the

IaaS deliverymodel, and a strategy for resource management called proportional thresholding.

The questions addressed are:

Is it beneficial to have two types of controllers, (1) application controllers that

determine whether additional resources are needed and (2) cloud controllers that

arbitrate requests for resources and allocate the physical resources?

Is it feasible to consider fine control? Is course control more adequate in a cloud

computing environment?

Are dynamic thresholds based on time averages better than static ones?

Is it better to have a high and a low threshold, or it is sufficient to define only a high

threshold?

The essence of the proportional thresholding is captured by the following algorithm:

Compute the integral value of the high and the low thresholds as averages of the

maximum and, respectively, the minimum of the processor utilization over the process

history.

Request additional VMs when the average value of the CPU utilization over the current

time slice exceeds the high threshold.

Release a VM when the average value of the CPU utilization over the current time slice

falls below the low threshold.

The conclusions reached based on experiments with three VMs are as follows :(a) dynamic

thresholds perform better than static ones and (b) two thresholds are better than one. Though

conforming to our intuition, such results have to be justified by experiments in a realistic

environment. Moreover, convincing results cannot be based on empirical values for some of

the parameters required by integral control equations.

Coordination of specialized autonomic performance managers

Virtually all modern processors support dynamic voltage scaling (DVS) as a mechanism for

energy saving. Indeed, the energy dissipation scales quadratically with the supply voltage. The

power management controls the CPU frequency and, thus, the rate of instruction execution.

For some compute-intensive workloads the performance decreases linearly with the CPU clock

frequency, whereas for others the effect of lower clock frequency is less noticeable or

nonexistent. The clock frequency of individual blades/servers is controlled by a power manager,

typically implemented in the firmware; it adjusts the clock frequency several times a second.

Scheduling algorithms in Cloud Computing:

Fair Queuing

Fair queuing is a family of scheduling algorithms used in some process and network schedulers.

The concept implies a separate data packet queue (or job queue) for each traffic flow (or for

each program process) as opposed to the traditional approach with one FIFO queue for all

packet flows (or for all process jobs). The purpose is to achieve fairness when a limited resource

is shared, for example to avoid that flows with large packets (or processes that generate small

jobs) achieve more throughput (or CPU time) than other flows (or processes).

Principle

The main idea of fair queuing is to use one queue per packet flows and service them in rotation,

such that each flow "obtains an equal fraction of the resources".

The advantage over conventional first in first out (FIFO) or static priority queuing is that a high-

data-rate flow, consisting of large or many data packets, cannot take more than its fair share of

the link capacity.

Fair queuing is used in routers, switches, and statistical multiplexers that forward packets from

a buffer. The buffer works as a queuing system, where the data packets are stored temporarily

until they are transmitted.

With a link data-rate of R, at any given time the N active data flows (the ones with non-empty

queues) are serviced each with an average data rate of R/N. In a short time interval the data

rate may be fluctuating around this value since the packets are delivered sequentially,

depending on the scheduling used.

Algorithm details

Modeling of actual finish time, while feasible, is computationally intensive. The model needs to

be substantially recomputed every time a packet is selected for transmission and every time a

new packet arrives into any queue.

To reduce computational load, the concept of virtual time is introduced. Finish time for each

packet is computed on this alternate monotonically increasing virtual timescale. While virtual

time does not accurately model the time packets complete their transmissions, it does

accurately model the order in which the transmissions must occur to meet the objectives of the

full-featured model. Using virtual time, it is unnecessary to re compute the finish time for

previously queued packets. Although the finish time, in absolute terms, for existing packets is

potentially affected by new arrivals, finish time on the virtual time line is unchanged - the

virtual time line warps with respect to real time to accommodate any new transmission.

The virtual finish time for a newly queued packet is given by the sum of the virtual start time

plus the packet's size. The virtual start time is the maximum between the previous virtual finish

time of the same queue and the current instant.

With a virtual finishing time of all candidate packets (i.e., the packets at the head of all non-

empty flow queues) computed, fair queuing compares the virtual finishing time and selects the

minimum one. The packet with the minimum virtual finishing time is transmitted.

Borrowed Virtual Time Scheduling

With BVT scheduling, thread execution time is monitored in terms of virtual time, dispatching the runnable thread with the earliest effective virtual time (EVT). However, a latency sensitive thread is allowed to warp back in virtual time to make it appear earlier and thereby gain dispatch preference. It then effectively borrows virtual time from its future CPU allocation and thus does not disrupt long-term CPU sharing. Hence the name, borrowed virtual time scheduling. This algorithm is described in detail in the rest of this section. Each BVT thread includes the state variables , its effective virtual time (EVT); , its actual virtual time (AVT); its virtual time warp; and set if warp is enabled. When a thread unblocks or the currently executing thread blocks, the scheduler runs thread if it has the minimum of all the runnable threads.

Resource management and dynamic application scaling

The demand for computing resources, such as CPU cycles, primary and secondary storage, and network bandwidth, depends heavily on the volume of data processed by an application. The

demand for resources can be a function of the time of day, can monotonically increase or decrease in time, or can experience predictable or unpredictable peaks. For example, a new Web service will experience a low request rate when the service is first introduced and the load will exponentially increase if the service is successful. A service for income tax processing will experience a peak around the tax filling deadline, where as access to a service provided by Federal Emergency Management Agency (FEMA) willincrease dramatically after a natural disaster.

The elasticity of a public cloud, the fact that it can supply to an application precisely the amount of resources it needs and that users pay only for the resources they consume are serious incentives to migrate to a public cloud. The question we address is: How scaling can actually be implemented in a cloud when a very large number of applications exhibit this often unpredictable behavior. To make matters worse, in addition to an unpredictable external load the cloud resource management has to deal with resource reallocation due to server failures.

We distinguish two scaling modes: vertical and horizontal. Vertical scaling keeps the number of VMs of an application constant, but increases the amount of resources allocated to each one of them. This can be done either by migrating the VMs to more powerful servers or by keeping the VMs on the same servers but increasing their share of the CPU time. The first alternative involves additional overhead; the VM is stopped, a snapshot of it is taken, the file is transported to a more powerful server, and, finally, the VM is restated at the new site.

Horizontal scaling is the most common mode of scaling on a cloud; it is done by increasing the number of VMs as the load increases and reducing the number of VMs when the load decreases. Often, this leads to an increase in communication bandwidth consumed by the application. Load balancing among the running VMs is critical to this mode of operation. For a very large application, multiple load balancers may need to cooperate with one another. In some instances the load balancing is done by a front-end server that distributes incoming requests of a transaction-oriented system to back-end servers.

Unit-6

Storage Systems

Evolution of storage technology

The technological capacity to store information has grown over time at an accelerated pace

1986: 2.6EB; equivalent to less than one 730MB CD-ROM of data per computer user.

1993: 15.8EB; equivalent to four CD-ROMs per user.

2000: 54.5EB; equivalent to 12 CD-ROMs per user.

2007: 295EB; equivalent to almost 61 CD-ROMs per user.

These rapid technological advancements have changed the balance between initial investment in storage devices and system management costs. Now the cost of storage management is the dominant element of the total cost of a storage system. This effect favors the centralized storage strategy supported by a cloud; indeed, a centralized approach can automate some of the storage management functions, such as replication and backup, and thus reduce substantially the storage management cost.

While the density of storage devices has increased and the cost has decreased dramatically, the access time has improved only slightly. The performance of I/O subsystems has not kept pace with the performance of computing engines, and that affects multimedia, scientific and engineering, and other modern applications that process increasingly large volumes of data.

The storage systems face substantial pressure because the volume of data generated has increased exponentially during the past few decades; whereas in the 1980s and 1990s data was primarily generated by humans, nowadays machines generate data at an unprecedented rate. Mobile devices, such as smart-phones and tablets, record static images, as well as movies and have limited local storage capacity, so they transfer the data to cloud storage systems. Sensors, surveillance cameras, and digital medical imaging devices generate data at a high rate and dump it onto storage systems accessible via the Internet. Online digital libraries, ebooks, and digital media, along with reference data,2 add to the demand for massive amounts of storage.

As the volume of data increases, new methods and algorithms for data mining that require powerful computing systems have been developed. Only a concentration of resources could provide the CPU cycles along with the vast storage capacity necessary to perform such intensive computations and access the very large volume of data.

Althoughweemphasizetheadvantagesofaconcentrationofresources,wehavetobeacutelyaware that a cloud is a large-scale distributed system with a very large number of components that must work in concert. The management of such a large collection of systems poses significant challenges and requires novel approaches to systems design. Case in point: Although the early distributed file systems used custom-designed reliable components, nowadays large-scale systems are built with offthe-shelf components. The emphasis of the design philosophy has shifted from performance at any cost to reliability at the lowest possible cost. This shift is

evident in the evolution of ideas, from the early distributed file systems of the 1980s, such as the Network File System(NFS)and theAndrewFile System (AFS), to today’s Google File System (GFS) and the Megastore.

Google File System

The Google File System (GFS) was developed in the late 1990s. It uses thousands of storage systems built from inexpensive commodity components to provide peta bytes of storage to a large user community with diverse needs. It is not surprising that a main concern of the GFS designers was to ensure the reliability of a system exposed to hardware failures, system software errors, application errors, and last but not least, human errors.

The system was designed after a careful analysis of the file characteristics and of the access models. Some of the most important aspects of this analysis reflected in the GFS design are:

Scalability and reliability are critical features of the system; they must be considered from the beginning rather than at some stage of the design.

The vast majority of files range in size from a few GB to hundreds of TB.

The most common operation is to append to an existing file; random write operations to a file are extremely infrequent.

Sequential read operations are the norm.

The users process the data in bulk and are less concerned with the response time.

Theconsistencymodelshouldberelaxedtosimplifythesystemimplementation, but without placing an additional burden on the application developers.

GFS files are collections of fixed-size segments called chunks;at the time of file creation each chunk is assigned a unique chunk handle. A chunk consists of 64 KB blocks and each block has a 32-bit checksum. Chunks are stored on Linux files systems and are replicated on multiple sites; a user may change the number of the replicas from the standard value of three to any desired value. The chunk size is 64MB; this choice is motivated by the desire to optimize performance for large files and to reduce the amount of metadata maintained by the system.

A large chunk size increases the likelihood that multiple operations will be directed to the same chunk; thus it reduces the number of requests to locate the chunk and, at the same time, it allows the application to maintain a persistent network connection with the server where the chunk is located. Space fragmentation occurs in frequently because the chunk for a small file and the last chunk of a large file are only partially filled.

The locations of the chunks are stored only in the control structure of the master’s memory and are updated at system startup or when a new chunk server joins the cluster. This strategy allows the master to have up-to-date information about the location of the chunks.

System reliability is a major concern, and the operation log maintains a historical record of metadata changes, enabling the master to recover in case of a failure. As a result, such changes are atomic and are not made visible to the clients until they have been recorded on multiple replicas on persistent storage. To recover from a failure, the master replays the operation log.

To minimize the recovery time, the master periodically checkpoints its state and at recovery time replays only the log records after the last checkpoint.

Each chunk server is a commodity Linux system;it receives instructions from the master and responds with status information. To access a file, an application sends to the master the filename and the chunk index, the offset in the file for the read or write operation; the master responds with the chunk handle and the location of the chunk.Then the application communicates directly with the chunk server to carry out the desired file operation.

The consistency model is very effective and scalable. Operations, such as file creation, are atomic and are handled by the master. To ensure scalability, the master has minimal involvement in file mutations and operations such as write or append that occur frequently. In such cases the master grants a lease for a particular chunk to one of the chunk servers, called the primary; then, the primary creates a serial order for the updates of that chunk.

The client contacts the master, which assigns a lease to one of the chunk servers for a particular chunk if no lease for that chunk exists; then the master replies with the ID of the primary as well as secondary chunk servers holding replicas of the chunk. The client caches this information.

The client sends the data to all chunk servers holding replicas of the chunk; each one of the chunk servers stores the data in an internal LRU buffer and then sends an acknowledgment to the client.

The client sends a write request to the primary once it has received the acknowledgments from all chunk servers holding replicas of the chunk. The primary identifies mutations by consecutive sequence numbers.

The primary sends the write requests to all secondaries.

Each secondary applies the mutations in the order of the sequence numbers and then sends an acknowledgment to the primary.

Finally, after receiving the acknowledgments from all secondaries, the primary informs the client.

The system supports an efficient checkpointing procedure based on copy-on-write to construct system snapshots. A lazy garbage collection strategy is used to reclaim the space after a file deletion. In the first step the filename is changed to a hidden name and this operation is time stamped. The master periodically scans the namespace and removes the metadata for the files with a hidden name older than a few days; this mechanism gives a window of opportunity to a user who deleted files by mistake to recover the files with little effort.

Periodically, chunk servers exchange with the master the list of chunks stored on each one of them; the master supplies them with the identity of orphaned chunks whose metadata has been deleted, and such chunks are then deleted, Even when control messages are lost, a chunk server will carry out the housecleaning at the next heartbeat exchange with the master. Each chunk server maintains in core the checksums for the locally stored chunks to guarantee data integrity.

Apache Hadoop

A wide range of data-intensive applications such as marketing analytics, image processing, machine learning, and Web crawling use Apache Hadoop,anopen-source,Java-basedsoftwaresystem.6 Hadoop supports distributed applications handling extremely large volumes of data. Many members of the community contributed to the development and optimization of Hadoop and several related Apache projects such as Hive and HBase.

Hadoop is used by many organizations from industry, government, and research; the long list of Hadoop users includes major IT companies such as Apple, IBM, HP, Microsoft, Yahoo!, and Amazon; mediacompanies suchasTheNewYorkTimesand Fox; socialnetworks, including Twitter,Facebook, and LinkedIn; and government agencies, such as the U.S. Federal Reserve. In 2011, the Facebook Hadoop cluster had a capacity of 30PB.

A Hadoop system has two components, a MapReduce engine and a database (see Figure 8.8). The database could be the Hadoop File System (HDFS), Amazon S3, orCloudStore, an implementation of the Google File System discussed in Section 8.5. HDFS is a distributed file system written in Java; it is portable, but it cannot be directly mounted on an existing operating system. HDFS is not fully POSIX compliant, but it is highly performant.

The Hadoop engine on the master of a multinode cluster consists of a job tracker and a task tracker, whereas the engine on a slave has only a task tracker. Thejob tracker receives a MapReduce job from a client and dispatches the work to the task trackers running on the nodes of a cluster. To increase efficiency, the job tracker attempts to dispatch the tasks to available slaves closest to the place where it stored the task data.The task tracker supervises the execution of the work allocated to the node. Several scheduling algorithms have been implemented in Hadoop engines, including Facebook’s fair scheduler and Yahoo!’s capacity scheduler; see Section 6.8 for a discussion of cloud scheduling algorithms.

HDFS replicates data on multiple nodes. The default is three replicas; a large dataset is distributed over many nodes. The namenode running on the master manages the data distribution and data replication and communicates with data nodes running on all cluster nodes; it shares with the job tracker information about data placement to minimize communication between the nodes on which data is located and the ones where it is needed. Although HDFS can be used for applications other than those based on the MapReduce model, its performance for such applications is not at par with the ones for which it was originally designed.

BigTable

The system is based on a simple and flexible datamodel. It allows an application developer to exercise control over the data format and layout and reveals data locality information to the application clients. Any reador write row operation is atomic, even when it affects more than one column. The column keys identify column families, which are units of access control. The data in a column family is of the same type. Client applications written in C++ can add or delete values, search for a subset of data, and look up data in a row.

A row key is an arbitrary string of up to 64KB, and a row range is partitioned into tablets serving as units for load balancing. The time stamps used to index various versions of the data in a cell are 64-bit integers; their interpretation can be defined by the application, whereas the default is the time of an event in microseconds. A column key consists of a string defining the family name, a set of printable characters, and an arbitrary string as qualifier.

The organization of a BigTable shows a sparse, distributed, multidimensional map for an email application. The system consists of three major components: a library linked to application clients to access the system, a master server, and a large number of tablet servers. The master server controls the entire system, assigns tablets to tablet servers and balances the load among them, manages garbage collection, and handles table and column family creation and deletion.

Internally, the space management is ensured by a three-level hierarchy: the root tablet, the location of which is stored in a Chubby file, points to entries in the second element, the metadata tablet, which, in turn, points to user tablets, collections of locations of users’ tablets. An application client searches through this hierarchy to identify the location of its tablets and then caches the addresses for further use.

BigTable is used by a variety of applications, including Google Earth, Google Analytics, Google Finance, and Web crawlers. For example, Google Earth uses two tables, one for preprocessing and one for serving client data. The preprocessing table stores raw images; the table is stored on disk because it contains some 70TB of data. Each row of data consists of a single image; adjacent geographic segments are stored in rows in close proximity to one another. The column family is very sparse; it contains a column for every raw image. The preprocessing stage relies heavily on MapReduce to clean and consolidate the data for the serving phase. The serving table stored on GFSis“only”500 GB, and it is distributed across several hundred tablet servers, which maintain in-memory column families. This organization enables the serving phase of Google Earth to provide a fast response time to tens of thousands of queries per second.

Google Analytics provides aggregate statistics such as the number of visitors to a Web page per day. To use this service, Web servers embed a JavaScript code into their Web pages to record information everytimeapageisvisited.Thedataiscollectedinaraw-clickBigTableofsome200 TB, with a row for each end-user session. A summary table of some 20TB contains predefined summaries for a Website.

Megastore

Megastore is scalable storage for online services. The system, distributed over several data centers, has a very large capacity, 1PB in 2011,and it is highly available. Megastore is widely used internally at Google; it handles some 23 billion transactions daily: 3 billion write and 20 billion read transactions

Unit-1 Systems modeling, Clustering and virtualization

Documents

Transcript of Unit-1 Systems modeling, Clustering and virtualization