Hybrid clustering based 3 d face modeling upon non-perfect orthogonality
Unit-1 Systems modeling, Clustering and virtualization
Transcript of Unit-1 Systems modeling, Clustering and virtualization
Unit-1
Systems modeling, Clustering and virtualization
Scalable Computing over the Internet:
Over the past 60 years, computing technology has undergone a series of platform and
environment changes. In this section, we assess evolutionary changes in machine architecture,
operating system platform, network connectivity, and application workload. Instead of using a
centralized computer to solve computational problems, a parallel and distributed computing
system uses multiple computers to solve large-scale problems over the Internet. Thus,
distributed computing becomes data-intensive and network-centric. This section identifies the
applications of modern computer systems that practice parallel and distributed computing.
These large-scale Internet applications have significantly enhanced the quality of life and
information services in society today.
The Platform Evolution
Computer technology has gone through five generations of development, with each generation
lasting from 10 to 20 years. Successive generations are overlapped in about 10 years. For
instance, from 1950 to 1970, a handful of mainframes, including the IBM 360 and CDC 6400,
were built to satisfy the demands of large businesses and government organizations. From 1960
to 1980, lower-cost mini-computers such as the DEC PDP 11 and VAX Series became popular
among small businesses and on college campuses.
From 1970 to 1990, we saw widespread use of personal computers built with VLSI micro
proces-sors. From 1980 to 2000, massive numbers of portable computers and pervasive devices
appeared in both wired and wireless applications. Since 1990, the use of both HPC and HTC
systems hidden in clusters, grids, or Internet clouds has proliferated. These systems are
employed by both consumers and high-end web-scale computing and information services.
The general computing trend is to leverage shared web resources and massive amounts of data
over the Internet. On the HPC side, supercomputers (massively parallel processors or MPPs) are
gradually replaced by clusters of cooperative computers out of a desire to share computing
resources. The cluster is often a collection of homogeneous compute nodes that are physically
connected in close range to one another.
On the HTC side, peer-to-peer (P2P) networks are formed for distributed file sharing and
content delivery applications. A P2P system is built over many client machines Peer machines
are globally distributed in nature. P2P, cloud computing, and web service platforms are more
focused on HTC applications than on HPC applications. Clustering and P2P technologies lead to
the development of computational grids or data grids.
Three New Computing Paradigms
Centralized computing This is a computing paradigm by which all computer resources
are centralized in one physical system. All resources (processors, memory, and storage)
are fully shared and tightly coupled within one integrated OS. Many data centers and
supercomputers are centralized systems, but they are used in parallel, distributed, and
cloud computing applications
Parallel computing In parallel computing, all processors are either tightly coupled with
centralized shared memory or loosely coupled with distributed memory. Some authors
refer to this discipline as parallel processing Interprocessor communication is
accomplished through shared memory or via message passing. A computer system
capable of parallel computing is commonly known as a parallel computer. Programs
running in a parallel computer are called parallel programs. The process of
writing parallel programs is often referred to as parallel programming.
Distributed computing This is a field of computer science/engineering that studies
distributed systems. Adistributed system consists of multiple autonomous computers,
each having its own private memory, communicating through a computer network.
Information exchange in a distributed system is accomplished through message passing.
A computer program that runs in a distributed system is known as a distributed
program. The process of writing distributed programs is referred to as distributed
programming.
Cloud computing An Internet cloud of resources can be either a centralized or a
distributed computing system. The cloud applies parallel or distributed computing, or
both. Clouds can be built with physical or virtualized resources over large data centers
that are centralized or distributed. Some authors consider cloud computing to be a form
of utility computing or service computing.
Technologies for Network-Based Systems
It will focus on viable approaches to building distributed operating systems for handling
massive parallelism in a distributed environment.
Multicore CPUs and Multithreading Technologies
The growth of component and network technologies over the past 30 years, are crucial to the
development of HPC and HTC systems.
Advances in CPU Processors
Today, advanced CPUs or microprocessor chips assume a multicore architecture with dual,
quad, six, or more processing cores. These processors exploit parallelism at ILP and TLP levels.
However, the clock rate reached its limit on CMOS-based chips due to power limitations. This
limitation is attributed primarily to excessive heat generation with high frequency or high
voltages. The ILP is highly exploited in modern CPU processors which include multiple-issue
superscalar architecture, dynamic branch prediction, and speculative execution, among others.
Figure shows the architecture of a typical multicore processor. Each core is essentially a
processor with its own private cache (L1 cache).
Multiple cores are housed in the same chip with an L2 cache that is shared by all cores. In the
future, multiple CMPs could be built on the same CPU chip with even the L3 cache on the chip.
Multicore and multithreaded CPUs are equipped with many high-end processors.
Multicore CPU and Many-Core GPU Architectures Multicore CPUs may increase from the tens
of cores to hundreds or more in the future. But the CPU has reached its limit in terms of
exploiting massive DLP due to the aforementioned memory wall problem. This has triggered
the development of many-core GPUs with hundreds or more thin cores. The GPU also has been
applied in large clusters to build supercomputers in MPPs. In the future, multiprocessors can
house both fat CPU cores and thin GPU cores on the same chip.
Multithreading Technology Typical instruction scheduling patterns are shown here. Only
instructions from the same thread are executed in a superscalar processor. Fine-grain
multithreading switches the execution of instructions from different threads per cycle. Course-
grain multithreading executes many instructions from the same thread for quite a few cycles
before switching to another thread. The multicore CMP executes instructions from different
threads completely. The SMT allows simultaneous scheduling of instructions from different
threads in the same cycle.
GPU (Graphics processing unit) A GPU is a graphics coprocessor or accelerator mounted on a
computer’s graphics card or video card. A GPU offloads the CPU from tedious graphics tasks in
video editing applications. A modern GPU chip can be built with hundreds of processing cores.
GPUs have a throughput architecture that exploits massive parallelism by executing many
concurrent threads slowly, instead of executing a single long thread in a conventional
microprocessor very quickly.
MEMORY, STORAGE, AND WIDE-AREA NETWORKING
Memory Technology Memory chips have experienced a 4x increase in capacity every three
years. The capacity increase of disk arrays will be even greater in the years to come. Faster
processor speed and larger memory capacity result in a wider gap between processors and
memory.
Disks and Storage Technology The rapid growth of flash memory and solid-state drives (SSDs)
also impacts the future of HPC and HTC systems. Eventually, power consumption, cooling, and
packaging will limit large system development. Power increases linearly with respect to clock
frequency and quadratic ally with respect to voltage applied on chips.
Virtual Machines and Virtualization Middleware
Virtual machines (VMs) offer novel solutions to underutilized resources, application inflexibility,
software manageability, and security concerns in existing physical machines. To build large
clusters, grids, and clouds, we need to access large amounts of computing, storage, and
networking resources in a virtualized manner. We need to aggregate those resources, and
hopefully, offer a single system image. The VM is built with virtual resources managed by a
guest OS to run a specific application. Between the VMs and the host platform, one needs to
deploy a middleware layer called a virtual machine monitor (VMM). The VMM called a
hypervisor in privileged mode.
VM Primitive Operations The VMM provides the VM abstraction to the guest OS. Low-level
VMM operations are indicated below
First, the VMs can be multiplexed between hardware machines.
Second, a VM can be suspended and stored in stable storage.
Third, a suspended VM can be resumed or provisioned to a new hardware platform.
Finally, a VM can be migrated from one hardware platform to another.
The VM approach ill significantly enhance the utilization of server resources. Multiple server
functions can be consolidated on the same hardware platform to achieve higher system
efficiency.
Virtual Infrastructures
Virtual infrastructure is connects resources like compute, storage, and networking to
distributed applications. It is a dynamic mapping of system resources to specific applications.
The result is decreased costs and increased efficiency and responsiveness.
System models for Distributed and Cloud Computing
Distributed and cloud computing systems are built over a large number of autonomous
computer nodes. These node machines are interconnected by SANs, LANs, or WANs in a
hierarchical man-ner. With today’s networking technology, a few LAN switches can easily
connect hundreds of machines as a working cluster. A WAN can connect many local clusters to
form a very large cluster of clusters. In this sense, one can build a massive system with millions
of computers connected to edge networks.
Clusters of Cooperative Computers
A computing cluster consists of interconnected stand-alone computers which work
cooperatively as a single integrated computing resource. In the past, clustered computer
systems have demonstrated impressive results in handling heavy workloads with large data
sets.
Cluster Architecture
This network can be as simple as a SAN (e.g., Myrinet) or a LAN (e.g., Ethernet). To build a larger
cluster with more nodes, the interconnection network can be built with multiple levels of
Gigabit Ethernet, Myrinet, or InfiniBand switches. Through hierarchical construction using a
SAN, LAN, or WAN, one can build scalable clusters with an increasing number of nodes. The
cluster is connected to the Internet via a virtual private network (VPN) gateway. The gateway IP
address locates the cluster. The system image of a computer is decided by the way the OS
manages the shared cluster resources. Most clusters have loosely coupled node computers. All
resources of a server node are managed by their own OS. Thus, most clusters have multiple
system images as a result of having many autonomous nodes under different OS control.
Single-System Image
Cluster designers desire a cluster operating system or some middle-ware to support SSI at
various levels, including the sharing of CPUs, memory, and I/O across all cluster nodes. An SSI is
an illusion created by software or hardware that presents a collection of resources as one
integrated, powerful resource. SSI makes the cluster appear like a single machine to the user. A
cluster with multiple system images is nothing but a collection of inde-pendent computers.
Hardware, Software, and Middleware Support
Clusters exploring massive parallelism are commonly known as MPPs. Almost all HPC clusters in
the Top 500 list are also MPPs. The building blocks are computer nodes (PCs, workstations,
servers, or SMP), special communication software such as PVM or MPI, and a network interface
card in each computer node. Most clusters run under the Linux OS. The computer nodes are
interconnected by a high-bandwidth network (such as Gigabit Ethernet, Myrinet, InfiniBand,
etc.).
Special cluster middleware supports are needed to create SSI or high availability (HA). Both
sequential and parallel applications can run on the cluster, and special parallel environments
are needed to facilitate use of the cluster resources. For example, distributed memory has
multiple images. Users may want all distributed memory to be shared by all servers by
forming distribu-ted shared memory (DSM). Many SSI features are expensive or difficult to
achieve at various cluster operational levels. Instead of achieving SSI, many clusters are loosely
coupled machines. Using virtualization, one can build many virtual clusters dynamically, upon
user demand.
Major Cluster Design Issues
Unfortunately, a cluster-wide OS for complete resource sharing is not available yet. Middleware
or OS extensions were developed at the user space to achieve SSI at selected functional levels.
Without this middleware, cluster nodes cannot work together effectively to achieve
cooperative computing. The software environments and applications must rely on the
middleware to achieve high performance. The cluster benefits come from scalable
performance, efficient message passing, high system availability, seamless fault tolerance, and
cluster-wide job management.
Software environments for distributed systems and clouds
This section introduces popular software environments for using distributed and cloud
computing systems.
Service-Oriented Architecture (SOA)
In grids/web services, Java, and CORBA, an entity is, respectively, a service, a Java object, and a
CORBA distributed object in a variety of languages. These architectures build on the traditional
seven Open Systems Interconnection (OSI) layers that provide the base networking
abstractions. On top of this we have a base software environment, which would be .NET or
Apache Axis for web services, the Java Virtual Machine for Java, and a broker network for
CORBA. On top of this base environment one would build a higher level environment reflecting
the special features of the distributed computing environment. This starts with entity interfaces
and inter-entity communication, which rebuild the top four OSI layers but at the entity and not
the bit level.
Layered Architecture for Web Services and Grids
The entity interfaces correspond to the Web Services Description Language (WSDL), Java
method, and CORBA interface definition language (IDL) specifications in these example
distributed systems. These interfaces are linked with customized, high-level communication
systems: SOAP, RMI, and IIOP in the three examples. These communication systems support
features including particular message patterns (such as Remote Procedure Call or RPC), fault
recovery, and specialized routing. Often, these commu-nication systems are built on message-
oriented middleware (enterprise bus) infrastructure such as Web-Sphere MQ or Java Message
Service (JMS) which provide rich functionality and support virtualization of routing, senders,
and recipients.
In the case of fault tolerance, the features in the Web Services Reliable Messaging
(WSRM)framework mimic the OSI layer capability (as in TCP fault tolerance) modified to match
the differ-ent abstractions (such as messages versus packets, virtualized addressing) at the
entity levels. Secur-ity is a critical capability that either uses or reimplements the capabilities
seen in concepts such as Internet Protocol Security (IPsec) and secure sockets in the OSI layers.
Web Services and Tools
Both web services and REST systems have very distinct approaches to building reliable
interoperable systems. In web services, one aims to fully specify all aspects of the service and
its environment. This specification is carried with communicated messages using Simple Object
Access Protocol (SOAP). The hosting environment then becomes a universal distributed
operating system with fully distributed capability carried by SOAP messages. This approach has
mixed success as it has been hard to agree on key parts of the protocol and even harder to
efficiently implement the protocol by software such as Apache Axis.
In the REST approach, one adopts simplicity as the universal principle and delegates most of the
difficult problems to application (implementation-specific) software. In a web services
language, REST has minimal information in the header, and the message body (that is opaque
to generic message processing) carries all the needed information. REST architectures are
clearly more appropriate for rapid technology environments. However, the ideas in web
services are important and probably will be required in mature systems at a different level in
the stack (as part of the application). Note that REST can use XML schemas but not those that
are part of SOAP; “XML over HTTP” is a popular design choice in this regard. Above the
communication and management layers, we have the ability to compose new entities or
distributed programs by integrating several entities together.
In CORBA and Java, the distributed entities are linked with RPCs, and the simplest way to build
composite applications is to view the entities as objects and use the traditional ways of linking
them together. For Java, this could be as simple as writing a Java program with method calls
replaced by Remote Method Invocation (RMI), while CORBA supports a similar model with a
syntax reflecting the C++ style of its entity (object) interfaces. Allowing the term “grid” to refer
to a single service or to represent a collection of services, here sensors represent entities that
output data (as messages), and grids and clouds represent collections of services that have
multiple message-based inputs and outputs.
The Evolution of SOA
Service-oriented architecture (SOA) has evolved over the years. SOA applies to building grids,
clouds, grids of clouds, clouds of grids, clouds of clouds (also known as interclouds), and
systems of systems in general. A large number of sensors provide data-collection services,
denoted in the figure as SS (sensor service). A sensor can be a ZigBee device, a Bluetooth
device, a WiFi access point, a personal computer, a GPA, or a wireless phone, among other
things. Raw data is collected by sensor services. All the SS devices interact with large or small
computers, many forms of grids, databases, the compute cloud, the storage cloud, the filter
cloud, the discovery cloud, and so on. Filter services ( fs in the figure) are used to eliminate
unwanted raw data, in order to respond to specific requests from the web, the grid, or web
services.
Most distributed systems require a web interface or portal. For raw data collected by a large
number of sensors to be transformed into useful information or knowledge, the data stream
may go through a sequence of compute, storage, filter, and discovery clouds. Finally, the inter-
service messages converge at the portal, which is accessed by all users. Two example portals,
OGFCE and HUBzero, are described in Section 5.3 using both web service (portlet) and Web 2.0
(gadget) tech-nologies. Many distributed programming models are also built on top of these
basic constructs.
Grids versus Clouds
The boundary between grids and clouds are getting blurred in recent years. For web services,
work-flow technologies are used to coordinate or orchestrate services with certain
specifications used to define critical business process models such as two-phase transactions. In
general, a grid system applies static resources, while a cloud emphasizes elastic resources. For
some researchers, the differences between grids and clouds are limited only in dynamic
resource allocation based on virtualization and autonomic computing. One can build a grid out
of multiple clouds. This type of grid can do a better job than a pure cloud, because it can
explicitly support negotiated resource allocation. Thus one may end up building with a system
of systems: such as a cloud of clouds, a grid of clouds, or acloud of grids, or inter-clouds as a
basic SOA architecture.
Performance, Security, and Energy Efficiency
Performance Metrics and Scalability Analysis
Performance metrics are needed to measure various distributed systems. In this section, we
will discuss various dimensions of scalability and performance laws. Then we will examine
system scalability against OS images and the limiting factors encountered.
Performance Metrics
We discussed CPU speed in MIPS and network bandwidth in Mbps in Section 1.3.1 to estimate
pro-cessor and network performance. In a distributed system, performance is attributed to a
large number of factors. System throughput is often measured in MIPS, Tflops (tera floating-
point operations persecond), or TPS (transactions per second). Other measures include job
response time and network latency. An interconnection network that has low latency and high
bandwidth is preferred. Systemoverhead is often attributed to OS boot time, compile time, I/O
data rate, and the runtime support sys-tem used. Other performance-related metrics include
the QoS for Internet and web services; systemavailability and dependability; and security
resilience for system defense against network attacks.
Dimensions of Scalability
Users want to have a distributed system that can achieve scalable performance. Any resource
upgrade in a system should be backward compatible with existing hardware and software
resources. Overdesign may not be cost-effective. System scaling can increase or decrease
resources depending on many practical factors. The following dimensions of scalability are
characterized in parallel and distributed systems.
Size scalability This refers to achieving higher performance or more functionality by increasing
themachine size. The word “size” refers to adding processors, cache, memory, storage, or I/O
channels. The most obvious way to determine size scalability is to simply count the number of
processors installed. Not all parallel computer or distributed architectures are equally size-
scalable. For example, the IBM S2 was scaled up to 512 processors in 1997. But in 2008, the
IBM BlueGene/L system scaled up to 65,000 processors.
Software scalability This refers to upgrades in the OS or compilers, adding mathematical and
engineering libraries, porting new application software, and installing more user-friendly
programming environments. Some software upgrades may not work with large system
configurations. Testing and fine-tuning of new software on larger systems is a nontrivial job.
Application scalability this refers to matching problem size scalability with machine
size scalability. Problem size affects the size of the data set or the workload increase. Instead of
increasing machine size, users can enlarge the problem size to enhance system efficiency or
cost-effectiveness.
Technology scalability This refers to a system that can adapt to changes in building tech-
nologies, such as the component and networking technologies discussed in Section 3.1. When
scaling a system design with new technology one must consider three aspects: time, space,
and heterogeneity. (1) Time refers to generation scalability. When changing to new-generation
processors, one must consider the impact to the motherboard, power supply, packaging and
cooling, and so forth. Based on past experience, most systems upgrade their commodity
processors every three to five years. (2) Space is related to packaging and energy concerns.
Technology scalability demands harmony and portability among suppliers. (3) Heterogeneity
refers to the use of hardware components or software packages from different vendors.
Heterogeneity may limit the scalability.
Security Responsibilities
Three security requirements are often considered: confidentiality, integrity, and availability for
most Internet service providers and cloud users. In the order of SaaS, PaaS, and IaaS, the
providers gra-dually release the responsibility of security control to the cloud users. In
summary, the SaaS model relies on the cloud provider to perform all security functions. At the
other extreme, the IaaS model wants the users to assume almost all security functions, but to
leave availability in the hands of the providers. The PaaS model relies on the provider to
maintain data integrity and availability, but bur-dens the user with confidentiality and privacy
control.
Energy Efficiency
Primary performance goals in conventional parallel and distributed computing systems are high
performance and high throughput, considering some form of performance reliability (e.g., fault
tol-erance and security). However, these systems recently encountered new challenging issues
including energy efficiency, and workload and resource outsourcing. These emerging issues are
crucial not only on their own, but also for the sustainability of large-scale computing systems in
general. This section reviews energy consumption issues in servers and HPC systems, an area
known as distributed power management (DPM).
Protection of data centers demands integrated solutions. Energy consumption in parallel and
dis-tributed computing systems raises various monetary, environmental, and system
performance issues. For example, Earth Simulator and Petaflop are two systems with 12 and
100 megawatts of peak power, respectively. With an approximate price of $100 per megawatt,
their energy costs during peak operation times are $1,200 and $10,000 per hour; this is beyond
the acceptable budget of many (potential) system operators. In addition to power cost, cooling
is another issue that must be addressed due to negative effects of high temperature on
electronic components. The rising tempera-ture of a circuit not only derails the circuit from its
normal range, but also decreases the lifetime of its components.
UNIT II
Virtual Machines and Virtualization of Clusters and Data Centers
Implementation Levels of Virtualization
Virtualization is a computer architecture technology by which multiple virtual machines (VMs)
are multiplexed in the same hardware machine. The idea of VMs can be dated back to the
1960s. The purpose of a VM is to enhance resource sharing by many users and improve
computer performance in terms of resource utilization and application flexibility. Hardware
resources (CPU, memory, I/O devices, etc.) or software resources (operating system and
software libraries) can be virtualized in various functional layers. This virtualization technology
has been revitalized as the demand for distributed and cloud computing increased sharply in
recent years.
The idea is to separate the hardware from the software to yield better system efficiency. For
example, computer users gained access to much enlarged memory space when the concept
of virtual memory was introduced. Similarly, virtualization techniques can be applied to
enhance the use of compute engines, networks, and storage. In this chapter we will discuss
VMs and their applications for building distributed systems. According to a 2009 Gartner
Report, virtualization was the top strategic technology poised to change the computer industry.
With sufficient storage, any computer platform can be installed in another host computer, even
if they use processors with different instruction sets and run with distinct operating systems on
the same hardware.
Levels of Virtualization Implementation
A traditional computer runs with a host operating system specially tailored for its hardware
architecture. After virtualization, different user applications managed by their own operating
systems (guest OS) can run on the same hardware, independent of the host OS. This is often
done by adding additional software, called a virtualization layer. This virtualization layer is
known as hypervisor or virtual machine monitor (VMM). The VMs are shown in the upper
boxes, where applications run with their own guest OS over the virtualized CPU, memory, and
I/O resources.
The main function of the software layer for virtualization is to virtualize the physical hardware
of a host machine into virtual resources to be used by the VMs, exclusively. This can be
implemented at various operational levels, as we will discuss shortly. The virtualization
software creates the abstraction of VMs by interposing a virtualization layer at various levels of
a computer system. Common virtualization layers include the instruction set architecture
(ISA) level, hardware level, operating system level, library support level, and application level.
Instruction Set Architecture Level
At the ISA level, virtualization is performed by emulating a given ISA by the ISA of the host
machine. For example, MIPS binary code can run on an x86-based host machine with the help
of ISA emulation. With this approach, it is possible to run a large amount of legacy binary code
writ-ten for various processors on any given new hardware host machine. Instruction set
emulation leads to virtual ISAs created on any hardware machine.
The basic emulation method is through code interpretation. An interpreter program interprets
the source instructions to target instructions one by one. One source instruction may require
tens or hundreds of native target instructions to perform its function. Obviously, this process is
relatively slow. For better performance, dynamic binary translation is desired. This approach
translates basic blocks of dynamic source instructions to target instructions. The basic blocks
can also be extended to program traces or super blocks to increase translation efficiency.
Instruction set emulation requires binary translation and optimization. A virtual instruction set
architecture (V-ISA) thus requires adding a processor-specific software translation layer to the
compiler.
Hardware Abstraction Level
Hardware-level virtualization is performed right on top of the bare hardware. On the one hand,
this approach generates a virtual hardware environment for a VM. On the other hand, the
process manages the underlying hardware through virtualization. The idea is to virtualize a
computer’s resources, such as its processors, memory, and I/O devices. The intention is to
upgrade the hardware utilization rate by multiple users concurrently. The idea was
implemented in the IBM VM/370 in the 1960s. More recently, the Xen hypervisor has been
applied to virtualize x86-based machines to run Linux or other guest OS applications.
Operating System Level
This refers to an abstraction layer between traditional OS and user applications. OS-level
virtualiza-tion creates isolated containers on a single physical server and the OS instances to
utilize the hard-ware and software in data centers. The containers behave like real servers. OS-
level virtualization is commonly used in creating virtual hosting environments to allocate
hardware resources among a large number of mutually distrusting users. It is also used, to a
lesser extent, in consolidating server hardware by moving services on separate hosts into
containers or VMs on one server.
Library Support Level
Most applications use APIs exported by user-level libraries rather than using lengthy system
calls by the OS. Since most systems provide well-documented APIs, such an interface becomes
another candidate for virtualization. Virtualization with library interfaces is possible by
controlling the communication link between applications and the rest of a system through API
hooks. The software tool WINE has implemented this approach to support Windows
applications on top of UNIX hosts. Another example is the vCUDA which allows applications
executing within VMs to leverage GPU hardware acceleration.
User-Application Level
Virtualization at the application level virtualizes an application as a VM. On a traditional OS, an
application often runs as a process. Therefore, application-level virtualization is also known
as process-level virtualization. The most popular approach is to deploy high level language
(HLL)VMs. In this scenario, the virtualization layer sits as an application program on top of the
operating system, and the layer exports an abstraction of a VM that can run programs written
and compiled to a particular abstract machine definition. Any program written in the HLL and
compiled for this VM will be able to run on it. The Microsoft .NET CLR and Java Virtual Machine
(JVM) are two good examples of this class of VM.
Other forms of application-level virtualization are known as application isolation, application
sandboxing, or application streaming. The process involves wrapping the application in a layer
that is isolated from the host OS and other applications. The result is an application that is
much easier to distribute and remove from user workstations. An example is the LANDesk
application virtualization platform which deploys software applications as self-contained,
executable files in an isolated environment without requiring installation, system modifications,
or elevated security privileges.
Virtualization Structures/Tools and Mechanisms
Before virtualization, the operating system manages the hardware. After virtualization, a
virtualization layer is inserted between the hardware and the OS. In such a case, the
virtualization layer is responsible for converting portions of the real hardware into virtual
hardware. Depending on the position of the virtualization layer, there are several classes of VM
architectures, namely
1. Hypervisor architecture.
2. Paravirtualization.
3. host-based virtualization.
Hypervisor and Xen Architecture
Depending on the functionality, a hypervisor can assume a micro-kernel architecture or a
monolithic hypervisor architecture. A micro-kernel hypervisor includes only the basic and
unchanging functions (such as physical memory management and processor scheduling). The
device drivers and other changeable components are outside the hypervisor. A monolithic
hypervisor implements all the aforementioned functions, including those of the device drivers.
Therefore, the size of the hypervisor code of a micro-kernel hypervisor is smaller than that of a
monolithic hypervisor.
Xen Architecture
Xen is an open source hypervisor program developed by Cambridge University. Xen is a
microkernel hypervisor, which separates the policy from the mechanism. It implementsall the
mechanisms, leaving the policy to be handled by Domain 0, as shown in Figure. Xen does not
include any device drivers natively. It just provides a mechanism by which a guest OS can have
direct access to the physical devices.
Like other virtualization systems, many guest OSes can run on top of the hypervisor. The guest
OS (privileged guest OS), which has control ability, is called Domain 0, and the others are called
Domain U. It is first loaded when Xen boots without any file system drivers being available.
Domain 0 is designed to access hardware directly and manage devices.
Binary Translation with Full Virtualization
Depending on implementation technologies, hardware virtualization can be classified into two
categories: full virtualization and host-based virtualization.
Full Virtualization
With full virtualization, noncritical instructions run on the hardware directly while critical
instructions are discovered and replaced with traps into the VMM to be emulated by software.
Both the hypervisor and VMM approaches are considered full virtualization. Noncritical
instructions do not control hardware or threaten the security of the system, but critical
instructions do. Therefore, running noncritical instructions on hardware not only can promote
efficiency, but also can ensure system security.
Host-Based Virtualization
An alternative VM architecture is to install a virtualization layer on top of the host OS. This host
OS is still responsible for managing the hardware. The guest OSes are installed and run on top
of the virtualization layer. Dedicated applications may run on the VMs. Certainly, some other
applications can also run with the host OS directly. This host based architecture has some
distinct advantages, as enumerated next. First, the user can install this VM architecture without
modifying the host OS. Second, the host-based approach appeals to many host machine
configurations.
Para-Virtualization
It needs to modify the guest operating systems. A para-virtualized VM provides special APIs
requiring substantial OS modifications in user applications. Performance degradation is a critical
issue of a virtualized system. Figure illustrates the concept of a para-virtualized VM
architecture. The guest OS are para-virtualized. They are assisted by an intelligent compiler to
replace the non virtualizable OS instructions by hypercalls. The traditional x86 processor offers
four instruction execution rings: Rings 0, 1, 2, and 3. The lower the ring number, the higher the
privilege of instruction being executed. The OS is responsible for managing the hardware and
the privileged instructions to execute at Ring 0, while user-level applications run at Ring 3.
Although para-virtualization reduces the overhead, it has incurred problems like compatibility
and portability, because it must support the unmodified OS as well. Second, the cost is high,
because they may require deep OS kernel modifications. Finally, the performance advantage of
para-virtualization varies greatly due to workload variations.
CPU, Memory, and Network Virtualization
A VMware virtual machine provides complete hardware virtualization. The guest operating
system and applications running on a virtual machine can never determine directly which
physical resources they are accessing (such as which physical CPU they are running on in a
multiprocessor system, or which physical memory is mapped to their pages).
CPU Virtualization
Each virtual machine appears to run on its own CPU (or a set of CPUs), fully isolated from other
virtual machines. Registers, the translation lookaside buffer, and other control structures are
maintained separately for each virtual machine. Most instructions are executed directly on the
physical CPU, allowing resource-intensive workloads to run at near-native speed. The
virtualization layer safely performs privileged instructions.
Memory Virtualization
A contiguous memory space is visible to each virtual machine. However, the allocated physical
memory might not be contiguous. Instead, noncontiguous physical pages are remapped and
presented to each virtual machine. With unusually memory-intensive loads, server memory
becomes overcommitted. In that case, some of the physical memory of a virtual machine might
be mapped to shared pages or to pages that are unmapped or swapped out.
ESX/ESXi performs this virtual memory management without the information that the guest
operating system has and without interfering with the guest operating system’s memory
management subsystem.
Network Virtualization
The virtualization layer guarantees that each virtual machine is isolated from other virtual
machines. Virtual machines can communicate with each other only through networking
mechanisms similar to those used to connect separate physical machines.
The isolation allows administrators to build internal firewalls or other network isolation
environments that allow some virtual machines to connect to the outside, while others are
connected only through virtual networks to other virtual machines.
Virtual Clusters and Resource Management
When a traditional VM is initialized, the administrator needs to manually write configuration
information or specify the configuration sources. When more VMs join a network, an inefficient
configuration always causes problems with overloading or underutilization.
Amazon’s Elastic Compute Cloud (EC2) is a good example of a web service that provides elastic
computing power in a cloud. EC2 permits customers to create VMs and to manage user
accounts over the time of their use. Most virtualization platforms, including XenServer and
VMware ESX Server, support a brid-ging mode which allows all domains to appear on the
network as individual hosts. By using this mode, VMs can communicate with one another freely
through the virtual network interface card and configure the network automatically.
Physical versus Virtual Clusters
Virtual clusters are built with VMs installed at distributed servers from one or more physical
clus-ters. The VMs in a virtual cluster are interconnected logically by a virtual network across
several physical networks. Each virtual cluster is formed with physical machines or a VM hosted
by multiple physical clusters. The virtual cluster boundaries are shown as distinct boundaries.
The provisioning of VMs to a virtual cluster is done dynamically to have the following interest-
ing properties:
The virtual cluster nodes can be either physical or virtual machines. Multiple VMs
running with different OSes can be deployed on the same physical node.
A VM runs with a guest OS, which is often different from the host OS, that manages the
resources in the physical machine, where the VM is implemented.
The purpose of using VMs is to consolidate multiple functionalities on the same server.
This will greatly enhance server utilization and application flexibility.
VMs can be colonized (replicated) in multiple servers for the purpose of promoting
distributed parallelism, fault tolerance, and disaster recovery.
The size (number of nodes) of a virtual cluster can grow or shrink dynamically, similar to
the way an overlay network varies in size in a peer-to-peer (P2P) network.
The failure of any physical nodes may disable some VMs installed on the failing nodes.
But the failure of VMs will not pull down the host system.
Since system virtualization has been widely used, it is necessary to effectively manage VMs
running on a mass of physical computing nodes (also called virtual clusters) and consequently
build a high-performance virtualized computing environment. This involves virtual cluster
deployment, monitoring and management over large-scale clusters, as well as resource
scheduling, load balancing, server consolidation, fault tolerance, and other techniques.
Virtualization For Data-Center Automation
Data-center automation refers huge volumes of hardware, software, and database resources in
these data centers can be allocated dynamically to millions of Internet users simultaneously,
with guaranteed QoS. The latest virtualization development highlights high availability (HA),
backup services, workload balancing, and further increases in client bases.
Server Consolidation in Data Centers
The heterogeneous workloads in the data center can be roughly divided into two categories:
chatty workloads and noninteractive workloads. Chatty workloads may burst at some point and
return to a silent state at some other point. For example, video services can be used by a lot of
people at night and few people use it during the day. Noninteractive workloads do not require
people’s efforts to make progress after they are submitted. Server consolidation is an approach
to improve the low utility ratio of hardware resources by reducing the number of physical
servers. The use of VMs increases resource management complexity.
It enhances hardware utilization. Many underutilized servers are consolidated into
fewer servers to enhance resource utilization. Consolidation also facilitates backup
services and disaster recovery.
In a virtual environment, the images of the guest OSes and their applications are readily
cloned and reused.
Total cost of ownership is reduced.
Improves availability and business continuity.
Automation of data-center operations includes resource scheduling, architectural support,
power management, automatic or autonomic resource management, performance of analytical
models, and so on. In virtualized data centers, an efficient, on-demand, fine-grained scheduler
is one of the key factors to improve resource utilization. Dynamic CPU allocation is based on VM
utilization and application-level QoS metrics. One method considers both CPU and memory
flowing as well as automatically adjusting resource overhead based on varying workloads in
hosted services. Another scheme uses a two-level resource management system to handle the
complexity involved. A local controller at the VM level and a global controller at the server level
are designed.
Virtual Storage Management
Virtual storage includes the storage managed by VMMs and guest OSes. Generally, the data
stored in this environment can be classified into two categories: VM images and application
data. The VM images are special to the virtual environment, while application data includes all
other data which is the same as the data in traditional OS environments. In data centers, there
are often thousands of VMs, which cause the VM images to become flooded. Parallax is a
distributed storage system customized for virtualization environments. Content Addressable
Storage (CAS) is a solution to reduce the total size of VM images, and therefore supports a large
set of VM-based systems in data centers.
Parallax designs a novel architecture in which storage features that have traditionally been
implemented directly on high-end storage arrays and switchers are relocated into a federation
of storage VMs. These storage VMs share the same physical hosts as the VMs that they serve.
Figure provides an overview of the Parallax system architecture. For each physical machine,
Parallax customizes a special storage appliance VM. The storage appliance VM acts as a block
virtualization layer between individual VMs and the physical storage device. It provides a virtual
disk for each VM on the same physical machine.
UNIT III
Cloud Platform Architecture
Cloud Computing and service Models
Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to
a shared pool of configurable computing resources (e.g., networks, servers, storage,
applications, and services) that can be rapidly provisioned and released with minimal
management effort or service provider interaction. This cloud model is composed of five
essential characteristics, three service models, and four deployment models.
Software as a Service (SaaS)
The capability provided to the consumer is to use the provider’s applications running on a cloud
infrastructure2. The applications are accessible from various client devices through either a thin
client interface, such as a web browser (e.g., web-based email), or a program interface. The
consumer does not manage or control the underlying cloud infrastructure including network,
servers, operating systems, storage, or even individual application capabilities, with the
possible exception of limited user-specific application configuration settings.
Platform as a Service (PaaS)
The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-
created or acquired applications created using programming languages, libraries, services, and
tools supported by the provider.3 The consumer does not manage or control the underlying
cloud infrastructure including network, servers, operating systems, or storage, but has control
over the deployed applications and possibly configuration settings for the application-hosting
environment.
Infrastructure as a Service (IaaS)
The capability provided to the consumer is to provision processing, storage, networks, and
other fundamental computing resources where the consumer is able to deploy and run
arbitrary software, which can include operating systems and applications. The consumer does
not manage or control the underlying cloud infrastructure but has control over operating
systems, storage, and deployed applications; and possibly limited control of select networking
components (e.g., host firewalls).
Architectural Design of Compute and Storage Clouds
Cloud Architecture for Distributed Computing
An Internet cloud is envisioned as a public cluster of servers provisioned on demand to perform
collective web services or distributed applications using the datacenter resources. The cloud
design objectives are first specified below. Then we present a basic cloud architecture design.
Cloud Platform Design Goals: Scalability, virtualization, efficiency, and reliability are four major
design goals of a cloud computing platform. Clouds support Web 2.0 applications. The cloud
management receives the user request and then finds the correct resources, and then calls the
provisioning services which invoke resources in the cloud. The cloud management software
needs to support both physical and virtual machines. Security in shared resources and shared
access of datacenters also post another design challenge.
The platform needs to establish a very large-scale HPC infrastructure. The hardware and
software systems are combined together to make it easy and efficient to operate. The system
scalability can benefit from cluster architecture. If one service takes a lot of processing power
or storage capacity or network traffic, it is simple to add more servers and bandwidth. The
system reliability can benefit from this architecture. Data can be put into multiple locations. For
example, the user email can be put in three disks which expand to different geographical
separate data centers. In such situation, even one of the datacenters crashes, the user data is
still accessible. The scale of cloud architecture can be easily expanded by adding more servers
and enlarging the network connectivity accordingly.
Enabling Technologies for Clouds: The key driving forces behind cloud computing are the
ubiquity of broadband and wireless networking, falling storage costs, and progressive
improvements in Internet computing software. Cloud users are able to demand more capacity
at peak demand, reduce costs, experiment with new services, and remove unneeded capacity,
whereas service providers can increase the system utilization via multiplexing, virtualization,
and dynamic resource provisioning.
Technology Requirements and Benefits
Fast Platform Deployment Fast, efficient, and flexible deployment of cloud resources to provide dynamic computing environment to users
Virtual Clusters on Demand Virtualized cluster of VMs provisioned to satisfy user demand and virtual cluster reconfigured as workload changes.
Multi-Tenant Techniques SaaS distributes software to a large number of users for their simultaneous uses and resource sharing if so desired.
Massive Data Processing Internet search and web services often require massive data processing, especially to support personalized services.
Web-Scale Communication Support e-commerce, distance education, telemedicine, social
networking, digital government, and digital entertainment, etc. Distributed Storage Large-scale storage of personal records and public archive information
demand distributed storage over the clouds Licensing and Billing Services License management and billing services greatly benefit all types of
cloud services in utility computing.
These technologies play instrumental roles to make cloud computing a reality. Most of these
technologies are mature today to meet the increasing demand. In the hardware area, the rapid
progress in multi-core CPUs, memory chips, and disk arrays has made it possible to build faster
datacenters with huge storage space. Resource virtualization enables rapid cloud deployment
faster and fast disaster recovery. Service-oriented architecture (SOA) also plays a vital role. The
progress in providing Software as a Service (SaaS), Wed.2.0 standards, and Internet
performance have all contributed to the emergence of cloud services. Today’s clouds are
designed to serve a large number of tenants over massive volume of data. The availability of
large-scale, distributed storage systems lies the foundation of today’s datacenters. Of course,
cloud computing is greatly benefitted by the progress made in license management and
automatic billing techniques in recent years.
A Generic Cloud Architecture: The Internet cloud is envisioned as a massive cluster of servers.
These servers are provisioned on demand to perform collective web services or distributed
applications using datacenter resources. Cloud platform is formed dynamically by provisioning
or de-provisioning, of servers, software, and database resources. Servers in the cloud can be
physical machines or virtual machines. User interfaces are applied to request services. The
provisioning tool carves out the systems from the cloud to deliver on the requested service.
In addition to building the server cluster, cloud platform demand distributed storage and
accompanying services. The cloud computing resources are built in datacenters, which are
typically owned and operated by a third-party provider. Consumers do not need to know the
underlying technologies. In a cloud, software becomes a service. The cloud demands a high-
degree of trust of massive data retrieved from large datacenters. We need to build a framework
to process large scale data stored in the storage system. This demands a distributed file system
over the database system. Other cloud resources are added into a cloud platform including the
storage area networks, database systems, firewalls and security devices. Web service providers
offer special APIs that enable developers to exploit Internet clouds. Monitoring and metering
units are used to track the usage and performance of resources provisioned.
The software infrastructure of a cloud platform must handle all resource management and do
most of the maintenance, automatically. Software must detect the status of each node, server
joining and leaving and do the tasks accordingly. Cloud computing providers, like Google and
Microsoft, have built a large number of datacenters all over the world. Each datacenter may
have thousands of servers. The location of the datacenter is chosen to reduce power and
cooling costs. Thus, the datacenters are often built around hydroelectricity power stop. The
cloud physical platform builder concerns more about the performance/price ratio and reliability
issues than the shear speed performance.
In general, the private clouds are easier to manage. Public clouds are easier to access. The
trends of cloud development are that more and more clouds will be hybrid. This is due to the
fact that many cloud application must go beyond the boundary of an Intranet. One must learn
how to create a private cloud and how to interact with the public clouds in the open Internet.
Security becomes a critical issue in safeguard the operations of all cloud types.
Inter-cloud Resource Management
Cloud resource management and inter-cloud resource exchange schemes are reviewed in this
section.
Extended Cloud Computing Services
Figure shows six layers of cloud services, ranging from hardware, network, and collocation to
infrastructure, platform, and software applications. The cloud platform provides PaaS, which
sits on top of the IaaS infrastructure. The top layer offers SaaS.
The bottom three layers are more related to physical requirements. The bottommost layer
provides Hardware as a Service (HaaS). The next layer is for interconnecting all the hardware
components, and is simply called Network as a Service (NaaS). Virtual LANs fall within the scope
of NaaS. The next layer above NaaS offers Location as a Service (LaaS), which provides a
collocation service to house, power, and secure all the physical hardware and network
resources. The cloud infrastructure layer can be further subdivided as Data as a Service (DaaS)
and Communication as a Service (CaaS) in addition to compute and storage in IaaS.
The cloud players are divided into three classes: (1) cloud service providers and IT
administrators, (2) software developers or vendors, and (3) end users or business users. These
cloud players vary in their roles under the IaaS, PaaS, and SaaS models. From the software
vendors’ perspective, application performance on a given cloud platform is most important.
From the providers’ perspective, cloud infrastructure performance is the primary concern. From
the end users’ perspective, the quality of services, including security, is the most important.
Cloud Service Tasks and Trends
The top layer is for SaaS applications, mostly used for business applications. The approach is to
widen market coverage by investigating customer behaviors and revealing opportunities by
statistical analysis. SaaS tools also apply to distributed collaboration, and financial and human
resources management.
PaaS is provided by Google, Salesforce.com, and Facebook, among others. IaaS is provided by
Amazon, Windows Azure, and RackRack, among others. Collocation services require multiple
cloud providers to work together to support supply chains in manufacturing. Network cloud
services provide communications such as those by AT&T, Qwest, and AboveNet. Details can be
found in Clou’s introductory book on business clouds.
Software Stack for Cloud Computing
The overall software stack structure of cloud computing software can be viewed as layers. Each
layer has its own purpose and provides the interface for the upper layers just as the traditional
software stack does. However, the lower layers are not completely transparent to the upper
layers.
The platform for running cloud computing services can be either physical servers or virtual
servers. By using VMs, the platform can be flexible, that is, the running services are not bound
to specific hardware platforms. This brings flexibility to cloud computing platforms. The
software layer on top of the platform is the layer for storing massive amounts of data. This layer
acts like the file system in a traditional single machine. Other layers running on top of the file
system are the layers for executing cloud computing applications. They include the database
storage system, programming for large-scale clusters, and data query language support. The
next layers are the components in the software stack.
Runtime Support Services
As in a cluster environment, there are also some runtime supporting services in the cloud
computing environment. Cluster monitoring is used to collect the runtime status of the entire
cluster. One of the most important facilities is the cluster job management system. The
scheduler queues the tasks submitted to the whole cluster and assigns the tasks to the
processing nodes according to node availability. The distributed scheduler for the cloud
application has special characteristics that can support cloud applications, such as scheduling
the programs written in MapReduce style. The runtime support system keeps the cloud cluster
working properly with high efficiency. Runtime support is software needed in browser-initiated
applications applied by thousands of cloud customers. The SaaS model provides the software
applications as a service, rather than letting users purchase the software.
Cloud Security
While security and privacy concerns are similar across cloud services and traditional non-cloud
services, those concerns are amplified by the existence of external control over organizational
assets and the potential for mismanagement of those assets. Transitioning to public cloud
computing involves a transfer of responsibility and control to the cloud provider over
information as well as system components that were previously under the customer’s direct
control.
Despite this inherent loss of control, the cloud service customer still needs to take responsibility
for its use of cloud computing services in order to maintain situational awareness, weigh
alternatives, set priorities, and effect changes in security and privacy that are in the best
interest of the organization. The customer achieves this by ensuring that the contract with the
provider and its associated cloud service agreement has appropriate provisions for security and
privacy. In particular, the agreement must help maintain legal protections for the privacy of
data stored and processed on the provider's systems. The customer must also ensure
appropriate integration of cloud computing services with their own systems for managing
security and privacy.
There are a number of security risks associated with cloud computing that must be adequately
addressed:
Loss of governance: In a public cloud deployment, customers cede control to the cloud provider
over a number of issues that may affect security. Yet cloud service agreements may not offer a
commitment to resolve such issues on the part of the cloud provider, thus leaving gaps in
security defenses.
Responsibility ambiguity: Responsibility over aspects of security may be split between the
provider and the customer, with the potential for vital parts of the defenses to be left
unguarded if there is a failure to allocate responsibility clearly. This split is likely to vary
depending on the cloud computing model used (e.g., IaaS vs. SaaS).
Authentication and Authorization: The fact that sensitive cloud resources are accessed from
anywhere on the Internet heightens the need to establish with certainty the identity of a user --
especially if users now include employees, contractors, partners and customers. Strong
authentication and authorization becomes a critical concern.
Isolation failure: Multi-tenancy and shared resources are defining characteristics of public
cloud computing. This risk category covers the failure of mechanisms separating the usage of
storage, memory, routing and even reputation between tenants (e.g. so-called guest-hopping
attacks).
Compliance and legal risks: The cloud customer’s investment in achieving certification (e.g., to
demonstrate compliance with industry standards or regulatory requirements) may be lost if the
cloud provider cannot provide evidence of their own compliance with the relevant
requirements, or does not permit audits by the cloud customer. The customer must check that
the cloud provider has appropriate certifications in place.
Handling of security incidents: The detection, reporting and subsequent management of
security breaches may be delegated to the cloud provider, but these incidents impact the
customer. Notification rules need to be negotiated in the cloud service agreement so that
customers are not caught unaware or informed with an unacceptable delay.
Management interface vulnerability: Interfaces to manage public cloud resources (such as self
provisioning) are usually accessible through the Internet. Since they allow access to larger sets
of resources than traditional hosting providers, they pose an increased risk, especially when
combined with remote access and web browser vulnerabilities.
Application Protection: Traditionally, applications have been protected with defense-in-depth
security solutions based on a clear demarcation of physical and virtual resources, and on
trusted zones. With the delegation of infrastructure security responsibility to the cloud
provider, organizations need to rethink perimeter security at the network level, applying more
controls at the user, application and data level. The same level of user access control and
protection must be applied to workloads deployed in cloud services as to those running in
traditional data centers. This requires creating and managing workload-centric policies as well
as implementing centralized management across distributed workload instances.
Data protection: Here, the major concerns are exposure or release of sensitive data as well as
the loss or unavailability of data. It may be difficult for the cloud service customer (in the role of
data controller) to effectively check the data handling practices of the cloud provider. This
problem is exacerbated in cases of multiple transfers of data, (e.g., between federated cloud
services or where a cloud provider uses subcontractors).
Malicious behavior of insiders: Damage caused by the malicious actions of people working
within an organization can be substantial, given the access and authorizations they enjoy. This
is compounded in the cloud computing environment since such activity might occur within
either or both the customer organization and the provider organization.
Business failure of the provider: Such failures could render data and applications essential to
the customer's business unavailable over an extended period.
Service unavailability: This could be caused by hardware, software or communication network
failures.
Vendor lock-in: Dependency on proprietary services of a particular cloud service provider could
lead to the customer being tied to that provider. The lack of portability of applications and data
across providers poses a risk of data and service unavailability in case of a change in providers;
therefore it is an important if sometimes overlooked aspect of security. Lack of interoperability
of interfaces associated with cloud services similarly ties the customer to a particular provider
and can make it difficult to switch to another provider.
Insecure or incomplete data deletion: The termination of a contract with a provider may not
result in deletion of the customer’s data. Backup copies of data usually exist, and may be mixed
on the same media with other customers’ data, making it impossible to selectively erase. The
very advantage of multi-tenancy (the sharing of hardware resources) thus represents a higher
risk to the customer than dedicated hardware.
Visibility and Audit: Some enterprise users are creating a “shadow IT” by procuring cloud
services to build IT solutions without explicit organizational approval. Key challenges for the
security team are to know about all uses of cloud services within the organization (what
resources are being used, for what purpose, to what extent, and by whom), understand what
laws, regulations and policies may apply to such uses, and regularly assess the security aspects
of such uses.
Cloud computing does not only create new security risks: it also provides opportunities to
provision improved security services that are better than those many organizations implement
on their own. Cloud service providers could offer advanced security and privacy facilities that
leverage their scale and their skills at automating infrastructure management tasks. This is
potentially a boon to customers who have little skilled security personnel.
Trust Management:
Reputation-Guided Protection of Data Centers:
Trust is a personal opinion, which is very subjective and often biased. Trust can be transitive but
not necessarily symmetric between two parties. Reputation is a public opinion, which is more
objective and often relies on a large opinion aggregation process to evaluate. Reputation may
change or decay over time. Recent reputation should be given more preference than past
reputation.
Reputation Systems for Clouds
Redesigning the aforementioned reputation systems for protecting data centers offers new
opportunities for expanded applications beyond P2P networks. Data consistency is checked
across multiple databases. Copyright protection secures wide-area content distributions. To
separate user data from specific SaaS programs, providers take the most responsibility in
maintaining data integrity and consistency. Users can switch among different services using
their own data. Only the users have the keys to access the requested data.
The data objects must be uniquely named to ensure global consistency. To ensure data
consistency, unauthorized updates of data objects by other cloud users are prohibited. The
reputation system can be implemented with a trust overlay network. A hierarchy of P2P
reputation systems is suggested to protect cloud resources at the site level and data objects at
the file level. This demands both coarse-grained and fine-grained access control of shared
resources. These reputation systems keep track of security breaches at all levels.
The reputation system must be designed to benefit both cloud users and data centers. Data
objects used in cloud computing reside in multiple data centers over a SAN. In the past, most
reputation systems were designed for P2P social networking or for online shopping services.
These reputation systems can be converted to protect cloud platform resources or user
applications in the cloud. A centralized reputation system is easier to implement, but demands
more powerful and reliable server resources. Distributed reputation systems are more scalable
and reliable in terms of handling failures. The five security mechanisms presented earlier can be
greatly assisted by using a reputation system specifically designed for data centers.
However, it is possible to add social tools such as reputation systems to support safe cloning of
VMs. Snapshot control is based on the defined RPO. Users demand new security mechanisms to
protect the cloud. For example, one can apply secured information logging, migrate over
secured virtual LANs, and apply ECC-based encryption for secure migration. Sandboxes provide
a safe execution platform for running programs. Further, sandboxes can provide a tightly
controlled set of resources for guest operating systems, which allows a security test bed to test
the application code from third-party vendors.
Service Oriented Architecture:
Service-oriented architecture (SOA) is an evolution of distributed computing based on the
request/reply design paradigm for synchronous and asynchronous applications. An application's
business logic or individual functions are modularized and presented as services for
consumer/client applications. What's key to these services is their loosely coupled nature; i.e.,
the service interface is independent of the implementation. Application developers or system
integrators can build applications by composing one or more services without knowing the
services' underlying implementations.
Service-oriented architectures have the following key characteristics:
SOA services have self-describing interfaces in platform-independent XML documents.
Web Services Description Language (WSDL) is the standard used to describe the
services.
SOA services communicate with messages formally defined via XML Schema (also called XSD). Communication among consumers and providers or services typically happens in heterogeneous environments, with little or no knowledge about the provider. Messages between services can be viewed as key business documents processed in an enterprise.
SOA services are maintained in the enterprise by a registry that acts as a directory listing. Applications can look up the services in the registry and invoke the service. Universal Description, Definition, and Integration (UDDI) is the standard used for service registry.
Each SOA service has a quality of service (QoS) associated with it. Some of the key QoS elements are security requirements, such as authentication and authorization, reliable messaging, and policies regarding who can invoke services.
Why SOA?
The reality in IT enterprises is that infrastructure is heterogeneous across operating systems, applications, system software, and application infrastructure. Some existing applications are used to run current business processes, so starting from scratch to build new infrastructure isn't an option. Enterprises should quickly respond to business changes with agility; leverage existing investments in applications and application infrastructure to address newer business requirements; support new channels of interactions with customers, partners, and suppliers; and feature an architecture that supports organic business. SOA with its loosely coupled nature allows enterprises to plug in new services or upgrade existing services in a granular fashion to address the new business requirements, provides the option to make the services consumable across different channels, and exposes the existing enterprise and legacy applications as services, thereby safeguarding existing IT infrastructure investments.
SOA Architecture:
Fig: Service Architecture
In the above Figure several service consumers can invoke services by sending messages. These
messages are typically transformed and routed by a service bus to an appropriate service
implementation. This service architecture can provide a business rules engine that allows
business rules to be incorporated in a service or across services. The service architecture also
provides a service management infrastructure that manages services and activities like auditing,
billing, and logging. In addition, the architecture offers enterprises the flexibility of having agile
business processes, better addresses the regulatory requirements like Sarbanes Oxley (SOX),
and changes individual services without affecting other services.
SOA infrastructure
To run and manage SOA applications, enterprises need an SOA infrastructure that is part of the
SOA platform. An SOA infrastructure must support all the relevant standards and required
runtime containers.
Fig: A typical SOA infrastructure.
WSDL is used to describe the service; UDDI, to register and look up the services; and SOAP, as a
transport layer to send messages between service consumer and service provider. While SOAP
is the default mechanism for Web services, alternative technologies accomplish other types of
bindings for a service. A consumer can search for a service in the UDDI registry, get the WSDL
for the service that has the description, and invoke the service using SOAP.
Benefits of SOA
While the SOA concept is fundamentally not new, SOA differs from existing distributed
technologies in that most vendors accept it and have an application or platform suite that
enables SOA. SOA, with a ubiquitous set of standards, brings better reusability of existing assets
or investments in the enterprise and lets you create applications that can be built on top of new
and existing applications. SOA enables changes to applications while keeping clients or service
consumers isolated from evolutionary changes that happen in the service implementation. SOA
enables upgrading individual services or services consumers; it is not necessary to completely
rewrite an application or keep an existing system that no longer addresses the new business
requirements. Finally, SOA provides enterprises better flexibility in building applications and
business processes in an agile manner by leveraging existing application infrastructure to
compose new services.
Message Oriented Middleware
Message-oriented middleware (MOM) is software or hardware infrastructure supporting
sending and receiving messages between distributed systems. MOM allows application
modules to be distributed over heterogeneous platforms and reduces the complexity of
developing applications that span multiple operating systems and network protocols.
The middleware creates a distributed communications layer that insulates the application
developer from the details of the various operating systems and network interfaces. APIs that
extend across diverse platforms and networks are typically provided by MOM.
Middleware categories
Remote Procedure Call or RPC-based middleware
Object Request Broker or ORB-based middleware
Message Oriented Middleware or MOM-based middleware
All these models make it possible for one software component to affect the behavior of another
component over a network. They are different in that RPC- and ORB-based middleware create
systems of tightly coupled components, whereas MOM-based systems allow for a loose
coupling of components. In an RPC- or ORB-based system, when one procedure calls another, it
must wait for the called procedure to return before it can do anything else. In
these synchronous messaging models, the middleware functions partly as a super-linker,
locating the called procedure on a network and using network services to pass function or
method parameters to the procedure and then to return results.
Advantages
Central reasons for using a message-based communications protocol include its ability to store
(buffer), route, or transform messages while conveying them from senders to receivers.
Another advantage of messaging provider mediated messaging between clients is that by
adding an administrative interface, you can monitor and tune performance. Client applications
are thus effectively relieved of every problem except that of sending, receiving, and processing
messages. It is up to the code that implements the MOM system and up to the administrator to
resolve issues like interoperability, reliability, security, scalability, and performance.
Disadvantages
The primary disadvantage of many message-oriented middleware systems is that they require
an extra component in the architecture, the message transfer agent (message broker). As with
any system, adding another component can lead to reductions in performance and reliability,
and can also make the system as a whole more difficult and expensive to maintain.
In addition, many inter-application communications have an intrinsically synchronous aspect,
with the sender specifically wanting to wait for a reply to a message before continuing
(see real-time computing and near-real-time for extreme cases). Because message-
based communication inherently functions asynchronously, it may not fit well in such
situations. That said, most MOM systems have facilities to group a request and a response as a
single pseudo-synchronous transaction.
UNIT IV
Cloud Programming and Software Environments
Features of Cloud and Grid Platforms
Cloud Capabilities and Platform Features
Commercial cloud capabilities offer cost-effective utility computing with the elasticity to scale
up and down in power. However, as well as this key distinguishing feature, commercial clouds
offer a growing number of additional capabilities commonly termed “Platform as a Service”
(PaaS). For Azure, current platform features include Azure Table, queues, blobs, Database SQL,
and web and Worker roles. Amazon is often viewed as offering “just” Infrastructure as a Service
(IaaS), but it continues to add platform features including SimpleDB (similar to Azure Table),
queues, notification, monitoring, content delivery network, relational database, and
MapReduce (Hadoop). Google does not currently offer a broad-based cloud service, but the
Google App Engine (GAE) offers a powerful web application development environment.
Workflow
workflow has spawned many projects in the United States and Europe. Pegasus, Taverna, and
Kepler are popular, but no choice has gained wide acceptance. There are commercial systems
such as Pipeline Pilot, AVS (dated), and the LIMS environments. A recent entry is Trident [2]
from Microsoft Research which is built on top of Windows Workflow Foundation. If Trident runs
on Azure or just any old Windows machine, it will run workflow proxy services on external
(Linux) environments. Workflow links multiple cloud and noncloud services in real applications
on demand.
Data Transport
The cost (in time and money) of data transport in (and to a lesser extent, out of) commercial
clouds is often discussed as a difficulty in using clouds. If commercial clouds become an
important component of the national cyberinfrastructure we can expect that high-bandwidth
links will be made available between clouds and TeraGrid. The special structure of cloud data
with blocks (in Azure blobs) and tables could allow high-performance parallel algorithms, but
initially, simple HTTP mechanisms are used to transport data [3–5] on academic
systems/TeraGrid and commercial clouds.
Security, Privacy, and Availability
The following techniques are related to security, privacy, and availability requirements for
developing a healthy and dependable cloud programming environment.
Use virtual clustering to achieve dynamic resource provisioning with minimum overhead cost.
Use stable and persistent data storage with fast queries for information retrieval.
Use special APIs for authenticating users and sending e-mail using commercial accounts.
Cloud resources are accessed with security protocols such as HTTPS and SSL.
Fine-grained access control is desired to protect data integrity and deter intruders or hackers.
Shared data sets are protected from malicious alteration, deletion, or copyright violations.
Features are included for availability enhancement and disaster recovery with life migration of
VMs.
Use a reputation system to protect data centers. This system only authorizes trusted clients and
stops pirates.
Data Features and Databases
Program Library
Many efforts have been made to design a VM image library to manage images used in academic
and commercial clouds. The basic cloud environments described in this chapter also include
many management features allowing convenient deployment and configuring of images (i.e.,
they support IaaS).
Blobs and Drives
The basic storage concept in clouds is blobs for Azure and S3 for Amazon. These can be
organized (approximately, as in directories) by containers in Azure. In addition to a service
interface for blobs and S3, one can attach “directly” to compute instances as Azure drives and
the Elastic Block Store for Amazon. This concept is similar to shared file systems such as Lustre
used in TeraGrid. The cloud storage is intrinsically fault-tolerant while that on TeraGrid needs
backup storage. However, the architecture ideas are similar between clouds and TeraGrid, and
the Simple Cloud File Storage API.
DPFS
This covers the support of file systems such as Google File System (MapReduce), HDFS
(Hadoop), and Cosmos (Dryad) with compute-data affinity optimized for data processing. It
could be possible to link DPFS to basic blob and drive-based architecture, but it’s simpler to use
DPFS as an application-centric storage model with compute-data affinity and blobs and drives
as the repository-centric view.
In general, data transport will be needed to link these two data views. It seems important to
consider this carefully, as DPFS file systems are precisely designed for efficient execution of
data-intensive applications. However, the importance of DPFS for linkage with Amazon and
Azure is not clear, as these clouds do not currently offer fine-grained support for compute-data
affinity. We note here that Azure Affinity Groups are one interesting capability. We expect that
initially blobs, drives, tables, and queues will be the areas where academic systems will most
usefully provide a platform similar to Azure (and Amazon). Note the HDFS (Apache) and Sector
(UIC) projects in this area.
SQL and Relational Databases
Both Amazon and Azure clouds offer relational databases and it is straightforward for academic
systems to offer a similar capability unless there are issues of huge scale where, in fact,
approaches based on tables and/or MapReduce might be more appropriate [8]. As one early
user, we are developing on FutureGrid a new private cloud computing model for the
Observational Medical Outcomes Partnership (OMOP) for patient-related medical data which
uses Oracle and SAS where FutureGrid is adding Hadoop for scaling to many different analysis
methods.
Note that databases can be used to illustrate two approaches to deploying capabilities.
Traditionally, one would add database software to that found on computer disks. This software
is executed, providing your database instance. However, on Azure and Amazon, the database is
installed on a separate VM independent from your job (worker roles in Azure). This implements
“SQL as a Service.” It may have some performance issues from the messaging interface, but the
“aaS” deployment clearly simplifies one’s system. For N platform features, one only needs N
services, whereas number of possible images with alternative approaches is a prohibitive 2N.
Table and NOSQL Nonrelational Databases:
A substantial number of important developments have occurred regarding simplified database
structures—termed “NOSQL”—typically emphasizing distribution and scalability. These are
present in the three major clouds: BigTable in Google, SimpleDB in Amazon, and Azure Table
for Azure. Tables are clearly important in science as illustrated by the VOTable standard in
astronomy and the popularity of Excel. However, there does not appear to be substantial
experience in using tables outside clouds.
There are, of course, many important uses of nonrelational databases, especially in terms of
triple stores for metadata storage and access. Recently, there has been interest in building
scalable RDF triple stores based on MapReduce and tables or the Hadoop File System, with
early success reported on very large stores. The current cloud tables fall into two groups: Azure
Table and Amazon SimpleDB are quite similar and support lightweight storage for “document
stores,” while BigTable aims to manage large distributed data sets without size limitations.
All these tables are schema-free (each record can have different properties), although BigTable
has a schema for column (property) families. It seems likely that tables will grow in importance
for scientific computing, and academic systems could support this using two Apache projects:
Hbase for BigTable and CouchDB for a document store. Another possibility is the open source
SimpleDB implementation M/DB . The new Simple Cloud APIs for file storage, document
storage services, and simple queues could help in providing a common environment between
academic and commercial clouds.
Queuing Services
Both Amazon and Azure offer similar scalable, robust queuing services that are used to
communicate between the components of an application. The messages are short (less than 8
KB) and have a Representational State Transfer (REST) service interface with “deliver at least
once” semantics. They are controlled by timeouts for posting the length of time allowed for a
client to process. One can build a similar approach (on the typically smaller and less challenging
academic environments), basing it on publish-subscribe systems such as ActiveMQ or
NaradaBrokering with which we have substantial experience.
Parallel & Distributed Programming Paradigms
Running a parallel program on a distributed computing system (parallel and distributed
programming) has several advantages for both users and distributed computing systems. From
the users’ perspective, it decreases application response time; from the distributed computing
systems standpoint, it increases throughput and resource utilization. Running a parallel
program on a distributed computing system, however, could be a very complicated process.
MapReduce, Hadoop, and Dryad are three of the most recently proposed parallel and
distributed programming models. They were developed for information retrieval applications
but have been shown to be applicable for a variety of important applications. Further, the loose
coupling of components in these paradigms makes them suitable for VM implementation and
leads to much better fault tolerance and scalability for some applications than traditional
parallel computing models such as MPI.
MapReduce, Twister, and Iterative MapReduce:
This software framework abstracts the data flow of running a parallel program on a distributed
computing system by providing users with two interfaces in the form of two functions: Map and
Reduce. Users can override these two functions to interact with and manipulate the data flow
of running their programs. the “value” part of the data, (key, value), is the actual data, and the
“key” part is only used by the MapReduce controller to control the data flow.
Fig: Map Reduce FrameWork
MapReduce Actual Data and Control Flow
The main responsibility of the MapReduce framework is to efficiently run a user’s program on a
distributed computing system. Therefore, the MapReduce framework meticulously handles all
partitioning, mapping, synchronization, communication, and scheduling details of such data
flows.
Data partitioning The MapReduce library splits the input data (files), already stored in
GFS, into M pieces that also correspond to the number of map tasks.
Computation partitioning This is implicitly handled (in the MapReduce framework) by
obliging users to write their programs in the form of the Map and Reduce functions.
Therefore, the MapReduce library only generates copies of a user program (e.g., by a
fork system call) containing the Map and the Reduce functions, distributes them, and
starts them up on a number of available computation engines.
Determining the master and workers The MapReduce architecture is based on a
masterworker model. Therefore, one of the copies of the user program becomes the
master and the rest become workers. The master picks idle workers, and assigns the
map and reduce tasks to them. A map/reduce worker is typically a computation engine
such as a cluster node to run map/ reduce tasks by executing Map/Reduce functions.
Steps 4–7 describe the map workers.
Reading the input data (data distribution) Each map worker reads its corresponding
portion of the input data, namely the input data split, and sends it to its Map function.
Although a map worker may run more than one Map function, which means it has been
assigned more than one input data split, each worker is usually assigned one input split
only.
Map function Each Map function receives the input data split as a set of (key, value)
pairs to process and produce the intermediated (key, value) pairs.
Combiner function This is an optional local function within the map worker which
applies to intermediate (key, value) pairs. The user can invoke the Combiner function
inside the user program. The Combiner function runs the same code written by users for
the Reduce function as its functionality is identical to it. The Combiner function merges
the local data of each map worker before sending it over the network to effectively
reduce its communication costs. As mentioned in our discussion of logical data flow, the
MapReduce framework sorts and groups the data before it is processed by the Reduce
function. Similarly, the MapReduce framework will also sort and group the local data on
each map worker if the user invokes the Combiner function.
Partitioning function As mentioned in our discussion of the MapReduce data flow, the
intermediate (key, value) pairs with identical keys are grouped together because all
values inside each group should be processed by only one Reduce function to generate
the final result. However, in real implementations, since there are M map and R reduce
tasks, intermediate (key, value) pairs with the same key might be produced by different
map tasks, although they should be grouped and processed together by one Reduce
function only.
Synchronization MapReduce applies a simple synchronization policy to coordinate map
workers with reduce workers, in which the communication between them starts when
all map tasks finish.
Communication Reduce worker i, already notified of the location of region i of all map
workers, uses a remote procedure call to read the data from the respective region of all
map workers. Since all reduce workers read the data from all map workers, all-to-all
communication among all map and reduce workers, which incurs network congestion,
occurs in the network. This issue is one of the major bottlenecks in increasing the
performance of such systems. A data transfer module was proposed to schedule data
transfers independently.
Sorting and Grouping When the process of reading the input data is finalized by a
reduce worker, the data is initially buffered in the local disk of the reduce worker. Then
the reduce worker groups intermediate (key, value) pairs by sorting the data based on
their keys, followed by grouping all occurrences of identical keys. Note that the buffered
data is sorted and grouped because the number of unique keys produced by a map
worker may be more than R regions in which more than one key exists in each region of
a map worker.
Reduce function The reduce worker iterates over the grouped (key, value) pairs, and for
each unique key, it sends the key and corresponding values to the Reduce function.
Then this function processes its input data and stores the output results in
predetermined files in the user’s program.
Hadoop Library from Apache
Hadoop is an open source implementation of MapReduce coded and released in Java (rather
than C) by Apache. The Hadoop implementation of MapReduce uses the Hadoop Distributed
File System (HDFS) as its underlying layer rather than GFS. The Hadoop core is divided into two
fundamental layers: the MapReduce engine and HDFS. The MapReduce engine is the
computation engine running on top of HDFS as its data storage manager. The following two
sections cover the details of these two fundamental layers.
HDFS: HDFS is a distributed file system inspired by GFS that organizes files and stores their data
on a distributed computing system.
HDFS Architecture: HDFS has a master/slave architecture containing a single NameNode as the
master and a number of DataNodes as workers (slaves). To store a file in this architecture, HDFS
splits the file into fixed-size blocks (e.g., 64 MB) and stores them on workers (DataNodes). The
mapping of blocks to DataNodes is determined by the NameNode. The NameNode (master)
also manages the file system’s metadata and namespace. In such systems, the namespace is the
area maintaining the metadata, and metadata refers to all the information stored by a file
system that is needed for overall management of all files. For example, NameNode in the
metadata stores all information regarding the location of input splits/blocks in all DataNodes.
Each DataNode, usually one per node in a cluster, manages the storage attached to the node. Each
DataNode is responsible for storing and retrieving its file blocks.
HDFS Features: Distributed file systems have special requirements, such as performance,
scalability, concurrency control, fault tolerance, and security requirements [62], to operate
efficiently. However, because HDFS is not a general-purpose file system, as it only executes
specific types of applications, it does not need all the requirements of a general distributed file
system. For example, security has never been supported for HDFS systems. The following
discussion highlights two important characteristics of HDFS to distinguish it from other generic
distributed file systems.
HDFS Fault Tolerance: One of the main aspects of HDFS is its fault tolerance characteristic.
Since Hadoop is designed to be deployed on low-cost hardware by default, a hardware failure in
this system is considered to be common rather than an exception. Therefore, Hadoop considers
the following issues to fulfill reliability requirements of the file system.
Block replication To reliably store data in HDFS, file blocks are replicated in this system. In
other words, HDFS stores a file as a set of blocks and each block is replicated and distributed
across the whole cluster. The replication factor is set by the user and is three by default.
Replica placement The placement of replicas is another factor to fulfill the desired fault
tolerance in HDFS. Although storing replicas on different nodes (DataNodes) located in different
racks across the whole cluster provides more reliability, it is sometimes ignored as the cost of
communication between two nodes in different racks is relatively high in comparison with that
of different nodes located in the same rack. Therefore, sometimes HDFS compromises its
reliability to achieve lower communication costs. For example, for the default replication factor
of three, HDFS stores one replica in the same node the original data is stored, one replica on a
different node but in the same rack, and one replica on a different node in a different rack to
provide three copies of the data.
Heartbeat and Blockreport messages Heartbeats and Blockreports are periodic messages sent
to the NameNode by each DataNode in a cluster. Receipt of a Heartbeat implies that the
DataNode is functioning properly, while each Blockreport contains a list of all blocks on a
DataNode. The NameNode receives such messages because it is the sole decision maker of all
replicas in the system.
HDFS High-Throughput Access to Large Data Sets (Files): Because HDFS is primarily designed
for batch processing rather than interactive processing, data access throughput in HDFS is more
important than latency. Also, because applications run on HDFS typically have large data sets,
individual files are broken into large blocks (e.g., 64 MB) to allow HDFS to decrease the amount
of metadata storage required per file. This provides two advantages: The list of blocks per file
will shrink as the size of individual blocks increases, and by keeping large amounts of data
sequentially within a block, HDFS provides fast streaming reads of data.
HDFS Operation: The control flow of HDFS operations such as write and read can properly
highlight roles of the NameNode and DataNodes in the managing operations. In this section,
the control flow of the main operations of HDFS on files is further described to manifest the
interaction between the user, the NameNode, and the DataNodes in such systems.
Reading a file To read a file in HDFS, a user sends an “open” request to the NameNode to get
the location of file blocks. For each file block, the NameNode returns the address of a set of
DataNodes containing replica information for the requested file. The number of addresses
depends on the number of block replicas. Upon receiving such information, the user calls the
read function to connect to the closest DataNode containing the first block of the file. After the
first block is streamed from the respective DataNode to the user, the established connection is
terminated and the same process is repeated for all blocks of the requested file until the whole
file is streamed to the user.
Writing to a file To write a file in HDFS, a user sends a “create” request to the NameNode to
create a new file in the file system namespace. If the file does not exist, the NameNode notifies
the user and allows him to start writing data to the file by calling the write function. The first
block of the file is written to an internal queue termed the data queue while a data streamer
monitors its writing into a DataNode. Since each file block needs to be replicated by a
predefined factor, the data streamer first sends a request to the NameNode to get a list of
suitable DataNodes to store replicas of the first block.
The steamer then stores the block in the first allocated DataNode. Afterward, the block is
forwarded to the second DataNode by the first DataNode. The process continues until all
allocated DataNodes receive a replica of the first block from the previous DataNode. Once this
replication process is finalized, the same process starts for the second block and continues until
all blocks of the file are stored and replicated on the file system.
Architecture of MapReduce in Hadoop
The topmost layer of Hadoop is the MapReduce engine that manages the data flow and control
flow of MapReduce jobs over distributed computing systems. Similar to HDFS, the MapReduce
engine also has a master/slave architecture consisting of a single JobTracker as the master and
a number of TaskTrackers as the slaves (workers). The JobTracker manages the MapReduce job
over a cluster and is responsible for monitoring jobs and assigning tasks to TaskTrackers. The
TaskTracker manages the execution of the map and/or reduce tasks on a single computation
node in the cluster.
Each TaskTracker node has a number of simultaneous execution slots, each executing either a
map or a reduce task. Slots are defined as the number of simultaneous threads supported by
CPUs of the TaskTracker node. For example, a TaskTracker node with N CPUs, each supporting
M threads, has M * N simultaneous execution slots. It is worth noting that each data block is
processed by one map task running on a single slot. Therefore, there is a one-to-one
correspondence between map tasks in a TaskTracker and data blocks in the respective
DataNode.
Figure: MapReduce in Hadoop
Programming Support of Google App Engine
Programming the Google App Engine
Several web resources (e.g., http://code.google.com/appengine/) and specific books and
articles (e.g., www.byteonic.com/2009/overview-of-java-support-in-google-app-engine/)
discuss how to program GAE.
There are several powerful constructs for storing and accessing data. The data store is a NOSQL
data management system for entities that can be, at most, 1 MB in size and are labeled by a set
of schema-less properties. Queries can retrieve entities of a given kind filtered and sorted by
the values of the properties. Java offers Java Data Object (JDO) and Java Persistence API (JPA)
interfaces implemented by the open source Data Nucleus Access platform, while Python has a
SQL-like query language called GQL. The data store is strongly consistent and uses optimistic
concurrency control.
Figure: Programming Environment for Google App Engine
An update of an entity occurs in a transaction that is retried a fixed number of times if other
processes are trying to update the same entity simultaneously. Your application can execute
multiple data store operations in a single transaction which either all succeed or all fail
together. The data store implements transactions across its distributed network using “entity
groups.” A transaction manipulates entities within a single group. Entities of the same group
are stored together for efficient execution of transactions. Your GAE application can assign
entities to groups when the entities are created. The performance of the data store can be
enhanced by in-memory caching using the memcache, which can also be used independently of
the data store.
Recently, Google added the blobstore which is suitable for large files as its size limit is 2 GB.
There are several mechanisms for incorporating external resources. The Google SDC Secure
Data Connection can tunnel through the Internet and link your intranet to an external GAE
application. The URL Fetch operation provides the ability for applications to fetch resources and
communicate with other hosts over the Internet using HTTP and HTTPS requests. There is a
specialized mail mechanism to send e-mail from your GAE application.
Applications can access resources on the Internet, such as web services or other data, using
GAE’s URL fetch service. The URL fetch service retrieves web resources using the same
highspeed Google infrastructure that retrieves web pages for many other Google products.
There are dozens of Google “corporate” facilities including maps, sites, groups, calendar, docs,
and YouTube, among others. These support the Google Data API which can be used inside GAE.
An application can use Google Accounts for user authentication. Google Accounts handles user
account creation and sign-in, and a user that already has a Google account (such as a Gmail
account) can use that account with your app. GAE provides the ability to manipulate image data
using a dedicated Images service which can resize, rotate, flip, crop, and enhance images. An
application can perform tasks outside of responding to web requests. Your application can
perform these tasks on a schedule that you configure, such as on a daily or hourly basis using
“cron jobs,” handled by the Cron service.
Alternatively, the application can perform tasks added to a queue by the application itself, such
as a background task created while handling a request. A GAE application is configured to
consume resources up to certain limits or quotas. With quotas, GAE ensures that your
application won’t exceed your budget, and that other applications running on GAE won’t
impact the performance of your app. In particular, GAE use is free up to certain quotas.
Google File System (GFS)
GFS was built primarily as the fundamental storage service for Google’s search engine. As the
size of the web data that was crawled and saved was quite substantial, Google needed a
distributed file system to redundantly store massive amounts of data on cheap and unreliable
computers. None of the traditional distributed file systems can provide such functions and hold
such large amounts of data. In addition, GFS was designed for Google applications, and Google
applications were built for GFS. In traditional file system design, such a philosophy is not
attractive, as there should be a clear interface between applications and the file system, such as
a POSIX interface.
There are several assumptions concerning GFS. One is related to the characteristic of the cloud
computing hardware infrastructure (i.e., the high component failure rate). As servers are
composed of inexpensive commodity components, it is the norm rather than the exception that
concurrent failures will occur all the time. Another concern the file size in GFS. GFS typically will
hold a large number of huge files, each 100 MB or larger, with files that are multiple GB in size
quite common. Thus, Google has chosen its file data block size to be 64 MB instead of the 4 KB
in typical traditional file systems. The I/O pattern in the Google application is also special. Files
are typically written once, and the write operations are often the appending data blocks to the
end of files. Multiple appending operations might be concurrent. There will be a lot of large
streaming reads and only a little random access. As for large streaming reads, highly sustained
throughput is much more important than low latency.
Thus, Google made some special decisions regarding the design of GFS. As noted earlier, a 64
MB block size was chosen. Reliability is achieved by using replications (i.e., each chunk or data
block of a file is replicated across more than three chunk servers). A single master coordinates
access as well as keeps the metadata. This decision simplified the design and management of
the whole cluster. Developers do not need to consider many difficult issues in distributed
systems, such as distributed consensus. There is no data cache in GFS as large streaming reads
and writes represent neither time nor space locality. GFS provides a similar, but not identical,
POSIX file system accessing interface. The distinct difference is that the application can even
see the physical location of file blocks. Such a scheme can improve the upper-layer applications.
The customized API can simplify the problem and focus on Google applications. The customized
API adds snapshot and record append operations to facilitate the building of Google
applications.
Figure: GFS Architecture
The above figure shows the GFS architecture. It is quite obvious that there is a single master in
the whole cluster. Other nodes act as the chunk servers for storing data, while the single master
stores the metadata. The file system namespace and locking facilities are managed by the
master. The master periodically communicates with the chunk servers to collect management
information as well as give instructions to the chunk servers to do work such as load balancing
or fail recovery.
The master has enough information to keep the whole cluster in a healthy state. With a single
master, many complicated distributed algorithms can be avoided and the design of the system
can be simplified. However, this design does have a potential weakness, as the single GFS
master could be the performance bottleneck and the single point of failure. To mitigate this,
Google uses a shadow master to replicate all the data on the master, and the design guarantees
that all the data operations are performed directly between the client and the chunk server.
The control messages are transferred between the master and the clients and they can be
cached for future use. With the current quality of commodity servers, the single master can
handle a cluster of more than 1,000 nodes.
The above figure shows the data mutation (write, append operations) in GFS. Data blocks must
be created for all replicas. The goal is to minimize involvement of the master. The mutation
takes the following steps:
The client asks the master which chunk server holds the current lease for the chunk and
the locations of the other replicas. If no one has a lease, the master grants one to a
replica it chooses (not shown).
The master replies with the identity of the primary and the locations of the other
(secondary) replicas. The client caches this data for future mutations. It needs to contact
the master again only when the primary becomes unreachable or replies that it no
longer holds a lease.
The client pushes the data to all the replicas. A client can do so in any order. Each chunk
server will store the data in an internal LRU buffer cache until the data is used or aged
out. By decoupling the data flow from the control flow, we can improve performance by
scheduling the expensive data flow based on the network topology regardless of which
chunk server is the primary.
Once all the replicas have acknowledged receiving the data, the client sends a write
request to the primary. The request identifies the data pushed earlier to all the replicas.
The primary assigns consecutive serial numbers to all the mutations it receives, possibly
from multiple clients, which provides the necessary serialization. It applies the mutation
to its own local state in serial order.
The primary forwards the write request to all secondary replicas. Each secondary replica
applies mutations in the same serial number order assigned by the primary.
The secondaries all reply to the primary indicating that they have completed the
operation.
The primary replies to the client. Any errors encountered at any replicas are reported to
the client. In case of errors, the write corrects at the primary and an arbitrary subset of
the secondary replicas. The client request is considered to have failed, and the modified
region is left in an inconsistent state. Our client code handles such errors by retrying the
failed mutation. It will make a few attempts at steps 3 through 7 before falling back to a
retry from the beginning of the write.
Programming On Amazon Aws and Microsoft Azure
Programming on Amazon EC2
Amazon was the first company to introduce VMs in application hosting. Customers can rent
VMs instead of physical machines to run their own applications. By using VMs, customers can
load any software of their choice. The elastic feature of such a service is that a customer can
create, launch, and terminate server instances as needed, paying by the hour for active servers.
Amazon provides several types of preinstalled VMs. Instances are often called Amazon Machine
Images (AMIs) which are preconfigured with operating systems based on Linux or Windows,
and additional software.
Figure: Amazon EC2 Execution environment
Standard instances are well suited for most applications.
Micro instances provide a small number of consistent CPU resources and allow you to
burst CPU capacity when additional cycles are available. They are well suited for lower
throughput applications and web sites that consume significant compute cycles
periodically.
High-memory instances offer large memory sizes for high-throughput applications,
including database and memory caching applications.
High-CPU instances have proportionally more CPU resources than memory (RAM) and
are well suited for compute-intensive applications.
Cluster compute instances provide proportionally high CPU resources with increased
network performance and are well suited for high-performance computing (HPC)
applications and other demanding network-bound applications. They use 10 Gigabit
Ethernet interconnections.
The cost in the third column is expressed in terms of EC2 Compute Units (ECUs) where one
ECU provides the CPU capacity of a 1.0–1.2 GHz 2007 Opteron or 2007 Xeon processor.
Amazon Simple Storage Service (S3)
Amazon S3 provides a simple web services interface that can be used to store and retrieve any
amount of data, at any time, from anywhere on the web. S3 provides the object-oriented
storage service for users. Users can access their objects through Simple Object Access Protocol
(SOAP) with either browsers or other client programs which support SOAP. SQS is responsible
for ensuring a reliable message service between two processes, even if the receiver processes
are not running.
The fundamental operation unit of S3 is called an object. Each object is stored in a bucket and
retrieved via a unique, developer-assigned key. In other words, the bucket is the container of
the object. Besides unique key attributes, the object has other attributes such as values,
metadata, and access control information. From the programmer’s perspective, the storage
provided by S3 can be viewed as a very coarse-grained key-value pair. Through the key-value
programming interface, users can write, read, and delete objects containing from 1 byte to 5
gigabytes of data each. There are two types of web service interface for the user to access the
data stored in Amazon clouds. One is a REST (web 2.0) interface, and the other is a SOAP
interface. Here are some key features of S3:
Redundant through geographic dispersion
Designed to provide 99.999999999 percent durability and 99.99 percent availability of
objects over a given year with cheaper reduced redundancy storage (RRS).
Authentication mechanisms to ensure that data is kept secure from unauthorized
access. Objects can be made private or public, and rights can be granted to specific
users.
Per-object URLs and ACLs (access control lists).
Default download protocol of HTTP. A BitTorrent protocol interface is provided to lower
costs for high-scale distribution.
$0.055 (more than 5,000TB) to 0.15 per GB per month storage (depending on total
amount).
First 1GB per month input or output free and then $.08 to $0.15 per GB for transfers
outside an S3 region.
There is no data transfer charge for data transferred between Amazon EC2 and Amazon
S3 within the same region or for data transferred between the Amazon EC2 Northern
Virginia region and the Amazon S3 U.S. Standard region (as of October 6, 2010).
Amazon Elastic Block Store (EBS) and SimpleDB
The Elastic Block Store (EBS) provides the volume block interface for saving and restoring the
virtual images of EC2 instances. Traditional EC2 instances will be destroyed after use. The status
of EC2 can now be saved in the EBS system after the machine is shut down. Users can use EBS
to save persistent data and mount to the running instances of EC2. Note that S3 is “Storage as a
Service” with a messaging interface. EBS is analogous to a distributed file system accessed by
traditional OS disk access mechanisms. EBS allows you to create storage volumes from 1GB to
1TB that can be mounted as EC2 instances.
Multiple volumes can be mounted to the same instance. These storage volumes behave like
raw, unformatted block devices, with user-supplied device names and a block device interface.
You can create a file system on top of Amazon EBS volumes, or use them in any other way you
would use a block device (like a hard drive). Snapshots are provided so that the data can be
saved incrementally. This can improve performance when saving and restoring data. In terms of
pricing, Amazon provides a similar pay-per-use schema as EC2 and S3. Volume storage charges
are based on the amount of storage users allocate until it is released, and is priced at $0.10 per
GB/month. EBS also charges $0.10 per 1 million I/O requests made to the storage (as of
October 6, 2010). The equivalent of EBS has been offered in open source clouds such Nimbus.
Amazon SimpleDB Service
SimpleDB provides a simplified data model based on the relational database data model.
Structured data from users must be organized into domains. Each domain can be considered a
table. The items are the rows in the table. A cell in the table is recognized as the value for a
specific attribute (column name) of the corresponding row. This is similar to a table in a
relational database. However, it is possible to assign multiple values to a single cell in the table.
This is not permitted in a traditional relational database which wants to maintain data
consistency.
Many developers simply want to quickly store, access, and query the stored data. SimpleDB
removes the requirement to maintain database schemas with strong consistency. SimpleDB is
priced at $0.140 per Amazon SimpleDB Machine Hour consumed with the first 25 Amazon
SimpleDB Machine Hours consumed per month free (as of October 6, 2010). SimpleDB, like
Azure Table, could be called “LittleTable,” as they are aimed at managing small amounts of
information stored in a distributed table; one could say BigTable is aimed at basic big data,
whereas LittleTable is aimed at metadata. Amazon Dynamo is an early research system along
the lines of the production SimpleDB system.
Microsoft Azure Programming Support
When the system is running, services are monitored and one can access event logs,
trace/debug data, performance counters, IIS web server logs, crash dumps, and other log files.
This information can be saved in Azure storage. Note that there is no debugging capability for
running cloud applications, but debugging is done from a trace. One can divide the basic
features into storage and compute capabilities. The Azure application is linked to the Internet
through a customized compute VM called a web role supporting basic Microsoft web hosting.
Such configured VMs are often called appliances. The other important compute class is the
worker role reflecting the importance in cloud computing of a pool of compute resources that
are scheduled as needed. The roles support HTTP(S) and TCP. Roles offer the following
methods:
The OnStart() method which is called by the Fabric on startup, and allows you to
perform initialization tasks. It reports a Busy status to the load balancer until you return
true.
The OnStop() method which is called when the role is to be shut down and gives a
graceful exit.
The Run() method which contains the main logic.
Figure: Features of the Azure Cloud Platform
SQLAzure
All the storage modalities are accessed with REST interfaces except for the recently introduced
Drives that are analogous to Amazon EBS discussed in Section 6.4.3, and offer a file system
interface as a durable NTFS volume backed by blob storage. The REST interfaces are
automatically associated with URLs and all storage is replicated three times for fault tolerance
and is guaranteed to be consistent in access.
The basic storage system is built from blobs which are analogous to S3 for Amazon. Blobs are
arranged as a three-level hierarchy: Account → Containers → Page or Block Blobs. Containers
are analogous to directories in traditional file systems with the account acting as the root. The
block blob is used for streaming data and each such blob is made up as a sequence of blocks of
up to 4 MB each, while each block has a 64 byte ID. Block blobs can be up to 200 GB in size.
Page blobs are for random read/write access and consist of an array of pages with a maximum
blob size of 1 TB. One can associate metadata with blobs as <name, value> pairs with up to 8 KB
per blob.
Azure Tables
The Azure Table and Queue storage modes are aimed at much smaller data volumes. Queues
provide reliable message delivery and are naturally used to support work spooling between
web and worker roles. Queues consist of an unlimited number of messages which can be
retrieved and processed at least once with an 8KB limit on message size. Azure supports PUT,
GET, and DELETES message operations as well as CREATE and DELETE for queues. Each account
can have any number of Azure tables which consist of rows called entities and columns called
properties.
There is no limit to the number of entities in a table and the technology is designed to scale
well to a large number of entities stored on distributed computers. All entities can have up to
255 general properties which are <name, type, value> triples. Two extra properties,
PartitionKey and RowKey, must be defined for each entity, but otherwise, there are no
constraints on the names of properties—this table is very flexible! RowKey is designed to give
each entity a unique label while PartitionKey is designed to be shared and entities with the
same PartitionKey are stored next to each other; a good use of PartitionKey can speed up
search performance. An entity can have, at most, 1MB storage; if you need large value sizes,
just store a link to a blob store in the Table property value. ADO.NET and LINQ support table
queries.
EMERGING CLOUD SOFTWARE ENVIRONMENTS
Open Source Eucalyptus and Nimbus
Eucalyptus is a product from Eucalyptus Systems (www.eucalyptus.com) that was developed
out of a research project at the University of California, Santa Barbara. Eucalyptus was initially
aimed at bringing the cloud computing paradigm to academic supercomputers and clusters.
Eucalyptus provides an AWS-compliant EC2-based web service interface for interacting with the
cloud service. Additionally, Eucalyptus provides services, such as the AWS-compliant Walrus,
and a user interface for managing users and images.
Eucalyptus Architecture
The Eucalyptus system is an open software environment. The architecture was presented in a
Eucalyptus white paper. The system supports cloud programmers in VM image management as
follows. Essentially, the system has been extended to support the development of both the
computer cloud and storage cloud.
VM Image Management
Eucalyptus takes many design queues from Amazon’s EC2, and its image management system is
no different. Eucalyptus stores images in Walrus, the block storage system that is analogous to
the Amazon S3 service. As such, any user can bundle her own root file system, and upload and
then register this image and link it with a particular kernel and ramdisk image. This image is
uploaded into a user-defined bucket within Walrus, and can be retrieved anytime from any
availability zone. This allows users to create specialty virtual appliances
(http://en.wikipedia.org/wiki/Virtual_appliance) and deploy them within Eucalyptus with ease.
The Eucalyptus system is available in a commercial proprietary version, as well as the open
source version we just described.
Nimbus
Nimbus [81,82] is a set of open source tools that together provide an IaaS cloud computing
solution. It allows a client to lease remote resources by deploying VMs on those resources and
configuring them to represent the environment desired by the user. To this end, Nimbus
provides a special web interface known as Nimbus Web. Its aim is to provide administrative and
user functions in a friendly interface. Nimbus Web is centered around a Python Django web
application that is intended to be deployable completely separate from the Nimbus service.
Nimbus supports two resource management strategies. The first is the default “resource pool”
mode. In this mode, the service has direct control of a pool of VM manager nodes and it
assumes it can start VMs. The other supported mode is called “pilot.” Here, the service makes
requests to a cluster’s Local Resource Management System (LRMS) to get a VM manager
available to deploy VMs. Nimbus also provides an implementation of Amazon’s EC2 interface
that allows users to use clients developed for the real EC2 system against Nimbus-based clouds.
Amazon Simple Storage Service (S3)
Amazon S3 provides a simple web services interface that can be used to store and retrieve any
amount of data, at any time, from anywhere on the web. S3 provides the object-oriented
storage service for users. Users can access their objects through Simple Object Access Protocol
(SOAP) with either browsers or other client programs which support SOAP. SQS is responsible
for ensuring a reliable message service between two processes, even if the receiver processes
are not running.
The fundamental operation unit of S3 is called an object. Each object is stored in a bucket and
retrieved via a unique, developer-assigned key. In other words, the bucket is the container of
the object. Besides unique key attributes, the object has other attributes such as values,
metadata, and access control information. From the programmer’s perspective, the storage
provided by S3 can be viewed as a very coarse-grained key-value pair. Through the key-value
programming interface, users can write, read, and delete objects containing from 1 byte to 5
gigabytes of data each. There are two types of web service interface for the user to access the
data stored in Amazon clouds. One is a REST (web 2.0) interface, and the other is a SOAP
interface. Here are some key features of S3:
Redundant through geographic dispersion
Designed to provide 99.999999999 percent durability and 99.99 percent availability of
objects over a given year with cheaper reduced redundancy storage (RRS).
Authentication mechanisms to ensure that data is kept secure from unauthorized
access. Objects can be made private or public, and rights can be granted to specific
users.
Per-object URLs and ACLs (access control lists).
Default download protocol of HTTP. A BitTorrent protocol interface is provided to lower
costs for high-scale distribution.
$0.055 (more than 5,000TB) to 0.15 per GB per month storage (depending on total
amount).
First 1GB per month input or output free and then $.08 to $0.15 per GB for transfers
outside an S3 region.
There is no data transfer charge for data transferred between Amazon EC2 and Amazon
S3 within the same region or for data transferred between the Amazon EC2 Northern
Virginia region and the Amazon S3 U.S. Standard region (as of October 6, 2010).
Unit-5
Cloud Resource Management and Scheduling
Policies and Mechanisms for Resource Management
Cloud resource management policies can be loosely grouped into five classes:
1. Admission control.
2. Capacity allocation.
3. Load balancing.
4. Energy optimization.
5. Quality-of-service (QoS) guarantees.
Load balancing and energy optimization can be done locally, but global load-balancing and
energy optimization policies encounter the same difficulties as the one we have already
discussed. Load balancing and energy optimization are correlated and affect the cost of
providing the services. Indeed, it was predicted that by 2012 up to 40% of the budget for IT
enterprise infrastructure would be spent on energy.
The common meaning of the term load balancing is that of evenly distributing the load to a set
of servers. For example, consider the case of four identical servers, A, B,C, and D, whose
relative loads are 80%,60%,40%, and 20%, respectively, of their capacity. As a result of perfect
load balancing, all servers would end with the same load − 50% of each server’s capacity. In
cloud computing a critical goal is minimizing the cost of providing the service and, in particular,
minimizing the energy consumption. This leads to a different meaning of the term load
balancing; instead of having the load evenly distributed among all servers, we want to
concentrate it and use the smallest number of servers while switching the others to standby
mode, a state in which a server uses less energy. In our example, the load from D will migrate to
A and the load from C will migrate to B; thus, A and B will be loaded at full capacity, whereas C
and D will be switched to standby mode. Quality of service is that aspect of resource
management that is probably the most difficult to address and, at the same time, possibly the
most critical to the future of cloud computing.
Virtually all optimal – or near-optimal – mechanisms to address the five classes of policies don’t
scaleupandtypicallytargetasingleaspectofresourcemanagement,e.g.,admissioncontrol,butignore
energy conservation. Many require complex computations that cannot be done effectively in
the time available to respond. The performance models are very complex, analytical solutions
are intractable, and the monitoring systems used to gather state information for these models
can be too intrusive and unable to provide accurate data. Many techniques are concentrated
on system performance in terms of throughput and time in system, but they rarely include
energy tradeoffs or QoS guarantees. Some techniques are based on unrealistic assumptions; for
example, capacity allocation is viewed as an optimization problem, but under the assumption
that servers are protected from overload.
Allocation techniques in computer clouds must be based on a disciplined approach rather than
adhoc methods. The four basic mechanisms for the implementation of resource management
policies are:
Control theory. Control theory uses the feedback to guarantee system stability and predict
transient behavior, but can be usedonly topredict local rather than global behavior. Kalman
filters have been used for unrealistically simplified models.
Machine learning. A major advantage of machine learning techniques is that they do not need
a performance model of the system. This technique could be applied to coordination of several
autonomic system managers.
Utility-based. Utility-based approaches require a performance model and a mechanism to
correlate user-level performance with cost.
Market-oriented/economic mechanisms. Such mechanisms do not require a model of the
system, e.g., combinatorial auctions for bundles of resources.
Applications of control theory to task scheduling on a cloud
Control theory has been used to design adaptive resource management for many classes of
applications, including power management, task scheduling, QoS adaptation in Web servers,
and load balancing. The classical feedback control methods are used in all these cases to
regulate the key operating parameters of the system based on measurement of the system
output; the feedback control in these methods assumes a linear time-in variants system model
and a closed-loop controller. This controller is based on an open-loop system transfer function
that satisfies stability and sensitivity constraints.
The main components of a control system are the inputs, the control system components, and
the outputs. The inputs in such models are the offered workload and the policies for admission
control, the capacity allocation, the load balancing, the energy optimization, and the QoS
guarantees in the cloud. The system components are sensors used to estimate relevant
measures of performance and controllers that implement various policies; the output is the
resource allocations to the individual applications.
The controllers use the feedback provided by sensors to stabilize the system; stability is related
to the change of the output. If the change is too large, the system may become unstable. In our
context the system could experience thrashing, the amount of useful time dedicated to the
execution of applications becomes increasingly small and most of the system resources are
occupied by management functions.
There are three main sources of instability in any control system:
The delay in getting the system reaction after a control action.
The granularity of the control, the fact that a small change enacted by the controllers
leads to very large changes of the output.
Oscillations, which occur when the changes of the input are too large and the control is
too weak, such that the changes of the input propagate directly to the output.
Two types of policies are used in autonomic systems: (i) threshold-based policies and (ii)
sequential decision policies based on Markovian decision models. In the first case, upper and
lower bounds on performance trigger adaptation through resource reallocation. Such policies
are simple and intuitive but require setting per-application thresholds.
If upper and lower thresholds are set, instability occurs when they are too close to one another
if the variations of the workload are large enough and the time required to adapt does not
allow the system to stabilize. The actions consist of allocation/deallocation of one or more
virtual machines; sometimes allocation/deallocation of a single VM required by one of the
thresholds may cause crossing of theother threshold and this may represent, another source of
instability.
Feedback control based on dynamic thresholds
The elements involved in a control system are sensors, monitors and actuators. These ns or
measure the parameter(s) of interest, then transmit the measured values to a monitor, which
determines whether the system behavior must be changed, and, if so, it requests that the
actuators carry out the necessary actions. Often the parameter used for admission control
policy is the current system load; when a threshold, e.g., 80%, is reached, the cloud stops
accepting additional load.
In practice, the implementation of such a policy is challenging or outright infeasible. First, due
to the very large number of servers and to the fact that the load changes rapidly in time, the
estimation of the current system load is likely to be inaccurate. Second, the ratio of average to
maximal resource requirements of individual users specified in a service-level agreement is
typically very high. Once an agreement is in place, user demands must be satisfied; user
requests for additional resources within the SLA limits cannot be denied.
Thresholds. A threshold is the value of a parameter related to the state of a system that
triggers a change in the system behavior. Thresholds are used in control theory to keep critical
parameters of a system in a predefined range. The threshold could be static, defined once and
for all, or it could be dynamic. A dynamic threshold could be based on an average of
measurements carried out over a time interval, a so-called integral control. The dynamic
threshold could also be a function of the values of multiple parameters at a given time or a mix
of the two.
To maintain the system parameters in a given range, a high and a low threshold are often
defined. The two thresholds determine different actions; for example, a high threshold could
force the system to limit its activities and a low threshold could encourage additional activities.
Control granularity refers to the level of detail of the information used to control the system.
Fine control means that very detailed information about the parameters controlling the system
state is used, whereas coarse control means that the accuracy of these parameters is traded for
the efficiency of implementation.
Proportional Thresholding. Application of these ideas to cloud computing, in particular to the
IaaS deliverymodel, and a strategy for resource management called proportional thresholding.
The questions addressed are:
Is it beneficial to have two types of controllers, (1) application controllers that
determine whether additional resources are needed and (2) cloud controllers that
arbitrate requests for resources and allocate the physical resources?
Is it feasible to consider fine control? Is course control more adequate in a cloud
computing environment?
Are dynamic thresholds based on time averages better than static ones?
Is it better to have a high and a low threshold, or it is sufficient to define only a high
threshold?
The essence of the proportional thresholding is captured by the following algorithm:
Compute the integral value of the high and the low thresholds as averages of the
maximum and, respectively, the minimum of the processor utilization over the process
history.
Request additional VMs when the average value of the CPU utilization over the current
time slice exceeds the high threshold.
Release a VM when the average value of the CPU utilization over the current time slice
falls below the low threshold.
The conclusions reached based on experiments with three VMs are as follows :(a) dynamic
thresholds perform better than static ones and (b) two thresholds are better than one. Though
conforming to our intuition, such results have to be justified by experiments in a realistic
environment. Moreover, convincing results cannot be based on empirical values for some of
the parameters required by integral control equations.
Coordination of specialized autonomic performance managers
Virtually all modern processors support dynamic voltage scaling (DVS) as a mechanism for
energy saving. Indeed, the energy dissipation scales quadratically with the supply voltage. The
power management controls the CPU frequency and, thus, the rate of instruction execution.
For some compute-intensive workloads the performance decreases linearly with the CPU clock
frequency, whereas for others the effect of lower clock frequency is less noticeable or
nonexistent. The clock frequency of individual blades/servers is controlled by a power manager,
typically implemented in the firmware; it adjusts the clock frequency several times a second.
Scheduling algorithms in Cloud Computing:
Fair Queuing
Fair queuing is a family of scheduling algorithms used in some process and network schedulers.
The concept implies a separate data packet queue (or job queue) for each traffic flow (or for
each program process) as opposed to the traditional approach with one FIFO queue for all
packet flows (or for all process jobs). The purpose is to achieve fairness when a limited resource
is shared, for example to avoid that flows with large packets (or processes that generate small
jobs) achieve more throughput (or CPU time) than other flows (or processes).
Principle
The main idea of fair queuing is to use one queue per packet flows and service them in rotation,
such that each flow "obtains an equal fraction of the resources".
The advantage over conventional first in first out (FIFO) or static priority queuing is that a high-
data-rate flow, consisting of large or many data packets, cannot take more than its fair share of
the link capacity.
Fair queuing is used in routers, switches, and statistical multiplexers that forward packets from
a buffer. The buffer works as a queuing system, where the data packets are stored temporarily
until they are transmitted.
With a link data-rate of R, at any given time the N active data flows (the ones with non-empty
queues) are serviced each with an average data rate of R/N. In a short time interval the data
rate may be fluctuating around this value since the packets are delivered sequentially,
depending on the scheduling used.
Algorithm details
Modeling of actual finish time, while feasible, is computationally intensive. The model needs to
be substantially recomputed every time a packet is selected for transmission and every time a
new packet arrives into any queue.
To reduce computational load, the concept of virtual time is introduced. Finish time for each
packet is computed on this alternate monotonically increasing virtual timescale. While virtual
time does not accurately model the time packets complete their transmissions, it does
accurately model the order in which the transmissions must occur to meet the objectives of the
full-featured model. Using virtual time, it is unnecessary to re compute the finish time for
previously queued packets. Although the finish time, in absolute terms, for existing packets is
potentially affected by new arrivals, finish time on the virtual time line is unchanged - the
virtual time line warps with respect to real time to accommodate any new transmission.
The virtual finish time for a newly queued packet is given by the sum of the virtual start time
plus the packet's size. The virtual start time is the maximum between the previous virtual finish
time of the same queue and the current instant.
With a virtual finishing time of all candidate packets (i.e., the packets at the head of all non-
empty flow queues) computed, fair queuing compares the virtual finishing time and selects the
minimum one. The packet with the minimum virtual finishing time is transmitted.
Borrowed Virtual Time Scheduling
With BVT scheduling, thread execution time is monitored in terms of virtual time, dispatching the runnable thread with the earliest effective virtual time (EVT). However, a latency sensitive thread is allowed to warp back in virtual time to make it appear earlier and thereby gain dispatch preference. It then effectively borrows virtual time from its future CPU allocation and thus does not disrupt long-term CPU sharing. Hence the name, borrowed virtual time scheduling. This algorithm is described in detail in the rest of this section. Each BVT thread includes the state variables , its effective virtual time (EVT); , its actual virtual time (AVT); its virtual time warp; and set if warp is enabled. When a thread unblocks or the currently executing thread blocks, the scheduler runs thread if it has the minimum of all the runnable threads.
Resource management and dynamic application scaling
The demand for computing resources, such as CPU cycles, primary and secondary storage, and network bandwidth, depends heavily on the volume of data processed by an application. The
demand for resources can be a function of the time of day, can monotonically increase or decrease in time, or can experience predictable or unpredictable peaks. For example, a new Web service will experience a low request rate when the service is first introduced and the load will exponentially increase if the service is successful. A service for income tax processing will experience a peak around the tax filling deadline, where as access to a service provided by Federal Emergency Management Agency (FEMA) willincrease dramatically after a natural disaster.
The elasticity of a public cloud, the fact that it can supply to an application precisely the amount of resources it needs and that users pay only for the resources they consume are serious incentives to migrate to a public cloud. The question we address is: How scaling can actually be implemented in a cloud when a very large number of applications exhibit this often unpredictable behavior. To make matters worse, in addition to an unpredictable external load the cloud resource management has to deal with resource reallocation due to server failures.
We distinguish two scaling modes: vertical and horizontal. Vertical scaling keeps the number of VMs of an application constant, but increases the amount of resources allocated to each one of them. This can be done either by migrating the VMs to more powerful servers or by keeping the VMs on the same servers but increasing their share of the CPU time. The first alternative involves additional overhead; the VM is stopped, a snapshot of it is taken, the file is transported to a more powerful server, and, finally, the VM is restated at the new site.
Horizontal scaling is the most common mode of scaling on a cloud; it is done by increasing the number of VMs as the load increases and reducing the number of VMs when the load decreases. Often, this leads to an increase in communication bandwidth consumed by the application. Load balancing among the running VMs is critical to this mode of operation. For a very large application, multiple load balancers may need to cooperate with one another. In some instances the load balancing is done by a front-end server that distributes incoming requests of a transaction-oriented system to back-end servers.
Unit-6
Storage Systems
Evolution of storage technology
The technological capacity to store information has grown over time at an accelerated pace
1986: 2.6EB; equivalent to less than one 730MB CD-ROM of data per computer user.
1993: 15.8EB; equivalent to four CD-ROMs per user.
2000: 54.5EB; equivalent to 12 CD-ROMs per user.
2007: 295EB; equivalent to almost 61 CD-ROMs per user.
These rapid technological advancements have changed the balance between initial investment in storage devices and system management costs. Now the cost of storage management is the dominant element of the total cost of a storage system. This effect favors the centralized storage strategy supported by a cloud; indeed, a centralized approach can automate some of the storage management functions, such as replication and backup, and thus reduce substantially the storage management cost.
While the density of storage devices has increased and the cost has decreased dramatically, the access time has improved only slightly. The performance of I/O subsystems has not kept pace with the performance of computing engines, and that affects multimedia, scientific and engineering, and other modern applications that process increasingly large volumes of data.
The storage systems face substantial pressure because the volume of data generated has increased exponentially during the past few decades; whereas in the 1980s and 1990s data was primarily generated by humans, nowadays machines generate data at an unprecedented rate. Mobile devices, such as smart-phones and tablets, record static images, as well as movies and have limited local storage capacity, so they transfer the data to cloud storage systems. Sensors, surveillance cameras, and digital medical imaging devices generate data at a high rate and dump it onto storage systems accessible via the Internet. Online digital libraries, ebooks, and digital media, along with reference data,2 add to the demand for massive amounts of storage.
As the volume of data increases, new methods and algorithms for data mining that require powerful computing systems have been developed. Only a concentration of resources could provide the CPU cycles along with the vast storage capacity necessary to perform such intensive computations and access the very large volume of data.
Althoughweemphasizetheadvantagesofaconcentrationofresources,wehavetobeacutelyaware that a cloud is a large-scale distributed system with a very large number of components that must work in concert. The management of such a large collection of systems poses significant challenges and requires novel approaches to systems design. Case in point: Although the early distributed file systems used custom-designed reliable components, nowadays large-scale systems are built with offthe-shelf components. The emphasis of the design philosophy has shifted from performance at any cost to reliability at the lowest possible cost. This shift is
evident in the evolution of ideas, from the early distributed file systems of the 1980s, such as the Network File System(NFS)and theAndrewFile System (AFS), to today’s Google File System (GFS) and the Megastore.
Google File System
The Google File System (GFS) was developed in the late 1990s. It uses thousands of storage systems built from inexpensive commodity components to provide peta bytes of storage to a large user community with diverse needs. It is not surprising that a main concern of the GFS designers was to ensure the reliability of a system exposed to hardware failures, system software errors, application errors, and last but not least, human errors.
The system was designed after a careful analysis of the file characteristics and of the access models. Some of the most important aspects of this analysis reflected in the GFS design are:
Scalability and reliability are critical features of the system; they must be considered from the beginning rather than at some stage of the design.
The vast majority of files range in size from a few GB to hundreds of TB.
The most common operation is to append to an existing file; random write operations to a file are extremely infrequent.
Sequential read operations are the norm.
The users process the data in bulk and are less concerned with the response time.
Theconsistencymodelshouldberelaxedtosimplifythesystemimplementation, but without placing an additional burden on the application developers.
GFS files are collections of fixed-size segments called chunks;at the time of file creation each chunk is assigned a unique chunk handle. A chunk consists of 64 KB blocks and each block has a 32-bit checksum. Chunks are stored on Linux files systems and are replicated on multiple sites; a user may change the number of the replicas from the standard value of three to any desired value. The chunk size is 64MB; this choice is motivated by the desire to optimize performance for large files and to reduce the amount of metadata maintained by the system.
A large chunk size increases the likelihood that multiple operations will be directed to the same chunk; thus it reduces the number of requests to locate the chunk and, at the same time, it allows the application to maintain a persistent network connection with the server where the chunk is located. Space fragmentation occurs in frequently because the chunk for a small file and the last chunk of a large file are only partially filled.
The locations of the chunks are stored only in the control structure of the master’s memory and are updated at system startup or when a new chunk server joins the cluster. This strategy allows the master to have up-to-date information about the location of the chunks.
System reliability is a major concern, and the operation log maintains a historical record of metadata changes, enabling the master to recover in case of a failure. As a result, such changes are atomic and are not made visible to the clients until they have been recorded on multiple replicas on persistent storage. To recover from a failure, the master replays the operation log.
To minimize the recovery time, the master periodically checkpoints its state and at recovery time replays only the log records after the last checkpoint.
Each chunk server is a commodity Linux system;it receives instructions from the master and responds with status information. To access a file, an application sends to the master the filename and the chunk index, the offset in the file for the read or write operation; the master responds with the chunk handle and the location of the chunk.Then the application communicates directly with the chunk server to carry out the desired file operation.
The consistency model is very effective and scalable. Operations, such as file creation, are atomic and are handled by the master. To ensure scalability, the master has minimal involvement in file mutations and operations such as write or append that occur frequently. In such cases the master grants a lease for a particular chunk to one of the chunk servers, called the primary; then, the primary creates a serial order for the updates of that chunk.
The client contacts the master, which assigns a lease to one of the chunk servers for a particular chunk if no lease for that chunk exists; then the master replies with the ID of the primary as well as secondary chunk servers holding replicas of the chunk. The client caches this information.
The client sends the data to all chunk servers holding replicas of the chunk; each one of the chunk servers stores the data in an internal LRU buffer and then sends an acknowledgment to the client.
The client sends a write request to the primary once it has received the acknowledgments from all chunk servers holding replicas of the chunk. The primary identifies mutations by consecutive sequence numbers.
The primary sends the write requests to all secondaries.
Each secondary applies the mutations in the order of the sequence numbers and then sends an acknowledgment to the primary.
Finally, after receiving the acknowledgments from all secondaries, the primary informs the client.
The system supports an efficient checkpointing procedure based on copy-on-write to construct system snapshots. A lazy garbage collection strategy is used to reclaim the space after a file deletion. In the first step the filename is changed to a hidden name and this operation is time stamped. The master periodically scans the namespace and removes the metadata for the files with a hidden name older than a few days; this mechanism gives a window of opportunity to a user who deleted files by mistake to recover the files with little effort.
Periodically, chunk servers exchange with the master the list of chunks stored on each one of them; the master supplies them with the identity of orphaned chunks whose metadata has been deleted, and such chunks are then deleted, Even when control messages are lost, a chunk server will carry out the housecleaning at the next heartbeat exchange with the master. Each chunk server maintains in core the checksums for the locally stored chunks to guarantee data integrity.
Apache Hadoop
A wide range of data-intensive applications such as marketing analytics, image processing, machine learning, and Web crawling use Apache Hadoop,anopen-source,Java-basedsoftwaresystem.6 Hadoop supports distributed applications handling extremely large volumes of data. Many members of the community contributed to the development and optimization of Hadoop and several related Apache projects such as Hive and HBase.
Hadoop is used by many organizations from industry, government, and research; the long list of Hadoop users includes major IT companies such as Apple, IBM, HP, Microsoft, Yahoo!, and Amazon; mediacompanies suchasTheNewYorkTimesand Fox; socialnetworks, including Twitter,Facebook, and LinkedIn; and government agencies, such as the U.S. Federal Reserve. In 2011, the Facebook Hadoop cluster had a capacity of 30PB.
A Hadoop system has two components, a MapReduce engine and a database (see Figure 8.8). The database could be the Hadoop File System (HDFS), Amazon S3, orCloudStore, an implementation of the Google File System discussed in Section 8.5. HDFS is a distributed file system written in Java; it is portable, but it cannot be directly mounted on an existing operating system. HDFS is not fully POSIX compliant, but it is highly performant.
The Hadoop engine on the master of a multinode cluster consists of a job tracker and a task tracker, whereas the engine on a slave has only a task tracker. Thejob tracker receives a MapReduce job from a client and dispatches the work to the task trackers running on the nodes of a cluster. To increase efficiency, the job tracker attempts to dispatch the tasks to available slaves closest to the place where it stored the task data.The task tracker supervises the execution of the work allocated to the node. Several scheduling algorithms have been implemented in Hadoop engines, including Facebook’s fair scheduler and Yahoo!’s capacity scheduler; see Section 6.8 for a discussion of cloud scheduling algorithms.
HDFS replicates data on multiple nodes. The default is three replicas; a large dataset is distributed over many nodes. The namenode running on the master manages the data distribution and data replication and communicates with data nodes running on all cluster nodes; it shares with the job tracker information about data placement to minimize communication between the nodes on which data is located and the ones where it is needed. Although HDFS can be used for applications other than those based on the MapReduce model, its performance for such applications is not at par with the ones for which it was originally designed.
BigTable
The system is based on a simple and flexible datamodel. It allows an application developer to exercise control over the data format and layout and reveals data locality information to the application clients. Any reador write row operation is atomic, even when it affects more than one column. The column keys identify column families, which are units of access control. The data in a column family is of the same type. Client applications written in C++ can add or delete values, search for a subset of data, and look up data in a row.
A row key is an arbitrary string of up to 64KB, and a row range is partitioned into tablets serving as units for load balancing. The time stamps used to index various versions of the data in a cell are 64-bit integers; their interpretation can be defined by the application, whereas the default is the time of an event in microseconds. A column key consists of a string defining the family name, a set of printable characters, and an arbitrary string as qualifier.
The organization of a BigTable shows a sparse, distributed, multidimensional map for an email application. The system consists of three major components: a library linked to application clients to access the system, a master server, and a large number of tablet servers. The master server controls the entire system, assigns tablets to tablet servers and balances the load among them, manages garbage collection, and handles table and column family creation and deletion.
Internally, the space management is ensured by a three-level hierarchy: the root tablet, the location of which is stored in a Chubby file, points to entries in the second element, the metadata tablet, which, in turn, points to user tablets, collections of locations of users’ tablets. An application client searches through this hierarchy to identify the location of its tablets and then caches the addresses for further use.
BigTable is used by a variety of applications, including Google Earth, Google Analytics, Google Finance, and Web crawlers. For example, Google Earth uses two tables, one for preprocessing and one for serving client data. The preprocessing table stores raw images; the table is stored on disk because it contains some 70TB of data. Each row of data consists of a single image; adjacent geographic segments are stored in rows in close proximity to one another. The column family is very sparse; it contains a column for every raw image. The preprocessing stage relies heavily on MapReduce to clean and consolidate the data for the serving phase. The serving table stored on GFSis“only”500 GB, and it is distributed across several hundred tablet servers, which maintain in-memory column families. This organization enables the serving phase of Google Earth to provide a fast response time to tens of thousands of queries per second.
Google Analytics provides aggregate statistics such as the number of visitors to a Web page per day. To use this service, Web servers embed a JavaScript code into their Web pages to record information everytimeapageisvisited.Thedataiscollectedinaraw-clickBigTableofsome200 TB, with a row for each end-user session. A summary table of some 20TB contains predefined summaries for a Website.
Megastore
Megastore is scalable storage for online services. The system, distributed over several data centers, has a very large capacity, 1PB in 2011,and it is highly available. Megastore is widely used internally at Google; it handles some 23 billion transactions daily: 3 billion write and 20 billion read transactions