Hadoop MapReduce in Eucalyptus Private Cloud

Hadoop MapReduce inEucalyptus Private Cloud

Johan Nilsson

May 27, 2011Bachelor’s Thesis in Computing Science, 15 credits

Supervisor at CS-UmU: Daniel HenrikssonExaminer: Pedher Johansson

Umea UniversityDepartment of Computing Science

SE-901 87 UMEASWEDEN

Abstract

This thesis investigates how setting up a private cloud using the Eucalyptus Cloud systemcould be done along with it’s usability, requirements and limitations as an open-source cloudplatform providing private cloud solutions. It also studies if using the MapReduce frameworkthrough Apache Hadoop’s implementation on top of the private Eucalyptus Cloud canprovide near linear scalability in terms of time and the amount of virtual machines in thecluster.

Analysis has shown that Eucalyptus is lacking in a few usability areas when setting upthe cloud infrastructure in terms of private networking and DNS lookups, yet the API thatEucalyptus provides gives benefits when migrating from public clouds like Amazon. TheMapReduce framework is showing an initial near-linear relation which is declining when theamount of virtual machines is reaching the maximum of the cloud infrastructure.

Contents

1 Introduction 1

2 Problem Description 3

2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Virtualized cloud environments and Hadoop MapReduce 5

3.1 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1.1 Networking in virtual operating systems . . . . . . . . . . . . . . . . . 7

3.2 Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Amazon’s public cloud service . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Software study - Eucalyptus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3.1 The different parts of Eucalyptus . . . . . . . . . . . . . . . . . . . . . 11

3.3.2 A quick look at the hypervisors in Eucalyptus . . . . . . . . . . . . . . 12

3.3.3 The Metadata Service . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.4 Networking modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.5 Accessing the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Software study - Hadoop MapReduce & HDFS . . . . . . . . . . . . . . . . . 16

3.4.1 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.2 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Accomplishment 23

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Setup, configuration and usage . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Setting up Eucalyptus . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.2 Configuring an Hadoop image . . . . . . . . . . . . . . . . . . . . . . . 31

4.2.3 Running MapReduce on the cluster . . . . . . . . . . . . . . . . . . . 34

4.2.4 The MapReduce implementation . . . . . . . . . . . . . . . . . . . . . 36

5 Results 39

5.1 MapReduce performance times . . . . . . . . . . . . . . . . . . . . . . . . . . 40

iii

iv CONTENTS

6 Conclusions 43

6.1 Restrictions and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Acknowledgements 47

References 49

A Scripts and code 51

List of Figures

3.1 A hypervisor can have multiple guest operating systems in it. . . . . . . . . . . . . 6

3.2 Different types of hypervisor-based server and machine virtualizations. . . . . . . . 6

3.3 Simplified visualization of cloud computing. . . . . . . . . . . . . . . . . . . . . . 8

3.4 Overview of the components in Eucalyptus on rack based servers. . . . . . . . . . 12

3.5 Metadata request example in Eucalyptus. . . . . . . . . . . . . . . . . . . . . . . 14

3.6 The HDFS node structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.7 The interaction between nodes when a file is read from HDFS. . . . . . . . . . . . 19

3.8 The MapReduce phases in Hadoop MapReduce. . . . . . . . . . . . . . . . . . . . 21

4.1 Eucalyptus network layout on the test servers. . . . . . . . . . . . . . . . . . . . 25

4.2 The optimal physical layout compared to the test environment. . . . . . . . . . . . 26

4.3 Network traffic in the test environment. . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Runtimes on a 2.9 GB database. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41



5.4 Map task times on a 2.9 GB database. . . . . . . . . . . . . . . . . . . . . . . . . 42



v

vi LIST OF FIGURES

Chapter 1

Introduction

By using a cloud service a company, organization or even a private person can outsourcemanagement, maintenance and administration of large clusters of servers but still keep thebenefits. While using a public cloud provider is sufficient for most tasks; bandwidth, storage,data protection or pricing details might encourage companies to house a private cloud. Theinfrastructure to control and maintain the cloud can be proprietary like Microsoft Hyper-VCloud [17], VMware vCloud [21] and Citrix Open Cloud [4], but there are also a number offree and open-source solutions like Eucalyptus Cloud, OpenNebula [19] and CloudStack [5].

The cloud can provide the processing power, but the actual framework to take benefit ofthe these distributed instances does not inherently come with the machines. The HadoopMapReduce claims to provide very high scalability and stability across a large cluster [7].It is meant to run on dedicated servers, but there is nothing that limits them from runningon a virtual machine.

This thesis is a study performed at the University of Umea, Department of Comput-ing Science to provide familiarity with the cloud and it’s related technologies in general,focusing specifically on the Eucalyptus cloud infrastructure. It shows a mean of setting upa private cloud, along with using the Hadoop MapReduce idiom/framework on top of thecloud showing the benefits and requirements of running MapReduce on a Eucalyptus privatecloud. As a proof of concept a simple MapReduce test is implemented and tested on thecloud to provide an analysis of the distributed computation of MapReduce.

The report will have a software study on the systems used in the thesis, followed by adescription of the configuration, setup and usage of Eucalyptus and Hadoop. Finally theresult from the analysis will be presented along with a short conclusion.

1

2 Chapter 1. Introduction

Chapter 2

Problem Description

This thesis is two-fold. It will provide a relatively large software study of the Eucalyptuscloud and a general overview of some the technologies it uses. It will also study what HadoopMapReduce is and how it can be used in conjunction with Eucalyptus.

The first part of the thesis is to analyse how to setup an Eucalyptus private cloud in asmall environment; what the requirements are to run and maintain it and what problemsand/or benefits the current implementation of it has. This is a documentation and imple-mentation of one way to configure the infrastructure to deliver virtual machines in smallmanner to a private user/company/organization.

The second part is to test how well Hadoop MapReduce is performing in a virtual cluster.The machines used for the cluster will be virtual machines delivered through the Eucalyptuscloud that has been set up in the course of the thesis. A (simple) MapReduce applicationwill be implemented to process a subset of Wikipedia’s articles and the time it takes toprocess this, based on the number of nodes that the cluster runs on, will be measured. In aperfect environment the MapReduce framework can deliver near-linear performance [7] butthat is without the extra overhead of running on small virtual machines.

2.1 Problem Statement

By first setting up a small Eucalyptus Cloud on a few local servers the thesis can answerwhich problems and obstacles there are when preparing the open-source infrastructure. Themain priority is setting up a cloud that can deliver virtual instances capable of runningHadoop MapReduce on them to supply a base to perform the analysis of the framework.

Simplifying launching Hadoop MapReduce clusters inside the Eucalyptus Cloud is ofsecond priority after setting up the infrastructure and testing the feasibility of MapReduceon virtual machines. This can include scripts, stand-alone programs or utilities beyondEucalyptus and/or Hadoop.

3

4 Chapter 2. Problem Description

2.2 Goals

The goals of this thesis is to do a software study and analysis of the performance andusability of Hadoop MapReduce running on top of virtual machines inside an EucalyptusCloud infrastructure. It will study means to setup, launch, maintain and remove virtualinstances that can together form a MapReduce cluster. The following are the specific goalsof this thesis:

– Demonstrate a way of setting up a private cloud infrastructure using the Eucalyp-tus Cloud system. This includes configuring subsystems that Eucalyptus uses likehypervisors, controller libraries and networking systems.

– Create a virtual machine image containing Hadoop MapReduce that will provide easeof use and minimal manual configuration at provisioning time.

– Provide a way to easily create and remove virtual instances inside the private cloud,adjusting the size of Hadoop worker nodes available in it’s cluster.

– Test the Hadoop MapReduce framework on a virtual cluster inside the private cloud.This is to show what kind of performance increase a user gains when adding morevirtual nodes to the cluster, and if it is a near linear increase.

2.3 Related Work

Apache Whirr is a collection of scripts that has sprung out as a project of its own. Thepurpose of Whirr is to simplify controlling virtual nodes inside a cloud like Amazon WebServices [10]. Whirr controls everything from launching, removing and maintaining instancesthat Hadoop then can utilize in a cluster.

Another similar controller program is Puppet [14] from Puppet Labs. This programfully controls instances and clusters inside an EC2-compatible (AWS or Eucalyptus forexample) cloud. It uses a program outside the cloud infrastructure that can control whetherto launch, edit or remove instances. Puppet also controls the Hadoop MapReduce clusterinside the virtual cluster. Mathias Gug, an Ubuntu Developer, has tested how to deploy avirtual cluster inside an Ubuntu Enterprise Cloud using Puppet. The results can be foundon his blog [13].

Hadoop’s commercial and enterprise offspring, Cloudera [6], has released a distribu-tion called CDH. The current version, version 3, contains a virtual machine with HadoopMapReduce configured along with Apache Whirr instructions. This is to simplify launchingand configuring Hadoop MapReduce clusters inside a cloud. These releases also containsextra packages for enterprise clusters, such as Pig, Hive, Sqoop and HBase. CDH also usesApache Whirr to simplify AWS deployment.

Chapter 3

Virtualized cloud environmentsand Hadoop MapReduce

This in-depth study focuses on explaining some key concepts regarding cloud computing,virtualization and clustering along with how certain specific software solutions work basedon these concepts. As some of the software are used in the practical implementation of thethesis the in-depth study naturally focuses on how these work in practical environment.

3.1 Virtualization

The term virtualization refers to creating a virtual environment instead of an actual physicalone. This enables a physical system to run different logical solutions on it by virtuallycreating an environment that meets the demand of the solution. By virtually creatingseveral different operating systems on one physical workstation the administrator can createa cluster of computers that acts as if they were physical.

There are several different methods of virtualization. Network Virtualization refers tocreating virtual networks that can be used for segmenting, subnetworking or creating vir-tual private networks (VPN) as a few examples. Desktop virtualization enables a user toaccess his local desktop from a remote location and is commonly used in large corporationsor authorities to ensure security and accessibility. A more common virtualization usuallyencountered by a home user is Application Virtualization which enables compilation of codeto machine instruction running in a certain environment. Examples of this include Java VMand Microsoft’s .NET framework. In cloud computing, Server & Machine Virtualization areextensively used to virtually create new computers that can act as a completely differentoperating system independent of the underlying system it runs on [26].

Without virtualization situations would arise where machines would use only a percent-age of their maximum capacity. If the server would have virtualization active and enablemore operating systems run on the physical hardware, the hardware would be used more

5

6 Chapter 3. Virtualized cloud environments and Hadoop MapReduce

effectively. This is why server and machine virtualization is of great benefit when creating acloud environment because the cloud host can maximize effectivity and distribute resourceswithout having to buy a physical server each time a new instance is needed.

The system that keeps track of the machine virtualization is called a hypervisor. Hy-pervisors are mediators that translates calls from the virtualized OS to the hardware andacts as a security guard. The guard prevents different virtual instances from accessing eachother’s memory or storage areas that is outside their virtual bounds. When a hypervisorcreates a new virtual instance (a Guest OS ), the hypervisor ’marks’ memory, CPU andstorage areas to be used by that instance [22]. The underlying hardware is usually the limitof how many virtual instances that can be run on one physical machine.

Figure 3.1: A hypervisor can have multiple guest operating systems in it.

Depending on the type of hypervisor it can either work directly with the hardware (calledtype 1 virtualization) or on top of an already installed OS (called type 2 virtualization).The type used varies based on which hypervisor, underlying OS or hardware installed thatis installed. These different variations demands different requirements for each system; ahypervisor might flawlessly work on one hardware/OS setup but might be inoperable in aslightly different variation [26]. See figure 3.2.

Figure 3.2: Different types of hypervisor-based server and machine virtualizations.

3.1. Virtualization 7

Hypervisor-based virtualization is the most commonly used one [22], but several differentvariants of it exists. Kernel-based virtualization employs specialized OS kernels, wherethe kernel runs a separate version of itself along with a virtual machine on the physicalhardware. In practice one could say that the kernel acts as a hypervisor and it is usuallya Linux kernel that uses this technique. Hardware virtualization does not rely on anysoftware OS, but instead uses specialized hardware along with a special hypervisor to providevirtualization. The benefit of this is that the OS running inside the hypervisor does not haveto be modified, which normal software hypervisor virtualization requires [22]. Technologiesfor providing hardware virtualization on the CPU’s (native virtualization) are based on theCPU developers such as Intel VT-x or AMD-V.

The operating systems that runs in the virtual environment are called machine images.These images can be be put to sleep and then stored on the hard drive with their currentinstallation, configuration and even running processes hibernated. When requested, theimages can then be restored to their running state to continue to finish what they didbefore hibernation. This allows dynamic activation and deactivation of resources.

3.1.1 Networking in virtual operating systems

With operating systems acting inside a hypervisor and not directly contacting the physicalhardware the problem arises when there are several instances that wants to communicateon the network. They do not actually exists with a physical Network Interface Card (NIC)connected to them, so the hypervisor has to ensure that the right instance receives thecorrect network packages.

The way the networking is handled depends on the hypervisor. There are four techniquesused to create virtual NICs [26]:

NAT NetworkingNAT (Network Area Translation) is the same type of technique used in common homerouters. It translates an external IP address to an internal one, which enables multipleinternal IPs. The packets sent are recognized by the port they are sent to and from.The hypervisor provides the NAT translation and the VMs reside in a subnetworkwith the hypervisor acting as the router.

Bridge NetworkingBridging the networking is basically connecting the virtual NIC with the physicalhardware NIC. The hypervisor sets up the bridge and the virtual OS connects to itas it believes it to be a physical. The benefit of this is that the Virtual Machine willshow up on the local network just as any other physical machine.

Host-onlyHost-only networking is the local variant of networking. The hypervisor disables net-working to external machines outside of the VM which defeats the purpose of the VMin a cloud environment. This is used on local machines mostly.

HybridHybrid networking is a combination or variation of the networking styles mentioned.


These can connect to most of the other types of networking styles and in some wayscan act as a bridge to a host-only VM.

Networking the virtual machines in a proper way is crucial when setting up a virtualizedcloud. The virtual machines have to be able to connect to the cloud system network toprovide resources.

3.2 Cloud Computing

Cloud computing is a type of distributed computing that provides elastic, dynamic process-ing power and storage when needed. At its essence it basically gives the user the computingpower when it needs it. The term cloud refers to typical visual representation of Internet ina diagram; a cloud. What cloud computing means is that there is a collection of computersthat can give the customer/user the amount of computational power needed, without themhaving to worry about maintenance or hardware [20].

Typically a cloud is hosted on a server farm with a large amount of clustered computers.These provide the hardware resources. The cloud provider (the organization that hosts theservers) offers an interface for users to pay for a certain amount of processing power, storageor computers in a business model. These resources can then be increased or decreased basedon demand, so the user only needs to focus on it’s contents whereas the provider takes careof maintenance, security and networking.

Figure 3.3: Simplified visualization of cloud computing.

The servers in the server farm are usually virtualized, although they are not required tobe to actually be included in a cloud. Virtualization is a ground pillar in cloud computing;it enables the provider to maximize the processing power on the raw hardware and gives thecloud elasticity, the ability for users to scale the instances required. It also helps providingtwo other key features of a cloud: multitenacity, the sharing of resources, and massivescalability, the ability to have huge amounts of processing systems and storage areas (tensof thousands of systems with large amounts of terabytes or petabytes of data) [16].

There are three major types of services that can be provided from a cloud. These areusually different levels of access for the user, ranging from having control of just a fewcomponents to the operating system itself [16]:

Infrastructure as a Service (IaaS)IaaS gives the user the most freedom and access on the systems. These can sometimes

3.2. Cloud Computing 9

be on dedicated hardware (that is, not virtualized) where the user has to install what-ever they want on the system themselves. The user is given access to the operatingsystem, or the ability to create their own through images that they create (typicallyin a virtualized environment). This is used when the user wants the raw processingpower of alot of systems or needs a huge amount of storage.

Platform as a Service (PaaS)PaaS does not give as much freedom as the IaaS, but instead focusing on having keyapplications already installed on the systems delivered. These are used to provide theuser the systems needed in a quick and accessible way. The users can then modify theapplications to their needs. An example of this would be a hosted Website; the toolsfor hosting the website (along with extra systems like databases, web service engineetc.) is installed and the user can create the page without having to think aboutnetworking or accessibility.

Software as a Service(SaaS)SaaS is generally transparent for the user. It gives the user a software that has itsprocess in a cloud. The user can only interact with the software itself and is moreoften unaware that it is being processed in a cloud. A quick example is the GoogleDocs, where users can edit documents that are hosted and processed in the Googlecloud.

The cloud providers in the business model often uses web interfaces to enable users toincrease or decrease the instances they use. They are then billed by the amount of spacerequired or processing power used (depending on what type of service that is bought) in apay-as-you-go system. This type is of the IaaS type, which for example Amazon, Proofpointand Rightscale can provide [16]. However a cloud does not necessarily only exists on theInternet as business model of delivering computational power or storage. A cloud couldeither be public; which means that the machines delivered resides on the Internet, private;which means that the cluster is hosted locally or hybrid ; where the instances are local atstart but can use public cloud services on-demand if the private cloud does not have sufficientpower [20].

Cloud computing is also used in a lot more different systems. Google uses cloud com-puting to provide the backbone to their large systems such as Gmail, Google App Engineand Google Sites. Google App Engine provides a PaaS for the sole purpose of creating andhosting web sites. The Google Apps is a SaaS cloud that is a distributed system of handlingoffice types of files; a clouded Microsoft Office [16].

3.2.1 Amazon’s public cloud service

Amazon was one of the first big companies moving into being a large cloud system provider[20]. Amazon is interesting in terms of being one of the first giants that provides an APIto access data through their Amazon Web Service (AWS) and web pages. Since Eucalyptususes the same API, albeit with a different and open source implementation, a closer lookon Amazon and their service is interesting. Amazon provides several different services [20],but in terms of this thesis there are some of more interest:


Amazon Simple Storage Service (S3)The Simple Storage Service is Amazon’s way of providing vast amounts of storagespace to the user. A user can pay for the amount of space needed, from just a fewgigabytes to several petabytes. Also, fees apply to the amount of data transfered toand from the storage. S3 uses ”buckets” which in layman terms can be seen as foldersto store data within. These buckets are stored somewhere inside the cloud and arereplicated on several devices to provide redundancy. Using standard protocols such asHTTP, SOAP, REST and even BitTorrent to transfer the data to the S3; the SimpleStorage Service provides ease of access [3] to the user.

Amazon Elastic Compute Cloud (EC2)The Elastic Compute Cloud is a way to provide a dynamic/elastic amount of com-putational power. Amazon gives the user the ability to pay for nodes. These nodesare virtualized computers that can take an Amazon Machine Image and use it as theimage that runs on their virtualized environment (see section 3.1). The EC2 aims atsupplying large amounts of CPU and RAM to the user, but it is up to the user to writeand execute the applications to use the resources [2]. These virtualized computers,nodes, are contained inside a security group, or a virtual network consisting of all theEC2 nodes the user has payed for. During computation they can be linked togetherto provide a strong distributed computational base.

Amazon Elastic Block Store (EBS)While the S3 is focused on storage it does not focus on speed and fast access to thedata. When using the EC2 system the data that has to be processed must be stored ina fast-to-access way to avoid downtime for the EC2 system. The S3 does not providethat, so Amazon has created a way of attaching virtual, reliable and fast devices tobe attached to the EC2. This is called the Elastic Block Storage, EBS. EBS differs toS3 in that the volumes cannot be as small or large as the S3 (1 GB - 1 TB on EBScompared to 1 B - 5 TB on S3 [1, 3]) but instead has faster read-write times and areeasier to attach to EC2 instances. One EBS volume can only be attached to one EC2instance at a time, but one EC2 instance can have several EBS volumes attached toit. The EBS also offers the ability to snapshot the EBS volume and store the snapshoton a different storage medium, for example the S3 [1].

As an example of using the Amazon EC2, the New York Times used EC2 and S3 inconjunction to convert 4 TB of articles from 1851 - 1980 (around 11 million articles) storedin TIFF images to PDF format. By using 100 Amazon EC2 nodes, the NY Times convertedthe TIFF-images to 1.5 TB of PDFs in less than 24 hours, a conversion that would takefar greater time if done on a single computer [18]. On a side note, the NY Times also usedApache Hadoop installed on their AMIs to process the data (see section 3.4).

3.3 Software study - Eucalyptus

Eucalyptus is a free open-source cloud management system that is using the same APIas the AWS are using. This enables tools that originally where developed for Amazon tobe used with Eucalyptus, but with the added benefit of Eucalyptus being free and open-source. It provides the same functionality in terms of IaaS deployment and can be used as

3.3. Software study - Eucalyptus 11

a private, hybrid or even a public cloud system with enough hardware. Instances runninginside Eucalyptus runs Eucalyptus Machine Images (EMI, cleverly name after AMI), whichcan either be created by the user or downloaded as pre-packaged version. An EMI cancontain either a Windows, Linux or CentOS operating system [8]. At the time of writingEucalyptus does not support Mac OS.

3.3.1 The different parts of Eucalyptus

Eucalyptus resides on the host operating systems of which is installed on. Since it useslibraries and hypervisors that are restricted to the Linux OS it cannot be run on otheroperating systems like Microsoft Windows or Apple OS. When Eucalyptus starts it contactsits’ different components to determine the layout and setup of the systems it controls.These components are configured using configuration files in each component. They allhave different responsibilities and areas to create a complete system that can handle dynamiccreation of virtualized instances, large storage environments and user access control.

Providing the same features as Amazon in terms of computation clouds and storage,the components inside Eucalyptus have different names but with equal functionality andAPI [8]:

– WalrusWalrus is the name of the storage container system, similar to Amazon S3. It storesdata in buckets and have the same API to read and write data in a redundant system.Eucalyptus offers a way to limit access and size of the storage buckets through thesame means as S3, by enforcing user credentials and size limits. Walrus is written inJava, and is accessible through the same means as S3 (SOAP, REST or Web Browser).

– Cloud ControllerThe Cloud Controller (CLC) is the Eucalyptus implementation of the Elastic ComputeCloud (EC2) that Amazon provides. The CLC is responsible for starting, stoppingand controlling instances in the system, as this is providing the computational power(CPU & RAM) to the user. The CLC is indirectly contacting the hypervisors throughCluster Controllers (CC) and Node Controllers (NC). The CLC is written in Java.

– Storage ControllerThis is the equivalent to the EBS found in Amazon. The Storage Controller (SC) isresponsible for providing fast dynamic storage devices with low latency and variablestorage size. It resides outside the virtual CLC-instances, but can communicate withthem as external devices in a similar fashion of the EBS system. The SC is written inJava.

Beneath the Cluster Controller, on every physical machine lies the Node Controller (NC).Written in C, this component is in direct contact with the hypervisor. The CC and SC talkswith the NC to determine the availability, access and need of the hypervisors. The CC andSC runs on a cluster-level, which means they only need one per cluster. Usually the SC andCC are deployed on the head-node of each cluster - that is a defined machine marked asbeing the ”leader” of the rest of the physical machines in the cluster - but if the cloud only


consists of one large cluster the CLC, SC, CC and Walrus all can reside on the head node,the front-end node.

Figure 3.4: Overview of the components in Eucalyptus on rack based servers.

All components communicate with each other over SOAP with WS-security [8]. To makeup the entire system Eucalyptus has more parts which were mentioned in brief earlier. TheCluster Controller, written in C, is responsible for an entire physical cluster of machinesto provide scheduling and network control of all the machines under the same physicalswitch/router. See figure 3.4. While the CLC is responsible for controlling most of theinstances and their requests (creating, deleting, setting EMIs etc) it talks with both the CCand SC on the cluster level. Walrus on the other hand is only responsible for storage actionsand thus only talks with the SC.

The front-end serves as the maintenance access point. If a user wants more instances orneeds more storage allocated to them, the front-end has Walrus and the CLC ready to acceptrequests and propagate them to CCs and SCs. This provides the user the transparency ofthe Eucalyptus cloud. The user cannot tell where and how the storage is created, only thatthey actually received more storage space by requesting it from the front-end.

3.3.2 A quick look at the hypervisors in Eucalyptus

To be able to create and destroy virtualized instances on demand the Node Controller needsto talk with a hypervisor installed on the machine it is running on. Currently, Eucalyptusonly supports the hypervisors Xen and KVM. To be able to communicate with them,Eucalyptus utilizes the libvirt virtualization API and virsh.

The Xen hypervisor is a Type 1 hypervisor which utilizes paravirtualization to run


operating systems on it. This requires that the Host OS’s must be modified to do callsto the hypervisor instead of the actual hardware [22]. Xen can also support hardwarevirtualization but that requires specialized virtualization hardware. See section 3.1. Thefirst guest operating system that Xen virtualizes is called dom0 (basically the first domain)and is automatically booted into when starting the computer. When Xen runs a virtualmachine, the drivers are run in user-space, which means that every OS runs inside thememory of a user instead of the kernel’s memory space. Xen provides networking by bridgingthe NIC (see Section 3.1.1).

KVM, Kernel-based Virtual Machine, is a hypervisor built using the OS’s kernel. Thismeans that KVM uses calls far deeper into the OS architecture, which in turn providegreater speed. KVM is very small and built into the Linux kernel but it cannot by itselfprovide CPU paravirtualization. To be able to do that it uses the QEMU CPU emulator.QEMU is in short an emulator designed to simulate different CPU’s through API calls. Theusage of QEMU inside KVM means that KVM sometimes is referred to as qemu/KVM.When KVM runs inside the kernel-space, it uses calls through QEMU to interact with theuser-space parts, like creating or destroying a virtual machine [22]. Like Xen, KVM alsobridges the NIC to provide networking to the virtual machines.

When Eucalyptus creates EMIs to be used inside the cloud system, it requires imagesalong with kernel and ramdisk pairs that works for the designated hypervisor. A RAMdisk is not required in the beginning since the ramdisk image defines the state of the RAMmemory in the virtual machine [22], but if the image has been installed and is running whenput to sleep a ramdisk image should come with it (assume there were none, then whenthe virtual machine resumed all the RAM would be empty). Since the images might lookdifferent depending on which hypervisor created them; Xen and KVM cannot load eachothers images. This provides an interesting point to Eucalyptus: if in theory there wasan image and ramdisk/kernel pair that would work on both the hypervisors, Eucalyptuscould run physical machines that had either KVM or Xen installed on them and bootany Xen/KVM image without encountering any virtualization problems. With the currentdiscrepancy, the machines in the cloud are forced to run a specific hypervisor so that theEMIs can be loaded across any Node Controller in the cloud.

3.3.3 The Metadata Service

Just like Amazon, Eucalyptus has a metadata service available for the virtual machines [8].What the metadata does is supplying information to VMs about themselves. This is achievedby the VMs contacting the CLC with a HTTP request. The CLC checks from which VMthe call is made and returns the requested information based on the VM which made thecall. An example, if the CLC’s IP is 169.254.169.254 then a VM could make a request like:

http://169.254.169.254:8773/latest/meta-data/<name of metadata tag>

http://169.254.169.254:8773/latest/user-data/<name of user-defined metadata tag>

The metadata tag can be anything from standard default ones like the kernel-id, security-groups or the public hostname or it could be specific ones defined by the administration.This is a method of obtaining data to setup information when new instances are created or


Figure 3.5: Metadata request example in Eucalyptus.

destroyed. The metadata calls have the exact same callnames and structure like the AWS,so tools used inside the AWS system works with the Eucalyptus metadata.

3.3.4 Networking modes

With Eucalyptus installed on all the machines in the cluster(s), the different componentscall each other with SOAP commands on the physical NIC. However, when new instancesare created they need to have their networking set up on the fly. Since the physical networkmight have other settings regarding how the NIC’s retrieve their IP’s Eucalyptus havedifferent modes to give the virtual machines access to the network. The virtual machinescommunicate with each other using virtual subnets. These subnets must not in any waybe in the same area as the physical net, used by the components of Eucalyptus (noticethe difference between components like CLC, Walrus, NC etc and virtual machines). TheCC has one connection towards the virtual subnet and another bridged to the physicalnetwork [8].

Networking modes inside the Eucalyptus cloud system differs in how much freedom andconnectivity the instances have. Some modes adds features to the VM networks:

– Elastic IPs is in short a way to supply the user with a range of IPs that the VMscan use. These can then be used towards external, public, IPs and is ideal if the userneeds a persistent Web server for example.

– Security groups is way of giving the user of a group of instances control what canbe done or not in terms of network traffic. For example one security group can enforcethat no ICMP calls is answered, or that one cannot make SSH connections.


– VM isolation prevents VMs from different security groups to contact each other ifactivated. When running a public cloud providing IaaS, this is almost a must-have.

The different modes gives different benefits and drawbacks and some is even required touse under certain circumstances. There are four different networking modes in Eucalyptus.In three modes, the front-end acts as the DHCP server distributing IPs to the virtualmachines. The fourth mode require an external DHCP server to distribute IPs to virtualmachines [8]. In all networking modes the VMs can be connected to from an external sourceif given a public IP and the security group allows it. The different modes are the following:

SYSTEMThe only networking mode that requires an external DHCP server to serve new VMinstances with IPs. This mode requires little configuration since it does not limitinternal interaction. It does however provide no extra things like security groups,elastic IPs or VM isolation. This mode is better if used when the cloud is private andthere are few users that share the cloud.

STATICSTATIC mode requires that the DHCP server on the network is either turned off orconfigured to not serve a specific IP-range that the VMs use. The front-end has totake care of the DHCP server towards the instances, but through a non-dynamic wayby adding pairs of MAC-addresses and IPs to VMs. Just like SYSTEM, it does notprovide the benefits that you normally associate with a public cloud like elastic IPs,VM isolation or security groups.

MANAGEDMANAGED mode gives the most freedom to the cloud. Advanced features like VMisolation, elastic IPs and security groups are available by creating virtual networksbetween the VMs.

MANAGED-NoVLANIf the physical network relies on VLAN then normal MANAGED mode will not work(since several VLAN packets on top of each others will cause problems for the routing).In this mode most of the MANAGED mode features are still there except VM isolation.

When setting up the Eucalyptus networking mode one has to consider what type ofcloud it is and what kind of routing setup is made on the physical network.

3.3.5 Accessing the system

When a user wants to create, remove or edit the instances they can either contact themdirectly through SSH (if they have public IPs) or they can control the instances by usingEucalyptus’ web interface. By logging in on the front-end with a username and passwordthe user or admin can configure settings of the system. Also, tools developed for AWS canbe used for this since Eucalyptus supports the same API calls [8].

Similar there is a tool called euca2ools for administration. It is a Command LineInterface tool that is used to manipulate the instances that a user has running. An admin


using euca2ools has more access than an ordinary user. Euca2ools is almost mandatorywhen working with Eucalyptus.

3.4 Software study - Hadoop MapReduce & HDFS

Hadoop is an open source software package from the Apache Foundation which containsdifferent systems aimed at file storage, analysis and processing of large amounts of dataranging from only a few gigabytes to several hundreds or thousands of petabytes. Hadoop’ssoftware is all written in Java, but the different parts are separate projects by themselvesso bindings to other programming languages exists on per-project basis. The three majorsubprojects of Hadoop are the following [9]:

HDFSThe Hadoop Distributed File System, HDFS, is a specialized filesystem to storelarge amounts of data across a distributed system of computers with very high through-put and multiple replication on a cluster. It provides reliability between the differentphysical machines to support a base for very fast computations on a large dataset.

MapReduceMapReduce is a programming idiom for analyzing and process extremely largedatasets in a fast, scalable and distributed way. Originally conceived by Google as away of handling the enormous amount of data produced by their search bots [23], ithas been adapted in a way that it can run on a cluster of normal commodity machines.

CommonThe Hadoop Common subproject provides interfaces and components built in Javato support distributed filesystems and I/O. This is more of a library that has all thefeatures that HDFS and MapReduce uses to handle the distributed computation. Ithas the the code for persistent data structures and Java RPC that HDFS needs tostore clustered data [23].

While these are the projects that Hadoop have as it’s major subprojects, there areseveral other that are related to the Hadoop package. These are generally projects relatedto distributed systems which either uses the major Hadoop subprojects or are related tothem in some way:

PigPig introduces a higher level data-flow language and framework when doing parallelcomputation. It can work in conjunction with MapReduce and HDFS and has anSQL-like syntax.

HiveHive is data warehouse infrastructure with a basic query language, Hive QL, which isbased of SQL. Hive is designed to easily integrate and work together with the datastorage of MapReduce jobs.

3.4. Software study - Hadoop MapReduce & HDFS 17

HBaseA distributed database designed to support large tables of data with a scalable infras-tructure on top of normal commodity hardware. It’s main usage is when to handleextremely large database tables, i.e. billions of rows on millions of columns.

AvroBy using Remote Procedure Calls (RPC), Avro provides a data serialization systemto be used in distributed systems. Avro can be used when parts of a system needs tocommunicate through the network.

ChukwaBuilt on top of HDFS and MapReduce, Chukwa is a monitoring system when a largedistributed system needs to be monitored.

MahoutA large machine learning library. It uses the MapReduce to provide scalability andhandling of large datasets.

ZooKeeperZooKeeper is mainly a service for distributed systems control, monitoring and syn-chronization.

HDFS and MapReduce are intended to work on commodity hardware which is the oppo-site to specialized high-end server hardware designed for computational-heavy processing.The idea is to be able to use the Hadoop software on a cluster of not-that-high-end com-puters and still get a very good result in terms of throughput and reliability. An exampletaken from Hadoop - The Definitive Guide [24] of a commodity hardware:

Processor: 2 quad-core 2-2.5GHz CPUs

Memory: 16-24 GB ECC RAM

Harddrive: 4 x 1TB SATA disks

Network: Gigabit Ethernet

Since Hadoop is designed to use multiple large harddrives and multiple CPU cores, havingmore of them is almost always a benefit. The ECC RAM stands for Error Correction CodeRAM and is almost a must have since Hadoop uses a lot of memory in processing andreportedly sees a lot of checksum errors on clusters without it [24]. Using Hadoop on a largecluster of racked physical machines in a two-level network architecture is a common setup.

3.4.1 HDFS

The Hadoop Distributed File System is designed to be a filesystem that gives a fast accessrate and reliability for very large datasets. HDFS is basically a Java program that com-municates with other networked instances of HDFS through RPC to store blocks of dataacross a cluster. It is designed to work well with large file sizes (which can vary from justa hundreds of MBs to several PBs), but since it focuses more on delivering high amount ofdata between the physical machines it has a slower access rate and higher latency [23].


HDFS is split into three software parts. The NameNode is the ”master” of the filesys-tem that keeps track of where and how the files are stored in the filesystem. The DataNodeis the ”slave” in the system and is controlled by the NameNode. There is also a SecondaryNameNode which, contrary to what it’s name says, is not a replacement of the NameNode.The secondary NameNode is optional, which is explained why later in the section.

When HDFS stores files in it’s filesystem it splits the data into blocks. These blocksof raw data is of configurable size (defined in the NameNode configuration) but the defaultsize is 64 MB. This is compared to a normal disk block which is 512 bytes [23]. Whena datafile has been split up into blocks, the NameNode sends the blocks to the differentDataNodes (other machines) where they are stored on disk. The same block can be sent tomultiple DataNodes which will provide redundancy and higher throughput when anothersystem requests access to the file.

The NameNode is responsible for keeping track of the location of the file among theDataNode, as well as the tree structure that the filesystem uses. The metadata about eachfile is also stored in the NameNode, like which original datafile it belongs to and it’s relationto other blocks. This data is stored on-disk in the NameNode in form of two files: thenamespace image and the edit log. The exact block locations on the DataNodes is notstored in the namespace image, this is reconstructed on startup by communicating with theDataNodes and then only kept in memory [23].

Figure 3.6: The HDFS node structure.

Due to the NameNode keeping track of the metadata of the files and the tree structure ofthe file system it is also a single point of failure. If it breaks down the whole HDFS filesystemwill be invalid since the DataNodes only stores the data on disk without any knowledge of thestructure. Even the secondary NameNode cannot work without the NameNode, since thesecondary NameNode is only responsible validating the namespace image of the NameNode.Due to the large data amounts that file metadata can provide the NameNode and secondaryNameNodes should be different machines (and separated from the DataNodes) on a largesystem [23].

However, as of Hadoop 0.21.0 work has begun to remove the Secondary NameNode andreplace it with a Checkpoint Node and Backup Node which are meant to keep track ofthe NameNode and keep an up-to-date copy of the NameNode. This will work as a backup


in case of a NameNode breakdown [11], lowering the risk of failure if the NameNode crashes.

By default, the NameNode replicates each block by a factor of three. That is, theNameNode tries to keep three copies of each block on different DataNodes at each time. Thiswill provide both redundancy and more throughput for the client that uses the filesystem.To provide better redundancy and throughput HDFS is also rack-aware, that is it wants toknow which rack each node resides in and how ”far” in terms of bandwidth each node isfrom each other. That way the NameNode can keep more copies of blocks on one rack forfaster throughput, but additional copies on other racks for better redundancy.

Figure 3.7: The interaction between nodes when a file is read from HDFS.

DataNodes are more or less data-dummies that takes care of storing and sending filedata to and from clients. When started they have the NameNode’s location defined as anURL in their configuration file. This is by default localhost, which needs to be changed assoon as there are more than one node in the cluster.

When a user wants to read a file it uses a Hadoop HDFS client that contacts the Na-meNode. The NameNode then fetches the block locations and return the locations to theclient, forcing the client to do the reading and merging of blocks from the DataNodes. SeeFigure 3.7. Since HDFS requires a special client to interact with the filesystem it is not aseasy as mounting a NFS (Network File System) and reading from it in an operating system.However, there are bindings to HTTP and FTP available and software like FUSE, Thriftor WebDAV can also work with HDFS [23]. Using FUSE on top of HDFS would mean thatone can mount it as a normal Unix userspace drive.

3.4.2 MapReduce

MapReduce is a programming idiom/model for processing extremely large datasets usingdistributed computing on a computer cluster. It is invented and patented by Google. Theword MapReduce derives from two typical functions used within functional programming,the Map and Reduce functions [7]. Hadoop has taken this framework and implemented it tobe able to run it on top of a cluster of computers that are not high-end , similar to HDFS,through a license from Google. The purpose of Hadoop’s MapReduce is to be able to utilizethe combined resources of a large cluster of commodity hardware. MapReduce relies on a


distributed file system, where HDFS is currently one of the few supported.

MapReduce phases

The MapReduce framework is split up into two major phases; the Map phase and Reducephase. The entire framework is built around Key-Value pairs and the only thing that iscommunicated between the different parts of the framework are Key-Value pairs. The keysand values can be user-implemented, but they are required to be serialized since they arecommunicated across the network. Keys and values can range from simple primitive typesto large data types. When implementing a MapReduce problem the problem has to be ableto be split into n parts, where n is at least the amount of Hadoop nodes in the cluster.

It is important to understand that while the different phases in a MapReduce job can beregarded as sequential, they are in fact working in parallel as much as possible. The shuffleand reduce phases can start working as soon as one map task has completed and this isusually the case. Depending on work slots available across the cluster each job is divided asmuch as possible. The MapReduce framework is built around these components:

InputFormatReads file(s) on the DFS, tables from a DBMS or what the programmer wants it toread. This phase takes an input of some sorts and splits it into InputSplits.

InputSplitsAn InputSplit is dependant on what the input data is. It is a subset of the data readand one InputSplit is sent to each Map task.

MAPThe Map phase takes a key-value pair generated through the InputSplit. Each noderuns one map task and is run in parallel with each other. One Map task takes akey-value pair, process it and generates another key-value pair.

CombineThe optional combine phase is a local task run directly after each map task on eachnode. It does a mini-reduce by combining all keys that are the same generated fromthe current map task.

ShuffleWhen the nodes have completed their map task it enters the shuffle phase, where datais communicated with each node. Key-value pairs are passed between the nodes toappend, sort and partition it. This is the only phase where the nodes communicatewith each other.

Shuffle - AppendAppending the data during the shuffle phase is generally just putting all the datatogether. Shuffle append is automatically done by the framework.

Shuffle - SortThe sort phase is when the keys are sorted by either a default way or in a programmer-implemented way.


Shuffle - PartitionThe partition phase is the last phase of the shuffle. This calculates how the combineddata should be split out to the reduces. It can either be handled in a default way orprogrammer-implemented. It should generate an equal amount of data to each reducerfor optimal performance.

REDUCEReduce is done by taking all the key-value pairs with the same key and performingsome kind of reducing on the values. Each reduce takes a subset of all the key-valuepairs, but will always have all the values to one key. For example, the (Foo, Bar) and(Foo, Bear) will go to the same reducer as (Foo, [Bar, Bear]). If a reducer has onekey, no other reducer will receive that key.

OutputEach reducer generates one output to a storage. The output can be controlled throughsubclassing OutputFormat. By default the output is generating part-files for each re-ducer in the form of files named part-r-00000, part-r-00001 etc. This can be controlledthrough an implementation of OutputFormat.

Although each inputsplit is sent to one Map task, the programmer can tell the Input-Format (through a RecordReader) to read across the boundaries of the split given. Thisenables the InputFormat to read a subset of data without having to combine them from twoor more maps. When one inputsplit has been read across it’s boundaries, the latter splitwill begin after where the former stopped. The size of the inputsplit given to a map task ismost oftenly dependant on the size of the data and the size of a HDFS block. Since HadoopMapReduce is optimized for - and most oftenly runs on - HDFS, the block size of HDFSmost often dictates the size of the split if it is read from a file.

Figure 3.8: The MapReduce phases in Hadoop MapReduce.

The key-values given to the mapper does not always require the keys to be meaningful.The values can be the only interesting for the mapper, outputting a different key and value


after computation. The general flow of key-value pairs in the MapReduce framework is thefollowing:

map(K1, V 1) → list(K2, V 2)reduce(K2, list(V 2)) → list(V 3)

However, when implementing MapReduce the framework takes care of generating thelists and combining the values to one key. The programmer only needs to focus on whatone map or reduce task does and the framework will apply it n times up to until the job ismapped. In terms of the Hadoop framework it can be regarded as:

framework in(data → (K1, V 1))map(K1, V 1) → (K2, V 2)framework shuffle((K2, V 2) → (K2, list(V 2)))reduce(K2, list(V 2)) → (K3, V 3)framework out((K3, V 3) → (K3, list(V 3)))

When starting a Hadoop MapReduce cluster it requires a master that contains theHDFS NameNode and the JobTracker. The JobTracker is the scheduler and input of aMapReduce job. The JobTracker communicates with TaskTrackers that runs on othernodes in the cluster. A node that only contains a JobTracker and HDFS DataNode isgenerally known as a slave. TaskTrackers periodically pings the JobTracker and checkswhether a free task to work on is ready or not. If the JobTracker has a task ready it is sentto the TaskTracker which performs it. Generally on a large cluster the master is separatefrom the slaves, but on smaller clusters the master also runs a slave.

Chapter 4

Accomplishment

This chapter describes how the configuration of Eucalyptus and Hadoop was done. Itdescribes one way to set up an Eucalyptus cloud and one way to run Hadoop MapReducein it. While it describes one way, it should not be regarded as the definite way to do it.Eucalyptus can run several different networking modes on top of different OS’s which meansthat the following configurations are not the only solution.

4.1 Preliminaries

The hardware available for this thesis were nine rack servers running Debian 5, kernel version2.6.26, connected through one gigabit switch with virtual network support. Of these, onewas the designated DHCP server of the subnet, only serving specific MAC addresses aspecific IP address and not giving out any IP to an unknown MAC address. This server alsohad a shared NFS /home for the other eight servers in the subnet. Due to the other eightserver being dependant on this server, it was ruled out of the Eucalyptus setup. Out of theeight available, four supported the Xen hypervisor and four supported the KVM hypervisor.This is the hardware settings of the servers.

CPUAMD Opteron 246 Quad-core on test01-04AMD Opteron 2346 HE Quad-core on test05-08

RAM2 GB on test01-044 GB on test05-08

HDD27 GB on /home145 GB/server

Initially Eucalyptus version 1.5.1 was chosen for this thesis, but due to problematicconfigurations of the infrastructure a later version was used in the final testing; 2.0.2. See

23

24 Chapter 4. Accomplishment

section 4.2.1 for further explanation. For Hadoop MapReduce the latest version, 0.21.0, waschosen due to the fact that in this release a major API change has been made. 0.21.0 alsohas a large number of bug fixes implemented into it compared to the earlier versions. Noneof the servers had any Hadoop or Eucalyptus previously installed on them.

4.2 Setup, configuration and usage

The following sections contains a description of how the Eucalyptus and Hadoop MapReducesoftware were setup. It is divided into three different subsection; the first focuses on how toconfigure Eucalyptus on the servers available, the second on how to create an image withHadoop suitable for the environment and finally the MapReduce implementation along withhow to get it running. Installing, configuring and running Eucalyptus requires a user withroot access to the systems of which it is installed on.

The configuration is based on a Debian host OS, as this was the system it was runon. This means that some packages or commands either does not exists or have a differentcommand structure on other OS’s like CentOS or RedHat Linux.

4.2.1 Setting up Eucalyptus

Compared to Xen, KVM works more ”out of the box” since it is tightly configurated with thenative Linux kernel. The choice of hypervisor was then to use KVM to avoid any problemsthat might occur between the host OS and the hypervisor. This meant that four out ofeight servers could not be used as servers inside the cloud infrastructure. Eucalyptus canprobably be set up to use both Xen and KVM if it loads an image adapted to the correcthypervisor, but that is out of scope of this thesis.

Installing Eucalyptus can be done using the package manager of the OS. In Debian, apt-get can be used once the repository of Eucalyptus has been added to /etc/apt/sources.list.Depending on the version used the repository location is different. To add Eucalyptus 2.0.2,edit the sources.list file and add the following line:

deb http://eucalyptussoftware.com/downloads/repo/eucalyptus/2.0.2/debian/ squeeze main

Calling apt-get install eucalyptus-nc eucalyptus-cc eucalyptus-cloud eucalyptus-sc will in-stall all the parts of Eucalyptus on the server it was called on. Starting, stopping andrebooting Eucalyptus services is done through the init.d/eucalyptus-* scripts.

The physical network setup is to have one server act as the front-end that has all thecloud, cluster and storage controllers running on it. The three other servers only run virtualmachines and talk to the front-end. The FE does not run any instances at all to avoid anyissues with networking and resources. Public IPs can be ”booked” in the CLC configurationfile and for the test environment the IP ranges from *.*.*.193-225 were set as available forEucalyptus. This needs to be communicated with the network admin, as the Eucalyptussoftware must assume that these are IP addresses are free and no one else are using them.

4.2. Setup, configuration and usage 25

Figure 4.1: Eucalyptus network layout on the test servers along with what services were runningon each server.

In terms of the networking, different modes has different benefits. However, due to theconfiguration of the subnet DHCP server the SYSTEM mode is not an option. The STATICconfiguration simplifies setting up new VMs but it prevents benefits like VM isolation andespecially the metadata service. So the choice is to use either the MANAGED or theMANAGED-NOVLAN mode. By setting up a virtual network between two of the serverone can verify whether the network is able to use virtual LANs. This is documented on theEucalyptus network configuration page [8]. The network verifies as VLAN clean, that is itis able to run VLANs.

Most of the problems encountered when setting up an Eucalyptus infrastructure is re-lated to the network. The networking inside the infrastructure consists of several layers,and with the error logs available (found in the /var/log/eucalyptus/*.log files) there is alot of information to search through when searching for errors. The configuration file thatEucalyptus uses, /etc/eucalyptus/eucalyptus.conf, has a configuration setting that sets theverbosity level of the logging. When installing, at least INFO setting is found to be recom-mended, while DEBUG can be set when there are errors that are hard to find the sourceof.

When working with the Eucalyptus configuration file, it is important to note that thedifferent subsystems (NC, CC, CLC, SC and Walrus) uses the same configuration file butdifferent parts of it. As an example, changing the VIRTIO * settings on the Cloud controllerhas no effect, as it is a setting that only the Node controller uses. This might cause confusionin the beginning of the configuration, but the file itself is by default very well documented.

When setting up the initial Eucalyptus version - Eucalyptus 1.5.1 - problems occur withcompability of newer libraries. Since 1.5.1 uses libraries that are relatively new comparedto the Eucalyptus itself, it attempts to load libraries that has changed the names and/orlocations. The 1.5.1 version does for example attempt to load the Groovy library through animport call that points to a non-existent location. To remedy these library problems 1.6.2was selected as the next Eucalyptus version to try. This demanded a kernel and distributionupgrade from Debian 5 to 6.

Eucalyptus 1.6.2 has a better support for KVM networking and calls to newer librariesso it should work better than 1.5.1. 1.6.2 does not run without any problems though.Booting the NCs on the three node server (test06-08) as well as the Cluster controller onthe front-end (test05) goes without any problems but the cloud controller silently dies a


few seconds after launching it. This is an error that cannot be found in the logs since it isoutput directly to stderr from the daemon, which itself is directed to /dev/null. By runningthe CLC directly from /usr/sbin/eucalyptus-cloud instead of /etc/init.d/eucalyptus-clc onecan see that 1.6.2 still has dependency problems with newer Hibernate and Groovy libraries.This can be solved by downgrading the libraries to earlier version, however this can causecompability issues with other software running on the servers.

To prevent compability issues the latest version, 2.0.2, was installed on the four serversused. This proved to be a working concept. All the Eucalyptus services are running, butthey are not connected to each other. The CLC is not aware of any SC, Walrus or CCrunning, and the CC is not aware of any nodes available. Since Eucalyptus is a web servicethat runs on Apache Web Server one can verify that the services are running by calling

ps auxw | grep euca

to determine that the correct httpd-daemons are running. This should show an httpd-daemon running under the eucalyptus user.

There are two ways of connecting the different parts of the system. Either the correctsettings can be set in the configuration file at the CLC and then rebooting it or runningthe euca conf CLI program. What euca conf does is actually change the configuration fileand reboot a specific part of the CLC. The cloud controller can then connect to the NCsthrough REST calls (which can be seen by reading the /var/log/eucalyptus/cc.log file at thefront-end). This means that the Eucalyptus infrastructure is working in one term, but thevirtual machines themselves can still be erroneously configured.

Network configuration

Figure 4.2: The optimal physical layout (left) compared to the test environment (right).


Configuring the right network settings for the VMs that run inside the cloud is a some-what trial and error procedure. Even though there are documentation on network setup,it does not explain why the setup should be in the documented way. It also assumes thatthe cloud has a specific setup regarding the physical network. In a perfect environment thefront-end should have two NIC’s, with one NIC connected to the external network and oneto the internal. See Figure 4.2.

The front-end will act as a virtual switch in any case which means that in a single-NIClayout the traffic passes through the physical layer to the front-end, where it is switched andthen passed through the same network again as shown in Figure 4.3. This means that onthe NCs the Private and public interface settings are in fact the same NIC, with a bridgespecified so that virsh knows where to attach new VMs. On the front-end there is no virtualmachines but only a connection from the outside network (the NIC, eth0) and a connectionto the internal network (here a bridge).

Figure 4.3: Network traffic in the test environment. Dotted lines indicate a connection on thesame physical line.

While the documentation specifies that MANAGED should work even in the non-optimallayout the physical network works on, none of the different combinations of settings on theNC and FE could make the VMs to properly connect to the ”outside” network. An indicationof a faulty network configuration is by checking the public IP address of the newly createdinstance through the euca2ools or another tool like hybridfox, or read through the logs. Ifthe public IP shows 0.0.0.0 it is generally an indication of a faulty network setting. Toproperly configure the network for the VMs, the front-end and node controllers needs tohave configured settings in the configuration file that match each other. Table 4.1 showsvariables shared on the NC and front-end found in the configuration file, with their settingin the test environment that provides a working network.

Variable Front-end Value NC Value Comment

VNET MODE MANAGED-NOVLAN MANAGED-NOVLAN Network environment.Same on NC and FE.

VNET PUBINTERFACE eth0 eth0 The public interface touse.

VNET PRIVINTERFACE br0 eth0 The private interface tothe nodes.

VNET BRIDGE - br0 Bridge to connect theVMs to.

Table 4.1: Configuration variables and their corresponding values in the environment.

This renders a network that has an internal network communicating through bridges.


When a VM is booted, the VMs network (which itself is a virtual bridged NIC) attaches tothe bridge specified in the configuration file. This requires that the node controller and thefrond end has created a bridge to the physical network.

When running on Ubuntu or Debian as host OS, disabling the package NetworkManageris required since it can - and most likely will - cause problems. NetworkManager can bescheduled to check the network to a certain setting and reboot it if it is configured in anotherway. This problem was identified at least twice as a root to networking issues during thecourse of this thesis.

Node Controller configuration

The Node controllers are responsible for communicating with the hypervisor on each ma-chine. In Eucalyptus, this is done by talking to virsh, which is a virtualization binary forLinux. Virsh then takes care of telling KVM in this case which image/ramdisk/kernel toload and a few different settings. Most of these can be left untouched, but there are a fewthings that needs to be changed according to Eucalyptus settings. Virsh is made to runstand-alone, and in order for Eucalyptus to run virsh it has to be set as the user who runs it.These settings is specified in the Eucalyptus administration handbook, see the references [8].

If the VMs fail to receive an IP address one can specify in the virsh settings to useVNC, which is a remote desktop system. Using VNC can be regarded as a security issueand should only be used as a means to troubleshoot. VNC can be added to all the VMslaunched in the NC where the changes are made. This requires an image to be used wherea username/password pair already exists on the image, compared to the root access thatis given through keypairs in Eucalyptus default environment. VNC was used in the initialtroubleshooting of the networking by editing the /usr/share/eucalyptus/gen kvm libvirt xmlfile. In the <devices> element, adding this gives VNC accessibility:

<graphics type=’vnc’ port=’-1’ autoport=’yes’ keymap=’en-us’ listen=’0.0.0.0’/>

Node controllers running MANAGED-NOVLAN mode requires the NIC to be bridged.This can be done using the linux package bridge-utils with a command named brctl.When brctl is installed bridges can be added or removed by editing the network configurationfile, /etc/network/interfaces, and rebooting the networking daemon through the command/etc/init.d/networking restart. The NCs all have the same networking configuration on thetest system which is:

iface eth0 inet dhcp

auto lo

iface lo inet loopback

auto br0

iface br0 inet dhcp

bridge_ports eth0

This configuration means that the eth0 is not activated but configured by default, whilstthe loopback address (127.0.0.1 or localhost) is always on. The bridge is attached to theeth0 interface where it is automatically configured and run.


In the Eucalyptus configuration file most of the settings can be left to the default valuesas the configuration file is mostly set to run ”out-of-the-box”. Changing the NC PORT con-figuration can be useful when there are other services running on the server, but it requireschanging the settings on the front-end too, as this is the port where the different Eucalyp-tus parts talks to each other. The only things that the NC needs to be set are VNET MODE,VNET PUBINTERFACE, VNET PRIVINTERFACE, VNET BRIDGE and HYPERVISOR(either ”xen” or ”kvm”, where kvm was used in this thesis).

Front-end configuration

Front-end refers to the Cloud controller, which in this case also is the Walrus, StorageController and Cluster controller. On larger clusters the SC, CC and Walrus should runon other servers to minimize load. Cluster controllers also specify what is referred to asAvailability Zones in AWS terms.

Configuring the front-end is done by editing the eucalyptus.conf file. The changes re-quired to be made here is larger than on the NCs, as in this file the networking mode,subsystem locations and external software are specified. Some of the changes in the config-uration file can be done using the euca conf script to administrate the system, but editingthe configuration file is suggested to ensure that all the changes are exactly as the onesneeded. In the test environment the following settings are changed from the default (see 4.1for VNET-modes of the front-end):

DISABLE DNSEnabling DNS allows the VMs to use Fully Qualified Domain Names. This requiresthat the network admin adds the front-end address to the public nameserver as thenameserver of Eucalyptus VMs. This is further described in the handbook [8].Set to N in the test environment as we want to use FQDNs due to Hadoop require-ments.

SCHEDPOLICYThe policy on how the front-end chooses which NC to place a new VM on. ROUNDROBINis the default which means that the front-end cycles through each NC to add a newVM. This gives an even distribution of VMs.Set to ROUNDROBIN.

VNET DHCPDAEMONSpecifies the location of the binary to the DHCP server daemon. The front-end willrun this on the internal net to distribute IP-addresses to the VMs.Set to /usr/sbin/dhcpd3.

VNET DHCPUSERSets the user to run the dhcp server. Can be either dhcpd or root.Disabled on the test environment, which defaults to root.

VNET SUBNETSpecifies which internal IP range the VMs should use inside the internal network.These cannot be accessed from outside. The IP range should be something that doesnot exists on the current subnet.Set to 10.10.0.0 in the test environment.


VNET NETMASKIn conjunction with the subnet setting, the netmask specifies which addresses can begiven out of the subnet.A netmask of 255.255.0.0 which is the test setting, will enable VMs to receive internaladdresses from 10.10.1.2 - 10.10.255.255. The 10.10.1.1 is the address of the DHCPserver in the internal network (basically the front-end).

VNET DNSAddress to the external DNS. When a properly configured image boots, it will fetchthis address from the meta-data service and add it to its /etc/hosts file.

VNET ADDRSPERNETDetermines how many addresses can be set per virtual subnet. Recommended is either32 or 64 on large clouds. It is quite irrelevant on the test environment as there is notenough physical hardware to support that many VMs.Set to 32.

VNET PUBLICIPSThis can be either a range or a comma-separated list of IP addresses. These are theaddresses on the public external net that is free to use.Set to 130.239.48.193-130.239.48.225 on the test environment.

As the front-end takes care of distributing IP addresses to new VMs in all modes exceptSYSTEM, the front-end needs to have it’s internal DHCP server running. The DHCPdaemon is an external binary that has to be set in the configuration file. This binary willonly run when there are VMs in the cloud running, and it will only listen to, and answercalls from, VMs. An issue with Eucalyptus 2.0.2 is that it is programmed to work towardsthe dhcpd3 daemon, whereas the latest daemon which replaces dhcpd3 in the Debian repois the isc-dhcp-server. The isc server has an incompatible API which renders it unusableto Eucalyptus 2.0.2. This means that to run the Eucalyptus DHCP service the system needsthe older dhcpd3 server installed and not the isc-server. This can be done by forcing apt-getto install the older version of dhcpd3.

Accessing and managing a Virtual Machine

To launch a virtual machine inside the Eucalyptus cloud the user requires keypairs thatare created and retrieved through the web interface of the Eucalyptus. The default user,the admin, has privileges to run, add, remove or edit any images inside the cloud, while anordinary user cannot do as much. In the test environment the admin privileges was used sothat images could easily be updated into the cloud.

The first thing required is to generate a keypair that is bound to the user (admin inthis case). This keypair is used in all the managing of both the Eucalyptus features andinstances. By logging into https://<front-end>:8443/#login and choosing Credentials onecan download the required keypair files. Extracting and using source eucarc in a CLI givesthe current command interface access to euca2ools credentials, which means CLI interactionwith Eucalyptus features. The keypair, which is a file ending with .private, is required touse SSH into an image.


On the default images created by Eucalyptus there are no users with a username/passwordto directly SSH into the instance. When the image boots, it instead contacts the metadataservice through the generic address 169.254.169.254 in MANAGED modes. The IP addressis automatically configured on the virtual switch on the front-end to land on the front-endwhen a VM calls the IP. Thus, during boot the rc.local script calls the metadata, down-loads the keypair, and adds it to the root verified SSH keypair. In short anyone with the.private-key can login to an instance using

ssh -i xxxxx.private root@<VM IP>

or

ssh -i xxxxx.private root@<VM FQDN>

The keypair can also be used in other tools. During the course of the thesis the Hybridfoxwas used to manage instances. This is a graphical tool built into Firefox that simplifiesviewing, launching and stopping running instances. Hybridfox was thoroughly employed inthe testing phase as it quickly shows which instances are running, their type and currentstatus.

Usage Protocol From Port To Port CIDRPing ICMP -1 -1 0.0.0.0/0 (all)SSH TCP 2 2 0.0.0.0/0DNS TCP 53 53 0.0.0.0/0Hadoop HDFS metadata TCP 8020 8020 0.0.0.0/0Hadoop NameNode access TCP 9000 9000 0.0.0.0/0HDFS data transfer TCP 50010 50010 0.0.0.0/0HDFS block/metadata recovery TCP 50020 50020 0.0.0.0/0Hadoop WS JobTracker TCP 50020 50020 0.0.0.0/0Hadoop WS TaskTracker TCP 50060 50060 0.0.0.0/0Hadoop WS NameNode TCP 50070 50070 0.0.0.0/0Hadoop FS custom port TCP 54110 54111 0.0.0.0/0

Table 4.2: Security group with Hadoop rules.

When running in any MANAGED mode, Eucalyptus supports a feature called securitygroups. Security groups is a software firewall type of feature that prevents access to instancesinside a specific group. Launching an instance requires the user to specify a security group tolaunch it within (or use the default group). The user can then open access on certain ports,from certain CIDR locations, certain types of network traffic (UDP, TCP or ICMP) or acombination between them all. The security groups does not prevent access from instancesinside the group, but from locations to and from outside the group. With Hadoop runninginside a security group, a security group was setup to allow access to web services and fileupload.

Table 4.2 details how the security group used in the test environment was defined. Someports in the security group was only set to open to verify access in a test with an externalHadoop NameNode.

4.2.2 Configuring an Hadoop image

Eucalyptus does not come with any images when installing, but there are several imageswith different operating systems that has been made to fit with Eucalyptus private cloud


that can be downloaded from the Eucalyptus homepage. The benefit of using these is thateverything needed to run have already been made including scripts, kernel- and ramdiskpairs and the OS on the image itself. The downside is that there is not much space for anyextra software on the images and there are unneeded binaries clouting the images.

The idea behind the Hadoop image is to have a preconfigured image with the properusers, credentials, binaries and scripts loaded on the image already. When the image isloaded as a VM, only minimal configuration has to be done on the Hadoop slaves and themaster. There are many ways to solve this, including having one type of image for themaster node and one for the slaves. However, in this thesis only a single image was used toallow for a homogeneity across the instances.

To simplify the image creation the Eucalyptus Debian 5 x86 64 image was chosen as adefault image to work with. This image comes with a kernel and ramdisk pair which canbe uploaded immediately to Walrus and Eucalyptus using the euca2ools. The image is ondownload just an .img-file, so to be able to edit the image it has to be mounted on a physicalOS. With a simple script, run as root, this can be quite fast. Assuming the image is nameddebian.5.hadoop.img one can use a few lines to fully edit it as a normal root OS:

mkdir mnt

mount -o loop debian.5.hadoop.img

mount -o bind /proc mnt/proc

mount -o bind /sys mnt/sys

mount -o bind /dev mnt/dev

chroot mnt

A specialized script was created to help with mounting/unmounting the image, as wellas a specialized script adapted to bundle and unbundle images. Bundling an image is anEucalyptus action of splitting the image into smaller files, uploading them into Walrus andpreparing them to be loaded. This allows the NCs to fetch them and boot them on call.

VM are always reset to their initial state when loaded. If there are changes to be madeon boot, editing the /etc/rc.local file is one way to ensure scripts are run. This is how someautomated on-boot settings are done to the VM when it boots.

When on the image a user hadoop under the group hadoop was created. All the hadoopfiles is intended to be run under this user, so privileges has to be set accordingly. Hadoop0.21.0 is downloaded and installed on the /opt/hadoop/hadoop-0.21.0 location. However,this is close to filling up the image space, so removing all the source code found in thesrc/ folder is necessary. All of the hadoop directories have read-write-run privileges by thehadoop user.

As the image does not have any space to store files or use HDFS upon, these can beloaded in several different ways. Similar to Amazon EBS, Eucalyptus can give dynamicstorage to instances. Neither the block storage (SC) nor the Walrus was used by the imagesin this test however. Instead the images loaded the ephemeral data given to each VM onboot and is basically a piece of the local HDD of which the VM is loaded on. The size ofthe ephemeral storage is adjusted by the admin in the Eucalyptus Web Interface and canhave different sizes depending on what type of VM is loaded. In KVM, the ephemeral isan unformatted harddrive found in /dev/sda2. The image has to mount and format this


drive each time it boots, compared to Xen where it is already mounted and running anext2 filesystem. On the image, in rc.local, a script is thus added that mounts and formats/dev/sda2.

The Hadoop settings is thus modified to use the ephemeral harddrive for data storage.By changing the /opt/hadoop/hadoop-0.21.0/conf/core-site.xml to load from the ephemeralstorage all the loaded instances will load from their own ephemeral storage.

Hadoop requires changes in the core-site.xml, mapred-site.xml and hdfs-site.xml as wellas the masters and slaves configuration files each time they are included in a virtual Hadoopcluster. These files have thus been modified with placeholder variables that are replacedremotely through custom made script. This script is called through ssh and run with hadoopprivileges. What the script does is basically replace the variable with the data that is specificfor the cluster. As an example, the hdfs-site.xml contains a variable that must point at theMaster in the cluster. As the master can different for each cluster, it has a default variableof {MASTER} which the script replaces with the IP of the master node.

A short list of changes that has to be made to the default image in order to have aworking Hadoop image:

Edit the rc.local fileEdit the rc.local file to add more boot features to the image. The Hadoop imageretrieves it’s FQDN with public/internal IP and sets it in the hostname. It also mountsand formats the ephemeral data. Inside the rc.local is also the keypair retrieval code,which must be left untouched.

Create user hadoopCreate the group and user hadoop with a password. A password is not necessary asit will either be called through the image root or using a remote keypair.

Save HadoopStore the Hadoop binaries at /opt/hadoop/hadoop-0.21.0 without the source code.These files needs ownership and privileges set to hadoop.

Set $HADOOP HOMEThe shell variable $HADOOP HOME needs to be set at the Hadoop home directory.This is set in the /etc/profile to ensure that all users have this set.

Edit configuration filesChange the configuration files mentioned earlier to have a default, changeable, valueset.

Modify conf/hadoop-env.shHadoop is running into issues when using IPv6, so hadoop-env.sh needs a line addedthat forces Hadoop to use IPv4. Addingexport HADOOP OPTS=-Djava.net.preferIPv4Stack=trueforces Hadoop in IPv4 on when it is first run.

An issue with Hadoop is that both the MapReduce and HDFS projects require the nodesto both know the Fully Qualified Domain Name (FQDN) and the IP address. Eucalyptusitself does not have an entirely working Reverse DNS system that gives the nodes the abilityto do a reverse DNS lookup to determine their FQDN. As such, Hadoop would not be ableto run and crash at startup giving an UnknownHostException. To work around this, the


instance can use the Eucalyptus metadata service to fetch its internal- and external IP, aswell as it’s FQDN. Thus, the rc.local has another script that fetches these and adds it tothe /etc/hostname file and forces the system to load from /bin/hostname -f /etc/hostname.When these are set in the file, Hadoop does not try to use the nameservers to reverse DNSbut instead automatically resolves it from the file.

4.2.3 Running MapReduce on the cluster

The Hadoop cluster consists entirely of Virtual Machines. The master node is one (gen-erally the first booted) machine in the cluster albeit with a slightly different configurationcompared to the slaves. To setup a cluster VMs are booted either through the euca2oolsCLI or the Hybridfox Firefox plugin. Once the machines are booted and ready to run - i.e.they have mounted and formatted their ephemeral drives and received a FQDN - scriptsthat reside outside the cluster on a physical server will do the setup based on a configurationfile. The configuration file, hadoop-cc.conf, contains the external IP of the master node,a list of IPs consisting of all the slave nodes and the replication level.

Script/file name Location Purposehadoop-cc.conf Outside the cluster. Contains the configuration settings for

launching a cluster.hadoop-cc.conf Image: $HADOOP HOME/cc-scripts/ The local counterpart. Is replaced by the con-

fig file outside the cluster.update-cluster Outside the cluster. Reads from the .conf file and updates the set-

tings on all Hadoop nodes.boot-cluster Outside the cluster. Boots the cluster. Requires that the cluster

has been updated beforehand.terminate-cluster Outside the cluster. Terminates a running cluster. Does NOT ter-

minate VMs, only Hadoop services.update-hadoop.sh Image: $HADOOP HOME/cc-scripts/ Updates the local Hadoop configuration

based on a hadoop-cc.conf found locally.set-master.sh Image: $HADOOP HOME/cc-scripts/ Changes the masters in the local image.

Called by update-hadoop.sh.set-slaves.sh Image: $HADOOP HOME/cc-scripts/ Changes the slaves locally. Called by update-

hadoop.sh.set-replication.sh Image: $HADOOP HOME/cc-scripts/ Changes the replication level locally. Called

by update-hadoop.sh.master Image: $HADOOP HOME/conf/default/ Default file to replace the configuration file

with.slaves Image: $HADOOP HOME/conf/default/ Default file to replace the configuration file

with.conf-site.xml Image: $HADOOP HOME/conf/default/ Default file to replace the configuration file

with.hdfs-site.xml Image: $HADOOP HOME/conf/default/ Default file to replace the configuration file

with.mapred-site.xml Image: $HADOOP HOME/conf/default/ Default file to replace the configuration file

with.

Table 4.3: Table showing scripts and custom config files to run the Hadoop cluster.

Table 4.3 shows a listing of scripts used in configuration a running Hadoop cluster. Theprocedure to boot a cluster and run a MapReduce job is the following:

1. Launch VMs through euca2ools or Hybridfox.

2. Change the master, slaves and the replication level in the hadoop-cc.conf file.

3. Run update-cluster.Update-cluster will do several things; first it will connect to the master node and


generate a SSH-key for the user hadoop. That key will then be distributed across allthe slaves, as it will enable passwordless-SSH for user hadoop to all VMs. Secondlyit will log in on all the VMs, use the metadata service and fetch the IP/FQDN pairs.Third, it will create a /etc/hosts file containing all the IPs/FQDNs of all the VMsin the cluster. Lastly it will distribute the /etc/hosts file to all nodes, distribute thehadoop-cc.conf file to all nodes and tell them to run the update-hadoop.sh script.

4. Run boot-cluster.This will start the Hadoop MapReduce and HDFS services as user hadoop. With theSSH keypairs already distributed for user hadoop, the master node will take care ofstarting it on all the nodes.

5. Upload the files to HDFS.Since the database files does not reside on the cluster by default, they are transferredto the master through SCP. With a SSH call to the master’s HDFS client they canthen be uploaded to the HDFS.

6. Download the JAR file containing the MapReduce implementation.Fetching the JAR-file containing the job is a simple matter of placing it in $HADOOP HOME.

7. Run the MapReduce job using bin/hadoop jar hadtest.jar count articles out.This will run the JAR hadtest.jar with the arguments ”count articles out”. It basi-cally means run the count job on the articles HDFS folder and put the results in theHDFS out folder.

8. Retrieve the result from out folder.This will be a list based on the implementation.

9. Calculate the job time.This is done by visiting Hadoop’s internal web services for the JobTracker. By visitinghttp://<Master External IP>:50030 the MapReduce framework provides feedback ontime, number of tasks etc.

The scripts that run outside the cluster requires the prevalence of a keypair that givesroot access to the instances. This is because they will login as root and switch to userhadoop on the VMs and then perform the relevant actions.

There are a few shortcomings in the Eucalyptus DNS nameserver for internal VMs.What it does is that the instances cannot resolve each other through the nameserver thatEucalyptus provides. To circumvent the shortcoming the update-cluster script builds a largehosts file containing a list of IP/FQDN-pairs of all the nodes in the cluster. The computedhosts file is distributed to all slaves and replaces the existing one. This lets Hadoop resolvethem through the local file.

What is also missing in Eucalyptus is providing proper in-communication between VMsbased on their external IP. For instance, assume a VM with 10.10.1.2 and 130.239.48.193as its internal and external IP and another VM with 10.10.1.3/130.239.48.194 as its IPs.The first VM can only contact the other VM by contacting the internal, 10.10.1.3, IP butcontacting through the external IP, 130.239.48.194, will give a timeout. The reason to this isthe virtual switch inside the front-end. When booting a new instance, the front-end adjustsit’s Netfilter rules using the iptables binary in the same way as giving itself the metadata


address. The problem here is that Eucalyptus automatically does it for the internal IPs,but the external IPs are never adjusted for the VMs.

The boot-cluster and terminate-cluster manually adds and removes iptables rulesto solve this issue through:

sudo iptables -t nat -A POSTROUTING -s "<LOC_IP>" -j SNAT --to-source "<PUB_IP>"

and

sudo iptables -t nat -D POSTROUTING -s "<LOC_IP>" -j SNAT --to-source "<PUB_IP>"

Unfortunately this requires that the scripts are run at the front-end and with su privi-leges. Of course, adding an SSH command would be sufficient if it was to be run remotely,but they still require iptables changes on the front-end.

4.2.4 The MapReduce implementation

To test the MapReduce framework on the infrastructure an embarrassingly distributed prob-lem was chosen that could easily use MapReduce to distribute the tasks. The problem chosenwas to count all the links in a subset of Wikipedia articles, listing the article title with themost links to it. This is a problem most suitable for MapReduce, as the database dumpsfrom Wikipedia are large XML-files.

The largeness of the files fits the HDFS, but with XML files there is a slight problemof tags existing outside of file splits. When splitting files, HDFS and MapReduce doesnot account to any specific split locations by default, so a specific InputSplit had to beimplemented. Cloud9 [15] is a Hadoop MapReduce project with an implementation thatsuits the Wikipedia databases and MapReduce. However, the XML parsing looks quite likethe Apache Mahout [12] implementation and both uses the old pre-Hadoop 0.20 API.The implementation is thus rewritten to fit the 0.21.0 API. This utilizes the MapReduceRecordReader’s ability to read outside of the split given, so the InputFormat can read froma tag start to a tag end.

After the XML files have been parsed, a WikipediaArticle is passed to a Mapper. Themapper just calculates the link titles which are notified in the XML string, outputting thelink as a key and a 1 as a value. The Reducer then merges the links and sums up the 1’sto a complete number. However, when these are output, they are sorted alphabetically bytheir keys and not by the amount of links they have. This is solved by running a secondjob that goes very fast ( 3mins) which just reads the output file, quickly passes it througha mapper that switches the key-values and passes it through a reducer to print it out.

This implementation enables an adjustable amount of data to use, as running the wholeWikipedia database of current articles is a 27 GB large XML file. Only using subset filesduring the testing shows whether there is a difference in completion time when there is alarge amount of data to process compared to a smaller amount. Subsets used where thearticle-only (no metadata, images, history or user discussions) subsets dumped at 2011-03-17 [25]. The subsets ranged from 2.9 GB (first three XML files), 4 GB (first 4) and 9.6 GB(first 8 files).


To calculate the time taken to process the data the following was done:

1. Start a specific amount of Hadoop VMs.

2. Configure them accordingly.

3. Upload the data and the JAR-file to the master VM.

4. Put the data onto HDFS.

5. Run the job 5 times, making sure to remove the output folder between runs.

6. Go to the JobTracker Web Service and analyse the time.

7. The max time of a job consist of adding both the Count and Sort jobs together.

8. Calculate an average of the 5 runs.

All of the nodes only uses 1 core and 512 MB RAM each and the master also contained aslave in all configurations. When running with only one node in the cluster, the Replicationlevel was set to 1, in any other test the replication was set to 2. The size of the ephemeralHDD varied depending on the size of the database and the number of nodes in the cluster.See table 4.4, where the master also counts as a slave.

DB size Nr. of nodes Master HDD size Slave HDD size Replication level2.9 GB 1 10 GB - 12.9 GB 2 10 GB 5 GB 22.9 GB 4 10 GB 5 GB 22.9 GB 8 10 GB 5 GB 22.9 GB 12 10 GB 5 GB 24.0 GB 1 10 GB - 14.0 GB 2 10 GB 5 GB 24.0 GB 4 10 GB 5 GB 24.0 GB 8 10 GB 5 GB 24.0 GB 12 10 GB 5 GB 29.6 GB 1 25 GB - 19.6 GB 2 10 GB 10 GB 29.6 GB 4 10 GB 10 GB 29.6 GB 8 10 GB 10 GB 29.6 GB 12 10 GB 10 GB 2

Table 4.4: The ephemeral HDD size of the nodes in the running cluster.

To simplify deployment on the cluster, the JAR-file was built to include any librariesit uses in it’s build path. This provides a relatively large file compared to without theselibraries. This does however remove the need to include libraries on launch from the MapRe-duce framework itself.

Chapter 5

Results

This chapter details the results of running Hadoop MapReduce jobs inside a virtual cluster.The procedure to perform the tests are explained in an earlier chapter, see section 4.2.4.When running the example job and verifying the timings and map tasks run, the internalweb service of Hadoop was used to procure the timings on the job time and map times.

VMs run on Eucalyptus never crashed during the test period once they were properlybooted. However, some things were observed and noticed during the boot phase of a VM,where most of the problems occurred. A machine running did run until the machine wasterminated either inside the VM or by controlling it outside-wise. A few observations relatedto running the Eucalyptus cloud:

– Booting many instances at the same time increase crash risk of VMsWhen Eucalyptus receives requests of booting several instances at the same time, thereis a risk that the VM either kernel panics or fails to receive an IP or properly resolveits FQDN. If a high amount of (more than three) instances are booting simultaneouslythen it is more likely that an instance will fail to boot. Requesting an instance one byone minimize the risk, but increases the time to completely boot all the instances ina cluster.

– Node Controllers does not remove cached images that has been removedRemoving an image from the cloud removes it from Walrus but the Node Controllersretains a cached copy of it, even though it cannot be booted by the infrastructure.This can clout the HDDs of the NCs and sometimes requires manual removal to freeup disk space.

– Networking to a new instance is never instantly createdEven though the iptable rules and bridging is properly setup the instances are notinstantly accessible. They can respond to ping, but the time it takes to access aninstance through SSH usually ends at 1-2 minutes.

– The network to instances is never at a steady speedTransferring files through wget, curl or SCP never had a stable speed. It could goup to 20 mb/s, and in the next moment stall for 15 seconds. This was frequently

39

40 Chapter 5. Results

observed, especially when communicating from outside the virtual cluster into a VM.If this affected the speed between Hadoop nodes in a MapReduce job is unknown.

Hadoop MapReduce itself is built around being redundant, so if a machine stoppedworking it would solve the task eventually either on the same VM or on another in thecluster. Running the MapReduce jobs did not induce any severe crashing VMs or similareven though they could provide a high load on the machines. Some things were observedduring the runs however:

– Near linear increase in performance on few nodesIncreasing from 1 to 2 nodes is a large increase in performance. Henceforth the per-formance increase is not linear but still increasing.

– Larger cluster means more repetition of same taskIf the cluster is large, the likelihood of a map task being done at least twice and thendropped is increased. This is also based on the size of the database it runs on. As anexample, a job of 157 tasks is done exactly 157 times on a 1-node cluster whereas ona 8-node cluster it would be done on average 165 times and having 8 tasks dropped.

– Task getting dropped more often on large clusterRelated to the previous item, tests indicate that nodes timeout more often the largerthe cluster is. Whether this is related to the MapReduce framework or the internalnetworking between the nodes is unknown. This requires forces the task to be redonelater or at another location.

An important note here is that performance is somewhat simplified. This thesis focusesonly how fast the job finishes which means that:

high performance = short job time

which means that high performance equals very low time to complete the job.

5.1 MapReduce performance times

Initially the testing was supposed to be made with a single size of a database, but once theinitial tests of the relatively small database of 2.9 GB of Wikipedia articles were done theMapReduce displayed results that could indicate that 12 nodes would not make an increasein performance compared to 8 nodes. So the database was increased to 4.0 GB at first andthen 9.6 GB to verify the results.

The increased size did not however display any significant changes from the initial analy-sis of the the database; when the size of the cluster where put to the maximum of 12 virtualinstances the gain in performance would be marginal. Figure 5.1, 5.2 and 5.3 shows theperformance curves based on the size of database and number of nodes in the cluster.

5.1. MapReduce performance times 41

Figure 5.1: Runtimes on a 2.9 GB database.



As such, the only noticeable difference when nearing the max number of virtual nodesavailable is that the max/min-times of completing the job would draw closer to the averagecompared to the 8 node cluster. On a cluster with only few nodes in it, the variations onhow long time it takes to complete the job were very high compared to a 8 or 12 nodecluster.

The map times on the jobs did not make any considerable changes based on how largethe database or nodes there are in the cluster. The only noticeable difference would bethat the larger the database the higher the maximum time of an individual map time would

42 Chapter 5. Results

be observed. However, that could point at a problem regarding the network, where theJobTracker would be unable to contact the TaskTracker of the individual map and thusdrop it from the job, redoing it later or on another node. See figures 5.4, 5.5 and 5.6.

Figure 5.4: Map task times on a 2.9 GB database.



This shows that a linear increase in performance based on the number of nodes is notachievable in the test environment set up. The performance evens out when nearing themaximum number of nodes in the virtual cluster. A performance increase is however givenby using virtual nodes.

Chapter 6

Conclusions

While Hadoop MapReduce is designed to be used on top of physical commodity hardwareservers, testing has shown that running Hadoop MapReduce in private cloud supplied byEucalyptus is viable. Using virtual machines gives the user the ability to supply more ma-chines when needed, as long as it is not reaching the physical upper limits of the underlyinghost machines. While setting up the cloud and Hadoop can prove problematic at first, itshould not pose a problem to someone experienced in scripting, command line interfaces,networking and administration in a UNIX environment.

Using Hadoop in a virtual cluster provides an added benefit of reusing the hardware tosomething completely different when no MapReduce job is running. If the cloud containsseveral different images, it is quite viable to use a private cloud as a mean to give morecomputing power if needed and use the hardware to something else when it is not.

6.1 Restrictions and limitations

While this is a test of setting up and using MapReduce in a private Eucalyptus cloud, theamount of physical servers severely restricts the size of the test. It does not reflect thesize of a cloud that most likely will be used in a company or an organization although theprinciples would be just as fitting. This thesis focuses more what is required to make aHadoop cluster inside a private cloud instead of actually perfecting the means of creatingand managing such a cluster, although creating a simple, streamlined and usable method ispreferred.

The way of setting up the Hadoop nodes using the keypair, CLI scripts and configurationfile is a viable way but it is far from complete. It takes a while to boot images that usesauto-mounting of ephemeral devices. API calls to Eucalyptus cannot always verify whetherthe VM has booted correctly or not and since the networks are being bridged it requiresa certain amount of traffic to find the right node. This means a downtime when bootingthat can range anything from 30 seconds to several minutes, of which afterwards there isno guarantee that the VM might not have kernel panicked or hit any other kind of error

43

44 Chapter 6. Conclusions

preventing it from booting.

With current status of Eucalyptus in mind the shortcomings regarding the VMs con-necting between their external IPs, the reverse DNS lookups and the incompabilities withsome binaries and libraries (dhcpd, hibernate and groovy) makes it problematic keeping anup-to-date version of Eucalyptus. However, the API that Eucalyptus uses simplifies makingapplications and tools interacting with it. It is one major benefit having tools that once wasused for AWS and can be as easily used in the private cloud infrastructure.

Eucalyptus also has two problems when booting a VM. The first is that if there are manyinstances requested to boot at the same time, there is a risk that they fail to either retrievethe proper IP, resolve the proper FQDN, or fetch the SSH key from the metadata service.While the first is most likely an issue related to the virtual switching and the bridging ofthe network, the FQDN- and SSH-problems resides in a metadata service that cannot beaccessed or is not prepared. Slow network in general, for example stalling SCP file transfer,can diligently be observed on the type of network that the test cloud is using. Setting upanother kind of network like only MANAGED and with proper switches can possibly solvethis.

The second large problem is that when booting several VMs at the same time on thesame host machine, there is a risk of a kernel panic in the VM. Eucalyptus still regards theVM as in a running state while it is not accessible at all. This is not Eucalyptus itself thatis erroneous, merely that an error occurred and virsh cannot tell Eucalyptus properly thatthe instance has failed on boot.

6.2 Future work

The size of both the private Eucalyptus cloud and Hadoop’s virtual cluster is very smallcompared to the size they are designed towards. Having a large cloud spread on large racksand preferably under several geographically separated availability zones would definitelyincrease the difficulty of maintenance and setup process. However, adding more Node Con-trollers to the already existing cluster should not pose a considerable problem, as the majorof issues regarding the cloud network is already set up.

Hadoop should preferably run on larger VMs with more RAM and HDD available tothem. What would be interesting to see is if the cluster shows the same decline in perfor-mance increase when the cluster is reaching 1.000 or more virtual machines. Also interestingto see would be if the Hadoop cluster recognizes a VM that relies on a different physicalrack (but in the same virtual) as a node inside the same rack. The rack-awareness insidethe Hadoop cluster relies on the reverse DNS and FQDN’s which means that Eucalyptusfirst needs an upgrade in how it handles iptables, virtual switching and DNS lookups.

As the VMs runs on three physical servers, a comparisation between the performance ofthe virtual cluster and running Hadoop straight on top of the three servers would also beinteresting. Dropping the cloud service, and with it the ability to switch VMs depending onneed, would give fewer Hadoop nodes but each with higher individual performance. If thisgives greater performance, and to what degree, than running more virtual nodes would bea good comparison.

6.2. Future work 45

The way to launch, check and terminate Hadoop nodes can easily be more streamlined.Either as a complete stand-alone software utilizing the open API of Eucalyptus or addingmore scripts to use in a CLI environment. The fact that the API that Eucalyptus uses is aweb service implies a lot of different approaches and ways of interacting with VMs.

Using the storage features found in Eucalyptus, like it’s storage controller and Walrus,to simplify the booting of a new image could be an interesting take on how to actuallyload the test data locally to the virtual cluster. Hadoop’s MapReduce uses HDFS as a filestorage, but it can run directly on top of Amazon’s S3. If it uses the same API calls, then itshould be able to run on top of Walrus instead. However, as Walrus is accessed through thenetwork, it can pose a limitation in terms of bandwidth (depending on size of the database).

46 Chapter 6. Conclusions

Chapter 7

Acknowledgements

I would like to thank my supervisor at CS UmU Daniel Henriksson and co-supervisor LarsLarsson for giving me this great opportunity to work with something so high-end and stillusable. The help and feedback you have given me is duly appreciated.

I would also like to thank the system administrator at CS UmU, Tomas Ogren, for hispatience and support when I repeatedly failed to setup and prepare the servers; solvingproblems that was out of bounds of this thesis.

A final thanks to Erik Elmroth that managed to find several interesting distributed ,this one included, despite his many work obligations and the lack of time.

47

48 Chapter 7. Acknowledgements

References

[1] Amazon. Amazon elastic block storage (ebs). http://aws.amazon.com/ebs, 2011. Re-trieved 2011-02-15.

[2] Amazon. Amazon elastic compute cloud (ec2). http://aws.amazon.com/ec2, 2011.Retrieved 2011-02-15.

[3] Amazon. Amazon simple storage service (s3). http://aws.amazon.com/s3, 2011. Re-trieved 2011-02-15.

[4] Citrix. Citrix open cloud platform.http://www.citrix.com/English/ps2/products/subfeature.asp?contentID=2303748,2011. Retrieved 2011-04-26.

[5] Cloud.com. The cloud os for the modern datacenter. http://cloud.com/, 2011. Retrieved2011-04-26.

[6] Cloudera. Cloudera’s distribution including apache hadoop.http://www.cloudera.com/hadoop/, 2011. Retrieved 2011-04-26.

[7] J Dean and S Ghemawat. Mapreduce: Simplified data processing on large clusters. InOSDI’04: Sixth Symposium on Operating System Design and Implementation. GoogleInc., 2004.

[8] Eucaluptys Systems, Inc. Eucalyptus Administration Guide (2.0), 2010.

[9] Apache Software Foundation. Apache hadoop. http://hadoop.apache.org/, 2011. Re-trieved 2011-02-17.

[10] Apache Software Foundation. Apache whirr. http://incubator.apache.org/whirr/, 2011.Retrieved 2011-04-26.

[11] Apache Software Foundation. Hdfs user guide.http://hadoop.apache.org/hdfs/docs/current/hdfs user guide.html, 2011. Retrieved2011-04-06.

[12] Apache Software Foundation. Mahout. http://mahout.apache.org/, 2011. Retrieved2011-03-10.

[13] M Gug. Deploying a hadoop cluster on ec2/uec with puppet and ubuntu maver-ick. http://ubuntumathiaz.wordpress.com/2010/09/27/deploying-a-hadoop-cluster-on-ec2uec-with-puppet-and-ubuntu-maverick/, 2010. Retrieved 2011-03-10.

49

50 REFERENCES

[14] Puppet Labs. Puppet powers it production. http://www.puppetlabs.com/puppet/introduction/,2011. Retrieved 2011-04-26.

[15] J Lin. Cloud9 - a mapreduce library for hadoop. http://www.umiacs.umd.edu/ jim-mylin/cloud9/docs/, 2011. Retrieved 2011-03-10.

[16] T. Mather, S. Kumaraswamy, and S. Latif. Cloud Security and Privacy. O’Reilly, 2009.

[17] Microsoft. Microsoft private cloud. http://www.microsoft.com/virtualization/en/us/private-cloud.aspx, 2011. Retrieved 2011-04-26.

[18] Derek Gottfrid (New York Times). Self-service, prorated supercomput-ing fun! http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/, 2007. Retrieved 2011-02-15.

[19] OpenNebula. Opennebula - the opensource toolkit for cloud computing.http://opennebula.org/, 2011. Retrieved 2011-04-26.

[20] A. T. Velte, T. J. Velte, and R. Elsenpeter. Cloud Computing - A Practical Approach.The McGraw-Hill Companies, 2010.

[21] VMware. Vmware vcloud. http://www.microsoft.com/virtualization/en/us/private-cloud.aspx, 2011. Retrieved 2011-04-26.

[22] W. von Hagen. Professional Xen Virtualization. John Wiley & Sons, 2008.

[23] T. White. Hadoop - The Definitive Guide. O’Reilly Media, 2nd edition, 2010.

[24] T. White. Hadoop - The Definitive Guide, page 260. O’Reilly Media, 2nd edition, 2010.

[25] Wikimedia. enwiki dump progress on 20110317.http://dumps.wikimedia.org/enwiki/20110317/, 2011. Retrieved 2011-03-18.

[26] C. Wolf and Erick Halter. Virtualization - From the Desktop to the Enterprise. Apress,Berkeley, CA, 2005.

Appendix A

Scripts and code

The scripts and source code can be downloaded at:

http://www8.cs.umu.se/~c07jnn/bsthesis/

51

Hadoop MapReduce in Eucalyptus Private Cloud

Documents

Transcript of Hadoop MapReduce in Eucalyptus Private Cloud