An Experimental Approach to Multiple Sequence Alignment in Openvz Using

31
AN EXPERIMENTAL APPROACH TO MULTIPLE SEQUENCE ALIGNMENT IN OPENVZ USING HADOOP CLUSTER BAKTAVATCHALAM.G (08MW03) MASTER OF ENGINEERING Branch: SOFTWARE ENGINEERING of Anna University September 2009 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PSG COLLEGE OF TECHNOLOGY (Autonomous Institution) COIMBATORE – 641 004

description

My 3rd Sem InPlant Training Report

Transcript of An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Page 1: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

AN EXPERIMENTAL APPROACH TO MULTIPLE SEQUENCE ALIGNMENT IN OPENVZ USING HADOOP CLUSTER

BAKTAVATCHALAM.G (08MW03)

MASTER OF ENGINEERING

Branch: SOFTWARE ENGINEERING

of Anna University

September 2009

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING PSG COLLEGE OF TECHNOLOGY

(Autonomous Institution)

COIMBATORE – 641 004

Page 2: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

PSG COLLEGE OF TECHNOLOGY (Autonomous Institution)

COIMBATORE – 641 004

AN EXPERIMENTAL APPROACH TO MULTIPLE SEQUENCE ALIGNMENT

IN OPENVZ USING HADOOP CLUSTER

Bona fide record of work done by

BAKTAVATCHALAM.G (08MW03)

MASTER OF ENGINEERING

Branch: COMPUTER SCIENCE AND ENGINEERING

of Anna University, Coimbatore.

September 2009

…..…………………. ……….…………………. Ms. M.Gowri Shankar Dr. S.N.Sivanandam Faculty Guide Head of the Department

Certified that the candidate was examined in the viva-voce examination held on ………………….

…………………….. ………………………….. (Internal Examiner) (External Examiner)

Page 3: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Acknowledgement

i

ACKNOWLEDGEMENT

We wish to express our sincere gratitude to our respected Principal Dr. R. Rudramoorthy for having given us the opportunity to undertake our project.

We also wish to express our sincere thanks to Dr. S. N. Sivanandam, Professor and Head of the Department of Computer Science and Engineering, for his

encouragement and support that he extends towards our project work.

We extend our sincere thanks to our internal guide and faculty in charge Mr. M. Gowri Shankar, Lecturer, Department of Computer Science and Engineering, for

his guidance and help rendered for the successful completion of our project.

Page 4: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Synopsis

i

SYNOPSIS

Multiple alignment of protein sequences is an essential tool in molecular

biology. It aids to determine evolutionary linkage and to predict molecular structures. The

factors to be considered while aligning multiple sequences are speed and accuracy of

alignment. Dynamic programming algorithms like Needleman-Wunsch and Smith-

Waterman produce accurate alignments. But these algorithms are computation intensive

and are limited to a small number of short sequences.

In this project we propose a time efficient approach to sequence alignment

that produces quality alignment. The dynamic nature of the algorithm coupled with data

and computational parallelism of hadoop data grids improves the accuracy and speed of

sequence alignment. Further due to the scalability of hadoop framework, the proposed

multiple sequence alignment is also highly suited for large scale alignment problems.

The Improved algorithm also overcome the Space limitations in

Needleman-Wunsch Algorithm by dividing the sequence into blocks and process the

individual blocks in parallel. Also we optimize the computation by performing parallel

alignment score computation. Also the algorithm is designed to supports the platform

virtualization (OpenVZ).

Page 5: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Contents

iii

CONTENTS

CHAPTER Page No. Synopsis………………….………………………………………………..…………….. .(i) List of Figures.………….………………………………………………...…………….. .(ii) List of Tables.…………………………………………………………………………….(iii) 1. INTRODUCTION.……...…………………………………………………………... .1

1.1. Problem Definition 1

1.2. Objective of the Project 1

1.3. Significance of the Project 1

1.4. Outline of the Project 1

2. SYSTEM STUDY..…….……………………..……………………………………...3 2.1. Existing System 3 2.2. Proposed System 3

3. SYSTEM ANALYSIS..…….……………………..………………………………….5 3.1 Requirement Analysis 5 3.2 Feasibility Study 5

4. SYSTEM DESIGN…...…….……………………..………………………………….7 4.1 Contextual Activity Diagram 7

5. SYSTEM IMPLEMENTATION.………………..…………………………………...8 5.1 Aligner Module 8

5.2 FileUtil Module 8

5.3 GAJobRunner Module 8

6. TESTING……………………….………………..……………………………………9 6.1 Unit Testing 9

6.2 Integration Testing 10

6.3 Sample Test Cases 11

7. SNAPSHOT.…..……………….………………..………………………………….12

7.1 Nodes 12

7.2 Parallel Jobs 13

CONCLUSIONS………………..………………………………………….……….……..14 FUTURE ENHANCEMENTS..…………………………………………………….……. .15

BIBLIOGRAPHY...…………………………………………………………….………….16

APPENDIX…….....…………………………………………………………….………….17

Page 6: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

List of Tables

iii

LIST OF TABLES

TABLE NO NAME PAGE NO.

Table 6.1 Sample Test Cases 15

Page 7: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

List of Figures

ii

LIST OF FIGURES

FIGURE NO LIST OF FIGURES PAGE NO.

Fig: 2.1

System Architecture 3

Page 8: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Introduction Chapter 1

1

CHAPTER 1

INTRODUCTION

This chapter provides a brief overview of the company profile problem definition,

objectives and significance of the project and an outline of the report.

1.1 PROBLEM DEFINITION Design and Implementation of Parallel approach to MSA using Hadoop Data

Clusters with Virtualization to overcome the Space limitations in Original Needleman-

Wunsch Algorithm by processing the sequence alignment in parallel. Parallel alignment

score computation is proposed to improve computational efficiency.

1.2 OBJECTIVE OF THE PROJECT Most of the users are interested in parallel execution of MSA. Also users want

the accurate alignment results in balanced virtualized cluster. So this project gives a

solution for user that increased efficiency using parallel alignment and virtualization and

reduced time complexity.

1.3 SIGNIFICANCE OF THE PROJECT With the enormous growth in bio-information, there is a corresponding need for

tools that enable fast and efficient alignment job of sequences. The concurrent execution

will greatly simplify the complexity of the alignment.

1.4 OUTLINE OF THE PROJECT The rest of the report is structures as follows. Chapter 2 provides a detailed study

of the existing system and the basic ideas of the proposed system. Chapter 3 discusses

the requirements for the development of the system and an analysis on the feasibility of

Page 9: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Introduction Chapter 1

2

the system. Chapter 4 presents the overall design of the system. Chapter 5 discusses

the implementation details. Chapter 6 explains various testing procedures conducted on

the system. Chapter 7 contains the snapshot of various forms in our system. The last

section summarizes the project.

Page 10: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

System Study Chapter 2

3

CHAPTER 2

SYSTEM STUDY

This chapter elucidates the existing system and a brief description of the

proposed system.

2.1 EXISTING SYSTEM

Our existing system is Needleman algorithm. Its only supports linear/sequential

execution of alignment of very large DNA sequences. So there is no parallelism over

those sequence alignments. The Needleman algorithm gives best accuracy over pair of

sequences, but it needs very large amount of space to align the sequences. Also it

doesn't support MSA. Also it does not compatible with platform virtualization.

2.2 PROPOSED SYSTEM In our system, we use OpenVZ as a VM (Virtual Machine) and each VM has its

own Hadoop Datanode and TaskTracker. We can create many isolated VM’s on a single

Core Kernal. Each VM has its own Files, Users, Process Tree, N/W, Devices, and IPC

Objects … and also supports Dynamic Resource Management, Check Pointing (State

Dumping). The input sequences are given to Modified MSA and it will rearrange the

sequences into pair of sequences. Then each pair will be executed in each VM in the

Hadoop Cluster. This process is repeated until the user specified level is finished. Finally

all results are gathered from all VM’s and then all are combined together and the result

is displayed to the user.

Page 11: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

System Study Chapter 2

3

Page 12: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

System Analysis Chapter 3

4

CHAPTER 3

SYSTEM ANALYSIS This section describes the hardware and software specifications for the

development of the system and an analysis on the feasibility of the system.

3.1 REQUIREMENT ANALYSIS 3.1.1 Software Requirements After experimenting with various commercial software available and analyzing

the Pros and Cons of the software, the following are chosen.

• Operating System – Platform Independent • Programming Languages – Java 1.6+ • Front End - Java Swing • Framework - Hadoop • Virtualization Tool - OpenVZ

3.1.2 Hardware Requirements The Hardware requirements of the proposed system are as follows:

• Pentium-III machine & above

• RAM-256 MB

• Hard Disk with a Capacity of 10 GB • Network of Computers with above configuration for Cluster

3.2 FEASIBILITY ANALYSIS Feasibility deals with step-by-step analysis of the system. Analysis showed that

this project was feasible in all respects. Three kinds of feasibility factors are considered:

• Economic Feasibility

Page 13: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

System Analysis Chapter 3

5

• Technical Feasibility

• Operational Feasibility

3.2.1 Economic Feasibility

The system is developed only using those softwares that are very well used in

the market, so there is no need for installation of new softwares. Hence, the cost

incurred towards this project is negligible

3.2.2 Technical Feasibility

3.2.2.1 Parallel MSA The main aim of our project is to align the given sequences in parallel using

MSA.

3.2.2.2 Virtualization Next important thing that must be done in our project is to configure OpenVZ to

incorporate platform virtualization for our project to increase concurrency.

3.2.3 Operational Feasibility The functions needed to be performed by the system are all valid and without

any conflicts. All functions and constraints specified in the requirements are completely

operational. The requirements stated are realistically testable.

The requirements are adaptable to changes with out any large-scale effects on

other system requirements. The system is capable of accommodating future

requirements if they arise.

Page 14: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

System Design Chapter 4

6

CHAPTER 4

SYSTEM DESIGN This chapter describes the functional decomposition of the system and illustrates

the movement of data between external entities, the processes and the data stores

within the system, with the help of data flow diagrams.

4.1 CONTEXTUAL ACTIVITY DIAGRAM

Page 15: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Implementation Chapter 5

10

CHAPTER 5

IMPLEMENTATION

This phase is broken up into two phases: Development and Implementation. The

individual system components are built during the development period. Programs are

written and tried by users. During Implementation, the components built during

development are put into operational use. In the development phase of our system, the

following system components were built.

• Aligner module

• FileUtil module

• GAJobRunner

5.1 Aligner Module This module contains,

• Procedure to align two Input Sequences using Standard Needleman

Algorithm.

• Procedure to compute score for two given sequences using Score

Matrix.

5.2 FileUtil Module This module contains,

• Procedure to read File contents from HDFS.

• Procedure to write File Contents to HDFS.

5.3 GAJobRunner Module This module contains,

• Implementation of Hadoop Map/Reduce Procedures

• Specification of Hadoop JobConf, InputFormat, OutputFormat,

Key/Value Pair Design and Parallel job submission.

Page 16: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Testing Chapter 6

12

CHAPTER 6

TESTING

This chapter explains the various testing procedures conducted on the system.

Testing is a process of executing a program with the intent of finding an error. A

successful test is one that uncovers an as yet undiscovered error. A testing process

cannot show the absence of defects but can only show that software errors are present.

It ensures that defined input will produce actual results that agree with the required

results. A good testing methodology should include

• Clearly define testing roles, responsibilities and procedures

• Establish consistent testing process

• Streamline testing requirements

• Overcome “requirements slow me down” mentality

• Common sense process approach

• Use some elements of existing Process

• Not an attempt to replace, rewrite or redefine Process

• To find defects early and to give good time to developers for bug fixes

• Independent perspective in testing

Some of the testing principles used in this project are:

• Unit Testing

• Integration Testing

6.1 UNIT TESTING Unit testing is a strategy by which individual components, which make up the

system, are tested first to ensure that system works up to the desired extent. It focuses

on the verification effort on the smallest unit of the software design i.e. module. Various

modules of the system are tested to see whether they perform their intended functions.

Using procedural design description, important control paths are tested to uncover the

Page 17: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Testing Chapter 6

13

errors with in the boundary of the module. While accepting a connection using specified

functions we go for unit testing in their respective modules. The unit test is normally a

white box test (a testing method in which the control structure of the procedural design is

used to derive test cases).

6.1.1 Process Objectives To test every unit of the software in isolation before integrating it with other units.

6.1.2 Definition of Unit

A unit is a module as identified during size estimation process with a size

estimate that does not exceed 1000LOC.

For GUI applications each screen will be a unit.

If the size estimate for a unit exceeds 1000 LOC and it is not feasible to break it

into smaller logically independent units that can be tested in isolation, the project lead in

concurrence with the SQA can decide to define this as a unit.

6.1.3 Entry Criteria The entry criteria for this process are the following:

• Unit completed

• Unit peer reviewed

6.1.4 Exit Criteria The exit criteria for this process are the following:

• Unit test cases executed

• Any defects that are identified during unit testing and that are not fixed before the

unit enters component testing is listed in the test report and verified

• 100% statement coverage

If unit will be tested before code review of unit, this must be identified in the

project plan. In these projects the developer will self-review (desk check) the code

before unit testing.

In cases of exception handling of error conditions that are difficult to generate,

thereby making it impossible to achieve 100% statement coverage, the code should be

formally reviewed with this additional criteria

Page 18: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Testing Chapter 6

14

6.2 INTEGRATION TESTING The integration testing is a systematic technique for constructing the program

structure while conducting tests to uncover errors associated with interfacing. It is a type

of testing by which the individual modules of the system are combined and tested

whether they work properly as a whole. The objective is to take unit test modules and

build a program that has been dictated by the design. Integration testing can be either

‘Incremental’ or ‘Non-Incremental’.

The objective of the integration testing is to help engineers plan and execute the

component and Integration testing for their respective projects.

Integration testing should include the following objectives:

• Performed by the product group/Dev test team after feature complete

• Determines that all product components on a list of specific platforms function

successfully together (The List specified in Master test plan)

• Performed in a basic product / platform environment (Basic environment

specified in Master test plan)

• Tests the product functionality against the specification

• Tests functionality of fake languages with sample single and double byte

languages

• Tests scaling to an acceptable minimum level as called out in the master test

plan

• Tests performance, reliability to an acceptable level as called out in the master

test plan

• Final integration tests done after all components are integrated, with the build in

production format

The tasks of the project have been integrated and the functioning of the entire

system has been found to be satisfactory. The functionality of the entire system has

been subjected to a series of tests and all the modules have been found to interoperate

properly.

Finally the integration testing was performed on the integrated system and found

to work properly.

Page 19: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Testing Chapter 6

15

6.3 SAMPLE TEST CASES The following are the some of the sample test cases employed along with the

test results have been described in the table below.

Table 6.1 Sample Test Cases

Test Description

Result

Is Hadoop stable for running more than one client jobs? OK

Is VM’s are return the results properly? OK

Is MSA executes Optimally? OK

Is Alignment computed Accurately? OK

Is Hadoop is running in OpenVZ with no errors? OK

Page 20: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Snapshot Chapter 7

16

CHAPTER 7

SNAPSHOT

This chapter contains the snapshot of various snaps from our system.

7.1 Node Details

Page 21: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Snapshot Chapter 7

17

7.2 Parallel Jobs

Page 22: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Conclusion

17

CONCLUSION

Thus the analysis, design and implementation of MSA in Hadoop with OpenVZ

are done successfully. So that the user can able to do alignments of very large DNA

sequences and the user can able to view/set the virtual environments of OpenVZ. This is

very useful for align DNA sequences in a platform virtualized environment. Also the

alignment is running concurrently, so we can get higher performance.

Page 23: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Future Enhancements

18

FUTURE ENHANCEMENTS Currently we need to configure and install Hadoop and OpenVZ in all nodes

manually and also it doesn’t have Fine Grain Scheduling of Alignment jobs. In future the

enhancements are made to build a AutoConfigure Script which will automatically install

and configure Hadoop and OpenVZ. Also to design an efficient scheduling agent which

executes alignment jobs.

Page 24: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Bibliography

19

BIBLIOGRAPHY

• Sagl B. Needle and Christus D. Wunsch, “A General Method Applicable To The Search For Similarities In The Amino Acid Sequence Of Two Proteins”, journal of molecular biology, 1970, pp 443-453.

• • Wang-Sheng Juang and Shun-Feng Su, “Multiple Sequence Alignment Using Modified

Dynamic Programming And Particle Swarm Optimization”, Journal of the Chinese Institute of Engineers, Vol. 31, No. 4, pp. 659-673 (2008).

• P.V.Lakshmi, Allam Appa Rao, GR Sridhar, “An Efficient Progressive Alignment

Algorithm for Multiple Sequence Alignment”, International Journal of Computer Science and Network Security, VOL.8 No.10, October 2008.

• Jens Stoye, Vincent Moulton and Andreas W.M. Dres, “DCA: An Efficient

Implementation Of The Divide-and-conquer Approach To Simultaneous Multiple Sequence Alignment”, Vol. 13 no. 6 1997. Pages 625-626. Research Center for Interdisciplinary Studies on Structure Formation (FSPM), University of Bielefeld.

• [Booch 1994] Booch, G. Object Oriented Analysis and Design with Applications

(second edition), Benjamin/Cummings 1994, ISBN 0-8053-5340-2.

• Java Network Programming, O'Reilly & Associates, Inc.,, Second Edition • Herbert Schildt ., and Patrick Naughton , 2001,“Java2: The Complete Reference “, Fourth

Edition , Tata McGraw-Hill Publishing Company Limited . Websites

http://en.wikipedia.org/ http://www.omg.org/docs/formal/00-03-01.pdf http://www.uml-forum.com/FAQ.htm

Page 25: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

Bibliography

19

Page 26: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

APPENDIX

18

APPENDIX

Needleman Algorithm 

The Needleman-Wunsch algorithm performs a global alignment on two sequences (called A and B here). It is commonly used in bioinformatics to align protein or nucleotide sequences. The algorithm was proposed in 1970 by Saul Needleman and Christian Wunsch in their paper A general method applicable to the search for similarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53. The Needleman-Wunsch algorithm is an example a of dynamic programming, and is guaranteed to find the alignment with the maximum score. Needleman-Wunsch is the first instance of dynamic programming being applied to biological sequence comparison. Scores for aligned characters are specified by a similarity matrix. Here, S(i,j) is the similarity of characters i and j. It uses a linear gap penalty, here called d. For example, if the similarity matrix was

- A G C T A 10 -1 -3 -4 G -1 7 -5 -3 C -3 -5 9 0 T -4 -3 0 8

then the alignment:

AGACTAGTTAC CGA---GACGT

with a gap penalty of -5, would have the following score...

To find the alignment with the highest score, a two-dimensional array (or matrix)

is allocated. This matrix is often called the F matrix, and its (i,j)th entry is often denoted Fij There is one column for each character in sequence A, and one row for each character in sequence B. Thus, if we are aligning sequences of sizes n and m, the amount of memory used by the algorithm is in O(nm). (However, there is a modified

Page 27: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

APPENDIX

19

version of the algorithm which uses only O(m + n) space, at the cost of a higher running time. This modification is in fact a general technique which applies to many dynamic programming algorithms; this method was introduced in Hirschberg's algorithm for solving the longest common subsequence problem.) As the algorithm progresses, the Fij will be assigned to be the optimal score for the alignment of the first i characters in A and the first j characters in B. The principle of optimality is then applied as follows. Basis: F0j = d * j Fi0 = d * i Recursion, based on the principle of optimality: Fij = max(Fi − 1,j − 1 + S(Ai,Bj),Fi,j − 1 + d,Fi − 1,j + d) The pseudo-code for the algorithm to compute the F matrix therefore looks like this (array indexes start at 0): for i=0 to length(A) F(i,0) ← d*i for j=0 to length(B) F(0,j) ← d*j for i=1 to length(A) for j = 1 to length(B) { Choice1 ← F(i-1,j-1) + S(A(i), B(j)) Choice2 ← F(i-1, j) + d Choice3 ← F(i, j-1) + d F(i,j) ← max(Choice1, Choice2, Choice3) }

Once the F matrix is computed, note that the bottom right hand corner of the matrix is the maximum score for any alignments. To compute which alignment actually gives this score, you can start from the bottom left cell, and compare the value with the three possible sources(Choice1, Choice2, and Choice3 above) to see which it came from. If it was Choice1, then A(i) and B(i) are aligned, if it was Choice2 then A(i) is aligned with a gap, and if it was Choice3, then B(i) is aligned with a gap. AlignmentA ← "" AlignmentB ← "" i ← length(A) j ← length(B) while (i > 0 and j > 0) { Score ← F(i,j) ScoreDiag ← F(i - 1, j - 1) ScoreUp ← F(i, j - 1) ScoreLeft ← F(i - 1, j) if (Score == ScoreDiag + S(A(i), B(j))) {

Page 28: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

APPENDIX

20

AlignmentA ← A(i-1) + AlignmentA AlignmentB ← B(j-1) + AlignmentB i ← i - 1 j ← j - 1 } else if (Score == ScoreLeft + d) { AlignmentA ← A(i-1) + AlignmentA AlignmentB ← "-" + AlignmentB i ← i - 1 } otherwise (Score == ScoreUp + d) { AlignmentA ← "-" + AlignmentA AlignmentB ← B(j-1) + AlignmentB j ← j - 1 } } while (i > 0) { AlignmentA ← A(i-1) + AlignmentA AlignmentB ← "-" + AlignmentB i ← i - 1 } while (j > 0) { AlignmentA ← "-" + AlignmentA AlignmentB ← B(j-1) + AlignmentB j ← j - 1 }

Types of virtualization 

In the context of this report, virtualization is a system or a method of dividing

computer resources into multiple isolated environments. It is possible to distinguish four

types of such virtualization: emulation, paravirtualization, operating system-level

virtualization, and multiserver (cluster) virtualization. Each virtualization type has its pros

and cons that condition its appropriate applications.

Emulation makes it possible to run any non-modified operating system which supports

the platform being emulated. Implementations in this category range from pure

emulators (like Bochs) to solutions which let some code to be executed on the CPU

natively, in order to increase performance. The main disadvantages of emulation are low

performance and low density. Examples: VMware products, QEmu, Bochs, Parallels.

Page 29: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

APPENDIX

21

Paravirtualization is a technique to run multiple modified OSs on top of a thin layer

called a hypervisor, or virtual machine monitor. Paravirtualization has better performance

compared to emulation, but the disadvantage is that the “guest” OS needs to be

modified. Examples: Xen, UML.

Operating system-level virtualization enables multiple isolated execution

environments within a single operating system kernel. It has the best possible (i. e. close

to native) performance and density, and features dynamic resource management. On

the other hand, this technology does not allow to run different kernels from different OSs

at the same time. Examples: FreeBSD Jail, Solaris Zones/Containers, Linux-VServer,

OpenVZ and Virtuozzo.

OpenVZ kernel 

The OpenVZ kernel is a modified Linux kernel which adds the following

functionality: virtualization and isolation of various subsystems, resource management,

and checkpointing. Virtualization and isolation enables many virtual environments within

a single kernel. Resource management subsystem limits (and in some cases

guarantees) resources such as CPU, RAM, and disk space on a per-VE basis.

Checkpointing —a process of “freezing” a VE, saving its complete state to a disk file,

with the ability to “unfreeze” that state later. These components are described below.

Virtualization and isolation 

Each VE has its own set of resources provided by the operating system kernel.

Inside the kernel, those resources are either virtualized or isolated. Each VE has its own

set of objects, such as the ones described below.

Files – System libraries, applications, virtualized /proc and /sys, virtualized locks, etc.

Users and groups – Each VE has its own root user, as well as other users and groups.

Process tree – A VE sees only its own set of processes, starting from init. PIDs are

virtualized, so that the init PID is 1 as it should be.

Page 30: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

APPENDIX

22

Network – Virtual network device, which allows the VE to have its own IP addresses, as

well as a set of netfilter (iptables) and routing rules.

Devices – Some devices are virtualized. In addition, if there is a need, any VE can be

granted (an exclusive) access to real devices like network interfaces, serial ports, disk

partitions, etc.

IPC objects – Shared memory, semaphores, and messages.

Resource management 

Resource management is of paramount importance for operating system level

virtualization solutions, because there is a finite set of resources within a single kernel

that are shared among multiple Virtual Environments. All those resources need to be

controlled in a way that lets many VEs co-exist on a single system, and not influence

each other. The OpenVZ resource management subsystem consists of three

components:

1. Two-level disk quota – The OpenVZ server administrator can set up per-VE disk

quotas in terms of disk space and number of inodes. This is the first level of disk quota.

The second level of disk quota lets the VE administrator (VE root) use standard UNIX

quota tools to set up per-user and per-group disk quotas.

2. “Fair” CPU scheduler – The OpenVZ CPU scheduler is also twolevel. On the first

level it decides which VE to give the time slice to, taking into account the VE’s CPU

priority and limit settings. On the second level, the standard Linux scheduler decides

which process in the VE to give the time slice to, using standard process priorities.

3. User Beancounters – This is a set of per-VE counters, limits, and guarantees. There

is a set of about 20 parameters which are carefully chosen to cover all the aspects of VE

operation, so no single VE can abuse any resource which is limited for the whole

computer and thus cause harm to other VEs. The resources accounted and controlled

are mainly memory and various in-kernel objects such as IPC shared memory

segments, network buffers etc.

Page 31: An Experimental Approach to Multiple Sequence Alignment in Openvz Using

APPENDIX

23

Checkpointing and live migration 

Checkpointing allows the “live” migration of a VE to another physical server. The

VE is “frozen” and its complete state is saved to a disk file. This file can then be

transferred to another machine and the VE can be “unfrozen” (restored) there. The

whole process takes a few seconds, and from the client’s point of view it looks not like a

downtime, but rather a delay in processing, since the established network connections

are also migrated.

OpenVZ Utilities  1 vzctl

OpenVZ comes with a vzctl utility, which implements a high-level commandline

interface to manage Virtual Environments. For example, to create and start a new VE it

takes just two commands — vzctl create and vzctl start. vzctl set command is used to

change various VE parameters. Note that all the resources (for example, VE virtual

memory size) can be changed during runtime. This is usually impossible with other

virtualization technologies, like emulation or paravirtualization.

2 Templates and vzpkg Templates are existing images used to create a new VE. A template is a set of

packages, and a template cache is an archive (tarball) of a chrooted environment with

those packages installed. During the vzctl create stage, this tarball is unpacked. Using a

template cache technique, a new VE can be created in seconds, thus enabling fast

deployment scenarios. Vzpkg tools is a set of tools to facilitate template cache creation.

It currently supports rpm and yum-based repositories. For example, to create a template

of Fedora Core 5 distribution, one needs to specify a set of (yum) repositories which

have FC5 packages, and a set of packages to be installed. In addition, pre- and post-

install scripts can be employed to further optimize/ modify a template cache. All the

above data (repositories, lists of packages, scripts, GPG keys, etc.) form template

metadata. With template metadata, a template cache can be created automatically

by the vzpkgcache utility. It will download and install the listed packages into a

temporary VE, and pack the result as a template cache. Template caches for non-RPM

distributions can be created as well, although this is more of a manual process.