Branch replication scheme: A new model for data replication in large scale data grids

9
Future Generation Computer Systems 26 (2010) 12–20 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs Branch replication scheme: A new model for data replication in large scale data grids José M. Pérez, Félix García-Carballeira * , Jesús Carretero, Alejandro Calderón, Javier Fernández Computer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganes, Madrid, Spain article info Article history: Received 7 May 2007 Received in revised form 15 May 2009 Accepted 18 May 2009 Available online 27 May 2009 Keywords: Data grids Parallel I/O Replication High performance I/O abstract Data replication is a practical and effective method to achieve efficient and fault-tolerant data access in grids. Traditionally, data replication schemes maintain an entire replica in each site where a file is replicated, providing a read-only model. These solutions require huge storage resources to store the whole set of replicas and do not allow efficient data modification to avoid the consistency problem. In this paper we propose a new replication method, called the Branch Replication Scheme (BRS), that provides three main advantages over traditional approaches: optimizing storage usage, by creating subreplicas; increasing data access performance, by applying parallel I/O techniques; and providing the possibility to modify the replicas, by maintaining consistency among updates in an efficient way. An analytical model of the replication scheme, naming system, and replica updating scheme are formally described in the paper. Using this model, operations such as reading, writing, or updating a replica are analyzed. Simulation results demonstrate the feasibility of BRS, as they show that the new replication algorithm increases data access performance, compared with popular replication schemes such as hierarchical and server-directed replication, which are commonly used in current data grids. © 2009 Elsevier B.V. All rights reserved. 1. Introduction Data management is a major problem in grid environments. A data grid is composed of hundreds of geographically distributed computers and storage resources usually located under different administrative domains. The objective of a data grid is to facilitate sharing of data and resources [1,2]. The size of the data managed by data grids is continuously growing [3], and it has already reached Petabytes, as in the Atlas Project Datastore [4]. There are two basic data management services in a data grid: services for data transfer, and services for replica management. The main service for data management is the GridFTP protocol [5], an extension of FTP that provides efficient and secure data transfer and access to large files in grid environments. Data replication is a practical and effective method to achieve efficient and depend- able data access in grids. Access to data grid files through several replicas is used to achieve high performance data access, fault tol- erance, and load balancing [6]. Thus, most of the data management efforts in grids have been focused on data replication [7,3,8,6,9]. Most of the data replication schemes for data grid environments provide two main features: to maintain replicas of the whole * Corresponding author. Tel.: +34 916249060; fax: +34 916249129. E-mail address: [email protected] (F. García-Carballeira). URL: http://arcos.inf.uc3m.es (F. García-Carballeira). original file in each target storage site, and to provide a read-only model. These models have two major drawbacks. On the one hand, maintaining an entire replica in each site requires large storage and network resources and it can be a cause for inappropriate usage of the data resources. On the other hand, the use of a read-only replica model is only suitable for data distribution and archiving centres, such as the CERN DataGrid [10], but it is not appropriate for collaborative environments where distributed clients can modify a replica of a data set. In this paper we propose a new model of data replication, named the branch replication scheme (BRS). The main goals of BRS are: increasing scalability, performance, and fault tolerance; demonstrating the feasibility of an efficient coherence model for write operations. In our model, each replica is composed of a different set of subreplicas organized using a hierarchical topology. We use parallel I/O techniques [11–13] in order to increase the scalability and performance of the system for both read and write operations. The rest of the paper is organized as follows. Section 2 gives an overview of previous work on data grid replication. Section 3 describes the branch replication scheme and algorithms. Section 4 describes the analytical model used to evaluate the scheme proposed in this paper, and Section 5 presents the results of the simulation developed to evaluate the performance of the BRS. Finally, we present the main conclusions of this work and some future research directions. 0167-739X/$ – see front matter © 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2009.05.015

Transcript of Branch replication scheme: A new model for data replication in large scale data grids

Future Generation Computer Systems 26 (2010) 12–20

Contents lists available at ScienceDirect

Future Generation Computer Systems

journal homepage: www.elsevier.com/locate/fgcs

Branch replication scheme: A new model for data replication in large scaledata grids

José M. Pérez, Félix García-Carballeira ∗, Jesús Carretero, Alejandro Calderón, Javier FernándezComputer Architecture Group, Computer Science Department, Universidad Carlos III de Madrid, Leganes, Madrid, Spain

a r t i c l e i n f o

Article history:Received 7 May 2007Received in revised form15 May 2009Accepted 18 May 2009Available online 27 May 2009

Keywords:Data gridsParallel I/OReplicationHigh performance I/O

a b s t r a c t

Data replication is a practical and effective method to achieve efficient and fault-tolerant data accessin grids. Traditionally, data replication schemes maintain an entire replica in each site where a file isreplicated, providing a read-onlymodel. These solutions require huge storage resources to store thewholeset of replicas and do not allow efficient data modification to avoid the consistency problem. In thispaper we propose a new replication method, called the Branch Replication Scheme (BRS), that providesthree main advantages over traditional approaches: optimizing storage usage, by creating subreplicas;increasing data access performance, by applying parallel I/O techniques; and providing the possibility tomodify the replicas, by maintaining consistency among updates in an efficient way. An analytical modelof the replication scheme, naming system, and replica updating scheme are formally described in thepaper. Using thismodel, operations such as reading,writing, or updating a replica are analyzed. Simulationresults demonstrate the feasibility of BRS, as they show that the new replication algorithm increases dataaccess performance, comparedwith popular replication schemes such as hierarchical and server-directedreplication, which are commonly used in current data grids.

© 2009 Elsevier B.V. All rights reserved.

1. Introduction

Data management is a major problem in grid environments. Adata grid is composed of hundreds of geographically distributedcomputers and storage resources usually located under differentadministrative domains. The objective of a data grid is to facilitatesharing of data and resources [1,2]. The size of the datamanaged bydata grids is continuously growing [3], and it has already reachedPetabytes, as in the Atlas Project Datastore [4].There are two basic data management services in a data grid:

services for data transfer, and services for replicamanagement. Themain service for data management is the GridFTP protocol [5], anextension of FTP that provides efficient and secure data transferand access to large files in grid environments. Data replication isa practical and effective method to achieve efficient and depend-able data access in grids. Access to data grid files through severalreplicas is used to achieve high performance data access, fault tol-erance, and load balancing [6]. Thus, most of the datamanagementefforts in grids have been focused on data replication [7,3,8,6,9].Most of the data replication schemes for data grid environments

provide two main features: to maintain replicas of the whole

∗ Corresponding author. Tel.: +34 916249060; fax: +34 916249129.E-mail address: [email protected] (F. García-Carballeira).URL: http://arcos.inf.uc3m.es (F. García-Carballeira).

0167-739X/$ – see front matter© 2009 Elsevier B.V. All rights reserved.doi:10.1016/j.future.2009.05.015

original file in each target storage site, and to provide a read-onlymodel. Thesemodels have twomajor drawbacks. On the one hand,maintaining an entire replica in each site requires large storage andnetwork resources and it can be a cause for inappropriate usageof the data resources. On the other hand, the use of a read-onlyreplica model is only suitable for data distribution and archivingcentres, such as the CERNDataGrid [10], but it is not appropriate forcollaborative environments where distributed clients canmodify areplica of a data set.In this paper we propose a new model of data replication,

named the branch replication scheme (BRS). The main goals ofBRS are: increasing scalability, performance, and fault tolerance;demonstrating the feasibility of an efficient coherence model forwrite operations. In our model, each replica is composed of adifferent set of subreplicas organized using a hierarchical topology.We use parallel I/O techniques [11–13] in order to increase thescalability and performance of the system for both read and writeoperations.The rest of the paper is organized as follows. Section 2 gives

an overview of previous work on data grid replication. Section 3describes the branch replication scheme and algorithms. Section 4describes the analytical model used to evaluate the schemeproposed in this paper, and Section 5 presents the results of thesimulation developed to evaluate the performance of the BRS.Finally, we present the main conclusions of this work and somefuture research directions.

J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20 13

2. Related work

Datamovement in grids is basicallymade possible by using twomechanisms: GridFTP and RFT. GridFTP is an implementation ofthe FTP protocol for grids [5], and it is widely used for securelymoving massive amounts of data among grids. It has several ad-vantages: high performance parallel streams implementation [14];coordinated data transfer by using multiple computer nodes atthe source and destination; supports various security options, in-cluding Grid Security Infrastructure (GSI); partial downloads of alarge file; and automatic restart of failed transfers. The ReliableFile Transfer (RFT) [15] service is a web service that provides inter-faces for controlling andmonitoring third party file transfers usingGridFTP servers. RFT can detect a variety of failures and restart thefile transfer from the point of failure.Replication, a well-known technique from distributed systems,

is the main mechanism used in grid environments for increasingthe data access performance. Replication reduces access latencyand bandwidth consumption. Moreover, data replication can beused to provide fault-tolerant support and load balancing bymaintaining copies of data at different locations. Data replicationand replicamanagement in grids have been studied inmanyworks.Studies of the effect of data replication techniques for data in gridscan be found in [7,16]. Performance and scalability aspects havebeen studied in [1,17,18].Replica placement is a key performance factor in data grids.

Static data replication strategies, following deterministic policies,are studied in [9]. Dynamic replication techniques, that allowthe system to automatically manage replicas following chang-ing system parameters and user access patterns, are proposed in[3,19–21]. There are some mixed approaches that use a staticreplica placement algorithm to optimize average response timeand a dynamic replica placement algorithm to re-allocate replicasto new candidate sites if a performance metric degrades signifi-cantly [22]. P2P techniques [23] andparallel transmission [24] havebeen also used for replica placement in grids. A tree-based replicalocation scheme (TRLS) to decide the replica locations is proposedin [25]. The objective of TRLS is to minimize the sum of storagecost and communication cost of the replication, and the problem issolvedusing linear programming. Ahybrid protocol using trees andgrid replication is shown in [26]. DHTs and P2P approaches havealso been studied for replication in grid systems. In [27], a DHT-based replication protocol that adjusts autonomously the numberof replicas to deliver a configured data availability guarantee is pre-sented.One of the critical components in a grid replica management

system, as defined by GT 4.0 Data Management [28], is theReplica Location Service (RLS) [29]. RLS maintains and providesaccess to mapping information from logical names for data itemsto target names. These target names may represent physicallocations of data items, or an entry in the RLS may map toanother level of logical naming for the data item, allowing toimplement a hierarchical tree. One example of this kind of systemis that proposed by Chervernak [17], whose implementation isavailable as part of Globus Toolkit. In this implementation, RLSis a distributed registry, meaning that it may consist of multipleservers at different sites, which increases scalability. The ResourceNamespace Service (RNS) [30] is another service specified bythe Grid File System Working Group (GFS-WG) of the GlobalGrid Forum that allows the construction of a uniform, widelydistributed, global, hierarchical namespace, and allows the lookupof physical file names given the logical file name.One important aspect in general data replication is replica

consistency. If a replica can be modified by some application, theproblem of maintaining consistency in data and metadata arisesas a factor that can limit scalability [31]. RLS does not guarantee

consistency among replicated data and the uniqueness of filenames in the directory. It is intended to be used by higher-levelgrid services that provide these functionalities. Thus, many datagrids do not solve the data replication problem, allowing only read-only data sets as the main data model in those grids [6]. However,if modifications are allowed, how to maintain the consistency ofthose files is a great challenge in large systems.The study of replication algorithms that try to maintain con-

sistency among replicas has been a very active research topicin the field of distributed databases. Traditionally in distributeddatabases, the consistency management algorithms among repli-cas have been classified by two approaches. On the one hand, thereare strong consistency protocols [32]. These replication systemsrequire the synchronization of a great number of replicas. On theother hand, there are weak consistency protocols [33]. These kindsof protocols provide a high availability and better response timeby allowing updates to be done in an asynchronous way. In datagrids strong consistency protocols are widely used, and several so-lutions have been proposed using this approach. In [34], a tradi-tional approach based on logs and change distribution is proposed.A more scalable one-direction tree-based approach is proposedin [35],where performance anddata distribution speed are thema-jor goals. A system providing replication and consistency is alsoshown in [36]. Most of the current techniques for updating data ingrids use a propagation scheme, in which the master site propa-gates the update messages to all replica sites [31,37–39]. In [40] anupdate replication protocol called Update Propagation Grid (UPG)is proposed. This protocol updates the replicas by using a propa-gation technique based on nodes organized into a logical structurenetwork in the form of a grid structure.Another important aspect is replication granularity. The most

extended model in data grids uses replication granularity atfile level, as shown in [8,41]. Most replication schemes rely ontraditional sequential files and propose optimizations for thedata movement. However, parallel file systems, such as GridEx-pand [13], can be also used to exploit the inherent parallelism ofthe underlying communication and storage hardware. By usingparallel file systems adapted to the grid, GridFTP can be transpar-ently employed to create several parallel client–server connectionsto enhance file replication performance. There are some other ap-proaches to granularity, such as object level replication [18], whichcan be useful in object-oriented database management systemsor for types of interactive data analysis that would be too incon-venient or costly to perform with tools that work on a file levelonly [42]. Usually a complete file is replicated in a normal situation.In [43], the problem of fragmented replica is analyzed. In this worka block mapping procedure is proposed to determine the distribu-tion of blocks in every available server for replica retrieval later.The following sections show a newmodel for data grid replica-

tion that is based on the partial replicas stored in different storagenodes. The model proposed allows parallel I/O and provides a bet-ter usage of storage resources.

3. Branch Replication Scheme (BRS)

In this section a new replication scheme called the BranchReplication Scheme is presented. BRS is aimed at providing morescalability than serial solutions through the parallel replication ofa file in several sites. Using this solution, a replica is divided intoseveral subreplicas thatmay reside in different storage nodes. Fig. 1shows the differences between a hierarchical replication schemethat uses entire replicas, as the one proposed in [3], and the branchreplication scheme proposed in this article.A replica R (see Fig. 1) is defined as a set of disjoint subreplicas,

Ri (fragments of files) that, together, contain all data stored in the

14 J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20

Fig. 1. Differences between hierarchical replication (top) and branch replication (bottom).

root file or original file (RR). Formally, we can define a replica asfollows:

R =n⋃i=1

Ri (1)

where ∀i, j ∈ N : data(Ri)⋂data(Rj) = ∅∧data(RR)

⋂data(R) =

data(RR).This replication method allows using this scheme in grid envi-

ronments with data storage resources which have very differentstorage capacities. Moreover, this approach allows parallel accessto replicas and it requires less space per node to store the repli-cas. In whole file replication approaches, every time a replica of aK GB file is made, we need K GB per target node. To create n repli-cas using BRS we require more target nodes, which should not be aproblem in large data grids. However, each target node needs lessspace to store its subreplicas. For a root file with K GB and n repli-cas, using 2n−1 target nodes, each target node stores on average:N∑i=1

(K2i−1

)2n−1

GB.

In Fig. 2, a tree with three levels of replication is shown.The original file (root) is located at SITE 1. The second level of

replication is formed by SITE 2 and SITE 3: the joint of their datais an entire replica, and their intersection is the empty set. In thesame way, the subreplica in SITE 2 is branch-replicated in SITE 4and SITE 5. Following this scheme, a data replication tree,wherewecan see that each level is an entire replica, is created. Moreover, anentire replica can be obtained by joining different combinations ofsubreplicas, so that a replica may be composed of subreplicas withdifferent depths in the replication tree. For example, we can get areplica joining sites 2, 6, and 7, or joining sites 3, 4, and 5. This way,using BRS avoidswasting toomuch space replicating the entire file,but a high fault tolerance level is still ensured. The replica definedby the tree leafs is named terminal replica.The main features of BRS are the following:

• Root replica. In this algorithm, a single storage node supports theoriginal file, named root replica. This replica is always completeand it stores the original file. Initially, the root replica is chosenwhen the file is created.• Parallel replication. To create a new replica, n target nodeshave to be selected to store the subreplicas. The union of allthe subreplicas will be the original replica. BRS logically splitsthe original replica into chunks and creates the subreplicas bycopying the chunks in parallel to the target nodes usingGridFTP.In this way, we can reduce the replication time as compared

J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20 15

Fig. 2. Replication example using BRS.

with the time needed to create a whole replica in a singlestorage node.• Fine grain replication. If we want to make a replica of a specificpart of the file, the associated subreplicas can be split again intoseveral subsubreplicas, and so on, until a minimum chunk sizeis reached.• Partial replication of popular file fragments. We can select onlythe fragments of data usedmore by clients and replicating themusing popularity or geographic distribution criteria.• Parallel data access. As subreplicas exist in different targetnodes, parallel I/O can be used to enhance data access. ParallelI/O can be achieved using GridFTP in Internet or a parallel filesystem in the Intranets.• Better resource usage. BRS requires less space per storage nodeto support replica creation. Thus, even small storage devicescan be used to replicate data, which allows the extension of thestorage network.

A new replication scheme, like BRS, needs new ways of dealingwith replica location and creation, and to deal with the replicaconsistency problem. Below, the solutions proposed are outlined.

3.1. Replica Location Service (RLS)

Maintaining replicated data in grid environments requires areplica location service for providing a mechanism to registerand discover replicas. Replica location systems in data grids haveevolved from directories, such as LDAP [44], to peer-to-peersystems [45]. All those systems need to store some metadata andmetrics information to locate replicas.In order to provide a standard method for replica location

we use the Resource Namespace Service (RNS) model. RNS is aspecification of the Grid File System Working Group (GFS-WG)of Global Grid Forum that allows building a uniform, global,hierarchical namespace using aweb service [30]. It defines a three-tier naming architecture consisting of human interface names (hin),

logical names (ln), and endpoint references (er), where endpointreferences are URL, file names, metadata, or other objects. Thereare two levels of indirection: human interface names to logicalnames, and logical names to endpoint references. This second level ofindirection has the advantage of using a logical name to representa logical reference, and therefore, logical namesmay be referencedand resolved independently of the hierarchical namespace. Thismeans that logical names may be used as a globally unique logicalresource identifier, and theymay be referenced directly by the RNSnamespace as well as by other services (as web services).Fig. 3 shows the mapping in RNS of the BRS example shown in

Fig. 2. As may be seen, there is a single hin per file, but as many lnas replicas of the file. Moreover, each logical name has as many eras files compose the replica (subreplicas). As shown in the figure,not only the replicas directly created exist. There are some morereplicas that can be defined due to the combinations of subreplicasinto the branches of the tree. For example, Replica 4 is a result ofthe combination of a part of Replica 1 and a part of Replica 2.To support BRS, we must link the following metadata informa-

tion for each replica logical name:

• FR: Parent replica or subreplica (upper level). Root replicaparent is itself.• CR: Set of children subreplicas. It includes the location ofthe files that support the subreplicas and the portion of datareplicated in each of them. Terminal subreplicas’ children arethemselves.• BR: Set of subreplicas, usually at the same level, with a commonupper level.

We have modelled a prototype of this service that includes filemetadata with the information described above needed to supportBRS. Given a human interface name, we can obtain via RNS the listof logical names corresponding with all replicas of the file. This listis passed to a Replica Optimization Service (ROS) to get the bestreplica for the client.Four metrics are used in ROS to choose a replica:

16 J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20

Fig. 3. RNS three-tier naming architecture.

• ∆(x, y): Network distance between nodes x and y. Computedusing the number of hops with a traceroute command. Toalleviate the cost, distance information can be cached when areplica is checked for the first time.• Π(x): Storage performance in node x. Computed using nodecomputing power (in GFLOPS), storage bandwidth (in MBytes),and portion of storage devoted to the grid (non-null), as shownbelow.

Π(x) = computing_power(x)disk_bandwidth(x)

available_grid_storage(x). (2)

• N(x, y): Network and protocol performance between nodes xand y. Computed using network bandwidth (in Mbits/s) andlatency among nodes x and y (in ms), as shown below.

N(x, y) =network_bandwidth(x, y)network_latency(x, y)

. (3)

• I: Replica usage index, computed as a ratio between numbersof open requests for the replica in a defined time frame. Thenumber of requests and time can be configured to characterizethe system behaviour.

The ROS includes an access performance function (fap(x, y)),that computes a performance value for a node y to be accessedfrom client x. This function can be tailored, because it is definedas a weighted combination of three of the former metrics:fap(x, y) = wd∆(x, y)+ wsΠ(y)+ wnN(x, y). (4)Then, the criteria to select a replica are the following:

(1) Replicas with a usage index I greater than a defined threshold(Maximum Replica Usage, MRU) are discarded to avoidoverloading them more.

∀i = 1 . . . n, Ii ≥ MRU → discard(Ri). (5)(2) From the set of eligible replicas, the one with the maximumratio (performance, usage index) will be chosen.

∀i = 1 . . . n, (Ii ≤ MRU) ∧ (Pi > max(P1...n))→ choose(Ri), (6)

where Pi =

|R|∑j=1fap(cl, srj)

I

clbeing a client node, and srj the server that stores theRj subreplica.When a file is open for writing, RLS always returns the terminalreplica. This process is detailed in Section 3.3.

3.2. Replica creation

In BRS the write process does not produce new replicas. Newreplicas are produced only under three circumstances:(1) Client driven, because of a high probability to heavily write oruse a far replica or subreplica.

(2) System driven, under two circumstances:(a) Performance (P) very low for a replica or subreplica.(b) Usage index very high (I > MRU) for a replica orsubreplica.

In any case, replicas are created as close as possible to the clientsthat request the data files. The root replica grows toward the clientsin a branching way, stripping replicas into several subreplicas.With this approach, the growing of the replica tree is driven byclient needs; thus we can say that a replica is expanded towardthe clients, or attracted to them. When a replica or subreplica isreplicated, two sites must be selected for the replication process.When a client activates the replication process, the system selectstwo sites (S1 and S2) such that:(1) ∆(client, S1) and∆(client, S2) are minimum.(2) Π(S1) andΠ(S2) are maximum.(3) N(client, S1) and N(client, S2) are maximum.

Replication does not have to be for the entire replica. Subrepli-cas can be also replicated following the former conditions. Assumethat accesses to a file are not uniformly distributed and that, as aresult, the subreplica Ri storage node is overloaded. BRS can repli-cate only this subreplica to discharge this node. Thus, the expan-sion of the replication tree might not be symmetric and differentbranches could have different depths. When new replicas are cre-ated, the replica location information is updated in the RNS.Replica creation is highly parallelized because we have several

levels of replicas and many subreplicas. That is, replica creationmay be supported by a high number of independent subreplicas,so that we can execute parallel I/O writes on several subreplicas atthe same time.

3.3. Replica consistency

In order to maintain consistency among updates by clients wepropose the following mechanism: Clients only can modify thedata located in the terminal replica, that is to say, in the leaf nodesof the replication tree. Thus, the location of the replica is reducedto the location of the deepest subreplicas that support the range ofdata requested by the application.

J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20 17

Data update is performedbottom–up, from the children replicasto the parent, recursively until the root replica is reached. Onlyupdated blocks are propagated. Assume, for the example in Fig. 2,that block 3 of replica 2 (located in SITE 5) is written. Theconsistency algorithm sends block 3 to the replica’s parent (SITE 2),that again sends block 2 to its parent (SITE 1). As the replica in SITE1 is root, the algorithmstops. Thus replica updating canbe executedminimizing the number of steps (3) and the amount of informationsent (only 1 block in this example). If entire replicas were used,the whole tree should be traversed (8 steps). The amount of datatransferred would be a minimum of 8 blocks, in the optimisticsolution, and a maximum of 8 ∗ K GB, in the pessimistic solution.Transitorily, we may have a consistency problem. It can be

solved with a pessimistic approach by locking the logical resourcenames in the RNS tree, so that access is forbidden as replicas arebeing updated. However, we apply an optimistic solution based onthe following considerations:• Each child replica sends the modifications immediately to theparent replica as an update.• The range of data modified is disjoint and it can be small.• Techniques such as collective I/O can be used to send modifi-cations from several clients to a parent replica through a childreplica.

A problemmay occurwhen a client tries towrite in a subreplicawhich is not terminal, because that subreplica has been split intoothers. In this case, the error write not allowed is sent to the client.This may only happen because the client opens the file in the read-only mode. If a file is open for writing or updating, the RLS servicealways returns the terminal replica. Thus, the client has to open thefile forwriting or updating and look for the replica that contains thedata range needed by the client.

4. BRS modelling

This section shows the analytical model used to evaluate theaccess, creation and update of a replica in BRS. Fig. 4 shows thebasic model for data access: a client must cross two LAN andone WAN for accessing the data. Table 1 shows definitions andnotations used by the model. Disk parameters have been obtainedfrom a commercial disk (Segate Barracuda ST3160021A).The time for reading one replica is defined as:

tread = trequest + taccess + treply (7)wheretrequest = LLAN + LWAN + LLAN (8)

taccess =Sbs· rd (9)

bs being the block size used in transfers and rd the time for readinga block from disk.

treply = LLAN + LWAN + LLAN +max(SBLAN

,SBWAN

). (10)

The model developed only considers the time for accessing,creating or updating a replica. It does not take into account the timefor locating the replica.

4.1. Modelling a read operation in BRS

If we use n subreplicas, the time for reading the whole file inparallel is defined as:

tread = 2 · (LLAN + LWAN + LLAN)+S/nbs· rd

+ max(S/nBLAN/n

,S/nBWAN/n

). (11)

WAN

Clients

LANServer

Site

Clients

LAN

Server

Clients

LAN

Server

Site

Clients

LAN

Server

SiteSite

Fig. 4. Network basic model.

Table 1Definitions and notations for data access model.

Variable Description Value

S File size 1 GBLLAN Local area network latency 0.5 msLWAN Wide area network latency VariableBLAN Local area network bandwidth 1 Gb/sBWAN Wide area network bandwidth 2 GB/sBd Disk transfer rate 40 MB/stseek Average disk seek time 8.5 mstlat Average disk latency time 4.16 ms

Fig. 5. Reading a file of 1 GB using different numbers of subreplicas.

In this case, each transfer shares the WAN and LAN usage.Fig. 5 shows the timeneeded to read a file of 1GBusing different

numbers of subreplicas and varyingWAN latencies between 10msand 200 ms. We have used a block size of 256 kB. As can be seen,the usage of several subreplicas for reading a file reduces the accesstime.

18 J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20

Fig. 6. Creating a replica of a file of 1 GB varying the number of subreplicas.

4.2. Modelling a create operation in BRS

If we consider the time needed for creating a whole replica ina hierarchical replication scheme (HRS), that replicates completefiles, the time can be modelled as:

tcreation = LLAN + LWAN + LLAN +Sbs· rd

+ max(SBLAN

,SBWAN

). (12)

If we use BRS, n subreplicas are fragmented into 2n subreplicas.In this case, subreplica creation can be performed in parallel, andthe time for replica creation is modelled as:

tcreation = LLAN + LWAN + LLAN +S/2nbs· rd

+ max(S/2nBLAN/2n

,S/2nBWAN/2n

). (13)

Fig. 6 shows the time to create one replica using a hierarchicalmodel (HRS) and branch replication (BRS) with different numbersof subreplicas, for different WAN latencies.

4.3. Modelling an update operation in BRS

Finally, in this section we analyze the time for update opera-tions. In this case, we compare three schemes: the branch repli-cation model (BRS) described in this paper, a replication schemeof entire replicas, in which all update operations are conducted ona master replica, and then the modifications are propagated to allreplicas (SDRS, server-directed replication scheme), using a simi-lar scheme to the proposed in [31,37–39], and finally a replicationscheme of entire replicas, in which all update operations are con-ducted on a master replica, and then the modifications are propa-gated to all replicas using a hierarchical scheme (HRS). In HRS, theupdate propagation is carried out in parallel, in a hierarchical way.In SDRS, the replica master propagates the update messages in asequential way; in this case all update messages are sent from themaster replica sequentially.The update scheme in which the master replica propagates the

updates to n replicas can be modelled as:

tupdate = n ·(LLAN + LWAN + LLAN +

Sbs· rd

+ max(SBLAN

,SBWAN

)). (14)

The update scheme in which the master replica propagates theupdates in a hierarchical way to n replicas can be modelled as:

Fig. 7. Updating the replicas of 1 GB with several subreplica numbers.

tupdate =levels∑i=1

(LLAN + LWAN + LLAN +

Sbs· rd

+ max(SBLAN

,SBWAN

))(15)

where levels = dlog2(n)e. In this case, we assume that the updateprocess is carried out in parallel in each level.The model for the branch replication is defined as:

tupdate =levels∑i=1

(LLAN + LWAN + LLAN +

S/2i−1

bs· rd

+ max(S/2i−1

BLAN,S/2i−1

BWAN

))(16)

where levels = dlog2(n)e. We must remember that all clientupdates are submitted to the leaf nodes in the tree. Fig. 7 comparesthe time needed to modify a replica of 1 GB with different replicanumbers and a WAN latency of 10 ms. In this figure, it can beseen that the performance is worst for the SDRSmethod, since thismethod does not include parallelism in the update operation.

5. Evaluation

A discrete event simulator, built on top of Omnet++ [46], hasbeen developed to evaluate the branch replication scheme pro-posed and to compare it with a hierarchical replication scheme. Allsimulations have been executed for HRS and BRS in order to com-pare results. In this section, we evaluate the equivalent methods(HRS and BRS) and we do not consider the SDRS method, becausein this method, all update messages are sent sequentially.A data grid including a set of 50 sites, each one comprising a

number of processors and 10 storage nodes, has been defined. Allnodes located in the same site use a LANwith a Gigabit Ethernet forcommunications. Communications among several Internet sitesare modelled using a WAN with an average latency of 50 msand 1 Gb/s bandwidth. The data grid simulated comprises a totalset of 5000 files. The file size is defined by using a uniform sizedistribution between 1 MB and 1 GB. The replication level for eachfile is randomly selected between 1 and 10 replicas.The model defines 200 users spread evenly across the 50 sites.

Each user reads or writes different files selected randomly. Thereplica for the read operation is selected using the algorithmdescribed in the RNS section of the paper. The replica for writing isalways the terminal replica obtained directly from the RNS.Fig. 8 shows the average bandwidth (MB/s) obtained for reading

operations when varying the file size (from 1 MB to 1 GB). Fig. 9shows the same test case results for write operations. As maybe seen, the performance for reading operations increases for

J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20 19

Fig. 8. Performance results for reading operations.

Fig. 9. Performance results for writing operations.

files with a size bigger than 2 MB, which is very usual in gridenvironments, where files tend to be large. For write operations,BRS performance is better than HRS for all file sizes; actually BRSoutperforms HRS. The results obtained demonstrate that the newbranch replication scheme can be used as a replication method ingrid environments for increasing data access performance.

6. Conclusions and future work

In this paper, we have addressed the problem of scalability andreplicamodification in data grid environments by investigating theuse of a new data replication method, called branch replicationscheme (BRS). In BRS, replicas are composed as a set of subreplicas,organized by using a hierarchical tree topology, so that subreplicasof a replica do not overlap and the union of the set of subreplicas isan entire replica. This model is appropriated for applying parallelI/O techniques and it provides an efficient way for updating datareplicas, which provide high performance replication and updatingof the replication tree. As part of themodel, we have also proposeda naming scheme, based on the RNS standard, to link human nameswith endpoint physical references in a scalable and efficient way.An analytical model of the replication scheme, naming system,

and replica updating scheme are formally described in the paper.Using thismodel, operations such as reading,writing, and updatinga replica are analyzed. The results show the feasibility of BRS. Basedon this analytical model, we have developed a simulator, built ontop of Omnet++. We have simulated and tested three replicationschemes: hierarchical (HRS), server-directed (SDRS), and branchedreplication (BRS). The results of the simulation show that BRSalways improves data access performance for files with differentsizes, for both read and write operations.

The work presented here was aimed at demonstrating thefeasibility of BRS for a data grid. Future work will implement aTestbed that uses this replication method in an existing data grid.

Acknowledgments

This work has been partially funded by project TIN2007-63092of the SpanishMinistry of Education andproject CCG07-UC3M/TIC-3277 of the Madrid State Government.

References

[1] W. Allcock, J. Bester, J. Bresnahan, A. Chervenak, I. Foster, C. Kesselman,S. Meder, V. Nefedova, D. Quesnel, S. Tuecke, Secure, efficient data transportand replicamanagement for high performance data intensive computing, IEEEMass Storage Conference, 2001.

[2] I. Foster, The grid: A new infrastructure for 21st century science, Physics Today54 (2) (2002).

[3] H. Lamehamedi, Z. Shentu, B. Szymanski, E. Deelman, Simulation of dynamicdata replication strategies in data grids, in: Proc. 12th HeterogeneousComputing Workshop, HCW2003, Nice, France, April 2003, IEEE ComputerScience Press, Los Alamitos, CA, 2003.

[4] D. Deatrich, S. Liu, C. Payne, R. Tafirout, R. Walker, A. Wong, M. Vetterli,Managing Petabyte-scale storage for the ATLAS Tier-1 centre at TRIUMF, in:22nd International Symposium on High Performance Computing Systems andApplications, HPCS 2008, 9–11 June 2008, pp. 167–171.

[5] J. Bresnahan, M. Link, G. Khanna, Z. Imani, R. Kettimuthu, I. Foster, GlobusGridFTP: What’s new in 2007, in: Proceedings of the First InternationalConference on Networks for Grid Applications, GridNet 2007, Lyon, France,2007.

[6] A. Chervenak, et al. Giggle: A framework for constructing scalable replicalocation services, in: Proceeding of the IEEE Supercomputing 2002.

[7] K. Ranganathan, I. Foster, Simulation studies of computation and datascheduling algorithms for Data Grids, Journal of Grid Computing V1 (1) (2003).

[8] P. Kunszt, E. Laure, H. Stockinger, K. Stockinger, File-based replica manage-ment, Future Generation Computer Systems (FGCS) 22 (1) (2005) 115–123.Elsevier.

[9] U. Cibej, B. Slivnik, B. Robic, The complexity of static data replication in datagrids, Parallel Computing 31 (8–9) (2005) 900–912.

[10] DataGrid Project. The European DataGrid. http://eu-datagrid.web.cern.ch/eu-datagrid/.

[11] H. Jin, T. Cortes, R. Buyya (Eds.), High Performance Mass Storage and ParallelI/O: Technologies and Applications, IEEE Press and Wiley, 2002.

[12] J.M. Perez, F. Garcia, J. Carretero, A. Calderon, J. Fernandez, A Parallel I/OMiddleware to Integrate Heterogeneous Storage Resources on Grids (GridComputing: First European Across Grids Conference, Santiago de Compostela,Spain, February 13–14, 2004), in: Lecture Notes in Computer Science Series,vol. 2970, 2004, pp. 124–131. Revised Papers.

[13] F. Garcia-Carballeira, J. Carretero, A. Calderon, J.D. Garcia, L.M. Sanchez, Aglobal and parallel file systems for grids, Future Generation Computer Systems23 (1) (2007) 116–122.

[14] B. Radic, V. Kajic, E. Imamagic, Optimization of data transfer for grid usingGridFTP, Journal of Computing and Information Technology - CIT 15 (15)(2007) 347–353.

[15] R.K. Madduri, C.S. Hood, W.E. Allcock, Reliable file transfer in Grid environ-ments, in: Proceedings of the 27th Annual IEEE Conference on Local ComputerNetworks, LCN 2002, 2002, pp. 727–738.

[16] L. Guy, P. Kunszt, E. Laure, H. Stockinger, K. Stockinger, Replica Managementin Data Grids. Technical Report, GF5 Working Draft, 2002.

[17] A.L. Chervenak, N. Palavalli, S. Bharathi, C. Kesselman, R. Schwartzkopf,Performance and scalability of a replica location service, in: High PerformanceDistributed Computing Conference, HPDC-13, Honolulu, HI, June 2004.

[18] H. Stockinger, A. Samar, K. Holtman, B. Allcock, I. Foster, B. Tierney, File andobject replication in Data Grids, in: Proceedings of the 10th IEEE InternationalSymposium on High Performance Distributed Computing, 2001, pp. 76–87.

[19] R. Slota, D. Nikolow, L. Skital, J. Kitowski, Implementation of replicationmethods in the Grid environment, in: Advances in Grid Computing - EGC 2005,in: Lecture Notes in Computer Science, vol. 3470/2005, 2005, pp. 474–484.

[20] R. Chang, J. Chang, Data replica consistency service for Data Grids, in:International Conference on Information Technology: New Generations,ITNG’06, 2006.

[21] P. Liu, J. Wu, Optimal replica placement strategy for hierarchical Data Gridsystems, in: Proceedings of the Sixth IEEE International Symposium on ClusterComputing and the Grid, CCGRID 06, 2006, pp. 417–420.

[22] R. Rahman, K. Barker, R. Alhajj, Replica placement designwith static optimalityand dynamic maintainability, in: Proceedings of the Sixth IEEE InternationalSymposium on Cluster Computing and the Grid, CCGRID 06, 2006, pp. 16–19.

[23] Q. Rasool, et al., On P2P and hybrid approaches for replica placement in gridenvironment, Information Technology Journal 7 (4) (2008) 599–606.

20 J.M. Pérez et al. / Future Generation Computer Systems 26 (2010) 12–20

[24] C. Wang, C. Yang, M. Chiang, A fair replica placement for parallel downloadon cluster grid, in: Network-Based Information Systems, in: Lecture Notes inComputer Science, vol. 4658/2007, Springer, 2007, pp. 268–277.

[25] C.D. Nam, C. Youn, S. Jeong, E. Shim, E. Lee, E. Park, An efficient replicationscheme for data grids, in: Proceedings 12th IEEE International Conference onNetworks, ICON 2004, 2004, pp. 392–396.

[26] H. Youn, et al., An efficient hybrid replication protocol for highly availabledistributed system, in: Proceedings of the Communications and ComputerNetworks, CCN 2002, Cambridge, USA, Acta Press, 2002.

[27] P. Knezevic, A. Wombacher, T. Risse, DHT-based self-adapting replicationprotocol for achieving high data availability, in: Proceedings of TheInternational Conference on Signal-image Technology and Internet basedSystems, SITIS, 2006.

[28] Globus Alliance. GT 4.0: The Globus Toolkit Data Management.http://www.globus.org/toolkit/data/, 2008.

[29] Globus Alliance. GT 4.0 Data Management: Replica Location Service (RLS).http://www.globus.org/toolkit/data/rls, 2008.

[30] M. Pereira, O. Tatebe, L. Luan, T. Anderson, J. Xu, Resource Namespace Servicespecification. http://www.global.http://www.globalgridforum.net, Nov. 2005.

[31] A. Domenici, F. Donno, G. Pucciani, H. Stockinger, K. Stockinger, Replicaconsistency in a Data Grid, Journal of Nuclear Instruments and Methodsin Physics Research Section A: Accelerators, Spectrometers, Detectors andAssociated Equipment 534 (1) (2004) 24–28.

[32] J. Gray, P. Helland, O. O’Neil, D. Shasha, The dangers of replication and asolution, in: Proceedings of ACM SIGMOD, 1996, pp. 173–182.

[33] R. Ladin, B. Liskov, L. Shrira, Lazy replication: Exploiting the semantics ofdistributed services, in: Proceedings of the 9th ACM Symposium on Principlesof Distributed Computing, Quebec City, CA, August 1990, pp. 43–57.

[34] A.M. Kermarrec, A. Rowston, M. Shapiro, P. Druschel, The IceCube approachto the reconciliation of divergent replicas, in: Proceedings of the 20th AnnualACM Symposium on Principles of Distributed Computing, PODC 2001, August,2001.

[35] C. Yang, W. Tsai, T. Chen, C. Hsu, One-way file replica consistency model inData Grids, in: Proceedings of the 2nd IEEE Asia-Pacific Service ComputingConference, 2007, pp. 364–373.

[36] C. Yang, C. Fu, C. Huang, C. Hsu, FRCS: A file replication and consistencyservice in Data Grids, in: Proceedings of MUE 2008. International Conferenceon Multimedia and Ubiquitous Engineering, 2008, pp. 444–447.

[37] J. No, et al., Data replication techniques for data intensive applications,International Conference on Computational Science 4 (2006) 1063–1070.

[38] R. Chang, H. Chang, Y. Wang, A dynamic weighted data replication strategy indata grids, in: Proceedings of AICCSA 2008. IEEE/ACS International Conferenceon Computer Systems and Applications, 2008, pp. 414–421.

[39] Y. Sun, Z. Xu, Grid replication coherence protocol, in: 18th InternationalParallel andDistributed Processing Symposium, IPDPS’04-Workshop 13, 2004.

[40] A. Mamat, M. Radi, M.M. Deris, H. Ibrahim, Performance of update propagationtechniques for Data Grid, in: Proceedings of the International Conferenceon Computer and Communication Engineering, Malaysia, May 13–15 2008,pp. 332–335.

[41] A. Samar, H. Stockinger, Grid Data Management Pilot (GDMP): A tool forwide area replication, in: Proceedings of IASTED International Conference onApplied Informatics, AI2001, 2001, pp. 89–93.

[42] K. Holtman, Object level physics data replication in the Grid, in: Proceedingsof ACAT’2000, 2000, pp. 244–246.

[43] R. Chang, J. Chang, Y. Lin, Job scheduling and data replication on data grids,Future Generation Computer Systems, 23 (7) 846–860.

[44] P. Eerola, et al. The NorduGrid architecture and tools. Computing in HighEnergy Physics and Nuclear Physics, CHEP03, La Jolla, USA, March 2003.

[45] M. Cai, A. Chervenak,M. Frank, A Peer-to-Peer replica location service based ondistributed hash table, in: Proceedings of Supercomputing 2004 Conference.November 2004.

[46] Omnet++ Discrete Event Simulation system. www.omnetpp.org.

José M. Pérez obtained his MS in Computer Science fromthe Universidad Politecnica de Madrid in 2001 and hisPh.D. in 2006 from theUniversidad Carlos III deMadrid. Hewas an associate professor in the Department of ComputerScience at the Carlos III University of Madrid, Spain. Hisresearch interests include high performance computingand parallel file systems.

Félix García-Carballeira received the MS degree inComputer Science in 1993 at the Universidad Politecnicade Madrid, and the Ph.D. degree in Computer Sciencein 1996 at the same university. From 1996 to 2000 wasan associate professor in the Department of ComputerArchitecture at the Universidad Politecnica de Madrid.He is currently a full professor in the Computer ScienceDepartment at the Universidad Carlos III de Madrid. Hisresearch interests include high performance computingand parallel file systems. He is the coauthor of 12 booksand he has publishedmore than 80 articles in journals and

conferences.

Jésus Carretero obtained his Computer Science degreeand his Ph.D. from the Universidad Politecnica de Madrid.Since 1989, he has been teaching Operating Systemsand Computer Architecture in several universities. During1997 and 1998, he had a visiting scholar position at theNorthwestern University, in Chicago. He has been a fullprofessor at the Universidad Carlos III de Madrid, Spain,since 2001. His research interest is focused on Paralleland Distributed Systems, especially data storage systems,Real-Time Systems and Multimedia Techniques. He is theauthor of several educational books and he has published

papers in several major journals of this field, as, for example, Parallel Computingand Journal of Parallel and Distributed Computing.

Alejandro Calderón obtained his MS in Computer Sciencefrom the Universidad Politecnica de Madrid in 2000 andhis Ph.D. in 2005 from theUniversidad Carlos III deMadrid.He is an associate professor in the Department of Com-puter Science at the Carlos III University of Madrid, Spain.His research interests include high performance comput-ing and parallel file systems. Alejandro has participated inthe implementation of MiMPI, a multithread implementa-tion of MPI, and the Expand parallel file system.

Javier Fernández obtained his Computer Science degreefrom the Universidad Politecnica de Madrid in 2000 andhis Ph.D. in 2005 from the Universidad Carlos III deMadrid. He has been an assistant professor since 2002 atthe Universidad Carlos III de Madrid teaching ComputerArchitecture and Operating Systems. His research interestis focused on Parallel and Distributed Systems, especiallydata storage systems, and Real-Time Systems.