Albano Alves* António Pina José Exposto and José Rufino

12
Int. J. Computational Science and Engineering, Vol. 4, No. 3, 2009 137 Deploying applications in multi-SAN SMP clusters Albano Alves* Escola Superior de Tecnologia e Gestão, Instituto Politécnico de Bragança, Campus de Santa Apolónia – Apartado 1038, 5301-854 Bragança, Portugal E-mail: [email protected] *Corresponding author António Pina Departamento de Informática, Universidade do Minho, Campus de Gualtar 4710-057 Braga, Portugal E-mail: [email protected] José Exposto and José Rufino Escola Superior de Tecnologia e Gestão, Instituto Politécnico de Bragança, Campus de Santa Apolónia – Apartado 1038, 5301-854 Bragança, Portugal E-mail: [email protected] E-mail: rufi[email protected] Abstract: The effective exploitation of multi-SAN SMP clusters and the use of generic clusters to support complex information systems require new approaches; multi-SAN SMP clusters introduce new levels of parallelism and traditional environments are mainly used to run scientific computations. In this paper we present a novel approach to the exploitation of clusters that allows integrating in a unique metaphor: the representation of physical resources, the modelling of applications and the mapping of application into physical resources. The proposed abstractions favoured the development of an API that allows combining and benefiting from the shared memory, message passing and global memory paradigms. Keywords: cluster computing; heterogeneity; resource management; application modelling; logical-physical mapping. Reference to this paper should be made as follows: Alves, A., Pina, A., Exposto, J. and Rufino, J. (2009) ‘Deploying applications in multi-SAN SMP clusters’, Int. J. Computational Science and Engineering, Vol. 4, No. 3, pp.137–148. Biographical notes: Albano Alves is a Professor Coordenador at the Instituto Politécnico de Bragança, Portugal. He received his PhD Degree from Universidade do Minho, Portugal, in 2004. Since 2000 he has been researching in cluster computing and distributed systems. António Pina is a Professor Auxiliar at Departamento de Informática, Universidade do Minho, Portugal. He received his MSc and PhD Degrees in Informatics from Universidade do Minho, Portugal. Throughout his career he has conducted research and published in the areas of scalable parallel computing, parallel languages and environments and information retrieval. José Exposto is a Professor Adjunto at the Instituto Politécnico de Bragança, Portugal. He is now concluding his PhD Thesis. José Rufino is a Professor Adjunto at the Instituto Politécnico de Bragança, Portugal. He received his PhD Degree from Universidade do Minho, Portugal, in 2008. Since 2000 he has been researching in cluster computing and distributed systems. Copyright © 2009 Inderscience Enterprises Ltd.

Transcript of Albano Alves* António Pina José Exposto and José Rufino

Int. J. Computational Science and Engineering, Vol. 4, No. 3, 2009 137

Deploying applications in multi-SAN SMP clusters

Albano Alves*

Escola Superior de Tecnologia e Gestão,Instituto Politécnico de Bragança,Campus de Santa Apolónia – Apartado 1038,5301-854 Bragança, PortugalE-mail: [email protected]*Corresponding author

António Pina

Departamento de Informática,Universidade do Minho,Campus de Gualtar 4710-057 Braga, PortugalE-mail: [email protected]

José Exposto and José Rufino

Escola Superior de Tecnologia e Gestão,Instituto Politécnico de Bragança,Campus de Santa Apolónia – Apartado 1038,5301-854 Bragança, PortugalE-mail: [email protected]: [email protected]

Abstract: The effective exploitation of multi-SAN SMP clusters and the use of genericclusters to support complex information systems require new approaches; multi-SAN SMPclusters introduce new levels of parallelism and traditional environments are mainly used torun scientific computations. In this paper we present a novel approach to the exploitationof clusters that allows integrating in a unique metaphor: the representation of physicalresources, the modelling of applications and the mapping of application into physicalresources. The proposed abstractions favoured the development of an API that allowscombining and benefiting from the shared memory, message passing and global memoryparadigms.

Keywords: cluster computing; heterogeneity; resource management; application modelling;logical-physical mapping.

Reference to this paper should be made as follows: Alves, A., Pina, A., Exposto, J. andRufino, J. (2009) ‘Deploying applications in multi-SAN SMP clusters’, Int. J. ComputationalScience and Engineering, Vol. 4, No. 3, pp.137–148.

Biographical notes: Albano Alves is a Professor Coordenador at the Instituto Politécnico deBragança, Portugal. He received his PhD Degree from Universidade do Minho, Portugal,in 2004. Since 2000 he has been researching in cluster computing and distributed systems.

António Pina is a Professor Auxiliar at Departamento de Informática, Universidade doMinho, Portugal. He received his MSc and PhD Degrees in Informatics from Universidadedo Minho, Portugal. Throughout his career he has conducted research and published in theareas of scalable parallel computing, parallel languages and environments and informationretrieval.

José Exposto is a Professor Adjunto at the Instituto Politécnico de Bragança, Portugal.He is now concluding his PhD Thesis.

José Rufino is a Professor Adjunto at the Instituto Politécnico de Bragança, Portugal.He received his PhD Degree from Universidade do Minho, Portugal, in 2008. Since 2000he has been researching in cluster computing and distributed systems.

Copyright © 2009 Inderscience Enterprises Ltd.

138 A. Alves et al.

1 Introduction

Common parallel programming tools are intended todevelop conventional applications capable of exploitingconventional parallel machines. Non-conventionalarchitectures and emerging application domains willbenefit from novel approaches.

1.1 Motivation

Clusters of Symmetric Multi-Processor (SMP)workstations interconnected by a high-performanceSystem Area Network (SAN) technology are becomingan effective alternative for running high-demandapplications. The assumed homogeneity of these systemshas allowed the development of efficient platforms.However, to expand computing power, new nodes maybe added to an initial cluster and novel SAN technologiesmay be considered to interconnect these nodes, thuscreating a heterogeneous system that we name multi-SANSMP cluster.

Recently, the hierarchical nature of SMP clusters hasmotivated the investigation of appropriate programmingmodels (Baden and Fink, 2000; Gursoy and Cengiz, 1999).But to effectively exploit multi-SAN SMP clusters andsupportmultiple cooperative applications new approachesare still needed. In fact, PVM and MPI parallel basedenvironments used to exploit and develop parallelprograms assume that clusters are simple and straightcollections of processors, thus ignoring the various levelsof their intrinsic memory hierarchies.

PVM (Geist et al., 1994) and MPI (Snir et al., 1998)are both specifications for message-passing libraries thathave been widely used to create extremely efficient parallelapplications that run onmany types of parallel computers.Message passing is a powerful and very general method ofexpressing parallelism that had a major influence on thewideuseofparallel computers.Traditionally this paradigmis used mainly to program scientific and engineeringalgorithms that run as a stand-along application, makinguse of the totality of the physical resources of a parallelcomputer. Nowadays, novel application domains likethose that emerge from the increasing importance of theinternet, web computing and distributed computing areputting strong demands on cluster computing.

As long as novel programming models and runtimesystems are developed, we may consider using clusters tosupport complex information systems, integratingmultiplecooperative applications.

1.2 Our approach

Figure 1(a) presents a practical example of a multi-SANSMP cluster mixing Myrinet and Gigabit. Multi-interfacenodes are used to integrate sub-clusters (technologicalpartitions).

To exploit such a cluster we developed RoCl (Alveset al., 2003), a communication library that combinesGM (Myricom Inc., 2000) – a low-level communication

library provided to interface Myrinet – and MVIA– a Modular implementation of the Virtual InterfaceArchitecture (Compaq et al., 1997) that supports Gigabit.RoCl may be considered a communication-level SingleSystem Image (SSI), since it provides full connectivityamong application entities instantiated all over the clusterand also allows the registration and discovery of entities(see Figure 1(b)) through a basic cluster oriented directoryservice, relying on UDP broadcast.

Figure 1 Exploitation of a multi-networked SMP cluster:(a) cluster representation and (b) RoCl and mεµlayers

Now we propose a new layer – mεµ – built on topof RoCl, intended to assist programmers in setting-upcooperative applications and exploiting cluster resources.Our contribution may be seen as a new methodologycomprising three stages:

• the representation of physical resources

• the modelling of application components

• the mapping of application components intophysical resources.

This paper is structured as follows: the next sectionformalises our approach; Sections 3 and 4 detail importantaspects of our modelling and programming tools;Sections 5–7 exemplify each one of the stages stated above;finally, in Section 8we summarise our concluding remarks.

2 Resource oriented computing

The resource oriented computing paradigm is aimedat accomplishing an efficient and convenient way of

Deploying applications in multi-SAN SMP clusters 139

modelling long-running complex applications (Pina et al.,2002). In RoCl the resource abstraction is used tomodel communication end-points. In mεµ it is used tosimplify the description of the parallel computer and themodelling/development of the application and it plays animportant role in supporting application execution.

2.1 Resources in RoCl and mεµ

RoCl acts as a base layer interfacing low-levelcommunication subsystems generally available in clusters.It constitutes a basic single system image, by providinginterconnectivity among communication entities. Theseentities we named resources are application entitiesthat may exchange messages despite the underlyingcommunication subsystems. The RoCl dispatchingmechanism is able to bridge messages from GM toM-VIA, for instance. A cluster oriented directory serviceallows programmers to announce and locate applicationresources thus turning RoCl into a convenient platformto drive multi-SAN clusters that integrate multiplesub-clusters (Myrinet and Gigabit sub-clusters, forinstance) interconnected by multihomed nodes.

mεµ was implemented over RoCl and provideshigher-level programming abstractions. Basically, itsupports the specification of physical resources, theinstantiation of logical resources accordingly to theactual organisation of physical resources and high-levelcommunication operations between logical resources.

Figure 2 depicts the main difference between RoCland mεµ abstraction models; while RoCl (Figure 2(a))provides unstructured communication between resources,mεµ (Figure 2(a)) offers mechanisms to organise resourcesand exploit locality.

Figure 2 Resource interaction in: (a) RoCl and (b) mεµ

The overall platform is well suited to the development ofapplications aimed at exploiting multi-SAN clusters that

integrate multiple communication technologies such asMyrinet and Gigabit. The implicit programming modelhas proved to be very useful to manage the distinct localitylevels intrinsic to the target system architecture. At theapplication level, the approach smoothly accommodatesthe evolution from cluster to multi-cluster grid enabledcomputing.

2.2 Resource oriented communication

RoCl is an intermediate-level communication librarythat allows system programmers to easily develophigher-level programming environments. It uses existinglow-level communication libraries to interface networkinghardware, like Madeleine (Aumage et al., 2000).

RoCl defines three major entities: contexts, resourcesand buffers. Contexts are used to interface thelow-level communication subsystems. Resources permitto model both communication and computation entitieswhose existence is announced by registering them ina global distributed directory service. To minimisememory allocation and pinning operations, RoCl usesa buffer management system; messages are held onspecific registered memory areas to allow zero-copycommunication.

RoCl includes a fully distributed directory servicewhere resources are registered along with their attributes;a directory server is started at each cluster node andall application resources are registered locally. A basicinterserver protocol allows applications to locate remoteresources; query requests are always addressed to a localserver but, at any moment, servers may trigger a globalsearch by broadcasting the request.

The typical sequence of operations in an applicationincludes:

• initialising a RoCl context

• registering resources

• querying the directory to find remote resources

• requesting message buffers

• sending messages to resources previously found

• retrieving received messages from a local queue.

Resources are animated by application threads whichshare the communication facilities provided by contexts.On the other hand, RoCl supports the simultaneousexploitation of multiple communication technologiesbeing responsible for selecting the most appropriatecommunication medium to deliver messages to a specificdestination, for aggregating technologies when the targetresource of a message can be reached through distincttechnologies and for routing messages at multihomednodes that interconnect distinct sub-clusters. To providethe above facilities,RoCl uses amultithreaded dispatchingmechanism and includes native support for multithreadedapplications.

140 A. Alves et al.

2.3 Unified modelling and exploitation

mεµ programing methodology includes three phases:

• the definition and organisation of the concepts thatmodel the parallel computer – physical resources

• the definition and organisation of the entities thatrepresent applications – logical resources

• the mapping of the logical entities into the physicalresources.

The programing interface is organised around six basicabstractions (see Figure 3) designed for modelling bothlogical and physical resources:

• domains – to group or confine a hierarchy of relatedentities

• operons – to delimit the running contexts of tasks

• tasks – to support fine-grain concurrency andcommunication

• mailboxes – to queue messages sent/retrieved bytasks

• memory blocks – to define segments of contiguousmemory

• memory gathers – to create the illusion of globalmemory.

Figure 3 mεµ entities

These abstractions are to be disposed of in a tree, reflectingthe memory hierarchy of the cluster. Figure 4(a) showshow the physical resources of a parallel machine madeof distinct technological partitions may be modelled bya hierarchy of domains. Figure 4(b) presents a high-levelspecification of a basic web application.

The mapping of logical into physical resources isachieved by merging the base hierarchies – physicalresources and logical entities – into a single mεµhierarchy.

3 Resource properties

Resources are named and globally identified entitiesto which we may attach lists of relevant properties.Properties allow to improve the meaningfullness of eachentity present in a mεµ hierarchy by pointing out thepresence or defining the value of a particular characteristic.In a mεµ hierarchy, the properties of an entity arecomputed taking into account its specific location.

Figure 4 Resource modelling examples: (a) physical hierarchyand (b) logical hierarchy

3.1 Aliases

In addition to regular ascendant-descendant relationshipspresent in a tree, it is possible in a mεµ hierarchy toestablish original-alias relationships. Thus, an entity mayhave one or more aliases and an alias may result from oneor more original entities.

The creationof analias for a given entity corresponds tothe creation of another entity, of the same kind, at anotherpoint of the tree, and to the storage of the identifier of thefirst entity (the original) in the second entity (the alias).To create an alias for a given set of entities (all of the samekind) it is required to store the identifiers of all originalentities in the alias. Figure 5 presents some examplesof original-alias relationships, which are represented bydashed arrows.

Figure 5 Ascendants, descendants, originals and aliases

Origin-alias relationships are set upby creatingnewentitiesthat are inserted in the tree. The insertion of an alias mustnot misrepresent the regular ascendancy chains1 of a tree,thus, none of the originals of the alias can belong to the

Deploying applications in multi-SAN SMP clusters 141

ascendancy chain of the entity where the alias is attached.Since the ascendancy chain of an entitymay contain aliases(including the entity itself), the stated restriction must beapplied to the aliasing-ascendancy chains, that is, all nodechains that lead to the tree root, including alias entities.

3.2 Obtaining properties

Considering the original-alias and ascendant-descendantrelationships, the process to obtain the properties of anentity is by collecting the properties directly attached tothat entity (own properties) and the properties obtainedthrough inheritance, synthesis or sharing. The inheritancemechanism ensures that the properties of an entityare passed along to its descendants while the synthesiscorresponds to the reverse. Property sharing allows anentity to spread all its properties to its aliases.

Figure 6 shows the way properties are determinedusing a simple example. It is important to note that someproperties may involve computation aside from union.

The ability of establishing entity relationships andindividually attach properties to entities along with themechanisms of inheritance, synthesis and sharing allowsthe characterisation of complex systems efficiently.

Figure 6 Determining entity properties

4 Programming paradigms

mεµ allows programmers to combine three well knownparallel programming paradigms: shared memory,message passing and global memory.

Shared memory programming in mεµ is equivalent totraditional POSIX threads programming, since mεµ tasksare directly mapped into POSIX threads.

4.1 Message passing

The approach followed in mεµ extends the traditionalmessage passing model by integrating distinct types ofresources used in the process of modelling an application.In fact, only tasks may generated messages, but thedestination of a message may be a task, an operon, adomain or a mailbox.

When a message is addressed to a task, thecommunication mechanism provided by RoCl is adequate

(see Figure 7(a)). If themessage destination is an alias task,then the message must be forwarded to the original task,thus requiring additional functionality (see Figure 7(b)).If the original, by its turn, is an alias, then the forwardingprocess will continue (see Figure 7(c)).

Figure 7 Message passing scenarios

When a message is sent to an operon, any descendantnon-alias task may compete to get these message.The operon may be considered as a repository where theonly message copy is stored until a task claims for it.If the operon is an alias, the forwarding mechanism,is activated.

The mailbox acts like the operon, storing the messagecopy, when it receives a message. But in this case, therewill be more tasks that can access the message; any taskbelonging to the subtree defined by the ascendant of themailbox may claim the message.

When a message is addressed to a domain, a messagecopy is sent to each descendant capable of receivingmessages (tasks, operons, mailboxes and domains) and,if the domain is an alias, each original will also receivea message copy (see Figure 7(e) and (f)). Any messageaddressed to a domain that represents a physical resourcewill be dropped (see Figure 7(d)).

4.2 Global memory

A fundamental issue in cluster computing is memoryhierarchy and as a consequence the exploitation of datalocality is one of the keys to achieve high-performance.

To build a global memory address space, mεµ includestwo abstractions: memory blocks and gathers. Theinstantiation of resources based on these abstractions and

142 A. Alves et al.

the data transfer between local and global memory areaccomplished by calling specific mεµ primitives.

A memory block is a segment of contiguous memorywhich may be asynchronously accessed by local or remotetasks. It may be read/written in portions or all at a time.As depicted in Figure 8 memory blocks (Bx) are createdas descendants of operons. One or more memory blocksbelonging to the same or different hierarchy of operons,may be joined together to create the illusion of a largeramount of contiguous memory. The key for this facilityresides on the definition of a memory gather which is aresource that organises together in one dimension severalvariable size distributed memory blocks.

A memory gather has as its descendants a sequence ofmemory block aliases. Each alias, automatically createdby the run-time system, has one only original – amemory block that encloses a specific segment of memory.In Figure 8, G2 is a memory gather that combines severalmemory blocks by means of the corresponding aliases.The black portion of the small bars under memory blockaliases represents the fraction of the contiguous memorysegment that is imported to the alias. Note that the samememory block may be aggregated under multiple memorygathers. It is also possible for a memory gather to encloseother memory gathers along with simple blocks, as is thecase of G1 in Figure 8.

Figure 8 Block gathering example

Figure 9 shows a code fragment used to create the twoglobal address spaces depicted in Figure 8.Memory blocksare created by specifying an ascendant, a name, a memorysize and an optional list of properties. Memory gathers arecreated similarly, but obviously no size is given, since theymay grow arbitrarily. When a block is added to a gather,it is possible to specify a fraction of the block (by defaultthe whole block is considered) and an offset, relative to thegather, where the block fractionmust be placed (by defaultblocks are appended to the end of the gather).

Global memory, represented by gathers, is accessedby explicitly calling mεµ primitives that transfer datafrom several remote distributed blocks to application localaddress space. Figure 10 presents a basic example toexplain how an application is able to access data fromthe global address space. First, it is necessary to obtain apointer to a local chunk of memory which will be used as acache to the remote data. Since applications may not needto access the whole global memory and individual nodes

can not hold a copy of all remote blocks, it is requiredto specify a particular global memory fraction. Next, theglobal memory may be read through a get operation.By default, the whole segment specified in themeu_newrefprimitive is read, however programmers may decide toread a sub-fraction at a time. By using the address of thelocal memory chunk, in practice, global data is handledlike regular memory obtained through malloc. Finally,to update remote blocks according to local computations,the put primitive must be used.

Figure 9 Instantiation of blocks and gathers

Figure 10 Global memory usage

Note that an invocation of get orputmayproducemultipleread/write operations targeting multiple remote memoryblocks, spread among cluster nodes. Low-level RDMAfacilities are used to improve performance.

5 Representing physical resources

The manipulation of physical resources requires theiradequate representation and organisation. Following theintrinsic hierarchical nature of multi-SAN SMP clusters,a tree is used to lay out physical resources. Figure 11 showsa resource hierarchy to represent the cluster of Figure 1(a).

5.1 Basic organisation

Each node of a resource tree – a domain – confines aparticular assortment of hardware, characterised by alist of properties. Higher-level domains introduce generalresources, such as a common interconnection facility,while leaf domains embody the most specific hardware theruntime system can handle.

Properties are useful to evidence the presence ofqualities or to establish values that clarify or quantifyfacilities. For instance, in Figure 11, the propertiesMyrinet and Gigabit divide cluster resources into twoclasses while the properties GFS = . . . and CPU = . . .establish different ways of accessing a global file systemand quantify the resource processor, respectively.

Deploying applications in multi-SAN SMP clusters 143

Figure 11 Cluster resources hierarchy

Because every node inherits properties from its ascendant,it is possible to assign a particular property to all nodesof a subtree by attaching that property to the subtreeroot node. Node A1 will thus collect the propertiesGFS = /ethfs, FastEthernet, GFS = myrfs, Myrinet,CPU = 2 and Mem = 512.

By expressing the resources required by an applicationthrough a list of properties, the programmer instructs theruntime system to traverse the resource tree and discovera domain whose accumulated properties conform to therequirements. Respecting Figure 11, the domain Node A1fulfils the requirements (Myrinet) ∧ (CPU = 2), since itinherits the property Myrinet from its ascendant.

If the resources required by an application are spreadamong the domains of a subtree, the discovery strategyreturns the root of that subtree. To combine the propertiesof all nodes of a subtree at its root, we use a synthesisationmechanism. Hence, Quad Xeon Sub-Cluster fulfils therequirements (Myrinet) ∧ (Gigabit) ∧ (CPU = 4 ∗ m).

5.2 Virtual views

The inheritance and the synthesisation mechanismsare not adequate when all the required resourcescannot be collected by a single domain. Still respectingFigure 11, no domain fulfils the requirements (Myrinet) ∧(CPU = 2 ∗ n + 4 ∗ m).2 A new domain, symbolising adifferent view, should therefore be created withoutcompromising current views.

In Figure 11, the domain Myrinet Sub-cluster (dashedshape) is an alias whose originals (connected by dashedarrows) are the domains Dual PIII and Quad Xeon.This alias will therefore inherit the properties of thedomain Cluster and will also share the properties of itsoriginals, that is, will collect the properties attached to itsoriginals as well as the properties previously inherited orsynthesised by those originals.

By combining original/alias and ascendant/descendant relations we are able to represent complexhardware platforms and to provide programmers themechanisms to dynamically create virtual views accordingto application requirements. Other well known resourcespecification approaches, such as the Resource and ServiceDescription (RSD) environment (Brune et al., 1999),do not provide such flexibility.

5.3 Resource specification

To assist the administrator in specifying the clusterhardware, we provide a very simple language whosesyntax is presented in Figure 12. It allows for the textualrepresentation of a tree where all nodes are resourcedomains. Each domain specification is described by anoptional tag, a name, a reference to the ancestor andthe full set of properties. In the case of an alias domain,the specification also includes a list of references to itsoriginals. The specification of a domain uses tags toreference other domains in the tree.

Figure 12 Specification language syntax

By using this language, the cluster configuration depictedin Figure 11 may be specified as shown in Figure 13.mεµ provides a tool to parse the textual specificationand make relevant information available to applicationsthrough the directory service.

Figure 13 Physical resource specification

5.4 Creating virtual views

The exercise of selecting a particular domain comprisingthe physical resources required for executing an

144 A. Alves et al.

application may not be easy. The programmer specifiesa list of properties and instructs the system to find thephysical domain that matches these properties, but whenno single domain assembles all required properties, thesystem must be able to create an alias.

As an example, consider that the programmer wants tocircumscribe some cluster nodes (referring to Figure 11)according to the following specification: each node musthave two CPUs, nodes must be interconnected by Gigabitand a total of four CPUs must be available. mεµ shouldreturn the alias domain included in the figure, but if thealias is not present the system must create it.

The code fragment presented in Figure 14 uses afew mεµ primitives to create the desired aggregatordomain. First, two lists of properties are created: thefirst one to specify the properties each entity must ownand the second one to specify the properties owned bythe collection of entities represented by the aggregator.The meu_newaggreator primitive uses these two lists toretrieve from the directory the resources that all togethermatch both conditions, starting at the root domain.On success the primitive creates an alias domain, whoseaccumulated properties imported from its originals fulfilthe two property lists.

Figure 14 Creation of an aggregator domain

6 Application modelling

Following the same approach that we used to representand organise physical resources, application modellingcomprises the definition of a hierarchy of nodes.Like physical domains (used to represent physicalresources), application entities (domains, operons, tasks,mailboxes, memory blocks and gathers) are registeredin the directory service and programmer may instructthe runtime system to discover a particular entity in thehierarchyof anapplication component. In fact, applicationentities may be seen as logical resources that are availableto any application component.

6.1 A modelling example

Figure 15 shows a modelling example concerning asimplified version of SIRe, a scalable information retrievalenvironment. This example is just intended to explainour approach; specific work on web information retrievalmay be found e.g., in Brin and Page (1998) andCho and Garcia-Molina (2002).

Each Robot operon represents a robot replica,executing on a single machine, which uses multipleconcurrent tasks to perform each of the crawling stages.At each stage, the various tasks compete for work

among them. Stages are synchronised through global datastructures in the context of an operon. In short, each robotreplica exploits an SMP workstation through the sharedmemory paradigm.

Within the domain Crawling, the various robotscooperate by partitioning URLs. After the parse stage,the spread stage will thus deliver to each Robot operon itsURLs. Therefore Download tasks will concurrently fetchmessages within each operon. Because no partitioningguarantees, by itself, a perfect balancing of the operons,Download tasks may send excedentary URLs to themailboxPending. Thismailboxmaybeaccessedbyany idleDownload task. That way, the cooperation among robotsis achieved by message passing.

The indexing system represented by the domainIndexing is purposed to maintain a matrix connectingrelevant words and URLs. The large amount of memoryrequired to store such a matrix dictate the use ofseveral distributedmemory fragments. Therefore, multipleIndexer operons are created, each to hold amemory block.Each indexer manages a collection of URLs stored inconsecutive matrix rows, in the local memory block, thusavoiding references to remote blocks.

Finally, the querying system uses the disperse memoryblocks as a single large global address space to discoverthe URLs of a given word. Memory blocks are chainedthrough the creation of aliases under a memory blockgather which is responsible for redirecting memoryreferences and to provide a basic mutual exclusionaccess mechanism. Accessing the matrix through thegather Word/URL will then result in transparent remotereads throughout a matrix column. The querying systemthus exploits multiple nodes through the global memoryparadigm.

6.2 Cooperative applications

Clusters may be used to support complex informationsystems that integrate multiple cooperative applications.In such a scenario, any programmer should be allowedto implement an application (a system module) withoutknowing all details about the other applications.Distinct applications may also execute under the controlof different users. In mεµ, all the applications that formthe system are represented by a single hierarchy. In fact,all applications running in a cluster at a specific momentwill be represented by a single hierarchy of resources,even if they do not belong to the same application system,that is, apart from their cooperation level.

Distinct applications may cooperate if they are ableto identify entry points to access software componentsrunning in the cluster. These entry points, in mεµ, arethose entities that allow message passing and remotememory access. Considering that all logical entities usedto model each application are registered in the directoryservice and that it is easy to discover entities accordingto their known (public) properties, it is only necessary toannounce, using mechanisms external to mεµ, the waythese entry pointsmay be exploited (what kind ofmessages

Deploying applications in multi-SAN SMP clusters 145

Figure 15 Modelling example of the SIRe system

the application component is prepared to receive, whichinformation is stored in a specific memory region, etc.).

The programming paradigms to accomplish thecooperation between applications are therefore messagepassing and global memory. To establish a cooperationrelationship with other application, based on messagepassing, an application should demand the discovery of aspecific domain, operon, task or mailbox, owned by thatapplication. In the same manner, based on external blockmemory andmemory gather identifiers, an applicationwillbe able to read and write data from other applications.

It is possible to realise more sophisticated forms ofcooperation, by creating aliases whose originals belong toexternal application components. The creation of entitiesunder domains, operons and memory gathers of otherapplications allows an application tomix its hierarchywithothers. For instance, if an application creates an alias taskbelow an external domain (of other application), it willreceive a copy of any message addressed to that domainand cooperation will be feasible.

7 Mapping resources

The last step of our methodology consists of merging thetwo separate hierarchies produced on the previous stagesto yield a single hierarchy.

7.1 Laying out logical resources

Figure 16 presents a possibility of integrating theapplication depicted in Figure 15 into the physicalresources depicted in Figure 11.

Operons, mailboxes and memory block gathers mustbe instantiated under original domains of the physicalresources hierarchy. Tasks and memory blocks arecreated inside operons and so have no relevant role onhardware appropriation. In Figure 16, the applicationdomain Crawling is fundamental to establish the physicalresources used by the crawling sub-system, since theoperons Robot are automatically spread among clusternodes placed under the originals of that alias domain.

To preserve the application hierarchy conceived by theprogrammer, the runtime system may create aliases forthose entities instantiated under original physical resourcedomains. Therefore, two distinct views are always present:the programmer’s view and the system view.

The task Parse in Figure 16, for instance, can bereached by two distinct paths: Cluster → Dual Athlon →Node Ck → Robot → Parse – the system view – andCluster → SIRe → Crawling → Robot(Alias) → Parse –the programmer’s view. No alias is created for the taskParse because the two views had already been integratedby the alias domain Robot; aliases allow jumping betweenviews.

Figure 16 Mapping logical hierarchy into physical

146 A. Alves et al.

Programmer’s skills are obviously fundamental toobtain an optimal fine-grain mapping. However, if theprogrammer instantiates application entities below thephysical hierarchy root, the runtime system will guaranteethat the application executes but efficiency may decay.

7.2 Dynamic creation of resources

Logical resources are created at application start-up, sincethe runtime system automatically creates an initial operonand a task, and when tasks execute primitives with thatspecific purpose. To create a logical resource it is necessaryto specify the identifier of the desired ascendant andthe identifiers of all originals in addition to the resourcename and properties. To obtain the identifiers requiredto specify the ascendant and the originals, applicationshave to discover the target resources based on their knownproperties.

When applications request the creation of operons,mailboxes or memory block gathers, the runtime systemis responsible for discovering a domain that representsa cluster node. In fact, programmers may specify ahigher-level domain confining multiple domains thatrepresent cluster nodes. The runtime system will thustraverse the corresponding sub-tree in order to select anadequate domain.

The code fragment in Figure 17 shows how themεµ primitives may be used to launch the Crawlingapplication.

Figure 17 Crawling application launching

After discovering the location for a specific logicalresource, the runtime system instantiates that resource andregisters it in the local directory server. The creation andregistration of logical resources is completely distributedand asynchronous.

7.3 Programmer’s views

In what follows, by taking as an example the evolutionof a dynamic Crawling application, we demonstratethe mεµ features that support application scaling.The general idea is to show how an initial view of thephysical system resourcesmay later be derived into distinct

programmer’s views shaped by different applicationrequirements.

Assuming the crawling application is launched viacommand line at Node C1, an operon ProgName and atask Main (see Figure 18(a)) are automatically created.As the programmer prefers its own application view,a new domain Robots is created to hold an alias operon.This alias (Robot) will be used by the application insteadof the original operon (ProgName).

The next step, illustrated in Figure 18(b), correspondsto the creation of several parse and download tasks,instantiated by the task main, as descendants of Robot.Since Robot is an alias, the run-time will follow the linkto reach operon ProgName, where tasks will effectively beinstantiated.

Figure 18 Application scaling

Finally, the configuration depicted in Figure 18(c) showshow the application may be scaled up, in order to respondto higher demands. A new cluster node is exploitedby creating another operon (via the meu_newoperonprimitive) and an alias is created under the domainRobots,in order to maintain the application view. That way,the new application entities along with the initial entitiesmay be accessed through a single resource. Simplifying,we might say that the domain Robots increased its powerfrom one to two operons.

It is important to note that, besides the system view, theprogrammer may define multiple views to the applicationentities. The SuperRobots domain and its descendant alias,for instance, are used to confine a particular operon of thewhole crawling application.

Deploying applications in multi-SAN SMP clusters 147

8 Discussion

The integration of various communication technologieshas motivated the development of some intermediate-levelcommunication libraries. Generally, these libraries donot introduce relevant abstractions when comparedto low-level communication libraries. Basically, theyoffer node-to-node communication or, in somecases, port-to-port communication, and programmershave the responsibility to map application entitiesinto communication end-points. RoCl introducesa new communication paradigm, supported by alow-level directory service, which allows to easily addcommunication facilities to any computational entity.

Traditionally, the execution of high performanceapplications is supported by powerful SSIs thattransparently manage cluster resources to guaranteehigh availability and to hide the low-level architecture,e.g., Gallard et al. (2002). Our approach is to rely on abasic communication-level SSI (RoCl) used to implementsimple high-level abstractions that allow programmersto directly manage physical resources. mεµ provideshigh-level mechanisms for structuring applications andfor mapping their components into physical resources.

To exploit the various locality levels that it ispossible to identify in a cluster, some authors proposedprogramming models that combine message passing andshared memory (combining MPI and OpenMP, forinstance). Some innovative approaches had also beenpresented, as is the case of Kelp (Fink and Baden,1997). mεµ, because of its mechanisms to map logicalinto physical resources, is another approach to thesame problem. However, mεµ considers another level ofparallelismwhich is not addressedby commonapproaches:sub-clusters.

Message passing is themost used parallel programmingparadigm and both MPI and PVM are key references.When programmers experiment new platforms, theyexpect to be able to use the most important concepts andmechanisms present on these two systems. mεµ includesimportant features available on traditional platforms but itextends these features to offermore flexibility on the designand execution of applications.

The distributed shared memory paradigm is veryconvenient for application programming but leads to poorperformance because of the overheads of the mechanismsused to guarantee consistency. However, new approachesthat eliminate these overheads and expose the programmerto the memory hierarchy details are becoming an effectivealternative (Nieplocha et al., 1996). mεµ uses a similarapproach, but global memory abstractions are entirelyintegrated with the remaining abstractions.

When compared to a multi-SAN SMP cluster, ametacomputing system is necessarily a much morecomplex system. Investigation of resource managementarchitectures has already been done in the context ofmetacomputing, e.g., Czajkowski et al. (1998). However,by extending the resource concept to include both physical

and logical resources and by integrating on a singleabstraction layer

• the representation of physical resources

• the modelling of applications

• the mapping of application components intophysical resources, our approach is innovative.

The message passing paradigm still remains the mainchoice for grid programming as attested by current portslikeMPICH-G2 (Karonis et al., 2003) and other platformsto run unmodified PVM/MPI applications (Royo et al.,2000). The underlying resource oriented computationmodel of mεµ provides the necessary abstractions toaccommodate the migration from cluster to multi-clustergrid enabled computing.

More recently,RoCl has been configured for executionin grid alike multi-cluster environments, that we namevirtual clusters, conforming to the deployment modelused by NPACI Rocks (Sacerdoti et al., 2004), a robustand scalable toolkit used to create large scale clusters.The approach allows individual users of differentadministrative domains to select a certain numberof nodes and a front-end, among all machines inthe multi-cluster, to build and setup a customisedvirtual cluster. To achieve this functionality we useOpenVPN and PVFS (Figueiredo et al., 2001). OpenVPNallows communication between nodes of different privatenetworks. PVFS introduces the concepts of shadowaccounts and NFS proxies to support global file accessacross virtual cluster nodes.

Programmers need the power of grids to build anddeploy applications that break free of single resourcelimitations to solve large-scale problems. Although muchwork has been done in the context of Globus (Czajkowskiet al., 2001) to exploit grid resources, our approach usesa straight model but provides much of the operationsidentified in Allen et al. (2003) as key functionality fordeveloping grid applications.

Acknowledgement

We would like to thank FCT/MCT, Portugal, for itssupport under contract POSI/CHS/41739/2001.

References

Allen, G., Goodale, T., Russell, M., Seidel, E. and Shalf, J.(2003) Grid Computing: Making the Global Infrastructurea Reality, Classifying and Enabling Grid Applications,John Wiley & Sons, Ltd., West Sussex, England.

Alves, A., Pina, A., Exposto, J. and Rufino, J. (2003)‘RoCL: a resource oriented communication library’,Euro-Par 2003, pp.969–979.

Aumage, O., Bougé, L., Denis, A., Méhaut, J-F., Mercier, G.,Namyst, R. and Prylli, L. (2000) ‘Madeleine II: a portableand efficient communication library for high-performancecluster computing’, CLUSTER’00, pp.78–87.

148 A. Alves et al.

Baden, S.B. and Fink, S.J. (2000) ‘A programming methodologyfor dual-tier multicomputers’, IEEE Transactions onSoftware Engineering, Vol. 26, No. 3, pp.212–226.

Brin, S. and Page, L. (1998) ‘The anatomy of a large-scalehypertextual web search engine’, Computer Networks andISDN Systems, Vol. 30, Nos. 1–7, pp.107–117.

Brune, M., Reinefeld, A. and Varnholt, J. (1999) ‘A resourcedescription environment for distributed computingsystems’, Proc. 8th International Symposium on HighPerformance Distributed Computing, Redondo Beach, CA,pp.279–286.

Cho, J. and Garcia-Molina, H. (2002) ‘Parallel crawlers’,Proc. 11th International World-Wide Web Conference,Hawaii, pp.124–135.

Compaq, Intel and Microsoft Corporations (1997) VirtualInterface Architecture Specification, Version 1.0, TechnicalReport, available at http://download.intel.com/design/servers/vi/VI_Arch_Specification10.pdf

Czajkowski, K., Fitzgerald, S., Foster, I. and Kesselman, C.(2001) ‘Grid information services for distributed resourcesharing’, HPDC’01, pp.181–194.

Czajkowski, K., Foster, I., Karonis, N., Kesselman, C.,Martin, S., Smith, W. and Tuecke, S. (1998) ‘A resourcemanagement architecture for metacomputing systems’,IPPS/SPDP’98, pp.62–82.

Figueiredo, R., Kapadia, N. and Fortes, J. (2001) ‘The punchvirtual file system: seamless access to decentralised storageservices in a computational grid’, HPDC’01, San Francisco,CA, pp.334–344.

Fink, S.J. and Baden, S.B. (1997) ‘Runtime support formulti-tier programming of block-structured applicationson SMP clusters’, International Scientific Computingin Object-Oriented Parallel Environments Conference(ISCOPE’97), Marina del Rey, pp.1–8.

Gallard, P.,Morin, C. andLottiaux,R. (2002) ‘Dynamic resourcemanagement in a cluster for high-availability’, Euro-Par2002, pp.589–592.

Geist, A., Beguelin, A., Dongarra, J., Jiang, W., Manchek, R.and Sunderam, V. (1994) PVM: Parallel Virtual Machine.A Users Guide and Tutorial for Networked ParallelComputing, MIT Press, Cambridge, MA, USA.

Gursoy, A. and Cengiz, I. (1999) ‘Mechanism for programmingSMP clusters’, PDPTA’99, Vol. IV, pp.1723–1729.

Karonis, N., Toonen, B. and Foster, I. (2003) ‘MPICH-G2:a grid-enabled implementation of the message passinginterface’, Parallel and Distributed Computing, Vol. 63,No. 5, pp.551–563.

Myricom Inc. (2000) The GM Message Passing System,Reference Manual, available at http://www.myri.com/scs/GM/doc/gm.pdf

Nieplocha, J., Harrison, R. and Foster, I. (1996) Advancesin High Performance Computing, Chapter ExplicitManagement of Memory Hierarchy, Kluwer AcademicPublishers, Norwell, MA, USA.

Pina, A., Oliveira, V., Moreira, C. and Alves, A. (2002)‘pCoR: a prototype for resource oriented computing’,HPC 2002, pp.251–262.

Royo, D., Kapadia, N. and Fortes, J. (2000) ‘Running PVMapplications in the PUNCH wide area network-computingenvironment’, Proc. 4th International Meeting VECPAR2000, Oporto, pp.90–95.

Sacerdoti, F.D., Chandra, S. and Bhatia, K. (2004) ‘Gridsystems deployment and management using rocks’,Proc. International Conference on Custer Computing,CLUSTER’04, San Diego, CA, pp.337–345.

Snir, M., Otto, S., Huss-Lederman, S., Walker, D. andDongarra, J. (1998) MPI – The Complete Reference,MIT Press, Cambridge, MA, USA.

Notes

1The term ascendancy chain is used to represent the node chainfrom a node to the root.

2n and m stand for the number of nodes of sub-clusters DualPIII and Quad Xeon.