Testing of a Novel Distributed Shared Memory Framework · Distributed Shared Memory Framework...

Testing of a NovelDistributed Shared Memory

Framework

Laurence Markey

A dissertation submitted to the University of Dublin,

in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science

1999

II

Declaration

I declare that the work described in this dissertation is, exceptwhere otherwise stated, entirely my own work and has not beensubmitted as an exercise for a degree at this or any otheruniversity.

Signed: _________________________Laurence Markey17 September 1999

III

Permission to lend and/or copy

I agree that Trinity College Library may lend or copy thisdissertation upon request

Signed: _________________________Laurence Markey17 September 1999

IV

Acknowledgments

I would like to thank my supervisor, Brendan Tangney, for his

guidance, encouragement and advice throughout the work on this

dissertation.

Also Stefan Weber who developed the DSD framework that was

the focus of this dissertation, and never lost his patience or good

humour during many many consultations.

And finally Brian Coghlan for allowing me to use the CAG Cluster

for testing the programs implemented using the framework.

V

Summary

A novel object oriented framework for software Distributed Shared Memory(DSM) has been developed by the Distributed Systems Group in Trinity College,Dublin for programming parallel applications on a group of loosely coupledworkstations. It offers the developer the ability to apply different rules for howaccesses to shared data are seen by each process (consistency models) which canbe implemented by different means of managing shared data (coherenceprotocols). Different combinations of models and protocols can be applied toindividual objects within an application, in addition the protocols that theframework uses can be customized by the developer.

By providing this range of facilities it is hoped that developers will be able toexploit application specific semantics to achieve improvements in performanceand speed-up characteristics while maintaining relative ease of programming.

The objective of this project is to test to what extent this DSM framework deliversthe flexibility, customizability and programmability in design of parallelapplications that it promises and whether it can offer appropriate speedupcharacteristics for different classes of problem.

Two applications with significantly different data sharing characteristics weredesigned and implemented as sample cases to test the framework: the TravellingSales Person (TSP) problem which has a low communication-to-computationratio and for which workers can work independently of each other; and the LUDecomposition problem which has a very a high communication-to-computationratio and an algorithm where workers are highly interdependent

The thesis will show that the framework did achieve appropriate speed-upcharacteristics. It also allows significant flexibility and programmability:protocols can be easily set and changed for individual objects; the protocolsthemselves were customized to improve the performance of each of theapplications.

VI

Table of Contents

1 INTRODUCTION ................................................................................................................................... I

2 BACKGROUND TO DISTRIBUTED SHARED MEMORY .............................................................4

2.1 INTRODUCTION .................................................................................................................................4

2.2 WHAT IS DSM ..................................................................................................................................4

2.3 PARALLEL PROGRAMMING................................................................................................................4

2.3.1 MultiProcessing......................................................................................................................5

2.3.2 Message Passing.....................................................................................................................6

2.3.3 DSM........................................................................................................................................7

2.4 DIFFERENT DSM SYSTEMS...............................................................................................................9

2.4.1 Consistency models...............................................................................................................15

2.4.2 Coherence Protocols ............................................................................................................18

2.5 SUMMARY.......................................................................................................................................20

3 FRAMEWORKS ...................................................................................................................................21

3.1 INTRODUCTION ............................................................................................................................... 21

3.2 WHAT FRAMEWORKS ARE..............................................................................................................21

3.2.1 Other types of reuse ..............................................................................................................22

3.2.2 Features of frameworks ........................................................................................................24

3.2.3 Problems with using Frameworks ........................................................................................26

3.3 FRAMEWORKS APPLIED TO DSM.....................................................................................................27

3.3.1 An existing framework ..........................................................................................................27

3.3.2 The Distributed Shared Data (DSD) Framework Developed in Trinity College..................28

3.3.3 Structure of the Framework..................................................................................................29

3.4 HOW TO PARALLELISE A PROGRAM ON THE FRAMEWORK..............................................................34

3.5 SUMMARY.......................................................................................................................................37

4 TESTING THE FRAMEWORK..........................................................................................................38

4.1 INTRODUCTION ............................................................................................................................... 38

4.2 RELATIVE SPEED-UP .......................................................................................................................38

4.3 THE THESIS.....................................................................................................................................41

VII

4.4 THE TRAVELLING SALES-PERSON PROBLEM ..................................................................................42

4.4.1 The Algorithm .......................................................................................................................43

4.4.2 Parallelisation of the TSP.....................................................................................................48

4.4.3 Predicted effect of the protocols ...........................................................................................50

4.4.4 Results of speed-up tests .......................................................................................................53

4.4.5 Analysis of Results ................................................................................................................54

4.4.6 Explanation of Super Linear Speedup ..................................................................................60

4.4.7 Analysis of the use of the framework ....................................................................................61

4.4.8 Conclusions for TSP .............................................................................................................63

4.5 LU DECOMPOSITION.................................................................................................................65

4.5.1 The Algorithm .......................................................................................................................65

4.5.2 Parallelisation of LU Decomposition ...................................................................................71

4.5.3 Predicted Effect of the Protocols ..........................................................................................72

4.5.4 Results of speed-up tests .......................................................................................................73

4.5.5 Analysis of Results ................................................................................................................75

4.5.6 Analysis of the use of the framework ....................................................................................76

4.5.7 Conclusions for LU Decomposition......................................................................................77

4.6 SUMMARY.......................................................................................................................................78

5 REVIEW OF THE DSD FRAMEWORK ...........................................................................................79

5.1 INTRODUCTION ............................................................................................................................... 79

5.2 IS DSD A GENUINE FRAMEWORK? ..................................................................................................79

5.3 WHAT KIND OF A FRAMEWORK IS IT? ..............................................................................................80

5.4 HOW EFFECTIVE IS IT AT SUPPORTING DSM?..................................................................................80

5.5 PROBLEMS ......................................................................................................................................82

5.6 CONCLUSION...................................................................................................................................82

6 APPENDICES........................................................................................................................................83

6.1 TRAVELLING SALES PERSON 17 CITIES...........................................................................................83

6.2 TRAVELLING SALES PERSON 17 CITIES - ADDITIONAL DATA........................................................85

6.3 TRAVELLING SALES PERSON 16 CITIES...........................................................................................86

6.4 LU DECOMPOSITION.......................................................................................................................87

7 BIBLIOGRAPHY..................................................................................................................................88

VIII

List of Tables and Illustrations

ILLUSTRATIONS

Fig. 3.1 Structure of the Framework

Fig. 3.2 Release Consistency

Fig. 3.3 Lazy Release Consistency

Fig. 3.4 Issuing of Work Items to Workers

Fig. 3.5 Structure of Applications that Can be Run in Parallel

Fig. 3.6 Changes Required to Run an Application on the Framework

Fig. 4.1 Comparison of Expected Relative Speed-up with Linear

Relative Speed-up

Fig. 4.2 Distances Between Cities in Asymmetric TSP Problem

Fig. 4.3 Outline of TSP Algorithm – Branch and Bound Tree Pruning

Fig. 4.4 Illustration of Branch and Bound Algorithm

Fig. 4.5 Calculation of Route Length

Fig. 4.6 Classes Changed in Order to Parallelise TSP

Fig. 4.7 Replication Protocols

Fig. 4.8 HomeBased Protocols

Fig. 4.9 Relative Speed-up Graph for 17 City TSP

Fig. 4.10 % Standard Deviation for 17 City TSP

Fig. 4.11 Relative Speedup Graph for 17 City TSP Additional Data

Fig. 4.12 Speedup Graph for 16 City TSP

Fig. 4.13 Simple Algorithm for LU Decomposition

Fig. 4.14 Better Algorithm for LU Decomposition

Fig. 4.15 Relative Speedup Graph for LU Decomposition

TABLES:

Table 1 17 City TSP Summary Table

Table 2 17 City TSP – Additional Data Summary Table

Table 3 600 x 600 LU Decomposition Summary Table

1

1 Introduction

Parallel programming has been developed in the attempt to improve the speed at which

programs can be executed, particularly those that demand a great deal of processing power.

It describes the attempt to improve the speed at which a given problem is processed by

having a number of processors or workstations working in co-operation on the problem. It

is contrasted with sequential programming where only one processor is used and all

operations take place in a linear sequence. One of the measures of the performance of a

parallel programming system is its speed-up characteristic i.e. how much faster a program

is processed as more processors are recruited to work on it.

Distributed Shared Memory (DSM) is a technique that allows parallel programming of

applications on a network of computers. It relies on the “shared memory” abstraction

which enables computers that do not share memory to have shared access to data by a

means which appears similar to how they access data in their own memory. The shared

memory consists of items of shared data which are replicated at each workstation that needs

to use them, thereby eliminating the need to read these data across a relatively slow (in

terms of processor speeds) network.

However having multiple copies of items of data introduces the issue of how to maintain all

of these copies in a consistent state. This can be separated into two elements the Coherence

Protocol which determines what mechanisms are used to maintain the copies in a consistent

state, and the Consistency Model which defines what constitutes a consistent state by

constraining the event-orderings are allowable.

Different coherence protocols will use difference means of passing updates between

workstations. The optimal coherence protocol will be different for different applications,

being determined by the different communications requirements of the applications.

Maintaining strict consistency where all reads at all nodes return the most recently written

value introduces extra delays because no processor can read a data item when any other

processor is updating a copy of that data item. These delays can be reduced by allowing a

more “relaxed” consistency model. Many such models have been proposed (e.g. Goodman,

2

1989; Dubois et al., 1988; Gharacharloo et al., 1990; Keleher et al., 1996a), the more

relaxed they are the better performance they offer. However application semantics often

demand that some of these are not acceptable for any data item in the application, or are not

acceptable for certain key data items.

In order to optimise the performance of parallel programmes it would be desirable to be

able to separate the means of applying coherence protocols and consistency models so that

an appropriate combination could be selected for each application from a range of differing

consistency models and coherence protocols. In addition further optimisation would be

promised by a system that could apply consistency on a per-object basis.

This is what has been attempted in a framework that has been developed in Trinity College

Dublin. The framework (Weber et al., 1998) consists of a set of Java classes that can be

selected and extended to provide combinations of consistency models and coherence

protocols on a per-object basis for parallel programming. It allows the flexibility of easily

choosing and changing the combination of coherence protocol and consistency model that

have been chosen for any data item. In addition it allows existing protocols to be

customised, or new ones defined in order to achieve optimal performance for specific

applications. It does this in an environment that is relatively easy to program.

Objective

The objective of this dissertation was to test how well the framework delivers on the

benefits that it promises. This was done by using the framework to implement two parallel

applications with different sharing semantics and then testing the performance of these

applications over the network in order to determine what type of speed-up was offered. In

the process of developing and testing these applications it was possible to analyse how well

the framework delivered on the flexibility, programmability and customisability that it

claims to offer.

Results

It was found that developing parallel applications for the framework was relatively easy,

particularly if the problem space could be partitioned into a list of work items before the

commencement of processing. There was a significant range of coherence protocols and

3

consistency models provided in advance. It was possible to customise these, or develop

new ones by creating new classes that extend the base CoherenceProtocol or

ConsistencyModel classes. The steps required to assign and change consistency models

and coherence protocols for individual variables were clear and easily programmable. The

speed-up characteristics obtained were very different for the two applications reflecting the

different sharing semantics and communication/computation ratios of the applications.

Selection of appropriate ConsistencyModels and CoherenceProtocols had a significant

affect on the levels of speed-up obtained.

Therefore we have concluded that the framework has delivered the capabilities that it has

promised.

4

2 Background to Distributed Shared Memory

2.1 Introduction

The purpose of this chapter is to introduce some of the concepts and approaches that form

the background against which this investigation was carried out. It introduces Distributed

Shared Memory (DSM), explains parallel programming and surveys some of the

approaches other than DSM that have been used for parallel programming. It then looks at

some of the different approaches that have been proposed based on DSM, the problems that

they encountered and how some of these can be overcome. It lays out a conceptual

framework that can be used for classifying and understanding DSM systems and for

optimising the performance of parallel programs run over DSM.

2.2 What is DSM

Distributed Shared Memory (DSM) is a technique that allows parallel programming of



means which appears similar to how they access data in their own memory.

The objective of this dissertation is to evaluate a novel framework that has been developed

for developing programs to run over DSM. In order to clarify exactly what benefits are

promised by this new framework-based approach to DSM it is worthwhile to survey the

different approaches that have been applied to parallel programming in the past.

2.3 Parallel Programming

Parallel programming is the name for the attempt to improve the speed at which a given

problem is processed by having a number of processors working in co-operation on the

problem. It is contrasted with sequential programming where only one processor is used

and all operations take place in a linear sequence.

Developing parallel programs involves many concerns that are not encountered while

developing sequential programs. The greatest of these are process communication and

5

synchronisation. Since a parallel program uses multiple sequential processes executing and

interacting in parallel to solve a problem, developers have to ensure that a consistent view

of the problem data is maintained across all these processes, this is called synchronisation.

However in the interests of efficiency this must be achieved without requiring excessive

communication between processes.

2.3.1 MultiProcessing

The initial approach used to implement parallel programming was multi-processing. It

required the construction of a computer with a number of CPUs that share the same bus and

that use that bus to access a shared memory. By having a single bus and a single shared

memory that contains only one copy of each data item and the semaphore that controls

access to that data, it provides a system whose structure implements data consistency and

synchronisation. Consistency is the requirement that all of the processors see a consistent

view of the data. This is ensured by the fact that there is only one set of data which all

processes have access to, thus no two processes can have a different view of the data.

Synchronisation is achieved because the system implements semaphores, it ensures that

when any one process is updating a data item all other processes have no access.

However these bus-based multiprocessors suffer from a lack of scalability, they cannot be

used with more than a few dozen processors because the bus (which only one CPU can use

at a time) tends to become a bottleneck. This is because when any one processor accesses a

data item in the shared memory they prevent all other processors from accessing any data

item because only one processor can access the bus at any time. In addition, in its simplest

form, this architecture makes no distinctions between non-shared data and shared data and

between read-only data and read-write data. In each of these cases non-shared data and

local copies of read-only data could be stored in local memory if that were available

without compromising the consistency of the overall data set. This would allow

performance improvement because shared-bus access would not be required for access to

these data types and so frequency of demand for the shared-bus bottleneck would be

reduced.

Switched multiprocessors, such as DASH (Lenoski et al., 1992) can be made to scale, but

they are relatively expensive, complex and difficult to build and maintain. A switched

6

processor tries to overcome the limitation on the number of processors that can efficiently

share use of a single bus by having several clusters of processors, each cluster with its own

bus and the clusters connected by slower (relative to buses) intercluster links. When a

processor wants to access a data item located on the local memory it accesses just as it

would in the multiprocessor situation described above. If however the item’s address

indicates that it is located on the memory of one of the other clusters then the data item is

fetched across the intercluster link. This topology assumes that most accesses by

processors will be to their local memory and so the slow intercluster links will be relatively

infrequently used thus having little affect on performance and allowing greater scale.

However, as mentioned already, these systems are expensive and complex to build and

maintain. Given that many locations had networks of workstations it became more

attractive to see if the existing processing power on these networks could be used to

perform parallel processing, rather than purchasing expensive hardware that performed a

specialised function. This approach of attempting to use a network of standalone computers

for parallel processing is called multicomputing.

A multicomputer consists of a number of separate standalone computers, each with their

own CPU and their own memory, which communicate via a network. This makes larger

scale feasible: because each machine accesses its own memory there is no bottleneck for

access to memory, they only use the network to propagate updates to data shared with other

machines. Large multi-computers are easier to build and cheaper because they use standard

workstation and network hardware. However they create a more difficult programming

environment, because the programmer now has to deal with issues of consistency and

synchronisation. In addition they have to cope with extra time required to communicate

across a network. There have been two main approaches to multi-computing: Message

passing and Distributed shared memory.

2.3.2 Message Passing

Parallel programming requires that the processes running the application must share data.

Message passing achieves this by passing between processes messages containing the data

to be shared (Coulouris, et al., 1994). By using this approach where, provided messages are

passed between processes in the agreed format, the actual processing can in principle be

7

performed on any type of computer message passing can implement parallel programming

over a heterogeneous network.

Message passing requires extra work on the part of the programmer as compared to

multiprocessing. Variables to be passed between processes must be marshalled into

messages, messages transmitted and variables unmarshalled at the receiving side. If a

heterogeneous system is to be used then different marshalling will be required for each

different kind of computer being used on the system. In addition there is a requirement to

monitor that messages have been received by their intended recipient, and to manage issues

such as flow control, buffering and blocking.

Because each process maintains its own copy of the data, message passing must implement

synchronisation to ensure that all processes are using a consistent set of data. However it

cannot do this using normal synchronisation constructs such as locks, instead

synchronisation must be implemented using special primitives. Message passing does not

offer the same flexibility in the selection and changing of policies as will be offered by

some of the other systems we will discuss.

2.3.3 DSM

Distributed Shared Memory (DSM) is a technique which allows parallel programming of



means which appears similar to how they access data in their own memory.

Each of the computers is a standalone computer, but using DSM they can all work in

parallel on a single application. Each computer works on the problem as it would work on

a standalone application, because each has accesses shared data in a way similar to how it

accesses its own local data. It is up to the DSM system to ensure that that each workstation

sees a view of the shared data that is consistent with that seen by all the other workstations.

Shared Memory

Unlike message passing DSM does not require the programmer to manage communication

between processes by passing data messages, instead it provides processes with a shared

8

address space. A distributed application consisting of processes running on several nodes

can access the data in the shared address space using operations like "read" and "write"

which, combined with synchronisation primitives, are applied to a specific address. Thus

each process uses the shared address space in a similar way to how they use their local

memory space with the addition of synchronisation management.

The shared address space consists of local memory on each machine in the system. When a

process accesses data in the shared address space, a mapping manager maps the shared

memory address to the physical memory of one of the nodes on the system. The process

can then read that data and make a local copy. Each computer can maintain local copies of

recently accessed data items. Changes made by one process to items in the shared address

space are propagated by the DSM system to the local memory of all other machines with

copies of that data. The DSM system ensures a consistent overall set of data, the

programmer need not worry about this.

Thus it appears to distinct processes on different nodes that they have concurrent access to a

shared set of data. And this is achieved while allowing most accesses to data to be to local

memory. This enables DSM to improve performance by using local memory accesses

whenever it wishes to access a shared variable. The DSM system only needs to propagate

data when a change is made to a shared variable. In addition communication between

processes and across node boundaries is transparent, it is handled by the DSM and no

marshalling is required.

However enforcement of consistency does not come without cost. If we are to ensure that

all processes sharing some data see changes to that shared data in the same order then we

will have to implement distributed locks. This will require several stages of communication

to establish locks on all copies of data, and update all copies of data; only when this has

been committed can any processes have access to this data. This introduces significant

delays, particularly when communicating across a network. However, as we shall see it is

possible to significantly increase performance if we relax the level of consistency being

enforced.

9

2.4 Different DSM Systems

Having outlined the concept of a virtual shared memory upon which DSM relies, we shall

now look at different DSM systems that have been developed and the different approaches

they have used to implementing the virtual shared memory. They can be broadly grouped

into two categories: page-based DSM and object-based DSM.

Page-Based DSM

The initial attempts at DSM were attempts to simulate the operation of multi-processors

using a distributed system. This was partly because one of the goals of these original DSM

systems was to be able to run existing multi-processor programs on larger DSM systems

with the minimum of modification. Multi-processors use a memory shared by all

processors in a cluster located on one bus, with one copy of each data item stored in the

memory (Tanenbaum, 1995). Each processor can access each address in the memory

directly. As we have seen one way of coping with the limitations on access to the shared

bus was by passing data along inter-cluster links.

The network hardware that was available for these early DSM systems was optimised for

passing pages across the network. So the early attempts at DSM married these two

approaches, they divided the address space into pages, each page being located in the local

memory of one machine in the system. When a processor tries to access an address in a

page that is not local the DSM software identifies the page, locates it and fetches it from the

machine it is currently stored on.

Thus the virtual memory consists of pages, stored in local memory of a machine on the

network, that migrate across the network as different processors need to access their

contents. This migration is required for read accesses as well as write accesses, because the

only way data can migrate across the network is if the page it is located on is transferred to

another machine.

Because it takes longer to access a page that is located on another machine, performance

can be improved by locating particular pages on the machines that use them most

frequently. However this approach is not adequate if there is more than one processor that

needs frequent access to a given page, because it can lead to a problem called thrashing or

alternatively the ping-pong effect.

10

Thrashing occurs when there are several processes that have frequent needs for access to

the same page, this will mean that ownership will be difficult to determine and the page will

be transferred back and forth across the network at a very high frequency. These processes

will spend much of their time waiting for the page to be returned to them and so their

performance will suffer.

The solution to thrashing it is to allow several copies of the page to exist simultaneously on

a number of machines and to institute some means of ensuring that they all display a

consistent view of the data that they contain. This is the approach that was adopted in the

IVY system.

IVY (Integrated Shared Virtual Memory at Yale) was implemented on a network of Apollo

workstations connected by a network (Li and Hudak, 1989). The memory was divided into

pages, 1 Kbyte in size, which were transferred across the network. There could be more

that one process on each workstation. Each process saw the address space as divided into

two areas:

1. A shared virtual memory address space which can be accessed by any process

2. A private space which can only be accessed by processes on the same workstation

Each workstation has a mapping manager that controls the mapping between the shared

virtual memory space and the local memory of the workstation. When a process raises a

page fault a check is made to see if the page is located locally at the workstation. If the

page is not local a remote memory request is made and the page is acquired from another

workstation.

IVY follows a multiple readers-single writer semantics. There can be multiple copies of a

page, each of which can be read separately, thus eliminating thrashing when several

processes want to read from the same page. All readers always see the latest value written,

which means that IVY enforces “strict” consistency (see below). This consistency model is

implemented using the “write-invalidation” protocol. This means that when a writer wants

to write to a page which is copied elsewhere, then all other copies of that page are

invalidated before the update is made. If a reader attempts to read from an invalidated page

the system will obtain a copy of the updated version of the page. In order to enforce strict

11

consistency where all readers see the latest update, it is necessary to complete the

invalidation of copies of an updated page before another update can commence. Therefore

there can only be one writer to a page at any time.

Write invalidation saves communication because copies of pages are only updated when a

process attempts to read from them. So communication is reduced by only updating those

copies that are used subsequently; and by only updating them when they are to be read

which may be after several writes have been made.

Ivy resolved the thrashing problem for reads, but not for writes. Each time a write is

performed all other copies of the page are invalidated, so subsequent readers or writers will

have to receive a copy of that page over the network.

Mirage uses a different approach to reducing thrashing (Fleisch & Popek, 1989). It adds a

parameter to the sharing protocol that sets the minimum time for which a page will be

available at a node. By ensuring that a page remains at that location for at least a certain

time it allows one process complete a sequence of updates to a page before the page is

passed across the network. This parameter can be tuned dynamically to ensure that the

page does not stay at a node longer than is needed. It also provides a “yield” primitive

which a process calls when it has completed a sequence of updates. Thus ensuring that the

page does not stay fixed for the duration of the period if there are no further accesses to be

made.

However even if these measures do allow page-based systems to reduce thrashing, there is

another problem which they have more difficulty in coping with: false sharing.

Granularity and false sharing

Granularity describes the size of the minimum unit of shared memory. In page-based

systems it is the page size. The underlying protocols and hardware that are used to

propagate updates will have an influence on the choice of granularity. From their point of

view efficiency is maximized by making the granularity into a multiple of the size used by

these. Page-based systems are designed to optimise the passing of pages around the

network so the page size will be determined by the underlying hardware. This ensures that

there is limited room for manoeuvre in selection of granularity in page-based systems.

12

A larger page-size will take advantage of the locality of reference exhibited by processes.

A process that accesses one variable is very likely to access other variables located near to

it in memory. However this consideration is often overridden by the fact that a large page-

size increases the likelihood of false sharing. This occurs when a number of processes seek

to access unrelated data items that happen to be located on the same page. Thus there will

be delays caused by this contention even though the variables themselves are unrelated.

False sharing particularly occurs if a page contains a key variable that is accessed very

often, this will mean that processes will experience considerable delays in accessing any

other variables located on the same page.

This problem has been moderated somewhat for page-based systems by adopting a mode of

operation where key variables are stored alone on one page, so that accessing them will not

cause any difficulties in accessing other unrelated data items. However this is only a work-

around and an attempt to replicate what is achieved more effectively in shared-variable

DSM systems. In particular it wastes valuable memory space, and bus time; a whole page

has to be maintained and passed around in order to manage the state of only one variable.

Shared Variable DSM

Shared variable systems were proposed as a means of overcoming some of the problems

with page-sharing. By focussing the system on the management of variables rather than

pages they eliminate the problem of false sharing. Processes seize individual variables,

thus there is no possibility that accessing one variable will prevent others accessing an

unrelated variable that happens to be located beside it in memory.

In addition they ensure that only those variables that need to be shared are passed across the

network. Rather than having the same mechanism for all of memory, variables that need to

be shared are treated differently from the rest of memory.

Shared variable systems allow several processes to maintain replicates of individual

variables in their memory. This solves the problem of thrashing, and ensures that

individual processes have faster access to shared variables. However, it calls for a

mechanism to ensure that all replicates of a variable are maintained in a consistent state.

13

This can be optimised by allowing different levels of consistency for different variables

which can be achieved by performing specific annotations on the variables that declare

what policy applies to them.

Munin is a system that uses this approach (Bennett et al., 1990, Carter et al. 1991). It has a

fixed range of types that can be applied to shared data: write-once, write-many, private,

migratory etc. Different protocols are used for updating shared data that have been

assigned different types. Thrashing caused by competing writers can be avoided by

specifying the type as write-many.

A shared variable system puts the responsibility on the programmer to annotate each data

item to indicate which variables are shared together and what type of policies apply. This

was not necessary in page-based systems where the system treated all data similarly. This

flexibility does offer improved performance, however because of the need to annotate each

data item it is difficult to avoid costly mistakes. What was needed was some means of

simplifying the application of different sharing policies to data items. This is what object-

based DSM aims to do.

Object-based DSM

An object is a programmer-defined encapsulated data structure (Booch, 1994). It consists of

internal data and associated methods. The methods are procedures that operate on the

object state, they can be called by any program that has a reference to the object. Thus an

object is a data abstraction that comes supplied with methods to operate on it. One of the

key properties of an object is that it implements information hiding which means that direct

access to the internal state is not allowed. The only means by which the internal state can

be accessed or operated on is via the defined methods. Forcing all accesses to an object’s

data to go through the methods helps structure the program in a modular way and allows the

programmer to control the means by which the object can be changed. Objects offer

possibilities for optimisations that are not available with a shared memory composed only

of pages or of shared variables.

14

Another key property of objects is inheritance. This allows an object to inherit the data and

methods of another object. This can be used to simplify the management of how different

policies are applied to different classes of data items.

In an object-based distributed shared memory processes on several machines share an

abstract space filled with shared objects. The management of these objects is handled by

the DSM system. Any process can invoke any object’s methods once it has obtained a

reference to that object. The process and object need not be located on the same machine,

once the method is called it will be run at the location where the object resides.

The issues of replication and managing updates and/or invalidations in order to maintain

consistency still have to be addressed in this system. However it does offer number of

advantages. In particular the fact that the access methods can be used to restrict how

accesses can be made may ensures that synchronisation can be built into the access

mechanism. This reduces the possibility of error and makes programming different sharing

policies for different data items much easier. Because of the fact that synchronisation and

access are controlled at the level of the object the underlying implementation can be more

flexible. This has led to the separation of what we have so far called sharing policies into

two separate elements which can be defined separately for an object-based DSM system

(Weber et al., 1998). The Consistency Model describes the type of consistency that is being

enforced for a given variable and the Coherence Protocol describes the mechanisms that are

being used to implement that model.

And of course an object-oriented approach, by using inheritance, allows the development of

frameworks, which are sets of co-operating classes embodying abstract designs that can be

reused to provide a structure for developing applications within a certain domain

(frameworks are discussed fully in the next chapter). Using a successful framework should

simplify the development of individual applications within the framework’s domain.

These three concepts: consistency model, coherence protocol and framework and the

benefits that using them together can offer are the key ideas driving the design of the DSM

system that has been developed in Trinity College. The remainder of this chapter will be

devoted to explaining consistency models and coherence protocols; frameworks are

explained in the next chapter.

15

2.4.1 Consistency models

A consistency model is a system of rules that puts constraints on what orderings of events

are allowed in a DSM system (Weber et al., 1998). It is used to ensure that at all times

every process with access to a shared data set has a view of that data set that is coherent

with the view of all other processes. However for the views of all processes to be coherent

it is not necessary that they be the same. A shared memory is coherent if the value returned

by a read operation is always the value that the programmer expected. This means that a

shared memory can be coherent when different processes see different values of a given

variable at the same time provided that this is what the programmer had expected. By

“what the programmer expected” we mean in accordance with the consistency model that

the programmer has specified. As long as the programmer has allowed that different users

may see different values of a variable, then different values of the same variable coexisting

in the same system are coherent.

Intuitively we tend to think that there can be only one consistency model that can be

adequate, all processes must have the same view at any one time (which can be

implemented by saying that all reads from data must return the most recently written value).

This is based on the premise that if I read a value that does not incorporate an update that

has been made by another user, then my view of that value is inconsistent with the view of

the writer of that value. However this need not be the case, if for example the values being

entered are only estimates of something that is unknown, then the fact that two processes

have different estimates may not constitute a major problem.

The classic example that illustrates this point is that of newsgroups. There are many users

of a newsgroup and there are many messages posted. However it is not essential that new

messages go out to all users at exactly the same time and so are seen by all users in exactly

the same order.

Strict consistency may be abandoned because it often confers limited benefits in terms of

correctness, i.e. it does not produce a better result for many applications, while it always

exacts a heavy penalty on performance. DSM systems achieve gains in performance by

enforcing more “relaxed” consistency models that require less communication overhead.

These are achieved by allowing the consistency model to specify allowable variations in the

16

order of accesses. Some of the consistency models that have been proposed are outlined

below.

Strict Consistency

The most stringent consistency model is called strict consistency or alternatively sequential

consistency. It is defined as:

“Any read from a variable must return the value stored by the most recent write to that

variable.” (Lamport, 1979)

This model requires the total ordering of requests, so that all requests are seen in the same

order throughout the system. So the distributed system of copies of the variable appears as

if there was only one copy of the variable which all processes were accessing. This leads to

great reduction in efficiency because any update to a variable can only be completed not

merely when that update has been propagated to all processes, but only when their

confirmation that they have received the update has been received. Thus this works against

the key advantage of DSM i.e. maintaining local copies of shared data, thus allowing local

memory accesses to that data. Now every update is not complete until several sequences of

messages have been passed back and forth across the network between the initiator and

every other copy of that variable on the system. Once one process starts an update then no

other process can access the variable until this process is complete.

Note also that because of the delays that this necessarily imposes processes are not

guaranteed to be able to update or even read from a variable at will, thus this intuitive

consistency model may not even be an unattainable ideal, but could actively interfere with

applications that rely on real-time data access.

Processor Consistency

Processor consistency is a relaxation from strict consistency, which was defined by its

author as:

“Writes done by a single process are received by all other processes in the order in which

they were issued, but writes from different processes may be seen in a different order by

different processes.” (Goodman, 1989)

17

All processes see the writes of any one process in the order in which they were sent.

However there is no guarantee as to what order a sequence of writes by different processors

will be seen in at different processors. This means that writes by different processors that

had a causal effect on each other may be seen by some processors in an order that breaks

this causal relation. This means that writes by a single process can be queued, and so long

as they are performed in the right order then other processes need not break off processing

in order to perform these writes. This allows considerable freedom to nodes on a system as

to when they need to perform updates, however even more relaxed consistency models have

been proposed.

All the models we have seen so far have been uniform models, i.e. they treat all memory

accesses as being equally important and apply the same restrictions to them all. However

once we understand the semantics of particular applications we will see that very often only

a small proportion of accesses to memory are critical. Hybrid models define classes of

accesses that are treated differently in an attempt to utilise this knowledge.

Release Consistency

Release consistency exploits the fact that programmers use synchronisation constructs in

many applications, these are used to manage critical sections of the program execution or

access to critical variables. Release consistency assumes that as long as these are

synchronised then synchronisation will not be an issue for the rest of the program. This

system relies on the programmer’s ability to use these constructs correctly because

synchronisation is only applied where the programmer has asked for it. Synchronised

accesses can be declared using acquire and release accesses, which indicate the start and

end of critical sections of code. The guarantee that is offered is:

“All ordinary memory accesses issued prior to a release will take effect at all other

processes before the release completes – including those accesses issued prior to the

preceding acquire” (Gharachorloo et al., 1990)

Updates need not be forced until a critical section is being left, hence the title release

consistency. For this reason release consistency can be used to share several updates in one

message, or to share only the last of a series of updates.

18

Gharachorloo has shown that release consistency gives results equivalent to strict or

sequential consistency, provided the synchronisation accesses are used correctly. Thus

increased performance can be achieved without compromising application semantics

through optimising the use of synchronisation by only applying it where needed and

allowing processes to proceed without delay where it is not applied in the program.

Lazy Release Consistency

There have been a number of proposals for a relaxation of Release Consistency. These are

all responses to the observation that release consistency might be too “eager” in some

situations. It is eager in that it sends updates as soon as the updating process releases its

lock on a critical section. The “lazier” versions propose different rules which ensure that

updates can be delayed or reduced in scope. For instance Keleher describes a lazy release

consistency that propagates an update, not when a lock is released, but only when it is next

acquired; it only needs to send the update to the process acquiring the lock, and can include

the update in the response to the lock request (Keleher et al., 1996a).

The lazy release consistency that has been implemented in the DSM system to be studied

here only propagates an update when a subsequent read is attempted on the same portion of

data. This is particularly useful for applications where different processes are performing

independent operations which do not need to read the results obtained by other processes.

In this case updates need not be processed at all until the end of the application when only

the master needs to be updated with the results obtained by each process.

2.4.2 Coherence Protocols

The coherence protocol describes the mechanisms that are used to ensure that the virtual

shared memory in DSM is maintained in a coherent state. The consistency model defines

what orders of events (i.e. accesses to memory) are allowed. The coherence protocol is

responsible for ensuring that these accesses are performed correctly so that the consistency

model that has been specified is implemented.

Li and Hudak (1989) is the classic article that identifies the key issues that have to be

addressed in ensuring that memory coherence is maintained in a DSM system. Their

19

analysis focuses on the requirements of a page-based system, and does not separate

consistency model issues from coherence protocol issues. However we can identify in their

article issues that are not addressed by consistency models, as outlined above, these are the

issues that the coherence protocol must manage. If coherence is to be achieved the

following must be implemented: managing the local copies of data items in the virtual

shared memory in a way that ensures the whole data set remains consistent; keeping track

of the ownership and location of copies of the data; and achieving the distribution of

updates to all nodes with copies of the updated data. We will look at each of these in turn.

Memory Management

Managing the local copy of the virtual shared memory space involves handling local

accesses (writes/reads to/from the local copy) and handling updates. Updates can be

achieved either by overwriting all copies of the data to be updated (write update protocol),

or by updating one copy and invalidating all others (write invalidate protocol). Invalidation

of a local copy will prevent the protocol from returning the value of that local copy as the

result of a read.

For a read operation memory management must return a value when supplied with the

location of the local copy. However this does not mean that the value of the local copy is

what will be returned. If the local copy has been invalidated because another copy of that

data has been updated then the value of the updated copy will be returned.

Tracking Ownership

Ownership refers to tracking where the most up-to-date copy of a data item is located. This

is a service that must be visible to all nodes in the system. In order to assist in the

management of the complete set of copies of a data item it must also keep track of all local

copies of each data item so that they can be located should they need to be either updated or

invalidated.

This can be achieved by using a broadcast message which instructs nodes to invalidate their

copy of a given item if they have one. However this depends on being confident that all

nodes will always receive such broadcast messages. A more robust alternative is to have an

ownership manager which maintains a list of copies for each item and, by communicating

20

with each node where a copy resides, it can receive confirmation that every copy has been

invalidated.

Distribution Management

Distribution management involves the interface with the communication mechanism. It

must ensure that updates are sent out to whatever nodes are deemed to require the update.

In addition it must handle incoming updates from other nodes. As well as handling updates

it must also deal with requests for updates, i.e. it must be able to send them out to particular

nodes and be able to handle them when they come in from another node.

Once these three issues have been dealt with in an interlocking manner then the protocol

should be able to correctly perform updates to the virtual shared memory in the order

specified by the consistency model.

2.5 Summary

The purpose of this chapter was to introduce some of the concepts and approaches that form

the background against which this investigation was carried out. It explained parallel

programming and surveyed some of the approaches have been used including DSM. It

looked at some of the different approaches to DSM that have been proposed, the problems

that they encountered and how some can be overcome. It identified a conceptual

framework of consistency model and coherence protocol that can be used for understanding

the functionality provided by DSM systems and for optimising the performance of parallel

programs by managing these issues separately. In the next chapter frameworks are

introduced and their application to DSM is outlined.

21

3 Frameworks

3.1 Introduction

The purpose of this chapter is to introduce the concept of a framework, the characteristics

of frameworks, their uses and to indicate how they might be relevant to programming DSM

applications.

Then the DSO framework that has been developed in Trinity College is introduced. Its

links to the major DSM concepts identified in the last chapter are explained in terms of its

major elements, and the relationships between them. Finally the steps required to prepare

an application so that it can be run as a parallel application over the DSO framework are

explained.

3.2 What Frameworks Are

Frameworks are structures of classes that are used to assist in code reuse. However this

does not meant that they just another name for class libraries. The components in a class

library can serve as a standalone tools each of which can be recruited as a solution to a

particular problem, without needing to call on any other element of its class library. They

are designed to provide their functionality in more or less any context. A class library is

effectively a toolbox.

A framework on the other hand is a high-level plan that can be applied to create a family of

solutions to a range of problems posed by a particular application domain. It can do this

because it addresses key issues in the domain in a systematic and interrelated way. This is

explained as:

“ A framework is a set of classes that embodies an abstract design for solutions to a family

of related problems … frameworks provide for reuse at the largest granularity.”

(Johnson & Foote, 1988)

This means that a framework is a set of cooperating classes that must be reused together in

order to provide a reusable outline design for solutions to a specific class of problems. It

achieves this by defining a set of abstract classes and defining their responsibilities and

22

collaborations in terms of key methods that they implement and interface methods through

which they interact. In order to apply the framework to a particular problem a programmer

must customise it to provide the particular functionality required for the application. This is

done by creating subclasses that extend the abstract classes of the framework, adding

particular functionality without compromising the inter-relating structure that has been laid

down in the framework. The programmer has the option of creating a suite of such

subclasses for each class in the framework, thus providing a modular system where

different policies and designs can be applied within the structure of the framework.

The objective in developing a framework is to capture a range of domain specific

knowledge in a structure that ensures that all of the key issues of this domain have been

addressed. In effect it is to achieve design reuse. By applying a framework developers

should be able to proceed more quickly to the details of the application they wish to create,

rather than first having to identify what general concepts apply in this domain and then

develop a structure which is capable of integrating them into a workable whole. Through

using workable structure that has been laid out code reuse is also achieved.

The framework allows the programmer to work within a structure that has been tested for

many different applications and that has been constructed with the benefit of the

accumulated expertise of the framework’s developers.

3.2.1 Other types of reuse

In order to clarify exactly what kind of reuse is offered by frameworks, we will consider in

this section the similarities and differences between frameworks and some other types of

reuse, namely class libraries, design patterns and components.

Class libraries

As mentioned above class libraries are a common approach to software reuse that is less

domain-specific than frameworks (Johnson and Foote, 1988). Class libraries provide a

range of standalone tools to be applied to wherever a particular functionality is required. In

general they can be applied in any context. Class libraries provide code reuse. Frameworks

provide reuse at a much higher level of abstraction. A framework is an outline for the

development of a certain class of applications, whereas a class library could be used in

many different types of application.

23

Design patterns

Design patterns are another approach to software re-use that is more general than class

libraries (Gamma et al., 1995). However patterns are not domain specific in the way that

frameworks are. Patterns are designs that can be reused in any number of domains. They

do not offer code reuse, it is up to the developer to create code that implements the pattern.

Patterns are generally much smaller in scope than frameworks, a pattern will define a type

of interaction between a small number of elements. For instance the barrier that was added

to the consistency model for one of the test applications for the framework is an example of

a design pattern. It is a technique used to co-ordinate a number of processes. In contrast a

framework is an outline for a complete system.

Components

Components are self-contained instances of abstract data-types (Fayad & Schmidt, 1997).

They provide a defined interface that allows any application, or indeed another component,

that has a reference to them to be able to interact with them via this interface. Reuse of a

component is achieved by leveraging knowledge of its external interface, no understanding

of the implementation of this interface is required. In contrast reuse in a framework is a

matter of extending the super-classes that have been provided. This requires a detailed

understanding of the internal structure of these classes and of their interaction. In extending

the classes the programmer will often need to override methods of the super-classes without

compromising the functionality that they were intended to provide. Reuse of components

requires a much lower level of expertise.

Many computer environments that are designed for use by non-experts consist of a

collection of components that can be plugged together to provide more complicated

functionality. However each component in itself is only an instance of an abstract type,

whereas a framework is an overarching design encompassing a set of classes. However, as

we shall see, mature frameworks can often look like systems of interchangeable

components.

24

3.2.2 Features of frameworks

To summarise what has gone before we can see that frameworks have the following

features:

Reusability

The developer is able to save time and improve the quality of applications by reusing the

structure that has been defined using the domain-specific knowledge of the builders of the

framework. Frameworks provide reuse of both design and code.

Modularity

By allowing extensions of the classes in the framework, while still ensuring that they

implement the key interfaces and methods a developer can create a suite of modules from

which combinations can be selected to provide alternative policies and designs within the

application domain.

Extensibility

The framework is extensible in that the developer is free to extend the abstract classes to

achieve extra functionality that is not included in the framework. However this can only be

done if the developer understands the role of the classes that are to be extended and retains

this in the new subclasses.

Inversion of Control

The framework provides an overarching system. The extensions and additions by the

developer will be to methods that are called by the framework. Therefore the framework

tends to control the order of execution. Thus unlike class libraries a framework is a system

which executes the modifications the user supplies, not a set of modifications which the

user may apply to an application.

Types of Framework

Johnson and Foote (1988) identified two kinds of framework: whitebox and blackbox

frameworks. The key difference between the two is that in whitebox frameworks

application specific behaviour is added by subclassing from the classes of the framework.

This is very much the classic framework scenario as outlined above. These methods must

be designed and implemented in such way as to retain the functionality intended by the

25

designer of the superclasses. This requires the application developer to have an

understanding of the framework’s implementation.

In blackbox frameworks, on the other hand, various components which extend each of the

main classes of the framework are already available. They have been supplied by the

creators of the framework, or perhaps by previous users who have extended the framework.

An application developer may then be able to create a particular application by selecting

from the components that have been made available. In order to create an application the

developer need only know the public interfaces of the components and the functionality that

they provide, there is no need to understand the details of their implementations or of the

detailed workings of the framework.

For this reason blackbox frameworks are more easy to use and require much less of a

learning curve, because the user only need only learn about the particular functionality

provided by different components, not about the mechanisms by which the different kinds

of classes are inter-related to form a working framework.

Johnson and Foote point out that this demarcation is not strict, there is a continuum

stretching from whitebox to blackbox frameworks. Indeed the tendency is that a framework

will move from being a whitebox to being a blackbox framework as more subclasses are

developed to provide particular functionality. If someone has already developed a given

functionality for a whitebox framework then there is no obligation on a user to go and

define their own subclass with similar functionality. Instead they can simply reuse the

functionality that has already been defined. As time goes on and more such classes with

different kinds of functionality are defined using the framework will tend to become more a

matter of selecting appropriate combinations rather than subclassing to create new

functionality.

Conversely for a blackbox framework. If the developer finds that the components available

do not provide the functionality needed then, provided the developer has access to the

super-classes and understands their rationale, it is possible to develop subclasses that

provide the functionality required.

26

It might be better to say that whitebox and blackbox refer to different ways in which a

framework can be used. Depending on a developer’s programming expertise and

understanding of the framework, it may be open to that developer to extend the framework

in a whitebox manner. If that expertise is lacking then the developer is restricted to using

the framework in a blackbox manner.

3.2.3 Problems with using Frameworks

Using frameworks for application development offers many advantages such as code and

design reuse, and achieving quality applications. Very often frameworks are defined

precisely because it has been found very difficult to design a robust application for a

particular type of domain. However frameworks also carry their own disadvantages (Fayad

and Schmidt, 1997), only some of which come from the fact that they will most likely be

dealing with a difficult application domain.

Development effort

The effort and domain knowledge required to successfully develop a framework for a given

application domain is greater than that required to develop an application in that domain.

This of course is no surprise. A framework is intended to be general and to cope with

variations that may not apply in a given application. Also the time invested in developing a

framework is intended to be an investment that will yield reduced development time and

better quality for a whole range of applications to be developed in the domain.

Learning curve

The learning curve involved in learning to use a particular framework will be quite high.

This is certainly the case if the framework is to be used as a whitebox framework.

Difficult to debug

Debugging applications created with a framework is often difficult. The framework

controls the flow of control, so it may be difficult to identify what is the cause of a given

fault without and understanding of how the framework controls the execution.

27

3.3 Frameworks applied to DSM

Having looked that the characteristics and features of frameworks, and before explaining

the DSD framework it will be worthwhile to consider another framework that has been

proposed to aid in the development of DSM parallel applications – DISOM.

3.3.1 An existing framework

DISOM (Castro et al., 1996) is an example of an object-oriented framework developed for

building DSM applications based on shared objects. Classes become shareable by

extending super-classes that have methods to pass updates to relevant nodes. The

framework provides one consistency model (entry consistency). Others can be added by

defining new classes with functionality that differs from the entry consistency class.

However DISOM does not offer a similar interface to the coherence protocol therefore

there is no capacity for the user to modify the coherence protocol.

DISOM allows consistency models to be applied to classes. Thus all instances of a class

will have the same consistency rules applied to them. However in many applications

although there will be many instances of a given type of data item, there may be

performance benefits to be gained by relaxing the consistency applied to many of them and

only applying more strict consistency to those that perform key roles in the algorithm.

This framework is offering flexibility in only one (consistency models) of the two major

areas where it can be applied in DSM, in addition it only allows consistency to be applied

on a per-class basis, not on a per-object basis. DISOM only provides one pre-defined

consistency model and leaves it up to the user to define any others that they might wish to

use, therefore it makes no concessions towards black-box use and demands that the user

develop a full understanding of the structure of the DISOM framework before they can

attempt to apply alternative consistency models.

Possible improvements

One would expect that a framework that could overcome all these limitations would offer

considerable benefits in terms of improved performance of applications, through

selecting and customising both consistency models and coherence protocols that are

appropriate and through applying them on a per-object basis. In addition if there is a

28

selection of predefined consistency models and coherence protocols and the mechanism for

choosing and changing them is flexible then the system would be much more usable for

non-expert users. Such a system should make parallel applications as programmable as the

semantics of shared memory will allow.

This is precisely what has been attempted in the design of the DSM framework that has

been developed in Trinity College. It’s structure and use are outlined in the rest of this

chapter.

3.3.2 The Distributed Shared Data (DSD) Framework Developed in

Trinity College

As we have seen a framework is a set of cooperating classes that must be reused together in

order to provide a reusable outline design for solutions to a specific class of problems. The

design which this framework attempts to provide in outline is based on the concepts that

have been extracted from our consideration of object-based DSM. It aims to allow the user

design DSM applications by assigning combinations of consistency models and coherence

protocols to individual objects.

The rationale for the DSD framework is to provide a flexible structure for building DSO

systems from the most appropriate elements (Weber et al., 1998). In particular it should be

able to deliver the following benefits:

Provide a selection of predefined consistency models and coherence protocols

So allowing non-expert users a range of policies without requiring them to extend the

framework.

Allow Consistency Models and Coherence Protocols to be combined on a per-object basis

This will enable programmers to take advantage of application-specific semantics on

particular applications in a way which is not allowed by other DSM systems. Other

systems do not allow individual instantiations of classes to have different protocols

associated with them, thus they cannot take advantage of optimisations that can be achieved

29

through treating each instance of a class type in the manner most appropriate to its role in

the program.

Allow customisation of coherence protocols and consistency models

The framework allows expert users who understand the workings of the classes of the

framework to extend the base classes to create their own protocols and models in order to

achieve optimum performance for particular applications.

Flexibility

It allows easy choosing and changing of the protocols and models that are applied to

different variables.

Programmability

It is intended that the framework should make it relatively easy to modify an existing

application in order to get it to run over the DSM framework. It should minimise the

amount of further difficulty in programming parallel applications other than those that are

inherent in parallel programming.

Relative speed-up

The objective of parallelising an application is to reduce the time taken to execute the

application by enabling more processing power to be applied to the problem. The relative

speed-up is the rate at which the time taken is reduced as more workers are added.

3.3.3 Structure of the Framework

The structure of the framework is as follows (Weber et al., 1998). There are three elements

in the DSM: Consistency Models, Coherence Protocols and Concurrency Control. The

consistency models and coherence protocols perform the roles that were identified in the

last chapter, defining acceptable event-orderings and implementing those event-orderings.

The concurrency control implements locks and barriers that are used by the consistency

model to control in what order events are seen.

30

Consistency Model

All consistency models are derived from a base class that defines the common interface for

consistency models: the operations read, write, lock and unlock. This ensures that they all

can perform the function of a consistency model in the framework. Lock and unlock have

corresponding methods in concurrency control. Read and write have corresponding

methods in the coherence protocol base class that perform the actual accesses. There are

subclasses of the base class for strict and relaxed consistencies. The classes for actual

implementations of particular consistency models can then be derived from either the strict

or the relaxed consistency model. As we will use Release Consistency and Lazy Release

Consistency, these two are explained in more detail below.

Release Consistency

Release consistency defines two primitives that stand at either end of a critical piece of

code: acquire and release. Updates are not forced until the release is encountered, the hope

is that several updates will have taken place, and only one message will be required to

deliver them.

This is implemented as shown in Fig. 2. A worker wishing to update shared data makes a

lock request. The master replies when the lock has been established. The worker updates

the master with the new value of the shared data and requests an unlock, (i.e. release is

ConsistencyModel

Coherence Protocol

DistMgt

OwnMgt

MemMgt

Communication Service

ConcurrencyControl

Fig. 3.1 Structure of the Framework

31

called) this causes the master to issue an update to all relevant workers and then remove the

lock.

Lazy Release Consistency

Lazy Release Consistency is similar except that updates are not propagated to other workers

until they attempt to access the variable. The hope is that many updates will have been

made before an access and only the last update will be sent to the worker in question.

This is implemented by having each worker make its updates locally. Then when another

worker requests access the master gets the update from the updating worker and it is passed

to the worker requesting access.

1. Lock Request

2. Data Locked

4. Unlock Request

5. Update

6. Unlock

3. Update

Data

Data Data

Fig. 3.2 Release Consistency

2. UpdateRequest

Data

Data DataWritesdonelocally

1. Request Access

4. Update afterrequest access3. Update

Fig. 3.3 Lazy Release Consistency

32

Coherence Protocol

The coherence protocol describes the mechanisms that are used to ensure that the virtual

shared memory in DSM is maintained in a coherent state. It is responsible for ensuring that

accesses are performed correctly so that the consistency model that has been specified is

implemented. There are three elements to the coherence protocol as outlined already:

memory management, tracking ownership, and distribution management. They perform the

roles that have been outlined above.

Concurrency control

This provides implementations of distributed locks and barriers for use by the consistency

model. When a lock is requested it locks all copies of a data item according to the

semantics of the particular type of lock requested. At the moment this implements a

concurrent read, exclusive write (CREW) lock and an exclusive read, exclusive write

(EREW) lock. It also implements a barrier.

Master and Workers

In addition to the DSM per se the framework also includes the master and worker classes.

The master issues work to the workers and receives updates back. The workers perform the

work on each work item they are sent. At the moment this is the only means of allocating

work that has been implemented. However there is no reason within the DSM classes why

there should not be some other system, for example workers working together without a

master.

When workers are being initialised at the start of the program there are three modes in

which they can be issued data:

Full Distribution: all data issued at initialisation

Empty Distribution: update at access time

(data sent out as worker tries to access it)

Empty Distribution: update at sharing time

(data sent out when share method called on that data)

33

When workers are being initialised they may or may not be provided with full copies of all

the data they need to set up the problem, depending on the initialisation mode.

When a worker requests a work item it is supplied with a section of the problem data. This

is inserted into its location in the data structure that is has been extracted from (i.e. a matrix

row is inserted into the appropriate row of the workers copy of the matrix. The problem is

solved by workers requesting and processing work items until all work items have been

processed.

Not fault tolerant

A point worth bearing in mind in regard to the framework is that it has not been designed to

be fault tolerant. It was designed to investigate how well the consistency model / coherence

protocol structure would work.

Worker

Data Objects

Master

Worker Worker

Fig. 3.4 Issuing of work items to workers

34

3.4 How to Parallelise a Program on the Framework

A simple manual or instruction sheet would ensure that it would be feasible to parallelise a

program without needing to understand exactly how the framework handles the different

options that are being selected. Of course this would not eliminate the requirement for a

general understanding of DSM and of the effect of the different protocols and consistency

models, which is required to ensure that the appropriate options are selected when

parallelising a program. Nor does it eliminate the need to decide at exactly what points in

the execution of the program data needs to be propagated across the system

For the application to be parallelisable at all it is necessary that the total block of work to be

done can be broken into a number of portions which can be performed in parallel by the

different workers, as illustrated in the diagram below. This is not a requirement of this

particular system but is necessary for any parallel system, if the work cannot be broken up

then it cannot be performed in parallel by a number of processors working simultaneously.

I have called these portions of work “items”.

The changes and additions that are normally needed to convert a standalone application into

a parallel application ready to be run on the framework are illustrated in the diagram below.

In certain cases more changes than these may be required, as we shall see later on, however

these changes are effective for the majority of applications.

&ODVV

work(){ For all items methodParallel(item)}

workParallel(item){

actually does the work}

0DLQ��

Fig. 3.5 Structure of applications that can be run in parallel

35

1 Some lines of code that are the same for all applications to be parallelised must be added

to the main method, these provide that method with a reference to the framework’s Master

which is responsible for managing the workers, and identify the ClassType (see below)

either by reading it from a config file or from the command line.

2 The ClassType class must be defined, it indicates the set of coherence model and

consistency protocol characteristics to be applied to the variables contained in the problem.

Alternative versions of ClassType can be created with different sharing characteristics. In

addition it creates a WorkDescription that describes the initialisation data and the method

and variables for the workParallel() that the workers are to use.

3 The ClassType class must extend the class defined for the standalone problem and must

override its work() method so that instead of calling the workParallel() method for each

item that is to be processed it calls the perform() method on the master. This will ensure

that the method described in the workDescription is processed by the workers, with the

number of the starting item and the number of items to process.

4 The share() method must be called on objects to be shared around the DSM before the

start of processing

0DLQ��

�&ODVV5HIHUHQFH

�&ODVV'HVFULSWRU

&ODVV:RUN+DQGOHU

UHDG,QLW'DWD��

KDQGOH,WHP��

�&ODVV7\SH��H[WHQGV�&ODVV

ZRUN��

^

��DVVLJQV�&0��&3�WR�HDFK�VKDUHG�REMHFW

��FUHDWH:RUN'HVFULSWLRQ��

��PDVWHU�SHUIRUP�

��&ODVV:RUN+DQGOHU�

��:RUN'HVFULSWLRQ�

��

��

`

ZRUN3DUDOOHO�LWHP�

^

��LQWHQG�ZULWH��5HIHUHQFH�

`

Fig. 3.6 Changes required to run an application on the framework

36

5 In order to ensure that updates to shared data in the workParallel method are propagated

throughout the DSM system it is necessary to declare an “intention” before updating the

variable and close the intention afterwards.

6 The WorkHandler class must also be defined. This contains three important methods.

• readInitData() reads the initialisation data coming from the master. It must be

defined to extract data in the order in which it was packed in the initialisation

parameter.

• handleItem() accepts a work item from the master and then gets the worker to

invoke on this item the method specified by the workDescription.

• readEndData() is performed when the master indicates that there is no more work to

be done. It does any tidying up that is needed at the end including unsharing shared

variables so that remaining updates will be propagated back to the master.

7 The ClassDescriptor and ClassReference classes are two relatively short classes. The

ClassDescriptor describes a range of shared data that is to be accessed and the type of

access intended. The ClassReference uses that description to perform a read or a write

to/from a range of data that has been described.

Overview of changes required

The only changes that are complicated are those to the ClassType, and these can be clearly

defined in terms of the steps to be carried out for each variable to be shared across the DSM

system. If it is desired to change the sharing characteristics of any one variable, then only

four lines need to be changed to select and publish a different pairing of consistency model

and coherence protocol for that variable.

Different ClassTypes can be defined in advance with varying allocations of consistency

model and coherence protocol to variables. One of these can then be selected simply by

passing it as a parameter to the main program, or by including it in a .config file. This helps

to ease the process of testing applications for a range of different combinations of

consistency model and coherence protocol.

37

3.5 Summary

The purpose of this chapter was to introduce the concept of a framework, the characteristics

of frameworks, their uses and to indicate how they might be relevant to programming

parallel applications. Then the DSO framework that has been developed in Trinity College

was introduced. Its links to major DSM concepts identified in the last chapter are explained

in terms of its major elements, and the relationships between them. Finally the steps

involved in the use of the framework in parallelising applications was explained.

38

4 Testing the Framework

4.1 Introduction

This chapter explains the approach that was used to test the framework and the results

obtained. It begins with an explanation of relative speed-up which is the most appropriate

measure of how well a parallel programming system provides gains in performance as more

nodes work in parallel to execute an application. Then it reiterates the claims put forward

for the framework, both in terms of performance and usability. These then provide the

focus for the tests that are performed. The two algorithms selected to test the framework

are then are outlined and in each case the results and conclusions actually obtained from the

process of programing and testing the algorithm are laid out. Finally conclusions are drawn

as to how well the framework has delivered on the benefits it claims to offer.

4.2 Relative Speed-up

The objective of parallelising an application is to reduce the time taken to execute the

application by enabling more processing power to be applied to the problem. Thus the goal

in testing a parallel programming system such as the DSD framework would be to produce

measurements that indicate how well it meets this objective. The most appropriate

indicator of this is a relative speed-up curve, which shows how much faster a given

problem is processed as more processors work on its solution.

A speed-up curve is obtained by plotting the speed-up value obtained against the number of

workers as the number of workers is changed. The speed-up value for n workers is defined

as

Speed-up factor = the time for 1 worker / the time for n workers

Ideally the speed-up curve should be linear and should maintain a ratio of 1:1 against the

number of workers. However, because the implementation of a parallel programming

system (DSM or any other kind) involves an overhead in terms of management of shared

data structures and communication delays between workers which would increase as the

number of workers increase we would expect that the speedup will fall away from linear as

the number of workers increases in a manner similar to that shown in Fig. 4.1.

39

Fig. 4.1 Comparison of Expected Relative Speedup with Linear Relative Speed-up

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Number of Workers

Linear Speed-up

Expected Speed-up

The extent to which the curve should fall away from linear would be affected by the nature

of the application. If an application required a great deal of information sharing between

workers then it would experience delays due to the communication time required to

perform this information sharing. In addition, if the application had certain key global

variables which only one worker could access at a time, then there could be considerable

latencies introduced with other workers having to wait idle unable to access a variable if

another worker is already accessing it.

Amdahl’s Law

Amdahl’s law defines the maximum relative speed-up that can be gained by parallelising an

application, and explains why relative speed-up curves tend to fall away from linear speed-

up (Amdahl, 1967). It does this by recognizing that for any parallel application parts of the

application will run as a sequential process and only the remainder will run in parallel.

Performance improvements due to parallelisation will only apply to the portion of the

algorithm that is running in parallel. As more processors are added the time to perform the

parallel sections will reduce, but the time to perform the sequential sections will remain the

same and so this time will become a larger and larger proportion of the overall time and

will limit the relative speed-up that can be attained.

The factors that introduce serial sections into a parallel application are synchronization

(which only allows one processor to access a synchronized particular variable at any one

time) and communication delays.

40

Relative speed-up on a parallel programming system is influenced by how much

communication and data sharing is involved in the test application. Therefore it is better to

test it using applications with significantly different communication and data sharing

requirements in order to get results that reflect the capabilities of the system and not just the

characteristics of one application.

Gustafson’s Law

Gustafson showed how Amdahl’s law could be circumvented in practice (Gustafson, 1988).

He does not deny its theoretical validity, he just points out that it does not measure what

tends to occur in practice. In practice when greater processing power is made available to

apply to a problem we do not normally treat this as an opportunity to reduce the time

required, rather we use it to increase the complexity and sophistication of our application.

He introduced the idea of scaled speed-up. Even if relative speed-up tends to fall off as

more processors are applied to a problem, good speed-ups can be obtained when this extra

processing power is applied by increasing the size of the problem. Thus by scaling the size

of the problem upwards as more processors are added we can still achieve worthwhile

speed-ups when we have large numbers of processors involved in an application.

41

4.3 The Thesis

The thesis of this dissertation is that the framework delivers on the benefits that it promises.

As we have seen these are

• Per-object combination of Consistency Models and Coherence Protocols

• Flexibility

• Customisation

• Programmability

• Speed-up

In order to test this thesis, two applications with differing communications and data sharing

requirements have been implemented on the framework, they are the Travelling Sales

Person and LU Decomposition. The Travelling Sales Person requires very little sharing of

data between workers and requires little communication between master and workers. LU

Decomposition in contrast requires a great deal of data sharing between workers and a great

deal of communication because the data to be shared is substantial. In the following

sections on each of these test applications I outline the test problem, the algorithm used to

solve it and the speed-up results that were obtained. In addition I will report on how well

the framework delivered on the other benefits that it promises to the user.

42

4.4 The Travelling Sales-Person Problem

The travelling Sales-Person problem (TSP) (Taha, 1992) is much more easily described

than it is solved. We are asked to imagine that a travelling Sales-Person located at a

starting point has to make a tour of a number of cities n in order to visit all his clients, his

tour must include only one visit to each of the cities and must finish at the city from which

he originally departed, city 0. In the interests of efficiency it is desired that we should

identify the shortest route that meets the requirements.

This is an instance of a cost minimization problem, the problem stands for any problem

where a number of items must be combined in a way that minimizes the total cost, i.e. only

the very best solution is considered adequate. For the Travelling Sales-Person the cost is

measured in terms of time spent travelling which is non-productive and which the Sales-

Person would wish to minimize.

We are provided with a list of cities and distances between them. The distances are

supplied in the form of a matrix in which the elements in each row are the distances from

the city the row represents to each city in the problem, including the distance to itself on the

diagonal. There is one such row for every city. This matrix need not be symmetric, i.e. the

direct route from A to B, passing through no other cities on the way, need not be the same

length as that from B to A as shown in Fig. 4.2.

All algorithms that have been proposed for solving this problem have shown that the

amount of processing required to solve it increases exponentially as the number of cities is

increased. This is because the number of possible tours through n cities is (n-1)!. For

example for 16 cities there are 1.3x1012 possible routes while for 17 cities the problem is 16

times larger with 2.1x1013 possible routes. This means that the TSP is an NP complete

problem (Karp, 1972), these are a class of problems for which no efficient solution has yet

been found. The time taken to solve the problem increases exponentially for all known

City ACity B

10 Miles

15 Miles

Fig. 4.2 Distances between cities in assymmetric TSP problem

43

solutions. NP-complete problems are contrasted with so-called tractable problems which

can be solved in polynomial time (i.e. the time to solve increases as a polynomial function

of the problem size, which increases much more slowly than exponential).

At the moment there are no efficient algorithms for solving NP-complete problems. For

this reason they are often used as benchmark problems to test the performance of

computing systems. The TSP is one of the most regularly used benchmark NP-complete

problems.

There are many heuristic algorithms which can get around the problem of NP-completeness

by fairly quickly obtaining a good, but not necessarily optimal, solution to the problem.

However our problem is an optimisation problem, we are required to find the very best

solution, not merely a "good" solution. Thus our algorithm must be one that can definitively

exclude all routes other than the one returned as the shortest route. It must in effect account

for every possible route in the problem space, although this does not mean that it has to test

every single possible route.

4.4.1 The Algorithm

The algorithm which I designed and implemented for this test was chosen because it fits

well with the DSM framework and should offer very good speedup characteristics. It is a

depth first tree pruning using a branch and bound algorithm to eliminate non-optimal

routes. I have added a number of TSP specific optimisations to reduce the processing time,

the details are as follows:

Initial sorting of partial routes

In order to optimise the order in which routes are tested all possible four-city-long partial

routes are generated and sorted into a list called the routeTable, with the shortest routes at

the start of the list. Then the we repeatedly extract the first (i.e. shortest remaining) route

from the routeTable and test to see whether any continuations of that partial route will yield

the shortest route (see Fig 4.3 overleaf). The algorithm we use to test this is a branch-and-

bound tree pruning algorithm (Taha, 1992).

44

Branch-and-bound tree pruning

The principle of this algorithm is to eliminate any branches from the tree of possible routes

which cannot lead on to the shortest route. This is determined by the branch and bound

method. The algorithm starts with a partial route and tests each possible branch on the

route, i.e. it considers each possible continuation of a route that could follow from a partial

route. It does this depth first, identifying a complete route and then attempting to find other

routes that are shorter than it, by working back from the complete route checking other

similar routes. At all times the algorithm retains a reference to the shortest route that has

been found so far, called the “bestRoute”. Each route generated during the tree traversal is

compared to it.

If a partial route is longer than the bestRoute then there is no need to test any of the

continuations of it because they will also be longer than the bestRoute and so cannot be

candidates for the shortest possible route that is being sought. Thus all routes that are

connected to this branch are eliminated (i.e. bound out). Through this means, the sooner a

short bestRoute can be identified then the more quickly large portions of the problem space

can be excluded without having to check each route individually. This produces significant

savings in terms of processing time.

If a complete route is shorter than the bestRoute then it becomes the new bestRoute. At the

end of the program when all possible routes have been considered the route contained in the

bestRoute will be the best possible route for the data set given in the problem.

5RXWH7DEOH EHVW5RXWH

• Extract shortest partial route from RouteTable and test all possible extensions of it until they

are longer than bestRoute.

• If a complete route is shorter than bestRoute then it becomes new bestRoute.

Contains list of all possible four-city routes staring atcity0. Sorted with shortest routes first.

Fig. 4.3 Outline of TSP algorithm – branch and bound tree pruning

45

Fig. 4.4 illustrates how this works. Initially the bestRoute is set at 20. Routes are extended

until such time as they become longer than 20, at which time they are eliminated and the

next route is tried. When a complete route is less than the bestRoute value as is the case

with the route shown in bold, then that route becomes the new bestRoute and any routes

longer than it can be eliminated from now on.

Optimisations

A number of optimisations have been added that reduce the number of routes that have to

be tested.

- Partial routes are not considered if they are already longer than the best complete route

that has been found so far.

- Before testing the routes an initial guess at a short route is made using the "hillclimb"

method (Taha, 1992): starting at city0 we always choose the city closest to the last city to

be the next city on the route. This very quick algorithm allows us to produce a reasonably

BestRoute = 208

12 2

1110

2

3

7

1

These partial routeseliminated when theybecome longer thanbestRoute

This complete route is only18 long, so it becomes thenes bestRoute, all routeslonger than 18 willhenceforth be bound out.

2

Fig. 4.4 Illustration of branch and bound algorithm

46

short route. This ensures that at the start of the algorithm we already have a pretty good

route as our best route, thus right from the start we will be able to exclude a large

proportion of the non-optimal partial routes: those which are longer than the route found by

the hillclimb method.

(It should be noted that when we produce a random distribution of cities the hillclimb will

often find the right route without any need for the rest of the algorithm, for this reason the

data sets that have been used for testing the algorithm need to have very extreme data in

order to ensure that the problem tests all elements of the algorithm. For this reason the data

used is that published by Pete Keleher (Keleher, 1996b) as sample data for parallelising the

TSP problem.)

- Route lengths are calculated to include the minimum possible distance from the end of the

route (if it is a partial route) back to the start. This ensures that long routes are eliminated

more quickly, because it accounts not only for the length of the route so far but also for the

minimum possible length to complete the route back to the starting point. Thus for the

four-city partial route shown in Fig. 4.5 the routelength includes the minimum distance

back to the start from the end of the route (12), because any possible completion of this

route would be longer than the resulting length: 32

- In order to ensure that the best candidate routes are tested first partial routes, four cities

long, are sorted in order of their length. Each of these are tested in turn starting with the

shortest. It is, of course, not necessarily the case that the shortest partial route will be the

start of what will be the shortest route, however working on the shortest partial routes first

Fig 4.5 Calculation of route length

5 5

5

12

City0

5

Route length = 32

Include this distance whencalculating the route lengthfor incomplete routes.

City2

City3

City1City4

City7

City5

City6

47

should allow a short bestRoute to be identified more quickly, thereby reducing the

processing required on all subsequent routes. This optimisation is not intended to focus on

the best, but rather to screen out the worst.

- Each time a route is being extracted from the routeTable it is tested to see if it is longer

than the current bestRoute, if it is longer it is discarded and the next route is extracted. This

ensures that workers do not waste time working on routes that cannot provide a new

bestRoute.

Features that should enhance relative speed-up

There are three aspects of this algorithm that should help to ensure a good relative speed-up

curve: independence of workers, sharing of bestRoute data, and a high computation-to-

communication ratio.

Independence of workers

In the case of the TSP algorithm the workers are independent of each other in that each

worker can process successfully the partial routes it receives from the master without

having any knowledge of the status of any of the other workers. The extension and testing

of the partial routes can be performed by comparing them with a local copy of the

bestRoute which may be longer than that held by some other workers. This is because the

master allows a worker to update its copy of bestRoute only if the worker’s copy of

bestRoute is shorter than its own copy. Thus eventually the master will be updated with the

best possible route.

The independence of the workers means that workers need not wait on each other. If one

worker is held up for some reason the others can continue regardless, without affecting the

result. This should help to ensure good speedup characteristics. If one worker has a partial

route that requires a lot of processing to eliminate routes then the others need not stand idle

while it finishes this route.

48

Sharing of bestRoute data

The more workers we have working in parallel processing candidate routes, the more

quickly a short route should be found. If this short route is propagated to all the other

workers, they will be able to exclude more routes than before because they will have been

provided with a shortest route that is shorter than the one that they have been able to find on

their own. At every stage of the algorithm the shorter our best route is the more quickly we

will be able to identify that candidate partial routes are wrong and thus exclude them and all

extensions of them.

Because the BestRoute does not need to be updated very often and is a very small item of

data, there is little communication delay involved in performing this sharing. By ensuring

that all workers share the best available version of bestRoute, the parallelised algorithm

should achieve close to linear speedup. Indeed as we shall see better than linear speed-up is

possible for some data sets.

High computation to communication ratio

Because the items to be passed around the network, i.e. routes, are very small and because

the amount of computation required to process each one is relatively large, because the TSP

is NP-complete, then there should be low relative delays due to communication so the

relative speed-up should be quite good.

4.4.2 Parallelisation of the TSP

The parallelisation of the TSP algorithm described above proved to be much more complex

than the relatively straightforward steps outlined in section 3.4 on parallelising applications.

This was because the means by which new items of work (in this case partial routes from

the routeTable) were obtained by the workers differed from the standard method.

The standard method assumes that the items are known before any work is done by the

workers, and so they can be arranged in an array and selected one by one by passing to each

worker the index of the item it is assigned and allowing it to extract the item from the array.

49

However in the case of the TSP this is not adequate because we do not want to extract every

one of the items that is originally added to the routeTable. If any of them are longer than

the current value of bestRoute then we want to discard those. This is done by calling a

method on the routeTable that performs the check. There is no way in which we can know

in advance which partial routes can be ignored by this means.

This required a redesign of the whole mechanism by which work items are issued to the

workers, which involved changes to the classes indicated in Fig. 4.6 below.

The key difference was that the perform() method had to be provided with a reference to the

routeTable and the name of the getNextRoute() method that was to be called on it, instead

of the index of the first item and the total number of items. This change caused a cascade

of changes through all the classes that feed work items to the worker. For instance the

WorkHandler class had to be redesigned to be provided with a reference to the routeTable

on the master during the initialisation stage. It had to call the getNextRoute() method on

the routeTable whenever it receives a handleItem from the master.

0DLQ��

�&ODVV5HIHUHQFH

&ODVV'HVFULSWRU

&ODVV:RUN+DQGOHU

UHDG,QLW'DWD��

KDQGOH,WHP��

UHDG(QG'DWD��

�&ODVV7\SH1��H[WHQGV�&ODVV

0HWKRG��

^

��DVVLJQV�&0��&3�WR�HDFK�VKDUHG�REMHFW

��PDVWHU�SHUIRUP�&ODVV:RUN+DQGOHU�

��:RUN'HVFULSWLRQ�

��QXOO�

��JHW1H[W5RXWH�

`

PHWKRG3DUDOOHO�LWHP�

^

LQWHQG�ZULWH��5HIHUHQFH�

`

�0DVWHU

�)HHG$JHQW

5RXWH$JHQW

�3�30DVWHU

Fig 4.6 Classes changed in order to parallelise TSP

50

Customisation of protocols for TSP

A new intention called an “Access” intention had to be added to the read and write

intentions. It describes an intention that first reads the value at the master and then decides

whether it should overwrite it. This addition was necessary in order to parallelise the

setRouteIfBest method which ensures that a worker only updates the bestRoute value of the

DSM system as a whole if its value is lower than that currently prevailing.

This change was very easily made. The new intention class was created by combining the

functionality contained in the two existing intentions: “read” and “write”. The read

functionality is performed first. Then the write may or may not be performed depending on

the result of the read.

4.4.3 Predicted effect of the protocols

The TSP program was run with both Replication and HomeBased protocols. We would

expect significantly different performance for these different types of configurations for

reasons which are outlined below.

Replication

The replication protocols perform updates by means of messages that are broadcast to all

nodes that are part of the DSM system, as illustrated in Fig. 4.7 below.

5RXWH7DEOH

0DVWHU

EHVW5RXWH

5RXWH��

:RUNHU� :RUNHU�Z

5RXWH��

EHVW5RXWH

EHVW5RXWH

Fig. 4.7 Replication protocols

Work items issued to each worker on a give me work basis.Updates issued to master and to other workers via multicast.

51

This means that when a worker publishes an intention that it wishes to send an update to the

master this update is also sent to the other workers. So all the workers see all updates to the

bestRoute as soon as they are propagated to the master. Thus with the replication protocols

the DSM system implements sharing of bestRoute data, thus ensuring that the speed-up

benefits resulting from this sharing are obtained by the parallel program. This sharing is

achieved without having to sacrifice or degrade any of the other optimisations that have

been built into the program. For this reason using the replication protocols should offer

close to linear speedup.

The TSP algorithm requires only small amounts of communication between the workers

and the master, therefore we would expect that there would be little traffic on the network

and so Replication2 which uses UDP multicast which is unreliable but lightweight should

be faster than Replication3 where LRMP enforces greater reliability at the cost of some

management overhead.

HomeBased

The homebased protocols only allow communication between each worker and the master,

as illustrated here.

5RXWH7DEOH

0DVWHUEHVW5RXWH

5RXWH��

:RUNHU�

EHVW5RXWH

:RUNHU�Z

5RXWH��

EHVW5RXWH

Work items issued to each worker on a give me work basis.Updates issued from worker to master only.

Fig. 4.8 HomeBased protocols

52

This means that when a worker issues a new lower bestRoute to the master this is not

shared with the other workers. Thus we would expect poorer speed-up for either of the two

Homebased protocols because as more workers are added all except one will be operating

with a sub-optimal value of bestRoute, the proportion of workers working with a sub-

optimal bestRoute will increase as workers are added so speed-up should tend to fall further

away from linear as more workers are added.

This could be overcome by adding an invalidation feature to the coherence protocol.

Whenever one worker updates the master’s copy of bestRoute then the other workers copies

of the same variable should be invalidated. This would mean that when they next attempted

to access their local copy the fact that the local copy was invalidated would cause the

protocol to be invoked to obtain the new better value from the master. However at the time

that the TSP was being tested this feature had not been added to the framework. The

performance of this version would still be slower than that of the replication protocols

because performing this invalidation across all copies of the data on the network would

require extra time, and would not eliminate the need to perform a separate transfer of data

to each worker when they actually want to read an invalidated data item.

This version of the HomeBased protocol would be more efficient that the Replication

protocols if it were the case that updates were being made to the bestRoute more often than

it was being read. Because then updates would only be propagated to the workers on the

rare occasions when they would attempt a read. However this does not apply for the TSP,

the bestRoute is only updated a small number of times (less than 15 for our 17-city TSP

data set) while the bestRoute is read many thousands of times during the application.

The TSP algorithm requires only small amounts of communication between the workers

and the master (because there are few updates and the data to be updated is very small),

therefore we would expect that there would be little traffic on the network and so

HomeBased1 which uses UDP which is unreliable but lightweight should be faster than

HomeBased2 which, using TCP, enforces greater reliability at the cost of some

management overhead.

53

Release Consistency & Lazy Release Consistency

Release consistency should offer better speed-up than lazy release consistency for reasons

related to when updates are published. Release consistency publishes an update as soon as

it is made. Lazy Release waits until each worker tries to access the shared data and then

publishes the update. This takes extra time and so is less efficient in an algorithm like the

TSP where the data are being read many more times than they are being updated. If the

data were updated more often than they were read Lazy Release would offer significantly

better performance.

4.4.4 Results of speed-up tests

The objective of parallelising an application is to reduce the time taken to perform its

function by enabling more processing power to be applied to the problem. Thus the

objective of testing a parallel programming system such as the DSM framework would be

to produce measurements that indicate how well it meets this objective.

As we have seen in section 4.2 the most appropriate indicator of how well a parallel

programming system delivers increases in processing power with the addition of more

workers is the relative speed-up curve which shows how much faster a given problem is

processed as more processors work on its solution.

The performance of the parallelised TSP program was tested by recording how long it took

to complete the search for the best possible route. This test was repeated ten times for a

given number of workers and the speed-up factor was calculated for each test. The average

and standard deviation of these ten speed-up results was calculated. A set of ten tests was

done for each of the following numbers of workers: 1, 3, 6, 9, 12 and 15.

This procedure was performed for the parallelised TSP program for each of the following

configurations of Consistency Model and Coherence Protocol:

Release Consistency, HomeBased Protocol 1

Release Consistency, HomeBased Protocol 2

Release Consistency, Replication Protocol 2


54

The full results are tabulated in appendix 6.1 and summarized in table 1 below:

Table 117 City TSP

SUMMARY TABLE

Average Time (s) Number of Workers

1 3 6 9 12 15

RC9 Replication3 686 213 114 81 67 52

RC9 Replication2 678 211 118 85 68 53

RC9 HomeBased1 694 303 191 135 111 99

RC9 HomeBased2 691 315 201 137 120 100

Speedup Factor Number of Workers

1 3 6 9 12 15

RC9 Replication3 1.0 3.2 6.0 8.5 10.2 13.2

RC9 Replication2 1.0 3.2 5.7 7.9 10.0 12.7

RC9 HomeBased1 1.0 2.3 3.6 5.2 6.2 7.0

RC9 HomeBased2 1.0 2.2 3.4 5.0 5.8 6.9

Linear Speedup 1.0 3.0 6.0 9.0 12.0 15.0

When the speed-up factor is plotted against the number of workers the graph is as

illustrated overleaf. The straight line represents linear speed-up where for every doubling

of the number of nodes the processing time is cut in half. The other lines show the speed-

up achieved by the different configurations of the TSP program.

4.4.5 Analysis of Results

These results allow us to draw a number of conclusions about the speed-up performance of

the framework for the TSP algorithm.

Replication protocols better speed-up than Homebased protocols

The two replication protocols achieve near-linear speed-up whereas the two homebased

protocols fall away quite sharply from the linear speed-up line. This is in line with what

was expected. The homebased protocols do not allow workers to communicate with each

other, therefore they do not allow the program to benefit from the optimization that is

available through sharing a new bestroute when one is found.

55

This is the strongest effect that is to be observed from the graph, the difference between

homebased and replication protocols is greater than that within these classes of protocols by

a ratio of over 20:1. Replication2 and Replication3 differ by 2—3% whereas Homebased1

and Homebased2 differ by over 60%.

Differences within protocols not significant

The differences between Replication2 and Replication3 and between HomeBased2 and

HomeBased3 are not significant.

For the homebased protocols this is because both of the homebased protocols have

significant levels of variation in the results, the average standard deviation for both of them

is over 4% which is greater than the difference in their average times which is only 2—3%.

For the replication protocols, although the standard deviation of Replication3 is very low,

starting at less than 0.5% and then falling to 0, the standard deviation of Replication2 is

Fig. 4.9 Relative Speed-up Graph for 17 City TSP

3.2

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Workers

Linear Speedup

RC9 Replication3

RC9 Replication2

RC9 HomeBased1

RC9 HomeBased2

56

much greater at an average of 3%, enough to cross the difference between the average

values

Vulnerability to network traffic

The % standard deviations of the times to complete the TSP algorithm (shown in the graph

below) indicate that for all of the configurations apart from Replication3 adding workers

tends to increase the variability of the results, this is particularly clearly the case for the

Homebased1 protocol. However Replication3, which uses the reliable LRMP multicast

protocol, generally maintains a low % standard deviation as more workers are added.

Throughout the testing of the TSP program we suffered from external network traffic

leaking onto the network where the worker nodes were located (CAG cluster of Linux

workstations) due to the fact that this network was only partially isolated from the rest of

the college network. These results indicate that the LRMP protocol was most effective at

coping with this traffic.

Only very slight difference between protocols for one worker.

The fact that there are only very small differences in the processing times for one worker

demonstrates that the differences between the protocols for more workers are due to issues

with communication and sharing of data, not due to the problem being processed at

different speeds. When only one worker is working on the problem there is only

communication with the master and there is no sharing of data with other workers. All of

Fig 4.10 % Standard Deviation for 17 City TSP

0123456789

1 3 6 9 12 15

Number of Workers

RC9 Replication3

RC9 Replication2

RC9 HomeBased1

RC9 HomeBased2

57

the protocols include communication with the master so they are all doing the same thing

for one worker, the slight differences that do exist may be due to the slight differences in

the efficiency of the communication mechanisms being used by the different protocols. To

ensure a consistent environment for calculating speed-ups the master is always on a

separate node with no workers on that node. When there is only one worker then that

worker is on a separate node and has to communicate with the master across the network.

Super-linear speed-up possible

The second really noticeable result to come from this data is that it is possible to achieve

better than linear speedup. This is a remarkable result because most literature expects some

fall off away from linear speed-up due to communication overhead.

On the data that has been presented above there is only one point (3 workers) on each of the

replication protocol speed-up graphs that shows super-linear speed-up. In order to

investigate this more closely tests were performed for each of these protocols on a greater

number of points. The results of these tests are indicated in Table 2 below and on the graph

overleaf:

Table 217 City TSP – Additional Data

SUMMARY TABLE


1 2 3 4 5 6 8 9 12 15

Replication3 RC9 686 334 213 150 135 114 90 81 67 52

Replication2 RC9 678 333 211 158 129 118 91 85 68 53

Speedup Factor Number of Workers

1 2 3 4 5 6 8 9 12 15

Replication3 RC9 1.0 2.1 3.2 4.6 5.0 6.0 7.6 8.5 10.2 13.2

Replication2 RC9 1.0 2.0 3.2 4.3 5.3 5.7 7.5 8.0 10.0 12.8

Linear Speedup 1.0 2.0 3.0 4.0 5.0 6.0 8.0 9.0 12.0 15.0

58

Fig 4.11 Relative Speedup Graph for 17 CityTSP Additional Data

2.0

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Workers

Linear Speedup

Replication3 RC9

Replication2 RC9

These extra points confirm clearly that the one super-linear point in the previous graph was

not an aberration, because now we have other points on either side of 3 workers that show

better than linear speedup, for both of the replication protocols. Super-linear speedup is

clearly possible for the TSP algorithm, but it is not guaranteed, because it is very dependent

on the nature of the particular data set being used.

That the super-linear speed-up is not a statistical aberration is demonstrated by the fact that

for up to six workers the standard deviation has a maximum of 0.5% for Replication3. Yet

the speedup factor at 3.2 for 3 workers is 6.7% above linear more than 13 times higher.

In the next section a number of ways in which super-linear speed-up is possible are

outlined, all of which centre around the timing of the finding of a new bestRoute. Finding a

new bestroute significantly reduces the time to process a route for all of the workers (when

a replication protocol is used). However the actual amount of speedup can vary

considerably depending on the circumstances in which the bestroute is found as we shall

see in the next section. Given that a new bestroute is found a small number of times during

59

the running of the program the particular circumstances in which one is found can have a

significant impact.

The fact that the levels of speed-up are data dependent is illustrated by the speed-up curve

obtained for a 16-city data set which came from the same source as the 17-city data set we

have been concentrating on (Keleher, 1996b). In this case, when we use Replication3 and

Release Consistency 9, very close to linear speed-up is obtained at the start of the graph, but

it never achieves the superlinear speed-up obtained for the 17-city data set. Thus for a

different data set we have also achieved very good speed-up but not the superlinear speed-

up we obtained before. The 16-city speed-up results graph is shown below. The results are

tabulated in appendix 6.3 and the graph is shown below.

Fig 4.12 Speedup Graph for 16 City TSP

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Number of Workers

Linear Speedup

Replication3 RC9

60

4.4.6 Explanation of Super Linear Speedup

In this section three ways are outlined in which it is possible to achieve super-linear speed-

up for a portion of the TSP problem space. Given that the items of work which include a

new bestRoute tend to take significantly longer than other items any speedup obtained from

them will tend to have a disproportionate effect on the speedup obtained for the problem as

a whole.

In order to make the examples clear it is assumed that apart from exceptional routes all

routes have the same length, and that the effect of getting a bestRoute is to eliminate any

further processing on the current routes.

1. A single worker processes routes serially, whereas multiple workers process routes in

parallel. When a multi-worker system finds and shares a new bestRoute it could save

processing on routes that the single worker system would have processed before finding

that new bestroute.

2. The new bestRoute could have been discovered in a route that requires little processing

while there is another route that requires far more processing. If there is only 1 worker then

it might have to complete the processing of the slow route whereas with more than 1 worker

the benefit of the new bestRoute would apply for most of the processing of that route.

3 Workers1 Worker

Time

Speedup

Factor

2.5 / 0.5 = 5

3 Workers1 Worker

Time

Speedup

Factor

2.5 / 0.5 = 5

61

3. The new bestRoute could be discovered very early in the processing of one route, in that

case the benefit of the new bestRoute would apply to the current route of each worker from

a very early stage, whereas with 1 worker many of these routes will have been completely

processed before the bestRoute is found.

4.4.7 Analysis of the use of the framework

Flexibility in consistency models and coherence protocols

The mechanisms for selecting consistency models and coherence protocols were found to

be very simple and flexible, this aspect of the framework works extremely well.

Alternative ClassType classes could be defined with different combinations of consistency

models and coherence protocols for different variables. Each of these could be applied

from the command-line. This made the process of changing the setup of the program quite

simple.

Programmability

Parallelising the TSP algorithm required a great deal more programming than was

anticipated. This was because the work items could not be pre-sorted into a list, therefore

the already-implemented mechanism for allocating work items (i.e. extracting them from an

array using array indices) could not be used. It was necessary instead to provide an

alternative mechanism in the framework which allows work items to be extracted by calling

a method on the routeTable.

The modifications to be performed to the TSP program itself to were exactly as indicated in

section 3.4. The only changes that were required to the body of the TSP’s work method

3 Workers1 Worker

Time

Speedup

Factor

1.25 / 0.25 = 5

62

were to call the share() method on the bestroute at the start of the program and to declare an

intention before an attempt to update the bestroute, and close this intention afterwards.

Customisations

A new “Access” intention added in addition to the “read” and “write” intentions. This new

intention describes an intention that first reads the value at the master and then decides

whether it should overwrite it. This addition was necessary in order to parallelise the

setRouteIfBest method which ensures that a worker only updates the bestRoute value of the

DSM system as a whole if its value is lower than that currently prevailing.

This change was very easily made. The new intention class was created by combining the

functionality contained in the two existing intentions: “read” and “write”. The read

functionality is performed first. Then the write may or may not be performed depending on

the result of the read.

The Release Consistency model was customised to use CREW locks (concurrent read,

exclusive write). This was required so that all workers could check the bestRoute value,

but only one could update it at a time. The pre-defined class for the distributed CREW lock

contained in concurrency control eased the process of making this change. All that was

required was for the lock to be applied to the data being managed by the consistency model.

63

4.4.8 Conclusions for TSP

The conclusions that we have reached from the testing of the parallel TSP program are:

1. Replication Protocols achieve very close to linear speed-up for this algorithm over the

range we tested of 1 to 15 workers.

2. Homebased protocols achieve much less speedup because a bestRoute found by one

worker is not shared with the other workers.

3. Better than linear speed-up is possible for the Replication protocols. However it is only

achieved for certain data sets. Its achievement depends on the fact that finding a

bestRoute reduces the problem size for all workers. The data-dependent factors that

influence its achievement are: the order in which work items are passed out to different

workers, the size of the work items other workers are using when a new bestRoute is

found, and how early in the processing of a work item the bestRoute is found.

4. Difficulties with network traffic restricted the amount of time available for testing.

Thus full tests were completed only on Release Consistency, no full tests were

completed on Lazy Release Consistency.

5. A number of customisations were introduced in order to optimise the Release

Consistency model for this application. These changes were eased by being able to

reuse existing code for the read and write intentions and for the CREW lock.

6. The flexibility in choosing and changing combinations of Consistency model and

coherence protocol enables a range of coherence protocols to be tested. Choosing a

replication protocol over a homebased protocol was the factor that had the biggest affect

on the speed-up curve. DISOM, the other DSM framework that we mentioned, would

have been unable to apply different coherence protocols and so would not have been as

appropriate for this application.

64

7. The framework was found to be flexible and customizable, however it was quite

difficult to program this particular application because it does not allocate work items in

the manner the framework was designed to accommodate.

8. The difficulties we did have in programming this application were due to the

mechanism by which work items were issued. Other than this it was relatively easy to

program. Indeed at this stage, with a deeper knowledge of the framework, we can see

that there were easier ways to solve this problem.

65

4.5 LU DECOMPOSITION

LU Decomposition is a matrix transformation problem. Given any square matrix it is

possible to transform it into two triangular matrices, one upper triangular and one lower

triangular. This transformation is useful for simplifying many matrix algebra problems, e.g.

solving a set of linear equations.

In fact because the lower-triangular matrix always has a value of 1 in each diagonal

element, it is possible to represent the two triangular matrices using one full square matrix.

Because we know the lower-triangular always has 1 on the diagonal we can enter the

diagonal elements from the upper-triangular matrix into the diagonal elements of the square

matrix. This is convenient because there is an algorithm that can compute this combined

lower and upper matrix (hence the description LU decomposition).

4.5.1 The Algorithm

The algorithm as normally performed requires two stages of computation to be performed

on a large part of the whole matrix for each diagonal element in the matrix. These are:

For each diagonal element in the matrix

1. Divide each element in the column below the current diagonal element by that diagonal

element.

2. For each element in the square below and to the right of the diagonal subtract from it the

product of the corresponding elements in the column and row on which the diagonal lies.

$L�L

5HDG

Fig. 4.13 Simple algorithm for LU Decomposition

66

This algorithm is best subdivided into subtasks for parallelisation by dividing the matrix

into columns and treating each column separately. However because it is more convenient

to manipulate rows than columns in Java the matrix was inverted before performing the

calculations and at the end reinverted to produce the correct result. This allows all

operations to be carried out on rows. From this point on the algorithm will be explained in

terms of how changes were made to the rows of the inverted matrix.

This algorithm could be programed as:

for each diagaonal element a(i)(i){

for j=i+1 to end of row{

a(i)(j) = a(i)(j) / a(i)(i)}

for l=i+1 to end of row{

for m=i+1 to end of column{

a(l)(m) = a(l)(m) - a(l)(i)* a(i)(m) }}

}

This algorithm could be parallelised by having the workers perform the calculations for a

given diagonal on one row. This algorithm is very unsuitable for parallelising because the

number of calculations to be done on each row is n-d where n is the number of elements in

the row and d is the current diagonal. Each row would have to be updated and returned to

the master once for every row above it. This means that the computation to communication

ratio is of the order of n computations to 1 communication which is too low. The workers

in a parallel program using this algorithm would spend most of their time waiting for their

results to be returned to the master and for new work items to be delivered to them, not

actually working on the problem. The parallel program would be many times slower than a

standalone program, thus defeating the object of parallelising the application in the first

place.

67

In order to improve the parallelisability of the algorithm, rather than performing one

iteration for each row with each worker updating one row at a time, it would be better to get

each worker to perform all the calculations for one row. Each worker is given a row and

then performs all the transformations for that row that would have been performed for each

of the diagonal elements above that row in the matrix. Thus instead of each worker

performing one iteration of row transformation for one row, it performs all the

transformation iterations for the row it has been assigned.

Thus for each row a worker will perform the following calculations

// For each previous diagonal subtract corresponding row and column // elements from each element to right of that diagonal on this row.

for (k=0; k<row; k++){

for (;rowNotUpdated[k];) {// forces worker to wait until // other workers have finished

} // updating the row it wants to read

for (j=k+1; j<n; j++) {A[row][j] = a[row][j] - A[row][k] * A[k][j]

}}

// Divide all to the right of the diagonal by the diagonalfor (j=row+1; j<n; j++) {

A[row][j] = A[row][j]/ A[row][row]}

// Set flag so others will now be able to read this rowrowNotUpdated[row] = false;

$L�L

5HDG

Fig 4.14 Better algorithm for LU Decomposition

68

Thus the number of divisions or multiplications to be done for each row i in an nxn matrix

is:

n-j-1 multiplications for each previous row j

n-j divisions for each previous row j

so the total for each row i is:i

Σ (n-j-1)j=0

= ((n-1)2 + n-1) - ((n-i-2)2 + (n-i-2))

2

= Order (n2/2)

This has greatly increased the number of computations per row and ensured that each row

only has to be updated once. This should enhance the parallelisability of the algorithm by

increasing the number of computations in one visit to a row and reducing communication

by only requiring one visit to each row. The combination of these two changes reduces the

computation-to-communication ratio which is a good indicator of parallelisability.

However the speed-up characteristics that this algorithm can offer are limited by two facts:

the interdependence of workers and the low computation-to-communication ratio.

Interdependence of workers

The workers are highly interdependent because, as the fig. 4.14 above shows, in order for a

worker to complete the transformations on any particular row it must be able to read the

transformed values of every one of the rows above it. Thus no worker can complete a row

until all the rows above it have been completed, and so it must wait until all workers have

completed all rows above it in the matrix. If one worker is slow then all the other workers

will be delayed, none of them will be able to complete a row below that workers row until

that worker has completed his current row.

69

Low computation-to-communication ratio

Even with the redesign of the algorithm to improve the communication-to-computation

ratio it is still much lower for LU decomposition than it for other matrix applications that

have been successfully implemented over the framework such as matrix multiplication

(Weber et al. 1998).

The matrix multiplication algorithm that was parallelised involved issuing rows of a matrix

A out to workers where they would multiply it by each of the columns of a matrix B in

order to calculate one row of the result matrix C. Thus this involved issuing one row of 600

elements for each work item and receiving one row of 600 elements back as the result.

(Note that it has been found that an array of 600 elements is the largest that can be reliably

passed around the network).

Plotting the computations per row for a 600x600 matrix multiplication and for LU

decomposition for a range of matrix sizes results in the graph shown below:

Fig 4.15 LU Decomposition and Matrix Multiplication Computations per Row for 600x600 Matrix

0

50000

100000

150000

200000

250000

300000

350000

400000

Row No.

LU Decomp. 600

Multiplication

For a 600x600 matrix the number of LU computations per row is at maximum one half of

that for every row of the matrix multiplication algorithm. Indeed for the first 100 rows of

the matrix the computations per row only get up to one seventh of the computations for

each row of the matrix multiplication.

70

However the amount of communication per row is actually higher in the case of LU

Decomposition. For each row of the LU the results of the computation must be propagated

to all of the workers, and to the master. For matrix multiplication the result need only be

sent to the master, not the other workers. Indeed if Lazy Release Consistency is used

updates to the master can be delayed until the end for matrix multiplication thereby

achieving close to linear speed-ups on this framework, however this is not possible for LU

decomposition. LU decomposition cannot allow updates to be delayed until all rows are

processed because all workers need to see the results for all previous rows.

Thus LU decomposition imposes a much higher communications burden on the system than

does matrix multiplication while requiring only a fraction of the computation involved. Its

computation-to-communication ratio is much lower than for matrix multiplication.

For these two reasons we would expect that LU decomposition will not produce speed-up

graphs that are anything like as close to linear as those that have been obtained for matrix

multiplication and for the travelling sales person.

Attempts to improve the speed-up characteristics

In an attempt to improve the speed-up characteristics by reducing the communication-to-

computation ratio the algorithm was changed so that each worker would process several

contiguous rows as a single item of work. This reduced the amount of communication by

sending a group of rows at less frequent intervals and by ensuring that towards the end of a

batch of rows the worker would be reading rows that it itself had processed, so there would

be no communication delay in receiving the updated status of those rows. However this

meant that the span of the matrix across which workers were currently working was much

larger and so the potential for latencies due to workers waiting for other workers to

complete their work was much greater. It was found that this change produced no

improvement in speed-up.

71

4.5.2 Parallelisation of LU Decomposition

The LU decomposition algorithm was much simpler to parallelise than was the TSP

algorithm. This is because the list of work items that are to be allocated to the workers (for

the LU this each row of the matrix) is known at the start of the algorithm. Therefore they

can be referred to by their index number in this list and so the method of allocating work

that has already been implemented can be used. Thus only the standard classes described in

section 3.4 have to be created for LU decomposition.

Customisation of protocols required for LU Decomposition

It was found necessary to customize the Release Consistency model in order to get the

application to run properly over the framework. The consistency model, as implemented

for previous applications, had been designed so that as soon as a worker had received all the

initialisation data it needed to set up a problem it could immediately proceed to processing

work items. This eliminated any waiting latency that could be introduced by making each

worker wait until all workers had received the initialisation data before any worker could

start doing work.

In the case of the LU Decomposition the amount of computation to be done on the first few

rows is very low (599 divisions for the first row). This meant that when the first worker to

receive all its initial data started working on the first row it would actually finish processing

this row while some other workers still had not received all the initial data. Therefore when

it published this data, those workers were not ready to listen for that data.

This meant that those workers would never see that the first row had been updated. So they

could never process any item because in order to start processing an item they would have

to read the updated state of the first row, something which they had missed when it was sent

out. All other workers would then be stopped in turn because the stopped worker would

never process its item the result of which all the other workers would need to access if they

were to process any item further down in the matrix. Thus the whole program would come

to a halt after only one or two items had been processed.

72

This problem was overcome by implementing a barrier which did not allow any worker

start processing items until all had received the initialisation data. This was quite simple as

a base class for handling barriers has been provided in the barrier server. So this class was

just extended to provide the specific functionality of getting all the workers to wait at the

barrier.

4.5.3 Predicted Effect of the Protocols

HomeBased Protocols

The HomeBased protocols allow workers to perform updates by means of messages that are

sent only to the master, not the other workers that are part of the DSM system. This means

that each worker does not see the effect of updates made by other workers, because the

worker’s local copy of the data is never updated when another worker updates the master.

For this reason the HomeBased protocols cannot be used for the LU decomposition

algorithm, because this algorithm requires that each worker sees all of the updates made by

all other workers.

Replication Protocols

The replication protocols perform updates by means of messages that are broadcast to all

nodes on that are part of the DSM system. Thus when one worker updates the master with

its transformations on one row then the same data is also multicast to all the other workers

as well. This sharing of updated data with other workers is essential for this algorithm

because workers need to read the results obtained by all other workers on previous rows in

order to complete the transformations for any given row.

We would expect that Replication3 which uses the reliable LRMP multicast protocol would

perform better than Replication2 which uses UDP multicast, because of the large volume of

data to be multicast (each worker has to multicast a row of 600 doubles at the end of

processing one row, and all of the other workers need to receive this data). The replication

protocols suit more closely integrated and interdependent algorithms which is exactly what

this algorithm is like.

73

4.5.4 Results of speed-up tests

As already noted the objective of parallelising an application is to reduce the time taken to

perform its function by enabling more processing power to be applied to the problem and

the most appropriate indicator of how well a parallel programming system delivers

increases in processing power with the addition of more workers is the speed-up curve.

The performance of the parallelised LU Decomposition program was tested by recording

how long it took to complete the transformation of a 600x600 matrix into a matrix

combining its lower and upper triangular matrices. This test was repeated ten times for a

given number of workers. A set of ten tests was done for the following numbers of

workers: 1, 3, 6, 9, 12, 15.

This procedure was performed for the parallelised LU decomposition program for each of

the following configurations of Consistency Model and Coherence Protocol:



The full results are tabulated in appendix 6.4, and summarized in Table 3 below:

Table 3600x600 LU Decomposition

Summary Table


1 3 6 9 12 15

RC10 Replication3 75 40 38 37 37 37

RC10 Replication2 78 47 44 43 43 43

Average

Speedup Factor

Number of Workers

1 3 6 9 12 15

RC10 Replication3 1 1.88 1.97 2.04 2.05 2.04

RC10 Replication2 1 1.66 1.77 1.79 1.80 1.80

74

The graph of the speed-up factors obtained against the number of workers working on the

problem is shown in Fig 4.15 below:

Fig. 4.15 Relative Speedup Graph for LU Decomposition

0

1

2

3

4

5

6

7

8

9

10

1 2 3 4 5 6 7 8 9

Number of Workers

Linear Speedup

RC10 Replication3

RC10 Replication2

75

4.5.5 Analysis of Results

These results allow us to draw a number of conclusions about the speed-up performance of

the framework for the LU Decomposition algorithm.

Very little speed-up

There is very little speed-up offered for this algorithm, indeed for more than three workers

there is almost no additional speed-up gained by adding more workers. Between one

worker and three workers the speed-up is 1.88 for Replication3 and 1.66 for Replication2.

The value of 1.88 represents a reasonable speed-up for three workers.

This low value is due to the issues that have already been identified for this application: a

very low computation-to-communication ratio (particularly for the first 100 rows of the

matrix) and the very high interdependency between the workers.

The difference between the speed-up for the replication protocols and linear speed-up is

many times greater than the differences between the two replication protocols as outlined

below. For as few as three workers it is six times greater. This indicates that the delays are

not primarily due to lost packets, but rather due to latencies introduced by waiting on other

workers, the frequency of communication and the volume of data to be transferred on each

occasion.

This is reinforced by the fact that the time between rows being processed changes little over

the course of the execution of the program, even though the amount of computation for later

rows is many times larger than that for the early rows.

Gustafson’s Law

By applying Gustafson’s scaled speedup (Gustafson, 1988) we should be able to ensure that

for larger problem sizes the programme could operate within the portion of the speed-up

graph that does produce speed-up. However we were unable to test this due to limitations

on the size of arrays that could be passed reliably around the network.

76

Faster incorporation of updates

The speed-up performance of the framework would be considerably enhanced if the time

taken to incorporate new updates at the workers and at the master could be reduced. This

code is called hundreds of thousands of times during the application, optimisations to it

should have a significant effect. However no work has been done on this at this time.

Reliable multicast faster than unreliable

The Replication3 protocol which uses the reliable LRMP multicasting is on average 15%

faster than Replication2 which uses UDP multicast. This difference is much greater than it

was for the TSP algorithm and reflects the greater amount of communication that is

required by this algorithm. Replication3 reduces the need to resend packets that were not

received by enforcing greater reliability.

4.5.6 Analysis of the use of the framework

Flexibility in consistency models and coherence protocols

The mechanisms for selecting consistency models and coherence protocols were found to

be very simple and flexible, this aspect of the framework works extremely well.

The availability of the replication protocols was necessary to this application. The

homebased were unsuitable for the type of data sharing required in the program. The

option of changing coherence protocols was not available in the DSM framework DISOM,

so it would have been more restricted in dealing with this application.

Programmability

The programming of the LU Decomposition algorithm so that it could run in parallel over

the framework was quite straightforward. This is because the work items are known in

advance, i.e. each row of the matrix is one work item, and they can be accessed using array

indices i.e. each row can be accessed using its row number. This was simplified by the

inversion of the matrix that was performed at the start of the program so that operations

77

were performed on rows instead of on columns. Therefore the already-implemented

mechanism for allocating work using array indices could be applied directly.

The modifications to be performed to the LU program itself to were exactly as indicated in

section 3.4. The only changes that were required to the body of the LU’s work method

were to call the share() method on the matrix at the start of the program and to declare an

intention before an attempt to update a row, and close this intention afterwards.

Customization of protocols

As we have seen, in order to get the application to run even with the replication protocols it

was necessary to customize the Release Consistency model in order to ensure that all

workers had received the initialisation data before the results for the first rows were

published. This change was needed because the amount of processing on the first rows of

the LU decomposition was almost zero when compared to that required for the TSP and

other algorithms which had been parallelised in the past such as matrix multiplication.

In this case the capacity to customise the consistency model was essential. Without doing

this the program would not have run at all. The addition of the barrier was the means of

stopping the program locking with all workers waiting for a worker that could not proceed.

Programming of the barrier was relatively easy because a barrier class had already been

defined in the concurrency control.

4.5.7 Conclusions for LU Decomposition

The conclusions that we have reached from the testing of the parallel LU Decomposition

program are

1. The speed-up values obtained for this application were very low. This was due to the

very low computation-to-communication ratio.

2. Only the replication protocols were suitable for this application because it requires all

workers to see the updates of all other workers, which the homebased protocols cannot

provide as they are currently designed.

78

3. Reliable multicasting produced a 15% performance improvement over unreliable

multicasting.

4. The problem was quite easily programmed to run over the framework. No

modifications to the program other than those intended were required.

5. The framework was found to be flexible and customizable. The severe communication

and synchronisation demands of this application meant that it could never have been run

without the ability to customize the framework. For this application the option to use

the framework as a whitebox framework was essential to a successful implementation

of the parallel program.

4.6 Summary

This chapter explained the approach that was used to test the framework and the results

obtained. The two algorithms selected to test the framework were outlined and in each case

the results and conclusions actually obtained from the process of programing and testing the

algorithm are laid out. Finally conclusions are drawn as to how well the framework has

delivered on the benefits it claims to offer.

79

5 Review of the DSD Framework

5.1 Introduction

In this chapter an overall assessment is made of how well the framework has delivered on

the benefits that it promised for developing DSM applications for parallel programming.

First it looks at whether the DSD really is a framework. Then what kind of a framework

and finally at how well it delivers the benefits it claims to offer to developers of DSM

applications.

5.2 Is DSD a genuine framework?

In other words does it meet the definition that we have taken:

“A framework is a set of classes that embodies an abstract design for solutions to a family

of related problems … frameworks provide for reuse at the largest granularity”

(Johnson & Foote, 1988)

We can clearly accept that it meets this definition. The DSD is a set of classes that we have

used to implement DSM parallel applications with differing sharing and communication

characteristics. We have done this by using alternative classes derived from the base

classes for consistency models and coherence protocols. Thus the key classes that we have

used have embodied this abstract design involving a partition of the problem into two co-

operating elements (consistency model and coherence protocol) which can be implemented

in many different ways. The reuse of this design is reuse at a large granularity, because

what we are reusing is a structure that can be applied to all DSM applications, because it is

based on an analysis of the fundamental elements required to deliver DSM.

The characteristics that we identified frameworks as having were: reusability, modularity,

extensibility, and inversion of control. DSD has demonstrated all of these features in the

development and testing that we have performed. Its basic structure of consistency model

and coherence protocol is reused in every application developed using the framework. In

addition it has a suite of implemented consistency models and coherence protocols that are

available for re-use in parallel applications.

80

It is modular in that the user can easily replace one consistency model with another in an

application, and it is extensible in that the user can develop a new consistency model or

coherence protocol which can be applied to variables in exactly the same way as the pre-

defined ones. The framework delivers inversion of control because it is the framework that

assigns work items to the worker processes and that determines what view they have of the

shared data. Therefore we can at least DSD is a genuine framework.

5.3 What kind of a framework is it?

Two kinds of framework were identified: whitebox and blackbox. Whitebox frameworks

are used by extending the classes of the framework to create specific functionality. We

have seen this done to create new varieties of consistency model and coherence protocol.

However the framework can also be used in a blackbox mode because there is a good range

of protocols and models predefined and ready to be used and the mechanism for choosing

and changing them is relatively easy to use. Therefore we can conclude that the framework

combines both approaches, and this is because it is a reasonably mature whitebox

framework.

5.4 How effective is it at supporting DSM?

We identified a number benefits that could be expected from the framework given how it

was defined.

Providing a selection of predefined consistency models and coherence protocols

We have found that the coherence protocols and consistency models that have been

supplied offer significantly different characteristics to the developer. Indeed for the LU

program some of them could not be used at all. This demonstrates the benefit of having a

range to select from. For the TSP program we saw that there were very significant

performance differences between the replication and homebased protocol, showing again

the benefits of being able to select a model and a protocol that are appropriate to the

application.

81

In addition the range of pre-defined protocols allows non-expert users a range of policies

without requiring them gain the expertise to extend the framework in order to have options

in this area.

Per-object combination of consistency models and coherence protocols

This enables programmers to take advantage of application-specific semantics on particular

applications in a way which is not allowed by other DSM systems. This feature was not

tested in the applications that we used. Many different items were shared but there was no

need to have different sharing characteristics for different instances of the same type in

either of the test applications that we ran.

Customisation of coherence protocols and consistency models

This feature was particularly important in the LU application, where it enabled the

modification of the Release Consistency model to allow the program to run successfully.

Also in the case of the TSP optimisations were introduced that allowed each worker to

compare routes with the best route that had been found across the whole DSM network.

Flexibility

The mechanism for choosing and changing the combination of coherence protocol and

consistency model for different variables worked well and aided the testing process.

Programmability

For both applications it was found that the framework was relatively easy to program, with

the exception of the manner in which the work allocation mechanism had to be

reprogrammed for the TSP.

Speed-up

Impressive speed-up curves were obtained for the TSP. Speed-up was obtained for the LU

decomposition only up to three workers. Again the ability to combine coherence protocols

and consistency models and to customise them to the needs of the application were key to

obtaining improved speed-ups.

82

5.5 Problems

Learning curve

There was a very steep learning curve to be climbed at the start in terms of coming to

understand how the framework handles an application and passes various data to the

workers. However this may have been exaggerated in this case because the application that

was done first was the TSP which required going much deeper into the framework than did

the LU program. Perhaps if the order had been reversed the learning curve would have

been more easily negotiated.

Network interference

The framework was not fault tolerant in respect of high network traffic. However fault

tolerance was not one of the key issues that the framework was built to test. For this reason

we can say that this is problem does not affect our assessment of how well the framework

meets its objectives.

5.6 Conclusion

The DSD framework was very successful at supporting DSM parallel applications. We

were able to implement two applications, one of them very demanding on the network. By

applying the flexibility that the framework offered we able to run both as parallel

applications and to achieve very impressive speed-ups on one of them. According to its

authors “the rationale for the DSD framework is to provide a flexible structure for building

DSO systems from the most appropriate elements” (Weber et al., 1998). The conclusion of

this dissertation is that this rationale has been delivered to a very high extent.

83

6 Appendices

6.1 Travelling Sales Person 17 Cities

RC9 Replication3 Nodes

1 3 6 9 12 15

Reading 1 684 213 114 81 67 52

Reading 2 686 213 114 81 67 52

Reading 3 688 213 114 81 67 52

Reading 4 687 212 114 81 67 52

Reading 5 688 214 114 81 67 52

Reading 6 692 212 114 81 68 52

Reading 7 686 212 114 81 67 52

Reading 8 680 212 114 81 67 52

Reading 9 684 213 114 81 67 52

Reading 10 683 213 114 81 67 52

Average 686 213 114 81 67 52

Standard Deviation 3 1 0 0 0 0

RC9 Replication2 Nodes

1 3 6 9 12 15

Reading 1 680 211 122 85 72 55

Reading 2 674 210 113 99 64 52

Reading 3 681 210 128 80 68 54

Reading 4 674 212 113 84 68 52

Reading 5 679 210 113 80 70 52

Reading 6 675 210 120 85 67 55

Reading 7 687 211 113 86 67 52

Reading 8 683 214 122 81 68 52

Reading 9 674 210 113 89 66 55

Reading 10 674 212 127 84 66 53

Average 678.1 211 118.4 85.3 67.6 53.2


84

RC9 HomeBased1 Nodes

1 3 6 9 12 15

Reading 1 699 296 195 130 110 107

Reading 2 688 299 189 133 104 92

Reading 3 694 297 187 138 116 112

Reading 4 690 305 186 136 110 92

Reading 5 698 308 193 129 118 92

Reading 6 690 299 196 139 115 106

Reading 7 691 307 190 140 111 89

Reading 8 697 305 187 133 109 102

Reading 9 694 306 189 137 114 98

Reading 10 698 305 195 132 107 101

Average 694 303 191 135 111 99


RC9 HomeBased2 Nodes

1 3 6 9 12 15

Reading 1 691 324 215 146 110 103

Reading 2 694 304 184 132 111 101

Reading 3 683 315 200 132 122 94

Reading 4 686 321 188 130 116 99

Reading 5 695 323 217 138 118 103

Reading 6 692 310 204 136 122 104

Reading 7 690 307 187 144 117 102

Reading 8 685 322 210 143 129 95

Reading 9 688 308 193 133 131 103

Reading 10 690 313 208 140 119 94

Average 691 315 201 137 120 100


85

6.2 Travelling Sales Person 17 Cities - Additional Data

Replication3 RC9 Nodes

1 2 3 4 5 6 9 12 15

Reading 1 684 350 213 151 137 114 81 67 52

Reading 2 686 346 213 152 133 114 81 67 52

Reading 3 688 351 213 154 141 114 81 67 52

Reading 4 687 344 212 148 133 114 81 67 52

Reading 5 688 347 214 150 137 114 81 67 52

Reading 6 692 345 212 147 134 114 81 68 52

Reading 7 686 343 212 148 136 114 81 67 52

Reading 8 680 345 212 152 135 114 81 67 52

Reading 9 684 346 213 151 136 114 81 67 52

Reading 10 683 348 213 148 137 114 81 67 52

Average 686 347 213 150 136 114 81 67 52

Standard Deviation 3 3 1 2 2 0 0 0 0


1 2 3 4 5 6 9 12 15

Reading 1 674 336 211 158 129 122 85 72 55

Reading 2 679 337 210 159 128 113 99 64 52

Reading 3 675 335 210 159 129 128 80 68 54

Reading 4 687 332 212 157 130 113 84 68 52

Reading 5 683 333 210 158 129 113 80 70 52

Reading 6 674 331 210 159 129 120 85 67 55

Reading 7 674 332 211 158 129 113 86 67 52

Reading 8 674 334 214 160 128 122 81 68 52

Reading 9 682 333 211 158 128 113 89 66 55

Reading 10 677 338 212 157 130 127 84 66 53

Average 678 334 211 158 129 118 85 68 53

Standard Deviation 5 2 1 1 1 6 6 2 1

86

6.3 Travelling Sales Person 16 Cities


1 3 6 9 12 15

Reading 1 108 36 18 13 10 9

Reading 2 107 36 18 13 10 9

Reading 3 107 36 18 13 10 9

Reading 4 108 36 18 13 10 9

Reading 5 108 36 18 13 10 9

Reading 6 107 36 18 13 10 9

Reading 7 108 36 18 13 10 9

Reading 8 108 36 18 13 10 9

Reading 9 107 36 18 13 10 9

Reading 10 107 36 18 13 10 9

Average 108 36 18 13 10 9


87

6.4 LU Decomposition

Table 4


1 3 6 9 12 15

Reading 1 76 41 38 37 37 37

Reading 2 75 39 38 37 37 37

Reading 3 78 40 37 37 37 36

Reading 4 76 38 39 36 37 37

Reading 5 74 40 38 37 36 38

Reading 6 73 42 39 37 37 36

Reading 7 74 40 38 38 37 37

Reading 8 76 40 38 37 36 37

Reading 9 75 41 39 37 37 37

Reading 10 77 40 38 37 37 37

Average 75 40 38 37 37 37


Table 5


1 3 6 9 12 15

Reading 1 75 46 44 43 43 43

Reading 2 79 47 45 44 43 43

Reading 3 79 46 43 43 43 43

Reading 4 77 48 43 45 44 43

Reading 5 80 46 44 43 43 44

Reading 6 76 47 44 43 43 44

Reading 7 79 47 45 43 44 43

Reading 8 77 48 45 44 43 43

Reading 9 78 46 44 43 43 43

Reading 10 79 48 43 43 44 44

Average 78 47 44 43 43 43


88

7 Bibliography

Amdahl, G.M., (1967). Validity of the single-processor approach to achieving large scale

computing capabilities. In: AFIPS Conference Proceedings , vol. 30, April 1967, pp. 483—

485.

Bennett, J.K., Carter, J.B., Zwaenepoel, W., (1989). Adaptive software cache management

for distributed shared memory architectures. In: Proceedings of the 17th Annual

International Symposium on Computer Architecture (ISCA ’90), May 1990, pp. 125—135.

Carter, J.B., Bennett, J.K., Zwaenepoel, W., (1991). Implementation and performance of

Munin. In: Proceedings of the 13th ACM Symposium on Operating Systems Principles

(SOSP-13) October 1991, pp. 152—164.

Booch, G., (1994). Object-Oriented Analysis and Design, with Applications. 2nd Ed.

Rational.

Castro, M., Guedes, P., Sequeira, M., Costa, M., (1996). Efficient and flexible object

sharing. In: Proceedings of the 1996 International Conference on parallel Processing

(ICPP’96), August 1996, vol. 1, pp. 128—137.

Coulouris, G., Dollimore, J., Kindberg, T., (1994). Distributed Systems: Concepts and

Design. 2nd Ed. Addison Wesley.

Dubois, M., Scheurich, C., Briggs, F.A., (1988). Synchronization, coherence, and event

ordering in multiprocessors. IEEE Computer, 21(2) February 1988, pp. 9—21.

Fayad, M., & Schmidt, D. (1997). Object-Oriented Application Frameworks.

Communications of the ACM (Guest Editorial), Special Issue on Object-Oriented

Application Frameworks, Vol. 40, No. 10, Oct. 1997.

Fleisch, B.D. & Popek, G.J. (1989). Mirage: A Coherent Distributed Shared Memory

Design, SOSP12, December 1989, pp. 211—223.

89

Gamma E., Helm, R., Johnson, R., Vlissides, J., (1995). Design Patterns: Elements of

Reusable Object-Oriented Software. Addison Wesley, 1995.

Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., Hennessy, J., (1990).

Memory Consistency and Event Ordering in Scaleable Shared-Memory Multiprocessors.

In: Proceedings of the 17th Annual International Symposium on Computer Architecture,

May 1990, pp. 15—26.

Goodman, J.R., (1989). Cache consistency and sequential consistency. Technical Report

61, IEEE Scaleable Coherent Interface Working Group, March 1989.

Gustafson, J.L., (1988). Reevaluating Amdahl’s Law. Communications of the ACM, vol.

31(5), May 1988, pp. 532—533.

Johnson, R., & Foote, B., (1988). Designing Reusable Classes. Journal of Object-Oriented

Programming, June/July 1988.

Karp, R.M., (1972). Reducibility Among Combinatorial Problems. In: Complexity of

Computer Computations, 1972 (Proc. Sympos. IBM Thomas J. Watson Res. Center,

Yorktown Heights, N.Y., Plenum Press, 1972), pp. 85—103.

Keleher, P., Cox, A.L., Dwarkadas, S., Zwaenepoel, W., (1996a). Treadmarks: Shared

Memory Computing on Networks of Workstations. IEEE Computer, vol. 29(2), February

1996, pp. 18—28.

Keleher, P. (1996b). The Relative Importance of Concurrent Writers and Weak

Consistency Models. ICDCS16, May, 1996, pp. 91—98.

Lamport, L., (1979). How to make a multiprocessor computer that correctly executes

multiprocess programs. IEEE Transactions on Computers, vol. C-28(9), September 1979,

pp. 690—691.

90

Lenoski, D.E., Ludon, J., Gharachorloo, K., Weber, W.-D., Gupta, A., Hennessy, J.L.,

Horowitz, M., Lam, M.S., (1992). The Stanford DASH Multiprocessor. IEEEC, vol. 25(3)

March 1992, pp. 63—79.

Li, K., & Hudak, P., (1989). Memory Coherence in Shared Virtual Memory Systems. ACM

Transactions on Computer Systems, vol. 7(4), November 1989, pp. 321—359.

Lipton, R.J., & Sandberg, J.S., (1988). PRAM: A Scaleable Shared Memory. Technical

Report CS-TR-180-88, Department of Computer Science, Princeton University, September

1988.

Taha, H.A., (1992). Operations Research. 5th Ed. Macmillan, 1992.

Tanenbaum, A.S., (1995). Distributed Operating Systems. Prentice Hall, 1995.

Weber, S., Nixon, P.A., Tangney, B., (1998). A flexible framework for consistency

management in object oriented distributed shared memory. Submitted for publication.

Testing of a Novel Distributed Shared Memory Framework · Distributed Shared Memory Framework...

Documents

Transcript of Testing of a Novel Distributed Shared Memory Framework · Distributed Shared Memory Framework...