Instructor-s Guide to Parallel Programming in c With Mpi and Openmp

Instructor�s Guide to

Parallel Programming in C

with MPI and OpenMP

Michael J� Quinn

July ��

Contents

Preface iii

� Motivation and History �

� Parallel Architectures �

� Parallel Algorithm Design ��

� Message�passing Programming ��

� The Sieve of Eratosthenes ��

� Floyds Algorithm ��

� Performance Analysis ��

Matrix�vector Multiplication ��

� Document Classi�cation ��

�� Monte Carlo Methods ��

�� Matrix Multiplication ��

�� Solving Linear Systems ��

�� Finite Di erence Methods ��

�� Sorting ��

�� The Fast Fourier Transform ��

i

ii CONTENTS

�� Combinatorial Search ��

�� Shared�memory Programming ��

� Combining MPI and OpenMP �

Preface

This booklet contains solutions to the �pencil and paper� exercises foundin Parallel Programming in C with MPI and OpenMP� It does not containsolutions to the programming assignments� except for a few of the shortestones� Likewise� it does not report results for the benchmarking exercises�since the times that students observe will vary greatly from system to sys�tem�

I apologize for any mistakes that have survived my proofreading� If youidentify any errors in this solutions manual� or if you have any ideas foradditional exercises� I would appreciate hearing from you�

Michael J� QuinnOregon State UniversityDepartment of Computer ScienceCorvallis� OR ��

quinn�cs�orst�edu

iii

iv PREFACE

Chapter �

Motivation and History

�� I like to do this exercise in class during the rst lecture� It helps moti�vate many of the concepts discussed in the course� including speedup�contention� and scalability�

a� The best way to sort cards by hand can spark an entertaining de�bate� I believe it is faster to sort the cards initially by suit� thento use an insertion sort to order each suit in turn while holding itin your hands� Knuth disagrees� see The Art of Computer Pro�gramming� Volume �� Sorting and Searching � Addison�Wesley�� pp� �� for another opinion��

b� Assuming no great variance in card�handling skills� the best wayfor p people to sort p decks of cards is to let each person sort onedeck of cards� Assuming person i requires ti time units to sorta deck of cards� the time needed for p persons to sort p decks ofcards is maxft�� t�� tpg�

c� The time needed for p people to sort � deck of cards dependsheavily upon the parallel algorithm used� Students often use thefollowing algorithm when � � p � �� Each person is assignedresponsibility for one or two suits� The unsorted deck is dividedamong the people� Each person keeps the cards associated withhis or her suit and tosses the other cards to the appropriatepartner� After working through the unsorted deck� each personsorts the suits he or she has been assigned�

One of my classes discovered an interesting parallel algorithmfor six people that exhibits both pipelining and data parallelism�Persons � and � distribute the cards� while persons �� and� are each responsible for sorting cards in a particular suit�

�

� CHAPTER �� MOTIVATION AND HISTORY

Person � starts with the deck of cards� He or she hands roughlyone�half the cards to person �� Persons � and � work throughtheir cards� tossing each card to one of the other persons� de�pending upon the suit� Persons �� and � take the cardstossed to them by persons � and � and use insertion sort toorder the cards in their suit� When these people have nishedsorting their cards� they hand them to person �� who concate�nates them� completing the sort�

�� It requires dlog� ne folds to put n holes in a piece of paper with asingle stroke of the hole punch�

Proof of correctness by induction on number of holes�� Let fi�represent the number of folds needed to put i holes in a piece of paper�Base case� With no folds of the paper� the punch can put one holein it� Hence f�� dlog� �e� Induction� Assume fk� � dlog� kefor � � n � k� Prove true for �k� We want to make �k holes� If wefold the paper in half dlog� ke times� we can make k holes inductionhypothesis�� Hence if we fold the paper in half once more we can make�k holes� Suppose k � �j � r� where r � �j � Then �k � �j�� r�where �r � �j�� Hence

f�k� � dlog� ke� � � j � �� j � � � dlog� �keOptimality proof by induction on number of folds�� Let hi� denotethe maximum number of holes that can be made with i folds of thepaper� Base case� With no folds of the paper� the punch can put atmost one hole in it� h � � �� Induction� Assume hn� � �n for � n � k� Prove true for k � �� Folding the paper once more allowsus to double the number of holes� Hence hk�� hk� � �� k ��k��

Since at most �i holes can be made with i folds of the paper� an algo�rithm that makes j holes with dlog� je folds of the paper is optimal�

�� a� Hand the cards to the accountant at the leftmost desk in thefront row� The accountant keeps �� cards and passes �� cards tothe accountant on her right� This accountant keeps �� cards andpasses �� to the accountant on her right� This process continuesuntil every accountant in the front row has �� cards� Meanwhile�the front row accountants pass cards to the accountants behindthem� The number of accountants receiving cards depends uponhow many cards each accountant keeps� For example� if eachaccountant keeps � cards� then a total of � accountants � ineach column� will receive cards�

�

b� After each accountant has computed the total of the cards shecontrols� the subtotals are accumulated by reversing the �owused to distribute the cards� The last accountant in each col�umn that received cards passes her subtotal to the accountantin front of her� This accountant adds the subtotal she receivedwith her own subtotal and passes the sum to the accountant infront of her� Eventually the subtotals reach the front row� Theaccountants accumulate subtotals right to left until the leftmostaccountant in the front row has the grand total�

c� Increasing the number of accountants decreases the total com�putation time� but increases the total communication time� Theoverall time needed is the sum of these two times� For awhile�adding accountants reduces the overall time needed� but even�tually the decrease in computation time is outweighed by theincrease in communication time� At this point adding accoun�tants increases� rather than decreases� overall execution time�The point at which this occurs depends upon your estimates ofhow long an addition takes and how long a communication takes�However� if �� accountants are actually used� the computa�tion must take longer than if � accountants are used� becausethere is nothing for the extra accountants to do other than wastetime collecting single cards and then handing them back� Anaccountant with only one card cannot do an addition�� For thisreason the curve must slope up between � and �� accoun�tants�

d� One accountant should require � times longer to add � � numbers than �� numbers� When adding � � numbers� weshould be able to accommodate many more accountants beforeoverall execution time increases�

e� One thousand accountants cannot perform the task one thou�sand time faster than one accountant because they spend timecommunicating with each other� This represents extra work thatthe single accountant does not have to do�

�� We will arrange the desks of the accountants into a binary tree� Oneaccountant is at the root� The desks of two other accountants are nearthe desk of the �root� accountant� Each of these accountants havetwo accountants near them� and so on� The root accountant is given� cards� She divides them in half� giving � to each of the twoaccountants near her� These accountants divide their portions in half�giving �� cards to both of the accountants near them� and so on�


This communication pattern reduces the communication overhead�meaning that the overall execution time of the algorithm for a givennumber of accountants should be lower than the execution time ofthe algorithm described in the previous exercise�

In Chapter � we�ll see how a binomial tree is even better than abinary tree� because it avoids the problem of inactive interior nodeswhile computing gets done at the leaves��

�� It takes m time units for the rst task to be processed� After that�a new task is completed every time unit� Hence the time needed tocomplete n tasks is m� n� ��

�� The time needed to copy k pages is ��k� �� k seconds�or �� k�� minutes� We want to choose k such that

k

�� k��

� k

�� k� �

Solving for k� we get k � ��

�� All of today�s supercomputers are parallel computers� Today�s super�computers are capable of trillions of operations per second� and nosingle processor is that fast� Not every parallel computer is a super�computer� A parallel computer with a few commodity CPUs is notcapable of performing trillions of operations per second�

�� Performance increases exponentially� Since performance improves bya factor of � every � years� we know that x� � � � Hence

� lnx � ln � � lnx � ln � �� lnx � �� x � e�� x � ��

We want to nd y such that xy � ��

��y � �� y ln �� ln �� y � ��

So the performance of CPUs doubles every �� years about every�� months��

�� a� The ��CPU Cosmic Cube had a best case performance of �� mega�ops� This translates into a single�CPU speed of � �� mega�ops� Caltech�s Supercomputing �� cluster had �� CPUs executing at more than � giga�ops� This translates intoa speed of at least �� mega�ops per CPU� The overall speedincrease is in the range ��

�

b� We want to nd x� such that x��

x�� lnx� � ln �� x� � ��

We want to nd x� such that x��

x�� lnx� � ln �� x� � ��

The annual microprocessor performance improvement needed toaccount for the speed increase calculated in part a� was between� � and ��

�� The faster computer will be capable of performing � times as manyoperations in �� hours as the existing computer� Let n denote the sizeof the problem that can be solved in �� hours by the faster computer�

a� n�� n � � � � �

b� n log� n�� log� � � � � � � n � ��

c� n�� n� � �� n � ��

d� n�� n� � �� n � ��

�� Advantages of commodity clusters� The latest CPUs appear in com�modity clusters before commercial parallel computers� Commodityclusters take advantage of small prot margins on components� mean�ing the cost per mega�op is lower� Commodity clusters can be builtentirely with freely available software� Commodity clusters have alow entry cost�

Advantages of commercial parallel computers� Commercial systemscan have a higher performance interprocessor communication systemthat is better balanced with the speed of the processors�

�� Students will come up with many acceptable task graphs� of course�Here are examples�


a� Building a house

Digfoundation

Pourfoundation

Framehouse

Installroofing

Installwindows

Installsiding

Heatinginstallation

Plumbinginstallation

Electricalinstallation

Installflooring

Installcabinets

Finishcarpentry

Paintexterior

Installwallboard

Paintinterior

b� Cooking and serving a spaghetti dinner

Preparesauce Set table

Put foodon table

Cooknoodles

Makesalad

Mix sauceand noodles

Simmersauce

c� Cleaning an apartment

Vacuumcarpets

Mop vinylfloors

Dust

Clean toilet,shower, sink

Put awayclutter

Clean outrefrigerator

�

d� Detailing an automobile

Degrease andclean engine

Polishexterior

Washexterior

Washwindows

Remove floormats

Clean floormats

Replace floormats

Vacuuminterior

Clean wheelsand tires

Conditioninterior surfaces

�� The upper set of three nodes labeled A is a source of data paral�lelism� The lower let of three nodes labeled A is another source ofdata parallelism� B and C are functionally parallel� C and D arefunctionally parallel� D and the lower group of three nodes labeled Aare functionally parallel�

�� The advantage of preallocation is that it reduces the overhead asso�ciated with assigning documents to idle processors at run�time� Theadvantage of putting documents on a list and letting processors re�move documents as fast as they could process them is that it balancesthe work among the processors� We do not have to worry about oneprocessor being done with its share of documents while another pro�cessor perhaps with longer documents� still has many to process�

�� Before allocating any documents to processors we could add thelengths of all the documents and then use a heuristic to allocate doc�uments to processors so that each processor is responsible for roughlythe same total document length� We need to use a heuristic� since


optimally balancing the document lengths is an NP�hard problem��This approach avoids massive imbalances in workloads� It also avoidsrun�time overhead associated with access to the shared list of unas�signed documents�

Chapter �

Parallel Architectures

�� Hypercube networks with �� and � nodes� A hypercube networkwith n nodes is a subgraph of a hypercube network with �n nodes�

0 1 0 1

2 3

0 1

2 3

4 5

6 7

�� A �dimensional hypercube can be labeled � way� A d�dimensionalhypercube can be labeled �dlog� d�� ways� for d � �� Explanation�We can give any of the �d nodes the label � Once this is done� welabel the neighbors of node � We have log� d choices for the label�� log� d � � choices for the label �� log� d � � choices for the label�� log� d � � choices for the label �� etc� In other words� there arelog� d�� ways to label the nodes adjacent to node � Once this isdone� all the other labels are forced� Hence the number of di�erentlabelings is �dlog� d��

�� The number of nodes distance i from s in a d�dimensional hypercubeis�di

�d choose i��

�� If node u is distance i from node v in a hypercube� exactly i bitsin their addresses node numbers� are di�erent� Starting at u and�ipping one of these bits at a time yields the addresses of the nodeson a path from u to v� Since these bits can be �ipped in any order�there are i� di�erent paths of length i from u to v�

�

� CHAPTER �� PARALLEL ARCHITECTURES

�� If node u is distance i from node v in a hypercube� exactly i bitsin their addresses node numbers� are di�erent� Starting at u and�ipping one of these bits at a time yields the addresses of the nodeson a path from u to v� To generate paths that share no intermediatevertices� we order the bits that are di�erent�

b�� b�� bi��

We generate vertices on the rst path by �ipping the bits in this order�

b�� b�� bi��

We generate vertices on the second path by �ipping the bits in thisorder�

b�� b�� bi�� b�

We generate vertices on the third path by �ipping the bits in thisorder�

b�� b�� bi�� b�� b�

This generates i distinct paths from u to v�

�� A hypercube is a bipartite graph� Nodes whose addresses contain aneven number of �s are connected to nodes whose addresses containan odd number of �s� A bipartite graph cannot have a cycle of oddlength�

�� Compute the bitwise �exclusive or� operation on v and v� The numberof � bits in the result represents the length of the shortest path fromu to v� Since u and v have logn bits� there are no more than lognbits set� and the shortest path has length at most logn� Numberthe indices from right to left� giving them values � �� logn � ��Suppose the exclusive or results in k bits being set� and these bits areat indices b�� b�� bk� This algorithm prints the shortest path fromu to v�

w � uprint wfor i� � to kw � w � �bk fexclusive orgprint w

endfor

��

Example� Find a shortest path from node � to node �� in a four�dimensional hypercube� � �� Hence i� � and i� � ��First jump� � � � � �� Second jump� �� Path is � � ��

�� Shu�e�exchange networks with �� and � nodes� Dashed lines denoteexchange links� solid arrows denote shu�e links� A shu�e�exchangenetwork with n nodes is not a subgraph of a shu�e�exchange networkwith �n nodes�

0 1

0 1 2 3

0 1 2 3 4 5 6 7

�� Nodes u and v are exactly �k � � link traversals apart if one is node and the other is node �k � �� When this is the case� all k bits mustbe changed� which means there must be at least k traversals alongexchange links� There must also be at least k � � traversals alongshu�e links to set up the bit reversals� The total number of linktraversals is �k � ��

In all other cases� u and v have at least one bit in common� andfollowing no more than k� � shu�e links and k� � exchange links issu�cient to reach v from u�

�� Algorithm to route message from u to v in an n�node shu�e�exchangenetwork�

for i� � to log� n dob� u� v fexclusive orgif b is odd thenprint �exchange�

endifu� left cyclic rotateu�v � left cyclic rotatev�print �shu�e�

�� CHAPTER �� PARALLEL ARCHITECTURES

Example� Routing message from � to � in a ��node shu�e�exchangenetwork� u � �� and v � � �� Algorithm produces the followingroute� exchange � shu�e � shu�e � shu�e � exchange � shu�e�

�� a� n�� log� n

b� log� n

c� n�� log� n

d� �

e� No

�� Suppose the address of the destination is bk��bk�� b�b�� Everyswitch has two edges leaving it� an �up� edge toward node � and a�down� edge away from node �� The following algorithm indicateshow to follow the edges to reach the destination node with the givenaddress�

for i� k � � downto doif bi � thenprint �up�

elseprint �down�

endifendfor

�� Data parallelism means applying the same operation across a dataset� A processor array is only capable of performing one operationat a time� but every processing element may perform that operationon its own local operands� Hence processor arrays are a good t fordata�parallel programs�

�� Let i be the vector length� where � � i � � � The total number ofoperations to be performed is i� The time required to perform theseoperations is di��e � �� second� Hence performance is

i

di��e � � � operations�second

For example� when vector length is �� performance is � million opera�tions per second� When vector length is � performance is � �� operations per second� or about �� million operations per second�

��

�� a� The e�ciency is ��k

b� The e�ciency is Pki�IiPiPki� Ii

�� Large caches reduce the load on the memory bus� enabling the systemto utilize more processors e�ciently�

�� Even with large instruction and data caches� every processor stillneeds to access the memory bus occasionally� By the time the sys�tem contains a few dozen processors� the memory bus is typicallysaturated�

�� a� The directory should be distributed among the multiprocessors�local memories so that accessing the directory does not becomea performance bottleneck�

b� If the contents of the directories were replicated� we would needto make sure the values were coherent� which is the problem weare trying to solve by implementing the directory�

�� Continuation of directory�based cache coherence protocol example�

S111

Interconnection Network

CPU 0 CPU 1 CPU 2

5X

5X5X5X

E010


CPU 0 CPU 1 CPU 2

5X

9X

S001


CPU 0 CPU 1 CPU 2

4X

4X

E001


CPU 0 CPU 1 CPU 2

4X

5X

S011


CPU 0 CPU 1 CPU 2

5X

5X5X

�� Here is one set of examples�

SISD� Digital Equipment Corporation PDP��

SIMD� MasPar MP��

MISD� Intel iWarp

MIMD� Sequent Symmetry


�� Continuation of systolic priority queue example�

��

�� Contemporary supercomputers must contain thousands of processors�Even distributed multiprocessors scale only to hundreds of processors�Hence supercomputers are invariably multicomputers� However� it istrue that each node of a supercomputer is usually a multiprocessor�The typical contemporary supercomputer� then� is a collection of mul�tiprocessors forming a multicomputer�

Chapter �

Parallel Algorithm Design

�� Suppose two processors are available to nd the sum of n values�which are initially stored on a single processor� If one processor doesall the summing� the processor utilization is � �� but there is nointerprocessor communication� If two processors share the computa�tion� the processor utilization is closer to � �� but there is someinterprocessor communication� One processor must send the otherprocessor n�� values and receive in return the sum of those values�The time spent in interprocessor communication prevents utilizationfrom being � ��

�� a� log � � �� blog �c � �� dlog �e � �

b� log �� blog ��c � �� dlog ��e � �

c� log �� blog ��c � �� dlog ��e � �

d� log �� blog ��c � �� dlog ��e � �

e� log �� blog ��c � �� dlog ��e � �

�� a� Binomial tree with �� nodes

��

�� CHAPTER �� PARALLEL ALGORITHM DESIGN

b� Binomial tree with �� nodes

�� a� Hypercube with �� nodes

12 13

14 15

8 9

1110

4 5

6 7

0 1

32

b� Hypercube with �� nodes

28 29

30 31

24 25

2726

20 21

22 23

16 17

1918

12 13

14 15

8 9

1110

4 5

6 7

0 1

32

��

�� Here are four binomial trees embedded in a four�dimensional hyper�cube� All four binomial trees have the same hypercube node as theirroot�

�� Three reductions� each performed with dlogne communication steps�

1

1

11

1

2

2

2

3

34

1

1

1

2

2

3

1

11

1 1

11

1

2

2

2

2

2

23

3

3

4

5

�� A C program that describes the communications performed by a taskparticipating in a reduction�

int main �int argc� char �argv��

�

int i

int id � Task ID �

int p � Number of tasks �

� CHAPTER �� PARALLEL ALGORITHM DESIGN

int pow � ID diff between sendingreceiving procs �

if �argc ��

printf ��Command line� �s �p� �id��n�� argv��

exit ��

�

p � atoi�argv��

id � atoi�argv��

if ��id � �� id �� p��

printf ��Task id must be in range ��p��n��

exit ��

�

if �p ��

printf ��No messages sent or received�n��

exit ��

�

pow � �

while��pow � p� pow ��

while �pow � ��

if ��id � pow� �� id � pow �� p��

printf ��Message received from task �d�n��

id � pow�

else if �id �� pow� �

printf ��Message sent to task �d�n�� id � pow�

break

�

pow � �

�

�

�� From the discussion in the chapter we know that we can reduce nvalues in time �logn� using n tasks� The only question� then� iswhether we can improve the execution time by using fewer tasks�

Suppose we use only k tasks� Each task reduces about n�k values�and then the tasks communicate to nish the reduction� The timecomplexity of this parallel algorithm would be �n�k � log k�� Let�snd the value of k that minimizes this function� If fk� � n�k�log k�then f �k� � �n�k� � ��k ln �� f �k� � when k � n ln �� i�e��when k � �n�� Hence the complexity of the parallel reduction is�n�n� log n� � �log n��

�� Our broadcast algorithm will be based on a binomial tree communi�

��

cation pattern� Data values �ow in the opposite direction they do fora reduction�

a� Here is a task channel graph showing how the root task shownin black� broadcasts to � other tasks in three steps� In general�use of a binomial tree allows a root task to broadcast to p � �other tasks in dlog pe communication steps�

1

22

3 3

33

b� We prove by induction that broadcast among p tasks � root taskbroadcasting to p� � other tasks� requires at least dlog pe com�munication steps� Basis� Suppose p � �� No communicationsare needed� dlog �e � � Induction� Suppose we can broadcastamong p tasks in dlog pe communication steps� for � � p � k�Then dlog ke�� communication steps are needed to broadcastamong �k tasks� because dlog ke communication steps allow ktasks to get the value� and � additional step allows each of the ktasks to communicate the value to one of the k tasks that needsit� dlog ke � � � dlog �ke� Hence the algorithm devised in parta� is optimal�

�� The all�gather algorithm is able to route more values in the same be�cause its utilization of the available communication channels is higher�During each step of the all�gather algorithm every task sends and re�ceives a message� In the scatter algorithm the tasks are engagedgradually� About half of the tasks never send a message� and thesetasks only receive a message in the last step of the scatter�

�� We organize the tasks into a logical hypercube� The algorithm haslog p iterations� In the rst iteration each task in the lower half of thehypercube swaps p�� elements with a partner task in the upper halfof the hypercube� After this step all elements destined for tasks in theupper half of the hypercube are held by tasks in the upper half of thehypercube� and all elements destined for tasks in the lower half of the


hypercube are held by tasks in the lower half of the hypercube� Atthis point the algorithm is applied recursively to the lower and upperhalves of the hypercube dividing the hypercube into quarters�� Afterlog p swap steps� each process controls the correct elements�

The following picture illustrates the algorithm for p � ��

00, 01, 20, 21,40, 41, 60, 61

10, 11, 30, 31,50, 51, 70, 71

02, 03, 22, 23,42, 43, 62, 63

12, 13, 32, 33,52, 53, 72, 73

04, 05, 24, 25,44, 45, 64, 65

14, 15, 34, 35,54, 55, 74, 75

06, 07, 26, 27,46, 47, 66, 67

16, 17, 36, 37,56, 57, 76, 77

00, 10, 20, 30,40, 50, 60, 70

01, 11, 21, 31,41, 51, 61, 71

02, 12, 22, 32,42, 52, 62, 72

03, 13, 23, 33,43, 53, 63, 73

04, 14, 24, 34,44, 54, 64, 74

05, 15, 25, 35,45, 55, 65, 75

06, 16, 26, 36,46, 56, 66, 76

07, 17, 27, 37,47, 57, 67, 77

00, 01, 02, 03,40, 41, 42, 43

10, 11, 12, 13,50, 51, 52, 53

20, 21, 22, 23,60, 61, 62, 63

30, 31, 32, 33,70, 71, 72, 73

04, 05, 06, 07,44, 45, 46, 47

14, 15, 16, 17,54, 55, 56, 57

24, 25, 26, 2764, 65, 66, 67

34, 35, 36, 3774, 75, 76, 77

00, 01, 02, 03,04, 05, 06, 07

10, 11, 12, 13,14, 15, 16, 17

20, 21, 22, 23,24, 25, 26, 27

30, 31, 32, 33,34, 35, 36, 37

40, 41, 42, 43,44, 45, 46, 47

50, 51, 52, 53,54, 55, 56, 57

60, 61, 62, 63,64, 65, 66, 67

70, 71, 72, 73,74, 75, 76, 77

EXCHANGE

EXCHANGE

EXCHANGE

��

During each step each process sends and receives p�� elements� Thereare log p steps� Hence the complexity of the hypercube�based algo�rithm is �p log p��

�� Initially we create a primitive task for each key� The algorithm it�erates through two steps� In the rst step the tasks associated witheven�numbered keys send their keys to the tasks associated with thesuccessor keys� The receiving tasks keep the larger value and returnthe smaller value� In the second step the tasks associated with theodd�numbered keys send their keys to the tasks associated with thesuccessor keys� The receiving tasks keep the larger value and returnthe smaller value� After n�� iterations the list must be sorted� Thetime complexity of this algorithm is �n� with n concurrent tasks�

a[0] a[1] a[2] a[3] a[4] a[ n-1]

After agglomeration we have p tasks� The algorithm iterates throughtwo steps� In the rst step each task makes a pass through its sub�array� from low index to high index� comparing adjacent keys� Itexchanges keys it nds to be out of order� In the second step eachtask except the last� passes its largest key to its successor task� Thesuccessor task compares the key it receives with it smallest key� re�turning the smaller value to the sending task� The computationalcomplexity of each step is �n�p�� The communication complexityof each step is �� After at most n iterations the list is sorted�The overall complexity of the parallel algorithm is �n��p� n� withp processors�

�� Initially we associate a primitive task with each pixel� Every primitivetask communicates with tasks above it� below it� to its left� and to itsright� if these tasks exist� Interior tasks have four neighbors� cornertasks has two neighbors� and edge tasks have three neighbors�


Each task associated with a � pixel at position i� j� sets its componentnumber to the unique value n� i� j� At each step of the algorithmeach task sends its current component number to its neighbors andreceives from its neighbors their component numbers� The minimumof the component numbers received from a neighboring task is lessthan the task�s current component number� then the task replacesits component number with this minimum value� In fewer than n�

iterations the component numbers have stabilized�

A good agglomeration maximizes the computation communicationratio by grouping primitive tasks into rectangles that are as squareas possible�

��

�� We set up one �manager� task and a bunch of �worker� tasks� A pairof channels connects the manager with each worker� All tasks have acopy of the dictionary�

ManagerWorker Worker

Worker Worker

The manager partially lls out a crossword puzzle and then assigns itto a worker to complete� By assigning partially�completed puzzles tothe workers� the manager ensures no two workers are doing the samething� A worker sends a message back to the manager either whenit has successfully lled in a puzzle or when it has determined thereis no way to solve the partially lled�in puzzle it received� When themanager receives a solved puzzle from a worker it tells all the workersto stop�

�� Each primitive task is responsible for the coordinates of one of thehouses� All primitive tasks have the coordinates of the train station�Each task computes the euclidean distance from its house to the trainstation� Each task i contributes a pair di� i� to a reduction in whichthe operation is minimum� The result of the reduction is the pairdk� k�� where dk is the minimum of all the submitted values of di� andk is the index of the task submitting dk� The communication patternis a binomial tree� allowing the reduction to take place in logarithmictime� A sensible agglomeration assigns about n�p coordinate pairsto each of p tasks� Each task nd the closest house among the n�phouses it has been assigned� then participates in the reduction stepwith the other tasks�

The �before� and �after� task channel graphs should resemble Figure��

�� The text is divided into p segments� one per task� Each task is re�sponsible for searching its portion of the text for the pattern� We


need to be able to handle the case where the pattern is split betweentwo tasks� sections� Suppose the pattern has k letters� Each of therst p� � tasks must pass the last k � � letters of its portion of thetext to the subsequent task� Now all occurrences of the pattern canbe detected�

After the tasks have completed their portions of the search� a gathercommunication brings together information about all occurrences ofthe pattern�

In the following task channel graph� the straight arrows representchannels used to copy the last nal k� � letters of each task�s sectionto its successor task� The curved arrows represent channels used forthe gather step�

�� If the pattern is likely to appear multiple times in the text� and ifwe are only interested in nding the rst occurrence of the pattern inthe text� then the allocation of contiguous blocks of text to the tasksis probably unacceptable� It makes it too likely that the rst taskwill nd the pattern� making the work of the other tasks super�uous�A better strategy is to divide the text into jp portions and allocatethese portions in an interleaved fashion to the tasks� Each task endsup with j segments from di�erent parts of the text�

0 1 2 3 0 1 2 3 0 1 2 3

As before� the nal k� � letters in each segment must be copied intothe successor segment� in case the pattern is split across two segments�

Communication tasks enable one task to inform the other tasks assoon as it has found a match�

�� Associate a primitive task with each key� In the rst step the tasksperform a max�reduction on the keys� In the second step the max�imum key is broadcast to the tasks� In the third step each taskcompares its key with the maximum� In the fourth step the tasksperform another max�reduction� Every task submits a candidate keyto the second max�reduction except the task whose key is equal tothe maximum key�

��

We can reduce the communication complexity of the parallel algo�rithm without increasing the computation time on p processors byagglomerating groups of n�p primitive tasks into larger tasks�

�� This solution is similar to the previous solution� except that in themax�reduction step each task submits the pair ki� i�� where ki is thevalue of the key and i is its index in the list� The result of the max�reduction is the pair mi� i�� where mi is the smallest key submittedand i is its list position� This pair is broadcast to all the primitivetasks� In the fourth step the tasks perform another max�reduction�Every task submits a candidate key except the task whose index isequal to the index received in the broadcast step�

We can reduce the communication complexity of the parallel algo�rithm without increasing the computation time on p processors byagglomerating groups of n�p primitive tasks into larger tasks�

Chapter �

Message�passing

Programming

�� a� Give the pieces of work the numbers � �� n � �� Processk gets the pieces of work k� k � p� k � �p� etc��

b� Piece of work j is handled by process j mod p�

c� dn�ped� If n is an integer multiple of p� then all processes have the same

amount of work� Otherwise� n mod p processes have more piecesof work!dn�pe� These are processes � �� n mod p��

e� bn�pcf� If n is an integer multiple of p� then all processes have the same

amount of work� Otherwise� p � n mod p� processes have lesspieces of work!bn�pc� These are processes n mod p�� n modp� � �� p� ��

�� a� ��

b� Over�ow� In C� get �� as result�

c� ��

d� ��

e� ��

f�

g� �

h� �

��

� CHAPTER �� MESSAGE�PASSING PROGRAMMING

�� Modied version of function check circuit�

int check�circuit �int id� int z� �

int v�� Each element is a bit of z �

int i

for �i � � i � �� i�� v�i� � EXTRACT�BIT�z�i�

if ��v��v�� v��v� �� v��v� ��

�� v� � �� v�� v�� v��

�� v�� v�� v�� v��

�� v�� v�� v� � �� v�!��

�� v� � �� v�� v�!� �� v�"��

�� v�!� �� v�"�� v�"� �� v��

�� v�"� �� v�� v�� v��

�� v�� v�� v�� v��

�� v�� v��

printf ��d� �d�d�d�d�d�d�d�d�d�d�d�d�d�d�d�d�n��

id�v��v��v��v� ��v��v��v��v� ��

v�!��v�"��v��v��v��v�� v��v��

fflush �stdout�

return �

� else return �

�

�� See Figure ��

�� a� The circuit can be represented by a logical expression using thesame syntax as the C programming language� Parentheses areused as necessary to indicate the order in which operators shouldbe applied�

b� Creating a syntax�free grammar for logical expressions is easy�We could write a top�down or a bottom�up parser to constructa syntax tree for the circuit�

c� The circuit would be represented by an expression tree� Theleaves of the tree would be the inputs to the circuit� The interiornodes of the tree would be the logical gates� A not gate wouldhave one child� and and or gates would have two children�

��

�� Parallel variant of Kernighan and Ritchie�s �hello� world� program�

#include �mpi�h�

#include �stdio�h�


int id � Process rank �

MPI�Init ��argc� �argv�

MPI�Comm�rank �MPI�COMM�WORLD� �id�

printf ��hello� world� from process �d�n�� id�

fflush �stdout�

MPI�Finalize��

return �

�

�� CHAPTER �� MESSAGE�PASSING PROGRAMMING

Chapter �

The Sieve of Eratosthenes

�� a� The rst p� � processes get dn�pe elements and the last processdoes not get any elements when n � kp � �� where k is aninteger greater than � and less than p� Here are the cases wherethis happens for n � � � n � � and p � �� n � � and p � ��n � � and p � �� n � � and p � �� and n � � and p � ��

b� In order for bp��c processes to get no values� p must be odd andn must be equal to p�� Here are the cases where this happensfor n � � � n � � and p � �� n � � and p � �� n � � and p � ��and n � � and p � ��

�� We list the number of elements held by each process for each scheme�For example� �� means that the rst process has four elementsand the second and third processes have three elements each�

a� Scheme �� Scheme ��

b� Scheme �� Scheme ��

c� Scheme �� Scheme ��

d� Scheme �� Scheme ��

e� Scheme �� Scheme ��

f� Scheme �� Scheme ��

�� This table shows the predicted execution time of the rst parallelimplementation of the Sieve of Eratosthenes on �� processors�

��

�� CHAPTER �� THE SIEVE OF ERATOSTHENES

Processors Time

� ��

�� This table compares the predicted versus actual execution time ofthe second parallel implementation of the Sieve of Eratosthenes on acommodity cluster�

Processors Predicted Time Actual Time Error ��

� ��

The average error in the predictions for �� processors is � ��

��

�� This table compares the predicted versus actual execution time ofthe third parallel implementation of the Sieve of Eratosthenes on acommodity cluster�

Processors Predicted Time Actual Time Error ��

� ��

The average error in the predictions for �� processors is ��

�� The rst disadvantage is a lack of scalability� Since every processmust set aside storage representing the integers from � through n�the maximum value of n cannot increase when p increases� The sec�ond disadvantage is that the OR�reduction step reduces extremelylarge arrays� Communicating these arrays among the processors willrequire a signicant amount of time� The third disadvantage is thatthe workloads among the processors are unbalanced� For example�in the case of three processors� the processor sieiving with �� etc� will take signicantly to mark multiples of its primes than theprocessor sieving with �� etc�

�� CHAPTER �� THE SIEVE OF ERATOSTHENES

Chapter �

Floyd�s Algorithm

�� Suppose n � kp � r� where k and r are nonnegative integers and � r � p� Note that this implies k � dn�pe�If r the remainder� is � then all processes are assigned n�p � dn�peelements� Hence the last process is responsible for dn�pe elements�

If r � � then the last process is responsible for elements bp��n�pcthrough n� ��

bp� ��n�pc � bp� ��kp� r��pc� bpkp� r��p� kp�p� r�pc� kp� r� � k � br�pc� kp� r � k � �

� n� k � ��

The total number of elements the last process is responsible for is

n� �� n� k � �� k � � � dn�pe

�� Process only needs to allocate bn�pc elements to store its portionof the le� Since it may need to input and pass along dn�pe elementsto some of the processes� process cannot use its le storage bu�eras a temporary bu�er holding the elements being passed along to theother processes� Room for one extra element must be allocated�� Incontrast� process � will eventually be responsible for dn�pe elementsof the le� Since no process stores more elements� the space allocated

��

�� CHAPTER �� FLOYDS ALGORITHM

for process ��s portion of the le is su�cient to be used as a temporarymessage bu�er�

We see this played out in Figure �� in which process is responsiblefor b��c � � elements� while process � is responsible for d��e � �elements�

�� We need to consider both I O and communications within Floyd�salgorithm itself� File input will be more complicated� assuming thedistance matrix is stored in the le in row�major order� One processis responsible for accessing the le� It reads in one row of the matrixat a time and then scatters the row among all the processes� Aftern read�and�scatter steps� each process has its portion of the distancematrix�

During iteration k of Floyd�s algorithm� the process controlling col�umn k must broadcast this column to the other processes� No othercommunications are needed to perform the computation�

In order to print the matrix� one process should be responsible for allof the printing� Each row is gathered onto the one process that printsthe row before moving on to print the next row�

�� Several changes are needed when switching from a rowwise blockstriped decomposition to a rowwise interleaved striped decomposi�tion�

First� the process that inputs the matrix from the le can only readone row at a time� In each of n iterations� the process reads row i andsends it to process i mod p except when it keeps rows it is responsiblefor�� Hence this decomposition results in more messages being sentby the process responsible for reading the matrix from the le�

Second� in the performance of Floyd�s algorithm� the computationof the process broadcasting row k is di�erent� In the block decom�position the process controlling row k is bpk � �� nc� In theinterleaved decomposition the process controlling row k is k mod p�

Third� when the matrix is printed� processes send their rows to theprinting process one at a time� instead of all at once� The printingprocess must print the rst row held by process � the rst row held byprocess �� the rst row held by process p�� the second row heldby process � etc� Hence this decomposition results in more messagesbeing received by the process responsible for printing the matrix�

�� a� The processes are logically organized into a square mesh� Duringeach iteration of the outer loop of the algorithm� a process needs

��

access to a�i��k� and a�k��j� in order to update a�i��j��Hence every process controlling a portion of row k must broad�cast it to other processes in the same column of the process mesh�Similarly� every process controlling a portion of column k mustbroadcast it to other processes in the same row of the processmesh�

k

k

Note that thepp broadcasts across rows of the process mesh may

take place simultaneously� Thepp broadcasts across columns of

the process mesh may also take place simultaneously�

b� During each iteration there is are two broadcast phases� Thecommunication time is the same for both phases� A total ofn�pp data items� each of length � bytes� are broadcast among a

group ofpp processes� The time needed for the broadcast is

dlogppe�� n�pp��

Since there are two broadcast phases per iteration and n itera�tions� the total communication time is

�ndlogppe�� n�pp��

c� Since �ndlogppe � ndlog pe� the number of messages sent byboth algorithms is the same� However� the parallel version of

� CHAPTER �� FLOYDS ALGORITHM

Floyd�s algorithm that assigns each process a square block ofthe elements of A has its message transmission time reduced bya factor of

pp� Hence it is superior�

�� We need to compute

n�dn�pe�� ndlog pelambda� �n��

where n � � � p � �� nanosec� � � �� sec� and � � � �

bytes sec� The result is �� seconds�

�� This table predicts the execution times in seconds� when executingFloyd�s algorithm on problems of size � and �� with �� processors� The computer has the same values of �� and � as thesystem used for benchmarking in this chapter�

Processors n � � n � � � ��

Chapter �

Performance Analysis

�� The speedup formula is

n� p� � n� � �n�

n� � �n��p� �n� p�

We want to nd p� such that p � p� � n� p� � n� p�� From theformula we see this is the same as saying we want to nd p� such thatp � p� � �n��p � �n� p� � �n��p� � �n� p�� We are told that�n� p� � c log p� Let

fp� � �n��p� c log p

Thenf �p� � ��n��p� � c�p ln ��

Finding the point where the slope is �

f �p� � � p � �n� ln ��c

Letp� � d�n� ln ��ce

Since f �p� � when p � p�� fp� is increasing for p � p�� Therefore�there exists a p� such that p � p� � n� p� � n� p��

�� The denition of e�ciency is

� n� � �n�

pn� � �n� � p�n� p�

Since pn� and �n� p� are both nondecreasing functions of p� thedenominator is a nondecreasing function of p� meaning that n� p� isa nonincreasing function of p� Hence p� � p� n� p�� n� p��

��

�� CHAPTER � PERFORMANCE ANALYSIS

�� Sequential execution time is � � seconds� Speedup is parallel execu�tion time divided by sequential execution time�

Processors Speedup

� ��

�� By Amdahl�s Law�

� �

f � �� f��p�

�

� � � ��

Maximum achievable speedup is ��

�� Using Amdahl�s Law�

� � �

� � � ��p� p � ��

Hence at least �� processors are needed to achieve a speedup of � �

�� Starting with Amdahl�s Law�

� �

f � �� f��p

we take the limit as p� and nd that

� �

f� f � �

In order to achieve a speedup of � � no more than � � th �� of aprogram�s execution time can be spent in inherently sequential oper�ations�

��

�� We begin with Amdahl�s Law and solve for f �

� �

f � �� f��p

��

f � �� f��

��

f � �� f

��f � �� f � �

��f � ��

�f � � ��

At most �� of the computations performed by the program areinherently sequential�

�� Gustafson�Barsis�s Law states that

� p� �� p�s

In this case s � �� Hence the scaled speedup on ��processors is ��

�� By Gustafson�Barsis�s Law� scaled speedup is � ��

�� This problem asks us to apply the Karp�Flatt metric�

I� The answer is B�� The experimentally determined serial fractionis �� for � � p � �� If it remains at �� speedup on �� processorswill �� only � � higher than on � processors�

II� The answer is C�� The experimentally determined serial fractionis growing linearly with p� If it continues to increase at this rate�speedup on �� processors will be �� which is �� lower than on �processors�

III� The answer is A�� The experimentally determined serial fractionis holding steady at � �� If it remains at � �� speedup on ��processors will be �� or �� higher than on � processors�

IV� The answer is A�� The experimentally determined serial fractionis growing logarithmically with p� If it has the value � � whenp � �� speedup will be �� or �� higher than on � processors�

V� The answer is B�� The experimentally determined serial fractionis growing very slowly� but it is quite large�


VI� The answer is A�� The experimentally determined serial fractionis small and growing slowly�

�� In Amdahl�s Law� when you x f and increase p� the problem sizedoes not change� This is because the focus of Amdahl�s Law is on thee�ects of applying more processors to a particular problem� Hence asp�� the maximum speedup converges on ��f �

In contrast� with Gustafson�Barsis� Law� when you x s and increasep� the proportion of inherently sequential operations actually drops�For example� suppose s � �� and p � �� Then two processors arebusy � � of the time and one processor is busy � � of the time� Thenumber of inherently sequential operations represents just over ��of the total computation� In contrast� suppose s � �� and p � � �Now twenty processors are busy � � of the time and one processor isbusy � � of the time� The number of inherently sequential operationsrepresents just over �� of the total computation�

Because the proportion of inherently sequential operations dropswhen xing s and increasing p� the scaled speedup predicted byGustafson�Barsis� Law increases without bound�

�� Here are three reasons why it may be impossible to solve the prob�lem within a specied time limit� even if you have access to al theprocessors you care to use�

First� the problem may not be amenable to a parallel solution�

Second� even if the bulk of the problem can be solved by a parallelalgorithm� there may be a sequential component such as le I O�that limits the speedup achievable through parallelism�

Third� even if there is no signicant sequential component� interpro�cessor communication overhead may limit the number of processorsthat can be applied before execution time increases� rather than de�creases� when more processors are added�

�� Good news� If students can solve this problem� they can perform theisoe�ciency analysis of every parallel algorithm in the book�

System a� n � Cp and Mn� � n�

MCp��p � C�p��p � C�p

System b� n � Cpp log p and Mn� � n�

MCpp log p��p � C�p log� p�p � C� log� p

��

System c� n � Cpp and Mn� � n�

MCpp��p � C�p�p � C�

System d� n � Cp log p and Mn� � n�

MCp log p��p � C�p� log� p�p � C�p log� p

System e� n � Cp and Mn� � n

MCp��p � Cp�p � C

System f� n � pC � � � C � �� and Mn� � n

MpC��p � pC�p � pC�� Op�

System g� n � pC � � � C � �� and Mn� � n

MpC��p � pC�p � pC�� Op��

Most scalable c� ebfad

Least scalable g

Chapter �

Matrix�vector

Multiplication

�� The expected execution time of the rst matrix�vector multiplicationprogram on �� processors� given the model and machineparameters described in Section �� Predicted times for �� and �� processors are in Table ��

Processors Time �sec�

� � ��

�� A C code segment that partitions the process grid into column com�municators�

MPI�Comm grid�comm

MPI�Comm grid�coords��

MPI�Comm col�com

MPI�Comm�split �grid�comm� grid�coords��

grid�coords�� col�comm�

��

�� CHAPTER �� MATRIX�VECTOR MULTIPLICATION

�� The answer shown below gives more detail than is necessary� Notonly does it show how to use read block vector to open a le anddistribute it among a column of processes in a a grid communicator�it also shows how the variables used in this function call are declaredand initialized�


�

double �b � Where to store vector �

MPI�Comm col�comm � Communicator shared by

processes in same column �

MPI�Comm grid�comm � Logical grid of procs �

int grid�coords�� Coords of process �

int grid�id � Rank in grid comm �

int grid�size�� Number of procs per dim �

int p � Number of processes �

int periodic�� Wraparound flag �

int nprime � Size of vector �

MPI�Init ��argc� �argv�

MPI�Comm�size �MPI�COMM�WORLD� �p�

grid�size�� grid�size��

MPI�Dims�create �p� �� grid�size�

periodic�� periodic��

MPI�Cart�create �MPI�COMM�WORLD� �� grid�size� periodic�

�� grid�comm�

MPI�Comm�rank �grid�comm� �grid�id�

MPI�Cart�coords �grid�comm� grid�id� �� grid�coords�

MPI�Comm�split �grid�comm� grid�coords��

grid�coords�� col�comm�

if �grid�coords��

nprime � read�block�vector ��Vector�� void �� b�

MPI�DOUBLE� col�comm�

��

�

�� c� C code that redistributes vector b� It works for all values of p�

� Variable $is�square$ true iff $p$ is square �

if �is�square� �

��

� Proc at �i�� sends subvector to proc at

��i�� Proc at �� just does a copy� �

if ��grid�coords��

�grid�coords��

for �i � � i � recv�els i��

btrans�i� � b�i�

� else if ��grid�coords��


send�els �

BLOCK�SIZE�grid�coords��grid�size��n�

coords��

coords�� grid�coords��

MPI�Cart�rank�grid�comm�coords� �dest�

MPI�Send �b� send�els� mpitype� dest� ��

grid�comm�

� else if ��grid�coords��


coords�� grid�coords��

coords��

MPI�Cart�rank�grid�comm�coords� �src�

MPI�Recv �btrans� recv�els� mpitype� src� ��

grid�comm� �status�

�

� else �

� Process at �� gathers vector elements

from procs in column � �


recv�cnt � �int ��

malloc �grid�size�� sizeof�int��

recv�disp � �int ��


for �i � � i � grid�size�� i��

recv�cnt�i� �

BLOCK�SIZE�i� grid�size�� n�

recv�disp��


recv�disp�i� �

recv�disp�i�� recv�cnt�i��

� CHAPTER �� MATRIX�VECTOR MULTIPLICATION

MPI�Gatherv �b� BLOCK�SIZE�grid�coords��

grid�size�� n�� mpitype� tmp� recv�cnt�

recv�disp� mpitype� �� col�comm�

�

� Process at �� scatters vector elements to

other processes in row � of process grid �


if �grid�size��

send�cnt � �int ��


send�disp � �int ��



send�cnt�i� �

BLOCK�SIZE�i�grid�size��n�

�

send�disp��


send�disp�i� �

send�disp�i�� send�cnt�i��

�

MPI�Scatterv �tmp� send�cnt� send�disp�

mpitype� btrans� recv�els� mpitype�

�� row�comm�

� else �

for �i � � i � n i��

btrans�i� � tmp�i�

�

�

�

� Row � procs broadcast their subvectors to

procs in same column �

MPI�Bcast �btrans�recv�els� mpitype� �� col�comm�

Chapter �

Document Classi�cation

�� a� E�ciency is maximized when all workers are able to begin andend execution at the same time� This is the case whenever

k��Xi�

ti � p� ��r

and there exists a partitioning of the k tasks into p � � subsetsof tasks such that the sum of the execution times of the taskswithin each subset is r� The simplest example is the case whenthere are p � � tasks with identical execution times� Since themanager process is not performing any of the tasks� the maxi�mum e�ciency possible is p� ��p�

b� E�ciency is minimized when there is only a single task� Theminimum possible e�ciency is ��p�

�� Nonblocking sends and receives can reduce execution time by allowinga computer to overlap communications and computations� This isparticularly helpful when a system has a dedicated communicationscoprocessor� Posting a receive before a message arrives allows the run�time system to save a copy operation by transferring the contents ofthe incoming message directly into the destination bu�er rather thana temporary system bu�er�

�� The rst way� A worker�s initial request for work is a message oflength � The manager could distinguish this request from a messagecontaining a document prole vector by looking at the length of themessage�

��

�� CHAPTER �� DOCUMENT CLASSIFICATION

The second way� The rst message a manager receives from anyworker is its request for work� The manager could keep track of howmany messages it has received from each worker� When a managerreceived a message from a worker it could check this count� If thecount of previously received messages is � the manager would knowit was an initial request for work�

Both of these solutions are more cumbersome and obscure than usinga special message tag� Hence tagging the message is preferable�

�� Typically a call to MPI Wait is placed immediately before the bu�ertransmitted by MPI Isend is reused� However� in this case there is nosuch bu�er� since a message of length zero was sent�

�� If the manager set aside a prole vector bu�er for each worker� it couldinitiate a nonblocking receive every time it sent a task to a worker� Itcould then replace the call to MPI Recv with a call to MPI Testsome

to see which document vectors had been received� This change wouldeliminate the copying of prole vectors from system bu�ers to userbu�ers�

The call to MPI Send in which the manager sends a le name to aworker could also be changed to the nonblocking MPI Isend� Sincethe le name will never be overwritten� there is no need for a corre�sponding call to MPI Wait�

�� In a hash table that uses chaining to resolve collisions� the table is anarray of pointers to singly linked lists� It is easy enough for one workerto broadcast this array to the other workers� because these pointersare stored in a contiguous group of memory locations� However� thelinked lists must also be broadcast� The structures on the linked listare allocated from the heap� and they are not guaranteed to occupya contiguous block of memory� Even if they did� we cannot guaranteethat the same memory locations are available on the other processors"In short� we cannot broadcast element allocated from the heap� Away to solve this problem is to allocate a large� static array and doall dynamic memory allocations from this array� In other words� wecreate a user�managed heap� We can broadcast the hash table bybroadcasting this array along with the array of pointers�

Chapter �

Monte Carlo Methods

� �� If you use a linear congruential random number generator� the orderedtuples xi� xi�� xi�� in the ��dimensional space form a latticestructure� limiting the precision of the answer that can be computedby this method�

� �� The principal risk associated with assigning each process the samelinear congruential generator with di�erent seed values is that thesequences generated by two or more processes may be correlated� Inthe worst case� the sequences of two or more processes will actuallyoverlap�

� �� Here is a C function with accompanying static variables� that usesthe Box�Muller transformation to return a double�precision �oatingpoint value from the normal distribution�

int active � � � Set to � when g� has a sample �

double g� � Normal variable �

unsigned short xi� � � Random number seed �

double box�muller �void�

�

double f� g�� r� v�� v� � See book$s pseudocode �

if �active� �

active � �

return g�

� else �

do �

��

�� CHAPTER � � MONTE CARLO METHODS

v� � �� erand�!�xi� � ��

v� � �� erand�!�xi� � ��

r � v� � v� � v� � v�

� while ��r �� r ��

f � sqrt�� log�r�r�

g� � f � v�

g� � f � v�

active � �

return g�

�

�

� �� The volume is ��

� �� The volume is � � � �

� �� The volume is � ��

� �� This table is based on � million neutrons per width H�

H Re�ected �� Absorbed �� Transmitted ��

� ��

� �� The temperature at the middle of the plate is ��

� �� Average number of stalls occupied by cars� �� Probability of a carbeing turned away� ��

� �� Probability of waiting at each entrance�

North� ��

West� � ��

South� ��

East� ��

��

Average queue length at each entrance�

North� ��

West� ��

South� ��

East� ��

�� CHAPTER � � MONTE CARLO METHODS

Chapter ��

Matrix Multiplication

�� a� C �

��

��

�A

b� A��B�� A��B�� A��B�� A��B��

A��B�� A��B��

� ��

��

� ��

��

� ��

�

A��B�� A��B��

� ��

��

� ��

��

� ��

�

�� For reasonable performance we want to ensure that each process�sportions of A and B t in primary memory� If both A and B aredivided into pieces and distributed among the processors� then largerproblems can be solved when more processors are available� If wereplicate B across all processes� then the maximum problem size islimited by the amount of memory available on each processor� Largerproblems cannot be solved when more processors are available� Putanother way� a parallel program based on this decomposition is notscalable�

�� a� Cannon�s algorithm is a better match for the recursive sequen�tial matrix multiplication algorithm because it divides the factormatrices into square or nearly square� pieces�

b� When the block of B is too large to t in cache� the recursivesequential matrix multiplication algorithm described in the bookbreaks the block into four pieces by dividing it horizontally and

��

�� CHAPTER �� MATRIX MULTIPLICATION

vertically� We modify the algorithm so that it only breaks theblock into two pieces� It considers the dimensions of the blockand divides it in half� making either a vertical cut or a horizontalcut� It breaks the block along the dimension which is largest�For example� if the block has more columns than rows� it makesa vertical cut�

�� a� Computation time per iteration is �n��p�� Communication timeper iteration is �� n��p�� Since p � �� nanoseconds�� microseconds� and � � �� million elements per second�communication time is less than computation time for all n ��

b� Computation time per iteration is �n��p�� Communicationtime per iteration is �� n��p�� Since p � �� nanoseconds� � � �� microseconds� and � � �� million ele�ments per second� communication time is less than computationtime for all n � ��

Chapter ��

Solving Linear Systems

�� Using back substitution to solve the diagonal system equations at theend of Section ��

�x� ��x� ��x� ��x� � ��x� ��x� ��x� �

�x� ��x� � ��x� � �

We can solve the last equation directly� since it has only a singleunknown� After we have determined that x� � �� we can simplifythe other equations by removing their x� terms and adjusting theirvalues of b�

�x� ��x� ��x� � ��x� ��x� � �

�x� � ��x� � �

Now the third equation has only a single unknown� and can see thatx� � �� Again� we use this information to simplify the two equationsabove it�

�x� ��x� � ��x� � ��

�x� � ��x� � �

��

� CHAPTER �� SOLVING LINEAR SYSTEMS

We have simplied the second equation to contain only a single un�known� and dividing b� by a�� yields x� � �� After subtractingx� � a�� from b� we have

�x� � ��x� � ��

�x� � ��x� � �

and it is easy to see that x� � �� Hence x � f�� g�

�� Forward substitution algorithm in pseudocode�

Forward Substitution�

a# ��n� �$# ��n� �$ ! coe�cient matrixb# ��n� �$ ! constant vectorx# ��n� �$ ! solution vector

for i� to n� � dox#i$� b#i$�a#i� i$for j � i� � to n� � dob#j$� b#j$� x#i$� a#j� i$a#j� i$� fThis line is optionalg

endforendfor

�� a� Assume the execution time of the sequential algorithm is �n��With interleaved rows� the work is divided more or less evenlyamong the processors� so we can assume the computation timeof the parallel program is �n��p�

During each iteration of the parallel algorithm a tournamentcommunication determines the pivot row� If this �all reduce� isimplemented as a reduction followed by a broadcast� the totalcommunication time is about �dlog pe�� The messages are soshort that the communication time is dominated by the messagelatency��

��

After the tournament one process broadcasts the pivot row tothe other processes� As the algorithm progresses� the number ofelements that must be broadcast continues to shrink� Over thecourse of the algorithm the average number of elements broad�cast is about n�� If � measures bandwidth in terms of elementsper second� then the total communication time for broadcaststeps is ndlog pe�� n��

The overall estimated execution time of the parallel program is

�n��p� dlog pe�� n��

b� Assume the execution time of the sequential algorithm is �n��With interleaved columns� the work is divided more or less evenlyamong the processors� so we can assume the parallel programrequires time �n��p to perform these computations�

Also� a single process is responsible for determining the pivotelement each iteration� Assuming it takes � time units to checkeach pivot row� and given an average of n�� rows to check periteration� the time needed to determine the pivot row is �n��

After the tournament one process broadcasts column i to theother processes� This column does not shrink as the algorithmprogresses� because the unmarked rows are scattered through�out the matrix� If � measures bandwidth in terms of elementsper second� then the total communication time for the columnbroadcasts is ndlog pe�� n�beta��

The overall estimated execution time of the parallel program is

�n��p� n�� dlog pe�� n��

�� During each iterationpp processes are involved determining the pivot

row� Each process nds its candidate row� This has time complexity�n�

pp�� Then these processes participate in a tournament� The

tournament has time complexity �logpp�� log p�� Up to

pp

processes control portions of the pivot row that must be broadcast�The number of processes decreases as the algorithm progresses��Each process controlling a portion of the pivot row broadcasts it to theother processes in the same column of the virtual process grid� Thisbroadcast has message latency �log

pp� � �log p� and message

transmission time �log pn�pp�� Combining both communication

steps� we see that the checkerboard�oriented parallel gaussian elim�ination algorithm has overall message latency �n log p� and overallmessage transmission time �n� log p�

pp��

�� CHAPTER �� SOLVING LINEAR SYSTEMS

Next we consider the computation time� As the algorithm progresses�we manipulate elements in fewer and fewer columns� Hence processesgradually drop out of the computation� The processes associated withthe rightmost columns stay active until the end� They determine thecomputation time of the program� Each process is responsible forn��p elements� If the process manipulates all of these elements for alln iterations� the time complexity of these computations is �n��p��

Now we can determine the isoe�ciency of the algorithm� The to�tal communication overhead across p processes is �n�p log p�

pp� �

�n� log ppp�� Hence

n� � Cn�pp log p�� n � C

pp log p

For this algorithm Mn� � n�� Hence the scalability function is

MCpp log p��p � C�p log� p�p � C� log� p

This algorithm is much more scalable than the row�oriented andcolumn�oriented algorithms developed in the chapter� It is not asscalable as the pipelined� row�oriented algorithm�

�� a� The rst element of the le is a integer n indicating the numberof rows in the matrix� The second element of the le is an integerw indicating the semibandwidth of the matrix� The le containsn�w�� additional elements� representing the �w�� elementsin each of the n rows of the matrix� Note that rst w rows andthe last w rows of the banded matrix do not have �w�� elementssee Figure �� The rst w and the last w rows in the le are�padded� with s for these nonexistent elements� This ensuresthat the w��st element of each row in the le is the element onthe main diagonal of the matrix�

Chapter ��

Finite Di�erence Methods

�� Function u has only three unique partial derivatives with order ��because uxy � uyx� In other words� taking the partial derivative withrespect to x and then with respect to y yields the same result astaking the partial derivative with respect to y and then taking thepartial derivative with respect to x�

�� a� When k � �� each process sends four messages and receives fourmessages� All messages have length n�

pp� We assume sends and

receives proceed simultaneously� Communication time per itera�tion is ��n�

pp�� There are no redundant computations�

Hence communication cost per iteration is �� n�pp��

When k � �� each process must send eight messages and receiveeight messages� because elements are needed from diagonallyadjacent blocks� Choose some k � �� The total number ofelements each process sends and the total number of elementseach process receives� is

��kn�

pp� �

k��Xi�

i�� kn�

pp� kk � ��

�

The number of redundant computations is the same as the num�ber of elements communicated for k � �� Hence the number ofredundant computations is

��k � ��n�

pp� k � ��k � ��

�

The average communication cost per iteration is the total com�munication time plus the redundant computation time� divided

��

�� CHAPTER �� FINITE DIFFERENCE METHODS

by k� or

�� k�n�pp� k � ��

�� k � ��

�n�pp� k � ��

�k

b� This chart plots communication cost per iteration as a functionof k� where n � � � p � �� nanosec� � � � �sec�and � � �� elements sec�

0.001250

0.001300

0.001350

0.001400

0.001450

0.001500

0.001550

1 2 3 4 5 6 7 8 9 10

Chapter ��

Sorting

�� Quicksort is not a stable sorting algorithm�

�� Note� In the �rst printing of the text there are a couple

of typos in this exercise� First� n should be �� not

�� million� Second� answers should be computed only for pequal to powers of �� and ��

This is a di�cult problem� I came up with these numbers by actuallywriting an MPI program implementing the rst parallel quicksortalgorithm described in the book� then performing a max�reduction tond the maximum partition size� The values in the table representthe average over � executions of the program�

a�

p Maximum List Size Max list size dn�pe� � � ��

b� Parallel execution time is the sum of the computation time andthe communication time�

The computation time has two components� partition time andquicksort time� At each iteration of the algorithm each processormust partition its list of values into two pieces� the piece it iskeeping and the piece it is passing� This time is about

mlog p� � � sec�

��

�� CHAPTER �� SORTING

where m is the maximum list size� The quicksort time is

� � sec�m logm� �� logm�

see Baase and Van Gelder #�$ for the expected number of com�parisons in sequential quicksort��

The communication time is the time spent exchanging elements�We assume sending and receiving happen concurrently� We es�timate the total communication time by the processor with themost elements concurrently sends half of these elements and re�ceives half of these elements each iteration� The total communi�cation time is

log p� � � sec � m�� byte�� byte�sec�

�

p Time sec� Speedup� ��

�� Note� In the �rst printing of the text there are a couple

of typos in this exercise� First� n should be �� not

�� million� Second� answers should be computed only for pequal to powers of �� and ��

This is a di�cult problem� I came up with these numbers by actuallywriting an MPI program implementing the hyperquicksort algorithmdescribed in the book� then performing a max�reduction to nd themaximum partition size� The values in the table represent the averageover � executions of the program�

a�

p Maximum List Size Max list size dn�pe� � � ��

b� Parallel execution time is the sum of the computation time andthe communication time� The execution time formulas are thesame as in the answer to exercise ��

��

p Time sec� Speedup� ��

�� We have n � � �� nanosec� � � � �sec� and � � ��

elements sec� In the rst step each processor sorts n�p elements� Thetime required for this is

��n�p� logn�p�� n�p��

The time to gather the pivots is dominated by the message latency�

�dlog pe�

Next one process sorts the p� random samples� Using quicksort� theestimated time is

��p�� logp�� p�

Broadcasting p� � pivots to the other processes is dominated by themessage latency�

�dlog pe�Next comes the all�to�all communication step� We assume each pro�cess sends p� � messages and receives p� � messages� These happenconcurrently� The total communication time for this step is

p� �� np� ��p��

Finally� each process merges p sorted sublists� If each step requires� log p comparisons� the time needed for this step is

�� log p�� n�p�

The total estimated execution time is

��n�p� logn�p� � �p� log p�� n�p� � p��

� � log p�� n�p��

��dlog pe� p� �� np� ��p��


Processors Time �sec� Speedup

� ��

�� The PSRS algorithm is less disrupted when given a sorted list�

In the rst iteration of the hyperquicksort algorithm� the root processsorts its sublist� then broadcasts the median of its sublist to the otherprocesses� All processes use this value to divide their lists into thelower �half� and the upper �half�� Since the root process has thesmallest n�p keys� its estimate of the median will be very poor� Theroot process will send half of its keys to its partner process in theupper half of the hypercube� The other processes in the lower half ofthe hypercube will send all of their keys to their partner processes inthe upper half of the hypercube� This severe workload imbalance willsignicantly increase the execution time of the algorithm�

In contrast� in the PSRS algorithm each process contributes p samplesfrom its sorted sublist� After sorting the contributed samples� theroot process selects regularly spaced pivot values� After receivingthe pivot values� each process will partition its sorted sublist into pdisjoint pieces� The largest partition held by process iwill be partitioni� When each process i keeps its ith partition and sends the jthpartition to process j� relatively few values will be communicated�Each process should have nearly the same number of elements for thenal mergesort step�

�� We are going to perform a p�way merge of p sorted sublists� We builda heap from the rst element of each sorted sublist� This takes time�p�� The root of the heap is the element with the smallest value� It is

��

the rst element of the merged lists� We remove the heap element andreplace it with the next member of that element�s originating sublist�This can be done in constant time� We then perform a �heapify�operation to restore the heap property� This step takes �log p� time�

Now the root of the heap is the second element of the merged lists�We repeat the previous delete�replace�heapify procedure�

When an originating sublist is exhausted� we insert a key with a verylarge value into the heap� ensuring it will never be chosen before anelement from one of the other sublists�

We continue until we have deleted from the heap a number of elementsequal to the original total length of all the sublists� If there were�n�p� elements total in the p sublists� the total time required toperform the merge is �n�p� log p��

�� a� We organize the processes into a binomial tree� In the initiallog p steps of the algorithm� each process with a list of elementsdivides its list in half and sends half of its list to a partnerprocess� After log p steps� every process has n�p keys�

Next each process uses sequential mergesort to sorts its sublist�

Reversing the communication pattern of the rst step� duringeach of log p steps half the processes with a sorted sublist sendtheir sublist to another process with a sorted sublist� and thereceiving processes merges the lists� After log p of these steps� asingle process has the entire sorted list�

This gure illustrates parallel mergesort�

� CHAPTER �� SORTING

AQMEIOPXCHUEIOVC

AQMEIOPX CHUEIOVC

AQME CHUE

IOPX IOVC

AEMO CEHU

IOPX CIOV

AEIMOOPX CDEHIOUV

ACDEEHIIMOOOPUVX

Divide

Divide

Sort

Merge

Merge

b� Distributing the unsorted subarrays takes log p steps� In therst step � process sends n�� values� in the second step � pro�cesses concurrently send n�� values� and so on� Hence thecommunication complexity of this step is �log p � n�� Thetime needed for each process to sort its section of n�p val�ues is �n�p� logn�p�� The communication time needed forthe merge step is �log p � n�� The computation time for themerge step is �n�� Hence the overall time complexity of parallelmergesort is �log p� n� n�p� logn�p��

c� The communication overhead of the algorithm is p times thecommunication complexity� or �plog p� n�� When n is largethe message transmission time dominates the message latency� sowe�ll simplify the communication overhead to �p log p�� Hencethe isoe�ciency function is

n logn � Cp log p

��

When we double p we must slightly more than double n in orderto maintain the same level of e�ciency� Since p is much smallerthan n� the algorithm is not perfectly scalable�

�� a� We divide the interval # � k� into p buckets subintervals�� Eachprocess is responsible for one of these buckets� and every processknows the interval handled by each process� In step one everyprocess divides its n�p keys into p groups� one per bucket� Instep two an all�to�all communication routes groups of keys tothe correct processes� In step three each process sorts the keysin its bucket using quicksort�

b� Step one has time complexity �n�p�� Since n is much largerthan p� communication time is dominated by the time neededto transmit and receive the elements� We expect each proces�sor to send about np � ��p elements to other processors andreceive about the same number� Assuming the network has suf�cient bandwidth to allow each processor to be sending and re�ceiving a message concurrently� the time needed for step twois �n�p�� Step three has time complexity �n�p� logn�p��The overall computational complexity of the parallel bucketsort is �n�p� logn�p�� and the communication complexity is�n�p��

c� The communication overhead is p times the communication com�plexity� or �n�� If there are p buckets� the sequential al�gorithm requires time �n� to put the values in the bucketsand time �n�p� logn�p�� to sort the elements in each bucket�With p buckets� the complexity of the sequential algorithm is�n logn�p��

Hence the scalability function is

n logn�p� � Cn� logn�p� � C � n�p � �C � n � �Cp

The parallel bucket sort algorithm is not highly scalable�

Chapter ��

The Fast Fourier

Transform

�� a� � � �e�i

b� i � �i�� i

c� � � �i � �e��i

d� �� i � ��e��i

e� �� i � ��e��i

�� Principal �d root of unity is �� e�i�

Principal �d root of unity is � �� i � �e�� i�

Principal �th root of unity is i � �e�� i�

Principal �th root of unity is �� i � �e�� i�

�� a� ��b� �� i�� i�

c� �� i� � � �i�� i�� i� � ��i�� i�

�� a� ��

b� ��

c� �� Our rst parallel quicksort algorithm and hyperquicksort exhibit a

butter�y communication pattern� A butter�y communication pattern

��

�� CHAPTER �� THE FAST FOURIER TRANSFORM

is also an appropriate way to perform an all�to�all exchange when thecommunication time is dominated by message latency� An all�reducemay be performed using a butter�y communication pattern�

�� The sequential versions of both hyperquicksort and the fast Fouriertransform have time complexity �n logn�� Both parallel algorithmshave �log p� communication steps� each step transmitting �n�p�values� Both parallel algorithms use a butter�y communication pat�tern that enables data starting on any processor to reach any otherprocessor in no more than log p message�passing steps�

Chapter ��

Combinatorial Search

�� The decomposition step cannot be performed in parallel� The timeneeded for decompositions alone is

n�n

k�

n

k��

n

klogk n

Since this sum is � n and � nk�k � �� a good lower bound on thecomplexity of the parallel algorithm is �n��

�� There are two fundamentally di�erent strategies for limiting the num�ber of nodes in the state space tree� I do not know which one is moree�ective for this problem�

The rst approach is to always ll in the word that would cause themost unassigned letters in the puzzle to be given a value� By assigningas many letters as possible at each node in the state space tree� thisstrategy aims to minimize the depth of the tree�

The second approach is to always ll in the word with the fewestunassigned letters� In case of a tie� we would choose the longest wordwith the fewest unassigned letters� This strategy aims to minimizethe number of branches at each node in the state space tree�

The algorithm described in the book represents a middle point be�tween the two strategies I have just described�

�� It is unlikely that a simple backtrack search of the state space tree forthe ��puzzle will yield a solution� because there is a good probabilitythat before the search nds a solution node� it will encounter a nodewhose board position is identical to that of a previously�examinednode� If this happens� it means that the search has run into a cycle�

��

�� CHAPTER �� COMBINATORIAL SEARCH

Assuming that the legal moves from a given position are always ex�amined in the same order� encountering a cycle guarantees that thesearch will never leave the cycle� and hence a solution will never befound�

Here are two ways to modify the backtrack search to guarantee it willeventually nd a solution�

One way to avoid falling into an innite cycle is to check for cyclesevery time a move is considered� If a move results in a board positionalready represented by one of the ancestor nodes in the state spacetree� the move is not made�

Another way to avoid falling into an innite cycle is to use iterativedeepening� i�e�� limiting the search to a certain depth� First we per�form backtrack search to a xed level d� If no solution is found� weperform a backtrack search to level d�� We continue in this fashionuntil a solution is found�

�� The lower bound associated with the right node of the root is ��because the eight tiles are out of place by a total of � positions usingthe Manhattan distance calculation�� and � move has already beentaken� �� Tile � is � position out of place� tile � is � positionsout of place� tile � is � positions out of place� and tile � is � positionout of place��

�� The number of nodes at each level of the state space tree is �� Multiply by �� then multiply by � �� multiply by �� multiplyby � �� etc�� A breadth�rst search examines all nodes in the rst velevels of the state space tree� or � � � � � � �� nodes� Abreadth�rst search also examines all nodes up to and including thesolution node at the sixth level of the tree� There are �� nodes to theleft of the solution node� The total number of nodes examined by abreadth�rst search of the state space tree is ��

�� In case of a tie� we want to pick the node deeper in the state spacetree� since its lower bound is based more on denite information thenumber of constraints added so far�� and less on an estimate thelower bound function��

�� The advantage of the proposed best�rst branch�and�bound algorithmis that it has very little communication overhead� The disadvantageis that the workloads among the processes are likely to be highlyimbalanced� The proposed algorithm is unlikely to achieve higherperformance than the algorithm presented in the text�

��

�� If a perfect evaluation function existed� the best move could be foundsimply by evaluating the position resulting from each legal move andchoosing the position with the highest value�

�� Here is the result of a minimax evaluation of the game tree�

3 2 6 7 -2 1 1 5

5173

3

3

1

�� The alpha�beta algorithm prunes the game tree of Figure �� atthe node labeled B because this is a min node� and the rst valuereturned is �� The opponent is able to choose the move� Since theopponent is trying to minimize the value� we know that the ultimatevalue of the node is less than or equal to �� However� the parent ofthis node is a max node� Since the player already has a move withleading to a position with value �� there is no way the player willmake a move leading to a position with value less than or equal to�� so there is no point in continuing the search� The search tree ispruned at this point�

�� Here is the result of an alpha�beta search of the game tree�

3 2 6 -2 1

163

3

3

1

�� Extending the perfectly ordered game tree�

�� CHAPTER �� COMBINATORIAL SEARCH

1

1

1 3

2

3

3

2

1

2

1

3

2

2

2

2

3

2

2

2

3

3

2

3

2

3

2

3

2

3

22

3

2

3

2

�� When DTS search reaches a point where there is only a single proces�sor allocated to a node� it executes the standard sequential alpha�betasearch algorithm� Hence alpha�beta search is a special case of DTS!the case where p � ��

�� Improving the pruning e�ectiveness of the underlying alpha�beta al�gorithm reduces the speedup achieved by the DTS algorithm on aparticular application� Given a uniform game tree with branchingfactor b� if alpha�beta pruning reduces the e�ective branching factorto bx where �� x � �� then speedup achieved by DTS with pprocessors and breadth�rst allocation will be reduced to Opx��

Chapter ��

Shared�memory

Programming

�� The two OpenMP function that have closest analogs in MPI func�tions are omp get num threads and omp get thread num� Func�tion omp get num threads is similar to MPI Comm size� Functionomp get thread num is similar to MPI Comm rank�

�� a� #pragma omp parallel for

b� This code segment is not suitable for execution because the con�trol expression of the for loop is not in canonical form� Thenumber of iterations cannot be determined at loop entry time�

c� This loop can be executed in parallel assuming there are no sidee�ects inside function foo� The correct pragma is

#pragma omp parallel for

d� This loop can be executed in parallel assuming there are no sidee�ects inside function foo� The correct pragma is

#pragma omp parallel for

e� This loop is not suitable for parallel execution� because it con�tains a break statement�

f� #pragma omp parallel for reduction��dotp�

g� #pragma omp parallel for

h� This loop is not suitable for parallel execution if n is greaterthan or equal to ��k� because in that case there is a dependence

��

� CHAPTER �� SHARED�MEMORY PROGRAMMING

between the rst iteration of the loop that assigns a value toa�k� and the k� ��st iteration� that references the value�

�� Here is a C program segment that computes � using a private variableand the critical pragma�

double area� pi� subarea� x

int i� n

area � ��

#pragma omp parallel private�x� subarea�

�

subarea � ��

#pragma omp for nowait

for �i � � i � n i��

x � �i��n

subarea �� x�x�

�

#pragma omp critical

area �� subarea

�

pi � area n

�� Matrices in C are stored in row�major order� There is a good chancethat the end of one matrix row and the beginning of the next matrixrow are stored in the same cache block� Increasing the chunk sizereduces the number of times one thread is responsible for one matrixrow and another thread is responsible for the next matrix row� Henceincreasing the chunk size reduces the number of times two di�erentthreads share the same cache block and increases the cache hit rate�

�� Here is an example of a simple parallel for loop that would probablyexecute faster with an interleaved static scheduling than by givingeach task a single contiguous chunk of iterations�

for �i � � i � �� i��

if �i � ��

a�i� � �sin�i� � cos�i��sin�i� � cos�i��

� else �

a�i� � ��

�

�

��

�� A good example would be a loop with a small ratio of iterations tothreads in which the variance in the execution times of each iterationis large and cannot be predicted at compile time�

�� Two threads can end up processing the same task if both threads areexecuting function get next task simultaneously and both threadsexecute statement

answer � ��job�ptr��task

before either thread executes the following statement�

�job�ptr � ��job�ptr��next

That is why we add the critical pragma to ensure the functionexecutes atomically�

�� Another OpenMP implementation of the Monte Carlo algorithm tocompute ��

#include �stdlib�h�

#include �stdio�h�


�

int count � Points inside unit circle �

int i

int local�count � This thread$s subtotal �

int samples � Points to generate �

unsigned short xi� � � Random number seed �

double x� y � Coordinates of point �

samples � atoi�argv��

omp�set�num�threads �atoi�argv��

count � �

#pragma omp parallel private�xi�x�y�local�count�

�

local�count � �

xi��

xi��

xi�� omp�get�thread�num��

#pragma omp for nowait

�� CHAPTER �� SHARED�MEMORY PROGRAMMING

for �i � � i � samples i��

x � erand�!�xi�

y � erand�!�xi�

if �x�x � y�y �� local�count��

�

#pragma omp critical

count �� local�count

�

printf ��Estimate of pi� � ��f�n�� countsamples�

�

�� Code segment from Winograd�s matrix multiplication algorithm� aug�mented with OpenMP directives�

#pragma omp parallel

�

#pragma omp for private�j� nowait

for �i � � i � m i��

rowterm�i� � ��

for �j � � j � p j��

rowterm�i� �� a�i��j� � a�i��j��

�

#pragma omp for private�j� nowait

for �i � � i � q i��

colterm�i� � ��

for �j � � j � p j��

colterm�i� �� b��j��i� � b��j��i�

�

�

�

Alternatively� we could have used the sections pragma to enableboth for loops indexed by i to execute concurrently�

#pragma omp parallel

�

#pragma omp sections

�

#pragma omp section

for �i � � i � m i��

rowterm�i� � ��

��

for �j � � j � p j��

rowterm�i� �� a�i��j� � a�i��j��

�

#pragma omp section

for �i � � i � q i��

colterm�i� � ��

for �j � � j � p j��

colterm�i� �� b��j��i� � b��j��i�

�

�

�

Some compilers may allow both parallelizations� However� the Suncompiler does not allow a for pragma to be nested inside a sectionspragma�

�� CHAPTER �� SHARED�MEMORY PROGRAMMING

Chapter ��

Combining MPI and

OpenMP

�� We consider the suitability of each of the program�s functions forparallelization using threads�

main!The function contains no for loops� It has no large blocks ofcode that two or more threads could execute concurrently� It is notsuitable for parallelization using threads�

manager!The only for loop in the function is the one that copies thedocument prole vector from buffer to vector�assigned�src�� Ifthe vector is long enough� this step could benet from execution byconcurrent threads� Before doing this� however� the for loop shouldbe changed to a memcpy operation� There are no large blocks of codethat two or more threads could execute concurrently�

worker!The function contains no for loops in canonical form� Ithas no large blocks of code that two or more threads could executeconcurrently� It is not suitable for parallelization using threads�

build �d array!This function takes little time to execute�

get names!This function searches a directory structure for names ofdocuments to be proled� The speed of this search could be improvedby dividing subdirectories among threads�

write profiles!This function writes prole vectors to a single le�It is not suitable for concurrent execution�

build hash table!This function initializes an array of pointers andtakes little time to execute�

��

�� CHAPTER �� COMBINING MPI AND OPENMP

make profile!This function builds a prole vector from the wordsin a document� We can imagine most of the program�s execution timeis spent inside this function� Assuming the prole vector for an entiredocument can be created from prole vectors of the document�s pieces�this function is a good candidate for parallelization via threads�

read dictionary!This function reads words from a single le� It isnot a good candidate for parallelization�

From our analysis it appears the functions that are the best candi�dates for parallelization with OpenMP pragmas are make profile

and get names�

�� The book describes three situations in which mixed MPI OpenMPprograms can outperform MPI�only programs�

The rst situation is when the mixed MPI OpenMP program haslower communication overhead than the MPI�only program� MonteCarlo methods typically have very little communication among theprocesses� Hence this situation probably does not hold�

The second situation is when it is practical to parallelize some portionsof the computation via lighter weight threads� but not via heavierweight processes� In the case of a Monte Carlo program� the entirecomputation is completely parallelizable!every process is executingthe simulation with a di�erent stream of random numbers� Hencethis situation probably does not hold�

The third situation is when some MPI processes are idle when othersare busy� due to message�passing or other I O operations� This ismost likely not the case for Monte Carlo simulations�

The conclusion is that there are poor prospects for signicantly im�proving the performance of the Monte Carlo methods by convertingthem into MPI OpenMP programs�

Instructor-s Guide to Parallel Programming in c With Mpi and Openmp

Documents

Transcript of Instructor-s Guide to Parallel Programming in c With Mpi and Openmp