Parallelization of the Sieve of Eratosthenes...Parallelization of the Sieve of Eratosthenes Mário...

Parallelization of the Sieve of Eratosthenes

Mário Cordeiro

Faculdade de Engenharia da Universidade do PortoRua Dr. Roberto Frias, s/n 4200-465 Porto PORTUGAL

[email protected]

Abstract. Algorithm parallelization plays and important role in mod-ern multi-core architectures. The present paper presents the implemen-tation and optimization of the algorithm to find prime numbers calledSieve of Eratosthenes. The performance, efficiency and scalability anal-ysis was made in distinct improvements over a serial and two parallelversions. The parallelization of the algorithm was made employing twodistinct technologies OpenMP and MPI. The obtained results shown thatthe parallel versions outperform the serial version in most cases with afactor higher than 100.

Keywords: sieve of eratosthenes, parallel computing, OpenMP, MPI

1 Introduction

Prime calculation plays nowadays an important role on secure encryption. Thedata encryption/decryption by public keys such as the RSA algorithm [3] makeuse of prime numbers. Basically the “public key” consists in the product of twolarge primes used to encrypt a message, and the “secret key” consists on theprimes itself used to decrypt the message. Once that the two multiplied primenumber only have four factors: one, itself and the two primes, the time spentto discover those primes and find the “secret key” it is proportional to the timeneeded to find the two primes. This means that the security of RSA is based onthe very hard and deterministic process of factoring large numbers. Using biggerprime numbers increase the security of the “public key” but also increase thecomputational cost needed to generate those prime numbers. That is one of themain reasons why finding prime numbers is so important in the computer worldand shows how important is nowadays to find bigger prime numbers efficiently.

Finding prime numbers was initially described by the Greek mathematicianEratosthenes more than two thousand years ago [6]. In their algorithm called theSieve of Eratosthenes [7] it was described a procedure to find all prime numbersin a given range. The algorithm is based in a check table of integers used tosift multiples of known prime numbers. This eliminates the need of redundantchecks on numbers that cannot be prime.

In the present paper it will be evaluated implementations of the Sieve ofEratosthenes using both serial and parallel versions. The parallelization of thealgorithm will be implemented using two distinct technologies OpenMP [5] andMPI [4].

2

Pseudocode 2.1 Sieve of Eratosthenes algorithm1. create a list of unmarked natural numbers 2, 3, ..., N:primes[1] := falsefor i := 2 to N do

primes[i] := trueend2. initialize the seed:k := 23. perform the sieve until k2 < N:while (k2 < N) do

i := k2

(a) mark all multiples of k between k2 and N:while (i ≤ N) do

primes[i] := falsei := i + k

end(b) find the smallest unmarked number greater than k:while (primes[k] == true) do

k := k + 1end

end4. unmarked numbers are primes

2 Sequential Sieve of Eratosthenes

The pseudo-algorithm used to calculate the k prime numbers below N is shownin Pseudocode 2.1.

An example of the sieve to find the all prime numbers bellow 50 is shown inFigure 1. In Step 1 it is defined the list of all natural numbers from 2 to 50. Theelimination of all multiples of 2, 3, 5 and 7 is made in steps 2 to 5. Once thatthe square of the next unmarked number in the table (11× 11) is grater than 50all unmarked numbers are primes.

The complexity of the sequential Sieve of Erathosthenes algorithm is O(n ln ln n)and n is exponential in the number of digits.

All algorithms were implemented in C++, the program was developed andtested on Linux. For user simplicity, command line arguments are available tocontrol the number of primes to within a specified bound and the number ofthreads or processes to be used in the calculation.

2.1 Single processor implementation

The single processor implementation is a serial version that executes the sieveusing only one thread or process. In the present work there were evaluated threedistinct versions:

3

Fig. 1: Steps for the calculation of prime numbers using the Sieve of Eratosthenesalgorithm

2.1.1 Base algorithm This version of the algorithm was implemented bydividing each element of the array by k checking if the remainder of the divisionis zero (Code 1.8 line 78). In case of the rest of the division being zero it meansthat j is not a prime number and it should be marked (Code 1.8 line 81). ThePseudocode 2.2 shows the basis algorithm to perform this operation.

Pseudocode 2.2 Checking prime numbers by divisionfor j := k2 to N step 1 do

if jmodk = 0 thenit is not a primemarkj

fiend

In Code 1.8 line 84 it is found the smallest unmarked number greater thank. The algorithm is repeated until k2 < N (Code 1.8 line 87).

4

2.1.2 Optimization 1 This algorithm improvement, also known as fast mark-ing, find j the first multiple of k on the block: j, j +k, j + 2k, j + 3k, etc insteadof performing the checking of the remaining of the division of j by k. With thischange only the multiples of k are computed (2k, 3k, 4k, etc.) and marked asnot being primes, this will avoid the checking if the multiples of j +k are primes,because they are not. The Pseudocode 2.3 shows the fast marking algorithm.

Pseudocode 2.3 Fast markingfor j := k2 to N step K do

it is not a primemarkj

end

The improvement of this algorithm is to change the test done in Code 1.8line 78 by an fast marking loop defined in Code 1.9 line 81.

2.1.3 Optimization 2 Based on the previous algorithm, another possibleimprovement is an reorganization in order how loops are performed. The objec-tive here is to allow the searching of several seeds in the same data block. Therange of numbers from 2 to N was divided in equal intervals and subsequentlyprocessed in serial manner block by block. The use of smaller blocks will allowthe processor optimize the memory access of the list of prime numbers and re-duce cache misses. In Code 1.10 57 is defined the outer loop that performs thesearching of prime numbers whiting a single block.

3 Parallel Sieve of Eratosthenes

The parallelization of the sieve of Eratosthenes is made by applying a domaindecomposition, breaking the array into n−1 elements and associating a primitivetask with each of these elements. Each one of those primitive tasks will markas composite the elements in the array multiples of a particular prime (mark allmultiples of k between k2 and N). Two distinct data decompositions could beapplied [8]:

3.1 Interleaved Data Decomposition

Performing an interleaved decomposition of the array elements, the process 0will be responsible for checking natural numbers 2, 2 + p, 2 + 2p, etc, processor1 will check natural numbers 3, 3 + p, 3 + 2p, etc. The main advantage of theinterleaved approach lies in the easiness of finding witch process controls a givenindex (easily computed by imodp where i is the index number and p the process

5

Pseudocode 3.1 Block Data Decompositionr := n mod pif r = 0 then

all blocks have same sizeelse

First r blocks have size n/pRemaining p − r blocks have size n/p

fi

number). The main disadvantage of this method is that such decomposition leadto a significant load imbalances among processes. It also requires some sort ofreduction or broadcast operations.

3.2 Block Data Decomposition Method

This method divides the array into p contiguous blocks of roughly equal size.Let N is the number of array elements, and n is a multiple of the number ofprocesses p, the division is straightforward. This can be a problem if n is not amultiple of p. Suppose n = 17, and p = 7, therefore it will give 2.43. If we giveevery process 2 elements, then we need 3 elements left. If we give every process 3elements, then the array is not that large. We cannot simply give every processp−1 processes [n/p] combinations and give the last process the left over becausethere may not be any elements left. If we allocate no elements to a process, itcan complicate the logic of programs in which processes exchange values. Alsoit can lead to a less efficient utilization of the communication network.

This approach solves the block allocation but it is also needed to be retrievedwitch range of elements are controlled by a particular process and also witchprocess controls a particular element.

3.2.1 Method #1 Suppose n is the number of elements and p is the numberof processes. The first element controlled by process i is given by:

ibn/pc+ min(i, r) (1)

The last element controlled by process i is the element before the first elementcontrolled by process i + 1:

(i + 1)bn/pc+ min(i + 1, r)− 1 (2)

The process controlling a particular array element j is:

min(bj/(bn/pc+ 1)c, b(j − r)/bn/pcc) (3)

Figure 2 shown the block data decomposition using method #1 of an arrayof 17 elements in configurations of 7, 5 and 3 processes.

6

Fig. 2: 17 elements divided among n processes using the data decompositionmethod 1

3.2.2 Method #2 The second scheme does not focus on all of larger blocksamong the smaller-numbered processes. Suppose n is the number of elementsand p is the number of processes. The first element controlled by process i is:

bi n/pc (4)

The last element controlled by process i is the element before the first elementbefore the first element controlled by process i + 1:

b(i + 1) n/pc − 1 (5)

The process controlling a particular array element j is:

b(p (j + 1)− 1)/nc (6)

Fig. 3: 17 elements divided among n processes using the data decompositionmethod 2

Figure 3 shown the same block data decomposition described in Figure 2 butnow using method #2.

7

3.2.3 Macros Comparing block decompositions, we choose the second schemebecause it has fewer operations in low and high indexes. The first scheme thelarger blocks are held by the lowest numbered tasks; in the second scheme thelarger blocks are distributed among the tasks.

C/C++ macros can be used in any of our parallel programs where a group ofdata items is distributed among a set of processors using block decomposition.Code 1.7 include the definition of those macros.

3.3 OpenMP implementation

With the objective of increase the performance of the algorithm taking the ad-vantage of being executed in processor architectures with multiple CPU-coressharing the same global memory, it will be presented the following OpenMPversions of the algorithm:

3.3.1 Base algorithm This version is based on the optimized version de-scribed in 2.1.3 and adapted in order to run in parallel threads. In Code 1.11line 69 there were created num threads threads using an omp parallel forsection. Each one of the separated threads runs in parallel (Code 1.11 line 70).Each one of the data blocks used have the lower value and block size defined us-ing the BLOCK LOW() (Code 1.11 line 72) and BLOCK SIZE() (Code 1.11 line 73)macros. Once that only thread 0 is calculating the smallest unmarked numbergreater than k it was needed to include two omp barrier (Code 1.11 line 107and line 114) to synchronize the k on all the running threads.

Finally the total number of primes found by each of the threads should becalculated. To perform this it was defined an omp atomic section (Code 1.11line 128) to sum all the partial prime number counts.

3.3.2 Optimization 1 Taking advantage on the fact that there is only asingle even prime number (number 2), the previous algorithm was modified inorder to eliminate all even numbers from the list and performed computation.This will allows not only to speed-up the process of finding prime numbers butwill also require the half amount of space to store the list of prime numbers.

The main changes over the previous algorithm are related with index adap-tion in order to deal only with odd numbers. The lower value, high value andblock size were adapted to deal only with odd numbers in the array (Code 1.12lines 81 to 106). The fist index maintained by each thread need also to be adapted(Code 1.12 lines 126 to 148). Finally the finding of the smallest unmarked num-ber greater than k, calculated only by thread 0, should skip also all even numbers(Code 1.12 line 156).

3.3.3 Optimization 2 A drawback of the two pervious implementations isthe need od perform thread synchronization when updating the value of k. Inthis optimization were removed the two OpenMP barrier directives from the

8

code (Code 1.11 line 107 and line 114). These are synchronization directivesthat force all threads to reach a common point before any thread can continue.Since they were used to synchronize the value of k, the program is changed suchthat every thread keeps track of k on its own. This is achieve by pre-calculatingevery prime from 3−

√N (remember that even numbers are disregarded). Each

thread is then given a copy of this list of primes – k is changed by iteratingthrough the list. This optimization allows each thread to move at its own pace,further reducing execution time.

To implement this in Code 1.13 lines 68 to 90 to each thread is given everyprime from 3−

√N . The task of finding of the smallest unmarked number greater

than k, calculated only by thread 0, should skip also all even numbers (Code1.12 line 156) was removed. Finally, in Code 1.12 line 180, each thread keepstrack of k on its own.

3.4 MPI implementation

The MPI versions of the three algorithms presented are adapted version of theones presented in section 3.3. The main objective is the use of a multiple shared-memory-systems (MPI) without OpenMP. The three MPI versions presentedalso consider the optimizations described in sections 3.3.2 – elimination of alleven numbers from the lists/computation – 3.3.3 – each thread to move at itsown pace by calculating the k values.

In terms of parallelization the keys aspects and challenges of the implemen-tations using MPI are the switching between a multi-threaded environment to amessage passing interface running on several distributed processes.

3.4.1 Base algorithm This version is based on the optimized version de-scribed in 3.3.1 and adapted in order to run in a multiple process / multiplenode environment using the MPI instead of OpenMP.

The process identification and number of processes to be used by the BLOCK LOW()(Code 1.14 line 44) and BLOCK SIZE() (Code 1.14 line 45) macros are retrievedusing the MPI Comm rank() and MPI Comm size() in Code 1.14 line 44 and 44respectively.

Each time the process with id 0 finds and updated the smallest unmarkednumber greater than k, it it required to broadcast that value to the other pro-cesses. The Code 1.14 line 111 show the MPI Bcast() instruction used to performthat task.

Finally to calculate the sum of all prime numbers found by each of theprocesses it is necessary to perform a reduction operation that will sum thepartial prime number counts of each process. This reduction is made by theMPI Reduce() included in Code 1.14 line 126.

3.4.2 Optimization 1 The changes needed to adapt the algorithm to performthe elimination of all even numbers from the lists/computation is equal of theone described in section 3.3.2. The main differences are the changing of the

9

two omp barrier (Code 1.11 line 107 and line 114) to synchronize the k by anMPI Bcast() instruction Code 1.15 line 152 and the omp atomic section (Code1.11 line 128) to sum all the partial prime number counts by an MPI Reduce()in Code 1.15 line 167.

3.4.3 Optimization 2 Once that each process in this optimization keeps thetrack of k on its own, the pre-calculating of every prime from 3−

√N (remember

that even numbers are disregarded) should be done by each process. This willavoid the need of the MPI Bcast() included in the Code 1.15 line 152. Theadaption of the algorithm according the rules defined in sections 3.3.2, 3.3.3adapted to MPI will remove the dependency between processes when findingprime numbers. To sum all the partial prime number counts found by eachprocess it is still necessary to include an MPI Reduce() in Code 1.16 line 189.

4 Results

4.1 Computing platform configurations

To perform the evaluation of the nine algorithms defined in sections 2.1, 3.3and 3.4 were used three distinct configurations using machines with an Intel(R)Core(TM)2 Quad CPU Q9300 running at 2.50GHz. The Table 1 shows the de-tailed information about processor cache.

Cache SizeL1 L2 L3

Processor Q9300 4 x 32 KB 2 x 3 MB -

Table 1: Processor cache size information [2]

Regarding the network interface the hardware configured allowed to use gi-gabit ethernet.

4.1.1 Single computing node using only one core The sequential Sieve ofEratosthenes algorithms described in sections 2.1.1, 2.1.2 and 2.1.3 only requiredone thread to be executed.

4.1.2 Single computing node using up to 4 cores The OpenMP paral-lel Sieve of Eratosthenes algorithms described in sections 3.3.1, 3.3.2 and 3.3.3were tested in four distinct configurations using 1 to 4 threads. The objectivewas to test the scalability od those algorithms in a multi-core platform. Thisconfiguration allowed to distribute up to 1 thread per processor core in order todistribute the load among all processor cores.

10

4.1.3 Up to 16 cores in 4 distributed computing nodes The MPI par-allel Sieve of Eratosthenes algorithms described in sections 3.4.1, 3.4.2 and 3.4.3were tested distinct configurations using 1 to 16 threads. The objective was totest the scalability od those algorithms in a multi-core multi-node configuration.This configuration allowed to distribute up to 1 thread per processor core andalso distribute the load among the 4 available computing nodes. The Code 1.1show the hostfile configuration for the cluster of 4 nodes.

1 1 9 2 . 1 6 8 . 3 3 . 1 5 1 s l o t s =42 1 9 2 . 1 6 8 . 3 3 . 1 5 0 s l o t s =43 1 9 2 . 1 6 8 . 3 3 . 1 4 4 s l o t s =44 1 9 2 . 1 6 8 . 3 3 . 1 4 2 s l o t s =4

Code 1.1: MPI hostfile configuration file

4.2 Test scenarios

4.2.1 Testing the Single processor implementation With the objectiveof testing the performance and scalability of the sequential Sieve of Eratosthenesalgorithm the three implementations were executed using distinct ranges of num-bers. The maximum interval was defined as being 2 to 225. The algorithms weretested against 16 ranges with the maximum value being 1/n 225 with n from 1to 16. To perform this operation it was created a shell script to execute a batchoperation for the 16 intervals (Code 1.2). The time and number or primes foundwas retrieved to an results file using the command line listed in Code 1.3. Theoutput file generated by the script is shown in Code 1.4.

1 f o r i in {1 . . 1 6}2 do3 arg =‘ expr $ i \∗ 2097152 ‘4 . / bin / s i e v e $arg5 done

Code 1.2: Batch run for the sequential Sieve of Eratosthenes algorithm

1 . / run . sh > output . txt

Code 1.3: Retrieve results for the batch

1 155612 primes found between 2 and 20971522 Time : 1 .963 seconds3 295948 primes found between 2 and 41943044 Time : 5 .303 seconds5 431503 primes found between 2 and 62914566 Time : 9 .470 seconds7 [ . . . ]8 2063690 primes found between 2 and 335544329 Time : 104.669 seconds

Code 1.4: Example of the output with information on the interval, primes foundand time spent by the algorithm in seconds

This procedure was repeated for each one of the three sequential algorithmsBase algorithm, Optimization 1 and Optimization 2 using the computing plat-form configuration defined in section 4.1.1.

11

4.2.2 Testing the OpenMP implementation The scalability and perfor-mance of the OpenMP implementations was done using the computing platformconfiguration defined in section 4.1.2. The three distinct OpenMP implementa-tions Base algorithm, Optimization 1 and Optimization 2, were tested in config-uration of processes varying from 1 to 4. The Code 1.5 shows the shell script usedto retrieve the results using 1 thread. The second argument of the program is thenumber of threads (1 in the given example). The measures (time and number orprimes found) was retrieved using the same method defined in section 4.2.1.

1 f o r i in {1 . . 1 6}2 do3 arg =‘ expr $ i \∗ 2097152 ‘4 . / bin / s i e v e $arg 15 done

Code 1.5: Batch run for the OpenMP Sieve of Eratosthenes algorithm with onethread

4.2.3 Testing the MPI implementation The scalability and performanceof the MPI implementations was done using the computing platform configu-ration defined in section 4.1.3. The three distinct MPI implementations Basealgorithm, Optimization 1 and Optimization 2, were tested in configuration ofprocesses varying from 1 to 16 using 4 computing nodes. The Code 1.6 showsthe shell script used to retrieve the results using 8 processes (argument -np 8).The measures (time and number or primes found) was retrieved using the samemethod defined in section 4.2.1.

1 f o r i in {1 . . 1 6}2 do3 arg =‘ expr $ i \∗ 2097152 ‘4 mpirun . openmpi −mca b t l ˆ openib −np 8 . / bin / s i e v e $arg5 done

Code 1.6: Batch run for the MPI Sieve of Eratosthenes algorithm with 8 processes

4.3 Algorithm Evaluation

To evaluate the performance of the algorithms it was decided to use a measurebased on the number of primes found per second for each algorithm implementa-tion and configuration used. Once that the number of primes in the same interval,should be equal for all the algorithms, algorithm that found more primes persecond have more performance.

4.3.1 Single processor results The obtained execution times and averageprime numbers found per second are listed in the Table 2. Each row of thetable contains the run for the respective interval from 2 to 1/n 225. The valuesobtained for each one of the sequential algorithms are shown in table columnsBase algorithm, Optimization 1 and Optimization 2.

12

Found Primes N 225Base algorithm Optimization 1 Optimization 2time primes/sec time primes/sec time primes/sec

155612 2097152 1/16 1,963 79273 0,017 9153588 0,017 9153588295948 4194304 2/16 5,303 55808 0,073 4054068 0,057 5192053431503 6291456 3/16 9,470 45565 0,145 2975876 0,084 5136929564164 8388608 4/16 14,337 39350 0,202 2792886 0,103 5477311694717 10485760 5/16 19,640 35373 0,264 2631500 0,129 5385395823750 12582912 6/16 25,605 32171 0,325 2534612 0,158 5213601951352 14680064 7/16 31,954 29773 0,384 2477477 0,198 48048031077872 16777216 8/16 38,763 27807 0,497 2168755 0,244 44175041203570 18874368 9/16 45,900 26222 0,523 2301279 0,277 43450141328231 20971520 10/16 53,269 24934 0,583 2278268 0,308 43124351452314 23068672 11/16 61,222 23722 0,680 2135756 0,330 44009521575662 25165824 12/16 69,436 22692 0,757 2081454 0,380 41464761698417 27262976 13/16 77,592 21889 0,776 2188680 0,450 37742581820646 29360128 14/16 86,222 21116 0,880 2068915 0,428 42538431942385 31457280 15/16 95,300 20382 0,972 1998337 0,451 43068382063690 33554432 1 104,669 19716 1,018 2027199 0,480 4299352

Table 2: Execution times and average prime numbers found per second for theSingle processor implementation. Each row shows the measured values for 16equal ranges of numbers from 2 to 225

The plot of Figure 4 show the compared performance obtained by each oneof the algorithms, in number of primes found per second, when increasing therange interval of numbers.

4.3.2 OpenMP results In Table 3 are shown the obtained results for theOpenMP implementations. Each row represents the values obtained in the re-spective number of cores configuration (1 to 4). The values obtained for each oneof the MPI algorithms are shown in table columns Base algorithm, Optimization1 and Optimization 2. The values obtained are relative to the range of numbersbetween 2 and 225.

Found Primes N 225 CoresBase algorithm Optimization 1 Optimization 2time primes/sec time primes/sec time primes/sec

2063690 33554432 1 1 1,004 2055467 0,460 4486280 0,492 41944902063690 33554432 1 2 0,854 2416498 0,343 6016586 0,340 60696742063690 33554432 1 3 1,099 1877788 0,443 4658440 0,372 55475512063690 33554432 1 4 4,035 511447 3,730 553268 0,379 5445090

Table 3: Execution times and average prime numbers found per second for theOpenMP implementation. Each row shows the measured values distinct coreconfigurations in a single computer node configuration

13

Fig. 4: Evolution of the performance of the Single processor implementations bychanging the range of numbers (16 intervals from 2 ro 225)

The plot of Figure 6 show the compared performance obtained by each one ofthe OpenMP algorithms, in number of primes found per second, when increasingthe number of cores (1 to 4).

Fig. 5: Evolution of the performance of the OpenMP implementations by chang-ing the number of processor cores (1 to 4)

4.3.3 MPI results In Table 4 are shown the obtained results for the MPI im-plementations. Each row represents the values obtained in the respective numberof cores configuration (1 to 16). The values obtained for each one of the MPIalgorithms are shown in table columns Base algorithm, Optimization 1 andOptimization 2. The values obtained are relative to the range of numbers be-tween 2 and 225.

14

Found Primes N 225 CoresBase algorithm Optimization 1 Optimization 2time primes/sec time primes/sec time primes/sec

2063690 33554432 1 1 0,888 2323974 0,450 4585973 0,481 42904122063690 33554432 1 2 0,793 2602382 0,359 5748435 0,442 46689772063690 33554432 1 3 0,769 2683601 0,368 5607848 0,342 60341752063690 33554432 1 4 0,769 2683601 0,349 5913146 0,296 69719192063690 33554432 1 5 0,668 3089355 0,308 6700286 0,217 95100832063690 33554432 1 6 0,478 4317341 0,254 8124756 0,111 185917842063690 33554432 1 7 0,390 5291510 0,138 14954261 0,095 217230322063690 33554432 1 8 0,547 3772740 0,107 19286804 0,067 308013132063690 33554432 1 9 0,280 7370318 0,085 24278682 0,043 479927442063690 33554432 1 10 0,310 6657061 0,064 32245125 0,050 412737602063690 33554432 1 11 0,241 8563025 0,100 20636880 0,034 606967062063690 33554432 1 12 0,295 6995556 0,086 23996372 0,046 448627832063690 33554432 1 13 0,169 12211178 0,051 40464471 0,048 429935002063690 33554432 1 14 0,142 14533021 0,115 17945113 0,027 764328892063690 33554432 1 15 0,173 11928838 0,063 32756952 0,029 711616552063690 33554432 1 16 0,100 20636890 0,060 34394800 0,052 39686308

Table 4: Execution times and average prime numbers found per second for theMPI implementation. Each row shows the measured values distinct core config-urations in a 4 computer node configuration

Fig. 6: Evolution of the performance of the MPI implementations by changingthe number of processor cores (1 to 16 in 4 distributed nodes)

15

The plot of Figure 6 show the compared performance obtained by each one ofthe OpenMP algorithms, in number of primes found per second, when increasingthe number of cores (1 to 16).

4.3.4 Single processor vs OpenMP vs MPI in a single node The Fig-ure 7 plots the performance of all algorithm versions for the range of numbersbetween 2 and 225. The Sequential algorithms use only one core, the OpenMPuse 4 cores in a single computing node and the MPI used 4 cores distributedamong 4 distinct computing nodes (1 core per node).

Fig. 7: Evolution of the performance of the three implementations by changingthe number of processor cores (1 to 4 cores)

4.4 Discussion

In the present section will be analyzing the data obtained by the Single processorresults (section 4.3.1), OpenMP results (section 4.3.2) and MPI results (4.3.3).

Starting by analyzing results of the sequential algorithm implementationsdescribed in sections 2.1.1, 2.1.2 and 2.1.3, the Figure 4 shows that the bet-ter performance was obtained by the second optimization of the algorithm.The speed up factors plotted in Figure 8 show that the Optimization 1 hasspeedup factors from a minimum 65 times to a maximum of 115 times morefaster than the Base algorithm. The best speedup factors where obtained bythe Optimization 2 witch was 218 times faster than the Base algorithm for therange of values between 2 and 225, the minimum speed up factor obtained forthis algorithm was 93 times faster. Analyzing the trends of the graphic curves(Figure 8a) it can be concluded that continuing to increase the range of values,the performance degradation of the Base algorithm will be more significantthan for the other two optimizations. By comparing the Optimization 1 and

16

Optimization 2 speedup factors in Figure 8b the speedup factor of the blockedalgorithm vary from 1 (equal performance) to 2 times faster when considering16 blocks. The maximum speedup factor is already reached when using blocksof data of 2.097.152 bytes (225 × 3/16 divided in 3 blocks of data). This value isconsistent with the size of the processor cache.

(a) Base algorithm (b) Optimization 1

Fig. 8: Speedup factors of: Optimization 1 and Optimization 2 relative to theBase algorithm (a) and Optimization 2 relative to Optimization 1 (b)

Regarding the OpenMP implementations the Figure 5 shows that the bet-ter performance was obtained by the Optimization 2 of the algorithm. Thegraphic also show that for small core configurations the Optimization 1 hasa comparable performance to the Optimization 2, but for configurations withhigher number of cores both Base algorithm and Optimization 1 shown avisible degradation (more that 2 cores). The main fact for this is related withconcurrency problems of having several threads disputing the same portion ofdata (reading and writing the value of k). Once that in Optimization 2 it wasremoved the two omp barriers from the algorithm, the performance of is notaffected by the scaling to a multi-core environment. Speedup factor range from2.5 times faster for configurations with 1 or 2 cores to more than 10 times inconfigurations with 4 cores.

By analyzing Figure 4 it can be concluded that using MPI all of the threeimplementations scale well when increasing the number of cores. Optimization2 shown again the better performance of the three implementations.

Finally by comparing the performance of the 9 algorithms (Figure 7) in con-figurations up to 4 cores it can be concluded that the performance of OpenMPOptimization 2, MPI Optimization 1 and MPI Optimization 2 have similarperformance, with OpenMP having better results in lower core count configura-tions. This fact may be related with the overhead of communication needed byMPI that cannot outperform the OpenMP in such cases.

17

5 Conclusions

This paper introduced the algorithm for seeking a list of prime numbers usingthe Sieve of Eratosthenes given an range of numbers from 2 to N . In the firstsections where revealed some weakness of the algorithm regarding the scaling tohigher ranges of numbers.

The parallelization of the algorithm revealed to be a good strategy to scalethe algorithm in multi-core architectures using OpenMP. The use of MPI wasalso addressed to be used in a multiple node computing environment. In bothapproaches OpenMP and MPI were compared optimizations regarding the elimi-nation of even integers (all primes are odd except 2) and in the removal of threadsynchronization / broadcast operations by introducing redundant portions of thecode that can be performed by each thread or process.

The algorithms where tested in multiple computing configurations using aquad-core architecture and 4 computing nodes in the MPI versions. The resultsshown that the OpenMP could be a good solution when using multiple corearchitecture but programmer should be aware of thread synchronization issuesthat may degrade the performance. If the objective is to scale the algorithm to amulti node architecture the MPI revealed to have good scaling capabilities overa multi node configuration. Nevertheless the overhead of inter process communi-cation used by the MPI this solution revealed to have performance to OpenMPeven in a single computing node configuration.

Atkin and Bernstein [1] described an improved version of the sieve of Eratos-thenes, the parallelization of that algorithm using OpenMP and MPI and therespective benchmark over the presented implementations could be pointed asfuture work.

Bibliography

[1] Atkin, A.O.L., Bernstein, D.J.: Prime sieves using binary quadratic forms.Mathematics of Computation 73, 2004 (1999)

[2] CPU-World: Microprocessor news, benchmarks, information and pictures(2012), http://www.cpu-world.com/, [Online; accessed 04-April-2012]

[3] Kaliski, B.: Public-key cryptography standards (pkcs) #1: Rsa cryptographyspecifications version 2.1”, rfc 3447 (2003)

[4] Message Passing Interface Forum: MPI: A Message-Passing Interface Stan-dard, Version 2.2. High Performance Computing Center Stuttgart (HLRS)(September 2009)

[5] OpenMP Architecture Review Board: OpenMP application program inter-face version 3.1 (2011), http://www.openmp.org/mp-documents/OpenMP3.1.pdf

[6] Wikipedia: Eratosthenes — Wikipedia, the free encyclopedia (2012),http://en.wikipedia.org/w/index.php?title=Eratosthenes&oldid=502464002, [Online; accessed 01-July-2012]

[7] Wikipedia: Sieve of eratosthenes — Wikipedia, the free encyclo-pedia (2012), http://en.wikipedia.org/w/index.php?title=Sieve_of_Eratosthenes&oldid=499635979, [Online; accessed 01-July-2012]

[8] Wirian, D.J.: Parallel prime sieve: Finding prime numbers. Institute of Infor-mation and Mathematical Sciences, Massey University at Albany, Auckland,New Zealand 1 (2009)

http://www.cpu-world.com/

http://www.openmp.org/mp-documents/OpenMP3.1.pdf

http://www.openmp.org/mp-documents/OpenMP3.1.pdf

http://en.wikipedia.org/w/index.php?title=Eratosthenes&oldid=502464002

http://en.wikipedia.org/w/index.php?title=Eratosthenes&oldid=502464002

http://en.wikipedia.org/w/index.php?title=Sieve_of_Eratosthenes&oldid=499635979

http://en.wikipedia.org/w/index.php?title=Sieve_of_Eratosthenes&oldid=499635979

19

Code Listings

1 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )23 /∗∗4 ∗ BLOCK HIGH5 ∗ Returns the index immediately a f t e r the6 ∗ end o f a l o c a l array with r e g a r d s to7 ∗ block decomposit ion o f a g l o b a l array .8 ∗9 ∗ param ( i n t ) p r o c e s s rank

10 ∗ param ( i n t ) t o t a l number o f p r o c e s s e s11 ∗ param ( i n t ) s i z e o f g l o b a l array12 ∗ r e t u r n ( i n t ) o f f s e t a f t e r end o f l o c a l array13 ∗/14 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1, ( p ) , ( n ) ) )1516 /∗∗17 ∗ BLOCK SIZE18 ∗ Returns the s i z e o f a l o c a l array19 ∗ with r e g a r d s to block decomposit ion20 ∗ o f a g l o b a l array .21 ∗22 ∗ param ( i n t ) p r o c e s s rank23 ∗ param ( i n t ) t o t a l number o f p r o c e s s e s24 ∗ param ( i n t ) s i z e o f g l o b a l array25 ∗ r e t u r n ( i n t ) s i z e o f l o c a l array26 ∗/27 #d e f i n e BLOCK SIZE( id , p , n ) ( (BLOCK HIGH( ( id ) , ( p ) , ( n ) ) ) − (BLOCK LOW

( ( id ) , ( p ) , ( n ) ) ) )2829 /∗∗30 ∗ BLOCK OWNER31 ∗ Returns the rank o f the p r o c e s s that32 ∗ handles a c e r t a i n l o c a l array with33 ∗ r e g a r d s to block decomposit ion o f a34 ∗ g l o b a l array .35 ∗36 ∗ param ( i n t ) index in g l o b a l array37 ∗ param ( i n t ) t o t a l number o f p r o c e s s e s38 ∗ param ( i n t ) s i z e o f g l o b a l array39 ∗ r e t u r n ( i n t ) rank o f p r o c e s s that handles index40 ∗/41 #d e f i n e BLOCK OWNER( i , p , n ) ( ( ( p ) ∗ ( ( i ) +1)−1) /( n ) )

Code 1.7: C/C++ macros used to distribute data items among a set of processorsusing block decomposition

1 #i n c l u d e <iostream>2 #i n c l u d e <cmath>3 #i n c l u d e <c s t d i o >4 #i n c l u d e <c s t d l i b >56 #i n c l u d e <sys / time . h>78 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )9 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1,p , n )−1)

10 #d e f i n e BLOCK SIZE( id , p , n ) (BLOCK HIGH( id , p , n )−BLOCK LOW( id , p , n ) +1)1112 void usage ( void )13 {14 std : : cout << ” s i e v e <max number>” << std : : endl ;15 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;16 }17

20

18 i n t main ( i n t argc , char ∗∗ argv )19 {20 i f ( argc != 2)21 {22 std : : cout << ” I n v a l i d number o f arguments ! ” << std : : endl ;23 usage ( ) ;24 r e t u r n 0 ;25 }2627 i n t range max = a t o i ( argv [ 1 ] ) ;2829 i f ( range max < 2)30 {31 std : : cout << ”<max number> Must be g r e a t e r than or equal to 2 . ”

<< std : : endl ;32 usage ( ) ;33 r e t u r n 0 ;34 }3536 // Global k37 i n t k = 2 ;3839 // Global count40 i n t count = 0 ;4142 i n t low va lue = 2 ;4344 // block o f data45 char ∗ marked = ( char ∗) mal loc ( range max ) ;4647 i f ( marked == 0)48 {49 std : : cout << ”Cannot a l l o c a t e d enough memory . ” << std : : endl ;50 e x i t ( 1 ) ;51 }5253 f o r ( i n t i = 0 ; i < range max ; ++i )54 {55 marked [ i ] = 0 ;56 }5758 i n t f i r s t i n d e x = 0 ;59 do60 {61 i f ( k > l ow va lue )62 {63 f i r s t i n d e x = k − l ow va lue + k ;64 }65 e l s e i f ( k ∗ k > l ow va lue )66 {67 f i r s t i n d e x = k ∗ k − l ow va lue ;68 }69 e l s e i f ( low va lue % k == 0)70 {71 f i r s t i n d e x = 0 ;72 }73 e l s e74 {75 f i r s t i n d e x = k − ( low va lue % k ) ;76 }7778 f o r ( i n t i = f i r s t i n d e x ; i < range max ; i ++)79 {80 i f ( i % k == 0)81 marked [ i ] = 1 ;82 }8384 whi le ( marked[++k ] ) ;

21

858687 } whi le ( k ∗ k <= range max ) ;8889 f o r ( i n t i = 0 ; i < range max ; ++i )90 {91 i f ( marked [ i ] == 0)92 {93 ++count ;94 }95 }9697 f r e e ( marked ) ; marked = 0 ;9899 std : : cout << count << ” primes found between 2 and ” << range max <<

std : : endl ;100101 r e t u r n 0 ;102 }

Code 1.8: Single process Sieve of Eratosthenes with division checking

1 #i n c l u d e <iostream>2 #i n c l u d e <cmath>3 #i n c l u d e <c s t d i o >4 #i n c l u d e <c s t d l i b >56 #i n c l u d e <sys / time . h>78 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )9 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1,p , n )−1)

10 #d e f i n e BLOCK SIZE( id , p , n ) (BLOCK HIGH( id , p , n )−BLOCK LOW( id , p , n ) +1)1112 void usage ( void )13 {14 std : : cout << ” s i e v e <max number>” << std : : endl ;15 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;16 }1718 i n t main ( i n t argc , char ∗∗ argv )19 {20 i f ( argc != 2)21 {22 std : : cout << ” I n v a l i d number o f arguments ! ” << std : : endl ;23 usage ( ) ;24 r e t u r n 0 ;25 }2627 i n t range max = a t o i ( argv [ 1 ] ) ;2829 i f ( range max < 2)30 {31 std : : cout << ”<max number> Must be g r e a t e r than or equal to 2 . ”

<< std : : endl ;32 usage ( ) ;33 r e t u r n 0 ;34 }3536 // Global k37 i n t k = 2 ;3839 // Global index40 i n t pr ime index = 0 ;4142 // Global count43 i n t count = 0 ;44

22

45 i n t low va lue = 2 ;4647 // block o f data48 char ∗ marked = ( char ∗) mal loc ( range max ) ;4950 i f ( marked == 0)51 {52 std : : cout << ”Cannot a l l o c a t e d enough memory . ” << std : : endl ;53 e x i t ( 1 ) ;54 }5556 f o r ( i n t i = 0 ; i < range max ; ++i )57 {58 marked [ i ] = 0 ;59 }6061 i n t f i r s t i n d e x = 0 ;62 do63 {64 i f ( k > l ow va lue )65 {66 f i r s t i n d e x = k − l ow va lue + k ;67 }68 e l s e i f ( k ∗ k > l ow va lue )69 {70 f i r s t i n d e x = k ∗ k − l ow va lue ;71 }72 e l s e i f ( low va lue % k == 0)73 {74 f i r s t i n d e x = 0 ;75 }76 e l s e77 {78 f i r s t i n d e x = k − ( low va lue % k ) ;79 }8081 f o r ( i n t i = f i r s t i n d e x ; i < range max ; i += k )82 {83 marked [ i ] = 1 ;84 }8586 whi le ( marked[++prime index ] ) ;87 k = prime index + 2 ;8889 } whi le ( k ∗ k <= range max ) ;9091 f o r ( i n t i = 0 ; i < range max ; ++i )92 {93 i f ( marked [ i ] == 0)94 {95 ++count ;96 }97 }9899 f r e e ( marked ) ; marked = 0 ;

100101 std : : cout << count << ” primes found between 2 and ” << range max <<

std : : endl ;102103 r e t u r n 0 ;104 }

Code 1.9: Single process Sieve of Eratosthenes with fast marking

1 #i n c l u d e <iostream>2 #i n c l u d e <cmath>3 #i n c l u d e <c s t d i o >

23

4 #i n c l u d e <c s t d l i b >56 #i n c l u d e <sys / time . h>78 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )9 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1,p , n )−1)

10 #d e f i n e BLOCK SIZE( id , p , n ) (BLOCK HIGH( id , p , n )−BLOCK LOW( id , p , n ) +1)1112 void usage ( void )13 {14 std : : cout << ” s i e v e <max number> <block count >” << std : : endl ;15 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;16 std : : cout << ”<block count > i s the number o f b l o c k s to use . ” << std

: : endl ;17 }1819 i n t main ( i n t argc , char ∗∗ argv )20 {21 i f ( argc != 3)22 {23 std : : cout << ” I n v a l i d number o f arguments ! ” << std : : endl ;24 usage ( ) ;25 r e t u r n 0 ;26 }2728 i n t range max = a t o i ( argv [ 1 ] ) ;29 i n t num blocks = a t o i ( argv [ 2 ] ) ;3031 i f ( range max < 2)32 {33 std : : cout << ”<max number> Must be g r e a t e r than or equal to 2 . ”

<< std : : endl ;34 usage ( ) ;35 r e t u r n 0 ;36 }3738 i f ( num blocks < 1)39 {40 std : : cout << ”<block count > between 1 and <max number>” << std : :

endl ;41 usage ( ) ;42 r e t u r n 0 ;43 }4445 i n t temp = ( range max − 1) / num blocks ;46 i f ( ( 1 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )47 {48 std : : cout << ”Too many b l o c k s ! ” << std : : endl ;49 std : : cout << ” Block s i z e should be g r e a t e r equal than s q r t ( n ) . ”

<< std : : endl ;50 e x i t ( 1 ) ;51 }5253 // Global count54 i n t count = 0 ;5556 i n t t h r e a d i d = 0 ;57 f o r ( t h r e a d i d = 0 ; t h r e a d i d < num blocks ; ++t h r e a d i d )58 {59 i n t k = 2 ;6061 i n t pr ime index = 0 ;6263 i n t low va lue = 2 + BLOCK LOW( t h r e a d i d , num blocks , range max −

1) ;64 i n t b l o c k s i z e = BLOCK SIZE( t h r e a d i d , num blocks , range max −

1) ;65

24

66 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;6768 i f ( marked == 0)69 {70 std : : cout << ” Thread ” << t h r e a d i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;71 e x i t ( 1 ) ;72 }7374 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i ) marked [ i ] = 0 ;7576 i n t f i r s t i n d e x = 0 ;77 do78 {79 i f ( k > l ow va lue )80 {81 f i r s t i n d e x = k − l ow va lue + k ;82 }83 e l s e i f ( k ∗ k > l ow va lue )84 {85 f i r s t i n d e x = k ∗ k − l ow va lue ;86 }87 e l s e88 {89 i f ( l ow va lue % k == 0) f i r s t i n d e x = 0 ;90 e l s e f i r s t i n d e x = k − ( low va lue % k ) ;91 }9293 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += k )94 {95 marked [ i ] = 1 ;96 }9798 whi le ( marked[++prime index ] ) ;99 k = prime index + 2 ;

100101 } whi le ( k ∗ k <= range max ) ;102103 i n t l o c a l c o u n t = 0 ;104 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )105 {106 i f ( marked [ i ] == 0)107 {108 ++l o c a l c o u n t ;109 }110 }111112 f r e e ( marked ) ; marked = 0 ;113114 count += l o c a l c o u n t ;115 }116117 std : : cout << count << ” primes found between 2 and ” << range max <<

std : : endl ;118119 r e t u r n 0 ;120 }

Code 1.10: Single process Blocked Sieve of Eratosthenes with fast marking

1 #i n c l u d e <omp . h>2 #i n c l u d e <iostream>3 #i n c l u d e <cmath>4 #i n c l u d e <c s t d i o >5 #i n c l u d e <c s t d l i b >67 #i n c l u d e <sys / time . h>

25

89 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )

10 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1,p , n )−1)11 #d e f i n e BLOCK SIZE( id , p , n ) (BLOCK HIGH( id , p , n )−BLOCK LOW( id , p , n ) +1)1213 void usage ( void )14 {15 std : : cout << ” s i e v e <max number> <thread count>” << std : : endl ;16 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;17 std : : cout << ”<thread count> i s the number o f threads to use . ” <<

std : : endl ;18 }1920 i n t main ( i n t argc , char ∗∗ argv )21 {22 i f ( argc != 3)23 {24 std : : cout << ” I n v a l i d number o f arguments ! ” << std : : endl ;25 usage ( ) ;26 r e t u r n 0 ;27 }2829 i n t range max = a t o i ( argv [ 1 ] ) ;30 i n t num threads = a t o i ( argv [ 2 ] ) ;3132 i f ( range max < 2)33 {34 std : : cout << ”<max number> Must be g r e a t e r than or equal to 2 . ”

<< std : : endl ;35 usage ( ) ;36 r e t u r n 0 ;37 }3839 i f ( num threads < 1)40 {41 std : : cout << ”<thread count> between 1 and <max number> ” << std

: : endl ;42 usage ( ) ;43 r e t u r n 0 ;44 }4546 i f ( num threads > omp get max threads ( ) )47 {48 num threads = omp get max threads ( ) ;49 }5051 i n t temp = ( range max − 1) / num threads ;52 i f ( ( 1 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )53 {54 std : : cout << ”Too many threads ! ” << std : : endl ;55 std : : cout << ” Thread should be g r e a t e r equal than s q r t ( n ) . ” <<

std : : endl ;56 e x i t ( 1 ) ;57 }5859 // Global k60 i n t k = 2 ;6162 i n t pr ime index = 0 ;6364 // Global count65 i n t count = 0 ;6667 i n t t h r e a d i d = 0 ;68 omp set num threads ( num threads ) ;69 #pragma omp p a r a l l e l f o r d e f a u l t ( shared ) p r i v a t e ( t h r e a d i d )70 f o r ( t h r e a d i d = 0 ; t h r e a d i d < num threads ; ++t h r e a d i d )71 {

26

72 i n t low va lue = 2 + BLOCK LOW( t h r e a d i d , num threads , range max− 1) ;

73 i n t b l o c k s i z e = BLOCK SIZE( t h r e a d i d , num threads , range max −1) ;

7475 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;7677 i f ( marked == 0)78 {79 std : : cout << ” Thread ” << t h r e a d i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;80 e x i t ( 1 ) ;81 }8283 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i ) marked [ i ] = 0 ;8485 i n t f i r s t i n d e x = 0 ;86 do87 {88 i f ( k > l ow va lue )89 {90 f i r s t i n d e x = k − l ow va lue + k ;91 }92 e l s e i f ( k ∗ k > l ow va lue )93 {94 f i r s t i n d e x = k ∗ k − l ow va lue ;95 }96 e l s e97 {98 i f ( l ow va lue % k == 0) f i r s t i n d e x = 0 ;99 e l s e f i r s t i n d e x = k − ( low va lue % k ) ;

100 }101102 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += k )103 {104 marked [ i ] = 1 ;105 }106107 #pragma omp b a r r i e r108 i f ( t h r e a d i d == 0)109 {110 whi le ( marked[++prime index ] ) ;111 k = prime index + 2 ;112 }113114 #pragma omp b a r r i e r115 } whi le ( k ∗ k <= range max ) ;116117 i n t l o c a l c o u n t = 0 ;118 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )119 {120 i f ( marked [ i ] == 0)121 {122 ++l o c a l c o u n t ;123 }124 }125126 f r e e ( marked ) ; marked = 0 ;127128 #pragma omp atomic129 count += l o c a l c o u n t ;130 }131132 std : : cout << count << ” primes found between 2 and ” << range max <<

std : : endl ;133134 r e t u r n 0 ;135 }

27

Code 1.11: OpenMP Sieve of Eratosthenes

1 #i n c l u d e <omp . h>2 #i n c l u d e <iostream>3 #i n c l u d e <cmath>4 #i n c l u d e <c s t d i o >5 #i n c l u d e <c s t d l i b >67 #i n c l u d e <sys / time . h>89 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )

10 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1,p , n )−1)11 #d e f i n e BLOCK SIZE( id , p , n ) (BLOCK HIGH( id , p , n )−BLOCK LOW( id , p , n ) +1)1213 void usage ( void )14 {15 std : : cout << ” s i e v e <range> <thread count>” << std : : endl ;16 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;17 std : : cout << ”<thread count> i s the number o f threads to use . ” <<

std : : endl ;18 }1920 i n t main ( i n t argc , char ∗∗ argv )21 {22 TimeUti ls : : ScopedTimer t ;2324 i f ( argc != 3)25 {26 std : : cout << ” I n v a l i d number o f arguments ! ” << std : : endl ;27 usage ( ) ;28 r e t u r n 0 ;29 }3031 i n t range max = a t o i ( argv [ 1 ] ) ;32 i n t num threads = a t o i ( argv [ 2 ] ) ;3334 i f ( range max < 2)35 {36 std : : cout << ”<max number> Must be g r e a t e r than or equal to 2 . ”


: : endl ;44 usage ( ) ;45 r e t u r n 0 ;46 }4748 i f ( num threads > omp get max threads ( ) )49 {50 num threads = omp get max threads ( ) ;51 }5253 i n t temp = ( range max − 1) / num threads ;54 i f ( ( 1 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )55 {56 std : : cout << ”Too many threads ! ” << std : : endl ;57 std : : cout << ” Thread should be g r e a t e r equal than s q r t ( n ) . ” <<

std : : endl ;58 e x i t ( 1 ) ;59 }

28

606162 i n t k = 3 ;636465 i n t pr ime index = 0 ;666768 i n t count = 1 ;697071 i n t t h r e a d i d = 0 ;72 omp set num threads ( num threads ) ;73 #pragma omp p a r a l l e l f o r d e f a u l t ( shared ) p r i v a t e ( t h r e a d i d )74 f o r ( t h r e a d i d = 0 ; t h r e a d i d < num threads ; ++t h r e a d i d )75 {76 i n t low va lue = 2 + BLOCK LOW( t h r e a d i d , num threads , range max

− 1) ;77 i n t h i g h v a l u e = 2 + BLOCK HIGH( t h r e a d i d , num threads ,

range max − 1) ;78 i n t b l o c k s i z e = BLOCK SIZE( t h r e a d i d , num threads , range max −

1) ;798081 i f ( l ow va lue % 2 == 0)82 {83 i f ( h i g h v a l u e % 2 == 0)84 {85 b l o c k s i z e = ( i n t ) f l o o r ( ( double ) b l o c k s i z e / 2 . 0 ) ;86 high va lue−−;87 }88 e l s e89 {90 b l o c k s i z e = b l o c k s i z e / 2 ;91 }9293 l ow va lue++;94 }95 e l s e96 {97 i f ( h i g h v a l u e % 2 == 0)98 {99 b l o c k s i z e = b l o c k s i z e / 2 ;

100 high va lue−−;101 }102 e l s e103 {104 b l o c k s i z e = ( i n t ) c e i l ( ( double ) b l o c k s i z e / 2 . 0 ) ;105 }106 }107108109 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;110111 i f ( marked == 0)112 {113 std : : cout << ” Thread ” << t h r e a d i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;114115116 e x i t ( 1 ) ;117 }118119120 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i ) marked [ i ] = 0 ;121122 i n t f i r s t i n d e x = 0 ;123 do

29

124 {125126 i f ( k >= low va lue )127 {128129 f i r s t i n d e x = ( ( k − l ow va lue ) / 2) + k ;130 }131 e l s e i f ( k ∗ k > l ow va lue )132 {133 f i r s t i n d e x = ( k ∗ k − l ow va lue ) / 2 ;134 }135 e l s e136 {137 i f ( l ow va lue % k == 0)138 {139 f i r s t i n d e x = 0 ;140 }141 e l s e142 {143144 f i r s t i n d e x = 1 ;145 whi le ( ( low va lue + (2 ∗ f i r s t i n d e x ) ) % k != 0)146 ++f i r s t i n d e x ;147 }148 }149150 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += ( k ) )151 {152 marked [ i ] = 1 ;153 }154155 #pragma omp b a r r i e r156 i f ( t h r e a d i d == 0)157 {158 whi le ( marked[++prime index ] ) ;159 k = (3 + ( pr ime index ∗ 2) ) ;160 }161162 #pragma omp b a r r i e r163 } whi le ( k ∗ k <= range max ) ;164165 i n t l o c a l c o u n t = 0 ;166 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )167 {168 i f ( marked [ i ] == 0)169 {170 ++l o c a l c o u n t ;171 }172 }173174 f r e e ( marked ) ; marked = 0 ;175176 #pragma omp atomic177 count += l o c a l c o u n t ;178 }179180 std : : cout << count << ” primes found between 2 and ” << range max <<

std : : endl ;181182 r e t u r n 0 ;183 }

Code 1.12: OpenMP Sieve of Eratosthenes with all even numbers eliminationfrom the lists/computation

1 #i n c l u d e <omp . h>

30

2 #i n c l u d e <iostream>3 #i n c l u d e <vector >4 #i n c l u d e <cmath>5 #i n c l u d e <c s t d i o >6 #i n c l u d e <c s t d l i b >78 #i n c l u d e <sys / time . h>9

10 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )11 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( id ) +1,p , n )−1)12 #d e f i n e BLOCK SIZE( id , p , n ) (BLOCK HIGH( id , p , n )−BLOCK LOW( id , p , n ) +1)1314 void usage ( void )15 {16 std : : cout << ” s i e v e <range> <thread count>” << std : : endl ;17 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;18 std : : cout << ”<thread count> i s the number o f threads to use . ” <<

std : : endl ;19 }2021 i n t main ( i n t argc , char ∗∗ argv )22 {23 TimeUti ls : : ScopedTimer t ;2425 i f ( argc != 3)26 {27 std : : cout << ” I n v a l i d number o f arguments ! ” << std : : endl ;28 usage ( ) ;29 r e t u r n 0 ;30 }3132 i n t range max = a t o i ( argv [ 1 ] ) ;33 i n t num threads = a t o i ( argv [ 2 ] ) ;3435 i f ( range max < 2)36 {37 std : : cout << ”<max number> Must be g r e a t e r than or equal to 2 . ”


: : endl ;4647 usage ( ) ;48 r e t u r n 0 ;49 }5051 i f ( num threads > omp get max threads ( ) )52 {53 num threads = omp get max threads ( ) ;54 }5556 i n t temp = ( range max − 1) / num threads ;57 i f ( ( 1 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )58 {59 std : : cout << ”Too many threads r e q u e s t e d ! ” << std : : endl ;60 std : : cout << ”The f i r s t thread must have a block s i z e >= s q r t ( n )

. ” << std : : endl ;61 e x i t ( 1 ) ;62 }6364 i n t k = 3 ;65

31

66 i n t count = 1 ;6768 i n t sqrtn = c e i l ( s q r t ( ( double ) range max ) ) ;6970 char ∗ pre marked = ( char ∗) mal loc ( sqrtn + 1) ;71 pre marked [ 0 ] = 1 ;72 pre marked [ 1 ] = 1 ;73 f o r ( i n t i = 2 ; i <= sq rtn ; ++i ) pre marked [ i ] = 0 ;74 i n t pre k = 2 ;7576 do77 {78 i n t base = pre k ∗ pre k ;79 f o r ( i n t i = base ; i <= sqrtn ; i += pre k ) pre marked [ i ] = 1 ;80 whi le ( pre marked[++pre k ] ) ;81 } whi le ( pre k ∗ pre k <= sqrtn ) ;8283 std : : vector <int > k s e t ;84 f o r ( i n t i = 3 ; i <= sqrtn ; ++i )85 {86 i f ( pre marked [ i ] == 0)87 k s e t . push back ( i ) ;88 }8990 f r e e ( pre marked ) ;9192 i f ( k s e t . empty ( ) )93 {94 std : : cout << ” There i s 1 prime l e s s than or equal to 2 . ” << std

: : endl ;95 e x i t ( 0 ) ;96 }9798 i n t t h r e a d i d = 0 ;99 i n t kindex = 0 ;

100 omp set num threads ( num threads ) ;101 #pragma omp p a r a l l e l f o r d e f a u l t ( shared ) p r i v a t e ( t h r e a d i d , kindex ,

k )102 f o r ( t h r e a d i d = 0 ; t h r e a d i d < num threads ; ++t h r e a d i d )103 {104 kindex = 0 ;105 k = k s e t [ kindex ] ;106107 i n t low va lue = 2 + BLOCK LOW( t h r e a d i d , num threads , range max

− 1) ;108 i n t h i g h v a l u e = 2 + BLOCK HIGH( t h r e a d i d , num threads ,

range max − 1) ;109 i n t b l o c k s i z e = BLOCK SIZE( t h r e a d i d , num threads , range max −

1) ;110111 i f ( l ow va lue % 2 == 0)112 {113 i f ( h i g h v a l u e % 2 == 0)114 {115 b l o c k s i z e = ( i n t ) f l o o r ( ( double ) b l o c k s i z e / 2 . 0 ) ;116 high va lue−−;117 }118 e l s e119 {120 b l o c k s i z e = b l o c k s i z e / 2 ;121 }122123 l ow va lue++;124 }125 e l s e126 {127 i f ( h i g h v a l u e % 2 == 0)128 {

32

129 b l o c k s i z e = b l o c k s i z e / 2 ;130 high va lue−−;131 }132 e l s e133 {134 b l o c k s i z e = ( i n t ) c e i l ( ( double ) b l o c k s i z e / 2 . 0 ) ;135 }136 }137138 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;139140 i f ( marked == 0)141 {142 std : : cout << ” Thread ” << t h r e a d i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;143144145 e x i t ( 1 ) ;146 }147148 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i ) marked [ i ] = 0 ;149150 i n t f i r s t i n d e x = 0 ;151 do152 {153 i f ( k >= low va lue )154 {155 f i r s t i n d e x = ( ( k − l ow va lue ) / 2) + k ;156 }157 e l s e i f ( k ∗ k > l ow va lue )158 {159 f i r s t i n d e x = ( k ∗ k − l ow va lue ) / 2 ;160 }161 e l s e162 {163 i f ( l ow va lue % k == 0)164 {165 f i r s t i n d e x = 0 ;166 }167 e l s e168 {169 f i r s t i n d e x = 1 ;170 whi le ( ( low va lue + (2 ∗ f i r s t i n d e x ) ) % k != 0)171 ++f i r s t i n d e x ;172 }173 }174175 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += ( k ) )176 {177 marked [ i ] = 1 ;178 }179180 k = k s e t [++kindex ] ;181 } whi le ( k ∗ k <= range max && kindex < ( i n t ) k s e t . s i z e ( ) ) ;182183 i n t l o c a l c o u n t = 0 ;184 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )185 {186 i f ( marked [ i ] == 0)187 {188 ++l o c a l c o u n t ;189 }190 }191192 f r e e ( marked ) ; marked = 0 ;193194 #pragma omp atomic195 count += l o c a l c o u n t ;

33

196 }197198 std : : cout << count << ” primes found between 2 and ” << range max <<

std : : endl ;199200 r e t u r n 0 ;201 }

Code 1.13: OpenMP Sieve of Eratosthenes with each thread maintaining theseed list

1 #i n c l u d e <mpi . h>2 #i n c l u d e <iostream>3 #i n c l u d e <cmath>4 #i n c l u d e <c s t d i o >5 #i n c l u d e <c s t d l i b >67 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )8 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( ( id ) +1) ,p , n )−1)9 #d e f i n e BLOCK SIZE( id , p , n ) ( (BLOCK LOW( ( ( id ) +1) ,p , n ) )−(BLOCK LOW( id , p ,

n ) ) )1011 void usage ( void )12 {13 std : : cout << ” s i e v e <max number>” << std : : endl ;14 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;15 }1617 i n t main ( i n t argc , char ∗argv [ ] )18 {19 double e l a p s e d t i m e ;2021 MPI Init (&argc , &argv ) ;2223 MPI Barrier (MPI COMM WORLD) ;24 e l a p s e d t i m e = −MPI Wtime ( ) ;2526 i n t p r o c e s s i d ;27 MPI Comm rank (MPI COMM WORLD, &p r o c e s s i d ) ;2829 i n t num processes ;30 MPI Comm size (MPI COMM WORLD, &num processes ) ;3132 i f ( argc != 2)33 {34 i f ( p r o c e s s i d == 0)35 {36 usage ( ) ;37 MPI Final ize ( ) ;38 e x i t ( 1 ) ;39 }40 }4142 i n t range max = a t o i ( argv [ 1 ] ) ;4344 i n t low va lue = 2 + BLOCK LOW( p r o c e s s i d , num processes , range max−1) ;45 i n t b l o c k s i z e = BLOCK SIZE( p r o c e s s i d , num processes , range max−1) ;4647 i n t temp = ( range max − 1) / num processes ;4849 i f ( ( 2 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )50 {51 i f ( p r o c e s s i d == 0)52 {53 std : : cout << ”Too many p r o c e s s e d ! ” << std : : endl ;54 std : : cout << ” Process should be g r e a t e r equal than s q r t ( n ) . ”

<< std : : endl ;

34

55 }5657 MPI Final ize ( ) ;58 e x i t ( 1 ) ;59 }6061 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;62 i f ( marked == NULL)63 {64 std : : cout << ” Process ” << p r o c e s s i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;6566 MPI Final ize ( ) ;67 e x i t ( 1 ) ;68 }6970 f o r ( i n t i = 0 ; i < b l o c k s i z e ; i ++)71 {72 marked [ i ] = 0 ;73 }7475 i n t f i r s t i n d e x ;76 i f ( p r o c e s s i d == 0)77 {78 f i r s t i n d e x = 0 ;79 }8081 i n t k = 2 ;828384 i n t pr ime index = 0 ;8586 i n t count = 0 ;8788 do89 {90 i f ( k ∗ k > l ow va lue )91 {92 f i r s t i n d e x = k ∗ k − l ow va lue ;93 }94 e l s e95 {96 i f ( l ow va lue % k == 0) f i r s t i n d e x = 0 ;97 e l s e f i r s t i n d e x = k − ( low va lue % k ) ;98 }99

100 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += k )101 {102 marked [ i ] = 1 ;103 }104105 i f ( p r o c e s s i d == 0)106 {107 whi le ( marked[++prime index ] ) ;108 k = prime index + 2 ;109 }110111 MPI Bcast (&k , 1 , MPI INT , 0 , MPI COMM WORLD) ;112113 } whi le ( k ∗ k <= range max ) ;114115 i n t l o c a l c o u n t = 0 ;116 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )117 {118 i f ( marked [ i ] == 0)119 {120 ++l o c a l c o u n t ;121 }

35

122 }123124 f r e e ( marked ) ; marked = 0 ;125126 MPI Reduce (& l o c a l c o u n t , &count , 1 , MPI INT , MPI SUM, 0 ,

MPI COMM WORLD) ;127128 e l a p s e d t i m e += MPI Wtime ( ) ;129130 i f ( p r o c e s s i d == 0)131 {132 std : : cout << count << ” primes found between 2 and ” <<

range max << std : : endl ;133134 char s t [ 1 0 0 ] ;135 s p r i n t f ( st , ”Time : %3.3 f seconds \n” , e l a p s e d t i m e ) ;136 std : : cout << s t ;137 }138139 MPI Final ize ( ) ;140 r e t u r n 0 ;141 }

Code 1.14: MPI Sieve of Eratosthenes

1 #i n c l u d e <mpi . h>2 #i n c l u d e <iostream>3 #i n c l u d e <cmath>4 #i n c l u d e <c s t d i o >5 #i n c l u d e <c s t d l i b >67 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )8 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( ( id ) +1) ,p , n )−1)9 #d e f i n e BLOCK SIZE( id , p , n ) ( (BLOCK LOW( ( ( id ) +1) ,p , n ) )−(BLOCK LOW( id , p ,

n ) ) )1011 void usage ( void )12 {13 std : : cout << ” s i e v e <max number>” << std : : endl ;14 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;15 }1617 i n t main ( i n t argc , char ∗argv [ ] )18 {19 double e l a p s e d t i m e ;2021 MPI Init (&argc , &argv ) ;2223 MPI Barrier (MPI COMM WORLD) ;24 e l a p s e d t i m e = −MPI Wtime ( ) ;2526 i n t p r o c e s s i d ;27 MPI Comm rank (MPI COMM WORLD, &p r o c e s s i d ) ;2829 i n t num processes ;30 MPI Comm size (MPI COMM WORLD, &num processes ) ;3132 i f ( argc != 2)33 {34 i f ( p r o c e s s i d == 0)35 {36 usage ( ) ;37 MPI Final ize ( ) ;38 e x i t ( 1 ) ;39 }40 }41

36

42 i n t range max = a t o i ( argv [ 1 ] ) ;434445 i n t low va lue = 2 + BLOCK LOW( p r o c e s s i d , num processes , range max −

1) ;46 i n t h i g h v a l u e = 2 + BLOCK HIGH( p r o c e s s i d , num processes , range max

− 1) ;47 i n t b l o c k s i z e = BLOCK SIZE( p r o c e s s i d , num processes , range max − 1)

;4849 i f ( l ow va lue % 2 == 0)50 {51 i f ( h i g h v a l u e % 2 == 0)52 {53 b l o c k s i z e = ( i n t ) f l o o r ( ( double ) b l o c k s i z e / 2 . 0 ) ;54 high va lue−−;55 }56 e l s e57 {58 b l o c k s i z e = b l o c k s i z e / 2 ;59 }6061 l ow va lue++;62 }63 e l s e64 {65 i f ( h i g h v a l u e % 2 == 0)66 {67 b l o c k s i z e = b l o c k s i z e / 2 ;68 high va lue−−;69 }70 e l s e71 {72 b l o c k s i z e = ( i n t ) c e i l ( ( double ) b l o c k s i z e / 2 . 0 ) ;73 }74 }7576 i n t temp = ( range max − 1) / num processes ;7778 i f ( ( 2 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )79 {80 i f ( p r o c e s s i d == 0)81 {82 std : : cout << ”Too many p r o c e s s e d ! ” << std : : endl ;83 std : : cout << ” Process should be g r e a t e r equal than s q r t ( n ) . ”

<< std : : endl ;84 }8586 MPI Final ize ( ) ;87 e x i t ( 1 ) ;88 }8990 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;91 i f ( marked == NULL)92 {93 std : : cout << ” Process ” << p r o c e s s i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;9495 MPI Final ize ( ) ;96 e x i t ( 1 ) ;97 }9899 f o r ( i n t i = 0 ; i < b l o c k s i z e ; i ++)

100 {101 marked [ i ] = 0 ;102 }103104 i n t f i r s t i n d e x ;

37

105 i f ( p r o c e s s i d == 0)106 {107 f i r s t i n d e x = 0 ;108 }109110 i n t k = 3 ;111112 i n t pr ime index = 0 ;113114 i n t count = 1 ;115116 do117 {118 i f ( k >= low va lue )119 {120 f i r s t i n d e x = ( ( k − l ow va lue ) / 2) + k ;121 }122 e l s e i f ( k ∗ k > l ow va lue )123 {124 f i r s t i n d e x = ( k ∗ k − l ow va lue ) / 2 ;125 }126 e l s e127 {128 i f ( l ow va lue % k == 0)129 {130 f i r s t i n d e x = 0 ;131 }132 e l s e133 {134135 f i r s t i n d e x = 1 ;136 whi le ( ( low va lue + (2 ∗ f i r s t i n d e x ) ) % k != 0)137 ++f i r s t i n d e x ;138 }139 }140141 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += ( k ) )142 {143 marked [ i ] = 1 ;144 }145146 i f ( p r o c e s s i d == 0)147 {148 whi le ( marked[++prime index ] ) ;149 k = (3 + ( pr ime index ∗ 2) ) ;150 }151152 MPI Bcast (&k , 1 , MPI INT , 0 , MPI COMM WORLD) ;153154 } whi le ( k ∗ k <= range max ) ;155156 i n t l o c a l c o u n t = 0 ;157 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )158 {159 i f ( marked [ i ] == 0)160 {161 ++l o c a l c o u n t ;162 }163 }164165 f r e e ( marked ) ; marked = 0 ;166167 MPI Reduce (& l o c a l c o u n t , &count , 1 , MPI INT , MPI SUM, 0 ,

MPI COMM WORLD) ;168169 e l a p s e d t i m e += MPI Wtime ( ) ;170171 i f ( p r o c e s s i d == 0)

38

172 {173 std : : cout << count << ” primes found between 2 and ” <<


Code 1.15: MPI Sieve of Eratosthenes with all even numbers elimination fromthe lists/computation

1 #i n c l u d e <mpi . h>2 #i n c l u d e <iostream>3 #i n c l u d e <vector >4 #i n c l u d e <cmath>5 #i n c l u d e <c s t d i o >6 #i n c l u d e <c s t d l i b >78 #d e f i n e BLOCK LOW( id , p , n ) ( ( id ) ∗(n ) /( p ) )9 #d e f i n e BLOCK HIGH( id , p , n ) (BLOCK LOW( ( ( id ) +1) ,p , n )−1)

10 #d e f i n e BLOCK SIZE( id , p , n ) ( (BLOCK LOW( ( ( id ) +1) ,p , n ) )−(BLOCK LOW( id , p ,n ) ) )

1112 void usage ( void )13 {14 std : : cout << ” s i e v e <max number>” << std : : endl ;15 std : : cout << ”<max number> range between 2 and N. ” << std : : endl ;16 }1718 i n t main ( i n t argc , char ∗argv [ ] )19 {20 double e l a p s e d t i m e ;2122 MPI Init (&argc , &argv ) ;2324 MPI Barrier (MPI COMM WORLD) ;25 e l a p s e d t i m e = −MPI Wtime ( ) ;2627 i n t p r o c e s s i d ;28 MPI Comm rank (MPI COMM WORLD, &p r o c e s s i d ) ;2930 i n t num processes ;31 MPI Comm size (MPI COMM WORLD, &num processes ) ;3233 i f ( argc != 2)34 {35 i f ( p r o c e s s i d == 0)36 {37 usage ( ) ;38 MPI Final ize ( ) ;39 e x i t ( 1 ) ;40 }41 }4243 i n t range max = a t o i ( argv [ 1 ] ) ;4445 i n t sqrtn = c e i l ( s q r t ( ( double ) range max ) ) ;4647 char ∗ pre marked = ( char ∗) mal loc ( sqrtn + 1) ;48 pre marked [ 0 ] = 1 ;49 pre marked [ 1 ] = 1 ;

39

50 f o r ( i n t i = 2 ; i <= sq rtn ; ++i ) pre marked [ i ] = 0 ;51 i n t pre k = 2 ;5253 do54 {55 i n t base = pre k ∗ pre k ;56 f o r ( i n t i = base ; i <= sqrtn ; i += pre k ) pre marked [ i ] = 1 ;57 whi le ( pre marked[++pre k ] ) ;58 } whi le ( pre k ∗ pre k <= sqrtn ) ;5960 std : : vector <int > k s e t ;61 f o r ( i n t i = 3 ; i <= sqrtn ; ++i )62 {63 i f ( pre marked [ i ] == 0)64 k s e t . push back ( i ) ;65 }6667 f r e e ( pre marked ) ;6869 i f ( k s e t . empty ( ) )70 {71 std : : cout << ” There i s 1 prime l e s s than or equal to 2 . ” << std

: : endl ;72 e x i t ( 0 ) ;73 }7475 i n t low va lue = 2 + BLOCK LOW( p r o c e s s i d , num processes , range max −

1) ;76 i n t h i g h v a l u e = 2 + BLOCK HIGH( p r o c e s s i d , num processes , range max

− 1) ;77 i n t b l o c k s i z e = BLOCK SIZE( p r o c e s s i d , num processes , range max − 1)

;7879 i f ( l ow va lue % 2 == 0)80 {81 i f ( h i g h v a l u e % 2 == 0)82 {83 b l o c k s i z e = ( i n t ) f l o o r ( ( double ) b l o c k s i z e / 2 . 0 ) ;84 high va lue−−;85 }86 e l s e87 {88 b l o c k s i z e = b l o c k s i z e / 2 ;89 }9091 l ow va lue++;92 }93 e l s e94 {95 i f ( h i g h v a l u e % 2 == 0)96 {97 b l o c k s i z e = b l o c k s i z e / 2 ;98 high va lue−−;99 }

100 e l s e101 {102 b l o c k s i z e = ( i n t ) c e i l ( ( double ) b l o c k s i z e / 2 . 0 ) ;103 }104 }105106 i n t temp = ( range max − 1) / num processes ;107108 i f ( ( 2 + temp ) < ( i n t ) s q r t ( ( double ) range max ) )109 {110 i f ( p r o c e s s i d == 0)111 {112 std : : cout << ”Too many p r o c e s s e d ! ” << std : : endl ;

40

113 std : : cout << ” Process should be g r e a t e r equal than s q r t ( n ) . ”<< std : : endl ;

114 }115116 MPI Final ize ( ) ;117 e x i t ( 1 ) ;118 }119120 char ∗ marked = ( char ∗) mal loc ( b l o c k s i z e ) ;121 i f ( marked == NULL)122 {123 std : : cout << ” Process ” << p r o c e s s i d << ” cannot a l l o c a t e d

enough memory . ” << std : : endl ;124125 MPI Final ize ( ) ;126 e x i t ( 1 ) ;127 }128129 f o r ( i n t i = 0 ; i < b l o c k s i z e ; i ++)130 {131 marked [ i ] = 0 ;132 }133134 i n t f i r s t i n d e x ;135 i f ( p r o c e s s i d == 0)136 {137 f i r s t i n d e x = 0 ;138 }139140 i n t kindex = 0 ;141142 i n t k = k s e t [ kindex ] ;143144 i n t count = 1 ;145146 do147 {148 i f ( k >= low va lue )149 {150 f i r s t i n d e x = ( ( k − l ow va lue ) / 2) + k ;151 }152 e l s e i f ( k ∗ k > l ow va lue )153 {154 f i r s t i n d e x = ( k ∗ k − l ow va lue ) / 2 ;155 }156 e l s e157 {158 i f ( l ow va lue % k == 0)159 {160 f i r s t i n d e x = 0 ;161 }162 e l s e163 {164 f i r s t i n d e x = 1 ;165 whi le ( ( low va lue + (2 ∗ f i r s t i n d e x ) ) % k != 0)166 ++f i r s t i n d e x ;167 }168 }169170 f o r ( i n t i = f i r s t i n d e x ; i < b l o c k s i z e ; i += ( k ) )171 {172 marked [ i ] = 1 ;173 }174175 k = k s e t [++kindex ] ;176 } whi le ( k ∗ k <= range max && kindex < ( i n t ) k s e t . s i z e ( ) ) ;177178 i n t l o c a l c o u n t = 0 ;

41

179 f o r ( i n t i = 0 ; i < b l o c k s i z e ; ++i )180 {181 i f ( marked [ i ] == 0)182 {183 ++l o c a l c o u n t ;184 }185 }186187 f r e e ( marked ) ; marked = 0 ;188189 MPI Reduce (& l o c a l c o u n t , &count , 1 , MPI INT , MPI SUM, 0 ,

MPI COMM WORLD) ;190191 e l a p s e d t i m e += MPI Wtime ( ) ;192193 i f ( p r o c e s s i d == 0)194 {195 std : : cout << count << ” primes found between 2 and ” <<


Code 1.16: MPI Sieve of Eratosthenes with each thread maintaining the seed list

Parallelization of the Sieve of Eratosthenes...Parallelization of the Sieve of Eratosthenes Mário...

Documents

Transcript of Parallelization of the Sieve of Eratosthenes...Parallelization of the Sieve of Eratosthenes Mário...