[IEEE 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and...

A Parallel Algorithm by Sampling for the Knapsack Problem Based on MIMD parallel computers

LIU XIAO-LING1, GAO SHOU-PING1, GONG DE-LIANG1, LI KEN-LI2 1 Department of Computer science, Xiangnan College, Chenzhou, 423000, China

2 School of Computer and Communication, Hunan University, Changsha, 410082, China [email protected]

Abstract The knapsack problem is a famous NP-complete

problem. It is very important in the research on cryptosystem and number theory. After its proposed parallel algorithms are analyzed deeply, a new parallel algorithm by sampling is proposed based on MIMD supercomputers in the paper. Then performance analysis and comparisons are illuminated. Finally the experimental results of the knapsack instances randomly generated on IBM P690 supercomputer are given. The results show: the parallel efficiency can be over 60% when solving the larger scale knapsack instances ( 40≥n ). Thus it is proved that the proposed parallel algorithm for the knapsack problem is feasible and efficient on MIMD scalable supercomputers.

1. Introduction

The knapsack problem, namely the Subset-Sum

problem, can be defined mathematically as: Given n positive integers W = (w1, w2,…, wn) and a positive integer M, to find a solution for ∑ = Mxw ii , xi=0 or 1, i∈ {1,2,…,n}. The knapsack problem is proved to be a famous NP-complete problem[1]. The original search space has 2n possible values, thus an exhaustive search would take O(2n) time to find a solution in the worst case. Because of its exponential complexity for solution, it is very important in the research on cryptosystem and number theory[2]. Some algorithms are by far the most efficient for the knapsack problem in sequential, but for knapsack problem instances where the dimension n is greater in practice, they could not get a solution with reasonable time.

With the advent of the parallel processing technology, much effort has been made in order to reduce the computing time of the knapsack problem in all research fields[3]. Also following the two-list algorithm, Karnin[4] proposed a parallel two-list algorithm based on CREW-SIMD[5]. In 1991

Ferreira[6] proposed a brilliant parallel algorithm whose space is O(2n/2). In 1994 Chang[7] presented another parallel algorithm where the requirement of the sharing memory is O(2n/4) by using O(2n/8) processors. Thereafter, in 1997, based on Chang’s parallel algorithm, Lou DC[9] parallelized the second stage of the two-list algorithm. The optimal parallel algorithm for the knapsack problem based on CREW-SIMD and the optimal parallel algorithm without memory conflicts based on EREW-SIMD were respectively proposed in [8] and [9] in 2003.

Those parallel algorithms mentioned above all are based on SIMD sharing memory parallel computers and very important in parallel algorithms for the knapsack problem. But MIMD scalable supercomputers prevail currently. They include Symmetric Multiprocessor (SMP), Massively Parallel Processor(MPP) and Cluster of Workstation (COW)[10].

Based on the two-list algorithm[4] and the parallel merging by fixed sampling (PMFS) sort algorithm based on MPP[11], a new parallel algorithm by sampling for the knapsack problem is proposed based on MIMD supercomputers in the paper. The design of the proposed algorithm is based on MIMD supercomputers. The algorithm needs P = O(2m) processors and each processor holds O(2n/2-m) memory space, where 2/2 nm ≤≤ .

The paper is organized as follows. In Section 2, the two-list algorithm and the PMFS sort are presented in brief. Ours proposed algorithm is particularly given in Section 3. Then performance analysis and comparisons follow. Finally, the experiment results and some concluding remarks are given in Section 5.

2. The basic algorithms 2.1. The two-list algorithm

Speaking briefly, the two-list algorithm can be

Proceedings of the Seventh International Conference onParallel and Distributed Computing,Applications and Technologies (PDCAT'06)0-7695-2736-1/06 $20.00 © 2006

divided into two stages: one is the generation stage, and the other is the search stage. The former is designed for generating two sorted lists, and the latter is constructed for obtaining the final solutions from the combinations of the two sorted lists. We briefly explain it as it was introduced in [4,5]

Algorithm 1: the two-list algorithm Generation stage 1 Divide W into two equal parts: W1 = (w1, w2,…,

w n/2), and W2 = (w n/2+ 1, wn/2 + 2,…, wn). 2 Form all 2n/2 possible subset sums of W1 = (w1,

w2,…,wn/2), and then sorted them in an increasing order and store them as the list A = [a1,…, an’], where n’ = 2n/2.

3 Form all 2n/2 possible subset sums of W2 = (w n/2+ 1, wn/2 + 2,…, wn), and then sorted them in a decreasing order and store them as the list B = [b1, b2,…, bn’], where n’ = 2n/2.

Search stage 1 i = 1, j = 1. 2 If ai + bj = M then stop: a solution is found. 3 If ai + bj < M then i = i +1; else j = j + 1. 4 If i >2n/2 or j >2n/2 then stop: there is no solution. 5 Go to Step 2.

To understand the two-list algorithm clearly, let us use the following example to illustrate it.

Example: W = (5, 4, 7, 9, 2, 8), M = 20 The generation stage is as follows:

W1 = (5, 4, 7) and W2 = (9, 2, 8); A = (0, 4, 5, 7, 9, 11, 12, 16); B = (19, 17, 11, 10, 9, 8, 2, 0).

The search stage is as follows: 1 a1 + b1 = 19 < M ⇒ i = i + 1; 2 a2 + b1 = 23 > M ⇒ j = j + 1; 3 a2 + b2 = 21 > M ⇒ j = j + 1; 4 a2 + b3 = 15 < M ⇒ i = i + 1; 5 a3 + b3 = 16 < M ⇒ i = i + 1; 6 a4 + b3 = 18 < M ⇒ i = i + 1; 7 a5 + b3 = 20 = M ⇒ X = (1, 1, 0, 1, 1, 0).

In fact, the two-list algorithm can be divided into the following three main steps, among which step a and step b belong to generation stage while step c belongs to search stage.

a. Generating all subset sums of A and B b. Sorting of A and B c. Searching list A and B.

2.2. The PMFS sort algorithm

The PMFS sort and the parallel sorting by

regular sampling (PSRS) are suitable for a diverse range of MIMD architectures. They have many advantages: good load balancing properties, modest communication needs and good memory locality of reference[12]. The data generated by the PMFS sort are better distributed than those generated by the PSRS[11]，so the PMFS sort is most fit for those problems requiring good balance for data distribution.

The PMFS sort algorithm is extended from the PMFS algorithm. The PMFS sort algorithm[11] is given in Algorithm 2 briefly.

Algorithm 2: the PMFS sort algorithm Input: a data list A = {a1,a2,…,an} requiring to sort is

averagely distributed on P processors. Output: a data list in approximate order according to

the serial number of the processor. Begin

1 For all Pi do Sort the local data using Quick Sort ;

2 For i = 1 to logP do For all Pj do

Result = Pj /2i; Group by the same Result; All Pj in each group call PMFS to merge;

End

3. The proposed parallel algorithm For the convenience of expression, We set

N = n/2, e=2n/2-m+1. Pi means the ith processor. We use O(2m) processors and O(e) memory

space in each processor for n-dimension knapsack problem. The logical memory includes Element array and Sum array. In each processor Element array is setting to contain N elements so as to reduce the communication overhead. Element array is used to store w1 or w2. Sum array is signed to deposit part of the subset sum of N elements.

In the following text we consider the three steps of our algorithm: the subset sum generation stage, the sort stage and the search stage one by one.

3.1. The subset sum generation stage

The front P/2 processors are signed to calculate

all subset sums of w1. Others are used for all subset sums of w2. In this stage two phases are required: filling the Sum(0) of all processors and calculating all subset sums of the remained elements.

In the first phase, firstly Sum(0) of P0 and PP/2 are set 0. Secondly P0 and PP/2 respectively compute the subset sums of the front r elements in w1 and w2, where 22 Pr = , putting the corresponding results in the temporary array temp[P/2] respectively on P0 and PP/2. Finally the two processors respectively send the P/2-1 values of the temp array to Sum(0) of P1,…,PP-1. For the second phase, each processor takes the remained elements from Element[r] one by one, then add them with data in Sum array and the results are set orderly in Sum array. The detailed algorithm is given in Algorithm 3, and the process of P0 performing the generation stage is show in Figure 1. That of PP/2 performing the subset sum generation stage is similar to P0’s.


Figure 1. The process of P0 performing the subset

sum generation stage Algorithm 3: the subset sum generation stage Pi：the ith processor, i is the index of the processor; Sum(i)(0)：Sum(0) of Pi processor. Begin

If (P0 or PP/2 ) //calculating Sum(0) of processors Sum (0)=0; temp(0)=0; For 0=i to 1−r do

For( ijs 2,0 == ; 12 1 −≤ +ij & 1−≤ js ; ++++ sj , ) temp[j] = Element [i] + temp [s];

For (h = 1; h < P/2; h ++) If (P0) Sum(h)(0) = temp(h);

Else (PP/2) Sum(P/2+h)(0) = temp(h) For all Pj do // calculate subset sums of local data

For k = 0 to 2i-r-1 Sum(2i-r + k) = Sum(k) + Element(i);

End The first phase avoids the deadclock at the

course of sending data and receiving data between processors. Thus it ensures the stability of our algorithm. Since in the second phase all processors deal with the local data, none communication is needed and good utilization is achieved. Thus the communication cost is decreased.

3.2. The sort stage

After the generation stage, the subset sums of w1

and w2 are averagely stored in Sum array of P0,…,PP/2-1 and PP/2,…,PP-1 respectively. Each process contains e data.

Because of the advantages of the PMFS sort algorithm proposed in [11], we use it to sort the data in Sum array of each processor. P0,…, PP/2-1 sort A and PP/2,…, PP-1 sort B. After sorting, A stored in P0,…, PP/2-1 is in nondecreasing order according to the serial numbers of processors while B distributed on PP/2,…, PP-1 is in nonincreasing order. Then all processors sort their local data by the merge sort. This step is illuminated in Algorithm 4.

Algorithm 4: the sort stage Pi：the ith processor, i is the index of processors;

Begin For all Pi do

If ( i >=0 and i < P/2) 1 Pi calls PMFS sort to sort e data in Sum

array in nondecreasing order according to the serial numbers of all Pi processors;

2 Pi sorts its local data by mergesort in nondecreasing order.

Else 3 Pi calls PMFS sort to sort e data in Sum

array in nonincreasing order according to the serial numbers of all Pi processors.

4 Pi sorts its local data by mergesort in nonincreasing order.

End

3.3. The search stage After the two steps above, each Sum array on

P0,…,PP/2-1 is equal to the data block that is get throng dividing A equally into P/2 block, containing e sorted elements in nondecreasing order. Similarly each Sum array on PP/2,…, PP-1 is the same as the block throng dividing B equally into P/2 block, also containing e sorted elements in nonincreasing order.

This stage is achieved through two steps: first test the paired processors which possibly include the solution; second search the solution in the paired processors. In the first step， a two-dimensional FlagFindSolution array is set to store the paired indexes of the paired processors. To reduce searching time of processors, Lemma 1 and Lemma 2 in [6] are introduced. The details are given in Algorithm 5. To express expediently, mark the Sum arrays of P0,…,PP/2-1 as Asum[e] and that of Pp/2,…,Pp-1 as Bsum[e].

Algorithm 5: the search stage last = e -1; P: the number of processors, P = 2m. Begin

For all Pi while )12

0( −≤≤ pi do

For all Pj while )12

( −≤≤ pjp do

X = Asum[0] + Bsum[last]; Y = Asum[last] + Bsum[0];

If (X = M or Y = M) Then Stop, a solution is found;

Else if (X > M) j++; Else if (Y < M) i++; Else（X > M and Y < M）

Send (i,j) to FlagFindSolution[i][j] of P0; If FlagFindSolution is not null do

For all paired Pi and Pj in FlagFindSolution do While ( lasth ≤≤0 and lastk ≤≤0 ) If Asum[h] + Bsum[k] < M then h++; If Asum[h] + Bsum[k] > M then k++; If Asum[h] + Bsum[k] =M then

Stop, a solution is found; Else There is no solution!


End

4.Performance analysis and comparisons 4.1. Performance analysis

The performance of our algorithm in terms of

hardware and computation time is readily obtained. The memory space is O(2n/2-m). Let Tadd, Tcomm and Tcomp respectively represent the time to perform the addition action, the communication time to transmit data and the time to compare data.

For the subset sum generation stage, the first phase takes O(2m) time to add 2m subset sums and O(2m) time to transmit 2m data. In the second none communication but Tadd = O(2n/2-m+1) is needed. Hence the run time of the generation stage is O(2m+2n/2-m). In the sort stage, based on the time complexity of PMFS analyzed in [11], we can conclude the time complexity of its PMFS is )2)2/(2( 21 mnm nmO −− ×+× , of which Tcomm and Tcomp are respectively )2)1(2( 12 −− ×−+ mmn mO and

)22)1(( 12 −− +×− mmnmO . For its second step, since all processors process their local data, this step has no communication and spends ))2/(2( 2 mnO mn −×− on comparing data. Therefore the total time of the sort stage is )2)(2( 2/1 mnm mnmO −− ×−+× . In the first step of the search stage, all processors in Pi

)12

0( −≤≤ pi must perform testing with all the

processors in Pj )12

( −≤≤ pjp . So O(2m) communication,

addition and comparison is needed. During its second step all paired processors search the solution in one direction, so the worst case needs O(2n/2-m) communication, addition and comparison at most. Hence the time of this stage is )22( 2 mnmO −+ . In summary, the total time complexity of our propsed algorithm is )2)(2( 2 mnm mnmO −×−+× .

4.2. Performance comparisons

Following the previous researches, the

performance comparison will be described in terms of time-processor tradeoff and the type of parallel computer, i.e. Karnin’s parallel algorithm[4] takes O(2n/2) time to solve the knapsack problem with O(2n/6) processors. The number of processor, time complexity, and memory of Ferreira’s parallel algorithm[6] are respectively O((2n/2) ε−1 ), O(n(2n/2) ε ) and O(2n/2). The performance of Chang’s parallel algorithm[7] and Lou’s[4] are both T = O(n2n/2), P = O(2n/8) and M = O(2n/4). As for the optimal parallel algorithm of Li[8], its performance parameters are respectively O((2n/4) ε−1 ), O(2n/4(2n/4) ε ) and O(2n/2). Our algorithm in the paper takes O(n2n/2-m) time with O(2m) processors and O(2n/2) memory. For the purpose of clarity, the comparisons of the mentioned parallel algorithms for the knapsack problem are depicted in Table 1.

Tabel 1. Comparisons of the parallel algorithms for the knapsack problem Algorithm Processor Time Memory Method Type of parallel

computer Karnin[4] O(2n/6) O(2n/2) O(2n/6) DG SIMD shared memory

Ferreira[6] O((2n/2) ε−1 ) O(n(2n/2) ε ) O(2n/2) PG and PS SIMD shared memory Chang[7] O(2n/8) O(n2n/2) O(2n/4) DG SIMD shared memory Lou[4] O(2n/8) O(n2n/2) O(2n/4) DG and PS SIMD shared memory Li[8] O((2n/4) ε−1 ) O(2n/4(2n/4) ε ) O(2n/2) PG and PS SIMD shared memory

Ours O(2m) O(n2n/2-m) O(2n/2) PG and PS MIMD Shared or distributed memory

Notations: 0 ≤≤ ε 1, 2/2 nm ≤≤ , DG-dynamic generation, PG-parallel generation, PS-parallel search. It is obvious that Li’s optimal algorithm is based

on SIMD shared memory and outtakes other algorithms which base on SIMD shared memory model. Although the time complexity of our algorithm is larger than that of Li’s algorithm, ours is based on MIMD and can be adapted to solve the knapsack problem on the parallel computers with MIMD shared or distributed memory.

5. Experimental results and conclusions 5.1. Experimental results

Our proposed algorithm has been implemented

with C and MPI in the blocked communication way on IBM P690 high performance computer. Its experimental results are shown in Table 2.

As shown from Table 2: with the number of processors increased, for small scale knapsack problem the run time increases, because communication cost increases; but for large scale ones )40( ≥n their run time decreases sharply. So the appropriate number of processors should be decided by the scale; the proposed algorithm is most fit for the larger scale knapsack problems )40( ≥n ,


whose efficiency is up to 60% generally. From table 2 it is concluded that our proposed algorithm is

efficient on MIMD parallel computers.

Table 2. Experimental results on IBM P690 (unit: s)

Dimension (n) Number (P) Parallel time (s) Serial time (s) Speedup ratio Efficiency (%)

4 0.054440 3.654 91.35 8 0.383685 0.518 6.48 16 1.622112 0.123 0.77 20

32 5.511016

0.198898

0.03 0.11 4 0.249973 3.501 87.53 8 0.930597 0.940 11.76 16 1.930013 0.453 2.83 30

32 9.069347

0.875155

0.096 0.30 4 30.186220 2.482 62.06 8 12.636239 5.930 74.12 16 7.049582 10.629 66.43 40

32 8.022891

74.928604

9.339 29.19 4 6866.865104 1.840 45.98 8 1784.907939 7.076 88.45 16 905.959959 13.94 87.13 50

32 554.088727

12630.460131

22.79 71.23 5.2. Conclusions

In the paper based on MIMD scalable

supercomputers, a new parallel algorithm by sampling is presented for the knapsack problem. In the sort stage of the proposed algorithm the PMFS sort is introduced to sort the lists. It makes for reducing the communication cost and achieving good load balancing. Performance analysis and experimental results have shown: the parallel efficiency can be over 60% when solving the larger scale knapsack instances ( 40≥n ); it has less communication cost and good load balancing. Furthermore, it is certain for the proposed algorithm to be performed on SMP, MPP and COW scalable parallel computers with shared or distributed memory. Although with a proper number of processors the communication cost is reduced to a moderate proportion, the communication cost will increase with the increasing number of processors. So much work must be perfected on the aspects.

6. References [1]. Garey M R, and Johnson D S, Computers and

intractability: A guide to the theory of NP-Completeness, San Francisco: W.H.Freeman and Co, 1979

[2]. B. Chor, and R.L Rivest, A knapsack–type public key cryptosystem based on arithmetic in finite fields, IEEE Trans. Inform. Theory, 1988, 34 (5):901-909

[3]. C.A.A. Aanches, N.Y. Soma and H.H. Yanasse, Comments on parallel algorithms for the knapsack problem, Parallel Computing, 28(2002): 1501-1505

[4]. Lou DC and Chang CC, A parallel two-list algorithm

for the knapsack problem. Parallel Computing, 1997, 22: 1985~1996

[5]. Schroeppel and R. Shamir. A, A T = O(2n/2), S = O(2n/4) algorithm for certain NP-complete problems, SIAM J. Compute, 1981, 10(3):456-464

[6]. Ferreira AG, A parallel time/hardware tradeoff T • H=O(2n/2) for the knapsack problem, IEEE Transactions on Computing, 1991, 40(2):221-225

[7]. Ken-Li Li; Ren-Fa Li and Qing-Hua Li. Optimal parallel algorithm for the knapsack problem without memory conflicts. Journal of Computer Science and Technology, 2004 Volume 19, Issue 6, Pages: 760 - 768

[8]. LI Qing-Hua, LI Ken-Li, JIANG Sheng-Yi and ZHANG Wei. An Optimal Parallel Algorithm for the Knapsack Problem (in Chinese), Journal of Software, 2003, No.5, Vol.14: Pages 891-896

[9]. Kenli Li, Qinghua Li, Wang-Hui and Shengyi Jiang, Optimal parallel algorithm for the knapsack problem without memory conflicts, Parallel and Distributed Computing, Applications and Technologies, 2003, PDCAT'2003. Proceedings of the Fourth International Conference on , 27-29 Aug. 2003, Pages:518 -521

[10]. Cheng Guo-Liang, Parallel Computing, Beijing: Higher Education Press, 2003 (in Chinese)

[11]. DING Wei-Qun, JI Yong-Chang and CHEN Guo-Liang, A Parellel Merging Algorithm Based on MPP (in Chinese), Journal of Computer research and Development, Jan. 1999, Vol.36, No.1: 52 - 56

[12]. LI Xiao-Bo, Paul Lu, Jonathan Schaeffer, John Shillington, Pok Sze Wong and Hanmao Shi. On the versatility of parallel sorting by regular sampling, Parallel Computing, Oct. 1993, volum. 19 NO. 10, p:1079-1103


[IEEE 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and...

Documents

Transcript of [IEEE 2006 Seventh International Conference on Parallel and Distributed Computing, Applications and...