On the Allocation of Documents in Multiprocessor ...binds.cs.umass.edu › papers ›...

On the Allocation of Documents inMultiprocessor Information Retrieval Systems

Ophir FriederDept. of Computer ScienceGeorge Mason UniversityFairfax, VA 22030

Abstract. Information retrieval is the selection of

documents that are potentially relevant to a user’sinformation need. Given the vast volume of data stored inmodern information retrieval systems, searching thedocument database requires vast computational resources.To meet these computational demands, various researchershave developed parallel information retrieval systems. Asefficient exploitation of parallelism demands fast access tothe documents, data organization and placementsignificantly affect the total processing time. We describeand evaluate a data placement strategy for distributedmemory, distributed 1/0 multicomputers. Initially, aformal description of the Multiprocessor DocumentAllocation Problem (MDAP) and a proof that MDAP isNP Complete are presented. A document allocationalgorithm for MDAP based on Genetic Algorithms isdeveloped. This algorithm assumes that the documents areclustered using any one of the many clustering algorithms.We define a cost function for the derived allocation andevaluate the performance of our algorithm using thisfunction. As part of the experimental analysis, the effectsof varying the number of documents and their distributionacross the clusters as well the exploitation of variousdiffering architectural interconnection topologies arestudied. We also experiment with the several parameterscommon to Genetic Algorithms, e.g., the probability ofmutation and the population size.

1.0 IntroductionAn efficient multiprocessor information

retrieval system must maintain a low systemresponse time and require relatively little storageoverhead. As the volume of stored data continuesto increase daily, the multiprocessor engines mustlikewise scale to a large number of processors.This demand for system scalability necessitates a

distributed memory architecture as a large number

of processors is not currently possible in a shared-memory configuration. A distributed memorysystem, however, introduces the problem

Perrrkion to copy without fee all or part of this material is

granted provided that the copies are not made or distributed for

direct commercial advantage, the ACM copyright notice and the

title of the publication and its date appaar, and notice is given

that copying is by permission of the Association for Computing

Machinery. To copy otherwise, or to republish, requires a fee

and/or specific permission.

@ 1991 ACM 0-89791 -448 -1/91 /0009 /0230 . ..$1 .50

Hava Tova SiegelmannDept. of Computer ScienceRutgers UniversityNew Brunswick, NJ 08903

associated with the proper placement of data ontothe given architecture. We refer to this problem asthe Multiprocessor Document Allocation Problem(MDAP), a derivative of the Mapping Problemoriginally described by Bokhari [Bok8 1].

We assume a clustered document database. Aclustered approach is taken since an index fileorganization can introduce vast storage overhead(up to roughly 300% according to Haskin[Has8 1]) and a full-text or signature analysistechnique results in lengthy search times. In thiscontext, a proper solution to MDAP is anymapping of the documents onto the processorssuch that the average cluster diameter is kept to aminimum while still providing for an evendocument distribution across the nodes.

To achieve a significant reduction in the totalquery processing time using parallelism, theallocation of data among the processors should bedistributed as evenly as possible and theinterprocessor communication among the nodesshould be minimized. Achieving such anallocation is NP Complete. Thus, it is necessaryto use heuristics to obtain satisfactory mappings,which may indeed be suboptimal.

Genetic Algorithms [DeJ89, G0189, Gre85,Gre87, H0187, Rag87] approximate optimalsolutions to computationally intractable problems.We develop a genetic algorithm for MDAP andexamine the effects of varying the communicationcost matrix representing the interprocessorcommunication topology and the uniformity of thedistribution of documents to the clusters.

1.1 Mapping Problem ApproximationsAs the Mapping Problem and some of itsderivatives are NP complete, heuristic algorithmsare commonly employed to approximate the

optimal solutions. Some of these approaches[Bok81, B0188, Lee87] deal, in some manner,

This work was partially supported by grants from DCS,Inc. under contract number 5-35071 and the Center forInnovative Technology under contract number 5-34042.

9Qn

with the mapping of a communicating set ofprocesses onto an architecture with a fixedinterconnection topology. This problem is similarto MDAP in that both problems must map a set oftasks (items) onto a given architecture. However,the goals of the above efforts differ from MDAPas MDAP does not aim to maximize the amount ofconcurrent interprocess communication, butinstead, aims at reducing the total communicationsdiameter of its logical tasks (clusters).

Bokhari [Bok81] introduced a pairwise-exchange heuristic algorithm that accepted as inputtwo adjacency matrices representing graphs G (setof communicating processes) and G’ (targetarchitecture). Using the cardinality of the numberof communicating pairs that directlycommunicated with their neighbor as the objectivefunction, Bokhari developed and evaluated analgorithm that mapped graph G onto graph G’.Lee and Aggarwal [Lee87] extended that effort bydeveloping objective functions that moreaccurately quantified the communicationoverhead. Using a set of objective functions(parametric equations) that corresponded to thecost associated with the given mapping, Lee andAggarwal precisely measured the optimality of thederived mapping. The main limitation in theirapproach was that it only employed a fixed pathrouting scheme for the network traffic.

Bollinger and Midkiff [B0188] used a two-phase simulated annealing algorithm to map alogical system onto a physical architecture. Thefirst phase, process annealing, assigned theprocesses onto the physical nodes. Theconnection annealing phase mapped the logicalconnections onto the network data links so as tominimize communication link conflicts. Thiseffort improved upon Lee and Aggarwal [Lee87]in that it utilized the information concerning theactual routing rules.

Du and Maryanski [Du88] attacked a variationof the mapping problem. This variationconcerned the allocation of data in a dynamicallyreconfigurable environment. The allocationalgorithm used a set of “benefit” functions and agreedy search algorithm. The underlyingexecution architecture was based on a client/servermodel (a heterogeneous system). Although theirproblem more closely resembles MDAP, as theunderlying architectural model significantly differsfrom the MDAP execution environment, theirassumptions are not relevant to MDAP.

1.2 Related Multiprocessor SystemsDistributed-memory information retrieval

systems have been investigated as a mean of

providing short response times to users’ requests.Some of these systems include various efforts onthe Connection Machine [As090, Sta86, Sta89,Sta90], on the Distributed Array Processor (DAP)[Pog87, Pog88], on a network of Transputcrs[Cri90], and on hypercube systems [Sha89].Both the commentary on and the extensions of theConnection Machine efforts [As090, Sa188,Sta89, Sta90, Sto87] as well as the hypercube[Sha89], DAP [Pog87, Pog88] and Transputer[Cri90] efforts have all addressed the notion ofdata organization in the search and retrievalscheme employed.

Stone [Sto87] demonstrated analytically that,by indexing keywords, a uniprocessor systemwith comparable memory to that of theConnection Machine employed in the Stanfill andKahle effort [Sta86] can achieve similar userretrieval response times to those times reported in[Sta86]. Via keyword indexing, the volume ofdata that had to be searched was reduced,significantly reducing the total 1/0 processingtime. A parallel index-based retrieval effort onthe Connection Machine was later reported inStanfill, Thau, and Waltz [Sta89] and in Asokan,Ranka, and Frieder [As090]. Additional paralleltext retrieval search methods are described inSalton and Buckley [Sa188].

Various descriptions of efforts that focus onthe organization of data for the DAP system,appear in the literature. In Pogue and Willett[Pog87], an approach using text signatures isproposed and evaluated using three documentdatabases comprising of roughly 11000, 17000,and 27000 documents. Pogue, Rasmussen, andWillett [Pog88], describe several CIU steringtechniques using the DAP. Both reports clearlydemonstrate that if a proper document mappingonto the individual Processing Elements (PEs) isestablished, the DAP system readily achieves ahigh search rate. However, an improper mappingresults in a poor search rate stemming from theinability of the DAP system to access thedocuments.

Cringean, et. al., [Cri90] describe earlyefforts aimed at developing a processor-pc~olbased multicomputer system for informationretrieval. The physical testbed hardware consi:stsof an Inmos Transputer network. To reduce thevolume of data accessed and hence the total queryprocessing times, a two phase retrieval algorithmis proposed. The initial phase acts as a filter toretrieve all potentially relevant documents. 13yusing text signatures, the majority of the ncm-relevant documents are eliminated from furtherprocessing. This filtering of documents vastly

231

reduces the volume of data processed in thecompute intensive second phase. (Some non-relevant documents are selected as a consequenceof false-drops. False-drops are common to alltext signature analysis approaches.) In the secondphase, full text search is performed. The twophase algorithm is yet another example thatemphasizes the need for intelligence in theorganization and retrieval of documents.

Finally, Sharma [Sha89] describes ahypercube-based information retrieval effort. Theresults presented are based on the timingequations provided. To reduce the volume of dataread, Sharma relies on the fact that the documentsare initially partitioned into clusters, and onlydocuments that belong to “relevant” clusters areretrieved. As in our approach, Sharma does notaddress nor is dependent on any particularclustering technique. Sharma does require,however, that the cluster scheme employed yieldnon-hierachical clusters, whereas we do notimpose such a restriction. Thus, all the clusteringschemes, including the numerous schemesdescribed in Willett [Wi188] can also be employedin our scheme. Sharma partitions the clustersacross the individual nodes according to anarchitectural topology-independent, best-fitheuristic. No evaluation of the documentdistribution algorithm is provided.

We also rely on clustering, but use a geneticalgorithm approach that uses information aboutthe underlying architecture to map the documentsonto the nodes. As the actual dataset used in the[Sha89] evaluation is not described, it is notpossible to directly compare the results of ouralgorithm to the algorithm described in [Sha89].For tutorials on clustering and other informationretrieval related topics, the reader is referred to[Bra90, Sa183, Wi188].

The remainder of the paper is organized asfollows. A proof that MDAP is NP-Complete issketched in Section 2. Section 3 comprises of adescription of a proposed Genetic Algorithm forMDAP. Results from an experimental evaluationof our approach are illustrated in Section 4. Weconclude in Section 5.

2.0 MDAP is NP-CompleteAn instance of MDAP consists of a

homogeneous distributed memory architecturewith n nodes, Xi, O < i < n - 1, and partitions ofthe documents Di, O < i < d -1, called clusters, C.

Each cluster Ci, O < i < c -1, represents a list of

all the documents associated with it. Typically, rd~ c < d/constant [Sha89]. The distance between a

pair of nodes is defined as the cost of sending a

packet from node i to node j and is represented bythe internode communication cost matrix, Mij, o ~i,j

2. To show the NP-Completeness of MDAP, weuse the NP-Complete, Binary QuadraticAssignment Problem [BQAP] defined inGarey and Johnson [Gar79].

Binary Quadratic Assignment Problem [BQAP]:

Instance:Non-negative costs:

bij~ {0,1}, bij=bij; 1

ALG ORITHMLInitialization Phase;

1. Create a permutation matrix, Pij (O S i S

p-l: O

c 0ss0 ver Phasq;. While maintaining a copy of the lowest-

cost permutation, say P’t, randomly pairup the rows in P. If p is odd, ignore theunpaired row. Randomly generate twointeger values, i and j, such that Os i < j <d -1. For each pair of rows in P, say Aand B, position-wise exchange Ai, Ai+l,Ai+2, .. .. Aj-1, Aj, with Bi, ~i+l, Bi+2,.... .Bj-l, Bj, respectively within the twostrings. Replace the highest costpermutation with Pt. The replacement ofthe resulting highest cost permutation byP’t guarantees the survival of the “most-fit” parents. For example, if i = 3, j = 4,A = Pl, and B = P2, mapping string A tostring B exchanges the 2 and 5 and the 1and 3 in row B while mapping string B tostring A swaps the 5 and 2 and 3 and 1 inrow A. In this example, PO is theminimum-cost permutation, The resultingP is

I 0123450 102534

P.1 421530

2 305214

Mutat ion Phase:8. Mutate the permutations periodically to

prevent premature loss of importantnotions [G0189]. Randomly choose anumber from the interval [0, 1]. If thenumber falls outside the interval [0, q],where q is the probability of mutation,then terminate the mutation step.Otherwise, select a random number, t,that designates the number of mutationsthat occur in the given step. For each oftiterations, select three random integervalues i,j, k,suchthat OSi

,, “- “m, M+ “W, !61.I* MC,, ,,,- M., ,”,, M- k’,,. ,- Mesh, x,hi

+ M., ,,,- Mcd,4r4M

F,g 1A. All arch,!e.mm, wmh m mm dccunm! dmnb”zion F,g

These parameters include the size of thepopulation (number of permutations) and theprobability of mutation. Five population sizesranging from 10 to 50 permutations in incrementsof ten permutations were examined. The

population sizes investigated were kept small tocoincide with the size of the database modelled(64 documents). Future studies will involvemodels comprising of more realistically-sizeddatabases and larger population samples. Theeffects of both incorporating and not incorporatingthe mutation phase were also investigated.

Figures 1 through 5 illustrate results for a 64document database distributed over 16 nodesystems of varying interconnection topologies.The results for two different document to clusterpartitioning are presented. Both documentpartitions employ 8 clusters but the distribution ofdocuments to clusters is varied. That is, in thefirst distribution (figures 1(a), 2(a), 3(a), 4(a) and5(a)), all clusters contain an equal (8) number ofdocuments. In the second distribution, (figures1(b), 2(b), 3(b), 4(b) and 5(b)), 25 percent of theclusters contain 50 percent of the number ofdocuments. For notational convenience, wedescribe a document partitioning by a four-tuple

(D, C, x, y), where D is the number documents,C is the number of clusters, and x and y representthe x percentage of clusters containing y percentof the documents. Therefore, (64, 8, 25, 50)refers to the latter document distribution, while theeven document partitioning is represented by (64,8, x, x), for all values of x, O e x S 100.

Figures 1(a) and (b) present the results for allthe architectures considered. A point on anycurve represents an iteration in which a betterallocation was derived. As shown, the number ofpoints varies with the architecture considered. Allruns terminated at either a point in which the entirepopulation (document allocations), in this case 30,were identical or after 1000 iterations (prematuretermination), which ever came first. A point at1000 indicates that premature terminationoccurred. As expected, the higher the

1,,,.,,”.

lB. All archi!m.res with a (d4, 8, 25, 50) dkmbucion

,Cn

.

L-“-,,-Ma, ,,,,* !4,,, ,8

. - M.., ,,

70

69

m

:Wo ,00 ,0, ,00 .,, ,00

[,,,,,,,,.

o JO )00 150 200 MO 300 350 400

Fig. 2A W,,houc mum,,.” o“ m .“,” dumb”(mnFig 2B, W,thout mum”.. on a (d-l, 8, 25, 50) dhmbuuon

“~ 80,

L*,+-“w,, .

+ M., ,.,6M

-’ M4,2x8M75 * M.,41, M

63

G=~ ,,

3u

.,

15

1!

70

L

-- Hmr ,6!4- Mm,h, ,,, hi

* M.iT, x#M60 + M,*. .,.

50

40

30

20

1,,,.,,””

Fw3A.WILh rnu!at>onon m .“,. dsmbutmn 1,,,.,,..F,g. 3B. WiIII mutw.an m 9 (64, 8, 25, 50) d,smbu”on

m0 ,0, *OO 300 k, ,00 ,00 700 8,, 90, ,000, ,0,

I,,r,tim

.

.

.

,,b$,~

0 ,00 ,0, ,0” 4,, ,0. 600 7,, ,00 900,000

FIg 4A. Effms of mucum. witk a. eve. dktribumn F!& 4B. E((m, of rmm,,o” w,,h a (6+ 8, 25. 50) d,$r,but>on

communication diameter of the architecture, themore significant was the improvement in thederived allocation from the initial randomdocument distribution.

Figures 2(a) and (b) and 3(a) and (b) moreclearly illustrate the behavior of the proposedalgorithm in the case where no genetic mutationsare possible and when a 0.5 probability ofmutation exists, respectively. A 0.5 mutationimplies that with a probability of 0.5, a randomnumber of pairs ranging from 1 to 10, will beexchanged. That is on average, 2.75 pairs will beexchanged per iteration. Figures 4(a) and (b)illustrate the effects of mutation on the allocationsderived for a hypercube and a 4-by-4 meshsystems. As seen, and in all runs performed,mutation results in better allocations. The betterallocations result from the prevention of localminima interference.

236

,5

70

65

m

3,

m

4s

40

33

xl

.

.~oPop”,., ,., Sk

Fig. 5A Ef(wcs O(Fop.lmon me .sIng cmeven dimbu!m

8,m

,,

\

- Mcs! ,4x4

+’ Mc$h4x4M70

63

w

~ 55

:s4

:

Weare presently continuing to evaluate ourproposed algorithm. Future experimentation willconsist of models representing larger documentdatabases, documents with non-uniform accessfrequency, and correspondingly larger populationspaces. A detailed study of the effects of mutationon the derived allocations will be conducted.

We are also presently developing a hypercube-based multiprocessor information retrievalsystem. Once developed, we will evaluate theactual response time difference resulting fromvarious standard document allocation schemes,e.g., round-robin, hashed, etc., and thoseallocations derived by the algorithm described inthis paper.

References[As090] Asokan, N., S. Ranka, and O. Frieder,

“A Parallel Free-text Search System withIndexing, ” Proceedings of the IEEEInternational Conference on ParallelArchitectures and Databases, pp 519-521,March 1990.

[Bok81] Bokhari, S. H., “On the MappingProblem,” IEEE Transactions on Computers,30(3), pp 207-214, March 1981.

[B0188] Bollinger, S. W. and S. F. Midkiff,“Processor and Link Assignment inMulticomputers Using Simulated Annealing,”Proceedings of the 1988 InternationalConference on Parallel Processing, pp 1-7,August 1988.

[Bra90] Brajnik, G., G. Guida, and C. Tasso,“User Modeling in Expert Man - MachineInterfaces: A Case Study in IntelligentInformation Retrieval,” IEEE Transactions onSystems, Man, and Cybernetics, 20(1 ), pp166-185, January/February 1990.

[Cri90] Cringean, J. K., R. England, G. A.Manson, and P. Willett, “Parallel TextSearching in Serial Files Using a ProcessorFarm,” Proceedings of the 1990 ACM SIGIR,pp 413-428, September 1990.

[DeJ89] De Jong, K. A. and W. M. Spears,“Using Genetic Algorithms to Solve NP-Complete Problems,” Proceedings of the ThirdInternational Conference on GeneticAlgorithms, pp 124-132, June 1989.

~u88] Du, X. and F. J. Maryanski, “DataAllocation in a Dynamically ReconfigurableEnvironment,” Proceeding of the IEEE FourthInternational Conference on Data Engineering,pp 74-81, February 1988.

[Gar79] Garey, M. R. and D. S. Johnson,Computers and Intractability: A Guide to theTheorv of NP-Com~leteness< Freeman andCo., New York, 1979.

[G0189] Goldberg, D. E. Genetic Al~orithms inSearch O~timization and Machine Learning.Addison Wesley, New York, 1989.

[Gre85] Grefenstette, J., Proceeding of anInternational Conference on eneticAlgorithms and Their Atmlications, LawrenceErlbaum Associates Hinsdale, NJ, 1985.

[Gre87] Grefenstette, J., Genetic Ahzorithms andTheir Atmlications: Proceeding of the SecondInternational onference on GeneticAl~orithms and Their Aur)lications, LawrenceErlbaum Associates Hinsdale, NJ, 1987.

[Has81] Haskin, R. L., “Special-PurposeProcessors for Text Retrieval,” DatabaseEngineering 4, pp 16-29, September 1981.

[H0187] Holland, J. H., “Genetic Algorithmsand Classifier Systems: Foundations andFuture Directions,” Genetic Algorithms andTheir Applications: Proceedings of the SecondInternational Conference on GeneticAlgorithms and Their Applications, pp 82-89,June 1987.

[Lee87] Lee, S.-Y. and J. K. Aggarwal, “AMapping Strategy for Parallel Processing,”ZEEE Transactions on Computers, 36(4), pp433-442, April 87.

[Pog87] Pogue, C. A. and P. Willett, “Use ofText Signatures for Document Retrieval in aHighly Parallel Environment, ” ParallelComputing, Elsevier (North-Holland), vol. 4,pp 259-268, June 1987.

[Pog88] Pogue, C. A., E. M. Rasmussen, andP. Willett, “Searching and Clustering ofDatabases Using the ICL Distributed ArrayProcessor,” Parallel Computing, Elsevier(North-Holland), vol. 8, pp 399-407, October1988.

[Rag87] Raghavan, V. V. and B. Agarwal,“Optimal Determination of User-OrientedClusters: An Application for the ReproductivePlan, ” Genetic Algorithms and TheirApplications: Proceedings of the Second

International Conference on GeneticAlgorithms and Their Applications, pp 241-246, June 1987.

[Sa188] Salton, G. and C. Buckley, “ParallelText Search Methods,” Communications of theACM, 31(2), pp 202-215, February 1988.

rSa1831 Salton, G. and M. J. McGill,.Introduction to Modern Information Retrieval:McGraw Hill, New York, 1983.

238

[Sha89] Sharma, R., “A Generic Machine for

Parallel Information Retrieval,” InformationProcessing and Management, Pergamon Press,25(3), pp 223-235, 1989.

[Sie91] Siegelmann, H. and O. Frieder, “TheAllocation of Documents in MultiprocessorInformation Retrieval Systems: An Applicationof Genetic Algorithms, ” Proceedings of the1991 IEEE International Conference onSystems, Man, and Cybernetics,Charlottesville, Virginia, October 1991.

[Sta86] Stanfill, C. and B. Kahle, “ParallelFree-text Search on the Connection MachineSystem,” Communications of the ACM,29(12), pp 1229-1239, December 1986.

[Sta90] Stanfill, C., “Partitioned Posting Files:A Parallel Inverted File Structure forInformation Retrieval,” Proceedings of the1990 ACM SIGIR, pp 413-428, September1990.

[Sta89] Stanfill, C., R. Thau, and D. Waltz, “AParallel Indexed Algorithm for InformationRetrieval,” Proceedings of the 1989 ACMSIGIR, pp 88-97, June 1989.

[Sto87] Stone, H. S., “Parallel Querying ofLarge Databases: A Case Study,” IEEEComputer, 20 (10), pp 11-21, October 1987.

[Wi188] Willett, P., “Recent Trends in HierarchicDocument Clustering: A Critical Review,”Information Processing and Management,Pergamon Press, 24(5), pp 577-597, 1988.

239

On the Allocation of Documents in Multiprocessor ...binds.cs.umass.edu › papers ›...

Documents

Transcript of On the Allocation of Documents in Multiprocessor ...binds.cs.umass.edu › papers ›...