DTIC FILE COPY - Defense Technical Information · PDF fileUNCLASSIFIED DTIC FILE COPY...
Transcript of DTIC FILE COPY - Defense Technical Information · PDF fileUNCLASSIFIED DTIC FILE COPY...
UNCLASSIFIED DTIC FILE COPYSF-.URIT"
Form Approved
NTATION PAGE OMB No 0704.0188Exp Date Jun30, 1986
a. REPno° AD-A223 479 1b. RESTRICTIVE MARKINGS
2a. SECt , , 3. DISTRIBUTION/AVAILABILITY OF REPORT
Nnnp Approved for public 'release; distribution2b. DECLASSIFICATION/DOWNGRAII i S,'!I I, is unlimited.
None i4 PERFORMING ORGANIZATION JUMBE n S. MONITORING ORGANIZATION RdPTABER(S)
/$OSR.Tft 906a, NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATIONUniversity of Southern (If applicable) Department of the Air ForceCalifornia Air Force Office of Scientific Research
6c. ADDRESS (City, State, and ZIPCode) 7b. ADDRESS(City, State, and ZIPCode) "6V2\\.41Institute for Robotics & Intelligent Systems Bolling Air Force Base, D.C. 21, 2-6448Powell Hall Room 204Los Angeles, CA 90089-0273
Ba. NAME OF FUNDING/SPONSORING 8b. OFFICE SYMBOL 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBERORGANIZATION Air Force Office (if applicable)
of Scientific Research Grant No. AFOSR-89-O0328c. ADDRESS (City, State, and ZIP Code) 10. SOURCE OF FUNDING NUMBERSBuilding 410 PROGRAM PROJECT TASK IWORK UNITBolling AFB, D.C. 20332-6448 ELEMENT NO. NO. NO, ACCESSION NO
S)56 A11. TITLE (Include Security Classification) -.
Parallel Architectures.'and Algorithms for Image UnderstandingFinal Thchnical Report
12. PERSONAL AUTHOR(S)R. Nevatia and V.K. Prasanna-Kumar
13a. TYPE OF REPORT 113b. TIME COVERED 14. DATE OF REPORT (Year, Month, Day) 15. PAGE COUNTFinal Technical Report FROM _I/I/88To 9/30/8g 1990, May 17 39
16. SUPPLEMENTARY NOTATION e
17. COSATI CODES 18. SUBJECT TERMS (Continue on reverse if necessary and identify by block number)FIELD GROUP SUB.GROUP/f" Parallel Architectures, Parallel Algorithms, Latin Squares,
ma e Understan ing, Array Access. Processor Time OptimaSolutions, Image Labelling, Low Level Vision.
19. ABSTRACT (Continue on reverse if ne essary and identify by block number)Several issues in using pa allel processing for image understanding are addressed. First,efficient schemes for para lel image array access are developed. New class of latin squarescalled perfect latin squar s are defined. Several construction methods are shown and someuseful properties of such s uares are identified. A generic parallel model.. of computationemploying electro optical d vices is developed. Parallel techniques for image computationsare d.eveloped on this model. Finally, processor time optimal solutions to several low andintermediate level image und rstanding tasks are designed on orthogonal access parallelarchitectures.
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT 21 ABSTRACT SECURITY CLASSIFICATION
(n UNCLASSIFIED/UNLIMITED 0 SAME AS RPT. 0- DTIC USERS Unclassified22a. NAME OF RESPONSIBLE INDIVIDUAL 22b TELEPHONE (Include Area Code) 22c OFFICE SYMBOL
Abraham Waksman, Program Manager (202) 767-5025 IMDD FORM 1473, 84 MAR '83 APR edition may be used until exhausted SECURITY CLASSIFICATION OF THIS PAGE
All other editions are obsolete Uncl ass i fied
DISCLAIMER NOTICE
THIS DOCUMENT IS BEST
QUALITY AVAILABLE. THE COPY
FURNISHED TO DTIC CONTAINED
A SIGNIFICANT NUMBER OF
PAGES WHICH DO NOT
REPRODUCE LEGIBLY,
INSTITL 1IE IOR ROBOTICS ANI) IN FijI1,1GIN I 'YS I'MSI 1\*1 I%11 I (* 1 ,01 ," 11 IFTI: ( \1 I~ IN A t1
I'x vll I I 'all R moo 2i1)
1 )I - 4 1 i 377
OSRTR. , 0 06 75
May 1990
Dr. Abraham WaksmanProgram ManagerAFOSR/NMBuilding 410Boiling AFB, DC 20332-6448
Dear Dr. Waksman:
Please find enclosed six copies of our Final Technical Report for the period of November 1, 1988to September 30, 1989 on our AFOSR Grant #89-0032.
Sincerely,
,nciral Investigator
V.K. Prasanna-KumarCo-Principal Investigator
cc: Kathleen Wetherell, Contracting OfficerHarriet Vigoren, USC Contracts and Grants
encl: 6 copies of report
90 06 26 014
Parallel Architectures and Algorithms for ImageUnderstanding
AFOSR Grant # AFOSR-89-0032
Final Technical Report
R. Nevatia and V.K. Prasanna KuinarInstitute for Robotics and Intelligent Systems
University of Southern California
May 17, 1990
Accesioni For
NTIS CRA&iUric rAUiUard1,Ou1.ced 0
ByDistribtitio ij
-
O T I C
Avail jadljorDist soecial
1I
Summary of Results
This report describes the major accomplishments under the grant AFOSR-89-0032. The main focus of research was on two major areas: study of parallelarchitectures, and design of efficient parallel algorithms for image understand-ing. In the following, we summarize the research. The appendix includes copiesof publications supported under this grant.
1 Memory Systems for Image ProcessingI
Communication between memory and processors always has been a bottleneckin high performance computer systems. Low memory bandwidths can lead to asignificant performance degradation of these systems. Particularly, in real timeimage processing and vision applications, where the amount of data movementis enormous, very fast memory operations are needed. A new efficient memorysystem has been proposed for image processing and vision. In this system, LatinSquares are adopted as the skewing scheme. A new type of Latin squares, per-fect Latin squares with special properties are introduced for imagi-.processing.A simple construction method for perfect Latin squares is also sh6wn.'The re-sulting memory system provides access to various subsets of image data (rows,columns, diagonals, main subsouares etc.) without memory conflict. The mem-ory modules are fully utilized for most frequently used subsets of image data.The address generation can be performed in constant time. This memory systemis the first known system that achieves constant time access to rows, columns,diagonals and subarrays using minimum number of memory modules [5].
2 Electro-optical Architectures
Arrays of processors with optical interconnection networks for fine grain imageprocessing are studied. It has been shown that the unit time interconnectionmedium made possible with the use of free space optics results in efficient so-lutions to communication intensive image computations. Using these models,O(logN) pointer based algorithms for several tasks in low and medium levelvision are presented. We also presented a generic soldtion that can be usedto simulate PRAM algorithm for image computation on the proposed electro-optical arrays [4].
2
3 Orthogonal-access Parallel Organizations
We have investigated a class of orthogonal-access parallel organizations for ap-plications in image and vision analysis. These architectures consist of a mas-sive memory and reduced number of processors which access the shared mem-ory. The memory can be envisaged as an array of memory modules in thek-dimensional space, with each row of modules-along a certain dimension con-nected to one bus. Each processor has access to one bus along each dimension.It is shown that these organizations are communication-efficient and can provideprocessor-time optimal solutions to a wide class of image and vision problems.In the two dimensional case, tlhe basic organization has n processors and n x nmemory arr-', which can hold n x n image, and it provides 0(n) time'solution toseveral imat computations including: histogram, histogiam equalization, com-puting connected components, convexity problems, and pomputing distances,etc. [6].
3
References
[1] H.M. Alnuweiri, and V.K. Prasanna Kumar, Fast Image Labeling Using Lo-cal Operators on Mesh-Connected Computers, Proceedings of InternationalConference on Parallel Processing (ICPP), 1989.
[2] H.M. Alnuweiri, Communication efficient Parallel Architectures and Algo-rithms for Image Computations, Ph.D. thesis, August 1989, USC.
[3] M. Mary Eshaghian Parallel Computing Using Optical InterconnectionNetworks, Ph.D. thesis, Decelber 1988, USC.
[4] M. Mary Eshaghian and V.K. Prasanna Kumar, Fine Grain Image Com-putations on Electro-optical Arrays, Proceedings of IEEE conference onComputer Vision and Pattern Recognition (CVPR), 1989.
[5] Kichul Kim and V.K. Prasanna Kumar, Parallel Memory Systems for Im-age Processing, ProceeAings of IEEE conference on Computer Visitun andPattern Recognition ((U, ,rPR), 1989.
[6] V.K. Prasanna-Kumar, Parallel Image Computations on Reduced VLSI Ar-rays, Indo-US Workshop on high speed signal processing, 1989.
4
Appears in Proceedings of the IEEE ComputerVision and Pattern Recognition Conference,1989.
Parallel Memory Systems for Image Processing i
KIichul Kiut mid V. K. Irasamia Kuniar
)Dvartllwiet orf IBlectl'ieal EI*gilleerilig-,S'ysten.
i~s
University of Southerni ,aliforina
Los Aigeles, CA )0089-0781(2!3).7,13:.523(;
The effective )an(Iwidt h of t helse nie|||ory systems not onlyAbstract depceilds on the number of modules and the speed of oper-
ation but also depends on the occurreuice of mnemory con-A novel memory system is proposed for image process. flicts, which occur when more than one memory request ising. Latin squares, which are well known combinatorial given to the same memory module during a memory cy-objects for centuries, are used as the skew function of the cle. Hence, the memory system should provide conflict freememory system. We introduce a new Latin square with access to as many subsets Qf interest as possible.desired properties for image array access. The resulting Many rese.rchers Jiave addressed the issue of design-memory system provides access to various subsets of im- ing memory systems which provide efficient access to rows,age data(rows, columns, diagonals, main subsquares etc.) columns, diagohals or subarraysl,2,9,11,12,16,17,18,19].without memory conflict. The memory modules are fully Budnik and Kuck introduced linear skewing scheme andutilized for most frequently used subsets of image data. prime memory system which allows conflict free access toThe address generation can be performed in constant time. rows, columns, diagonals and N1/2 x N112 subarrays of anThis memory system is the first known memory system that N x N array, in which the number of memory modules is aachieves constant time access to rows, columns, diagonals prime number greater'than N12,91. However, their methodand subarrays using minimum iumber of memory modules. needs modulo operation ok.a prime number in address gen-
eration, which can not be performed in constant time withI Introduction a reasonable amount of c}jcuitry. Lawrieill generalized
the linear skewing schieme'and proposed an array storageWith recent advances in VLSI and parallel processing, large scheme with a simple address generation. His scheme pro-Wither opreavcesrs in VbS atparllelt proeig arge vides conflict free access to rows, columns, diagonals andnumber of processors can be iptercosnected to yield high N" 2 x NO/ subarrays of an N x N array using 2N memoryperformance systems for image processing and vision. ost modules. lie also introduced the omega network to perform
tations but also process enormous a ount of data. Low data routing aid data alignment. However, his scheme suf-
effective bandwidth of the memory system can lead to sig- fers from low utilization of nemory modules. Only half of
nificant performance (legradaio|n. Inm order to fully utilize the memory modules are used in a ncmnory fetch. Men-
the processing power available in parallel systems, special ory organizations with linear skewing schemes suffer either
attention should be given to memory organiizations as well low memory utilization or difficulty in address generation.Another way of achieving ellicieit access to various subsetsas mapping of image arrays, of interest is by using nonlinear skewing schiemcsll,3,12,161.
In image processing, an imnage can be relnleselited by a
two dimensional array of size ,I x N. It is well understood Lee[12] presented a nonlinear skewing sclenme(called scram-bled storage scheme) and a new network which allows con-that access to row vectors, cohuim vectors, diagomials amid fitfe cest os ounsur lcsails
subarrays are required heavily in iniage I)rocessing{10,17]. flict free access to rows, columns. square blocks an(! lis-
A memory system for inage piocessing should )rovi(le ef- tributed blocks. The scrambled storage scheme does notiieedl modulo operations for a(lrcss generation. However,ficent access to rows, coluums, (liagonmls amid subarrays. it can inot plrOVide elliciiut access to diagomialds.
Also, while implement iig partilioed VI.SI arrays and spe- i vcn t pgi a h ,t as i tiatnl im
cial purpose arrays, access to various ubatkrays is ineedled. lven thgh a hot or wok has been done in the con-
'lodlay's comiputters use immilt ' tumcmiory modules eit her text of array storage sclimies for ge,,eral purl)ose parallel
in interleaved meimorv forimi or in, IalM nIniillory foric. Coniputers, not inuh has been l done with re.pect to accesspatterns needed for image pro cessinmg and visioni. \oorhis
'This research was supliorid In pail tit ' : ms..,.,I el R-1iil- atid Morriil 171 pto(.(sed a linar skewing schelme and andation under grant tItI-S71iOs:$t antd iiit l, Al-(i1It widr griant address getirat ion circuit mr*v for image prcesiig ini whichAFOSR-89-032. (illy olle imemtory umillihle 6 i, u.sed I duting a inemiiory
654CH27524/89/0000/0654$01.00 (c 1989 IEEE
itltidlo mI't .ii til with .1ati'' ~'i'ij inot .1 power (of I oli't.,~ 611 111 ~ie avie for athit riry data; avcst'c,'n
li ilii't oll Ii I'. veat'55 I 0 ''ttjioltt I ofd tIt a. itiiry schemtie call lbe pride~tcd fotr act'ts to t the siilin.tswei pt ipom. to ose Lat iti stqitare; which are well knouwni cook of ititet s'it lehiltitg ill hligh effective liatiuwi(lt It.in tat 'i ia! iiijet' I ftr ct ent iirit'sj'l j. We intro ducte a tiew Ilkiittin 1ge gir ces',iotg. row verfors, t'iili t iVetoris. sit il
La;i militat(dltted pel~fect Lat itt sqiare) whtichl has p l -iat taysatid ufi-igtitals(uir dliagontals of suliari'ays) are hevavily,'rtjes itsefit for itt ag' 'ess i anit Vil i. We' uLIst show l*lm lit' h importain't' of row access is ohbviouts ciii si dtriiiI %it itje cotnstrutio mttnethod! for' perfect ILat in sqitat es. U.". thiat itai ilt. ip its anitd out puts *,re usually itt tow mtajor
ig peifet' La tin squtiarit as t he skewing schet ne, various sub. ord er( left toi right , loip to hottotm). '[here are also t t aitsets of ini age data a ir accessible withlouit memtiory con flict a Igori t Ii is whli ch nteed row accesses like one iitntsionialwhile inc tmtry tmiod~ule s are fully uttil ized for most intecrest- con voluition alIgori thmis. D~esirability3 for colutin access isitig subsets of imnage data. Furthermiore, it is possible to justifiedt 1'y thle commonly used 90 degree rotation of inmageshow that t'e address generationi call be donie in cotnstant and by adgorit lims which nteed column accesses like the twottime'. tdi tiut'siotual L7TT algorithms. Subarrays play ati itlpor.
lin the tiext sectioti. we tdescribe memory access require- tant role in two dimnsionial (emplate matching amid othermnetts for image processing. lit section 3, we introduce p~er. algorit huts. P'artit ionuing is all essenitial step ii imap~pingfect Latini squares. We also show a construction method for algorithiuis to fixed size VLI arraysjl31. Access to sub).perfect Lat in squares. lIn section 4, we describe how perfect arrays of imiage data is n~eeded in such implemetntationis.Latin squares call lbe used for storinig and retrieving imiage D~iagonials or (!iagotils of subarrays are also useful in mna.d!at a. Sectitin 5i conicludes the paper. t rix comnputat ion which is anl indispensable tool in image
processinlo)10.Frot the above argutnqts, it is obvious that a nmetmory
2 Preliminaries system for imuage processing should provide efficient, hope.
A meorymodle an e caraccried y te aoun of fully cotnflict free, access to row vectors, column vectors,A nimor moulecan e caraterzed y te aoun of subarrays, and diagonals for ipiage processing.
datat it can stcre and thc cycle time which is the fiii In% this paper, we restrictour attenution to the casesmnum timue betweent two consecutive read or write oper- where the tnumiber of mietorx modules is an even poweratiotis. Clearly, the cycle time determinies bandwidth of of t wo. rhe desirability of ha'iiug puniber of memory mod-the memory system. To ov'ercomne the linmitatiotn imposed uetobanvnpwrofwoiclrinersfaddress
by ccletim, inihile emor moule areuse It t0 getneration, utilization of address space and interconnectionday's computers. One way of using nmultiple memory mod- niet work.tiles is by o~rganiizinig them as p~arallel memory banks. itthis inemory system, multiple mnemory baniks are accessedat the same time, therefore, the effective memory band- 3 Latin Squares for M emory Ac-with is multiplied by the number of memory baniks. This cessscheme is commonly usedl in SIMD or MIMD computers.Anmot her way of using multiple memory modules is by orga- A\ JLfn tquotrc of ordler it is anl it x n square compiloseditizitig t hem as initerleaved memories which are commuonly ut itlIt symibols from 0 to nt - I such that no symibol appearsemuployedl in pipelitied vector computers. lin this memory mnore than onice in a row or in a column[4j. The rows aresystetn, memory access mnechanismu is div~ided into several nmbeilrt'd fronm 0 to t - 1, top to bottomn. The columins arestages and used itt a lpipelined mnattier. Cray X-Mll uses also numbercd front 0 to it - 1, left to right. The square Aa 32-way' interleaved tmemory system. Some memiory 53's' shown below is alt example of Latin square of order 'I.tets employ hot h parallel memory batiks atid ititerleavctlMlemnories. lit such a system, each parallel memory batik is0123cot tposcil of inmte rleaved memories. 1I 2 3 0
E'veni if tmult illt imemor' mnodutle's are use1 d to intc rease2301ie( eff'ec'tivye band with of thle tmetnory syst etm, mem' tory 3011
'otficlt t'ati greatly retdtce the effective batdidtth. 'or A dingiinsi Ihulini sqtwt'c of order ii is a Latini stquare stichatit extit 't te t's atit td, if all that a eci ien ts to be accessetd mt ' that not symtblol appears inore thant oitc' in any of its twositi'd itt thelit' s ti nttiiry mtodutle, tilie e' ft'ttivi' bauwtii itl I Ii i i diagotials. A sitmtple conistructio mu thod is kntownits s i tilt' a" m .ii on~t tly onte inint ory tittdutll' i rrespect ivt' of G- r di agotsial Latint squares of order im whtenti is aI power'
of m wo'l1 ( i'rgelv l wed a getn't'al it'l sto to .'ottstrnit t
655
diag a o , s v.a .of "-my t " 17 ..'1 .,l1are1 , I 3.lI it ii , n3 1.. I I..,, 1,, b' ..ri I,. fb,11hy m if. 3331 l.,tolinsilwn hihelow is an exaph. ih'f ,i .l I.at in -.,I llar o ICif l.r m/11111l .1 o ort iA Il I of oid( r it. thcn ther( t tists itI. 313103doub li .j, If.,'lhohi,.ml L.am mit, rt ( of o 1/ , r 33i..
3 " 1 Proof. 1: lom ilie diiily m.lf(t hlogt al Latin squat es .S:1 ' 2 'and B i, we c.1onl aI mlimic (i f .rl1'r in willi tit. f,.-I (1 :} ; J I(wilig rule:
Iwo Latin squ ares C I I ) of , .i dei Ii ,ire (i whoyona Itoeach other if I lite wt of or(d ered pairs V,, u ' . , i ,, . .. , < ij " 12 -CD {(c, ,d,.,), 0 -. t.j ".i I} Is equal ;o t lie .etof all possible ordered Ipairs i,) ), (I -. i,j it - 1. The where [i/li) rel)resetits tite largest ittleger not exceeding i/nsquares C and D shown below are (rthogonal to each or. and i mod n represents the nonnegalive remainder of i/n.ther. rie construction of C can be viewed as follows:
C = 1 2 0 2 1. Construct a -. tiare of order inn using tin Bi's. Let12 0B,. be the square I.whose position is i + 1 from top
A self.orthogonal Latin square is a Latin square orthogonal and j + I from left.
to its transpose. We define a doubly sclf.orlhoyonal Latin 2. Add is x a, to all i ncbers of ,,,.square as a Latin square orthogonal to its transpose and1to its antitranspose. Notice that a doubly self.orthogonal An exanl)le is shown below for the case m = it = 22.Latin square is also a diagonal Latin square. The square B 0 3shown above is also an example of doubly self.orthogonal 2 3Latin square. We define a subsquarc S,., of a Latin square 0of order n2 as an it x n square whose top left cell has the 3 2 1 0coordinate (ij). When i -- 0 mod 13 and j = 0 mod it, the 1 0 3 2
subsquare S.j. is called a mailn..subsquarc. Subsquare S0, 0 1 2 3 4 5 6 ,7 8 9 10 11 12 13 14 .
of the square B is [ 2 1 and main subsquare S.20 of the 2 3 0 1 6 7 4 '5 10 11 8 9 14 1512 13r 3 0 3 2 1 0 7 6 5 4 I1 10 9 8 15 14 13 12
1 3 2 5 4 7 , 9 8 11 10 13 12 15 143$ 20 11 12 13 14 15 0, 1 2 3 4 5 6 7square B is We define a pcrfcct Latin squarc of 10 II 8 9 1.4 15 12 13 2 3 0 1 6 7 4 5I 0 II 10 9 8 15 II 13 12 3 2 1 0 7 6 5 4
order n2 as a diagonal Latin square of order n2 such that 9 R It 10 13 12 15 1.1 1 0 3 2 5 ,4 7 6no symbol appears more titan once in any main subsquare. 12 13 14 15 8 9 10 I .1 5 6 7 0 1 2 3
14 15 12 13 10 11 8 9 6 7 4 5 2 3 0 1Hence, in a perfect Latin square, no symbol appears more I114 13 12 II t0 9 8 7 6 5 4 3 2 I 0than once in aniy row, in any column, in any diagonal or 13 12 15 14 9 8 10 It 5 4 7 6 I 0 3 2in any main subsquare. Tile square E shown below is a 4 5 6 7 0 I 2 3 12 13 14 15 8 9 10 11perfect Latin square of order 9. 6 7 4 5 2 3 0 1 14 15 12 13 8 9
7 6 5 4 3 2 1 0 15 14 13 12 9 80 3 6 1 1 7 2 "5 $ 5 4 7 6 1 0 3 2 1 3 1215 14 9 11 102 5 8 0 3 6 1 4 7 C1 4 7 2 5 8 0 3 6
7 From the construction, it is clear that no symbol appears
E 5 8 l 3 6 01.1 7 1 more than once in any row or in any column. Suppose4 7 1 5 8 2 3 6 0 C is not orthogonal to its transpose. rhere should exist6 0 3 7 1 1 8 "2 i,j,k atd 1, i I k or j .z? 1, such that c,, ck, an(8 2 5 6 0 3 7 1 , C,., =0c:. Sce 0_<n,,q< i-1, 0_p,q < -1, and7 1 .1 8 2 5 i 0 3 O b,.,<it- 1,0< r,s < l. fromc,., =c.1, we have
In the rest of this section, we presetit a construction "i,/,l,, n a },.ltn (Imethod for perfect Latin squares of order ,,2 for the casei 2 is, k (1.2,3 ... ). The it rested l(ader can refer " .. .... I n = b , ,,,t. ,dn (2)
to [81 for the construction of perfect I,atin mluares of order From c., - ClI. we haven2 where it is o d d( r it -2in', 1 ( (2.3. 4.... }.m is odd.
For the construction of perfect Ilatill sqiares. we start 'l~l,,I.f)/. n) ,l,,l (p)wil] the following h',luiia. , . ,,, . , (.1)
656
.1111~~~~~ ~ ~ ~ ~ ~ i.1 IitIIwasliII)ItIII I ha 4 I II It 8 I BI are t2lYI ? 3 1 N 6 7 9 11 1 2 1 3
Latlill squle . Elt It 1 9 141 1!, 12 13t 2 .1 0 1 1) 7 -1 I
No~w. we are ready fot tit( followinig lemlnni. I !. I13 10 1 t 1.9 6 7 4 5 2 31 0 1C 7 4 , 2 3 0I 1 11 1 . 12 13 10I 11 $ 9I
Leina ;.2 lPor till k k- (2, 3,4l,... lhcrr exists a dlublt 3 2. 1 II 14F C. 5 I1 It 11I) 9 8 15 -14 13 12
MVlfort1loqonail Latin square Wk of oridcr 24. F. 1t1 1: Is t1 10i 9~ S:i6t 2 1 0 ~CI I " 2 1 0 1S 111 13 12 11 1II 9 8
lProof: We Ilse indtlion tolOil k. i If J 2 -5- 4 4 I10 13-T iV2 It. 1Blases: F-r k =.2 and k -%3, tlite following squares are 9 8 It t0 13 11 15 141 1 0 3 2 5 4 7 6
dloub~ly self-ortliogonal latin squares of order 22 and 2 3. 1) 13 12 15 14 9 8 1 0 II1 5 -1 7 6 1 0 3 2il 5 * 7 6 1 0) 3 2 13 2 IS 1 9 8 11 10
2 3 0 1 7 6 5 4 Front tile const rucetion, it is clear that, in tiw squnare Pit, no0 1 2 3 5 4 7 6 symblol applears m~ore than once in any row, in any column
012 36 7 4 5 3 2 1 0 or in any main subsquare. Suppose P"k is not orthogonal
L2 230 1 ' 45 6 7 1 0 3 2 to its transp~ose. There should-be i,in and n, i 0 rnor 0t 6uc 74a 5~ it' and pj' =P",. Fromn the
2j 32 56 ~' construtmn, p,0 5 i~j :5 2" - 1 can be expressed as
7lo 6 5~ 4~ 6 0 0 1 2k =ic .)k~d I~=pk
1ypthesis: T1here exists a doubly self-orthogonial Latin we have 1 ,: '-1 rn '
Square D4 of order 2k. dk2lmo2 dk(5lInductionl Step: We construct a (doubly self-ortllogotial i/kjodh7lm/2'l.,n.d2&
Latin square D"" of order 2 '+2, using D 2 and Dk, Whose d md&tI& - dk,~&ri& (6)existence is proven in the previous lernina. a Frot p,* = we av
Notice that D 2 ill the above lemmia is also a pcrfcctkLatin square. Now, we presenlt the miain thleorem for thle dt.,I2 kj,imod2" d(6/2&p,.m.d2- (7)construction of perfect .tin squares. dirno2 k1 i 2 41 = (8
Theorem 3.1 For all ki E {1,2,3,.. .), thcrcexcists a per-. F)-S contradict the assumption that D' is a doubly self-fccd Latin square P"k of order 2 2k, which is (a1s0 a doubly orthogonal Latin square. Hence Plk is orthlogonlal to its.clf-orthoyonal Latin square. tranlspose. The samne argument can be applied to p2k and
Proof. Mwn k =1, 1" = D2 When k : 2, fromn the its anti transpose, Hence, p2k is a doubly self.-orthlogon al
doubly self-ortliogontal Latin square D A of order 2k, we con Latin square and 110 symbol appears more than once in any
s... dt a Latinl square C"k of order 2 - suchl tltat of its two mail (diagonals. Hence, plk is a perfect Latin
square of order 2 ". a
Notice that the conistruiction in litod for C2' is the samne 4 Parallel Access to Image A-rrayats the construction method appeared iti lelnlna 3.1. From [in this section, we sho0w in detail how to use perfect Latin(,2k, we construct perfect Latin square P"k of oretler 2 2' sq'uares in soigand rereigituageda. ealosw
sctht 2k *2k thle p~roperties of plerfect Latin squares built, fromt our con-lo, L2& Ytrmd2'+i/2Ily) strlictioll nlctilod which~ are useful for iniage (data storage.
Notic tht 2~ s htalel ro h)2k Uy row CXClIuIage Suppose we hlave two dimnsional inmage (1at 2kA, )operation, suchI I lhat the irst k 111(1st sigifiicanlt bits of i nrc whlere .I 11 and N are miultiples of 2 2k and we also have 2 kswappled with Ii le k- least signlificanlt bits of i, i.e., lite rows
are ilufled I ineS As ii ealiliewe bild perect illory Modules. Thieni, we canl use Lte perfect Latin
La.t in squlare of order 2'. Si ice lthe square A int Iciinna 3. 1 s~ae,2 stleseigshle 'ihneee eis Ilie doiil ly self-oil logoiial Lati n squIIare W), iii' tilnt ri x lI z. j)will be stored in tile IIinvilry modu1 le 1,& md.)C it, h-iuina :,.1I is exactly lte iilt'rilediate ina~trix C' we Figuie I shows ;it e'xamlple for the case Ul . 1'2,i N 16wa ial. We gvl a pe-rfect L a till sq Iire P' by I liv row vxci angt aniid k - 1, i.e., we arc uising -1 iiiei ory ill(11 ties to store
41i14-at iol I'll C( . imtage data of size 1*2 - 16.
657
which .siiit' Iint lIIIv itnt1IIIItIIIII 111-1.4 lof ntcintt0rv VI'll
0I 1 :t 2 0 1 :1 (1 II32( i sIw'wnii~n nii ~i ni'mih is ~~sci2 :1 1 2 :1 1 0 2 :1 1 0I 2 :
1 2 :1 1 0 2 3 1 21 :1 : I () 21 :t Lld illati 4.2 V\ tymthio! itppt ls a P n arft ti t Iti -t hina t i it, p
:1 2 0 1 31 2 0 1 3 2 0I 1 : '2 (1 fity crilosquintS,o 3 2 I I 2 II I 3 2 0I 1 3 2
2 3 1 0 2 31 1 11 2 1 1 (I 2 :3 1 () Leit ItIIM -1.2 is -I di rectI tcsulI of leitlI tit.1e~ lw~;tis everv
1 0t 2 31 1 0 2 31 1 0 2 : 1 0I 2 , sil ',(ttate Cal be partitione~d into tw a htt such thlat euach3 2 0 1 3t 2 0I I 3 2 0I 1 31 2 (1 1 (of tietni hclottg to I subsqttare onl ; strip (of with 2'.0 1 . 2 0I 1 3 2 0 1 31 2 0I 1 3 2 A nothler stubset of po,;sii)le in terest in imitage processing2 3 1 0I 2 :1 1 0I 2 :t 1 0I 2 1 1 (0 is tine sutbiset of eem ents whIose cootrdlinates are t-ie sal te1 0I 2 3 1 0I 2 3 1 0I 2 :1 1 0l 2 :1 wit Itin .each nuain stnbsquare. Th'lis su bset can be used[ ill31 2 0 I 31 2 0 1 3 2 0 1 1 -2 t0 1I algori thmis wic it ied a valute witini eachtn iti suibarray.
Let a1 sali position SI',j of a p~erfect Lat in square .A of
Figure 1: Mcniory assigntmient of 12 x 16 array tinto -1 mei ordler n2* be defined its follows,ory mtodul. SP., j I k Isi mod1 nt,1 Pj itmod t), 0 S i~j -1 .
Now, we have t h6 following lenitim which shows tio tnemiorycontflict occurs when a same ponsitiont is accessedi.
lin figure 1, it is easy to see that every I x 4I rows attd Lemma 4.3 No synibol appcnnrs ,,tre thman ontcr in anty
every 4 x 1 columns are accessible without mennotry con- .in position $P,flict. 2 x 2 mtaini subarrays are also accessible withoutmemory conflict. Data ott tlnc diagonals of -1 x '1 perfect 5 Conclusion*Latin squares are also accessible withot n enory conflict.These properties come directly from the definitioni of per- A new efficienut memory systcntn for image processitg atidfect Latini squares. However, figure I (Ices not fully demion- vision is described in thtis payer. Latin squares have beenstrate the properties of the perfect Latin squares built frotm adaptcd as thle skewing scIhen'4. 'Rerfect Lat in squares whichour construction method, because we are using, as the skew. tire Latin squares with special p~roperties uisefutl for imageing scheme, P2 whtich is not built front otir method. The processitng have ben introdutcedi. A sinmple constructionperfect Latin squares built from our construction met hod method for perfect Latitn squares is also shown. Uising per-(p'*k & > 2) have other properties which are very useful in fect Lautini squares as tine skewing scheme, inaity interestinigimage processing. It is easy to verify thme following property subsets(rows, columins, diagonals, mainu subarravs etc.) ofsatisfied by perfect Latin square of order 22Pr, k >: 2, built innuage data can be accessed withIout ituory confhlict.from theorem 3.1. lii this inenuory system, it is p)ossible to show t hat thle
Lemna4. Nosybo apcas noe tanorcc n sb. add~ress genterationi is v'ery easy for bztli muemory tmodu tleLem m S i hta t 4. =o sy b 0apa mo - ti an 0 nc i i n 2'. address antd local address wit hint muemuory mtoduiles. Mein-
squar S,, suc tha i 0mod 2 Om-j 0 tto~ 2k ry mtodlule address cant lbe geiierat ed in conist ant lttle withI
The above lemmna shows that not onlty Itle niit sl)- at simle logic circuitry andc local address generat in oessquresareaccssilewitoutineorycoflit bt aso v. not nieedl anty additionail hardware.
er usquares re tcesibe weih o i horzonalitritt o width'The efficient skewing sclenie along with lthle fast ;ddrecss
cry sar se a resi one ithou v ec lory corficotta. ti s o it generation circuitry nm akes it 1)0. ble tha~t rows. columtns.
Ik ar eacsy s e wthot mhe r o ntheict (ai s u r iagoinals, samne p~ositionts, and snbarrays caii be accessedIt i eas tosee hat her doe no exit a ati sqire , constanit time. This inenmiry systeimt achieves Ithis goal
such that no symbol appears more thn ontce in any sub. ingunni umeofeiry odlssquare, which means there is ito skewitng schtemte which uii niintniuie fimioymtdhs
gurnte cnlitfreacesto rows, coliumtns anid N1I2 X TRa I shows several1 kniownt itnemory sclmeittes alotig-NO1 subarrayswe emr ues~v ar Rie os i ith Ifile sc met ie proposed inn Ithis paperI. Compjarisonts arie
~ whn N uenoru'inodb.- areusedto tore hbased ott storinrg N x S dhitai. A\ i addreiss generttnt is
coiistlrct to lbe hard if it reqiuires iiotiuilo uoperatiituits of aintinl er wchich is inot, ;I power (If t wo,.
658
[ leslt' 1 ~of sig~,t ~ 1 .dlr~s II'.Y.10 ''1. I'SI Arity I'11 ~.I t ,. I.111.11 14-%'tiit' supported [ftiti j generat it'll 9~
- N diagonlal. linear hlard II. I). II. l,a.wrie. ".\tcv, and Alig titn.io of li)ta in :-,isuharray .,A "rtay I: v ,.o./ir." 1 "'ll.i.. ( o i , . ,,I .. (- 24,.row, coltut . I'P. I 1l;-I 1,5,. 197:'1,
fil N digottal, liiW44i easyI\iharraa -12j I). Ive. "Scra,,iled Storage fil I'irallel Mvi'ty Sv.-
'-'- --'v -- -1 row, columi linear I hardI i'thIl. Symp. Com, cr ..Irchit (clii, pp. 232-
sularray m2:). 19SS.
row, '131 .1. X. Navarro, .1. M. I laberia atd hi. Valeto, "Parti.121 I' column, nonlinear easy It ing: an essential step it) allappi g algortil s into
blocks I systolic array processors," IEEE (oiputcr, pp. 77-89,row, column, July 1987.
This N diagonal, nonlinear easypaper suharray '141 J. W. Park, "An Ellicient Memory Systemt for Image
same position_ n Processing," IEEE Trans. Conmput., vol. C-35, pp. 669-
Table 1: Comparison of memory systems (Blocks include 6.4, 1986.
square blocks and distributed blocks. Square blocks are [151 A. Rosenfeld and A. Kak. Digital Imnage iProces.king,same as main subsquares and distributed blocks are same Academic Press, 1976,.as same positions discussed in this paper.) i!6j II. 1). Shapiro, "Theoretical Linitations on the Effi-
cient Use of Parallel Memories," IEEE Trans. Cont.put.. vol. C-27. pp. .121-428, 1978.
References i171 D. C. V. Voorhis and T. It. Morrin, "Memory Systemsfor Image Proccssing,"'IEEE Trains. CompuL., vol. C-
III M. Balakrishnan, R. J'ain and C. S. Raghavendra, "On 27, pp. 113-125, 1978..Array Storage For Conflict-Free Memory Access For 1181 If. A. G. Wijshoff and J. V. Lecuwen, "The StructureParallel Processors," Proc. Intl. Con Parallel Pro. of Periodic Storage SchefAies for Parallel Memories,"cessing, vol. 1, pp. 103-107, 1988. IEEE Trans. Comput., vol. C-34, pp. 501-505, 1985.
12) P. Budnik and D. J. Kuck, "The Organization and Use 1191 If. A. G. Wijshoff and-J ". 'V. Leruwen, "On Lin-of Parallel Memories," IEEE Trans. Coinpul., vol. C- ear Skewing Schemes and d-Ordered Vectors," IEEE20, pp. 1566-1509, 1971. Trans. 'omput., vol. C-36, pp. 233-239, 1987.
131 A. Deb. "Conflict-free Access of Arrays-A counter Ex-ample,' Inf. Proc. Letters, vol. 10, No. 1, pp. 20, 1980.
141 J. Denes and A. D. Keedwell, Latin Squarcs and TheirApplications, Academic Press, New York, 1974.
151 E. Gergely, "A Simple Method for Constructing Dou-bly Diagonalized Latin Squares," J. CombinatorialTheory,, A 16, pp. 266-272, 1974.
161 A. lledayat, "A Complete Solution to the Existenceand Nonexistence of Knut Vik )esigns and OrthogonalKnut Vik Designs," J. Combinatorial Theory, A 22,pp. 331-337, 1977.
171 K. Ilwang and F. A. Briggs, Computcr Architecture4111( Parallel Prorcessing, McGraw-llill, 1984.
181 K. Kim and V. K. Irasanna Kumar, "Perfect LatinSquares and Parallel Array Access," Techn|ical ReportIRIS -.-239, Utiversity of Southern California, 1988.
{9( I). .1. Kuck, "II IAC IV Software a,,d ApplicationProgramming." IEEE "litrans. Coinjut., vol. C-17, pp.T5$-770, 1968.
659
7Y'-- -7
Appears in Proceedings of the' IEEEComputer Vision and Pattern RecognitionConference, 1989.
Fine Grain iImagc Computations oil Electro-opta, cal Arrays*
M. Mary Eshaghian and V. K. Prasanna-IKumar)epartment of Electrical Engineering-Systenis
University of Southern CaliforniaLos Angeles, CA 90089-0781
Abstract electronic interconnects is not possible unless one al-lows unbounded fan-in and or fan-out for each of the
In this paper, we consider a set of array proces- processors. In fact , a lower bound on the time tosors with optical interconnection networks for fine simulate one step of a PRAM on any bounded de-grain image processing. The unit time interconnec- gree network of N nodes using electrical intercon-tion medium made possible with the use of free space nects is fl(log N). Even though several networks haveoptics, results in efficient solutions to communication been designed to eflicienfly simulate PRAM (7], manyintensive image computations. Using these models, cost/performance issues limit their potential applica-we present a summary of our results in designing tion in high perTormance parallel systems.O(Iog N) pointer based algorithms 'or several tasks in Recently, there has been an emerging interest inlow and medium level vision. We also show a generic using optical communications in parallel processing.subroutine that can be used to simulate PRAM al- The replacement of tle electrical interconnects withgorithms for image computations on the proposed optical beams has a significant impact on the per-electro-optical arrays. . formance of parallel computing systems implemented
using VLSI technology [4 10]. This is due to the fol-lowing two important properties of free space optics.
1 Introduction First, free space opticalbimals can cross each otherwithout any interference. Also, the connections need
During past decade, several mesh based VLSI archi- not be fixed and can be redirected 13]. This impliestectures have been considered for fine grain image that using optical interconnects, one can design areacomputations [13, 12, 1, 14]. The undirlying inter- efficient bounded degree VLSI architectures that canconnection medium of these designs has a significant simulate a unit delay interconnection network.impact on the performance of the desired parallel al- In this paper, we consider a class of parallel archi-gorithms. For example, solving many image process- tectures which use free space uptical beams as meansing problems on a N x N array of processors takes of interprocessor communications. The organizationas much as O(N) time (121. This can be reduced to of these networks is based on i. proposed generic par-a logarithmic time complexity for a limited class of allel model of computation. This model called OMC,problems when organizations such as pyramids, mesh allows unit delay interconnects and can efficientlyof trees, or reconfigurable mneslhes is used [13, 14, 11]. simulate PRAM algorithms. We introduce possible
The Parallel Random Access Machine (PRAM) physical realizations of this model. Each of the pro-is an abstract shared memory model which has been posed designs reflects the capabilities and limitationsemployed in designing parallel algorithms for compu- of the device technology used for the redirection oftational geometry and computer vision, among oth- optical beams. Using these models, we present effi-ers. The major drawback of this model is in its un- cient pointer based algorithms for finding geometricrealistic unit time communication assumption. The properties of digitized images, and parallel implemen-unit time simulation of a N processor PRAM using tation of iterative methods used in high level vision
*This research was supported in part by the National Sci- processing. We will also present a simple O(log N)ence Foundation under grant Iltl.871083 and by AFOSR un- algorithm for simulating each step of an N proces-der grant AFOSR-g9-0032 sor PRAM on these models using O(N/ log N) pro-
(66CH2752-4/89/0000/0666$01.00 (cc 1989 IEEE
cessors. This can be used its a generic subroutine connected to it central control unit. The above defi-in translating PRAM alg6rithmis for image cornputa- nition is supplemented with a set of assumptions fortions. .I an accurate analysis in 161.
The rest of the paper is organized as follows. Using those assuniptions, the following spaceIn the next section, we discuss an optical model of time trade-off can be shown:computation, and present three physical implemen-tations of the model. In section 3, we study a generic AT = fQ(I)subroutine for simulating PRAM algorithms. A setof parallel algorithms for image computations using where A is the area occupied by the processing layer,electro-optical arrays is shown in section 4. and the time (T) required to solve a problem, is the
number of cycles necessary to exchange the minimumrequirqd information (I). When compared with three
2 Electro Optical Arrays dimensional VLSI model of computation the followingproposition can be stated:
In the first part of this section, we define an Optical
Model of Computation (OMC) which is an abstrac-tion of currently implementable optical and electro- three dimensional VLSI organization having N pro-
optical computers. Similar to the VLSI model of cessors with degree d, !in 'time T, and volume V cancomputation [18], this generic model can be used to be performed on OMC in volume v, and time t, whereunderstand the limits on computational efficiency in dT/N < t _ T, and Nd<v.using optical technology. In the second part of thissection, we present possible physical realizations of In the following section three different parallelthe optical interconnection network defined by OMc. architectures are presented as possible efficient upperEach of the proposed designs reflects the capabilities bounds for v.and limitations of the device technology used for theredirection of optical beams. 2.2 Physical realizations
2.1 A Generic Model In this section, we present' class of optical intercon-nection networks as a real ztiion of the OMC pre-0MG is defined as follows: sented in the previous section. Each of the proposed
Definition 1 An optical model of computation repre designs uses a different optical device technology forsents a network of N processors each associated with redirection of the optical beams to establish a newa memory module, and a deflecting unit capable of es- topology at any clock cycle, and represents an uppertablishing direct optical connection to another proces- bound on the volume requirement of OMC.sor. The interprocessor communication is performedsatisfying the following rules similar to [2]: 2.2.1 Optical Mesh Using Mirrors
1. At any time a processor can send at most onemessage. Its destination is another processor. In this design, there are N processors on the process-
2. The message will succeed in reaching the proces- ;ng layer of area N. Similarly, the deflecting layer hassor if it is the only message with that processor area N and holds N mirrors. These layers are alignedas its destination, at that time step. so that each of the mirrors is located directly above its
associated processor. Each processor has two lasers.3. All messages succeed or fail (and thus are dis- One of these is directed up towards the arithmetic
carded) in unit time. unit of the mirror and the other is directed towards
the mirror's surface. A connection phase would con-To insure that every processor knows when its sist of two cycles. In the first cycle, each processor
message succeeds we assume that the OMC is run sends the address of its desired destination processorin two phases. In the first phase, read/write mes- to the arithmetic unit of its associated mirror usingsages are sent, and in the second, values are returned its dedicated laser. The arithmetic unit of the mirrorto successful readers and acknowledgements are re- computes a rotation degree such that both the ori-turned to successful writers. We assume that the op- gin and destination processors have equal angle witheration mode is synchronous, and all processors are the line perpendicular to the surface of the mirror in
667
tie plane formed by the mirror, the source proces- tlhe destination processor. 'Thie ncousto-optic devicesor, and the destination processor. Once the angle is then redirects the incident be-ti from the source tocomputed, the mirror is rotated to point to the de- the destination processor. One of the advantages ofs~red destination. In the second cycle, connection is this architecture over the previous design is its orderestablished by the laser bean carrying the data from of micro-seconds reconfiguration tinue, which is domt-the source to the mirror and from the mirror being inated by the speed of sound waves. The other ad-reflected towards the destination. Since the connec- vantage is its broadcasting capability, which is due totion is done through a mechanical movement of the the .possibility of generating multiple waves throughmirror, with the current technology this leads to an a crystal at a given time.order of milli-seconds reconfiguration time. Thereforethis architecture is suitable for applications where theinterconnection topology does not have to be changedfrequently. 2.2.3 Electro Optical Crossbar
The space requirement of this architecture is0(N) under the following assumption. Each mirror is This design uses a hybrid reconfiguration techniqueattached to a simple electro-mechanical device which for interconnecting processors. There are N proces-takes one unit of space and can rotate to any posi- sors each located in a distinct row and column of thetion in one unit of time. With current technological N x N processing layer. For each processor, thereadvancements, the above assumption may not be the is a hologram mqdule having N units, such that themost accurate one but we do not see any technological ith unit has a grating plate with a frequency leadinglimitations preventing us from making our assump- to a deflection angle corresponding to the processortions. In fact, our assumptions are as valid as those in located at the grid point (i, i). In addition, each unitVLSI; the constant propagation '-1ly assumption re- has a simple controller, and a laser beam. To es-gardless of wire's length. Other a-..mptions can also tablish or reconfigure to a new connection pattern,be made based on the following arguments. Many each processor broadcasts.the address of the desiredmirrors have a reconfiguration delay proportional to destination processor to the controller of each of Ntheir rotation angle, O(N). More complex mirrors on units of its hologram m9dUle using an electrical bus.the other hand, can rotate faster for a larger angle The controller activates a-isbt (for conversion of the(unit time rotation delay ) but their size can grow electrical input to optical signal), if its ID matchesproportional to the number of angles they can realize the broadcast address of the destination processor.(0(N)). The connection is made when the laser beams are
passed through the predefined gratings. Therefore,2.2.2 Electra-optical Linear Array since the grating angles are predefined, the reconfig-
uration time of this design is bounded by the laserswitching time which is in the order of nano-seconds
In this organization, N processors are arranged to using Gallium Arsenide (GaAs) technology (9].form a one-dimensional processing layer and the cor-responding acousto-optics devices are similarly lo- This architecture is faster than the previous de-cated on a one-dimensional deflecting layer. The size signs and further it compares well with the clock cycleof each of the acousto-optic devices is proportional to of the current supercomputers. One of the advan-the size of the processing array, leading to an O(N 2 ) tages of this simple design is in its implementabilityarea deflection layer. Similar to the design using the in VLSI, using GaAs technology. Unlike the previousmirrors, every processor has two lasers, and each con- designs, this can be fabricated with very low cost andnection phase is made of two cycles. In the first cy- is highly suitable for applications where full connec-cle, each processor sends the address of its desired tivity is required. In such applications, the processordestination processor to the arithmetic unit of its as- layer area can be fully utilized by placing N opticalsociated acousto-optic unit using its dedicated laser beam receivers in each of the vacant areas to simul-bcam. The acousto-optic cell's arithmetic unit corn- taneously interconnect with all the other processors.putes the frequency of the wave to be applied to the This design can be easily adopted to implement a neu-crystal for redirection of the incnuing optical beam to ral network of processes with optical interco-anccts [6].
6N
3 Mapping and Simulation of message to a random,,ly chosen intermediate destinta-Parallel AlgOTithms tion, id, front there, to its true destination. On
OMC such a technique would lead to queues of size
An OMC with N processors can simulate, in real O0lg N). In 181, instead of choosing random paths
time, aun Ixclusive Read Exclusive Write (BREW) for nicssges to traverse, their algorithm repeatedly
PRAM flaving P processors and M memory loca- attempts to deliver a randontly choseI subset of the
tions, where N= maximum [P, M}. On the other messages. A by-product of their strategy is that their
hand, a P processor EREW PRAM can simulate in algorithm requires no intermediate buffering of ines-
real time the computations of a P processor OMC. sages, and hence works under the general situation
Thus, the interesting cases are when the memory is where each processor can send and receive polynomi-
"large". In this section, we present a simple efficient ally many messages. In [151, the only use of random-
algorithm for simulation of a N processor EREW ization is in selecting a hash function to distribute the
PRAM using an OMC with N/log N processors. This shared address space of the PRAM onto the nodes
algorithm can be used as a subroutine for simulating of the butterfly. (Note that the' direct application
and mapping PRAM algorithms for image processing of the techniques of (15) to our problem will lead to
onto OMC. 0(log2 N) running time, and O(log2 N) local mem-
The input is assumed to be of the following ory requirement for each processor.) Similarly, we
form. Every processor is responsible for routing the only use randomization in assuming a random distri-
O(logN) messages that reside in its local memory. bution for input data. 'his random distribution can
Each message has a destination tag attached to it. also be obtained in selecting a hash function to dis-
The destination address is made up of two compo- tribute the shared address space of the PRAM onto
nents z, y, where x denotes the address of the mem- an OMC with NI log N processors, each having log N
ory module, z <N I/log N , and y denotes its position local memory. The routing itself is deterministic as
within the module, y < log N. explained in the proof of Lemma 1, and has worst
In the first step of the algorithm, all elements in case running time of O(log2 N).each PE are sorted based upon the position to whichthey will write within the memory module, with du- Theorem 1 The probability that the simulation ofplicate positions being sent to a second queue. Next, one step of an N processorwERoJW PRAM using an
each of the elements is sent out one by one. After this, O(log Nl log N) time is given by e oe N for
the duplicates from the last iteration are brought for- some constant g independent of N.
ward and the process is repeated. This algorithm
works in O(log2 N) time by running through each In the above, as opposed to the original algo-O(log N) time iteration a total of log N times, to in- rithm, we check each of the iteration for the numbersure that all elements are transferred to the correct of elements that are left to be sent. If there are none,memory locations. This leads to: the algorithm exits else the program will run again,
Lemma 1 One step of an N processor EREW but will only iterate according to the number of ele-
PRAM can be simulated on an OMC with N/log N ments left to be sent. This leads to O(log N log log N)
processors in 0(hog2 N) time. Further, there ezists running time with a high probability. For more de-
an input sequence for which this is the best possible tails see [6].
bound.
The O(log N) stages of a slightly modified ver- 4 Fine Grain Image Comput-sion of the algorithm in the proof of Lemma 1 [6], ingleads to a randomized algorithm with a running timeof O(logN loglogN) with high probability, and a The algorithms described in the last section, can beworst case time complexity of O(log2 N). Our rout- used as a subroutine in mapping and simulating anying algorithn differs from others in the literature in PRAM algorithm for image processing onto the pro-the way randomization is used. Unlike the algorithms posed models. In this section, we present a summaryof [201 and (191, it does not randomize with respect of our results in designing algorithms directly on theseto paths taken by messages. For example, Valiant's models for fine grain image computing. For full de-classic scheme for routing on a hypercube sends each tails see [6].
669
V
4.1 Low and Medium Level Vision based approaches to scene labeling can be viewed asProcessing nit iterative inprovenent process. In such problems,
the underlying graph is usually sparse. BUt this spar-An early step in image processing is identifying figures sity is not regular. Effli.icnt parallel implententationsin the image. Figures correspond to connected l's in of such relaxation methods is possible with OMC.the image. An N x N digitized picture may contain Any iterative matrix structure can be realizedmore than one connected region of black pixels. The by .MC using devices such as holograms (or thoseproblem is to identify to which figure (label) each "1 described previously). Although the reconfigurationbelongs to. time for the holograms can be in the order of sec-
Lemma 2 Given a N x N 0/1 image, all figures can onds it only has to be set once during the prepro-
be labeled in O(iog N) time using an (N/log1 r/2 N x cessing phtse. The structure of the coefficient matrixelaeledn lN) i i rlis used, to define the holographic connections. The
Nir/log1 1 N) - OMC , interconnection pattern remains the same through-
Convexity plays an important role in image pro- out the computation. An optimal O(logm) time can
cessing and in vision; many other problems can be be achieved by this design.and the number of proces-solved once the convex hull of figures is obtained. sors depends only on the number of non-zero elementsIt has applications to normalizing pattern in image in the matrix [6]. This nfethod is attractive whenprocessing, obtaining triangulations of sets of points, many computations ari to be performed in which thetopological feature extraction, shape decompositionin pattern recognition, testing for linear separability, structure of the coefficient matrix is fixed. It is welletc. [13]. uited for implementation of many iterative methods
such as Gauss-Jordan, Gauss-Siedel and the Conju-Lemma 3 Given a N x N 0/1 image, the eztreme gate method [5].points of all the figures can be enumerated in O(log N) Consider the iterative methodtime using an (NI logt/2 N x N/log1 2 N)-OMC.
X+1 M Zk + gAnother interesting problem is to identify and to
compute the distance to a nearest figure to each figure where M is sparse and nonsingular. Let ni be thein a digitized image. In the following, we use the li number of nonzero elemenr.iq .the ith row of M, andmetric. However, it can be modified to operate for ibt i1,J2,...,j,, be the coluinns'corresponding to theseany l, metric. elements. Thus, the above equation can be rewritten
asLemma 4 Given a N x N 0/1 image, the nearestfigure to all figures can be computed in O(log N) time _k+= +gusing an (N/log1 / 2 N x N/log1 12 N)-OMC. X = E Mij, X), + g.
Template matching is another basic operation in Suppose there are n processors and each of the pro-image processing and computer vision [16, 17]. Tem-plate matching can be described as comparing a tem- cessors stores exactly one nonzero element in matrixplate (window) with all possible windows of the im- M. Then the following can be shown:N e. Each position of the image will store the resultf the window operation for which it is the top-left Lemma 5 The OMC can be used to solve each iter-
corner. Note that the computation can be done in ation of the iterative solution to any general sparseO(N x k2 ) time using a uniprocessor. Using N/log N linear systems with m variables and n nonzero ele-
processors, the above computation can be done in ments in O(logm) time.O(k 2 logN) time.
5 Conclusion4.2 Parallel Implementation of Itera-
tive methods for higher level vi- We studied a set of electro-optical arrays for effi-
sion processing cient image computations. Using these, we showedO(log N) algorithms for determining several proper-
Solutions to many problems in image understanding ties of images. Furthermore, we introduced a genericcan be posed in terms of iterative improvement to an subroutine that can be used to map and simulate theinitial configuration. For example, discrete relaxation existing PRAM algorithms on these models.
67(0
References [11] It. Miller, V. K. Prasanna- Kuniar, 1). Rcisis, mindQ. F. Stout. Data movement operations and
[1] II. Alnuweiri and V:,K. Prasanna-Kuntar. Opti- applications on reconfigurable VLSI arrays. lireal image computations on VLSI architectures Proc. of IEEE International Conference on Par.with reduced hardware. In Proc. of IEIE Work- allel Processing, August 1988.shop on Computer Architecture for Pattern andMachine Intelligence, 1987. [12] It. Miller and Q. F. Stout. Parallel geometric
algorithms for digitized pictures on mesh con-12] R. Anderson and G. L. Miller. Optimal paral- nected computers. In IEEE transactions on Pat-
lel algorithms for list ranking. Technical report, tern Analysis and Machine Intelligence, MarchDept. of Computer Science, University of South- 1985.ern California, 1987. [13] R: Miller and Q. F. Stout. Efficient parallel con-
[3] L. A. Bergman, W. H. Wu, A. R. Johnston, vex hull algorithms. In IEEE transactions on
R. Nixon, C. C. Esener, S. C. Guest, P. Yu, T. J. Computers, December 1988.
Drabik, M. Feldman, and S. H. Lee. Holographic [14] V. K. Prasanna-Kumar and M. M. Eshaghian.optical interconnects for VLSI. Optical Engineer. Parallel geometric algorithm for digitized pic-ing, October 1986. tures on mesh of trees. In Proc. of IEEE Interna.
[4] B. D. Clymer and J. W. Goodman. Optical clock tional Conference on Parallel Processing, 1986.
distribution to silicon chips. Optical Engineering, [15] A. G. Ranade. How to emulate shared memory.October 1986. In Proc. of Annual Symposium on Foundations
of Computer Sciernce, 1987.[5] D.J. Evans, Ed. Sparsity and its applications.
Cambridge University Press, Cambridge, Lon- [16] A. Rosenfeld and -A. Kak. Digital Picture Pro-don, 1985. cessing. Academic Press, 2nd edition, 1982. 2
Volumes.[6] M. M. Eshaghian. Parallel Computing with Op-
tical Interconnects. PhD thesis, Dept. of Coin- [17] L. J. Siegel, H. J. Si gel,;and A. Feather. Par-puter Engineering, University of Southern Cali- allel processing approaches to image correlation.fornia, 1988. IEEE Trans. on Computers, C-31, March 1982.
[7] T. Y. Feng. A survey of interconnection net- [18] C. D. Thompson. A Complezity Theory for
works. IEEE Computer magazine, December VLSI. PhD thesis, Dept. of Computer Science,
1981. Carnegie Mellon University, 1980.
[8] R. I. Greenberg and C. E. Leiserson. Random- [19] E. Upfal. Efficient schemes for parallel commu-
ized routing on fat-trees. In Proc. of the 26th nication. JACM, 31(3):507-517, July 1984.
Annual Symposium on Foundations of Computer [20] L. G. Valiant and G. J. Brebner. UniversalScience, pages 241-249, October 1985. schemes for parallel computations. In Proc.
[9] H. Y. H. Ito, N. Komagata and H. Inaba. New of ACM Symposium on Theory of Computing,
structure of laser diode and light emitting diode pages 263-277, 1981.
based on coaxial transverse junction. Technicalreport, Research Institute of Electrical Commu-nication, Tohoku University, Sendai 980, Japan,1988.
[10] M. K. Kilcoyne, S. Beccue, K. D. Pedrotti,R. Asatourian, and R. Anderson. Opto-electronic integrated circuits for high speed sig-nal processing. Optical Enginefring, October1986.
671
To appear in InternationalConference on, ,Pattern Recognition(ICPR),, June 1990.
Optiml Image Algorithms on anOrthogonally-Connected Memory Architecture
Hussein M. Alnuweiri and V. K.'Prasanna KurnarSAL-344, Department of EE-Systerns
University of Southern CaliforniaLos Angeles, CA 90089-0781
September 25, 1989
Abstract
We present processor-time optimal parallel algorithms for several iinge and vision prob-lems on a novel architecture which combines an orthogonally accessed memory with lineararray structure. The organization has p processors and a memory of size O(n 2 ) locations.The number of processors p can vary over the range [1, n3 /2 ] while providing optimal speedupfor several Droblems in image processing and vision. Such problems include labeling con-uecLed compuneuLs al conputLig the convex huall. diamcter, smallest enc'osing r.et:-t. .iPgand nearest neighbors of each region. Histograruning and computing the Hough Transformare also considered. Such problems arise in medium-level vision and require global oper-ations on image pixels. For these problems, it is shown that the proposed organization issuperior to the mesh and pyramid organizations.
1'rhis research was supported in part by the National Science Foundation tinder grant [I-8710836 andin part by AFOSIt under grant AFSOIL-89-0032.
Contents
1 Itroductiou 1
2 Architecture 22.1 A lteview of the IRMOT Organization .......................... 32.2 The Mesh-Connected lModules (MCM) Organization ................ 4
3 Main Results 6
4 Parallel Vision Algorithms on the MCM 74.1 Histogranming on the MCM ....... .......................... 84.2 Line Detection by the Hough Transform ......................... 104.3 Computing inage Connected Components ....... . . . ...... 114.4 Convexity of Multiple Figures .................. .......... .144.5 Computing All Closest Neighbors .............................. 17
5 Concluding Remarks 20
List of Figures
I Organization of an RMOT (Basic Module) with 4 PEs . . .. ....... 32 Augmenting an RMOT row with more PEs 4.................43 The Mesh-Connected Modules Organization ....................... 54 Concentrating data into a smaller submesh ..... .................. 135 Partitioning convex hull ....... ............................. 15A (!Inq t. ipivh hn co tutati . .. .. . . . .. .. . .. .. ... 18
List of Tables
1 Performance comparison of an MCM with pyramid and mesh computers . . 62 Problems that can be solved in O(m2 /q) time on an BM with q PEs . . .. 7
2
If. M. Alnuweiri
1 Introduction
Several parallel techniques for image conputation have been inipleinented on distributed
models of computation. In particular, mesh-connected computers have been widely used for
image processing since the nearest-neighbor interconnection among the processors preserves
the spatial relation among image pixels while maintaining a low hardware implementation
cost [7, 1I, 31]. Mesh-connected arrays can efficiently handle low level-image processing
applications in which only local operations are performed'on image pixels [6], or compu-
tationally intensive tasks (such as image template matching) for whichksuitable mappings
can result in optimal speedup on the array. However, a wide class o'f problems in vision
and image analysis involves the computation of geometric properties of image regions. Such
problems require global or dense data movement operations on the image. The time needed
to solve such problems on an nx n mesh is lower bounded by n (proportional to the diameter
of the mesh) even if the problem considered is not computationally intensive. For example,
a wide class of problems on n x n images can be solved sequentially in 0(n 2) time, but take
0(n) time on an n x n mesh (i.e. the speedup is not optimal). The mesh-diameter problem
can be alleviated by using broadcast buses [24], a reconfigurable bus'fl5t, or a hierarchy
of processors and buses such as in the Pyramid Computer (PC) [10, 19], and the Mesh of
Trees (MOT) organization [14, 22, 23]. However, even with such enhancements, the number
,. L PtU-.,oaULb U '.&d, A., A.. O & L .t *.i ..%J A, : ...... L
Furthermore, algorithms which require dense data movement requirements (such as image
rotation and sorting pixels) can not be performed efficiently on these organizations.
In this paper, we present a mesh-connected array with special memory-access and coin-
munication capabilities to provide processor-time optimal algorithms to several image prob-
lems which involve global computations on image pixels. The problems we consider include
histogramning, Hough Transform, component labeling, computing convexity and related
properties, and computing nearest neighbors. Although such problems are not comipu-
tationally intensive (except for computing the lHough Transform, they all can be solved
sequentially in 0(n 2 ) time for an n x n image), they incur a high communication overhead
when implemented on a parallel distributed-memory computer. However, for such prob-
lems, divide-and-conquer and data-reduction techniques can be used to reduce the number
of computations and the conmunication overhead.
L.
H. M. Alnuweiri
The proposed array consists of a memory array of size 0(n 2) locations (onto which an
i X n image can be mapped) and p Processing Bietnents (Pls), where p can vary over tile
range I to it2. The PEs are interconnected in a ne.sh fashion and have special access to the
menmory. The organization is called Mesh-Connected Modules (MCM) because it can be
looked upon as a mesh of basic-modules, where each basic-module is a parallel organization
in which the PEs have row and column access to a suharray of the memory. The MCM
supports global data operations efficiently and is thus, highly suited to the class of image
computations considered here. One major distinguishing feature of the MCM from other
mesh-based parallel organizations is that communication among processors is done via a
shared memory as well as by interprocessor links. Most of our algorithms take O(n2/p) time,
using p processors, where p is in the range [1, n3/2]. As compared to the performance of a
pyramid computer for such problems [19], the MCM performance is clearly superior. For
example, the problems of computing the connected components of an image and computing
all closest figures take O(n 1/2) time, while computing the convexity of multiple figures takes
O(n 1/2 logn) time, on a pyramid computer with 0(n 2 ) PEs. The MCM can solve all these
problems in O(nl/2) time using n3/2 PEs only.
The rest of the paper is organized as follows. Section 2 introduces the proposed MCMorganization; section 3 presents a summary of results and comparisons to other organiza-
tions; and section 4 presents processor-time optimal parallel algorithms for several image, I
Fi UUii.ILS.
2 Architecture
The proposed organization is based on a class of orthogonal memory-access architectures
which have been studied by several authors (1, 5, 12, 27, 29]. A VLSI-based multiprocessor
architecture with orthogonal access to a memory array was first proposed by Tseng, I[wang,
and Prasanna Kumar [29], to provide optimal speed up for matrix-based computations and
sorting. Ja'.Ja' and Owens [13], have proposed a somewhat similar array for computing
the Fast Fourier Transform and sorting; also, Scherson and Sen [27] have considered the
organization for sorting. Finally, Alnuweiri and Prasanna Kumar [1, 4, 5] have proposed
efficient data movement techniques for this organization which lead to processor-time opti-
mal solutions to a wide class of image and vision problems and graph theoretic problems.
2
I'
It. M. AInuweiri
In [I], the organization has been called a Reduced Mesh of Trees (RMOT) because it can
he viewed as esh of trees organization in which the base Pls are replaced by memory
mo(dules, and each row (column) tree of lEs is rqplace(l by one PE with a row (colunn)
bus.
2.1 A Review of the RMOT Organization,
An RMOT of size in consists of in PEs each having row and colunt access to an m X m
array of memory modules such that PEi can access the modules in the ith row and the ith
column of the array. The organization of an RMOT with 4 P4 s is shown in figure 1, where
the dashed lines represent data-links whose purpose will be exjilained in the next section.
0 Row/column switch 9-U
O3 Memory module . . . .. .
0 Processor
- Memory bus
....... - ----- . ..... ... Symbol
Figure 1: Organization of an RMOT (Basic Module) with 4 PEs
The PEs have basic arithmetic/logic capabilities and all their internal registers and data
paths are 0(log n)-bit wide. Also, each memory module consists of a number of O(log n)-bit
registers each of which is considered to be one memory location within the memory module.
The distinction between memory locations and memory modules should be noted. Each PE
can read or write one unit (logn bits) of data in a memory location in its row or in its
column in 0(1) time. This organization has the advantage that the processing area and the
storage area are identified as two distinct components of the design. Further architectural
3
I. M. Alnuwciri
and operational details can Ie found in [5].
2.2 The Mesh-Connected Modules (MCM) Organization
A comput.tion on tile ItMOT can be done by decomposing it into alternating row-phases
and column-phases, where in a row-phase (column-phase) the entire computation is per-
formed on data within the same row (column) of memory. In nontrivial applications, tile
time to r..rform a computation is dominated by the time taken-hy each PE to perform the
computation on its memory row and memory column. The proposed architecture enhances
the time performance of the RMOT by augmenting it with 'nore PEs. Each PE with its
row or column bus in the RMOT is replaced by a linear array of PEs, and the memory
modules within each row or column are partitioned equally among these PEs, as shown in
figure 2. Note that now we have two types of communication links, ihierprocessor links and
memory buses. To save on the number of processors, we let each row and column array of
RMOT 0 Memory Bus
Suc E r 0 A h Eh e 66 0
Memory Module
Processor Inter-processor
Struct U.
Memory module Mm s
Figure 2: Augmenting an RMOT row with more PEs
PEs share one PE. The resulting organization is shown in figure 3 (the distinction between
interprocessor links and memory buses is not shown in this figure). It is interesting to
note that tle proposed organization consists of smaller RMOT arrays (modules) which are
connected as a mesh. A formal definition of the architecture and sonic additional notation
are given below.
Definitions and Notation
The Mesh-Connected Modules (MCM) organization consists of p PEs and O(n 2) memory
locations. The PEs operate in a synchronous mode tinder the supervision of one control
4
:0 40r 0T a i I 1a a a , 40,
I. k. Aig IW'gt
0 Processor
TO- b- -- '0 a a a 1 10 a a r3 Memory Module
a ~~ a6 a a aa b ,
Figure 3: The Mesh-Connected Modules Organization
unit. This mode of operation is usually known as SIMD (Single-Instruction Multiple Data).
The MCM is organized as a k x k mesh-connected array of basic mdduIqs (BMs), where
1 < k < n. Each BM is an RMOT organization with q PEs (PEo, ... , PEq-i) where
1 < q < n/k; and a q x q array of memory modules each of size O((n/kq)2 ). In addition,
each PE; of a BM is connected to PE; in each adjacent BM. The memory module in the
ith row and'jth column of the array is denoted by MMii. Also, a memory row is dcncted
by RM, and a memory column is denoted by CM. It is assumed that O(logn) bits can be
moved across each bus or data link in a constant amount of time. The organization of a BM
was shown in figure 1, where the dashed lines represent the interprocessor links between
each PEs and its four possible neighbor PEs.
An MCM can be specified by the three parameters n, k and q, described above, where
1 < k < it and I < q K n/k. We use the notation MCM(n, k, q) to denote an MCM having
0(n 2 ) words of memory and organized as a k x k array of BMs, each having q PEs. The
total number of PEs in the MCM is p = kVq, and the total number of memory modules is
k 2q' (i.e. a total of 0(n ' ) memory locations). Let rn = n/k, then each memory module
has O-,r) memory locations, and the memory array of each 13M has a total of O(71)
memory locations. Several l)roblems can be solved )y recursively partitioning the MOM
into sul)tueshes of lBMs, solving the subprol)em in each submesh, then merging the results.
H. M. Alnuweiri
A j-partition of tl'e MCM, I < j < k, is a partition of the MOM into k2 /j 2 subnleshes, each
consisting of an array of j x j BMs. Each such subumesh is called a j-mesh and denoted by
Mj(i), where i is the index of the submesh in the.partition. Note that a j-mesh is also an
MCM(jtn,j,q), where m = n/k. Also, by definition, a BM s a 1-mesh.
3 Main Results
It was mentioned earlier that an MCM(n, k, q) can solve several image and vision problems
in 0(n 2 /p) time using p = k2 q PEs, I < p <n 3/2 . For a large numberof vision and image
problems it can be shown that the MCM is superior to the stahdard mesh computer [16]
and the pyramid computer [191. Table 1 compares the performance of an MCM(n, V'-, %/i)
having O(n 3 / 2) PEs with that of the pyramid computer and the standard mesh-connected
computer each having O(n 2 ) PEs. Definitions and algorithmic details for these problens
are given in section 4.
2MCC Pyramid MCGM( n n /i-)
Number of processors n2 O(n 2 ) n3/ 2
Labeling figures n n 1 /2 n1/2
Convex hull:1. Set of p.xe!s r log' n/ toglogn n' /.
2. Multiple figures n ni/2 logn nl / 2
Properties:Diameter n n 1/2 logn n 1/ 2
Smallest enclosing rectangle n n 1/ 2 logn rl/ 2
Distances:
1. All-point closest neighbor n logn n 1 / 2
2. Closest figure n n 1/2 n 1 / 2
Table 1: Performance comparison of an MCM with pyramid and mesh computers
As an enhancement to the standard mesh, Miller and Stout [18] considered a variable-
diameter mesh which consists of a vF x Fp array of PEs, each having a local memory of
size n 2 /p locations. They have shown that the pro)lems listed in table I can be solved in
O(n 2 /p) time on the variable-diameter inesh, where I < p < n4/3 . The MCM is superior
to this mesh since it provides O(n 2 /p) time solutions for I p < ?t3 / 2. In fact, it is easy
to see that the variable-dianieter mesh is a special instant of the MCM(n,k,q) obtained
H. M. Alnuweiri
by choosing q = t (thus p - k, and I < k < nt2/3). Thus, the MCM algorithins can be
iuplemented directly on the variable-diameter nesh.
4 Parallel Vision Algorithms on the MCM
This section presents optimal solutions to several image problems on the MCM. Our ap-
proach is based on combining the divide-and-conquer paradigm with the optimal parallel
techniques that can be implemented on the basic modules. The MCM techniques basically
consist of performing parallel merge on the results from the basic modules. Optimality
is achieved by carefully balancing the BM techniques and themerge operations as will be
shown later. Processor-time optimal parallel algorithms for solving several image and vision
problems on a BM have been prezented in [1, 5]. Table 2 sumnarizes these results. These
algorithms form the basic subtechniques for the MCM algorithms.
Radix-Sorting of m2 integersParallel Prefix Computations on m2 elementsProblems on an m x m Image:1. Histogram computations2. Histogram modification (equalization)3. Labeling connected componentsA Cnnvpy huhl of each fipt re
5. Diameter of each figure6. Smallest enclosing rectangle of each figure7. All-point closest neighbors8. Closest figures
Table 2: Problems that can be solved in O(m 2/q) time on a BM with q PEs
(Alnuweiri and Prasanna Kumar [1, 5])
Before we proceed further, it is necessary to introduce some additional notation which
will clarify the description of our algorithms. The image is initially partitioned into k2
disjoint squares each of size m x m pixels, where m = n/k. Each such square is called a
basic-square and stored in a distinct BM. Given an n x n image F, a j-partition of the F
is defined to be the partitioning of F into disjoint squares each consisting of j x j basic-
squares of the image. Each such square is called a j.square and denoted by QJ(i), where
7
H. Al. Alnuweiri
i = 0, 1,-.. ,(k/j)2 -1 is the index assigned to each square. Note that the size of a j-square
is jit x jm pixels. '[lhe squares of a j-partition of the image are indexed such that QJ(i)
is stored in M(i). For j < 2, each Qj(j) can be. partitioned into four equal quadrants,
denoted by Qj, Qj, Q', Qj. Also, note that Qo(i) denotes the basic-square in BM(i). The
boundary of a j-square consists of topmost and botton-most rows of pixels and the leftnost
and rightmost columns of pixels within the j-square. The boundary of an image square Q
is denoted by 6(Q). The pixels on the boundary of a j-square QJ(i) are stored in the PEs
of the boundary BMs of MJ(i).
4.1 Histograinming on the MCM
Histogranuning is a widely used operation in the early processing of images, and in im-
age enhancement applications. The histogrammfing problem can be defined as follows [26]:
Given an n x n image I having H gray-level values 0,1,... , H - 1, compute the number of
pixels having the gray-level value g for each g = 0, 1,..-, H - 1. Each gray'-level value can be
represented by a binary number gh- 1,gh-2, ",go, where h = [logH]. .T'he histogramming
problem for an m x m image can be solved in O(m 2/q) time on a BM withlq PEs by using
the parallel radix-sort technique described in [5] (see table 1). This result is processor-
time optimal since, sequentially, this problem requires fl(rn 2) time. Histogranmning can be
;erformed eIlicie-ntly on the M(.M 1y using data reduction and dtviae-ano-conquer Lech-
niques. Our technique consists of two stages: a sorting stage, and a key-counting stage.
Suppose that the given n x n image consists of If different gray-levels, the two stages can
be performed as follows:
Sorting Stage: Partition the MCM into j-neshes each having a total of at least
H memory locations, i.e. j 2n 2 > H (for simplicity, assume that II = j 2in 2 ). Each
memory location in such a j-mesh corresponds to a distinct key (gray-level value).
Within each j-inesh, sort the pixels by their gray-level values, then compute the sum
of the pixels in each gray-level, and finally, store the sum into the memory location
corresponding to that gray-level key. It is clear the time complexity of this stage is
dominated by sorting the pixels within each j-rnesh, which is O(jl-).
" Key-Counting Stage: This stage is performed using a divide-and-conquer tech-
nique. Since the initial size of each submesh is j X j B Ms, this stage takes logk- logj =
8
H. M. Alnuweiri
log(k/j) iterations. In iteration i, the size of each submesh is (j2i x j2') IMs. lBe-
cause the number of different keys is fixed and equal to Ii, After each iteration the
keys are divided equally among the ltMs of each t-mesh, t = j2'. Fulrthertnore, the
counts (sunt of elements) of the same key in different I-meshes, are stored in the samte
memory location within each I-mesh. Step i of this stage is performed as follows:
1. In this iteration, the sum of pixels for each key in a 21-mesh, t = j2i, is to becomputed by adding the four sums for that key in the four quadrants of the
21-mcsh. Since these-four sums are stored in the same memory location within
each quadrant, this operation can be performed in O(j2' + fm ) time by
rotating the data within each row and column of PEs in a 21-mesh.
2. Now we partition the keys among the rows of each 2t-mesh, so that each RM
will contain a smaller number of keys. This can be done in O(j2' + [ -. ,j)
time by a simple rotation of data in each row and column of memory within each
21-mesh.
The total time taken by the log(k/j) iterations is -
j2' + r-7 --i=O
which is bounded by O(k + 77. /). For processor-time optimality, we must have
_ - = H <irnq q
i.e., the memory size of each BM must be equal or greater than the maximum number of
keys (gray-levels). Thus, we have the following theorem:
Theorem 1 Given art image with 1I or less gray-levels, computing the histogram of theimage can be done in O(k + q . v/77 ) time on an MCM(n, k, q) with p = k2q processors.
fi If = O(m 2), where m = n/k, then the histogram can be computed on this MCM in
O(k "+- M2) = O(k + al) time, which is optimal.
Hlistograms are used in a large number of applications such as image enhancement. For
example, in a certain image some gray-levels may be heavily populated while other levels
H. M. Alnuweiri
may have a umuch lower number of pixels. To enhance the dynamic range of the image, a
redistribution of gray-level values may be desired. This can be done by a technique called
histogram modification [261, in which given a histogram IfM for an image F, it is required
to transform the gray-scale of the image so as to give the image a specified histogram II'"'.
The new histogram has the same number of gray-levels as the old histogram but has a
different number of pixels in each level. It is easy to show that image enhancement by this
method can be performed on an MCM(n, k, q) in 0(k+ V1-) time, where H is the number
of the gray levels. Again, if we choose I = m2 , then this time becomes 0(k + 2-) which is
optimal for 1 < p < n3/2 and I < k < n 2/ 3 .
4.2 Line Detection by the Hough Transform
The Hough Transform is a general technique for feature detection ih images. The method
involves accumulating evidence for parametrized objects from the image feature space [9].
The principle of line detection using the Hough Transform is based oh transforming each
pixel (i,j) in the image (also called the feature space) into a sinusoidalcurve in the param-
eter space. This transformation is defined by the normal representation of a line (passing
through (i, j)), which is given by
o = isinO + icosO. (1)
The computation of the Hough Transform proceeds by. associating an accumulator cell
A(s, t) with each point (p., Ot) (in the parameter space) which keeps a count of the numberof curves intersecting at that point. For computational purposes the ranges of p and 0 are
quantized into a number of levels. If the number of quantized levels of 0 (respectively, p) is
L (respectively, M), then we have ML accumulator cells each associated with a pair (p,, Ot)
of quantized values, where 1 < s < M and 1 < t < L.
Suppose that 01,... ,ALO, are the L given samples of 0, then the Hough Transform can
be computed using L histogramming steps. In Step i, each image pixel (i,j) increments
the count of A(p., 0t) by one, where p, = i sinOt + j cos O. Implementing this technique
directly on the MCM(n,k,q) leads to O(Lk + Ln 2/p) time. The term Ln 2 /p in the time
complexity is within the optimal bound; however, the term Lk is not for sufficiently large
values of k and L, e.g. L = 0(n), k = 0(nt /2 ). The method can be improved by modifying
the key-counting stage of our htstogranuning technique. The complete algorithm is rather
10
lengthy to describe here. We will consider a special case which shows how optititality can
be achieved on the MCM.
Consider computing the Itough Transform of an n x n image. Let L be the number ofvalues of 0 and M be the number of values of p, and let Ii = LM be the total number of
accumulator cells needed to compute the transform. Now let M = in' and L - k, where
as before m = n/k. Each BMi, 1 < i < k2 , has y;2 memory locations which can store
the M = m2 values of p for a particular Oi. In tle computation each BM computes the
Hough Transform for its image pixels for a certain 0. This can be done in O(in2/q) time
by histogramming. The m' values of p are then shifted to an adjacent 13M, so that values
of p for a new value of 0 can be computed. Using this technique each BM computes the
values of p for the L - k values of 0. Since each such computation takes O(m 2 /q) time,
the total time taken by each BM is O(k m2 /q). The data transfer time is also O(k 2m2/q).
Since H - LM - k~m2 , the computation time on an MCM(n,k,q) is O(H/q) = O(L )
using p = k2 q PEs, which is optimal since a sequential computation of the I[T would take
O(Ln2 ) time.
Theorem 2 Computing the fough Transform can be done in O(k + LL' time on anp.
MCM(n, k, q) with p = k~q processors, where L is the number of discrete values of 0 in
* " .. ,.... * rm""Tn"
4.3 Computing Image Connected Components
A black and white (binary) image may consist of several connected regions of black pixels.
A connected component (or a figure) is a maximally connected region of black pixels. The
image labeling problem is to identify a unique label with each figure in the image. The
labeling problem can be solved by using a bottom-up recursive procedure in which the
figures in a k x k square are labeled first by labeling the connected regions within eachk X A quadrant of the square, then by using the adjacency information along the commnon
boundaries of the quadrants to label the connected regions within the k x k square. Nassini
and Sahni have proposed this approach for meshes [201, and since then it was implemented
on several other parallel organizations such as the pyramid computer [191, MOT [231, and
hypercube computers [8]. The same technique can be implemented on the MCM as follows:
tt
H. M. Alnuweiri
Initially, the figures within each basic-square can be labeled in O(-!--) time by using a
niodule-algorithn. Next, a recursive merging algorithm is used to compute the labels of
the figures. The algorithml consists of logk iterations. In iteration i, I < i < logk - 1, tile
figures within each j-squarc, j = 2', are labeled by considering the pixels on the boundaries
of its four quadrts. The merge is done first by constructing an auxiliary graph Gj(h)
representing the connectivity among the figures in the four quadrants of each Qj(h), then
computing the connected components of each such graph. Each vertex in Gj(h) represents
a figure which intersects the boundary of a quadrant. The edges of Gj(h) are constructed
as follows: (v,, v,)cE if two figures labeled z and y in two adjacent quadrants have at least
two pixels which are 4-neighbors along the common boundary pf the tw'o quadrants. It is
clear that Gi has O(jm) vertices and O(jm) edges.
Lemma 1 Given the graph Gi constructed as above, and stored in a~j-mesh, the connected
components of this graph can be computed in O(j + V'. M . log(im)) time.
Proof: The O(jm) edges of the given graph G are stored in a column (or row) of boundary
BMs. Let these edges be stored in the leftmost column of BMs in the j-mesh. The connec-
tivity algorithm we use is basically an implementation of the parallel connectivity algorithm
in [281. For a graph with jm vertices, this algorithm requires O(log(jm)) iterations on a
CROW PRAM with U(jm) Pbs. In each step, each Pi performs a constant. time operation
on its local data. To implement this algorithm on the MCM, each read/write step on the
PRAM is simulated by a random access read/write operation [21] on the MCM. Once each
PE has the appropriate data in its RM, it performs the desired operations on this data.
To implement this algorithm on a j-mesh in which GC is stored, we first move the O(jm)
edges of the graph into a 2r i /l - mesh within the j-mesh, where as before j = 2'. This can
be done as follows: First partition the 2' BMs into 2f i/21 groups, each consisting of 21i /2j
BMs. Move the data within each such group into a distinct column of BMs, such that the
2[ i /21 groups are stored in 2 tV/2J adjacent columns of BMs. This can be in 0( 2[i/21 + m/q)
time by moving the data within each row of PEs. The data within each such column of
BMs is moved upward (within the j-mesh) so that all the edges are now in a 2Li/2 J x 2[i/21
array of BMs. If i is even then this array constitutes a v'j-mcsh. If i is odd then this array
constitutes the upper half of a 2 ri/21-mesh. Each PE within the smaller mesh has O(m/q)
edges only in its RM. This data movement is illustrated in figure 4, and it can be completed
12
fH. M. Alnuweirib
in 0(2' A. 211/21 ii/q) time.
am Memory module
000 I0000001311 I 0011 0Do000 Z~1~loo o OlOo o ilo oo :o'o o' ooo01 O o oooo mI I~ O11000010 'o 0 'IOuo o I
0000000 jg 0 0 C3gIIg~
'Coaooo o a 00 a a a4DIaaIU ooIooaDOO I Ioolalool lgml IoRIH oo ara 0 9oo
I3 90 o MOo lO 10 13M o30 013 H 1oooo1 1oQaO O D a TO all 3 o1 o 13 a 1/
I l 1 1 loo0 0 0ooo 1o GOO11D at i aoo oo oooo 0 oo1oo30 1o o 1 I o o 2 Io ago 0 o aoooo o oo.o.o E ooao 100o aonoolloooo l looooloao soo io 0ooooaooo oloaollaOOallaOO Ioooollo0OOl OUa o 0al 00I OOOlO aoo soon
ooo o o 0~ .~.ogoo0110oI L o o o1 sg 00 0__ o Og lOO a do110g .~l 00loooooooololWI 00 Io1ooo1o o ' oO o a oo 1olGua
1 9oo ooaoooooloOMD R 00 aom oo l a o OlOo O oo o 000 3lO 3 Dao oo0lGoooooooooo1oaO 00 ooolooooooIo o Olooo DOO o o3o 00o aoa
oaooooori ~ ~ ~ ~ ~ ~ 1 0 0 g go0l00~j ggl~gn0000o13oo 110 o0 .al b a al 00 o OO No o o 1o Gu l I alOo laam Oll aoo00
Move data along rows Concentrate data along columns Final position of data
Figure 4: Concentrating data into a smaller submesh
Within the 2/1-mesh, each random access read/write operation n 13 e performed in
O(2Fi/21lm/q) = O(V3'. (m/q)) time [21. Thus, the log(jmn)iterations of the connectivity
algorithm in [28] can be performed in O(\/3. (m/q) . (log im)) time. 0
After c:omputing the connected componetits of C3, the correct jabeis of tiie pixeis on
the boundiary of each j-square (and also on the boundary of its four quadrants) can be
computed. The procedure is then repeated for each (2j)-square, (4j).square, and so onuntil the entire ag has been processed. The time for iteration i of the algoitlun is
O(j + V-• log(ja)), where j = 2 , and the time for the initial module computation is
O(m2 /q). Thus, the time for all logk iterations is,
log km/q + [ can+ 2/n(m/q)(log2m))
i=1
which is bounded by O(k + m2/q) = O(k + n2 /p), where p = koq.
Since only the labels of the pixels on the boundary of each quadrant are updated in
eacl iteranti ome i as ll have old incorrect labels at the end of the comiputa-
tion. These pixels can be labeled with the correct labels by applying a top-down recursive
procedure similar to the above procedure but in the reverse direction. That is, use the
13
II. M. Alnuweiri
labels of tle pixels ott the )otln(lary of each j-squareto label Lite pixels On Lite boundaries
of its four quadrants. This is (lone starting front tle square Qk which consists of tle entire
it X n image, and then applying the technique until the size of each quadrant is In X In
pixels. Each such quadrant is local to one BM and its pixels can be labeled with the correctlabels in 0(m 2 /p) time using a niodule-algorithm. The following theorem summerizes this
result.
Theoremn 3 Given an n x n black and white image, all connected regions of black pixels
in the image can be labeled in 0(k + n2/p) time on an MCM(n, k, q) with p PEs, where1_p n3/2 and k<n 2/ 3 .
Besides being optimal, our result is superior to previous known results. For example,
using n3 /2 the MCM can provide 0(n'/2) time solution to the above problem. A pyramid
computer can provide the same solution time but using It2 PEs [19]. A 2MCC with n 2
PEs provides 0(n) time solution to this problem [20]. The hypercube and shuffle-exchange
networks can provide 0(log2 n) time solutions using n2 PEs [8], but they require a much
larger area if implemented in VLSI.
We now consider the problem of computing the convex hull of each figure (connected com-
ponent) in -an n x ?I image in which all figures have been labeled. A set S of pixels is said
to be convex if the smallest convex polygon containing them contains no other pixels. Such
polygon is called the convex hull of S, and it is denoted by CH(S). Tile MCM algorithm
is base' ie divide-and-conquer paradigm for merging disjoint convex hulls [25]. Two
conv, .ais are said to be disjoint if there exists-a vertical line in the plane such that all the
vertices of one hull lie on or to the left of this line, and all the vertices of the second hull lie
to the right of this line. By merging two convex hulls A, and A2, we mean computing the
convex hull CH(Ai,A 2 ) of At and A2. Merging two disjoint convex polygons At and A2
can be done by computing the two tangents common to At and A2, and by eliminating the
vertices which become internal to the resulting polygon. To compute these tangent lines,
it is useful to partition At (A2) into an upper hull and a lower hull. Let t and r be the two
vertices with, respectively, the smallest and largest i-coordinates among all the vertices of
14
I. M. Alnuweiri
co1nn1111 tangent - - --
UH2
- --' - -
0 Marked vortex(each is stored ina record with
the two edges Incident upon it)
Figure 5: Partitioning convex hull
A1 (A2). The upper hull consists of all the vertices of A, (A2) which lie on or above the
line tr, and the lower hull consists of all the vertices below this line.
Our merging technique is based on the following procedure: Suppose we have two convex
hulls (polygons), C11 and C112. We consider the merge procedure for the upper two
hulls of CI1 and CI12 (denoted UI1 and U1[2, respectively). The same procedure can
be applied to merge the lower hulls. Let ulti= (vI,v 2 ,... ,v.) and UH2= (ui,U2, .",u,),
where the vi's (ui's) are the vertices UlI (UI[2) given in the increasing order of their
z -coordinate (see figure 5). Mark j vertices V = (vi,, vi, - ,v Iii) from UI[i and j vertices
U = (Uitli2l"'IUi,) from UH2, where vi, = v1, vi, = V,, ui, = ul, and ui, = u,. The
marked vertices are also ordered by their z-coordinate. The sets V and U partition UI11
and Ulf2 into j - I segments, where a segment [vi., vi, I ([ui, lli,+ 1 ) consists of all the
vertices of UIl (UI[2) in between, and including, vi, and vi, , (uj, and ui,+,). Note that
the segments do not necessarily contain the saine number of vertices. Finally, we assume
each vertex in V (U) is stored in a record together with the two edges from U1l1 (U112)
incident upon it. The two hulls U11I and U112 can be merged in two stages as follows:
Procedure MERGE-IIUL,
* Stage 1: This stage performs the binary search procedure only on the vertices in V
15
H. M. Alnuwteiri
and U and the edges incident upon them. Each iteration of the algorthii consists
of choosing two vertices vi, and ujt, constructing the line joining thetu, and thet
testing if this line is tangent to hoth U1ll atid U112 [251. Thus, this stage terminates
in at most (logj) iterations. Upon terinination, either a common tangent is found
(and we are-done) or two segments of vertices, one from U1il and one from UI[2, are
identified. These two segments contain the two vertices through which the common
tangent passes.
Stage 2: This stage computes the coiUmon tangent by performing the binary search
procedure on vertices of the two segments identified in the previouts stage. This stage
terminates in at most (log t) iteration, where t is the tbtal number of vertices in the
two segments.
To merge pairs of convex hulls in two adjacent image squares' the above two stages
must be applied simultaneously to all pairs of convex hulls which touch along the common
boundary of the two squares. The following steps show how the above procedure can be
implemented on the MCM: .,
1. Initial module computation: We start by computing the convex hulls for all the
figures within each basic square. Since each basic square is local to one BM, this
computation can be completcd in O(m'/q) time using a B1M algorithm. Note t",at if
a figure extends over several basic squares, then this figure will have a convex hull in
each one of these basic squares. All the convex hulls of one figure must be merged in
order to obtain the final convex hull of the figure. For the merge phase, we need only
to consider convex hulls of figures which have at least one pixel on the boundary of a
basic square.
2. Recursive-merge: This phase consists of a recursive procedure which runs in logk
iterations. In iteration i, I < i < logk, we compute the convex hulls of the figures
within each j-square, QJ, where j = 2i. At the beginning of this iteration, the
convex hulls of the figures within each quadrant of Qj have been computed. So we
need to merge each group of (at most four) convex hulls of the same figure in the
four quadrants. It can be shown that all pairs of hulls can be merged in 2 log(jm)
iterations by using a binary-search procedure [25]. Our merge procedure is based on
implementing this binary search technique. The details of the implelentation are
16
a fH. M. Alnuweiri
rather tedious and are left out to the full version of this paper. rhe following theorem
suniuerizes our restilt.
Theorein 4 Given an n x n irnagc in which all figures have been labeled, the Convtex hull
of each figure can be computed, for all figures simultaneously, in O(n 2 /p) time on an
MGM(n, k, q) with p PEs, where I < p < it3 / 2 and k < n2/3 .
Using a similar computation, several other geometric properties can he computed for
each figure in the image. For example, computing the diameter and -a smallest enclosing
rectangle for each figure can be computed in 0(n 2 /p) time on an MCM(n, k, q) with p PEs,
where 1 < p <n 3/2 and k < 2/3 . Again for p = n/2 , the above problems can be solved in
O(n 1 / 2 ) time, which is superior to the O(n) time solution on a standard n x n mesh of PEs
[16], and the O(nh/2 logn) solution on a pyramid computer of size n 2 [19].
4.5 Computing All Closest Neighbors
We consider two problems related to computing distances among image regions. The first
problem we consider is the closest neig,'bor problem. In this problem we are given an n X n
oiacK ana white iroage, and we are required w, cuinipuL, ;or eacit pixcl p (CitbiL 1. u u
white) a closest black pixel. The computed black pixel will be called a closest neighbor of p.
As a measure of distance, we will use the Ll metric. In this metric the distance between two
pixels (i, j) and (u, v) (denoted by DIST[(ij),(u,v)] is given by Ii - ul + Ij - vi. A bottom-up
recursive approach can be used for this problem. To compute the closest neighbor for each
pixel in a j-square square Qj, we first solve the problem within each quadrant of Qj and
then merge the results from the quadrants. It can be shown that in the merge step only
the boundary pixels of the quadrants need to be considered.
The divide-and-conquer procedure proceeds as follows6 Initially, tile closest neighbors
are computed within each basic square. Since each basic square is local to one BM, this
can be done in O(in2/q) time using a module-algorithm. A bottom-up recursive algorithm
is then used. In the beginning of iteration i of this algorithm, I < i < logk, the boundary
pixels of each quadrant of a j-square, j = 2', have computed their closest neighbors within
that quadratit. Iteration i consists of computing the closest neighbors of tile boundary
17
11. M. Alnuweiri
o 01
u.V IV 0)
i i A u.v)
02 03
Figure 6: Closest neighbor computationsft.
pixels of each j-square, by merging the information along the joundaries of its quadrants.
Theorem 5 Given an n x n black and white image, the closest neighlbor of each pixel in the
image can be computed in O(k + n2/p) time on an MCM(n,k,q) with p PEs, 1 < p < 3/
and k < i2t3
Proof: Let Mi be the j-mesh containing the j-square, Qi. and let M(3/2)(t), be the (j/2)-
mesh (within Mj) which contains quadrant Q', 0 < t < 3, of Q3 (see figure 6). The major
stet in the meree procedure is the data transfer needed to enable each pixel on the boundary
of a quadrant to compute its nearest neighbors in the other three quadrants. Once a pixel
has this information, computing its nearest neighbors can be done in 0(l) time. Consider
the pixels on the boundary of quadrant Q1. Each PE in the rightmost column (topmost
row) of BMs in M(j/ 2)(2) reads the closest neighbors of the boundary pixels in its neighbor
PE in the leftmost column (bottom-most row) of BMs in M(3/2)(3) (M(3/ 2)(O)). Since each
such PE has O(rn/q) boundary pixels in its RM, this operation can be completed in O(m/q)
time. The pixel (u - 1, v + 1) in the south-western corner of Q' can be also moved into
the PE containing pixel (u, v) in the north-eastern corner of Q', in 0(i) time. Now the PE
containing (u, v) can compute the closest neighbors of (u, v) within quadrants Q', Q', and
Q'. These three closest neighbors are then broadcast to all boundary PEs in M(/ 2)(2).
This can be (lone in O(j + log q + re/q) time. Now each PE in the rightmost column and
topmost row of BMs in M(j/2)(2), has all thw imtorinatior necessary to compute tlhe corr'ct
distances for each pixel in its portion of tlH boundary. Since each boundary'1f has "7(m,.[r ,
such pixels, this can he done in O(m/q) time. A sidilar computation is perfor vied to n,.date
18
.f. M. Alnuweiri
the closest neighbors of the pixels on the boundary of Qj. Thus, iteration i of the tCrgealgorithn requires O(j + log q + m/q) time, where j = 2'. The tine taken by the entire
algorithln is
71 2 1. 109 (2 log q =o(k +M2q q q
Where the O(in 2 /q) term is the time taken by the initial module-algorithm. Since m = n/k
and p = k2q, the time taken by the algorithm is O(k+n 2/p) which is optimal for < p < n3/2
and k < n2/3. C]
A closely related problem is the problem of computing for each figure F, the label of the
nearest figure to it in the image. As before, define DIST(p, q) to be the distance between
two pixels, p and q according to L1 metric. Thm the label of the nearest figure to figure F,
is the label of the pixel q such that
DIST(p, q) = mn {DIST(p, q)IpEF,q /F}p~q
Corollary 1 Given an n x n image in which all figures have been labeled, a nearest fig-
ure to each figure can be identified, for all figures simultaneously, in O(n 2/p) time on an
JVAL~kAI, q/ WZLIt p FES, Wihtu 1 < p < ,I j, , un' .. _ ,71"
Proof: Let £(p) denote the label assigned to a pixel p, this problem can be solved in
two major steps. In the first step, we compute for each pixel p, a closest neighbor q
such that £(q) 0 £(p). This can be done by using a simple modification of the above
algorithm for computing closest neighbors. In the second step, the nearest figure to each
figure is computed. This can be done by using a slight variation of the component labeling
algorithm. In addition to keeping track of the minimum indexed pixel in th. component,
we also keep track of the label of the closest neighbor with the minimum distance from a
pixel in the component. From theorem 3 we know that this can be done in O(n2/p) time.
At the end of the computation, each pixel labeled F knows the label of the nearest figure
to F. 0
19
H. M. Alnuweiri
5 Concluding Remarks
We have presented several processor-timue optimal algorithms for digitized images on anesh-connected processor array having p PEs and 0(n 2 ) memory. The organization uses
novel conununication capabilities and can be implemented in current VL'SI technology.
The proposed organization has several features which distinguish it from other mesh-basedparallel computers. First, the PEs communicate through a shared memory as well as direct
interprocessor links. Second, the MCM is, in fact, a hybrid of a mesh-connected array
and the coitununication-efficient organization of the basic modules (BMs). This two-level
organization of the MCM implies two types of parallel algoritlums: basic-module algorithmsand communication-algorithms. Balancing the size of each BM and the number of BMs
in the MCM results in processor-time optimal algorithms over a wide range of processor
counts. The image problems considered here require global operations on image pixels, but
are not computationally intensive (they require O(n 2) operations for an n x n image). To
provide optimal solutions to such problems, several efficient data movement operations have
been developed for the MCM. All of these operations are efficient for applications in which
the amount of data in the memory array is reduced (i.e. less than 0 (n'J'); 'hich is the case
for several of the image problems that we have considered. The MCM organization has
been considered to proiide optimal speedups for several classes of problems such as sorting
20
f. M. Alnuweiri
References
[I] It. M. Alnuweiri and V. K. Prasanna Kumar, "Efficient Image Comlputations on VLSIArchitectures with Reduced Hardware", Pro.c. IEEl Workshop on Computer Archi-tecture for Pattern Analysis and Machine Intelligence, Seattle, WA, 1987.
[21 It. M. Alnuweiri and V. K. Prasanna Kumar, "Optimal VLSI Sorting with ReducedNumber of Processors", Tednical Report, IRIS 225, USC, 1987.
[31 It. M. Alnuweiri and V. K. Prasanna Kumar, "O ;tinial Geometric Algorithms on Fixed-Size Linear Arrays and Scan Line Arrays", Proc. IEEE Conference on Computer Visionand Pattern Recognition, 1988.
[4) It. M. Alnuweiri and V. K. Prasanna Kumar, "A Reduced Mesh of Trees Organizationfor Efficient Solutions to Graph Problems", Proc. the 22nd ,Annual Conference onInformation Science and Systems (CISS), Princeton University, March 1988.
[5] H. M. Alnuweiri and V. K. Prasanna Kumar, "Optimal image Computations on Re-duced VLSI Architectures", to appear in IEEE Transactions on Circuits and Systems,October, 1989.
[6] H. M. Alnuweiri and V. K. Prasanna Kumar, "Fast Image Labeling Using local Op-erators on Mesh-Connected Computers", Proc. International Co..ference on ParallelProcessing, 1989.
[71 K. E. Batcher, "Design of A Massively Parallel Processor", IEEE Transactions onComputers, c-29, September 1980.
[8] R. Cypher, J. L. C. Sanz, L. Snyder, "Hypercube and Shuffle-Exchange Algorithms forImage Component Labeling", Proc. IEEE Workshop on Pattern Analysis and MachineTnfall;a~nio . q ff'lp . WA IQR7
[9] R. 0. Duda and P. E. Hart, "Use of the Hough Transformation to Detect Lines andCurves in Pictures", Communications of the ACM, 15, 1, 1972.
[10] C. R. Dyer, "A VLSI Pyramid Machine for Hierarchical Parallel Image Processing",Proc. IEEE Conference on Pattern Recognition and Image Processing, 1981.
[1.1 C. R. Dyer and A. Rosenfeld, "Parallel Image Processing by Memory Augmented Cel-lular Automata", IEEE Transactions on Pattern Analysis and Machine Intelligence,1981.
[12] K. Hwang and POS. Tseng and D. Kim, "An Orthogonal Multiprocessor for ParallelScientific Computations", IEEE Transactions oi Computers, c-38, January 1989.
[13) J. Ja'Ja' and R. M. Owens, "An Architecture for a VLSI FFT Processor", INTEGRA-TION the VLSI Journal, 1, 1983.
[141 F. T. Leighton, "Parallel Computations on Meshes of Trees", Technical Report, MIT1982.
[151 It. Miller, V. K. Prasanna Kumar, D. Reisis, Q. F. Stout, "Meshes with reconfigurablebuses", Proc. MIT Conference on Advanced Research on V1,S1, Cambridge, MA, 1988.
21
11. M. Alnuweiri
1161 It. Miller and Q. F. Stout, "Geometric Algorithms for l)igitized Pictures on A Mesh.Connected Computer", [EEE Transactions on Pattern Analysis and Machine Intelli-qence, March 1985.
1171 It. Miller and Q. F?. Stout, "Varying Diametfr and Problem Size on Mesh-ConnectedComputers", Proc. International Conference on Parallel Processing, 1985.
[181 It. Miller and Q. F. Stout, "Varying Diameter and Problem Size on Mesh-ConnectedConipiters", Proc. International Conference on Parallel Processing, 1985.
[191 R. Miller and Q. F. Stout, "Data Movement Tedniques for the Pyramid Computer",SIAM Journal on Computing, 2, 1987.
[20] 1). Nassinti and S. Sahni, "Finding Connected Components and Connected Ones on aMesh-Connected Parallel Computer", SIAM Journal on Computirg, 9, 4, 1980.
[21] D. Nassini and S. Sahni, "Data Broadcasting in SIMD Computers", IEEE Transactionson Computers, c-30, 2, February 1981.
[22] D. Nath, S. N. Maheshwari, P. C. P. Blhatt, "Efficient VLSI Networks for ParallelProcessing Based on Orthogonal Trees", IEEE Transactions on Computers, c-32, 6,June 1983.
[23] V. K. Prasanna Kunar and M. Eshaghian, "Parallel Geometric Algorithms for Dig-itized Pictures on Mesh of Trees Organization", Proc. International Conference onParallel Processing, 1986.
[24] V. K. Prasanna Kumar and C. S. Raghavendra, "Array Processor with Multiple Broad-casting", Journal of Parallel and Distributed Computing, 4, 173-190 (1987).
(25] F. Prepaiata and S. J. Hong, "Convex Hulls of Finite Sets of Points in Two and Three, 4 .. . . ". a"... ..L. . A/Y-,f T4., --.. " I '7.
[26] A. Rosenfeld and A. C. Kak, Digital Picture Processing, Academic Press, New York,1982.
[27] I. D. Scherson and S. Sen, "Parallel Sorting in Two-Dimensional VLSI Models of Com-putation", [EEE Transactions on Computers, c-38, 2, 1989.
(28] Y. Shiloach and U. Vishkin, "An O(logn) Parallel Connectivity Algorithm", Journalof Algorithms, 3, 1982.
[29] P. S. Tseng, K. lIwang, V. K. Prasanna Kurnar, "A VLST Based Multiprocessor Ar-chitecture for Implementing Parallel Algorithms", Proc. International Conference onParallel Processing, 1985.
[30] J. Ullman, Computational Aspects of VLSI, Computer Science Press, 1984.
[31] S. II. Unger, "A Computer Oriented towards Spatial Problems", Proceedings of thefi1E, v. 46, pp. 1744-1754, 1958.
22