Set Merging Algorithms -...

10
SIAM J. COMPUT. Vol. 2, No. 4, December 1973 SET MERGING ALGORITHMS* J. E. HOPCROFT" AND J. D. ULLMAN{ Abstract. This paper considers the problem of merging sets formed from a total of n items in such a way that at any time, the name of a set containing a given item can be ascertained. Two algorithms using different data structures are discussed. The execution times of both algorithms are bounded by a constant times nG(n), where G(n) is a function whose asymptotic growth rate is less than that of any finite number of logarithms of n. Key words, algorithm, algorithmic analysis, computational complexity, data structure, equivalence algorithm, merging, property grammar, set, spanning tree 1. Introduction. Let us consider the problem of efficiently merging sets according to an initially unknown sequence of instructions, while at the same time being able to determine the set containing a given element rapidly. This problem appears as the essential part of several less abstract problems. For example, in [13 the problem of "equivalencing" symbolic addresses by an assembler was con- sidered. Initially, each name is in a set by itself, i.e., it is equivalent to no other name. An assembly language statement that sets name A equivalent to name B by implication makes C equivalent to D if A and C were equivalent and B and D were likewise equivalent. Thus, to make A and B equivalent, we must find the sets (equivalence classes) of which A and B are currently members and merge these sets, i.e., replace them by their union. Another setting for this problem is the construction of spanning trees for an undirected graph [2]. Initially, each vertex is in a set (connected component) by itself. We find edges (n, m) by some strategy and determine the connected compo- nents containing n and m. If these differ, we add (n, m) to the tree being constructed and merge the components containing n and m, which now are connected by the tree being formed. If n and m are already in the same component, we throw away (n, m)and find a new edge. A third application [33 is the implementation of property grammars I43, and many others suggest themselves when it is realized that the task we discuss here can be done in less than O(n log n) time. By way of introduction, let us consider some of the more obvious data struc- tures whereby objects could be kept in disjoint sets, these sets could be merged, and the name of the set containing a given object could be determined. One possibility is to represent each set by a tree. Each vertex of the tree would correspond to an object in the set. Each object would have a pointer to the vertex representing it, and each vertex would have a pointer to its father. If the vertex is a root, however, the pointer would be zero to indicate the absence of a father. The name of the set is attached to the root. Received by the editors August 10, 1972, and in revised form May 18, 1973. This research was supported by the National Science Foundation under Grant GJ-1052 and the Office of Naval Research under Contract N00014-67-A-0071-0021. 5 Department of Computer Sciences, Cornell University, Ithaca, New York 14850. :1: Department of Electrical Engineering, Princeton University, Princeton, New Jersey 08540. 294 Downloaded 10/08/12 to 128.178.159.115. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php

Transcript of Set Merging Algorithms -...

Page 1: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

SIAM J. COMPUT.Vol. 2, No. 4, December 1973

SET MERGING ALGORITHMS*

J. E. HOPCROFT" AND J. D. ULLMAN{

Abstract. This paper considers the problem of merging sets formed from a total of n items in sucha way that at any time, the name of a set containing a given item can be ascertained. Two algorithmsusing different data structures are discussed. The execution times of both algorithms are bounded by aconstant times nG(n), where G(n) is a function whose asymptotic growth rate is less than that of anyfinite number of logarithms of n.

Key words, algorithm, algorithmic analysis, computational complexity, data structure, equivalencealgorithm, merging, property grammar, set, spanning tree

1. Introduction. Let us consider the problem of efficiently merging setsaccording to an initially unknown sequence of instructions, while at the same timebeing able to determine the set containing a given element rapidly. This problemappears as the essential part of several less abstract problems. For example, in [13the problem of "equivalencing" symbolic addresses by an assembler was con-sidered. Initially, each name is in a set by itself, i.e., it is equivalent to no othername. An assembly language statement that sets name A equivalent to name B byimplication makes C equivalent to D if A and C were equivalent and B and Dwere likewise equivalent. Thus, to make A and B equivalent, we must find thesets (equivalence classes) of which A and B are currently members and mergethese sets, i.e., replace them by their union.

Another setting for this problem is the construction of spanning trees for anundirected graph [2]. Initially, each vertex is in a set (connected component) byitself. We find edges (n, m) by some strategy and determine the connected compo-nents containing n and m. If these differ, we add (n, m) to the tree being constructedand merge the components containing n and m, which now are connected by thetree being formed. If n and m are already in the same component, we throw away(n, m)and find a new edge.

A third application [33 is the implementation of property grammars I43, andmany others suggest themselves when it is realized that the task we discuss herecan be done in less than O(n log n) time.

By way of introduction, let us consider some of the more obvious data struc-tures whereby objects could be kept in disjoint sets, these sets could be merged, andthe name of the set containing a given object could be determined. One possibilityis to represent each set by a tree. Each vertex of the tree would correspond to anobject in the set. Each object would have a pointer to the vertex representing it,and each vertex would have a pointer to its father. If the vertex is a root, however,the pointer would be zero to indicate the absence of a father. The name of the setis attached to the root.

Received by the editors August 10, 1972, and in revised form May 18, 1973. This research wassupported by the National Science Foundation under Grant GJ-1052 and the Office of Naval Researchunder Contract N00014-67-A-0071-0021.

5 Department of Computer Sciences, Cornell University, Ithaca, New York 14850.

:1: Department of Electrical Engineering, Princeton University, Princeton, New Jersey 08540.

294Dow

nloa

ded

10/0

8/12

to 1

28.1

78.1

59.1

15. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 2: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

SET MERGING ALGORITHMS 295

Given the roots of two trees, one can replace the representation of two setsby a representation for their union by making the pointer at one root point to theother root and, if necessary, updating the name at the root of the combined tree.Thus two structures can be merged in a fixed number of steps, independent of thesize of the sets. The name of the set containing a given object can be found, given thevertex corresponding to the object, by following pointers to the root.

However, by starting with n trees, each consisting of a single vertex, and suc-cessively merging the trees together until a single tree is obtained, it is possible toobtain a representation for a set of size n, which consists of a chain of n vertices.Thus, in the worst case, it requires time proportional to the size of the set to deter-mine which set an object is in.

For purposes of comparison, assume that initially there are n sets, each con-taining exactly one object, and that sets are merged in some order until all itemsare in one set. Interspersed with the mergers are n instructions to find the setcontaining a given object. Then the tree structure defined above has a total costof n for the merging operation and a cost bounded by n for determining which setcontains a given object (total cost n2 for n look ups).

Methods based on maintaining balanced trees (see [5], e.g.) have a total cost ofn log n for the merging operations and a cost bounded by log n for determiningwhich set contains a given object (total cost n log n for n look ups).

A distinct approach is to use a linear array to indicate which set contains agiven object. This strategy makes the task of determining which set contains agiven object finite. By renaming the smaller of the two sets in the merging process,the total cost of merging can be bounded by n log n.

A more sophisticated version of the linear array replaces the set names inthe array by pointers to header elements. This method, based on the work ofStearns and Rosenkrantz [3], uses nloglog log(n) steps for the merging

kprocess and a fixed number of steps independent of n for determining which setcontains a given element. Here k is a parameter of the method and can be anyfixed integer.

In what follows, we shall make use of a very rapidly growing function and avery slowly growing function which we define here. Let F(n) be defined by

F(0)-- 1,

F(i) F(i- 1)2F(i-x) for i>__ 1.

The first five values of F are 1, 2, 8, 2048 and 22059.The slowly growing function G(n) is defined for n => 0 to be the least number

k such that F(k) >= n. Both set merging algorithms presented here have asymptoticgrowth rates of at most O(nG(n)).

Initially, let Si {i}, 1 __< =< n. The Si’s are set names. Given a sequence oftwo types of instructions,

(i) MERGE(i, j), (ii) FIND(i),

where andj are distinct integers between and n, we wish to execute the sequence

Dow

nloa

ded

10/0

8/12

to 1

28.1

78.1

59.1

15. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 3: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

296 J. E. HOPCROFT AND J. D. ULLMAN

from left to right as follows. Each time an instruction of the form MERGE(i, j)occurs, replace sets Si and Sj by the set Si U Sj and call the resulting set Sj. Eachtime an instruction of the form FIND(i) occurs, print the name of the set currentlycontaining the integer i. It is assumed that the length of the sequence of instructionsis bounded by a constant times n. Both set merging algorithms presented here haveasymptotic growth rates bounded by nG(n). The first algorithm actually requiresnG(n) steps tbr certain sequences. For the second algorithm, it is unknown whetherriG(n) is in fact its worst case performance. Recently, Tarjan [6_ has shown thealgorithm to be worse than O(n).

2. The first set merging algorithm. The first set merging algorithm representsa set by a data structure which is a generalization of that used in 3] to implementproperty grammars. The basic structure is a tree similar to that shown in Fig. 1,where the elements of the set are stored at the leaves of the tree. Links are assumedto point in both directions.

level 4

level 3

level 2

level

level 0

"22048

56 sons

FIG. 1. Data structure for set representation

sons

Each vertex at level i, 1, has between one and 2v- 1) sons and thus atmost F(i) descendants which are leaves. We define a complete vertex as follows"

(i) Any vertex at level 0 is complete.(ii) A vertex at level > is complete if and only if it has 2v(- 1) sons, all of

which are complete.Otherwise, a vertex is incomplete.

The data structure is always maintained, so that no vertex has more than oneincomplete son. Furthermore, the incomplete son, if it exists, will always beleftmost, so it may be easily found. Attached to each vertex is the level number, thenumber of sons, and the number of descendant leaves. The name of the set isattached to the root.

The procedure COMBINE, given below, takes two data structures of thesame level and combines them to produce a single structure of level 1. If there aretoo many leaves for a structure of level 1, then COMBINE produces two structuresof level l, one with a complete root and one with the remaining leaves. To simplifyunderstanding of the algorithm, the updating of set names, level numbers, numberof sons and the number of descendant leaves for each vertex has been omitted.

Procedure COMBINE (s s2).Assume without loss of generality that s2 has no more sons than s (otherwise

permute the arguments).

Dow

nloa

ded

10/0

8/12

to 1

28.1

78.1

59.1

15. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 4: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

SET MERGING ALGORITHMS 297

1. If both s and s2 have incomplete sons, say v and/)2, then call COMBINE(v, v2). On completion of this recursive call of COMBINE, the original datastructure is modified as follows. If originally the total number of leaves on thesubtrees with roots v and v2 did not exceed the number of leaves for a completesubtree, then the new subtree with root v contains all leaves of the original twosubtrees. The new subtree with root v2 consists only of the vertex v2.

If originally the total number of leaves exceeded the number for a completesubtree, then the new subtree with root v is a complete subtree whose leaves areformer leaves of the original subtrees with roots v and re, and the new subtree withroot v2 is an incomplete subtree with the remaining leaves. If on completion of therecursive call of COMBINE, vertex v2 has no sons, then delete vertex v2 from thedata structure.

2. Make sons of s2 be sons of s, until either s2 has no more sons or s has2V(l- 1) sons.

3. If s2 still has sons, then interchange the lone incomplete vertex at level1, if any, with the leftmost son of s2. Otherwise, interchange the incomplete

vertex with the leftmost son of S l.

We now consider the first algorithm for the MERGE-FIND problem.ALGORITHM 1.1. Initially n vertices numbered to n are created and treated as structures

of level 0. Each vertex has information giving the name of its set, its number ofdescendant leaves (1), its level (0) and the number of its sons (0). A linear array ofsize n is created, such that the ith cell contains a pointer to vertex i.

2. The sequence of instructions of the forms MERGE(i,j) and FIND(i)are processed in order of occurrence.

(a) For an instruction of the form FIND(i), go to the ith cell of the array,then to the vertex pointed to by that cell, then proceed from thevertex to the root and print the name at the root.

(b) For an instruction of the form MERGE(i, j), if the roots for struc-tures and j are not of the same level, then add a chain of verticesabove the root of lower level so that the roots of the two rees willhave the same level. Let sl and sz be the two roots. Execute COM-BINE(s1, s2). If after the execution of COMBINE s2 has no sons,then discard s2 and we are finished. Otherwise, create a new vertexwhose left son is s and whose right son is s2.

We now show that Algorithm 1 requires time bounded by a constant timesnG(n), provided that the length of the sequence of instructions is itself first boundedby a constant times n. The first step is to observe that Algorithm preserves theproperty that at most one son of any vertex is incomplete.

LEMMA 1. At any time during the stimulation of a sequence of MERGE andFIND instructions by Algorithm 1, each vertex has at most one incomplete son.Furthermore, the incomplete son, if it exists, will always be leftmost.

Proof The proofproceeds by induction on the number ofMERGE instructionssimulated. The basis zero is trivial, since each leaf is complete by definition.

For the inductive step, we observe by induction on the number of applicationsof COMBINE that applying COMBINE to two trees, each of which have thedesired property, will result in either producing a single tree with the desired

Dow

nloa

ded

10/0

8/12

to 1

28.1

78.1

59.1

15. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 5: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

298 J.E. HOPCROFT AND J. D. ULLMAN

property plus a single vertex or producing a complete tree plus another tree withthe desired property. Thus, no call of COMBINE can give a vertex two incompletesons if no vertex had two such sons previously. Furthermore, if two trees areproduced, neither consisting of a single vertex, then one must be complete.

As a consequence of the above observation, if a new vertex with sons S and s2is created in step 2(b) of Algorithm 1, the subtree with root S is complete, and theproperty holds for the new vertex.

LEMMA 2. If we begin with n objects, no vertex created during Algorithm 1 haslevel greater than G(n).

Proof If there were such a vertex v, it could only be created in step 2(b). Itwould, by Lemma 1, have a complete descendant at level G(n). An easy inductionon shows that a complete vertex at level has F(l) descendant leaves. Thus, vhas at least F(G(n)) + descendant leaves, which implies F(G(n)) + 1 <= n. Thelatter is false by definition of G.

THEOREM 1. If Algorithm 1 is executed on n objects and a sequence of at most

m MERGE and FIND instructions, rn >= n, then the time spent is O(mG(n)).Proof By Lemma 2, each FIND may be executed in O(G(n)) steps, for a total

cost of O(rnG(n)) for the FIND’s. Since there can be at most n MERGE’s,the cost of simulating the MERGE operations, exclusive of calls to COMBINE,is O(n). Moreover, by Lemma 2, each MERGE can result in at most G(n) recursivecalls to COMBINE, as each call is with a pair of vertices at a lower level thanpreviously. Thus, the cost of all calls to COMBINE is O(nG(n)) plus the cost ofshifting vertices from one father to another in step 2 of COMBINE.

It remains only to show that the total number of shifts of vertices over allcalls of COMBINE is bounded by a constant times nG(n). In executing an in-struction MERGE(s1, s2), no more than G(n) incomplete vertices are shifted, atmost one at each level. Thus we need only count shifts of complete vertices.

Consider step 2 of COMBINE. The new subtree with root s2 is referred to asthe CARRY unless the subtree consists solely of the vertex v2 in which case wesay, "there is no CARRY." The new subtree with root s is referred to as theRESULT. The number of shifts of complete vertices is counted as follows. If acomplete vertex is shifted, and there is no CARRY at this execution of COMBINE,charge the cost to the vertex shifted. If there is a CARRY, set the cost of each vertexin the CARRY to zero, and distribute the cost uniformly among the vertices in tlaeRESULT.

Each time a vertex is moved, either its new father has at least twice as manysons as its old father, or there is a CARRY. Thus, a vertex at level is moved atmost F(i) times before a CARRY is produced. Once a CARRY is produced, the rootof the RESULT is complete, and its sons are never moved again. Hence, a vertex atlevel can accumulate a cost of at most 2F(i), that is, F(i) due to charges to itselfand F(i) for its share of the costs previously charged to the sons of the root ofCARRY.

To compute the total cost of shifting complete vertices, note that a completevertex at level has F(i) descendant leaves. Redistribute the cost of each completevertex uniformly among its descendant leaves. Since no leaf has more than G(n)ancestors, the cost charged to any leaf is bounded by 2G(n). Hence, the completecost of moving vertices is O(nG(n)). Since m >_ n is assumed, we have our result.

Dow

nloa

ded

10/0

8/12

to 1

28.1

78.1

59.1

15. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Page 6: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

SET MERGING ALGORITHMS 299

COROLLARY. If we may assume m <= an for somefixed constant a, then Algorithm1 is O(nG(n)).

It should be observed that the set merging algorithm can be modified tohandle a problem which is in some sense the inverse of set merging. Initially givena set consisting of integers 1, 2, ..., n}, execute a sequence of two types of in-structions"

(i) PARTITION(i), (ii) FIND(i),where is an integer between and n. Each time an instruction of the form PAR-TITION(i) occurs, replace the set S containing into two sets"

S {jlj e S and j __< i},

$2 {jlj e S and j > i}.

Each set is named by the largest integer in the set. Each time an instruction of theform FIND(i) occurs, print the name of the set currently containing the integer i.To handle this partitioning problem, use the same data structure as before, butstart with a single tree with n leaves. The leaves in order from left to right corres-pond to the integers from to n. To execute PARTITION(i), start at the leafcorresponding to the integer and traverse a path to the root ofthe tree to partitionthe tree into two trees.

All vertices to the left of the path are placed on one tree, all vertices to the rightof the path are placed on the other. Vertices on the path are replaced by twovertices, one for each subtree, unless the vertex is the rightmost descendant leafof the vertex, in which case the vertex is placed on the left subtree. Assume thatvertices v and w are on the path and w is a son of v. The sons of v are partitionedas follows. Simultaneously, start counting the sons of v to the left of w, including w,starting with the leftmost son of v, and start counting the sons of v to the right of w.Cease counting as soon as one of the two counts is complete. (The reason for thesimultaneous counting is to make the cost of counting proportional to the smallerof the two counts.) The sons of v can now be partitioned at a cost proportional tothe smaller of the two sets by moving the smaller number of sons.

The analysis of the running time is similar to that of the set merging algorithm,and thus only a brief sketch is given. Note that a vertex can have at most twoincomplete sons, only one of which can be moved in the execution of any PAR-TITION instruction. Thus at most G(n) incomplete vertices are moved in executingone PARTITION instruction.

To bound the cost of moving a complete vertex, note that each time a vertexis moved, its new father has at most half as many sons as the old father. Thus avertex at level can be moved at most F(i) times. Since a complete vertex at levelhas F(i) leaves, distributing to its leaves the cost of all moves of a given vertex whileit is complete gives at most a cost of one to each of its leaves. Since a leaf has at mostG(n) ancestors, the cost of moving all complete vertices is bounded by riG(n).

3. The second set merging algorithm. We now consider a second algorithmto simulate a sequence of MERGE and FIND instructions.

This algorithm also uses a tree data structure to represent a set. But here,all vertices of the tree, rather than just the leaves, correspond to elements in the set.Moreover, tree links point only from son to father.D

ownl

oade

d 10

/08/

12 to

128

.178

.159

.115

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 7: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

300 J. E. HOPCROFT AND J. D. ULLMAN

ALGORITHM 2.1. Initially each set {i}, __< =< n, is represented by a tree consisting of a

single vertex with the name of the set at the vertex.2. To merge two sets, the corresponding trees are merged by making the root

of the tree with fewer vertices a son of the root of the other tree. Ties can be brokenarbitrarily. Attach the appropriate name to the remaining root.

3. To execute FIND(i), follow the path from vertex to the root of its tree andprint the name of the set. Make each vertex encountered on the path a son of theroot. (This is where the algorithm differs substantially from balanced tree schemes !)

The above algorithm, except for the balancing feature of merging smalltrees into large, was suggested by Knuth 7] and is attributed by him to Tritter.The entire algorithm, including this feature, was implemented by McIlroy [2]and Morris in a spanning tree algorithm. The analysis of the algorithm withoutthe balancing feature was completed by Fischer 8], who showed that O(n log n)is a lower bound, and Paterson [93, who showed it to be an upper bound. Ouranalysis shows that the algorithm with the balancing scheme is O(nG(n)) at worst.Thus it is substantially better than the one without the balancing.

We now introduce certain concepts needed to analyze Algorithm 2. Let be afixed sequence of MERGE and FIND instructions. Let v be a vertex. Define therank of v, with respect to , denoted R(v), as follows. If in the execution of byAlgorithm 2, v never receives a son, then R(v) 0. If v receives sons v l, v2, ".’, Vkat any time during the execution, then the rank of v is max {R(vi)} + 1. It is nothard to prove that the rank of v is equal to the length of the longest path from vto a descendant leaf in the tree that would occur if the FIND instructions andtheir attendant movement of vertices were ignored in the execution of .

LEMMA 3. If the rank of v with respect to is r, then at some time during theexecution of , v was the root of a tree with at least 2 vertices.

Proof The proof is by induction on the value of r. The case r 0 is trivial.Assume the lemma to be true for all values up to r 1, r _>_ 1. If v is of rank r, atsome point v must become the father of some vertex u of rank r 1. This eventcould not occur during a FIND, for if it did, then u would have previously been adescendant of v with some vertex w between them. But then w is of rank at least rand v of rank at least r + 1 by the definition of rank.

Thus, u must be the root of a tree T which is merged with a tree T’ having rootv. By the inductive hypothesis, T has at least 2 vertices, since a root cannot losedescendants and a nonroot cannot gain descendants, and hereafter u will no longerbe a root. By step 2 of Algorithm 2, T’ has at least as many descendants as T. Theresulting tree has at least 2’ vertices and has v as root.

LEMMA 4. Each time a vertex v gets a new father w during the execution ofAlgorithm 2, that father has a rank higher than any previous father u :/: w of v.

Proof If v, formerly a son of u, becomes a son of w, it must be during a FIND.Then w is an ancestor of u and hence of higher rank by definition.

LEMMA 5. For each vertex v and rank r, there is at most one vertex u of rank rwhich is ever an ancestor of v.

Proof Suppose there were two such u’s, say u and u2. Assume, without loss ofgenerality, that v first becomes a descendant of u 1. Then u2 is of higher rank thanu by Lemma 4 and the fact that at all times, paths up trees are of monotonicallyincreasing rank. This contradicts the assumption that U and u2 were of rank r.D

ownl

oade

d 10

/08/

12 to

128

.178

.159

.115

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 8: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

SET MERGING ALGORITHMS 301

LEMMA 6. There are at most n/2 vertices of rank r.

Proo[i Each vertex of rank r at some point is the root of a tree with at least 2vertices by Lemma 3. No vertex could be the descendant of two vertices of rank rby Lemma 5. This implies that there are at most n/2 vertices of rank r.

For j >= 1, define aj, the j-th rank group, as follows "a {vllog+ l(n) < R(v) <__ log(n)}.

Note that the higher the rank group, the lower the ranks of the vertices it contains.LFMMA 7. laj[ _--< 2n/logJ(n).Proof Since there are at most n/2 vertices of rank r, we have

logJn F/ 2n]ajl =< Z 2--/ < ]ogJn (1 + 1/2 + 1/4 + ") <

i-- logo/+ in logjn

LEMMA 8. Each vertex is in some o for <= j <= G(n) + 1.

Proof By Lemma 6, no vertex has rank greater than log n, so j >__ may beassumed. Thus to prove j <__ G(n) + 1, we need only show that logcs(")+z(n) < 0.From the definition of G(n), n <_ F(G(n)), and so

loga(,) + 2(n __< loga(,)+ 2 F(G(n)).

Thus it suffices to show that logi+2F(i) < 0. We shall actually show that logi+ 22F(i)=< 0. The proof is by induction on i. For 0, the result is obvious. Assume theinduction hypothesis true up to i- 1. Then logi+22F(i)<= logi+l(1 + log F(i))

logi+ 1(1 + log F(i 1) + F(i 1)), which is less than or equal to logi+ 2F(i 1),since F(i 1) > for > and since log x < x for integer x >_ 1. Thus by theinductive hypothesis, logi+Z2F(i) =< 0. From this it follows that log+ F(i) < O,and the proof is complete.

THZORFM 2. Given n >= 2 objects, the time necessary to execute any sequence ofm > n MERGE and FIND instructions on these objects by Algorithm 2 is O(mG(n)).

Proof The cost of the MERGE instructions is clearly O(n). The cost of execut-ing all FIND instructions is proportional to the number of instructions plus thesum, over all FIND’s, of the number of vertices moved by that FIND. We now showthat this sum is bounded by O(mG(n)).

Let v be a vertex in as. If before a move of v the father of v is in a lower rankgroup (smaller value of j), assign the cost to the FIND instruction. Otherwise,assign the cost to v itself. For an instruction FIND(i), consider the path fromvertex to the root of its tree. The ranks of vertices along the path to the root aremonotonically increasing, and hence there can be on the path at most G(n) verticeswhose fathers are in a lower rank group. Hence no FIND instruction is assigned acost more than G(n).

By Lemma 7, there are at most 2n/logSn vertices in as, and by Lemma 4,each vertex in a can be moved at most logJn times before its new father is inaj_ or a lower rank group. Thus, the total cost of moving vertices in aj, not count-ing moves of a vertex whose father is in a lower rank group, is 2n. Since there areat most G(n) + rank groups, the total cost exclusive of that charged to FIND’sis O(nG(n)). Hence, the total cost of executing the sequence of MERGE and FINDinstructions is O(mG(n)).

logO(n) is n and logs+ l(n) log(logS(n)) logS(log(n)). Base 2 logarithms are assumed through-OUt.D

ownl

oade

d 10

/08/

12 to

128

.178

.159

.115

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 9: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

302 J. E. HOPCROFT AND J. D. ULLMAN

COROLLARY. Ifm <_ anforfixed constant a, then Algorithm 2 requires O(nG(n))time.

4. An application. One application of the set merging algorithms is to processa sequence of instructions of the forms INSERT(i), __< _< n, and MIN. Startwith a set S which is initially the empty set. Each time an instruction INSERT(i)is encountered, adjoin the integer to the set S. Each time a MIN instruction isexecuted, delete the minimum element from the set S and print it. We assume thatfor each i, the instruction INSERT(i) appears at most once in the sequence ofinstructions, and at no time does the number of MIN instructions executed exceedthe number of INSERT instructions executed. Note that as a special case, wecould sort k integers from one to n by executing INSERT instructions for eachinteger, followed by k MIN instructions.

The algorithm which we shall give for this problem is off-line, in the sense thatthe entire sequence of instructions must be present before processing can begin.In contrast, Algorithms and 2 are on-line, as they can execute instructionswithout knowing the subsequent instructions with which they will be presented.

Let 11, I2, ..., Ir be the sequence of INSERT and MIN instructions to beexecuted. Note that r __< 2n. Let of the instructions be MIN. We shall set up aMERGE-FIND problem whose on-line_solution will allow us to simulate theINSERT-MIN instructions. We create objects Mi, 1 <= <__ l, where M "rep-resents" the ith MIN instruction. We also create n objects Xi, 1 =< < n, whereX "represents" the integer i. Suppose, in addition, that there are two arrayswhich, given i, enable us to find Mi or X in a single step. The following algorithmdetermines for each i, that MIN instruction, if any, which causes to be printed.Once we have that information, a list of the integers printed by 11, I2, Ir iseasily obtained in O(n)steps.

ALGORITHM 3.1. Initially use MERGE instructions to create the following sets"

(i) Si, 2 <= <= l, consists of object Mi_ and all those Xj such thatINSERT(j) appears between the 1st and ith MIN instructions.

(ii) $1 consists of all X such that INSERT(j) appears prior to the firstMIN.

(iii) S consists of M and all Xj’s not placed by (i) or (ii).2. for until n do

beginFIND(X/) ;alet Xg be in S;f.j thenbegin

TIME(i) - j;

FIND(M)let M be in S;MERGE(Sj, S)

endend

We are taking the liberty of using Xi, Mi and S as arguments of FIND and MERGE, ratherthan integers, as these instructions were originally defined. It is easy to see that objects and set namescould be indexed, so no confusion should arise.D

ownl

oade

d 10

/08/

12 to

128

.178

.159

.115

. Red

istr

ibut

ion

subj

ect t

o SI

AM

lice

nse

or c

opyr

ight

; see

http

://w

ww

.sia

m.o

rg/jo

urna

ls/o

jsa.

php

Page 10: Set Merging Algorithms - ict.ac.cnbioinfo.ict.ac.cn/~dbu/AlgorithmCourses/Lectures/Union-Find-Hopcroft.pdfalgorithm, merging, property grammar, set, spanningtree 1. Introduction. Let

SET MERGING ALGORITHMS 303

The strategy behind Algorithm 3 is to let M after the ith iteration of step 2lie in that set Sj such that the jth MIN instruction is the first one following the kthMIN having the property that none of the integers up to are printed by the jthMIN. Likewise, X will be in Sj if and only if thejth MIN is the first MIN followingINSERT(m) which does not print any integer up to i. We can formalize these ideasin the next lemma.

LEMMA 9. (a) After the i-th iteration ofstep 2 in Algorithm 3, Si,j 4: , contains

X,, (resp. Mk) if and only if j is the smallest number such that the j-th MIN followsINSERT(m) (resp. the k-th MIN), and none of 1, 2, ..., is printed by the j-th MIN.

(b) TIME(i) -j by Algorithm 3 if and only if is printed by the j-th MIN.Proof We show (a) and (b) simultaneously by induction on i. The basis,0, is true by step of Algorithm 3. Part (b) of the induction holds, since if

X is in Sj when the ith iteration begins, it must be that the jth MIN follows IN-SERT(i) but does not cause any integer smaller than to be printed. However, byhypotheses, any MIN’s between INSERT(i) and the jth MIN do print out smallerintegers. Thus is the smallest integer available when the jth MIN is executed.

For part (a) of the induction, we need only observe that the set in which anX,, or M belongs does not change at iteration unless it was in Sj. Then, since thejth MIN prints i, Algorithm 3 correctly merges Sj with the set containing Mj, thatis the set of the next available MIN (or So if no further MIN’s are available).

THEOREM 3. A sequence of INSERT and MIN instructions on integers up to ncan be simulated off-line in O(nG(n)) time.

Proof We use Algorithm 3 to generate a sequence of MERGE and FINDinstructions. At most 2n MERGE’s are needed in step and at most 2n FIND’sand n MERGE’s are needed in step 2, a total of at most 5n instructions. We canthus, by the corollary to either Theorem or 2, simulate this sequence on-line inO(nG(n)) steps. In doing so, we shall obtain the array TIME(i). The order in whichintegers are printed by the MIN instructions can be obtained in a straightforwardmanner from TIME in O(n) steps.

Acknowledgment. We are indebted to M. J. Fischer for bringing to ourattention an error in an earlier analysis of the second algorithm which attemptedto show that the time bound was linear.

REFERENCES

[1] B. A. GAIIER AND M. J. FISCHER, An improved equivalence algorithm, Comm. ACM, 7 (1964),pp. 301-303.

[2] M. D. MCIIRO’, Private communication, Sept. 1971.[3] R. E. Sa’EhRNS ND D. J. ROSENRNTZ, Table machine simulation, 10th SWAT, 1969, pp. 118-128.[4] R. E. STE,RNS AND P. M. LEWIS, Property grammars and table machines, Information and Control

14 (1969), pp. 524-549.[5] D. E. KNtVH, The Art ofComputer Programming, vol. III, Addison-Wesley, Reading, Mass., 1973.[6] R. E. TaRJmq, On the efficiency of a good but not linear set union algorithm, Tech. Rep. 72-148,

Dept. of Comp. Sci., Cornell University, Ithaca, N.Y., 1972.[7] D. E. KNUa’H, The Art ofComputer Programming, vol. II, Addison-Wesley, Reading, Mass., 1969.

See also L. Guibas and D. Plaisted, Some combinatorial research problems with a computerscience flavor, Combinatorial Seminar, Stanford Univ., Stanford, Calif., 1972.

[8] M. J. FISCHER, Efficiency of equivalence algorithms, Complexity of Computer Computations,Miller et al., eds., Plenum Press, New York, 1972, pp. 153-167.

[9] M. Paterson, unpublished report, University of Warwick, Coventry, Great Britain.

Dow

nloa

ded

10/0

8/12

to 1

28.1

78.1

59.1

15. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p