A greedy randomized adaptive search procedure with path relinking for the shortest superstring...

25
J Comb Optim DOI 10.1007/s10878-013-9622-z A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem Theodoros Gevezes · Leonidas Pitsoulis © Springer Science+Business Media New York 2013 Abstract The shortest superstring problem (SSP) is an NP -hard combinatorial opti- mization problem which has attracted the interest of many researchers, due to its applications in computational molecular biology problems such as DNA sequencing, and in computer science problems such as string compression. In this paper a new heuristic algorithm for solving large scale instances of the SSP is presented, which outperforms the natural greedy algorithm in the majority of the tested instances. The proposed method is able to provide multiple near-optimum solutions and admits a natural parallel implementation. Extended computational experiments on a set of SSP instances with known optimum solutions indicate that the new method finds the opti- mum solution in most of the cases, and its average error relative to the optimum is close to zero. Keywords Combinatorial optimization · DNA sequencing · Data compression · Heuristics · GRASP · Path relinking 1 Introduction Given a set of strings over an alphabet, the shortest (common) superstring problem (SSP) consists in finding a shortest string that contains all strings in the set. Let Σ be a finite alphabet of cardinality |Σ |≥ 2, since for |Σ |= 1 the problem is trivial. Given two strings s i and s j over Σ , the second is a substring of the first if s i contains |s j | T. Gevezes (B ) · L. Pitsoulis Department of Mathematical, Physical and Computational Sciences, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece e-mail: [email protected] L. Pitsoulis e-mail: [email protected] 123

Transcript of A greedy randomized adaptive search procedure with path relinking for the shortest superstring...

Page 1: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb OptimDOI 10.1007/s10878-013-9622-z

A greedy randomized adaptive search procedure withpath relinking for the shortest superstring problem

Theodoros Gevezes · Leonidas Pitsoulis

© Springer Science+Business Media New York 2013

Abstract The shortest superstring problem (SSP) is an N P-hard combinatorial opti-mization problem which has attracted the interest of many researchers, due to itsapplications in computational molecular biology problems such as DNA sequencing,and in computer science problems such as string compression. In this paper a newheuristic algorithm for solving large scale instances of the SSP is presented, whichoutperforms the natural greedy algorithm in the majority of the tested instances. Theproposed method is able to provide multiple near-optimum solutions and admits anatural parallel implementation. Extended computational experiments on a set of SSPinstances with known optimum solutions indicate that the new method finds the opti-mum solution in most of the cases, and its average error relative to the optimum isclose to zero.

Keywords Combinatorial optimization · DNA sequencing · Data compression ·Heuristics · GRASP · Path relinking

1 Introduction

Given a set of strings over an alphabet, the shortest (common) superstring problem(SSP) consists in finding a shortest string that contains all strings in the set. Let Σ be afinite alphabet of cardinality |Σ | ≥ 2, since for |Σ | = 1 the problem is trivial. Giventwo strings si and s j over Σ , the second is a substring of the first if si contains |s j |

T. Gevezes (B) · L. PitsoulisDepartment of Mathematical, Physical and Computational Sciences,Aristotle University of Thessaloniki, 54124 Thessaloniki, Greecee-mail: [email protected]

L. Pitsoulise-mail: [email protected]

123

Page 2: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

consecutive characters that match s j exactly, where |s| is the length of string s. Givena set S = {s1, s2, . . . , sn} of strings over Σ , s is a superstring of S if every si ∈ S isa substring of s. SSP is the problem of finding a minimum length superstring of S,while such a string may not be unique.

The SSP has several important applications in various scientific domains. In com-putational molecular biology, the DNA sequencing via fragment assembly can beformulated as SSP (Armen and Stein 1995b; Bains and Smith 1988; Li 1990). Invirology and immunology, the SSP models the compression of viral genome (Ilie andPopescu 2006; Ilie et al. 2006). Viruses compress their genome by overlapping genesto reduce space. In computer science, information technology and data transmission,the SSP can be used to achieve data compression (Gallant et al. 1980; Mayne andJames 1975; Storer and Szymanski 1978, 1982). For example, words in a text can berepresented only by the shortest superstring of the word set and two labels for eachword, one for the starting point and one for the ending point in it, achieving in this waythe compression of the text. In scheduling, SSP solutions can be used to coordinatemachine starting times to solve the flow shop and open shop problems (Middendorf1998), which have a mathematical structure quite similar to a variant of the SSP.

DNA is a double-stranded sequence of four types of nucleotides: adenine, cytosine,guanine and thymine, thereby it can be viewed as a string over the alphabet {a, c, g, t}.The sequencing assembly problem in molecular biology is to determine the string of aDNA molecule. Due to laboratory constraints, only parts of few hundred nucleotidescan be read reliably by current methods, while the length of the DNA molecule inmany species is quite longer. To recognize a long DNA sequence, many copies of theDNA molecule are made and cut into smaller overlapping pieces (fragments) that canbe read at once. To reconstruct the initial DNA molecule, these fragments must bereassembled, a process where the usage of a computer is necessary due to the hugeamount of data. Intuitively, shortest superstrings of these fragments preserve importantbiological structures and in practice they prove to be good models of the original DNAsequence. Thus, SSP considered as an abstraction of the sequencing assembly problem.Similar problems arise in reconstructing RNA molecules or proteins from sequencedfragments.

The SSP is known to be an N P-hard problem (Garey and Johnson 1990) andtherefore it is unlikely to exist an efficient algorithm for solving it, unless P = N P .Furthermore, the SSP is a MAX SNP-hard problem (Blum et al. 1991) or equivalentlyAPX-hard modulo PTAS-reduction (Ott 1999). This implies that it can be approxi-mated in polynomial time within some constant factor since it is in MAX SNP, butthere is no polynomial time approximation scheme (PTAS) for this problem, unlessP = N P (Arora et al. 1998). So, there exists an ε > 0 such that it is N P-hard tosolve the shortest superstring problem within an approximation factor of 1 + ε.

Due to the approximation complexity of the SSP, a plethora of approximationalgorithms have been developed for it, that use different variations of the greedystrategy (Armen and Stein 1995a, 1996; Blum et al. 1991; Breslauer et al. 1997; Czumajet al. 1994; Kaplan and Shafrir 2005; Kosaraju et al. 1994; Li 1990; Sweedyk 1999;Teng and Yao 1993), placing great importance to the approximation ratio achievementand not to the best obtained solution. The best known approximation algorithm finds a

123

Page 3: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

string whose length is at most 2.5 times the length of an optimum superstring (Sweedyk1999).

On the other hand, only few heuristics have been implemented for the SSP, namely(Goldberg and Lim 2001; López-Rodríguez and Mérida-Casermeiro 2009; Tarhio andUkkonen 1988; Zaritsky and Sipper 2004). Notice that there exist a number of methodsfor solving simplified variants of the SSP, such as DNA sequencing methods whichusually require human intervention, or algorithms that find superstrings for a subset ofthe given strings eliminating some of them for efficiency purposes. A possible reasonfor the small number of heuristic algorithms for the SSP is the good performanceof the natural greedy algorithm. The natural greedy algorithm repeatedly combinesthe most appropriate pair of distinct strings in set S until only one string remains inS, which is the output of the algorithm. Blum et al. proved that the natural greedyalgorithm is a 4-approximation algorithm for the SSP (Blum et al. 1991). Kaplan andShafrir reduced this approximation ratio to 3.5 (Kaplan and Shafrir 2005). However,its average performance is much better than its proven approximation ratio. Its goodexperimental performance may be explained by the proof that it is asymptoticallyoptimal on random instances (Frieze and Szpankowski 1998; Yang and Zhang 1999).Moreover, for any given instance of the SSP the average approximation ratio of thenatural greedy algorithm on a small random perturbation in the letters of the instancedata is 1 + o(1) (Ma 2008). Computational results show that the superstring obtainedby the natural greedy algorithm is only 1% longer than a shortest superstring on theaverage (Tarhio and Ukkonen 1988). The purpose of this paper is to present a heuristicthat produces consistently better superstrings than the natural greedy in comparablecomputational time.

GRASP (Greedy randomized adaptive search procedure) is an iterative meta-heuristic for combinatorial optimization, which is implemented as a multi-start pro-cedure where each iteration is made up of a construction phase and a local searchphase. The first phase constructs a randomized greedy solution, while the secondphase starts at this solution and applies repeated improvement until a locally optimalsolution is found. The best solution over all iterations is kept as the final result. GRASPhas a strong intuitive appeal, a prominent empirical track record and is easily imple-mentable on parallel processors. Since the late 1980s GRASP has been applied to awide range of operations research and industrial optimization problems. These includeproblems in scheduling, routing, logic, partitioning, location and layout, graph theory,assignment, manufacturing, transportation, telecommunications, automatic drawing,electrical power systems, and VLSI design. This paper extends this list by adding aproblem in computational biology. A survey on GRASP can be found in (Pitsoulis andResende 2002) while an annotated bibliography in (Festa and Resende 2002). Lately,as a common practice, heuristics like GRASP are combined with other methods thatutilize the initial results in a try to find better solutions. Path relinking belongs in thiscase. It was first introduced in the context of tabu search (Glover and Laguna 1997)as an approach to integrate intensification and diversification strategies by exploringtrajectories that connect high quality solutions. See (Glover et al. 2000) for a method’ssurvey. Path relinking in the context of GRASP was first introduced in (Laguna andMartí 1999) as a memory mechanism for utilizing information on previously found

123

Page 4: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

good solutions. Recent advances and applications of GRASP with path relinking arepresented in (Resende and Ribeiro 2005).

In this paper an implementation of GRASP with path relinking for solving the SSPis presented. The proposed method:

– outperforms the natural greedy algorithm in the majority of the cases, even forlarge SSP instances,

– solves large scale SSP instances in the order of thousands of strings,– provides multiple near-optimum solutions that is of importance for DNA sequenc-

ing,– lastly, admits a natural parallel implementation.

Additionally, a new integer programming formulation for the SSP is presented, andit is used to create a set of SSP instances with known optimal solutions. To the bestof our knowledge, this benchmark set of instances is the first one that appears in theliterature.

2 Preliminaries

2.1 Shortest superstrings

An instance of the SSP is specified by a set S = {s1, . . . , sn} of strings over a finitealphabet Σ . Without loss of generality, S is defined to be a set since for S a collection,where some strings appear more than once, S has exactly the same set of superstringsas the set S′ = {s : s ∈ S}. Also, it is assumed that S is a substring free set, that is, nostring si ∈ S is a substring of any other string s j ∈ S. This assumption can be madewithout loss of generality because for any set of strings there exists a unique substringfree set of strings with the same set of superstrings.

Let si and s j be two strings over the alphabet Σ . The overlap string between si ands j , in this specific order, is the longest string v over Σ , such that si = uv and s j = vw,for some non empty strings u, w over Σ . In other words, v is the longest string that isa proper suffix of si and a proper prefix of s j . The overlap string between si and s j isdenoted by o(si , s j ) while its length |o(si , s j )| is called overlap. The merge string of si

and s j is the concatenation of these two strings with the overlap appearing only once,that is uvw, and is denoted by m(si , s j ). Note that |m(si , s j )| = |si |+|s j |−|o(si , s j )|.

Let O PT (S) be the shortest superstring of a set S. The goal of any heuristicalgorithm is to find a superstring of S whose length is as close to |O PT (S)| as possible.Let �n be the set of all permutations of the finite set {1, . . . , n}, thereby any solutionof the SSP can be represented as a permutation p ∈ �n , indicating the order in whichstrings from S must be merged to get the superstring. Given a superstring s of a string setS, a permutation that indicates the order in which the strings in S are merged, is denotedby permutation(S, s). Given a permutation p ∈ �n and a set S of strings, the resultantsuperstring is superstring(S, p) = m(sp(1), m(sp(2), . . . , m(sp(n−1), sp(n)))) and itslength is equal to

123

Page 5: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

|superstring(S, p)| =n∑

i=1

|si | −n−1∑

i=1

|o(sp(i), sp(i+1))|. (1)

Therefore, the SSP can be formulated as

|O PT (S)| = minp∈�n

n∑

i=1

|si | −n−1∑

i=1

|o(sp(i), sp(i+1))|. (2)

Note that the minimum length superstring is achieved when the sum of the overlapsbetween consecutive strings, in the order defined by p, is maximized.

2.2 Integer programming formulation for the SSP

An integer programming formulation for the SSP is presented both to illustrate theproblem and to be used for finding exact solutions for comparison reasons. Givena set S = {s1, s2, . . . , sn} of strings, the graph G = (V, E, w) with vertex setV = {s1, s2, . . . , sn}, arc set E = {(si , s j ) : si , s j ∈ V, i �= j} and arc weightsw((si , s j )) = |o(si , s j )|, for (si , s j ) ∈ E , is called the overlap graph of S. The opti-mum solution to an SSP instance is equivalent to the optimum solution to the longestHamiltonian path (LHP) problem in the corresponding overlap graph, since the lattercontains all the vertices (strings) ordered in a single path so as they have the maximumpairwise overlap sum.

There is a simple relation between the LHP problem and the longest Hamiltoniancycle (LHC) problem. An LHP problem on a graph G is equivalent to an LHC problemon a new graph G ′ obtained from G by adding a new node (say node s0) and newbidirectional arcs with 0 weights, that respectively connect the new node with everynode in the original graph. In particular, w′

i j = wi j = |o(si , s j )| ∀ i, j ∈ {1, 2, . . . , n}andw′

i j = 0 for i = 0 or j = 0. A LHC in G ′ contains a LHP in G, and thus an optimumsolution to the associated SSP. In this way the integer programming formulation ofthe LHC problem given in (Miller et al. 1960) can be used in SSP:

maxn∑

i=0

n∑

j=0

w′i j xi j (3)

s.t.n∑

i=0i �= j

xi j = 1, j = 0, 1, 2, . . . , n (4)

n∑

j=0j �=i

xi j = 1, i = 0, 1, 2, . . . , n (5)

xi j ∈ {0, 1}, i, j = 0, 1, . . . , n (6)

ui − u j + 1 ≤ n(1 − xi j ), 1 ≤ i �= j ≤ n (7)

ui ≥ 0, i = 0, 1, 2, . . . , n. (8)

123

Page 6: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Constraints (4)–(6) (assignment constraints) indicate that X = (xi j ) must be apermutation matrix with no assignments on its main diagonal and constraints (7)–(8)(arc-constraints) require the solution to be a Hamiltonian cycle. The arc-constraint(7) for a pair of nodes i, j �= 0, forces u j ≥ ui + 1 when xi j = 1, therefore if afeasible solution contains more than one cycles then at least one of these would notcontain node 0, and so along this cycle the ui values lead to a contradiction. Thisinteger programming formulation is used in Sect. 4 to construct a set of benchmarkSSP instances with known optimum solutions.

2.3 GRASP with path relinking

GRASP is an iterative procedure consisting of two phases in each iteration: a con-struction phase and a local search phase. Iterations are repeated until a stoppingcriterion such as maximum number of iterations or a target value of the objectivefunction is satisfied, and the best solution among all iterations is kept. GRASP canbe viewed as a procedure that samples (greedily-biased) high quality points fromthe solution space, searching the neighbourhood of each point for a local opti-mum.

In the first phase, a greedy randomized solution is constructed one element at atime. At each step of the construction phase, a set of candidate elements to be addedto the solution is generated. This set is called the restricted candidate list (RCL) andits size is specified by the parameter α of the GRASP algorithm, where α ∈ [0, 1].If the number of all possible candidate elements is M , then the RCL consists of theαM best elements. One of these elements is selected at random and it is added inthe current partial solution. After each selection, the RCL is updated to include onlyelements that lead to a feasible solution when added to the current partial solution. Theprocedure is repeated until a feasible solution is reached. Parameter α determines thegreediness versus randomness level in the construction. If α = 0 then the constructionphase is reduced to a pure greedy procedure, while if α = 1 the solution is randomlyconstructed.

The second phase of each GRASP iteration is the local search. Given a neighbour-hood structure definition, the neighbourhood of a solution is searched in an attempt toidentify a better solution. If a better solution exists, the current solution is updated andthe procedure is repeated until no further improvement is possible. In every iteration,the input solution to the local search procedure is the solution constructed in the firstphase.

Algorithm 1 shows the pseudo-code of a generic GRASP.Path relinking gives memory to the GRASP by retaining previously found solu-

tions and using them as guides to speed up convergence to good-quality solu-tions. It uses an elite set of high quality solutions found by GRASP. In everyiteration of GRASP, a solution of the elite set is selected with respect to someprobability distribution, to be combined with the current solution. If the com-bined solution is better than some of the elite solutions then the elite set isupdated. Algorithm 2 shows the pseudo-code of a generic GRASP with path relink-ing.

123

Page 7: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Algorithm: grasp- g

input : problem instanceoutput: a solution to the problem instance

1. while none of the stopping criteria is satisfied do2. generate a greedy randomized adaptive solution3. find local optimum in the neighbourhood of this solution4. update best solution5. end6. return the best solution

Algorithm 1: Generic GRASP

Algorithm: grasp- pr- g

input : problem instanceoutput: a solution to the problem instance

1. while none of the stopping criteria is satisfied do2. generate a greedy randomized adaptive solution3. find local optimum in the neighbourhood of this solution4. if elite set is not full then5. add current solution to the elite set6. else7. select a solution of the elite set at random8. execute path relinking for elite and current solution9. update elite set

10. end11. end12. return the best solution of the elite set

Algorithm 2: Generic GRASP with path relinking

3 GRASP with path relinking for the SSP

An SSP instance of size n is specified by a substring free set S = {s1, . . . , sn} ofstrings and its feasible solution space is the set of the n! different permutations of thestrings in S.

3.1 Construction phase

During the GRASP construction phase, a complete permutation is constructed by apartial one. In each step, a set of remaining strings S′ is considered (initially S′ = S)and the number of distinct pairs of strings in S′ is M = |S′|2 − |S′|, where |S| is thecardinality of the set S. The overlaps between these pairs are computed and the �αMpairs with the largest overlaps comprise the RCL. From the RCL, one pair of strings,assume si and s j , is randomly selected and the set S′ is updated by adding the mergestring m(si , s j ) and removing each string s′ ∈ S′ is a substring of m(si , s j ), includingsi and s j , to remain substring free.

At this point, should be mentioned the difference between the greedy algorithmand the construction phase of GRASP. The greedy algorithm also starts by a substring

123

Page 8: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

free set of strings and constructs a complete permutation by a partial one. In each step,greedy selects a string pair with the maximum overlap, and after the selection of thestrings si and s j in some step, the set S′ is updated by adding the merge string m(si , s j )

and removing only the strings si and s j . Algorithm 3 shows the pseudo-code of thegreedy algorithm for the SSP.

Algorithm: greedy

input : S = {s1, . . . , sn}output: a superstring of S

1. S′ = S2. while |S′| > 1 do3. L = {(s′

i , s′j ) : s′

i , s′j ∈ S′, i �= j}

4. k = max{|o(s′i , s′

j )| : (s′i , s′

j ) ∈ L}5. let (si , s j ) ∈ L be a pair such that |o(si , s j )| = k6. S′ = (S′ − {si , s j }) ∪ {m(si , s j )}7. end8. let s be the only string in S′9. return s

Algorithm 3: greedy for the SSP

The following theorem shows, in contrast to the construction phase of GRASP, thatin each step of greedy algorithm it is not necessary to search S′ for substrings ofm(si , s j ).

Theorem 1 The set of the remaining strings in each step of the greedy algorithmremains substring free.

Proof Let S′i be the set of remaining strings in step i for i ∈ {1, 2, . . . , n − 1}, where

n is the size of the problem. It would be shown by induction that S′i is substring free

in all steps of greedy. For i = 1, S′1 = S and therefore S′

1 is a substring free set.Suppose S′

i is substring free for some i ∈ {1, 2, . . . , n − 2}. Let (si , s j ) be the stringpair that greedy selects in this step. Therefore S′

i+1 = (S′i − {si , s j }) ∪ {m(si , s j )}.

Assume that S′i+1 is not substring free. Then, there is some sk ∈ S′

i+1, sk �= m(si , s j ),that is substring of m(si , s j ), but not a substring of si or s j since Si is substring free.Thus, the relative alignment of strings si , s j and sk would be the one shown in Fig. 1.

As shown in figure, |o(si , sk)| > |o(si , s j )| and |o(sk, s j )| > |o(si , s j )|. This meansthat the greedy algorithm in the previous step did not select a string pair with themaximum overlap, which is a contradiction. Therefore, S′

i+1 is substring free. So, theset S′

i is substring free ∀ i ∈ {1, 2, . . . , n − 1}.

On the other hand, Theorem 1 does not hold for the GRASP construction phasesince in each step a string pair with the maximum overlap is not necessarily selected.Assume that the set of remaining strings in some step is S′ and (si , s j ) is the selectedpair from RCL, then the set of the remaining strings for the next step would be equalto (S′ − T ) ∪ {m(si , s j )}, where T = {s : s ∈ S′, s is a substring of m(si , s j )}.

123

Page 9: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Fig. 1 The alignment of the strings si , s j , and sk such that sk is a substring of m(si , s j ), but not a substringof si or s j

Obviously, si , s j ∈ T . In GRASP is likely to exist strings that are removed from S′without taking part in any merge. After the update of S′ the next step is performed.

The procedure terminates when |S′| = 1 while the total number of steps is at mostn − 1 and is exactly n − 1 when T = {two selected strings} in all construction steps.When the procedure terminates, the only string s in S′ is the superstring of all the nstrings in S, even those that do not participate in a merge. The permutation(S, s) ∈�n is the output of the construction phase. Algorithm 4 shows the pseudo-code of theconstruction phase of GRASP for the SSP.

Algorithm: constructionPhase

input : S = {s1, . . . , sn}, α

output: p ∈ �n

1. S′ = S2. while |S′| > 1 do3. L = {(s′

i , s′j ) : s′

i , s′j ∈ S′, i �= j}

4. foreach (s′i , s′

j ) ∈ L do compute overlap |o(s′i , s′

j )|5.6. RC L = {�α|L| elements of L with largest overlaps}7. select an element (si , s j ) ∈ RC L randomly8. T = {s : s ∈ S′, s is a substring of m(si , s j )}9. S′ = (S′ − T ) ∪ {m(si , s j )}

10. end11. let s be the only string in S′12. return permutation(S, s)

Algorithm 4: Construction phase of GRASP for the SSP

3.2 Local search phase

The second phase of each GRASP iteration is the local search. Among several neigh-bourhood structures that were experimentally tested, two of them produced the bestresults: the 2-exchange neighbourhood and the shift neighbourhood. The 2-exchangeneighbourhood was superior with respect to other k-exchange neighbourhoods, sinceit provided the same quality solutions in less computational time. Moreover, GRASPproduces better results when implemented with the 2-exchange neighbourhood versusthe neighbourhood structure described in (López-Rodríguez and Mérida-Casermeiro2009). The latter neighbourhood structure is used as the update configuration for adiscrete neural network solving the SSP, and contains all the reorders of all possiblethree parts that the current solution can be separated into. The shift neighbourhoodis so chosen because it highlights the relative-ordering nature of the SSP, since in

123

Page 10: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

this problem the order between the strings is crucial, and not their global locationin the structure of the superstring (Zaritsky and Sipper 2004). Also, it was the onlyneighbourhood among those tested that yielded further improvement on the solutionprovided by the 2-exchange.

The difference between two permutations p and p′ is defined as δ(p, p′) = {i :p(i) �= p′(i)}, and the distance between them as d(p, p′) = |δ(p, p′)|. The 2-exchange neighbourhood of an SSP solution p ∈ �n is defined as

N2e(p) = {p′ : p′ ∈ �n, d(p, p′) = 2}. (9)

The 2-exchange neighbourhood of a solution p consists of all those solutions thatdiffer from p in exactly two assignments.

Given a permutation p = [p(1), . . . , p(n)] ∈ �n and a k ∈ {0, 1, . . . , n − 1}, thek-shift operation is defined as shk(p) = [p(k + 1), . . . , p(n), p(1), . . . , p(k)]. Notethat sh0(p) = p. The shift neighbourhood of an SSP solution p ∈ �n is defined as

Nsh(p) = {p′ : p′ = shk(p), k ∈ {0, 1, . . . , n − 1}}, (10)

that is all the cyclic shifts of p. Note that Nsh(p) = Nsh(p′) ∀ p′ ∈ Nsh(p).For each solution p of the construction phase, the local search phase initially

searches in N2e(p). If it finds a better solution p′, then it replaces p with p′ andrepeats the procedure until no further improvement is possible, and therefore a localoptimum of the objective function with respect to the 2-exchange neighbourhood isfound. The shift local search is applied to this 2-exchange local optimum and the finalresult is the output of the local search phase.

Algorithm 5 shows the pseudo-code of the local search phase. At lines 1–8 the2-exchange local search is implemented and, at lines 9–13 the shift local search isimplemented.

Algorithm: localSearch

input : S = {s1, . . . , sn}, p ∈ �noutput: p ∈ �n

1. repeat2. foreach p′ ∈ N2e(p) do3. if |superstring(S, p′)| < |superstring(S, p)| then4. p = p′5. exit foreach6. end7. end8. until p �= p′9. foreach p′ ∈ Nsh(p) do

10. if |superstring(S, p′)| < |superstring(S, p)| then11. p = p′12. end13. end14. return p

Algorithm 5: Local search phase of GRASP for the SSP

123

Page 11: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

3.3 GRASP with path relinking for the SSP

The two previously described phases, construction and local search, are iterativelyexecuted and due to randomness, each iteration most likely leads to a different solution.These iterations continue until a maximum iteration number is reached or a specificlength of the superstring is found. The permutation p, which corresponds to the bestsolution over all iterations, is kept and the superstring(S, p) is returned as the resultof the GRASP algorithm.

Here is a summary of the parameters that are used in GRASP for the SSP. Allof them have already been mentioned explicitly or implicitly. Parameter α specifiesthe size of the RCL and thereby determines the greediness versus randomness levelin the construction phase. Parameters m and s constitute the termination criteria ofthe algorithm, where m specifies the maximum number of iterations that GRASP canexecute and s specifies the desired superstring length. For some string set S, if s > 0and GRASP finds a solution p with |superstring(S, p)| ≤ s, then GRASP stops andreturns superstring(S, p). If s = 0, then this stopping criterion is ignored.

Algorithm 6 shows the pseudo-code of GRASP for the SSP. At line 1 initializationof the best solution is made, at line 5 the possible update of best solution is performedand at line 6 the stopping criterion of the target solution is checked.

Algorithm: grasp-ssp

input : S = {s1, . . . , sn}, α, m, soutput: a superstring of S

1. |s∗| = ∞2. for i teration = 1 to m do3. p = constructionPhase(S, α)4. p = localSearch(S, p)5. if |superstring(S, p)| < |s∗| then s∗ = superstring(S, p)

6. if |s∗| ≤ s then exit for7.8. end9. return s∗

Algorithm 6: GRASP for the SSP

Notice that for α = 0 the construction phase of grasp-ssp becomes the greedyalgorithm.

Path relinking examines the space of solutions spanned from GRASP iterations.Let p1 be the initial solution and p2 the guiding solution. Path relinking generatesd(p1, p2)−1 different feasible solutions that correspond to the necessary 2-exchangesto transform p1 to p2. For each of these solutions, the local search procedure is appliedand the best overall solution is kept. The path relinking procedure for p1 and p2 isshown in Algorithm 7, where r∗ is the best solution found by the procedure.

To combine GRASP with path relinking, an elite set P with maximum capacityP is used. Computational experiments were performed to determine the best valuefor P , the strategy to choose guiding solutions from the elite set P , as well as theupdating procedure. Several values for P were tested and the smaller one that was

123

Page 12: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Algorithm: pathRelinking

input : S = {s1, . . . , sn}, p1 ∈ �n , p2 ∈ �noutput: p ∈ �n

1. r∗ = p12. foreach i ∈ δ(p1, p2) do3. let j be such that p1( j) = p2(i)4. t = p1(i)5. p1(i) = p1( j)6. p1( j) = t7. r = localSearch(S, p1)8. if |superstring(S, r)| < |superstring(S, r∗)| then r∗ = r9.

10. end11. return r∗

Algorithm 7: Path relinking for the SSP

among the best ones was chosen. For choosing the guiding solution, we examinedpure randomness and a probability distribution that depends on the similarity of eachsolution in the elite set with the one produced by GRASP. The second one was provento be the best experimentally. The choice of this strategy is reinforced by the empiricalobservation in (Oliveira et al. 2004; Resende and Werneck 2004) that if two solutionsare quite similar, then the application of the path relinking to them is likely to beunsuccessful. For the update strategy, we examined pure randomness, the replacementof the least used solution, and the replacement of the most similar solution. The lastof the afore mentioned strategies was experimentally proven the best. According tothese choices, the procedure is described here.

Initially the elite set P is empty, and while P is not full, solutions produced byGRASP are added to P , if they are not already in it. If P is full, then the path relinkingprocedure is applied having as initial solution the solution p of GRASP, and as guidingsolution a solution selected from P . Let D = ∑

q∈P d(p, q). A solution q is selectedfrom P with probability d(p, q)/D. Let r be the solution obtained by the path relinkingprocedure. If r is not already in P and is better than at least one solution in it, then ris added to P replacing the most similar to r solution among those having worse costthan r . In this way, the size of the elite set is kept constant and its solution diversityis kept as high as possible. Algorithm 8 shows the pseudo-code of GRASP with pathrelinking for the SSP.

4 Computational results

greedy, grasp-ssp and grasp-pr-ssp were coded in the programming language Cand compiled with the gcc version 4.1.2. The execution of the following experimentswas performed on an SGI Altix 450 shared memory architecture cluster running SUSELinux Enterprise Server 10 (ia64) with 16 Intel Itanium® II, 1.67 GHz, 64-bit proces-sors and a total of 32 GB shared memory.

For the computational experiments, random instances of SSP were generated byduplicating the DNA sequencing procedure performed in biology laboratories. For

123

Page 13: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Algorithm: grasp-pr-ssp

input : S = {s1, . . . , sn}, α, m, s, Poutput: a superstring of S

1. P = ∅2. for i teration = 1 to m do3. p = constructionPhase(S, α)4. p = localSearch(S, p)5. if |P| < P then6. if p /∈ P then P = P ∪ {p}7.8. else9. D = ∑

q∈P d(p, q)

10. select a q from P with probability d(p, q)/D11. r = pathRelinking(S, p, q)12. if r /∈ P and |superstring(S, r)| < max{|superstring(S, q)| : q ∈ P} then13. L = {q : q ∈ P, |superstring(S, q)| > |superstring(S, r)|}14. let q ′ ∈ L be the most similar solution to r15. P = (P − {q ′}) ∪ {r}16. end17. end18. let r∗ be the best solution in P19. s∗ = superstring(S, r∗)

20. if |s∗| ≤ s then exit for21.22. end23. return s∗

Algorithm 8: GRASP with path relinking for the SSP

each instance, a random string over the alphabet Σ = {a, c, g, t} is generated, rep-resenting the DNA sequence. A number of copies of this string is considered andeach copy is cut in random positions so that the resulting length of fragments be in aspecific range. This length in our experiments varies between 50 and 60 to reflect acommon string length in next-generation sequencing technologies used nowadays inbiology laboratories (Lin et al. 2011; Morozova and Marra 2008). The string set S ofan SSP instance is constructed by these fragments after removing each of them is asubstring of another. The final number |S| of strings is the size of the instance and in allfollowing experiments is denoted by the letter n. A set of SSP instances, generated asdescribed previously, which consists of k instances for each instance size n ∈ [2, m]is denoted by I(k, m). This configuration for the experiments is selected to test theproposed algorithm hardly, since greedy expected to achieve near-optimum solutionsfor these instances. Furthermore, the experiments involve much larger instances thanother heuristics did, to deduce a more certain behaviour of the proposed algorithm.Our purpose is to design an algorithm that steadily produces better results than thecurrent widely-used method, and not only on instances of few tens of strings. Thedistinct computational experiments that performed concern

– the determination of the parameter α,– the contribution of path relinking to GRASP,– comparison between greedy and grasp-ssp,

123

Page 14: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 100 200 300 400 500 600 700 800 900 1000 1100

0.0025

0.005

0.01

0.02

0.04

0.08

0.16

0.32

0.64

Fig. 2 Best parameter α values per instance size. The � points correspond to the computational resultsfor the specification of the proper value, and the curve corresponds to the calculation function used in theexperiments

– analysis of the 2-exchange and shift neighbourhoods in local search,– the performance of the proposed algorithm.

The value of the parameter α is important for the quality of the grasp-ssp solutions.In order to specify the proper parameter α value for each instance size n, a set I =I(25, 1156) of instances was generated and the most appropriate α values for eachn were estimated. The set of the candidate α values was {α1, α2, . . . , α13}, whereα1 = 0.64 and αi = αi−1/2 for i = 2, 3, . . . , 13. For each instance S ∈ I, thesolutions of grasp-ssp(S, αi , 16, 0) were computed for i = 1, 2, . . . , 13. Let wn

i bethe number of instances of size n, for which αi found the best solution among all α

values, and wn = max{wni : i = 1, 2, . . . , 13}. In Fig. 2 there is a � in each point

(n, αi ) for which wni = wn . Note that the vertical axis is in logarithmic scale. It can

be observed that for small size instances (i.e. n < 100) most α values lead to the bestfound solution. For larger instances (i.e. n > 100) the most appropriate α value beginsfrom α7 = 0.01 and is steadily decreasing. For example α6 = 0.02 is appropriate onlyfor instances of size less than 100 while the most appropriate α value for instances ofsize 300 ≤ n ≤ 400 is the α9 = 0.0025. In Fig. 2 the curve given by the function

α(n) ={

0.01 if n ≤ 112(15n1.3 + 7n1.7)/(2n3) if n > 112

(11)

fits to the appropriate α values. In all subsequent computational experiments, the valuefor the parameter α is computed by this function.

In order to infer the contribution of path relinking to GRASP, a set I = I(25, 512)

of instances was generated and solved by both grasp-ssp and grasp-pr-ssp. Theseinstances are separated into two sets according to the solutions of the two algorithms.The first set contains the instances for which the two algorithms found different solu-tions and is used for comparing the solution quality, while the second set contains the

123

Page 15: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 50 100 150 200 250 300 350 400 450 5120.99

0.995

1

1.005

1.01

1.015

Fig. 3 The ratio between the solutions obtained by grasp-ssp and by grasp-pr-ssp for instances wherethe two algorithms give different results

instances for which the two algorithms found the same solution and is used for makingtime comparison.

For each instance S ∈ I of size n in the first set, the ratio

ρ1 = |grasp- ssp(S, α(n), 8, 0)||grasp- pr- ssp(S, α(n), 8, 0, 5)|

is computed and the results are presented in Fig. 3. The conclusion pointed to thisfigure is that the two algorithms are comparable and none of them has an overallsuperiority over the other.

For the second set of instances, since both algorithms conclude to the same solution,it is interesting to investigate the number of iterations they require to find this solution.Figure 4 shows the average number of iterations to the best solution for both algorithmsaccording to instance size n and the ratio between these average values for each n.

It seems that the average number of iterations for both algorithms is very low (lessthan 3) and slowly increases as n grows. It can be concluded that grasp-ssp andgrasp-pr-ssp need the same number of iterations on the average to find their bestsolution. However, the time needed to complete an iteration is not the same for bothalgorithms. Figure 5 shows the CPU time requirement according to instance size n forgrasp-ssp and grasp-pr-ssp to complete eight iterations.

It is obvious that grasp-ssp needs significantly less time than grasp-pr-ssp tocomplete the same number of iterations, as expected. In both cases, the dispersalof times as n grows is due to the local search phase. Based on the above, it can beconcluded that path relinking procedure does not offer any substantial improvement ongrasp-ssp. In contrast to other cases of heuristics for optimization problems (Oliveiraet al. 2004), in the case of GRASP for the SSP, path relinking has nothing to give in thewhole process as concluded experimentally. This is one more indication that greed,

123

Page 16: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 200 5121

1.5

2

2.5

3

3.5

0 200 5121

1.5

2

2.5

3

3.5

0 50 100 150 200 250 300 350 400 450 5120.5

1

1.5

2

Fig. 4 The average number of iterations to the best solution for the grasp-ssp and grasp-pr-ssp algorithmsfor instances where the two algorithms give the same result, and the ratio between these averages

0 50 100 150 200 250 300 350 400 450 5120

0.5

1

1.5

0 50 100 150 200 250 300 350 400 450 5120

50

100

150

Fig. 5 CPU time requirement for grasp-ssp and grasp-pr-ssp to complete eight iterations

representing in this case by GRASP, works very well for the SSP. Therefore, GRASPwithout path relinking was used in the following experiments.

Except the result in the probabilistic framework that greedy is asymptoticallyoptimum for the SSP, computational results show that the superstring obtained by thegreedy algorithm is only 1 % longer than a shortest superstring on the average (Tarhioand Ukkonen 1988). The length of the random strings in these experiments varied from4 to 100, and the observation that greedy works even better with longer strings ismade. These results motivated us to design a heuristic algorithm which performs better

123

Page 17: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 200 400 6450.9998

1

1.0002

1.0004

1.0006

1.0008

1.001

1.0012

1.0014

700 800 900 10240.9965

0.997

0.9975

0.998

0.9985

0.999

0.9995

1

1.0005

Fig. 6 The ratio between the solutions obtained by greedy and by grasp-ssp for instances where the twoalgorithms give different results

than greedy even for large SSP instances. In order to compare greedy and grasp-sspalgorithms, a set I = I(25, 1024) of instances was generated and for each instanceS ∈ I of size n the ratio

ρ2 = |greedy(S)||grasp- ssp(S, α(n), 16, 0)|

was computed. Figure 6 presents ρ2 according to the instance size n, in two separateplots for better readability, only for instances where the two algorithms found differentsolutions. It is obvious that grasp-ssp consistently produces better quality solutionsthan greedy for all values of n.

Due to the specific configuration used for the experiments, a randomly producedsequence containing n strings of lengths between 50 and 60 is expected to be 27.5×n+28 long. The ρ2 values are inversely proportional to the sequence length, and thus to thestring length, and so they can be much larger if the string lengths in the experiments aresmaller. The choice of string length is representative of the next-generation sequencingtechnologies. The special pattern shown in Fig. 6 is an immediate result of the factthat the size of the generated DNA sequences lies in a range of integer values for eachinstance size due to the way the instances are produced, and the fact that the differencesbetween the solutions are discrete. The first curve above one means that the greedysolution differs from grasp-ssp solution by 1, the second curve by 2 etc. The mainobservation obtained from Fig. 6 is that grasp-ssp steadily produces better solutionsthan greedy even for large instance sizes. The small values of the relative differencesbetween the two solutions as indicated by ρ2 do not reflect the difference in solutionquality. This is due to the discrete nature of the problem, where for any two solutionsthat differ in few letters, the composition of the corresponding superstrings may bedisproportionally different, which implies that they are also biologically different.

123

Page 18: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 100 200 300 400 500 600 700 800 900 10241

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Fig. 7 The average number of iterations that grasp-ssp needed to find its best solution

The average number of iterations that grasp-ssp needed to find its best solution foreach instance size n is presented in Fig. 7. It can be observed that the average numberof iterations increases as n grows, however its value remains quite small even for largeinstance sizes.

The question of whether the superiority of grasp-ssp over the greedy is owedonly to the solution improvement realized by local search phase, was examined next.The previous experiment was repeated except that the local search phase was appliedin each greedy solution too. The solution comparison was made in the same way aspreviously and the ratio

ρ3 = Length of the greedy(S) with local search solution

|grasp- ssp(S, α(n), 16, 0)|

was computed. The pattern of the resultant figure is the same as in Fig. 6, while thedifferences are only in 7 of 2,204 instances (points) presented in it. This result showsthat local search phase does not improve the greedy solution in the vast majority ofcases, and also that the superiority of the grasp-ssp over greedy is owed both to thehigh quality point sampling in the solution space by the construction phase and thelocal search phase.

The next experiment concerns the components of grasp-ssp, that is the constructionphase, the 2-exchange local search, and the shift local search. A set I = I(25, 1024)

of instances was generated and for each instance S ∈ I of size n, grasp-ssp(S, α(n), 1,0) found a solution. The solution quality and the time requirement of each componentwere recorded separately in order to compare their results. Figure 8 shows the ratio

ρ4 = Length of construction phase solution

Length of 2-exchange local search solution

123

Page 19: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 100 200 300 400 500 600 700 800 900 10241

1.01

1.02

1.03

1.04

1.05

1.06

Fig. 8 The ratio between the solutions obtained by the construction phase and by the 2-exchange localsearch. It represents the improvement by the 2-exchange local search to the solution

according to instance size n. The relative improvement of the construction phasesolution by the 2-exchange local search phase is constantly decreasing as instancesize n increases.

Figure 9 shows the ratio

ρ5 = Length of 2-exchange local search solution

Length of shift local search solution

according to instance size n. The shift local search improves the solution qualityonly in few instances and the relative improvement according to instance size n isconstantly decreasing. Although the relative improvement is not substantial, the shiftlocal search has minimal time requirements, and this is the reason it remains as partof our algorithm. Moreover, it was the only neighbourhood structure among thosetested that yielded further improvement on the solution provided by the 2-exchange.Intuitively, this shows the relative-ordering nature of the problem.

Figure 10 shows the time requirement for the three grasp-ssp components for eachinstance size n. The construction phase takes more time from the two local searches,the 2-exchange local search has the greater dispersion of values and the shift localsearch is the one with the less time requirement since its CPU time is almost zero forall n.

Since the dominant component of grasp-ssp with respect to the running time is theconstruction phase, and it operates similarly to greedy, the running time of an iterationof grasp-ssp is very close to the running time of greedy. This can be observed ifthe structure of a linked list is used to implement the choosing procedure during theconstruction phase. In both algorithms, the sorting of the pairwise overlaps is the mosttime-consuming process. Combining this result with the small number of iterationsthat grasp-ssp needs to converge, and the ability to easily parallelize its iterations in

123

Page 20: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 100 200 300 400 500 600 700 800 900 10241

1.005

1.01

1.015

1.02

1.025

1.03

1.035

Fig. 9 The ratio between the solutions obtained by the 2-exchange local search and by the shift localsearch. It represents the improvement by the shift local search to the solution

0 100 200 300 400 500 600 700 800 900 10240

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8construction phase2−exchange local searchshift local search

Fig. 10 Time requirement for the three components of grasp-ssp, the construction phase, the 2-exchangelocal search, and the shift local search

a number of processors performing the sorting process only once, can be deduced thatthe parallel version of grasp-ssp has similar running time to the greedy algorithm.

In the final computational experiment, in order to compare the solutions of thegrasp-ssp algorithm with the optimum ones, a set I = I(25, 256) of instances wasgenerated. They solved by both the grasp-ssp and an exact solver. The bound of theinstance size n is due to the time requirements of the exact solver. Each instance wasformulated as an integer program according to Eqs. (3)–(8), and solved to optimalityby the Gurobi Optimizer version 4.6.1. The Gurobi Optimizer is a state-of-the-artsolver for linear programming (LP), quadratic programming (QP), and mixed-integerprogramming (MIP), and includes high-performance implementations of the primalsimplex method, the dual simplex method, and a parallel barrier solver. Figure 11

123

Page 21: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 50 100 150 200 2561

1.001

1.002

1.003

1.004

1.005

1.006

1.007

1.008

1.009

Fig. 11 The ratio between the solutions obtained by grasp-ssp and the optimum solutions obtained by anexact solver

0 50 100 150 200 25620

30

40

50

60

70

80

90

100

Fig. 12 Percentage of instances for which grasp-ssp found the optimum solution

shows for each instance S ∈ I of size n, the ratio

ρ6 = |grasp- ssp(S, α(n), 8, 0)||O PT (S)| .

Figure 12 shows the percentage of instances for which grasp-ssp found the opti-mum solution according to the instance size n.

Figure 13 shows the average additional length of the superstring obtained by grasp-ssp compared with the length of a shortest superstring. Its error rate remains quite low,about 2 × 0.14 %, even for large SSP instances.

As can be observed, the solutions of the grasp-ssp are very close to the corre-sponding optimum ones, it finds the best solution in many cases while its relativeperformance is improved as n grows, and its average error rate remains quite low.

123

Page 22: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

0 50 100 150 200 2560

1

2

3

4

5

x 10−4

Fig. 13 Average additional length of the superstring obtained by grasp-ssp compared with the length ofa shortest superstring

Having the benchmark set of the SSP instances with known optimum solutions, amore detailed quantitative comparison between grasp-ssp and greedy can be doneindicating the superiority of the first, additionally to the result of Fig. 6. Figure 14has a dot for each instance S ∈ I where the two algorithms found different solutions,while the horizontal axis represents the ratio ρ6, and the vertical axis the ratio

ρ7 = |greedy(S)||O PT (S)|

for these instances. In this way, the figure shows the improvement achieved by grasp-ssp over greedy in terms of how much it brings the solution closer to the optimum.In all instances except two of them, grasp-ssp improves the greedy solution, whilein most of these improved instances it finds the optimum solution where ρ6 = 1.

5 Conclusion

In this paper a GRASP algorithm for the SSP is implemented, which outperforms thewidely-used in practice natural greedy algorithm. The proposed algorithm providesmultiple near-optimum solutions that is of importance for DNA sequencing. Extendedcomputational experiments are presented for the first time to our knowledge with largescale SSP instances. The error rate of the proposed algorithm compared to the optimumremains quite low for all tested instances, and the average number of its iterations isquite small. A benchmark set of 6,375 SSP instances with known optimum solutionsis constructed, with sizes that range from 2 up to 256 strings of lengths 50–60. Theproposed algorithm has been implemented in the C programming language, and thesource code can be obtained online at http://users.auth.gr/theogev.

123

Page 23: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

1 1.0002 1.0004 1.0006 1.0008 1.001 1.0012

1

1.0002

1.0004

1.0006

1.0008

1.001

1.0012

Fig. 14 Improvement achieved by grasp-ssp over greedy according to the optimum solution

Acknowledgments This research has been co-financed by the European Union (European Social Fund -ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of theNational Strategic Reference Framework (NSRF)—Research Funding Program: Heraclitus II. Investing inknowledge society through the European Social Fund.

References

Armen C, Stein C (1995a) Improved length bounds for the shortest superstring problem. In: Akl S, Dehne F,Jorg-Rudiger S, Santoro N (eds) Algorithms and data structures, volume 955. Lecture notes in computerscience. Springer, Berlin, pp 494–505

Armen C, Stein C (1995b) Short superstrings and the structure of overlapping strings. J Comput Biol2(2):307–332

Armen C, Stein C (1996) A 2 23 -approximation algorithm for the shortest superstring problem. In: Hirschberg

D, Myers G (eds) Combinatorial pattern matching, volume 1075. Lecture notes in computer science.Springer, Berlin, pp 87–101

Arora S, Lund C, Motwani R, Sudan M, Szegedy M (1998) Proof verification and the hardness of approx-imation problems. J ACM 45:501–555

Bains W, Smith GC (1988) A novel method for nucleic acid sequence determination. J Theor Biol135(3):303–307

Blum A, Jiang T, Li M, Tromp J, Yannakakis M (1991) Linear approximation of shortest superstrings. In:Proceedings of the twenty-third annual ACM symposium on theory of computing, STOC ’91, New York,pp. 328–336. ACM. ISBN 0-89791-397-3

Breslauer D, Jiang T, Jiang Z (1997) Rotations of periodic strings and short superstrings. J Algorithm24:340–353

Czumaj A, Ga̧sieniec L, Piotrów M, Rytter W (1994) Parallel and sequential approximation of shortestsuperstrings. In: Schmidt E, Skyum S (eds) Algorithm theory SWAT ’94, volume 824. Lecture notes incomputer science. Springer, Berlin, pp 95–106

Festa P, Resende MGC (2002) GRASP: an annotated bibliography. In: Ribeiro CC, Hansen P (eds) Essaysand surveys in metaheuristics. Kluwer Academic Publishers, Norwell, pp 325–367

Frieze A, Szpankowski W (1998) Greedy algorithms for the shortest common superstring that are asymp-totically optimal. Algorithmica 21:21–36

Gallant J, Maier D, Storer JA (1980) On finding minimal length superstrings. J Comput Syst Sci 20(1):50–58Garey MR, Johnson DS (1990) Computers and intractability: a guide to the theory of NP-completeness. W.

H. Freeman & Co., New York

123

Page 24: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Glover F, Laguna M, Rafael M (2000) Fundamentals of scatter search and path relinking. Control Cybern29(42743):653–684

Glover F, Laguna M (1997) Tabu search. Kluwer Academic Publishers, NorwellGoldberg MK, Lim DT (2001) A learning algorithm for the shortest superstring problem. In: Proceedings

of the atlantic symposium on computational biology and genome information and technology, Durham,NC, pp 171–175

Ilie L, Popescu C (2006) The shortest common superstring problem and viral genome compression. FundamInform 73:153–164

Ilie L, Tinta L, Popescu C, Hill KA (2006) Viral genome compression. In: Mao C, Yokomori T (eds) DNAcomputing, volume 4287. Lecture notes in computer science. Springer, Berlin

Kaplan H, Shafrir N (2005) The greedy algorithm for shortest superstrings. Inform Process Lett 93:13–17Kosaraju SR, Park JK, Stein C (1994) Long tours and short superstrings. In: Proceedings of the 35th annual

symposium on foundations of computer science, Washington, pp 166–177. IEEE Computer Society.ISBN 0-8186-6580-7

Laguna M, Martí R (1999) Grasp and path relinking for 2-layer straight line crossing minimization.INFORMS J Comput 11:44–52

Li M (1990) Towards a DNA sequencing theory (learning a string). Foundations of Computer Science. In:Proceedings of 31st Annual IEEE Symposium, 22–24 Oct 1990

Lin Y, Li J, Shen H, Zhang L, Papasian CJ, Deng H-W (2011) Comparative studies of de novo assemblytools for next-generation sequencing technologies. Bioinformatics 27(15):2031–2037

López-Rodríguez D, Mérida-Casermeiro E (2009) Shortest common superstring problem with discreteneural networks. In: Kolehmainen M, Toivanen P, Beliczynski B (eds) Adaptive and natural computingalgorithms, volume 5495. Lecture Notes in computer science. Springer, Berlin, pp 62–71

Ma B (2008) Why greed works for shortest common superstring problem. In: Ferragina P, Landau G (eds)Combinatorial pattern matching, volume 5029. Lecture notes in computer science. Springer, Berlin, pp244–254. doi:10.1007/978-3-540-69068-9-23

Mayne A, James EB (1975) Information compression by factorising common strings. Comput J 18(2):157–160

Middendorf M (1998) Shortest common superstrings and scheduling with coordinated starting times. TheorComput Sci 191(1–2):205–214

Miller CE, Tucker AW, Zemlin RA (1960) Integer programming formulation of traveling salesman problems.J ACM 7:326–329

Morozova O, Marra MA (2008) Applications of next-generation sequencing technologies in functionalgenomics. Genomics 92(5):255–264

Oliveira C, Pardalos P, Resende M (2004) Grasp with path-relinking for the quadratic assignment problem.In: Ribeiro C, Martins S (eds) Experimental and efficient algorithms, volume 3059. Lecture notes incomputer science. Springer, Berlin,pp. 356–368. ISBN 978-3-540-22067-1

Ott S (1999) Lower bounds for approximating shortest superstrings over an alphabet of size 2. In: WidmayerP, Neyer G, Eidenbenz S (eds) Graph-theoretic concepts in computer science, volume 1665. Lecture notesin computer science. Springer, Berlin, pp 55–64

Pitsoulis LS, Resende MGC (2002) Greedy randomized adaptive search procedures. In: Pardalos PM,Resende MGC (eds) Handbook of applied optimization. Oxford University Press, Oxford, pp 178–183

Resende MGC, Ribeiro CC (2005) Grasp with path-relinking: recent advances and applications. In: IbarakiT, Nonobe K, Yagiura M (eds) Metaheuristics: progress as real problem solvers. Springer, Berlin, pp29–63

Resende MGC, Werneck RF (2004) A hybrid heuristic for the p-median problem. J Heuristics 10:59–88Storer JA, Szymanski TG (1978) The macro model for data compression (extended abstract). In: Proceedings

of the tenth annual ACM symposium on theory of computing, STOC ’78, New York, pp. 30–39Storer JA, Szymanski TG (1982) Data compression via textual substitution. J ACM 29:928–951Sweedyk Z (1999) A 2 1

2 -approximation algorithm for shortest superstring. SIAM J Comput 29:954–986Tarhio J, Ukkonen E (1988) A greedy approximation algorithm for constructing shortest common super-

strings. Theor Comput Sci 57(1):131–145Teng S-H, Yao F (1993) Approximating shortest superstrings. In: Proceedings of the 1993 IEEE 34th annual

foundations of computer science, pp. 158–165Yang E, Zhang Z (1999) The shortest common superstring problem: average case analysis for both exact

and approximate matching. IEEE Trans Inf Theory 45(6):1867–1886

123

Page 25: A greedy randomized adaptive search procedure with path relinking for the shortest superstring problem

J Comb Optim

Zaritsky A, Sipper M (2004) The preservation of favored building blocks in the struggle for fitness: thepuzzle algorithm. IEEE Trans Evol Comput 8(5):443–455

123