Longest common subsequence: linear space, diff, and bit...

Longest common subsequence: linearspace, diff, and bit packing

Algorithms on Strings

Paweł Gawrychowski

May 21, 2013

Recap Reconstructing the answer How diff works? And now let’s do some theory

Outline

Recap

Reconstructing the answer

How diff works?

And now let’s do some theory

May 21, 2013 2/24


We define the edit distance between a pair of strings s and t .d(s, t) is the smallest number of changes we need to get t froms, where each operation is one of the following:

removing a letter,inserting a letter,changing a letter.

This known as the Levenshtein distance.

May 21, 2013 3/24


Other ways of defining the distance (or, other scoring functions)are possible. One of the most popular is the longest commonsubsequence (LCS). We define the longest commonsubsequence of a pair of strings s and t to be the longest stringwhich is a subsequence of both s and t .

Notice the different between subsequence and substring. Thelongest common substring can be found in O(|s|+ |t |), as we willsee in two weeks, but computing the longest commonsubsequence is more complicated.

May 21, 2013 4/24


So how to compute the edit distance d(s, t) efficiently?

d(s[1..n], t [1..m]) = min(

d(s[1..n − 1], t [1..m]) + 1,d(s[1..n], t [1..m − 1]) + 1,d(s[1..n − 1], t [1..m − 1]) + [s[n] 6= t [m]]

)

Spend 2 minutes trying to figure out how to prove the aboveformula.

May 21, 2013 5/24


We compute a big table T [i , j] = d(s[1..i], t [1..j]) with two forloops. Then the time complexity becomes just O(nm).Unfortunately, the space complexity is O(nm), too.

1010 operations are probably fine, you just have to wait a fewminutes. But 1010 bytes of memory, that is a different story.

We use a simple trick: to compute the i-th row of the table wedon’t need the first, second, ..., i − 2-th one. Hence we only storethe current row and the previous one during our computation.Then the space complexity becomes O(m), and the timecomplexity stays the same.

May 21, 2013 6/24


That is all nice when you are only interested in the distance itself.But what if you would like to actually compute the sequence ofoperations? Or, in the case of LCS, output the longest commonsubsequence?Let us focus on the longest common subsequence. We will provethe following.

Hirschberg 1975The longest common subsequence can be reconstructed inO(nm) time and O(n + m) space.

May 21, 2013 7/24


We start with visualizing the DP solution in a slightly differentmanner.

A B C B A A

B

A

C

B

B

A

May 21, 2013 8/24



A B C B A A

B

A

C

B

B

A

0

1

1

1

1

May 21, 2013 8/24



A B C B A A

B

A

C

B

B

A

0

1

1

1

1

d(3, 3)

May 21, 2013 8/24


d(i , j) is the weight of the cheapest path from (0,0) to (i , j).

Main trickCan we compute d ′(i , j), the weight of the cheapest path from(i , j) to (n,m), using a similar method?

LemmaFor any x ∈ [0,n], the edit distance is miny∈[0,m] d(x , y) + d ′(x , y).

Proof: see the whiteboard.

May 21, 2013 9/24


A recursive method for reconstructing the cheapest pathchoose any x ∈ [0,n]

compute d(x , y) and d ′(x , y) for all y ∈ [0,m]

select y minimizing the sumrecursively reconstruct the path in two smaller gridscorresponding to d(s[1..x ], t [1..y ]) andd(s[x + 1..n], t [y + 1..m]).

May 21, 2013 10/24


May 21, 2013 11/24


x

May 21, 2013 11/24


x

y

May 21, 2013 11/24


How to choose x?

T (n,m) = T (x , y) + T (n − x ,m − y) +O(nm)

Now choose x = n2 .

T (n,m) = T (n2, y) + T (

n2,m − y) +O(nm)

If you think that nm is the “size” of the problem, you can see thatthe combined size of the two smaller subproblems is half as bigas the original size. It follows that T (n,m) = O(nm).More formally: see the whiteboard.

May 21, 2013 12/24


So, the time consumption is O(nm). What about space? Wecompute d(x , y) and d ′(x , y) for all y ∈ [0,m] using therow-by-row space saving method, then the space complexity isjust O(n + m).

The same trick works for other scoring functions, and in fact formany dynamic programming solutions.

May 21, 2013 13/24


Even though we decrease the space consumption, computing theedit distance between two long strings is still infeasible. We willtry to develop a method which works efficiently when the stringsare not very different, i.e., when d(s, t) is not large.

Myers 1986Edit distance can be computed in time O(nD), where D = d(s, t).

Why?Think how you are using diff. If D is large, you probably don’tcare about the edit distance anyway.

May 21, 2013 14/24


Consider a diagonal, i.e., all points (i , j) such that i − j = δ. Whatcan we say about the corresponding d(i , j)?

Lemma

d(i , j) ≥ |i − j |

Proof: see the whiteboard.

May 21, 2013 15/24


Assume that we know the value of D (we don’t, but don’t worryabout that for the time being). Do we need to compute all d(i , j)?

ObservationIf the edit distance is D, it is enough to compute the values ofd(s, t) in the diagonal strip of “width” 2D. This can be done inO(nD) time.

Notice that we really need that the values of d(s[1..i], t [1..j]) arenon-decreasing here, otherwise it might make sense to go faraway from the strip and then go back.

But we don’t know the value of D!

May 21, 2013 16/24


Final trickSay that we choose some value of D. If d(s, t) ≤ D, the abovemethod will return the correct answer. If d(s, t) > D, our resultwill exceed D, too. Hence we can verify whether our guess wascorrect.

Try D = 1,2,3, . . .. What is the resulting complexity?OK, so maybe try D = 20,21,22, . . .. What is the resultingcomplexity now?

May 21, 2013 17/24


So we have an O(nD) time algorithm. Is that the best we arecapable of? No.

Edit distance can be computed in time O(n + D2).

To prove the above statement, we need one tool which will beactually proven during the next two lecture. For the time being,just believe that the following is possible.

Suffix array + constant time RMQ queriesGiven a text w [1..n], we can construct in linear time a structure ofsize O(n) which allows us to answer any query of the form “whatis the longest common prefix of w [i ..n] and w [j ..n]?” in constanttime.

May 21, 2013 18/24


Look at a single diagonal. What can we say about theconsecutive values of d(i , j) there?

Observation

d(i + 1, j + 1) ∈ {d(i , j),d(i , j) + 1}

Hence the values on a single diagonal are non-decreasing. As inthe O(nD) time algorithm, as soon as d(i , j) > D, we don’t careabout the exact value, as it cannot possible correspond to a pathof total cost at most D.

May 21, 2013 19/24


On each diagonal i − j = δ we compute only those values thatare at most D, i.e., we look at (i , j) such that |i − j | ≤ D andd(i , j) ≤ D.

May 21, 2013 20/24


Now this doesn’t really decrease the complexity. We need onemore observation.

Succinct description of diagonalsTo fully describe the values of d(i , j) for all (i , j) such that i − j = δand d(i , j) ≤ D it is enough to compute for each x = 0,1, . . . ,Dthe last (i , j) on the diagonal such that d(i , j) = x .

May 21, 2013 21/24


To make the description simpler, let Mδ(x) be the last (i , j) on thediagonal i − j = δ such that d(i , j) = x .

0

0

0

1

1

2

3

3

4

4

May 21, 2013 22/24


To make the description simpler, let Mδ(x) be the last (i , j) on thediagonal i − j = δ such that d(i , j) = x .

Mδ(0)

Mδ(1)

Mδ(2)

Mδ(3)

Mδ(4)

0

0

0

1

1

2

3

3

4

4

May 21, 2013 22/24


Now the question is whether there is some clever way to computeall O(D2) values Mδ(x) in a reasonable time.

Yes!All values of Mδ(x + 1) can be computed in O(D) time given allvalues of Mδ(x).

The idea is that we look at mδ(x + 1), which is the first (i , j) onthe diagonal such that d(i , j) = x + 1. Then it must originate froma neighboring diagonal, say i ′ − j ′ = δ + 1 with, with d(i ′, j ′) = x !Hence we can compute all mδ(x + 1) in O(d) time. Then for eachof them we look how far the “free” edges (i.e., with cost 0) extend.This is exactly what we have the longest common prefix for!

May 21, 2013 23/24


While O(n + D2) seems nice from a purely practical point of view,it can still be Θ(n2) in the worst case. So, can we beat thiscomplexity, at least in theory?

Masek and Paterson 1980The longest common subsequence of two strings over analphabet of constant size can be computed in time O( n2

log n ).

Masek and Paterson 1980The longest common subsequence of two strings over anyalphabet can be computed in time O(n2 log log n

log n ).

May 21, 2013 24/24

Longest common subsequence: linear space, diff, and bit...

Documents

Transcript of Longest common subsequence: linear space, diff, and bit...