Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space...
Transcript of Time-Space Trade-Offs for the Longest Common Substring Problem · 2013-07-09 · Time-Space...
Time-Space Trade-Offs for theLongest Common Substring Problem
Tatiana Starikovskaya1 and Hjalte Wedel Vildhøj2
1Moscow State University, Department of Mechanics and Mathematics,[email protected]
2Technical University of Denmark, DTU Compute, [email protected]
CPM 2013, Bad Herrenalb, GermanyJune 17, 2013
1 / 27
The Longest Common Substring ProblemDefinition
Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.
Example
T1 = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
T2 = a c a c c t a c c c t a g
T3 = a c t a g t a a t g c a t
2 / 27
The Longest Common Substring ProblemDefinition
Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.
Example
T1 = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
T2 = a c a c c t a c c c t a g
T3 = a c t a g t a a t g c a t
d = 3 ⇒ LCS = c t a g
3 / 27
The Longest Common Substring ProblemDefinition
Problem: Given T1,T2, . . . ,Tm of total length n. Compute the longestsubstring, which appears in at least 2 ≤ d ≤ m strings.
Example
T1 = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
T2 = a c a c c t a c c c t a g
T3 = a c t a g t a a t g c a t
d = 3 ⇒ LCS = c t a g
d = 2 ⇒ LCS = c t a c c
4 / 27
The Longest Common Substring ProblemA patented solution
5 / 27
A Textbook Solution
1
acctaccctag$
7ctag$
10
$
3
accctag$
t
cc
12$
6ctacct$
1gctagctacct$
ga
2acctaccctag$
8
ctag$
11
$
4
ccctag$
9
g$
at
c
12
$
5
ctag$
8
t$
cc
10
ccctag$
4
g$
g
a
tc
13
$
7
cct$
3
gctacct$
cta
2
gctagctacct$
g
13
$
6
ctag$
9
t$
cc
11
ccctag$
5
g$
g
a
t
Build Generalized Suffix Tree
T1 = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
T2 = a c a c c t a c c c t a g
6 / 27
A Textbook Solution
1
acctaccctag$
7ctag$
10
$
3
accctag$
t
cc
12$
6ctacct$
1gctagctacct$
ga
2acctaccctag$
8
ctag$
11
$
4
ccctag$
9
g$
at
c
12
$
5
ctag$
8
t$
cc
10
ccctag$
4
g$
g
a
tc
13
$
7
cct$
3
gctacct$
cta
2
gctagctacct$
g
13
$
6
ctag$
9
t$
cc
11
ccctag$
5
g$
g
a
t
Build Generalized Suffix Tree
T1 = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
T2 = a c a c c t a c c c t a g
7 / 27
A Textbook Solution
1
acctaccctag$
7ctag$
10
$
3
accctag$
t
cc
12$
6ctacct$
1gctagctacct$
ga
2acctaccctag$
8
ctag$
11
$
4
ccctag$
9
g$
at
c
12
$
5
ctag$
8
t$
cc
10
ccctag$
4
g$
g
a
tc
13
$
7
cct$
3
gctacct$
cta
2
gctagctacct$
g
13
$
6
ctag$
9
t$
cc
11
ccctag$
5
g$
g
a
t
Space: Θ(n)
/
Build Generalized Suffix Tree
T1 = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
T2 = a c a c c t a c c c t a g
8 / 27
Our Results
QuestionCan the LCS problem be solved (deterministically) in O
(n1−ε) space
and O(n1+ε
)time for 0 ≤ ε ≤ 1?
Our AnswerYes if 0 ≤ ε ≤ 1
3 . More precisely,
For two strings (d = m = 2), the problem can be solved in:
Time: O(n1+ε
)Space: O
(n1−ε) for any 0 < ε ≤ 1
3 .
In the general case (2 ≤ d ≤ m), the problem can be solved in:
Time: O(
n1+ε log2 n(d log2 n + d2))
Space: O(n1−ε) for any 0 ≤ ε < 1
3 .
9 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
10 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
DCτ DCτ DCτ DCτ DCτ DCτ
Difference CoversA difference cover modulo τ is a set of integers DCτ ⊆ {0,1, . . . , τ − 1}such that for any distance d ∈ {0,1, . . . , τ − 1}, DCτ contains twoelements separated by distance d modulo τ .Ex: The set DCτ = {1,2,4} is a difference cover modulo 5.
d 0 1 2 3 4i, j 1,1 2,1 1,4 4,1 1,2
12
4
03
1
4
23
11 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
DCτ DCτ DCτ DCτ DCτ DCτ
I Number of sampled suffixes: O( nτ |DCτ |
)= O
( n√τ
).
I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in
both strings.I Hence the LCS corresponds to a pair (p∗1, p
∗2) maximizing
lcp(RB(p1),RB(p2)
)+ lcp
(T[p1..],T[p2..]
)− 1
12 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
I Number of sampled suffixes: O( nτ |DCτ |
)= O
( n√τ
).
I The LCS is the LCP of two suffixes.
I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled inboth strings.
I Hence the LCS corresponds to a pair (p∗1, p∗2) maximizing
lcp(RB(p1),RB(p2)
)+ lcp
(T[p1..],T[p2..]
)− 1
13 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
I Number of sampled suffixes: O( nτ |DCτ |
)= O
( n√τ
).
I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in
both strings.
I Hence the LCS corresponds to a pair (p∗1, p∗2) maximizing
lcp(RB(p1),RB(p2)
)+ lcp
(T[p1..],T[p2..]
)− 1
14 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
I Number of sampled suffixes: O( nτ |DCτ |
)= O
( n√τ
).
I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in
both strings.I Hence the LCS corresponds to a pair (p∗1, p
∗2) maximizing
lcp(RB(p1),RB(p2)
)+ lcp
(T[p1..],T[p2..]
)− 1
15 / 27
A Solution for Two StringsWhen the LCS is long
Idea: Preprocess a sparse sample of the n suffixes for LCP queries.
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
RB(11)= (g c t a c)R = c a t c g
I Number of sampled suffixes: O( nτ |DCτ |
)= O
( n√τ
).
I The LCS is the LCP of two suffixes.I If |LCS| ≥ τ one of the first τ characters of the LCS is sampled in
both strings.I Hence the LCS corresponds to a pair (p∗1, p
∗2) maximizing
lcp(RB(p1),RB(p2)
)+ lcp
(T[p1..],T[p2..]
)− 1
16 / 27
A Solution for Two StringsWhen the LCS is long
How to compute the pair (p∗1, p∗2) faster than O
( n2
τ
)?
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]
LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]
SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]
LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]
Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.
17 / 27
A Solution for Two StringsWhen the LCS is long
How to compute the pair (p∗1, p∗2) faster than O
( n2
τ
)?
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]
LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]
SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]
LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]
Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.
18 / 27
A Solution for Two StringsWhen the LCS is long
How to compute the pair (p∗1, p∗2) faster than O
( n2
τ
)?
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]
LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]
SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]
LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]
Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.
19 / 27
A Solution for Two StringsWhen the LCS is long
How to compute the pair (p∗1, p∗2) faster than O
( n2
τ
)?
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]
LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]
SARτ = 14 1 17 21 26 6 16 22 11 19 12 24 4 2 27 7 9[ , , , , , , , , , , , , , , , , ]
LCPRτ = 0 1 1 4 3 0 2 4 1 3 2 1 0 2 4 0[ , , , , , , , , , , , , , , , ]
Main observation: lcp(T[p∗1..],T[p∗2..]) ∈ [`max − τ + 1; `max], so we canignore all pairs with lcp values smaller than `max − τ + 1.
20 / 27
A Solution for Two StringsWhen the LCS is long
How to compute the pair (p∗1, p∗2) faster than O
( n2
τ
)?
T = a1
g2
g3
c4
t5
a6
g7
c8
t9
a10
c11
c12
t13
$1
14
a15
c16
a17
c18
c19
t20
a21
c22
c23
c24
t25
a26
g27
$2
28
SAτ = 14 21 17 26 6 1 16 22 11 12 19 24 4 27 7 2 9[ , , , , , , , , , , , , , , , , ]
LCPτ = 0 3 1 2 2 0 1 2 1 2 3 4 0 1 1 0[ , , , , , , , , , , , , , , , ]
O(τ)
Analysis (sketch): O(τ) rounds each using O(n/√τ) time and space:
Time: O (n√τ)
Space: O (n/√τ)
Time: O(n1+ε
)Space: O
(n1−ε) 0 < ε ≤ 1
2 .τ = n2ε
21 / 27
A Solution for Two StringsWhen the LCS is shorter than τ
T1 =
2τ
Si
I The LCS is a substring of one of the strings of length 2τ .I Build the generalized suffix tree for a batch Si of strings of total
length O( n√τ
).
I Traverse the suffix tree with T2 in O(n) time to find the node ofgreatest string depth.
I Repeat for all O(√τ) batches.
22 / 27
A Solution for Two StringsWhen the LCS is shorter than τ
T1 =
2τ
Si
I The LCS is a substring of one of the strings of length 2τ .I Build the generalized suffix tree for a batch Si of strings of total
length O( n√τ
).
I Traverse the suffix tree with T2 in O(n) time to find the node ofgreatest string depth.
I Repeat for all O(√τ) batches.
Time: O (n√τ)
Space: O (n/√τ)
Time: O(n1+ε
)Space: O
(n1−ε) 0 ≤ ε ≤ 1
3 .τ = n2ε
τ = O(n/√τ)
23 / 27
A General Solution for m StringsWhen the LCS is long
Challenge: The difference cover property only holds for pairs.
T =
T1 T2 T3 T4
LCS LCS LCS
24 / 27
A General Solution for m StringsWhen the LCS is long
Challenge: The difference cover property only holds for pairs.
T =
T1 T2 T3 T4
LCS LCS LCS
τ
541
532
25 / 27
A General Solution for m StringsWhen the LCS is long
Challenge: The difference cover property only holds for pairs.
T =
T1 T2 T3 T4
LCS LCS LCS
τ
541
532
Algorithm: Extract the maximum head until we have d− 1 distinctstrings. Repeat everything for all n possible positions of the LCS.Computing the next element in a list can be done in O(log n(log2 n + d)).Extracting it costs O(
√τ). At most O(d
√τ) extractions.
Time: O(
n√τ log2 n(log2 n + d)
)Space: O (n/
√τ)
26 / 27
Conclusion
ResultsFor two strings (d = m = 2), the LCS problem can be solved in:
Time: O(n1+ε
)Space: O
(n1−ε) for any 0 < ε ≤ 1
3 .
In the general case (2 ≤ d ≤ m), the LCS problem can be solved in:
Time: O(
n1+ε log2 n(d log2 n + d2))
Space: O(n1−ε) for any 0 ≤ ε < 1
3 .
Open ProblemsCan the generalized solution be improved? Can the trade-off interval ofour solutions be extended to 0 ≤ ε ≤ 1
2 ? Can the problem be solved inO(n1+ε) time and O(n1−ε) space for any 0 ≤ ε ≤ 1?
27 / 27