Efficient LZ78 factorization of grammar compressed text
description
Transcript of Efficient LZ78 factorization of grammar compressed text
Efficient LZ78 factorization of grammar compressed text
Hideo Bannai, Shunsuke Inenaga, Masayuki TakedaKyushu University, Japan
SPIRE 2012 @ Cartagena, Colombia
SPIRE 2012 @ Cartagena, Colombia
Outline Background LZ78 Factorization Straight Line Programs (SLP) Algorithms
LZ78 factorization using suffix trees SLP to LZ78 Improvements
SPIRE 2012 @ Cartagena, Colombia
Background
Compressed Representation of String
BIG StringThis work: LZ78 factorization of grammar compressed strings
Compressed String Processing (CSP) compress string for storage … but …
don’t decompress all of it when using it! can be faster than processing the uncompressed text,
by exploiting regularities identified by compression regard compression as a generic preprocessing!
Pattern Matchingprocessdirectly
Edit DistancePattern Mining
etc.
SPIRE 2012 @ Cartagena, Colombia
LZ78 Factorization [Ziv&Lempel ’78]
The LZ78-factorization of string S is a factorizationS = f1 f2 ... fm
where fi is the longest prefix of fi ... fm such that
fi = fj c for some 0 ≤ j < i (let f0 = ε)S = a l a b a r a l a l a b a r d a $
0
1
a
2
l
3
b4
r5l
7b
6
a8
d9
$
LZ78 trie of S
(0,a)f1
(0,l)f2
(1,b)f3
(1,r)f4
(1,l) f5
(5,a) f6
(0,b) f7
(5,d) f8
(1,$)f9
O(N log σ) timeO(m) space
SPIRE 2012 @ Cartagena, Colombia
Straight Line Programs
• CFG in Chomsky normal form that derives single string.• Can efficiently model outputs of many compression
algorithms: REPAIR, SEQUITUR, LZ78, etc.
Straight Line Program
X1 = aX2 = bX3 = X1 X2
X4 = X1 X3
X5 = X4 X3
X6 = X4 X5
X7 = X6 X5
SLP , n = 7 Derivation tree
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
SPIRE 2012 @ Cartagena, Colombia
Problem: SLP to LZ78Input: SLP Output: LZ78 Factorization (Trie)
X1 = a X5 = X4 X3
X2 = b X6 = X4 X5
X3 = X1 X2 X7 = X6 X5
X4 = X1 X3
0
15
2
3
4
6a
a b
a
b
b
Why “re-compress” a compressed representation? Convert the representation
Some CSP algorithms require specific compression Re-compress an SLP modified by ad-hoc edits
Dynamic compressed texts Compute Normalized Compression Distance [Li et al. 2004]
Clustering & classification w/o decompression CLZ78 (x), CLZ78 (y), CLZ78(xy) from SLPs of x, y
ComputerScientist
Make Sleeping Files Walk in their Sleep!
SPIRE 2012 @ Cartagena, Colombia
Our Results
Algorithms to compute LZ78 from SLPAlgorithm Time Space
Direct (uncompressed) O(N logσ) O(m)Decompress + Direct O(N logσ) O(n+m)SLP (partial decompressions) O(nN½ + m log N) O(nN½ + m)SLP + Doubling O(nL + m log N) O(nL + m)SLP + Redundancy Reduction O(Nα + m log N) O(Nα + m)
N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α ≤ N m : # of LZ78 factors
(O(N/log N) for constant σ) α ≥ 0 is a quantity that represents the amount of redundancy in the string that is captured by the SLP
SPIRE 2012 @ Cartagena, Colombia
LZ78 Factorization using a Suffix Tree
SPIRE 2012 @ Cartagena, Colombia
Suffix Tree & LZ78The LZ78 trie can be superimposed on the suffix tree
S
1 2 3 4 5 6 7 8 9 10 11 12 13
suffix tree of S LZ78 trie of S
a a b a a b a b a a b a b
10
a
5
8
7
9
12
1 4 2 3
13b
a
a
bab
a
11
6
ababaabab
b
aabab
babab
aabab
aabab
abab
aabab
b
aabab
0
13 2
5
6
4a
a b
a
b
b0
13 2
5
6
4a
a b
a
b
b
SPIRE 2012 @ Cartagena, Colombia
10
a
5
8
7
9
12
1 4 2 3
13b
a
a
bab
a
11
6
ababaabab
b
aabab
babab
aabab
aabab
abab
aabab
b
aabab
31
2
LZ78 Factorization on Suffix Treea a b a a b a b a a b a bS
1 2 3 4 5 6 7 8 9 10 11 12 13
0
5
4
6
Build LZ78 trie on top of suffix tree ST Nodes corresponding to LZ78 trie are marked
Find longest prefix of S[i:N] in LZ78 trie O(1) time by dynamic nearest marked ancestor queries [Westbrook, ‘92]Make new node of LZ78 trie on ST O(1) time by level ancestor query on ST [Berkman & Vishkin ‘94]Compute next position i i + |fi|
LZ78 factorization in O(m) time,given suffix tree preprocessed for nma & la queries
i
Next factor is prefix of S[i:N].Find node in ST corresponding to S[i:N]
SPIRE 2012 @ Cartagena, Colombia
SLP to LZ78
SPIRE 2012 @ Cartagena, Colombia
Our algorithm: SLP to LZ78
We only need a suffix tree that contains all distinct substrings of S with length at most cN
Build GST from a set of substrings of S that contain all distinct length-cN substrings of S
Main Idea
For any string of length N, the length of any LZ78 factor fi satisfies:
|fi| ≤ cN = (2N+¼)½ – ½ = O(N½)
Key Observation
SPIRE 2012 @ Cartagena, Colombia
Important Concept: StabbingXi stabs an interval [u:v] of S, when it is the shortest variable that derives the interval(any interval is stabbed by a unique variable)
X1 = aX2 = bX3 = X1 X2
X4 = X1 X3
X5 = X4 X3
X6 = X4 X5
X7 = X6 X5
e.g.: aaba at [9:12] is stabbed by X5 X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
SPIRE 2012 @ Cartagena, Colombia
Substrings stabbed by Xi
All length-q substrings stabbed by Xi are contained in a string ti(q) of length at most 2(q – 1)
Xl(i) Xr(i)
Xi
q – 1
q
q – 1
qAny length-q substring of Sis stabbed by some unique variable Xi , and therefore is a substring of some ti(q)
{ ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n } will contain all distinctlength-cN substrings of S
ti(q)
SPIRE 2012 @ Cartagena, Colombia
LZ78 Factorization from SLPAlgorithm:1. Compute { ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n }
2. Build generalized suffix tree (GST)for strings{ ti (cN) : |Xi| ≥ cN , 1 ≤ i ≤ n }
3. Run LZ78 Factorization algorithm using GST
O(ncN) time/space
SPIRE 2012 @ Cartagena, Colombia
Example N = 13, cN = 4, n = 7 { t5(4), t6(4), t7(4) } = { aabab, aabaab, babaab }
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
SPIRE 2012 @ Cartagena, Colombia
GST & LZ78 FactorsThe LZ78 trie superimposed on GST of {t5(4), t6(4), t7(4)}
a a b a a b a b a a b a bS
1 2 3 4 5 6 7 8 9 10 11 12 13
aa
b
a
b
a
bb
b
a
a
38,14
b
7,13
9,15 4,10,16
5,11,17
16
ab b
2
3
12
a
bab
GST of {t5(4),t6(4),t7(4)} LZ78 trie of S
01
32
5
6
4a
a b
a
b
b01
32
5
6
4a
a b
a
b
b
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
SPIRE 2012 @ Cartagena, Colombia
Find longest prefix of S[i:N] in LZ78 trieMake new node for LZ78 trie on STCompute next position i i + |fi|
Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
aa
b
a
b
a
bb
b
a
a
38,14
b
7,13
9,15 4,10,16
5,11,17
16
ab b
2
3
12
a
bab
1
LZ78 Factorization on GST0
cN = 4i
O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries
SPIRE 2012 @ Cartagena, Colombia
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
aa
b
a
b
a
bb
b
a
a
38,14
b
7,13
9,15 4,10,16
5,11,17
16
ab b
2
3
12
a
bab
1
2
LZ78 Factorization on GST0
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
cN = 4i
Find longest prefix of S[i:N] in LZ78 trieMake new node for LZ78 trie on STCompute next position i i + |fi|
Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]
O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries
SPIRE 2012 @ Cartagena, Colombia
a a b a b a a b a a b b a b a a bt5(4) t6(4) t7(4)
1 2 3 4 5 6 7 8 91011121314151617
S
X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a b
X2X1
X1 X3 X2X1
X4 X3
X5
1 2 3 4 5 6 7 8 9 10 11 12 13
aa
b
a
b
a
bb
b
a
a
38,14
b
7,13
9,15 4,10,16
5,11,17
16
ab b
2
3
12
a
bab
13
2
LZ78 Factorization on GST0
cN = 4i
LZ78 factorization can be computed in O(mlogN) time, given GST preprocessed for nma & la, and SLP preprocessed for random access queries
Find longest prefix of S[i:N] in LZ78 trieMake new node for LZ78 trie on STCompute next position i i + |fi|
Next factor is prefix of S[i:N].Find node in GST corresponding to S[i:N]
O(log N) time w/ random accesson SLP [Bille et al. 2011]O(1) time w/ dynamic nma queriesO(1) time w/ dynamic nma queries
SPIRE 2012 @ Cartagena, Colombia
Summary of Basic Algorithm
Extreme Cases: If the string is compressible, n = O(log N), m = O(N½), so
O(ncN + m log N) = O(N½ log N) = o(N) If the string is not compressible, n, m = O(N) and
O(ncN + m log N) = O(N1.5)
Algorithm Time Space
Direct (uncompressed) O(N log σ) O(m)Decompress + Direct O(N log σ) O(n+m)SLP O(ncN
+ m log N) O(ncN + m)
cN = O(N½)
can we do better than just revert to decompress & process?
SPIRE 2012 @ Cartagena, Colombia
(1) Improving ncN term to nL ≤ ncN
Let L denote length of longest LZ78 factor of S We built GST for distinct substrings of length at most cN
but actually, we only need substrings of length at most L However, L is not known beforehand…
O(ncN + mlogN) time, O(ncN + m) space
O(nL + mlogN) time, O(nL + m) space
Assume L = 2 and run algorithm. If LZ78 trie expands beyond GST,
L 2×L, rebuild GST and LZ78 trie, and continue Total time complexity for rebuild:
Σi=1..log LO(n2i+m) = O(nL+mlogL)
Doubling Technique:
SPIRE 2012 @ Cartagena, Colombia
(2) Improving ncN term to Nα ≤ N
We can replace GST with suffix tree of trie for q = cN
Given SLP for string S, the set of length-q substrings of S can be represented as paths in a reverse trie of size Nα = N – α(q) ≤ N, whereα(q) = Σi:|Xi| ≥ q (vOcc(Xi) – 1) (|ti(q)| – (q – 1)) ≥ 0vOcc(Xi) : # of times Xi occurs in derivation tree
Lemma [Goto et al. CPM 2012]
The suffix tree of a reverse trie can be constructed in linear time.
Lemma [Shibuya 2003]
O(ncN + mlogN) time, O(ncN + m) space
O(Nα + mlogN) time, O(Nα + m) space
The trie can be computed in time linear of its size.
Nα = O(ncN)
SPIRE 2012 @ Cartagena, Colombia
Example: Trie of size Nα for q = 4X7
X2X1
X6
X2X1 X1 X3 X2X1
X1 X3 X4 X3
X4 X5
a a b a a b a b a a b a bS
a a b a b
a a b
b a b
X2X1
X1 X3 X2X1
X4 X3
X5
Σ|ti(q)| : 17Text size: 13Trie size: 11
We can aggregate all ti(q) intoa trie of size at most the text size
SPIRE 2012 @ Cartagena, Colombia
Summary Showed algorithm for SLP LZ78 factorization
at least as fast as naïve decompress & process better when string is compressible
Algorithm Time Space
Direct (uncompressed) O(N logσ) O(m)Decompress + Direct O(N logσ) O(n+m)SLP (partial decompressions) O(nN½ + m log N) O(nN½ + m)SLP + Doubling O(nL + m log N) O(nL + m)SLP + Redundancy Reduction O(Nα + m log N) O(Nα + m)
N : length of uncompressed string S σ: alphabet sizen : size of SLP representing S L : length of longest LZ78 factorNα = N – α(cN) ≤ N m : # of LZ78 factors
(O(N/log N) for constant σ)