[Shoshana Felman] Writing and Madness Literature (Bookos.org)
Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.
-
Upload
esmond-manning -
Category
Documents
-
view
219 -
download
1
Transcript of Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.
![Page 1: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/1.jpg)
Pattern Matching Algorithms: An Overview
Shoshana NeuburgerThe Graduate Center, CUNY
9/15/2009
![Page 2: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/2.jpg)
2 of 59
Overview
• Pattern Matching in 1D• Dictionary Matching• Pattern Matching in 2D• Indexing
– Suffix Tree– Suffix Array
• Research Directions
![Page 3: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/3.jpg)
3 of 59
What is Pattern Matching?
Given a pattern and text, find the pattern in the text.
![Page 4: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/4.jpg)
4 of 59
What is Pattern Matching?
• Σ is an alphabet.• Input:
Text T = t1 t2 … tn
Pattern P = p1 p2 … pm
• Output: All i such that
., ii tp
mkkPkiT 0],1[][
![Page 5: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/5.jpg)
5 of 59
Pattern Matching - Example
Input: P=cagc = {a,g,c,t} T=acagcatcagcagctagcat
Output: {2,8,11}
1 2 3 4 5 6 7 8 …. 11
acagcatcagcagctagcat
![Page 6: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/6.jpg)
6 of 59
Pattern Matching Algorithms
• Naïve Approach– Compare pattern to text at each location.– O(mn) time.
• More efficient algorithms utilize information from previous comparisons.
![Page 7: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/7.jpg)
7 of 59
Pattern Matching Algorithms
• Linear time methods have two stages 1. preprocess pattern in O(m) time and space.2. scan text in O(n) time and space.
• Knuth, Morris, Pratt (1977): automata method• Boyer, Moore (1977): can be sublinear
![Page 8: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/8.jpg)
8 of 59
KMP Automaton
P = ababcb
![Page 9: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/9.jpg)
9 of 59
Dictionary Matching
• Σ is an alphabet.
• Input:Text T = t1 t2 … tn
Dictionary of patterns D = {P1, P2, …, Pk}
All characters in patterns and text belong to Σ.
• Output: All i, j such that
where mj = |Pj|
,1,0],1[][ kjmllPliT jj
![Page 10: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/10.jpg)
10 of 59
Dictionary Matching Algorithms
• Naïve Approach:– Use an efficient pattern matching algorithm for
each pattern in the dictionary.– O(kn) time.
More efficient algorithms process text once.
![Page 11: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/11.jpg)
11 of 59
AC Automaton
• Aho and Corasick extended the KMP automaton to dictionary matching
• Preprocessing time: O(d)• Matching time: O(n log |Σ| +k).
Independent of dictionary size!
k
jjPd
1
||
![Page 12: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/12.jpg)
12 of 59
AC Automaton
D = {ab, ba, bab, babb, bb}
![Page 13: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/13.jpg)
13 of 59
Dictionary Matching
• KMP automaton does not depend on alphabet size while AC automaton does – branching.
• Dori, Landau (2006): AC automaton is built in linear time for integer alphabets.
• Breslauer (1995) eliminates log factor in text scanning stage.
![Page 14: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/14.jpg)
14 of 59
Periodicity
A crucial task in preprocessing stage of most pattern matching algorithms:
computing periodicity.
Many forms– failure table– witnesses
![Page 15: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/15.jpg)
15 of 59
Periodicity
• A periodic pattern can be superimposed on itself without mismatch before its midpoint.
• Why is periodicity useful?Can quickly eliminate many candidates for pattern occurrence.
![Page 16: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/16.jpg)
16 of 59
Periodicity
Definition:• S is periodic if S = and
is a proper suffix of .• S is periodic if its longest prefix that is also a
suffix is at least half |S|.• The shortest period corresponds to the
longest border.
2,' kk '
![Page 17: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/17.jpg)
17 of 59
Periodicity - Example
S = abcabcabcab |S| = 11• Longest border of S: b = abcabcab;
|b| = 8 so S is periodic.• Shortest period of S: =abc
= 3 so S is periodic.
||
![Page 18: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/18.jpg)
18 of 59
Witnesses
Popular paradigm in pattern matching:1.find consistent candidates2.verify candidates
consistent candidates → verification is linear
![Page 19: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/19.jpg)
19 of 59
Witnesses
• Vishkin introduced the duel to choose between two candidates by checking the value of a witness.
• Alphabet-independent method.
![Page 20: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/20.jpg)
20 of 59
Witnesses
Preprocess pattern:• Compute witness for each location of self-
overlap.• Size of witness table:
, if P is periodic,, otherwise.
||
2
m
![Page 21: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/21.jpg)
21 of 59
Witnesses
• WIT[i] = any k such that P[k] ≠ P[k-i+1].• WIT[i] = 0, if there is no such k.
k is a witness against i being a period of P.
Example: Pattern
Witness Table
a a a b
0 4 4 4
1 2 3 4
![Page 22: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/22.jpg)
22 of 59
Witnesses
Let j>i. Candidates i and j are consistent if they are sufficiently far from each other OR WIT[j-i]=0.
![Page 23: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/23.jpg)
23 of 59
DuelScan text:• If pair of candidates is close and inconsistent,
perform duel to eliminate one (or both).• Sufficient to identify pairwise consistent
candidates: transitivity of consistent positions.
a a a b
P=
T=
i j witness
ba?
![Page 24: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/24.jpg)
24 of 59
2D Pattern Matching
• Σ is an alphabet.
• Input:Text T [1… n, 1… n]
Pattern P [1… m, 1… m]
• Output: All (i, j) such that
., ijij tp
mlklkPljkiT ,0],1,1[],[
MRI
![Page 25: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/25.jpg)
25 of 59
2D Pattern Matching - ExampleInput: Pattern = {A,B}
Text
Output: { (1,4),(2,2),(4, 3)}
A B A
A B A
A A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
![Page 26: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/26.jpg)
26 of 59
Bird / Baker
• First linear-time 2D pattern matching algorithm.
• View each pattern row as a metacharacter to linearize problem.
• Convert 2D pattern matching to 1D.
![Page 27: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/27.jpg)
27 of 59
Bird / Baker
Preprocess pattern:• Name rows of pattern using AC automaton.• Using names, pattern has 1D representation.• Construct KMP automaton of pattern.
Identical rows receive identical names.
![Page 28: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/28.jpg)
28 of 59
Bird / Baker
Scan text:• Name positions of text that match a row of
pattern, using AC automaton within each row.• Run KMP on named columns of text.
Since the 1D names are unique, only one name can be given to a text location.
![Page 29: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/29.jpg)
29 of 59
Bird / Baker - Example
Preprocess pattern:• Name rows of pattern using AC automaton.• Using names, pattern has 1D representation.• Construct KMP automaton of pattern.
A B A
A B A
A A B
1
1
2
![Page 30: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/30.jpg)
30 of 59
Bird / Baker - Example
Scan text:• Name positions of text that match a row of
pattern, using AC automaton within each row.• Run KMP on named columns of text.
A A B A B A A
B A B A B A B
A A B A A B B
B A A B A A A
A B A B A A A
B B A A B A B
B B B A B A B
0 0 2 1 0 1 0
0 0 0 1 0 1 0
0 0 2 1 0 2 0
0 0 0 2 1 0 0
0 0 1 0 1 0 0
0 0 0 0 2 1 0
0 0 0 0 0 1 0
0 0 2 1 0 1 0
0 0 0 1 0 1 0
0 0 2 1 0 2 0
0 0 0 2 1 0 0
0 0 1 0 1 0 0
0 0 0 0 2 1 0
0 0 0 0 0 1 0
![Page 31: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/31.jpg)
31 of 59
Bird / Baker
• Complexity of Bird / Baker algorithm:
time and space.
• Alphabet-dependent.
• Real-time since scans text characters once.
• Can be used for dictionary matching:
replace KMP with AC automaton.
||log2 n
![Page 32: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/32.jpg)
32 of 59
2D Witnesses
• Amir et. al. – 2D witness table can be used for linear time and space alphabet-independent 2D matching.
• The order of duels is significant.• Duels are performed in 2 waves over text.
![Page 33: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/33.jpg)
33 of 59
Indexing
• Index text– Suffix Tree– Suffix Array
• Find pattern in O(m) time
• Useful paradigm when text will be searched for several patterns.
![Page 34: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/34.jpg)
34 of 59
Suffix Triebanana$
anana$nana$
ana$na$
a$$
n
b
n
a
a
a
an
n
a
a
n
n
a
a
$
$$
$
$$
suf1
suf2
suf3
suf4
suf5
suf6
suf7• One leaf per suffix.• An edge represents one character.• Concatenation of edge-labels on the path from the root to leaf i spells the
suffix that starts at position i.
suf1
suf2
suf6
suf5suf4
suf3
$suf7
T = banana$
![Page 35: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/35.jpg)
35 of 59
Suffix Treebanana$
anana$nana$
ana$na$
a$$
banana$
a
na
na$
na
na$
$
$
$
suf1
suf2
suf3
suf4
suf5
suf6
suf7• Compact representation of trie.• A node with one child is merged with its parent.• Up to n internal nodes.• O(n) space by using indices to label edges
suf1
suf2
suf6
suf5
suf4
suf3
[7,7]
$
[1,7][3,4]
[2,2]
[7,7]
[5,7] [7,7]
[7,7]
[5,7]
[3,4]
T = banana$
![Page 36: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/36.jpg)
36 of 59
Suffix Tree Construction
• Naïve Approach: O(n2) time
• Linear-time algorithms:Author Date Innovation Scan Direction
Weiner 1973 First linear-time algorithm,alphabet-dependent suffix links
Right to left
McCreight 1976 Alphabet-independent suffix links, more efficient
Left to right
Ukkonen 1995 Online linear-time construction, represents current end
Left to right
Amir and Nor 2008 Real-time construction Left to right
![Page 37: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/37.jpg)
37 of 59
Suffix Tree Construction
• Linear-time suffix tree construction algorithms rely on suffix links to facilitate traversal of tree.
• A suffix link is a pointer from a node labeled xS to a node labeled S; x is a character and S a possibly empty substring.
• Alphabet-dependent suffix links point from a node labeled S to a node labeled xS, for each character x.
![Page 38: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/38.jpg)
38 of 59
Index of Patterns
• Can answer Lowest Common Ancestor (LCA) queries in constant time if preprocess tree accordingly.
• In suffix tree, LCA corresponds to Longest Common Prefix (LCP) of strings represented by leaves.
![Page 39: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/39.jpg)
39 of 59
Index of Patterns
To index several patterns: Concatenate patterns with unique characters
separating them and build suffix tree.Problem: inserts meaningless suffixes that span several patterns.
OR Build generalized suffix tree – single structure for
suffixes of individual patterns.Can be constructed with Ukkonen’s algorithm.
![Page 40: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/40.jpg)
40 of 59
Suffix Array
• The Suffix Array stores lexicographic order of suffixes.
• More space efficient than suffix tree.• Can locate all occurrences of a substring by
binary search.• With Longest Common Prefix (LCP) array can
perform even more efficient searches.• LCP array stores longest common prefix
between two adjacent suffixes in suffix array.
![Page 41: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/41.jpg)
41 of 59
Suffix ArrayIndex Suffix Index Suffix LCP
1 mississippi 11 i 02 ississippi 8 ippi 13 ssissippi 5 issippi 14 sissippi 2 ississippi 45 issippi 1 mississippi 06 ssippi 10 pi 07 sippi 9 ppi 18 ippi 7 sippi 09 ppi 4 sissippi 210 pi 6 ssippi 111 i 3 ssissippi 3
sort suffixes alphabetically
![Page 42: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/42.jpg)
42 of 59
Suffix array
T = mississippi
3 4 5 6 7 8 91 2 1110
5 2 1 10 9 7 411 8 36
Index
Suffix
1 4 0 0 1 0 20 1 31LCP
![Page 43: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/43.jpg)
43 of 59
Search in Suffix Array
O(m log n):Idea: two binary searches
- search for leftmost position of X- search for rightmost position of X
In between are all suffixes that begin with X
With LCP array: O(m + log n) search.
![Page 44: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/44.jpg)
44 of 59
Suffix Array Construction
• Naïve Approach: O(n2) time
• Indirect Construction: – preorder traversal of suffix tree– LCA queries for LCP.Problem: does not achieve better space efficiency.
![Page 45: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/45.jpg)
45 of 59
Suffix Array Construction• Direct construction algorithms:
• LCP array construction: range-minima queries.
Author Date Complexity Innovation
Manber, Myers 1993 O(n log n) Sort and search, KMR renaming
Karkkainen and Sanders 2003 O(n) Linear-time
Ko and Aluru 2003 O(n) Linear-time
Kim, et. al. 2003 O(n) Linear-time
![Page 46: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/46.jpg)
46 of 59
Compressed IndicesSuffix Tree: O(n) words = O(n log n) bits
Compressed suffix tree• Grossi and Vitter (2000)
– O(n) space.
• Sadakane (2007) – O(n log |Σ|) space.– Supports all suffix tree operations efficiently.– Slowdown of only polylog(n).
![Page 47: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/47.jpg)
47 of 59
Compressed IndicesSuffix array is an array of n indices, which is stored in:
O(n) words = O(n log n) bits
Compressed Suffix Array (CSA)Grossi and Vitter (2000)
• O(n log |Σ|) bits• access time increased from O(1) to O(logε n)
Sadakane (2003)• Pattern matching as efficient as in uncompressed SA.• O(n log H0) bits
• Compressed self-index
![Page 48: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/48.jpg)
48 of 59
Compressed Indices
FM – index• Ferragina and Manzini (2005)• Self-indexing data structure • First compressed suffix array that respects the
high-order empirical entropy • Size relative to compressed text length.• Improved by Navarro and Makinen (2007)
![Page 49: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/49.jpg)
49 of 59
Dynamic Suffix Tree
Dynamic Suffix Tree• Choi and Lam (1997)• Strings can be inserted or deleted efficiently.• Update time proportional to string
inserted/deleted.• No edges labeled by a deleted string.• Two-way pointer for each edge, which can be
done in space linear in the size of the tree.
![Page 50: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/50.jpg)
50 of 59
Dynamic Suffix Array
Dynamic Suffix Array• Recent work by Salson et. al.• Can update suffix array after construction if
text changes.• More efficient than rebuilding suffix array.• Open problems:
– Worst case O(n log n).– No online algorithm yet.
![Page 51: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/51.jpg)
51 of 59
Word-Based Index
• Text size n contains k distinct words• Index a subset of positions that correspond to
word beginnings• With O(n) working space can index entire text
and discard unnecessary positions.• Desired complexity
– O(k) space.– will always need O(n) time.Problem: missing suffix links.
![Page 52: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/52.jpg)
52 of 59
Word-Based Suffix Tree
Construction Algorithms:
Author Date Results
Karkkainen and Ukkonen 1996 O(n) time and O(n/j) space construction of sparse suffix tree (every jth suffix)
Anderson et. al. 1999 Expected linear-time and k-space construction of word-based suffix tree for k words.
Inenaga and Takeda 2006 Online, O(n) time and k-space construction of word-based suffix tree for k words.
![Page 53: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/53.jpg)
53 of 59
Word-Based Suffix Array
Ferragina and Fischer (2007) – word-based suffix array construction algorithm
• Time and space optimal construction.• Computation of word-based LCP array in O(n)
time and O(k) space. • Alternative algorithm for construction of
word-based suffix tree.• Searching as efficient as ordinary sufffix array.
![Page 54: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/54.jpg)
54 of 59
Research Directions
Problems we are considering:• Small space dictionary matching.• Time-space optimal 2D compressed dictionary
matching algorithm.• Compressed parameterized matching.• Self-indexing word-based data structure.• Dynamic suffix array in O(n) construction time.
![Page 55: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/55.jpg)
55 of 59
Small-Space
• Applications arise in which storage space is limited.
• Many innovative algorithms exist for single pattern matching using small additional space:– Galil and Seiferas (1981) developed first time-
space optimal algorithm for pattern matching.– Rytter (2003) adapted the KMP algorithm to work
in O(1) additional space, O(n) time.
![Page 56: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/56.jpg)
56 of 59
Research Directions
• Fast dictionary matching algorithms exist for 1D and 2D. Achieve expected sublinear time.
• No deterministic dictionary matching method that works in linear time and small space.
• We believe that recent results in compressed self-indexing will facilitate the development of a solution to the small space dictionary matching problem.
![Page 57: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/57.jpg)
57 of 59
Compressed Matching
• Data is compressed to save space.• Lossless compression schemes can be
reversed without loss of data.• Pattern matching cannot be done in
compressed text – pattern can span a compressed character.
• LZ78: data can be uncompressed in time and space proportional to the uncompressed data.
![Page 58: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/58.jpg)
58 of 59
Research Directions
• Amir et. al. (2003) devised an algorithm for 2D LZ78 compressed matching.
• They define strongly inplace as a criteria for the algorithm: that the extra space is proportional to the optimal compression of all strings of the given length.
• We are seeking a time-space optimal solution to 2D compressed dictionary matching.
![Page 59: Pattern Matching Algorithms: An Overview Shoshana Neuburger The Graduate Center, CUNY 9/15/2009.](https://reader034.fdocuments.in/reader034/viewer/2022042821/56649d135503460f949e6ec4/html5/thumbnails/59.jpg)
59 of 59
Thank you!