1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical:...
-
date post
22-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Boyer-Moore Charles Yan 2007. 2 Exact Matching Boyer-Moore ( worst-case: linear time, Typical:...
2
Exact Matching
Boyer-Moore (worst-case: linear time, Typical: sublinear time )
Aho-Corasik (A set of pattern)
4
Boyer-Moore
12345678901234567T: spbctbsabpqsctbpqP: tpabsab
Idea 2: Bad character ruleR(x): The right-most occurrence of x in P. R(x)=0 if x does not
occur. R(t)=1, R(s)=5.i: the position of mismatch in P. i=3k: the counterpart in T. k=5. T[k]=tThe bad character rule says P should be shifted right by max{1, i-
R(T[k])}. i.e., if the right-most occurrence of character T[k] in P is in position j (j<i), then P[j] show be below T[k] after the shifting.
P: tpabxab
5
Boyer-Moore
The idea of bad character rule is to shift P by more than one characters when possible.
But is has no effect if j>i Unfortunately, it is often the case that j>i
12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat
6
Boyer-Moore
Let x=T[k], the mismatched character in T.
Idea 3: Extended bad character rule says P should be shifted right so that the closest x to the left of position i in P is below T[K].
12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat
7
Boyer-Moore
To use extended bad character rule we need: For each position i of P, for each character x in the alphabet, the position of the closest occurrence of x to the left of i.
Approach 1: Two dimensional array. n*| |
Space and time: expensive
8
Boyer-Moore
Approach two: scan P from right to left and for each x maintain a list positions where x occurs (in decreasing order).
P: tpabsat t7,1 a6,3 …
When P[i] is mismatched with T[k], (let x=T[k]), scan the x’s list, find the first number (let it be j) that is less than i and shift P to right so that P[j] is below T[k].
If no such j is found then shift P past T[k]Space and time: Linear
12345678901234567T: spbctbsatpqsctbpqP: tpabsatP: tpabsat
9
Boyer-Moore
Idea 3: Strong good suffix rule
t is a suffix of P that match with a substring t of Tx≠yt’ is the right-most copy of t in P such that t’ is not a suffix of P
and z≠y
x t
y tt’z
T
P
10
Boyer-Moore
The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T
123456789012345678
T: prstabstubabvqxrst
P: qcabdabdab
x t
y tt’
y tt’
z
z
T
P
P: qcabdabdab
P: qcabdabdab
11
Boyer-Moore
Extended bad character rule focuses on characters.Strong good rule focuses on substrings.
How to get the information needed for the strong good suffix rule? i.e., for a t, how do we find t`?
12
Boyer-Moore
L’(i): For each i, L’(i) is the largest position less than n such that substring P[i,…,n] matches a suffix of P[1,…, ’(i) ] with the additional requirement that the character preceding that suffix is not equal to character P[i-1].
If there is no such a position, L’(i) =0.Let t= P[i,…,n], then L’(i) is the right end-position of t’.
x t
y tt’
y tt’
z
z
T
P
niL’(i)
T: prstabstubabvqxrstP: qcabdabdab 1234567890L’(9)=4, L’(10)=0, L’(8)=?, L’(7)=? L’(6)=?
13
Boyer-Moore
Let t= P[i,…,n], then L’(i) is the right end-position of t’.
Thus to use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.
For pattern P,
Nj is the length of the longest substring that end at j and that is also a suffix of P.
tt’j
xyP
t=t’;j=|t’|=|t|;x≠y
14
Boyer-Moore
Nj is the length of the longest substring that end at j and that is also a suffix of P.
Zi: the length of the longest substring of P that starts at i and matches a prefix of P
tt’j
xy
t t’ xyi
15
Boyer-Moore
N is the reverse of Z!
P: the pattern
Pr the string obtained by reversing P
Then Nj (P)=Zn-j+1 (Pr)
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0
t t’ xyi
tt’j
xy
16
Boyer-Moore
For pattern P,
Nj (for j=1,…,n) can be calculated in O(n) using the Z algorithm.
Why do we need to define Nj ?
To use the strong good suffix rule, we need to find out L’(i) for every i=1,…,n.
We can get L’(i) from Nj !
x t
y tt’
y tt’
z
z
T
P
niL’(i)
17
Boyer-Moore
For position i, let t=P[i,…n].
L’(i) is the largest position j less than n such that Nj=|t|
y tt’zPniL’(i)
t’’
1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0P: q c a b d a b d a b Pr: b a d b a d b a c qNj: 0 0 0 2 0 0 5 0 0 0 Zi 0 0 0 5 0 0 2 0 0 0L’(i): 0 0 0 0 0 7 0 0 4 0
18
Boyer-Moore
How to obtain L’(i) from Nj in linear time?
Input: Pattern POutput: L’(i) for i=1,…,nAlgorithm
Calculate Nj for j=1,…,n based on Z algorithmfor i=1; i<=n; i++
L’(i)=0;for j=1; j<n; j++
i=n-Nj+1 L’(i)=j;
y tt’zP
niL’(i)
j
19
Boyer-Moore
The strong good suffix rule says (1) if t’ exist then shift P to right such that t’ in P is below t in T
123456789012345678T: prstabstubabvqxrstP: qcabdabdab i=9; L’(9)=4
x t
y tt’
y tt’
z
z
T
P
P: qcabdabdab
i nL’(i)
L’(i) i n
20
Boyer-Moore
The strong good suffix rule:(1) If a mismatch occurs at position i-1 of P and L’(i)>0 (i.e. t’
exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.
(2) What if a mismatch occurs at position i-1 of P and L’(i)=0 (i.e. t’ does not exists)? We can shift P as least like this
x t
y t
y t
T
P
i nP
i n
22
Boyer-Moore
Observation 1 If is a prefix of P is also a suffix of P, then…
x t
y t
y t
T
P
i nP
i n’
23
Boyer-Moore
Observation 2: If there are more than one candidates of , then shift P by the least amount
x t
y t
y t
T
P
P1
’ y tP2
24
Boyer-Moore
The strong good suffix rule: When a mismatch occurs at position i-1 of P
(1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.
(2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches a suffix of t.
x t
y t
y t
T
P
i nP
i n’
25
Boyer-Moore
l’(i) : the length of the largest suffix of P[i,…,n], that is also a prefix of P. If none exists, then l’(i)=0.
l’(i) is length of the overlap between the unshifted and shifted patterns.
x t
y t
y t
T
P
P1
’ y tP2
il’(i)l’(i)
26
Boyer-Moore
l’(i) equals the largest j≤|P[i,…n]|, such that Nj=j
1. Nj=j then is a prefix of P is also a suffix of P
2. and we want the largest j
y tPi
l’(i)
Pj
j2j1
27
Boyer-Moore
l’(i) equals the largest j≤|P[i,…n]|, such that Nj=j
1 2 3 4 5 6 7 8 9 0P: a b d a b a b d a b Nj: 0 2 0 0 5 0 2 0 0 0l’(i): 5 5 5 5 5 5 2 2 2 0
29
Boyer-MooreThe strong good suffix rule: When a mismatch occurs at position i-1 of P
(1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.
(2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right.
x t
y t
y t
T
P
i nP
i n’
x ty tt’
y tt’
z
z
TP
i nL’(i)
L’(i) i n
l’(i)
30
Boyer-Moore
What if a match is found? Shift P by one position…but…
Shift P by the least amount such a prefix of the shifted pattern matches a suffix of t, that is, shift P to the right by n-l’(2)
y t
T
P
P
31
Boyer-MooreThe strong good suffix rule: When a mismatch occurs at position i-1 of P
(1) If L’(i)>0 (i.e. t’ exists), then using the strong good suffix rule we can shift P by n-L’(i) positions to the right.
(2) Else if L’(i)=0 (i.e. t’ does not exists)? We can shift P past the left end of t by the least amount such a prefix of the shifted pattern matches t, that is by n-l’(i) positions to the right.
(3) If a match is found, then shift P to the right by n-l’(2)
x ty t
y t
TP
Pi n’
x ty tt’
y tt’
z
z
TP
i nL’(i)
l’(i)
32
Boyer-Moore
The extended bad character rule vs. the strong
good suffix rule
123456789012345678T: prstabstubabvqxrstP: qcabdabdab
P: qcabdabdabP: qcabdabdab
123456789012345678T: prstabstuqabvqxrstP: qcabdabdab
P: qcabdabdabP: qcabdabdab
33
Boyer-Moore
Shift P by the largest amount given by either of rules. That results in the Boyer-Moore algorithm!
Input: Text T, and pattern P; Output: Find the occurrences of P in TAlgorithm Boyer-Moore
Compute L’(i), L`(i), and R(x)k=n;while (k≤m) do
i=nh=kwhile i>0 and P[i]=T[h] do
i--;h--;
if i=0report an occurrence of P in T ending at position k;k=k+n-l`(2)
else shift P (increase k) by the maximum amount determined by the extended bad character rule and the good suffix rule.
tt
T
Pi
kh