1 Reverse Colussi algorithm Fastest pattern matching in strings, Colussi, L. Journal of Algorithms,...
-
date post
21-Dec-2015 -
Category
Documents
-
view
252 -
download
1
Transcript of 1 Reverse Colussi algorithm Fastest pattern matching in strings, Colussi, L. Journal of Algorithms,...
1
Reverse Colussi algorithm
Fastest pattern matching in strings, Colussi, L.
Journal of Algorithms, Vol. 16 , No. 2, 1994, pp.163-189
Advisor: Prof. R. C. T. Lee
Speaker: Y. K. Shie
2
The Reverse Colussi Algorithm is an algorithm which solves the string matching problem and it is in the spirit of the original Colussi Algorithm..
3
The Main Points of the Reverse Colussi Algorithm
1. It changes the bad character rule from matching one character to matching a pair of characters.
2. Reverse Colussi algorithm divides the position into special position and non-special position. Special position allow smaller number of jump.
3. The Reverse Colussi Algorithm processes the special position first.
4
Note that the Colussi Algorithm does not consider all of the positions where the prefix function assumes value -1.
That this can be done can be seen by the following fact: The position where prefix function assumes -1 allows the largest number of steps to shift.
Thus the Colussi Algorithm examines all positions which allow smaller number of steps of shift which is a safe action.
5
In this Reverse Colussi Algorithm, we define some points which are special and some points which are not special.
Special points allow smaller number of steps to shift than non-special points.
Thus, in the Reverse Colussi Algorithm, we examine the special positions first.
We shall make this clear later.
6
Ti is the ith character in T (1≦i≦n). Pj are the jth character in P (1≦j≦m).
The bad character rule is like the Rule 2-1, Character Matching Rule.
7
Rule 2-1: Character Matching Rule(A Special Version of Rule 2)
• For any character x in T, find the nearest x in P which is to the left of x in T.
T
P
x
x
8
Implication of Rule 2-1
• Case 1. If there is an x in P to the left of T, move P so that the two x’s match.
T
P
x
x
9
• Case 2: If no such an x exists in P, consider the partial window defined by x in T and the string to the left of it.
T
P
x
Partial W
10
rcBc table
Consider the following case where the last character X of the window of T does not match with the last character of P.
T:
P:
X
11
rcBc table
Suppose we successfully find an X in Pas shown below:
T:
P:
X
X
12
rcBc table
Then we can move P as shown as below:
T:
P:
X
X
13
rcBc table
Suppose the last character Y of the windowof T does not match with the last character of P as shown below:
T:
P:
X
X
Y
14
rcBc table
Then we try to find a pair of X and Y in Psuch that after we move P, these X and Yin P match with the X and Y in T.
T:
P:
X
X
Y
YX
15
Thus, the Reverse Colussi Algorithm uses a very special version of Rule 2: a pair of characters.
T:
P:
X
X
Y
YX
16
How do we find this pair of characters in P?
We use the rcBc Table.
17
rcBc tableY is the last character of the windows of T.
s is the length which we shift in last step.
k is an integer.
case 1:
If we can find Pm-k-1=Y and Pm-k-s-1=Pm-s-1,
we fill the minimal k into rcBc[Y, s].
case 2:If we can find Pm-k-1=Y and k>m-s-1,
we fill the minimal k into rcBc[Y, s].
case 3:Otherwise, we fill the m into rcBc[Y, s].
18
ex: s = 1:
Length of Previous
Present matched Shifts (s)character of T (Y)
1 2 3 4 5 6 7 8
A 8
C
G
T
XY = AA does not exist in P.rcBc[Y, 1] = 8
T:
P: GAG C A G A G
A
X = A
Y = A
19
ex: s = 2:
Length of Previous
Present matched Shifts (s)character of T (Y)
1 2 3 4 5 6 7 8
A 8 5
C
G
T
Looking for exists. rcBc[Y, 2] = 5
T:
P: GAG C A G A G
A
X = G
Y = A
G A
G C A
5
20
ex: s = 3:
Length of Previous
Present matched Shifts (s)character of T (Y)
1 2 3 4 5 6 7 8
A 8 5 5
C
G
T
Looking for qualifies. rcBc[Y, 3] = 5
T:
P: GAG C A G A G
A
X = A
Y = A
A A
G C A
5
21
ex:
Length of Previous
Present matched Shifts (s)character of T (Y)
1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
22
rcGs table
We build the rcGs table which corresponds to the good suffix rules of Boyer-Moore algorithm.
The good suffix rules are like the Rule 1, The Suffix to Prefix Rule, and Rule 2, The Substring Matching Rule.
23
Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern,
in some way, there must be a suffix of the window which is equal to a prefix of the pattern.
T
P
24
Rule 2: The Substring Matching Rule
• For any substring u in T, find a nearest u in P which is to the left of it. If such an u in P exists, move P such then the two u’s match; otherwise, we may define a new partial window.
T
T
P
u
u
P
u
u
25
A repeating suffix of a string S is a suffix which appears somewhere else in S.
For instance, ABA is a repeating suffix of CABAGTABA. BA is also a suffix
repeating suffix.
26
Let x be the character to the left of a repeating suffix. A repeating suffix u of S is a maximal repeating suffix if xu does not appear elsewhere in S.
For instance, in CABAGTABA , ABA is a maximal repeating suffix because TABA does not appear any where in S while BA is not because ABA appears somewhere else in S.
27
Given a pattern P, denote all positions to the left of maximal repeating suffixes of P as special positions. The Reverse Colussi Algorithm consider these special positions first.
In this case, we can see that the following suffixes are all maximal suffixes:
G ( corresponding substring : G )
AG ( corresponding substring : CAG )
AGAG ( correspondingsubstring : CAGAG)
G C A G A G A G
28
For
The special positions are
G C A G A G A G
G C A G A G A G0 1 2 3 4 5 6 7
29
For each maximal suffix u, let the last position of corresponding substring be located at p. Then, if a mismatching occur at the special positions with u, we may move P m-p-1 steps, where m is length of P (Rule 2).
G C A G A G A G0 1 2 3 4 5 6 7
u
substring associates with u
special position
m = 8
p = 5
30
So we can move 8 - 5 - 1 = 2 as below:
T GT:
P:
The number of steps moved for each special position is stored in a table, called hmin.
G C A G A G A G
G C A G A G A G
0 1 2 3 4 5 6 7
31
Pi G C A G A G A G
hmin 3
special positions
0 1 2 3 4 5 6 7
For a special position i = 3, we record its length of move 2 (8-5-1) on hmin[2]=3.
32
Pi G C A G A G A G
hmin 3 5
special positions
0 1 2 3 4 5 6 7
For a special position i = 5, we record its length of move 4 (8-3-1) on hmin[4]=5.
33
Pi G C A G A G A G
hmin 3 5 6
special positions
0 1 2 3 4 5 6 7
For a special position i = 6, we record its length of move 7 (8-0-1) on hmin[7]=6.
34
• Note that for special positions, Rule 2 (substring matching rule) can be used.
• For non-special positions, Rule 1 (suffix to prefix rule) can be used.
35
The basic idea of the Reverse Colussi Algorithm is as follows:
1.We consider special positions first andnon-special positions next.
2.We use Rule 2 (substring matching rule)when we consider special positions.
3. We use Rule 1 (suffix to prefix rule) when we consider non-special positions.
36
After we compare special positions, we must compare the remainder positions, called non-special positions. We compare those non-special positions form left to right.
The number of steps moved for each non-special position is stored in a table, called rmin.
The value of rmin can be found by Rule 1 (the suffix to prefix rule).
37
If a suffix S which exists at the right side of a non-special position i is equal to a prefix, rmin(i)=m-|S|. (|S| is the length of S.) If no such S exists, rmin(i)=m.
38
ex1: G C A G A G A G0 1 2 3 4 5 6 7
G C A G A G A G
rmin 7 7 7 7
A suffix S is equal to a prefix which is at right side of some non-special positions, so the values of rmin of these non-special positions are m-|S| ( 8-1 ).
S
39
ex2:G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10
A suffix S is equal to a prefix which is at right side of some non-special positions, so the values of rmin of these non-special positions are m-|S| ( 11-5 ).
G A G A G T G A G A G
rmin 6 6 6 6 6
special positions
S
40
ex2:G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10
G A G A G T G A G A G
rmin 6 6 6 6 6 8
S
special positions
We find a shorter suffix at right side of some non-special position which is equal to a prefix, so the values of rmin of these non-special positions are m-|S| ( 11-3 ).
41
ex2:G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10
G A G A G T G A G A G
rmin 6 6 6 6 6 8 10
S
special positions
And we find a shorter suffix at right side of some non-special position which is equal to a prefix, so the values of rmin of these non-special positions are m-|S| ( 11-1 ).
42
ex3:C G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10 11
No suffix is equal to any prefix, so the values of all non-special positions in rmin are m.
C G A G A G T G A G A G
rmin 12 12 12 12 12 12 12 12
43
rcGs tableAfter we bulid those tables, we can use those ta
bles to build the rcGs table.
ex : GCAGAGAG
i 0 1 2 3 4 5 6 7 8
Pi G C A G A G A G
hmin[ i ]
3 5 6
rmin[ i ]
7 7 7 7
rcGs[ i ]
0
44
rcGs table
First, we fill the index of special positions that hmin is nonempty into rcGs table.
i 0 1 2 3 4 5 6 7 8
Pi G C A G A G A G
hmin[ i ]
3 5 6
rmin[ i ]
7 7 7 7
rcGs[ i ]
0 2 4
45
rcGs tableSecond, we fill the rmin value that rmin is none
mpty into rcGs table.
i 0 1 2 3 4 5 6 7 8
Pi G C A G A G A G
hmin[ i ]
3 5 6
rmin[ i ]
7 7 7 7
rcGs[ i ]
0 2 4 7 7 7 7 7
46
rcGs tableIf P exact match with T, we can move P
by Rule 1. Therefore, we fill rcGs[8]=m-|S| (8-1).
i 0 1 2 3 4 5 6 7 8
Pi G C A G A G A G
hmin[ i ]
3 5 6
rmin[ i ]
7 7 7 7
rcGs[ i ]
0 2 4 7 7 7 7 7 7
47
ex:
T=
P=
s = m = 8
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
48
ex:
Shift by 1 (rcBc[A][s], s = 8), and change s = 1
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
1
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
49
ex:
Shift by 2 (rcGs[1]), and change s = 2
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
12
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
50
ex:
Shift by 2 (rcGs[1]), and change s = 2
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
12
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
51
ex:
Shift by 7 (rcGs[8]), and change s = 7
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
12 3 45 6 7 8
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
52
ex:
Shift by 2 (rcGs[1]), and change s = 2
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
12
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
53
ex:
Shift by 5 (rcBc[A][s], s = 2), and change s = 5
G C A T C G C A G A G
1 2 3 4 5 6 7 8 9 10 11
A G T
12 13 14
A T A C A G T A C G
15 16 17 18 19 20 21 22 23 24
G C A G A G A G
1
rcBc 1 2 3 4 5 6 7 8
A 8 5 5 3 3 3 1 1
C 8 6 6 6 6 6 6 6
G 2 2 2 4 4 2 2 2
T 8 8 8 8 8 8 8 8
i 0 1 2 3 4 5 6 7 8
rcGs[ i ] 0 2 4 7 7 7 7 7 7
54
Time complexity
• preprocessing phase in O(m2) time complexity and O(mσ) space complexity.
• searching phase in O(n) time complexity.
• 2n text character comparisons in the worst case.
55
Reference
• [BV2005] Mutable strings in Java: design, implementation and lightweight text-search algorithms, Boldi, P. and Vigna, S., Science of Computer Programming, Vol.54, No.1, 2005, pp.3-23
• [HWC2000] Research on a faster algorithm for pattern matching, Han, K., Wang, Y. and Chen, G., Proceedings of the fifth international workshop on on Information retrieval with Asian languages, 2000, pp.119-124
• [L96] Chinese string searching using the KMP algorithm, Luk, R.W.P., Proceedings of the 16th conference on Computational linguistics, 1996
56
Thank you~