1 The Zhu-Takaoka Algorithm Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang On improving the average...
-
date post
21-Dec-2015 -
Category
Documents
-
view
221 -
download
3
Transcript of 1 The Zhu-Takaoka Algorithm Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang On improving the average...
1
The Zhu-Takaoka Algorithm
Advisor: Prof. R. C. T. Lee
Speaker: S. Y. Tang
On improving the average case of the Boyer-Moore string matching algorithm, Journal of Information Processing 10(3):173-177, 1987
R. F. ZHU, T. TAKAOKA
2
• The Zhu-Takaoka Algorithm is an algorithm which solves the string matching problem.
• String matching problem:
Input: a text string T of length n and a pattern string P
of length m.
Output: all occurrences of P which occur in T.
3
• The Zhu-Takaoka Algorithm is a variant of the Boyer and Moore Algorithm. The algorithm only improve the bad character of the Boyer and Moore Algorithm.
• Zhu and Takaoka modified the BM Algorithm. They replaced the bad character rule by a
2-substring rule . The good suffix rules are still used.
4
The 2-Substring Rule
• Consider text=ACTGCTAAGTA and pattern=CTAAG.
No GC appears in P.
0 1 2 3 4 5 6 7 8 9 10 11
A C T G C C T A A G T AText
Pattern C T A A G
C T A A G
C T A A G
A C T G C C T A A G T A
A C T G C C T A A G T A
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
Text
Text
Pattern
Pattern
6
• Example 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G G A T A T A C A G T A C GText
Pattern G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
G C A G A G A GShift by 5
T(CA)=5 means that CA appears in 5 locations from the right end. Thus we can shift by 5. T(GA)=1 means that GA appears in 1 location fromthe right end. If GA is the 2-substring to be matched, we shift 1 step.
Whenever a mismatch or a complete match occurs, we select the last 2-substring in T and search for the rightmost location of this 2-substring in P if it exists. This is done by constructing a ztBc table.
G C A G A G A GShift by 1
7
The preprocessing phase of the algorithm consists in computing foreach pair of characters (a, b) with a, b the rightmost occurrence of ab in x [ 0..m -2]
ztBc[a,b]
2] [0..in
occernot does and [0]
, 2] [0..in
occurnot does and [0] 1
, 2] 2.. [in
occurnot does and
1] .. x[and 2
] ,[
,For
-mx
abbxm k
-mx
abbx-mk
-mk-mx
ab
abk-mk-m-mk
kbaztBc
ba
8
preprocessing phaseConsider text= ATTGCCTAATA and pattern=CTAAG
The alphabet of pattern is {A.C.G.T }; The sign “ * ” denotes a
word of text which never appears in pattern.
First, we fill in the blanks with the length m of pattern.
A C G T *
A 5 5 5 5 5
C 5 5 5 5 5
G 5 5 5 5 5
T 5 5 5 5 5
* 5 5 5 5 5
Example:
9
preprocessing phase
Then, we suppose the last 2-substring ab does not occur in [0..m-2]. If P0 = b, we set ztBc[i , b] = m-1 for all i.
A C G T *
A 5 4 5 5 5
C 5 4 5 5 5
G 5 4 5 5 5
T 5 4 5 5 5
* 5 4 5 5 5
T: ATTGCCTAAGTAP: CTAAG
CTAAG
↑ a
← b
Example:
10
preprocessing phase
Finally, we set ztBC[a,b] = k if k≤ m-2 and P[m-k-2..m-k-1]=ab and ab does not occur in P[m-k-1..m-2].
A C G T *
A 1 4 5 5 5
C 5 4 5 3 5
G 5 4 5 5 5
T 2 4 5 5 5
* 5 4 5 5 5
1P: CTAAG
Example:
2
3
↑ a
← b
11
If ztBc[A,C] = k
• Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C GText
Pattern G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← b
• ztBc[C,A] = 5 ; k ≤ m-2 ; ∵ x[8-5-2..8-5-1] = ab (x[1..2] = CA) and “CA” does not occur in x[8-5-1..8-2] (x[2..6] ).
i 0 1 2 3 4 5 6 7
x[i]
G C A G A G A G
Case 1 :
2]. 1.. [in occur not does and
1] 2.. x[and 2
-m-k-mxab
ab-k-m-k-m-mk
G C A G A G A GShift by 5
12
=> If ztBc[A,C] = k
• Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C G G A G A G T A T A C A G T A C GText
Pattern G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← b
•ztBc[C,G] = 7 ; k = m-1 ; ∵ x[0] = b ( G = G) and “CG” does not occur in x[0..8-2] (x[0..6] ).
i 0 1 2 3 4 5 6 7
x[i]
G C A G A G A G
Case 2 :
, 2] [0..in
occurnot does and [0]; 1
-mx
abbx-mk
G C A G A G A GShift by 7
13
=> If ztBc[A,C] = k
• Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C GText
Pattern G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← b•ztBc[A,C] = 8 ; k = m ; ∵ x[0] ≠b (G≠C) and “AC” does not occur in x[0..8-2] ( x[0..6] ).
i 0 1 2 3 4 5 6 7
x[i]
G C A G A G A G
Case 3 :
. 2] [0..in
occernot does and [0] ;
-mx
abbxm k
14
• Full Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C GText
Pattern G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← bi 0 1 2 3 4 5 6 7
x[i] G C A G A G A G
bmGs 7 7 7 2 7 4 7 1
G C A G A G A GShift by 5
In the step, we select the ztBc function to shift because ztBc[P6P7=CA] = 5 > bmGs [7] =1. The pattern shifts 5 steps right by case 1.
15
• Full Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C GText
Pattern
G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← bi 0 1 2 3 4 5 6 7
x[i] G C A G A G A G
bmGs 7 7 7 2 7 4 7 1
G C A G A G A G
Shift by 7
In the step, we select the bmGs function to shift because ztBc[A,G] = 2 < bmGs [0] = 7.
exact matching
16
• Full Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C GText
Pattern G C A G A G A G
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← b
i 0 1 2 3 4 5 6 7
x[i] G C A G A G A G
bmGs 7 7 7 2 7 4 7 1
G C A G A G A GShift by 4
In the step, we select the bmGs function to shift because ztBc[A,G] = 2 < bmGs [5] = 4.
17
• Full Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
G C A T C G C A G A G A G T A T A C A G T A C GText
Pattern
ztBc A C G *
A 8 8 2 8
C 5 8 7 8
G 1 6 7 8
* 8 8 7 8
↑ a
← bi 0 1 2 3 4 5 6 7
x[i] G C A G A G A G
bmGs 7 7 7 2 7 4 7 1
G C A G A G A G
By the bmGs or ztBc function ; We can select the ztBc function or the bmGs function to shift because ztBc[C,G] = 7 = bmGs [6].
18
• preprocessing phase in O(m + ) time and space complexity. ( = the numbers of alphabet of the text ).
• searching phase in O(m × n) time complexity.
Time complexity
2
19
References
1. ZHU, R.F. and TAKAOKA, T., 1987, On improving the average case of the Boyer-Moore string matching algorithm, Journal of Information Processing 10(3):173-177 .