1 Reverse Colussi algorithm Fastest pattern matching in strings, Colussi, L. Journal of Algorithms,...

1

Reverse Colussi algorithm

Fastest pattern matching in strings, Colussi, L.

Journal of Algorithms, Vol. 16 , No. 2, 1994, pp.163-189

Advisor: Prof. R. C. T. Lee

Speaker: Y. K. Shie

2

The Reverse Colussi Algorithm is an algorithm which solves the string matching problem and it is in the spirit of the original Colussi Algorithm..

3

The Main Points of the Reverse Colussi Algorithm

1. It changes the bad character rule from matching one character to matching a pair of characters.

2. Reverse Colussi algorithm divides the position into special position and non-special position. Special position allow smaller number of jump.

3. The Reverse Colussi Algorithm processes the special position first.

4

Note that the Colussi Algorithm does not consider all of the positions where the prefix function assumes value -1.

That this can be done can be seen by the following fact: The position where prefix function assumes -1 allows the largest number of steps to shift.

Thus the Colussi Algorithm examines all positions which allow smaller number of steps of shift which is a safe action.

5

In this Reverse Colussi Algorithm, we define some points which are special and some points which are not special.

Special points allow smaller number of steps to shift than non-special points.

Thus, in the Reverse Colussi Algorithm, we examine the special positions first.

We shall make this clear later.

6

Ti is the ith character in T (1≦i≦n). Pj are the jth character in P (1≦j≦m).

The bad character rule is like the Rule 2-1, Character Matching Rule.

7

Rule 2-1: Character Matching Rule(A Special Version of Rule 2)

• For any character x in T, find the nearest x in P which is to the left of x in T.

T

P

x

x

8

Implication of Rule 2-1

• Case 1. If there is an x in P to the left of T, move P so that the two x’s match.

T

P

x

x

9

• Case 2: If no such an x exists in P, consider the partial window defined by x in T and the string to the left of it.

T

P

x

Partial W

10

rcBc table

　　 Consider the following case where the last character X of the window of T does not match with the last character of P.

T:

P:

X

11

rcBc table

　　 Suppose we successfully find an X in Pas shown below:

T:

P:

X

X

12

rcBc table

　　 Then we can move P as shown as below:

T:

P:

X

X

13

rcBc table

　　 Suppose the last character Y of the windowof T does not match with the last character of P as shown below:

T:

P:

X

X

Y

14

rcBc table

　　 Then we try to find a pair of X and Y in Psuch that after we move P, these X and Yin P match with the X and Y in T.

T:

P:

X

X

Y

YX

15

　　 Thus, the Reverse Colussi Algorithm uses a very special version of Rule 2: a pair of characters.

T:

P:

X

X

Y

YX

16

How do we find this pair of characters in P?

We use the rcBc Table.

17

rcBc tableY is the last character of the windows of T.

s is the length which we shift in last step.

k is an integer.

case 1:

If we can find Pm-k-1=Y and Pm-k-s-1=Pm-s-1,

we fill the minimal k into rcBc[Y, s].

case 2:If we can find Pm-k-1=Y and k>m-s-1,

we fill the minimal k into rcBc[Y, s].

case 3:Otherwise, we fill the m into rcBc[Y, s].

18

ex: 　s = 1:

Length of Previous

Present matched Shifts (s)character of T (Y)

1 2 3 4 5 6 7 8

A 8

C

G

T

XY = AA does not exist in P.rcBc[Y, 1] = 8

T:

P: GAG C A G A G

A

X = A

Y = A

19

ex: 　s = 2:

Length of Previous


1 2 3 4 5 6 7 8

A 8 5

C

G

T

Looking for　　　　 exists. rcBc[Y, 2] = 5

T:

P: GAG C A G A G

A

X = G

Y = A

G A

G C A

5

20

ex: 　s = 3:

Length of Previous


1 2 3 4 5 6 7 8

A 8 5 5

C

G

T

Looking for　　　　 qualifies. rcBc[Y, 3] = 5

T:

P: GAG C A G A G

A

X = A

Y = A

A A

G C A

5

21

ex: 　

Length of Previous


1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

22

rcGs table

We build the rcGs table which corresponds to the good suffix rules of Boyer-Moore algorithm.

The good suffix rules are like the Rule 1, The Suffix to Prefix Rule, and Rule 2, The Substring Matching Rule.

23

Rule 1: The Suffix to Prefix Rule • For a window to have any chance to match a pattern,

in some way, there must be a suffix of the window which is equal to a prefix of the pattern.

T

P

24

Rule 2: The Substring Matching Rule

• For any substring u in T, find a nearest u in P which is to the left of it. If such an u in P exists, move P such then the two u’s match; otherwise, we may define a new partial window.

T

T

P

u

u

P

u

u

25

A repeating suffix of a string S is a suffix which appears somewhere else in S.

For instance, ABA is a repeating suffix of CABAGTABA. BA is also a suffix

repeating suffix.

26

Let x be the character to the left of a repeating suffix. A repeating suffix u of S is a maximal repeating suffix if xu does not appear elsewhere in S.

For instance, in CABAGTABA , ABA is a maximal repeating suffix because TABA does not appear any where in S while BA is not because ABA appears somewhere else in S.

27

Given a pattern P, denote all positions to the left of maximal repeating suffixes of P as special positions. The Reverse Colussi Algorithm consider these special positions first.

In this case, we can see that the following suffixes are all maximal suffixes:

G ( corresponding substring : G )

AG ( corresponding substring : CAG )

AGAG ( correspondingsubstring : CAGAG)

G C A G A G A G

28

For

The special positions are

G C A G A G A G

G C A G A G A G0 1 2 3 4 5 6 7

29

For each maximal suffix u, let the last position of corresponding substring be located at p. Then, if a mismatching occur at the special positions with u, we may move P m-p-1 steps, where m is length of P (Rule 2).

G C A G A G A G0 1 2 3 4 5 6 7

u

substring associates with u

special position

m = 8

p = 5

30

So we can move 8 - 5 - 1 = 2 as below:

T GT:

P:

The number of steps moved for each special position is stored in a table, called hmin.

G C A G A G A G

G C A G A G A G

0 1 2 3 4 5 6 7

31

Pi G C A G A G A G

hmin 3

special positions

0 1 2 3 4 5 6 7

For a special position i = 3, we record its length of move 2 (8-5-1) on hmin[2]=3.

32

Pi G C A G A G A G

hmin 3 5

special positions

0 1 2 3 4 5 6 7


33

Pi G C A G A G A G

hmin 3 5 6

special positions

0 1 2 3 4 5 6 7


34

• Note that for special positions, Rule 2 (substring matching rule) can be used.

• For non-special positions, Rule 1 (suffix to prefix rule) can be used.

35

The basic idea of the Reverse Colussi Algorithm is as follows:

1.We consider special positions first andnon-special positions next.

2.We use Rule 2 (substring matching rule)when we consider special positions.

3. We use Rule 1 (suffix to prefix rule) when we consider non-special positions.

36

After we compare special positions, we must compare the remainder positions, called non-special positions. We compare those non-special positions form left to right.

The number of steps moved for each non-special position is stored in a table, called rmin.

The value of rmin can be found by Rule 1 (the suffix to prefix rule).

37

If a suffix S which exists at the right side of a non-special position i is equal to a prefix, rmin(i)=m-|S|. (|S| is the length of S.) If no such S exists, rmin(i)=m.

38

ex1: G C A G A G A G0 1 2 3 4 5 6 7

G C A G A G A G

rmin 7 7 7 7

A suffix S is equal to a prefix which is at right side of some non-special positions, so the values of rmin of these non-special positions are m-|S| ( 8-1 ).

S

39

ex2:G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10

A suffix S is equal to a prefix which is at right side of some non-special positions, so the values of rmin of these non-special positions are m-|S| ( 11-5 ).

G A G A G T G A G A G

rmin 6 6 6 6 6

special positions

S

40

ex2:G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10


rmin 6 6 6 6 6 8

S

special positions

We find a shorter suffix at right side of some non-special position which is equal to a prefix, so the values of rmin of these non-special positions are m-|S| ( 11-3 ).

41

ex2:G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10


rmin 6 6 6 6 6 8 10

S

special positions

And we find a shorter suffix at right side of some non-special position which is equal to a prefix, so the values of rmin of these non-special positions are m-|S| ( 11-1 ).

42

ex3:C G A G A G T G A G A G0 1 2 3 4 5 6 7 8 9 10 11

No suffix is equal to any prefix, so the values of all non-special positions in rmin are m.

C G A G A G T G A G A G

rmin 12 12 12 12 12 12 12 12

43

rcGs tableAfter we bulid those tables, we can use those ta

bles to build the rcGs table.

ex : GCAGAGAG

i 0 1 2 3 4 5 6 7 8

Pi G C A G A G A G

hmin[ i ]

3 5 6

rmin[ i ]

7 7 7 7

rcGs[ i ]

0

44

rcGs table

First, we fill the index of special positions that hmin is nonempty into rcGs table.

i 0 1 2 3 4 5 6 7 8

Pi G C A G A G A G

hmin[ i ]

3 5 6

rmin[ i ]

7 7 7 7

rcGs[ i ]

0 2 4

45

rcGs tableSecond, we fill the rmin value that rmin is none

mpty into rcGs table.

i 0 1 2 3 4 5 6 7 8

Pi G C A G A G A G

hmin[ i ]

3 5 6

rmin[ i ]

7 7 7 7

rcGs[ i ]

0 2 4 7 7 7 7 7

46

rcGs tableIf P exact match with T, we can move P

by Rule 1. Therefore, we fill rcGs[8]=m-|S| (8-1).

i 0 1 2 3 4 5 6 7 8

Pi G C A G A G A G

hmin[ i ]

3 5 6

rmin[ i ]

7 7 7 7

rcGs[ i ]

0 2 4 7 7 7 7 7 7

47

ex:

T=

P=

s = m = 8

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

G C A T C G C A G A G

1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

48

ex:

Shift by 1 (rcBc[A][s], s = 8), and change s = 1


1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

1

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

49

ex:

Shift by 2 (rcGs[1]), and change s = 2


1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

12

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

50

ex:



1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

12

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

51

ex:



1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

12 3 45 6 7 8

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

52

ex:



1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

12

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

53

ex:

Shift by 5 (rcBc[A][s], s = 2), and change s = 5


1 2 3 4 5 6 7 8 9 10 11

A G T

12 13 14

A T A C A G T A C G

15 16 17 18 19 20 21 22 23 24

G C A G A G A G

1

rcBc 1 2 3 4 5 6 7 8

A 8 5 5 3 3 3 1 1

C 8 6 6 6 6 6 6 6

G 2 2 2 4 4 2 2 2

T 8 8 8 8 8 8 8 8

i 0 1 2 3 4 5 6 7 8

rcGs[ i ] 0 2 4 7 7 7 7 7 7

54

Time complexity

• preprocessing phase in O(m2) time complexity and O(mσ) space complexity.

• searching phase in O(n) time complexity.

• 2n text character comparisons in the worst case.

55

Reference

• [BV2005] Mutable strings in Java: design, implementation and lightweight text-search algorithms, Boldi, P. and Vigna, S., Science of Computer Programming, Vol.54, No.1, 2005, pp.3-23

• [HWC2000] Research on a faster algorithm for pattern matching, Han, K., Wang, Y. and Chen, G., Proceedings of the fifth international workshop on on Information retrieval with Asian languages, 2000, pp.119-124

• [L96] Chinese string searching using the KMP algorithm, Luk, R.W.P., Proceedings of the 16th conference on Computational linguistics, 1996

56

Thank you~

1 Reverse Colussi algorithm Fastest pattern matching in strings, Colussi, L. Journal of Algorithms,...

Documents

Transcript of 1 Reverse Colussi algorithm Fastest pattern matching in strings, Colussi, L. Journal of Algorithms,...