1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999....

24
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen

Transcript of 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999....

Page 1: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

1

Very fast and simple approximate string matching

Information Processing Letters, 72:65-70, 1999.

G. Navarro and R. Baeza-Yates

Advisor: Prof. R. C. T. Lee

Speaker: H. M. Chen

Page 2: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

2

• Our approximate string matching problem is defined as follows: Given a pattern string P of length m and a text string T of length n, and a maximal number k of errors allowed, find all text positions that match P with up to edit distance equal to k.

Page 3: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

3

This paper is based upon the following lemma presented by the same authors in [A Hybird indexing method for approximate string matching].

Lemma: Let T and P be two strings. Let P be divided into j pieces p1, p2, …, pj. If ed(T,P) ≤ k, then there exists at least one pi and a substring S in T such that ed(S,pi) ≤ . jk /

Page 4: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

4

If we let j=k+1, then .In this case, if ed(T,P) ≤ k, then at least one pi

occurs in T exactly.

If, in a certain window, we find an exact matchingof a pi inside the window, we use the dynamic programming approach to determine whether thereexists an approximate matching of P allowing k errors in this window.

0/ jk

Page 5: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

5

If, in a window, we cannot find any exact matching of pi inside the window, we ignore the window. That is, we do not have to checkwhether there is an approximate matchinginside the window.

Page 6: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

6

Question: How large can the window be?

Answer: The largest window size which isallowed to produce an approximate matchingwith edit distance smaller than or equal to kis m+2k where m is the length of the pattern.

This can be explained in the following slide.

Page 7: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

7

Consider the following case. Suppose P exactly matches a substring S in T. We may extend S k characters to the right and k characters to the left. This forms a window of size m+2k. Any substring obtained by extending S to the right and to the left is an approximate matching with P with edit distance less than or equal to k.

T

P

mS

k k

Page 8: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

8

• Let us consider the case where we limit the error to be less than k. Then we split the pattern P into k+1 pieces. Since each piece is rather small, there is a high probability that it appears exactly in T. Thus, when the pieces are small, us in this case, we cannot eliminate many substrings.

Page 9: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

9

• Our think is as follows:

After determining the occurrences of exact matching of small pieces, we start to determine the occurrences of larger piece of P in T.

AAABBBCCCDDD

AAABBB CCCDDD

BBB CCC DDDAAA

k = 3

12/3/ ,2 jkj

04/3/ ,4 jkj

Page 10: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

10

bc table

• The only thing we want to do is to construct a table of each piece of P as follow. Let x be a character in the alphabet. We record the position of the last x, if it exists in piece of P, we record the position of x from the right end. If x does not exist in piece of P, we record it as m+1.

Page 11: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

11

Suppose we have P = ATCCTC with k = 2.

We divide P into three pieces : p1 = AT, p2 = CC and p3 = TC.

To search for exact matching, we actually perform an exhaustive

search. Let us assume that we search for AT.

Note that there are three cases:

Case 1 : X = A. We move AT 2 steps.

Case 2 : X = T. We move AT 1 steps.

Case 3 : X≠A and X≠T, we move AT 3 steps.

T X

AT

Page 12: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

12

Let us assume that we search for CC.

Case 1 : X = C. We move CC 1 step.

Case 2 : X ≠ C. We move CC 3 steps.

T X

CC

Page 13: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

13

Let us assume that we search for TC.

Case 1 : X = T. We move TC 2 steps.

Case 2 : X = C. We move TC 1 step.

Case 3 : X≠T and X≠C, we move TC 3 steps.

T X

TC

Page 14: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

14

Based upon three above discussions, we choose the minimum values of each character and have the following shift table:

p1 = AT p2 = CC p3 = TCA T *2 1 3

T C *2 1 3

C *1 3

A T C *2 1 1 3

shift table

bc table

Page 15: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

15

T = TCCAAGTTATAGCTC

p1 = AT, p2 = CC , p3 = TC

First step: We open a window with length 2 to compare with AT, CC and

TC. We found that it has a exact matching with p3. Then shift the window according to shift table value of next position.

Second step:

We found CC has a exact matching with p2. Then we shift the window 2 positions.

Third step:

We cannot find AA among p1, p2 and p3. Then shift the window 3 positions and continue to compare.

A T C *2 1 1 3

shift table

T C C A A G T T A T A G C T C

T C C A A G T T A T A G C T C

T C C A A G T T A T A G C T C

Page 16: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

16

T = TCCAAGTTATAGCTC P = ATCCTC

Using this shift table, we may have the following.We will find AT occurring at 9 in T, CC occurring at 2 in T and TC occurring at 1 and 14 in T. Table d contains all text positions of P’s pieces.

AT 9

CC 2

TC 1,14

Table d

A T C *2 1 1 3

shift table

Page 17: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

17

T = CAABCAAABDAACB

P = ABCACABCDDCA k = 3

ABCACABCDDCA

ABCACA BCDDCA

ABC ACA BCD DCA

Page 18: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

18

T = CAABCAAABDAACB

P = ABCACABCDDCA k = 3

Table dABC 3

ACA NULL

BCD NULL

DCA NULL

shift tableA B C D *1 2 1 1 4

Page 19: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

19

T = CAABCAAABDAACB

P = ABCACABCDDCA

1. Found ABC in T.

Search for “ABCACA” in

with k=1. Now the length of m is six. So the window length is eight.

found!

C A A B C A A A B D A A C B

Page 20: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

20

T = CAABCAAABDAACB

P = ABCACABCDDCA

2. Search for ABCACABCDDCA with k=3

in

But we can’t find ABCACABCDDCA in T with k=3.

Stop comparing.

C A A B C A A A B D A A C B

Page 21: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

21

Time complexity

• search cost in O(kn/m) = O(αn)time complexity. Error level α= k / m.

Page 22: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

22

References

[1] R. Baeza-Yates, G. Gonnet, A new approach to text searching,Comm. ACM 35 (10) (1992) 74–82.[2] R. Baeza-Yates, G. Navarro, Faster approximate string matching,Algorithmica 23 (2) (1999) 127–158. Preliminary versionin: Proc. CPM’96.[3] R. Baeza-Yates, C. Perleberg, Fast and practical approximatepattern matching, Inform. Process. Lett. 59 (1996) 21–27.[4] G. Myers, A fast bit-vector algorithm for approximate patternmatching based on dynamic programming, in: Proc. CPM’98,Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin,1998, pp. 1–13.[5] G. Navarro, Approximate text searching, Ph.D. Thesis, Departmentof Computer Science, University of Chile, December1998. Tech. Report TR/DCC-98-14.[6] G. Navarro, A guided tour to approximate string matching,Technical Report TR/DCC-99-5, Department of ComputerScience, University of Chile, 1999. Submitted. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.

Page 23: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

23

References

[7] G. Navarro, R. Baeza-Yates, Improving an algorithm forapproximate string matching, 1998, submitted.[8] G. Navarro, M. Raffinot, A bit-parallel approach to suffixautomata: Fast extended string matching, in: Proc. CPM’98,Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin,1998, pp. 14–33.[9] P. Sellers, The theory and computation of evolutionary distances:pattern recognition, J. Algorithms 1 (1980) 359–373.[10] D. Sunday, A very fast substring search algorithm, Comm.ACM 33 (8) (1990) 132–142.[11] S. Wu, U. Manber, Agrep – a fast approximate pattern-matchingtool, in: Proc. of USENIX Technical Conference, 1992,pp. 153–162.[12] S. Wu, U. Manber, Fast text searching allowing errors, Comm.ACM 35 (10) (1992) 83–91.

Page 24: 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

24

Thank You