Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999....
-
Upload
carlos-costello -
Category
Documents
-
view
218 -
download
3
Transcript of 1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999....
1
Very fast and simple approximate string matching
Information Processing Letters, 72:65-70, 1999.
G. Navarro and R. Baeza-Yates
Advisor: Prof. R. C. T. Lee
Speaker: H. M. Chen
2
• Our approximate string matching problem is defined as follows: Given a pattern string P of length m and a text string T of length n, and a maximal number k of errors allowed, find all text positions that match P with up to edit distance equal to k.
3
This paper is based upon the following lemma presented by the same authors in [A Hybird indexing method for approximate string matching].
Lemma: Let T and P be two strings. Let P be divided into j pieces p1, p2, …, pj. If ed(T,P) ≤ k, then there exists at least one pi and a substring S in T such that ed(S,pi) ≤ . jk /
4
If we let j=k+1, then .In this case, if ed(T,P) ≤ k, then at least one pi
occurs in T exactly.
If, in a certain window, we find an exact matchingof a pi inside the window, we use the dynamic programming approach to determine whether thereexists an approximate matching of P allowing k errors in this window.
0/ jk
5
If, in a window, we cannot find any exact matching of pi inside the window, we ignore the window. That is, we do not have to checkwhether there is an approximate matchinginside the window.
6
Question: How large can the window be?
Answer: The largest window size which isallowed to produce an approximate matchingwith edit distance smaller than or equal to kis m+2k where m is the length of the pattern.
This can be explained in the following slide.
7
Consider the following case. Suppose P exactly matches a substring S in T. We may extend S k characters to the right and k characters to the left. This forms a window of size m+2k. Any substring obtained by extending S to the right and to the left is an approximate matching with P with edit distance less than or equal to k.
T
P
mS
k k
8
• Let us consider the case where we limit the error to be less than k. Then we split the pattern P into k+1 pieces. Since each piece is rather small, there is a high probability that it appears exactly in T. Thus, when the pieces are small, us in this case, we cannot eliminate many substrings.
9
• Our think is as follows:
After determining the occurrences of exact matching of small pieces, we start to determine the occurrences of larger piece of P in T.
AAABBBCCCDDD
AAABBB CCCDDD
BBB CCC DDDAAA
k = 3
12/3/ ,2 jkj
04/3/ ,4 jkj
10
bc table
• The only thing we want to do is to construct a table of each piece of P as follow. Let x be a character in the alphabet. We record the position of the last x, if it exists in piece of P, we record the position of x from the right end. If x does not exist in piece of P, we record it as m+1.
11
Suppose we have P = ATCCTC with k = 2.
We divide P into three pieces : p1 = AT, p2 = CC and p3 = TC.
To search for exact matching, we actually perform an exhaustive
search. Let us assume that we search for AT.
Note that there are three cases:
Case 1 : X = A. We move AT 2 steps.
Case 2 : X = T. We move AT 1 steps.
Case 3 : X≠A and X≠T, we move AT 3 steps.
T X
AT
12
Let us assume that we search for CC.
Case 1 : X = C. We move CC 1 step.
Case 2 : X ≠ C. We move CC 3 steps.
T X
CC
13
Let us assume that we search for TC.
Case 1 : X = T. We move TC 2 steps.
Case 2 : X = C. We move TC 1 step.
Case 3 : X≠T and X≠C, we move TC 3 steps.
T X
TC
14
Based upon three above discussions, we choose the minimum values of each character and have the following shift table:
p1 = AT p2 = CC p3 = TCA T *2 1 3
T C *2 1 3
C *1 3
A T C *2 1 1 3
shift table
bc table
15
T = TCCAAGTTATAGCTC
p1 = AT, p2 = CC , p3 = TC
First step: We open a window with length 2 to compare with AT, CC and
TC. We found that it has a exact matching with p3. Then shift the window according to shift table value of next position.
Second step:
We found CC has a exact matching with p2. Then we shift the window 2 positions.
Third step:
We cannot find AA among p1, p2 and p3. Then shift the window 3 positions and continue to compare.
A T C *2 1 1 3
shift table
T C C A A G T T A T A G C T C
T C C A A G T T A T A G C T C
T C C A A G T T A T A G C T C
16
T = TCCAAGTTATAGCTC P = ATCCTC
Using this shift table, we may have the following.We will find AT occurring at 9 in T, CC occurring at 2 in T and TC occurring at 1 and 14 in T. Table d contains all text positions of P’s pieces.
AT 9
CC 2
TC 1,14
Table d
A T C *2 1 1 3
shift table
17
T = CAABCAAABDAACB
P = ABCACABCDDCA k = 3
ABCACABCDDCA
ABCACA BCDDCA
ABC ACA BCD DCA
18
T = CAABCAAABDAACB
P = ABCACABCDDCA k = 3
Table dABC 3
ACA NULL
BCD NULL
DCA NULL
shift tableA B C D *1 2 1 1 4
19
T = CAABCAAABDAACB
P = ABCACABCDDCA
1. Found ABC in T.
Search for “ABCACA” in
with k=1. Now the length of m is six. So the window length is eight.
found!
C A A B C A A A B D A A C B
20
T = CAABCAAABDAACB
P = ABCACABCDDCA
2. Search for ABCACABCDDCA with k=3
in
But we can’t find ABCACABCDDCA in T with k=3.
Stop comparing.
C A A B C A A A B D A A C B
21
Time complexity
• search cost in O(kn/m) = O(αn)time complexity. Error level α= k / m.
22
References
[1] R. Baeza-Yates, G. Gonnet, A new approach to text searching,Comm. ACM 35 (10) (1992) 74–82.[2] R. Baeza-Yates, G. Navarro, Faster approximate string matching,Algorithmica 23 (2) (1999) 127–158. Preliminary versionin: Proc. CPM’96.[3] R. Baeza-Yates, C. Perleberg, Fast and practical approximatepattern matching, Inform. Process. Lett. 59 (1996) 21–27.[4] G. Myers, A fast bit-vector algorithm for approximate patternmatching based on dynamic programming, in: Proc. CPM’98,Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin,1998, pp. 1–13.[5] G. Navarro, Approximate text searching, Ph.D. Thesis, Departmentof Computer Science, University of Chile, December1998. Tech. Report TR/DCC-98-14.[6] G. Navarro, A guided tour to approximate string matching,Technical Report TR/DCC-99-5, Department of ComputerScience, University of Chile, 1999. Submitted. ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.
23
References
[7] G. Navarro, R. Baeza-Yates, Improving an algorithm forapproximate string matching, 1998, submitted.[8] G. Navarro, M. Raffinot, A bit-parallel approach to suffixautomata: Fast extended string matching, in: Proc. CPM’98,Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin,1998, pp. 14–33.[9] P. Sellers, The theory and computation of evolutionary distances:pattern recognition, J. Algorithms 1 (1980) 359–373.[10] D. Sunday, A very fast substring search algorithm, Comm.ACM 33 (8) (1990) 132–142.[11] S. Wu, U. Manber, Agrep – a fast approximate pattern-matchingtool, in: Proc. of USENIX Technical Conference, 1992,pp. 153–162.[12] S. Wu, U. Manber, Fast text searching allowing errors, Comm.ACM 35 (10) (1992) 83–91.
24
Thank You