1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35,...
-
Upload
natalie-nolan -
Category
Documents
-
view
213 -
download
0
Transcript of 1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35,...
1
Fast text searching: allowing errors
Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp. 83-91
Advisor: Prof. R. C. T. Lee
Reporter: Z. H. Pan
2
Given a text T(1,n), a pattern P(1,m) and an error found k.
Our approximate string matching problem is defined as follow:
Find all location i of T such that the following condition is satisfied: There exists a suffix A of T(1, i) such that d(A,P)≦k where d(x,y) is the edit distance between x and y.
3
Example:
T=deaabeg,
P=aabac and k=2.
For i=5.
T(1, 5)= deaab.
We note that there exists a suffix A=aab of T(1, 5) such that d(A,P)=d(aab,aabac)=2.
4
Example:
T=deaabeg, P=aab and k=2.
Consider i=5.
T(1,5)=deaab.
We have A=aab of T(1,5) and d(A,P)=d(aab,aab)=0. Thus we have found a substring aab in T such that d(aab,P)=0.
Consider i=6.
T(1,6)=deaabe.
We have A=aabe of T(1,6) and d(A,P)=d(aabe,aab)=1. Again, we have found a substring aabe in T such that d(aabe,P)=1.
5
T
P
S2
Let S be a substring of T.
If there exists a suffix S2 of S and a suffix P2 of P such that
d(S2, P2) = 0, and d(S1, P1) ≦k,
we have d(S, P) ≦ k.
S1
S
P1 P2
Our approach is based upon the following observation:
6
Example:
A=addcd and B=abcd. k=2. We may decompose A and B as follows:A=add+cd.B=ab+cd.d(add,ab)=2.
Thus d(A,B)=2.
7
A Recursive Operation for the Dynamic Programming Approach
Consider T(1,i) and P(1, j).
Case 1: T(i)=P( j). We denote prefix B which is P(1, j-1) in P. We consider whether there is a suffix A in T(1,i-1) such that d( A, B ) k.≦
i
j
T :
P :
A
B
i-1
j-11
8
Case 2: T(i)≠P(j). We consider three cases:
2.1 We denote B which is P(1, j). There is a suffix A in T(1,i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:
i
j
T :
P :
A
B
i-1
1
i
jT :
P :
A
B
i-1
1
insertion
9
Case 2: T(i)≠P(j). We consider three cases:
2.2 We denote B which is P(1, j-1). There is a suffix A in T(1, i ) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:
i
j
T :
P :
A
B1 j-1
10
Case 2: T(i)≠P(j). We consider three cases:
2.3 We denote B which is P(1, j-1). There is a suffix A in T(1, i-1) such that d(A,B)≦k-1. This corresponds to an insertion as illustrated below:
i
j
T :
P :
A
B
i-1
1 j-1
11
To solve our approximate string matching problem, we start with a table, called Rk[n, m]. Let S=T(1, i).
Rk(i,j)
Where 1≦i≦n and 1≦j≦m.
11000
a a b a a c a a b a c a b 11100
11110
11111
11111
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
11101
11010
11100
11110
11111
11011
11001
11100
Example:
T:aabaacaabacab, P:aabac and k=1.
Consider i=9, j=4.
S=T(1, 9)=aabaacaab
P(1, 4)=aaba
A=aab
d(A,P(1, 4))=d(aab,aaba)=1
∴ R1(9, 4)=1
R1
=1 if there exists a suffix A of S such that d(A, P1,j)≦k.=0 otherwise.
12
11000
a a b a a c a a b a c a b 11100
11110
11111
11111
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
11101
11010
11100
11110
11111
11011
11001
11100
Example:
T:aabaacaabacab, P:aabac and k=1.
Consider i=13 and j=5.
S=T(1, 13)=aabaacaabacab
P(1, 5)=aabac
There doesn’t exist any suffix A of S such that d(A,P(1, 5)) 1.≦
∴ R1(13,5)=0
R1
13
Question: How can we find Rk(i, j)?
Answer: Dynamic Programming.
There are three types of operation in edit distance:
(1) Insertion
(2) Deletion
(3) Substitution
We consider them separately and combine the results later.
14
Let RIk(i,j), RD
k(i,j) and RSk(i,j) denote the Rk(i,j) related t
o insertion, deletion and substitution respectively.
And let RIk[i,j], RD
k[i,j] and RSk[i,j] denote the Rk[i,j] relat
ed to insertion, deletion and substitution of table respectively.
15
Consider RIk(i,j) first.
RIk(i,j)
=1 if ti≠pj and Rk-1(i-1,j)=1
or ti= pj and Rk(i-1,j-1)=1,
=0 otherwise.
T:
P:
aabac
aabac
b
binsertion
i
j
i-1
16
RIk(i,j)
=1 if ti≠pj and Rk-1(i-1,j)=1 or ti= pj and Rk(i-1,j-1)=1
=0 otherwise
Example: Text = aabaacaabacab. Pattern = aabac. k=1.
10000
a a b a a c a a b a c a b 11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
R0[13,5]
RI1[13,5]
(1) When i=13 and j=3.
t13=p3=‘b’, R1(12,2)=1
∴ RI1(13,3)=1
(2) When i=6 and j=4.
t6=‘c’≠p4=‘a’, R0(5,4)=0
∴ RI1(6,4)=0
(3) When i=11 and j=4.
t11=‘c’≠p4=‘a’, R0(10,4)=1
∴ RI1(11,4)=1
10000
a a b a a c a a b a c a b 11000
11100
11110
11010
11001
11000
11000
11100
11110
10011
11001
10100
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
17
Consider RDk(i,j).
RDk(i,j)
=1 if ti≠pj and Rk-1(i,j-1)=1
or ti= pj and Rk(i-1,j-1)=1,
=0 otherwise.
T:
P:
aabac
aabac b
deletion
i
jj-1
18
RDk(i,j)
=1 if ti≠pj and Rk-1(i,j-1)=1 or ti= pj and Rk(i-1,j-1)=1
=0 otherwise
Example: Text = aabaacaabacab. Pattern = aabac. k=1.
10000
a a b a a c a a b a c a b 11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
R0[13,5]
(1) When i=13 and j=3.
t13=p3=‘b’, R1(12,2)=1
∴ RD1(13,3)=1
(2) When i=6 and j=4.
t6=‘c’≠p4=‘a’, R0(6,3)=0
∴ RD1(6,4)=0
(3) When i=3 and j=4.
t3=‘b’≠p4=‘a’, R0(3,3)=1
∴ RD1(3,4)=1
RD1[13,5]
11000
a a b a a c a a b a c a b 11100
10110
11011
11100
10000
11000
11100
10110
11011
10001
11000
10100
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
19
Consider RSk(i,j).
RSk(i,j)
=1 if ti≠pj and Rk-1(i-1,j-1)=1
or ti= pj and Rk(i-1,j-1)=1
=0 otherwise
T:
P:
aabac
aabac a
b T:
P:
aabac
aabac b
substitution
b
i
j
i-1
j-1
i
j
i-1
j-1
20
RSk(i,j)
=1 if ti≠pj and Rk-1(i-1,j-1)=1 or ti= pj and Rk(i-1,j-1)=1
=0 otherwise
Example: Text = aabaacaabacab. Pattern = aabac. k=1.
10000
a a b a a c a a b a c a b 11000
00100
10010
11000
00000
10000
11000
00100
10010
00001
10000
00000
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
R0[13,5]
(1) When i=13 and j=3.
t13=p3=‘b’, R1(12,2)=1
∴ RD1(13,3)=1
(2) When i=6 and j=4.
t6=‘c’≠p4=‘a’, R0(5,3)=0
∴ RD1(6,4)=0
(3) When i=5 and j=5.
t3=‘b’≠p4=‘a’, R0(4,4)=1
∴ RD1(5,5)=1
10000
a a b a a c a a b a c a b 11000
11100
11010
11001
11100
11010
11000
11100
11010
11001
11000
11100
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
RS1[13,5]
21
After every RIk(i,j), RD
k(i,j) and RSk(i,j) have found, we immedi
ately determine Rk(i,j) by
Rk(i,j)= RIk(i,j) or RD
k(i,j) or RSk(i,j).
11000
a a b a a c a a b a c a b 11100
11110
11111
11111
1 2 3 4 5 6 7 8 9 10111213
aabac
12345
11101
11010
11100
11110
11111
11011
11001
11100
Example: Text = aabaacaabacab. Pattern = aabac. k=1.
R1[13,5]
22
Thank you!