String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
-
Upload
cordelia-campbell -
Category
Documents
-
view
231 -
download
0
Transcript of String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
![Page 1: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/1.jpg)
String Matching with k Mismatches
Moshe Lewenstein Bar Ilan UniversityModified by Ariel Rosenfeld
![Page 2: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/2.jpg)
String Matching with k Mismatches
Landau – Vishkin 1986
Galil – Giancarlo 1986
Abrahamson 1987
Amir - Lewenstein - Porat 2000
![Page 3: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/3.jpg)
Exact String Matching
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A…
![Page 4: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/4.jpg)
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3
Exact String Matching
![Page 5: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/5.jpg)
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7
Exact String Matching
![Page 6: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/6.jpg)
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A… 3 7 11
Exact String Matching
![Page 7: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/7.jpg)
Input: T = t1 . . . tn
P = p1 … pm
Output: All locations i of T where P appears Example:
P = A B C A A B T = A B A B C A A B C A A B C A A B A A…
Answer: {3,7,11,..}
Exact String Matching
![Page 8: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/8.jpg)
Exact String Matching
Problem: Matching not exact in applications of:
• Computational Biology
• Musicology
• Text Editing
• Meteorology
• etc.
Need other definitions of string matching!
![Page 9: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/9.jpg)
Approximate String Matching
Idea: Find all text locations where distance from pattern is sufficiently small.
distance metric: HAMMING DISTANCE
Let S = s1s2…sm
R = r1r2…rm
Ham(S,R) = The number of locations j where sj rj
Example: S = ABCABC R = ABBAAC
Ham(S,R) = 2
![Page 10: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/10.jpg)
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C…
![Page 11: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/11.jpg)
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2
Ham(P,T1) = 2
![Page 12: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/12.jpg)
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4
Ham(P,T2) = 4
![Page 13: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/13.jpg)
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6
Ham(P,T3) = 6
![Page 14: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/14.jpg)
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2
Ham(P,T4) = 2
![Page 15: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/15.jpg)
String Matching with Mismatches
Input: T = t1 . . . tn
P = p1 … pm
Output: For each i in T Ham(P, titi+1…ti+m-1)
Example:
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
![Page 16: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/16.jpg)
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
![Page 17: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/17.jpg)
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, …
![Page 18: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/18.jpg)
Input: T = t1 . . . tn, P = p1 … pm
String Matching with k Mismatches
Output: Every i in T s.t. Ham(P, titi+1…ti+m-1) k
Example: k = 2
P = A B B A A C T = A B C A A B C A C… 2, 4, 6, 2, … Y,N,N,Y,…
![Page 19: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/19.jpg)
Naïve Algorithm(for counting mismatches or k-mismatches problem)
Running Time: O(nm) n = |T|, m = |P|
- Goto each location of text and compute hamming distance of P and Ti
![Page 20: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/20.jpg)
The Kangaroo Method(for k-mismatches)
Landau – Vishkin 1986
Galil – Giancarlo 1986
![Page 21: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/21.jpg)
Trie
• A tree representing a set of strings.
ab
c
e
e
f
d b
f
e g
{ aeef ad bbfe bbfg c }
![Page 22: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/22.jpg)
Trie (Cont)
• Assume no string is a prefix of another
ab
c
e
e
f
d b
f
e g
Each string corresponds to a leaf.
![Page 23: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/23.jpg)
Compressed Trie • Compress unary nodes, label edges by strings
ab
c
e
e
f
d b
f
e g
a
bbf
c
eefd
e g
![Page 24: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/24.jpg)
Suffix tree
Suffix tree of string s:a compressed trie of all suffixes of s
Prefix-free: add a special character, say $, at the end of s
![Page 25: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/25.jpg)
Suffix tree (Example) Let s = abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$
{ $ b$ ab$ bab$ abab$ }
ab
ab$
ab$
b
$
$
$
![Page 26: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/26.jpg)
Suffix Tree properties
- Succint in space - O(n).
- Can be built in O(n) time. McCreight, Weiner,
Ukkonen, Farach-Colton
b
12
ab
a
b$
a
b$
3
$ 4
$
5
$
![Page 27: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/27.jpg)
Exact string matching
12
ab
ab$
ab$
b
3
$ 4
$
5
$
Given a pattern P = ab we traverse the tree according to the pattern.
s=abab$
![Page 28: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/28.jpg)
Exact string matching
12
ab
ab$
ab$
b
3
$ 4
$
5
$
Leaves correspond to locations of appearance!
s=abab$ 1 3
![Page 29: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/29.jpg)
Exact string matching
12
ab
ab$
ab$
b
3
$ 4
$
5
$
Prepare Tree: O(n) time
Find matches: O(m + occ) time occ = # of matches
s=abab$ 1 3
![Page 30: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/30.jpg)
Lowest common ancestors
A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it
![Page 31: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/31.jpg)
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
s = abbaab$
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaab$
![Page 32: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/32.jpg)
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaab$
s = abbaab$ aab$
![Page 33: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/33.jpg)
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaab$
s = abbaab$ aab$ abbaab$
![Page 34: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/34.jpg)
Why?The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes
1
3
a
b
aab
ab$
b
5
$
2
b
4
b$a
6
$
7
$
b
$
aaab$
s = abbaab$
aab$ abbaab$
![Page 35: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/35.jpg)
LCA/LCP propertiesa
1
3
b
aab
ab$
b
5
$
2
b
4
b$
a6
$
7
$
b
$
aaab$
Preprocesssing time : O(n)
Query Time: O(1)
Harel & Tarjan 1984, Schieber & Vishkin 1988, Berkman & Vishkin 1993
![Page 36: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/36.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 37: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/37.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 38: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/38.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 39: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/39.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 40: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/40.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 41: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/41.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 42: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/42.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 43: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/43.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 44: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/44.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
-Check P at each location i of T by kangrooing
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 45: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/45.jpg)
The Kangaroo Method(for k-mismatches)
- Create suffix tree for: s = P#T
- Do up to k LCP queries for every text location
Example:
P = A B A B A A B A C A BT = A B B A C A B A B A B C A B B C A B C A … i
![Page 46: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.](https://reader036.fdocuments.in/reader036/viewer/2022062304/56649ee65503460f94bf58df/html5/thumbnails/46.jpg)
The Kangaroo Method(for k-mismatches)
Preprocess:
Build suffix tree of both P and T - O(n+m) timeLCA preprocessing - O(n+m) time
Check P at given text location
Kangroo jump till next mismatch - O(k) time
Overall time: O(nk)