Two Different Approximate String Matching Problems and Their Algorithms
Filter Algorithms for Approximate String Matching
-
Upload
cherokee-koch -
Category
Documents
-
view
53 -
download
0
description
Transcript of Filter Algorithms for Approximate String Matching
Filter Algorithms forFilter Algorithms forApproximate String MatchingApproximate String Matching
Stefan Burkhardt
OutlineOutline
Motivation Filter Algorithms Gapped q-grams Experimental Analysis
Motivation
Computational Biology: EST Clustering Assembly Genome comparison (e.g. Human/Mouse)
Information Retrieval Phonebooks Dictionaries Search Engines
Many more….
Why ?
Approximate String Matching
Edit and Hamming Distance
Problems and Motivation
The global approximate
string matching problem
Given a pattern P, a target S, an
error level k and a string distance d(x,y):
Find all substrings y from S with:
Why ?
Approximate String Matching
Edit and Hamming Distance
Problems and Motivation
kyP ),d(
P
S
GAT
ACTGATAACGTTAGCCATGG
The global approximate
string matching problem
d(x,y) = Hamming Distance:
The k-mismatches problem
d(x,y) = Edit Distance:
The k-differences problem
Why ?
Approximate String Matching
Edit and Hamming Distance
Problems and Motivation
P
S
GAT
ACTGATAACGTTAGCCATGG
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Potential Matches
FilterAlgorithm
Filtration Phase,apply Filter Criterion
ExactAlgorithm
Verification Phase,examine Potential Matches
False Matches True Matches
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
BLAST (Altschul, Karlin, et al.) :
S
P
Problem for high similarity: sequential scan quite time consuming
single q-grams unspecific
Sequential scan of S locates all matching q-grams with P
Iterative extension with cutoff to find good matches
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Preprocess
Index
ExactAlgorithm
Verification Phase,examine Potential Matches
False Matches True Matches
Potential Matches
Indexed FilterAlgorithm
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Preprocess
Potential Matches
IndexIndexed Filter
Algorithm
Con: preprocessing time
extra space required
only good for some filter criteria
Pro: potentially faster evaluation of filter criterium
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
P
S
Preprocess
Potential Matches
IndexIndexed Filter
Algorithm
QUASAR (Burkhardt, Rivals et al. 99):
Filter Criterion: q-gram Lemma (Jokinen, Ukkonen 91)
Index Structure: Lookup table (Jokinen, Ukkonen 91)
with suffix array (Manber, Myers 90)
Match Detection: overlapping rectangles in DP-Matrix
|P| =8, q = 3total # of q-grams : |P| - q + 1 = 6
T C GC G A
G A TA T T
T T AT A C
T C G A T T A C
Each error can ´destroy´q matching q-grams=> for k errors lose
kq q-grams
T C GC G A
G A TA T T
T T AT A C
T C G A A T A C
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
The q-gram Lemma
(Jokinen, Ukkonen, 1991)
For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least
t = |P| - q + 1 - (kq)
substrings of length q (q-grams).
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
Match Detection (Jokinen, Ukkonen 91) :
overlapping rectangles of width 2|P| in DP-Matrix
rectangle with at least t hits => potential match
S
P
3 hits3 hits
2 hits2 hits
1 hitt = 3
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
Match Detection (Jokinen, Ukkonen 91) :
overlapping rectangles of width 2|P| in DP-Matrix
rectangle with at least t hits => potential match
S
P
QUASAR (Burkhardt, Rivals et al. 1999) :
wider rectangles efficient in practice (2048 for QUASAR)
S
How?
BLAST
The q-gram Lemma and QUASAR
Filter Algorithms
QUASAR (Burkhardt, Rivals et al. 1999) :
BLAST for the verification of the potential matches
wider Rectangles as Match Regions
Index is a combination of Lookup Table and Suffix Array
used for EST-Clustering at the DKFZ in Heidelberg
searches for EST-Clustering about 30 times faster than BLAST
Gapped Gapped qq-grams-grams
A new (old?) idea Hamming Distance Finding good shapes
use gapped q-grams call arrangement of gaps the shape
General idea:
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
TCGATTACTC.A CG.T GA.T AT.A TT.C
gapped3-shape:
# # . #
Match Don’t care
Califano, Rigoutsos (1993) Pevzner, Waterman (1995) Lehtinen, Sutinen, Tarhio (1996)
Previous work...Previous work...
limited attention paid to choice of shapes
no exact threshold for the general case given
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
Recently...Recently...Buhler (2001) : Multiple ShapesMa, Tromp, Li (2002) : Pattern Hunter
threshold t = 1
The Threshold tDefinition: t is the number of remaining q-grams in a worst-case placement of k errors
OOXOOXOOXOOOOX OXO XOO OOX OXO XOO OOX OXO XOO
classic3-shape###k = 3
gapped3-shape##.#k = 3t = 1t = 0
no filter!
OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
OOOXXOOXOOOOO.X OO.X OX.O XX.O XO.X OO.O OX.O XO.O
Definition: t is the number of remaining q-grams in a worst-case placement of k errors
gapped shapes can have higher(!) thresholds t than ungapped shapes
The Threshold t
gapped3-shape##.#k = 3t = 1
classic3-shape###k = 3t = 0
no filter!
no simple formula for t we used a DP-based approach to compute t
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
Finding good shapes Finding good shapes
high low# of q-gram hits
high lowfiltration time
high
low
verific. time
high
low
# ofpotentialmatches
goodfilters
badfilters
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
tradeoffline
low highq
Finding good shapes Finding good shapes
high
low
# ofpotentialmatches
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
# of q-gram hits |S|1
||q
?tradeoff
line
goodfilters
badfilters
low highq
Finding good shapesFinding good shapes
Reason:
##.# ### ##.# ### ----- ----
5 4
A random match requires 5 matching characters instead of only 4 for the ungapped q-gram.This makes random matchesless likely.
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.
We define the minimum coverage cm as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S
Finding good shapesFinding good shapes
CGACGATTGAT ##.# ##.# -----ACTCGATTAGA
For t =2 andthe shape ##.#the minimum coverage is 5
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
Finding good shapes Finding good shapes
# ofpotentialmatches
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
# of q-gram hits |S|1
||q
low highq
tradeoffline
goodfilters
badfilters
|S|1
||c
m
low
high
cm
8 10 12 14 16 18 20 22
0
600
400
200
t = 1t = 2t = 3t = 4t = 5
minimum coverage
number ofshapes
with givenminimum coveragefor k = 5
q = 8
median
contiguous best
• compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6
Gapped q-grams A new (old ?) idea
Hamming Distance
Finding good shapes
Finding good shapesFinding good shapes
Experimental AnalysisExperimental Analysis
Speed and Filtration Efficiency The Heuristic Zone
6 7 8 9 10 11 12 q
min
imum
cov
erag
e
8
12
16
20
24
gapped, Hammingcontiguous
m
atch
es
hits 222 220 218 216 214 212
216
212
28
24
1
2-4
2-8
Experimental Analysis A few different Filters
Speed and Filtration Efficiency
The Heuristic Zonek = 5
|P| = 50|S| = 50Mbps
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0
0%
100%
Rec
ogni
tion
rate
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0 k
0%
100%
Rec
ogni
tion
rate
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0 k
0%
100%
Rec
ogni
tion
rate
|P|-mc
From Hits to Matches Describing Filter Properties
Filters usually have 3 ‚recognition zones` depending on k :1. Guarantee zone (finds all approximate matches)2. Heuristic zone (finds some of the approximate matches)3. Negative zone (guaranteed not to find matches)
Errors |P|0 k
0%
100%
Rec
ogni
tion
rate
|P|-mc
Errors |P|0 k |P|-mc
0%
100%
Rec
ogni
tion
rate
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
Heuristic Zone
Problem:Behaviour in the Heuristic Zone hard to predict
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
A simple idea:Sampling!
For a value i:1. Generate s sample strings with i random errors each2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent)
This allows an experimental evaluation of the Heuristic Zone
|P| = 501000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
contiguous k=3, q=11k=4, q=9
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
k=4, q=9k=3, q=11
gapped, edit
contiguous
k=5, q=10k=4, q=11k=3, q=11
|P| = 501000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
k=4, q=9k=3, q=11
BLAST
gapped, edit
contiguous
k=5, q=10k=4, q=11k=3, q=11
k=3,q=11k=4,q=10
|P| = 501000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
0%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15 20 25 30
k=4, q=9k=3, q=11
BLAST
gapped, edit
contiguous
k=5, q=10k=4, q=11k=3, q=11
k=3,q=11k=4,q=10
|P| = 501000 samples for each error level
A few different Filters
Speed and Filtration Efficiency
The Heuristic Zone
Experimental Analysis
50%
100%
Rec
ogni
tion
rat
e
Errors0 5 10 15
k=4, q=9k=3, q=11
BLAST
gapped, edit
contiguous
k=3, q=11k=4, q=11k=5, q=10k=3,q=11k=4,q=10
|P| = 501000 samples for each error level
Conclusion - Future WorkConclusion - Future WorkOur Work: Significant sensitivity improvement over existing filters Required modifications easy to implement Methods for describing filter properties
Future Work: Combination of `orthogonal` shapes into one filter Use of word neighborhoods Database of filter properties for good shapes