String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string...
Transcript of String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string...
![Page 1: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/1.jpg)
String searching problems
String searching: The problem - given the following data:
a SOURCE (or TEXT ) string s = s1s2s3 . . . sn, and
a PATTERN string p = p1p2p3 . . . pm.
Ask: does p occur as a substring of s and, if so, where? Can seekfirst occurrence, or all occurrences, etc.
For example, s could be the string “abracadabra” and p the string“cad”.
Applications: Used in text editors, file operations eg grep, websearch engines etc.Question: The problem is a special case of more general ‘patternmatching’ problems - to what extent do string search algorithmsgeneralise, or pattern matching algorithms specialise?
![Page 2: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/2.jpg)
String searching: Basic algorithm
A naıve algorithm:
Look for a match starting at beginning of source text s.
Compare characters of pattern p with those of s until either
1 find two characters that differ – this means that no match ispresent with the current starting point
2 reach end of p – this means we’ve found a match
3 reach end of s – this means that no match is present.
If (1) then slide the pattern along one character and start searchagain at the beginning of the pattern.
Repeat until reach end of p (success) or end of s (failure).
![Page 3: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/3.jpg)
Naive algorithm - example and complexity
Naıve string searching algorithm
If the pattern has length m and source text has length n, in worstcase need
Length.of .pattern×Number .of .possible.start.points = m×(n−m+1)
character comparisons (i.e. equality tests).
For example:
Pattern aaaabText aaaaaaaaaaaaab
takes 5× (14− 5 + 1) = 50 comparisons
Since, in general, n is much larger than m, algorithm is O(m × n).
![Page 4: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/4.jpg)
Efficient exact string matching
The Knuth-Morris-Pratt algorithm (KMP)
The naıve algorithm will always find a match if one exists but oftendoes much more work than is necessary.
0123456Pattern: abcabcdText: abcabcabcd....
0123456789
Match fails at position 6.
The naıve algorithm shifts the pattern along by one and startschecking again at start of pattern. It throws away knowledge ofthe source text gained by checking and knowing that we havematched so far. We know that the text matches the first 6characters of the pattern, so, from the pattern alone, we knowwhat the text is and that moving by 1 will not match.
![Page 5: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/5.jpg)
Knuth-Morris-Pratt: Reference
————–
This is the basis for the Knuth-Morris-Pratt algorithm:Donald E. Knuth, James H. Morris and Vaughan R. Pratt, FastPattern Matching in Strings, SIAM Journal on Computing, 6(2):323–350.
————–
![Page 6: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/6.jpg)
Knuth-Morris-Pratt: Development I
In fact, shifting the pattern by 2 cannot match, but by 3, it maypossibly match:
0123456Pattern: abcabcdText: abcabcabcd.....
0123456789
This works because, at the point of the match failure, the string‘abc’
is both a prefix of the pattern (abcabcd)
and a proper suffix of the pattern up to the mismatch (abc abcd).(by ‘proper’ we mean it is not the whole substring).
![Page 7: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/7.jpg)
Knuth-Morris-Pratt: Development II
In general, for each k with 0 < k < pattern length, define fail(k)to be
Largest r < k, such that p0 . . . pr−1 matchespk−r . . . pk−1.
If we define fail(0) to be −1, we have (for the above example):
i 0 1 2 3 4 5 6pi a b c a b c d
fail(i) -1 0 0 0 1 2 3
If a match failure occurs at character k , we know that previousfail(k) characters already match.
Note that matched characters in the text are visited only oncebecause we restart the checking at the fail point in the text. Thealgorithm has complexity O(m + n). Why?
![Page 8: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/8.jpg)
Comment
Note: In this example, a failure at say character index 5 – thesecond ‘c’ – means that although the first ‘ab’ aligns with thesecond ‘ab’, we know that this alignment too must fail as the nexttext character is not a ‘c’. This improvement is not usuallyincorporated into the algorithm.
![Page 9: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/9.jpg)
Knuth-Morris-Pratt: Search program
This program finds the first occurrence of the pattern in thetext:
private int[] failure;private int matchPoint;public boolean *match*() {int j = 0;if (text.length() == 0) return false;for (int i = 0; i < text.length(); i++) {
while (j > 0 && pattern.charAt(j) != text.charAt(i)){ j = failure[j - 1]; }if (pattern.charAt(j) == text.charAt(i)) { j++; }if (j == pattern.length()){ matchPoint = i - pattern.length() + 1;return true; } }
return false; }
![Page 10: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/10.jpg)
Knuth-Morris-Pratt: Pattern pre-processing program
This program computes the failure function using a boot-strappingprocess, where the pattern is matched against itself.
private void *computeFailure*() {int j = 0;for (int i = 1; i < pattern.length(); i++) {
while (j > 0 &&pattern.charAt(j)!= pattern.charAt(i))
{ j = failure[j - 1]; }if (pattern.charAt(j) == pattern.charAt(i)) {j++;
}failure[i] = j;
}}
![Page 11: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/11.jpg)
Knuth-Morris-Pratt: Conclusions
KMP is fast exactly where the naıve algorithm is slow – thatis, when the pattern and text contain repeated patterns ofcharacters.
KMP is particularly good when the alphabet is small, forexample bit patterns.
There is another algorithm which outperforms them both . . .
![Page 12: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/12.jpg)
Another efficient exact string search
The Boyer-Moore Algorithm
This algorithm (R.M. Boyer and J.S. Moore 1977) uses a changeof approach together with two techniques to improve the amountby which the pattern is shifted whenever a match fails.
Change of approach:
Try to match the pattern from right to left, rather than left to rightStill move the pattern across the source text from left to right.
![Page 13: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/13.jpg)
Boyer-Moore: Example
For example, if the pattern is ‘wish’ and the source text is ‘dish offruit’.
dish of fruit|
wish
We would successfully match ’h’,’s’ and ’i’ before failing on ’d’ and‘w’.
At first sight doesn’t look like a great idea!
![Page 14: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/14.jpg)
Boyer-Moore: Development I
Idea 1.
If a match fails, move the pattern along so that, if possible, amatch is made with the source text character we are looking at.
This is made clearer with an example. Consider the source text:here is a piece of text which we wish to search,and the pattern:wish:
here is a piece of text which we wish to search|
wish
The ‘h’ fails to match against ‘e’, and no ‘e’ appears in thepattern, so no match can contain the ‘e’.
So can move pattern along past the ‘e’, i.e. 4 places (the width ofthe pattern).
![Page 15: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/15.jpg)
We are now at the position:
here is a piece of text which we wish to search|
wish
Again match fails, this time against the space character. Since nospace in the pattern can shift along 4 again
here is a piece of text which we wish to search|
wish
Match fails again, but this time against an ‘i’, so move patternalong so that ‘i’ in the source text matches rightmost ‘i’ in thepattern. (Why rightmost?)
![Page 16: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/16.jpg)
We are now at position:
here is a piece of text which we wish to search|
wish
There is no ‘c’ in pattern, so shift 4. A couple more moves like thisgive us:
here is a piece of text which we wish to search|
wish
The ‘h’ matches, so try the next character down the pattern (‘s’).This fails, so move along to match the ‘w’ in the text against therightmost ‘w’ of the pattern
here is a piece of text which we wish to search|
wish
etc etc...
![Page 17: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/17.jpg)
The complete search is shown below:
here is a piece of text which we wish to search| | | | | | || | | ||
wish | | | | | || | | ||wish | | | | || | | ||
wish | | | || | | ||wish | | || | | ||
wish | || | | ||wish || | | ||
wish | | ||wish | ||
wish ||wish|wish
The match start occurs at the 34th character, but we have onlyhad to make 15 comparisons. The previous methods would havemade at least 37 attempts at matching.
![Page 18: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/18.jpg)
The B-M approach means that many text characters are notlooked at all. In general, the longer the pattern, the fewer thecomparisons.
Given the text character on which match fails, we need to knowhow far can shift pattern
If the character isn’t in the pattern, then can shift the widthof the pattern
If the character ch is in the pattern, then can shift so thatrightmost occurrence of ch matches the occurrence of ch inthe text.
Set up an array containing these values, for each character in thecharacter set being used.
![Page 19: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/19.jpg)
Boyer-Moore: Development II
Idea 2: The second technique used by B-M is essentially anadaptation of the fail array used in KMP, adapted to the right toleft pattern search.
Suppose we have a pattern batsandcats:
.......dats.......|
batsandcats
Mismatch occurs on text character ‘d’. The first heuristic wouldslide the pattern along until next ‘d’ in pattern matched text, i.e. 1character:
.......dats.......|
batsandcats
But we know that characters to right of current position are ‘ats’.
![Page 20: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/20.jpg)
If the string ‘ats’ does not occur again in the pattern, can slide thepattern past it, otherwise we slide pattern along so that ‘ats’ intext matches next occurrence of ‘ats’ in the pattern
.......dats.......|
batsandcats
![Page 21: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/21.jpg)
Boyer-Moore: Definition
In general, define MatchJump[k] to be the amount to incrementthe text position to begin the next pattern scan after a mismatchat character k of the pattern.
If m is the length of the pattern p, then, for k < m, let r belargest index so that:
pr . . . pr+m−k−1 matches pk+1 . . . pm
and pr−1 6= pk .
Define MatchJump[k] = m − r + 1.
If we can’t match the whole suffix pk+1 . . . pm, look for a q suchthat suffix of length q matches, then take
MatchJump[k] = m − k + m − q.
![Page 22: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/22.jpg)
The Boyer-Moore string searching algorithm
The Boyer-Moore algorithm uses both these approaches and movesthe pattern along as much as it can: computing both shifts andtaking the maximum.
This combination produces a dramatic improvement over the naıvealgorithm and KMP, particularly for long patterns and text with alarge alphabet.
Applications: The Boyer-Moore algorithm is the algorithm ofchoice for searching in some text editors (e.g. in the emacs editor).
![Page 23: String searching problemsdavid/courses/advalgorithms/strings.pdf · Another e cient exact string search The Boyer-Moore Algorithm This algorithm (R.M. Boyer and J.S. Moore 1977) uses](https://reader030.fdocuments.in/reader030/viewer/2022041120/5f3464a39c2262499008fd93/html5/thumbnails/23.jpg)
Hashing techniques for exact and approximate stringsearching
A quite different approach to string searching uses hashingtechniques. See accompanying paper ‘Implementation of substringtest’ by M.C. Harrison (C.A.C.M. 14:12. 1971).
See also course unit website for additional material, goodalgorithm sites, papers and recommended books.