Data Structures and Algorithms - Vilniaus universitetasalgis/dsax/Data-structures-6.pdf ·...

Data Structuresand

Algorithms

String Matching

www.mif.vu.lt/~algis

String Matching• Basic Idea:• Given a pattern string P, of length M• Given a text string, A, of length N• Do all characters in P match a substring of the

characters in A, starting from some index i?• Brute Force (Naïve) Algorithm:

int brutesearch(char *p, char *a)

int i, j, M = strlen(p), N = strlen(a);for (i = 0, j = 0; j < M && i < N; i++, j++)

if (a[i] != p[j]) i -= j; j = -1; if (j == M) return i-M; else return i;

String Matching

String Matching Algorithms

String Searching

The context of the problem is to find out whether one string (called "pattern") is contained in another string. This problem correspond to a part of more general one, called "pattern recognition". The strings considered are sequences of symbols, and symbols are defined by an alphabet. The size and other features of the alphabet are important factors in the design of string-processing algorithms.

Brute-force algorithm: The obvious method for pattern matching is just to check, for each possible position in the text at which the pattern could match, whether it does in fact match. The following program searches in this way for the first occurrence of a pattern string p in a text string a:

or

function brutesearch: integer; var i, j: integer; begin i:= 1; j:= 1; repeat if a[i] = p[j] then begin i:= i+1; j:= j+1 end else begin i:= i - j +2; j:= 1 end; until ( j > M ) or ( i > N ); if j > M then brutesearch:= i - M else brutesearch:= i end;

Property: brute-force string searching can require about M N character comparisons.

Knuth-Morris-Pratt Algorithm The basic idea behind the algorithm is this:

String Matching• Performance of Naïve algorithm?• Normal case?• Perhaps a few char matches occur prior to a mismatch

•Worst case situation and run-time?

Ω (N + M) = Ω (N) when N >> M

A = XXXXXXXXXXXXXXXXXXXXXXXXXXY

P = XXXXY

• P must be completely compared each time we move one index down A

M (N – M + 1) = Ω (N M) when N >> M

String Matching• Improvements?

• Two ideas

• Improve the worst case performance

• Good theoretically, but in reality the worst case does not occur very often for ASCII strings

• Perhaps for binary strings it may be more important

• Improve the normal case performance

• This will be very helpful, especially for searches in long files

KMP algorithm• KMP (Knuth Morris Pratt)

• Improves the worst case, but not the normal case

• Idea is to prevent index from ever going "backward" in the text string

• This will guarantee Ω (N) runtime in the worst case

• How is it done?

• Pattern is preprocessed to look for "sub" patterns

• As a result of the preprocessing that is done, we can create a "next" array that is used to determine the next character in the pattern to examine

KMP algorithm• KMP algorithm is one of the most popular patterns matching

algorithms.

• KMP stands for Knuth Morris Pratt (Donald Knuth, Vaughan Pratt, James H Morris), published in 1977.

• KMP algorithm was the first linear time complexity algorithm for string matching.

• The algorithm campares character by character from left to right.

• But whenever a mismatch occurs, it uses a preprocessedtable called "Prefix Table" to skip characters comparison while matching (LPS table - "Longest Proper Prefix”, which is also suffix).

KMP algorithmSteps for creating LPS table (Prefix Table)

• Step 1 - Define a one dimensional array with the size equal to the length of the Pattern (LPS[size])

• Step 2 - Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.

• Step 3 - Compare the characters at Pattern[i] and Pattern[j].

• Step 4 - If both are matched then set LPS[j] = i+1 and increment both i & j values by one. Goto to Step 3.

• Step 5 - If both are not matched then check the value of variable 'i'. If it is '0' then set LPS[j] = 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step 3.

• Step 6- Repeat above steps until all the values of LPS[] are filled.

KMP algorithm

2020-04-23 21:29Data Structures Tutorials - Knuth-Morris-Pratt Algorithm

Page 2 of 6http://www.btechsmartclass.com/data_structures/knuth-morris-pratt-algorithm.html

linked-list.html)

Circular Linked

List (circular-

linked-list.html)

Double Linked

List (double-

linked-list.html)

Arrays

(arrays.html)

Sparse Matrix

(sparse-

matrix.html)

Stack ADT

(stack-adt.html)

Stack Using

Array (stack-

using-

array.html)

Stack Using

Linked List

(stack-using-

linked-list.html)

Expressions

(expressions.html)

In!x to Post!x

(in!x-to-

post!x.html)

Post!x

Evaluation

(post!x-

evaluation.html)

Queue ADT

(queue-adt.html)

Queue Using

Array (queue-

using-

array.html)

Step 4 - Step 4 - If both are matched then set LPS[j] = i+1LPS[j] = i+1 and increment both i

& j values by one. Goto to Step 3.

Step 5 - Step 5 - If both are not matched then check the value of variable 'i'. If it

is '0' then set LPS[j] = 0LPS[j] = 0 and increment 'j' value by one, if it is not '0' then

set i = LPS[i-1]i = LPS[i-1]. Goto Step 3.

Step 6- Step 6- Repeat above steps until all the values of LPS[] are !lled.

Let us use above steps to create pre!x table for a pattern...

KMP algorithm



linked-list.html)

Circular Linked

List (circular-

linked-list.html)

Double Linked

List (double-

linked-list.html)

Arrays

(arrays.html)

Sparse Matrix

(sparse-

matrix.html)

Stack ADT

(stack-adt.html)

Stack Using

Array (stack-

using-

array.html)

Stack Using

Linked List

(stack-using-

linked-list.html)

Expressions

(expressions.html)

In!x to Post!x

(in!x-to-

post!x.html)

Post!x

Evaluation

(post!x-

evaluation.html)

Queue ADT

(queue-adt.html)

Queue Using

Array (queue-

using-

array.html)

Step 4 - Step 4 - If both are matched then set LPS[j] = i+1LPS[j] = i+1 and increment both i

& j values by one. Goto to Step 3.

Step 5 - Step 5 - If both are not matched then check the value of variable 'i'. If it

is '0' then set LPS[j] = 0LPS[j] = 0 and increment 'j' value by one, if it is not '0' then

set i = LPS[i-1]i = LPS[i-1]. Goto Step 3.

Step 6- Step 6- Repeat above steps until all the values of LPS[] are !lled.

Let us use above steps to create pre!x table for a pattern...



Queue Using

Linked List

(queue-using-

linked-list.html)

Circular Queue

(circular-

queue.html)

Double Ended

Queue (double-

ended-

queue.html)

Tree -

Terminology

(tree-

terminology.html)

Tree

Representations

(tree-

representations.html)

Binary Tree

(binary-

tree.html)

Binary Tree

Representations

(binary-tree-


Binary Tree

Traversals

(binary-tree-

traversals.html)

Threaded Binary

Trees (threaded-

binary-

trees.html)

Max Priority

Queue (max-

priority-

How to use LPS TableHow to use LPS TableWe use the LPS table to decide how many characters are to be skipped for

comparison when a mismatch has occurred.

When a mismatch occurs, check the LPS value of the previous character of the

mismatched character in the pattern. If it is '0' then start comparing the !rst

character of the pattern with the next character to the mismatched character in

the text. If it is not '0' then start comparing the character which is at an index

value equal to the LPS value of the previous character to the mismatched

character in pattern with the mismatched character in the Text.

How the KMP Algorithm WorksHow the KMP Algorithm WorksLet us see a working example of KMP Algorithm to !nd a Pattern in a Text...

KMP algorithm



Queue Using

Linked List

(queue-using-

linked-list.html)

Circular Queue

(circular-

queue.html)

Double Ended

Queue (double-

ended-

queue.html)

Tree -

Terminology

(tree-

terminology.html)

Tree

Representations

(tree-


Binary Tree

(binary-

tree.html)

Binary Tree

Representations

(binary-tree-


Binary Tree

Traversals

(binary-tree-

traversals.html)

Threaded Binary

Trees (threaded-

binary-

trees.html)

Max Priority

Queue (max-

priority-

How to use LPS TableHow to use LPS TableWe use the LPS table to decide how many characters are to be skipped for

comparison when a mismatch has occurred.

When a mismatch occurs, check the LPS value of the previous character of the

mismatched character in the pattern. If it is '0' then start comparing the !rst

character of the pattern with the next character to the mismatched character in

the text. If it is not '0' then start comparing the character which is at an index

value equal to the LPS value of the previous character to the mismatched

character in pattern with the mismatched character in the Text.

How the KMP Algorithm WorksHow the KMP Algorithm WorksLet us see a working example of KMP Algorithm to !nd a Pattern in a Text...

KMP algorithm• Use the LPS table to decide how many characters are to be

skipped for comparison when a mismatch has occurred.

•When a mismatch occurs, check the LPS value of the previous character of the character of the pattern with the next character to the mismatched character in the text.

• If it is not '0' then start comparing the character which is at an index value equal to the LPS value of the previous character to the mismatched character in pattern with the mismatched character in the Text.

KMP algorithm 2020-04-23 21:29Data Structures Tutorials - Knuth-Morris-Pratt Algorithm


queue.html)

Max Heap (max-

heap.html)

Introduction to

Graphs

(introduction-to-

graphs.html)

Graph

Representations

(graph-


Graph Traversal

- DFS (graph-

traversal-

dfs.html)

Graph Traversal

- BFS (graph-

traversal-

bfs.html)

Linear Search

(linear-

search.html)

Binary Search

(binary-

search.html)

Hashing

(hashing.html)

Insertion Sort

(insertion-

sort.html)

Selection Sort

(selection-

sort.html)

Radix Sort

(radix-sort.html)

Quick Sort

(quick-sort.html) Next (tries.html)

Previous (comparison-of-search-trees.html)

KMP algorithm



queue.html)

Max Heap (max-

heap.html)

Introduction to

Graphs

(introduction-to-

graphs.html)

Graph

Representations

(graph-


Graph Traversal

- DFS (graph-

traversal-

dfs.html)

Graph Traversal

- BFS (graph-

traversal-

bfs.html)

Linear Search

(linear-

search.html)

Binary Search

(binary-

search.html)

Hashing

(hashing.html)

Insertion Sort

(insertion-

sort.html)

Selection Sort

(selection-

sort.html)

Radix Sort

(radix-sort.html)

Quick Sort

(quick-sort.html) Next (tries.html)

Previous (comparison-of-search-trees.html)

Boyer Moore algorithm• What if we took yet another approach?

• Look at the pattern from right to left instead of left to right• Now, if we mismatch a character early, we have the

potential to skip many characters with only one comparison

• Consider the following example:A = ABCDVABCDWABCDXABCDYABCDZP = ABCDE

• If we first compare E and V, we learn two things:1) V does not match E2) V does not appear anywhere in the pattern

• How does that help us?

Boyer Moore algorithm

A right-to-left version of the next table for the pattern 101 10101 is shown in figure below: in this case next [j] is the number of character positions by which the pattern can be shifted to the right given that a mismatch in a right-to-left scan occurred on the jth character from the right in the pattern.

This is found as before, by sliding a copy of the pattern over the last j-1 characters of itself

from left to right, starting with the next-to-last character of the copy lined up with the last character of the pattern and stopping when all overlapping characters match.

This leads directly to a program which is quite similar to the above implementation of the

Knuth-Morris-Pratt method. We won't explore this in more detail because there is a quite different way to skip over characters with right-to-left. Using ”backup” more often, sometime we can expect better results, by scanning the pattern from right to left. For the pattern above, if we scan from the right, we can expect restart positions:

Restart positions for Boyer-Moore search

Boyer-Moore string search using the mismatched character heuristics

function mischarsearch: integer; var i, j: integer; begin i:= M; j:= M; init skip; repeat if a[i] = p[j] then begin i:= i-1; j: j-1 end else begin if M-j+1 > skip[index(a[i])] then i:= i + M - j + 1 else i:= i + skip[index(a[i])]; j:= Mp end;

Boyer Moore algorithm•We can now skip the pattern over M positions, after only one

comparison• Continuing in the same fashion gives us a very good search

time• Show on board

• Assuming our search progresses as shown, how many comparisons are required?

•Will our search progress as shown?• Not always, but when searching text with a relatively large

alphabet, we often encounter characters that do not appear in the pattern• This algorithm allows us to take advantage of this fact

N/M

Boyer Moore algorithm• Details

• The technique we just saw is the mismatched character (MC) heuristic

• It is one of two heuristics that make up the Boyer Moore algorithm

• The second heuristic is similar to that of KMP, but processing from right to left

• Does MC always work so nicely?

• No – it depends on the text and pattern

• Since we are processing right to left, there are some characters in the text that we don't even look at

•We need to make sure we don't "miss" a potential match

Boyer Moore algorithm• Consider the following:

A = XYXYXXYXYYXYXYZXYXYXXYXYYXYXYXP = XYXYZ• Discuss on board• Now the mismatched character DOES appear in the

pattern•When "sliding" the pattern to the right, we must make

sure not to go farther than where the mismatched character in A is first seen (from the right) in P• In the first comparison above, X does not match Z, but it

does match an X two positions down (from the right)•We must be sure not to slide the pattern any further

than this amount

Boyer Moore• How do we do it?• Preprocess the pattern to create a skip array• Array indexed on ALL characters in alphabet• Each value indicates how many positions we can skip

given a mismatch on that character in the textfor all i skip[i] = Mfor (int j = 0; j < M; j++)

skip[index(p[j])] = M - j - 1;• Idea is that initially all chars in the alphabet can give the

maximum skip• Skip lessens as characters are found further to the right

in the pattern

Boyer Moore algorithm• Can MC ever be poor?

• Yes

• Discuss how and look at example

• By itself the runtime could be Theta(NM) – same as worst case for brute force algorithm

• This is why the BM algorithm has two heuristics

• The second heuristic guarantees that the run-time will never be worse than linear

• Look at comparison table

• Discuss

Rabin Karp algorithm• Let's take a different approach:

•We just discussed hashing as a way of efficiently accessing data

• Can we also use it for string matching?

• Consider the hash function we discussed for strings:

s[0]*Bn-1 + s[1]*Bn-2 + … + s[n-2]*B1 + s[n-1]

• where B is some integer (31 in JDK)

• Recall that we said that if B == number of characters in the character set, the result would be unique for all strings

• Thus, if the integer values match, so do the strings

Rabin Karp algorithm• Ex: if B = 32

• h ("CAT") === 67*322 + 65*321 + 84 == 70772

• To search for "CAT" we can thus "hash" all 3-char substrings of our text and test the values for equality

• Let's modify this somewhat to make it more useful / appropriate

1) We need to keep the integer values of some reasonable size

– Ex: No larger than an int or long value

2) We need to be able to incrementally update a value so that we can progress down a text string looking for a match

Rabin Karp algorithm• Both of these are taken care of in the Rabin Karp

algorithm

1) The hash values are calculated "mod" a large integer, to guarantee that we won't get overflow

2) Due to properties of modulo arithmetic, characters can be "removed" from the beginning of a string almost as easily as they can be "added" to the end

• Idea is with each mismatch we "remove" the leftmost character from the hash value and we add the next character from the text to the hash value

• Show on board

• Let's look at the code27

Rabin Karp algorithm• The algorithm as presented in the text is not quite correct –

what is missing?• Does not handle collisions• It assumes that if the hash values match the strings match

– this may not be the case• Although with such a large "table size" a collision is not

likely, it is possible• How do we fix this?• If hash values match we then compare the character

values• If they match, we have found the pattern• If they do not match, we have a collision and we

must continue the search

28

Rabin Karp algorithm• Runtime?• Assuming no or few collisions, we must look at each

character in the text at most two times• Once to add it to the hash and once to remove it

• As long as are arithmetic can be done in constant time (which it can as long as we are using fixed-length integers) then our overall runtime should be Ω (N) in the average case• Note: In the worst case, the run-time is Ω (MN), just like

the naïve algorithm• However, this case is highly unlikely•Why? Discuss

• However, we still haven't really improved on the "normal case" runtime

29

Boyer-Moore-Horspool’s algorithm• It is possible in some cases to search text of length n in less

than n comparisons!

• Horspool’s algorithm is a relatively simple technique that achieves this distinction for many (but not all) input patterns.

• The idea is to perform the comparison from right to left instead of left to right.

•When characters do not match, the search jumps to the next matching position in the pattern by the value indicated in the Bad Match Table.

• The Bad Match Table indicates how many jumps should it move from the current position to the next.

Horspool’s algorithmAs an example, we will find “abcd” into the string “eovadabcdftoy.”

The first step is calculate the value of each letter of the substring to create the Bad Match Table, using formula:

• Value = length of substring – index of each letter in the substring – 1

Note that the value of the last letter and other letters that are not in the substring will be the length of the substring

Finally, the value should be assigned to each letter in the Bad Match Table. After calculating the value, your table will look like:

Horspool’s algorithm

Horspool’s algorithm• Now compare the substring and the string. • start from the index of the end letter in the substring, in this

case the letter “d.”• If the letter matches, then compare with the preceding

letter, “c” in this case.• If it doesn’t match, check its value in the Bad Match Table.• Then, skip the number of spaces that the table value

indicates.• Repeat this steps until all the letters match.• Here’s example:

Horspool’s algorithm

Horspool ImplementationPerformance

• The Boyer-Moore-Horspool algorithm execution time is linear in the size of the string being searched.

• It can have a lower execution time factor than many other search algorithms.

• For one, it does not need to check all characters of the string. It skips over some of them with help of the Bad Match table.

• The algorithm gets faster as the substring being searched for becomes longer.

• This is because with each unsuccessful attempt to find a match between the substring and the string, the algorithm uses the Bad Match table to rule out positions where the substring cannot match.

Horspool complexityComplexity:• In the worst-case the performance of the Boyer-Moore-Horspool

algorithm is O (m n), where m is the length of the substring and n is the length of the string.• The average time is O (n). • In the best case, the performance is sub-linear, and is, in fact,

identical to Boyer-Moore original implementation.• The Boyer-Moore-Horspool is quicker and the internal loop is

simpler than Boyer-Moore.

Horspool conclussionBoyer-Moore-Horspool is faster, simpler and optimized the searches of substrings.

It has the following uses:

• Searchbars

• Auto-correctors

• String Analyzers

• Big Data

• Text labeling

Boyer-Moore-Horspool is probably the best algorithm for string searches.

Data Structures and Algorithms - Vilniaus universitetasalgis/dsax/Data-structures-6.pdf ·...

Documents

Transcript of Data Structures and Algorithms - Vilniaus universitetasalgis/dsax/Data-structures-6.pdf ·...