Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String...

16
Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp Algorithm

Transcript of Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String...

Page 1: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1

Chapter 5.2

String Searching - Part 2

Boyer-Moore Algorithm

Rabin-Karp Algorithm

Page 2: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 2

The Boyer-Moore String Algorithm

This method can give substantially faster searches where the language contains a large number of symbolsE.g. Normal text (128 or 256 character

alphabet) rather than binary strings BM method incorporates two main ideas

start matching at the right of the pattern so as to find the rightmost mismatch

use information about the possible alphabet of the text, as well as the characters in the pattern

Page 3: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 3

Example

Search for LEAN in CARPETS NEED CLEANING REGULARLY

CARPETS NEED CLEANING REGULARLYLEAN

N and P mismatch. Furthermore, P does not occur anywherein the string LEAN. Hence move string all the way past P and compare with N again.CARPETS NEED CLEANING REGULARLY LEAN LEAN LEAN LEAN

N and E mismatch, butE occurs in LEAN, so wemove the E of LEAN to thisposition

Page 4: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 4

Boyer-Moore preprocessing

In order to implement the above idea, consider the characters in the alphabet which makes up the text.C0,C1,…,Ck (k+1 characters in the alphabet)

Initialise an array skip such that for each Cj in the pattern string set skip[j]

to the distance of Cj from the right hand end of the pattern

skip[j] = M otherwise, where M is the length of the pattern.

Page 5: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 5

The skip array - example

Suppose pattern is LEAN and alphabet is <blank>, A,B,…,Z (C0,C1,…,C26).skip[12] = 3 (L)skip[5] = 2 (E)skip[1] = 1 (A)skip[14] = 0 (N)skip[X] = 4 (otherwise)

skip[C] is the number of characters to move the pattern to the right after a mismatch in the text with character with index C

Page 6: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 6

Using the skip array

Try to match the pattern from right to left Mismatch occurs between Cn with index n and

the (M-j)th position of the pattern. Get value of skip[n] If (M-j) > skip[n] then shift pattern by 1

(since we have already passed the rightmost occurrence of Cn in the pattern).

Else shift pattern skip[n]-j positions, to try to align Cn in the text with the rightmost occurrence of Cn in the pattern.

Page 7: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 7

Example - shifting using skip

Pattern X X A X X X Z Z Z Z

M = 10 (length of pattern)

skip[1] = 7 (distance of rightmost A from right)

mismatch at position 10-4

Y Y Y Y Y A Z Z Z Z Z Z Z Z Z text

X X A X X X Z Z Z Z mismatched pattern

X X A X X X Z Z Z Z shift 3 positions

Shift pattern by 7-4 = 3 positions

Page 8: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 8

Boyer-Moore Algorithm (1)int boyermoore1(String P, String T){ int 1,j,t,M=P.length(),N=T.length(); initskip(P); // initialise skip array i = M-1; j = M-1; while (j > 0){ while (T[i] != P[j]){ t = skip[index(T[i])]; if ((M-j)>t) {i=i+M-j;} else {i=i+t;} if (I >= N) return N; // no match j = M-1;} i--; j--;} return i; } // successful match

Page 9: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 9

Refinement to B-M Algorithm

We can apply the KMP algorithm “right-to-left” Sometimes this gives a larger skip value than

the skip index used above E.g. Pattern BBAAA

skip[1] = 0 (skip value for A) skip[2] = 3 (skip value for B)

AAAAAAA

BBAAA mismatch on A in text

boyermoore1 algorithm shifts only one position

However it’s clear that AAA does not occur anywhere to the left of positions 3,4,5

Page 10: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 10

Boyer-Moore Refinement (2)

Build KMP next array from right to leftj = position of mismatch (from right)

next[j] = no. of positions to shift pattern to right

j next[j] BBAAA

2 1 BBAAA

3 1 BBAAA

4 5 BBAAA

5 5 BBAAA

Using the next array, a mismatch on B results in a shift of 5 positions

Page 11: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 11

Refined Boyer-Moore Algorithm

Initialise both the skip and the next arrays (right-to-left).

Whenever a mismatch occurs, get the skip value for the mismatched character and the next value for the position of the mismatch.

Shift the pattern right by whichever gives the greater value.

Page 12: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 12

Rabin-Karp String Matching

Consider a text and pattern consisting of characters represented by b bits eache.g. 7-bit ASCII charactersWe can regard a sequence of characters as

a (large) binary number (as with keys when using hash tables)

Idea - compute a hash value for an M -character pattern and compare it successively with the hash values of each successive sequence of M characters in the text.

Page 13: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 13

Rabin-Karp matching - basic idea

Example. Consider the stringCARPETS NEED CLEANING

and the search string LEAN. Then we compare h(LEAN) first against h(CARP),

then against h(ARPE), h(RPET), h(PETS), and so on.

Clearly h(LEAN) need be computed only once.

The key to efficient comparison is to compute the successive hash values efficiently.

We can exploit the fact that successive keys overlap, e.g. ARPE and RPET share 3 characters.

Page 14: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 14

Rabin-Karp - computing hash values

Let us use h(K) = K mod P as our hash function as before, where P is a large prime number

Let d = max number of characters (e.g. d=2b) Suppose K = C1,…,Cn where C1,…,Cn is a sequence of

characters in the text, and h(K) = X It can be shown that

h(C2,…,Cn+1)= h((XC1*dn-1)*d + Cn+1), since C2,…,Cn+1 can be rewritten as (C1,…,Cn - C1* dn-1)*d + Cn+1)

E.g. (d=10): 45678 = (34567 - (3*104))*10 + 8 Then use some properties such as h(X+Y) = h(h(X) +Y) and h(X*Y)

= h(h(X) * Y) Hence h(45678) = h((h(34567) - (3*104))*10 + 8

Thus, successive values for h are efficiently computed, since we can reuse the previous has value to compute the next one.

Page 15: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 15

Rabin-Karp Algorithmint rabinkarp(String P, String T){ int q=33554393 // a large prime int d=32 // size of alphabet int i,dM=1, h1=0, h2=0; int M=P.length(), N=T.length(); for (i=0;i<M;i++){dM=(d*dM) mod q;} for (i=0;i<M;i++){ h1=(h1*d+val(P[i])) mod q; // hash P h2=(h2*d+val(T[i])) mod q; } for (i=0; h1 != h2; i++){ h2=(h2+d*q-val(T[i]))*dM) mod q; h2=(h2*d + val(a[i+M])) mod q; if (i > N-M) return N;} \\ not found return i; }

Page 16: Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 1 Chapter 5.2 String Searching - Part 2 Boyer-Moore Algorithm Rabin-Karp.

Dept of Computer Science, University of Bristol. COMS21101. Chapter 5.2 Slide 16

Rabin-Karp - analysis

In the above algorithm, val(P[i]) is the number corresponding to the character P[i]. h1 is the hash value of the pattern h2 takes the hash value of successive sequences

of M characters in the text. Strictly, if h1=h2, we might not have a match, since a

hash collision could occur. We still need to make a final comparison on the strings themselves.

We can use a very large prime since we do not actually have to store the hash table; this makes collisions extremely unlikely.

Average number of comparisons = N+M