Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and...

42
Linear Time Algorithms for Exact Matching Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky

Transcript of Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and...

Page 1: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Linear Time Algorithms for Exact Matching

Book: Algorithms on strings, trees and sequences by Dan Gusfield

Presented by: Amir Anter and Vladimir Zoubritsky

Page 2: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Given a string P called pattern and a longer string T called the text, the exact matching problem is to find all occurrences, if any, of pattern P in text T.

Exact Matching Problem

Page 3: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

P=aa and T=abaabaaa P occurs in T 3 times, starting at locations 3,6

and 7.◦ Location 3:

abaabaaa◦ Location 6:

abaabaaa◦ Location 7:

abaabaaa

Please note that the occurrences may overlap, locations 6,7.

Exact Matching Problem - Example

Page 4: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Grep command in Unix:◦ grep apple fruitlist.txt

Internet browsers – Find option.

Biology - Searching for a string in a DNA database.

Articles, online books.

Usage cases and motivation

Page 5: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Google books – example

Page 6: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

1. Align the left end of P with the left end of T.2. compares the characters of P and T left to

right until:2.1 A mismatch2.2 P ends – An occurrence of P is reported.

3. P is shifted one place to the right.4. If P’s right end is farther than T’s right end: Finish5.Else Go to 2

Naive Algorithm

Page 7: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 1:abaabaaaaa

Step 1.1:abaabaaaaa

Step 1.2:abaabaaaaa

Example: T=abaabaaa P=aa

Page 8: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 2:abaabaaa aa

Step 2.1:abaabaaa aa

Example: T=abaabaaa P=aa

Page 9: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 3:abaabaaa aa

Step 3.1:abaabaaa aa

Step 3.2:abaabaaa aa

Report match at location 3

Example: T=abaabaaa P=aa

Page 10: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 4:abaabaaa aa

Step 4.1:abaabaaa aa

Step 4.2:abaabaaa aa

Example: T=abaabaaa P=aa

Page 11: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 5:abaabaaa aa

Step 5.1:abaabaaa aa

Example: T=abaabaaa P=aa

Page 12: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 6:abaabaaa aa

Step 6.1:abaabaaa aa

Step 6.2:abaabaaa aa

Report match at location 6

Example: T=abaabaaa P=aa

Page 13: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 7:abaabaaa aa

Step 7.1:abaabaaa aa

Step 7.2:abaabaaa aa

Report match at location 7

Example: T=abaabaaa P=aa

Page 14: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Step 8:abaabaaa aa

End

Example: T=abaabaaa P=aa

Page 15: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Let P’s length be n. Let T’s length be m. Number of character comparisons in the worst

case is O(nm). No additional storage is needed. 30 character string search in GenBank (DNA

DB) took more than 4 hours. We will shows a linear lime algorithm, which

improves this time to 10 minutes.

Naive Algorithm - Complexity

Page 16: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Given a string S and a position , let be the length of the longest substring of S that starts at i and matches a prefix of S.

Equivalently: is the length of the longest prefix of S[i..|S|] that matches a prefix of S.

Z function

iZ S

iZ S

1i

Page 17: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

aabcaabxaaz

aabcaabxaaz

aabcaabxaaz

aabcaabxaaz

aabcaabxaaz

Example: S=aabcaabxaaz

5 3Z S

6 1Z S

7 0Z S

8 0Z S

9 2Z S

Page 18: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

P – pattern of length n.T – text of length m.

Let S = P$T, where $ does not appear in P and in T.S’s length is .

Lets assume we have computed for at a preprocessing stage.

Claim: Any value of i>n+1 such that indentifies anoccurrence of P in T starting at position i-(n+1) of T.

Claim: If P occurs in T starting at position j of T, then

Do we really need $? (Except for USD )

Using Z function to solve the exact matching problem

iZ S 2 1i n m

1n m O m

iZ S n

1n jZ S n

Page 19: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

For any position where , Z-box at i is defined as the interval starting at i and ending at .

1 2 3 4 5 6 7 8 9 10 11

Z box

0iZ

1ii Z

1i

a a b c a a b x a a z

Page 20: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

- The right-most end of any Z-box that begins up to position i-1. - A substring - some Z-box ending at . - The left end of some .

Z box

i irililZ

S

ir

ir

il ..i iS l r

Page 21: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Z box

a a b c a a b x a a z 1 2 3 4 5 6 7 8 9 10 11

5

6

5

6

7

7

5

5

r

r

l

l

Page 22: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Our task is to compute Z values in linear time.

Let’s find by comparing left to right characters of

and until a mismatch is found. is the length of the matching string.

The Z algorithm

2Z

2..S S 1..S S

2Z

2

2

2

0

0

0

Z

r r

l l

2

2

2

0

2

Z

r r

l l

Page 23: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Let’s assume we have all Z values to k-1.The idea is to use already computed Z values to compute .

The Z algorithm

kZ

2 120

120

120

121

... are known

130

100

k

Z Z

r

l

Page 24: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm

2 120

120

120

121

... are known

130

100

k

Z Z

r

l

121i 120 130r 120 100l 31i

Page 25: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm

121i 120 130r 120 100l 31i

22i

Page 26: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm

121i 120 130r 120 100l 31i 22i

22 3Z

Page 27: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm

121i 120 130r 120 100l 31i 22i

22 1213 3Z Z

Page 28: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm

121i 120 130r 120 100l 31i 22i

x x

Let’s assume

22 1213 3 ?Why Z Z

1214 Z

Page 29: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm

i k

k

Given Z for all 1<i k-1 and the current values of r and l. Compute Z ,update r and l:

1.If k>r then find Z bycomparing thecharacters starting a position k to

the characters starting at position 1 of S, until

k

k

k

k' k k'

mismatch is found.

Set Z to be the length of the match.

If Z >0

r=k+Z

l=k

2. K r

Postion k in contained in a Z-box.

k'=k-l+1

=S ..

2.

Z < β Z =Z ,r and l remain unn

k r

a

k' k k'

changed

2.

Z β Z β ,r and l remain unnchanged

Compare the characters starting at position r+1 of S to thecharacters starting at position 1

until mismatch. Say the mismatch occurs at ch

b

of S

aracter q r+1.

Z

1

k q k

r q

l k

Page 30: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Example - JavaScript

The Z algorithm

Page 31: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm - Correctness

i k

k

Given Z for all 1<i k-1 and the current values of r and l. Compute Z ,update r and l:

1.If k>r then find Z bycomparing thecharacters starting a position k to

the characters starting at position 1 of S, until

k

k

k

k' k k'

mismatch is found.

Set Z to be the length of the match.

If Z >0

r=k+Z

l=k

2. K r

Postion k in contained in a Z-box.

k'=k-l+1

=S ..

2.

Z < β Z =Z ,r and l remain unn

k r

a

k' k k'

changed

2.

Z β Z β ,r and l remain unnchanged

Compare the characters starting at position r+1 of S to thecharacters starting at position 1

until mismatch. Say the mismatch occurs at ch

b

of S

aracter q r+1.

Z

1

k q k

r q

l k

Page 32: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Case 1: K>r

The Z algorithm - Correctness

121k 118r

Page 33: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Case 2.a

The Z algorithm - Correctness

121k 130r 100l 22i

'k

k r

Z

Page 34: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Case 2.b

The Z algorithm - Correctness

121k 130r 100l 22i

'k

k r

Z

Page 35: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Do we really need $ ? By definition, is the length of the longestprefix of S[i..|S|] that matches a prefix of S.

If P length is n, indicates an occurrence

of P in T, in case S=P$T and also in case S=PT.

The answer is no, it terms of correctness.

The Z algorithm – Correctness

iZ S n

iZ S

Page 36: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

So why to use $?

Using $ ensures a limit of n for the values of In The algorithm we use some and

to compute the current .

We need only additional space. is not bearable.

The Z algorithm – Space complexity

iZ

'kZ '.. lS k Z

O P

'i l

l

Z n Z n

k Z k n

kZ

O T

Page 37: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

iterations

Number of compressions:◦ Each mismatch ends an iteration. Max total of

mismatches for the entire algorithm.◦ Each match increments the value of r at least by

1.◦ number of matches comparisons for

the entire algorithm

The Z algorithm – Time complexityS

S

Sr S

Page 38: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm – Time complexity

k

k

k

1.If k>r then find Z bycomparing thecharacters starting a position k to

the characters starting at position 1 of S, until mismatch is found.

Set Z to be the length of the match.

If Z >0

r=k k+Z

l=k

Page 39: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm – Time complexity

k' k k'

2. K r

Postion k in contained in a Z-box.

k'=k-l+1

=S ..

2.

Z β Z β ,r and l remain unnchanged

Compare the characters starting at position r+1 of S to thecharacters starting at po

k r

b

sition 1

until mismatch. Say the mismatch occurs at character q r+1.

Z

1

k

of S

q k

r q

l k

Page 40: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

The Z algorithm – Time complexity

O S O n m

Page 41: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Knuth-Morris-Pratt (KMP) Aho–Corasick string matching algorithm

◦ Is a generalization of KMP.◦ Set of patterns in linear time.

Boyer-Moore ◦ Typically runs in sublinear time.◦ It is used in practice for exact matching.◦ Worst case linear.

Why continue?

Page 42: Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.

Thank You!