1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches,...

42
1 String Matching with k Mismatches by Using Kangaro o Method Efficient string with k mismatches, Landau, G.M., and V ishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249 Speaker: C. C. Lin Adviser: R. C. T. Lee

Transcript of 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches,...

Page 1: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

1

String Matching with k Mismatches by Using Kangaroo Method

Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249

Speaker: C. C. LinAdviser: R. C. T. Lee

Page 2: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

2

Problem definition:

Input: A text T with length n , a pattern P with

length m and a mismatching threshold k.

Output: All sub-strings of T with length m

matching P with k maximal number of

mismatches.

T = A G C T G C D C A C G I A B...1 4 3 2

P = A G C C

If k = 2k:

P = A G C CP = A G C CP = A G C C

Page 3: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

3

The concept of the Kangaroo method can be

explained as the following figure.

Assume that it is known before hand there

t1t2…ta=p1p2…pa and ta+1 is not equal to pa+1.

Thus we do not have to examine t1t2…ta+1 with

p1p2…pa+1 and jump directly to match the suffixes

beginning from ta+2 and pa+2.

Text: t1 t2… ta ta+1 ta+2 ta+3…tk…………Pattern: p1p2…pa pa+1 pa+2pa+3...pk…………

mismatch

Page 4: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

4

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

start

k=0

Page 5: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

5

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=1

Page 6: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

6

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=2

Page 7: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

7

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=3

Page 8: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

8

T = ABCCABDADBDETADBAADFDAAEERDXTDADCT…P = ETBDBCCDFDC

Kangaroo method will process as follows.

k=4

Page 9: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

9

We continue the above process. Whenever we

come to the situation that it is known a

substring of T exactly matching with a substring

of P, we skip this substring. This process is

stopped when k+1 mismatches have been found.

Input: T=ABAABBCCDD, P=ACDCB and k=2.

T=ABAABCCDD

P=ACDCB

k=3, we stop and discard ABAAB, then we start to

compare “BAADB” and “ACDCB”.

Page 10: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

10

Before we introduce the Kangaroo algorithm,

we shall first introduce the suffix tree and the

lowest common ancestor of two nodes.

The properties of suffix tree and the lowest

common ancestor of two nodes will be used in

Kangaroo algorithm.

Page 11: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

11

S = ABCDEADDBE

Suffix tree of a string with length n can be constructed in O(n).

Weiner, 1973McCreight, 1976Ukkonen, 1995

3

CDEADDBE$

A

B DE

61

924 7 8

105

BCDEADDBE$DDBE$

CDEADDBE$

E$

EADDBE$DBE$

BE$

ADDBE$ $

Page 12: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

12

The lowest common ancestor of two leaf nodes can be found in O(1) by O(n) preprocessing in constructing time.

Harel and Tarjan, 1984

3

CDEADDBE$

A

B DE

61

924 7 8

105

BCDEADDBE$DDBE$

CDEADDBE$

E$

EADDBE$DBE$

BE$

ADDBE$ $

Page 13: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

13

The Kangaroo method constructs a suffix tree

for text T and pattern P. Let the leaf node

corresponding to the substring starting from the

location be denoted as X. Let the leaf

corresponding to the pattern be denoted as Y.

The Kangaroo Method finds the lowest common

ancestor of X and Y to verify a text location with

k mismatches in O(k).

Let us consider the next page to figure out the

Kangaroo method.

Page 14: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

14

ANBECF$

ANCEC$

AN

BECF$ CEC$

Two suffix strings:

ANBECF$

ANCEC$

ANBECF$

ANCEC$

Then we can know that they have the same prefix “AN” and a mismatch “B” and “C”.

We now have to find whether there is any mismatches between ECF and EC.

ANBECF$ ANCEC$

mismatches=1

Page 15: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

15

We get remaining suffix strings:

ECF$

EC$

EC

$F$Then we can know that they have the same prefix “EC” and because we touch $, we finish the verification.

ECF$

EC$

ECF$ EC$

mismatches=1

Thus we could know thatthe mismatches between “ANBECF” and “ANCEC”is 1.

Page 16: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

16

We will not have to compare all characters by using the finding of the lowest common ancestor of two strings of text and pattern in the suffix tree.

This is useful if there are many equivalent characters between the text and the pattern because we will not have to compare those equivalent characters.

Finding the lowest common ancestor between two suffixes is to find the next mismatch between two strings.

Page 17: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

17

Input: T=ABCCBDCDBC, P=ABCD and k=2 The suffix tree of T and P is:

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

Page 18: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

18

The lowest common ancestor of “ABCD” and

“ABCCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1, return “ABCC”.

Page 19: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

19

The lowest common ancestor of “ABCD” and

“BCCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 20: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

20

The lowest common ancestor of “BCD” and

“CCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 21: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

21

The lowest common ancestor of “CD” and

“CBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “BCCB”.

Page 22: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

22

The lowest common ancestor of “ABCD” and

“CCBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 23: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

23

The lowest common ancestor of “BCD” and

“CBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 24: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

24

The lowest common ancestor of “CD” and

“BDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “CCBD”.

Page 25: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

25

The lowest common ancestor of “ABCD” and

“CBDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 26: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

26

The lowest common ancestor of “BCD” and

“BDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 27: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

27

The lowest common ancestor of “D” and

“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “CBDC”.

Page 28: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

28

The lowest common ancestor of “ABCD” and

“BDCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 29: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

29

The lowest common ancestor of “BCD” and

“DCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 30: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

30

The lowest common ancestor of “CD” and

“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2, return “BDCD”.

Page 31: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

31

The lowest common ancestor of “ABCD” and

“DCDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 32: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

32

The lowest common ancestor of “BCD” and

“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 33: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

33

The lowest common ancestor of “CD” and

“DBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “DCDB”.

Page 34: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

34

The lowest common ancestor of “ABCD” and

“CDBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=1.

Page 35: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

35

The lowest common ancestor of “BCD” and

“DBC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=2.

Page 36: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

36

The lowest common ancestor of “CD” and

“BC”.

ABC

CBDCDBC$

D$

B

D

C

DCDBC$

C

$

CDBC$ BC$

$

CBDCDBC$

BDCDBC$

D

$

CBDCDBC$

ABCCBDCDBC$

ABCD$

BCCBDCDBC$

BCD$

BC$CBDCDBC$

DCDBC$

DBC$

BC$$

CDBC$ CD$

BDCDBC$

D$

C$

CCBDCDBC$

D$

T=ABCCBDCDBC P=ABCD k=3, discard “CDBC”.

Page 37: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

37

Input: T=ABCCBDCDBC, P=ABCD and k=2.

Output: “ABCC” and “BDCD”.

Page 38: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

38

In order to use Kangaroo method, we construct

a suffix tree for the text T with the length n and

the pattern p with the length m in O(n+m).

By using Kangaroo method, we take O(1) time

to find one mismatch. We stop when there are

more than k mismatches. Therefore, we take

O(k) time to find at most k mismatches.

Page 39: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

39

Thus, the time complexity of finding out all

locations of text T with k maximal mismatches

with the pattern P is O(nk).

Page 40: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

40

References

For Construction of Suffix trees:[M76] McCreight, E.M., A Space-Economical Suffix Tree Construction Algorithm, J. ACM 23 (1976): 262-272.[U95] Ukkonen, E., On-line Construction of Suffix Trees, Algorithmica 41 (1995): 249-260.

For Finding Lowest Common Ancestor:[HT84] Harel, D. and Tarjan, R.E., Fast Algorithms for Finding Nearest Common Ancestor, SIAM Journal on Computing 13 (1984): 338-355.

Page 41: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

41

References

For String Matching with k Mismatches:

[LV86] Landau, G.M., and Vishkin, U., Efficient string with k mismatches, Theoret. Comput Sci 43 (1986): 239-249.

Page 42: 1 String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43,

42

Thank you