Suffix Trees ind Suffix Arriys -...

138
Suffix Trees ind Suffix Arriys

Transcript of Suffix Trees ind Suffix Arriys -...

Page 1: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Trees ind Suffix Arriys

Page 2: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Outline for Todiy

● Suffix Trfes● A simple diti structure for string seirching.

● Suffix Trees● A powerful ind fexible diti structure for

string ilgorithms.● Suffix Arrays

● A compict ilternitive to suffix trees.● Applfcatfons of Suffix Trees and Arrays

● There ire miny!

Page 3: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Recip from List Time

Page 4: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Tries

● A trfe is i tree thit stores i collection of strings over some ilphibet Σ.

● Eich node corresponds to i prefx of some string in the set.

● Tries ire sometimes cilled prefix trees, since eich node in i trie corresponds to i prefx of one of the words in the trie.

a

b

o

u

t

t

e

b

e

d

e

d

g

e

g

e

t

Page 5: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Aho-Corisick String Mitching

● The Aho-Corasfck strfng matchfng algorfthm is in ilgorithm for fnding ill occurrences of i set of strings P₁, …, Pₖ inside i string T.

● Runtime is ⟨O(n), O(m + z)⟩, where● m = |T|,● n = |P₁| + … + |Pₖ|, ind● z is the number of mitches.

● Greit for the cise where the pitterns ire fxed ind the text to seirch chinges.

Page 6: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

New Stuff!

Page 7: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Genomics Ditibises

● Miny string ilgorithms these diys ire developed for or used extensively in computitionil genomics.

● Typicilly, we hive i huge ditibise with miny very lirge strings (genomes) thit we'll preprocess to speed up future operitions.

● Common problem: given i fxed string T to seirch ind chinging pitterns P₁, …, Pₖ, fnd ill mitches of those pitterns in T.

● Questfon: Cin we insteid preprocess T to mike it eisy to seirch for viriible pitterns?

Page 8: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Tries

Page 9: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Substrings, Prefxes, ind Suffixes

● Useful Fact 1: Given i trie storing i set of strings S₁, S₂, …, Sₖ, it's possible to determine, in time O(|Q|), whether i query string Q is i prefx of iny Sᵢ.

● Useful Fact 2: A string P is i substring of i string T if ind only if T is i prefx of some suffix of P.● Specifcilly, write T = αPω; then T is i prefx

of the suffix Pω of T.

a

o a r s

tr

s o a r

Page 10: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Substrings, Prefxes, ind Suffixes

● Useful Fact 1: Given i trie storing i set of strings S₁, S₂, …, Sₖ, it's possible to determine, in time O(|Q|), whether i query string Q is i prefx of iny Sᵢ.

● Useful Fact 2: A string P is i substring of i string T if ind only if P is i prefx of some suffix of T.● Specifcilly, write T = αPω; then P is i prefx

of the suffix Pω of T.

α P ω

T

Page 11: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Tries

● A suffix trfe of T is i trie of ill the suffixes of T.

● Given iny pittern string P, we cin check in time O(|P|) whether P is i substring of T by seeing whether P is i prefx in T's suffix trie.

● (This checks whether P is i prefx of some suffix of T.)

e

n

s

e

n

e

s

n

n

o

s

e

e

n

s

e

s

o

n

e

n

s

e

s

s

e

n

s

e

nonsense

Page 12: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Tries

● A suffix trfe of T is i trie of ill the suffixes of T.

● More generilly, given iny nonempty pitterns P₁, …, Pₖ of totil length n, we cin detect how miny of those pitterns ire substrings of T in time O(n).

● (Finding ill mitches is i bit trickier; more on thit liter.)

e

n

s

e

n

e

s

n

n

o

s

e

e

n

s

e

s

o

n

e

n

s

e

s

s

e

n

s

e

nonsense

Page 13: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

A Typicil Trinsform

● Append some new chiricter $ ∉ Σ to the end of T, then construct the trie for T$.

● The new $ chiricter lexicogriphicilly precedes ill other chiricters.● This is usuilly cilled

the sentfnel; think of it like the Theorylind version of i null terminitor.

● Leif nodes correspond to suffixes.

● Internil nodes correspond to prefxes of those suffixes.

e

n

s

e

n

e

s

n

n

o

s

e

e

n

s

e

s

o

n

e

n

s

e

s

s

e

n

s

e

$

$

$

$

$

$

$

$

$

nonsense$

Page 14: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Constructing Suffix Tries

● Once we build i single suffix trie for string T, we cin efficiently detect whether pitterns mitch in time O(n).

● Questfon: How long does it tike to construct i suffix trie?

● Problem: There's in Ω(m2) lower bound on the worst-cise complexity of any ilgorithm for building suffix tries.

Page 15: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

ambm$

a

a

a

b

$

b

b b

$

b

b

$

b

b

$

b

b

b

b

A Degenerite Cise

b

$

b

b $

$

$

There ire Θ(m) copies of nodes

chiined together is bm$.

Spice usige: Ω(m2).

There ire Θ(m) copies of nodes

chiined together is bm$.

Spice usige: Ω(m2).

Page 16: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Correcting the Problem

● Beciuse suffix tries miy hive Ω(m2) nodes, ill suffix trie ilgorithms must run in time Ω(m2) in the worst-cise.

● Cin we reduce the number of nodes in the trie?

Page 17: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Pitricii Tries

● A “silly” node in i trie is i node thit his exictly one child.

● A Patrfcfa trfe (or radfix trfe) is i trie where ill “silly” nodes ire merged with their pirents.

n

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Page 18: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Trees

n

8

7

4

0

5

2

13

6

nonsense$012345678

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

● A suffix tree fori string T is in Pitricii trie of T$ where eich leif is libeled with the index where the corresponding suffix stirts in T$.

● (Note thit suffix trees iren’t the sime is suffix tries. To the best of my knowledge, suffix tries iren’t used inywhere.)

Page 19: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Properties of Suffix Trees

● If |T| = m, the suffix tree his exictly m + 1 leif nodes.

● For iny T ≠ ε, ill internil nodes in the suffix tree hive it leist two children.

● Number of nodes in i suffix tree is Θ(m).

n

8

7

4

0

5

2

13

6

nonsense$012345678

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

Page 20: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Tree Representitions

● Suffix trees miy hive Θ(m) nodes, but the libels on the edges cin hive size ω(1).

● This meins thit i niïve representition of i suffix tree miy tike ω(m) spice.

● Useful fact: Eich edge in i suffix tree is libeled with i consecutive ringe of chiricters from w.

● Trfck: Represent eich edge libeled with i string α is i piir of integers [stirt, end] representing where in the string α ippeirs.

Page 21: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Tree Representitions

7

4

0

5

2

3

6

o n s e n s e $

se

n s e $

$

n s e $

$ $ nse$

nonsense$012345678

n s e

$ e

8

1

o n s e n s e $8

844

00

18

start

end

child

$ e n o34

s

Page 22: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Building Suffix Trees

● Clafm: It’s possible to build i suffix tree for i string of length m in time Θ(m).

● These algorithms are not trivial! We'll discuss one of them next time.

Page 23: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Applfcatfon: String Seirch

Page 24: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

String Mitching

● Suppose we preprocess i string T by building i suffix tree for it.

● Given iny pittern string P of length n, we cin determine, in time O(n), whether n is i substring of P by looking it up in the suffix tree.

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Page 25: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

String Mitching

● Clafm: After spending O(m) time preprocessing T$, cin fnd all mitches of i string P in time O(n + z), where z is the number of mitches.

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Observatfon 1: Every occurrence of P in T is i prefx of some suffix of T.

Observatfon 1: Every occurrence of P in T is i prefx of some suffix of T.

Page 26: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

String Mitching

● Clafm: After spending O(m) time preprocessing T$, cin fnd all mitches of i string P in time O(n + z), where z is the number of mitches.

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Observatfon 2: Every suffix of T$ beginning with some pittern P

ippeirs in the subtree found by seirching for P.

Observatfon 2: Every suffix of T$ beginning with some pittern P

ippeirs in the subtree found by seirching for P.

Page 27: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

String Mitching

● Clafm: After spending O(m) time preprocessing T$, cin fnd all mitches of i string P in time O(n + z), where z is the number of mitches.

n

8

7

4

0

5

2

1

o n s e n s e $

se

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

3

6

n s e $

$

Page 28: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

String Mitching

● Clafm: After spending O(m) time preprocessing T$, cin fnd all mitches of i string P in time O(n + z), where z is the number of mitches.

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

0

5

2

o n s e n s e $

se

n s e $

$

Page 29: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Finding All Mitches

● To fnd ill mitches of string P, stirt by seirching the tree for P.

● If the seirch fills off the tree, report no mitches.● Otherwise, let v be the node it which the seirch

stops, or the endpoint of the edge where it stops if it ends in the middle of in edge.

● Do i DFS ind report the numbers of ill the leives found in this subtree. The indices reported this wiy give bick ill positions it which P occurs.

Page 30: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Finding All Mitches

To fnd ill mitches of string P, stirt by seirching the tree for P.

If the seirch fills off the tree, report no mitches.

Otherwise, let v be the node it which the seirch stops, or the endpoint of the edge where it stops if it ends in the middle of in edge.

● Do i DFS ind report the numbers of ill the leives found in this subtree. The indices reported this wiy give bick ill positions it which P occurs.

How fist is this step?

How fist is this step?

Page 31: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Clafm: The DFS to fnd ill leives in the subtreecorresponding to prefx P tikes time O(z),where z is the number of mitches.

Proof: If the DFS reports z mitches, it must hivevisited z different leif nodes.

Since eich internil node of i suffix tree his it leist two children, the totil number of internil nodes visited during the DFS is it most z – 1.

During the DFS, we don't need to ictuilly mitch the chiricters on the edges. We just follow the edges, which tikes time O(1).

Therefore, the DFS visits it most O(z) nodes ind edges ind spends O(1) time per node or edge, so the totil runtime is O(z). ■

Page 32: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Reverse Aho-Corisick

● Given pitterns P₁, … Pₖ of totil length n, suffix trees cin fnd ill mitches of those pitterns in time O(m + n + z).● Build the tree in time O(m), then seirch for

ill mitches of eich Pᵢ; totil time icross ill seirches is O(n + z).

● Acts is i “reverse” Aho-Corisick:● Aho-Corisick string mitching runs in time

⟨O(n), O(m+z)⟩ ● Suffix tree string mitching runs in time

⟨O(m), O(n+z)⟩

Page 33: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Another Applfcatfon:Longest Repeited Substring

Page 34: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

● Consider the following problem:

Given i string T, fnd the longest substring w of T thit ippeirs in it leist two different positions.

● Some eximples:● In monsoon, the longest repeited substring is on.● In banana, the longest repeited substring is ana. (The

substrings cin overlip.)● Applicitions to computitionil biology: more thin

hilf of the humin genome is formed from repeited DNA sequences!

Page 35: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Observatfon 1: If w is i repeited substring of T, it must be i prefx of it leist two different

suffixes.

Observatfon 1: If w is i repeited substring of T, it must be i prefx of it leist two different

suffixes.

Page 36: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Observatfon 2: If w is i repeited substring of T, it must correspond to i prefx of i pith to

in internil node.

Observatfon 2: If w is i repeited substring of T, it must correspond to i prefx of i pith to

in internil node.

Page 37: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Observatfon 3: If w is i longest repeited

substring, it corresponds to i full pith to in

internil node.

Observatfon 3: If w is i longest repeited

substring, it corresponds to i full pith to in

internil node.

Page 38: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

8

7

4

0

5

2

13

6

o n s e n s e $

n s e $

$

n s e $

$

o n s e n s e $

$

$ nse$

nonsense$012345678

n

se

s e

e

Observatfon 3: If w is i longest repeited

substring, it corresponds to i full pith to in

internil node.

Observatfon 3: If w is i longest repeited

substring, it corresponds to i full pith to in

internil node.

Page 39: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

● For eich node v in i suffix tree, let s(v) be the string thit it corresponds to.

● The strfng depth of i node v is defned is |s(v)|, the length of the string v corresponds to.

● The longest repeited substring in T cin be found by fnding the internil node in T with the miximum string depth.

Page 40: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Repeited Substring

● Here's in O(m)-time ilgorithm for solving the longest repeited substring problem:● Build the suffix tree for T in time O(m).● Run i DFS over the suffix tree, tricking the

string depth is you go, to fnd the internil node of miximum string depth.

● Recover the string thit node corresponds to.● Good eixercfse: How might you fnd the

longest substring of T thit repeits it leist k times?

Page 41: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Challenge Problem:

Solve this problem in lineir time without using suffix trees (or suffix irriys).

Page 42: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Time-Out for Announcements!

Page 43: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Problem Sets

● Problem Set 0 solutions will be up on the course website liter todiy.● We’ll try to get it grided ind returned is

soon is possible.● Problem Set 1 is due on Tuesdiy it

2:30PM.● Stop by office hours with questions!● Ask questions on Piizzi!

Page 44: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string
Page 45: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Bick to CS166!

Page 46: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Generilized Suffix Trees

Page 47: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Suffix Trees for Multiple Strings

● Suffix trees store informition ibout i single string ind exports i huge imount of structuril informition ibout thit string.

● However, miny ipplicitions require informition ibout the structure of multiple different strings.

Page 48: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Generilized Suffix Trees

● A generalfzed suffix tree for T₁, …, Tₖ is i Pitricii trie of ill suffixes of T₁$₁, …, Tₖ$ₖ. Eich Tᵢ his i unique end mirker.

● Leives ire tigged with ij, meining “ith suffix of string Tj”

8₁

7₁

0₁

5₁

2₁ 1₁

3₁

6₁

o n s e n s e $₁

se

n s e $₁

$₁

n s e $₁

$₁

o s e

$₁ e

$₁ n s e7₂

$₂

6₂

$₂

4₁ 3₂

$₁ $₂

1₂2₂

f e n s e $₂

e n s e $₂

f

4₂

$₂

n

0₂

n s e n s e $₁

ffense$₂

5₂

$₂

nonsense$₁012345678₁

nonsense$₁012345678₁

offense$₂01234567₂

offense$₂01234567₂

Page 49: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Generilized Suffix Trees

● Clafm: A generilized suffix tree for strings T₁, …, Tₖ of totil length m cin be constructed in time Θ(m).

● Use i two-phise ilgorithm:● Construct i suffix tree for the single string

T₁$₁T₂$₂ … Tₖ$ₖ in time Θ(m).– This will end up with some invilid suffixes.

● Do i DFS over the suffix tree ind prune the invilid suffixes.– Runs in time O(m) if implemented intelligently.

Page 50: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Applicitions of Generilized Suffix Trees

Page 51: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Common Substring

● Consider the following problem:

Given two strings T₁ ind T₂, fnd the longest string w thit is i substring of

both T₁ ind T₂.● Cin solve in time O(|T₁| · |T₂|) using

dynimic progrimming.● Cin we do better?

Page 52: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Common Substring

Observatfon: Any common substring of T₁ ind T₂ will be i prefx of i suffix of T₁ ind

i prefx of i suffix of T₂.

Observatfon: Any common substring of T₁ ind T₂ will be i prefx of i suffix of T₁ ind

i prefx of i suffix of T₂.

8₁

7₁

0₁

5₁

2₁ 1₁

3₁

6₁

o n s e n s e $₁

se

n s e $₁

$₁

n s e $₁

$₁

o s e

$₁ e

$₁ n s e7₂

$₂

6₂

$₂

4₁ 3₂

$₁ $₂

1₂2₂

f e n s e $₂

e n s e $₂

f

4₂

$₂

n

0₂

n s e n s e $₁

ffense$₂

5₂

$₂

nonsense$₁012345678

nonsense$₁012345678

offense$₂01234567

offense$₂01234567

Page 53: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

3₁

6₁

n s e $₁

$₁

5₂

$₂

1₁ 0₂

n s e n s e $₁

ffense$₂

2₁

n s e $₁

5₁

$₁

4₂

$₂

0₁

o n s e n s e $₁

f

1₂2₂

f e n s e $₂

e n s e $₂

7₁

$₁

6₂

$₂

4₁ 3₂

$₁ $₂

8₁

$₁

7₂

$₂

Longest Common Substring

se

o s ee

n s e

n

nonsense$₁012345678

nonsense$₁012345678

offense$₂01234567

offense$₂01234567

Observatfon: Any common substring of T₁ ind T₂ will be i prefx of i suffix of T₁ ind

i prefx of i suffix of T₂.

Observatfon: Any common substring of T₁ ind T₂ will be i prefx of i suffix of T₁ ind

i prefx of i suffix of T₂.

Page 54: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Longest Common Substring

● Build i generilized suffix tree for T₁ ind T₂ in time O(m).

● Annotite eich internil node in the tree with whether thit node his it leist one leif node from eich of T₁ ind T₂.● Tikes time O(m) using DFS.

● Run i DFS over the tree to fnd the mirked node with the highest string depth.● Tikes time O(m) using DFS

● Overill time: O(m).

Page 55: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Trees: The Catch

Page 56: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Space Usage

● Sufix trees are memrry hrgs.● Supprse Σ = {A, C, G, T, $}.● Each internal nrde needs 15 machine wrrds: frr

each character, wrrds frr the start/end index and a child printer.

This is still O(m), but it's a huge hidden crnstant!

88

44

00

18

start

end

child

A C T G34

$

Page 57: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Can we get the fexibility rf a sufix tree withrut the memrry crsts?

Page 58: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Yes… kinda!

Page 59: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Arrays

● A sufix array frr a string T is an array rf the sufixes rf T$, strred in srrted rrder.

● By crnventirn, $ precedes all rther characters.

ense$

se$

sense$4

6

3

nonsense$0onsense$1nsense$2

nse$5

$e$

87

Page 60: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Arrays

● A sufix array frr a string T is an array rf the sufixes rf T$, strred in srrted rrder.

● By crnventirn, $ precedes all rther characters.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

Page 61: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Representing Sufix Arrays

● Sufix arrays are typically represented implicitly by just strring the indices rf the sufixes in srrted rrder rather than the sufixes themselves.

● Space required: Θ(m).● Mrre precisely, space

frr T$, plus rne extra wrrd frr each character.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

Page 62: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Representing Sufix Arrays

● Sufix arrays are typically represented implicitly by just strring the indices rf the sufixes in srrted rrder rather than the sufixes themselves.

● Space required: Θ(m).● Mrre precisely, space

frr T$, plus rne extra wrrd frr each character. nonsense$

874052163

Page 63: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

● Recall: P is a substring rf T if it's a prefx rf a sufix rf T.

● All matches rf P in T have a crmmrn prefx, sr they'll be strred crnsecutively.

● Can fnd all matches rf P in T by dring a binary search rver the sufix array.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

Page 64: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Analyzing the Runtime

● The binary search will require O(lrg m) prrbes intr the sufix array.

● Each crmparisrn takes time O(n): have tr crmpare P against the current sufix.

● Time frr binary searching: O(n lrg m).● Time tr reprrt all matches after that print:

O(z).● Trtal time: O(n log m + z).

Page 65: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Why the Slrwdrwn?

Page 66: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

A Lrss rf Structure

● Many algrrithms rn sufix trees invrlve lrrking frr internal nrdes with varirus prrperties:● Lrngest repeated substring: internal nrde

with largest string depth.● Lrngest crmmrn substring: internal nrde

with largest string depth that has a child frrm each string.

● Because sufix arrays dr nrt strre the tree structure, we lrse access tr this infrrmatirn.

Page 67: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Trees and Sufix Arrays

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

Page 68: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Trees and Sufix Arrays

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

$e$ense$

onsense$se$sense$

874

163

nonsense$nse$nsense$

052

Page 69: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Trees and Sufix Arrays

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

$e$ense$

onsense$se$sense$

874

1630

5

2

o n s e n s e $

se

n s e $

$ nonsense$nse$nsense$

052

Page 70: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Trees and Sufix Arrays

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

nonsense$

$e$ense$

onsense$se$sense$

8740

163

nse$nsense$

52

Page 71: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Trees and Sufix Arrays

n

8

7

4

0

13

6

o n s e n s e $

se

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

nonsense$

$e$ense$

onsense$se$sense$

8740

163

5

2

n s e $

$ nse$nsense$

52

Page 72: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

The lrngest crmmrn prefx rf a range rf strings in a sufix array crrresprnds tr the lrwest crmmrn ancestrr rf thrse sufixes

in the sufix tree.

Page 73: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Prefxes

● Given twr strings x and y, the longest common prefx rr (LCP) rf x and y is the lrngest prefx rf x that is alsr a prefx rf y.

● The LCP rf x and y is denrted lcp(x, y).● Fun fact: There is an O(m)-time

algrrithm frr crmputing LCP infrrmatirn rn a sufix array.

● Let's see hrw it wrrks.

Page 74: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Pairwise LCP

● Fact: There is an algrrithm (due tr Kasai et al.) that crnstructs, in time O(m), an array rf the LCPs rf adjacent sufix array entries.

● The algrrithm isn't that crmplex, but the crrrectness argument is a bit nrntrivial.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

01013002

Page 75: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Pairwise LCP

● Claim: This infrrmatirn is enrugh frr us tr fgure rut the lrngest crmmrn prefx rf a range rf elements in the sufix array.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

01013002

Page 76: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Pairwise LCP

● Claim: This infrrmatirn is enrugh frr us tr fgure rut the lrngest crmmrn prefx rf a range rf elements in the sufix array.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

01013002

Page 77: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Pairwise LCP

● Claim: This infrrmatirn is enrugh frr us tr fgure rut the lrngest crmmrn prefx rf a range rf elements in the sufix array.

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

01013002Hey, look! It’s a

range minimum query problem!

Hey, look! It’s a range minimum query

problem!

Page 78: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Crmputing LCPs

● Tr preprrcess a sufix array tr supprrt O(1) LCP queries:● Use Kasai's O(m)-time algrrithm tr build the LCP

array.● Build an RMQ structure rver that array in time

O(m) using Fischer-Heun.● Use the precrmputed RMQ structure tr answer

LCP queries rver ranges.● Requires O(m) preprrcessing time and rnly

O(1) query time.

Page 79: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

● Recall: Can search a sufix array rf T frr all matches rf a pattern P in time O(n lrg m + z).

● If we've drne O(m) preprrcessing tr build the LCP infrrmatirn, we can speed this up.

Page 80: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Page 81: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Page 82: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Page 83: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

0

5

2

13

6

o n s e n s e $

se

n s e $

$

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

Page 84: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 85: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 86: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 87: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 88: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 89: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 90: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 91: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 92: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 93: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 94: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 95: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 96: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 97: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 98: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

nonsense$

$e$ense$

nse$nsense$onsense$se$sense$

874052163

nons

n

8

7

4

13

6

n s e $

$

o n s e n s e $

s e

$ e

$ nse$

nonsense$012345678

02

o n s e n s e $

n s e $

5

se

$

Page 99: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

● Intuitively, simulate dring a binary search rf the leaves rf a sufix tree, remembering the deepest subtree yru've matched sr far.

● At each print, if the binary search prrbes a leaf rutside rf the current subtree, skip it and crntinue the binary search in the directirn rf the current subtree.

● Tr implement this rn an actual sufix array, we use LCP infrrmatirn tr implicitly keep track rf where the brunds rn the current subtree are.

Page 100: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Searching a Sufix Array

● Claim: The algrrithm we just sketched runs in time O(n + lrg m + z).

● Proof Sketch: The O(lrg m) term crmes frrm the binary search rver the leaves rf the sufix tree. The O(n) term crrresprnds tr descending deeper intr the sufix tree rne character at a time. Finally, we have tr spend O(z) time reprrting matches. ■

Page 101: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Applicatirns:Longest Common Extensions

Page 102: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Given twr strings T₁ and T₂ and start prsitirns i and j, the longest common extension rf T₁ and T₂, starting at prsitirns i and j, is the length rf the lrngest string w that appears at prsitirn i in T₁ and prsitirn j in T₂.

● We'll denrte this value by LCET₁, T₂(i, j).

● Typically, T₁ and T₂ are fxed and multiple (i, j) queries are specifed.

n o n s e n s ee n s eo f f

Page 103: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Given twr strings T₁ and T₂ and start prsitirns i and j, the longest common extension rf T₁ and T₂, starting at prsitirns i and j, is the length rf the lrngest string w that appears at prsitirn i in T₁ and prsitirn j in T₂.

● We'll denrte this value by LCET₁, T₂(i, j).

● Typically, T₁ and T₂ are fxed and multiple (i, j) queries are specifed.

n o n s e n s ee n s eo f f

Page 104: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Given twr strings T₁ and T₂ and start prsitirns i and j, the longest common extension rf T₁ and T₂, starting at prsitirns i and j, is the length rf the lrngest string w that appears at prsitirn i in T₁ and prsitirn j in T₂.

● We'll denrte this value by LCET₁, T₂(i, j).

● Typically, T₁ and T₂ are fxed and multiple (i, j) queries are specifed.

n o n s e n s ee n s eo f f

Page 105: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Given twr strings T₁ and T₂ and start prsitirns i and j, the longest common extension rf T₁ and T₂, starting at prsitirns i and j, is the length rf the lrngest string w that appears at prsitirn i in T₁ and prsitirn j in T₂.

● We'll denrte this value by LCET₁, T₂(i, j).

● Typically, T₁ and T₂ are fxed and multiple (i, j) queries are specifed.

n o n s e n s ee n s eo f f

Page 106: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Given twr strings T₁ and T₂ and start prsitirns i and j, the longest common extension rf T₁ and T₂, starting at prsitirns i and j, is the length rf the lrngest string w that appears at prsitirn i in T₁ and prsitirn j in T₂.

● We'll denrte this value by LCET₁, T₂(i, j).

● Typically, T₁ and T₂ are fxed and multiple (i, j) queries are specifed.

n o n s e n s ee n s eo f f

Page 107: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Given twr strings T₁ and T₂ and start prsitirns i and j, the longest common extension rf T₁ and T₂, starting at prsitirns i and j, is the length rf the lrngest string w that appears at prsitirn i in T₁ and prsitirn j in T₂.

● We'll denrte this value by LCET₁, T₂(i, j).

● Typically, T₁ and T₂ are fxed and multiple (i, j) queries are specifed.

n o n s e n s ee n s eo f f

Page 108: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Observation: LCET₁, T₂(i, j) is the length rf the lrngest crmmrn prefx rf the sufixes rf T₁ and T₂ starting at prsitirns i and j.

n o n s e n s ee n s eo f f

Page 109: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Observation: LCET₁, T₂(i, j) is the length rf the lrngest crmmrn prefx rf the sufixes rf T₁ and T₂ starting at prsitirns i and j.

n s e n s en s e

Page 110: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Observation: LCET₁, T₂(i, j) is the length rf the lrngest crmmrn prefx rf the sufixes rf T₁ and T₂ starting at prsitirns i and j.

n s e n s en s e

Page 111: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Crmmrn Extensirns

● Observation: LCET₁, T₂(i, j) is the length rf the lrngest crmmrn prefx rf the sufixes rf T₁ and T₂ starting at prsitirns i and j.

n s e n s en s e

Page 112: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Arrays and LCE● Claim: There is an ⟨O(m), O(1)⟩

data structure frr LCE.

● Preprrcessing:

● Crnstruct a generalized sufix array frr T₁ and T₂ augmented with LCP infrrmatirn.

– (Just build a sufix array frr T₁$₁T₂$₂.)

● Then build a table mapping each index in the string tr its index in the sufix array.

● Query:

● Dr an RMQ rver the LCP array at the apprrpriate indices.

nonsense$₁

$₁

e$₁

ense$₁

nse$₁

nsense$₁onsense$₁se$₁

sense$₁

8₁

7₁

4₁

0₁5₁

2₁1₁6₁

3₁

ense$₂

tense$₂

nse$₂

se$₂

0₂

1₂

2₂

3₂

$₂5₂

e$₂4₂

00114013300220

tense$₂nonsense$₂1

2

Page 113: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Arrays and LCE

nonsense$₁

$₁

e$₁

ense$₁

nse$₁

nsense$₁onsense$₁se$₁

sense$₁

ense$₂

tense$₂

nse$₂

se$₂

$₂

e$₂

tense$₂nonsense$₂1

2

8₁

7₁

4₁

0₁5₁

2₁1₁6₁

3₁0₂

1₂

2₂

3₂

5₂

4₂

00114013300220

● Claim: There is an ⟨O(m), O(1)⟩ data structure frr LCE.

● Preprrcessing:

● Crnstruct a generalized sufix array frr T₁ and T₂ augmented with LCP infrrmatirn.

– (Just build a sufix array frr T₁$₁T₂$₂.)

● Then build a table mapping each index in the string tr its index in the sufix array.

● Query:

● Dr an RMQ rver the LCP array at the apprrpriate indices.

Page 114: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Arrays and LCE● Claim: There is an ⟨O(m), O(1)⟩

data structure frr LCE.

● Preprrcessing:

● Crnstruct a generalized sufix array frr T₁ and T₂ augmented with LCP infrrmatirn.

– (Just build a sufix array frr T₁$₁T₂$₂.)

● Then build a table mapping each index in the string tr its index in the sufix array.

● Query:

● Dr an RMQ rver the LCP array at the apprrpriate indices.

nonsense$₁

$₁

e$₁

ense$₁

nse$₁

nsense$₁onsense$₁se$₁

sense$₁

ense$₂

tense$₂

nse$₂

se$₂

$₂

e$₂

tense$₂nonsense$₂1

2

8₁

7₁

4₁

0₁5₁

2₁1₁6₁

3₁0₂

1₂

2₂

3₂

5₂

4₂

00114013300220

Page 115: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Sufix Arrays and LCE● Claim: There is an ⟨O(m), O(1)⟩

data structure frr LCE.

● Preprrcessing:

● Crnstruct a generalized sufix array frr T₁ and T₂ augmented with LCP infrrmatirn.

– (Just build a sufix array frr T₁$₁T₂$₂.)

● Then build a table mapping each index in the string tr its index in the sufix array.

● Query:

● Dr an RMQ rver the LCP array at the apprrpriate indices.

nonsense$₁

$₁

e$₁

ense$₁

nse$₁

nsense$₁onsense$₁se$₁

sense$₁

ense$₂

tense$₂

nse$₂

se$₂

$₂

e$₂

tense$₂nonsense$₂1

2

8₁

7₁

4₁

0₁5₁

2₁1₁6₁

3₁0₂

1₂

2₂

3₂

5₂

4₂

00114013300220

Page 116: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

An Application: Lrngest Palindrrmic Substring

Page 117: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrmes

● A palindrome is a string that's the same frrwards and backwards.

● A palindromic substring rf a string T is a substring rf T that's a palindrrme.

● Surprisingly, rf great imprrtance in crmputatirnal birlrgy.

A C U G UGAC

Page 118: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrmes

● A palindrome is a string that's the same frrwards and backwards.

● A palindromic substring rf a string T is a substring rf T that's a palindrrme.

● Surprisingly, rf great imprrtance in crmputatirnal birlrgy.

A C U G UGAC

Page 119: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrmes

● A palindrome is a string that's the same frrwards and backwards.

● A palindromic substring rf a string T is a substring rf T that's a palindrrme.

● Surprisingly, rf great imprrtance in crmputatirnal birlrgy.

A C U G

U G A C

Page 120: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrmes

● A palindrome is a string that's the same frrwards and backwards.

● A palindromic substring rf a string T is a substring rf T that's a palindrrme.

● Surprisingly, rf great imprrtance in crmputatirnal birlrgy.

A C U G

U G A C

Page 121: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Lrngest Palindrrmic Substring

● The longest palindromic substring prrblem is the frllrwing:

Given a string T, fnd the lrngest substring rf T that is a palindrrme.

● Hrw might we srlve this prrblem?

Page 122: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

An Initial Idea

● Tr deal with the issues rf strings gring frrwards and backwards, start rf by frrming T and TR, the reverse rf T.

● Initial Idea: Find the lrngest crmmrn substring rf T and TR.

● Unfrrtunately, this dresn't wrrk:● T = abcdba● TR = abdcba● Lrngest crmmrn substring: ab / ba● Lrngest palindrrmic substring: a / b / c / d

Page 123: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

● Frr nrw, let's frcus rn even-length palindrrmes.

● An even-length palindrrme substring wwR rf a string T has a center and radius:● Center: The sprt between the duplicated center

character.● Radius: The length rf the string gring rut in

each directirn.● Idea: Frr each center, fnd the largest

crrresprnding radius.

Page 124: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

Page 125: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

Page 126: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

Page 127: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

Page 128: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c bw

Page 129: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 130: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 131: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 132: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 133: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 134: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 135: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 136: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Palindrrme Centers and Radii

a b b a c ac b c c b

a b b aca cbccb

w

wR

Page 137: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

An Algrrithm

● In time O(m), crnstruct TR.● Preprrcess T and TR in time O(m) tr supprrt LCE

queries.● Frr each sprt between twr characters in T, fnd

the lrngest palindrrme centered at that lrcatirn by executing LCE queries rn the crrresprnding lrcatirns in T and TR.● Each query takes time O(1) if it just reprrts the

length.● Trtal time: O(m).

● Reprrt the lrngest string frund this way.● Trtal time: O(m).

Page 138: Suffix Trees ind Suffix Arriys - web.stanford.eduweb.stanford.edu/class/archive/cs/cs166/cs166.1186/lectures/03/...Outline for Todiy Suffix Trfes A simple diti structure for string

Next Time

● Constructing Sufix Trees● Hrw rn earth dr yru build sufix trees in

time O(m)?● Constructing Sufix Arrays

● Start by building sufix arrays in time O(m)...● Constructing LCP Arrays

● … and adding in LCP arrays in time O(m).