Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight...

33
Building Suffix Trees in O(m) time • Weiner had first linear time algorithm in 1973 • McCreight developed a more space efficient algorithm in 1976 • Ukkonen developed a simpler to understand variant in 1995 – This is what we will focus on
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight...

Page 1: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Building Suffix Trees in O(m) time

• Weiner had first linear time algorithm in 1973

• McCreight developed a more space efficient algorithm in 1976

• Ukkonen developed a simpler to understand variant in 1995– This is what we will focus on

Page 2: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Implicit Suffix Trees

• An implicit suffix tree for string S is a tree obtained from the suffix tree for S$ by– removing $ from all edge labels– removing any edges that now have no label– removing any node that does not still have at least two

children • Some suffixes may no longer be leaves

• An implicit suffix tree for prefix S[1..i] of S is similarly defined based on the suffix tree for S[1..i]$

• Ii will denote the implicit suffix tree for S[1..i]

Page 3: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Example• Implicit tree for xabxa from tree for xabxa$• {xabxa$, abxa$, bxa$, xa$, a$, $}

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

Page 4: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Remove $• Remove $• {xabxa$, abxa$, bxa$, xa$, a$, $}

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

Page 5: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Remove $ after• Remove $• {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

6

5

4

Page 6: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Remove unlabeled edges• Remove unlabeled edges• {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

6

5

4

Page 7: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Remove unlabeled edges• Remove unlabeled edges• {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

Page 8: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Remove interior nodes• Remove internal nodes with only one child• {xabxa, abxa, bxa, xa, a}

1

bb

bx

x

xx

a

aa

aa

23

Page 9: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Final implicit tree• Remove internal nodes with only one child• {xabxa, abxa, bxa, xa, a}

1

b b

bx

x

xxa a a

aa

23

Page 10: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Basic Structure of Algorithm

• Initialization– I1 has one edge labeled S(1)

• For i = 1 to m-1 (build Ii+1)– For j = 1 to i+1

• Find location of string S[j..i] in tree• “Extend” to incorporate character S(i+1)

• Expand final implicit tree to make full suffix tree

Page 11: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Order of operations visualization

• S = xabxacdefghixabcab$• i+1 = 1

– x• i+1 = 2

– extend x to xa– a

• i+1 = 3– extend xa to xab– extend a to ab– b

• …

Page 12: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Extension Rules

• Case 1: S[j..i] ends at a leaf– Add character S(i+1) to end of label on leaf edge

• Case 2: Not a leaf, but no path from end of S[j..i] location continues with S(i+1)– Split a new leaf edge for character S(i+1)

– May need to create an internal node if S[j..i] ends in the middle of an edge

• Case 3: S[j..i+1] is already in the tree– No update

Page 13: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Visualization

• Implicit tree for axabxb from tree for axabx

1

bb

bx x

x

xa a

a

2

3x

bx4b

b

b

b

Rule 1: at a leaf node

b

Rule 3: already in treeRule 2: add a leaf edge (and an interior node)

b

5

Page 14: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Observations

• Once S[j..i] is located in the tree, extending to accommodate S(i+1) is constant time

• Making Ukkonen’s algorithm O(m2)– Finding the S[j..i] locations in the suffix trees

quickly when explicit computation is needed

Page 15: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Edge Label Representation

• Potential Problem– Size of edge labels may be (m2) – Example

• S = abcdefghijklmnopqrstuvwxyz– Total length is j<m+1 j = m(m+1)/2

– Similar problem can happen when the length of the string is arbitrarily larger than the alphabet size

• Solution– Label edges with pair of indices indicating beginning

and end of positions of the substring in S

Page 16: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Full edge label illustration

• String S = xabxa$

1

bb

bx

x

xx

a

aa

aa$

$

$

$

$$

23

6

5

4

Page 17: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Compact edge label illustration

• String S = xabxa$

1

(1,2) [or (4,5)?]

(3,6)

(2,2)(6,6)

23

65

4

(3,6)(6,6)

(6,6) (3,6)

Page 18: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Modified Extension Rules

• Rule 2: new leaf edge– label new leaf edge (i+1, i+1)

• Rule 1: leaf edge extension– label had to be (p,i) before extension

• given rule 2 above and an induction argument

– now will be (p, i+1)

• Rule 3: still nothing needs to be done

Page 19: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Suffix Links

• Consider the two strings and x• Suppose some internal node v of the tree is

labeled with x and another node s(v) in the tree is labeled with

• Then the edge (v,s(v)) is a suffix link• Do all internal nodes (the root is not

considered an internal node) have suffix links?

Page 20: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Suffix Link Lemma

• If a new internal node v with path-label x is added to the current tree in extension j of some phase i+1, then– the path labeled already exists at an internal

node of the tree or– the internal node labeled will be created in

the extension of j+1 or– string is empty and s(v) is the root

Page 21: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Proof of Suffix Link Lemma

• A new internal node is created only by extension rule 2• This means there are two distinct suffixes of S[1..i+1]

that start with x– xS(i+1) and xc where c is not S(i+1)

• This means there are two distinct suffixes of S[1..i+1] that start with – S(i+1) and c where c is not S(i+1)

• Thus, if is not empty, will label an internal node once extension j+1 is processed which is the extension of

Page 22: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Using suffix links to speed up location of S[j..i]

• S[1..i] must end at a leaf since it is the longest string in implicit tree Ii

– Keep a pointer to this leaf in all cases and extend according to rule 1

• Locating S[j+1..i] from S[j..i] which is at node w– If w is an internal node, set v to w– Otherwise, set v = parent(w)– If v is the root, you must traverse from root to find S[j+1..i]– If not, go to s(v) and begin search for remaining portion of

S[j..i] from there• Remaining portion is the label of the edge we traversed up• (see figure 6.5 on page 100)

Page 23: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Skip/count Trick

• Problem: Moving down from s(v) naively takes time proportional to the number of characters compared

• Solution– At each node, only compare the first character in an edge

label to the next character to be checked– Then, use the number of characters on that edge to update

search in constant time– Running time is now proportional to the number of nodes

in the path searched rather than the number of characters• See Figure 6.6 on page 102

Page 24: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

O(m2) argument

• node-depth of v: number of nodes on path from root to node v

• Lemma: For any suffix link (v, s(v)) traversed in Ukkonen’s algorithm, at that moment, nd(v) <= nd(s(v))+1– If x is an ancestral internal node of v where

is not empty, then it has a suffix link to a node with path-label

• See Figure 6.7 on page 103

Page 25: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

O(m2) argument

• Lemma: Any phase takes O(m) time with skip/count trick

• Proof– Decrements to node depth at most 2m

• i+1 <= m extensions per phase• Walking up decreases node depth at most 1• Suffix link traversal decreases node depth at most 1

– At most 3m downward edge traversal• Max node depth is m• None are negative• Each downward traversal increases depth by at least 1

Page 26: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Observation

• Making Ukkonen’s algorithm O(m)– Implicit computations of many extensions– Need to take argument for a single phase and

extend to multiple phases

Page 27: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Rule 3

• Suppose suffix extension rule 3 applies to S[j..i+1]. – This means S[j..i+1] already appears in the implicit suffix

tree as a prefix of a larger suffix (and is thus a substring of S[1..i])

• Then it applies to S[k..i+1] for k > j.– Clearly, S[k..i+1] must also be a substring of S[1..i] and

thus must be in the tree

• Thus, stop a phase once the first application of rule 3 occurs– All future rule 3 extensions are done implicitly, not

explicitly

Page 28: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Implicit expansion of leaf nodes

• Once a leaf, always a leaf– It will always be extended using rule 1

• Implicit expansion of leaf nodes– When a leaf edge is created in phase i+1, instead of

labeling it with (p, i+1), label it with (p,e)– e is a global index that is set to i+1 once in each phase– In later phases, we will not need to explicitly extend this

leaf but rather can implicitly extend it by incrementing e once in its global location

– How can we easily identify leaf nodes to avoid explicitly expanding them in later phases?

Page 29: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Avoiding leaf nodes

• For phase i, let last(i) denote the last extension of phase i that is not by rule 3

• Observation: – All the suffixes S[j..i] for 1 <= j <= last(i) end at a leaf

node– All the extensions for 1 <= j <= last(i) in phase i+1 and

higher are rule 1 extensions by previous slide– Therefore, last(i+1) >= last(i)

• Trick– In phase i+1, only explicitly compute extensions for

last(i)+1 up till first rule 3 extension is found

Page 30: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Single phase algorithm

• Phase i+1– Increment e to i+1 (implicitly extending all existing

leaves)– Explicitly compute successive extensions starting at

last(i)+1 and continuing until a rule 3 extension or no more extensions needed

• Exact location is known given next step in previous phase

– Set last(i+1) appropriately (last rule 1 or 2 extension)

• Observation– Phase i and i+1 share at most 1 explicit extension– In all phases, there will be at most 2m extensions

Page 31: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Visualization of explicit extensions

• S = xabxacdefghixabcab$• Phase 1: extend x (rule 1)• Phase 2: extend a (rule 2)• Phase 3: extend b (rule 2)• Phase 4: extend x (rule 3)• Phase 5: extend x (rule 3)• Phase 6: extend x (rule 2), extend a (rule 2) extend c

(rule 2)• Phase 7: extend d (rule 2)• …

Page 32: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

O(m) argument

• Lemma: All phases take O(m) time • Proof

– Similar to previous node-depth argument– Key new observation

• From one phase to the next, we either start at root or we work with same node we ended with last time so the node depth is identical

• With 2m total extensions at most, we get O(m) upward and downward traversals of links

Page 33: Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

Finishing up

• Convert final implicit suffix tree to a true suffix tree

• Add $ using just another phase of execution– Now all suffixes will be leaves

• Replace e in every leaf edge with m– Just requires a traversal of tree which is O(m)

time