On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10,...

50
On-line Construction On-line Construction of Suffix Trees of Suffix Trees Chairman : Chairman : Prof. R.C.T. Lee Prof. R.C.T. Lee Speaker : Speaker : C. S. Wu ( C. S. Wu ( 吳吳吳 吳吳吳 ) ) June 10, 2004 June 10, 2004 Dept. of CSIE Dept. of CSIE National Chi Nan University National Chi Nan University

Transcript of On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10,...

Page 1: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

On-line Construction of On-line Construction of Suffix TreesSuffix Trees

Chairman :Chairman :  Prof. R.C.T. Lee  Prof. R.C.T. Lee

Speaker :Speaker :  C. S. Wu (  C. S. Wu ( 吳展碩吳展碩 ))

June 10, 2004June 10, 2004Dept. of CSIEDept. of CSIE

National Chi Nan UniversityNational Chi Nan University

Page 2: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

22

SourceSource

E. Ukkonen. E. Ukkonen.

On-line construction of suffix treesOn-line construction of suffix trees. . Algorithmica, Algorithmica, 1414:249--260, 1995. :249--260, 1995. 

Page 3: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

33

OutlineOutline

IntroductionIntroductionSuffix triesSuffix tries and and suffix treessuffix treesConstructing suffix triesConstructing suffix tries

Quadratic timeQuadratic timeOn-lineOn-line construction of suffix trees construction of suffix trees

Liner TimeLiner Time

Page 4: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

44

NotationsNotations

TT = = tt11tt22 ... ... ttnn be a string over an alphabet be a string over an alphabet ..

TTii denote the denote the prefixprefix tt1 1 … … ttii of of TT for for 00 ii nn..

.

TTii denote the denote the suffixsuffix ttii … … ttnn of of TT where where 11 ii n + n + 11..

.

TT :: abcdeTT33 :: abc

TT :: abcdeTT33 :: cde

Page 5: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

55

Notations (cont.)Notations (cont.)

TTnn++11 = = is the is the emptyempty suffix. suffix. The set of all The set of all sufsuffifixes of T is denoted xes of T is denoted

((TT))..TT :: abcde((TT))

:: abcde bcde cde de e

Page 6: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

66

Suffix Tries & Suffix TreesSuffix Tries & Suffix Trees

a

abab

ababcababc

abcabc

b

a

b

c

c

b

a

b

c

c

c

bb

cc

babcbabc

bcbc

abab

ababcababc

ab

abcc

bc

abcc

abcabc

babcbabc

bcbc

cc

bb

Suffix TrieSuffix Trie Suffix TreeSuffix Tree

Page 7: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

77

Suffix TriesSuffix Tries

The The suffix triesuffix trie of of TT is a trie representing is a trie representing ((TT))..

STrieSTrie((TT)) = = ((Q Q {{}}, , rootroot, , FF, , gg, , ff))

and define such a trie as an augmented and define such a trie as an augmented deterministic finite-state automationdeterministic finite-state automation..

Page 8: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

88

STrieSTrie((TT)) = = ((QQ{{}}, , rootroot, , FF, , gg, , ff)).. QQ is the is the setset of the statesof the states of of STrieSTrie((TT))..

one-to-one correspondence with the substring one-to-one correspondence with the substring of of TT

xx is the is the statestate that corresponds to a that corresponds to a substring substring xx..

is an auxiliary state.is an auxiliary state. rootroot is the is the initial stateinitial state corresponds to the corresponds to the

empty string empty string .. FF is the is the final statesfinal states corresponds to corresponds to ((TT))..

Suffix Tries (cont.)Suffix Tries (cont.)

x

Page 9: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

99

gg is the transition function: is the transition function:gg((xx, , aa) = ) = yy for all for all xx, , yy in in QQ such that such that yy = =

xaxa, where , where aa .. f f is the suffix function:is the suffix function:

Let Let xx rootroot. Then . Then xx = = ayay for some for some aa , and we set , and we set ff((xx)) = = yy..

ff((rootroot)) = = ..We call We call ff((rr)) the the suffix linksuffix link of state of state rr..

Suffix Tries (cont.)Suffix Tries (cont.)

Page 10: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1010

Suffix Tries (cont.)Suffix Tries (cont.)

a

abab

abcabdabcabd

b

c

a

b

d

b

c

a

b

d

TT = = abcabdabcabd

d

a

d

d

d

abdabd

bcabdbcabd

cabcabdd

dd

bbdd

b

bb

c

suffix linkssuffix links

Note: Only last layer ofNote: Only last layer of suffix links are suffix links are shown explicitly.shown explicitly.

STrieSTrie((TT)) = = ((QQ{{}}, , rootroot, , FF, , gg, , ff))

Page 11: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1111

We call the We call the pathpath that starts from the that starts from the deepest state deepest state tt11 ... ... ttii-1-1 and ends at and ends at the the boundary pathboundary path..

Boundary pathBoundary path consists of the consists of the last last layer oflayer of suffix links suffix links..

Boundary PathBoundary Path

Page 12: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1212

Constructing Suffix TriesConstructing Suffix Tries

Observation : Observation : ((TTii)) = = ((TTi-1i-1))ttii {{}}

abcd bcd cd d

((TTi-1i-1)) abcde bcde cde de e

((TTii))

boundary pathboundary path

Page 13: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1313

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

Algorithm 1.Algorithm 1.rr toptop;;

whilewhile gg((rr, , ttii)) is undefined is undefined dodo

create new state create new state r'r' and new transition and new transition gg((rr, , ttii)) = = r'r';;

ifif rr toptop thenthen create new suffix link create new suffix link ff((oldr'oldr')) = = r'r';;

oldr'oldr' r'r';;

rr ff((rr));;

create new suffix link create new suffix link ff((oldr'oldr')) = = gg((rr, , ttii));;

toptop gg((toptop, , ttii))..

Page 14: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1414

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

aTT = = aa

toptop

rr

rr

toptop

We color theWe color the boundary path boundary path orangeorange

Page 15: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1515

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

a

ababb

b

TT = = aabb

rr

rr

toptop

rr

bbtoptop

We color theWe color the boundary path boundary path orangeorange

Page 16: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1616

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

a

ababb

c

b

c

TT = = ababcc

bb

ctoptop

rrrr

rr

rr

toptop

We color theWe color the boundary path boundary path orangeorange

Page 17: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1717

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

a

ababb

c

a

b

c

a

TT = = abcabcaa

a

bb

c

toptop

rrrr

rr

rr

toptop

We color theWe color the boundary path boundary path orangeorange

Page 18: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1818

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

a

ababb

c

a

b

b

c

a

b

TT = = abcaabcabb

a

b

bb

c

toptop

rrrr

rr

rr

toptop We color theWe color the boundary path boundary path orangeorange

Page 19: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

1919

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

a

abab

abcabdabcabd

b

c

a

b

d

b

c

a

b

d

TT = = abcababcabdd

d

a

d

d

d

abdabd

bcabdbcabd

cabcabdd

dd

bbdd

b

bb

c

toptop

rrrr

rr

rrrr

rr

rr

toptop

Page 20: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2020

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

a

abab

abcabdabcabd

b

c

a

b

d

b

c

a

b

d

TT = = abcabdabcabd

d

a

d

d

d

abdabd

bcabdbcabd

cabcabdd

dd

bbdd

b

bb

c

Page 21: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2121

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

Theorem 1Theorem 1

Suffix trie STrieSuffix trie STrie((TT)) can be can be constructed in time proportional to constructed in time proportional to the size ofthe size of STrieSTrie((TT)) whichwhich, , in the in the worst caseworst case, , isis OO((||TT||22))..

Note: The number of nodes in Note: The number of nodes in STrieSTrie((TT)) is the number of substrings is the number of substrings of of TT. . TT has at most has at most OO((nn22)) substrings. Thus the size of substrings. Thus the size of STrieSTrie((TT)) is is OO((nn22))..

Page 22: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2222

Suffix TreesSuffix Trees

Suffix tree Suffix tree StreeStree((TT)) represents represents STrieSTrie((TT)) in space linear in the length |in space linear in the length |TT|.|. Represent only a subsetRepresent only a subset Q' Q' {{}} of the states of the states

of of STrieSTrie((TT)).. Q'Q' consists of all consists of all branchingbranching statesstates and all and all

leaves leaves of of StrieStrie((TT)).. Called the states in Called the states in Q'Q' {{}} the the explicit explicit

statesstates.. The other states of The other states of STrieSTrie((TT)) are called are called implicit implicit

statesstates as states of as states of STreeSTree((TT)).. Implicit statesImplicit states are not explicitly present in are not explicitly present in

STreeSTree((TT))..

Page 23: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2323

Suffix Trees (cont.)Suffix Trees (cont.)

cc

a

abab

ababcababc

abcabc

b

a

b

c

c

b

a

b

c

c

c

bb

babcbabc

bcbc

Suffix TrieSuffix Trie

abab

ababcababc

ab

abcc

bc

abcc

abcabc

babcbabc

bcbc

cc

bb

Suffix TreeSuffix Tree

implicit statesimplicit statesexplicit statesexplicit states

Page 24: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2424

Suffix Trees (cont.)Suffix Trees (cont.)

The string The string w = tw = tkk ... ... ttpp between two explicit between two explicit states states ss and and rr is represented in is represented in STreeSTree((TT)) as generalized transition as generalized transition g'g'((ss, , ww)) = = rr..

To save space the string To save space the string w = tw = tkk ... ... ttpp is is actually represented as a pair actually represented as a pair ((kk, , pp)) of of pointers to pointers to TT..

A transition A transition g'g'((ss, , ((kk, , pp)))) = = rr is called an is called an

a-transitiona-transition if if ttkk = = aa.. Each Each ss can have at most one can have at most one a-transitiona-transition for for

each each

aa ..

Page 25: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2525

Suffix Trees (cont.)Suffix Trees (cont.)

Suffix function:Suffix function: Defined only for all branching states Defined only for all branching states xx root root as as

f 'f '((xx)) = = yy where where yy is a branching state is a branching state such that such that

xx = = ayay for some for some a a f'f'((rootroot)) = = ..

If If xx is a branching state, the also is a branching state, the also f 'f '((xx)) is a is a branching state. These suffix links are branching state. These suffix links are explicitly represented. explicitly represented.

The suffix tree of The suffix tree of TT is denoted as is denoted as

STreeSTree((TT)) = = ((Q' Q' {{}}, , rootroot, , g'g', , f 'f '))

Page 26: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2626

Size of Suffix TreesSize of Suffix Trees

abab

ababcababc

ab

abcc

bc

abcc

abcabc

babcbabc

bcbc

cc

bb

(5,5)(5,5)(2,2)(2,2)(1,2)(1,2)

(3,5)(3,5)

(5,5)(5,5)

(3,5)(3,5)

(5,5)(5,5)

TT = = ababcababc

aa--transitiontransition

bb--transitiontransition

cc--transitiontransition

Page 27: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2727

Size of Suffix Trees (cont.)Size of Suffix Trees (cont.)

The size of The size of STreeSTree((TT)) is is linear sizelinear size in in ||TT|.|.Q'Q' has at most has at most ||TT| leaves| leaves and therefore and therefore

Q'Q' has to contain at most has to contain at most ||TT| - | - 11 branchingbranching statesstates in in Q'Q'..

There can be at most There can be at most 22||TT| - | - 22 transitions transitions between the states in between the states in Q'Q'..

Page 28: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2828

Reference to a StateReference to a State

We refer to a state We refer to a state rr of a suffix tree by a of a suffix tree by a referencereference pairpair((ss, , ww))..ss is some explicit state that is an ancestor is some explicit state that is an ancestor

of of rr..ww is the string spelled out by the is the string spelled out by the

transitions form transitions form ss to to rr in the corresponding in the corresponding suffix trie.suffix trie.

A reference pair is A reference pair is canonicalcanonical if if ss is the is the closest ancestorclosest ancestor of of rr..

PairPair((ss,, ))is represented as is represented as ((ss, , ((pp + + 11, , pp))))..

Page 29: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

2929

States on the Boundary PathStates on the Boundary Path

Let Let ss11 = = tt11 ... ... ttii--11, , ss22, , ss33, ... , , ... , ssii = = rootroot, , ssii++11 = = be the states of be the states of STrieSTrie((TTii--11)) on on the boundary paththe boundary path..

LetLet j j be the smallest index such that be the smallest index such that ssjj is is not a leafnot a leaf..

Let Let j'j' be the smallest index such that be the smallest index such that ssj'j' has a has a ttii--transitiontransition..

We call state We call state ssjj the the active pointactive point and and ssj'j' the the end pointend point of of STrieSTrie((TTii--11))..

Page 30: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3030

States on the Boundary PathStates on the Boundary Path

Lemma 1 Lemma 1 Algorithm 1 adds to STrieAlgorithm 1 adds to STrie((TTi-i-11)) a t a tii--transition for each of the states stransition for each of the states shh, , 11 h h << j'.j'.

For For 11 h h << j j, the new transition expands an , the new transition expands an old branch of the trie that ends at leaf sold branch of the trie that ends at leaf shh..

For For j j h h << j' j', the new transition initiates a , the new transition initiates a new branch from snew branch from shh..

Algorithm 1 does not create any other Algorithm 1 does not create any other transitions.transitions.

Page 31: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3131

States on the Boundary PathStates on the Boundary Path

Algorithm 1 inserts two different Algorithm 1 inserts two different groups of groups of ttii-transitions into -transitions into STrieSTrie((TTii--

11))::First groupsFirst groups

The states on the boundary path before the The states on the boundary path before the active point active point ssjj get a transition.get a transition.

Second groupsSecond groupsThe states from the active point The states from the active point ssjj to the to the

end point end point ssj'j', the end point excluded, get a , the end point excluded, get a new transition.new transition.

Page 32: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3232

States on the Boundary PathStates on the Boundary Path

a

ababb

c

a

b

b

c

a

b

a

b

bb

c

activactive e

pointpoint

TTii--11 = = abcababcab

STrieSTrie((TTii--11))

ttii = = dd

end end pointpoint

last layer oflast layer of suffix links suffix links ((boundary pathboundary path))

first groupfirst group

second groupsecond group

Page 33: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3333

States on the Boundary PathStates on the Boundary Path

a

abab

abcabdabcabd

b

c

a

b

d

b

c

a

b

d

d

a

d

d

d

abdabd

bcabdbcabd

cabcabdd

dd

bbdd

b

bb

c

first groupfirst group

second groupsecond group

STrieSTrie((TTii))

ttii = = dd

We color theWe color the new transition new transitionand new node and new node greengreen

activactive e

pointpoint

end end pointpoint

TTii--11 = = abcababcab

Page 34: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3434

Adding Transitions to Adding Transitions to STree(Ti-1) STree(Ti-1)

First groupFirst group can be can be notnot changedchanged to to STreeSTree((TTii--11).). Transitions of Transitions of STreeSTree((TTii--11)) leading to a leaf is leading to a leaf is

called an called an open transitionopen transition.. Such a transition is of the form Such a transition is of the form g'g'((ss, , ((kk, , ii--11)))) = = rr.. Instead, open transitions are represented as Instead, open transitions are represented as g'g'((ss, , ((kk, ,

)))).. indicates that this transition is 'indicates that this transition is 'open to growopen to grow'.'.

Page 35: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3535

Open TransitionsOpen Transitions

ab ab (1,2)(1,2)

bb(2,2)(2,2)

activactive e

pointpoint

TTii--11 = = abcababcab

STreeSTree((TTii--11))

ttii = = dd

end end pointpoint

first groupfirst group

second groupsecond group

cab

cab

cab

ab

(3,(3,))abcababcab

(3,(3,))bcabbcab

(3,(3,))cabcab

Page 36: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3636

Open TransitionsOpen Transitions

abb

d

first groupfirst group

second groupsecond group

STreeSTree((TTii))

ttii = = dd

We color theWe color the new transition new transitionand new node and new node greengreen

activactive e

pointpointend end

pointpointTTii--11 = = abcababcab

(3,(3,))abcababcab

dd

(3,(3,))bcabbcab

dd

(3,(3,))cabcabdd

cabd

cabd

cabddd

ab ab (1,2)(1,2)

bb(2,2)(2,2)

Page 37: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3737

Adding Transitions to Adding Transitions to STree(Ti-1) (cont.)STree(Ti-1) (cont.)

Create new branches for the Create new branches for the second groupsecond group.. They are presented They are presented explicitly or implicitlyexplicitly or implicitly.. They will be found They will be found along the boundary pathalong the boundary path using using

reference pairs and suffix links.reference pairs and suffix links. Let Let ((ss, , ww)) be the be the canonical reference pair canonical reference pair for for sshh, ,

j j h < j'. h < j'. ((ss, , ww)) = = ((ss, , ((kk, , ii - - 11)))) for some for some kk ii.. If If ((ss, , ((kk, , ii - - 11)))) already refers to the already refers to the end pointend point ssj'j', we are , we are

done.done. Otherwise a new branch has to be created.Otherwise a new branch has to be created.

If If ((ss, , ((kk, , ii - - 11)))) refers to an implicitly state, a new refers to an implicitly state, a new explicit state is created by explicit state is created by splitting the transitionsplitting the transition. Then . Then a a ttii-transition-transition is created. is created.

Page 38: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3838

On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)

Lemma 2Lemma 2 Let Let ((s, s, ((k, i - k, i - 11)))) be a reference pair of the end be a reference pair of the end point spoint sj'j' of STree of STree((TTi-i-11)). Then . Then ((s, s, ((k, ik, i)))) is a is a reference pair of the active point of STreereference pair of the active point of STree((TTii))..

Proof.Proof. ssjj is the is the active pointactive point of of STreeSTree((TTii-1-1)) if and only if if and only if ssjj is the is the

longest suffix of longest suffix of TTii-1-1 that occurs at least twice in that occurs at least twice in TTii-1-1.. ssj'j' is the is the end pointend point of of STreeSTree((TTii-1-1)) if and only if if and only if ssj'j' is the is the

longest suffix of longest suffix of TTii-1-1 such that such that ttj'j' ... ... ttii-1-1ttii is a substring of is a substring of TTii-1-1..

If If ssj'j' is the end point of is the end point of STreeSTree((TTii-1-1)) then then ttj'j' ... ... ttii-1-1ttii is the is the longest suffix of longest suffix of TTii that occurs at least twice in that occurs at least twice in TTii, that , that is, then state is, then state gg((ssj'j', , ttii)) is the active point of is the active point of STreeSTree((TTii))..

Page 39: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

3939

Constructing Suffix Tries Constructing Suffix Tries (cont.)(cont.)

TT = = aa

ss = = rootroot

kk = = 11

ii = = 00ii = = 11

(1,(1,))

ss = =

kk = = 22TT = = aabb

ii = = 22

(2,(2,))

kk = = 33TT = = ababcc

ii = = 33

(3,(3,))

kk = = 44activactiv

e e pointpoint

end end pointpoint

TT = = abcabcaa

ii = = 44

TT = = abcaabcabb

ii = = 55

TT = = abcababcabdd

ii = = 66activactiv

e e pointpoint

end end pointpointend end

pointpoint

(2,2)(2,2)(1,2)(1,2) (6,(6,))

(4,(4,))

(5,(5,))

TT = = abcabdabcabdkk = = 55kk = = 66

(3,(3,))

(3,(3,))

Page 40: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4040

On-Line Construction of On-Line Construction of Suffix TreesSuffix Trees

Algorithm 2 Algorithm 2 Construction of Construction of STreeSTree((TT)) for string for string TT = = tt11tt22...# in alphabet ...# in alphabet = = {{tt--11, ..., , ..., tt--

mm}}; # is the end marker.; # is the end marker.

Create states Create states rootroot and and ;;

forfor jj 11, ... , , ... , mm dodo

create transition create transition g'g'((,, ((--jj, -, -jj)))) = = rootroot;;

create suffix link create suffix link f'f'((rootroot))== ;;

ss rootroot; ; kk 11; ; ii 00;;

while while ttii++11 # # dodo

ii ii + + 11;;

((ss, , kk)) updateupdate((ss, , ((kk, , ii))));;

((ss, , kk)) canonizecanonize((ss, , ((kk, , ii))))..

Page 41: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4141

On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)

procedureprocedure updateupdate((ss, , ((kk, , ii))))::((ss, , ((kk, , i - i - 11))))is the canonical reference pair for the active is the canonical reference pair for the active point;point;

oldr oldr rootroot; ; ((endpointendpoint, , rr) ) test-and-splittest-and-split((ss, , ((kk, , i i - - 11)), , ttii));;

while notwhile not ((end-pointend-point)) dodo

create new transition create new transition g'g'((rr, , ((ii, , )))) = = r'r' where where r'r' is a is a new state;new state;

ifif oldroldr rootroot thenthen create new suffix link create new suffix link f'f'((oldroldr)) = = rr;;

oldroldr rr;;

((ss, , kk)) canonizecanonize((f'f'((ss)),,((kk, , ii - - 11))));;

((end-pointend-point, , rr)) test-and-splittest-and-split((ss,,((kk, , ii - - 11)), , ttii));;

if if oldroldr root root thenthen create new suffix link create new suffix link f'f'((oldroldr)) = s; = s;

returnreturn ((ss, , kk))..

Page 42: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4242

On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)

procedureprocedure test-and-splittest-and-split((ss, , ((kk, , pp)), , tt))::ifif kk pp thenthen

let let g'g'((ss, , ((k'k', , p'p')))) = = s's' be the be the ttkk-transition from -transition from ss;;

ifif t t = = ttk'k'++pp--kk++11 then returnthen return((truetrue, , ss))

elseelsereplace the replace the ttkk-transition above by transitions-transition above by transitions

g'g'((ss, , ((k'k', , k'k' + + pp - - kk)))) = = r r and and g'g'((rr, , ((k'k' + + pp - - kk + + 11, , p'p')))) = = s's'

where where rr is a new state; is a new state;

returnreturn((falsefalse, , rr))

elseelse

if if there is no there is no tt-transition from -transition from ss thenthen returnreturn((falsefalse, , ss))

elseelse returnreturn((truetrue, , ss))..

Page 43: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4343

On-Line Construction of On-Line Construction of Suffix Trees (cont.)Suffix Trees (cont.)

procedureprocedure canonizecanonize((ss, , ((kk, , pp))))::ifif pp < < kk then returnthen return((ss, , kk))

elseelse

find the find the ttkk-transition -transition g'g'((ss,,((k'k', , p'p')))) = = s's' from from ss;;

whilewhile p'p' – – k'k' pp – – kk dodo

kk kk + + p'p' – – k'k' + + 11;;

ss s's';;

ifif kk pp thenthen

find the find the ttkk-transition -transition g'g'((s, s, ((k'k', , p'p')))) = = s's' from from ss;;

returnreturn((ss, , kk).).

Page 44: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4444

Time ComplexityTime Complexity

Theorem 2Theorem 2 Algorithm 2 constructs the suffix tree STreeAlgorithm 2 constructs the suffix tree STree((TT)) for a string T = tfor a string T = t11 ... t ... tnn on-line in time O on-line in time O((nn))..

Proof.Proof. The The update update is called is called nn times. It takes time proportional times. It takes time proportional

to the total number of the visited states.to the total number of the visited states.

Page 45: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4545

Time Complexity AnalysisTime Complexity Analysis

aa

abab

abcabc

abcaabca

abcababcab

abcabdabcabd

heig

ht =

n

width n + 1

Page 46: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4646

Time Complexity AnalysisTime Complexity Analysis

activactive e

pointpoint

end end pointpoint

ssjj

ssj'j'

Let Let rri-i-11 be the string corresponding to the be the string corresponding to the active pointactive point

The string corresponding to The string corresponding to end pointend point is ( is (rrii)) i-i-1 1 ((Lemma 2Lemma 2))

Note: Note: rrii = = ((rrii)) i-i-11ttii

So that the number of the visited states in loopSo that the number of the visited states in loop i i

= = lengthlength((rri-i-11)) - - ((lengthlength((rrii))-1-1)) + + 11

Total number of the visited statesTotal number of the visited states

= = ((lengthlength((rrii-1-1)) - - lengthlength((rrii)) + 2 + 2))

= = lengthlength((rr00)) - - lengthlength((rrnn)) + 2 + 2nn 2 2nn

Page 47: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4747

ConclusionConclusion

Suffix tree can be constructed in Suffix tree can be constructed in linear time by employinglinear time by employingsuffix linkssuffix linksopen transitions open transitions for leaf nodesfor leaf nodes implicit nodes implicit nodes relay on active points and end points.relay on active points and end points.

Page 48: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4848

Suffix trees have many applications:Suffix trees have many applications:string searching string searching finding repeat substringsfinding repeat substringsMany applications appear in Many applications appear in

Algorithms on Strings, Trees, and Sequences: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Computer Science and Computational Biology, by Dan Gusfield, Cambridge, 1997. by Dan Gusfield, Cambridge, 1997.

Page 49: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

4949

Any Questions?Any Questions?

Page 50: On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.

5050

Thank YouThank You