Lecture 14 Full Text - courses.cs.ut.ee · • Dan Gusfield:Algorithms on Strings, Trees, and...
Transcript of Lecture 14 Full Text - courses.cs.ut.ee · • Dan Gusfield:Algorithms on Strings, Trees, and...
11/27/16
1
TextAlgorithms(6EAP)
Fulltextindexing
JaakVilo2016fall
1MTAT.03.190TextAlgorithmsJaakVilo
Problem
• GivenPandS– findallexactorapproximateoccurrencesofPinS
• YouareallowedtopreprocessS(andP,ofcourse)
• Goal:tospeedupthesearches
E.g.Dictionaryproblem
• DoesPbelongtoadictionaryD={d1,…,dn}– BuildabinarysearchtreeofD– B-TreeofD– Hashing
– Sorting+Binarysearch
• Buildakeywordtrie:searchinO(|P|)– Assumingalphabethasuptoaconstantsizec– SeeAho-Corasickalgorithm,Trieconstruction
Sortedarrayandbinarysearch
he
hers
his
global
indexhappy
head
header
info
informal
search
show
stop
1 13
Sortedarrayandbinarysearch
he
hers
his
global
indexhappy
head
header
info
informal
search
show
stop
1 13
O( |P| log n )
TrieforD={he,hers,his,she}
0
1
2
h
e
3
s
4
5
e
h
8
i
7
s
9
r
6
s
O( |P| )
11/27/16
2
S!=setofwords
• Soflengthn
• Howtoindex?
• Indexfromeverypositionofatext
• Prefixofeverypossiblesuffixisimportant
a
b
b
aa
a
aa
b
b
b
babaababaab
baabaab
abb
Trie(babaab)
b
a
a
b
Suffixtree• Definition: Acompactrepresentationofatriecorrespondingtothe
suffixesofagivenstringwhereallnodeswithonechildaremergedwiththeirparents.
• Definition(suffixtree).AsuffixtreeTforastringS(withn=|S|)isarooted,labeledtreewithaleafforeachnon-emptysuffixofS.Furthermore,asuffixtreesatisfiesthefollowingproperties:
• Eachinternalnode,otherthantheroot,hasatleasttwochildren;• Eachedgeleavingaparticularnodeislabeledwithanon-emptysubstring
ofSofwhichthefirstsymbolisuniqueamongallfirstsymbolsoftheedgelabelsoftheedgesleavingthisparticularnode;
• Foranyleafinthetree,theconcatenationoftheedgelabelsonthepathfromtheroottothisleafexactlyspellsoutanon-emptysuffixofs.
• DanGusfield: AlgorithmsonStrings,Trees,andSequences:ComputerScienceandComputationalBiology.Hardcover- 534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.
Literatureonsuffixtrees• http://en.wikipedia.org/wiki/Suffix_tree• DanGusfield: AlgorithmsonStrings,Trees,andSequences:Computer
ScienceandComputationalBiology.Hardcover- 534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.(pages:89--208)
• E.Ukkonen.On-lineconstructionofsuffixtrees.Algorithmica,14:249-60,1995. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.751
• Ching-FungCheung,JeffreyXuYu,HongjunLu."ConstructingSuffixTreeforGigabyteSequenceswithMegabyteMemory,"IEEETransactionsonKnowledgeandDataEngineering,vol.17,no.1,pp.90-105,January,2005.http://www2.computer.org/portal/web/csdl/doi/10.1109/TKDE.2005.3
• CPMarticlesarchive:http://www.cs.ucr.edu/~stelo/cpm/
• MarkNelson.FastStringSearchingWithSuffixTreesDr.Dobb'sJournal,August,1996.http://www.dogma.net/markn/articles/suffixt/suffixt.htm
http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
12
ThesuffixtreeTree(T)ofT
• datastructuresuffixtree, Tree(T),iscompactedtrie thatrepresentsallthesuffixesofstringT
• linearsize:|Tree(T)|=O(|T|)• canbeconstructedinlineartimeO(|T|)• hasmyriadvirtues (A.Apostolico)• iswell-known:366000Googlehits
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
11/27/16
3
13
Suffix tree andsuffix array techniques forpatternanalysis instringsEskoUkkonenUniv Helsinki
Erice School30Oct 2005E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
Partlybasedon:
High-throughput genome-scale sequence analysis andmapping using compressed datastructures
VeliMäkinenDepartmentofComputerScience
University ofHelsinki
14
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
15
Analysisofastringofsymbols
• T=hattivatti’text’• P=att’pattern’
• FindtheoccurrencesofPinT:hattivatti
• Patternsynthesis:#(t)=4#(atti)=2#(t****t)=2
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 16
Solution:backtrackingwithsuffixtree
...ACACATTATCACAGGCATCGGCATTAGCGATCGAGTCG.....
17
Patternfinding&synthesisproblems• T=t1t2 …tn,P=p1p2 …pn ,stringsofsymbolsinfinite
alphabet
• Indexingproblem:PreprocessT(buildanindexstructure)suchthattheoccurrencesofdifferentpatternsPcanbefoundfast– statictext,anygivenpatternP
• Patternsynthesisproblem:LearnfromTnewpatternsthatoccursurprisinglyoften
• Whatisapattern?Exactsubstring,approximatesubstring,withgeneralizedsymbols,withgaps,… 18
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
11/27/16
4
19
ThesuffixtreeTree(T)ofT
• datastructuresuffixtree, Tree(T),iscompactedtrie thatrepresentsallthesuffixesofstringT
• linearsize:|Tree(T)|=O(|T|)• canbeconstructedinlineartimeO(|T|)• hasmyriadvirtues (A.Apostolico)• iswell-known:366000Googlehits
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt 20
Suffixtrieandsuffixtree
a
b
b
aaa
aa
b
b
b
abaabbaabaababb
Trie(abaab)
E. Ukkonen: http://www.cs.helsinki.fi/u/ukkonen/Erice2005.ppt
21
Suffixtrieandsuffixtree
a
b
b
aaa
aa
b
b
b
a
baab
baab
ab
abaabbaabaababb
Trie(abaab) Tree(abaab)
22
Trie(T)canbelarge
• |Trie(T)|=O(|T|2)• badexample:T=anbn
• Trie(T)canbeseenasaDFA:languageaccepted=thesuffixesofT
• minimizetheDFA=>directedcyclicwordgraph(’DAWG’)
23
Tree(T)isoflinearsize
• onlytheinternalbranchingnodesandtheleavesrepresentedexplicitly
• edgeslabeledbysubstringsofT• v=node(α)ifthepathfromroottovspellsα• one-to-onecorrespondenceofleavesandsuffixes
• |T|leaves,hence<|T|internalnodes• |Tree(T)|=O(|T|+size(edgelabels))
24
Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti ttivatti
tivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
11/27/16
5
25
Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti ttivatti
tivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
hattivatti
atti
substring labels of edges represented as pairs of pointers
26
Tree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
1 2 34
5
6
6,106,10
2,54,5
i
10
8
9
3,3
i
vatti
vatti
vatti
hattivatti
hattivatti
7
27
Tree(T)isfull textindexTree(T)
P
31 8
P occurs in T at locations 8, 31, …
P occurs in T ó P is a prefix of some suffix of T ó Path for P exists in Tree(T)
All occurrences of P in time O(|P| + #occ)28
Findatt fromTree(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti ttivatti
tivatti
ivatti
vatti
vattivatti
attiti
2
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti7
29
LineartimeconstructionofTree(T)
hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
Weiner (1973),
’algorithm of the year’
McCreight (1976)
’on-line’ algorithm (Ukkonen 1992) 30
On-lineconstructionofTrie(T)
• T=t1t2 …tn$• Pi =t1t2 …ti i:th prefix ofT• on-lineidea:updateTrie(Pi) toTrie(Pi+1)• =>verysimpleconstruction
11/27/16
6
31
Trie(abaab)
a a
b
b a
b
b
aa
Trie(a) Trie(ab) Trie(aba)
chain of links connects the end points of current suffixes
abaabaa
aaεaε
32
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aaa
aa
Trie(abaa)
33
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aaa
aa
Trie(abaa)
Add next symbol = b
34
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aaa
aa
Trie(abaa)
Add next symbol = b
From here on b-arc already exists
35
Trie(abaab)
a a
b
b a
b
b
aa
a
b
b
aaa
aa
a
b
b
aaa
aa
b
b
b
Trie(abaab)
36
WhathappensinTrie(Pi) =>Trie(Pi+1) ?
ai
ai
aiai
aiai
Before
After
New nodes
New suffix links
From here on the ai-arc exists already => stop updating here
11/27/16
7
37
WhathappensinTrie(Pi) =>Trie(Pi+1) ?
• time:O(sizeofTrie(T))• suffixlinks:
slink(node(aα))=node(α)
38
On-lineprocedureforsuffixtrie
1. Create Trie(t1): nodes root and v, an arc son(root, t1) = v, and suffix links slink(v) := root and slink(root) := root
2. for i := 2 to n do begin
3. vi-1 := leaf of Trie(t1…ti-1) for string t1…ti-1 (i.e., the deepest leaf)
4. v := vi-1; v´ := 0
5. while node v has no outgoing arc for ti do begin
6. Create a new node v´´ and an arc son(v,ti) = v´´
7. if v´ ≠ 0 then slink(v) := v´´
8. v := slink(v); v´ := v´´ end
9. for the node v´´ such that v´´= son(v,ti) do if v´´ = v´ then slink(v’) := root else slink(v´) := v´´
39
Suffixtreeson-line
• ’compactedversion’oftheon-linetrieconstruction:simulatetheconstructiononthelinearsizetreeinsteadofthetrie=>timeO(|T|)
• alltrienodesareconceptuallystillneeded=>implicit andreal nodes
40
Implicitandrealnodes
• Pair(v,α)isanimplicitnode inTree(T)ifvisanodeofTreeandα isa(proper)prefixofthelabelofsomearcfrom v.Ifα istheemptystringthen (v,α)isa ’real’ node(=v).
• Let v=node(α´)in Tree(T). Then implicitnode(v,α)representsnode(α´α)ofTrie(T)
41
Implicitnode
…
v
(v, α)α…
α´
42
Suffixlinksandopenarcs
…
v
aα
…
α
root
slink(v)
label [i,*] instead of [i,j] if w is a leaf and j is the scanned position of T
w
11/27/16
8
43
Bigpicture
… … …
…
…
suffix link path traversed: total work O(n)
new arcs and nodes created: total work O(size(Tree(T)) 44
On-lineprocedureforsuffixtree
Input: string T = t1t2 … tn$
Output: Tree(T)
Notation: son(v,α) = w iff there is an arc from v to w with label α
son(v,ε) = v
Function Canonize(v, α):
while son(v, α´) ≠ 0 where α = α´ α´´, | α´| > 0 do
v := son(v, α´); α := α´´
return (v, α)
45
Suffix-treeon-line:mainprocedure
Create Tree(t1); slink(root) := root
(v, α) := (root, ε) /* (v, α) is the start node */
for i := 2 to n+1 dov´ := 0
while there is no arc from v with label prefix αti doif α ≠ ε then /* divide the arc w = son(v, αη) into two */
son(v, α) := v´´; son(v´´,ti) := v´´´; son(v´´,η) := w else
son(v,ti) := v´´´; v´´ := vif v´ ≠ 0 then slink(v´) := v´´
v´ := v´´; v := slink(v); (v, α) := Canonize(v, α)if v´ ≠ 0 then slink(v´) := v
(v, α) := Canonize(v, αti) /* (v, α) = start node of the next round */
http://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english
47
Theactualtimeandspace
• |Tree(T)|isabout20|T|inpractice• brute-forceconstructionisO(|T|log|T|)forrandom
stringsastheaveragedepthofinternalnodesisO(log|T|)
• differencebetweenlinearandbrute-forceconstructionsnotnecessarilylarge(Giegerich&Kurtz)
• truncatedsuffixtrees:ksymbolslongprefixofeachsuffixrepresented(Naetal.2003)
• alphabetindependentlineartime(Farach1997)
abc
11/27/16
9
abcabxabcd
ApplicationsofSuffixTrees
• DanGusfield: AlgorithmsonStrings,Trees,andSequences:ComputerScienceandComputationalBiology.Hardcover-534pages1stedition(January15,1997).CambridgeUnivPr(Short);ISBN:0521585198.- book
• APL1:ExactStringMatchingSearchforPfromtextS.Solution1:buildSTree(S)- oneachievesthesameO(n+m)asKnuth-Morris-Pratt,forexample!
• SearchfromthesuffixtreeisO(|P|)• APL2:ExactsetmatchingSearchforasetofpatternsP
11/27/16
10
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 55
C
Backtobacktracking
AC
T
4 2 1 5 36
CT
TA
T
TAC
T
AT
CT
A
ACA, 1 mismatch
Same idea can be used to many otherforms of approximate search, like Smith-Waterman, position-restricted scoringmatrices, regular expression search, etc.
ApplicationsofSuffixTrees
• APL3:substringproblemforadatabaseofpatternsGivenasetofstringsS=S1,...,Sn--- adatabaseFindallSithathavePasasubstring
• GeneralizedsuffixtreecontainsallsuffixesofallSi• QueryintimeO(|P|),andcanidentifytheLONGEST
commonprefixofPinallSi
ApplicationsofSuffixTrees
• APL4:Longestcommonsubstringoftwostrings• FindthelongestcommonsubstringofSandT.
• OveralltherearepotentiallyO(n2 )suchsubstrings,ifnisthelengthofashorterofSandT
• Donald Knuthonce(1970)conjectured thatlinear-timealgorithmisimpossible.
• Solution:constructtheSTree(S+T)andfindthenodedeepestinthetreethathassuffixesfrombothSandTinsubtreeleaves.
• Ex:S=superiorcalifornialives T=sealiver havebothasubstringalive. ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 58
Simpleanalysistask:LCSS
• LetLCSSA(A,B) denotethelongestcommonsubstringtwosequencesA andB. E.g.:– LCSS(AGATCTATCT,CGCCTCTATG)=TCTAT.
• Agoodsolutionistobuildsuffixtreefortheshortersequenceandmakeadescendingsuffixwalkwiththeothersequence.
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 59
Suffixlink
X
aX
suffix link
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 60
Descendingsuffixwalk
suffix tree of A Read B left-to-right,always going down in thetree when possible.If the next symbol of B doesnot match any edge labelon current position, takesuffix link, and try again.(Suffix link in the root to itself emits a symbol).The node v encountered with largest string depthis the solution.
v
11/27/16
11
ApplicationsofSuffixTrees
• APL5:RecognizingDNAcontaminationRelatedtoDNAsequencing,searchforlongeststrings(longerthanthreshold)thatarepresentintheDBofsequencesofothergenomes.
• APL6: CommonsubstringsofmorethantwostringsGeneralizationofAPL4,canbedoneinlinear(intotallengthofallstrings)time
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 62
Anothercommontool:Generalizedsuffixtree
ACCTTA....ACCT#CACATT..CAT#TGTCGT...GTA#TCACCACC...C$
A
C
C
node info:subtree size 47813871sequence count 87
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 63
Generalizedsuffixtreeapplication
...ACC..#...ACC...#...ACC...ACC..ACC..#..ACC..ACC...#...ACC...#...
...#....#...#...#...ACC...#...#...#...#...#...#..#..ACC..ACC...#......#...
A
C
C
node info:subtree size 4398blue sequences 12/15red sequences 2/62
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 64
Casestudycontinued
genome
regions with ChIP-seq matches
suffix tree of genome
5 blue1 red
TAC
..........T
motif?
ApplicationsofSuffixTrees
• APL7:Buildingadirectedgraphforexactmatching: Suffixgraph - directedacyclicwordgraph(DAWG),asmallestfinitestateautomaton recognizingallsuffixesofastringS.Thisautomatoncanrecognize membership,butnottellwhichsuffixwasmatched.
• Construction:mergeisomorficsubtrees.• IsomorficinSuffixTreewhenexistssuffixlinkpath,andsubtreeshaveequalnr.ofleaves.
ApplicationsofSuffixTrees
• APL8:Areverseroleforsuffixtrees,andmajorspacereductionIndexthepattern,nottree...
• Matchingstatistics.• APL10:All-pairssuffix-prefixmatchingForallpairsSi, Sj, findthelongestmatchingsuffix-prefixpair.Usedinshortestcommonsuperstringgeneration(e.g.DNAsequenceassembly),ESTalignmentmetc.
11/27/16
12
ApplicationsofSuffixTrees
• APL11:Findingallmaximalrepetitivestructuresinlineartime
• APL12:Circularstringlinearizatione.g.circularchemicalmoleculesinthedatabase,onewantstolienarizetheminacanonicalway...
• APL13:Suffixarrays- morespacereductionwilltouchthatseparately
ApplicationsofSuffixTrees
• APL14:Suffixtreesingenome-scaleprojects• APL15:ABoyer-Mooreapproachtoexactsetmatching
• APL16:Ziv-Lempeldatacompression• APL17:MinimumlengthencodingofDNA
ApplicationsofSuffixTrees• AdditionalapplicationsMostlyexercises...• Extrafeature:CONSTANTtimelowestcommonancestorretrieval(LCA)
Andmestruktuurmisvõimaldableidakonstantseajagaalumistühistvanemat(seevastabpikimaleühiseleprefixile!)onvõimalikkoostadalineaarseajaga.
• APL:Longestcommonextension:abridgetoinexactmatching• APL:Findingallmaximalpalindromesinlineartime
Palindromereadsfromcentralpositionthesametoleftandright.E.g.:kirik,saippuakivikauppias.
• BuildthesuffixtreeofSandinvertedS(aabcbad=>aabcbad#dabcbaa)andusingtheLCAonecanaskforanypositionpair(i,2i-1),thelongestcommonprefixinconstanttime.
• ThewholeproblemcanbesolvedinO(n).
ApplicationsofSuffixTrees
• APL:Exactmatchingwithwildcards• APL:Thek-mismatchproblem• Approximatepalindromesandrepeats• Fastermethodsfortandemrepeats• Alinear-timesolutiontothemultiplecommonsubstringproblem
• Andmany-manymore...
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 71
Propertiesofsuffixtree
• Suffixtreehasn leavesandatmostn-1internalnodes,wheren isthetotallengthofallsequencesindexed.
• Eachnoderequiresconstantnumberofintegers(pointerstofirstchild,sibling,parent,textrangeofincomingedge,statisticscounters,etc.).
• Canbeconstructedinlineartime.
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 72
Propertiesofsuffixtree...inpractice
• Hugeoverheadduetopointerstructure:– Standardimplementationofsuffixtreeforhumangenomerequiresover200GB memory!
– Acarefulimplementation(usinglogn -bitfieldsforeachvalueandarraylayoutforthetree)stillrequiresover40GB.
– Humangenomeitselftakeslessthan1GB using2-bitsperbp.
11/27/16
13
73
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
74
Suffixes- sorted
• Sortallsuffixes.Allowstoperformbinarysearch!
hattivattiattivatti
ttivattitivattiivattivatti
attittitiiε
ε
attiattivattihattivattiiivattititivattittittivattivatti
75
Suffixarray:example
• suffixarray=lexicographicorderofthesuffixes
hattivattiattivattittivattitivattiivattivattiattittitiiε
ε
attiattivattihattivattiiivattititivattittittivattivatti
11
72
1
105
94
83
6
1234567891011
1172110594836
76
Suffixarrayconstruction:sort!
• suffixarray=lexicographicorderofthesuffixes
hattivattiattivattittivattitivattiivattivattiattittitiiε
11
72
1
105
94
83
6
1
23
4
56
78
910
11
1172110594836
77
Suffixarray
• suffixarray SA(T)=anarraygivingthelexicographicorderofthesuffixesofT
• spacerequirement:5|T|• practitionerslikesuffixarrays(simplicity,spaceefficiency)
• theoreticianslikesuffixtrees(explicitstructure)
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 78
Reducingspace:suffixarray
AC
T
4 2 1 5 36
C A T A C T1 2 3 4 5 6
=[3,3]=[3,3]=[2,2]
suffix array
=[4,6]=[6,6]=[2,6]
=[3,6]=[5,6]
CC
TTA
T
TAC
TCT
A
T
A
11/27/16
14
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 79
Suffixarray
• Manyalgorithmsonsuffixtreecanbesimulatedusingsuffixarray...– ...andcoupleofadditionalarrays...– ...formingso-calledenhancedsuffixarray...– ...leadingtothesimilarspacerequirementascarefulimplementationofsuffixtree
• Notasatisfactorysolutiontothespaceissue.
80
Patternsearchfromsuffixarrayhattivattiattivattittivattitivattiivattivattiattittitiiε
ε
attiattivattihattivattiiivattititivattittittivattivatti
1172110594836
att binary search
ISMB2009Tutorial VeliMäkinen:"...analysisandmapping..." 81
Whatwelearntoday?
• Welearnthatitispossibletoreplacesuffixtreeswithcompressedsuffixtrees thattake8.8GB forthehumangenome.
• Welearnthatbacktracking canbedoneusingcompressedsuffixarrays requiringonly2.1GBforthehumangenome.
• Welearnthatdiscovering interestingmotifseedsfromthehumangenometakes40hoursandrequires9.3GB space.
82
Recentsuffixarrayconstructions
• Manber&Myers(1990):O(|T|log|T|)• lineartimeviasuffixtree• January/June2003:directlineartimeconstructionofsuffixarray- Kim,Sim,Park,Park(CPM03)- Kärkkäinen&Sanders(ICALP03)- Ko&Aluru(CPM03)
83
Kärkkäinen-Sandersalgorithm
1. Construct the suffix array of the suffixes starting at positions i mod 3 ≠ 0. This is done by reduction to the suffix array construction of a string of two thirds the length, which is solved recursively.
2. Construct the suffix array of the remaining suffixes using the result of the first step.
3. Merge the two suffix arrays into one.
84
Notation
• stringT=T[0,n)=t0t1 …tn-1• suffixSi =T[i,0)=titi+1 …tn-1• forC\subset[0,n]:SC ={Si|iinC}
• suffixarray SA[0,n]ofTisapermutationof[0,n]satisfyingSSA[0] <SSA[1] <…<SSA[n]
11/27/16
15
85
Runningexample
• T[0,n)=yabbadabbado00…
• SA=(12,1,6,4,9,3,8,2,7,5,10,11,0)
0 1 2 3 4 5 6 7 8 9 10 11
86
Step0:Constructasample
• fork=0,1,2Bk={iє [0,n]|imod3=k}
• C=B1UB2samplepositions• SC samplesuffixes
• Example:B1={1,4,7,10},B2={2,5,8,11},C={1,4,7,10,2,5,8,11}
87
Step1:Sortsamplesuffixes• fork=1,2,construct
Rk=[tktk+1tk+2][tk+3tk+4tk+5]…[tmaxBktmaxBk+1tmaxBk+2]
R=R1^R2concatenationofR1andR2
SuffixesofRcorrespondtoSC:suffix[titi+1ti+2]…correspondstoSi;correspondenceisorderpreserving.
SortthesuffixesofR:radixsortthecharactersandrenamewithrankstoobtainR´.Ifallcharactersdifferent,theirorderdirectlygivestheorderofsuffixes.Otherwise,sortthesuffixesofR´ usingKärkkäinen-Sanders.Note:|R´|=2n/3.
88
Step1(cont.)
• oncethesamplesuffixesaresorted,assignaranktoeach:rank(Si)=therankofSiinSC;rank(Sn+1)=rank(Sn+2)=0
• Example:R=[abb][ada][bba][do0][bba][dab][bad][o00]
R´ =(1,2,4,6,4,5,3,7)SAR´ =(8,0,1,6,4,2,5,3,7)rank(Si)- 14- 26- 53– 78– 00
89
Step2:Sortnonsamplesuffixes
• foreachnon-sampleSi є SB0 (notethatrank(Si+1)isalwaysdefinedforiє B0):
Si ≤Sj ↔(ti,rank(Si+1))≤(tj,rank(Sj+1))• radixsortthepairs(ti,rank(Si+1)).
• Example:S12 <S6 <S9 <S3 <S0because(0,0)<(a,5)<(a,7)<(b,2)<(y,1)
90
Step3:Merge• mergethetwosortedsetsofsuffixesusingastandard
comparison-basedmerging:• tocompareSi є SC withSj є SB0,distinguishtwocases:
• iє B1:Si ≤Sj ↔(ti,rank(Si+1))≤(tj,rank(Sj+1))• iє B2:Si ≤Sj ↔(ti,ti+1,rank(Si+2))≤(tj,tj+1,rank(Sj+2))
• notethattheranksaredefinedinallcases!• S1 <S6 as(a,4)<(a,5)andS3 <S8 as(b,a,6)<(b,a,7)
11/27/16
16
91
RunningtimeO(n)
• excludingtherecursivecall,everythingcanbedoneinlineartime
• therecursionisonastringoflength2n/3• thusthetimeisgivenbyrecurrence
T(n)=T(2n/3)+O(n)• henceT(n)=O(n)
92
Implementation
• about50linesofC++• codeavailablee.g.viaJuhaKärkkäinen’shomepage
93
LCPtable
• LongestCommonPrefixofsuccessiveelementsofsuffixarray:
• LCP[i]=lengthofthelongestcommonprefixofsuffixesSSA[i] andSSA[i+1]
• buildinversearraySA-1 fromSAinlineartime• thenLCPtablefromSA-1 inlineartime(Kasaietal,CPM2001)
• OxfordEnglishDisctionary http://www.oed.com/• Example- WordoftheDay,Fourth
http://biit.cs.ut.ee/~vilo/edu/2005-06/Text_Algorithms/L7_SuffixTrees/wotd_fourth.htmlhttp://www.oed.com/cgi/display/wotd
• PATindex- byGastonGonnet(taonsamutiMapletarkvaraüksloojatestninghiljemmolekulaarbioloogiatarkvarapaketiväljatöötajaid)
• PATindexisessentiallyasuffixarray.Tosavespace,indexedonlyfromfirstcharacterofeveryword
• XML-tagging(orSGML,atthattime!)alsoindexed• TomarkcertainfieldsofXML,thebitvectorswereused.• Mainconcern- improvethespeedofsearchontheCD- minimizerandom
accesses.• Forslowmediumeven15-20accessesistooslow...• G.H.Gonnet,R.A.Baeza-Yates,andT.Snider,Lexicographicalindicesfor
text:Invertedfilesvs.PATtrees,TechnicalReportOED-91-01,CentrefortheNewOED,UniversityofWaterloo,1991.
95
Suffixtreevssuffixarray
• suffixtreeó suffixarray+LCPtable
96
1. Suffix tree
2. Suffix array
3. Some applications
4. Finding motifs
11/27/16
17
97
SubstringmotifsofstringT
• stringT =t1 …tninalphabetA.• Problem:whatarethefrequentlyoccurring(ungapped)substringsofT?Longestsubstringthatoccursatleastq times?
• Thm:SuffixtreeTree(T) givescompleteoccurrencecountsofallsubstringmotifsofTinO(n) time(althoughT mayhaveO(n2)substrings!)
98
Countingthesubstringmotifs
• internalnodesofTree(T)↔repeatingsubstringsofT
• numberofleavesofthesubtreeofanodeforstringP=numberofoccurrencesofPinT
99
Substringmotifsofhattivatti
hattivattiattivatti ttivatti
tivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
2
2 2
24
Counts for the O(n) maximal motifs shown100
FindingrepeatsinDNA
• humanchromosome3• thefirst48999930bases• 31mincputime(8processors,4GB)
• Humangenome:3x109 bases• Tree(HumanGenome)feasible
101
Longestrepeat?
Occurrences at: 28395980, 28401554r Length: 2559
ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg 102
Tenoccurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt
Length: 277
Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925
11/27/16
18
103
Usingsuffixtrees:plagiarism
• findlongestcommonsubstringofstringsXandY
• buildTree(X$Y)andfindthedeepestnodewhichhasaleafpointingtoXandanotherpointingtoY
104
Usingsuffixtrees:approximatematching
• editdistance:insertions,deletions,changes
• STOCKHOLMvsTUKHOLMA
105
Stringdistance/similarityfunctions
STOCKHOLM vs TUKHOLMA
STOCKHOLM__TU_ KHOLMA
=> 2 deletions, 1 insertion, 1 change
106
Dynamicprogrammingdi,j = min(if ai=bj then di-1,j-1 else ¥,
di-1,j + 1, di,j-1 + 1)
= distance between i-prefix of A and j-prefix of B(substitution excluded)
di,j
di-1,j-1
di,j-1
di-1,j
dm,n
mxn table d
A
B
ai
bj
+1
+1
107
A\B s t o c k h o l m0 1 2 3 4 5 6 7 8 9
t 1 2 1 2 3 4 5 6 7 8u 2 3 2 3 4 5 6 7 8 9k 3 4 3 4 5 4 5 6 7 8h 4 5 4 5 6 5 4 5 6 7o 5 6 5 4 5 6 5 4 5 6l 6 7 6 5 6 7 6 5 4 5m 7 8 7 6 7 8 7 6 5 4a 8 9 8 7 8 9 8 7 6 5
di,j = min(if ai=bj then di-1,j-1 else ¥, di-1,j + 1, di,j-1 + 1)
dID(A,B)optimal alignment by trace-back 108
Searchproblem
• findapproximateoccurrencesofpatternPintextT:substringsP’ofTsuchthatd(P,P’)small
• dynprogrwithsmallmodification:O(mn)• lotsof(practical)improvementtricks
P
T P’
11/27/16
19
109
Indexforapproximatesearching?
• dynamicprogramming:PxTree(T)withbacktracking
P
Tree(T)
Burrows-WheelerTransformation
• BWTfortextcompressionandindexing
Burrows-Wheeler• SeeFAQ http://www.faqs.org/faqs/compression-faq/part2/section-9.html• Themethoddescribedintheoriginalpaperisreallyacompositeofthreedifferent
algorithms:– theblocksortingmainengine(alossless,veryslightlyexpansivepreprocessor),– themove-to-frontcoder(abyte-for-bytesimple,fast,locallyadaptivenoncompressivecoder)and– asimplestatisticalcompressor(firstorderHuffmanismentionedasacandidate)eventuallydoing
thecompression.
• Ofthesethreemethodsonlythefirsttwoarediscussedhereastheyarewhatconstitutestheheartofthealgorithm.Thesetwoalgorithmscombinedformacompletelyreversible(lossless)transformationthat- withtypicalinput- skewsthefirstordersymboldistributionstomakethedatamorecompressiblewithsimplemethods.Intuitivelyspeaking,themethodtransformsslackinthehigherorderprobabilitiesoftheinputblock(thusmakingthemmoreeven,whiteningthem)toslackinthelowerorderstatistics.Thiseffectiswhatisseeninthehistogramoftheresultingsymboldata.
• Please,readthearticlebyMarkNelson:• DataCompressionwiththeBurrows-WheelerTransformMarkNelson,Dr.Dobb'sJournal
September,1996.http://marknelson.us/1996/09/01/bwt/
BWT
11/27/16
21
CODE:t: hat acts like this:<13><10><1t: hat buffer to the constructort: hat corrupted the heap, or woW: hat goes up must come down<13t: hat happens, it isn't likelyw: hat if you want to dynamicallt: hat indicates an error.<13><1t: hat it removes arguments from
t: hat looks like this:<13><10><t: hat looks something like thist: hat looks something like thist: hat once I detect the mangled
Example
• Decode:errktreteoe.e
• Hint:. Isthelastcharacter,alphabeticallyfirst…