Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed...
Transcript of Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed...
![Page 1: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/1.jpg)
Burrows-Wheeler transform and BWT-index
![Page 2: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/2.jpg)
Succinct and compressed indexes
! succinct index takes space in bits proportional to that of the text itself
! previous indexes are not succinct as they take O(n) computer words but O(n·log(n)) bits
! compressed index takes space in bits proportional to that of the compressed text
! self-index does not require storing the text
![Page 3: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/3.jpg)
Burrows-Wheeler transform
$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!
T=acatacagatg$!
![Page 4: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/4.jpg)
Burrows-Wheeler transform
$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!
12 5 1 7 3 9 6 2 11 8 4 10
T=acatacagatg$!SA
![Page 5: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/5.jpg)
Burrows-Wheeler transform
$acatacagatg!acagatg$acat!acatacagatg$!agatg$acatac!atacagatg$ac!atg$acatacag!cagatg$acata!catacagatg$a!g$acatacagat!gatg$acataca!tacagatg$aca!tg$acatacaga!
T=acatacagatg$!
![Page 6: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/6.jpg)
Burrows-Wheeler transform
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
![Page 7: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/7.jpg)
Burrows-Wheeler transform
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• BWT has been defined for the purpose of compression, as BWT compresses better than the input text
• BWT is reversible!
1 2 3 4 5 6 7 8 9 10 11 12
![Page 8: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/8.jpg)
Burrows-Wheeler transform
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10]
F L
1 2 3 4 5 6 7 8 9 10 11 12
![Page 9: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/9.jpg)
Burrows-Wheeler transform
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
T= $!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10]
F L
1 2 3 4 5 6 7 8 9 10 11 12
![Page 10: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/10.jpg)
Burrows-Wheeler transform
T= $!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10]
F L
1 2 3 4 5 6 7 8 9 10 11 12
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 11: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/11.jpg)
Burrows-Wheeler transform
T= g$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10]
F L
1 2 3 4 5 6 7 8 9 10 11 12
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 12: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/12.jpg)
Burrows-Wheeler transform
T= g$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10] • Obs 2: for identical chars, their relative
order in F and L is the same
F L
x x
i
j
LF[i]=C[BWT[i]]+rank[BWT[i],i]
Ex: LF[1]=8+1
1 2 3 4 5 6 7 8 9 10 11 12
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 13: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/13.jpg)
Burrows-Wheeler transform
T= tg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10] • Obs 2: for identical chars, their relative
order in F and L is the same
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
Ex: LF[1]=8+1
1 2 3 4 5 6 7 8 9 10 11 12
x x
i
j
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 14: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/14.jpg)
Burrows-Wheeler transform
T= tg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10] • Obs 2: for identical chars, their relative
order in F and L is the same
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
Ex: LF[1]=8+1
1 2 3 4 5 6 7 8 9 10 11 12
x x
i
j
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 15: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/15.jpg)
Burrows-Wheeler transform
T= atg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10] • Obs 2: for identical chars, their relative
order in F and L is the same
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
Ex: LF[1]=8+1
1 2 3 4 5 6 7 8 9 10 11 12
x x
i
j
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 16: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/16.jpg)
Burrows-Wheeler transform
T= atg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10] • Obs 2: for identical chars, their relative
order in F and L is the same
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
Ex: LF[1]=8+1
1 2 3 4 5 6 7 8 9 10 11 12
x x
i
j
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 17: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/17.jpg)
Burrows-Wheeler transform
T= gatg$!
BWT T[SA[i]] • BWT[i]=T[SA[i]-1] if SA[i]≠1, otherwise $
• Obs 1: the first column (F) is easy to reconstruct, it can be represented by an array C[x]=∑y<x|occ(y,T)| for each letter x
• Ex: C=[0,1,6,8,10] • Obs 2: for identical chars, their relative
order in F and L is the same
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
Ex: LF[1]=8+1
1 2 3 4 5 6 7 8 9 10 11 12
x x
i
j
$ g!a t!a $!a c!a c!a g!c a!c a!g t!g a!t a!t a!
![Page 18: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/18.jpg)
LF function
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
BWT T[SA[i]]
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
T=acatacagatg$!
• LF[i] yields the index (in SA) of the suffix immediately preceding (in T) the i-th suffix (in SA). Formally, SA[LF[i]]=SA[i]-1.
1 2 3 4 5 6 7 8 9 10 11 12
12 5 1 7 3 9 6 2 11 8 4 10
SA
SA[LF[7]] = SA[7]-1 = 6-1 = 5
![Page 19: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/19.jpg)
rank function
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
BWT T[SA[i]]
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i]
![Page 20: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/20.jpg)
rank function
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
BWT T[SA[i]]
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i] 0 0 0 0 1 1 0 1 1 2 3 4
rank[BWT[i],i]
![Page 21: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/21.jpg)
rank function
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
BWT T[SA[i]]
how about general queries rank[a,i] for any letter a and any
position i?
F L
LF[i]=C[BWT[i]]+rank[BWT[i],i] 0 0 0 0 1 1 0 1 1 2 3 4
rank[BWT[i],i]
![Page 22: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/22.jpg)
rank/select functions
! given a string T, efficiently answer queries rank(a,i) on the number of a’s in T[1..i]
! rank function (on bit vectors) turns out to be a fundamental algorithmic block for building succinct data structures [Jacobson 89]
! on bit vectors rank can be supported in time O(1) using o(n) additional bits of memory
! complementary function select(a,j): output the position of the j-th occurrence of a in T. select can also be supported in O(1) time
! [Jacobson 89]: using rank/select to represent binary trees to support navigation
![Page 23: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/23.jpg)
Implementing rank on bitmaps
! consider a bitmap B of size n ! tabulate rank within all blocks of size (log n)/2 there are 2(log n)/2 = √n different blocks, and (log n)/2 possible queries,
with the result taking (log log n) bits. Overall space: O(√n·(log n)·(log log n))=o(n)
! idea: compute “cumulative rank” for block borders
! takes 2n/(log n) · (log n) = 2n bits too much! trick: introduce two levels of
blocks
![Page 24: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/24.jpg)
Implementing rank on bitmaps
! consider a bitmap B of size n ! tabulate rank within all blocks of size (log n)/2 there are 2(log n)/2 = √n different blocks, and (log n)/2 possible queries,
with the result taking (log log n) bits. Overall space: O(√n·(log n)·(log log n))=o(n) ! split B into n/(log2 n) superblocks of size (log2 n); compute
cumulative rank. This takes n/(log2 n) · (log n)=n/(log n)=o(n) bits ! split each superblock into blocks of size (log n)/2; compute
cumulative rank inside superblock; the result takes O(log log n) bits. Therefore we only need
2n/(log n) · (log log n) = o(n) bits
DONE!
![Page 25: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/25.jpg)
rank for large alphabets: wavelet trees
$,a,b c,d,r
$,a b
a $
c,d r
d c
Space: n·log(σ) bits
rank(a,11)=?
a: 001
![Page 26: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/26.jpg)
rank for large alphabets: wavelet trees
$,a,b c,d,r
$,a b
a $
c,d r
d c
Space: n·log(σ) bits
rank(a,11)=?
a: 001
rank(0,11,Sε)=5
![Page 27: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/27.jpg)
rank for large alphabets: wavelet trees
$,a,b c,d,r
$,a b
a $
c,d r
d c
Space: n·log(σ) bits
rank(a,11)=?
a: 001
rank(0,11,Sε)=5 rank(0,5,S0)=3
![Page 28: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/28.jpg)
rank for large alphabets: wavelet trees
$,a,b c,d,r
$,a b
a $
c,d r
d c
Space: n·log(σ) bits
rank(a,11)=?
a: 001
rank(0,11,Sε)=5 rank(0,5,S0)=3 rank(1,3,S00)=2
rank is computed in O(log(σ)) time
![Page 29: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/29.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
![Page 30: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/30.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
![Page 31: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/31.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
![Page 32: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/32.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
e!
f!
[e,f] : current interval x : letter
e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]
![Page 33: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/33.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
e!
f!
[e,f] : current interval x : letter
e:= C[x]+rank[x,e] f:= C[x]+rank[x,f]-1
![Page 34: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/34.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
e!f!
[e,f] : current interval x : letter
e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]
![Page 35: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/35.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
e!f!
[e,f] : current interval x : letter
e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]
![Page 36: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/36.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
e!f!
[e,f] : current interval x : letter
e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]
![Page 37: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/37.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
[e,f] : current interval x : letter
e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]
e!f!
![Page 38: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/38.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
S : current string (pattern suffix) [e,f] : current interval x : letter
compute new interval for xS
e:= C[x]+rank[x,e]+1 f:= C[x]+rank[x,f]
![Page 39: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/39.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
What position is it??
![Page 40: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/40.jpg)
String matching with BWT
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
P=taca!
It is position 3 !
![Page 41: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/41.jpg)
BWT-index (FM-index)
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
Solution: store only a fraction of values of SA.
Storing one value over log(n) leads to O(n·log(n)/log(n))=O(n) bits
12 5 1 7 3 9 6 2 11 8 4 10
SA
![Page 42: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/42.jpg)
BWT-index (FM-index)
$ acatacagat g!a cagatg$aca t!a catacagatg $!a gatg$acata c!a tacagatg$a c!a tg$acataca g!c agatg$acat a!c atacagatg$ a!g $acatacaga t!g atg$acatac a!t acagatg$ac a!t g$acatacag a!
T=acatacagatg$!
BWT T[SA[i]]
F L
Solution: store only a fraction of values of SA.
Storing one value over log(n) leads to O(n·log(n)/log(n))=O(n) bits
Search time becomes O(|P|+occ·log(n))
[Ferragina, Manzini 00]
FM-index includes: • BWT • selection of SA values • auxiliary structures: array C,
rank, position marking …
12
6 2
8 4 10
SA
![Page 43: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/43.jpg)
BWT-index: practical issues
! BWT-index can be implemented using ~3 bits/char (!!) ! BWT-index is now widely used in practical bioinformatics
software: BWA, bowtie, SOAP2 (mapping), CGA (assembly)
! Variant: bi-directional BWT-index
![Page 44: Burrows-Wheeler transform and BWT-indexkoutcher/lectures/lecture4-1.pdf · Succinct and compressed indexes ! succinct index takes space in bits proportional to that of the text itself](https://reader036.fdocuments.in/reader036/viewer/2022062402/5fc7ab44879b3669ff264ac8/html5/thumbnails/44.jpg)
BWT-index: practical issues
! BWT-index can be implemented using ~3 bits/char (!!) ! BWT-index is now widely used in practical bioinformatics
software: BWA, bowtie, SOAP2 (mapping), CGA (assembly)
! Variant: bi-directional BWT-index ! other succinct data structure exist (including compact suffix
array) and continue to appear ! construction may require much more space than the
resulting structure ! external memory algorithms are important ! algorithms specialized to multi-core or GPU processor
architectures ! dynamic indexes