Compressed suffix arrays and suffix trees with applications to text indexing and string matching.
Fine Tuning the Enhanced Suffix Arrays
description
Transcript of Fine Tuning the Enhanced Suffix Arrays
Ayat A.Dawood 1
Fine Tuning the Enhanced Suffix ArraysAyat A.DawoodCIS, Nile UniversityJoined work with: Mohamed AbouelHoda
Ayat A.Dawood 2
Table of Contents
Suffix array The enhanced suffix array Our accomplishment:
Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table
representation
Ayat A.Dawood 3
Suffix array Array of integers
in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.
e.g., S = acaaacatat$
S(Suftab[i]) Suftab Iaaacatat$ 2 0aacatat$ 3 1acaaacatat$ 0 2acatat$ 4 3atat$ 6 4at$ 8 5caaacatat$ 1 6catat$ 5 7tat$ 7 8t$ 9 9$ 10 10
Ayat A.Dawood 4
Suffix array Array of integers
in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.
e.g., S = acaaacatat$
S(Suftab[i]) Suftab Iaaacatat$ 2 0aacatat$ 3 1acaaacatat$ 0 2acatat$ 4 3atat$ 6 4at$ 8 5caaacatat$ 1 6catat$ 5 7tat$ 7 8t$ 9 9$ 10 10
Ayat A.Dawood 5
Enhanced suffix array Basically it is the suffix
array enhanced with a set of tables.
Using those tables, best performance and complexity are achieved
lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1].
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
6
Enhanced suffix array: l-interval
L-interval: interval of suffixes sharing the same prefixAyat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
1-[0..5]
7
Enhanced suffix array: l-interval
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
1-[0..5]
2-[0..1]
a
L-interval: interval of suffixes sharing the same prefix
8
Enhanced suffix array: l-interval
Ayat A.Dawood
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
L-interval: interval of suffixes sharing the same prefix
Ayat A.Dawood 9
Our accomplishment
Improvement (Fine Tuning): Alphabet-independent exact pattern
matching. Improving bucket table representation Improving access to the lcp-table.
Improvements are achieved using minimal perfect hashing techniques.
Ayat A.Dawood 10
Minimal perfect hashing(MPHF) Storing n static keys from universe U
in O(n) space with O(1) access time.[Botelho et. al]
Look up table requires O(|U|) space to achieve constant access time
11
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
12
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
13
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
14
Exact pattern matching problem
Ayat A.Dawood
0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
e.g., pattern = aca
Ayat A.Dawood 15
Exact pattern matching problem Using normal method: takes O(nm) Using the enhanced suffix arrays, it
can be achieved in O(|∑|m) [AbouElHoda et. al]
Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)). [Kim et. al],[Fischer et. al]
Ayat A.Dawood 16
Exact pattern matching problem Our work:
Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
MPHF table
MPHF table
Ayat A.Dawood 17
Exact pattern matching problem Our work:
Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
Ayat A.Dawood 18
Exact pattern matching problem Our work:
Using minimal perfect hashing technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]
1-[0..5] 2-[6..7] 1-[8..9]
2-[4..5]3-[2..3]2-[0..1]
a
a c
c t
t
Ayat A.Dawood 19
Improving the bucket table representation
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
Bucket table0 aa2 ac4 at
ag6 ca
ctcccg
8 tatctgttgagtgcgg
Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array
Ayat A.Dawood 20
Improving the bucket table representation
S(Suftab[i])
lcptable
Suftab
I
aaacatat$ 0 2 0aacatat$ 2 3 1acaaacatat$
1 0 2
acatat$ 3 4 3atat$ 1 6 4at$ 2 8 5caaacatat$
0 1 6
catat$ 2 5 7tat$ 0 7 8t$ 1 9 9$ 0 10 10
Bucket table0 aa2 ac4 at
ag6 ca
ctcccg
8 tatctgttgagtgcgg
Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array
Ayat A.Dawood 21
Improving the bucket table representation cont’ Problem:
Space consumption of the look up table is prohibitive for large d and ∑ (d ^ |∑|).
Solution: Use minimal perfect hashing techniques
to store the look up table.
Ayat A.Dawood 22
Improving the bucket table representation cont’ Results:
For the bacterial ecoli genome (size = 5400 bp) and for d= 12
Reduction comparing to lookup table
MPHF size in
bits
Lookup table
size in bits
No. of keys
Alphabet size
46% reduction 7231956.638
1677216 3474814
4 (A,T,C,G)
93% reduction 17590331.64
244140625
8451811
5(A,T,C,G,*N)*N for undefined nucleotide or dummy
character
Ayat A.Dawood 23
Conclusion
Exact pattern matching problem Improving the bucket table
representation. Improving access to the lcp-table.
Ayat A.Dawood 24
Questions???
Ayat A.Dawood 25
Improving access to the lcp-table To reduce space, lcp- table is
stored in 1 byte. If a common prefix is longer
than 255, then it is stored in another table.
To access this table, it is accessed sequential or using binary search
Our Enhancement: Use MPHF to store the extra
table to access it in constant time.
02
32
0
257279
300260
lcp-table
Extra lcp-table