Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

25
Fine Tuning the Enhanced Suffix Arrays Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood 1

Transcript of Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Page 1: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 1

Fine Tuning the Enhanced Suffix ArraysAyat A.DawoodCIS, Nile UniversityJoined work with: Mohamed AbouelHoda

Page 2: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 2

Table of Contents

Suffix array The enhanced suffix array Our accomplishment:

Minimal Perfect Hashing Function The exact pattern matching problem Improving the bucket table

representation

Page 3: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 3

Suffix array

Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.

e.g., S = acaaacatat$

S(Suftab[i]) Suftab I

aaacatat$ 2 0

aacatat$ 3 1

acaaacatat$ 0 2

acatat$ 4 3

atat$ 6 4

at$ 8 5

caaacatat$ 1 6

catat$ 5 7

tat$ 7 8

t$ 9 9

$ 10 10

Page 4: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 4

Suffix array

Array of integers in the range from 0 to (n+1), specifying the lexicographic order of the (n+1) suffixes of S$.

e.g., S = acaaacatat$

S(Suftab[i]) Suftab I

aaacatat$ 2 0

aacatat$ 3 1

acaaacatat$ 0 2

acatat$ 4 3

atat$ 6 4

at$ 8 5

caaacatat$ 1 6

catat$ 5 7

tat$ 7 8

t$ 9 9

$ 10 10

Page 5: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 5

Enhanced suffix array

Basically it is the suffix array enhanced with a set of tables.

Using those tables, best performance and complexity are achieved

lcptab[i] stores the length of longest common prefix of the suffixes suftab[i] and suftab[i-1].

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

Page 6: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

6

Enhanced suffix array: l-interval

L-interval: interval of suffixes sharing the same prefix

Ayat A.Dawood

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

1-[0..5]

Page 7: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

7

Enhanced suffix array: l-interval

Ayat A.Dawood

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

1-[0..5]

2-[0..1]

a

L-interval: interval of suffixes sharing the same prefix

Page 8: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

8

Enhanced suffix array: l-interval

Ayat A.Dawood

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

L-interval: interval of suffixes sharing the same prefix

Page 9: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 9

Our accomplishment

Improvement (Fine Tuning): Alphabet-independent exact pattern

matching. Improving bucket table representation Improving access to the lcp-table.

Improvements are achieved using minimal perfect hashing techniques.

Page 10: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 10

Minimal perfect hashing(MPHF)

Storing n static keys from universe U in O(n) space with O(1) access time.[Botelho et. al]

Look up table requires O(|U|) space to achieve constant access time

Page 11: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

11

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

e.g., pattern = aca

Page 12: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

12

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

e.g., pattern = aca

Page 13: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

13

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

e.g., pattern = aca

Page 14: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

14

Exact pattern matching problem

Ayat A.Dawood

0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

ac

c t

t

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

e.g., pattern = aca

Page 15: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 15

Exact pattern matching problem

Using normal method: takes O(nm) Using the enhanced suffix arrays, it

can be achieved in O(|∑|m) [AbouElHoda et. al]

Other modification to the enhanced suffix arrays allows it to be done in O(m log (|∑|)). [Kim et. al],[Fischer et. al]

Page 16: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 16

Exact pattern matching problem

Our work: Using minimal perfect hashing

technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

MPHF table

MPHF table

Page 17: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 17

Exact pattern matching problem

Our work: Using minimal perfect hashing

technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

a c

c t

t

Page 18: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 18

Exact pattern matching problem

Our work: Using minimal perfect hashing

technique, it can be achieved in O(m), removing the alphabet factor.0-[0..10]

1-[0..5] 2-[6..7] 1-[8..9]

2-[4..5]3-[2..3]2-[0..1]

a

ac

c t

t

Page 19: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 19

Improving the bucket table representation

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

Bucket table

0 aa

2 ac

4 at

ag

6 ca

ct

cc

cg

8 ta

tc

tg

tt

ga

gt

gc

gg

Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

Page 20: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 20

Improving the bucket table representation

S(Suftab[i])

lcptable

Suftab

I

aaacatat$ 0 2 0

aacatat$ 2 3 1

acaaacatat$

1 0 2

acatat$ 3 4 3

atat$ 1 6 4

at$ 2 8 5

caaacatat$

0 1 6

catat$ 2 5 7

tat$ 0 7 8

t$ 1 9 9

$ 0 10 10

Bucket table

0 aa

2 ac

4 at

ag

6 ca

ct

cc

cg

8 ta

tc

tg

tt

ga

gt

gc

gg

Bucket table: is a table used to jump directly to a certain lcp-interval in the suffix array

Page 21: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 21

Improving the bucket table representation cont’

Problem: Space consumption of the look up table

is prohibitive for large d and ∑ (d ^ |∑|). Solution:

Use minimal perfect hashing techniques to store the look up table.

Page 22: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 22

Improving the bucket table representation cont’

Results: For the bacterial ecoli genome (size =

5400 bp) and for d= 12

Reduction comparing to lookup table

MPHF size in

bits

Lookup table

size in bits

No. of keys

Alphabet size

46% reduction 7231956.638

1677216 3474814

4 (A,T,C,G)

93% reduction 17590331.64

244140625

8451811

5(A,T,C,G,*N)*N for undefined nucleotide or dummy

character

Page 23: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 23

Conclusion

Exact pattern matching problem Improving the bucket table

representation. Improving access to the lcp-table.

Page 24: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 24

Questions???

Page 25: Ayat A.Dawood CIS, Nile University Joined work with: Mohamed AbouelHoda Ayat A.Dawood1.

Ayat A.Dawood 25

Improving access to the lcp-table To reduce space, lcp- table is

stored in 1 byte. If a common prefix is longer

than 255, then it is stored in another table.

To access this table, it is accessed sequential or using binary search

Our Enhancement: Use MPHF to store the extra

table to access it in constant time.

0

2

3

2

0

257

279

300

260

lcp-table

Extra lcp-table