Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer...

Optimizing multi-pattern searches for compressed suffix arrays

Kalle KarhuDepartment of Computer Science and EngineeringAalto University, School of Science, Finland

Outline of the presentation

• Problem settings

• Methods used in the work– Preprocessing of text– Preprocessing of patterns– Executing the search

• Data

• Runs

• Results

• Conclusive remarks

• Future work.

Problem setting

• We would like to search text patterns or queries from text databases

• Multiple sets of large number of long patterns– Here we're handling a single set of 1000 patterns of length 1000 nucleotides

each

• Multiple instances of preprocessed text– Can be text using a compressed suffix array (CSA) purely as index, having

another instance of the text as is, or just saving the CSA, as it is a self-index

• In these experiments, both patterns and text were DNA

Problem setting

• As a single pattern set can be searched from multiple texts and vice versa, the preprocessing times are not limiting the usefulness of the possible method.– Time taken by preprocessing is amortized over large number of searches

• Because of this, it is smart to save the patterns already in preprocessed form

• This leads to searching a preprocessed set of patterns from preprocessed text.

Methods – Preprocessing the text

• Compressed suffix array (CSA) was constructed from the text, using the package available in the Pizza & Chili website (P. Ferragina and G. Navarro)

• Two main parameters exist for the CSA:– Samplerate: the interval between two indices of the suffix array stored explicitly.

Default value 16 was used– Samplepsi: the interval between two indices of the psi function stored explicitly.

Default value 128 was used.

Methods – Preprocessing patterns

• Using a compression tool called Re-Pair a certain collection of subpatterns of the patterns was retrieved

• Principal idea is to find a set of subpatterns, which would occur in a large number of patterns, but be rare in text

• Assuming that the letters in the text are independent and identically distributed, long patterns occur rarely

• Conveniently, Re-Pair produces phrases, which are long subpatterns of text which occur more than once

• These phrases were simply scored by the number of times they occur in the pattern set

• Additionally, the length of the subpattern was required to overcome a set threshold • Done to limit the expected number of occurrences this subpattern would have in the text.

Methods – doing the search

• Search the preprocessed subpatterns from the CSA using locate• O( m log(n) + occ * log ε(n) ) , 0 < ε < 1 for space-time tradeoff

• Extend the initial matches of these subpatterns to check if they are an exact match, using character by character comparison

– This is done for each pattern that includes the subpattern

• Stop this after a set number of patterns are handled using this approach

• Finish the search using the locate function for the remaining full patterns.

Data

• The 50MB DNA text was retrieved from the Pizza & Chili website

• 1000 patterns of length 1000 nucleotides were generated from this text at random

– That is, substrings of the text were retrieved from random locations

• It came later apparent that all of the patterns occur only once in the text, which would necessarily not always be the case.

Data

• The patterns were searched from the text index as was described in the methods section

• Five different thresholds were used for the minimum length of the subpattern: 25, 28, 30, 33 and 35

• Additionally, for each of these thresholds, the number of patterns handled by locating subpatterns was controlled by finishing this phase after 100, 300 or 500 patterns were handled

– However, as the subpatterns did not always occur in the full allowed number of patterns, this number of patterns handled by locating subpatterns and extending was lower in some runs

• The time taken by these runs was compared to searching all of the patterns with the locate of CSA.

Multi-pattern search on CSA

Results, set of 1000 patterns

75 125 175 225 275 3251.4

1.5

1.6

1.7

1.8

1.9

2

msl = 25

msl = 28

msl = 30

msl = 33

msl = 35

Trad. CSA

Patterns handled by searching subpatterns

To

tal s

ea

rch

tim

e fo

r 1

00

0 p

atte

rns

(s)

• Msl = 30 → 14.0 % decrease in run-times.

Multi-pattern search on CSA

Results, level of individual patterns

• Msl=35 → time per pattern was 71.6 % less than with traditional CSA.

75 125 175 225 275 3250

0.5

1

1.5

2

2.5

3

3.5

4

msl = 25

msl = 28

msl = 30

msl = 33

msl = 35

Trad. CSA

Patterns handled by searching subpatterns

Ave

rag

e s

ea

rch

tim

e p

er

pa

ttern

(m

s)

Results

• Searching for the subpatterns generally took around 85% of the time, while checking for the exact match took 15% of the time, when using the implemented new method

• The memory consumption is not notably different– Phrases and their pattern-related information have to be saved, but this

consumes a lot less memory than saving the CSA in practice

• Total preprocessing time for the set of patterns was roughly 0.8 s.

Conclusive remarks

• As minimum subpattern length is increased, the average time taken per pattern decreases

• Interestingly, average time per pattern taken also decreases when more patterns are handled by the proposed method

– Suggests that subpatterns occurring extremely commonly in the set of patterns are not the most optimal ones

• More sophisticated method to choose subpatterns occurring in the set of multiple patterns would be helpful

– The proposed method would work on independent and identically distributed text, but DNA definitely does not have these properties.

Future Work

• Consider k-mer distributions of the subpatterns and compare them to the k-mer distribution of the text– If the k-mer distribution of the text is unknown, sampling or other methods could

be used– Hopefully this would lead to better estimates of the probability of a subpattern to

occur in the text

• More work to be done in the sorting of the subpatterns

• This approach could be implemented for searches using other index structures as well– Anything where time taken by locate functionality strongly correlates with the

length of the query should work well.

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer...

Documents

Transcript of Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer...