Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving...

31
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented by Raghav & Balaji

Transcript of Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving...

Page 1: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Database Index to Large Biological Sequences

Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving

Proceedings of the 27th VLDB Conference,2001

Presented by Raghav & Balaji

Page 2: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Indexing Large Biological Sequences

Introduction Indexing strategies Suffix trees New Construction Algorithm Query Experiment and Results Conclusion

Page 3: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Introduction What's a DNA?

A, C, G, T (A with T, C with G) Base pair Gbp (Giga base pairs)

Mammalian genomes – 3Gbp

What is the challenge in indexing DNA? Large Size and no definite pattern

Searching genetic DNA sequences Sequentially scanning and filtering approach

(BLAST, FASTA)

Page 4: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Introduction Rise in volume of data and demand for

searches by researchers accelerated the need for better searches using indexes.

New Sequences will be revealed as improved sequencing techniques are developed.

Determining DNA sequences is useful in studying fundamental biological processes, as well as in forensic research.

Page 5: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Indexing Strategies Considered Inverted files

Not suitable since DNA cannot be broken into words.

B-tree Same as above

Q-grams Cannot deliver matches that have low

similarity to the query. Most of the techniques are infeasible.

Page 6: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Indexing Strategies Considered Suffix Trees

Ideal Choice for this type of indexing. Suffix trees on disk could only be built for

small sequences. “Memory Bottleneck”.

Suffix tree storage optimization Reduce the RAM required to around 13 bytes

per character indexed Not test on disk

Page 7: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Indexing Strategies Considered Approach to searching genetic DNA sequences

using an adaptation of the suffix tree.

Build suffix tree on disk for arbitrarily large sequences

New query process strategies.

Alternative data structures Q-grams, Suffix array, String B tree…

Page 8: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees Suffix tree - compressed digital trie.

A suffix tree is a rooted directed tree with m leaves, where m is the length S (the database string)

For any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i

Page 9: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees

Suffix tree is a compressed digital (suffix) trie

Page 10: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffices of mississippi:1 mississippi2 ississippi3 ssissippi4 sissippi5 issippi6 ssippi7 sippi8 ippi9 ppi10 pi11 i

Suffix tree building

mi

ss

is

si

pp

i

i

s

s

i

s

s

i

p

p

i

s

s

i

s

s

i

p

p

i

root

i

s

s

i

p

p

ip

p

i

p

p

i

p p i

p

p

i

p

p

ii

Page 11: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Result suffix tree building

m

i

s

s

i

s

s

i

p

p

i

s

s

i

p

p

i

s

s

i

p

p

i

p

p

i

p

p

i

p

p

i

pii

1

11

root

8

25

109

63

i

4

7

s

s

i

p

p

i

p

p

i

s

si

p

issi

Page 12: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees

Suffix Links: A necessary implementation trick to

achieve a linear time and space bound during building the tree

A suffix link is: a pointer from an internal node xS to another internal node S where x is a arbitrary character and S is a possibly empty substring

Page 13: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees Construction Suffix linkComplexity O(n)

Ukkonen’s Method

Page 14: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees

General applications of Suffix trees

Find all occurrences of q as a substring of S

Longest substring common to a set T of strings

Find the longest palindrome in S

Page 15: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees Analysis of Suffix Link Based Algorithm

Build the tree incrementally, check pointing the tree after each portion has been attempted.

2 distinct traversal patterns exist both of which are used during construction. Very long construction time.

These effects combine to limit the size of the tree that can be constructed and stored on disk to the available main memory.

Page 16: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Suffix Trees Using Suffix link based algorithm, it was

observed that checkpointing trees indexing more than 21Mbp was not possible using 1.8GB of main memory.

Reasons being Object header size increases

Page 17: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

New Construction Algorithm

Difficulties of traditional suffix tree construction: Memory bottleneck Necessity of random access

New conception To abandon the use of suffix links To perform multiple passes over the

sequence, constructing the suffix tree for a sub range of suffixes at each pass.

Page 18: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

New Construction Algorithm Removing Suffix link means that the

construction of a new partition does not modify previously checkpointed partitions of the tree.

Using multiple passes, it means that it is not necessary to access or update previously checkpointed partitions.

i.e. Data structure for the complete partitions can be evicted from the main memory and will not be faulted back during the rest of the tree’s construction.

Page 19: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

New Construction Algorithm Partition concept:

Build multiple suffix tree that fit in memory(AC, AT or AG fall into different partitions)

Base on the prefixes of each suffix Use a sliding window of length l.

Form a string s1 of window length, l. Scan the string and count the number of

occurrances of s1. Use a bin packing technique to pack (s1,

#occurrances)

Page 20: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

New Construction Algorithm

Partition technology: Assumption:tree is uniformly populated. Prefix code(Pi): Suffixes that are indexed during the jth

pass of the sequence have jr Pi (j+1)r

Page 21: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

New Construction Algorithm

The actual algorithm [Pseudo code]

Page 22: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

2

New Construction Algorithm

1 root2 ANA$3 NA$4 A$5 $

root

3

2 4

5

ANA$NA$

NA$ $

$A

Tree creation for ANA$

Page 23: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

New Construction Algorithm

left indexright indexsuffix number

sibsuffix link

child

left index

child sib

Original tree (Ukkonen) Modified Node

Page 24: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Query Only exact pattern matching. One query involves one partial traversal. Complexity of suffix tree search: O(k+m);

k-query length, m-no of matches in the index. Queries of length q bring back 1/(a^q) fraction of the whole

tree where a = size of the active alphabet i.e. 4 (A,C,G,T).

New query strategies: Short query: serial scan of the sequence Longer query: using index structure Threshold: 10 to 12 letters

Page 25: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Experiment and Results Develop and experiment platform:

Software: PJama, JAVA 1.3 & Solaris 7 OS Hardware: Enterprise 450 with 2GB RAM

Test data 6 single chromosomes of worm C.

elegans(20.5Mbp max. length) Human chromosomes 21,22, and 1(280Mbp)

Alphabets A, C, G, T, $, *

Page 26: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Experiment and Results Trees with suffix link: (use 20.5Mbp DNA)

– Construct in memory: 7 mins– Construct in disk: 34 hours

Trees without suffix link: (263Mbp DNA)– 19 hours

Page 27: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Experiment ResultsExact String matching using 263Mbp of human DNA

Queries sent in batches using warm storage

Page 28: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Experiment Results

Cold Storage

Page 29: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Experiment Results

Page 30: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Further Work Improvements to the tree representation

and incremental construction algorithm. Investigation of the interaction between

approximate matching algorithms and disk-based suffix trees.

Investigation of alternative persistent storage solutions.

Integration of the algorithms with biological research tools and usability studies.

Page 31: Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Conclusion Present an approach to searching genetic

DNA sequences using an adaptation of the suffix tree data structure.

Allow to build suffix trees on disk for arbitrarily large sequences.

Open up the perspective of building suffix trees in parallel, and the simplicity of this approach can make suffix trees more popular.