Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No....

Hugh E. Williams and Justin Zobel

IEEE Transactions on knowledge and data engineeringVol. 14, No. 1, January/February 2002

Presented by Jitimon Keinduangjun

AgendaAgenda

1. Introduction

2. What are the problems?

3. What are other people doing?

4. Indexed Genomic Retrieval with CAFÉ

5. Experimental Results

6. Conclusion

AT

GC

1. Introduction1. Introduction

• Biological sequence databases contain several sequences of both DNA and Protein.

• DNA (Deoxyribonucleic Acid) is the primary genetic material in all living organisms– A molecule composed of two complementary nucleotide

strands connected by base pairs that each base will pair

with only one another:

adenine (A) pairs with thymine (T)

guanine (G) pairs with cytosine (C)

1. Introduction (1)1. Introduction (1)

• A DNA sequence consists of– 4 alphabets : A G C T– 1 extra alphabet : N for unknown bases

• DNA sequence database> gi|1786692|gb|AE000155|ECAE000155 Escherichia coli , tesA, ybbA genes from base s 510705 to 522297 (section 45 of 400) of the complete genome TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTTCTACGAAATAGACTAGAAATAGTCTAGTCTACG> gb|L02373|ECORHSCA Escherichia coli Rhs core genes, complete cdsTAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTT

: :

Alphabet ‘ > ’ separates each sequence and identifies its information

2. What are the problems?2. What are the problems?

2.1 Databases and query sequences contain low quality sequences therefore all techniques also must improve accuracy of querying results

2.2 All techniques also require long computation time

2.1 Low quality DNA sequences2.1 Low quality DNA sequences

• Substitution, Insertions, Deletions

– Exact-match is not very efficient

– Similarity search is required

• All algorithms will find all segment pairs whose scores must be improved by insertions and deletions

Query: 3 LTRYCA - -GFTSLLKCNDADTIYDG 28

| | | | | | | | | | | | | | | | | | |

Subject : 3325 LTRYCAPAGFXALLKCNDADT--DG 3350

2.2 L2.2 Long computation time requiredong computation time required

• Various and huge data size of database

• A database contains many different sequences, of variable lengths which requires local similarity for database search

3. What are other people doing?3. What are other people doing?

3.1 SSERACH Algorithm– Using Dynamic Programming (DP)

• Very Slow, Very sensitive

3.2 BLAST Algorithm– Blast 1.4 (Old version): ungapped alignment

• Speed, sensitive

– Blast 2.0 (New version): gapped alignment• High Speed, less sensitive

3.3 FASTA Algorithm– Using DP-based Techniques: gapped alignment

• Slow, more sensitive

Edit distance and Dynamic Programming• Assume that the given two sequences are A and B

– n and m are the length of sequence A and sequence B, respectively– s (an,bm): similarity score between two aligned sequence a and b– Identical aligned pairs have a positive score 1 and non-identical pairs have

a score 0– Distance Matric D : Di,0 = Dj,0 = 0 for i = 0,1,…,n and j = 0,1,…,m

– Time complexity is O(n*m)

Di-1,j

Di,j = max Di,j-1

Di-1,j-1 + s(ai,bj) { }

3.1 SSEARCH Algorithm3.1 SSEARCH Algorithm

3.1 SSEARCH Algorithm (1)3.1 SSEARCH Algorithm (1)

• Example: Pairwise alignment via DP– Sequence a : ACGACA– Sequence b : AGCAC

-AGCAC

-0

0

0

0

0

0

A0

1

1

1

1

1

C0

1

1

2

2

2

G0

1

2

2

2

2

A0

1

2

2

3

3

C0

1

2

3

3

4

A0

1

2

3

4

4

sequence

b

sequence a

Possible results of 3 alignments(1) a: ACGACA - b: A -G -CAC(2) a: ACG -ACA b: A -GCAC -(3) a: A -CGACA b: AGC -AC -

Insert

Delete

Match

di-1,j-1 di-1,j

di,j-1 di,j

3.2 BLAST Algorithm for DNA3.2 BLAST Algorithm for DNA

• Sequence A : Length N and Sequence B : Length M

M Similarity Scores for DNA:Match = 5, Mismatch = -4 (WU-BLAST)Match = 1, Mismatch = -3 (NCBI)

M Scanning forexact matches

The list of words

hit

hit

extending

. . . . .

N

W=12Keyword Tree

A CT

AT

AC

G T CG

C

1 2 3 54: ::: :Generating Keyword Tree

Note: Extension consumes > 90% of all processing times.

3.3 FASTA Algorithm for DNA3.3 FASTA Algorithm for DNA

• Sequence A : Length N and Sequence B : Length M

M Scanning forexact matches

The list of words

hit

. . . . .

N

W=12Keyword Tree

A CT

AT

AC

G T CG

C

1 2 3 54: ::: :Generating Keyword Tree

M

N Alignment subsequences

4. Indexed Genomic Retrieval4. Indexed Genomic Retrievalwith CAFÉwith CAFÉ

4.1 Indexing with Café

4.2 Coarse Searching with Café (Filtering)

4.3 Fine Searching with Café as the method of FASTA

4.1 Indexing with CAFÉ4.1 Indexing with CAFÉ

• Inverted indexes consist of two component:– A search structure– Posting lists

• Example of an inverted index

ACCC 12,(3:144,154,962), 38,(2:47,1045)

The pattern occurs– 3 times in the 12th sequence, at offsets 144,154,and 962– 2 times in the 38th sequence, at offsets 47 and 1045

• These indices are compressed for reducing space described in detail elsewhere.

4.2 Coarse Searching with CAFÉ4.2 Coarse Searching with CAFÉ

• A novel Ranking technique using the index structure

Score for ranking: COMBINED = COVERAGE- k*(LENGTH-COVERAGE)

COVERAGE = 9LENGTH = 9



Example: Ranking by CAFÉ Example: Ranking by CAFÉ

Homologous -chain hemoglobinHomologous -chain hemoglobin

Human - ChimpanzeeHuman - Chimpanzee

Human - RatHuman - Rat

Human - PotatoHuman - Potato

5. Experimental Results5. Experimental Results

5.1 Test Data

5.2 Space

5.3 Retrieval Effectiveness

5.4 Speed

5.1 Test Data5.1 Test Data

• PIR Database for assessing the accuracy of search system.

• GenBank Database for assessing speed and index space requirements.

5.2 Space5.2 Space

• Uncompressed index size~9.7 times the collection size

• Compressed index size (Café index)~2.2 times the collection size

• The retrieval of uncompressed nucleotide data reduces the speed of Café system

5.3 Retrieval Effectiveness5.3 Retrieval Effectiveness

5.4 Speed5.4 Speed

6. Conclusion6. Conclusion

• Café system affords much faster query evaluation than exhaustive searching.

• Better accuracy than the most widely used search tool, BLAST 2.

• Café indices are smaller than the annotated source databases and the indices of previous indexed systems.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No....

Documents

Transcript of Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No....