Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No....

23
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun

Transcript of Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No....

Page 1: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

Hugh E. Williams and Justin Zobel

IEEE Transactions on knowledge and data engineeringVol. 14, No. 1, January/February 2002

Presented by Jitimon Keinduangjun

Page 2: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

AgendaAgenda

1. Introduction

2. What are the problems?

3. What are other people doing?

4. Indexed Genomic Retrieval with CAFÉ

5. Experimental Results

6. Conclusion

Page 3: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

AT

GC

1. Introduction1. Introduction

• Biological sequence databases contain several sequences of both DNA and Protein.

• DNA (Deoxyribonucleic Acid) is the primary genetic material in all living organisms– A molecule composed of two complementary nucleotide

strands connected by base pairs that each base will pair

with only one another:

adenine (A) pairs with thymine (T)

guanine (G) pairs with cytosine (C)

Page 4: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

1. Introduction (1)1. Introduction (1)

• A DNA sequence consists of– 4 alphabets : A G C T– 1 extra alphabet : N for unknown bases

• DNA sequence database> gi|1786692|gb|AE000155|ECAE000155 Escherichia coli , tesA, ybbA genes from base s 510705 to 522297 (section 45 of 400) of the complete genome TAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTTCTACGAAATAGACTAGAAATAGTCTAGTCTACG> gb|L02373|ECORHSCA Escherichia coli Rhs core genes, complete cdsTAGAATAGATGAGAATTAGTCTGTTCTACGAAATAGACGAGAATTAGTCTAGTCTAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAAATAGACTAGAAATAGTCTAGTCTACGAAATAGACTAGAAATAGCCTAGTTCTGTT

: :

Alphabet ‘ > ’ separates each sequence and identifies its information

Page 5: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

2. What are the problems?2. What are the problems?

2.1 Databases and query sequences contain low quality sequences therefore all techniques also must improve accuracy of querying results

2.2 All techniques also require long computation time

Page 6: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

2.1 Low quality DNA sequences2.1 Low quality DNA sequences

• Substitution, Insertions, Deletions

– Exact-match is not very efficient

– Similarity search is required

• All algorithms will find all segment pairs whose scores must be improved by insertions and deletions

Query: 3 LTRYCA - -GFTSLLKCNDADTIYDG 28

| | | | | | | | | | | | | | | | | | |

Subject : 3325 LTRYCAPAGFXALLKCNDADT--DG 3350

Page 7: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

2.2 L2.2 Long computation time requiredong computation time required

• Various and huge data size of database

• A database contains many different sequences, of variable lengths which requires local similarity for database search

Page 8: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

3. What are other people doing?3. What are other people doing?

3.1 SSERACH Algorithm– Using Dynamic Programming (DP)

• Very Slow, Very sensitive

3.2 BLAST Algorithm– Blast 1.4 (Old version): ungapped alignment

• Speed, sensitive

– Blast 2.0 (New version): gapped alignment• High Speed, less sensitive

3.3 FASTA Algorithm– Using DP-based Techniques: gapped alignment

• Slow, more sensitive

Page 9: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

Edit distance and Dynamic Programming• Assume that the given two sequences are A and B

– n and m are the length of sequence A and sequence B, respectively– s (an,bm): similarity score between two aligned sequence a and b– Identical aligned pairs have a positive score 1 and non-identical pairs have

a score 0– Distance Matric D : Di,0 = Dj,0 = 0 for i = 0,1,…,n and j = 0,1,…,m

– Time complexity is O(n*m)

Di-1,j

Di,j = max Di,j-1

Di-1,j-1 + s(ai,bj) { }

3.1 SSEARCH Algorithm3.1 SSEARCH Algorithm

Page 10: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

3.1 SSEARCH Algorithm (1)3.1 SSEARCH Algorithm (1)

• Example: Pairwise alignment via DP– Sequence a : ACGACA– Sequence b : AGCAC

-AGCAC

-0

0

0

0

0

0

A0

1

1

1

1

1

C0

1

1

2

2

2

G0

1

2

2

2

2

A0

1

2

2

3

3

C0

1

2

3

3

4

A0

1

2

3

4

4

sequence

b

sequence a

Possible results of 3 alignments(1) a: ACGACA - b: A -G -CAC(2) a: ACG -ACA b: A -GCAC -(3) a: A -CGACA b: AGC -AC -

Insert

Delete

Match

di-1,j-1 di-1,j

di,j-1 di,j

Page 11: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

3.2 BLAST Algorithm for DNA3.2 BLAST Algorithm for DNA

• Sequence A : Length N and Sequence B : Length M

M Similarity Scores for DNA:Match = 5, Mismatch = -4 (WU-BLAST)Match = 1, Mismatch = -3 (NCBI)

M Scanning forexact matches

The list of words

hit

hit

extending

. . . . .

N

W=12Keyword Tree

A CT

AT

AC

G T CG

C

1 2 3 54: ::: :Generating Keyword Tree

Note: Extension consumes > 90% of all processing times.

Page 12: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

3.3 FASTA Algorithm for DNA3.3 FASTA Algorithm for DNA

• Sequence A : Length N and Sequence B : Length M

M Scanning forexact matches

The list of words

hit

. . . . .

N

W=12Keyword Tree

A CT

AT

AC

G T CG

C

1 2 3 54: ::: :Generating Keyword Tree

M

N Alignment subsequences

Page 13: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

4. Indexed Genomic Retrieval4. Indexed Genomic Retrievalwith CAFÉwith CAFÉ

4.1 Indexing with Café

4.2 Coarse Searching with Café (Filtering)

4.3 Fine Searching with Café as the method of FASTA

Page 14: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

4.1 Indexing with CAFÉ4.1 Indexing with CAFÉ

• Inverted indexes consist of two component:– A search structure– Posting lists

• Example of an inverted index

ACCC 12,(3:144,154,962), 38,(2:47,1045)

The pattern occurs– 3 times in the 12th sequence, at offsets 144,154,and 962– 2 times in the 38th sequence, at offsets 47 and 1045

• These indices are compressed for reducing space described in detail elsewhere.

Page 15: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

4.2 Coarse Searching with CAFÉ4.2 Coarse Searching with CAFÉ

• A novel Ranking technique using the index structure

Score for ranking: COMBINED = COVERAGE- k*(LENGTH-COVERAGE)

COVERAGE = 9LENGTH = 9

COVERAGE = 21LENGTH = 55

COVERAGE = 6LENGTH = 55

Page 16: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

Example: Ranking by CAFÉ Example: Ranking by CAFÉ

Homologous -chain hemoglobinHomologous -chain hemoglobin

Human - ChimpanzeeHuman - Chimpanzee

Human - RatHuman - Rat

Human - PotatoHuman - Potato

Page 17: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

5. Experimental Results5. Experimental Results

5.1 Test Data

5.2 Space

5.3 Retrieval Effectiveness

5.4 Speed

Page 18: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

5.1 Test Data5.1 Test Data

• PIR Database for assessing the accuracy of search system.

• GenBank Database for assessing speed and index space requirements.

Page 19: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

5.2 Space5.2 Space

• Uncompressed index size~9.7 times the collection size

• Compressed index size (Café index)~2.2 times the collection size

• The retrieval of uncompressed nucleotide data reduces the speed of Café system

Page 20: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

5.3 Retrieval Effectiveness5.3 Retrieval Effectiveness

Page 21: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

5.4 Speed5.4 Speed

Page 22: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

6. Conclusion6. Conclusion

• Café system affords much faster query evaluation than exhaustive searching.

• Better accuracy than the most widely used search tool, BLAST 2.

• Café indices are smaller than the annotated source databases and the indices of previous indexed systems.

Page 23: Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.