Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...

26
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun Park Jun gIm Won {mkseo, sanghyun, jiwon}@cs.yonsei.ac.kr Department of Computer Science, Yonsei University

Transcript of Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the...

Page 1: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Clustered Segment Indexing for Pattern Searching on the Secondary Structure

of Protein Sequences

Minkoo Seo Sanghyun Park JungIm Won

{mkseo, sanghyun, jiwon}@cs.yonsei.ac.kr

Department of Computer Science, Yonsei University

Page 2: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 3: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Our target

Introduction• Similarities in Protein Structure

– Functional properties of proteins usually depend on structures of the proteins.

– There are many proteins which are structurally similar, but their sequences are not similar at the level of amino acids.

Amino Acids (AA)

Loop RegionsSecondary Structure Element (SSE):

Helices and Sheets

Polymer Chain

Protein

Polymer Chain Polymer Chain…

<Protein Structure>

Page 4: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Introduction (cont’)

• Secondary Structure of Protein Sequences– Linear sequence of amino acids folds into three dimensional stru

ctures.

– There are three basic types of folds:• h : Alpha Helices

• e : Beta Sheets

• l : Turns of Loops

– Normally, these types occur in groups.• For example, ‘lhhheeeelll’ is more likely to occur than ‘helhehhll.’

Primary :GQISDSIEEKRGFFSTKR..Secondary:HLLLLLLLLLLHHHEEEE..Probability:855577763445449476..

Page 5: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 6: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Related Works• BLAST and FASTA

– These exhaustive search techniques are often adapted to run on specialized high-end hardware or parallelized.

– They do not solve the fundamental problem of needing to compare a query to each sequence in the collection.

Page 7: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Related Works (cont’)

• Segment Based Indexing

L. Hammel and J. M. Patel, "Searching on the Secondary Structure of Protein Sequence," In Proc. VLDB Conference, 2002

– Indexing scheme is based on the idea of the segmentation of secondary protein sequences.

– Index selectivity is not uniform.

– Gap can be defined only as segments.

Page 8: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 9: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Segment

ID Type LengthStart Loc.

1 H 1 0

1 L 10 1

1 H 3 11

1 E 4 14

Segmentation Process

H LLLLLLLLLL HHH EEEE

H,1 L,10 H,3 E,4

0 1234567890 123 4567

Type, Len:

Loc :

Protein ID :1Primary :GQISDSIEEKRGFFSTKR..Secondary :HLLLLLLLLLLHHHEEEE..Probability:855577763445449476..

Sample Protein Sequence

<Segments>

Page 10: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Segment (cont’)

• Selectivity of Type+Length– Index selectivity is not uniform.

0

50

100

150

200

1 11 21 31 41 51 61

Segment Length

Nu

mb

er

of

se

gm

en

ts i

nT

ho

us

an

ds

l

h

e

Page 11: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Cluster and Look Ahead

IDType

StrLength

StartLoc.

LookAhead

1 h 1 0 lhe

1 l 10 1 he

1 h 3 11 e

1 e 4 14

ID Type LengthStart Loc.

1 h 1 0

1 l 10 1

1 h 3 11

1 e 4 14

<Cluster table><Segments>

IDType

StrLength

StartLoc.

LookAhead

1 hl 11 0 he

1 lh 13 1 e

1 he 7 11

IDType

StrLength

StartLoc.

LookAhead

1 hlhe 18 0

k=0

k=1

k=2

• 2k segments are combined and put into Cluster table. By doing this, we index several characters instead of exactly one character.

• ‘Look Ahead’ fields contain ‘Type’ fields of next n segments.

• Maximum value of k can be limited according to query length.

Page 12: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 13: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Index Selectivity

Cluster table is searched byTypeStr+TypeLen+LookAhead.

Segment table is searched byType+TypeLen.

Cluster indices are 30~3,000 times more selective than the segment index, though the exact length information is lost.

02468

1012141618

(ID

, LO

C)

in T

ho

us

an

ds

0.000

0.020

0.040

0.060

0.080

0.100

0.120

0.140Number of (ID, LOC) Found

Selectivity

Number of Proteins = 320,000

Page 14: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

LookAhead Selectivity

0

2

4

6

8

10

12

14

16

K0 K1 K2 K3 K4

(ID

, LO

C)

in T

ho

usa

nd

s

Num of (ID, LOC) Found w/o LookAhead

Num of (ID, LOC) Found w/ LookAhead

LookAhead improves selectivity 2~34 times.

Number of Proteins = 320,000

Page 15: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 16: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Exact Match

<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>

Query

Compute k

2)4,5logmin()4,||logmin( 22 Qk

<h 1 1><l 10 10><h 3 3><e 4 4><l 3 3>

Clustering

① k=2 ② k=2

Sort Merge• ID must be equal.• Loc difference must be 1(=<h 1 1>).

Post Processing

Validate result using actual sequences.

Searching

① TypeStr=hlhe, TypeLen=1+10+3+4, LookAhead=l② TypeStr=lhel, TypeLen=10+3+4+3

Page 17: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Range Match

<Range Match>

<e 50 60><l 3 3><h 5 6><e 6 8>

Clustering

TypeStr=elhe, TypeLen=64(50+3+5+6)~ 77(60+3+6+8)

Searching

Range of TypeLen is increased too much.

<Selective Clustering Range Match (SCRM)>

<e 50 60><l 3 3><h 5 6><e 6 8>

Clustering

Comparing Histograms

<e 50 60><l 3 3><h 5 6><e 6 8>①

② ③ ④

Search the cluster which seems to returnthe smallest number of rows.

① k=2, TypeLen=64~77 ② k=1, TypeLen=53~63 ③ k=1, TypeLen=8~9 ④ k=1, TypeLen=11~14

Page 18: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 19: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Environment• Oracle Database 8.1.7 on Linux

– To store cluster table and cluster indices.

• JDK 1.4.2_04 on Windows XP– To run searching program.

• Protein Data Used– 80,000 amino acid sequences of PDB(Protein Data Bank) were

downloaded.– Secondary structure of protein sequences were predicted using

PREDATOR.– 320,000 sequences were populated by duplicating 80,000 seque

nces 4 times.

Page 20: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Exact Match

0

10

20

30

40

3 5 10 15 20

|Q|

Tim

e i

n S

ec

.

Ours

ISS

SSS

M ISS(2)

0

5

3 5 10 15 20

|Q|

Tim

e in

Sec

.

Ours

MISS(2)

Exact Match Exact Match

Our method is 3~20 times faster than MISS(2) and 8~49 times faster than SSS and ISS.

Page 21: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Range Match

0

5

10

15

3 5 10 15 20

|Q|

Tim

e i

n S

ec

. SCRM

M ISS(2)

0

5

10

15

20

3 5 10 15 20

|Q|

Tim

e i

n S

ec

.

SCRM

M ISS(2)

0

5

10

15

20

3 5 10 15 20

|Q|

Tim

e in

Sec

.

SCRM

MISS(2)

Range Match, First Range Match, Middle

Range Match, End

SCRM is 1.2~6 times faster than MISS(2).

Page 22: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Scalability

0

5

10

15

80,000 160,000 320,000Number of proteins

Tim

e in

Sec

.

Ours

MISS(2)

05

101520253035404550

80,000 160,000 320,000Number of proteins

Tim

e in

Sec

.

Ours(SCRM)

MISS(2)

Exact Match, |Q|=5 Range Match (Middle), |Q|=5

Our methods are 2~4 times faster than MISS(2).

Page 23: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Index Size

Index size increases linearly.

Maximum K value can be set according to expected length of queries.In most cases, maximum K value of 2 will suffice.

0

500

1,000

1,500

2,000

2,500

3,000

0 1 2 3 4

Maximum K value

Siz

e in

MB

Cluster Index

Segment Index

Number of proteins = 320,000

Page 24: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Outline• Introduction• Related Works• Basic Concept

– Segment– Cluster and Look Ahead

• Selectivity Analysis• Algorithm Description

– Exact Match– Range Match

• Experiments• Conclusion and Future Work

Page 25: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Data and Knowledge Engineering Laboratory

Conclusion and Future Work

• We have proposed the concept of clustered indexing, which indexes fixed number of characters, and look ahead, which enhances index selectivity.

• Our experiments show that proposed techniques are more efficient than the previous algorithms.

• In the future, we would like to study and propose approximate searching algorithms for 3-D structures of proteins.

Page 26: Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Q & A