28 Schmidt

46
8/13/2019 28 Schmidt http://slidepdf.com/reader/full/28-schmidt 1/46 Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul

Transcript of 28 Schmidt

Page 1: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 1/46

Quadratic Time Algorithms

for Finding Common Intervals

in Two and More Sequences

Thomas Schmidt

Jens Stoye

CPM 2004, Istanbul

Page 2: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 2/46

Page 3: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 3/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 3

 Observations:- Gene order in bacterial genomes is weakly conserved

- Some genes tend to cluster together even in unrelated species

- Functional association of genes inside a cluster

Gene Order and Function in Bacteria:

Page 4: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 4/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 4

 Observations:- Gene order in bacterial genomes is weakly conserved

- Some genes tend to cluster together even in unrelated species

- Functional association of genes inside a cluster

Gene Order and Function in Bacteria:

Page 5: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 5/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 5

 Observations:- Gene order in bacterial genomes is weakly conserved

- Some genes tend to cluster together even in unrelated species

- Functional association of genes inside a cluster

Gene Order and Function in Bacteria:

Page 6: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 6/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 6

?

Gene Order and Function in Bacteria:

Page 7: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 7/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 7

?

Gene Order and Function in Bacteria:

Page 8: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 8/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 8

?

Gene Order and Function in Bacteria:

Page 9: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 9/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 9

 Are there more clusters ?

Gene Order and Function in Bacteria:

Page 10: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 10/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 10

 Are there more clusters ?

Gene Order and Function in Bacteria:

Page 11: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 11/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 11

Task: 

• Establish a model and search for gene clusters

Gene Order and Function in Bacteria:

Page 12: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 12/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 12

Formalization of Gene Clusters:

Genomes: permutations π 

1

 , π 2

 ,…,

 

π 

k  Genes: numbers 1 ,…,n 

Page 13: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 13/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 13

Formalization of Gene Clusters:

Genomes: permutations π 

1

 , π 2

 ,…,

 

π 

k  Genes: numbers 1 ,…,n 

π 1

π 2

π 3

π 4

1 2 3 4 5 6 7 8

Page 14: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 14/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 14

Formalization of Gene Clusters:

Genomes: permutations π 

1

 , π 2

 ,…,

 

π 

k  Genes: numbers 1 ,…,n 

π 1

π 2

π 3

π 4

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

Page 15: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 15/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 15

Formalization of Gene Clusters:

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

π 1

π 2

π 3

π 4

Genomes: permutations π 

1

 , π 2

 ,…,

 

π 

k  Genes: numbers 1 ,…,n 

Gene cluster:  common interval  subset of numbers occurring

contiguously in all permutations)

Page 16: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 16/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 16

Formalization of Gene Clusters:

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

π 1

π 2

π 3

π 4

Genomes: permutations π 

1

 , π 2

 ,…,

 

π 

k  Genes: numbers 1 ,…,n 

Gene cluster:  common interval  subset of numbers occurring

contiguously in all permutations)

Page 17: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 17/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 17

Formalization of Gene Clusters:

1 2 3 4 5 6 7 8

8 7 6 4 5 2 1 3

3 1 2 5 8 7 6 4

6 7 4 2 1 3 8 5

π 1

π 2

π 3

π 4

Genomes: permutations π 

1

 , π 2

 ,…,

 

π 

k  Genes: numbers 1 ,…,n 

Gene cluster:  common interval  subset of numbers occurring

contiguously in all permutations)

Page 18: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 18/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 18

Formalization of Gene Clusters:

Algorithms:- Uno & Yagiura, Algorithmica 2000 : Find all common intervals of

two permutations in O(n+|output|) time.

- Heber & Stoye, CPM 2001: Find all common intervals of k  ≥ 2permutations in O(kn+|output|) time.

Genomes: permutations π 1 , π 2  ,…, π k  

Genes: numbers 1 ,…,n 

Gene cluster:  common interval  subset of numbers occurring

contiguously in all permutations)

Page 19: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 19/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 19

Modeling multiple copies of a gene (paralogs):

Problem:

- Gene duplication results in multiple copies of a gene inside

a genome

- Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π 1

π 2

π 3

7 ?

Page 20: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 20/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 20

Modeling multiple copies of a gene (paralogs):

Problem:

- Gene duplication results in multiple copies of a gene inside

a genome

- Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π 1

π 2

π 3

? 7

Page 21: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 21/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 21

Modeling multiple copies of a gene (paralogs):

Problem:

- Gene duplication results in multiple copies of a gene inside

a genome

- Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π 1

π 2

π 3

3 1 2 ? ?

Page 22: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 22/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 22

Modeling multiple copies of a gene (paralogs):

Problem:

- Gene duplication results in multiple copies of a gene inside

a genome

- Difficult to assign the correct gene pair

1 2 3 4 5 6 7 8

π 1

π 2

π 3

3 ? 2 1 ?

Page 23: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 23/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 23

Modeling multiple copies of a gene (paralogs):

Solution:

- Do not distinguish between paralogous gene copies

- Each paralogous copy of a gene gets the same number

Consequence:

- Genomes are modeled as sequences instead of

permutations

1 2 3 4 5 6 7 8

S 1

S 2

S 3

3 1 2 4 8 7 6 1 2

8 7 6 7 5 4 2 1 3

Page 24: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 24/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 24

Overview:

• Introduction- Comparative genomics

- Common Intervals and Gene Clusters

• Formal Model

• Algorithms- Simple Data Structure: Quadratic Space

- Saving Space

• Results

Page 25: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 25/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 25

Formal Model:

Given:  String S  over a finite alphabet Σ 

Notation:  S [i] = the i-th character of S  

S [i,j] = substring of S  starting at index i and ending at j 

Definition: The character set   CS (S [i,j]) := {S [k ] | i ≤ k ≤ j} is

the set of all characters occurring in the substring

S [i,j].

Example: 

CS (S [2,5]) := {1,2,3}

1  2 3 4  5 6 7 8

S  : 3 1 2 3 1  5 2 6

Page 26: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 26/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 26

Formal Model:

Given:  Subset C Σ 

Definition: (i, j) is a CS-location of  C  in S , iff CS (S [i,j]) = C  

left-maximal =  S [i-1]  C

right-maximal   = S [ j+1]  C  maximal   = both left- and right-maximal

Example:S  : 3 1  2 3 1  5 2 6

1  2 3 4  5  6 7 8

The pair (3,5) is a CS-location of the set C={1,2,3},

because CS (S [3,5]) = {1,2,3}, but it is not left-

maximal !

Page 27: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 27/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 27

Formal Model:

Given:  Collection of k  strings S* = (S 1 ,...,S k ) over alphabet Σ 

Definition: C Σ is a common CS-factor  of S* if and only if

C  has a CS-location in each S l , 1 ≤ l ≤ k . 

Example:

0 1  2 3 4  5 6 7

S 1 : 3 2 1 3 1 5 1  6

S 2 : 4 3 5 5 5 1  4 2 2 

S 3: 7 5 1 5 3  6 5

1  2 3 4 5 6 7 8 9

common CS-factor: {1,3,5}  => S 1: (3,7) ― S 2: (2,6) ― S 3: (2,5) 

Page 28: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 28/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 28

Problem Formulation:

 A common CS-factor of k  strings represents a gene cluster thatoccurs in each of the k  genomes.

Given a collection of k  strings S*: 

Problem 1:  Find all common CS-factors in S*.

Problem 2:  For each common CS-factor find all its maximal

CS-locations in each of the strings.

Page 29: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 29/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 29

Overview:

• Introduction

• Formal Model

•  Algorithms 

- Simple Data Structure: Quadratic Space

- Saving Space

• Results

Page 30: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 30/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 30

Algorithm "Connecting Intervals" (CI) 

• Algorithm CI solves Problem 1 and Problem 2 for two sequences

• Input: Two sequences of length up to n with characters drawn

from Σ = {1,...,m}, m ≤ 2n

• Output: Pairs of CS-locations of all common CS-factors

• Time & Space complexity: O(n²)

Page 31: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 31/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 31

Preprocessing

POS[1] = 2,5

POS[2] = 3,7

POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

1 2 3 4 5 6 7 81  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 5

5  1 2 3 46   1 2 37   1 28  1

 NUM(i, j) :  i j 

POS[c] holds all positions where character c occurs in S 1.

 NUM(i, j) counts the number of different  characters in S 1[i, j].

Compute two tables for S 1= (3,1,2,3,1,5,2,6)

Page 32: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 32/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 32

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

S 2 : 4 3  5 5 5 1 4 2 21  2 3 4 5 6 7 8

S 1 : 3  1 2 3  1 5 2 6

 ji

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

1 2 3 4 5 6 7 8

1  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

 NUM(i, j) :  i j 

Page 33: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 33/46

  1 2 3 4 5 6 7 8

1  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 33

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

S 2 : 4 3  5 5 5 1 4 2 21  2 3 4 5 6 7 8

S 1 : 3  1 2 3  1 5 2 6

 ji

 NUM(i, j) :  i j 

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

Page 34: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 34/46

Page 35: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 35/46

  1 2 3 4 5 6 7 8

1  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 35

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

1  2 3 4 5 6 7 8

S 1 : 3  1 2 3  1 5  2 6

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

 NUM(i, j) :  i j 

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

i

S 2 : 4 3  5 5 5 1 4 2 2

 j

Page 36: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 36/46

  1 2 3 4 5 6 7 8

1  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 36

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

1  2 3 4 5 6 7 8

S 1 : 3  1  2 3  1  5  2 6

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

 NUM(i, j) :  i j 

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

i

S 2 : 4 3  5 5 5 1  4 2 2

 j

Page 37: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 37/46

  1 2 3 4 5 6 7 81  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 37

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

1  2 3 4 5 6 7 8

S 1 : 3  1  2 3  1  5  2 6

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

 NUM(i, j) :  i j 

Output: ((2,2)-(1,1)) ((2,2)-(4,4))

i

S 2 : 4 3  5 5 5 1  4 2 2

 j

Page 38: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 38/46

  1 2 3 4 5 6 7 81  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 38

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

1  2 3 4 5 6 7 8

S 1 : 3  1  2 3  1  5  2 6

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

 NUM(i, j) :  i j 

Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6))

i

S 2 : 4 3  5 5 5 1  4 2 2

 j

Page 39: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 39/46

  1 2 3 4 5 6 7 81  1 2 3 3 3 4 4 52  1 2 3 3 4 4 53  1 2 3 4 4 54  1 2 3 4 55  1 2 3 46   1 2 3

7   1 28  1

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 39

Algorithm CI 

Algorithm: While reading S 2, mark in S 1 the observed character

and track maximal intervals of marked characters

1  2 3 4 5 6 7 8S 1 : 3  1  2 3  1  5  2 6

POS[1] = 2,5

POS[2] = 3,7POS[3] = 1,4

POS[4] = empty

POS[5] = 6

POS[6] = 8

 NUM(i, j) :  i j 

i

S 2 : 4 3  5 5 5 1  4  2 2

 j

(i,j) not left-maximal !

Page 40: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 40/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 40

1. for   i = 1,...,|S 2|  do

2.  j = i

3. while  j < |S 2| and (i,j) is maximal do

4. if   (c = S 2[ j]) is seen the first time5. for  each entry in POS(c) do

6. mark and track

7. end for

8. end if

9.  j = j + 110. end while

11. end for

Time Complexity 

 Algorithm CI finds all common CS-factors of S 1 and S 2 in O(n²) time.

POS[1] = 1,4

POS[2] = 2,6

POS[3] = 0,3

POS[4] = emptyPOS[5] = 5

POS[6] = 7

S 2 : 4 3 5 5 5 1 4 2 2

Page 41: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 41/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 41

Multiple Genomes 

Goal :  Find all common CS-factors of a collection S*=(S 1 ,S 2 ,...,S k )

Algorithm :

1. Apply Algorithm CI to all pairs (S 1,S l ), 2 ≤ l  ≤ k  

2. Output only the common CS-factor detected in all pairs

Time complexity : O(kn²)

Space complexity : O(kn²) with redundant output, O(n²) otherwise

Further extension : Find all common CS-factors appearing in at

least k' of k  strings of S*

Time complexity : O(k( 1+k-k')n²)

Page 42: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 42/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 42

Saving Space 

• Due to the storage of the table NUM , Algorithm CI requiresquadratic space.

• An algorithm presented by Didier, WABI 2003, detects all common

CS-factors of two sequences in O(n² log n) time and linear space

• In a modified version, replacing a binary search by a constant time

Range Maximum Query, it is possible to reduce the time complexity

to O(n²) staying still linear in space. 

Page 43: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 43/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 43

Overview:

• Introduction

- Comparative genomics

- Common Intervals and Gene Clusters

• Formal Model

• Algorithms- Simple Data Structure: Quadratic Space

- Saving Space

• Results

Page 44: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 44/46

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 44

Results on real data 

• Data set: 

- 43 bacterial genome sequences from NCBI

- All classified in the "Clusters of Orthologous Groups of Proteins"

database (COG)

- Genes are identified by their COG number

- Computation time: approx. 5 -10 minutes on a standard PC

Page 45: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 45/46

Results on real data (k'= 2) all 43 genomes

cluster size ≥ 3

without closely related genomes (k = 32)

cluster size ≥ 2

cluster size ≥ 3

cluster size ≥ 2

Page 46: 28 Schmidt

8/13/2019 28 Schmidt

http://slidepdf.com/reader/full/28-schmidt 46/46

Teşekkür ederim !