Motif Discovery in Protein Sequences using Messy De Bruijn Graph

22
Motif Discovery in Motif Discovery in Protein Sequences Protein Sequences using Messy De Bruijn using Messy De Bruijn Graph Graph Mehmet Dalkilic and Rupali Patwardhan

description

Motif Discovery in Protein Sequences using Messy De Bruijn Graph. Mehmet Dalkilic and Rupali Patwardhan. Goal. The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences. Outline of Presentation. - PowerPoint PPT Presentation

Transcript of Motif Discovery in Protein Sequences using Messy De Bruijn Graph

Page 1: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Motif Discovery in Motif Discovery in Protein Sequences Protein Sequences

using Messy De Bruijn using Messy De Bruijn Graph Graph

Mehmet Dalkilic and Rupali Patwardhan

Page 2: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

GoalGoal

The goal of this project is to develop The goal of this project is to develop an algorithm that can take advantage an algorithm that can take advantage of the properties of De Bruijn graphs of the properties of De Bruijn graphs for discovering motifs in protein for discovering motifs in protein sequences.sequences.

Page 3: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Outline of PresentationOutline of Presentation

Motivation and BackgroundMotivation and Background ApproachApproach ImplementationImplementation ApplicationsApplications Future WorkFuture Work

Page 4: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

MotivationMotivation

Most of the popular motif discovery Most of the popular motif discovery algorithms being used right now depend algorithms being used right now depend on statistical significance to find the motif. on statistical significance to find the motif.

This project explores computational and This project explores computational and graph theoretic ways of doing the same graph theoretic ways of doing the same thing without using statistical significance.thing without using statistical significance.

Such an approach could drastically reduce Such an approach could drastically reduce the time required to search for motifs.the time required to search for motifs.

Page 5: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

What is a De Bruijn What is a De Bruijn Graph?Graph? De Bruijn Graph is a graph whose De Bruijn Graph is a graph whose

nodes are sequences of symbols from nodes are sequences of symbols from some alphabet and whose edges some alphabet and whose edges indicate the sequences which might indicate the sequences which might overlap.overlap.

The parameters are The parameters are nodelength(n)nodelength(n) and and overlap(k)overlap(k)..

So if n=4 and k=3, an edge ACAT So if n=4 and k=3, an edge ACAT CATS represents the sequence 'ACATS'CATS represents the sequence 'ACATS'

Page 6: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

ExampleExample

If we have a sequence ABCDEFG, If we have a sequence ABCDEFG, and we take nodelength=4 and and we take nodelength=4 and

overlap=3, overlap=3, we will can represent this same we will can represent this same

sequence by the following De sequence by the following De Bruijn Graph Bruijn Graph

Page 7: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

CDEFBCDEABCD

ABCDEFG

DEFG

Node Length = 4

Overlap = 3

Page 8: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Applying this to Identify Applying this to Identify Repeating Sub-sequencesRepeating Sub-sequences

If we have a bunch of sequences, we can go on If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our adding corresponding nodes and edges to our De Bruijn graph. De Bruijn graph.

If any sub-sequence is repeated, the If any sub-sequence is repeated, the corresponding edge will already be present in corresponding edge will already be present in that graph. that graph.

So we just increment the weight of that edge.So we just increment the weight of that edge. Eventually the Eventually the edges corresponding to highly edges corresponding to highly

repeated sequences will have higher weightsrepeated sequences will have higher weights.. Now we can find the motif by simply following Now we can find the motif by simply following

the graph along these edges with weights the graph along these edges with weights above a specified threshold .above a specified threshold .

Page 9: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

ExampleExample

Sequence 1:Sequence 1:

PAKARCDEKDPAKARCDEKD Sequence 2:Sequence 2:

ARCDEKHKHARCDEKHKH

Constructing the De Bruijn Graph Constructing the De Bruijn Graph for these sequences … for these sequences …

Page 10: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

PAKA ARCDAKAR KARC

RCDECDEK

DEKH

1 1 1

2

21

PAKPAKARCDEKARCDEKDD ARCDEKARCDEKHKHHKH

DEKD

EKHK KHKH11

1

Page 11: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Making them Messy Making them Messy

In the context of protein sequences, some In the context of protein sequences, some amino acid residues can be substituted amino acid residues can be substituted without affecting the function of the without affecting the function of the protein.protein.

So a sequence could be considered 'So a sequence could be considered 'similarsimilar' ' to an edge though its not exactly same. to an edge though its not exactly same.

Similarity is determined in the context of a Similarity is determined in the context of a standard scoring matrix, such as standard scoring matrix, such as BLOSUM62. BLOSUM62.

In that case, we increment weights of all In that case, we increment weights of all edges that represent sequences that are edges that represent sequences that are ‘similar’ to the one in question.‘similar’ to the one in question.

Page 12: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

ExampleExample

Consider the same 2 sequences as Consider the same 2 sequences as before, but with before, but with KK replaced by replaced by RR in one in one of them.of them.

PAKPAKARCDARCDEERRDD ARCDARCDEEKKHKHHKH

As per BLOSUM62, K and R have a As per BLOSUM62, K and R have a positivepositive substitution score. substitution score.

Page 13: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

PAKA ARCDAKAR KARC

RCDECDER

CDEK

1 1 1

2

1.751

PAKPAKARCDARCDEERRDD ARCDARCDEEKKHKHHKH

DERD

KHKHDEKH EKHK

1 11

1

Page 14: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Another ExampleAnother Example

> Sequence 1> Sequence 1DMLKLCDKADDKMNDRLDDYLKLDDDMLKLCDKADDKMNDRLDDYLKLDD> Sequence 2> Sequence 2EAKDKFDFKDFKLCDKADDARTYVHEAKDKFDFKDFKLCDKADDARTYVH> Sequence 3> Sequence 3GTYYYCPGHKLCDEADDFFHVDDTEGTYYYCPGHKLCDEADDFFHVDDTE> Sequence 4> Sequence 4LKLCDKANDYRPYYPITDPLMMNHILKLCDKANDYRPYYPITDPLMMNHI> Sequence 5> Sequence 5GTYKPGHKLCDEADDFFHENDTEKYCGTYKPGHKLCDEADDFFHENDTEKYC> Sequence 6> Sequence 6KLCDKADDYRPYYPITDPLGATAKHIKLCDKADDYRPYYPITDPLGATAKHI

Page 15: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Another ExampleAnother Example

> Sequence 1> Sequence 1DMLDMLKLCDKADDKLCDKADDKMNDRLDDYLKLDDKMNDRLDDYLKLDD> Sequence 2> Sequence 2EAKDKFDFKDFEAKDKFDFKDFKLCDKADDKLCDKADDARTYVHARTYVH> Sequence 3> Sequence 3GTYYYCPGHGTYYYCPGHKLCDKLCDEEADDADDFFHVDDTEFFHVDDTE> Sequence 4> Sequence 4LLKLCDKAKLCDKANNDDYRPYYPITDPLMMNHIYRPYYPITDPLMMNHI> Sequence 5> Sequence 5GTYKPGHGTYKPGHKLCDKLCDEEADDADDFFHENDTEKYCFFHENDTEKYC> Sequence 6> Sequence 6KLCDKADDKLCDKADDYRPYYPITDPLGATAKHIYRPYYPITDPLGATAKHI

Page 16: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Sample output …Sample output …

http://biokdd.informatics.indiana.http://biokdd.informatics.indiana.edu/rpatward/L519/project/ex1.htedu/rpatward/L519/project/ex1.htmlml

http://biokdd.informatics.indiana.edhttp://biokdd.informatics.indiana.edu/rpatward/L519/project/ttt.gifu/rpatward/L519/project/ttt.gif

Page 17: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

ResultsResults

When 41 sequences belonging to When 41 sequences belonging to PS00021 family were given as PS00021 family were given as inputinput

The best motif output was The best motif output was YCRNPDYCRNPD

The Prosite Reg Ex for this family The Prosite Reg Ex for this family isis [FY]-C-R-N-P-[DNR]. [FY]-C-R-N-P-[DNR].

http://biokdd.informatics.indiana.http://biokdd.informatics.indiana.edu/rpatward/L519/project/PS000edu/rpatward/L519/project/PS00021_op.html21_op.html

Page 18: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Possible ApplicationsPossible Applications

To predict if a given protein sequence To predict if a given protein sequence is likely to belong to a particular is likely to belong to a particular protein family or not.protein family or not.

To construct regular expressions for To construct regular expressions for protein families.protein families.

To fine-tune the results of clustering To fine-tune the results of clustering algorithms, by helping to decide algorithms, by helping to decide whether to merge two clusters or not.whether to merge two clusters or not.

Do preprocessing to improve the Do preprocessing to improve the performance of other motif discovery performance of other motif discovery algorithms.algorithms.

Page 19: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Limitation of this Limitation of this ApproachApproach The motif should have at least 3 The motif should have at least 3

continuous amino acid residues. continuous amino acid residues. So the program runs into trouble if the So the program runs into trouble if the

motif consists of alternate residues. For motif consists of alternate residues. For example, something like AxAxCxDxAxGxC example, something like AxAxCxDxAxGxC (x could be any residue). (x could be any residue).

The problem is due to the need for The problem is due to the need for overlaps, which is inherent nature of De overlaps, which is inherent nature of De Bruijn GraphsBruijn Graphs..

Page 20: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Future WorkFuture Work

We would like to integrate a We would like to integrate a machine-learning aspect to machine-learning aspect to dynamically change the node dynamically change the node length and other parameters to length and other parameters to find the optimal motif.find the optimal motif.

We also want to try to extend this We also want to try to extend this approach to do clustering itself.approach to do clustering itself.

Page 21: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

Link to the Link to the ImplementationImplementation

http://biokdd.informatics.indiana.edhttp://biokdd.informatics.indiana.edu/rpatward/L519/project.htmlu/rpatward/L519/project.html

Page 22: Motif Discovery in Protein Sequences using Messy  De Bruijn Graph

AcknowledgementAcknowledgement

I would like to thank Dr. Mehmet I would like to thank Dr. Mehmet Dalkilic for his ideas and support.Dalkilic for his ideas and support.