Bio process

32
SURYABHAN SURYABHAN SINGH RAWAT SINGH RAWAT

Transcript of Bio process

Page 1: Bio process

SURYABHAN SURYABHAN SINGH SINGH RAWATRAWAT

Page 2: Bio process

Protein Classification

A comparison of function inference techniques

Page 3: Bio process

Why do we need automated classification? Sequencing a genome is only the

first step. Between 35-50% of the proteins in

sequenced genomes have no assigned functionality.

Direct observation of function is costly, time consuming, and difficult.

Page 4: Bio process

Protein DomainsThe tertiary structure of many proteins is built from several domains.

Often each domain has a separate function to perform for the protein, such as:

•binding a small ligand (e.g., a peptide in the molecule shown here)

•spanning the plasma membrane (transmembrane proteins)

•containing the catalytic site (enzymes)

•DNA-binding (in transcription factors)

•providing a surface to bind specifically to another protein

In some (but not all) cases, each domain in a protein is encoded by a separate exon in the gene encoding that protein.

Page 5: Bio process

Inference through sequence similarity

ProtoMap: Automatic Classification of Protein

Sequences, a Hierarchy of Protein Families, and Local Maps of the Protein Space

(1999)

Page 6: Bio process

Final Goal

Page 7: Bio process

Observations Sometimes you don’t know where

the domains are. It is generally accepted that two

sequences with over 30% identity are likely to have the same fold.

Homologous proteins have similar functions.

Homology is a transitive relationship.

Page 8: Bio process

Departures Authors do not attempt to define protein

domains or motifs. Not dependant on predefined groups or

classifications. Chart the space of all proteins in

SWISSPROT, as opposed to individual families

Produce global organization of sequences.

Page 9: Bio process

Algorithm Overview We construct a weighted graph

where the nodes are protein sequences and the edges are similarity scores.

Cluster the network considering only those edges above some threshold.

Decrease similarity threshold and repeat.

Page 10: Bio process

Measuring Sequence Similarity Expectation value used. This the

normalized probability of the similarity occurring at random.

Lower value implies logarithmically stronger similarity.

2ln

ln'

KSS

'2/ SNE

Page 11: Bio process

Blosum62 Scoring Matrix

Page 12: Bio process

Finding Homologies Very difficult to distinguish a clear

threshold between homology and chance similarity.

Authors chose e = .1, .1, and .001 for SW, FASTA, and BLAST, respectively.

Spent a lot of time empirically determining these thresholds.

Page 13: Bio process

ClusteringClustering is done iteratively.

Start with a threshold of E < 10-100

Cluster and increase threshold by a factor of 105

Sublinear threshold prevents the collapse of sequence space

Page 14: Bio process

ProtoMap: Results Produces well-defined groups

which correlate strongly to protein families in PROSITE and Pfam.

Page 15: Bio process

Results:Immunoglobin Superfamily

Page 16: Bio process

ProtoMap: Limitations

Analysis performs poorly by families dominated by short/local domains (PH, EGF, ER_TARGET, C2, SH2, SH3, ect…)

High scoring, low complexity segments can lead to nonhomogeneous clusters.

“Hard” clustering vs. “Soft” clustering Has difficulty classifying multidomain

proteins.

Page 17: Bio process

ProtoMap: Future Directions

3D structure/fold Biological function Domain content Cellular location Tissue specificity Source organism Metabolic pathways

Page 18: Bio process

Inference through protein interaction networks

Functional Classification of Proteins for the Prediction of

Cellular Function from a Protein-Protein Interaction

Network (2003)

Page 19: Bio process

PRODISTIN

• Very similar to ProtoMap, only the data used to produce the graph is a list of binary protein-protein interactions instead of sequence similarity scores

• Sequence similarity not a dominating factor in PRODISTIN clusters

Page 20: Bio process
Page 21: Bio process
Page 22: Bio process

PRODISTIN Results

Page 23: Bio process

Problems with PRODISTIN

• Paucity of protein-protein interaction data (average # of connections = 2.6)

• Either very robust or very indiscriminant

Page 24: Bio process

Problems: Multidomain and Nonlocal Proteins

• protein kinases

• hydrolases

• ubiquitin…

PRODISTIN: Present problems in clustering by biochemical function

ProtoMap: Can create undesired connection among unrelated groups

Page 25: Bio process

Scale-Free Networks

j j

i

k

k~P(linking to node

i)

• Node connection probability follows a power law distribution

• Maximum degree of separation grows as O(lg n)

• Highly robust under noise, except at hubs and superhubs.

Page 26: Bio process

The Internet

Page 27: Bio process

Social Networks

Page 28: Bio process

Metabolic Networks

http://biocomplexity.indiana.edu/research/bionet/

• The E. coli metabolic network is scale-free.

• Actually, the metabolic networks of all organisms in all three domains of life appear to be scale-free (43 examined)

• The network diameter of all 43 metabolic networks is the same, irrespective of the number of proteins involved.

• Is this counter-intuitive? Yes.

Page 29: Bio process

Protein Domain Networks

http://mbe.oupjournals.org/cgi/content/full/18/9/1694

• Protein Domains – Nature’s take on writing modular code

• Reconciles apparent paradox of a fixed network diameter across species – despite vast differences in complexity (some human proteins have 130 domains)

• Occurrence of specific protein domains in multidomain proteins is scale-free.

Page 30: Bio process

Protein Domain Graphs

• Prosite domains have a distribution following the power-law function f(x) = a(b + x)-c, with c = .89. There are few highly connected domains and many rarely connected ones.

• ProDom and Pfam domains follow the power function

kkP )(

y = 2.5 for ProDom

y = 1.7 for Pfam

Page 31: Bio process

Hub Domains in Signaling Pathways

Page 32: Bio process

Conclusions• The accuracy of both ProtoMap and PRODISTIN is limited because they make the tacit assumption of a random network topology.

• Protein-Protein interaction networks have scale-free topology, foiling PRODISTIN

• Protein Domain networks have scale-free topology, foiling ProtoMap

• Any protein classification algorithm that performs better than ProtoMap is probably going to have to address this issue.