Predicting Protein Function Annotation using Protein-Protein Interaction Networks
description
Transcript of Predicting Protein Function Annotation using Protein-Protein Interaction Networks
Predicting Protein Function Annotation using Protein-
Protein Interaction Networks
By Tamar Eldad
Advisor: Dr. Yanay Ofran
89-385 Computational Biology - Projects Workshop
Bar-Ilan University, the Mina and Everard Goodman Faculty of Life Sciences1
Exponential increase in the number of proteins being identified by sequence genomics projects
Impossible to perform functional assay for every uncharacterized gene
Turn to sophisticated computational methods for assistance in annotating the huge volume of sequence and structure data being produced
homology-based annotation transfer sequence patterns structure similarity structure patterns genomic context microarray data
Protein Function Prediction
2
Biological function has more than one aspect
Sub-cellular to whole-organism context
Physiological aspect
Phenotype
What is Function?
The need of a well-defined vocabulary
3
Protein Sequence:
Protein Structure:
4
The Gene Ontology project is a major bioinformatics initiative with the aim of standardizing the representation of gene and gene product attributes across species and databases.
The project provides a controlled vocabulary of terms for describing gene product characteristics and gene product annotation data.
The Gene Ontology
6
The Gene Ontology
Cellular component Molecular function Biological process
DAG (1….N parent nodes) General Specific Term is assigned to Gene Product
7
The Gene Ontology
8
Classical Biology – collect a set of features for each protein Systems Biology – study protein function in the context of a network
A New Approach
Assemblies represent more than the sum of their parts
9
Protein Interactions Data on thousands of interactions in humans and most model
species have become available
mass spectrometry
genome-wide chromatin immunoprecipitation
yeast two-hybrid assays
combinatorial reverse genetic screens
rapid literature mining techniques
10
PPI Networks
Data are represented as networks, with nodes representing proteins and edges representing the detected PPIs.
11
Alignment – aligning sequence-matching proteins between species and checking if they also share network alignment can teach us about conserved pathways between species
Integration - data from different types of networks (i.e. protein, genetic, and transcriptional interaction networks) are integrated in order to get a better picture of the whole biological system
Querying - find sub-networks similar to functional units (by comparing interactions and the proteins themselves) - likely to be functioning units too
Existing Methods
12
conserved network motifs between two species convey evidence for function similarity of the individual proteins that make up these motifs
New Method
HUMANYEAST
2e-10
8e-13
1e-09
5e-15
13
What do we need?
1. list of proteins in human cell
2. list of proteins in yeast cell
3. interactions in each cell
4. sequence similarity grades
5. known GO annotations
6. function distance calculation
New Method
14
Protein Lists - UniProt DB
15
Interaction Databases
HPRD - The Human Protein Reference Database.
Dip - Database of Interacting Proteins.
Mips -Munich information center of proteins sequences
IntAct – interaction molecular database.
Reliable interaction performs one of these conditions:1. was at least observed in 2 different experiments.
OR2. was reported in 3 different articles.
16
Sequence Similarity Grades
BLAST - bl2seq
1 2 3 4
1 - 0.008 3e-18 X
2 10 - 0.02 3.6
HUMAN
YE
AS
T
17
GO annotations –UniProt DB
18
Evidence Codes
19
Function Distance Calculation
20
1. Prepare similarity matrix for cutoff e-value
2. Find all components of size N – 1 (DFS search)
3. Compare sub-graphs found using similarity matrix
4. Add N-th non-similar component to each pair of matching graphs
5. Get GO function annotation of N-th components
6. Calculate average distance of N-th component’s function
Implementation
21
1. Compare to random-pair annotation
No-sequence similarity
2. Compare to sequence-similar annotation
BLAST
Only proteins under cut-off value
Human genes only
Quality Assurance
22
Detailed Results
graph1new comp go func graph2 new comp go func term type dist
Eval average
,4814,4256,591,1584, Q12495 GO:0005515 ,4253,1335,2447,2353, Q9UHD2 GO:0005515 MolecularFunction 4 0.079
,4814,4256,591,1584, Q12495 GO:0030528 ,4253,1335,2447,2353, Q9UHD2 GO:0030528 MolecularFunction 3 0.079
,4814,4256,591,1584, Q12495 GO:0006334 ,4253,1335,2447,2353, Q9UHD2 GO:0006334 BiologicalProcess 0 0.079
,4814,4256,591,1584, Q12495 GO:0005515 ,4253,1335,2447,2353, O15111 GO:0005515 MolecularFunction 1 0.079
,4814,4256,591,1584, Q12495 GO:0005515 ,4253,1335,2447,2353, O15111 GO:0005515 MolecularFunction 12 0.079
,4819,2,236,234, P16649 GO:0016584 ,4354,2303,2890,3693, P55060 GO:0016584 BiologicalProcess 1 0.062
,4819,2,236,234, P16649 GO:0016565 ,4354,2303,2890,3693, Q96KB5 GO:0016565 MolecularFunction 1 0.062
,4819,2,236,234, P16649 GO:0016584 ,4354,2303,2890,3693, Q15699 GO:0016584 BiologicalProcess 8 0.062
,4819,2,236,234, P16649 GO:0016584 ,4354,2303,2890,3693, Q15699 GO:0016584 BiologicalProcess 5 0.062
,4867,2966,168,1224, P13393 GO:0000120 ,4387,1383,1452,2289, P63279 GO:0000120 CellularComponent 4 0.041
,4867,2966,168,1224, P13393 GO:0000120 ,4387,1383,1452,2289, P63279 GO:0000120 CellularComponent 3 0.041
,4867,2966,168,1224, P13393 GO:0000126 ,4387,1383,1452,2289, P63279 GO:0000126 CellularComponent 7 0.041
23
Results
E-value 5e-05
24
• Change graph size
• Lower e-value
• Start with larger amount of connected components
• Use only graphs with higher connectivity
• Non-similar proteins can be any protein in the graph
• Different network topology
• Limit number of paired proteins
Play with Parameters
25
Results
26
Conclusions
Most results are random
Significant improvement only for Biological Process prediction
Still far behind Homology Based Transfer
27
Summary
Functional annotation is one of the greatest challenges in the post-genomic era
PPI data for functional annotation as a new approach for promoting this field
Method tried out is unsuccessful
Other Ideas: Find a more specific search pattern Start from best results – what specializes them?
28
References
Friedberg,I. (2006) Automated function prediction: the genomic challenge. Brief. Bioinform. Accepted for publication
Sharan R, Ulitsky I, Shamir R: Network-based prediction of protein function. Mol Syst Biol 2007, 3:88.
Sharan R, Ideker T: Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, 4: 427 - 433.
http://www.geneontology.org/ http://www.chem.qmul.ac.uk/iubmb/enzyme/
29
Thanks
Advisor – Dr. Yanay Ofran Guys at the lab – Rotem, Vered, Sivan Roi Adadi & Omer Erel
30
Alignment
Querying
Integration
1 2 3 4
1 - 0.008 3e-18 X
2 10 - 0.02 3.6
E-value = 0.0005
TRUE
FALSE FALSE
FALSE
FALSE
TRUE
HUMAN
YE
AS
TSimilarity Matrix
Neighboring matrix
1 2 3 4
1 - TRUE FALSE TRUE
2 TRUE - FALSE FALSE
HUMAN CELL INTERACTIONS