Comparison of Networks Across Species CS374 Presentation October 26, 2006 Chuan Sheng Foo.

Post on 19-Dec-2015

215 views 2 download

Transcript of Comparison of Networks Across Species CS374 Presentation October 26, 2006 Chuan Sheng Foo.

Comparison of Networks Across Species

CS374 Presentation October 26, 2006Chuan Sheng Foo

In the beginning there was DNA…

Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides, NC. The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. NAR 34, D332-334

…then came protein interactions

Arabidopsis

PPI network

E. Coli

PPI network

Yeast PPI network

Comparative Genomics to Comparative Interactomics Evolutionary conservation implies functional

relevance Sequence conservation implies functional

conservation Network conservation implies functional conservation

too!

What new insights might we gain from network comparisons? (Why should we care?)

Network comparisons allow us to:

Identify conserved functional modules Query for a module, ala BLAST Predict functions of a module Predict protein functions Validate protein interactions Predict protein interactions

Only possible with network comparisons

Possible with existing techniques, but improved with network comparisons

What is a Protein Interaction Network? Proteins are nodes Interactions are

edges Edges may have

weights

Yeast PPI network

H. Jeong et al. Lethality and centrality in protein networks. Nature 411, 41 (2001)

The Network Alignment Problem

Given k different protein interaction networks belonging to different species, we wish to find conserved sub-networks within these networks

Conserved in terms of protein sequence similarity (node similarity) and interaction similarity (network topology similarity)

Example Network Alignment

Sharan and Ideker. Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, pp. 427-433, 2006

General Framework For Network Alignment Algorithms

Sharan and Ideker. Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, pp. 427-433, 2006

Network construction

Scoring function

Alignment algorithm

Covered in lecture on network integration

Two Algorithms Discussed Today

NetworkBLASTSharan et al. Conserved patterns of protein interaction in multiple species. PNAS, 102(6):1974-1979, 2005.

Græmlin Flannick et al. Græmlin: General and robust alignment of multiple large interaction networks. Genome Res 16: 1169-1181, 2006.

Overview of

Sharan et al. Conserved patterns of protein interaction in multiple species. PNAS, 102(6):1974-1979, 2005.

Estimation of Interaction Probabilities In the preprocessing step, edges in the

network are given a reliability score using a logistic regression model based on three features:

1. Number of times an interaction was observed

2. Pearson correlation coefficient between expression profiles

3. Proteins’ small world clustering coefficient

Network Alignment Graphs

Construct a Network Alignment Graph to represent the alignment

Nodes contain groups of sequence similar proteins from the k organisms

Edges represent conserved interactions. An edge between two nodes is present if:

1. One pair of proteins directly interacts, the rest are distance at most 2 away

2. All protein pairs are of distance exactly 23. At least max(2, k – 1) protein pairs directly interact

Tries to account for interaction deletions

Example Network Alignment Graph

Nodes

a

b

c

a’

b’

c’

a’’

b’’

c’’

ab

c

a’

b’

c’

a’’

b’’

c’’

Network alignment graph

Individual species’ PPI network

Species X Species Y Species Z

Scoring Function

Sharan et al. devise a scoring scheme based on a likelihood model for the fit of a single sub-network to the given structure

High scoring subgraphs correspond to structured sub-networks (cliques or pathways)

Only network topology is scored, node similarity is not

Log Likelihood Ratio Model

Measures the likelihood that a subgraph occurs if it is a conserved network vs. that if it were a randomly constructed network

Randomly constructed network preserves degree distribution for nodes

logPr(Subgraph occurs | Conserved Network)

Pr(Subgraph occurs | Random Network)

Likelihood Ratio Scoring of a Protein Complex in a Single Species

U : a subset of vertices (proteins) in the PPI graphOU : collection of all observations on vertex pairs in UOuv : interaction between proteins u, v observedMs : conserved network modelMn: random network (null) modelTuv : proteins u, v interactFuv : proteins u, v do not interactβ : probability that proteins u, v interact in conserved modelpuv : probability that edge u, v exists in a random model

Probability of complex being observed in a conserved network model

Probability of subgraph being observed in a random network model

Likelihood Ratio Scoring of a Protein Complex in a Single Species

Hence, log likelihood for a complex occurring in a single species is given by

For multiple complexes across different species, it is the sum of the log likelihoods

L(A, B, C) = L(A) + L(B) + L(C)

Example of Complex Scoring

Nodes

a

b

c

a’

b’

c’

a’’

b’’

c’’

ab

c

a’

b’

c’

a’’

b’’

c’’

Conserved complex A in the Network alignment graph

Individual species’ PPI network

L(A) = L(X1) + L(Y1) + L (Z1)

Complex X1 in Species X

Complex Y1 in Species Y

Complex Z1 in Species Z

Alignment algorithm

Problem of identifying conserved sub-networks reduces to finding high scoring subgraphs

NP-complete problem Heuristic solution:

Greedy extension of high scoring seeds(Does this sound familiar? BLAST?)Common to both papers discussed

Alignment algorithm

1. Find seeds for each node v in the alignment graph

a. Find high scoring paths of 4 nodes by exhaustive search

b. Greedily add 3 other nodes one by one, that maximally increase the score of the seed

Alignment algorithm

2. Iteratively add or remove nodes to increase the overall score of the node

Original seeds are preserved Limit size of discovered subgraphs to 15

nodes Record up to 4 highest scoring subgraphs

discovered around each node

Alignment algorithm

3. Filter subgraphs with a high degree of overlap

Iteratively find high scoring subgraph and remove all highly overlapping ones remaining

ResultsConserved network regions within yeast (orange ovals), fly (green rectangles) and worm (blue hexagons) PPI networks.

ResultsPrediction of protein function

• ‘Guilt by association’

• If a conserved cluster or path is significantly enriched in a functional annotation

Prediction of protein interactions

Predictions based on 2 strategies:

• Evidence that proteins with similar sequences interact

• Co-occurrence of proteins in the same conserved cluster or path

• Experimental verification of Yeast interactions using Y2H yielded 40-62% success rate

Overview of

Fast, scalable, network alignmentScales linearly in number of networks

comparedNetworkBLAST scales exponentially

Supports efficient querying of modules Speed-sensitivity control via user defined

parameterNot supported in NetworkBLAST

Input to the Algorithm

Weighted protein interaction graphsWeights represent probability that proteins

interactConstructed via network integration algorithm

covered in a later lecture A phylogenetic tree relating the species in

the desired alignmentUsed for progressive alignment

Definition of an alignment

A set of subgraphs chosen from the interaction networks of different species, together with a mapping between aligned proteins

Aligned proteins form equivalence classes Each class was derived from a common ancestral

protein Can contain multiple proteins from the same species

a a’ a’’ b’’

Equivalence class showing paralogs

Scoring Function

Log likelihood ratio model based onAlignment model M: modules are subject to

evolutionary constraintRandom model R: modules are not subject to

any constraints Scores equivalence classes and alignment

edges separately

Log Likelihood Ratio Model (Recap) Measures the likelihood that a module occurs if it

is subject to evolutionary constraint vs. that if it were a randomly constructed network

Randomly constructed network preserves degree distribution for nodes

logPr(Module occurs | Alignment Model M)

Pr(Module occurs | Random Model R)

Scoring Equivalence Classes

Reconstruct most parsimonious ancestral history of an equivalence class using Dynamic Programming based on five types of evolutionary events

Alignment model and random model give probabilities for each of these events, combined to give a log likelihood score

Scoring Alignment Edges

Alignment scores should reflect both network conservation and high connectivity – difficult to strike a balance

Introduction of a novel scoring approachEdge Scoring Matrix – Indexed by labelsAlgorithm assigns a label to each equivalence

class, scores according to distribution function in cells referenced by labels

Scoring: ESM

Alignment Algorithm:d-Clusters for Seed Generation A d-cluster consists of d

proteins close together in a network

“Close” means edge weights are high, so interaction is highly likely

Intuition is that high scoring alignments will have high scoring d-clusters

Alignment Algorithm:d-Clusters for Seed Generation Identify pairs of d-clusters

that score higher than a threshold T Score is defined by greedily

matching nodes from each d-cluster to obtain a high score

Uses these pairs as seeds Allows for speed-sensitivity

tradeoff

Alignment Algorithm: Generating An Initial Alignment From The Seed Determine highest scoring pair of nodes

(one from each d-cluster) when aligned Align these nodes and place these nodes

as well as their neighbors, into a frontier

3.0

1.5

5.0

Alignment Algorithm:Greedy Seed Extension Phase Examine all pairs of

nodes in frontier for pair that maximally increases score when added to alignment

Stops when no pair can further increase the score

Remove equivalence classes if it can further increase the score

Frontier

Current alignment

Alignment Algorithm:Multiple Alignment Progressive alignment

technique using the phylogenetic tree Successively aligns closest

pair of networks Places each aligned

network at the parent node of the two aligned species

Linear scaling in number of species

Performance Comparison:Speed-sensitivity / Linear Scaling

Performance Comparison: Multiple Alignment

Performance Comparison: Module Querying

ResultsFunctional module identification using network alignment

Functional module for transformation?

Results

Functional annotation using network alignment

Pairwise alignment

Multiple alignment of 9 networks

Conserved DNA replication module

Results

Multiple alignment of 10 networks showing possible cell division module

Functional annotation using network alignment

The Future of Network Comparison

Græmlin

Græmlin?

Sharan and Ideker. Modeling cellular machinery through biological network comparison. Nature Biotechnology 24, pp. 427-433, 2006

That’s all folks!

Thank you!

Questions?

Performance Comparison:Sensitivity

Scoring Sequence Mutations

Weighted sum of pairs scoring