2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field...

2. Lecture WS 2006/07

Bioinformatics III 1

V2 – network topologies

- Random graphs: classical field in graph theory.

Well studied analytically and numerically.

Literature (heavy): Bela Bollobas, Modern Graph Theory; Random Graphs

- Scale-free networks: quite new.

Properties were mostly studied numerically and heuristically (sofar).

- Evolution of domain linkage networks.

- Classification of network topologies.



Definitions

A graph G is an ordered pair of disjoint sets (V,E) such that E is a subset

of the set V(2) of unordered pairs of V.

V and E are assumed always finite.

The set V is the set of vertices; E is the set of edges.

A weighted graph has a real valued weight assigned to each edge.

A subgraph of a graph G is a graph whose vertex and edge sets are subsets of

those of G.

Given an undirected graph, two vertices u and v are called connected if there

exists a path from u to v. Otherwise they are called disconnected. The graph is

called connected graph if every pair of vertices in the graph is connected.

The Giant component is a network theory term referring to a connected subgraph

that contains a majority of the entire graph's vertices.



Definitions

A path in a graph is a sequence of vertices such that from each of its vertices

there is an edge to the successor vertex.

The first vertex is called the start vertex and the last vertex is called the end

vertex.

Both of them are called end or terminal vertices of the path.

The other vertices in the path are internal vertices.

Two paths are independent (alternatively, internally vertex-disjoint) if they do

not have any internal vertex in common.



Shortest path problem

The shortest path problem is the problem of finding a path between two

vertices such that the sum of the weights of its constituent edges is minimized.

More formally, given a weighted graph (V,E), and two elements n, n' V,

find a path P from n to n' so that

is minimal among all paths connecting n to n' .

The all-pairs shortest path problem is a similar problem, in which we have to

find such paths for every two vertices n to n' .

Pp

pw



Dijkstra’s algorithm

Dijkstra's algorithm solves the shortest path problem for a directed graph with non-

negative edge weights.

Input:

weighted directed graph G = (V,E) and a source vertex s in G.

Each edge of the graph is an ordered pair of vertices (u,v) representing a

connection from vertex u to vertex v.

The weight w(u,v) is the non-negative cost of moving from vertex u to vertex v.

The cost of an edge can be thought of as (a generalization of) the distance

between those two vertices. The cost of a path between two vertices is the sum of

costs of the edges in that path.

For a given pair of vertices s and t in V, the algorithm finds the path from s to t

with lowest cost (i.e. the shortest path).

It can also be used for finding costs of shortest paths from a single vertex s to all

other vertices in the graph.



Description of the algorithm

The algorithm works by keeping for each vertex v the cost d[v] of the shortest path

found so far between s and v.

Initially, this value is 0 for the source vertex s (d[s]=0), and infinity for all other

vertices, representing the fact that we do not know any path leading to those

vertices (d[v]=∞ v in V, except s).

When the algorithm finishes, d[v] will be the cost of the shortest path from s to v --

or infinity, if no such path exists.

The basic operation of Dijkstra's algorithm is edge relaxation: if there is an edge

from u to v, then the shortest known path from s to u (d[u]) can be extended to a

path from s to v by adding edge (u,v) at the end.

This path will have length d[u]+w(u,v). If this is less than the current d[v], we can

replace the current value of d[v] with the new value.



Description of the algorithm

Edge relaxation is applied until all values d[v] represent the cost of the shortest

path from s to v.

The algorithm is organized so that each edge (u,v) is relaxed only once,

when d[u] has reached its final value.

The algorithm maintains two sets of vertices S and Q.

Set S contains all vertices for which we know that the value d[v] is already the cost

of the shortest path and set Q contains all other vertices.

Set S starts empty, and in each step one vertex is moved from Q to S.

This vertex is chosen as the vertex with lowest value of d[u].

When a vertex u is moved to S, the algorithm relaxes every outgoing edge (u,v).



PseudocodeIn the following algorithm, u := Extract-Min(Q) searches for the vertex u in the vertex set Q

that has the smallest d[u] value. That vertex is removed from the set Q and returned to the

user. Q := update(Q) updates the weight field of the current vertex in the vertex set Q.

1 function Dijkstra(G, w, s)

2 for each vertex v in V[G] // Initialization

3 do d[v] := infinity

4 previous[v] := undefined

5 d[s] := 0

6 S := empty set

7 Q := set of all vertices

8 while Q is not an empty set

9 do u := Extract-Min(Q)

10 S := S U {u}

11 for each edge (u,v) outgoing from u

12 do if d[v] > d[u] + w(u,v) // Relax (u,v)

13 then d[v] := d[u] + w(u,v)

14 previous[v] := u

15 Q := Update(Q) ....enddo ... end ... endo ... end

If we are only interested in a shortest path between vertices s and t, we can terminate the

search at line 9 if u = t.



Pseudocode

Now we can read the shortest path from s to t by iteration:

1 S := empty sequence

2 u := t

3 while defined u

4 do insert u to the beginning of S

5 u := previous[u]

Now sequence S is the list of vertices on the shortest path from s to t.



Running time

The simplest implementation of the Dijkstra's algorithm stores the n vertices of

set Q in an ordinary linked list or array, and operation Extract-Min(Q) is simply a

linear search through all vertices in Q. In this case, the running time is O(n2).

For sparse graphs, that is, graphs with much less than n2 edges, Dijkstra's

algorithm can be implemented more efficiently.

Description of Dijkstra‘s algorithm taken from www.wikipedia.org



n nodes (vertices) joined by edges that have been chosen and placed between

pairs of nodes uniformly at random.

Gn,p : each possible edge in the graph on n nodes is present with probability p

and absent with probability 1 – p.

Average number of edges in Gn,p :

Each edge connects two vertices

average degree of a vertex:

Erdös-Renyi model of a random graph

p

nn

2

1

nppnn

pnn

n

11



Erdös and Renyi studied how the expected topology of a random graph with n

nodes changes as a function of the number of edges m.

When m is small, the graph is likely fragmented into many small connected

components having vertex sets of size at most O(log n).

As m increases the components grow at first by linking to isolated nodes, and

later by fusing with other components.

A transition happens at m = n/2, when many clusters cross-link spontaneously

to form a unique largest component called the giant component.

Its vertex set size is much larger than the vertex set sizes of any other

components.

It contains O(n) nodes, while the second largest component contains O(log n)

nodes.

In statistical physics, this phenomenon is called percolation.

Erdös-Renyi model: components



The shortest path length between any pairs of nodes in the giant component

grows like log n.

Therefore, these graphs are called „small worlds“.

The properties of random graphs have been studied very extensively.

Literature: B. Bollobas. Random Graphs. Academic, London, 1985, 2004

However, random graphs are no adequate models for real-world networks

because

(1) real networks appear to have a power-law degree distribution,

(while random graphs have Poisson distribution) and

(2) real networks show strong clustering while the clustering coefficient of a

random graph is C = p, independent of whether two vertices have a

common neighbor.

Erdös-Renyi model: shortest path length



Aim: allow a power-law degree distribution in a graph while leaving all other

aspects as in the random graph model.

Given a degree sequence (e.g. power-law distribution) one can generate a

random graph by assigning to a vertex i a degree ki from the given degree

sequence. Then choose pairs of vertices uniformly at random to make edges so

that the assigned degrees remain preserved.

When all degrees have been used up to make edges, the resulting graph is a

random member of the set of graphs with the desired degree distribution.

Problem: method does not allow to specify clustering coefficient.

On the other hand, this property makes it possible to exactly determine many

properties of these graphs in the limit of large n.

E.g. almost all random graphs with a fixed degree distribution and no nodes of

degree smaller than 2 have a unique giant component.

Generalized Random Graphs



Barabasi’s construction algorithm for scale-free model

Input (n0, m, t) where

n0 is the initial number of vertices,

m (m n0) is the number of added edges every time one new vertex is added to

the graph, and

t is the number of iterations.

Algorithm

a) Start with n0 isolated nodes.

b) Every time we add one new node v, m edges will be linked to the existing

nodes from v with a preferential attachment probability

where ki is the number of links (degree) of the i-th node.

Eventually, the graph will have (n0 + t) nodes and (mt) edges.

Problem of „pure“ mathematicians with this algorithm: how to start from n0 = 0?

1

1

N

ii

i

iiv

k

kk



Properties of Barabasi-Albert scale-free model

P(k) k- with = 3.

Real networks often show 2.1 – 2.4

Observation: if either growth or preferential attachment is eliminated,

the resulting network does not exhibit scale-free properties.

The average path length in the BA-model is proportional to ln n/ln ln n

which is shorter than in random graphs

scale-free networks are ultrasmall worlds.

Observation: non-trivial correlations = clustering between the degrees of

connected nodes.

Numerical result for BA-model C n-0.75. No analytical predictions of C sofar.



Properties of scale-free models

Scale-free networks are resistant to random failures („robustness“) because a

few high-degree hubs dominate their topology;

a deliberate node that fails probably has a small degree, and thus not severly

affects the rest of the network.

However, scale-free networks are quite vulnerable to attacks on the hubs.

See example of last lecture about

lethality of gene deletions in yeast.

These properties have been confirmed

numerically and analytically by studying

the average path length and the size

of the giant component.



Properties of Barabasi-Albert scale-free model

BA-model is a minimal model that captures the mechanisms responsible for the

power-law degree distribution observed in real networks.

A discrepany is the fixed exponent of the predicted power-law distribution ( = 3).

Does the BA-model describe the „true“ biological evolution of networks?

Recent efforts:- study variants with cleaner mathematical properties (Bollobas, LCD-model)- include effects of adding or re-wiring edges,

allow nodes to age so that they can no longer accept new edges

or vary forms of preferential attachment.

These models also predict exponential and truncated power-law degree

distribution in some parameter regimes.



2 Scale-free behavior in protein domain networks

‚Domains‘ are fundamental units of protein structure.

Most proteins only contain one single domain.

Some sequences appear as multidomain proteins.

On average, they have 2-3 domains, but can have up to 130 domains!

Most new sequences show homologies to parts of known protein sequences most proteins may have descended from relatively few ancestral types.

Sequences of large proteins often seem to have evolved by

joining preexisting domains in new combinations, „domain shuffling“:

domain duplication or domain insertion.

Wuchty Mol. Biol. Evol. 18, 1694 (2001)



Protein domain database SMART

http://smart.embl-heidelberg.de/

contains (in 2001)

153 signalling domains

176 nuclear domains, e.g. HLH domains

225 extracellular domains

115 „other“ domains




Protein Domain databases

Prosite (http://expasy.proteome.org.au/prosite/) contains 1400 biologically

significant motifs and profiles.

Pfam (http://www.sanger.ac.uk/Software/Pfam/index.shtml) : collection of multiple-

sequence alignments of protein families and profile HMMs. Curated documentation

on 2500 families.

ProDom (http://www.toulouse.inra.fr/prodom.html) : contains all 160.000 protein

domain families that can be automatically generated from SwissProt and TrEMBL

databases.

Here, only consider families with 10 members 6000 ProDom families.

InterPro Proteome Analysis of 41 nonredundant proteomes of genomes of

archaea, bacteria, and eukaryotes (http://www.ebi.ac.uk/proteome) yields domains

which appear along with other domains in a protein sequence domains are

vertices + co-appearance in a protein sequence means an edge.


http://expasy.proteome.org.au/prosite/

http://www.sanger.ac.uk/Software/Pfam/index.shtml

http://www.toulouse.inra.fr/prodom.html



Protein Domain databases

Prosite (

http://expasy.proteome.org.au/prosite/)

contains 1360 biologically significant

motifs and profiles.

Wuchty Mol. Biol. Evol. 18, 1694 (2001)number of links to other domains

P(n

um

ber

of

lin

ks t

o o

ther

do

mai

ns)

http://expasy.proteome.org.au/prosite/



Which ones are highly connected domains?

The majority of highly connected InterPro domains appear in signalling pathways.

List of the 10 best linked domains in various species.


evolutionary trend toward compartementalization of the cell and multicellularity

demands a higher degree of organization.

From left to right:

Number of links increases. Number of signalling domains (PH, SH3), their ligands

(proline-rich extensions), and receptors (GPCR/RHODOPSIN) increases.



Evolutionary Aspects

BA-model of scale-free networks is constructed by preferential attachment of newly

added vertices to already well connected ones.

Fell and Wagner (2000) argued that vertices with many connections in metabolic

network were metabolites originating very early in the course of evolution where

they shaped a core metabolism.

Analogously, highly connected domains could have also originated very early.

Is this true?


No. Majority of highly connected domains in Methanococcus and in E.coli are

concerned with maintanced of metabolism.

None of the highly connected domains of higher organisms is found here.

On the other hand, helicase C has roughly similar degrees of connection in all

organisms.



Conclusions

Expansion of protein families in multcellular vertebrates coincides with higher

connectivity of the respective domains.

Extensive shuffling of domains to increase combinatorial diversity might provide

protein sets which are sufficient to preserve cellular procedures without dramatically

expanding the absolute size of the protein complement.

greater proteome complexity of higher eukaryotes is not simply a consequence of

the genome size, but must also be a consequence of innovations in domain

arrangements.

highly linked domains represent functional centers in various different cellular

aspects.

They could be treated as „evolutionary hubs“ which help to organize the domain

space.




Network growth mechanism

How can we know what is the „true“ growth mechanism of real biological networks?

Question 1: Is it important to know this? Yes.

Question 2: What measure do we use to distinguish networks produced by different

growth mechanisms?

Look at the fine structure (motifs) of biological networks.



Analysis of Drosophila melanogaster protein interaction network

Data set: protein-protein interaction map for Drosophila by Giot et al.

Problem: data set is subject to numerous false positives.

Giot et al. assign a confidence score p [0,1] to each interaction measuring

how likely the interaction occurs in vivo.

What threshold p* should be used?

Measure size of the components for all possible values of p*.

Observe: for p*= 0.65, the two largest components are connected

use this value as threshold.

Edges in the graph correspond to interactions for which p > p*.

Remove self-interactions and isolated vertices

3359 (4625) nodes with 2795 (4683) edges for p*= 0.65 (0.5)

Middendorf et al., DOI: q-bio.QM/0408010, arXiv, 2004/08/15



Network evolution models considered

Duplication-mutation-complementation (DMC) algorithm:

based on model that proposes that most of the duplicate genes observed today have been

preserved by functional complementation. If either the gene or its copy loses one of its

functions (edges), the other becomes essential in assuring the organisms‘s survival.

Algorithm:

duplication step is followed by mutations that preserve functional complementarity.

At every time step choose a node v at random.

A twin vertex vtwin is introduced copying all of v‘s edges.

For each edge of v, delete with probability qdel either the original edge or its corresponding

edge of vtwin.

Cojoin twins themselves with independent probability qcon representing an interaction of a

protein with its own copy.

No edges are created by mutations DMC algorithm assumes that the probability of

creating new advantageous functions by random mutations is negligible.




Network evolution models considered

Variant of DMC: Duplication-random mutations (DMR) algorithm:

Possible interactions between twins are neglected.

Instead, edges between vtwin and the neighbors of v can be removed with probability qdel and

new edges can be created at random between vtwin and any other vertices with probability

qnew/N, where N is the current total number of vertices.

DMR emphasizes the creation of new advantageous functions by mutation.

Other models:

- linear preferential attachment (LPA) (Barabasi)

- random static networks (Erdös-Renyi) (RDS)

- random growing networks (RDG – growing graphs where new edges are created randomly

between existing nodes)

- aging vertex networks (AGV – growing graphs modeling citation networks, where the

probability for new edges decreases with the age of the vertex)

- small-world network (SMV – interpolation between regular ring lattices and randomly

connected graphs).




Training set

Create 1000 graphs as training data for each of the seven different models.

Every graph is generated with the same number of edges and nodes as measured

in Drosophila.

Quantify topology of a network by counting all possible subgraphs up to a given

cut-off, which could be the number of nodes, number of edges, or the length of a

given walk.

Here: count all subgraphs that can be constructed by a walk of length=8 (148 non-

isomorphic subgraphs) or length=7 (130 non-isomorphic subgraphs).

Use these counts as input features for classifier.

Note that the average shortest path between two nodes of the Drosophila

network‘s giant component is 11.6 (9.4) for p*=0.65 (0.5).

Walks of length=8 can traverse large parts of the network.




Visualization of subgraphs

A qualitative and more intuitive

way of interpreting the

classification result is

visualizing the subgraph

profiles.

Subgraphs associated with

Figures 3 and 1.

A representatie subset of 50

subgraphs out of 148 is

shown.




Learning algorithm: Alternating Decision Tree


Rectangles: decision nodes.

A given network‘s subgraph counts

determine paths in the tree dictated

by inequalities specified by the

decision nodes.

For each class, the Alternative

Decision Tree outputs a real-valued

prediction score, which is the

sum of all weights over all paths.

The class with the heighest score

wins.



Performance on training set

Can the Decision Tree separate the graphs generated by the different growth

mechanisms?

The confusion matrix shows truth and prediction for the test sets.

5 out of 7 have nearly perfect prediction accuracy.

AGV is constructed as an interpolation between LPA and a ring lattice

the AGV, LPA and SMW mechanisms are equivalent in specific parameter

regimes and show a non-negligible overlap.




Task: discriminate different growth mechanisms

Ten graphs of two different mechanisms exhibit similar average geodesic lengths

and almost identical degree distribution and clustering coefficients.

(a) cumulative degree distribution p(k > k0), average clustering coefficient <C> and

average geodesic length <L>, all quantities averaged over a set of 10 graphs.

global topology descriptors cannot separate between growth mechanisms

(b) Prediction score for all ten graphs and all five cross-validated ADTs.

The two sets of graphs can now be perfectly separated by the classifier.




Learning algorithm: Alternating Decision Tree


Figure shows the first few descision

nodes (out of 120) of a resulting ADT.

The prediction scores reveal that a

high count of 3-cycles suggests a

DMC network.

DMC mechanism indeed

facilitates creation of many

3-cycles by allowing 2 copies

to attach to eachother, thus

creating 3-cycles with their common

neighbors.

A low count in 3-cycles but a high

count in 8-edge linear chains is a

good precictor for LPA and DMR

networks.



Prediction for Drosophila melanogaster network

Use this classifier (ADT) with good prediction accuracy now to determine the

network mechanism that best reproduces the Drosophila network (or any network

of the same size).

Prediction scores for the Drosophial protein network for different confidence

threshold p* and different cut-offs in subgraph size.

Drosophial is consistently classified as a DMC network, with an especially strong

prediction for a confidence threshould of p*=0.65 and independently of the cut-off

in subgraph size.




Subgraph profilesThe average subgraph count of the

training data for every mechanism is

shown for the 50 representative

subgraphs S1-S50.

Black lines indicate that this model is

closest to Drosophila based on the

absolute difference between the

subgraph counts.

For 60% of the subgraphs (S1-S30),

the counts for Drosophila

are closest to the DMC model.

All of these subgraphs contain one or

more cycles, including highly

connected subgraphs (S1) and long

linear chains ending in cycles (S16,

S18, S22, S23, S25).


The DMC algorithm is the only

mechanism that produces such

cycles with a high occurrence.



Robustness against noise

Edges in Drosophila network are randomly

replaced and the network is classified.

Plotted are prediction scores for each of the 7

classes as more and more edges are replaced.

Every point is an average over 200 independent

random replacements.

For high noise level (beyond 80%), the

network is classified as an Erdös-Renyi (RDS)

graph. For low noise (< 30%), the confidence

in the classification as a DMC network is even

higher than in the classification as an RDS

network for high noise. The prediction score

y(c) for class c is related to the estimated

probability p(c) for the tested network to be in

class c by


cy

cy

e

ecp

2

2

1



Conclusions

Very nice (!) method that allows to infer growth mechanisms for real networks.

Method is robust against noise and data subsampling,

no prior assumption about network features/topology required.

Learning algorithm does not assume any relationships between features

(e.g. orthogonality). Therefore the input space can be augmented with various

features in addition to subgraph counts.

The protein interaction network of Drosophila is confidently classified as DMC

network. However, further growth mechanisms need to be explored in future.

Input from evolutionary biology is needed.

Here, we mostly concentrated on the technique of characterizing the resulting

network topologies.


2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field...

Documents

Transcript of 2. Lecture WS 2006/07Bioinformatics III1 V2 – network topologies - Random graphs: classical field...