Optimal Network Alignment with Graphlet Degree Vectors

Optimal Network Alignment with Graphlet Degree Vectors

Tijana Milenković

(Department of Computing, Imperial College London && Department of Computer Science, University of California)

Weng Leong Ng (Department of Computer Science, University of California),

Wayne Hayes(Department of Computer Science, University of California && Department

of Mathematics, Imperial College London)

Nataša Pržulj(Department of Computing, Imperial College London)

Cancer Informatics 2010

Presented by: Lila Shnaiderman

2/34

Motivation• Lately, advances in experimental techniques:

– yeast two-hybrid assay, – Mass spectrometry of purified complexes, – genome-wide chromatin immunoprecipitation,– etc.

• So, increasing amounts of biological network data becoming available!

• Comparative analyses of biological networks have as large an impact as comparative genomics on: – understanding of biology– Evolution– disease

• So, meaningful network comparisons across species becomes one of the foremost problems in evolutionary and systems biology!!!

Background• Subgraph isomorphism problem:

– Is one graph exists as an exact subgraph of another graph.– NP-complete complexity– So, network comparisons are computationally infeasible…

• Network alignment:– The most common network comparison method.– Is more general problem:

• Find the best way to “fit” a graph into another graph (not an exact subgraph)

• Unclear:– how to guide the alignment process– how to measure the “goodness” of an inexact fit– So, heuristic strategies must be sought

3/34

Background – alignment types• Local alignment:

– The majority of existing methods.– match a small sub network from one network to one or more sub

networks in another network.– Can be ambiguous…

• Global alignment:– Measures the overall similarity between two networks.– Aligns every node in the smaller network to exactly one node in the

larger network.– most existing methods incorporate some a priori information external to

network topology • like protein sequence similarities in PPIs networks, etc.

• Best known global network alignment algorithm based solely on network topology:– GRAph ALigner (GRAAL): uses a heuristic search strategy to quickly find

approximate alignments 4/34

Current solution: H-GRAAL• Hungarian-algorithm based GRAAL• More expensive• Guaranteed to find optimal alignments relative to

any fixed, deterministic cost function.• Relies solely and explicitly on a strong and

direct measure of network topological similarity.• Applicable to any type of networks• Allows to transfer the knowledge between aligned

networks.5/34

• A small connected induced sub graph of a larger network.

6/34

Graphlet degree vectors (1)

0 1

2

G1G0

3

G2

G7

G4

67

G5

8

G6

13

12

G8

G3

45

11

10

9

14

Graphlet degree vectors (2)• Graphlet degrees vector of node V: counts the number of

different graphlets that the node touches (for all graphlets on 2 to 5 nodes).

7/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v0 v v

v


8/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v1

2 v

orbit


9/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v1

2 vv

v v


10/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v3

45 ?

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0


11/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v45 v

v v


12/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v67 v

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

8

v


13/34

Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0

v v11

10

9

What is the degree of node V (according to the vector)?

There are 73 different orbits across all 2-5-node graphlets

The signature of node V

Degree Vector - Signature• Many real-world Networks:

– Have a small-world nature • So, degree Vector is an effective measure:

– Looks at network distance of 4 around a node – Captures a large portion of network topology

• Thus, comparing two signatures:– Highly constraining measure of local

topological similarity between nodes.

14/34

Signature similarity• For uG, ui: =

– the ith coordinate of its signature vector.– Distance:

– wi is the weight of orbit i.• Accounts for dependencies between orbits• higher weights to orbits that are not affected by many other

orbits• Questions:

– Why log?– Why “+1”? 15/34

Distance and Similarity• Total Distance:

– in (0,1)– O means: u,v identical

• Similarity: S(u,v) = 1-D(u,v)

16/34

H-GRAAL algorithm-definitions• G1 and G2 are networks:

– |V(G1)|<|V(G2)|• Alignment of G1 to G2:

– set of ordered pairs (u,v), u ∈ V (G1) and v ∈ V (G2)

– no two ordered pairs share the same G1-node or the same G2-node.

– Each pair called aligned pair.• Maximum alignment:

– Every G1-node is in some aligned pair– From now on: alignment=maximum alignment

17/34

H-GRAAL algorithm• H-GRAAL:

– Hungarian-algorithm-based GRAph Aligner• Produces an alignment:

– of minimum total cost between networks– total cost: summed over all aligned pairs– aligned pair cost: based on signature similarity

• The cost of aligning u and v:– favors alignment of the densest parts of the networks; – Reduced as the degrees of both nodes increase: higher degree

nodes with similar signatures provide a tighter constraint – α ∈ [0, 1]: weighs the cost-function contributions of the node

signature similarity between u and v– 1 − α: weights the contribution of nodes degrees.

18/34

Alignment Cost

• Cost=0: a pair of topologically identical nodes u and v• Cost close to 2: a pair of topologically very different nodes.• Any problem with this formula?• T(u,v) for most nodes is very low:

– As, there is small number of hubs (highly-linked nodes), – So max_deg(G1) and max_deg(G2) are much larger than

deg(u) and deg(v).19/34

Hungarian Algorithm• solves the assignment problem in polynomial

time:– Create two bipartite graphs V(G1), V(G2).– Edge (u,v) from V(G1) to V(G2): labeled with the node

alignment cost.– Find perfect match between them (with minimal cost).

• More than one optimal alignment is possible:– the particular found alignment is highly dependent on

the implementation details of the underlying Hungarian algorithm.

– For example: the order of presenting the nodes to the algorithm

20/34

Finding Few Optimal Alignment• Can learn about all possible optimal matchings.• Make H-GRAAL to give more alignments:

– “Remove” (u,v): raise the alignment cost of a node-pair (u,v) in A0 to +∞

– Run H-GRAAL again• Found alignment with higher cost than A0, “Remove” different edge.• After trying to “remove” all edges, if not found alignment with

optimal cost, no more optimal alignments exist.• This process has too high complexity…

– O(|V(G1)|3x||E(G1)|)– There exist a fix O(|V(G1)|2x||E(G1)|) (based on dynamic

Hungarian algorithm).– My remark: still very slow (can take months…)

21/34

Few Optimal Alignment algorithm• Optimizing aligned pair:

– Appears in at least one optimal alignment.• The set of optimizing pairs:

– Can be computed in at worst O(n4) time.– Can be easily parallelized.

22/34

My remark: too slow…

Few Optimal Alignments - Analysis• Significance of aligned pair:

– According to number of optimizing pairs per u.

– If (u,v) were the only optimizing pair for u: every optimal alignment contains (u,v). I.e., (u,v) is highly significant.

• Core alignment: – the set of all such special optimizing pairs.– Large core alignment means: stable

alignment.23/34

Measures of alignment quality (1)• Edge correctness (EC) –

– percentage of edges in one graph that are aligned to edges in the other graph.

To be able to measure the following measurements, must know the “true alignment” …• Node correctness (NC) –

– percentage of nodes in one network that are correctly aligned to nodes in the other network

• Interaction correctness (IC) – – percentage of interactions that are aligned correctly

• IC is stricter than EC:– EC does not require that the alignment partners are the correct

ones 24/34

Measures of alignment quality (2)• Usually the “true alignment” is not known

– So, can measure just EC…– two alignments possibly can have similar ECs, where one

alignment is “good” and the other is “bad” EC is not enough…

• To uncover regions of similar topology:– the aligned edges must cluster together and form large and

dense connected sub-graphs.• Common connected sub-graph (CCS):

– connected sub-graph that appears in both networks• Good alignment has:

– large and dense CCSs.– Large EC

25/34

Statistical Significance• Random alignment of real-world networks:

– the probability of obtaining a given or better EC at random.• Null model of random alignment:

– Random mapping g: E1 → V1 × V2. – n1 = |V1|, n2 = |V2|, m1 = |E1|, and m2 = |E2|. – p = n2 (n2 − 1)/2: the number of node pairs in G2– EC = x%: the edge correctness of the given alignment– k = [m1 × x]: the number of aligned edges from G1 to edges in G2.

• P: – the probability of successfully aligning k or more edges by chance (the

tail of the hypergeometric distribution):

.26/34

More statistical Significance Metrics• H-GRAAL’s alignment of random model networks:

– Checks the significance of the alignment in compare to alignment of random networks:• Align two PPI networks, • align them with random networks, • compare results.

• Biological Validation:– find the number of aligned protein pairs sharing a Gene

Ontology (GO) term.– Compute its statistical significance.

• Significance of functional enrichments:– Align metabolic networks of different species – generate phylogenetic trees based on H-GRAALs ECs.– Compute its statistical significance.

27/34

Results (1)

• H-GRAAL always produces better alignments than GRAAL for all values of α.

• using only degrees (α = 0) gives bad results. – So, graphlet-based signatures are far more valuable than a measure

based on degree alone.28/34

Results (2)

• The largest common connected sub-graph in the alignment of the yeast and human PPI networks– consisting of 1,290 interactions amongst 317 proteins. – This network appears, in its entirety, in the PPI networks of both species.

29/34

Results (3)

• Statistics of H-GRAAL’s core yeast-human alignment for α = 0.5.

• The percentage of yeast proteins, out of 2,390 of them, that participate in n “optimizing pairs”.

• Shows the quality of H-GRAAL!30/34

Results (4)

• Comparison of the phylogenetic trees for protists and fungies• H-GRAAL’s and GRAAL’s tree are slightly different from the

sequence-based one. • Sequence-based trees are built based on:

– multiple alignment of gene sequences– whole genome alignments. 31/34

Results (5)• Multiple alignments have few problems:

– Can be misleading due to gene rearrangements, inversions, transpositions, and translocations (at the substring level)

– Different species might have an unequal number of genes or genomes of vastly different lengths.

• Whole genome alignments can be misleading: – Noncontiguous copies of a gene or non-decisive gene order. – The trees are built incrementally from smaller pieces that are

“patched” together probabilistically probabilistic errors expected.

• H-GRAAL’s and GRAAL’s have none of these. But – There are noise problems – Incompleteness of PPI networks.

• No reason to believe that the sequence-based tree or GRAAL’s one should a priori be considered the correct one

32/34

Conclusions• Presented H-GRAAL algorithm for global

alignment between networks• Presented different statistics to evaluate

the quality of the alignment.• Experimented with different PPI networks,

and not only PPI.• Showed that H-GRAAL is the best known

global alignment algorithm.• H-GRAAL can have huge influence on

researching biological networks!33/34

34/34

Thank you for your attention!

Optimal Network Alignment with Graphlet Degree Vectors

Documents

Transcript of Optimal Network Alignment with Graphlet Degree Vectors