Counting small patterns in...

Counting small patterns in networks

A dissertation presentedby

Toma Hoevar

toThe Faculty of Computer and Information Science

in partial fulfilment of the requirements for the degree ofDoctor of Philosophy

in the subject ofComputer and Information Science

Ljubljana,

APPROVAL

I hereby declare that this submission is my own work and that, to the best of myknowledge and belief, it contains no material previously published or written by anotherperson nor material which to a substantial extent has been accepted for the award of anyother degree or diploma of the university or other institute of higher learning, except where

due acknowledgement has been made in the text. Toma Hoevar

January

The submission has been approved by

dr. Borut RobiProfessor of Computer and Information Science

examiner

dr. Gaper FijavProfessor of Mathematics

examiner

dr. Yuval ShavittProfessor of Electrical Engineering

external examinerTel Aviv University

dr. Janez DemarProfessor of Computer and Information Science

advisor

PREVIOUS PUBLICATION

I hereby declare that the research reported herein was previously published/submittedfor publication in peer reviewed journals or publicly presented at the following occa-sions:

[] T. Hoevar and J. Demar. A combinatorial approach to graphlet counting.Bioinformatics, ():, . doi: ./bioinformatics/btt

A pre-copyedited, author-produced version of an article accepted for publication inBioinformatics following peer review is included. The version of record Toma Hoevar,Janez Demar; A combinatorial approach to graphlet counting. Bioinformatics ; ():- is available online athttps://academic.oup.com/bioinformatics/article/30/4/559/205331/A-combinatorial-approach-to-graphlet-counting.

[] T. Hoevar and J. Demar. Computation of Graphlet Orbits for Nodes and Edges inSparse Graphs. Journal of Statistical Software, ():, .doi: ./jss.v.i

[] T. Hoevar and J. Demar. Combinatorial algorithm for counting small induced graphsand orbits. PLOS ONE, ():, . doi: ./journal.pone.

I certify that I have obtained a written permission from the copyright owner(s) toinclude the above published material(s) in my thesis. I certify that the above materialdescribes work completed during my registration as graduate student at the Universityof Ljubljana.
http://dx.doi.org/10.1093/bioinformatics/btt717https://academic.oup.com/bioinformatics/article/30/4/559/205331/A-combinatorial-approach-to-graphlet-countinghttps://academic.oup.com/bioinformatics/article/30/4/559/205331/A-combinatorial-approach-to-graphlet-countinghttp://dx.doi.org/10.18637/jss.v071.i10http://dx.doi.org/10.1371/journal.pone.0171428

POVZETEK

Univerza v LjubljaniFakulteta za raunalnitvo in informatiko

Toma Hoevartetje majhnih vzorcev v omrejih

Omreja pogosto uporabljamo za vizualizacijo in analizo relacij med pari entitet, kijih predstavimo z mnoico vozli, povezave med njimi pa predstavljajo relacije. Enaizmed relacij v bioinformatiki, ki jo pogosto modeliramo z omreji, so interakcije medpari proteinov. Nedavne tudije v zvezi z lokalno strukturo takih omreij so uporabljalemajhne povezane vzorce s ali vozlii, ki jim reemo tudi grafki. Vozlia grafkovse obiajno delijo v orbite glede na njihovo vlogo oz. simetrije. Kolikokrat nekovozlie v omreju nastopa v vsaki izmed orbit, predstavlja neke vrste podpis lokalnestrukture v okolici vozlia. Z zanaanjem na predpostavko, da je lokalna strukturavozlia povezana z njegovo funkcijo v omreju, je raziskovalcem uspelo z uporabografkov napovedati nove funkcije proteinov.

Glavna ovira pristopov na osnovi grafkov je obiajno v asu, ki ga zahteva tetjegrafkov. Ta omejitev je vedno bolj izrazita zaradi vedno veje koliine razpololjivihpodatkov. Disertacija se posvea izboljavi obstojeih metod za tetje grafkov. Te na-mre delujejo na osnovi enostavnega izrpnega natevanja vseh grafkov v omreju.

V disertaciji predstavljen algoritem Orca preteje grafke, ne da bi jih natel, kot toponejo ostale metode. Izkoria povezave med frekvencami orbit za pripravo sistemaenab, ki ga sestavi izredno uinkovito. Orca za tetje grafkov velikosti nateje zgoljgrafke velikosti 1. Tako dosee pohitritev, ki je sorazmerna najveji stopnji vozlia vomreju. V praksi to pomeni, da preteje grafke v vejih omrejih proteinskih interakcij do -krat hitreje.

Algoritem Orca je bil v osnovi razvit za tetje grafkov in orbit vozli velikosti in. Pristop smo uspeno prilagodili tudi tetju orbit povezav z enakimi prihranki gledeasa izvajanja. Reitev je mono posploiti za tetje poljubno velikih grafkov. V tanamen smo identificirali potrebne pogoje in dokazali, da jih je mogoe izpolniti tudiv primeru tetja vejih grafkov.

i

ii Povzetek T. Hoevar

Disertacija se posvea tudi problemu generiranja nakljunih omreij s predpisanoporazdelitvijo grafkov. Ta problem predstavlja motivacijo za prilagoditev algoritmaOrca za uporabo v dinaminih oz. spreminjajoih omrejih, kjer lahko nastajajo no-ve povezave ali pa obstojee propadajo. Spremembe so lahko posledica postopka zageneriranje nakljunega omreja ali pa so del procesa, ki ga omreje modelira. Gene-rirana omreja se zelo pribliajo eleni porazdelitvi grafkov. Poleg tevila grafkov pa sosi podobna tudi po drugih merah lokalne strukture omreij.

Razviti algoritem je pomembno orodje za analizo omreij z grafki in predstavljapomemben korak k analizi vejih in gostejih omreij. Kot najhitreja metoda te-tja grafkov je tudi osnova nadaljnjega raziskovanja uinkovitih metod tetja vzorcev vomrejih.

Doktorska disertacija temelji na treh objavljenih znanstvenih lankih, ki skupaj spoglavjem, ki vsebuje e neobjavljeno delo, tvorijo jedro disertacije.

Kljune besede: grafki, orbite, omreje, graf, podgraf, vzorec, tetje

ABSTRACT

University of LjubljanaFaculty of Computer and Information Science

Toma HoevarCounting small patterns in networks

Networks are an often employed tool that can help us visualize and analyze binaryrelationships by representing the entities as a set of nodes and the relations betweenthem as edges in the network. One type of relations in the field of bioinformatics that isoften modeled by networks are interactions between pairs of proteins. Recent studieshave focused on analyzing the local structure of such networks by observing smallconnected patterns consisting of or nodes, which are also known as graphlets. Thenodes of graphlets are further divided into orbits by their roles or symmetries. Thenumber of times a node from the network participates in each orbit forms a signature ofthe nodes local network topology. Working under the assumption that the nodes localtopology is correlated with its function in the network, researchers have successfullyused graphlets to predict new protein functions.

The bottleneck of graphlet-based approaches is usually in the time required to countthem. This restriction is becoming even more pronounced with a growing amountof available data. This dissertation focuses on improving existing graphlet countingtechniques that are based on simple exhaustive enumeration.

We present the algorithm Orca that counts graphlets and their orbits instead ofenumerating them. It exploits relations between orbit counts to construct a system ofequations that can be set up efficiently. Orca achieves this by enumerating ( 1)-node graphlets to count -node graphlets, effectively obtaining a speed-up by a factorproportional to the maximum degree of a node in the network. In practical terms,it counts graphlets in larger protein-protein interaction networks about - timesfaster.

Orca was designed for counting graphlets with and nodes. However, we adaptthe approach to counting edge-orbits in addition to the original node-orbits with thesame gains in run time. We also show that this approach can be generalized to graphlets

iii

iv Abstract T. Hoevar

of arbitrary size by identifying the necessary conditions and proving that these condi-tions can be fulfilled even for larger graphlets.

Finally, we consider the problem of generating random graphs with prescribed graph-let distributions. This motivated the adaptation of Orca for dynamic or changing net-works, where edges can be added or removed. These changes can be a consequence ofthe procedure for generating a random graph or can be inherent in the network andthe process it models. The generated graphs closely match the desired graphlet countsand as a consequence approximate other structural measures as well.

The developed algorithm is a valuable tool for graphlet-based network analysis and asignificant stepping stone towards analyzing larger and denser networks. As the fastestgraphlet counting method it also presents a basis for further development of efficientpattern counting methods in graphs.

This doctoral dissertation is based on three published papers that together with achapter containing some unpublished work form the core of the dissertation.

Key words: graphlets, orbits, network, graph, subgraph, pattern, counting

ACKNOWLEDGEMENTS

I would like to thank my parents who introduced me to programming and computer scienceand my advisor for guiding me through the unpredictable waters of research and publish-ing. I am also especially grateful to all my colleagues who were involved in algorithmicdiscussionsrelated or unrelated to the topic of this dissertation but contributing to thefinal result nevertheless.

Toma Hoevar, Ljubljana, January .

v

CONTENTS

Povzetek i

Abstract iii

Acknowledgements v

Introduction . Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . Areas of application . . . . . . . . . . . . . . . . . . . . . . . . . . Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scientific contributions . . . . . . . . . . . . . . . . . . . . . . .

Overview . A combinatorial approach to graphlet counting . . . . . . . . . . . . Computation of graphlet orbits for nodes and edges in sparse graphs . Combinatorial algorithm for counting small induced graphs and orbits

A combinatorial approach to graphlet counting . Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .. Related work . . . . . . . . . . . . . . . . . . . . . . . .

. Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Orbits in four-node graphlets . . . . . . . . . . . . . . . .

vii

viii Contents T. Hoevar

.. Counting complete graphlets . . . . . . . . . . . . . . . . .. Orbits on five-node graphlets . . . . . . . . . . . . . . . .

. Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Supplementary . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. Results on random networks . . . . . . . . . . . . . . . . .. Log-scale graphs . . . . . . . . . . . . . . . . . . . . . . .

Computation of graphlet orbits for nodes and edges in sparse graphs . Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Combinatorial approach to orbit counting . . . . . . . . . . . . . .

.. Node orbits . . . . . . . . . . . . . . . . . . . . . . . . . .. Edge orbits . . . . . . . . . . . . . . . . . . . . . . . . . .. System of equations . . . . . . . . . . . . . . . . . . . . . .. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .

. The orca package . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Functions . . . . . . . . . . . . . . . . . . . . . . . . . . .. Usage example on the Schools Wikipedia network . . . . .

. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . .

Combinatorial algorithm for counting small induced graphs and orbits . Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . .. Related work . . . . . . . . . . . . . . . . . . . . . . . . .. Outline of the proposed algorithm . . . . . . . . . . . . . .. Original contributions . . . . . . . . . . . . . . . . . . . .

. Relations between orbit counts . . . . . . . . . . . . . . . . . . . . .. Derivation of general relations between orbit counts . . . . .. Additional constraints on selection of y . . . . . . . . . . . .. Equation for a cycle on nodes . . . . . . . . . . . . . . . .. System of equations . . . . . . . . . . . . . . . . . . . . .

ix

.. Extension to edge orbits . . . . . . . . . . . . . . . . . . . . Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. Time- and space-complexity . . . . . . . . . . . . . . . . . . Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Graphlet counting in dynamic graphs . Generating random graphs . . . . . . . . . . . . . . . . . . . . . . . Dynamic Orca . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.. Overview of Orca . . . . . . . . . . . . . . . . . . . . . . .. Dynamic operations . . . . . . . . . . . . . . . . . . . . . .. Maintaining graphlet counts . . . . . . . . . . . . . . . .

. Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conclusion . Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Razirjeni povzetek A. Uvod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Prispevki k znanosti . . . . . . . . . . . . . . . . . . . . . . . . . A. Orca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Veji grafki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Sprotno tetje in nakljuna omreja . . . . . . . . . . . . . . . . . A. Zakljuek . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B Graphlet equations B. Equations for node-orbit counts in -graphlets . . . . . . . . . . . B. Equations for edge-orbit counts in -graphlets . . . . . . . . . . . . B. Equations for node-orbit counts in -graphlets . . . . . . . . . . . B. Equations for edge-orbit counts in -graphlets . . . . . . . . . . . .

Bibliography

Introduction

Introduction T. Hoevar

. Problem definition

Networks naturally arise in many different disciplines and are modeled as graphs.Small patterns in these graphs can uncover the basic building blocks of networks andhelp us understand the structure and properties of these networks. However, theirdiscovery and counting remain computationally intensive tasks. We can estimate thepattern frequencies on random samples, yet in some applications we would prefer toknow their exact frequencies. This motivates the research of efficient methods for exactpattern counting in sparse graphs.

We will refer to these small connected patterns that occur as induced subgraphsin the network as graphlets []. For an even finer description, the nodes of graphletscan be further divided into their automorphism orbits [] or roles of individual nodes.Figure . illustrates all -node and -node graphlets along with their orbits. Thegraphlet counting problem consists of computing the orbit distribution for every nodein the graphhow many times does a node in the graph participate in each of theorbits? This gives us a topological signature of a nodes local neighbourhood.

Problem definition Given a simple graph , we would like to compute everynodes frequency distribution of orbits from graphlets with at most nodes that appearas induced subgraphs in. Figure . illustrates the problem on a simple toy network.

Graphlet and orbit distributions can be viewed as generalizations of a node degree,which corresponds to the first non-trivial graphlet consisting of two connected nodes.We can extend the notion of orbits to edges and observe edge-orbits [] as definedby edge-automorphisms of the patterns. The number of common neighbours of twoadjacent nodes is equivalent to the number of triangles spanning over a given edge.This suggests that an edge-orbit distribution provides a generalization of the numberof common neighbours, which is an important feature in network analysis.

The graphlet counting problem differs from motif detection in networks [, ].However, the term motif is sometimes also used in the same way as graphlet or pat-tern and is not limited to the statistically too frequent patterns. We are interested infrequencies of all patterns because the most infrequent ones might also be the mostimportant in a given area of application. For a similar reason we also focus on ex-act computation although there have been several attempts at approximating graphletdistributions [].

. Areas of application

Network analysis with graphlet distributions has been so far used mainly in bioin-formatics but is in no way limited to this area []. Protein-protein interaction (PPI)networks have been the main subject of graphlet analysis. They have been used to findrandom network models [] that fit real PPI networks, cluster proteins and predict newprotein properties [] and predict new genes related to cancer [] or aging [].

Graphlets can also assist in other analytic methods, such as global network align-ment. GRAAL [] is an algorithm for aligning arbitrary networks based solely ontheir topology, which employs a local topology similarity measure based on graphletdegree vectors. The technique was used to show the large amount of shared networktopology between yeast and human PPI networks, which can be used to predict bio-logical functions of aligned proteins or reconstruct phylogenetic trees. H-GRAAL []aligns networks by reducing the problem to a weighted bipartite matching that can besolved with Hungarian algorithm []. Finally, MI-GRAAL [] integrates multiplesources of node similarity information, including the graphlet degree vectors.

. Motivation

There is a number of graphlet counting tools, which are used in bioinformatics. FAN-MOD [] is a network motif detection tool based on sampling random subgraphsand comparing their counts with those from random network models. Whelan []developed GraphletCounter, which works as a Cytoscape [] plugin and mergesgraphlet analysis with visual inspection of the network. GraphCrunch [, ] alsoprovides methods for further analysis and comparison of computed graphlet distribu-tions. Rapid Graphlet Enumerator (RAGE) [] is one example of a faster graphletcounting algorithm. Instead of counting the induced subgraphs directly, it recon-structs them from counts of non-induced subgraphs. Unlike FANMOD and Graph-Crunch, RAGE works only for up to four-node graphlets.

All previously mentioned methods rely on an exhaustive enumeration of graphlets.Despite limiting the analysis to patterns on at most five nodes, this already presentsa significant problem on larger PPI networks. For example, an estimated run timeof GraphCrunch for counting graphlets of size five in human PPI network from theBioGRID database [] is several months. One possibility is to exploit hardware op-timizations [] which reduce the time by a constant factor, but we will need new


algorithmic approaches with such fast growth of available data and size of networks.

. Theoretical background

A classic result by Itai and Rodeh [] refers to counting triangles in a graph. Raisingthe graphs adjacency matrix to the third power gives the number of paths of length between pairs of nodes. Elements , give the number of paths of length thatstart and finish in the node , which equals twice the number of triangles that include. The total number of triangles is then

,. Note that the same triangle

is counted twice for each of its three nodes. The time complexity of this procedureequals that of multiplying two matrices. A natural extension of this result deals withcounting larger cliques of size in a graph. More precisely, can this be accomplishedfaster than with an exhaustive enumeration which would require () time in densegraphs? Neetil and Poljak [] showed that clique detection problem can indeed besolved faster by reducing the original problem to detection of triangles in a graph with(/) nodes. Since we can detect triangles faster than in () with fast matrixmultiplication algorithms [, ], we can also detect cliques of size faster than().

A different approach exploits the relations between the numbers of occurrences ofinduced subgraphs in a graph. Kloks et al. [] presented a system of equations thatallows computing the number of occurrences of all six possible induced four-nodesubgraphs if we know the count of any of them. The time complexity of setting upthe system equals the time complexity of multiplying two square matrices of size .Kowaluk et al. [] generalized the result by Kloks to counting subgraph patterns ofarbitrary size. Their solution depends on the size of the independent set in the patterngraph and fast matrix multiplication techniques.

It is interesting to note that the problem of counting all induced patterns is equiv-alent to counting all non-induced patterns. These counts are connected through asimple system of equations. However, counting a single pattern is easier in some casesof non-induced patterns. All induced patterns seem to be equally difficult to count[], while non-induced patterns with large independent sets seem to present an easierproblem. For example, counting a star graph with nodes can be achieved simply bysumming deg() over all nodes .

How distinctive are graphlets as a signature for topological structure of networks?A closely related problem is the Reconstruction conjecture [], which states that any

graph on at least three nodes is uniquely determined by the multiset of its vertex-deleted subgraphs. The conjecture has been verified for graphs on up to nodes byMcKay []. However, the frequency of small connected patterns (graphlets) does notuniquely define the graph itself. Fig. . shows the smallest (in terms of the numberof nodes and edges) example of two such graphs. The graphs are not isomorphic: thefirst one is planar while the second one contains , (a complete bipartite graph withthree nodes on each side), which is a known non-planar graph. However, they containthe same number of all -node graphlets. For example, they both contain edges(graphlet ), no triangles (graphlet ), star graphs (graphlet ), etc.

Figure .An example of two noniso-morphic graphs with equal-node graphlet counts.

. Methodology

Existing methods consist of a simple exhaustive enumeration or rely on fast matrixmultiplication techniques. The former are becoming too slow on larger networks andthe latter are unsuitable on sparse networks that typically arise in practical applications.Our research was focused on developing a new method that outperforms exhaustiveenumeration methods and is suitable for sparse networks.

We focus on developing a counting algorithm instead of trying to optimize enumer-ation approaches. No matter how optimized such enumeration approach is to avoidlisting the same item several times, it still has to list each of them at least once. Acounting approach can improve upon such methods by observing some patterns andby considering entire batches of items in a single operation, i.e. adding more than to the accumulated result in one operation.

A trivial example is that of counting even numbers in the range from 1 to . Iteratingover all numbers in the range and checking whether a number is odd or even is clearlya waste of time. The pattern is obvious: every other number is even, therefore there


are /2 of them. Things get slightly more complicated if were interested in numbersdivisible by or . There are /2 multiples of and /3 multiples of . But weover-counted those that are divisible by and , i.e. multiples of . Subtracting /6gives the correct answer. A generalization of this is the inclusion-exclusion principle.

Lets move on to counting problems in graphs. Suppose we want to count thenumber of triangles. One approach is to consider every edge in the graph and try tocount in how many triangles it occurs. We will obviously over-count the triangles.However, the good news is that we will over-count them exactly three times. Everytriangle will be considered once for each of its three edge. How can we efficientlycount triangles spanning over a given edge in the graph without actually consideringall triangles? A short answer is: not in a simple way. This is exactly the number ofcommon neighbours of edges endpoints. By adding the degrees of the endpoints wewould consider all common neighbours but also the non-common ones. We wouldobtain the number of triangles and the number of paths of length that contain thisedge.

Constructing an expression to count exactly that one pattern, which we set out tocount, turns out to be rather difficult. Instead we change our approach to constructingan entire system of such equations. As long as the number of these equations is largeenough and they are independent, we can solve the system to obtain the numbers wewant. Moreover, the system of equations must be efficient to set up.

. Scientific contributions

This section summarizes the principal scientific contributions. With each contributionI list the dissertation sections where the topic is discussed and the references to thepublications of which I was the first author. The listed references were published ininternationally renowned scientific journals and were thus internationally reviewedand discussed. All article were published in journals from the top half of the ScienceCitation Index (SCI) by their Impact Factor in at least one field of research.

The following scientific contributions are presented in this dissertation:

An orbit counting algorithm for four-node and five-node graphlets

I developed an Orbit counting algorithm (Orca) for counting graphlets and or-bits. The algorithm is capable of counting node- and edge-orbits of four- andfive-node graphlets. Its substantial advantage over existing graphlet counting

methods comes from a combinatorial approach to counting instead of enumer-ating.

The contribution is covered by chapters and that contain reformated versionsof journal papers [] and [].

A general algorithm for counting node and edge orbits

I generalized the approach for generating equations required by the Orca al-gorithm and proved that such equations can be derived for arbitrarily largegraphlets and are therefore not limited to sizes four and five.

The paper [] addressing this topic is included in chapter .

A dynamic orbit counting algorithm for changing networks

I adapted Orca for the dynamic setting with interleaved additions and removalsof edges and graphlet count queries on individual nodes. This dynamic algo-rithm can be used to generate random networks with desired graphlet counts.

Chapter contains our so far unpublished research results.

Overview

Overview T. Hoevar

This chapter gives an overview of the published journal papers that comprise this dis-sertation. The papers have been reformatted to fit the dissertation template and containsome minor corrections that were suggested by the committee members.

Recent development of high-throughput experimental techniques for detecting pro-tein interactions [, ] led to a rapid increase in the amount of available data andestablishment of large protein interaction databases [, ]. Consequently, there isalso an increasing need for more efficient algorithms for processing this data, which isusually presented in the form of networks. One of the concepts used in connectionwith PPI networks are graphlets. Counts of these small patterns have been used as asignature of local network topology to predict protein functions [] and as a statisticfor many other applications. The first paper presents our algorithm Orca that addressesthis problem.

Graphlet analysis has been recently extended from the concept of node-orbits toedge-orbits in an attempt to overcome the problem of node membership in severalfunctional groups simultaneously. By clustering edges instead of nodes (accordingto their graphlet signatures) the authors of [] were able to identify new pathogen-interacting proteins and hence new drug target candidates. The second paper ad-dresses the problem of counting edge-orbits by employing a similar approach as theone used in our initial algorithm Orca. We also present a graphlet counting libraryfor programming language R. In contrast with the first paper, this one focuses on theimplementation details that lead to observed speed-ups.

After improving node- and edge-orbit counting algorithms for - and -node graph-lets we turn to the problem of computing orbits counts of larger graphlets. Morespecifically, can the approach of our Orca algorithm be generalized to graphlets ofarbitrary size? This problem is of more theoretical interest because the number ofgraphlets grows rapidly with their size (Table .). -dimensional node descriptionvectors would be quite sparse and not too informative for analyzing networks. Besides,-node graphlets would cover most of the network anyway, contradicting the idea ofgraphlet frequencies as a measure of local topology.

Some of our yet unpublished work on adapting the algorithm to a dynamic settingof the problem and its use for generating graphs with desired graphlet frequencies ispresented in Chapter . Some networks exhibit dynamic properties where new connec-tions are being added and old ones removed. Because each connection change affectsonly its local neighbourhood, we can efficiently update algorithms internal structures.

Table .Number of -node graphlets. See also http://oeis.org/A001349.

graphlets

We developed a dynamic version of Orca for counting -node graphlets. The algo-rithm supports additions and removals of edges that can be interleaved with graphletcounting queries for a specific node. The addition and removal operations efficientlyupdate the internal data structures. Queries can then use these values to set up thesystem of equations and finally solve it to obtain the actual orbit counts.

The dynamic algorithm was developed with a specific application in mindhowto generate a random graph that closely approximates given graphlet frequencies. Un-fortunately, the developed dynamic version of Orca was not fast enough for our ap-proach to generating random graphs and was optimized even further for maintainingthe graphlet counts in networks after every modification (addition or removal of anedge).

. A combinatorial approach to graphlet counting

In this paper we present our graphlet counting algorithm Orca. We focus on thepractical use of the algorithm in bioinformatics for counting graphlets in PPI networks.

The algorithm exploits relations between orbit counts for efficient computation.Systems of equations by Kloks [] and Kowaluk [] rely on fast matrix multiplicationtechniques and are therefore not feasible on sparse networks. Our results show thatwe can systematically design a different system of equations for four-node and five-node graphlets with the purpose of efficient graphlet counting in sparse graphs. Oursystem of equations also relates orbit counts as opposed to only graphlet counts. Theproposed system of equations is obtained through observing possible extensions ofsmaller patterns with another node. To make use of these relations we have to selectand efficiently enumerate a single orbit while the other orbit counts can be computedfrom the previously derived relations. For this, we focus on cliques because there arevery few of them in sparse networks and we can efficiently enumerate them with someknown enumeration algorithms [, ].

We evaluated our algorithm by comparing it with existing methods on PPI net-
http://oeis.org/A001349

Overview T. Hoevar

works. The algorithm exhibits a speed-up proportional to the maximum degree of anode in the graph, which coincides with the theoretical estimate of its time complex-ity. It is - times faster on the larger PPI networks and as much as timesfaster on the largest human PPI network that we used in our experiments.

A reformated version of the paper [] is inserted as Chapter .

. Computation of graphlet orbits for nodes and edges in sparsegraphs

This paper presents a package orca for programming language R that is capable ofefficiently counting node-orbits and edge-orbits of -node and -node graphlets. Notethat we could reduce the problem of counting edge-orbits to counting node-orbits inline graphs. However, the obtained line graphs would be much larger in terms of thenumber of nodes in the network as well as in the size of the patterns.

We derive edge-orbit relations using a similar technique as for deriving node-orbitrelations. For every edge-orbit of a graphlet we are able to select a special node suchthat we can list all patterns without this special node and for each of them count thenumber of missing special nodes that exist in the host graph. In contrast to countingnode-orbits, this special node should not coincide with the endpoints of the edge.

We provide a system of equations that can be employed in a graphlet countingalgorithm and show how the obtained system translates to the implementation of thealgorithm. The expected speed of counting edge-orbits is the same as that for countingnode-orbits. The paper concludes with a demonstration of the use of the developedpackage. The demo finds related concepts or entities in the Wikipedia for Schoolsnetwork.


. Combinatorial algorithm for counting small induced graphsand orbits

After a successful development of Orca for -node and -node graphlets we generalizedour algorithm to counting orbits of graphlets of arbitrary size. Melckenbeeck et al. []attempted to do the same but overlooked a crucial requirement that the system shouldbe efficient to build from a given network in order to offer the promised speed-up.Their systems of equations, while correct, does not lead to an efficient algorithm.

We identified the necessary conditions for systematically constructing an efficientsystem of equations relating the orbit counts. Next, we proved that we can indeedconstruct equations satisfying these conditions for graphlets of arbitrary size. The paperalso contains an analysis of algorithms expected time complexity on random Erdos-Renyi graphs.


A combinatorial approach tographlet counting

A combinatorial approach to graphlet counting T. Hoevar

Bioinformatics

A combinatorial approach to graphlet counting

Toma Hoevar and Janez Demar

Faculty of Computer and Information Science, University of Ljubljana,SI- Ljubljana, Slovenia

. Abstract

Motivation: Small induced subgraphs called graphlets are emerging as a possible toolfor exploration of global and local structure of networks, and for analysis of roles ofindividual nodes. One of the obstacles to their wider use is the computational com-plexity of algorithms for their discovery and counting.Results: We propose a new, combinatorial method for counting graphlets and orbitsignatures of network nodes. The algorithm builds a system of equations that connectcounts of orbits from graphlets with up to nodes, which allows to compute all orbitcounts by enumerating just a single one. This reduces its practical time complexity insparse graphs by an order of magnitude as compared to the existing, pure enumeration-based algorithms.Availability: Source code is available freely athttp://www.biolab.si/supp/orca/orca.htmlContact: [email protected]

. Introduction

Following the advent of high-throughput methods more than a decade ago, analysisof complex network data has assumed the central role among computational methods
http://www.biolab.si/supp/orca/[email protected]

in bioinformatics. The huge size of such networks on one hand, and the computa-tional intractability of the related methods on the other, have spawned a number ofinnovative analytic approaches.

G9

15

16

17

G11

22

23

G10

18

20

21

19

G13

27

28

30

29

G12

25

26

24

G14

32

33

31

G15

34

G19

45

47

48

46

G16

35

38

37

36

G17

39

42

40

41

G18

43

44

G20

49

50

G21

51

52

53

G22

55

54

G23

56

58

57

G24

5960

61

G25

63

62

64

G27

68

69

G26

66

6765

G29

72

G28

70

71

Gra

ph

lets

on

fiv

e n

od

es

G0

0

Gra

ph

let

on

two

no

des

G1

1

2

G2

3

Gra

ph

lets

on

thre

e n

od

es

G6

10

11

9

G7

12

13

G8

14

G3

4

5

G4

6

7

G5

8

Gra

ph

lets

on

fou

r n

od

es

Figure .Graphlets with nodesand automorphism orbits.Notation follows []. Col-ors are chosen arbitrarily;nodes of the same colorbelong to the same orbitwithin that graphlet, e.g.both black nodes in belong to orbit .

Przulj et al. [] described an approach focused on small induced subgraphs calledgraphlets. Due to combinatorial explosion, such analysis is usually limited to the graphlets with nodes (Fig. .). The number of appearances of graphlets in thenetwork provides a description of the networks structural properties. On a local level,counting how many times a particular node participates in each kind of graphlet in-duced in the network gives a topological signature of the nodes neighbourhood rep-resented as a -dimensional vector.

For a finer description, the nodes of every graphlet are partitioned into orbits [],which are equivalence classes under the action of its automorphism group. Two nodesbelong to the same orbit if they map to each other in some isomorphic projectionof the graphlet onto itself. Nodes of graphlets on points are grouped into orbits shown by numbers and node colors in Figure .. For instance, the five nodesfrom belong to three different orbits, marked with different colors and numbers;the black (as well as the gray) nodes have symmetric positions in the graphlet and thusbelong to the same orbit ( for the black, for the gray), and the white node belongsto the orbit . By counting the number of times a node of a graph appears in eachorbit, the node can be described by a -dimensional vector of orbit counts, whichreflects its position with respect to the local structure and gives insight into its role in


the network.Existing methods for counting the graphlets and orbits are based on direct enumer-

ation: in order to count them, they need to find all their embeddings in the network.We propose a new method, Orbit Counting Algorithm (Orca), which reduces thetime complexity by an order of magnitude by computing the orbit counts using therelations between them and directly enumerating only smaller graphlets.

.. Motivation

Graphlets are used for different kinds of analyses in bioinformatics. Milenkovi andPrzulj [] designed a method for comparing node neighbourhoods based on graphletsand demonstrated that clusters of nodes in protein-protein interaction (PPI) networks,obtained with their graphlet-based distance measure, share common protein proper-ties. They showed how to use this approach to predict functions of proteins and theirmemberships in protein complexes, subcellular compartments and tissue expressions.Milenkovic et al. [] studied the relation between cancer genes and their networktopology. They examined several clustering methods based on a graphlet similaritymeasure and found a difference between the PPI network structure around the cancerand non-cancer genes. Around of the predicted cancer gene candidates have in-deed been validated in the literature. Similarly, cost functions for network alignmentthat are based on graphlet degree vectors show superior results in comparison withother state-of-the-art methods. In particular, Milenkovi et al. [] showed how align-ment between the PPI networks of S. cerevisiae, D. melanogaster and C. eleganswith thehuman PPI network can be used for identification of genes related to aging, which aredifficult to observe directly for humans due to our long lifespans. Milenkovi et al. []also applied graphlets to estimate nodes topological centrality. Their graphlet degreecentrality measure is based on graphlet degree vectors and captures density and com-plexity of a nodes extended neighbourhood. They showed that the genes participatingin key biological processes also reside in complex and dense parts of networks.

Hayes et al. [] argue that in order to understand the biological networks, we needto find the mathematical models describing their structure, even though this may notbe of direct predictive use. Przulj et al. [] used graphlet distributions to show thatgeometric graphs match the structure of protein-protein interaction (PPI) networksbetter than Erds-Rnyi and scale-free graph models. Using a number of large PPInetworks, Hayes et al. [] further showed that while the network structure may be

unstable in regions with low edge density, high density regions are suitable for networkcomparison using graphlet degree distributions.

Graphlets can also assist in other analytic methods, such as global network align-ment. GRAAL [] is an algorithm for aligning arbitrary networks based solely ontheir topology, which employs a local topology similarity measure based on graphletdegree vectors. The technique was used to show the large amount of shared networktopology between yeast and human PPI networks, which can be used to predict bi-ological functions of aligned proteins or reconstruct phylogenetic trees. H-GRAAL[] aligns networks by reducing the problem to a weighted bipartite matching thatcan be solved with Hungarian algorithm. Finally, MI-GRAAL [] integrates multiplesources of node similarity information, including the graphlet degree vectors.

Solava et al. [] extended the use of graphlets by defining the orbits for graphlet edgesand demonstrated their use with a new clustering method that is not limited to locallysimilar edges and allows some overlap between clusters. As a practical result, theypredicted new pathogen-interacting proteins from clusters in the human PPI networkthat represent drug target candidates.

Graphlet analysis is therefore a useful tool for bioinformatics and with the increaseof available data there is also a growing need for fast graphlet counting tools.

.. Related work

We will denote the explored graph as = (, ). Let = || and = || be thenumber of nodes and edges, and let denote the maximal node degree. Let ()denote the set of nodes adjacent to node . In numbering the graphlets and orbits, wefollow Przulj []; we refer to the -th graphlet and -th orbit by and, respectively.

Counting subgraphs is a computationally intensive task. Common approaches tospeed it up include sampling [], exploiting pattern symmetries [] or using re-configurable hardware accelerators based on FPGA chips [].

The method described in this paper is related to the approach developed by Klokset al. [], who constructed a system of equations that allows computing the numberof occurrences of all six induced four-node subgraphs by knowing the count of anyof them. The time complexity of setting up the system equals the time complexity ofmultiplying two square matrices of size . We extend this approach to counting howmany times each node participates in each orbit. Our method also works on five-nodegraphlets and scales better on sparse graphs. Kowaluk et al. [] generalized the result


by Kloks et al. to count subgraph patterns of arbitrary size.There are several programs for graphlet counting and motif detection that are used

in bioinformatics. FANMOD [] is a network motif detection tool based on sam-pling random subgraphs and comparing their counts with those from random networkmodels. Besides implementing a novel sampling algorithm [] it also provides a fullenumeration procedure for graphlets on nodes. Whelan and Snmez [] de-veloped GraphletCounter, which works as a Cytoscape plugin and merges graphletanalysis with visual inspection of the network.

GraphCrunch [] is a tool for large network analysis. It includes a function forcomputing orbit signatures of every graph node for graphlets of up to five nodes usingan enumeration procedure with correction for over-counting some of the graphlets.A well-organized enumeration method imposes constraints that eliminate the needfor isomorphism testing except for distinguishing between a few different graphlets;this is further accelerated by comparing the number of edges and individual nodedegrees. GraphCrunch has been extended with a new method for topological networkalignment and with comparison of the networks with some additional mathematicalmodels []. The graphlet counting procedure in the new version remained essentiallythe same.

Rapid Graphlet Enumerator (RAGE) [] takes a different approach to countingfour-node graphlets. Instead of counting the induced subgraphs directly, it recon-structs them from counts of non-induced subgraphs. For computing the latter, ituses specifically crafted methods for each of the possible subgraphs ( to inFig. .). The time complexity of counting non-induced cycles and complete graphsis(+), while counting other subgraphs requires(). Another bound, whichis also more suitable for comparison with our method, is( ) = ( ). UnlikeFANMOD and GraphCrunch, RAGE works only for up to four-node graphlets.

. Methods

Let represent a certain node of interest in graph . Our task is to compute thenumber of times, , that appears in each orbit across all graphlets induced in. We will present an approach based on a system of linear equations that relate theorbit counts . The rank of the system is smaller than the number of orbits by one,so we can compute all values of from directly enumerating only a single one. Thealgorithm allows to compute the orbits for all points in a graph in time that is smaller

than the existing, direct enumeration approaches by an order of magnitude.We will first show how to construct a system of equations for four-node graphlets.

As for the single orbit that must be enumerated, we chose, which represents nodesof the complete graph, (or ); we show an efficient way to enumerate it. Theapproach used for four-node graphlets is less suitable for larger graphlets, so we presenta different technique for five-node graphlets.

.. Orbits in four-node graphlets

Right-hand sides of equations we are about to construct contain terms that are com-puted from the graph . Let (, ) = |()()| denote the number of commonneighbours of nodes and . Let (, ) denote the number of paths on three nodesthat start at node , continue with and end with some node , which is not connectedto . We can compute (, ) as (, ) = () 1 (, ).

If some node participates in a -node graphlet , it also participates in some( 1)-node graphlet . This can be seen by removing one of the graphlets nodesthat are the farthest away from . The subgraph induced by the remaining nodes isconnected (otherwise a whole component of the resulting graph would be farther from), so it is isomorphic to some ( 1)-node graphlet .

We will use this observation in reverse: every four-node graphlet can be constructedby adding a node to some three-node graphlet(s). To find the relations between countsof orbits in four-node graphlets for a certain node , we enumerate all three-nodegraphlets touching the node and count their possible extensions with the fourth node.

An example is shown in Figure .. Nodes , and induce graphlet , a pathon three nodes; we will observe its extensions to four-node graphlets with the fourthnode, , connected to and (dashed lines). The number of such nodes is (, ).In our example, there are (, ) = 3 such nodes, which we marked by , and (Fig. .(a)). The edge (, ) might exist in the graph (as in the case of ,the dotted line) or not (as for and ). With no edge, nodes , , and forma paw () with in orbit (Fig. .(b)). With an edge between and , theyform a diamond () with in orbit (Fig. .(c)). Since all (, ) nodes in() () must participate either in or , which puts in or , thisgives + = (, ) for the particular triplet , and .

We sum this over all possible three-node paths starting at . Summation must ac-count for symmetries: each graphlet appearing in the graph is counted twice with


Figure .Relation between orbits and . Solid linesare edges in the three-nodegraphlet being extended.Dashed lines exist bydefinition: (or ) arethe common neighbours of and . Dotted lines areoptional edges that makethe resulting four-nodegraphlet on , , and isomorphic to or .

x

y

z

w1

w2

w3

c(y, z) = 3

(a)

x

y

z

w10

10

11

9

G6

(b)

x

y

z

w

12

12

1313

G7

(c)

roles of and reversed, and is counted twice with reversed roles of and .Accounting for this, we get

2 + 2 = , ,()[{,,}]

(, ),

where denotes graph isomorphism (e.g., [{, , }], a subgraph on nodes , , , isisomorphic to , a path with three nodes).

For a different example, we will relate orbits and . We will extend a pathon nodes , and with another path that starts with nodes and ; we denotedthe number of such paths by (, ) (Fig. .(a)). Depending on whether the newnode is adjacent to , the extended graphlet is either a claw (, Fig. .(b)) or a paw(, Fig. .(c)). After accounting for symmetries and subtracting since (, ) alsocovers the case when = , we get

2 + 2 = , ,()[{,,}]

, 1 .

There are only two three-node graphlets and relatively few possible extensions. In-vestigating all possibilities in a similar manner yields linearly independent equationswith variables that correspond to counts of orbits in four-node graphlets (see theSupplementary).

Right-hand sides depend on the graph and need to be computed for each point. To accelerate their computation, we precompute values of (, ) and (, ). In all

x

y

z

w1

w2

w3

p(x, y) - 1 = 4

w4

(a)

6

G4

x

y

z

w

6

76

(b)

G6

x

y

zw

9

11

1010

(c)

Figure .Relation between orbits and . Edges are markedlike in Figure ..

equations, except for the last one, (, ) is computed on pairs of nodes (, ) that areconnected; in (, ), they are connected by the definition of . It therefore sufficesto precompute (, ) and (, ) only for all pairs of adjacent nodes and , whichrequires() space. The last equation, in which the new node closes a cycle, is treatedseparately. Nodes and are not adjacent but we can pre-compute the number ofpaths of length that start at node and end at node . This requires () spacefor each point; since we compute orbits for one point at a time, this memory can berecycled. Altogether, all lookups in the sums on the right-hand sides can be done inconstant time by sacrificing the memory of size(+) for precomputed values (, )and (, ).

The total time complexity for computing all orbits for all nodes is(+), where() is the time needed to enumerate complete graphlets on four nodes. Below, wedescribe an algorithm that does this in(), yet the actual importance of this termdepends on the structure and density of the graph.

.. Counting complete graphlets

For every node we still have to determine the count of one of the orbits. Sincegraphs are usually sparse, a good candidate is the rare orbit , which represents thenodes of the complete graphlet on four nodes . Because of very few occurrences ofthis graphlet and its symmetricity, we can efficiently restrict the enumeration.

A straightforward way to count the complete graphlets of size that touch a givennode is to start with that node and in every step add a neighbour of the last addednode , while checking that the new node is also connected to all nodes before ,


Figure .Enumerating by addingone neighbour at a timeor by checking pairs ofneighbours. Dashed edgesare added by iteratingthrough neighbours, anddotted edges are checked inthe last step.

x1

x2

x3

x4

(a)

x1

x2

x3

x4

(b)

59

x1

x2

x3

x

y

G24

(a)

65

x1

x2

x3

x

y

G26

(b)

68

x1

x2

x3

x

y

G27

(c)

70

x1

x2

x3

x

y

G28

(d)

Figure .Computing orbit count ; figures show graphletsfor different edges between and other nodes, and theorbits of .

orbits in which the node of interest, , appears if we add edges between and othernodes in the graphlet.

Let be the node of interest, let be the node whose edges we observe and let , and be the other three nodes in that graphlet.

Figure . illustrates counting of appearances of in , which belongs to (Fig. .(a)). We will focus on the node marked by , which is connected to the nodesmarked by and . Note that removing reduces into a diamond, , with in orbit .

Now assume that we are computing orbits for a certain node and discover someinduced subgraph with in . We assign labels , and to theremaining nodes as shown in the figure. Altogether, the graph contains (, )common neighbours of and (similar to nodes marked with in Fig..(a)).While all these nodes are by definition of (, ) connected to and , someare also connected to or , or both. Figure . shows all four possibilities, whichgive graphlets , , and with in orbits , , and , respectively.Therefore, + + + = (, ) 1, where denote orbits of withrespect to .

For the relation between , , and for the entire graph, we sum this overall possible induced with in . After considering the symmetries that causecounting the same graphlet multiple times with different assignments of , , and, we get

+ 4 + 2 + 6 = , ,


Condition < (under some arbitrary ordering of nodes) is needed to consider eachgraphlet just once. The other two conditions put in. The second term in thesum, (, ), accounts for the case in which the roles of and are exchanged.

Using a similar construction for other orbits, except for , gives linear equa-tions for orbits (see the Supplementary). Like for four-node graphlets, we directlyenumerate the orbit , which belongs to the complete graphlet. Equations are lin-early independent due to the way in which they were constructed: each equation isset up with one orbit in mind (e.g. in the above example) and the other orbitsthat appear in the equation belong to graphlets with a larger number of edges (the ad-ditional edges between and the other nodes, like the dotted edges in Fig. .(b-d)).Additional nice consequence besides independence is that the system is easy to solvesince orbit counts can be computed from those belonging to graphlets with more edgestowards those with less.

When constructing the equations, we choose that allows for efficient computationof the right-hand sides: we will ensure that the right-hand sides contain only the nodedegrees and the numbers of common neighbours of pairs and of connected triplets((, ), (, , )). This will allow us to precompute and store the values of (, ) and(, , ) for all pairs and connected triplets in before computing the orbit countsfor individual nodes.

First, we choose the node so that the remaining nodes constitute a four-nodegraphlet, i.e. removing does not break the graphlet into disconnected components,which would require enumeration of disconnected subgraphs. Second, the node hasto have at most three edges to avoid the need to compute the number of commonneighbours of four points, (, , , ). Besides, when has three neighbours, theyneed to induce a connected subgraph.

A node that fulfils these criteria exists for all orbits except . Precomputingthe values (, , ) for all connected triplets takes ( ) time, and storing them ina hash table takes ( ) space. Computation of the right hand-sides also requiresenumerating all the four-nodes graphlets, which again has a complexity of ( ).

The total time required to compute all orbit counts for all is then(+)with ( ) space, where () is the time required to enumerate all complete -node graphlets (). The algorithm thus has the same upper bound complexity as theexisting algorithms,( ). Experiments however show that the bound is not tight:the contribution of the () is negligible over the range of sensible graph densities,

Table .Statistics of benchmark real-world networks.

max.network nodes edges degreeS. cerevisiae E. coli D. melanogaster Human Internet Autonomous Systems

and the actual running times are smaller by an order of magnitude.We could use the same technique to construct systems of equations for larger graph-

lets. Note however that we reduced the running times by imposing some conditionsto the selection of the node . We have not researched whether such nodes also existfor larger graphlets; although theoretically interesting, this may be of little practicaluse in the context of bioinformatics.

. Results and Discussion

We compared the speed of Orca with RAGE, GraphCrunch and FANMOD. We ranall experiments on a modest desktop computer (Intel Core , . GHz). We have notexperimented with parallel execution; all four algorithms allow for trivial distributionof work on multiple cores, so the benefits of parallelization should be the same for all.

We compared the performance of methods on the three largest species-specific PPInetworks from the July update of the Database of Interacting Proteins [] andthe human PPI network from the BioGRID [] .. release. The sizes of individ-ual datasets are presented in Table ..

All algorithms except the significantly slower FANMOD counted orbits for four-node graphlets in the smaller graphs in a few seconds (Table .). Five-node graphletspresent a more difficult task: running GraphCrunch on the S. cerevisiae PPI networktook more than minutes (as compared to . seconds for four-node graphlets). FAN-MOD was almost times slower while Orca finished the same task times faster,in . seconds. RAGE is limited to four-node graphlets. We got similar results for theother two networks.

In the larger Human network, Orca counted the four-node graphlets and


Table .Comparison of algorithms on real-world networks. We aborted the algorithms that took more than a day.

four-node graphletsnetwork FANMOD GraphCrunch RAGE OrcaS. cerevisiae s . s . s < . sE. coli s . s . s < . sD. melanogaster s . s . s < . sHuman / min . min . sInternet Autonomous Systems min min . min . s

five-node graphletsnetwork FANMOD GraphCrunch OrcaS. cerevisiae min . min . sE. coli min . min . sD. melanogaster min . min . sHuman / / minInternet Autonomous Systems / / min

times faster than Rage and GraphCrunch, respectively; we aborted FANMOD after hours. Orca was also the only algorithm capable of counting five-node graphlets ina human PPI network in less than a day.

For comparison with RAGE, we included a test network of Internet AutonomousSystems that was used as the benchmark for RAGE []. FANMOD required over hours, GraphCrunch finished in minutes, RAGE in minutes and Orca in .seconds. Orca finished the computation for five-node graphlets in minutes, whilethe other two algorithms were stopped after hours.

The time that Orca needs for counting orbits in five-node graphlets are compa-rable to those that GraphCrunch needs for four-node graphlets. This is consistentwith the way the two algorithms are constructed: GraphCrunch enumerates four-nodegraphlets to count them, while Orca enumerates them to count five-node graphlets.As expected, the time needed for enumeration of complete five-node graphlets is neg-ligible at these network densities.

http://www.netdimes.org/PublicData/csv/ASEdges_.csv.gz

edges [thousands]

time

[sec

onds

]

Erds-Rnyi

edges [thousands]

Geometric

edges [thousands]

Barabsi-Albert

GraphCrunchRAGEOrca

Figure .Comparison of timesneeded for counting orbitsin four-node graphlets inrandom networks. Graphsare cut off at one minute;results of experiments inwhich the methods wereallowed to run for up toone hour are available inthe supplementary.

edges [thousands]

time

[sec

onds

]

Erds-Rnyi

edges [thousands]

Geometric

edges [thousands]

Barabsi-Albert

GraphCrunchOrca

Figure .Comparison of timesneeded for counting orbitsin five-node graphlets inrandom networks

For more insight into time complexities of the compared algorithms, we tested themon synthetic data using three different random network models Erds-Rnyi, geo-metric and Barabsi-Albert random graphs. Erds-Rnyi graphs are constructed byrandomly connecting pairs of nodes. We generated geometric graphs by randomlyplacing nodes in a -dimensional unit cube and connecting the closest pairs; geo-metric graphs show largest resemblance to protein interaction networks []. Barabsi-Albert preferential attachment model generates scale-free networks that exhibit hubsand individual highly connected nodes.

We explored the performance of GraphCrunch, RAGE and Orca at different net-work densities. FANMOD was not included as it consistently finished previous tests


far behind GraphCrunch. All graphs had nodes; for each method, we increasedthe graph density until the method needed more than a minute to complete the test.The corresponding graphs were relatively dense, containing up to of all possibleedges for test with four-node graphlets and around for five-node graphlets.

RAGE counted the four-node graphlets slightly faster than GraphCrunch but theywere both significantly outperformed by Orca (Fig. . and Tables .. in the sup-plementary). We observed similar results when counting five-node graphlets (Fig. .).Orca achieved the highest gain in comparison with other methods on Barabsi-Albertmodels, in which hubs present a large obstacle for GraphCrunch and RAGE. Thismakes Orca more suitable for real-world networks, which often display the small-worldproperty and contain hubs.

. Conclusion

Graphlet-based network analysis is useful for various tasks in bioinformatics, such asalignment of PPI networks and prediction of protein functions based on topologicalsimilarities. Past studies used these approaches to, for instance, identify genes relatedto cancer [] and to aging [].

We presented a new algorithm for counting graphlet orbits that is based on derivedrelations between orbit counts. To count the orbits for -node graphlets, it enumerates( 1)-node graphlets and a single -node graphlet. Empirical results confirm thatthis decreases the time complexity by an order of magnitude in comparison with otherknown methods. In practical terms, the algorithm counts orbits in large PPI networks- times faster than other state-of-the-art algorithms.

. Supplementary

.. Results on random networks

Counting -node graphlets

Table .Results of counting -node graphlets in random ER networks [s]. Missing values indicate running times over hour.

edges [thousands]

GraphCrunch . . . . . . /RAGE . . . . . . / /Orca . . . . . . . . . .

Table .Results of counting -node graphlets in random GEO networks [s]. Missing values indicate running times over hour.

edges [thousands]

GraphCrunch . . . . . /RAGE . . . . . / /Orca . . . . . . . . .

Table .Results of counting -node graphlets in random BA networks [s]. Missing values indicate running times over hour.

edges [thousands]

GraphCrunch . . . . . /RAGE . . . . . / /Orca . . . . . . . . .


Counting -node graphlets

Table .Results of counting -node graphlets in random ER networks [s].

edges [thousands]

GraphCrunch . . . . . . Orca . . . . . . . . . . .

Table .Results of counting -node graphlets in random GEO networks [s].

edges [thousands]

GraphCrunch . . . . . . Orca . . . . . . . . . .

Table .Results of counting -node graphlets in random BA networks [s].

edges [thousands]

GraphCrunch . . . . Orca . . . . . . .

.. Log-scale graphs

.

edges [thousands]

time

[sec

onds

]

Erds-Rnyi

.

edges [thousands]

Geometric

.

edges [thousands]

Barabsi-Albert

GraphCrunchRAGEOrca

Figure .Log-scale comparison oftimes needed for count-ing orbits of four-nodegraphlets in random net-works.

.

edges [thousands]

time

[sec

onds

]

Erds-Rnyi

.

edges [thousands]

Geometric

.

edges [thousands]

Barabsi-Albert

GraphCrunchOrca

Figure .Log-scale comparison oftimes needed for count-ing orbits of five-nodegraphlets in random net-works.

Computation of graphlet orbitsfor nodes and edges in sparse

graphs

Computation of graphlet orbits for nodes and edges in sparse graphs T. Hoevar

Journal of Statistical Software

Computation of Graphlet Orbits for Nodes and Edges inSparse Graphs

Toma Hoevar, Janez Demar

University of Ljubljana

. Abstract

Graphlet analysis is a useful tool for describing local network topology around indi-vidual nodes or edges. A node or an edge can be described by a vector containing thecounts of different kinds of graphlets (small induced subgraphs) in which it appears, orthe roles (orbits) it has within these graphlets. We implemented an R package withfunctions for fast computation of such counts on sparse graphs. Instead of enumerat-ing all induced graphlets, our algorithm is based on the derived relations between thecounts, which decreases the time complexity by an order of magnitude in comparisonwith past approaches.Keywords: network analysis, graphlets, data mining, bioinformatics

. Introduction

Analysis of networks plays a prominent role in many areas of science and business, fromgenetic and protein networks in bioinformatics to social networks in mining user data.Describing the roles of individual nodes and edges, clustering them, and predictingtheir future development requires observing their locally defined properties. One ofthe methods used particularly in bioinformatics is based on counting graphlets andgraphlet orbits.

G0

0

G1

1

2

G2

3

G9

15

16

17

G11

22

23

G10

18

20

21

19

G13

27

28

3029

G12

25

26

24

G14

32

33

31

G15

34

G19

45

47

48

46

G16

35

38

37

36

G17

39

42

40

41

G18

43

44

G20

49

50

G21

51

52

53

G22

55

54

G23

57

58

56

G24

5960

61

G25

63

6264

G27

68

69

G26

66

6765

G29

72

G28

70

71

G6

10

11

9

G7

12

13

G8

14

G3

4

5

G4

6

7

G5

8

(a) Node orbits.

G1

0

G2

1

G6

78

6

G7

109

G8

G9

1213

G11

17

G10

1416

15

G13

2122

2423

G12

20

19

18

G14

2627

25

G15

28

G19

41

G16

29

3031

G17

32

34

35

33

49

51

50

38

40

39

G18

36

37

G20

42

G21

43

4645

44

G22

4847

G23 G24

60

52

53

55

G25

57

62

56

58

G27

6364

G26

61

59

G29

67

G28

6566

G3

23

G4

4

G5

5

11

54

(b) Edge orbits.

Figure .Graphlets with nodeswith enumeration of or-bits. Node colors and lineshapes, which are chosenarbitrarily, correspond toorbits within each graphlet.Node orbits are enumer-ated as in []; edge orbitnumbers are enumeratedby increasing orbits of thecorresponding node pairs.

Graphlets are small connected simple graphs []. There are different graphlets withtwo to four nodes and graphlets with up to five nodes. In graphlet-based networkanalysis, we examine induced graphlets within the network: for each node, we count


the number of times the node is contained in an induced graphlet of each kind, whichgives a - or -dimensional vector description of the local topology surrounding theobserved node. One of the vector components represents, for instance, the number oftimes the node is included in an induced star on five nodes.

Furthermore, we can group nodes of each graphlet into orbits [] with respect tothe graphlet automorphisms (Figure .(a)). Orbits define the roles of the nodeswithin the graphlet. For instance, in a star on five nodes (), one node representsthe center and the remaining four nodes are the leaves; the nodes of the star thus formtwo different orbits (numbered and , respectively). Instead of counting only thenumber of appearances of induced stars that touch an observed node in the network,we can count how many times the node represents the center of such star (i.e., the nodeis connected to four nodes that are not connected to each other) and how many timesit has the role of a leaf (it is connected to a node that is connected to another threenodes that are disconnected from each other and from the observed node). This givesa finer description of the nodes vicinity with a -dimensional vector for four-nodegraphlets and a -dimensional vector for five node graphlets.

Similar can be done for edges []: there are edge orbits for graphlets with nodes, which allow for a characterization of an edge with a -dimensional vector(Figure .(b)).

Figure . gives an illustration for a small network. Figure .(a) shows the network,and figures .(b), .(c) and .(d) show all four-node subgraphs that include node; node appears in orbits , and . Table .(e) shows all orbit counts for node, including those belonging to two- and tree-node subgraphs. Table .(f ) showsthe orbit counts for all nodes and orbits. The vector (row) corresponding to node isquite different from others (e.g., counts for orbits , , ), which indicates its specialplace in the graph. Signatures of and , on the other hand, are the same since thetwo nodes map to each other in an automorphism of the graph.

The straightforward computation of orbit counts by enumeration takes ()time, where is the number of nodes (typically thousands or tens of thousands), is the maximal node degree (usually up to one hundred), and is the graphlet size( or ). We have recently presented a combinatorial approach for counting orbits ofnodes [] in time that is, for practical purposes, proportional to . Using this

In general, two nodes will have the same signature for node graphlets if their local neighborhood of upto edges is the same.

CB

A

D

FE

(a) Example network.

C D

FE

CB

FE

C

A

FE

(b) Subgraphs with node in orbit .

C

A

D

E

CB D

E

(c) Subgraphs with node in orbit .

CB

A

E

CB

A

D

(d) Subgraphs with node in orbit .

Orbit Count The count of orbit equals the degree of the node The terminal node of a path on three nodes ({, , }) The middle node of a path on three nodes

({, ,}, {, , }, {, ,}, {, , }, {, , }) Triangle ({, , }) The inner node of a path on four nodes (Fig. b) The central node of a -edge star (Fig. c) The central node of the , lollipop graph (Fig. d)

(e) Non-zero orbit counts for node .

Orbit

(f ) Orbit counts for all nodes.Figure .Illustration of orbit countsfor a simple network.


technique, the common-size networks from proteomics can be analyzed in a reasonabletime of a few hours on a common desktop computer. In this paper, we provide the firstcomplete description of the algorithm, including its novel extension to counting edgeorbits (Section ), and then document the corresponding R package together with twousage examples (Section ).

The notation used throughout the paper is summarized in Table ..

Table .Notation used in the paper.

= (, ) the observed graph with nodes and edges number of nodes in the graph number of edges maximal node degree graphlet size node orbit

() (or ) the number of times that node appears in orbit ;we use to reduce the clutter where possible

edge orbit () (or ) the number of times that edge appears in orbit

(, , ) the set of common neighbours of nodes , , , (, , ) the number of common neighbours of nodes , , ,

[{, , }] a subgraph of on nodes , , , symbol denotes graph isomorphism

. Combinatorial approach to orbit counting

Let = (, ) be a simple graph with vertices () and edges (). We assumethat the graph is sparse ( = ()). We will denote graphlets as and node orbitsas . We follow the enumeration by Przulj [] (see Figure .(a)), in which the orbitnumbers are assigned somewhat arbitrarily but with the constraint that the indices oforbits belonging to the graphlets with fewer edges are smaller than those belongingto the graphlets with more edges. We will use to denote edge orbits, which weenumerate as shown in Figure .(b). Here we decided to ignore the pre-existing

enumeration by Solava et al. [] and define a more consistent one in which the edgeorbits are ordered by the orbits of the corresponding nodes.

The task is to count the number of times a node appears in each orbit , or thenumber of times an edge appears in orbit . We will denote the two numbers by() and (); where possible, we will omit or and write only and . Thealgorithm computes the counts for all graph nodes (or edges). Computation for just afew nodes can be done faster using a brute force approach (exhaustive enumeration).

Past approaches such as that in GraphCrunch [] and RAGE (Rapid graphletenumerator) [] are based on exhaustive enumeration of induced subgraphs. Theirtheoretical and empirical complexity of enumerating graphlets of size is (),where is the maximal node degree in the graph. Our approach builds on the workof Kloks et al. [], who constructed a system of equations for counting induced sub-graphs with four-nodes, and Kowaluk et al. [], who generalized it for larger sub-graphs. We use a similar principle to count orbits; besides, our approach scales betterfor sparse graphs. In comparison with enumeration-based algorithms, the combina-torial approach decreases the practical time complexity by the factor of by directlyenumerating only the graphlets of size 1 and using them to compute the countsfor graphlets of size .

.. Node orbits

We shall demonstrate the basic idea with an example.Let be a node in the graph . () represents the number of times appears in

orbit , that is, the number of ways in which can be embedded in so that is in orbit . To reduce the clutter, we shall omit and denote this by . Counts, and are defined similarly. We will show that the following relation holdsfor any :

+ 3 + 2 + 2 = ,, [{,,,}]


Figure .Derivation of the relationbetween , , and . Solid lines belong tographlet , dashed linesrepresent the requirededges (as described inSection ..) and dottedlines represent additionaledges whose presence orabsence determines theorbit of the node . Graylines represent edges toother nodes of .

x

u

v

9

t

(a)

x

u

v

w

t

45

(b)

56

w

x

u

v t

(c)

62

w

xu

vt

(d)

65

w

x

u v

t

(e)

left-hand side of the equation is a linear combination of orbit counts that we wish tocompute and the right-hand side is a statistics that is easy to obtain.

Equation . can be constructed as follows. Let the subgraph on some nodes ,, and be isomorphic to with in (Figure .(a)). Now we observe thepossible extensions of [{, , , }] with a node that is attached to and .For each such (, ), the subgraph on , , , , is isomorphic

to , if is not connected to and (Fig. .(b)), or

to , if is connected to , but not to (Fig. .(c)), or

to , if is connected to , but not to (Fig. .(d)), or

to , if is connected to and (Fig. .(e)).

This puts in orbits , , or , respectively. Therefore,

+ + + = |(, )| 1 = (, ) 1, (.)

where represent orbit counts considering only these particular nodes (and annota-tions) , , , and all common neighbours of and , (, ). The term 1 is neededsince one of the members of (, ) is also .

Equation . relates the total orbit counts , , and for a fixed node .We construct it by summing the right-hand side of (.), (, ) 1, over all triplets

This description is intended to present the reasoning behind the relations, while the actual constructionwas slightly different in order to obtain a useful system of equations. Details are given in Section ...

{, , } such that [{, , , }] and , () (to put in withinthis subgraph) and with < under some arbitrary ordering of nodes, that is,

,, [{,,,}]


Figure .Derivation of the relationbetween , and . Solid lines belong tographlet , dashed linesrepresent the requirededges (since is definedto span over the edge(, )) and dotted linesrepresent additional edgeswhose presence or absencedetermines the orbit ofthe edge (, ). Gray linesrepresent edges to othernodes of .

5

x

u

y

v

(a)

43

x

u

y

vw

(b)

w

56

x

u

y

vw

56

x

u

y

v

(c) with (, ) and (, )

w

63

x

u

y

v

(d)

The sum runs over all induced subgraphs in that put node in orbit in .Nodes and are its neighbours and is the node opposite of in . We obtainthe same graphlet if we attach a new node either to or to , and there are () 2and () 2 such possibilities. There are also three optional edges (to , and or ),which decide the orbit of in the extended graphlet; the resulting orbit can be ,, , , , , or .

.. Edge orbits

Relations for edge orbits are derived in the same way. For instance, let (, ) representan edge of a square (graphlet ). Let us label the remaining two nodes with ()\{} and ()\{} (Figure .(a)). Extending this pattern with a node thatspans over the edge (, ) leads to three possible graphlets and hence three differentedge orbits of the edge (, ) (Figure . (b-d)).

Orbit arises when is adjacent to either or , while the other two orbits, and can only arise in one way. Hence the relation between the orbits is

+ 2 + = , [{,,,}]

(,) () ()

(, ). (.)

.. System of equations

We constructed the equations similar to those above to relate each orbit with orbitsfrom graphlets with a larger number of edges. Complete lists of equations are providedin the appendices.

For instance, Equation . was constructed specifically to relate with higherorbits. The actual construction of equation goes in the opposite direction from thatpresented in the introductory example. We started with the graphlet , to whichthe orbit belongs and picked one of the nodes (labelled in the above case). Weassumed that is adjacent to and , and examined the graphlets in which may bealso adjacent to and/or . This ensures that the equation that is set up with inmind relates with orbits with higher indices since these graphlets have more edgesthan . As a consequence, the resulting system of equations is triangular, and thusindependent and also easy to solve by going backwards from the higher orbits (startingwith or , which belong to complete graphs) towards lower orbits.

We impose several constraints on selection of . Node can not coincide with ,or with or when computing orbits of edge (, ). We further require that removalof does not break the remaining nodes into disconnected subgraphs. Node musthave at most 2 neighbours; when it does have 2 neighbours, they must beconnected. This allows for more time- and space-efficient computations of orbits, asdescribed in Section ... Existence of such nodes for each orbit of four- and five-node graphlets can be proven by exhaustive search, with exception of , which ishandled as a special case.

All equations have the following general form:

+ + ... + = []

()

(() + () + + () + ) , (.)

where cond() is a set of conditions that constrain the embedding of[] into andassign labels to nodes. For instance, in Equation ., condition , () assigns thelabels and to the nodes in orbit and < ensures that the same quadruplet ofnodes is not counted twice.

The sum runs over subgraphs [] isomorphic to some graphlet on 1 nodes,that is, over some three-node graphlet when computing the orbits in four-node graph-lets, or over some four-node graphlet when computing orbits in five-node graphlets.The subgraph must include , and the conditions in the sum put into some fixedorbit. Additional conditions may impose ordering on the remaining nodes of thegraphlet to decrease the number of symmetries.

The terms in the sum are the number of common neighbours of some subsets of


nodes in the subgraph ( ). The number of such terms is between and .The size of is also between , that is, the terms refer to node degrees and to thenumber of common neighbours of pairs and triplets of nodes. The criteria for thechoice of node , which are described above, ensure that these terms can be efficientlycomputed using some precomputed data as described in the following section.

The left-hand side is a fixed linear combination of orbits to which the node evolvesafter extending with another node connected to one of subsets . The coefficientsreflect the symmetries in the graphlets with regard to node assignments.

We prepared a system of equations that relate the node orbits of four-nodegraphlets, and a system of equations that relate the node orbits of the five-nodegraphlets. Likewise, we have constructed equations that relate the edge orbitsfor four-node graphlets and equations for edge orbits on five-node graphlets.By selecting different nodes , we have empirically verified that it is impossible toconstruct a full-rank system using our approach and the constraints we put on .

Due to the ranks deficiency, one of the orbits must be enumerated directly. Themost suitable candidates are the orbits belonging to complete graphlets ( and for nodes, and and for edges). First, this allows for a straightforward com-putation of the orbits since the system is triangular so that lower orbits are computedfrom the higher. Second, since we assume that the graphs are sparse, we can efficientlycompute these orbits by using an enumeration method similar to the Bron-Kerboschmaximal clique enumeration algorithm [].

.. Algorithm

The algorithm consists of precomputation of some data, followed by computation oforbit counts for each node or edge.

. Precomputation:

Count the complete graphlets touched by each node or edge.

Count the common neighbours of every pair and of every three verticesthat form a connected graph.

. For each node or edge:

Compute the right-hand sides by enumeration of 1 node graphs usingthe precomputed data above.

Solve the system of linear equations.

Our implementation of the algorithm represents the graph with adjacency and in-cidence lists, which are appropriate for sparse graphs. If the graph has less than nodes, we also construct an adjacency matrix. The matrix, which uses bit per edgeand takes at most around MB, allows us to check for existence of edges betweenany given pair of nodes in constant time. Without it, the time complexity of thelookup for an edge between two nodes is proportional to the logarithm of the numberof neighbours of the node with smaller degree.

In the following, we will describe each step in more detail.

. Precomputation:

For each node, count the number of complete graphlets in which the nodeor edge participates. We build cliques of size from cliques of size 1 bymaintaining a set of candidate nodes that are adjacent to all nodes in thesmaller clique. This procedure is similar to the Bron-Kerbosch algorithmwith the difference that we are not interested in maximal cliques but in allcliques of a given size.

Although the theoretical upper bound of the time complexity of this stepis (), where is the maximal node degree, the actual contributionof this step to the total running time is negligible since complete subgraphs(cliques) in sparse networks are rare.

Compute and store the number of common neighbours for each pair ofadjacent vertices. This takes () time and () space. For computingthe orbits in five-node graphlet, we also compute the number of paths oflength between each pair of nodes for which such a path exists, and thenumber of common neighbours for all triplets of connected nodes. Thistakes () time and () space.

. For each node or edge:

Compute the right-hand sides of the system of the linear equations. Itsgeneral form is shown in Equation .. For four-node graphlets, the sumsrun over three-node paths or triangles in which the node appears. For five-node graphlets, they run over four-node graphlets that the node touches.


Right-hand sides

Counting small patterns in...

Documents

Transcript of Counting small patterns in...