A General Framework for Estimating Graphlet Statistics via ... · A General Framework for...

14
A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen 1 , Yongkun Li 2 , Pinghui Wang 3 , John C.S. Lui 1 1 The Chinese University of Hong Kong 2 University of Science and Technology of China 3 Xi’an Jiaotong University 1 {xwchen, cslui}@cse.cuhk.edu.hk, 2 [email protected], 3 [email protected] ABSTRACT Graphlets are small induced subgraphs and have been fre- quently applied to characterize the local and global structure of graphs across various domains, e.g., online social networks (OSNs) and biological networks. Discovering and computing graphlet statistics are highly challenging. First, the massive size of real-world graphs makes the exact computation of graphlets extremely expensive. Secondly, the graph topol- ogy may not be readily available so one has to resort to web crawling using the available application programming inter- faces (APIs). In this work, we propose a general and novel framework to estimate graphlet statistics of “any size”. Our framework is based on collecting samples through consecu- tive steps of random walk. We derive analytic bound on the needed sample size (via the Cherno-Hoeding technique) to guarantee the convergence of our unbiased estimator. To further improve the accuracy, we introduce two novel opti- mization techniques to reduce the lower bound on the sample size. Experimental evaluations demonstrate that our meth- ods outperform the state-of-the-art method up to two orders of magnitude both in terms of accuracy and time cost. 1. INTRODUCTION Graphlets are dened as frequently recurring subgraph patterns in real-world networks [21, 27]. Unlike some global properties such as degree distribution, the frequencies of graphlets provide important statistics to characterize the lo- cal topology structure of networks. Decomposing networks into small k-node graphlets has been a fundamental ap- proach to characterize the local structure of many real-world complex networks. Graphlets also have numerous applica- tions ranging from biology to network science. Applications in biology include protein detection [19], biological network comparison [26] and identifying disease gene [20]. In net- work science, the researchers have applied graphlets for web spam detection [4], anomaly detection [3], social network structure analysis [31] and friendship recommendation, i.e., there are two “types” of 3-node graphlets: (1) a 3-node line subgraph or (2) a triangular subgraph. If 3-node line sub- graphs occur with much higher frequency than triangular subgraphs in an OSN, then we know that there are more opportunities for us to make friendship recommendation. Research Problem. In most applications, relative fre- quencies among various graphlets are sucient. One ex- ample is the friendship recommendation in OSNs we have just mentioned. Another example is building graphlet ker- nels for large graph comparison [29]. In this work, we focus on relative graphlet frequencies discovery and computation. More specically, we propose an ecient sampling method to compute the percentage of specic k-node graphlet type within all k-node graphlets in a given graph. The percent- age of particular k-node graphlet type is called the “graphlet concentration” or “graphlet statistics”. Challenges. One naive approach to compute the graphlet concentration is via exhaustive counting. However, there ex- ist large number of graphlets even for a moderately sized graph. For example, Facebook [1] in our datasets with 817K edges has 9 × 10 9 4-node graphlets and 2 × 10 12 5- node graphlets. Due to the combinatorial explosion prob- lem, how to count graphlets eciently is a long standing research problem. Some techniques, such as leveraging par- allelism provided by multi-core architecture [2], exploiting combinatorial relationship between graphlets [12], employ- ing distributed system [30], have been applied to speed up the graphlet counting. However, exhaustive counting al- gorithm is not scalable because it needs to explore each graphlet at least once. Even with those highly tuned al- gorithms, exhaustive counting of graphlets has prohibitive computation cost for real-world large graphs. An alterna- tive approach is to adopt “sampling algorithms” to achieve signicant speedup with acceptable error. Several methods based on sampling have been proposed to address the chal- lenge of graphlet counting [5, 7, 13, 28, 3234]. Another challenge is the restrictive access to the complete graph data. For example, most OSNs’ service providers are unwilling to share the complete data for public use. The un- derlying network may only be available by calling some ap- plication programming interfaces (APIs), which support the function to retrieve a list of user’s friends. Graph sampling through crawling is widely used in this scenario to estimate graph properties such as degree distribution [9, 15, 17], clus- tering coecient [11] and size of graphs [14]. In particular, random walk-based sampling methods have been popular due to their simplicity in implementation. In this work, we assume that graph data have to be externally accessed, ei- ther through external disks or by calling APIs provided by operators of OSNs. 1

Transcript of A General Framework for Estimating Graphlet Statistics via ... · A General Framework for...

Page 1: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

A General Framework for Estimating Graphlet Statistics viaRandom Walk

Xiaowei Chen1, Yongkun Li2 , Pinghui Wang3 , John C.S. Lui11The Chinese University of Hong Kong

2University of Science and Technology of China3Xi’an Jiaotong University

1{xwchen, cslui}@cse.cuhk.edu.hk, [email protected], [email protected]

ABSTRACTGraphlets are small induced subgraphs and have been fre-quently applied to characterize the local and global structureof graphs across various domains, e.g., online social networks(OSNs) and biological networks. Discovering and computinggraphlet statistics are highly challenging. First, the massivesize of real-world graphs makes the exact computation ofgraphlets extremely expensive. Secondly, the graph topol-ogy may not be readily available so one has to resort to webcrawling using the available application programming inter-faces (APIs). In this work, we propose a general and novelframework to estimate graphlet statistics of “any size”. Ourframework is based on collecting samples through consecu-tive steps of random walk. We derive analytic bound on theneeded sample size (via the Chernoff-Hoeffding technique)to guarantee the convergence of our unbiased estimator. Tofurther improve the accuracy, we introduce two novel opti-mization techniques to reduce the lower bound on the samplesize. Experimental evaluations demonstrate that our meth-ods outperform the state-of-the-art method up to two ordersof magnitude both in terms of accuracy and time cost.

1. INTRODUCTIONGraphlets are defined as frequently recurring subgraph

patterns in real-world networks [21,27]. Unlike some globalproperties such as degree distribution, the frequencies ofgraphlets provide important statistics to characterize the lo-cal topology structure of networks. Decomposing networksinto small k-node graphlets has been a fundamental ap-proach to characterize the local structure of many real-worldcomplex networks. Graphlets also have numerous applica-tions ranging from biology to network science. Applicationsin biology include protein detection [19], biological networkcomparison [26] and identifying disease gene [20]. In net-work science, the researchers have applied graphlets for webspam detection [4], anomaly detection [3], social networkstructure analysis [31] and friendship recommendation, i.e.,there are two “types” of 3-node graphlets: (1) a 3-node linesubgraph or (2) a triangular subgraph. If 3-node line sub-

graphs occur with much higher frequency than triangularsubgraphs in an OSN, then we know that there are moreopportunities for us to make friendship recommendation.

Research Problem. In most applications, relative fre-quencies among various graphlets are sufficient. One ex-ample is the friendship recommendation in OSNs we havejust mentioned. Another example is building graphlet ker-nels for large graph comparison [29]. In this work, we focuson relative graphlet frequencies discovery and computation.More specifically, we propose an efficient sampling methodto compute the percentage of specific k-node graphlet typewithin all k-node graphlets in a given graph. The percent-age of particular k-node graphlet type is called the “graphletconcentration” or “graphlet statistics”.

Challenges. One naive approach to compute the graphletconcentration is via exhaustive counting. However, there ex-ist large number of graphlets even for a moderately sizedgraph. For example, Facebook [1] in our datasets with817K edges has 9 × 109 4-node graphlets and 2 × 1012 5-node graphlets. Due to the combinatorial explosion prob-lem, how to count graphlets efficiently is a long standingresearch problem. Some techniques, such as leveraging par-allelism provided by multi-core architecture [2], exploitingcombinatorial relationship between graphlets [12], employ-ing distributed system [30], have been applied to speed upthe graphlet counting. However, exhaustive counting al-gorithm is not scalable because it needs to explore eachgraphlet at least once. Even with those highly tuned al-gorithms, exhaustive counting of graphlets has prohibitivecomputation cost for real-world large graphs. An alterna-tive approach is to adopt “sampling algorithms” to achievesignificant speedup with acceptable error. Several methodsbased on sampling have been proposed to address the chal-lenge of graphlet counting [5, 7, 13,28, 32–34].

Another challenge is the restrictive access to the completegraph data. For example, most OSNs’ service providers areunwilling to share the complete data for public use. The un-derlying network may only be available by calling some ap-plication programming interfaces (APIs), which support thefunction to retrieve a list of user’s friends. Graph samplingthrough crawling is widely used in this scenario to estimategraph properties such as degree distribution [9,15,17], clus-tering coefficient [11] and size of graphs [14]. In particular,random walk-based sampling methods have been populardue to their simplicity in implementation. In this work, weassume that graph data have to be externally accessed, ei-ther through external disks or by calling APIs provided byoperators of OSNs.

1

Page 2: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

The aim of this work is to design and implement an effi-cient random walk-based method to estimate graphlet statis-tics. Note that estimating graphlet statistics is a more com-plicated task than estimating other graph properties suchas degree distribution. For degree distribution, one can ran-domly walk on the graph to collect node samples and thenremove the bias. However, for graphlet statistics, one needsto consider the various local structures. A single node sam-ple cannot tell us information about the local structure. Toestimate the graphlet statistics with random walk, one needsto map a random walk to a Markov chain, carefully definethe state space, its transition matrix and random variable ofthe corresponding Markov chain so to ensure that the statespace contains all the k-node graphlets, and the correspond-ing Markov chain has a unique stationary distribution.

1.1 Related Work and Existing Problems.There are handful of works on estimating graphlet statis-

tics using random walk [5, 10, 33]. Method in [10] uses thesampled ego networks to estimate graphlet frequencies butthe method cannot be applied to all the graphlet types. In[5] and [33], authors adopt the concept of subgraph relation-ship graph, where each node is a connected induced sub-graph in the original graph, and random walk will be per-formed on these subgraph relationship graphs. Bhuiyan etal. [5] proposed GUISE, which is based on the Metropolis-Hasting random walk on subgraph relationship graph whosenodes are all the 3, 4, 5-node graphlets. Their method aimsto estimate 3, 4, 5 nodes graphlets statistics simultaneously.However, GUISE suffers from rejection of information-richsamples. In [33], authors proposed three random walk basedmethods: subgraph random walk (SRW ), pairwise subgraphrandom walk (PSRW ), and mix subgraph sampling (MSS).We denote the subgraph relationship graph in [33] as G(d),here d means that each node in G(d) is a d-node connected in-duced subgraph in the original graph. For k-node graphlets,SRW runs random walk on G(k) while PSRW chooses ran-dom walk on G(k−1). MSS is an extension of PSRW to es-timate k − 1, k, k + 1-node graphlets jointly. The simulationresults show that PSRW outperforms SRW significantly. Tothe best of our knowledge, PSRW is the state-of-the-art ran-dom walk-based method to estimate graphlet statistics.

One drawback of PSRW is its inefficiency of choosingneighbors during random walk on G(k−1) when k ≥ 4. Forexample, if one wants to estimate 4-node graphlets, PSRWfixes the random walk on G(3) whose nodes are all 3-nodegraphlets in the original graph. Populating neighbors inG(3) is about an order of magnitude slower than choos-ing random neighbors of nodes in G(2). If one can figureout how to estimate 4-node graphlets with random walk onG(2), the time cost can be reduced dramatically. Further-more, because PSRW is more accurate than SRW, we havereasons to believe that random walk-based method on G(d)

with smaller d has the potential to achieve higher accuracy.Faster random walk and more accurate estimation motivateus to propose a more efficient sampling method with randomwalk on G(d) to estimate k-node graphlets. Different fromPSRW, we seek for d that is smaller than k − 1.

1.2 Our contributionsNovel framework. In this paper, we propose a novelframework to estimate the graphlet statistics. Our frame-work is provably correct and makes no assumption on the

graph. The main idea of our framework is to collect samplesthrough random walk on G(d) to estimate k-node graphletstatistics, here d can be any positive integer less than k, andPSRW is just a special case where d = k − 1. We can con-struct the subgraph relationship graph G(d) on the fly, andthere is no requirement to know the complete topology ofthe original graph in advance. In fact, one can view d asa parameter of our framework. As mentioned in [33], it isnon-trivial to analyze and remove the sampling bias whenrandomly walk on G(d) where d is less than k − 1. The anal-ysis method in PSRW cannot be applied to the situationwhere d < k − 1. Our work is not just a simple extensionof PSRW. More precisely, we propose a new and generalframework which subsumes PSRW as a special case. Whenchoosing the appropriate parameter d, our methods signifi-cantly outperform the state-of-the-art method.Efficient optimization techniques. We also introducetwo novel optimization techniques to further improve the ac-curacy of our framework. The first one, corresponding statesampling (CSS), modifies the re-weight coefficient and im-proves the efficiency of our estimator. The second techniqueintegrates non-backtracking random walk in our framework.Simulation results show that our optimization techniquescan improve the estimation.Provable guarantees. We give detailed theoretical anal-ysis on our unbiased estimator. Specifically, we derive ananalytic Chernoff-Hoeffding bound on the sample size. Thetheoretical bound guarantees the convergence of our meth-ods and provides insight on the factors which affect the per-formance of graphlet statistics estimation.Extensive experimental evaluation. To further verifyour framework, we conduct extensive experiments on realworld networks. We focus on 3, 4, and 5-node graphlets.In Section 6, we demonstrate that our framework with anappropriate chosen parameter d is faster and more accu-rate than previous state-of-the-art methods. For 3-nodegraphlets, our methods with random walk on G outperformPSRW up to 7 times in accuracy. For 4, 5-node graphlets,our methods with random walk on G(2) outperform PSRWup to 100 times in accuracy. In summary:• We propose a general Markov Chain and Monte Carlo

(MCMC) framework to estimate the graphlet statistics.• We derive an unbiased estimator for the framework and

develop a Chernoff-Hoeffding bound for the sample size.• We introduce two novel optimization techniques to im-

prove our estimator. The simulation results show that thetechniques improve the estimation efficiency significantly.

• We conduct extensive experimental evaluation to supportour theoretical argument.The remainder of this paper is organized as follows. Sec-

tion 2 provides preliminaries. Section 3 explains our frame-work in detail. Section 4 presents our two optimizationtechniques. Section 5 gives some implementation details.Section 6 reports our experimental results, and finally, con-clusion is given in Section 7.

2. PRELIMINARYIn this section, we first define some notations and concepts

used throughout the paper. Then we review some usefulresults of finite Markov chain to serve as the mathematicalfoundations of our random walk-based methods.

2

Page 3: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

2.1 Notations and DefinitionsNetworks can be modeled as a graph G = (V, E), where V

is the set of nodes and E is the set of edges. For a node v ∈V , dv is the degree of node v, i.e., the number of neighborsof node v. A graph with no self-loops or multiple edges isdefined as simple graph. In this work, we consider simple,connected and undirected graphs.Induced subgraph: An induced subgraph is a subgraphGk = (Vk, Ek) which has k nodes in V together with anyedges whose both endpoints are in Vk. Formally, we haveVk ⊂ V , |Vk| = k and Ek ={(u, v) : u, v ∈ Vk, (u, v) ∈ E}.Subgraph relationship graph: In [5, 33], the authorsuse the concept of subgraph relationship graph. Here weadopt the definition in [33] and define the d nodes sub-graph relationship graph G(d) as follows. Let H(d) denotethe set of all d nodes connected induced subgraphs of G.For si, sj ∈ H(d), there is an edge between si and sj ifand only if they share d − 1 common nodes in G. We useR(d) to denote set of edges among all subgraphs in H(d).Then we define G(d) = (H(d), R(d)). Specially, we defineG(1) = G, H(1) = V, R(1) = E when d = 1. If the originalgraph G is connected, the G(d) is also connected [33, The-orem 3.1]. Figure 1 shows an example of G(2) and G(3) fora 4-node graph G. The 2-node induced subgraphs of G arenode pair {(1, 2), (1, 3), (1, 4), (2, 3), (3, 4)}. Let H (2) denoteall 2-node induced subgraphs of G, then the nodes set ofG(2) is H(2). Note that there is an edge between node pair(1, 2) and (2, 3) in G(2) because they share node 2 in G.

In general, constructing G(d) is prohibitive due to inten-sive computation cost. However, we will show that for ourrandom walk based methods, there is no need to constructG(d) in advance because we can generate the neighborhoodsubgraphs of s ∈ H(d) on the fly.

1 2

4 3

e1

e2

e3

e4 e5

G

1 2 2 3

3 41 4

1 3

G(2)

14

3

21

4

23

4

12

3

G(3)

Figure 1: Original graph G and its 2 and 3-nodesubgraph relationship graph G(2) and G(3).

Isomorphic: Two graphs G = (V, E) and G� = (V �, E�)are isomorphic if there exists a bijection ϕ : V → V � with(vi, vj) ∈ E ⇔ (ϕ(vi), ϕ(vj)) ∈ E� for all vi, vj ∈ V . Sucha bijection map is called an isomorphism, and we write iso-morphic G and G� as G � G�.

Definition 1. Graphlets are formally defined as small, non-isomorphic, connected induced subgraphs of a graph G.

Figure 2 shows all 3, 4-node graphlets. For 3-node graphlets,we have two different types, i.e., line and triangle. Thereare 6 different 4-node graphlets. The second row of Ta-ble 3 shows 21 different 5-node graphlets. The number ofdistinct graphlets grows with vertices in the graphlets. Forexample, there are 112 different 6-node graphlets and 853

1

2

3

g31

line

1

2 3

g32

triangle

1

23

4

e 1

e2e3

g41

4-path

2

1

3 4

e1

e2 e3

g42

3-star

1

2 3

4

e2

e3

e 4

e1

g43

cycle

3

21 4

e1

e2 e3

e4

g44

tailed-triangle

2

3

4

1e1

e2 e3

e4

e5

g45

chordal-cycle

2 3

4

1

e 1 e3

e4

e2e5e6

g46

clique

Figure 2: All 3 and 4 nodes graphlets.different 7-node graphlets. Due to the combinatorial com-plexity, the computation for graphlets is usually restrictedto 3, 4, 5 nodes [2, 5, 12, 13,32–34].Problem definition: Given an undirected connected graphG and a family of k-node graphlets Gk = {gk

1 , gk2 , · · · , gk

m},where gk

i is the ith type of k-node graphlet. Let Cki denote

the count of appearance of graphlet gki (i.e., number of in-

duced subgraphs that are isomorphic to graphlet gki ). Our

goal is to estimate the concentration of gki ∈ Gk for G, where

the concentration of gki is defined as

cki � Ck

i�m

j=1 Ckj

. (1)

There are many applications if we know the concentrationof gk

i , i.e., for 3-node graphlets where g31 is a line and g3

2 is atriangle (refer to Figure 2), if c3

1 is relatively high in a givenonline social network graph, then there is a high possibilityfor us to make friendship recommendation.

2.2 Random Walk and Markov ChainA simple random walk (SRW) over graph G is defined as

follows: start from an initial node in G, we move to one ofthe neighbors of the current node which is chosen uniformlyat random, and repeat this process until certain stoppingcriterion. The random walk on graph G can be viewed as afinite and time-reversible Markov chain with state space V .More specifically, let {Xt} be the Markov chain representingthe visited node of the random walk on G, the transitionprobability matrix P of this Markov chain is defined as

P (i, j) =� 1

di, if (i, j) ∈ E,

0 otherwise.

The SRW will have a unique stationary distribution πππ whereπ(v) = dv

2|E| for v ∈ V [18, 24]. Our framework is to extendthe random walk to G or G(k) for any k. For simplicity, wewrite the simple random walk on G(k) as SRW (k). The the-ory of random walk is rooted on the theory of finite Markovchain. Below, we review some basic Markov chain theory toserve as the mathematical foundation for our random walk-based estimation methods.Strong Law of Large Numbers. For a Markov chainwith finite state space S and stationary distribution πππ. Wehave a function f : S → R. The expectation of f with

3

Page 4: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

respect to πππ is given by

µ = Eπππ(f) ��

X∈Sf(X)π(X).

The subscript πππ means that the expectation are calculatedassuming X ∼ πππ. Let

µn = 1n

n�

s=1

f(Xs)

denote the sample average of f(X) over a run of the Markovchain. The Strong Law of Large Numbers (SLLN) ensuresthat the sample mean will converge almost surely to its ex-pected value with respect to the stationary distribution.

Theorem 1. [8, 15] For a finite and irreducible Markovchain with stationary distribution πππ, we have

µn → Eπππ(f) almost surely (a.s.)

as n → ∞ regardless of the initial distribution of the chain.

The SLLN is the fundamental basis for most random walk-based graph sampling methods, or more formally, MarkovChain Monte Carlo (MCMC) samplers, e.g., [5,11,17]. TheSLLN guarantees the asymptotic unbiasedness of the esti-mators based on any finite and irreducible Markov chain.Central Limit Theorem. The estimation error µn − µconverges to zero as n → ∞ according to SLLN. However,for a practical MCMC method, we also want to know howmany samples are need to achieve certain accuracy. Beforeintroducing the Central Limit Theorem (CLT) which statesthe distribution of estimation error when n → ∞, we definethe variance of estimator µn as follows.

σ2n = nVar(µn)

= 1n

n�

i=1

Var(f(Xi)) + 1n

n�

i=1

n�

j=1

Cov(f(Xi), f(Xj)). (2)

Here we assume the Markov chain {Xi} is in the stationaryregime. σ2

n consists of two parts, i.e., variance for f(Xi)and covariance between f(Xi) and f(Xj). We assume Xi isdrawn from the stationary distribution. In this case, we de-fine γ0 = Var(f(Xi)) and γk = Cov(f(Xi), f(Xi+k)) whereγ0 and γk are the same for all i. Thus we have

σ2 = γ0 + 2∞�

k=1

γk.

The following theorem describes the relationship betweenthe estimation error and the asymptotic variance.

Theorem 2. [8, 15] For a finite and irreducible Markovchain with stationary distribution πππ, the central limit theo-rem asserts that

√n(µn − µ) D−→ N(0, σ2) as n → ∞

for any function with Eπππ(f2) < ∞ and it holds for any initialdistribution. Here σ2 is defined in Equation (2).

Estimation error µ−µ times root n converges in distributionto a normal distribution with zero mean and variance σ2.For a sufficient large n, µ±1.96σ/

√n is an approximate 95%

confidence interval for the unknown value µ. Here, we cansee that the asymptotic variance σ2 is an important criterion

to measure the efficiency of MCMC samplers. Hence, fortwo estimators of µ, we define the one with less σ2 as themore efficient estimator.

2.3 Mixing Time of Markov ChainThe mixing time of a Markov chain is the number of steps

it takes for a random walk to reach its stationary distri-bution. Variation distance is a commonly used metric toquantify the distance between two distributions and it canbe used to define the mixing time.

Definition 2. The variation distance between two distribu-tions d1 and d2 on a countable state space S is given by

|d1 − d2|1 � 12

x∈S|d1(x) − d2(x)|

We adopt the definition of mixing time in [6, 22, 23]. Themixing time is defined as follows.

Definition 3. The mixing time τ(�) (parameterized by �)of a Markov chain is defined as

τ(�) = maxXi∈S

min{t : |πππ − Ptπππ(i)|1 < �},

where πππ is the stationary distribution, πππ(i) is the initial dis-tribution when starting from state Xi ∈ S, Pt is the transi-tion matrix after t steps.

3. GENERAL FRAMEWORKIn this section, we introduce our random walk-based gen-

eral framework for graphlet statistics estimation. Our frame-work leverages on “consecutive steps” of random walk. Wederive an unbiased estimator for graphlet concentration withthe re-weighting techniques for SRW. With this unbiased es-timator, we can estimate, in general, any k-node graphletconcentration. We will illustrate how to estimate k = 3, 4, 5-node graphlet concentrations in particular. Notations aresummarized in Table 1.

Table 1: Summary of NotationsG G = (V, E), underlying undirected graph

G(d) G(d) =(H(d),R(d)), d-node subgraph relationship graphgk

i ith type of k-node graphletCk

i count of graphlet gki in graph G

cki concentration of graphlet gk

i in Gdv degree of node v in G

dX degree of state X, X is also a node of G(d)

M(l) state space of the expanded Markov chainX(l) state in state space M(l)

V (X(l)) set of nodes contained in state X(l)

s(X(l)) subgraph induced by node set V (X(l))πππ stationary distribution of the random walk

πππM stationary distribution of the expanded Markov chainαk

i number of states X(l) in M (l) s.t. s(X(l)) � gki .

C(s) set of states X(l) in M (l) s.t. s(X(l)) = s.w(X(l)) normalized weight, equals to 1/(2|R(d)|πM (X(l))).

3.1 Expanded Markov ChainWe are interested to find the concentration of gk

i , wherek ≥ 3 (since it is not difficult to find for k = 1 or 2). Thecore idea is to collect k-node graphlet samples through l =k − d + 1 consecutive steps for the random walk on G(d),

4

Page 5: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

where d ∈ {1, . . . , k − 1}. To illustrate, consider the graphin Figure 1. We want to explore possible graphlets g4

i andwe have decided to perform random walk on G(2) (or selectd = 2). Assume we make l = 3 transitions on the followingstates: (1, 2) → (1, 3) → (3, 4), then we can derive a 4-node graphlet sample from the node set {1, 2, 3, 4} which wehave visited during these three steps walk. In this case, itcorresponds to g4

5 in Figure 2. Before giving more details onour framework, we first introduce the concept of expandedMarkov chain.

For random walk on G(d) = (H(d), R(d)), we define anexpanded Markov chain which remembers the last l stepsof the random walk. Each l consecutive steps is consideredas a state X(l) = (X1, · · · , Xl) of the expanded Markovchain. Here we use superscript “l” to denote the lengthof the random walk block. Each time our random walk onG(d) makes a transition, the expanded Markov chain transitsto the next state. Assume the expanded Markov chain iscurrently at state X(l) = (X1, · · · , Xl), it means the randomwalker is at Xl. If the walker jumps to X(l+1), i.e., oneof the neighbors of Xl, then the expanded Markov chaintransits to the state X(l) = (X2, · · · , X(l+1)). It is easy toverify that this new chain is also Markovian. Let N = H (d)

denote the state space for the random walk and M(l) denotethe state space for the expanded Markov chain. The statespace M(l) consists of all possible consecutive l steps of therandom walk. More formally, M(l) = {(X1, · · · , Xl) : Xi ∈N , 1 ≤ i ≤ l s.t. (Xi, Xi+1) ∈ R(d) ∀1 ≤ i ≤ l − 1} ⊆N × · · · × N . For example, if we perform random walk onG, any (u, v) and (v, u) where euv ∈ E are states in M (2).Let πππM denote the stationary distribution of the expandedMarkov chain and πππ denote the stationary distribution ofthe random walk. dX is number of neighbors of state X inG(d). The following theorem states that πππM is unique andits value can be computed with πππ.

Theorem 3. The stationary distribution πππM exists and isunique. For any X(l) = (X1, · · · , Xl) ∈ M(l), we have

πM (X(l)) =

dXl

2|R(d)| if l = 11

2|R(d)| if l = 2,1

2|R(d)|1

dX2· · · 1

dXl−1if l > 2

(3)

The proof of Theorem 3 is in Appendix A. Still usingthe graph in Figure 1 as an example, if we walk on G(2)

and visit three states X1 = (1, 2), X2 = (1, 3), X3 = (3, 4),then the corresponding state X

(3)1 in M (3) is (X1, X2, X3).

The number of edges for graph G(2) is 8. The degrees ofX1, X2, X3 are 3, 4, 3 respectively. Then we can compute thestationary distribution of X

(3)1 as 1/16 · 1/4 = 1/64. Now

assume we walk from X3 = (3, 4) to next node X4 = (1, 4)in G(2). For the expanded Markov chain, we actually transitto state X

(3)2 = (X2, X3, X4). Note that X

(3)2 contains node

set {1, 3, 4} which induces a triangle in G. In general, thestate space M (l) contains all k�-node graphlets where d+1 ≤k� ≤ k.

For a state X(l) ∈ M(l), V (X(l)) represents the set ofnodes contained in state X(l) and s(X(l)) is the connectedinduced subgraph whose node set is V (X(l)). We also de-fine X(l) as the corresponding state for subgraph s(X(l)).A key observation is that a subgraph s may have severalcorresponding states in M(l). For example, if we perform

random walk on G and choose l = 3, a triangle {u, v, w}in G has 6 corresponding states (u, v, w), (u, w, v), (v, u, w),(w, v, u), (v, w, u), (w, u, v) in M(3). To describe this ideaformally, we define the state corresponding coefficient αk

i .Definition 4. For any connected induced subgraph s � gk

i ,we define the set of corresponding states for s as C(s) ={X

(l)i |s(X(l)

i ) = s, X(l)i ∈ M(l)} and state corresponding

coefficient αki as |C(s)|. For a state X(l) ∈ M(l), if s(X(l)) �

gki , we define function α(X(l)) = αk

i .Coefficient αk

i is vital to the design of unbiased estimatorand it is used as re-weight coefficient. If states in M(l) areuniformly sampled, then the probability of getting a sampleof graphlet gk

i is αki Ck

i /|M (l)|. The physical interpretationof αk

i is that each subgraph isomorphic to gki is replicated αk

i

times in the state space. If αki is larger, we have a higher

chance to get sample of gki . Note that the value of coefficient

αki only depends on different graphlets and random walk

types. Hence we can compute αki in advance. For example,

if we perform random walk on G and l = 3, we have α32 = 6,

i.e., any triangle in G has 6 corresponding states in M(3).

3.2 Computation of CoefficientIn this subsection, we discuss in detail how to compute αk

i

in advance and give examples for k = 3, 4, 5 nodes graphlets.The coefficient αk

i is the key to our estimation framework.One can view it as part of the “re-weight coefficient” in theestimator which will be introduced in next subsection. Forthe 3-node graphlets, we can compute α3 by counting num-ber of ways to traverse the graphlets if choosing SRW (1).For example, we have six ways to traverse the triangle andtwo ways to traverse the line. Hence we have α3

1 = 2 andα3

2 = 6. Here we provide a method to compute αk in Al-gorithm 1 for SRW (d) when 1 ≤ d ≤ k − 1. In Line 8 ofAlgorithm 1, the S

(l)xi represent the xi-th element in list S(l).

A special case of Algorithm 1 is d = k − 1. In this case, wejust need to enumerate (k − 1)-node induced subgraph ofgk

i , which has time complexity O(k). The coefficient αki is

equal to (|S| − 1) · |S|, here S is set of (k − 1)-node sub-graphs of gk

i . Table 2 lists the coefficient αki for k = 3, 4.

Notice that α42 = 0 if we choose SRW (1), i.e., if we ran-

domly walk on the original graph G, we have no chance toget the sample of g4

2 under this estimation framework. Inthis case, we can only estimate the relative concentration forall 4-node graphlets except g4

2 . Table 3 lists the coefficientunder different random walk types for all 5-node graphlets.

Algorithm 1 Pseudo code of computing αk

Input: node set and edge list of graphlet gki , SRW (d)

Output: αki for graphlet gk

i

1: Set S ← all d-node connected induced subgraphs of gki

2: Random walk block length l ← k − d + 13: Counter αk

i ← 04: for all S(l) ← combination of l elements from S do5: if size of node set ∪s∈S(l) V (s) less than k then6: continue7: for all (x1, · · · , xl) ← permutation of 1, · · · , l do8: if S

(l)xi and S

(l)xi+1 share d−1 nodes, ∀1 ≤ i ≤ l−1

then9: αk

i ← αki + 1

10: return αki

5

Page 6: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

Table 2: Coefficient αki for k = 3, 4 nodes graphlets

Graphlet g31 g3

2 g41 g4

2 g43 g4

4 g45 g5

6

αki /2

SRW (1) 1 3 1 0 4 2 6 12SRW (2) 1 3 1 3 4 5 12 24SRW (3) 1/2 1/2 1 3 6 3 6 6

3.3 Unbiased EstimatorNow we derive an unbiased estimator of the graphlet con-

centration. Define an indicator function hki for state X(l):

hki (X(l)) = �{s(X(l)) � gk

i }. (4)

If X(l) is drawn from the uniform distribution u, we have

Eu[hki ] = αk

iCk

i

|M(l)| . (5)

However, state X(l) is drawn from distribution πππM . Whensampling from πππM , we have

Eu[hki ] = EπππM

�hk

iu(X(l))

πM (X(l))

�.

Note that this is a special case of importance sampling [25,Chapter 9]. The adjustment factor u(X(l))/πM (X(l)) aredefined as weight function w(X(l)) : M(l) → R such that

w(X(l)) = u(X(l))πM (X(l)) = 1

|M(l)|1

πM (X(l)) ,

where πM (X(l)) is defined in Equation (3). To remove thebias, we use the following technique. Suppose there are n

samples {X(l)s }n

s=1 from the expanded Markov chain, accord-ing to SLLN in Section 2.2, we have

µn(whki )= 1

n

n�

s=1

w(X(l)s )hk

i (X(l)s ) → EπππM [whk

i ]=Eu[hki ]. (6)

Combine Equation (5) and (6). The estimator for Cki is

1n

n�

s=1

1αk

i

hki (X(l)

s )πππM (X(l)

s )→ Ck

i (7)

Since we want to compute the concentration, we do notneed to know the exact value of |R(d)|. Define πM (X(l)) =2|R(d)|πM (X(l)) and w(X(l)) = 1

πM (X(l)) . For different gki ,

we only need to compute

ˆCki � 1

n

n�

s=1

1αk

i

w(X(l)s )hk

i (X(l)s ) → Ck

i � Cki

2|R(d)| a.s. (8)

Define Ck =�m

j=1 Ckj and ˆCk =

�m

j=1ˆCk

j . The concen-tration of gk

i is cki = Ck

i /Ck. Based on Equation (8), theestimator for graphlet concentration should be ck

i = ˆCki / ˆCk.

Note that we can easily reconstruct the count of graphlets ifwe know |R(d)|. For random walk on G, we have R(d) = |E|.For SRW (2), |R(2)| = 1

2�

euv(du +dv −2) and a single pass

of graph data is enough to compute this value. All the val-ues in Equation (8) can be obtained from local informationduring the random walk. Algorithm 2 depicts the processof k-node graphlet estimation using simple random walk ongraph G(d). We give an example using graph in Figure 1.If we walk on G through path 1 → 2 → 3, we gather theadjacent nodes information of node set {1, 2, 3}. Then we

can determine that {1, 2, 3} induces a triangle. For this tri-angle, we have α3

2 = 6 and w(X(3)) = d2 = 2. Then we add1

α32

w(X(3)) = 1/3 to the counter ˆC32 . Next we walk from

node 3 to node 4, the state transits to (2, 3, 4). Node set{2, 3, 4} contains a line (g3

1). For this line, we have α31 = 2

and w(X(3)) = 3. We add 3/2 to to the counter ˆC31 .

Algorithm 2 Unbiased Estimation of Graphlet StatisticsInput: sample budget n, SRW (d), graphlet size kOutput: estimation of ck

i for each graphlet gki

1: random walk block length l ← k − d + 12: counter ˆCk

i ← 0 for i ∈ {1, · · · , m}3: walk l steps to get the initial state X (l) =(X1,· · · ,Xl)4: random walk step t ← 05: while t < n do6: i ← graphlet type id of subgraph s(X(l))7: ˆCk

i ← ˆCki + 1

αki

w(X(l))8: Xl+t+1 ← uniformly chosen neighbor of Xl+t

9: X(l) ← (Xt+2, · · · , Xl+t+1)10: t ← t + 111: ˆCk =

�m

j=1 Ckj

12: cki = ˆCk

i / ˆCk for all i ∈ {1, · · · , m}

The next important question we like to ask is what is thesmallest sample size (or random walk steps) so that we canachieve a high accuracy in our estimator? In the follow-ing, we show the relationship between accuracy and samplesize using the Chernoff-Hoeffding bound for Markov chain.Let W denote maxX(l)∈M(l) w(X(l)) and αmin = minj αk

j .The following theorem states the relationship between theestimation accuracy and sample size.Theorem 4. For any 0 < � ≤ 1/8 and 0 < δ < 1, thereexists a constant ξ such that

Pr[(1 − �)cki ≤ ck

i ≤ (1 + �)cki ] > 1 − δ (9)

when the sample size n ≥ ξ( WΛ ) τ(�)

�2 (log�ϕ�πππM

δ). Here Λ =

min{αki Ck

i , αminCk}, τ(�) is the mixing time of the origi-nal random walk, ϕ is the initial distribution and �ϕ�πππM

=�

X(l)∈M(l)ϕ2(X(l))πM (X(l)) .

Remark: The proof for Theorem 4 is in Appendix B. FromTheorem 4, we know that the needed sample size is lin-ear with the mixing time τ(�). This implies our frameworkworks better for graphs with fast mixing time. Furthermore,some graphlet types are relatively rare in the graph. Forthese rare graphlet types, we need larger sample size to guar-antee the same accuracy. If αk

i is higher for rare graphletgk

i , the needed sample size is smaller. In practice, we choosethe starting point deterministically. The πππM norm of ini-tial distribution ϕ depends on how we choose the startingnode. If we choose starting point X0 of G(d), Lemma 8 inAppendix B asserts that �ϕ�πππM

is 1/π(X0) = 2|Rd|dX0

, whichmeans that choosing a start node with higher degree in G(d)

is a better choice.Estimate Joint Concentration. For some applications,we need to estimate the joint graphlet concentration. Theproblem of estimating k−1, k, k+1-node graphlet concentra-tion jointly has been studied in [5]. In this work, we focus on

6

Page 7: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

Table 3: Coefficient α5i for 5-node graphlets

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Shape

α5i /2

SRW (1) 1 0 0 1 2 0 5 2 2 4 4 6 7 6 6 10 14 18 24 36 60SRW (2) 1 2 12 5 4 16 5 6 24 24 12 18 15 54 36 42 34 82 76 144 240SRW (3) 1 5 24 8 5 24 5 16 30 24 16 63 26 63 30 43 63 63 90 90 90SRW (4) 1 3 6 3 3 6 10 12 12 12 12 10 10 10 12 10 10 10 10 10 10

k-node graphlets concentration. However, our methods canbe easily extended to estimate k − 1, k, k + 1-node graphletssimultaneously. More generally, we can estimate in one shot,all k�-node graphlets (where k� ≤ k) when we are estimatingk-node graphlets concentration without any extra computa-tion cost. According to Equation (8), we can see that thejoint concentration of graphlets with αk�

i �= 0 can be esti-mated when we choose the same random walk. For example,we can estimate all 3, 4, 5-node graphlets joint concentrationwith random walk on G(2). The difference from the k-nodegraphlet estimation is that we need to trace different lengthof random walk block at each step. In Section 5, we willshow that there is no extra computation cost for computingthis joint concentration estimation.

4. IMPROVED ESTIMATION OF GRAPHLETSTATISTICS

We now introduce two novel optimization techniques toimprove the lower bound on sample size in Theorem 4. Wefirst introduce the technique which modifies the re-weightcoefficient w(X(l)

s )/αki in Equation (8). Then we apply non-

backtracking random walk into our sampling framework toreduce the size of the state space. With these two opti-mization techniques, we derive a more efficient estimatorfor graphlet statistics.

4.1 Corresponding State Sampling (CSS)In Section 3.1, we defined the corresponding states for a

subgraph s. The key observation is that for X(l)i , X

(l)j ∈

C(s), the weight w(X(l)i ) and w(X(l)

j ) may be different whenthe length of a random walk block l is larger than 2. Toillustrate, consider a triangle {u, v, w}. We choose randomwalk on G and need to walk for l = 3 steps to collect samplesof 3-node graphlets. Both of state X

(3)1 = (u, v, w) and

X(3)2 = (v, u, w) are in corresponding states set of triangle

{u, v, w}. We know that w(X(3)1 ) = 1/dv while w(X(3)

2 ) =1/du. If node u and v have different node degrees, thenw(X(3)

1 ) �= w(X(3)2 ). Based on this observation1, we define

a new weight function wS for state X(l).

Definition 5. For a state X(l) and a subgraph s = s(X(l)),we define a new weight function wS for state X(l)

wS(X(l)) = 1�X

(l)j

∈C(s) 1/w(X(l)j )

.

In essence, wS(X(l)) is similar to the “harmonic mean” ofweight w(X(l)

j ), X(l)j ∈ C(s) where s is the induced subgraph

1Here we require l > 2 since when l = 2, the new re-weightcoefficient equals to the original one.

Table 4: Re-weight coefficient wS(X(l)) for all 3, 4-node graphlets

Graphlet SRW(d) αki /2 2 · wS(X(l))

g31 SRW (1) 1 d2

g32 3 1/(1/d1 + 1/d2 + 1/d3)

g41 1 de2

g42

SRW (2)

3 1/(�3

j=1 1/dej )g4

3 4 1/(�4

j=1 1/dej )g4

4 5 1/(2/de2 + 2/de3 + 1/de4 )g4

5 12 1/(2�5

j=1 1/dej + 2/de5 )g4

6 24 1/(4�6

j=1 1/dej )of node set V (X(l)). More importantly, we can observe allthe |C(s)| states through only one state. If these states havedifferent weights, then using the harmonic mean-like weightfunction is likely to improve the efficiency, i.e., reduce thevariance of the estimator in Equation (8).Illustration: Take all 3, 4-node graphlets in Figure 2 as ex-amples, we choose random walk SRW (1) for 3-node graphletsand SRW (2) for 4-node graphlets. Table 4 summarizes there-weight coefficient function for different 3, 4-node graphlettypes. Labels for nodes and edges are defined in Figure 2.Note that for an edge euv in the graph G, it should be anode in G(2). We can compute the degree of euv in G(2) asdu + dv − 2, here du and dv are degree of nodes u and vin G. In Table 4, the first column is the graphlet type ofs(X(l)). The second column is the random walk type. Thethird column is the state corresponding coefficient and thefourth column is the new re-weight coefficient of state X (l).To further understand the re-weight coefficient, we give anexample of triangle (g3

2). If we randomly walk on G andvisit node 1, 2, 3 sequentially, then the state we are visitingis X(3) = (1, 2, 3). Assume {1, 2, 3} is a triangle (isomorphicto g3

2), we know that the corresponding states of triangle{1, 2, 3} are (1, 2, 3), (3, 2, 1), (1, 3, 2), (2, 3, 1), (2, 1, 3) and(3, 1, 2). The re-weight coefficient wS for X(3) = (1, 2, 3) is1/(2/d1 + 2/d2 + 2/d3) while the old re-weight coefficient

1αk

i

w for X(3) is d2/6. Observe that coefficient wS makesbetter use of the degree information for all nodes in the sub-graph.

In the following we prove that if we substitute w(X (l))/αki

with wS(X(l)) in equation 8, we can still get an unbiasedestimator of Ck

i .

Lemma 5. For a specific subgraph s � gki , we have

EπππM [ 1αk

i

w(X(l))�{V (X(l)) = V (s)}] =

EπππM [wS(X(l))�{V (X(l)) = V (s)}]Proof. Using the fact w(X(l)) = 1/(2|R(d)|πM (X(l))),

we have wS(X(l)) = 1/(2|R(d)|�

X(l)j

∈C(s) πM (X(l)j )). Hence

7

Page 8: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

both side of the equation equal to 1/(2|R(d)|).

Remark: For each subgraph s, the probability that s issampled equals to

�X

(l)j

∈C(s) πM (X(l)j ). Lemma 5 indicates

that for the state X(l), we compute the sample probabilityof the subgraph s it induces instead of the stationary dis-tribution of X(l). It is trivial to verify that the functionhk

i (X(l)) in Equation (4) is the linear combination of func-tion �{V (X(l)) = V (s)}. Using the linearity of expectationand result in Lemma 5, we have

EπππM

�n�

s=1

w(X(l)s )

αki

hki (X(l)

s )

�=EπππM

�n�

s=1

wS(X(l)s )hk

i (X(l)s )

Hence, we can rewrite the estimator in Equation (8) as

1n

n�

s=1

wS(X(l)s )hk

i (X(l)s ) → Ck

i a.s. (10)

It is trivial to verify that max wS(X(l)) ≤ max w(X(l))α(X(l)) . So

we can derive a smaller bound on the sample size for thenew estimator in Equation (10). For a fixed sample budget,the efficiency of the sampling scheme is determined by theasymptotic variance in Equation (2). It is not clear whetherVarπππM [

�n

s=1 wS(X(l)s )hk

i (X(l)s )]≤ VarπππM [

�n

s=1w(X

(l)s )

αki

hki (X(l)

s )]since the covariance of random variables is difficult to analyze.However, we can always analyze the variance of independent ran-dom variables. Lemma 6 indicates that if all states are indepen-dent, CSS sampling scheme is more efficient than the basic onein Section 3.

Lemma 6. The variance of function wS(X(l))hki (X(l)) is smaller

than that of function 1αk

i

w(X(l))hki (X(l)) under the same station-

ary distribution πππM . More specifically, we have

VarπππM [wS(X(l))hki (X(l))] ≤ VarπππM [ 1

αki

w(X(l))hki (X(l))].

See Appendix C for more details on the proof.Algorithm 3 shows the pseudo code to compute the re-weight

coefficient wS(X(l)) (l > 2) when given node set and edge list ofsubgraph s induced by state X(l). The computation process issimilar to that in Algorithm 1.

Algorithm 3 Pseudo code of computing wS(X(l))

Input: state X(l), graphlet size k, SRW (d)Output: re-weight coefficient wS(X(l))1: random walk block length l ← k − d + 12: s ← s(X(l))3: set S ← all d-node connected induced subgraphs of s4: wS(X(l)) ← 05: for all S(l) ← combination of l elements from S do6: if size of node set ∪s∈S(l) V (s) less than k then7: continue8: for all (x1, · · · , xl) ← permutation of 1, · · · , l do9: if S

(l)xi and S

(l)xi+1 share d − 1 nodes, ∀1 ≤ i ≤ l−1

then10: corresponding state X �(l) ← (S(l)

x1 , · · · , S(l)xl )

11: wS(X(l)) ← wS(X(l)) + w(X �(l))12: return 1/wS(X(l))

4.2 Non-backtracking Random WalkMany techniques have been proposed to improve the efficiency

of random walk-based algorithms, for example, non-backtrackingrandom walk [15], random walk leveraging walk history [36], re-jection controlled Metropolis-Hasting random walk [17], randomwalk with jump [35], etc. In this subsection, we introduce non-backtracking random walk (NB-SRW) to our estimation frame-work as an example as how to integrate these techniques with ourframework.

The basic idea of NB-SRW is to avoid backtracking to previ-ously visited node. Due to the dependency on previously vis-ited node, the random walk itself is not a Markov chain on statespace H(d). However we can define an augmented state spaceΩ = {(i, j) : i, j ∈ H(d), s.t. (i, j) ∈ R(d)} ⊆ H(d) × H(d). Thetransition matrix P� � {P �(eij , elk)}eij ,elk∈Ω for the NB-SRW isdefined as follows

P �(eij , ejk) =

� 1dj −1 , if i �= k and d(j) ≥ 20, if i = k and d(j) ≥ 21, if i = k and d(j) = 1

All other elements of matrix P are zeros. Let πππ� be the station-ary distribution of the NB-SRW. A useful fact is that NB-SRWpreserves the stationary distribution of the original random walk,i.e., π�(i) = di/2|R(d)| and π�(eij) = 1/2|R(d)|. To apply NB-SRW, we just need to replace our previously used simple randomwalk with NB-SRW. The estimator in Equation (8) and (10) canstill be used except that we need to change the re-weight co-efficient w(X(l)). Define the nominal degree for Xi ∈ H(d) asd�

Xi= max{dXi

− 1, 1}. For any X(l) = (X1, · · · , Xl), the re-weight coefficient w(X(l)) can be modified accordingly.

w(X(l)) =

1/dXlif l = 1

1 if l = 2,

d�X2

× · · · × d�Xl−1

if l > 2

Applying NB-SRW helps us eliminate some “invalid” states fromthe state space. For example, if we want to estimate 3-nodegraphlets concentration using SRW (1). We need to walk for 3steps on graph G to collect a sample containing 3 nodes. It ispossible for us to get only 2 distinct nodes from 3 steps. Wecall such samples as invalid samples. Figure 3 shows an exampleof valid sample and invalid sample. The invalid samples do notcontribute to the estimation. If we apply NB-SRW here, it isless likely to get such invalid samples. Hence NB-SRW helps toimprove the estimation efficiency of our framework.

u v w

Valid sample

u v u

Invalid sample

Figure 3: Example of “valid” & “invalide” samples

5. IMPLEMENTATION DETAILSIn this section, we discuss the implementation details. We first

explain how to obtain neighborhood subgraphs of currently vis-ited state (subgraph) s ∈ H(d). According to Algorithm 2, wealso need to identify the type of k-node graphlet at each step.A naive implementation needs k(k − 1)/2 binary search to deter-mine the adjacent lists of these k nodes so as to determine thegraphlet type. Below we also show that a careful implementationonly needs k − 1 binary search. Furthermore, we demonstratethat if we aim at k-node graphlet statistics estimation, the con-centration of all graphlets with size less than k can be estimatedsimultaneously without any extra computation cost.

5.1 Populate Neighborhood of GraphletPopulating neighbors for a node in G or G(2) has low computa-

tion cost. Take nodes in G(2) for example. For a node euv in G(2)

8

Page 9: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

(euv is an edge in graph G), to obtain a random neighbor of euv ,we can replace u with a node in N(v)\u or replace v with node inN(u)\v. Here N(v) denotes the set of neighbors of node v ∈ V .From this example, we know that getting uniformly chosen neigh-bor of a node in G(2) can be done in constant time. When d > 2,we need d merge operations over adjacent list of nodes in the cur-rently visited subgraph. To populate neighbors of s in G(d), wereplace one of node vi in V (s) with a node vj ∈ ∪v∈V (s)\vi

N(v)and ensure the connectivity of this new subgraph induced by nodeset {vj} ∪ V (s)\vi. The time complexity of selecting a randomneighbor for a node in G(d) is simply O(d2 |E|

|V | ) when d is largerthan 2.

5.2 Identify k-node Graphlet TypesAt each step of the random walk, we use a set Vk = {v1, · · · , vk}

to record the k nodes we have just visited during the l steps ran-dom walk. Here Vk ⊆ V . We now show that k − 1 binary searchoperations are sufficient to determine the graphlet type of sub-graph Gk = (Vk, Ek). We use k × k bit matrix to represent theadjacent relationship between nodes in Vk. Every time we pro-ceed to the next node in G(d), we just visit a new node v ∈ Vand v �∈ Vk. Note that here node v is a node in the original graphG. Once we visit the new node v, an old node in Vk does notbelong to the node set we have visited during the last l steps.We replace this old node with node v to update Vk. In addition,we remove the information in the bit matrix about this old node.By performing k − 1 binary search on the neighborhood of nodev, we can determine the adjacent relationship between this newnode v with other nodes in the updated node set Vk, i.e., we canupdate the bit matrix for the currently visited k nodes. Withthe bit matrix, we can now determine the type of graphlet. Notethat for large k, it is difficult to determine the graphlet typeseven with the adjacent matrix. However, for k = 3, 4, 5-nodegraphlets, we can determining the graphlet type more easily withthe degree-signature of different graphlets [5], i.e., for the sub-graph Gk = (Vk, Ek), we just need to compute the degree of eachnode and derive the degree-signature, then compare it with thedegree-signature of different graphlet types.

5.3 Estimate Concentration SimultaneouslyNow we show that if we estimate k-node graphlet with SRW (d),

then all graphlets with nodes less than k can be estimated at thesame time without extra computation cost. When we estimatek-node graphlets, we collect k nodes through l = k − d + 1 stepsrandom walk on G(d). For k�- node graphlets (k� < k), we collectk� nodes through l� = k� −d+1 steps random walk. These l� stepsare included in the l steps. Since we already check the type of thenode-induced subgraph s which is obtained from previous l steps,it is trivial to know the type of k�-node graphlet obtained froml� steps random walk with the adjacent list of nodes in s. Fig-ure 4 shows such an example for estimating 3, 4, 5-node graphletssimultaneously.

· · · 1 2 3 4 5 · · ·

4

3

5

2

1

g55

4

3

5

2

g44

4

3

5g3

2

Figure 4: Estimating 3, 4, 5-node graphlets simulta-neously with random walk on G. Assume the sub-graph included from the 5 steps is g5

5, then the sub-graph induced from node 2, 3, 4, 5 must be g4

4 andsubgraph induced from node 3, 4, 5 is triangle.

6. EXPERIMENTAL EVALUATION

We present the simulation results for 3, 4, 5-node graphlets.Our framework is to run random walk on G(d). Here d is ac-tually a parameter of our algorithm. We evaluate the perfor-mance of random walk on different G(d) and provide empiricalrecommendation on choosing d for graphlets of different sizes.We implement our algorithms in C++ and run experiments on aLinux machine with Intel 3.70GHz CPU. The simulation resultsshow that we can achieve significantly higher accuracy than thestate-of-the-art methods.

6.1 Dataset and Error MetricWe use publicly available real-world networks to evaluate our

algorithms. For directed graph, we remove the direction of edgesand make the graph undirected. We only retain the largest con-nected component (LCC) of the networks and discard the rest ofthe nodes and edges. The detailed information about the LCCof the networks is reported in Table 5. The ground-truth of the3, 4-node graphlet concentration is computed with algorithm pro-posed by Ahmed et al. in [2]. For 5-node graphlets, we countthe graphlets using the algorithm proposed by Hočevar in [12].We calculate the true concentration of 5-node graphlets only forthe four smaller datasets due to the extremely high computationcost. From Table 5, we know that the 3, 4, 5-node cliques takerelatively low percentage for all the datasets.

Table 5: Datasets

Graph |V | |E| c32

(10−2)c4

6(10−3)

c521

(10−5)BrightKite [1] 57K 213K 3.98 1.447 4.661Epinion [16] 76K 406K 2.29 0.225 0.147Slashdot [16] 77K 469K 0.82 0.092 0.115Facebook [1] 63K 817K 5.46 1.419 2.511Gowalla [16] 197K 950K 0.80 0.008 -WikiTalk [16] 2.4M 4.7M 0.07 0.0003 -Youtube [1] 3.2M 9.4M 0.05 0.00007 -Wikipedia [1] 1.9M 36.5M 0.10 0.00009 -

To measure the estimation accuracy, we use the following nor-malized root mean square error (NRMSE):

NRMSE(cki ) =

�E[(ck

i − cki )2]

cki

,

where cki is the estimated value out of n samples and ck

i is theground-truth value. Note that when ck

i is extremely small, alarge NRMSE error is acceptable since it is really difficult to getsamples of graphlets with small ck

i . In all simulations, we startfrom a node that is chosen at random. Even though we do notstart in the stationary regime, the distribution will approach thestationary distribution after a ‘burn-in’ period.

We also compare our methods with the state-of-the-art method,i.e., pairwise subgraph random walk (PSRW ) [33]. Note that ourframework actually takes one parameter d which is the size ofeach node in G(d). PSRW is simply a special case under ourframework with d = k − 1. Later on, we will demonstrate thatSRW (k − 1) usually has lower accuracy than random walk onG(d) where d < k − 1.

The names of the methods are given in the following way.SRWd means the method is based on random walk on G(d). Ifthe method also integrates with the optimization technique cor-responding state sampling, we add CSS to the method name. Ifthe method also uses non-backtracking random walk, we appendNB at the end of the method name. For example, SRW1CSSNBmeans we run random walk on G and also use CSS and non-backtracking random walk as the optimization techniques.

6.2 Estimate 3-node Graphlets ConcentrationIn this subsection, we demonstrate the simulation results for

3-node graphlets. Note that there are only two different typesof 3-node graphlets, i.e., line and triangle. The concentration oftriangle has strong relationship with network properties such as

9

Page 10: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

����

�������

���������

����

������

��� ��� ��� ��� ����������������������������

���

���

���

����������������� ���������� �

���������

�� �

(a)

��� ��� ��� ��� ����������������������������

���

���

���

����������������� ����������� �

���������

�� �

(b)

��� ��� ��� ��� ����������������������������

���

���

���

��

����������������� ���������� �

��������

�� �

(c)

��� ��� ��� ��� ����������������������������

����

���

���

�������������������� ������������ �

��������

�� �

(d)

��� ��� ��� ��� ����������������������������

����

���

���

���

����������������� ����������� �

���������

�� �

(e)

��� ��� ��� ��� ����������������������������

����

���

���

���

����������������� ���������� �

���������

�� �

(f)Figure 5: Estimation of triangle concentrationv.s. the percentage of mined nodes. “SRW1”and ”SRW2” are the prior state-of-the-art meth-ods. Our proposed methods “SRW1CSS” and“SRW1CSSNB” perform best for all the datasets.global clustering coefficient, so we focus on measuring the accu-racy of triangle concentration. We point out that the triangleconcentration is almost the same as global clustering coefficient.In [11], Hardiman proposed a method which is similar to SRW (1)under our framework to estimate clustering coefficient. They de-rived their method in a totally different way. For fair comparison,we denote SRW (1) (method in [11]) and SRW (2) (PSRW ) fortriangle estimation as the prior state-of-the-art methods. Figure 5shows the results of estimating triangle concentration for the sixrepresentative graphs. In the experiments, we average the esti-mates of NRMSE over 10000 simulations.We place the subfiguresin order of decreasing triangle concentration value. The y-axis ofthe figures is in log-scale.

The results show that the error for each method decreasesquickly with increasing the percentage of mined nodes (num-ber of visited nodes during the random walk over total numberof nodes in the network), which means our framework is effec-tive for 3-node graphlets. For all the datasets except “Epinion”,“Shashdot”, and “Gowalla”, the NRMSE is below 0.1 with only2% mined nodes. Since the true value is relatively small, one canconsider error below 0.1 as acceptable. Non-backtracking randomwalk does not help improve the accuracy too much. However,the optimization technique CSS significantly improves the effi-ciency of the estimation. For dataset “Wikipedia”, the optimizedmethod SRW1CSS improves the accuracy of estimation about 4times higher than SRW1. For all the datasets, our proposed meth-ods SRW1CSS and SRW1CSSNB perform the best. Besides, theNRMSE gap between these two optimized methods with othersincreases, which indicates the error of these two methods decreasefaster than others. We can conclude that for 3-node graphlets,

non-backtracking random walk on G with corresponding statesampling is the most efficient method. Our method SRW1CSSNBoutperforms the prior art method SRW1 [11] up to 4 times (for“Wikipedia”) and performs better than SRW2 [33] (PSRW ) upto 7 times (for “WikiTalk”).

6.3 Estimate 4-node Graphlets ConcentrationIn this subsection, we show the simulation results for 4-node

graphlets. There are six different 4-node graphlets. We plot thereal concentration value for 4 graphs in Figure 7. From this fig-ure, we can see that the 3-star ( ) is the dominant 4-nodegraphlet. Clique ( ) appears least frequent in the graphs. Forrare graphlets, we need more samples to achieve specific accuracy.We can expect better performance for graphs with “flatter” con-centration distribution, i.e., the difference between concentrationfor different graphlets is smaller. For example, graph “Epinion”has flatter concentration distribution than other graphs in Fig-ure 7. We only present simulation results for the four graphs inthe dataset due to space limit.Weighted Concentration. From Equation (7), we know that1n

�n

s=1hk

i (X(l)s )

πππM (X(l)s )

→ αki Ck

i . To further understand the per-formance of our methods, we define weighted concentration forgraphlet gk

i as αki Ck

i /(�m

j=1 αkj Ck

j ). So our problem is equivalentto estimate the weighted concentration using the simple randomwalk on G(d). We plot the weighted concentration in Figure 6a.We mark the weighted concentration computed with coefficientαk of SRW (2) and SRW (3) as SRW2 and SRW3 respectively.Compared with the original concentration, the weighted one liftsthe percentage of relatively rare graphlet types, i.e., g4

3 ( ), g45

( ), and g46 ( ). For all the real-world graphs in Figure 6a,

the weighted concentration of SRW2 is much larger than that ofSRW3 for graphlets g4

5 , g46 , while for graphlet g4

3 , the weightedconcentration of SRW3 is slightly higher than that of SRW2. Forexample, the weighted concentration for g4

6 wiht SRW2 is about 8times higher than the original one while SRW3 only increases theconcentration 2 times higher. In another words, SRW2 increasesthe probability about 8 times that we get a sample of g4

6 comparedwith uniform sampling of graphlets while SRW3 only increases theprobability about 2 times. From Theorem 4 we know that if wefix the random walk type, more samples are needed to achievespecific accuracy for graphlets with smaller αk

i Cki . Hence the er-

ror of the estimation for rare graphlets is the major error source.If we can estimate the concentration of rare graphlets with higheraccuracy, the overall performance of the methods is less likely de-generated. With above discussion, we can expect better overallperformance from SRW2.Estimation Accuracy. Figure 6b and 6c show the simulationresults for 4-node graphlets. To estimate NRMSE, each simula-tion is run independently for 1000 times. We compare the perfor-mance between three methods, i.e., SRW2, SRW2CSS and SRW3.Here SRW3 is the method PSRW. We do not include SRW1 herebecause α4

2 = 0 with SRW1. From Figure 6b, we observe that theaveraged NRMSE of SRW2 is much smaller than SRW3. For allthe datasets in the figure, the NRMSE of SRW2 and SRW2CSS isat least 3 times less than that of SRW3. SRW2CSS outperformsall other methods in term of averaged NRMSE. Note that thesmallest NRMSE error for dataset “WikiTalk” is greater than 30even after 10000 random walk steps2. This is due to the extremelyhigh percentage of 3-star in “WikiTalk”. All other graphlet typesare at least two orders of magnitude less than 3-star. In Fig-ure 6c, we present the simulation results of NRMSE for eachgraphlet type. We fix the sample size at 10000. From this figurewe can see that SRW2 and SRW2CSS perform better than SRW3for almost all graphlet types. Furthermore, the CSS optimiza-tion technique improves the concentration estimates for all thegraphlet types. For each dataset, the smaller the concentrationvalue, the higher the estimation error, which is consistent withour theoretical analysis. In conclusion, our method SRW2CSS

2Here we just use 1000 simulations since SRW3 is slow.

10

Page 11: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

� � � � � ������������

����

����

����

���

���������������������� �������

��������������

� � � � � ������������

����

����

����

����

���

���������������������� ��������

��������������

� � � � � ������������

����

����

����

����

����

���

���������������������� �������

��������������

� � � � � ������������

����

����

����

����

����

����

���

���������������������� ��������

��������������

(a) Compare the original concentration value with the weighted concentration value under differentrandom walk SRW (2) and SRW (3).

��� ��� ��� ��� �������������������� ���

����

����

���

���

���

����������������� �������

����

����

�������

��� ��� ��� ��� �������������������� ���

����

���

���

���

����������������� ��������

����

����

�������

��� ��� ��� ��� �������������������� ���

����������������������������

����������������� �������

����

����

�������

��� ��� ��� ��� �����������������������

��

���

���

����

����

�����

����������������� ��������

����

����

�������

(b) NRMSE (average over all 4-node graphlets) of concentration estimate v.s. number of random walksteps. SRW3 is the prior state-of-the-art methods. Our proposed methods SRW2 and SRW2CSS perform

best for all the datasets.

1 2 3 4 5 6Graphlet id

10­3

10­2

10­1

100

NRMSE

Epinion

SRW3

SRW2

SRW2CSS

1 2 3 4 5 6Graphlet id

10­3

10­2

10­1

100

NRMSE

Slashdot

SRW3

SRW2

SRW2CSS

1 2 3 4 5 6Graphlet id

10­5

10­4

10­3

10­2

10­1

100

101

102

103

NRMSE

Gowalla

SRW3

SRW2

SRW2CSS

1 2 3 4 5 6Graphlet id

10­3

10­2

10­1

100

101

102

103

104

NRMSE

WikiTalk

SRW3

SRW2

SRW2CSS

(c) NRMSE (for each 4-node graphlet) of concentration estimate with 10000 random walk steps. SRW3 isthe prior state-of-the-art methods. Our proposed methods SRW2 and SRW2CSS perform better almost for

all the 4-node graphlet types.Figure 6: Simulation results for 4-node graphlets.

� � � � � �

����

����

����

����

����

����

���

�������������

�������

��������

�������

��������

Figure 7: Real concentration value of 4-nodegraphlets for graphs “Epinion”, “Slashdot”,“Gowalla”, “WikiTalk”outperforms prior art method SRW3 (PSRW ) up to two ordersof magnitude in the accuracy.

6.4 Estimate 5-node Graphlets ConcentrationWe present the simulation results for 5-node graphlets in this

section. We plot the ground-truth of 5-node graphlets concen-tration in Figure 8. Except graph “BrightKite”, clique ( ) isthe most rare type in different graphs. Table 6 shows the av-erage running time of performing 10000 random walk steps fordifferent methods. Here we do not consider the APIs responsedelay. The time cost of the random walk consists of populat-ing a random neighbor and identifying the graphlet types. Fromthis table we know that random walk on G(d) is faster when d issmaller. Random walk on G(2) is much faster than random walkon G(3) and G(4). We report the accuracy of different methods in

� � � � � � � � � �� �� �� �� �� �� �� �� �� �� �� ����

��

����

����

����

����

����

���

�������������

���������������������������������

Figure 8: Real concentration value of 5-nodegraphlets for graphs “BrightKite”,“Facebook”,“Epinion” and “Slashdot”Figure 9. All the simulations are run 1000 times. Due to the ran-dom walk on G(4) (prior state-of-the-art methods) is too slow, wecompare the SRW4 with other methods in the framework only forgraph “BrightKite” and “Epinion”. SRW3CSS has high compu-tation cost, we do not report its simulation results. The NRMSEof methods SRW3 and SRW4 do not show monotonic decreasingtrend. This is due to the poor performance of these two meth-ods in estimating rare graphlets. SRW2CSS outperforms SRW4(PSRW ) about 6 times for graph “BrightKite” and 17 times forgraph “Epinion” in accuracy. We can conclude that SRW2CSSperforms better than SRW3 and SRW4 (PSRW ) in accuracy andtime cost.Experimental summary: Our framework is efficient for esti-mating k-node graphlets concentration with random walk on G(d)

where 0 < d < k. For 3-node graphlets, SRW1CSSNB performsbest for all the datasets in Table 5 and it outperforms PSRWup to 7 times (for graph “WikiTalk”). For 4, 5-node graphlets,SRW2CSS performs better than PSRW both in time cost and

11

Page 12: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

���� ���� ���� �������

��� ��� ��� ��� �������������������� ���

���

���

���

��

����������������� ����������

(a)

��� ��� ��� ��� �������������������� ���

���

���

���

��

��

����������������� �������

(b)

��� ��� ��� ��� �������������������� ���

���

���

��

����������������� ��������

(c)

��� ��� ��� ��� �������������������� ���

����

���

���

���

����������������� ��������

(d)Figure 9: NRMSE (average over all 5-nodegraphlets) of concentration estimate v.s. the num-ber of random walk steps. SRW4 is the prior artmethod. Method SRW2CSS performs best for all thedatasets.

Table 6: Runtimes of performing 10K random walksteps for different random walk methods.

Graph SRW2 SRW2CSS SRW3 SRW4BrightKite [1] 9.7 ms 55.1 ms 135.5 ms 10.3 sEpinion [16] 10.3 ms 34.3 ms 270.0 ms 25.7 sSlashdot [16] 9.8 ms 25.3 ms 269.4 ms 23.7 sFacebook [1] 10.9 ms 57.4 ms 107.1 ms 9.9 s

accuracy. SRW2CSS outperforms than the prior art up to twoorders of magnitude (for graph “Gowalla”) in accuracy. Due tospace limit, we only present the results for part of the graphs inour datasets. Even though the performance of different meth-ods depends on specific structure of the graphs, we have consis-tent better performance in SRW1CSS for 3-node graphlets andSRW2CSS for 4, 5-node graphlets estimation.

7. CONCLUSIONIn this paper, we propose a novel random walk-based frame-

work which takes one parameter to estimate the graphlet con-centration. Our framework is general and can be applied to anyk-node graphlets. We derive an analytic bound on the numberof random walk steps required for convergence. We also intro-duce two optimization techniques to further improve the efficiencyof the methods in the framework. Our experiments with manyreal-world networks show that the methods with appropriate pa-rameter in our framework outperform the prior method PSRWsignificantly both in accuracy and time cost.

8. REFERENCES[1] KONECT Datasets: The koblenz network collection.

http://konect.uni-koblenz.de/networks/, May 2015.[2] N. K. Ahmed, J. Neville, R. A. Rossi, and N. Duffield.

Efficient graphlet counting for large networks. In ICDM,pages 1–10, 2015.

[3] L. Akoglu, H. Tong, and D. Koutra. Graph-based anomalydetection and description: A survey. Data Mining andKnowledge Discovery (DAMI), 2014.

[4] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficientsemi-streaming algorithms for local triangle counting inmassive graphs. In KDD, pages 16–24, 2008.

[5] M. Bhuiyan, M. Rahman, and M. Al Hasan. Guise:Uniform sampling of graphlets for large graph analysis. InICDM, pages 91–100, 2012.

[6] K.-M. Chung, H. Lam, Z. Liu, and M. Mitzenmacher.Chernoff-hoeffding bounds for markov chains: Generalizedand simplified. STACS, pages 124–135, 2012.

[7] E. R. Elenberg, K. Shanmugam, M. Borokhovich, andA. G. Dimakis. Beyond triangles: A distributed frameworkfor estimating 3-profiles of large graphs. In KDD, pages229–238, 2015.

[8] C. J. Geyer. Markov chain monte carlo lecture notes. 1998.[9] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou.

Walking in facebook: A case study of unbiased sampling ofosns. In INFOCOM, pages 1–9, 2010.

[10] M. Gjoka, E. Smith, and C. T. Butts. Estimating subgraphfrequencies with or without attributes from egocentricallysampled data. CoRR, abs/1510.08119, 2015.

[11] S. J. Hardiman and L. Katzir. Estimating clusteringcoefficients and size of social networks via random walk. InWWW, pages 539–550, 2013.

[12] T. Hočevar and J. Demšar. A combinatorial approach tographlet counting. Bioinformatics, pages 559–565, 2014.

[13] M. Jha, C. Seshadhri, and A. Pinar. Path sampling: A fastand provable method for estimating 4-vertex subgraphcounts. In WWW, pages 495–505, 2015.

[14] L. Katzir, E. Liberty, and O. Somekh. Estimating sizes ofsocial networks via biased sampling. In WWW, pages597–606, 2011.

[15] C.-H. Lee, X. Xu, and D. Y. Eun. Beyond random walkand metropolis-hastings samplers: why you should notbacktrack for unbiased graph sampling. In SIGMETRICS,pages 319–330, 2012.

[16] J. Leskovec and A. Krevl. SNAP Datasets: Stanford largenetwork dataset collection.http://snap.stanford.edu/data, June 2014.

[17] R.-H. Li, J. Yu, L. Qin, R. Mao, and T. Jin. On randomwalk based graph sampling. In ICDE, pages 927–938, 2015.

[18] L. Lovász. Random walks on graphs: A survey. InCombinatorics, Paul Erdős is Eighty, pages 353–398. JánosBolyai Mathematical Society, 1996.

[19] T. Milenkoviæ and N. Pržulj. Uncovering biologicalnetwork function via graphlet degree signatures. CancerInformatics, 6:257–273, 2008.

[20] T. Milenković, V. Memišević, A. K. Ganesan, andN. Pržulj. Systems-level cancer gene identification fromprotein interaction network topology applied tomelanogenesis-related functional genomics data. Journal ofThe Royal Society Interface, 7:423–437, 2010.

[21] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan,D. Chklovskii, and U. Alon. Network motifs: Simplebuilding blocks of complex networks. Science, 298:824–827,2002.

[22] M. Mitzenmacher and E. Upfal. Probability andComputing: Randomized Algorithms and ProbabilisticAnalysis. Cambridge University Press, 2005.

[23] A. Mohaisen, A. Yun, and Y. Kim. Measuring the mixingtime of social graphs. In IMC, pages 383–389, 2010.

[24] H. OLLE. Finite Markov Chains and AlgorithmicApplications, 2000.

[25] A. B. Owen. Monte Carlo theory, methods and examples.2013.

[26] N. Przulj. Biological network comparison using graphletdegree distribution. Bioinformatics, pages 853–854, 2010.

12

Page 13: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

[27] N. Przulj, D. G. Corneil, and I. Jurisica. Modelinginteractome: scale-free or geometric? Bioinformatics,20:3508–3515, 2004.

[28] M. Rahman, M. Bhuiyan, and M. A. Hasan. Graft: Anapproximate graphlet counting algorithm for large graphanalysis. In CIKM, pages 1467–1471, 2012.

[29] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn,and K. Borgwardt. Efficient graphlet kernels for large graphcomparison. In Artificial Intelligence and Statistics, pages488–495, 2009.

[30] S. Suri and S. Vassilvitskii. Counting triangles and thecurse of the last reducer. In WWW, pages 607–614, 2011.

[31] J. Ugander, L. Backstrom, and J. Kleinberg. Subgraphfrequencies: Mapping the empirical and extremalgeography of large graph collections. In WWW, pages1307–1318, 2013.

[32] P. Wang, J. Lui, and D. Towsley. Minfer: Inferring motifstatistics from sampled edges. arXiv:1502.06671, 2015.

[33] P. Wang, J. C. S. Lui, B. Ribeiro, D. Towsley, J. Zhao, andX. Guan. Efficiently estimating motif statistics of largenetworks. TKDE, pages 8:1–8:27, 2014.

[34] P. Wang, J. Tao, J. Zhao, and X. Guan. Moss: A scalabletool for efficiently sampling and counting 4- and 5-nodegraphlets. CoRR, abs/1509.08089, 2015.

[35] X. Xu, C.-H. Lee, and D. Y. Eun. A General Framework ofHybrid Graph Sampling for Complex Network Analysis. InINFOCOM, pages 2795–2803, 2014.

[36] Z. Zhou, N. Zhang, and G. Das. Leveraging history forfaster sampling of online social networks. PVLDB, pages1034–1045, 2015.

APPENDIXA. PROOF OF THEOREM 3Theorem 3. The stationary distribution πππM exists and is unique.For any X(l) = (X1, · · · , Xl) ∈ M(l), we have

πM (X(l)) =

dXl

2|R(d)| if l = 11

2|R(d)| if l = 2,

12|R(d)|

1dX2

· · · 1dXl−1

if l > 2(3)

Proof. We first prove that there exists unique stationary dis-tribution for the expanded Markov chain. For any two statesX

(l)i = (Xi1 , · · · , Xil

) and X(l)j = (Xj1 , · · · , Xjl

) in M(l), Xil

can reach state Xj1 in finite steps during the random walk onG(d). This indicates that X

(l)i can reach X

(l)j in finite steps in

the expanded Markov chain. Similarly, X(l)j can reach X

(l)i in

finite steps. We conclude that the expanded Markov chain is ir-reducible. Since any irreducible Markov chain has one uniquestationary distribution [24, Theorem 5.3], there exists a uniquestationary distribution for the expanded Markov chain.

To derive the closed form of πππM , we write the entry of tran-sition matrix PM of the expanded Markov chain for state X

(l)i

and X(l)j when l > 1 as follows:

PM (X(l)i , X

(l)j ) =

1dXil

if (Xil, Xjl

) ∈ R(d),

and Xiq = Xjq−1 , 1 < q ≤ l,

0 otherwise.

For πππM in Equation (3) we can verify that

πππM = πππM · PM and�

X(l)∈M(l)

πM (X(l)) = 1.

According to Theorem 5.1 and 5.3 in [24, Chapter 5], πππM inEquation (3) is the unique stationary distribution.

B. CHERNOFF BOUND FOR ESTIMATORTheorem 7. [6, Theorem 3] For a finite and ergodic Markovchain with state space N and stationary distribution πππ, let τ =τ(�) denote its �-mixing time for 0 < � ≤ 1/8. Let (X1, · · · , Xn)denote a n-step random walk starting from an initial distributionϕ on N . Define �ϕ�πππ =

�|N |i=1

ϕ2i

πi. The expectation of function

f : N → [0, 1] is denoted by Eπππ [f(X)] = µ. Define the totalweight of the walk (X1, · · · , Xn) by Z �

�n

s=1 f(Xs). Thereexists some constant c which is independent of µ, σ and � suchthat for 0 < δ < 1

Pr����Z

n− µ

��� > �µ

�≤ c �ϕ�πππ e−�2µn/72τ

Lemma 8. For a random walk with stationary distribution πππand an expanded Markov chain defined on this random walk withstationary distribution πππM , if we start from a fixed node X0 inG(d), then we have �ϕ�πππM

= 1/π(X0).Proof. We start from X0 and walk for l steps to collect the

first sample X(l)0 ∈ M(l). Then the first element of X

(l)0 must be

X0 and Pr[X(l)0 = (X0, · · · , Xl−1)] = 1

dX0· · · 1

dXl−2. We have

�ϕ�πππM=

X(l)0

�1

dX0· · · 1

dXl−2

�2

12|R(d)|

1dX1

1dXl−2

= 1π(X0)

Lemma 9. For 0 < � < 1/8 and 0 < δ < 1, there exists aconstant ξ, such that if n ≥ ξ( W

αki

Cki

) τ �(�)�2 (log

�ϕ�πππMδ

) we have

Pr���� ˆCk

i − Cki

��� /Cki >

3

�< δ/2

where W is defined as maxX(l)∈M(l) w(X(l)), ϕ is the initialdistribution, and τ �(�) is the mixing time of the expanded Markovchain.

Proof. Let fi = w(X(l))hki (X(l))

Wsuch that all values of fi are

in [0, 1]. The expectation of fi is µi = EπππM [fi] = αki Ck

iW

. FromTheorem 7, we have

Pr���� ˆCk

i /Cki − 1

��� >�

3

�< c �ϕ�πππM

e−�2µin/9·72τ �

Extracting n for which δ2 = c �ϕ�πππM

e−�2µin/9·72τ � , we have n ≥

ξ( Wαk

iCk

i

) τ �(�)�2 (log

�ϕ�πππMδ

)

Lemma 10. For 0 < � < 1/8 and 0 < δ < 1, there exists aconstant ξ, such that if n ≥ ξ( W

αminCk ) τ �(�)�2 (log

�ϕ�πππMδ

) we have

Pr���� ˆCk − Ck

��� /Ck >�

3

�< δ/2

where W is defined as maxX(l)∈M(l) w(X(l)) and αmin = minj αkj .

τ �(�) is the mixing time of the expanded Markov chain startingfrom the initial distribution ϕ.

Proof. We define f = w(X(l))/α(X(l))W/αmin

if α(X(l)) �= 0. Oth-erwise we define f = 0. A single line calculation shows thatµ = EπππM [f ] = αminCk

W. Follow the same argument in Lemma 9

we can derive the bound of n.

13

Page 14: A General Framework for Estimating Graphlet Statistics via ... · A General Framework for Estimating Graphlet Statistics via Random Walk Xiaowei Chen1, Yongkun Li2, Pinghui Wang3,

Theorem 4. For any 0 < � ≤ 1/8 and 0 < δ < 1, there exists aconstant ξ such that

Pr[(1 − �)cki ≤ ck

i ≤ (1 + �)cki ] > 1 − δ (9)

when the sample size n ≥ ξ( WΛ ) τ(�)

�2 (log�ϕ�πππM

δ). Here Λ =

min{αki Ck

i , αminCk}, τ(�) is the mixing time of the original ran-dom walk, ϕ is the initial distribution and �ϕ�πππM

=�

X(l)∈M(l)ϕ2(X(l))πM (X(l)) .

Proof. We first find steps ni to guarantee ˆCki and ˆCk be

within �/3 deviation from their expectation with probability greaterthan 1 − δ/2. See Lemma 9 and 10 for more details. Then usethe fact

(1−�)cki ≤ (1 − �/3)Ck

i

(1 + �/3)Ck≤

ˆCki

ˆCk≤ (1 + �/3)Ck

i

(1 − �/3)Ck≤ (1+�)ck

i (11)

we can prove that with ni ≥ ξ( WΛ ) τ �(�)

�2 (log�ϕ�πππM

δ), the esti-

mator cki is within � deviation from expectation with probability

greater than 1 − δ. Here τ �(�) is the mixing time of the expandedMarkov chain. Second, we prove that τ �(�) = τ(�). To see this,we denote the initial distribution of the random walk as π0, afterl−1 steps, we get the initial distribution for the expanded Markovchain. At time t, we denote the distribution of random walk as πππt

and distribution for expanded Markov chain is πππt−l+1. At timet+l−1, we get distribution πππt, for any state X(l) = (X1, · · · , Xl),we have π(X(l)) = πt(X1) 1

dX1· 1

dXl−1. So the variation distance

|πππ − πππt|1 = |πππM − πππt|1. Thus we have τ(�) = τ �(�). This endsthe proof.

C. PROOF OF LEMMA 6

Lemma 6. The variance of function wS(X(l))hki (X(l)) is smaller

than that of function 1αk

i

w(X(l))hki (X(l)) under the same station-

ary distribution πππM . More specifically, we have

VarπππM [wS(X(l))hki (X(l))] ≤ VarπππM [ 1

αki

w(X(l))hki (X(l))].

Proof. Variance of random variable X can be expanded asE[X2] − (E[X])2. Since the expectations of both functions areequal, we only need to prove that

EπππM [(wS(X(l))hki (X(l)))2] ≤ EπππM [( 1

αki

w(X(l))hki (X(l)))2]

Note that the left hand side of the inequality can be rewritten as

1(2|R(d)|)2

s�gki

1�X

(l)j

∈C(s)πM (X(l)

j )

while the right hand side can be represented as

1(2|R(d)|)2

s�gki

1(αk

i )2

X(l)j

∈C(s)

1/πM (X(l)j )

Use the fact that harmonic mean is not greater than arithmeticmean, we have

1�X

(l)j

∈C(s)πM (X(l)

j )≤ 1

(αki )2

X(l)j

∈C(s)

1/πM (X(l)j )

This ends the proof.

14