Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity...

Estimating Diffusion Network Structures: Recovery Conditions,Sample Complexity & Soft-thresholding Algorithm

Hadi Daneshmand1 [email protected] Gomez-Rodriguez1 [email protected] Song2 [email protected] Scholkopf1 [email protected] for Intelligent Systems and 2Georgia Institute of Technology

Abstract

Information spreads across social and techno-logical networks, but often the network struc-tures are hidden from us and we only observethe traces left by the diffusion processes, calledcascades. Can we recover the hidden networkstructures from these observed cascades? Whatkind of cascades and how many cascades do weneed? Are there some network structures whichare more difficult than others to recover? Can wedesign efficient inference algorithms with prova-ble guarantees?

Despite the increasing availability of cascade-data and methods for inferring networks fromthese data, a thorough theoretical understandingof the above questions remains largely unex-plored in the literature. In this paper, we inves-tigate the network structure inference problemfor a general family of continuous-time diffusionmodels using an `

1

-regularized likelihood ma-ximization framework. We show that, as longas the cascade sampling process satisfies a na-tural incoherence condition, our framework canrecover the correct network structure with highprobability if we observe O(d3 logN) cascades,where d is the maximum number of parents of anode and N is the total number of nodes. More-over, we develop a simple and efficient soft-thresholding inference algorithm, which we useto illustrate the consequences of our theoreti-cal results, and show that our framework outper-forms other alternatives in practice.

Proceedings of the 31 st International Conference on MachineLearning, Beijing, China, 2014. JMLR: W&CP volume 32. Copy-right 2014 by the author(s).

1. IntroductionDiffusion of information, behaviors, diseases, or more ge-nerally, contagions can be naturally modeled as a stochas-tic process that occur over the edges of an underlying net-work (Rogers, 1995). In this scenario, we often observethe temporal traces that the diffusion generates, called cas-cades, but the edges of the network that gave rise to thediffusion remain unobservable (Adar & Adamic, 2005).For example, blogs or media sites often publish a new pieceof information without explicitly citing their sources. Mar-keters may note when a social media user decides to adopta new behavior but cannot tell which neighbor in the socialnetwork influenced them to do so. Epidemiologist observewhen a person gets sick but usually cannot tell who infectedher. In all these cases, given a set of cascades and a diffu-sion model, the network inference problem consists of in-ferring the edges (and model parameters) of the unobservedunderlying network (Gomez-Rodriguez, 2013).

The network inference problem has attracted significantattention in recent years (Saito et al., 2009; Gomez-Rodriguez et al., 2010; 2011; Snowsill et al., 2011; Duet al., 2012a), since it is essential to reconstruct and pre-dict the paths over which information can spread, and tomaximize sales of a product or stop infections. Most pre-vious work has focused on developing network inferencealgorithms and evaluating their performance experimen-tally on different synthetic and real networks, and a rigo-rous theoretical analysis of the problem has been missing.However, such analysis is of outstanding interest since itwould enable us to answer many fundamental open ques-tions. For example, which conditions are sufficient to gua-rantee that we can recover a network given a large numberof cascades? If these conditions are satisfied, how manycascades are sufficient to infer the network with high pro-bability? Until recently, there has been a paucity of workalong this direction (Netrapalli & Sanghavi, 2012; Abra-hao et al., 2013) which provide only partial views of theproblem. None of them is able to identify the recovery

Estimating Diffusion Network Structures

condition relating to the interaction between the networkstructure and the cascade sampling process, which we willmake precise in our paper.

Overview of results. We consider the network infer-ence problem under the continuous-time diffusion modelrecently introduced by Gomez-Rodriguez et al. (2011).We identify a natural incoherence condition for such amodel which depends on both the network structure, thediffusion parameters and the sampling process of the cas-cades. This condition captures the intuition that we canrecover the network structure if the co-occurrence of anode and its non-parent nodes is small in the cascades.Furthermore, we show that, if this condition holds forthe population case, we can recover the network struc-ture using an `

1

-regularized maximum likelihood estima-tor and O(d3 logN) cascades, and the probability of suc-cess is approaching 1 in a rate exponential in the num-ber of cascades. Importantly, if this condition also holdsfor the finite sample case, then the guarantee can be im-proved to O(d2 logN) cascades. Beyond theoretical re-sults, we also propose a new, efficient and simple proxi-mal gradient algorithm to solve the `

1

-regularized maxi-mum likelihood estimation. The algorithm is especiallywell-suited for our problem since it is highly scalable andnaturally finds sparse estimators, as desired, by using soft-thresholding. Using this algorithm, we perform variousexperiments illustrating the consequences of our theoreti-cal results and demonstrating that it typically outperformsother state-of-the-art algorithms.

Related work. Netrapalli & Sanghavi (2012) pro-pose a maximum likelihood network inference methodfor a variation of the discrete-time independent cascademodel (Kempe et al., 2003) and show that, for general net-works satisfying a correlation decay, the estimator recoversthe network structure given O(d2 logN) cascades, and theprobability of success is approaching 1 in a rate exponen-tial in the number of cascades. The rate they obtained ison a par with our results. However, their discrete diffusionmodel is less realistic in practice, and the correlation decaycondition is rather restricted: essentially, on average eachnode can only infect one single node per cascade. Instead,we use a general continuous-time diffusion model (Gomez-Rodriguez et al., 2011), which has been extensively vali-dated in real diffusion data and extended in various waysby different authors (Wang et al., 2012; Du et al., 2012a;b).

Abrahao et al. (2013) propose a simple network inferencemethod, First-Edge, for a slightly different continuous-time independent cascade model (Gomez-Rodriguez et al.,2010), and show that, for general networks, if the cascadesources are chosen uniformly at random, the algorithmneeds O(Nd logN) cascades to recover the network struc-ture and the probability of success is approaching 1 only in

Figure 1. The diffusion network structure (left) is unknown andwe only observe cascades, which are N -dimensional vectorsrecording the times when nodes get infected by contagions thatspread (right). Cascade 1 is (ta, tb, tc,1,1,1), where ta <tc < tb, and cascade 2 is (1, tb,1, td, te, tf ), where tb < td <te < tf . Each cascade contains a source node (dark red), drawnfrom a source distribution P(s), as well as infected (light red) anduninfected (white) nodes, and it provides information on blackand dark gray edges but does not on light gray edges.

a rate polynomial in the number of cascades. Additionally,they study trees and bounded-degree networks and showthat, if the cascade sources are chosen uniformly at random,the error decreases polynomially as long as O(logN) and⌦(d9 log2 d logN) cascades are recorded respectively. Inour work, we show that, for general networks satisfying anatural incoherence condition, our method outperforms theFirst-Edge algorithm and the algorithm for bounded-degreenetworks in terms of rate and sample complexity.

Gripon & Rabbat (2013) propose a network inferencemethod for unordered cascades, in which nodes that areinfected together in the same cascade are connected by apath containing exactly the nodes in the trace, and givenecessary and sufficient conditions for network inference.However, they consider a restrictive, unrealistic scenario inwhich cascades are all three nodes long.

2. Continuous-Time Diffusion ModelIn this section, we revisit the continuous-time generativemodel for cascade data introduced by Gomez-Rodriguezet al. (2011). The model associates each edge j ! i witha transmission function, f(t

i

|tj

;↵ji

) = f(ti

� tj

;↵ji

), adensity over time parameterized by ↵

ji

. This is in con-trast to previous discrete-time models which associate eachedge with a fixed infection probability (Kempe et al., 2003).Moreover, it also differs from discrete-time models in thesense that events in a cascade are not generated iterativelyin rounds, but event timings are sampled directly from thetransmission functions in the continuous-time model.

2.1. Cascade generative processGiven a directed contact network, G = (V, E) with Nnodes, the process begins with an infected source node,s, initially adopting certain contagion at time zero, whichwe draw from a source distribution P(s). The contagion istransmitted from the source along her out-going edges toher direct neighbors. Each transmission through an edgeentails a random transmission time, ⌧ = t

j

� tj

, drawnfrom an associated transmission function f(⌧ ;↵

ji

). Weassume transmission times are independent, possibly dis-


Table 1. Functions.Function Infected node (t

i

< T ) Uninfected node (ti

> T )gi

(t;↵) log h(t;↵) +

Pj:tj<ti

y(ti

|tj

;↵j

)

Pj:tj<T

y(T |tj

;↵j

)

[ry(t;↵)]

k

�y0(ti

|tk

;↵k

) �y0(T |tk

;↵k

)

[D(t;↵)]

kk

�y00(ti

|tk

;↵k

)� h(t;↵)

�1H 00(t

i

|tk

;↵k

) �y00(T |tk

;↵k

)

tributed differently across edges, and, in some cases, canbe arbitrarily large, ⌧ ! 1. Then, the infected neighborstransmit the contagion to their respective neighbors, andthe process continues. We assume that an infected node re-mains infected for the entire diffusion process. Thus, if anode i is infected by multiple neighbors, only the neighborthat first infects node i will be the true parent. Figure 1illustrates the process.

Observations from the model are recorded as a set Cn

of cascades {t1, . . . , tn}. Each cascade t

c is an N -dimensional vector t

c

:= (tc1

, . . . , tcN

) recording whennodes are infected, tc

k

2 [0, T c

] [ {1}. Symbol 1 la-bels nodes that are not infected during observation window[0, T c

] – it does not imply they are never infected. The‘clock’ is reset to 0 at the start of each cascade. We assumeT c

= T for all cascades; the results generalize trivially.

2.2. Likelihood of a cascadeGomez-Rodriguez et al. (2011) showed that the likelihoodof a cascade t under the continuous-time independent cas-cade model is

f(t;A) =

Y

tiT

Y

tm>T

S(T |ti

;↵im

)⇥Y

k:tk<ti

S(ti

|tk

;↵ki

)

X

j:tj<ti

H(ti

|tj

;↵ji

), (1)

where A = {↵ji

} denotes the collection of parameters,S(t

i

|tj

;↵ji

) = 1� Rti

tjf(t|t

j

;↵ji

) dt is the survival func-tion and H(t

i

|tj

;↵ji

) = f(ti

|tj

;↵ji

)/S(ti

|tj

;↵ji

) is thehazard function. The survival terms in the first line accountfor the probability that uninfected nodes survive to all in-fected nodes in the cascade up to T and the survival andhazard terms in the second line account for the likelihoodof the infected nodes. Then, assuming cascades are sam-pled independently, the likelihood of a set of cascades is theproduct of the likelihoods of individual cascades given byEq. 1. For notational simplicity, we define y(t

i

|tk

;↵ki

) :=

logS(ti

|tk

;↵ki

), and h(t;↵i

) :=

Pk:tkti

H(ti

|tk

;↵ki

)

if ti

T and 0 otherwise.

3. Network Inference ProblemConsider an instance of the continuous-time diffusionmodel defined above with a contact network G⇤

= (V⇤, E⇤)

and associated parameters�↵⇤ji

. We denote the set of pa-

rents of node i as N�(i) = {j 2 V⇤

: ↵⇤ji

> 0} withcardinality d

i

= |N�(i)| and the minimum positive trans-

mission rate as ↵⇤min,i

= min

j :↵

⇤ji>0

↵⇤ji

. Let Cn be a setof n cascades sampled from the model, where the source

s 2 V⇤ of each cascade is drawn from a source distributionP(s). Then, the network inference problem consists of fin-ding the directed edges and the associated parameters usingonly the temporal information from the set of cascades Cn.

This problem has been cast as a maximum likelihood esti-mation problem (Gomez-Rodriguez et al., 2011)

minimizeA � 1

n

Pc2C

n log f(tc;A)

subject to ↵ji

� 0, i, j = 1, . . . , N, i 6= j,(2)

where the inferred edges in the network correspond to thosepairs of nodes with non-zero parameters, i.e. ↵

ji

> 0.

In fact, the problem in Eq. 2 decouples into a set of in-dependent smaller subproblems, one per node, where weinfer the parents of each node and the parameters associa-ted with these incoming edges. Without loss of generality,for a particular node i, we solve the problem

minimize↵i `n(↵i

)

subject to ↵ji

� 0, j = 1, . . . , N, i 6= j,(3)

where ↵i

:= {↵ji

| j = 1, . . . , N, i 6= j} are the relevantvariables, and `n(↵

i

) = � 1

n

Pc2C

n gi

(t

c

;↵i

) corres-ponds to the terms in Eq. 2 involving ↵

i

(also see Table 1for the definition of g( · ;↵

i

)). In this subproblem, we onlyneed to consider a super-neighborhood V

i

= Ri

[ Ui

ofi, with cardinality p

i

= |Vi

| N , where Ri

is the set ofupstream nodes from which i is reachable, U

i

is the set ofnodes which are reachable from at least one node j 2 R

i

.Here, we consider a node i to be reachable from a node jif and only if there is a directed path from j to i. We canskip all nodes in V\V

i

from our analysis because they willnever be infected in a cascade before i, and thus, the max-imum likelihood estimation of the associated transmissionrates will always be zero (and correct).

Below, we show that, as n ! 1, the solution, ˆ↵i

, of theproblem in Eq. 3 is a consistent estimator of the true param-eter ↵⇤

i

. However, it is not clear whether it is possible torecover the true network structure with this approach givena finite amount of cascades and, if so, how many cascadesare needed. We will show that by adding an `

1

-regularizerto the objective function and solving instead the followingoptimization problem

minimize↵i `n(↵i

) + �n

||↵i

||1

subject to ↵ji

� 0, j = 1, . . . , N, i 6= j,(4)

we can provide finite sample guarantees for recoveringthe network structure (and parameters). Our analysis alsoshows that by selecting an appropriate value for the regula-rization parameter �

n

, the solution of Eq. 4 successfully re-


covers the network structure with probability approaching1 exponentially fast in n.

In the remainder of the paper, we will focus on estimatingthe parent nodes of a particular node i. For simplicity, wewill use ↵ = ↵

i

, ↵j

= ↵ji

, N�= N�

(i), R = Ri

,U = U

i

, d = di

, pi

= p and ↵⇤min

= ↵⇤min,i

.

4. ConsistencyCan we recover the hidden network structures from the ob-served cascades? The answer is yes. We will show this byproving that the estimator provided by Eq. 3 is consistent,meaning that as the number of cascades goes to infinity,we can always recover the true network structure.

More specifically, Gomez-Rodriguez et al. (2011) showedthat the network inference problem defined in Eq. 3 is con-vex in ↵ if the survival functions are log-concave and thehazard functions are concave in ↵. Under these conditions,the Hessian matrix, Qn

= r2`n(↵), can be expressed asthe sum of a nonnegative diagonal matrix Dn and the outerproduct of a matrix Xn

(↵) with itself, i.e.,

Qn

= Dn

(↵) +

1

n

Xn

(↵)[Xn

(↵)]

>. (5)

Here the diagonal matrix Dn

(↵) =

1

n

Pc

D(t

c

;↵) is asum over a set of diagonal matrices D(t

c

;↵), one for eachcascade c (see Table 1 for the definition of its entries); andXn

(↵) is the Hazard matrix

Xn

(↵) =

⇥X(t

1

;↵) |X(t

2

;↵) | . . . |X(t

n

;↵)

⇤, (6)

with each column X(t

c

;↵) := h(tc;↵)

�1r↵h(tc;↵).Intuitively, the Hessian matrix captures the co-occurrenceinformation of nodes in cascades. Then, we can prove

Theorem 1 If the source probability P(s) is strictly posi-tive for all s 2 R, then, the maximum likelihood estimatorˆ↵ given by the solution of Eq. 3 is consistent.

Proof We check the three criteria for consistency: con-tinuity, compactness and identification of the objectivefunction (Newey & McFadden, 1994). Continuity is ob-vious. For compactness, since L ! �1 for both ↵

ij

! 0

and ↵ij

! 1 for all i, j so we lose nothing imposing upperand lower bounds thus restricting to a compact subset. Forthe identification condition, ↵ 6= ↵⇤ ) `n(↵) 6= `n(↵⇤

),we use Lemma 9 and 10 (refer to Appendices A and B),which establish that Xn

(↵) has full row rank as n ! 1,and hence Qn is positive definite.

5. Recovery ConditionsIn this section, we will find a set of sufficient conditions onthe diffusion model and the cascade sampling process un-der which we can recover the network structure from finitesamples. These results allow us to address two questions:

• Are there some network structures which are moredifficult than others to recover?

• What kind of cascades are needed for the networkstructure recovery?

The answers to these questions are intertwined. The diffi-culty of finite-sample recovery depends crucially on an in-coherence condition which is a function of both networkstructure, parameters of the diffusion model and the cas-cade sampling process. Intuitively, the sources of the cas-cades in a diffusion network have to be chosen in sucha way that nodes without parent-child relation should co-occur less often compared to nodes with such relation.Many commonly used diffusion models and network struc-tures can be naturally made to satisfy this condition.

More specifically, we first place two conditions on theHessian of the population log-likelihood, E

c

[`n(↵)] =

Ec

[log g(tc;↵)], where the expectation here is taken overthe distribution P(s) of the source nodes, and the den-sity f(tc|s) of the cascades t

c given a source nodes. In this case, we will further denote the Hessian ofEc

[log g(tc;↵)] evaluated at the true model parameter ↵⇤

as Q⇤. Then, we place two conditions on the Lipschitzcontinuity of X(t

c

;↵), and the boundedness of X(t

c

;↵⇤)

and rg(tc;↵⇤) at the true model parameter ↵⇤. For

simplicity, we will denote the subset of indexes associatedto node i’s true parents as S, and its complement as Sc.Then, we use Q⇤

SS

to denote the sub-matrix of Q⇤ indexedby S and ↵⇤

S

the set of parameters indexed by S.

Condition 1 (Dependency condition): There exists con-stants C

min

> 0 and Cmax

> 0 such that ⇤min

(Q⇤SS

) �C

min

and ⇤

max

(Q⇤SS

) Cmax

where ⇤

min

(·) and⇤

max

(·) return the leading and the bottom eigenvalue of itsargument respectively. This assumption ensures that twoconnected nodes co-occur reasonably frequently in the cas-cades but are not deterministically related.

Condition 2 (Incoherence condition): There exists " 2(0, 1] such that |||Q⇤

S

cS

(Q⇤SS

)

�1 |||1 1 � ", where|||A|||1 = max

j

Pk

|Ajk

|. This assumption captures theintuition that, node i and any of its neighbors should getinfected together in a cascade more often than node i andany of its non-neighbors.

Condition 3 (Lipschitz Continuity): For any feasible cas-cade t

c, the Hazard vector X(t

c

;↵) is Lipschitz conti-nuous in the domain {↵ : ↵

S

� ↵⇤min

/2},

kX(t

c

;�)�X(t

c

;↵)k2

k1

k� �↵k2

,

where k1

is some positive constant. As a consequence, thespectral norm of the difference, n�1/2

(Xn

(�)�Xn

(↵)),is also bounded (refer to appendix C), i.e.,

|||n�1/2

�Xn

(�)�Xn

(↵)

�|||2

k1

k� �↵k2

. (7)

Furthermore, for any feasible cascade t

c, D(↵)

jj

is Lips-chitz continuous for all j 2 V ,

|D(t

c

;�)jj

�D(t

c

;↵)

jj

| k2

k� �↵k2

,


where k2

is some positive constant.

Condition 4 (Boundedness): For any feasible cascade t

c,the absolute value of each entry in the gradient of its log-likelihood and in the Hazard vector, as evaluated at the truemodel parameter ↵⇤, is bounded,

krg(tc;↵⇤)k1 k

3

, kX(t

c

;↵⇤)k1 k

4

,

where k3

and k4

are positive constants. Then the abso-lute value of each entry in the Hessian matrix Q⇤, is alsobounded |||Q⇤|||1 k

5

.

Remarks for condition 1 As stated in Theorem 1, as longas the source probability P(s) is strictly positive for alls 2 R, the maximum likelihood formulation is strictly con-vex and thus there exists C

min

> 0 such that ⇤min

(Q⇤) �

Cmin

. Moreover, condition 4 implies that there existsC

max

> 0 such that ⇤max

(Q⇤) C

max

.

Remarks for condition 2 The incoherence condition de-pends, in a non-trivial way, on the network structure, diffu-sion parameters, observation window and source node dis-tribution. Here, we give some intuition by studying threesmall canonical examples.

First, consider the chain graph in Fig. 2(a) and assume thatwe would like to find the incoming edges to node 3 whenT ! 1. Then, it is easy to show that the incoherencecondition is satisfied if (P

0

+P1

)/(P0

+P1

+P2

) < 1� "and P

0

/(P0

+P1

+P2

) < 1�", where Pi

denotes the pro-bability of a node i to be the source of a cascade. Thus, forexample, if the source of each cascade is chosen uniformlyat random, the inequality is satisfied. Here, the incoherencecondition depends on the source node distribution.

Second, consider the directed tree in Fig. 2(b) and assumethat we would like to find the incoming edges to node 0

when T ! 1. Then, it can be shown that the incoherencecondition is satisfied as long as (1) P

1

> 0, (2) (P2

> 0) or(P

5

> 0 and P6

> 0), and (3) P3

> 0. As in the chain, thecondition depends on the source node distribution.

Finally, consider the star graph in Fig. 2(c), with exponen-tial edge transmission functions, and assume that we wouldlike to find the incoming edges to a leave node i whenT < 1. Then, as long as the root node has a nonzeroprobability P

0

> 0 of being the source of a cascade, itcan be shown that the incoherence condition reduces tothe inequalities

⇣1� ↵0j

↵0i+↵0j

⌘e�(↵0i+↵0j)T

+

↵0j

↵0i+↵0j<

1 � "(1 + e�↵0iT), j = 1, . . . , p : j 6= i, which always

holds for some " > 0. If T ! 1, then the condition holdswhenever " < ↵

0i

/(↵0i

+max

j:j 6=i

↵0j

). Here, the largerthe ratio max

j:j 6=i

↵0j

/↵0i

is, the smaller the maximumvalue of " for which the incoherence condition holds. Tosummarize, as long as P

0

> 0, there is always some " > 0

for which the condition holds, and such " value depends onthe time window and the parameters ↵

0j

.

0

1

2

3

01

12

23

(a) Chain

2

3

1

0

30

10

20

4

5

6

7

41

52

62

73

(b) Tree

1

2 3

p

0

03 02

01

0p

(c) StarFigure 2. Example networks.

Remarks for conditions 3 and 4 Well-known pairwisetransmission likelihoods such as exponential, Rayleighor Power-law, used in previous work (Gomez-Rodriguezet al., 2011), satisfy conditions 3 and 4.

6. Sample ComplexityHow many cascades do we need to recover the networkstructure? We will answer this question by providing asample complexity analysis of the optimization in Eq. 4.Given the conditions spelled out in Section 5, we can showthat the number of cascades needs to grow polynomially inthe number of true parents of a node, and depends only lo-garithmically on the size of the network. This is a positiveresult, since the network size can be very large (millionsor billions), but the number of parents of a node is usuallysmall compared the network size. More specifically, foreach individual node, we have the following result:Theorem 2 Consider an instance of the continuous-timediffusion model with parameters ↵⇤

ji

and associated edgesE⇤ such that the model satisfies condition 1-4, and let Cn

be a set of n cascades drawn from the model. Suppose thatthe regularization parameter �

n

is selected to satisfy

�n

� 8k3

2� "

"

rlog p

n. (8)

Then, there exist positive constants L and K, independentof (n, p, d), such that if

n > Ld3 log p, (9)

then the following properties hold with probability at least1� 2 exp(�K�2

n

n):

1. For each node i 2 V , the `1

-regularized network infe-rence problem defined in Eq. 4 has a unique solution,and so uniquely specifies a set of incoming edges ofnode i.

2. For each node i 2 V , the estimated set of incomingedges does not include any false edges and include alltrue edges.

Furthermore, suppose that the finite sample Hessian matrixQn satisfies conditions 1 and 2. Then there exist positiveconstants L and K, independent of (n, p, d), such that thesample complexity can be improved to n > Ld2 log p withother statements remain the same.Remarks. The above sample complexity is proved for eachnode separately for recovering its parents. Using a unionbound, we can provide the sample complexity for recove-


ring the entire network structure by joining these parent-child relations together. The resulting sample complexityand the choice of regularization parameters will remainlargely the same, except that the dependency on d willchange from d to d

max

(the largest number of parents ofa node), and the dependency on p will change from log p to2 logN (N the number of nodes in the network).

6.1. Outline of AnalysisThe proof of Theorem 2 uses a technique called primal-dualwitness method, previously used in the proof of sparsis-tency of Lasso (Wainwright, 2009) and high-dimensionalIsing model selection (Ravikumar et al., 2010). To thebest of our knowledge, the present work is the first thatuses this technique in the context of diffusion network in-ference. First, we show that the optimal solutions to Eq. 4have shared sparsity pattern, and under a further condition,the solution is unique (proven in Appendix D):Lemma 3 Suppose that there exists an optimal primal-dual solution (↵, µ) to Eq. 4 with an associated subgra-dient vector z such that ||z

S

c ||1 < 1. Then, any optimalprimal solution ↵ must have ↵

S

c= 0. Moreover, if the

Hessian sub-matrix Qn

SS

is strictly positive definite, then ↵is the unique optimal solution.Next, we will construct a primal-dual vector (↵, µ) alongwith an associated subgradient vector z. Furthermore, wewill show that, under the assumptions on (n, p, d) statedin Theorem 2, our constructed solution satisfies the KKToptimality conditions to Eq. 4, and the primal vector hasthe same sparsity pattern as the true parameter ↵⇤, i.e.,

↵j

> 0, 8j : ↵⇤j

> 0, (10)

↵j

= 0, 8j : ↵⇤j

= 0. (11)

Then, based on Lemma 3, we can deduce that the optimalsolution to Eq. 4 correctly recovers the sparsisty pattern of↵⇤, and thus the incoming edges to node i.

More specifically, we start by realizing that a primal-dualoptimal solution (↵, µ) to Eq. 4 must satisfy the gene-ralized Karush-Kuhn-Tucker (KKT) conditions (Boyd &Vandenberghe, 2004):

0 2 r`n(↵) + �n

z� µ, (12)µj

↵j

= 0, (13)µj

� 0, (14)zj

= 1, 8↵j

> 0, (15)|z

j

| 1, 8↵j

= 0, (16)

where `n(↵) = � 1

n

Pc2C

n log g(tc; ↵) and z denotes thesubgradient of the `

1

-norm.

Suppose the true set of parent of node i is S. We constructthe primal-dual vector (↵, µ) and the associated subgra-dient vector z in the following way

1. We set ↵S

as the solution to the partial regularized

maximum likelihood problem

↵S

= argmin

(↵S ,0),↵S�0

{`n(↵) + �n

||↵S

||1

}. (17)

Then, we set µS

� 0 as the dual solution associatedto the primal solution ↵

S

.2. We set ↵

S

c= 0, so that condition (11) holds, and

µS

c= µ⇤

S

c � 0, where µ⇤ is the optimal dual solu-tion to the following problem:

minimize↵ Ec

[`n(↵)]

subject to ↵j

� 0, j = 1, . . . , N, i 6= j.(18)

Thus, our construction satisfies condition (14).3. We obtain z

S

c from (12) by substituting in the con-structed ↵, µ and z

S

.

Then, we only need to prove that, under the stated scalingsof (n, p, d), with high-probability, the remaining KKT con-ditions (10), (13), (15) and (16) hold.

For simplicity of exposition, we first assume that the depen-dency and incoherence conditions hold for the finite sampleHessian matrix Qn. Later we will lift this restriction andonly place these conditions on the population Hessian ma-trix Q⇤. The following lemma show that our constructedsolution satisfies condition (10):

Lemma 4 Under condition 3, if the regularization para-meter is selected to satisfy

pd�

n

C2

min

6(k2

+ 2k1

pC

max

)

,

and krs

`n(↵⇤)k1 �n

4

, then,

k↵S

�↵⇤S

k2

3

pd�

n

/Cmin

↵⇤min

/2,

as long as ↵⇤min

� 6

pd�

n

/Cmin

.

Based on this lemma, we can then further show that theKKT conditions (13) and (15) also hold for the constructedsolution. This can be trivially deduced from condition (10)and (11), and our construction steps (a) and (b). Note thatit also implies that µ

S

= µ⇤S

= 0, and hence µ = µ⇤.

Proving condition (16) is more challenging. We first pro-vide more details on how to construct z

S

c mentioned instep (c). We start by using a Taylor expansion of Eq. 12,

Qn

(↵�↵⇤) = �r`n(↵⇤

)� �n

z + µ�R

n, (19)

where R

n is a remainder term with its j-th entry

Rn

j

=

⇥r2`n(↵j

)�r2`n(↵⇤)

⇤T

j

(↵�↵⇤),

and ↵j

= ✓j

↵ + (1 � ✓j

)↵⇤ with ✓j

2 [0, 1] accordingto the mean value theorem. Rewriting Eq. 19 using blockmatrices

✓Qn

SS

Qn

SS

c

Qn

S

cS

Qn

S

cS

c

◆✓↵

S

�↵⇤S

↵Sc �↵⇤

S

c

◆

= �✓r

S

`n(↵⇤)

rS

c`n(↵⇤)

◆� �

n

✓zS

zS

c

◆+

✓µS

µS

c

◆�✓R

n

S

R

n

S

c

◆


and, after some algebraic manipulation, we have

�zS

c= �r

S

c`n(↵⇤) + µ

S

c �R

n

S

c

�Qn

S

cS

(Qn

SS

)

�1

��rs

`n(↵⇤)� �z

S

+ µS

�R

n

S

�.

Next, we upper bound kzS

ck1 using the triangle inequality

kzS

ck1 ��1

n

kµ⇤S

c �rS

c`n(↵⇤)k1 + ��1

n

kRn

S

ck1+kQn

S

cS

(Qn

SS

)

�1k1 ⇥ ⇥1 + ��1

n

kRn

S

k1+��1

n

kµ⇤S

�rS

`n(↵⇤)k1

⇤,

and we want to prove that this upper bound is smaller than1. This can be done with the help of the following twolemmas (proven in Appendices F and G):

Lemma 5 Given " 2 (0, 1] from the incoherence condi-tion, we have,

P

✓2� "

�n

kr`n(↵⇤)� µ⇤k1 � 4

�1"

◆

2p exp(� n�2

n

"2

32k23

(2� ")2),

which converges to zero at rate exp(�c�2

n

n) as long as

�n

� 8k3

2�"

"

qlog p

n

.

Lemma 6 Given " 2 (0, 1] from the incoherence condi-tion, if conditions 3 and 4 holds, �

n

is selected to satisfy

�n

d C2

min

"

36K(2� "),

where K = k1

+ k4

k1

+ k21

+ k1

pC

max

, andkr

s

`n(↵⇤)k1 �n

4

, then, kRnk1�n

"

4(2�")

, as long as↵⇤min

� 6

pd�

n

/Cmin

.

Now, applying both lemmas and the incoherence conditionon the finite sample Hessian matrix Qn, we have

kzS

ck1 (1� ") + ��1

n

(2� ")kRnk1+ ��1

n

(2� ")kµ⇤ �r`n(↵⇤)k1

(1� ") + 0.25"+ 0.25" = 1� 0.5",

and thus condition (16) holds.

A possible choice of the regularization parameter �n

andcascade set size n such that the conditions of the Lem-mas 4-6 are satisfied is �

n

= 8k3

(2 � ")"�1

pn�1

log pand n > 288

2k23

(2 � ")4C�4

min

"�4d2 log p +�48k

3

(2� ")C�1

min

(↵⇤min

)

�1"�1

�2

d log p.

Last, we lift the dependency and incoherence conditionsimposed on the finite sample Hessian matrix Qn. We showthat if we only impose these conditions in the correspon-ding population matrix Q⇤, then they will also hold for Qn

with high probability (proven in Appendices H and I).

Lemma 7 If condition 1 holds for Q⇤, then, for any � > 0,

P (⇤

min

(Qn

SS

) Cmin

� �) 2dB1exp(�A

1

�2n

d2),

Algorithm 1 `1

-regularized network inferenceRequire: Cn,�

n

,K, Lfor all i 2 V dok = 0

while k < K do↵k+1

i =

�↵k

i � Lr↵i`n

(↵k

i

)� �n

L�+

k = k + 1

end while↵i = ↵K�1

iend forreturn {↵i}i2V

P (⇤

max

(Qn

SS

) � Cmax

+ �) 2dB2exp(�A

2

�2n

d2),

where A1

, A2

, B1

and B2

are constants independent of(n, p, d).

Lemma 8 If |||Q⇤S

cS

(Q⇤SS

)

�1 |||1 1� ", then,

P�kQn

S

cS

(Qn

SS

)

�1k1 � 1� "/2� p exp(�K

n

d3),

where K is a constant independent of (n, p, d).

Note in this case the cascade set size need to increase ton > Ld3 log p, where L is a sufficiently large positive con-stant independent of (n, p, d), for the error probabilities onthese last two lemmas to converge to zero.

7. Efficient soft-thresholding algorithmCan we design efficient algorithms to solve Eq. (4) for net-work recovery? Here, we will design a proximal gradi-ent algorithm which is well suited for solving non-smooth,constrained, large-scale or high-dimensional convex opti-mization problems (Parikh & Boyd, 2013). Moreover, theyare easy to understand, derive, and implement. We firstrewrite Eq. 4 as an unconstrained optimization problem:

minimize↵ `n(↵) + g(↵),

where the non-smooth convex function g(↵) = �n

||↵||1

if ↵ � 0 and +1 otherwise. Here, the general recipefrom Parikh & Boyd (2013) for designing proximal gra-dient algorithm can be applied directly.

Algorithm 1 summarizes the resulting algorithm. In eachiteration of the algorithm, we need to compute r`n (Ta-ble 1) and the proximal operator prox

L

kg

(v), where Lk isa step size that we can set to a constant value L or find us-ing a simple line search (Beck & Teboulle, 2009). UsingMoreau’s decomposition and the conjugate function g⇤, itis easy to show that the proximal operator for our particularfunction g(·) is a soft-thresholding operator, (v� �

n

Lk

)

+

,which leads to a sparse optimal solution ↵, as desired.

8. ExperimentsIn this section, we first illustrate some consequences ofTh. 2 by applying our algorithm to several types of net-works, parameters (n, p, d), and regularization parameter


0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

β

Succ

ess

Pro

babili

ty

p=6

p=26

p=40

(a) Super-neighborhood pi

0 10 20 30 40 50 600.2

0.4

0.6

0.8

1

K

Succ

ess

Pro

babili

ty

Forest−Fire (Exp)

Forest−Fire (Pow)

Hierarchical (Exp)

Hierarchical (pow)

(b) �n

Figure 3. Success probability vs. # of cascades.

�n

. Then, we compare our algorithm to two different state-of-the-art algorithms: NETRATE (Gomez-Rodriguez et al.,2011) and First-Edge (Abrahao et al., 2013).

Experimental Setup We focus on synthetic networks thatmimic the structure of real-world diffusion networks –in particular, social networks. We consider two modelsof directed real-world social networks: the Forest Firemodel (Barabasi & Albert, 1999) and the Kronecker Graphmodel (Leskovec et al., 2010), and use simple pairwisetransmission models such as exponential, power-law orRayleigh. We use networks with 128 nodes and, for eachedge, we draw its associated transmission rate from a uni-form distribution U(0.5, 1.5). We proceed as follows: wegenerate a network G⇤ and transmission rates A

⇤, simu-late a set of cascades and, for each cascade, record thenode infection times. Then, given the infection times, weinfer a network ˆG. Finally, when we illustrate the conse-quences of Th. 2, we evaluate the accuracy of the inferredneighborhood of a node ˆN�

(i) using probability of suc-cess P (

ˆE = E⇤), estimated by running our method of 100

independent cascade sets. When we compare our algorithmto NETRATE and First-Edge, we use the F

1

score, which isdefined as 2PR/(P + R), where precision (P) is the frac-tion of edges in the inferred network ˆG present in the truenetwork G⇤, and recall (R) is the fraction of edges of thetrue network G⇤ present in the inferred network ˆG.

Parameters (n, p, d) According to Th. 2, the number ofcascades that are necessary to successfully infer the in-coming edges of a node will increase polynomially to thenode’s neighborhood size d

i

and logarithmically to thesuper-neighborhood size p

i

. Here, we infer the incominglinks of nodes of a hierarchical Kronecker network with thesame in-degree (d

i

= 3) but different super-neighboorhodset sizes p

i

under different scalings � of the number of cas-cades n = 10�d log p and choose the regularization para-meter �

n

as a constant factor ofplog(p)/n as suggested

by Th. 2. We used an exponential transmission model andT = 5. Fig. 3(a) summarizes the results, where, for eachnode, we used cascades which contained at least one nodein the super-neighborhood of the node under study. As pre-dicted by Th. 2, very different p values lead to curves thatline up with each other quite well.

Regularization parameter �n

Our main result indicatesthat the regularization parameter �

n

should be a constant

500 1000 1500 2000 25000.5

0.6

0.7

0.8

0.9

1

# cascades

F1

NetRate

Our method

First Edge

(a) Kronecker hierarchical, POW

500 1000 1500 2000 25000.5

0.6

0.7

0.8

0.9

1

# cascades

F1

NetRate

Our method

First Edge

(b) Forest Fire, EXP

Figure 4. F1-score vs. # of cascades.

factor ofplog(p)/n. Fig. 3(b) shows the success proba-

bility of our algorithm against different scalings K of theregularization parameter �

n

= Kplog(p)/n for different

types of networks using 150 cascades and T = 5. We findthat for sufficiently large �

n

, the success probability flat-tens, as expected from Th. 2. It flattens at values smallerthan one because we used a fixed number of cascades n,which may not satisfy the conditions of Th. 2.

Comparison with NETRATE and First-Edge Fig. 4 com-pares the accuracy of our algorithm, NETRATE and First-Edge against number of cascades for a hierarchical Kro-necker network with power-law transmission model and aForest Fire network with exponential transmission model,with an observation window T = 10. Our method outper-forms both competitive methods, finding especially strikingthe competitive advantage with respect to First-Edge.

9. ConclusionsOur work contributes towards establishing a theoreticalfoundation of the network inference problem. Specifically,we proposed a `

1

-regularized maximum likelihood infe-rence method for a well-known continuous-time diffusionmodel and an efficient proximal gradient implementation,and then show that, for general networks satisfying a natu-ral incoherence condition, our method achieves an expo-nentially decreasing error with respect to the number ofcascades as long as O(d3 logN) cascades are recorded.

Our work also opens many interesting venues for futurework. For example, given a fixed number of cascades, itwould be useful to provide confidence intervals on the in-ferred edges. Further, given a network with arbitrary pair-wise likelihoods, it is an open question whether there al-ways exists at least one source distribution and time win-dow value such that the incoherence condition is satisfied,and, and if so, whether there is an efficient way of findingthis distribution. Finally, our work assumes all activationsoccur due to network diffusion and are recorded. It wouldbe interesting to allow for missing observations, as well asactivations due to exogenous factors.

AcknowledgementThis research was supported in part by NSF/NIH BIG-DATA 1R01GM108341-01, NSF IIS1116886, and aRaytheon faculty fellowship to L. Song.


ReferencesAbrahao, B., Chierichetti, F., Kleinberg, R., and Panconesi,

A. Trace complexity of network inference. In KDD,2013.

Adar, E. and Adamic, L. A. Tracking Information Epi-demics in Blogspace. In Web Intelligence, pp. 207–214,2005.

Barabasi, A.-L. and Albert, R. Emergence of Scaling inRandom Networks. Science, 286:509–512, 1999.

Beck, A. and Teboulle, M. Gradient-based algorithms withapplications to signal recovery. Convex Optimization inSignal Processing and Communications, 2009.

Boyd, S. P. and Vandenberghe, L. Convex optimization.Cambridge University Press, 2004.

Du, N., Song, L., Smola, A., and Yuan, M. Learning Net-works of Heterogeneous Influence. In NIPS, 2012a.

Du, N., Song, L., Woo, H., and Zha, H. Uncover Topic-Sensitive Information Diffusion Networks. In AISTATS,2012b.

Gomez-Rodriguez, M., Leskovec, J., and Krause, A. In-ferring Networks of Diffusion and Influence. In KDD,2010.

Gomez-Rodriguez, M., Balduzzi, D., and Scholkopf, B.Uncovering the Temporal Dynamics of Diffusion Net-works. In ICML, 2011.

Gomez-Rodriguez, Manuel. Ph.D. Thesis. Stanford Uni-versity & MPI for Intelligent Systems, 2013.

Gripon, V. and Rabbat, M. Reconstructing a graph frompath traces. arXiv:1301.6916, 2013.

Kempe, D., Kleinberg, J. M., and Tardos, E. Maximizingthe Spread of Influence Through a Social Network. InKDD, 2003.

Leskovec, J., Chakrabarti, D., Kleinberg, J., Faloutsos, C.,and Ghahramani, Z. Kronecker Graphs: An Approach toModeling Networks. JMLR, 2010.

Mangasarian, O. L. A simple characterization of solutionsets of convex programs. Operations Research Letters,7(1):21–26, 1988.

Netrapalli, P. and Sanghavi, S. Finding the Graph of Epi-demic Cascades. In ACM SIGMETRICS, 2012.

Newey, W. K. and McFadden, D. L. Large Sample Estima-tion and Hypothesis Testing. In Handbook of Econome-trics, volume 4, pp. 2111–2245. 1994.

Parikh, Neal and Boyd, Stephen. Proximal algorithms.Foundations and Trends in Optimization, 2013.

Ravikumar, P., Wainwright, M. J., and Lafferty, J. D. High-dimensional ising model selection using l1-regularizedlogistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.

Rogers, E. M. Diffusion of Innovations. Free Press, NewYork, fourth edition, 1995.

Saito, K., Kimura, M., Ohara, K., and Motoda, H. Learningcontinuous-time information diffusion model for socialbehavioral data analysis. Advances in Machine Lear-ning, pp. 322–337, 2009.

Snowsill, T., Fyson, N., Bie, T. De, and Cristianini, N. Re-fining Causality: Who Copied From Whom? In KDD,2011.

Wainwright, M. J. Sharp thresholds for high-dimensionaland noisy sparsity recovery using l1-constrainedquadratic programming (lasso). IEEE Transactions onInformation Theory, 55(5):2183–2202, 2009.

Wang, L., Ermon, S., and Hopcroft, J. Feature-enhancedprobabilistic models for diffusion network inference. InECML PKDD, 2012.

Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity...

Documents

Transcript of Estimating Diffusion Network Structures: Recovery Conditions, Sample Complexity...