arX
iv:1
104.
4605
v1 [
stat
.ML
] 2
4 A
pr 2
011
Compressive Network Analysis
Xiaoye Jiang1, Yuan Yao2, Han Liu3, Leonidas Guibas1
1Stanford University; 2Peking University;3Johns Hopkins University
May 28, 2018
Abstract
Modern data acquisition routinely produces massive amounts of network data.
Though many methods and models have been proposed to analyze such data, the
research of network data is largely disconnected with the classical theory of statistical
learning and signal processing. In this paper, we present a new framework for model-
ing network data, which connects two seemingly different areas: network data analysis
and compressed sensing. From a nonparametric perspective, we model an observed
network using a large dictionary. In particular, we consider the network clique detec-
tion problem and show connections between our formulation with a new algebraic tool,
namely Randon basis pursuit in homogeneous spaces. Such a connection allows us to
identify rigorous recovery conditions for clique detection problems. Though this paper
is mainly conceptual, we also develop practical approximation algorithms for solving
empirical problems and demonstrate their usefulness on real-world datasets.
Keywords: network data analysis, compressive sensing, Radon basis pursuit,
restricted isometry property, clique detection.
I. Introduction
In the past decade, the research of network data has increased dramatically. Examples
include scientific studies involving web data or hyper text documents connected via hyper-
links, social networks or user profiles connected via friend links, co-authorship and citation
network connected by collaboration or citation relationships, gene or protein networks con-
nected by regulatory relationships, and much more. Such data appear frequently in modern
application domains and has led to numerous high-impact applications. For instance, detect-
ing anomaly in ad-hoc information network is vital for corporate and government security;
exploring hidden community structures helps us to better conduct online advertising and
1
marketing; inferring large-scale gene regulatory network is crucial for new drug design and
disease control. Due to the increasing importance of network data, principled analytical and
modeling tools are crucially needed.
Towards this goal, researchers from the network modeling community have proposed many
models to explore and predict the network data. These models roughly fall into two cate-
gories: static and dynamic models. For the static model, there is only one single snapshot
of the network being observed. In contrast, dynamic models can be applied to analyze
datasets that contain many snapshots of the network indexed by different time points. Ex-
amples of the static network models include the Erdos-Renyi-Gilbert random graph model
(Erdos and Renyi, 1959, 1960), the p1 (Holland and Leinhardt, 1981), p2 (Duijn et al., 2004)
and more general exponential random graph (or p∗) model (Wasserman and Pattison, 1996),
latent space model (Hoff et al., 2001), block model (Lorrain and White, 1971), stochas-
tic blockmodel (Wasserman and Anderson, 1987), and mixed membership stochastic block-
model (Airoldi et al., 2008). Examples of the dynamic network models include the preferen-
tial attachment model (Barabasi and Albert, 1999), the small-world model (Watts and Strogatz,
1998), duplication-attachment model (Kleinberg et al., 1999; Kumar et al., 2000), continu-
ous time Markov model (Snijders, 2005), and dynamic latent space model (Sarkar and Moore,
2005). A comprehensive review of these models is provided in Goldenberg et al. (2010).
Though many methods and models have been proposed, the research of network data analysis
is largely disconnected with the classical theory of statistical learning and signal processing.
The main reason is that, unlike the usual scientific data for which independent measurements
can be repeatedly collected, network data are in general collected in one single realization
and the nodes within the network are highly relational due to the existence of many link-
ages. Such a disconnection prevents us from directly exploiting the state-of-the-art statistical
learning methods and theory to analyze network data. To bridge this gap, we present a novel
framework to model network data. Our framework assumes that the observed network has a
sparse representation with respect to some dictionary (or basis space). Once the dictionary
is given, we formulate the network modeling problem into a compressed sensing problem.
Compressed sensing, also known as compressive sensing and compressive sampling, is a tech-
nique for finding sparse solutions to underdetermined linear systems. In statistical machine
learning, it is related to reconstructing a signal which has a sparse representation in a large
dictionary. The field of compressed sensing has existed for decades, but recently it has ex-
ploded due to the important contributions of Candes and Tao (2005, 2007); Candes (2008);
Tsaig and Donoho (2006). By viewing the observed network adjacency matrix as the output
of an underlying function evaluated on a discrete domain of network nodes, we can formulate
the network modeling problem into a compressed sensing problem.
Specifically, we consider the network clique detection problem within this novel framework.
2
By considering a generative model in which the observed adjacency matrix is assumed
to have a sparse representation in a large dictionary where each basis corresponds to a
clique, we connect our framework with a new algebraic tool, namely Randon basis pur-
suit in homogeneous spaces. Our problem can be regarded as an extension of the work
in Jagabathula and Shah (2008) which studies sparse recovery of functions on permutation
groups, while we reconstruct functions on k-sets (cliques), often called the homogeneous
space associated with a permutation group in the literature (Diaconis, 1988). It turns out
that the discrete Radon basis becomes the natural choice instead of the Fourier basis con-
sidered in Jagabathula and Shah (2008). This leaves us a new challenge on addressing the
noiseless exact recovery and stable recovery with noise. Unfortunately, the greedy algorithm
for exact recovery in Jagabathula and Shah (2008) cannot be applied to noisy settings, and
in general the Radon basis does not satisfy the Restricted Isometry Property (RIP) (Candes,
2008) which is crucial for the universal recovery. In this paper, we develop new theories and
algorithms which guarantee exact, sparse, and stable recovery under the choice of Radon
basis. These theories have deep roots in Basis Pursuit (Chen et al., 1999) and its extensions
with uniformly bounded noise. Though this paper is mainly conceptual: showing the con-
nection between network modeling and compressed sensing, we also provide some rigorous
theoretical analysis and practical algorithms on the clique recovery problem to illustrate the
usefulness of our framework.
The main content of this paper can be summarized as follows. Section 2 presents the general
framework on compressive network analysis. In Section 3, 4 and 5, we consider the clique
detection problem under the compressive network analysis framework. A polynomial time
approximation algorithm is provided in Section 6 for the clique detection problem. We also
demonstrate successful application examples in Section 7. Section 8 concludes the paper.
II. Main Idea
In this section we present the general framework of compressive network analysis with a
nonparametric view. We start with an introduction of notations: let u = (u1, . . . , ud)T ∈ R
d
be a vector and I(·) be the indicator function. We denote
‖u‖0 ≡d∑
j=1
I(uj 6= 0), ‖u‖2 ≡
√√√√d∑
j=1
u2j , ‖u‖∞ ≡ max
j|uj|. (2.1)
We also denote by 〈·, ·〉 the Euclidean inner product and sign(u) = (sign(u1), . . . , sign(ud))T ,
3
where
sign(uj) =
+1 uj > 0
0 uj = 0
−1 uj < 0
(2.2)
We represent a network as a graph G = (V,E), where V = {1, . . . , n} is the set of nodes andE ⊂ V ×V is the set of edges. Let B ∈ R
n×n be the adjacency matrix of the observed network
with Bij represents a quantity associated with nodes i and j. With no loss of generality, we
assume that B is symmetric: B = BT and diag(B) = 0. With these assumptions, to model
B we only need to model its upper-triangle. For notational simplicity, we squeeze B into a
vector b ∈ RM where M = n(n − 1)/2 is the number of upper-triangle elements in B. Let
f(V ) ∈ RM be an unknown vector-valued function defined on V . We assume a generative
model of the observed adjacency matrix B (or equivalently, b):
b = f(V ) + z, (2.3)
where z ∈ RM is a noise vector. We can view f(V ) as evaluating a possibly infinite-
dimensional function f on a discrete set V , thus the model (2.3) is intrinsically nonparametric
and can model any static networks.
Without further regularity conditions or constraints, there is no hope for us to reliably
estimate f . In our framework, we assume that f has a sparse representation with respect to
an M by N dictionary A = [φ1(V ), . . . , φN(V )] where each φj(V ) ∈ RM is a basis function,
i.e., there exists a subset S ⊂ {1, . . . , N} with cardinality |S| ≪ N , such that
f(V ) =∑
q∈S
xqφq(V ). (2.4)
In the sequel, we denote by Apq the element on the p-th row and q-th column of A. Here p
indexes a pair of different nodes and q indexes a basis φq(V ). To estimate f , we only need
to reconstruct x = (x1, . . . , xN)T . Given the dictionary A, we can estimate f by solving the
following program:
(P0) min ‖x‖0 s.t. ‖b− Ax‖z ≤ δ (2.5)
where ‖ · ‖z is a vector norm constructed using the knowledge of z. The problem in (2.5) is
non-convex. In the sparse learning literature, a convex relaxation of (2.5) can be written as
(P1) min ‖x‖1 s.t. ‖b− Ax‖z ≤ δ. (2.6)
One thing to note is that the dictionary A can be either constructed based on the domain
knowledge, or it can be learned from empirical data. For simplicity, we always assume A is
4
pre-given in this paper. In the following sections, we use the clique detection problem as a
case study to illustrate the usefulness of this framework.
III. Clique Detection
In network data analysis, The problem of identifying communities or cliques1 based on partial
information arises frequently in many applications, including identity management (Guibas,
2008), statistical ranking (Diaconis, 1988; Jagabathula and Shah, 2008), and social net-
works (Leskovec et al., 2010). In these applications we are typically given a network with its
nodes representing players, items, or characters, and edge weights summarizing the observed
pairwise interactions. The basic problem is to determine communities or cliques within the
network by observing the frequencies of low order interactions, since in reality such low order
interactions are often governed by a considerably smaller number of high order communities
or cliques. Therefore the clique detection problem can be formulated as compressed sensing
of cliques in large networks. To solve this problem, one has to answer two questions: (i)
what is the suitable representation basis, and (ii) what is the reconstruction method? Before
rigorously formulating the problem, we provide three motivating examples as a glimpse of
typical situations which can be addressed within the framework in this paper.
Example 1 (Tracking Team Identities) We consider the scenario of multiple targets
moving in an environment monitored by sensors. We assume every moving target has an
identity and they each belong to some teams or groups. However, we can only obtain
partial interaction information due to the measurement structure. For example, watching a
grey-scale video of a basketball game (when it may be hard to tell apart the two teams),
sensors may observe ball passes or collaboratively offensive/defensive interactions between
teammates. The observations are partial due to the fact that players mostly exhibit to
sensors low order interactions in basketball games. It is difficult to observe a single event
which involves all team members. Our objective is to infer membership information (which
team the players belong to) from such partially observed interactions.
Example 2 (Inferring High Order Partial Rankings) The problem of clique identifi-
cation also arises in ranking problems. Consider a collection of items which are to be ranked
by a set of users. Each user can propose the set of his or her j most favorite items (say top
3 items) but without specifying a relative preference within this set. We then wish to infer
what are the top k > j most favorite items (say top 5 items). This problem requires us to
infer high order partial rankings from low order observations.
Example 3 (Detecting Communities in Social Networks) Detecting communities in
1A clique means a complete subgraph of the network.
5
social networks is of extraordinary importance. It can be used to understand the organization
or collaboration structure of a social network. However, we do not have direct mechanisms
to sense social communities. Instead, we have partial, low order interaction information. For
example, we observe pairwise or triple-wise co-appearance among people who hang out for
some leisure activities together. We hope to detect those social communities in the network
from such partially observation data.
In these examples we are typically given a network with some nodes representing play-
ers, items, or characters, and edge weights summarizing the observed pairwise interactions.
Triple-wise and other low order information can be further exploited if we consider complete
sub-graphs or cliques in the networks. The basic problem is to determine common inter-
est groups or cliques within the network by observing the frequency of low order interactions.
Since in reality such low order interactions are often governed by a considerably smaller num-
ber of high order communities. In this sense we shall formulate our problem as compressed
sensing of cliques in networks.
The problem we are going to address has a close relationship with community detection in
social networks. Community structures are ubiquitous in social networks. However, there is
no consistent definition of a “community”. In the majority of research studies, community
detections based on partitions of nodes in a network. Among these works, the most famous
one is based on the modularity (Newman, 2006) of a partition of the nodes in a group. A
shortcoming in partition-based methods is that they do not allow overlapping communities,
which occur frequently in practice. Recently there has been growing interest in studying
overlapping community structures (Lancichinetti and Fortunato, 2009). The relevance of
cliques to overlapping communities was probably first addressed in the clique percolation
method (Palla et al., 2005). In that work, communities were modeled as maximal connected
components of cliques in a graph where two k-cliques are said to be connected if they share
k − 1 nodes. In this paper, we pursue a compressive representation of signals or functions
on networks based on clique information which in turns sheds light on multiple aspects of
community structure.
In this paper, we use the same definition as in Palla et al. (2005) but are more interested
in identifying cliques. We pursue an alternative approach on exploring networks based on
clique information which potentially sheds light on multiple aspects of community structures.
Roughly speaking, we assume that there is a frequency function defined on complete low
order subsets. For example, in some social networks edge weights are bivariate functions
defined on pairs of nodes reflecting strength of pairwise interactions. We also assume that
there is another latent frequency function defined on complete high order subsets which we
hope to infer. Intuitively, the interaction frequency of a particular low order subset should
be the sum of frequencies of high order subsets which it belongs to. Hence we consider
6
a generative mechanism in which there exists a linear mapping from frequencies on high
order subsets (usually sparsely distributed) to low order subsets. One typically can collect
data on low order subsets while the task is to find those few dominant high order subsets.
This problem naturally fits into the general compressive network analysis framework we
introduced in the previous section. Below we demonstrate that the Radon basis will be
an appropriate representation for our purpose which allows the sparse recovery by a simple
linear programming reconstruction approach.
IV. Radon Basis Pursuit
A. Mathematical Formulation
Under the general framework in (2.3), we formulate the clique detection problem into a
compressed sensing problem named Radon Basis Pursuit. For this, we construct a dictionary
A so that each column of A corresponds to one clique. The intuition of such a construction
is that we assume there are several hidden cliques within the network, which are perhaps
of different sizes and may have overlaps. Every clique has certain weights. The observed
adjacency matrix B (or equivalently, its vectorized version b) is a linear combination of many
clique basis contaminated by a noise vector ǫ.
For simplicity, we first restrict ourselves to the case that all the cliques are of the same size
k < n. The case with mixed sizes will be discussed later. Let C1, C2, . . . , CN be all the
cliques of size k and each Cj ⊂ V . We have N =(nk
). For each q ∈ {1, . . . , N}, we construct
the dictionary A as the following
Apq =
{1 if the p-th pair of nodes both lie in Cq
0 otherwise.
The matrix A constructed above is related to discrete Radon transforms. In fact, up to a
constant and column scaling, the transpose matrix A∗ is called the discrete Radon transform
for two suitably defined homogeneous spaces (Diaconis, 1988). Our usage here is to exploit
the transpose matrix of the Radon transform to construct an over-complete dictionary, so
that the observed output b has a sparse representation with respect to it. More technical
discussions of the Radon transforms is beyond the scope of this paper.
The above formulation can be generalized to the case where b is a vector of length(nj
)
(j ≥ 2) with the p’th entry in b characterizing a quantity associated with a j-set (a set with
cardinality j). The dictionary A will then be a binary matrix Rj,k with entries indicating
7
whether a j-set is a subset of a k-clique (a clique with k nodes), i.e.,
Rj,kpq =
{1 if the p-th j-set of nodes all lie in the k-clique Cq
0 otherwise.
Therefore, the case where b is the vector of length(n2
)corresponds to a special case where
A = R2,k. Our algorithms and theory hold for general Rj,k with j < k.
Now we provide two concrete reconstruction programs for the clique identification problems:
(P1) min ‖x‖1 s.t. b = Ax
(P1,δ) min ‖x‖1 s.t. ‖Ax− b‖∞ ≤ δ.
P1 is known as Basis Pursuit (Chen et al., 1999) where we consider an ideal case that the
noise level is zero. For robust reconstruction against noise, we consider the relaxed program
P1,δ. The program in P1,δ differs from the Dantzig selector (Candes and Tao, 2007) which
uses the constraint in the form ‖A∗(Ax − b)‖∞ ≤ δ. The reason for our choice of P1,δ lies
in the fact that a more natural noise model for network data is bounded noise rather than
Gaussian noise. Moreover, our linear programming formulation of P1,δ enables practical
computation for large scale problems.
B. Intuition
Let G = (V,E) be the network we are trying to model. The set of vertices V represents
individual identities such as people in the social network. Each edge in E is associated with
some weights which represent interaction frequency information.
We assume that there are several common interest groups or communities within the net-
work, represented by cliques (or complete sub-graphs) within graph G, which are perhaps of
different sizes and may have overlaps. Every community has certain interaction frequency
which can be viewed as a function on cliques. However, we only receive partial measure-
ments consisting of low order interaction frequency on subsets in a clique. For example, in
the simplest case we may only observe pairwise interactions represented by edge weights.
Our problem is to reconstruct the function on cliques from such partially observed data.
A graphical illustration of this idea is provided in Figure 1, in which we see an observed
network can be written as a linear combination of several overlapped cliques.
One application scenario is to identify two basketball teams from pairwise interactions among
players. Suppose we have x0 which is a signal on all 5-sets of a 10-player set. We assume
it is sparsely concentrated on two 5-sets which correspond to the two teams with nonzero
weights. Assume we have observations b of pairwise interactions b = Ax0 + z, where z is
8
Figure 1: An illustrative example of the main idea.
uniform random noise defined on [−ǫ, ǫ]. We solve P1,δ, with δ = ǫ, which is a linear program
over x ∈ R(105 ) = R
252 with parameters A ∈ R(102 )×(
10
5 ) = R45×252 and b ∈ R
45.
C. Connection with Radon Basis
Let Vj denote the set of all j-sets of V = {1, · · · , n} and M j be the set of real-valued
functions on Vj. The observed interaction frequencies b on all j-sets, can be viewed as a
function in M j . We build a matrix Rj,k : Mk → M j (j < k) as a mapping from functions
on all k-sets of V to functions on all j-sets of V . In this setup, each row represents a j-set
and each column represents a k-set. The entries of Rj,k are either 0 or 1 indicating whether
the j-set is a subset of the k-set. Note that every column of Rj,k has(
k
j
)
ones. Lacking a
priori information, we assume that every j-set of a particular k-set has equal interaction
probability, whence choose the same constant 1 for each column. We further normalize Rj,k
to Rj,k so that the ℓ2 norm of each column of Rj,k is 1. To summarize, we have
Rj,k(σ,τ) =
{1
√
(
k
j
)
, if σ ⊂ τ ;
0, otherwise,
where σ is a j-set and τ is a k-set. As we will see, this construction leads to a canonical basis
associated with the discrete Radon transform. The size of matrix Rj,k clearly depends on
the total number of items n = |V |. We omit n as its meaning will be clear from the context.
The matrix Rj,k constructed above is related to discrete Radon transforms on homogeneous
space Mk. In fact, up to a constant, the adjoint operator (Rj,k)∗ is called the discrete Radon
transform from homogeneous space M j to Mk in Diaconis (1988). Here all the k-sets form
a homogeneous space. The collection of all row vectors of Rj,k is called as the j-th Radon
basis for Mk. Our usage here is to exploit the transpose matrix of the Radon transform to
construct an over-complete dictionary for M j , so that the observation b can be represented
by a possibly sparse function x ∈Mk (k ≥ j).
The Radon basis was proposed as an efficient way to study partially ranked data in Diaconis
(1988), where it was shown that by looking at low order Radon coefficients of a function on
9
Mk, we usually get useful and interpretable information. The approach here adds a reversal
of this perspective, i.e. the reconstruction of sparse high order functions from low order
Radon coefficients. We will discuss this in the sequel with a connection to the compressive
sensing (Chen et al., 1999; Candes and Tao, 2005).
V. Mathematical Theory
One advantage of our new framework on compressive network analysis is that it enables
rigorous theoretical analysis of the corresponding convex programs.
A. Failure of Universal Recovery
Recently it was shown by Candes and Tao (2005) and Candes (2008) that P1 has a unique
sparse solution x0, if the matrix A satisfies the Restricted Isometry Property (RIP), i.e.
for every subset of columns T ⊂ {1, . . . , N} with |T | ≤ s, there exists a certain universal
constant δs ∈ [0,√2− 1) such that
(1− δs)‖x‖22 ≤ ‖ATx‖22 ≤ (1 + δs)‖x‖22, ∀x ∈ R|T |,
where AT is the sub-matrix of A with columns indexed by T . Then exact recovery holds
for all s-sparse signals x0 (i.e. x0 has at most s non-zero components), whence called the
universal recovery.
Unfortunately, in our construction of the basis matrix A, RIP is not satisfied unless for very
small s. The following theorem illustrates the failure of universal recovery in our case.
Theorem 5.1. Let n > k+ j +1 and A = Rj,k with j < k. Unless s <(k+j+1
k
), there does
not exist a δs < 1 such that the inequalities
(1− δs)‖x‖22 ≤ ‖ATx‖22 ≤ (1 + δs)‖x‖22, ∀x ∈ R|T |
hold universally for every T ⊂ {1, . . . , N} with |T | ≤ s, where N =(nk
).
Note that(k+j+1
k
)does not depend on the network size n, which will be problematic. We
can only recover a constant number of cliques no matter how large the network is. The
main problem for such a negative result is that the RIP tries to guarantee exact recovery for
arbitrary signals with a sparse representation in A. For many applications, such a condition
is too strong to be realistic. Instead of studying such “universal” conditions, in this paper we
seek conditions that secure exact recovery of a collection of sparse signals x0, whose sparsity
10
pattern satisfies certain conditions more appropriate to our setting. Such conditions could
be more natural in reality, which will be shown in the sequel as simply requiring bounded
overlaps between cliques.
Remark 5.2. Recall that the matrix A has altogether N =(nk
)columns. Each column in
fact corresponds to a k-clique. Therefore, we could also use a k-clique to index a column
of A. In this sense, let T = {i1, . . . , ik} ⊂ {1, . . . , N} be a subset of size k. An equivalent
notation is to represent T as a class of sets: T = {τ1, . . . , τk} where each τi ⊂ {1, . . . , n} and|τ | = k.
Proof. We can extract a set of columns T = {τ : τ ⊂ {1, 2, · · · , k + j + 1} and |τ | = k}(τ is interpreted as a k-set) and form a submatrix AT . Recall that A has altogether
(nj
)
number of rows. Combined with the condition that n > k + j + 1 and the fact that the
number of nonzero rows of AT should be exactly(k+j+1
j
). We know that there must exist
rows in AT which only contains zeroes.
By discarding zero rows, it is easy to show that the rank of AT is at most(k+j+1
j
), which is
less than the number of columns. To see that the rank of AT is at most(k+j+1
j
), we need to
exploit the fact that j < k, therefore(k + j + 1
j
)<
(k + j + 1
k
), (5.1)
from which we see that the number of nonzero rows of AT is smaller than the number of
columns.
Thus, the columns in AT must be linearly dependent. In other words, there exist a nonzero
vector h ∈ RN where supp(h) ⊂ T such that Ah = 0. When s ≥
(k+j+1
k
), Since |supp(h)| ≤
|T | < s, we can not expect universal sparse recovery for all s-sparse signals . �
B. Exact Recovery Conditions
Here we present our exact recovery conditions for x0 from the observed data b by solving
the linear program P1. Suppose A is an M-by-N matrix and x0 is a sparse signal. Let T =
supp(x0), Tc be the complement of T , and AT (or AT c) be the submatrix of A where we only
extract column set T (or T c, respectively). The following proposition from Candes and Tao
(2005) characterizes the conditions that P1 has a unique condition. To make this paper
self-contained, we also include the proof in this section.
Proposition 5.3. (Candes and Tao, 2005) Let x0 = (x01, . . . , x0N )T , we assume that A∗
TAT
is invertible and there exists a vector w ∈ RM such that
11
1. 〈Aj, w〉 = sign(x0j), ∀j ∈ T ;
2. | 〈Aj , w〉 | < 1, ∀j ∈ T c.
Then x0 is the unique solution for P1.
Proof. The necessity of the two conditions come from the KKT conditions of P1. If we
consider an equivalent form of P1
min 1T ξ
subject to Ax− b = 0
−ξ ≤ x ≤ ξ
ξ ≥ 0
whose Lagrangian is
L(x, ξ; γ, λ, µ) = 1T ξ + γT (Ax− b)− λT+(ξ − x)− λT
−(ξ + x)− µT ξ.
Here γ ∈ RM , λ+ = (λ+(1), . . . , λ+(N))T ∈ R
N+ , λ− = (λ−(1), . . . , λ−(N))T ∈ R
N+ , µ ∈ R
N+
are the Lagrange multipliers.
Then the KKT condition gives
1. A∗γ + (λ+ − λ−) = 0,
2. 1− (λ+ + λ−)− µ = 0,
with λ, µ ≥ 0 and λ+(j)λ−(j) = 0 for all j.
Clearly T = supp(x0) = {j : ξj > 0}. Let w = γ, by the Strictly Complementary Theorem
for linear programming in Ye (1997), there exist µ and ξ such that 1 > µj > 0 for all j ∈ T c
with ξj = 0, and µj = 0 for all j ∈ T with ξj > 0. Thus, the first equation leads to
〈w,Aj〉 = −(λ+(j)− λ−(j)) = −sign(x0j), j ∈ T ;
the second equation leads to
|〈w,Aj〉| = |λ+(j)− λ−(j)| = 1− µj < 1.
Therefore, the two conditions are necessary for x0 to be the unique solution of P1.
12
To prove that these two conditions are sufficient to guarantee x0 is the unique minimizer
to P1, we need to show any minimizer y0 to the problem P1 must be equal to x0. Since x0
obeys the constraint Ax0 = b, we must have
‖y0‖1 ≤ ‖x0‖1.
Now take a w obeying the two conditions, we then compute
‖y0‖1 =∑
j∈T
|x0j + (y0j − x0j)|+∑
j∈T
|y0j|
≥∑
j∈T
sign(x0j)(x0j + (y0j − x0j)) +∑
j 6∈T
y0j 〈w,Aj〉
=∑
j∈T
|x0j |+∑
j∈T
(y0j − x0j) 〈w,Aj〉+∑
j 6∈T
y0j 〈w,Aj〉
=∑
j∈T
|x0j |+⟨w,
∑
j∈T∪T c
y0jAj −∑
j∈T
x0jAj
⟩
= ‖x0‖1 + 〈w, b− b〉= ‖x0‖1
Thus, the inequalities in the above computation must in fact be equality. Since | 〈w,Aj〉 | isstrictly less than 1 for all j 6∈ T , this in particular forces y0j = 0 for all j 6∈ T . Thus
∑
j∈T
(y0j − x0j)Aj = f − f = 0.
Since all columns in AT are independent, we must have y0j = x0j for all j ∈ T . Thus x0 = y0.
This concludes the proof of our theorem.
�
The above theorem points out the necessary and sufficient condition that in the noise-free
setting P1 exactly recover the sparse signal x0. The necessity and sufficiency comes from
the KKT condition in convex optimization theory (Candes and Tao, 2005). However this
condition is difficult to check due to the presence of w. If we further assume that w lies in
the column span of AT , the condition in Proposition 5.3 reduces to the following condition.
Irrepresentable Condition (IRR) The matrix A satisfies the IRR condition with respect
to T = supp(x0), if A∗TAT is invertible and
‖A∗T cAT (A
∗TAT )
−1‖∞ < 1,
13
or, equivalently,
‖(A∗TAT )
−1A∗TAT c‖1 < 1,
where ‖ · ‖∞ stands for the matrix sup-norm, i.e., ‖A‖∞ := maxi∑
j |Aij| and ‖A‖1 =
maxj∑
i |Aij|.
Proposition 5.4. By restricting that w lies in the image of AT , the conditions in propo-
sition 5.3 reduce to the IRR condition.
Proof. Since w lies in the image of AT , we can write w = ATv. To make sure that the
first condition in Proposition 5.3 holds, we must have v = (A∗TAT )
−1sign(x0), so
w = AT (A∗TAT )
−1sign(x0).
Now the second condition in proposition 5.3 can be equivalently written as
‖A∗T cAT (A
∗TAT )
−1‖∞ < 1,
which is exactly the IRR condition. �
Intuitively, the IRR condition requires that, for the true sparsity signal x0, the relevant
bases AT is not highly correlated with irrelevant bases AcT . Note that this condition only
depends on A and x0, which is easier to check. The assumption that w lies in the column
span of AT is mild; it is actually a necessary condition so that x0 can be reconstructed by
Lasso (Tibshirani, 1996) or Dantzig selector (Candes and Tao, 2007), even under Gaussian-
like noise assumptions (Zhao and Yu, 2006; Yuan and Lin, 2007).
C. Detecting Cliques of Equal Size
In this subsection, we present sufficient conditions of IRR which can be easily verified. We
consider the case that A = Rj,k with j < k. Given data b about all j-sets, we want to infer
important k-cliques. Suppose x0 is a sparse signal on all k-cliques. We have the following
theorem, which is a direct result of Lemma 5.6.
Theorem 5.5. Let T = supp(x0), if we enforce the overlaps among k-cliques in T to be
no larger than r, then r ≤ j − 2 guarantees the IRR condition.
Lemma 5.6. Let T = supp(x0) and j ≥ 2. Suppose for any σ1, σ2 ∈ T , the two cliques
corresponding to σ1 and σ2 have overlaps no larger than r, we have
14
1. If r ≤ j − 2, then ‖A∗T cAT (A
∗TAT )
−1‖∞ < 1;
2. If r = j−1, then ‖A∗T cAT (A
∗TAT )
−1‖∞ ≤ 1 where equality holds with certain examples;
3. If r = j, there are examples such that ‖A∗T cAT (A
∗TAT )
−1‖∞ > 1.
One thing to note is that Theorem 5.5 is only an easy-to-verify condition based on the worst-
case analysis, which is sufficient but not necessary. In fact, what really matters is the IRR
condition. It uses a simple characterization of allowed clique overlaps which guarantees the
IRR Condition. Specifically, clique overlaps no larger than j−2 is sufficient to guarantee the
exact sparse recovery by P1, while larger overlaps may violate the IRR Condition. Since this
theorem is based on a worst-case analysis, in real applications, one may encounter examples
which have overlaps larger than j − 2 while P1 still works.
In summary, IRR is sufficient and almost necessary to guarantee exact recovery. Theorem 5.5
tells us the intuition behind the IRR is that overlaps among cliques must be small enough,
which is easier to check. In the next subsection, we show that IRR is also sufficient to
guarantee stable recovery with noises.
Proof. To prove Lemma 5.6, given any τ ∈ T c, we define
µτ ≡∑
σ∈T
(|τ∩σ|
j
)(kj
) .
the intuition of such a definition is that
supτ∈T c
µτ = ‖A∗T cAT‖∞. (5.2)
As we will see in the following proofs, we essentially try to bound µτ for τ ∈ T c.
Before we present the detailed technical proof, we first introduce the high-level idea: our
main purpose is to bound ‖A∗T cAT (A
∗TAT )
−1‖∞. Since each entry of the matrix A∗TAT is
indexed by two k-sets, the value of this entry represents how many j-sets are contained in
the intersection of these two k-sets. Under the condition that r ≤ j − 1, it’s straightforward
that the matrix A∗TAT is an identity. Therefore, bounding ‖A∗
T cAT (A∗TAT )
−1‖∞ is equivalent
as bounding ‖A∗T cAT‖∞, which is exactly supτ∈T c µτ .
Proof of the case under Condition 1
Under Condition 1, since any σ1, σ2 ∈ T satisfy |σ1 ∩ σ2| ≤ j − 2, hence any two columns in
T are orthogonal. This implies A∗TAT is an identity matrix.
Now given τ ∈ T c, we will prove µτ < 1 under condition 1. If this is true, then
supτ∈T c
µτ = ‖A∗T cAT‖∞ = ‖A∗
T cAT (A∗TAT )
−1‖∞ < 1
15
Let T = {σ1, σ2, · · · , σ|T |} where σi(1 ≤ i ≤ |T |) are k-sets. We need to prove
µτ =
|T |∑
i=1
(|τ∩σi|
j
)(kj
) < 1
for all τ ∈ T c.
LetMi = {ρ : |ρ| = j, ρ ⊂ τ ∩σi}, soMi is a collection of j-sets of τ ∩σi (Here if |τ ∩σi| < j,
thenMi is simply an empty set). Obviously, we have |Mi| =(|τ∩σi|
j
). So
|T |∑
i=1
(|τ ∩ σi|j
)=
|T |∑
i=1
|Mi|.
Now we note the fact that for any 1 ≤ i, l ≤ |T |, we have Mi ∩ Ml = ∅. This is true
because otherwise suppose ρ ∈M1∩M2, then this mean ρ is a j-set ofM1 andM2. Hence
ρ ⊂ τ ∩ σ1, ρ ⊂ τ ∩ σ2, which implies that
|σ1 ∩ σ2| ≥ |(τ ∩ σ1) ∩ (τ ∩ σ2)| ≥ |ρ| ≥ j.
This contradicts with the condition that σi’s(1 ≤ i ≤ T ) have overlaps at most j−2. SoMi
must be pairwise disjoint. Hence
|T |∑
i=1
(|τ ∩ σi|j
)=
|T |∑
i=1
|Mi| = | ∪|T |i=1Mi|
For any 1 ≤ i ≤ |T |, every ρ ∈Mi is a j-set of τ ∩ σi. Hence ρ is of course a j-set of τ . The
set τ is of size k. So if we letM0 = {ρ : |ρ| = j, ρ ⊂ τ} which is the collection of all j-sets
of τ , then we have ∪|T |i=1Mi ⊂M0. So | ∪|T |
i=1Mi| ≤ |M0| ≤(kj
).
Till now, we actually proved µτ ≤ 1. All the above proof about µτ ≤ 1 for any τ ∈ T c will
remain valid for condition 2. In the next, we prove if any σi, σl ∈ T satisfy |σi ∩ σl| ≤ j − 2,
then equality can not hold.
Without loss of generality, we assume |σ1∩τ | ≥ j, otherwise if none of σi’s satisfies |σi∩τ | ≥ j,
then µτ = 0 which actually finishes the proof. To show the the equality will not hold, we
only need to find one j-set that is does not belong to ∪iM0.
In this case, we can let τ = {1, 2, · · · , k}, σ1 = {1, 2, · · · , s, k + 1, k + 2, 2k − s} where
j ≤ s ≤ k − 1(s ≤ k − 1 because otherwise σ1 = τ which contradicts with the fact that
σ1 ∈ T, τ ∈ T c). Now we show that ρ0 = {1, 2, · · · , j − 1, s+1} is not a member of ∪|T |i=1Mi.
Clearly ρ0 is not a member ofM1 because s+1 6∈ σ1. Now it remains to show that ρ0 is not a
16
member of anyMi(2 ≤ i ≤ |T |). If this was not true, say ρ0 ∈M2, then ρ0 ⊂ (τ ∩σ2) ⊂ σ2,
then {1, 2, · · · , j − 1} ⊂ σ1 ∩ σ2, which contradicts with the condition that |σ1 ∩ σ2| ≤ j − 2.
While it is clear that ρ0 inM0, so this means ∪|T |i=1Mi is a proper subset ofM0. So | ∪|T |
i=1
Mi| <(kj
)which means µτ < 1.
Proof of the case under Condition 2
Under condition 2, then almost the same as proof for lemma 1. We have A∗TAT is an identity
matrix and µτ ≤ 1. However, one can not show µτ < 1 in this case. We have the following
example where if n is large enough, then µτ can happens to be equal to one exactly.
Let τ = {1, 2, · · · , k} ∈ T c. Denote all the j-sets of τ to be ρ1, ρ2, · · · , ρ(kj). when n is large
enough, we choose(kj
)disjoint (k−j)-sets of {k+1, k+2, · · · , n}, denoted by ω1, ω2, · · · , ω(kj).
Let T = {σ1, σ2, · · · , σ|T |}, where σi = ρi∪ωi. Hence |T | =(kj
)and σi’s satisfy |σi∩σj | ≤ j−1.
But|T |∑
i=1
(|τ∩σi|
j
)(kj
) =
|T |∑
i=1
1(kj
) = 1.
Proof of the case under Condition 3
Under condition 3, we can construct examples where
‖A∗T cAT (A
∗TAT )
−1‖∞ > 1.
Let ρ1, ρ2, · · · , ρ(kj) be all j-sets of {1, 2, · · · , k}. For large enough n, it is possible to choose(kj
)+1 disjoint (k−j)-sets of {k+1, k+2, · · · , n}, say ω0, ω1, ω2, · · · , ω(kj). Let σi = ρi∪ωi for
1 ≤ i ≤(kj
)and σ0 = ρ1∪ω0. Define T = {σ0, σ1, σ2, · · · , σ(kj)} which is of size |T | =
(kj
)+1.
In this case, |σi ∩ σl| = j− 1 for any 1 ≤ i, l ≤(kj
)and |σ0 ∩ σ1| = j, |σ0 ∩ σi| ≤ j− 1 for any
2 ≤ i ≤(kj
). Then A∗
TAT is a(kj
)+1 by
(kj
)+1 matrix shown below with rows and columns
corresponds to {σ0, σ1, · · · , σ(kj)}
A∗TAT =
1 ǫ 0 0 · · · 0
ǫ 1 0 0 · · · 0
0 0 1 0 · · · 0
0 0 0 1 · · · 0
0 0...
.... . . 0
0 0 0 0 · · · 1
17
Here ǫ = 1
(kj). The inverse of the matrix is
(A∗TAT )
−1 =
11−ǫ2
− ǫ1−ǫ2
0 0 · · · 0
− ǫ1−ǫ2
11−ǫ2
0 0 · · · 0
0 0 1 0 · · · 0
0 0 0 1 · · · 0
0 0...
.... . . 0
0 0 0 0 · · · 1
Consider τ = {1, 2, · · · , k} ∈ T c, then the row corresponds to τ for A∗T cAT is a vector of
length |T | =(kj
)+ 1 with each entry being ǫ = 1
(kj). So the row vector corresponds to τ in
A∗T cAT (A
∗TAT )
−1 is a vector of length(kj
)+ 1, [ ǫ
1+ǫ, ǫ1+ǫ
, ǫ, ǫ, · · · , ǫ]. This vector has row sum
2ǫ
1 + ǫ+ (
(k
j
)− 1)ǫ =
2ǫ
1 + ǫ+ (
1
ǫ− 1)ǫ =
1 + 2ǫ− ǫ2
1 + ǫ>
1 + 2ǫ− ǫ
1 + ǫ= 1
Hence in this example ‖A∗T cAT (A
∗TAT )
−1‖∞ > 1. �
In the following, we construct explicit conditions which allow large overlaps while the IRR
still holds, as long as such heavy overlaps do not occur too often among the cliques in T .
The existence of a partition of T in the next theorem is a reasonable assumption in the
network settings where network hierarchies exist. In social networks, it has been observed
by Girvan and Newman (2002) that communities themselves also join together to form meta-
communities. The assumptions that we made in the next theorem where we allow relatively
larger overlaps between communities from the same meta-community, while we allow rela-
tively smaller overlaps between communities from different meta-communities characterize
such a scenario.
Theorem 5.7. Assume (k + 1)/2 ≤ j < k. let T = supp(x0). Suppose there exist a
partition T = T1 ∪ T2 ∪ · · · ∪ Tm with each Ti satisfies |Ti| ≤ K, such that
• for any σi, σj belong to the same partition, |σi ∩ σj | ≤ r;
• for any σi, σj belong to different partitions, |σi ∩ σj | ≤ 2j − k − 1.
If K satisfies
(K − 1)
(r
j
)/
(k
j
)< 1/4,
((k − 1
j
)+ (K − 1)
((k + r)/2
j
))/
(k
j
)≤ 3/4,
then IRR holds.
18
Proof. We will show the following two inequalities hold.
‖A∗TAT−I‖∞ ≤ (K−1)
(r
j
)/
(k
j
), ‖A∗
T cAT‖∞ ≤((
k − 1
j
)+(K−1)
((k + r)/2
j
))/
(k
j
).
We first bound the sup-norm of A∗TAT − I. Note that when σi and σj belong to different
partitions of T , then |σi ∩ σj | = 0 because their overlap is no larger than 2j − k− 1 which is
strictly smaller than j. So A∗TAT is a block diagonal matrix with block sizes |T1|, |T2|, · · · ,
|Tm|, and each diagonal entry of A∗TAT is one.
Thus, for any σ ∈ T , only cliques from the same partition as σ may have overlaps with σ
greater than j. Thus, the row sum of A∗TAT − I can be bounded by (K − 1)
(rj
)/(kj
). So the
first inequality is now established.
To prove the second inequality, we observe that for a fixed τ ∈ T c, |τ∩σi| ≥ j and |τ∩σj | ≥ j
can not hold at the same time for any σi and σj belong to different partitions. This is because
otherwise, we will have
|τ | ≥ |τ ∩ (σi ∪ σj)| = |τ ∩ σi|+ |τ ∩ σj | − |τ ∩ σi ∩ σj |≥ j + j − (2j − k − 1) = k + 1
Thus, all σ’s which have intersections with a fixed τ no less than j must lie in the same
partition of T .
For the same reason, we can show that for a fixed τ ∈ T c, |τ ∩ σi| ≥ (k + r + 1)/2 and
|τ ∩ σj | ≥ (k + r + 1)/2 can not hold at the same time for σi and σj belong to the same
partition of T . This is because otherwise, we will have
|τ | ≥ |τ ∩ (σi ∪ σj)| = |τ ∩ σi|+ |τ ∩ σj | − |τ ∩ σi ∩ σj |≥ (k + r + 1)/2 + (k + r + 1)/2− r = k + 1
Thus we know the maximum row sum of A∗T cAT is bounded from above by
((k − 1
j
)+ (K − 1)
((k + r)/2
j
))/
(k
j
).
Now if K further satisfies
(K − 1)
(r
j
)/
(k
j
)< 1/4,
((k − 1
j
)+ (K − 1)
((k + r)/2
j
))/
(k
j
)≤ 3/4.
then, we have
‖A∗TAT − I‖∞ < 1/4, ‖A∗
T cAT‖∞ ≤ 3/4.
19
Thus,
‖A∗T cAT (A
∗TAT )
−1‖∞≤ ‖A∗
T cAT‖∞‖(A∗TAT )
−1‖∞
≤ ‖A∗T cAT‖∞(1 +
∞∑
i=1
‖(A∗TAT − I)‖i∞)
< 3/4(1 +
∞∑
i=1
(1/4)i) = 1
So IRR holds under our conditions. �
The basis matrix A = Rj,k have(nk
)bases, which is not polynomial with respect to k. As
we will see from later sections, a practical implementation of the Radon basis pursuit for
the clique detection problem works on a subset of bases among all(nk
)bases. In that case,
we are actually solving P1 and P1,δ with the basis matrix A, which is only a submatrix of
A with a subset of column bases extracted. We have the following theorem regarding this
scenario.
Theorem 5.8. Denote the set of all cliques for columns in A by S, where A is a submatrix
of A. Assume any two k-cliques in A have intersections at most r, i.e. ∀σi, σj ∈ T ∪ T c,
|σi ∩ σj | ≤ r, where T = supp(x0) ⊂ S, and T c is the complement of T with respect to S.
Then IRR holds if
r ≤(
1
|T |(1 +√|T |)
)1/j
k (5.3)
Proof. Note that
‖A∗T cAT (A
∗T AT )
−1‖∞ ≤ ‖A∗T cAT ‖∞‖(A∗
T AT )−1‖∞
≤ ‖A∗T cAT ‖∞ ·
√|T |‖(A∗
T AT )−1‖2
So it suffices to show
‖A∗T cAT‖∞ ·
√|T |‖(A∗
T AT )−1‖2 < 1
under condition (5.3).
Firstly,
‖A∗T cAT‖∞ = max
τ∈T c
∑
σ∈T
(|τ∩σ|
j
)(kj
) ≤|T |(rj
)(kj
) , since |τ ∩ σ| ≤ r.
20
At least we need
|T |(r
j
)/
(k
j
)< 1. (5.4)
Secondly, let K = A∗T AT , then
Kii = 1
and since ∀σi, σj ∈ T , |σi ∩ σj | ≤ r, we have
Kij ≤(rj
)(kj
) .
Under condition (5.4), K is diagonal dominant, i.e.
Kii >∑
j 6=i
|Kij |.
Then by Girshgorin Circle Theorem,
λmin ≥ 1−∑
j 6=i
|Kij| ≥ 1− (|T | − 1)
(r
j
)/
(k
j
)≥ 1− |T |
(r
j
)/
(k
j
).
Therefore it suffices to have|T |(rj
)(kj
)√|T |
1− |T |(rj
)/(kj
) < 1
which gives (r
j
)<
1
|T |(1 +√|T |)
(k
j
).
To satisfy this, it suffices to assume
r <
(1
|T |(1 +√|T |)
)1/j
k.
�
D. Stable Recovery Theorems
In applications, one always encounters examples with noise such that exact sparse recovery is
impossible. In this setting, P1,δ will be a good replacement of P1 as a robust reconstruction
program. Here we present stable recovery theorem of P1,δ with bounded noise.
21
Theorem 5.9. Under the general framework (2.3), we assume that ‖z‖∞ ≤ ǫ, |T | = s,
and the IRR
‖A∗T cAT (A
∗TAT )
−1‖∞ ≤ α ≤ 1/s.
Then the following error bound holds for any solution xδ of P1,δ,
‖xδ − x0‖1 ≤2s(ǫ+ δ)
1− αs‖AT (A
∗TAT )
−1‖1. (5.5)
Proof. Let h = xδ − x0. Note that ‖Axδ − b‖∞ ≤ δ and z = Ax0 − b with ‖z‖∞ ≤ ǫ.
Then
‖Ah‖∞ = ‖Axδ − Ax0‖∞ = ‖Axδ − b+ b−Ax0‖∞ ≤ ‖Axδ − b‖∞ + ‖z‖∞ ≤ δ + ǫ. (5.6)
We denote xδ|T as constraining xδ on the support T , i.e. all the entries of xδ corresponding to
T c will be set to zero. From the optimization problem in (P1,δ), we know that ‖x0‖1 ≥ ‖xδ‖1,
‖hT‖1 = ‖x0 − xδ|T‖1 ≥ ‖x0‖1 − ‖xδ|T‖1 ≥ ‖xδ‖1 − ‖xδ|T‖1 = ‖xδ|T c‖1 = ‖hT c‖1. (5.7)
Therefore,
|〈Ah,AT (A∗TAT )
−1hT 〉|= |〈AThT , AT (A
∗TAT )
−1hT 〉+ 〈AT chT c , AT (A∗TAT )
−1hT 〉|≥ ‖hT‖22 − |〈hT c, A∗
T cAT (A∗TAT )
−1hT 〉|≥ ‖hT‖22 − ‖hT c‖1‖A∗
T cAT (A∗TAT )
−1hT‖∞≥ 1
s‖hT‖21 − α‖hT c‖1‖hT‖∞
≥ 1
s‖hT‖21 − α‖hT c‖1‖hT‖1
≥(1
s− α
)‖hT‖21
where the last step is due to ‖hT‖1 ≥ ‖hT c‖1 in the inequality (5.7). On the other hand,
|〈Ah,AT (A∗TAT )
−1hT 〉|≤ ‖Ah‖∞‖AT (A
∗TAT )
−1hT‖1≤ (δ + ǫ)‖AT (A
∗TAT )
−1‖1‖hT‖1
using (5.6). Combining these two inequalities yields
‖hT‖1 ≤s(δ + ǫ)
1− αs‖AT (A
∗TAT )
−1‖1,
22
as desired.
�
In the special case where k = j + 1, we have:
Corollary 5.10. Let k = j + 1, |T | = s, and for any σ1, σ2 ∈ T , the two cliques corre-
sponding to σ1 and σ2 have overlaps no larger than r. Then we have ‖A∗T cAT (A
∗TAT )
−1‖∞ ≤1/(j + 1), and thus the following error bound for solution xδ of P1,δ holds:
‖xδ − x0‖1 ≤2s(ǫ+ δ)
1− sj+1
√j + 1, s < j + 1.
Proof. This corollary follows follows from the Lemma above. Note that when the con-
ditions in Theorem 2 hold, A∗TAT = I and ‖AT‖1 ≤
√(kj
)=√j + 1.
Now it suffice to establish the fact that in this special case, we have
‖A∗T cAT (A
∗TAT )
−1‖∞ ≤1
j + 1< 1
Note that since any σ1, σ2 ∈ T satisfy |σ1 ∩σ2| ≤ j− 2, we have A∗TAT is an identity matrix.
So ‖A∗T cAT (A
∗TAT )
−1‖∞ = ‖A∗T cAT‖∞. Now assume τ ∈ T c, let Sτ = {σ : |σ∩τ | ≥ j, σ ∈ T},
then |Sτ | ≤ 1. This is because otherwise, suppose {σ1, σ2} ⊂ Sτ such that |Sτ | ≥ 2, then we
have
|τ | ≥ |τ ∩ (σ1 ∪ σ2)| = |τ ∩ σ1|+ |t ∩ σ2| − |t ∩ σ1 ∩ σ2|≥ j + j − (j − 2) = j + 2
which contradicts with the fact that τ is a j +1-set. So there exist at most one σ0 ∈ T such
that |τ ∩ σ| ≥ j. Let vτ be the row vector of A∗T cAT with row index correspond to τ . Then
‖vτ‖∞ ≤ (jj)(j+1
j )= 1
j+1< 1. �
E. Identifying Cliques with Mixed Sizes
In general settings, we need to identify high order cliques of mixed sizes, i.e., cliques of
sizes k1, k2, · · · , kl (k1 < k2 < · · · < kl), based on the observed data b on all j-sets. One
way to construct the basis matrix A is by concatenating Rj,k with different k’s satisfying
k > j. We can then solve P1 and P1,δ for exact recovery and stable recovery with this newly
concatenated basis matrix A. We have the following theorem:
23
Theorem 5.11. Suppose x0 is a sparse signal on cliques of sizes k1, k2, · · · , kℓ(j ≤ k1 <
k2 < · · · < kℓ ≤ k) and b = Ax0. Let T = supp(x0).
1. If the cliques in T have no overlaps, then they can be identified by P1.
2. Moreover, if the data b = Ax0 + z is contaminated by the noise z, P1,δ provides an
estimate of x0 for which the inequality in (5.5) still holds.
Proof. We prove under the condition that any σ1, σ2 ∈ T satisfy |σ1 ∩ σ2| = 0, then
solve P1 will exactly identify x0.
For simplicity, given any τ ∈ T c, we define
µτ =∑
σ∈T
1√(|τ |j
)(|σ|j
)(|τ ∩ σ|
j
)
Note that the intersection of σ1 and σ2 is zero implies that A∗TAT = I, moreover, given
τ ∈ T c, the collection of sets {τ ∩ σ|σ ∈ T} are disjoint. Note that if there is only one σ0
satisfies |τ ∩ σ0| ≥ j, then
µτ =1√(
|τ |j
)(|σ0|j
)(|τ ∩ σ0|
j
)< 1,
because it is the inner product of two column vectors corresponds to τ and σ0 of A, where
there are no two columns in A are identical.
Now suppose there are at least two σ’s satisfy, |τ ∩ σ| ≥ j, then we have
µτ =∑
σ∈T
1√(|τ |j
)(|σ|j
)(|τ ∩ σ|
j
)
≤∑
σ∈T,|τ∩σ|≥j
1√(|τ |j
)(|τ∩σ|
j
)(|τ ∩ σ|
j
)
=∑
σ∈T,|τ∩σ|≥j
√(|τ∩σ|
j
)√(
|τ |j
)
Since the collection of sets {τ ∩ σ|σ ∈ T} are disjoint, so if we can prove√(|τ ∩ σ1|
j
)+
√(|τ ∩ σ2|j|
)<
√(|τ ∩ (σ1 ∪ σ2)|j
),
24
then we know that
µτ ≤∑
σ∈T,|τ∩σ|≥j
√(|τ∩σ|
j
)√(
|τ |j
) <
√(|τ ∩ (∪σ∈T,|τ∩σ|≥jσ)|j
)/
√(|τ |j
)≤ 1
Now we only need to prove the following inequality: suppose j ≥ 2, given n1 ≥ j, n2 ≥ j, we
need to prove√(
n1
j
)+√(
n2
j
)<√(
n1+n2
j
)
The case of j = 2 can be verified directly, while for j ≥ 3, we square both sides and we now
we only need to prove(n1
j
)+(n2
j
)+ 2√(
n1
j
)(n2
j
)<(n1+n2
j
). Since
(n1 + n2
j
)=
j∑
s=0
(n1
j − s
)(n2
s
).
So we know we only need to prove 2√(
n1
j
)(n2
j
)< n2
(n1
j−1
)+n1
(n2
j−1
). Since n2
(n1
j−1
)+n1
(n2
j−1
)≥
2√n1n2
(n1
j−1
)(n2
j−1
), so we only need to verify n1
(n1
j−1
)>(n1
j
), this can be easily verified by
writing out explicitly both sides. �
The above theorem provides us a sufficient condition to guarantee exact sparse recovery with
concatenated bases and the stable recovery theory is also established.
VI. A Polynomial Time Approximation Algorithm
In practical applications, we have pairwise interaction data in a network with n nodes and we
wish to infer high order cliques up to size k. Directly constructing A by concatenating Radon
basis matrices Rj,j, Rj,j+1 . . . , Rj,k and solving P1,δ would incur exponential complexity since
A has exponentially many columns with respect to k. This would be intractable for inferring
high order cliques in large networks. In this section, we describe a polynomial time (with
respect to both n and k) approximation algorithm for solving P1,δ. Recall that the primal
and dual programs P1,δ and D1,δ are:
(P1,δ) min ‖x‖1 s.t. ‖Ax− b‖∞ ≤ δ
(D1,δ) max−δ‖γ‖1 − b∗γ s.t. ‖A∗γ‖∞ ≤ 1.
Proposition 6.1. The problem (D1,δ) is the dual of (P1,δ).
25
Proof. Consider an alternative form of P1,δ,
min 1T ξ
subject to −δ · 1 ≤ Ax− b ≤ δ · 1−ξ ≤ x ≤ ξ
ξ ≥ 0
whose Lagrangian is
L(x, ξ; γ, λ, µ) = 1T ξ− γT+(δ · 1−Ax+ b)− γT
−(Ax− b+ δ · 1)− λT+(ξ−x)−λT
−(ξ+ x)−µT ξ.
Here if we assume A is a matrix of size M by N , then γ+ = (γ+(1), . . . , γ+(M)) ∈ RM+ , γ− =
(γ−(1), . . . , γ−(M)) ∈ RM+ , λ+ = (λ+(1), . . . , λ+(N))T ∈ R
N+ , λ− = (λ−(1), . . . , λ−(N))T ∈
RN+ , µ ∈ R
N+ are the Lagrange multipliers.
Then the KKT condition gives
1. A∗(γ+ − γ−) + (λ+ − λ−) = 0,
2. 1− (λ+ + λ−)− µ = 0,
with γ, λ, µ ≥ 0 and γ+(τ)γ−(τ) = λ+(τ)λ−(τ) = 0 for all τ .
Now we can see that the dual function of 1T ξ is
−δ(γT+ + γT
−) · 1− (γT+ − γT
−)b,
which is −δ‖γ‖1 − b∗γ, while the constraints for γ is ‖A∗γ‖∞ ≤ 1. �
The key of our algorithm is that we use a polynomial number of variables and constraints to
approximate both programs, yielding an approximate solution for P1,δ. More precisely, we
apply a sequential primal-dual interior point method to solve the relaxed programs:
(P1,δ,T) min ‖x‖1 s.t. ‖ATx− b‖∞ ≤ δ
(D1,δ,T) max−δ‖γ‖1 − b∗γ s.t. ‖A∗Tγ‖∞ ≤ 1.
Here AT is a submatrix of A where we extract a subset of columns T . We approximate the
solution to the original programs by solving the above relaxed programs where we only use
polynomially many columns indexed by T . In particular, we want to find an interior point
γ for D1,δ,T which is also feasible for D1,δ. With this γ available, we can use duality gaps to
check convergence because the current dual objective provides a lower bound for D1,δ and
any interior point for P1,δ,T provides an upper bound for P1,δ.
26
Let Ai be the i-th column of A. We need to sequentially update the column set T . When
we have a solution γ (which is called the approximate analytic center) for the relaxed pro-
gram D1,δ,T , we need to find a new column Ai (i ∈ T c) which is not feasible in D1,δ,T . By
incorporating Ai into T , the feasible region of D1,δ,T is reduced to better approximate that
of D1,δ. When the current solution γ has no violated constraint, i.e., γ is feasible for D1,δ, we
use interior point methods to find a series of interior points which converge to the solution
of D1,δ,T . However, we may obtain a new interior point γ which is not feasible for D1,δ. We
then go back and add violated constraints. A formal description is provided in Algorithm 1.
Algorithm 1 Cutting Plane Method for Solving P1,δ
Initialize A = I, x = b, γ = (1, 1, · · · , 1)t.while TRUE do
if ∃ |A∗iγ| > 1 where i ∈ T c then
T ← T ∪ {i}, formulate new D1,δ,T and P1,δ,T .
Find new interior points γ and x for D1,δ,T and P1,δ,T respectively.
else if the duality gap is small then
Get the dual solution x and stop.
else
Find a new interior point γ for D1,δ,T , which optimizes the dual objective.
end if
end while
In Algorithm 1, the first IF statement involves a problem of finding a violated dual constraint
for the current relaxed program. In the special case where γ are dual variables associated
with edges, the problem becomes the maximum edge weight clique problem, which is known
to be NP-hard. We use a simple greedy heuristic algorithm, which iteratively adds new nodes
in order to maximize summation of edge weights to solve this problem (Lueker, 1978), which
runs in O(nk2) time and can return a 0.94-approximate solution in the average case. Note
that, if γ is feasible for the dual relaxation problem with no additional violated constraints,
then 0.94γ must be feasible for D1,δ whose objective is discounted by 0.94. Thus, we will
terminate with an 0.94-approximate solution.
Let η be the threshold to check the duality gap. Algorithm 1 can also be understood as
the column generation method (Dantzig and Wolfe, 1960), since adding a new inequality
constraint in the dual program adds a variable to the primal program and thus adds a
column to the basis matrix. For more details of the algorithm, see Mitchell (2003) and Ye
(1997). Theoretically, if one is able to find a violated constraint in constant time and uses
interior point methods to locate approximate centers of the primal-dual feasible regions,
27
then Algorithm 1 has computational complexity O(M/η2), where M is the number of dual
variables (Mitchell, 2003; Ye, 1997). In our case, M ≍ n2 and find a violated constraint has
complexity O(nk2), thus algorithm 1 has complexity O(n3k2/η2).
Finally, we note that other iterative algorithms, e.g., Bregman iterations, which have guar-
anteed convergence rates (Cai et al., 2009) can be used to find solutions of linear program
relaxations in our algorithms. We also note that, in practice, we never need to explicitly
construct the matrix A because there are many combinatorial structures within the basis
matrix to exploit. For example, operations such as evaluating inner products between the
bases can be evaluated efficiently by directly comparing two sets.
VII. Application Examples
In this section, we provide four application examples to illustrate the effectiveness of the
proposed framework in this paper. As we will see, our clique-based model can deal with
overlaps between cliques which gives us more community structural information compared
against using purely clustering methods and the state-of-the-art clique percolation method.
In these examples, we use the clique volume and conductance, which arguably are the simplest
evaluation criteria of clustering quality, to evaluate different algorithms. The clique volume
is the sum of edge weights inside the clique, while the clique conductance is the ratio between
the number of weights leaving the clique and the clique volume (Leskovec et al., 2010).
More precisely, let Buv be the element on the u-th row and v-th column of the adjacency
matrix B. The conductance φ(S) of a set of nodes S is defined as
φ(S) =
∑{(u,v):u∈S,v/∈S}Buv
min(Vol(S),Vol(V \ S))
and volume is Vol(S) =∑
{u,v∈S}Buv.
A. Basketball Team Detection
Detecting two basketball teams from pairwise interactions among plays is an ideal scenario
since the two teams do not overlap. Suppose we have x0 which is the true signal indicating
the two teams among all 5-sets of the 10-player set, i.e., it is sparsely concentrated on two 5-
sets which correspond to the two teams with magnitudes both equal to one. Assume we have
observations b of pairwise interactions, i.e. b = Ax0 + z, where z is bounded random noise
uniformly distributed in [−ǫ, ǫ]. We solve P1,δ, with δ = ǫ, which is a linear programming
search over x ∈ R(105 ) = R
252 with a parameter matrix A ∈ R(102 )×(
10
5 ) = R45×252 and b ∈ R
45.
The results are shown in Figure 2. In Figure 2-(a), we see that the two basketball teams
28
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
Noise Level ε
Val
ues
on 5
−C
lique
s
Detecting Baksetball Teams with Noise
Team 1
Team 2
Alternative 5−Subsets
(a) (b)
Figure 2: Detecting Basketball Teams with Noise. (a) Two teams in a virtual Basketball
Game, with intra-team interaction 1 and cross-team interaction noise no more than ǫ; (b)
Under a large noise level ǫ < 0.9, the two teams are identifiable. For each noise level, we run
100 simulations repeatedly, whose errorbar plot of weights on cliques are shown.
are perfected detected as expected. Since the two 5-sets correspond to the two teams have
no overlap, hence satisfy the irrepresentable Condition (IRR). In Figure 2-(b), we try to
detect the two teams under different noise levels ǫ ∈ [0, 1]. The two basketball teams can be
detected under fairly large noise levels. This example can also be dealt with using spectral
clustering techniques where we normalize the pairwise interaction data to get the transition
matrix, followed by spectral clustering on eigenspaces. We observed that both our method
and spectral clustering works very well under noise level less than 0.8 (i.e. |ǫ| < 0.8).
B. The Social Network of Les Miserables
We consider the social network of 33 characters in Victor Hugo’s novel Les Miserables (Knuth,
1993). We represent this social network using a weighted graph (Figure 3-(a)). The edge
weights are the co-appearance frequencies of the two corresponding characters. Table 1
illustrates several social communities formed by relationships including friendships, street
gangs, kinships, etc. The underlying social community, regarded as the ground truth for the
data, is summarized in Figure 3-(a) where several social communities arise. Figure 3-(b)
shows the spectral clustering result in which the first three red cuts are reasonable while the
next three blue cuts destroyed a lot of community structures within the network.
29
1
2
3
4
56
7
8
91011
12
13
14
1516
17
18 19
20
2122
23
24
252627
28
29
30
31
32
33
Street Gang
Friendship
FriendshipFriendship
Student Union
Kinship
Dramatic Conflict
1
2
3
4
56
7
8
91011
12
13
14
1516
17
18 19
20
2122
23
24
252627
28
29
30
31
32
33
(a) (b)
1
2
3
4
56
7
8
91011
12
1314
1516
17
18 19
20
2122 23
24
252627
28
29
30
31
32
331
2
3
4
56
7
8
91011
12
1314
1516
17
18 19
20
2122 23
24
252627
28
29
30
31
32
33
(c) (d)
Figure 3: Decomposition of Les Miserables social network. (a) Social network of characters in
Les Miserables; (b) Spectral clustering result; (c) The identified 3-cliques; (d) The identified
4-cliques.
We compare our method with the clique percolation method, 23 and 19 cliques were identified
respectively where our approach can identify more meaningful cliques – see Figure 3 and
Table 1 where we verified the ground truth from the novel. For example, our method can
30
0
1
2
3
4
5
6
7
3 4 5 6 7 8 9
clique sizes
φ (c
ondu
ctan
ce)
Radon Basis Pursuit −− Clique Conductances
0
1
2
3
4
5
6
7
3 4 5 6 7 8 9
clique sizes
φ (c
ondu
ctan
ce)
Clique Percolation −− Clique Conductances
(a) (b)
0
50
100
150
200
250
300
350
400
450
3 4 5 6 7 8 9
clique sizes
cliq
ue v
olum
e
Radon Basis Pursuit −− Clique Volumes
0
50
100
150
200
250
300
350
400
450
3 4 5 6 7 8 9
clique sizes
cliq
ue v
olum
e
Clique Percolation −− Clique Volumes
(c) (d)
Figure 4: Les Miserables social network: Box plot of clique conductances and volumes for
clique percolation method and our approach. Cliques identified by our approach have smaller
conductances and larger volumes.
correctly identify two separate cliques {4, 15, 22} and {20, 21, 22}, while the clique percolationmethod treats {4, 15, 20, 21, 22} as a single clique. The interaction frequencies among those
characters, however, show that there are relatively smaller cross-community interactions,
thus those two 3-cliques should be separated. Figure 3-(c) and 3-(d) depict important 3-
cliques and 4-cliques identified by our algorithm. The sparsity patterns of those cliques satisfy
the irrepresentable condition where overlaps between them are generally not large. However,
they do not necessarily satisfy the condition in Lemma 5.6 which is based on a worst-
case analysis. In Figure 4, we also compare both methods in terms of clique conductances
and volumes and see that the cliques identified by Radon basis pursuit have slightly lower
31
conductances and larger volumes, which demonstrates advantages of our approach.
Table 1: Social Networks of Les MiserablesCliques Names of Characters Relationships Perco. Radon
{1, 2, 3} {Myriel, Mlle Baptistine, Mme Magloire} Friendship N N
{4, 13, 14} {Valjean, Mme Thenardier, Thenardier} Dramatic Conflicts N Y
{4, 15, 22} {Valjean, Cosette, Marius} Dramatic Conflicts N Y
{20, 21, 22} {Gillenormand, Mlle Gillenormand, Marius} Kinship N Y
{5, 6, 7, 8} {Tholomyes, Listolier, Fameuil, Blacheville} Friendship Y Y
{9, 10, 11, 12} {Favourite, Dahlia, Zephine, Fantine} Friendship Y Y
{14, 31, 32, 33} {Thenardier, Gueulemer, Babet, Claquesous} Street Gang N Y
In summary, our method obtains more abundant social structure information than the com-
peting techniques. We also obtain social communities with overlaps which is impossible for
clustering methods. We note that some simple schemes will not work well. For example,
one may think of scoring each large clique by the mean scores of the included small cliques.
In this example, since two or three key characters appear very frequently, we will end up
with finding that the top high order cliques always contain them. In fact, among the top
ten 3-cliques, seven of them contain node 4 and six of them contain node 15, which does not
give us good results.
C. Coauthorships in Network Science
We also studied a medium size coauthorship network where there is a total of 1,589 scientists
who come from a broad variety of fields. Part of this network is shown in Figure 5-(a). 136
and 166 cliques are identified by our approach and the clique percolation method respectively.
We also compare the two methods in terms of clique conductances and volumes. From Figure
6-(a),(b), we see that the cliques identified by Radon basis pursuit have smaller conductances
and comparable clique volumes than the clique percolation method. Our approach can scale
very well. In this example, it can identify the cliques up to size 9 in 564 seconds. So
this application example shows that our approach can be used to identify cliques in social
networks with hundreds or even thousands of nodes.
Finally, we note that clustering techniques, e.g., spectral clustering, combined with our algo-
rithm can provide a more refined analysis of the network. We can look at the persistence of
identified cliques in the binary tree decomposition of bipartite spectral clustering of the net-
work in a bottom-up way. Cliques which persist through more levels will give us meaningful
community structural information.
In figure 5-(b), a small fraction of the binary tree decomposition of bipartite spectral clus-
tering is depicted, where child nodes are spectral bipartition of the parent node. We can
32
A
B
DC
(a) (b)
Figure 5: (a) Coauthorships in Network Science, only a part of the network is shown; (b)
Important cliques identified within clusters behave in a persistent way. Clustering node B is
exactly the blue part in (a)
detect cliques within the child nodes. Once cliques within clusters C,D are identified, we
then backtrack to the parent nodes B and A to see if the identified cliques still persist.
We can identify 3 cliques (c1={Kumar, Raghavan, Rajagopalan, Tomkins}, c2={Kumar. S,
Raghavan, Rajagopalan}, c3={Raghavan, Rajagopalan, Tomkins, Kumar. S}) within C and
3 cliques (d1={Flake. G, Lawrence. S, Giles. C, Coetzee. F}, d2={Flake. G, Lawrence.
S, Giles. C, Pennock. D, Glover. E}, d3={Flake. G, Lawrence. S, Giles. C}) within
D which persist to parents B and A. We can identify papers whose authors are exactly
those cliques. Using only clustering will not get this result because those cliques have heavy
overlaps between them.
In figure 5-(b), for simplicity, we only show two persistent cliques: c1={Kumar, Raghavan,
Rajagopalan, Tomkins} and d1={Flake. G, Lawrence. S, Giles. C, Coetzee. F} which are
the most important cliques (having the largest weights when solving the LP program) in
clusters C and D respectively. These two cliques are also the most important two cliques in
33
0
1
2
3
4
5
6
7
8
9
10
3 4 5 6 7 8 9
clique sizes
φ (c
ondu
ctan
ce)
Radon Basis Pursuit −− Clique Conductances
0
1
2
3
4
5
6
7
8
9
10
3 4 5 6 7 8 9
clique sizes
φ (c
ondu
ctan
ce)
Clique Percolation −− Clique Conductances
(a) (b)
0
10
20
30
40
50
3 4 5 6 7 8 9
clique sizes
cliq
ue v
olum
e
Radon Basis Pursuit −− Clique Volumes
5
10
15
20
25
30
35
40
45
50
55
3 4 5 6 7 8 9
clique sizes
cliq
ue v
olum
e
Clique Percolation −− Clique Volumes
(c) (d)
Figure 6: Coauthorship Network: Box plot of clique conductances and volumes for clique
percolation method and our approach. Cliques identified by our approach have smaller
conductances and larger volumes.
cluster B, and if we even further back track them to clustering A, they are still ranked as
the first and the third in terms of weights among all cliques identifiable in A.
D. Inferring high order ranking
Jester dataset (Goldberg et al., 2001) contains about 24, 000 users who give ratings on 100
jokes. Those ratings are of real value ranging from −10.00 to +10.00. We extract top 20
jokes from the entire dataset according to mean scores. Among those 20 jokes, we count the
voting on top 5-jokes by each user and view them as the ground truth. Figure 7-(a) shows
34
that there is a top 5-set, {27, 29, 35, 36, 50}, with an overwhelming voting than the others.
Now suppose we only know information as top 3 counts of the jokes and wonder if we can
identify the most popular 5-joke group. By solving P1,δ with the whole regularization path
by varying δ, we are capable to detect this subset (Figure 7-(b)) in a robust way.
0 2000 4000 6000 8000 10000 12000 14000 160000
10
20
30
40
50
60
70
num
ber
of v
otes
sorted top 5 subsets
Distribution of votes on top 5 subsets
50 60 70 80 90 100 110 120 130 140 150−5
0
5
10
15
20
25
30
35
δ
Mag
nitu
de o
f Top
xσ
Solution Path of P1,δ on Inferring Top 5 Jokes
(a) (b)
Figure 7: (a) There is a significant top-5 jokes (in red) whose ID is {27, 29, 35, 36, 50}; (b)Regularization path where the top curve (red) selects this top group over δ ∈ [50, 130]. Note
that the top 2nd curve (green) also identifies the fourth 5-set in a persistent way.
VIII. Conclusions
In this work, we present a novel approach to connect two seemingly different areas: network
data analysis and compressive sensing. By adopting a new algebraic tool, Randon basis
pursuit in homogeneous spaces, we formulate the network clique detection problem into
a compressed sensing problem. Such a novel formulation allows us to construct rigorous
conditions to characterize the network clique recovery problems. Instead of providing another
heuristic method, we aim at contributing at the foundational level to network data analysis.
We hope that our work could build a bridge connecting the research communities of network
modeling and compressive sensing, so that research results and tools from one area could be
ported to another one to create more exciting results.
To illustrate the usefulness of this new framework, we present a novel approach to identify
overlapped communities as cliques in social networks, based on compressed sensing with an
new algebraic method, i.e. Radon basis pursuit in homogeneous spaces associated with per-
35
mutation groups. Our approach starts from a general problem of compressive representation
of low order interactive information from high order cliques, which firstly arises from iden-
tity management and statistical ranking, etc. Specifically applied to social networks, this
approach studies bi-variate functions defined on pairs of nodes, and looks for compressive
representations of such functions based on clique information in networks. It turns out that
the sparse representation under Radon basis may disclose community structures, typically
overlapped, in social networks. We have shown that noiseless exact recovery and stable re-
covery with uniformly bounded noise hold under some natural conditions. Though this paper
is mainly methodological and theoretical, we also develop a polynomial-time approximation
algorithm for solving empirical problems and demonstrate the usefulness of the proposed
approach on real-world networks.
IX. Acknowledgments
Xiaoye Jiang and Leonidas Guibas wish to acknowledge the support of ARO grants W911NF-
10-1-0037 and W911NF-07-2-0027, as well as NSF grant CCF 1011228 and a gift from the
Google Corporation. Y. Yao acknowledges supports from the National Basic Research Pro-
gram of China (973 Program 2011CB809105), NSFC (61071157), Microsoft Research Asia,
and a professorship in the Hundred Talents Program at Peking University. The authors also
thank Zongming Ma, Minyu Peng, Michael Saunders, Yinyu Ye for very helpful discussions
and comments. Han Liu is thankful for a faculty supporting package from Johns Hopkins
University.
References
Airoldi, E. M., Blei, D. M., Fienberg, S. E. and Xing, E. P. (2008). Mixed
membership stochastic blockmodels. Journal of Machine Learning Research 9 1981–2014.
Barabasi, A. L. and Albert, R. (1999). Emergence of scaling in random networks.
Science 286 509–512.
Cai, J., Osher, S. and Shen, Z. (2009). Linearized bregman iterations for compressed
sensing. Mathematics of Computation 78(267) 1515–1536.
Candes, E. J. (2008). The restricted isometry property and its implications for compressed
sensing. Comptes Rendus de l’Academie des Sciences, Paris, Serie I 346 589–592.
Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Transaction
on Information Theory 51 4203–4215.
36
Candes, E. J. and Tao, T. (2007). The dantzig selector: statistical estimation when p
is much larger than n. Annals of Statistics 35(6) 2313–2351.
Chen, S., Donoho, D. L. and Saunders, M. A. (1999). Atomic decomposition by basis
pursuit. SIAM Journal on Scientific Computing 20 33–61.
Dantzig, G. and Wolfe, P. (1960). Decomposition principle for linear programs. Op-
erations Research 8 101–111.
Diaconis, P. (1988). Group Representations in Probability and Statistics. Institute of
Mathematical Statistics.
Duijn, M. A. J. V., Snijders, T. A. B. and Zijlstra, B. J. H. (2004). p2: a random
effects model with covariates for directed graphs. Statistica Neerlandica 59 234–254.
Erdos, P. and Renyi, A. (1959). On random graphs, i. Publicationes Mathematicae 6
290–297.
Erdos, P. and Renyi, A. (1960). On the evolution of random graphs. Publication of the
Mathematical Institue of the Hungrian Academy of Science 5 17–61.
Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological
networks. Proceedings of the National Academy of Sciences of the United States of America
99 7821–7826.
Goldberg, K., Roeder, T., Gupta, D. and Perkins, C. (2001). Eigentaste: A
constant time collaborative filtering algorithm. Information Retrieval 4(2) 133–151.
Goldenberg, A., Zheng, A. X., Fienberg, S. E. and Airoldi, E. M. (2010). A
survey of statistical network models. Foundations and Trends in Machine Learning 2.
Guibas, L. J. (2008). The identity management problem— a short survey. In International
Conference on Information Fusion.
Hoff, P. D., Raftery, A. E., Handcock, M. S. and H, M. S. (2001). Latent space
approaches to social network analysis. Journal of the American Statistical Association 97
1090–1098.
Holland, P. W. and Leinhardt, S. (1981). An exponential family of probability dis-
tributions for directed graphs. Journal of the American Statistical Association 76 33–50.
Jagabathula, S. and Shah, D. (2008). Inferring rankings under constrained sensing. In
Neural Information Processing Systems (NIPS).
37
Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins,
A. (1999). The web as a graph: measurements, models, and methods. In International
Computing and Combinatorics Conference.
Knuth, D. E. (1993). The Stanford GraphBase: A Platform for Combinatorial Computing.
Addison-Wesley.
Kumar, R., Raghavan, P., Rajagopalan, S., Sivakumar, D., Tomkins, A. and
Upfal, E. (2000). Stochastic models for the Web graph. In Proceedings of the 41st Annual
Symposium on Foundations of Computer Science.
Lancichinetti, A. and Fortunato, S. (2009). Benchmarks for testing community de-
tection algorithms on directed and weighted graphs with overlapping communities. Physical
Review E 80(1) 16118.
Leskovec, J., Lang, K. and Mahoney, M. (2010). Empirical comparison of algorithms
for network community detection. In ACM WWW International Conference on World
Wide Web (WWW).
Lorrain, F. and White, H. (1971). Structural equivalence of individuals in social net-
works. Journal of Mathematical Sociology 1 49–80.
Lueker, G. S. (1978). Maximization problems on graphs with edge weights chosen from
a normal distribution. In ACM Symposium on Theory of Computing.
Mitchell, J. E. (2003). Polynomial interior point cutting plane methods. Optimization
Methods and Software 18 2003.
Newman, M. E. J. (2006). Modularity and community structure in networks. Proceedings
of National Academy of Sciences 103(23) 8577–8582.
Palla, G., Derenyi, I., Farkas, I. and Vicsek, T. (2005). Uncovering the overlapping
community structure of complex networks in nature and society. Nature 435(7043) 814.
Sarkar, P. and Moore, A. (2005). Dynamic social network analysis using latent space
models. SIGKDD Explorations: Special Edition on Link Mining .
Snijders, T. A. B. (2005). Models for longitudinal network data. In Models and Methods
in Social Network Analysis. University Press.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Series B 58(1) 267–288.
38
Tsaig, Y. and Donoho, D. L. (2006). Compressed sensing. IEEE Transaction on
Information Theory 52 1289–1306.
Wasserman, S. and Anderson, C. (1987). Stochastic a posterior blockmodels: Con-
struction and assessment. Social Networks 9 1–36.
Wasserman, S. and Pattison, P. (1996). Logit models and logistic regressions for social
networks: I. an introduction to markov graphs and p∗. Psychometrika 61 401–425.
Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of’small-world’networks.
Nature 393 409–10.
Ye, Y. (1997). Interior Point Algorithms: Theory and Analysis. Wiley.
Yuan, M. and Lin, Y. (2007). On the nonnegative garrote estimator. Journal of the
Royal Statistical Society. Series B 69(2) 143–161.
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine
Learning Reserach 7 2541–2563.
39
Top Related