Modèles de graphes aléatoires à structure cachée pour l ...

HAL Id: tel-00623088https://tel.archives-ouvertes.fr/tel-00623088

Submitted on 13 Sep 2011

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Modèles de graphes aléatoires à structure cachée pourl’analyse des réseaux

Pierre Latouche

To cite this version:Pierre Latouche. Modèles de graphes aléatoires à structure cachée pour l’analyse des réseaux. Math-ématiques [math]. Université d’Evry-Val d’Essonne, 2010. Français. �tel-00623088�

https://tel.archives-ouvertes.fr/tel-00623088

https://hal.archives-ouvertes.fr

Université d’Évry Val d’Essonne

Laboratoire Statistique et Génome

Thèseprésentée en première version en vu d’obtenir le grade de Docteur,

spécialité “Mathématiques Appliquées”

par

Pierre Latouche

Modèles de graphes aléatoires àstructure cachée pour l’analyse

des réseaux

Thèse soutenue le la date de soutenance devant le jury composé de :

M. Christophe Biernacki Université de Lille (Rapporteur)M. Jean-Philippe Vert Mines, Fontainebleau & Institut Curie, Paris (Rapporteur)M. Geoff McLachlan Université du Queensland, Australie (Jury)M. Stéphane Robin AgroParisTech, Paris (Jury)M. Christophe Ambroise Université d’Évry Val D’Essonne (Directeur)M. Etienne Birmelé Université d’Évry Val d’Essonne (Co-Directeur)

À toute ma famille, affectueusement

Remerciements

Je tiens à remercier Christophe Ambroise et Etienne Birmelé pour avoirencadré ce travail de thèse et pour la liberté qu’ils m’ont donnée. Je

remercie également Julien Chiquet, Marie-Agnès Dillies, Gilles Grasseau,Catherine Matias, et Bernard Prum. Chacun à votre manière, vous m’avezdonné confiance en moi et aidé à avancer. Enfin, un grand remerciement àtous les membres du laboratoire Statistique et Génome. Merci pour tout.

P. Latouche, Evry, le November 30, 2010.

v

Contents

Contents vi

List of Figures viii

Préface 1

Abstract 5

1 Context 7

1.1 Mixture models and EM . . . . . . . . . . . . . . . . . . . . 9

1.1.1 Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.2 Gaussian mixture models . . . . . . . . . . . . . . . . . 12

1.1.3 The EM algorithm . . . . . . . . . . . . . . . . . . . . . 14

1.1.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . 19

1.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . 25

1.2.1 Variational EM . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.2 Variational Bayes EM . . . . . . . . . . . . . . . . . . . . 26

1.3 Graph clustering . . . . . . . . . . . . . . . . . . . . . . . . 29

1.3.1 Graph theory and real networks . . . . . . . . . . . . . . 30

1.3.2 Community structure . . . . . . . . . . . . . . . . . . . 35

1.3.3 Heterogeneous structure . . . . . . . . . . . . . . . . . . 41

1.4 Phase transition in stochastic block models . . . . . . 43

1.4.1 The phase transition . . . . . . . . . . . . . . . . . . . . 44

1.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . 45

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2 Variational Bayesian inference and complexity con-trol for stochastic block models 49

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.2 A mixture model for graphs . . . . . . . . . . . . . . . . . 52

2.2.1 Model and notations . . . . . . . . . . . . . . . . . . . . 52

2.2.2 A Bayesian Stochastic Block Model . . . . . . . . . . . . 53

2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.3.1 Variational EM . . . . . . . . . . . . . . . . . . . . . . . 54


2.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.5.1 Comparison of the criteria . . . . . . . . . . . . . . . . . 57

2.5.2 The metabolic network of Escherichia coli . . . . . . . . . 61

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

vi

3 Overlapping stochastic block models 65

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.2 The stochastic block model . . . . . . . . . . . . . . . . . 68

3.3 The overlapping stochastic block model . . . . . . . . . 69

3.3.1 Modeling sparsity . . . . . . . . . . . . . . . . . . . . . 70

3.3.2 Modeling outliers . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4.1 Correspondence with (non overlapping) stochastic blockmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.4.2 Permutations and inversions . . . . . . . . . . . . . . . . 73

3.4.3 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . 75

3.5 Statistical inference . . . . . . . . . . . . . . . . . . . . . . 76

3.5.1 The q-transformation . . . . . . . . . . . . . . . . . . . . 77

3.5.2 The ξ-transformation . . . . . . . . . . . . . . . . . . . . 78

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.6.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.6.2 French political blogosphere . . . . . . . . . . . . . . . . 86

3.6.3 Saccharomyces cerevisiae transcription network . . . . . 91

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4 Model selection in overlapping stochastic block mod-els 93

4.1 A Bayesian Overlapping Stochastic Block Model . . . . 95

4.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.2.1 The q-transformation . . . . . . . . . . . . . . . . . . . . 96

4.2.2 The ξ-transformation . . . . . . . . . . . . . . . . . . . . 97


4.2.4 Optimization of ξ . . . . . . . . . . . . . . . . . . . . . . 99

4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.4.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . 101

4.4.2 Saccharomyces cerevisiae transcription network . . . . . 104

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Conclusion 107

A Mixture models 109

A.1 Factorization of the integrated complete-data likeli-hood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A.2 Exact expression of log p(Z) . . . . . . . . . . . . . . . . . . 109

A.3 Asymptotic approximation of log p(Z) using Stirling

formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

B SBM 113

B.1 Optimization of q(Zi) . . . . . . . . . . . . . . . . . . . . . . 113

B.2 Optimization of q(α) . . . . . . . . . . . . . . . . . . . . . . 114

B.3 Optimization of q(Π) . . . . . . . . . . . . . . . . . . . . . . 114

B.4 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

C OSBM 119

C.1 First lower bound . . . . . . . . . . . . . . . . . . . . . . . 119

vii

C.2 Second lower bound . . . . . . . . . . . . . . . . . . . . . . 120

C.3 Optimization of ξij . . . . . . . . . . . . . . . . . . . . . . . 121

C.4 Optimization of αq . . . . . . . . . . . . . . . . . . . . . . . 122

C.5 Optimization of W . . . . . . . . . . . . . . . . . . . . . . . 122

D Bayesian OSBM 125

D.1 Lower Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

D.2 Optimization of q(α) . . . . . . . . . . . . . . . . . . . . . . 126

D.3 Optimization of q(W) . . . . . . . . . . . . . . . . . . . . . . 127

D.4 Optimization of q(Ziq) . . . . . . . . . . . . . . . . . . . . . 128

D.5 Optimization of ξ . . . . . . . . . . . . . . . . . . . . . . . . 131

D.6 Lower bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Bibliography 135

List of Figures

1.1 Subset of the yeast transcriptional regulatory network (Miloet al. 2002). Nodes of the directed network correspond togenes, and two genes are linked if one gene encodes a tran-scriptional factor that directly regulates the other gene. . . . 32

1.2 The metabolic network of bacteria Escherichia coli (Lacroixet al. 2006). Nodes of the undirected network correspond tobiochemical reactions, and two reactions are connected if acompound produced by the first one is a part of the secondone (or vice-versa). . . . . . . . . . . . . . . . . . . . . . . . . 33

1.3 Subset of the french political blogosphere network. The dataconsists of a single day snapshot of political blogs automati-cally extracted on 14th october 2006 and manually classifiedby the “Observatoire Présidentielle project” (Zanghi et al.2008). Nodes correspond to hostnames and there is an edgebetween two nodes if there is a known hyperlink from onehostname to another (or vice-versa). . . . . . . . . . . . . . . 34

1.4 Example of an undirected affiliation network with 50 ver-tices. The network is made of three communities repre-sented in red, blue, and green. Vertices connect mainly tovertices of the same community. . . . . . . . . . . . . . . . . 35

viii

1.5 Dendrogram of a network with 50 vertices for the commu-nity detection algorithm with edge betweenness. It shouldbe read from top to bottom. The algorithm starts with a sin-gle community which contains all the vertices. Edges withthe highest edge betweenness are then removed iterativelysplitting the network into several communities. After con-vergence, each vertex, represented by a leaf of the tree, is asole member of one of the 50 communities. . . . . . . . . . . 37

1.6 Directed network of social relations between 18 monks in anisolated American monastery (Sampson 1969, White et al.1976). Sampson collected sociometric information using in-terviews, experiments, and observations. This network fo-cus on the relation of “liking”. A monk is said to have asocial relation of “like” to another monk if he ranked thatmonk in the top three monks for positive affection in any ofthe three interviews given. The positions of the vertices inthe two data dimensional latent space have been calculatedusing the Bayesian approach for LPCM. The position of thethree class centers found are indicated as well as circles withradius equal to the square root of the class variances esti-mated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

1.7 Example of an undirected network with 20 vertices. Theconnection probabilities between the two classes in red andgreen are higher than the intra class probabilities. Verticesconnect mainly to vertices of a different class. . . . . . . . . 42

1.8 Several graphs with 5000 vertices are generated using SBMwith various connectivity matrices. The x axis representsthe values of maxq |λq| while the proportions N∗ of ver-tices in the biggest connected component are given in they axis. The critical point of phase transition occurs whenmaxq |λq| ≥ 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.1 Directed acyclic graph representing the Bayesian view ofSBM. Nodes represent random variables, which are shadedwhen they are observed and edges represent conditional de-pendencies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.2 Boxplot representation (over 60 experiments) of ILvb forQ ∈ {1, . . . , 40}. The maximum is reached at QILvb = 22. . . 62

2.3 Dot plot representation of the metabolic network after clas-sification of the vertices into QVB = 22 classes. The x-axisand y-axis correspond to the list of vertices in the network,from 1 to 605. Edges between pairs of vertices are repre-sented by shaded dots. . . . . . . . . . . . . . . . . . . . . . . 63

3.1 Example of a directed graph with three overlapping clusters. 70

3.2 Directed acyclic graph representing the frequentist view ofthe overlapping stochastic block model. Nodes representrandom variables, which are shaded when they are ob-served and edges represent conditional dependencies. . . . 71

ix

3.3 Example of a network with community structures. Overlapsare represented in black and outliers in gray. . . . . . . . . . 82

3.4 Example of a network with community structures and stars.Overlaps are represented in black and outliers in gray. . . . 83

3.5 L2 distance d(P, P) over the 100 samples of networks withcommunity structures, for CFinder and OSBM. Measureshow well the underlying cluster assignment structure hasbeen retrieved. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3.6 L2 distance d(P, P) over the 100 samples of networks withcommunity structures and stars, for CFinder and OSBM.Measures how well the underlying cluster assignment struc-ture has been retrieved. . . . . . . . . . . . . . . . . . . . . . 85

3.7 Classification of the blogs into Q = 4 clusters using OSBM.The entry (i, j) of the matrix describes the number of blogsassociated to the j-th political party (column) and classifiedinto cluster i (row). Each entry distinguishes blogs whichbelong to a unique cluster from overlaps (single member-ship blogs + overlaps). The last row corresponds to thenull component. . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.8 Classification of the blogs into Q = 5 clusters using MMSB.The entry (i, j) of the matrix describes the number of blogsassociated to the j-th political party (column) and classifiedinto cluster i (row). Each entry distinguishes blogs whichbelong to a unique cluster from overlaps (single member-ship blogs + overlaps). Cluster 5 corresponds to the classof outliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

3.9 Classification of the blogs into Q = 5 clusters using SBM.The entry (i, j) of the matrix describes the number of blogsassociated to the j-th political party (column) and classifiedinto cluster i (row). Cluster 5 corresponds to the class ofoutliers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.1 Figure produced by the R “OSBM” package. Example of anetwork generated using OSBM, with λ = 6, ǫ = 1, W∗ =−5.5, and Q = 5 classes. Overlaps are represented usingpies and outliers are in white. . . . . . . . . . . . . . . . . . . 103

4.2 The ILosbm criterion for Q ∈ {2, . . . , 8}. The maximum isreached at QILosbm = 6. . . . . . . . . . . . . . . . . . . . . . . 104

x

Préface

Les réseaux sont très largement utilisés dans de nombreux domainesscientifiques afin de représenter les intéractions entre objets d’intérêt.

Ainsi, en Biologie, les réseaux de régulation s’appliquent à décrire les mé-canismes de régulation des gènes, à partir de facteurs de transcription,tandis que les réseaux métaboliques permettent de représenter des voiesde réactions biochimiques. En sciences sociales, ils sont couramment util-isés pour réprésenter les intéractions entre individus.

Dans le cadre de cette thèse, nous nous intéressons à des méthodesd’apprentissage non supervisé dont l’objectif est de classer les nœuds d’unréseau en fonction de leurs connexions. Il existe une vaste littérature seréférant à ce sujet et un nombre important d’algorithmes ont été proposésdepuis les premiers travaux de Moreno en 1934. Il apparaît dès lors queles approches existantes peuvent être regroupées dans trois familles dif-férentes. Tout d’abord, un certain nombre de méthodes se concentrentsur la recherche de communautés où les nœuds sont classés de manièreà ce que deux nœuds aient une plus forte tendance à se connecter s’ilsappartiennent à la même classe. Ces techniques sont parmi les plus util-isées en particulier pour analyser les réseaux de type Internet. Par ailleurs,d’autres approches recherchent des structures topologiques différentes où,au contraire, deux nœuds ont une plus forte tendance à intéragir s’ils sontdans des classes distinctes. Elles sont particulièrement adaptées à l’étudedes réseaux de type biparpite. Enfin, quelques méthodes s’intéressent à larecherche de structures hétérogènes où les nœuds peuvent avoir des pro-files de connexion très différents. En particulier, ces techniques peuventêtre employées pour retrouver à la fois des commautés et des structuresbipartites dans les réseaux. Le point de départ de cette thèse est le mod-èle à blocs stochastiques, Stochastic Block Model (SBM) en anglais, quiappartient à cette dernière famille d’approches et qui est également trèslargement utilisé.

SBM est un modèle de mélange qui est le résultat de travaux en sci-ences sociales. Il fait l’hypothèse que les nœuds d’un réseau sont répartisdans des classes et décrit la probabilité d’apparition d’un arc entre deuxnœuds uniquement en fonction des classes auquelles ils appartiennent.Aucune hypothèse n’ait faite a priori en ce qui concerne ces probabilitésde connexion de sorte que SBM puisse prendre en compte des structurestopologiques très variées. En particulier, le modèle permet de caractériserla présence de hubs, c’est à dire de nœuds ayant un nombre de liens élevéspar rapport aux autres nœuds d’un réseau. Pour finir, il généralise laplupart des modèles existants pour la classification de nœuds dans lesréseaux.

L’estimation des paramètres de SBM a déjà fait l’objet de nombreuses

1

2 Préface

études. Cependant, à notre connaissance un seul critère de sélection demodèles a été développé pour estimer le nombre de composantes dumélange (Daudin et al. 2008). Malheureusement, il a été montré que cecritère était trop conservateur dans le cas de réseaux de petites tailles.

Par ailleurs, il apparaît que SBM ainsi que la plupart des modèlesexistants pour la classification dans les réseaux sont limités puisqu’ils par-titionnent les nœuds dans des classes disjointes. Or, de nombreux objetsd’étude dans le cadre d’applications réelles sont connus pour appartenirà plusieurs groupes en même temps. Par exemple, en Biologie, des pro-téines appelées moonlighting proteins en anglais ont plusieurs fonctionsdans les cellules. De la même manière, lorsqu’un ensemble d’individusest étudié, on s’attend à ce qu’un nombre conséquent d’individus appar-tiennent à plusieurs groupes ou communautés.

Dans cette thèse, notre contribution est la suivante:• Nous proposons un nouvel algorithme de classification pour SBM

ainsi qu’un nouveau critère de sélection de modèle. Ces travaux sontimplémentés dans le package R “mixer” qui est disponible sur leCRAN: http://cran.r-project.org/web/packages/mixer.

• Nous introduisons un nouveau modèle de graphe aléatoire que nousappelons modèle à blocs stochastiques chevauchants, OverlappingStochastic Block Model (OSBM) en anglais. Il autorise les nœudsd’un réseau à appartenir à plusieurs groupes simultanément et peutprendre en compte des topologies de connexion très différentes. Laclassification et l’estimation des paramètres de OSBM sont réaliséesà partir d’un algorithme basé sur des techniques variationnelles.

• Nous présentons une nouvelle approche d’inférence pour OSBMainsi qu’un nouveau critère de sélection de modèle qui permetd’estimer le nombre de classes, éventuellement chevauchantes, dansles réseaux. Ces travaux sont implémentés dans le package R“OSBM” qui sera très prochainement disponible sur le CRAN.

Le premier chapitre s’attache à présenter les principaux concepts statis-tiques et les méthodes existantes sur lesquels se fondent nos travaux. Nousintroduisons notamment les modèles de mélange ainsi que les techniquesd’inférence de type EM et variationnel. Plusieurs critères de sélection demodèle sont également discutés. Enfin, il est fait état des méthodes lesplus connues pour la classification des nœuds dans les réseaux.

Dans le deuxième chapitre, le modèle à blocs stochastiques est décritdans un cadre Bayésien et une nouvelle procédure d’inférence est pro-posée. Elle offre la possibilité d’approcher la loi a posteriori des paramètreset des variables latentes. Cette approche permet également d’obtenir uncritère non asymptotique de sélection de modèle, basé sur une approxi-mation de la vraissemblance marginale.

Le troisième chapitre introduit le modèle à blocs stochastiqueschevauchants. Nous montrons que le modèle est identifiable dans desclasses d’équivalence. De plus, un algorithme basé sur des techniquesvariationnelles locales et globales est proposé. Il permet le clustering denœuds dans des classes chevauchantes et l’estimation des paramètres dumodèle.

http://cran.r-project.org/web/packages/mixer

Préface 3

Enfin, le quatrième chapitre considère à nouveau un cadre Bayésien. Deslois a priori conjuguées sont utilisées pour caractériser les paramètres dumodèle à blocs stochastiques chevauchants. Une procédure d’inférencepermet alors d’approcher la loi a posteriori des paramètres et des variableslatentes. Elle donne naissance à un critère de sélection de modèle basé surune nouvelle approximation de la vraissemblance marginale.

Cette thèse a fait l’objet de deux papiers et d’un chapitre de livre:Latouche et al. (2009; 2010a;b).

Abstract

Networks are used in many scientific fields to represent the interactionsbetween objects of interest. For instance, in Biology, regulatory net-

works describe the regulation of genes with transcriptional factors whilemetabolic networks focus on representing pathways of biochemical reac-tions. In social sciences, networks are commonly used to represent theinteractions between actors.

In this thesis, we consider unsupervised methods which aim at clus-tering the vertices of a network depending on their connection profiles.There has been a wealth of literature on the topic which goes back to theearlier work of Moreno in 1934. It appears that available techniques canbe grouped into three significant categories. First, some models look forcommunity structure, where vertices are partitioned into classes such thatvertices of a class are mostly connected to vertices of the same class. Theyare particularly suitable for the analysis of affiliation networks. Othermodels look for disassortative mixing in which vertices mostly connect tovertices of different classes. They are commonly used to analyze bipar-tite networks. Finally, a few procedures look for heterogeneous structurewhere vertices can have different types of connection profiles. In particu-lar, they can be used to uncover both community structure and disassorta-tive mixing. The starting point of this thesis is the Stochastic Block Model(SBM) which belongs to this later category of approaches.

SBM is a mixture model for graphs which was originally developedin social sciences. It assumes that the vertices of a network are spreadinto different classes such that the probability of an edge between twovertices only depends on the classes they belong to. No assumption ismade on these probabilities of connection such that SBM can take verydifferent topological structures into account. In particular, the model cancharacterize the presence of hubs which make networks locally dense.Moreover and to some extent, it generalizes many of the existing graphclustering techniques.

The clustering of vertices as well as the estimation of SBM parametershave been subject to previous work and numerous inference strategieshave been proposed. However, SBM still suffers from a lack of criteria toestimate the number of components in the mixture. To our knowledge,only one model based criterion has been derived for SBM in the literature.Unfortunately, it tends to be too conservative in the case of small networks.

Besides, almost all graph clustering models, such as SBM, partitionthe vertices into disjoint clusters. However, recent studies have shownthat most existing networks contained overlapping clusters. For instance,many proteins, so-called moonlighting proteins, are known to have severalfunctions in the cells, and actors might belong to several groups of inter-

5

6 Abstract

ests. Thus, a graph clustering method should be able to uncover overlap-ping clusters.

In this thesis, our contributions are the following:• We propose a new graph clustering algorithm for SBM as well as a

new model selection criterion. The R package “mixer” implement-ing this work is available at http://cran.r-project.org/web/packages/mixer.

• We introduce a new random graph model, so-called, OverlappingStochastic Block Model (OSBM). It allows the vertices of a networkto belong to multiple classes and can take very different topologi-cal structures into account. A variational algorithm, based on globaland local variational techniques, is considered for clustering and es-timation purposes.

• We present a new inference procedure for OSBM as well as a modelselection criterion which can estimate the number of overlappingclusters in networks. A R package “OSBM” implementing this workwill be soon available on the CRAN.

The first chapter introduces the main concepts and existing work thisthesis builds on. In particular, we review mixture models as well as infer-ence techniques such as the EM algorithm and the variational EM algo-rithm. We also focus on some model selection criteria to estimate the num-ber of classes from the data. Finally, we describe some of the most widelyused graph clustering methods for network analysis and focus mainly onmodel based approaches.

The second chapter illustrates how SBM can be described in a Bayesianframework. A new inference procedure is proposed to approximate thefull posterior distribution over the model parameters and latent variables,given the observed data. This approach leads to a new model selectioncriterion based on a non asymptotic approximation of the marginal likeli-hood.

The third chapter presents OSBM. We show that the model is identifi-able within classes of equivalence. Moreover, an algorithm is proposed,based on global and variational techniques. It allows the model param-eters to be estimated and the vertices to be classified into overlappingclusters.

Finally, in chapter 4, conjugate prior distributions are proposed for theOSBM model parameters. Then, we show how an inference procedurecan be used to obtain an approximation of the full posterior distributionover the model parameters and latent variables. This framework leads toa model selection criterion based on new approximation of the marginallikelihood.

During this thesis, two papers and a book chapter were published:Latouche et al. (2009; 2010a;b).



1Context

Contents1.1 Mixture models and EM . . . . . . . . . . . . . . . . . . . . . . 9

1.1.1 Kmeans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.2 Gaussian mixture models . . . . . . . . . . . . . . . . . . . 12

1.1.3 The EM algorithm . . . . . . . . . . . . . . . . . . . . . . . 14

1.1.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.2 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.1 Variational EM . . . . . . . . . . . . . . . . . . . . . . . . . 25

1.2.2 Variational Bayes EM . . . . . . . . . . . . . . . . . . . . . . 26

1.3 Graph clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.3.1 Graph theory and real networks . . . . . . . . . . . . . . . 30

1.3.2 Community structure . . . . . . . . . . . . . . . . . . . . . 35

1.3.3 Heterogeneous structure . . . . . . . . . . . . . . . . . . . . 41

1.4 Phase transition in stochastic block models . . . . . . . 43

1.4.1 The phase transition . . . . . . . . . . . . . . . . . . . . . . 44

1.4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

This preliminary chapter introduces the main concepts and existingwork this thesis builds on. In Section 1.1, we consider a simple ex-

ample, i.e. the Gaussian mixture model, to illustrate the use of the Expec-tation Maximization (EM) algorithm for estimation and clustering pur-poses in mixture models. We also focus on some model selection criteriato estimate the number of classes from the data. We then give in Section1.2 a variational treatment of EM and describe a more general frameworkcalled variational Bayes EM which can lead to an approximation of thefull posterior distribution over the model parameters and latent variables.In Section 1.3, we present some of the most widely used graph cluster-ing methods for network analysis and focus mainly on model based ap-proaches. Finally, we bring some insights into the stochastic block modelwhich is considered in Chapter 2 and extended in Chapters 3 and 4 toallow overlapping clusters in networks.

7

1.1. Mixture models and EM 9

1.1 Mixture models and EM

In this section, we introduce some algorithms which are commonly usedto classify the points of a data set. Existing techniques can be distin-guished depending on the structure of the classification they produce.Here we consider methods which look for a partition P of the data only.Approaches looking for hierarchies or fuzzy partitions will be briefly men-tioned in the rest of the thesis while others producing overlapping clusterswill be described in more details in Chapters 3 and 4.

For a few decades, there has been a wealth of literature on mixturemodels and the associated Expectation Maximization (EM) algorithm touncover the classes of a partition P in a data set. Mixture models firstassume that the observations are spread into Q hidden classes such thatobservation xi is drawn from distribution fq if its belongs to class q. If theposterior distribution over the latent variables takes an analytical form,the EM algorithm can then be applied to estimate the mixture model pa-rameters and classify the observations. In the following, we consider asimple example, i.e. the Gaussian mixture model, for illustration pur-poses. However, we emphasize that EM has a much broader applicability.In particular, it can be used for every distributions fq of the exponentialfamily (McLachlan and Peel 2000).

We start by introducing the kmeans and kmedoids algorithms. Afterhaving described EM, we show how it is related to kmeans in the caseof Gaussian mixture models. Finally, we describe some model selectioncriteria to estimate the number Q of classes from the data in a mixturecontext.

1.1.1 Kmeans

Let us consider a data set E = {x1, . . . , xN} of N observations. Each vectorxi is in R

d and therefore the data set can be represented by a matrix X inR

N×d where the ith row equals x⊺i . The goal is to cluster the observationsinto Q disjoint clusters {P1, . . . , PQ} of a partition P such that Pq

⋂Pl =

∅, ∀q 6= l and⋃Q

q=1 Pq = E . We will see in Section 1.1.4 how Q can beestimated from the data but for now, Q is held fixed. A common approachconsists in looking for clusters where data points of the same cluster have asmaller inter-point distance than points of different clusters. To formalizethis notion, we introduce a d-dimensional vector µq, called prototype, foreach of the Q clusters. The kmeans algorithm aims at minimizing the sumof squares of the distances between each point and its closest prototype.Formally, this involves minimizing an objective function, called distortionfunction:

J =N

∑i=1

Q

∑q=1

Ziq|| xi− µq ||2, (1.1)

where Ziq equals 1 if xi is assigned to cluster q and 0 otherwise. In the fol-lowing, we will denote Zi the corresponding Q-dimensional binary vectorof cluster assignments. It satisfies Ziq ∈ {0, 1}, ∀(i, q) and ∑

Qq=1 Ziq = 1.

From the N binary vectors, we also build a N ×Q binary matrix Z whoseith row is equal to Z⊺

i . In the following, we will show that minimizing

10 Chapter 1. Context

(1.1) is equivalent to looking for a partition P with the smaller intra classinertia denoted Iintra and given by:

Iintra =Q

∑q=1

∑i∈Pq

|| xi−xPq ||2,

where xPq is the center of mass of Pq:

xPq =1

card(Pq)∑

i∈Pq

xi,

and card(Pq) is the cardinality of Pq. It is known that the total inertia Itotalcan be decomposed into the sum of the intra class inertia Iintra and theinter class inertia Iinter:

Itotal = Iintra + Iinter,

where

Iinter =Q

∑q=1

card(Pq)||xPq − x||2,

and x is the center of mass of the entire data set. Since Itotal does notdepend on the partition P, minimizing Iintra is equivalent to maximizingIinter. Thus, kmeans looks for clusters as separated as possible.

The algorithm starts with a set of initial prototypes usually sampledfrom a Gaussian distribution. It then optimizes (1.1) with respect to Z

and the set {µ1, . . . , µQ} in turn. First, the µqs are kept fixed while J isminimized with respect to Z. Second, J is minimized with respect to theµqs keeping Z fixed. This two stage optimization procedure is repeateduntil the absolute distance between two successive values of J is smallerthan a threshold eps. In Section 1.1.3, we will see how these two stepsare related to the E and M steps of the Expectation Maximization (EM)algorithm in the case of Gaussian mixture models.

Since J is a linear function of the Ziqs, the optimization of J with respectto Z leads to a closed form solution. Indeed, the terms indexed by i are allindependent, and therefore, optimizing J involves optimizing each of theobjective functions:

Ji =Q

∑q=1

Ziq|| xi− µq ||2, ∀i ∈ {1, . . . , Q}.

The function Ji is minimized when Ziq equals 1 for whichever value of qgives the minimum value of || xi− µq ||

2. In other words, Ziq equals 1 ifq = argminl || xi− µl ||

2 and 0 otherwise.Finally, in order to optimize J with respect µq, we set the gradient of

(1.1) to zero:

▽µqJ = 2

N

∑i=1

Ziq(xi− µq) = 0.

This implies:

µq =∑

Ni=1 Ziq xi

∑Ni=1 Ziq

. (1.2)


Thus, the prototype µq is defined as the mean of all the points xi assignedto cluster q.

Kmeans (Algorithm 1) was studied by MacQueen (1967). Because eachstep is guaranteed to reduce the objective function, convergence of thealgorithm is assured. However, it may converge to local rather than globalminimum.

Algorithm 1: The kmeans algorithm.

// INITIALIZATION

Initialize µ0q , ∀q

µnewq ← µ0

q , ∀q

// OPTIMIZATION

repeat

µoldq ← µnew

q , ∀q// Step 1

for i ∈ 1 : N do

Find q = argminl || xi− µoldl ||

2

Ziq ← 1Zil ← 0, ∀l 6= q

end

// Step 2

µnewq ←

∑Ni=1 Ziq xi

∑Ni=1 Ziq

, ∀q

until J converges

Numerous methods have been proposed to speed up the kmeans al-gorithm. Some of them precompute a tree such that nearby points are inthe same sub-tree (Rmasubramanian and Paliwal 1990). Others limit thenumber of distances || xi− µq ||

2 to compute using the triangle inequality(Hodgson 1998, Moore 2000, Elkan 2003).

Because the kmeans algorithm is based on the Euclidean distance, itis very sensitive to outliers. Moreover, depending on the type of dataconsidered, the Euclidean distance might not be an appropriate choice asa measure of dissimilarity between data points and prototypes. Therefore,it can be generalized by introducing any dissimilarity measure ds(·, ·). Thegoal is now to optimize the objective function J∗ rather than J, where:

J∗ =N

∑i=1

Q

∑q=1

Ziqds(xi, µq).

This gives rise to the kmedoids algorithm. As kmeans, the kmedoids algo-rithm relies on a two stage optimization procedure. First, each data pointxi is assigned to the prototype µq for which the corresponding dissimilar-ity measure ds(xi, µq) is minimized. Second, each cluster prototype is setequal to one of the data points assigned to that cluster.

A drawback of the kmeans and the kmedoids algorithms is that, ateach iteration, each data point is assigned to one and only one cluster.In practice, while some points are much closer to one of the mean vec-tors µq, some others often lie midway between cluster centers. In this


case, the hard assignment of points to the nearest cluster may not be themost appropriate choice. In Section 1.1.3, we will see how a probabilis-tic framework can lead to soft assignments reflecting the uncertainty towhich cluster each data point belongs.

1.1.2 Gaussian mixture models

We denote by κ a set of Q mean vectors µq and covariance matrices Σq.Moreover, let us consider a vector α of class proportions which satisfiesαq > 0, ∀q and ∑

Qq=1 αq = 1. A Gaussian mixture distribution is then given

by:

p(xi | α, κ) =Q

∑q=1

αqN (xi; µq, Σq), (1.3)

where

N (xi; µq, Σq) =1

(2π)d/2|Σq |1/2 exp(

−12(xi− µq)

⊺Σ−1q (xi− µq)

)

.

As shown in the following, a mixture distribution can be seen as the resultof a marginalization over a latent variable.

First, let us assume that a binary vector Zi is sampled from a multino-mial distribution:

p(Zi | α) =M (Zi; 1, α = (α1, . . . , αQ))

=Q

∏q=1

αZiqq .

(1.4)

As in the previous section, Zi sees all its components set to zero exceptone such that Ziq = 1 if observation i belongs to class q. The vector xi isthen sampled from a Gaussian distribution:

p(xi |Ziq = 1, µq, Σq) = N (xi; µq, Σq) (1.5)

with mean vector µq and covariance matrix Σq. The full conditional distri-bution is given by:

p(xi |Zi, κ) =Q

∏q=1

p(xi |Ziq = 1, µq, Σq)Ziq

=Q

∏q=1N (xi; µq, Σq)

Ziq .

(1.6)

Therefore, the marginalization of p(xi | α, κ) over all possible vectors Zileads to:

p(xi | α, κ) = ∑Zi

p(xi, Zi | α, κ)

= ∑Zi

p(xi |Zi, κ)p(Zi | α)

= ∑Zi

Q

∏q=1

(

αqN (xi; µq, Σq))Ziq

.

(1.7)

Note that (1.7) is equivalent to (1.3). However, we now have an explicitformulation of the Gaussian mixture model in terms of the latent vectorZi which will play a key role in Section 1.1.3.


Maximum likelihood

We are given a data set X of N observations which are assumed to be inde-pendent and drawn from a Gaussian mixture distribution with Q classes.In a frequentist setting, the goal is to maximize the observed-data log-likelihood log p(X | α, κ) with respect to κ and α:

log p(X | α, κ) =N

∑i=1

log p(xi | α, κ)

=N

∑i=1

log

(Q

∑q=1

αqN (xi; µq, Σq)

)

.

(1.8)

One approach consists in relying directly on gradient based algorithmsfor the optimization procedure (Fletcher 1987, Nocedal and Wright 1999).Although this approach is feasible, it is rarely used in practice and the EMalgorithm described in Section 1.1.3 is usually preferred.

Bayesian approach

In the Bayesian framework, some distributions, called priors, are intro-duced to characterize a priori information about the model parameters.For instance, in the regression context, a Gaussian prior distribution overthe weight vector β is often used for regularization in order to avoid over-fitting (Wahba 1975, Berger 1985, Gull 1989, MacKay 1992, Bernardo andSmith 1994, Gelman et al. 2004). While a Gaussian prior leads to a L2-norm regularization (ridge regression) (Hoerl and Kennard 1970, Hastieet al. 2001), a Laplace prior gives rise to a L1-norm regularization (Lasso)(Tibshirani 1996, Park and Casella 2008).

In the case of Gaussian mixture model, the Bayesian framework canbe used to deal with the singularities of the observed-data log-likelihood(1.8). For simplicity, consider that the covariance matrices of the differentcomponents are such that Σq = σ2

q I where I denotes the identity matrix.If one component has its mean vector equal to one of the points xi in thedata set, i.e. µq = xi, then:

N (xi; µq, Σq) = N (xi; xi, Σq)

=1

(2π)d/2σdq

.(1.9)

If we now consider the limit σq → 0, then (1.9) goes to infinity and sodoes (1.8). Therefore, optimizing the observed-data log-likelihood is not awell defined problem since singularities can arise whenever a componentof the Gaussian mixture model collapses to a single point. To tackle thisissue, a common strategy consists in introducing conjugate priors1 for themodel parameters. Thus, a Dirichlet distribution is chosen for the vectorα:

p(α) = Dir(α; n0 = (n01, . . . , n0

Q)),

1In a Bayesian framework, given a data set X as well as a model parameter θ, if aposterior distribution p(θ |X) is in the same family of distributions as the prior p(θ), thenthe prior and posterior are said to be conjugate


where n0q = 1/2, ∀q. This Dirichlet distribution corresponds to a non-

informative Jeffreys prior distribution which is known to be proper (Jef-freys 1946). It is also possible to consider a uniform distribution on theQ − 1 dimensional simplex by fixing n0

q = 1, ∀q. As for κ, independentGaussian-Wishart priors can be used:

p(κ) =Q

∏q=1N (µq; m0, β0 ∆

−1q )W(∆q; W0, λ0),

where ∆q denotes the precision matrix of the qth component, that is theinverse matrix of Σq. Considering a quadratic loss function (see Dang1998), the goal is now to maximize the log-posterior distribution of themodel parameters, given the observed data set X. Using Bayes rule, wehave:

p(α, κ |X) ∝ p(X | α, κ)p(α)p(κ).

Therefore:

log p(α, κ |X) = log p(X | α, κ) + log p(α) + log p(κ) + const. (1.10)

Note that the first term on the right hand side of (1.10) is the observed-datalog-likelihood. As in the frequentist setting, it is possible to rely directlyon gradient based algorithms for the optimization procedure. However,as mentioned above, the standard approach in the machine learning andstatistical community, to obtain point estimates of mixture model param-eters, is the EM algorithm (see Section 1.1.3). The estimates are calledMaximum A Posteriori (MAP) estimates because they maximize the pos-terior distribution rather than the observed-data likelihood. As shownin Bishop (2006), the singularities are absent in this Bayesian framework,simply by introducing the priors and relying on the MAP estimates.

In this section, we made a first step towards a full Bayesian treatmentof the data and motivated the use of prior distributions to deal with sin-gularities. However, the Bayesian framework brings much more powerfulfeatures as illustrated in Chapters 2 and 4. For instance, we will see howan estimation of the full posterior distribution over the model parameterscan be performed using variational techniques. This will naturally leadto new model selection criteria to estimate the number Q of classes innetworks.

1.1.3 The EM algorithm

The Expectation Maximization (EM) algorithm (Dempster et al. 1977,McLachlan and Krishnan 1997) was originally developed in order to findmaximum likelihood solutions for models with missing data, althoughit can also be applied to find MAP estimates. The two correspondingfunctions to be maximized will be denoted QML(·, ·) and QMAP(·, ·) re-spectively. EM has a broad applicability and will be first described in ageneral setting. We shall then go back to our example in this chapter, i.e.the Gaussian mixture model. Note that EM can be used in the case of con-tinuous latent variables but in the following, we concentrate on discretemixture models only.


General EM

Let us consider a data set X of N observations and Z the correspondingmatrix of class assignments introduced in Section 1.1.1 . The ith row of X

and Z represent the vectors x⊺i and Z⊺

i respectively. Moreover, we assumethat the observations are drawn from a mixture distribution with param-eter θ, i.e. θ = (κ, α) in the Gaussian mixture model. If the observationsare independent, then:

log p(X | θ) =N

∑i=1

log p(xi | θ).

Following (1.7), the marginalization of each distribution p(xi | θ) over allpossible vectors Zi leads to:

log p(X | θ) =N

∑i=1

log

(

∑Zi

p(xi, Zi | θ)

)

.

A more general expression, which stands also in the non independentcase, can be obtained by directly marginalizing over all possible matricesZ:

log p(X | θ) = log

(

∑Z

p(X, Z | θ)

)

, (1.11)

where p(X, Z | θ) is called complete-data likelihood. In practice, given adata set X, the matrix Z is unknown and has to be inferred. The stateof knowledge of Z is given only by the posterior distribution p(Z |X, θ).Therefore, in order to estimate θ, the EM algorithm considers the maxi-mization of the expected value of the complete-data log-likelihood, underthe posterior distribution.

If we denote θold the current value of the model parameters, then dur-ing the E step, the algorithm computes p(Z |X, θold) and the expectationQML(θ, θold):

QML(θ, θold) = ∑Z

p(Z |X, θold) log p(X, Z | θ). (1.12)

During the M step, (1.12) is then maximized with respect to θ. This leadsto a new estimate θnew of the model parameters. Thus, the algorithm startswith an initial value θ0. The E and M steps are then repeated until theabsolute difference between two successive values of θ are smaller thana threshold eps. For now, the use of the expectation may seem arbitrary,but we will show in Section 1.2.1 how EM (Algorithm 2) is guaranteed tomaximize the observed-data log-likelihood (1.11).

As mentioned in Section 1.1.2, the EM algorithm can also be used tofind MAP estimates when the goal is to maximize log p(θ |X):

log p(θ |X) = log

(

∑Z

p(θ, Z |X)

)

.


Algorithm 2: General EM algorithm for mixture models. Q(·, ·) ei-ther denotes QML(·, ·) or QMAP(·, ·).

// INITIALIZATION

Initialize θ0θnew ← θ0

// OPTIMIZATION

repeat

θold ← θnew

// E-step

Compute p(Z |X, θold)Compute Q(θ, θold)// M-step

Find θnew = argmaxθQ(θ, θold)until θ converges

During the E step, the algorithm computes p(Z |X, θold) and the expecta-tion QMAP(θ, θold):

QMAP(θ, θold) = ∑Z

p(Z |X, θold) log p(θ, Z |X)

= ∑Z

p(Z |X, θold) logp(X, Z | θ)p(θ)

p(X)

= ∑Z

p(Z |X, θold) log p(X, Z | θ) + ∑Z

p(Z |X, θold) log p(θ)

−∑Z

p(Z |X, θold) log p(X)

= ∑Z

p(Z |X, θold) log p(X, Z | θ) + log p(θ) + const.

(1.13)Again, QMAP(θ, θold) is maximized with respect to θ during the M step.Note that the first term on the right hand side of (1.13) corresponds exactlyto (1.12).

As pointed out by Bishop (2006), among many others, the EM algo-rithm can converge to local rather than global maxima.

EM for Gaussian mixtures

In order to perform maximum likelihood inference in the case of Gaus-sian mixture models, we need an expression for the complete-data log-likelihood. The observations are assumed to be independent and there-fore:

log p(X, Z | α, κ) = log p(X |Z, κ) + log p(Z | α)

=N

∑i=1

(log p(xi |Zi, κ) + log p(Zi | α)) .


Using (1.4) and (1.6) leads to:

log p(X, Z | α, κ) =N

∑i=1

Q

∑q=1

Ziq

(

logN (xi; µq, Σq) + log αq

)

.

Moreover, from Bayes rule and (1.3), the posterior distribution over all thelatent variables takes a factorized form:

p(Z |X, α, κ) =p(X, Z | α, κ)

p(X | α, κ)

=∏

Ni=1 p(xi, Zi | α, κ)

∏Ni=1 p(xi | α, κ)

=∏

Ni=1 ∏

Qq=1

(


∏Ni=1

(

∑Ql=1 αlN (xi; µl , Σl)

)

=∏

Ni=1 ∏

Qq=1

(


∏Ni=1 ∏

Qq=1

(


)Ziq

=N

∏i=1

Q

∏q=1

(αqN (xi; µq, Σq)


)Ziq

=N

∏i=1M (Zi; 1, τi = (τi1, . . . , τiQ)) ,

where

τiq =αqN (xi; µq, Σq)


. (1.14)

Each variable τiq denotes the posterior probability, sometimes called re-sponsibility, of observation i to belong to class q. Note that for each obser-vation, both the prior (1.4) and the posterior are multinomial distributions.During the M step, QML(θ, θold) is then maximized with respect to all themodel parameters:

QML(θ, θold) = ∑Z

p(Z |X, αold, κold) log p(X, Z | α, κ)

=N

∑i=1

Q

∑q=1

τiq

(

logN (xi; µq, Σq) + log αq

)

.(1.15)

Setting the gradient of (1.15), with respect to µq, to zero, leads to:

▽µqQML(θ, θold) =

N

∑i=1

τiq(xi− µq) = 0. (1.16)

Therefore

µq =∑

Ni=1 τiq xi

∑Ni=1 τiq

. (1.17)

Equation (1.17) must be compared with (1.2). Indeed, contrary to thekmeans algorithm, µq is now a weighted mean of all the points in the data


set, the weighting factors being the posterior probabilities of the points tobelong to class q. The gradient with respect to Σq also takes a simple form:

▽ΣqQML(θ, θold) =N

∑i=1

τiq

(Σq

2−

12(xi− µq)(xi− µq)

⊺

)

= 0.

Thus

Σq =∑

Qi=1 τiq(xi− µq)(xi− µq)

⊺

∑Ni=1 τiq

.

Again, each data point is weighted by τiq, the posterior probability thatclass q was responsible for generating xi. Finally, caution must be takenwhen maximizing QML(θ, θold) with respect to αq. Indeed, the optimiza-tion is subject to the constraint ∑

Qq=1 αq = 1. This is achieved using a

Lagrange multiplier and maximizing:

QML(θ, θold) + λ(Q

∑q=1

αq − 1). (1.18)

Setting the gradient of (1.18), with respect to αq, to zero, gives:

∑Ni=1 τiq

αq+ λ = 0. (1.19)

Then, after multiplying (1.19) by αq as well as summing over q, we findλ = −N and therefore:

αq =∑

Ni=1 τiq

N.

The algorithm is initialized for instance using a few iterations of thekmeans algorithm (see Section 1.1.1), the E and M step are then repeateduntil convergence. As mentioned already in a general setting, EM (Algo-rithm 3) is not guaranteed to converge to a global maximum.

Relation to kmeans

As mentioned in Section 1.1.1, the two stages of the kmeans algorithm arerelated to the E and M steps of the EM algorithm in the case of Gaussianmixture models. While kmeans relies on hard assignments, we showed inSection 1.1.3 that EM performs soft assignments of points in a data set.In fact, the kmeans algorithm can be seen as a particular limit of EM forGaussian mixtures (Bishop 2006). Thus, consider that each component ofthe mixture has a covariance matrix given by Σq = σ2 I, where I denotesthe identity matrix:

p(xi | µq, σ2) = N (xi; µq, σ2)

=1

(2πσ2)d/2 exp(

−1

2σ2 || xi− µq ||2)

.

The variance parameter σ2 is shared by all the components and held fixedfor now. If the EM algorithm is applied for maximum likelihood estima-tion of this Gaussian mixture model, then the posterior probabilities τiqs


Algorithm 3: The EM algorithm for maximum likelihood estimationin Gaussian mixture models.

// INITIALIZATION

Initialize α0q , µ0

q , Σ0q , ∀q

αnewq ← α0

q , ∀qµnew

q ← µ0q , ∀q

Σnewq ← Σ

0q , ∀q

// OPTIMIZATION

repeat

αoldq ← αnew

q , ∀qµold

q ← µnewq , ∀q

Σoldq ← Σ

newq , ∀q

// E-step

τiq ←αold

q N (xi ; µoldq ,Σold

q )

∑Ql=1 αold

l N (xi ; µoldl ,Σold

l ), ∀i, q

// M-step

αnewq ←

∑Ni=1 τiq

N , ∀q

µnewq ←

∑Ni=1 τiq xi

∑Ni=1 τiq

, ∀q

Σnewq ←

∑Ni=1(xi − µnew

q )(xi − µnewq )⊺

∑Ni=1 τiq

, ∀q

until (αq, µq, Σq), ∀q converges

given by (1.14) become:

τiq =αq exp

(

− 12σ2 || xi− µq ||

2)

∑Ql=1 αl exp

(− 1

2σ2 || xi− µl ||2) . (1.20)

If we now consider the limit σ2 → 0 in (1.20), the smallest term || xi− µl ||2

in the denominator will go to zero most slowly. Therefore, the probabili-ties τiq for observation xi will all go to zero except for the correspondingτil which will go to unity. Thus, when the variance parameter of the com-ponents tends to zero, the EM algorithm leads to a hard assignment of thedata as in kmeans, and each observation is assigned to the class having thenearest mean vector. The expectation QML(θ, θold) in (1.15) then becomes:

limσ2→0QML(θ, θold) = lim

σ2→0−

12σ2

N

∑i=1

Q

∑q=1

τiq|| xi− µq ||2 + const. (1.21)

Since the observations are hard assigned to the classes, we can denoteτiq = Ziq as in kmeans and therefore maximizing (1.21) is equivalent tominimizing the kmeans objective function (1.1).

1.1.4 Model selection

For a fixed number Q of classes, we have seen how the EM algorithmcould be used for both the estimation of mixture model parameters and


the classification of observations in a data set. We now describe some ofthe existing model selection criteria which aim at estimating Q from thedata. For an extensive discussion on model selection in mixture models,we refer to McLachlan and Peel (2000). In this section, θ either denotesthe maximum likelihood (θML) or the MAP estimate (θMAP) as describedin Section 1.1.3.

Given a set of values of Q, the goal is to select Q∗ such that a givencriterion is maximized. Because all the criteria described in this sectionrely on θ, the mixture model parameters have to be estimated for eachvalue of Q in the set.

Akaike’s information criterion

Let us consider an observed data set X and a fixed number Q of classes.We also denote by Y any data set of the same size of X. If the observa-tions are assumed to be drawn from a mixture model with parameter θ,we have seen that the EM algorithm can be used to maximize log p(X | θ)or log p(θ |X). This leads to an estimate θ of θ. In practice, the value of Qconsidered might not be the most appropriate to model the data. More-over, the EM algorithm can converge to local rather than global maximum.In any case, the distribution p(Y |θ) can only be seen as an approximationof the true distribution p(Y) which generated X.

Model selection can then be approached in terms of the Kullback-Leibler (KL) divergence between p(Y) and p(Y |θ):

KL(

p(·)||p(· |θ(X)

))= −

∫

p(Y) log

{

p(Y |θ(X)

)

p(Y)

}

d Y

=∫

p(Y) log p(Y)d Y−∫

p(Y) log p(Y |θ(X)

)d Y,

(1.22)where we have denoted θ = θ(X) to emphasize that θ is estimated fromX. In information theory, (1.22) is seen as the information which is lost byapproximating the true model p(Y) with a fitted model p(Y |θ). The goalis then to select the model for which the corresponding information lossis minimum. As the first term on the right hand side of (1.22) does notdepend on the fitted model, only the second term is relevant. It can beexpressed as:

η(X) =∫

p(Y) log p(Y |θ(X)

)d Y

= EY[log p(Y |θ(X)

)],

(1.23)

where the expectation is taken according to p(Y). Unfortunately, becausep(Y) is unknown, (1.23) cannot be computed analytically. However, if wetake the expectation of η(X) over every possible data set X, we obtain:

EX[η(X)] = EX,Y[log p(Y |θ(X)

)], (1.24)

which can be estimated (McLachlan and Peel 2000). In practice, only asingle data set X is given and Akaike (1973; 1974) showed that (1.24) isasymptotically equal to:

log p(X |θ)− K, (1.25)


where K is the total number of parameters in the model. The approxima-tion (1.25) corresponds to the Akaike’s Information Criterion (AIC).

The AIC criterion has been widely used to assess the order of a mix-ture model (Bozdogan and Sclove 1984, Sclove 1987). However, manyauthors observed that AIC is order inconsistent and tends to overfit mod-els (Koehler and Murphee 1988, Soromenho 1993, Celeux and Soromenho1996). In other words, AIC tends to overestimate the number of compo-nents in the mixture context.

Bayesian information criterion

The Bayesian Information Criterion (BIC) relies on an asymptotic approx-imation of the marginal log-likelihood, also called integrated observed-data log-likelihood, given by:

log p(X) = log{∫

p(X, θ)d θ

}

= log{∫

p(X | θ)p(θ)d θ

}

,(1.26)

where p(θ) is a prior distribution over the mixture model parameters.Since (1.26) involves integrating over all possible values of θ, it is gener-ally not tractable. To approximate the integral, the integrand is expandedusing a second order Taylor series about the point θ = θ:

log p(X, θ) ≈ log p(X, θ) +▽θ=θ log p(X, θ)⊺(θ−θ)−12(θ−θ)⊺H(θ−θ),

where H is the negative hessian matrix of log p(X, θ) at θ. Note that thisapproximation is relevant if the integrand is highly concentrated aroundθ. If we set θ = θMAP, we have:

▽θ=θMAP log p(X, θ) = ▽θ=θMAP log (p(θ |X)p(X))

= ▽θ=θMAP log p(θ |X) +▽θ=θMAP log p(X)

= ▽θ=θMAP log p(θ |X)

= 0,

since θMAP maximizes log p(θ |X) and log p(X) does not depend on θ.Therefore:

log p(X, θ) ≈ log p(X, θ)−12(θ−θ)⊺H(θ−θ)

≈ log p(X |θ) + log p(θ)−12(θ−θ)⊺H(θ−θ).

(1.27)

Using (1.27) in (1.26) leads to:

log p(X) ≈ log{∫

p(X |θ)p(θ) exp(

−12(θ−θ)⊺H(θ−θ)

)

d θ

}

≈ log p(X |θ) + log p(θ) + log

∫

exp(

−12(θ−θ)⊺H(θ−θ)

)

︸︷︷︸

Gaussian functional form

d θ

.

(1.28)


The functional form of the integrand in (1.28) corresponds to a Gaussiandistribution with mean vector θ and covariance matrix H−1. Thus, theintegral takes a simple form:

∫

exp(

−12(θ−θ)⊺H(θ−θ)

)

d θ = (2π)d/2|H|−12 ,

and

log p(X) ≈ log p(X |θ) + log p(θ) +d2

log(2π)−12

log |H|︸︷︷︸

Occam factor

. (1.29)

This approximation is known as Laplace’s method for integral. The secondpart of the right hand side of (1.29) penalizes the model complexity and issometimes called Occam factor. As shown in Kass and Raftery (1995) andRaftery (1995), for large samples, θML ≈ θMAP and H ≈ J, where J is theexpected Fisher information matrix of the observed data. This leads to:

log p(X) ≈ log p(X |θ) + log p(θ) +d2

log(2π)−12

log |J|. (1.30)

This strong approximation assumes that the prior is very flat such that itseffect can be ignored (Ripley 1996). Finally, the BIC criterion, sometimescalled Schwarz criterion (Schwarz 1978), is obtained by ignoring the termsin O(1) in (1.30) and noting that:

|J| = O(NK),

to give

log p(X) ≈ log p(X |θ)−K2

log N.

Leroux (1992) showed that BIC does not underestimate the true numberof classes, asymptotically. Moreover, simulation studies as well as exper-iments on real data have been carried out to assess the performances ofBIC and they have reported encouraging results (Roeder and Wasserman1997, Campbell et al. 1997, Dasgupta and Raftery 1998).

For log N > 2, that is N > 8, it can be easily verified that BIC penalizesthe model complexity more heavily than AIC. Therefore, it reduces thetendency of AIC to fit too many components. On the other hand, Celeuxand Soromenho (1996) showed that BIC fits too few components when thesample size is limited and the model for the component densities is valid.Conversely, if the model for the component densities is not valid, thenBiernacki et al. (2000) found that BIC tends to fit too many components.

Finally, as mentioned by McLachlan and Peel (2000), we emphasizethat not only BIC can be used to estimate Q, but it can also help decidingwhich model to adopt for the component densities (Biernacki et al. 1999).

Classification likelihood criterion

The complete-data log-likelihood, also called classification log-likelihood,is given by:

log p(X, Z | θ) = log p(X | θ) + log p(Z |X, θ). (1.31)


For many mixture models (see for instance Section 1.1.3), the posteriordistribution over the latent variables can be factorized and computed an-alytically. The functional form of the prior is conserved and we obtain aproduct of multinomial distributions:

p(Z |X, θ) =n

∏i=1M(Zi; 1, τi)

=n

∏i=1

Q

∏q=1

τZiq

iq ,(1.32)

where τiq is the posterior probability that observation i belongs to class q.As mentioned by Hathaway (1986), using (1.32) in (1.31) leads to:

log p(X, Z | θ) = log p(X | θ) +n

∑i=1

Q

∑q=1

Ziq log τiq. (1.33)

If we now set Z = τ and θ = θ in (1.33), we obtain:

log p(X, τ |θ) = log p(X |θ)− EN(τ), (1.34)

where

EN(τ) = −n

∑i=1

Q

∑q=1

τiq log τiq,

is the entropy of the fuzzy classification matrix. If the mixture compo-nents are well separated, it will be close to zero. Otherwise, it will havea large value. Biernacki and Govaert (1997) originally proposed (1.34) asa model selection criterion for Gaussian mixture models although it has abroader applicability. Indeed, it only requires that the posterior distribu-tion over the latent variables takes the factorized form (1.32). This criterionis referred as the Classification Likelihood Criterion (CLC). Biernacki andGovaert (1997) suggested to use the EM algorithm to estimate both τ andθ from the data. According to Biernacki et al. (1999), CLC works wellwhen the class proportions are restricted to being equal. On the otherhand, if they are different, because CLC does not penalize the number Kof mixture model parameters, it tends to overestimate the correct numberof classes.

Integrated classification likelihood criterion

The Integrated Classification Likelihood (ICL) criterion was introducedby Biernacki et al. (2000). It relies on an asymptotic approximation ofthe integrated complete-data log-likelihood log p(X, Z). So far, to keepthe notations uncluttered, we have denoted θ the set of all the mixturemodel parameters. However, to sketch the development of ICL, we nowneed to go back to the notations used in Section 1.1.2. Thus, κ denotesall the parameters which describe the component densities (i.e. the meanvectors and covariance matrices in Gaussian mixture models) and α theclass proportions. If the prior p(θ) is factorized such that:

p(θ) = p(κ)p(α),


then using (Appendix A.1), log p(X, Z) is given by:

log p(X, Z) = log p(X |Z) + log p(Z). (1.35)

A BIC-like approximation (see Section 1.1.4) applied on the first term ofthe right hand side of (1.35) leads to:

log p(X |Z) ≈ log p(X |Z, κ)−K1

2log N, (1.36)

where K1 is the number of parameters in κ and κ is a maximum likelihoodor MAP estimate of κ. Using a Jeffreys non informative prior distributionfor the class proportions α, Biernacki et al. (2000) obtained an analyticalexpression for log p(Z):

log p(Z) = log Γ(Q2) +

Q

∑q=1

log Γ(12+ nq)−Q log Γ(

12)− log Γ(N +

Q2),

(1.37)where nq = ∑

Ni Ziq, ∀q and Γ(·) is the Gamma function. For more de-

tails, see (Appendix A.2). Assuming that N and the nqs take large values(namely the αqs are far from 0), Biernacki et al. (2000) relied on the Stirlingformulae (Appendix A.3) to obtain an approximation of (1.37):

log p(Z) ≈ log p(Z |α)−Q− 1

2log N, (1.38)

where α = maxα log p(Z | α). Finally, using (1.36) and (1.38) in (1.35) leadsto:

log p(X, Z) ≈ log p(X |Z, κ)−K1

2log N + log p(Z |α)−

Q− 12

log N

= log p(X, Z |θ)−K1 + Q− 1

2log N.

Therefore, if we fix Z = τ as in Section 1.1.4, an (asymptotic) ICL criterionis given by:

ICL = log p(X, τ |θ)−K1 + Q− 1

2log N

= log p(X |θ)− EN(τ)−K1 + Q− 1

2log N

= CLC−K1 + Q− 1

2log N

= CLC−K2

log N,

(1.39)

since K, the total number of unknown parameters in θ, is given by K =K1 + Q− 1. The ICL criterion was proposed in order to favor models withwell separated components (as in CLC) as well as penalizing the modelcomplexity by the number of unknown mixture model parameters (as inBIC).

1.2. Variational Inference 25

1.2 Variational Inference

In this section, we introduce some variational techniques which lie at thecore of the main inference strategies for networks developed in this thesis.We first describe the variational EM algorithm which generalizes EM. Wethen go to a more general framework, called variational Bayes EM, whichis used in Chapters 2 and 4 to obtain an approximation of the full posteriordistribution over the model parameters and latent variables. This willnaturally lead to two new model selection criteria to estimate the numberof classes in networks.

1.2.1 Variational EM

As mentioned in Section 1.1.3, the EM algorithm can be used to find max-imum likelihood estimates for models with latent variables. Here, wegive a variational treatment of EM which will prove that the algorithmis guaranteed to maximize the observed-data log-likelihood (Csiszar andTusnady 1984, Hathaway 1986, Neal and Hinton 1998).

Let us consider a data set X and the corresponding classification matrixZ. We aim at maximizing the observed-data log-likelihood:

log p(X | θ) = log

(

∑Z

p(X, Z | θ)

)

. (1.40)

Note that (1.40) involves a summation over every possible matrix Z be-cause we consider discrete mixture models. However, this analysis goesthrough unchanged if the latent variables are continuous, simply by re-placing the summation with an integration. For any distribution q(Z)over the latent variables, the following decomposition holds:

log p(X | θ) = LML(q; θ) + KL (q(·) || p(·|X, θ)) ,

where

LML(q; θ) = ∑Z

q(Z) log{p(X, Z | θ)

q(Z)}, (1.41)

and

KL (q(·) || p(·|X, θ)) = −∑Z

q(Z) log{p(Z |X, θ)

q(Z)}. (1.42)

Note that LML(q; θ) in (1.41) is a functional of the distribution q(Z) as wellas a function of the parameter θ. In (1.42), KL denotes the Kullback-Leiblerdivergence between q(Z) and p(Z |X, θ). It satisfies

KL (q(·) || p(·|X, θ)) ≥ 0,

with equality if, and only if, q(Z) = p(Z |X, θ). Therefore, LML is a lowerbound of log p(X | θ):

log p(X | θ) ≥ LML(q; θ).

Suppose that the current value of the parameters is θold. During the Estep, LML(q; θold) is maximized with respect to q(Z). The solution to this


optimization step occurs when the KL divergence vanishes that is whenq(Z) = p(Z |X, θold). The observed-data log-likelihood is then equal to itslower bound given by:

LML(q; θ) = ∑Z

p(Z |X, θold) log p(X, Z | θ)−∑Z

p(Z |X, θold) log p(Z |X, θold)

= ∑Z

p(Z |X, θold) log p(X, Z | θ) + const,

(1.43)where all the terms that do not depend on θ have been absorbed intothe constant. Note that the first term on the right hand side of (1.43)is the expectation QML(θ, θold) of the complete-data log-likelihood (seeSection 1.1.3). During the M step, q(Z) is held fixed while LML(q; θ) ismaximized with respect to θ to obtain a new estimate θnew. This causes thelower bound to increase which will necessarily cause the correspondingobserved-data log-likelihood to increase.

As already mentioned, the EM algorithm can also be used to find MAPestimates when the goal is to maximize log p(θ |X). The correspondingvariational decomposition is given by:

log p(θ |X) = LMAP(q; θ) + KL (q(·) || p(·|X, θ)) ,

where

LMAP(q; θ) = ∑Z

q(Z) log{p(θ, Z |X)

q(Z)},

and KL is again the Kullback-Leibler divergence between q(Z) andp(Z |X, θ).

So far, we have assumed that the posterior distribution p(Z |X, θ)could be computed analytically. However, for some mixture models it isnot tractable and therefore variational approximations are required. Thisgives rise to the Variational EM (VEM) algorithm (Algorithm 4). Duringthe variational E step, the lower bound L(q; θold) is maximized with re-spect to q(Z), where L either denotes LML or LMAP. This maximization in-duces a minimization of the KL divergence between q(Z) and p(Z |X, θold).To obtain a tractable algorithm, it is often assumed that q(Z) can be fac-torized such that:

q(Z) =N

∏i=1

q(Zi).

The solution to this optimization procedure is an approximation q(Z) ofp(Z |X, θold). During the variational M step, this approximation is used tocompute the lower bound L(q; θ) which is then maximized with respectto θ. These two step are repeated until convergence.

1.2.2 Variational Bayes EM

In the previous sections, we have seen how EM strategies could be usedto obtain point estimates of mixture model parameters. Moreover, we mo-tivated the use of a Bayesian treatment of the data in order to deal withthe singularities of Gaussian mixture models which arise in the maximumlikelihood setting. However, the Bayesian framework brings much morepowerful features. Indeed, rather than looking for point estimates of the

1.2. Variational Inference 27

Algorithm 4: Variational EM algorithm for mixture models. L eitherdenotes LML or LMAP.

// INITIALIZATION

Initialize θ0θnew ← θ0

// OPTIMIZATION

repeat

θold ← θnew

// Variational E-step

Find q(Z) = argmaxq(Z)L(q; θold)// Variational M-step

Find θnew = argmaxθL(q; θ)until θ converges

model parameters, we are now going to see how an approximation of thefull posterior distribution over the model parameters and latent variablescan be obtained. Such approximation is particularly relevant when study-ing the variability of the MAP estimates. This also gives rise to new modelselection criteria.

In a Bayesian framework, all the model parameters are regarded asrandom variables drawn from a prior distribution p(θ). If we denote X anobserved data set and Z the corresponding classification matrix, the goal isto estimate p(Z, θ |X). The marginal log-likelihood, also called integratedobserved-data log-likelihood, is given by:

log p(X) = log{∫

p(X, θ)d θ

}

= log

{

∑Z

∫

p(X, Z, θ)d θ

}

.(1.44)

For any distribution q(Z, θ), the following decomposition holds:

log p(X) = L(q) + KL (q(·) || p(·|X)) ,

where

L(q) = ∑Z

∫

q(Z, θ) log{p(X, Z, θ)

q(Z, θ)}d θ, (1.45)

and

KL (q(·) || p(·|X)) = −∑Z

∫

q(Z, θ) log{p(Z, θ |X)

q(Z, θ)}d θ .

The functional L defined in (1.45) is a lower bound of the marginal log-likelihood, that is log p(X) ≥ L(q), with equality iff q(Z, θ) = p(Z, θ |X).However, we shall suppose that working with the true posterior distribu-tion is intractable. To obtain an approximation of the posterior, we restrictour search to a family of factorized distributions. In practice, it is often


assumed that:q(Z, θ) = q(Z)q(θ)

= q(θ)N

∏i=1

q(Zi).(1.46)

More generally, the hidden variables (Z, θ) can be classified into P disjointgroups {g1, . . . , gP} of a partition G such that:

q(Z, θ) = q(G) =P

∏i=1

qi(gi). (1.47)

In the following, we denote G\j all the groups of G except gj. Using (1.47)in (1.45), the lower bound becomes:

L(q) =∫

q(G) logp(X, G)

q(G)d G

=∫

∏i

qi(gi)

(

log p(X, G)−∑i

log qi(gi)

)

d G

=∫

∏i

qi(gi) log p(X, G) d G−∫

∏i

qi(gi)∑i

log qi(gi) d G

=∫

qj(gj)

(∫

∏i 6=j

qi(gi) log p(X, G) d G\j

)

d gj−∫

∏i

qi(gi) log qj(gj) d G

−∫

∏i

qi(gi)∑i 6=j

log qi(gi) d G

=∫

qj(gj)EG\j [log p(X, G)] d gj−∫

qj(gj) log qj(gj) d gj

−∑i 6=j

∫

qi(gi) log qi(gi) d gi

=∫

qj(gj)(

log p(X, gj)− const)

d gj−∫

qj(gj) log qj(gj) d gj

+ ∑i 6=j

H[gi]

= −KL(qj(·)|| p(·)

)+ ∑

i 6=j

H[gi]− const,

(1.48)where

log p(X, gj) = EG\j [log p(X, G)] + const

=∫

∏i 6=j

qi(gi) log p(X, G) d G\j,

and H[bgi] = −∫

qi(gi) log qi(gi)d gi is the entropy of gi(·). To simplifythe notations, we have used some integrations in formulating the lowerbound. In fact, we emphasize that the latent variables we consider arediscrete and therefore, depending on the groups of G, some integrationsshould be replaced by summations as required. If the factors in (1.47)are assumed to be fixed except the distribution qj(gj), then according to(1.48), maximizing L(q) with respect to qj(gj) is equivalent to minimizing

1.3. Graph clustering 29

the Kullback-Leibler divergence between qj(gj) and p(X, gi). Therefore,the optimal approximation for the j-th factor is:

log qj(gj) = log p(X, gj) = EG\j [log p(X, G)]) + const,

where the constant can be determined by normalizing qj(gj). Thus, wehave obtained a set of P coupled equations. Indeed, each of the factor isoptimized by computing an expectation with respect to all the other fac-tors. As a consequence, the factors are first initialized and then by cyclingthrough the equations and replacing each factor with its new approxi-mation, we obtain a consistent solution for the variational optimizationprocedure. Convergence is guaranteed because the lower bound is convexwith respect to each factor.

Algorithm 5: Variational Bayes EM algorithm for mixture models.

// INITIALIZATION

Initialize q0i (gi), ∀i

qnewi (gi)← q0

i (gi), ∀i

// OPTIMIZATION

repeat

qoldi (gi)← qnew

i (gi), ∀i

qnewi (gi)← exp

(∫

∏j 6=i qoldj (gj) log p(X, G) d G\i

)

, ∀i

Normalize qnewi (gi), ∀i

until L(q) converges

For the factorization (1.46), some optimization equations involve thedistributions q(Zi) over the latent variables while others focus on the factorq(θ). These two steps are related to the E and M steps of EM strategies.Therefore, the optimization algorithm described in this section is usuallyreferred as the Variational Bayes EM (VBEM) algorithm (Algorithm 5). Itcan be seen that VEM is a limiting case of VBEM in which the distributionover the model parameters q(θ) is collapsed to point estimates at the modeof the distribution (Hofman and Wiggins 2008).

1.3 Graph clustering

In the previous sections, we introduced mixture models and describednumerous EM strategies for estimation and clustering purposes. We nowconcentrate on the classification of vertices in networks depending on theirconnection profiles. There has been a wealth of literature on the topicwhich goes back to the earlier work of Moreno (1934). As shown in New-man and Leicht (2007), it appears that available methods can be groupedinto three significant categories. First, some models look for communitystructure, also called assortative mixing (Newman 2003, Danon et al. 2005),where vertices are partitioned into classes such that vertices of a class aremostly connected to vertices of the same class. They are particularly suit-able for the analysis of affiliation networks. Other models look for disas-sortative mixing in which vertices mostly connect to vertices of different


classes. They are commonly used to analyze bipartite networks (Estradaand Rodriguez-Velazquez 2005). These models are not considered in thisthesis. Finally, a few procedures look for heterogeneous structure wherevertices can have different types of connection profiles. In particular, theycan be used to uncover both community structure and disassortative mix-ing.

In this section, we first review some basic definitions of graph theoryand give some examples of real networks. We then describe some of themost widely used graph clustering methods. Note that many model freeapproaches exist (Fortunato 2010). However, except for the algorithmicapproach presented in Section 1.3.2, we concentrate in the following onmethods which rely on statistical models only, as in Goldenberg et al.(2010).

1.3.1 Graph theory and real networks

Networks are used in many scientific fields to represent the interactionsbetween objects of interest. For instance, in Biology, regulatory networkscan describe the regulation of genes with transcriptional factors (Milo et al.2002), while metabolic networks focus on representing pathways of bio-chemical reactions (Lacroix et al. 2006). Besides, the binding procedures ofproteins are often described as protein-protein interaction networks (Al-bert and Barabási 2002, Barabási and Oltvai 2004). In social sciences, net-works are widely used to represent relational ties between actors (Snijdersand Nowicki 1997, Nowicki and Snijders 2001, Palla et al. 2007). Otherexamples of networks are powergrids (Watts and Strogatz 1998) and theInternet (Zanghi et al. 2008).

A network is commonly represented by a graph G = (V , E) where Vis a set of N vertices and E is a set of edges between pairs of vertices.The graph is said to be directed (Figure 1.1) if the pairs (u, v) in E areordered. Conversely, unordered pairs form an undirected graph (Figures1.2 and 1.3). Note that the edges can be weighted by a function w : E → F

for any set F. However, in this thesis, we will concentrate only on binarygraphs, that is F = {0, 1}. The size of G is then given through the edgecount m = |E |. The graph is said to be dense if m is close to the maximalnumber M of edges whereas a low value of m leads to a sparse graph. Tocharacterize the density of G, a criterion δ(G) is often used. It is defined asthe ratio of the number m of existing edges over the number M of potentialedges:

δ(G) =mM

.

For a directed graph, M = N2 while M = N(N + 1)/2 otherwise. If Gdoes not contain any self loop, that is an edge from a vertex to itself, thenM = N(N − 1) for a directed graph and M = N(N − 1)/2 otherwise.

The neighbourhood NG(u) of vertex u is defined as the set of all thevertices connected to u. Its degree d(u) is equal to its number of incidentedges. Finally, a path from a vertex u to a vertex v is a sequence of edgesin E starting at vertex v0 = u and ending at vertex vk+1 = v:

{u, v1}, {v1, v2}, . . . , {vk, v}.


If there exists at least one path between every pair of vertices then thegraph is said to be connected. For instance, the graph in Figure 1.1 isconnected contrary to the graphs in Figures 1.2 and 1.3 which have someisolated vertices.

So far, we have denoted X a data matrix whose ith row represents ob-servation xi. Because we considered N observations in R

d, X was in RN×d.

The matrix X is now an adjacency matrix which describes the presence orabsence of an edge in a graph. As mentioned already, we focus on binarygraphs and therefore X is in {0, 1}N×N . Thus, if there exists an edge fromvertex i to vertex j then Xij equals 1 and 0 otherwise. At this point, it iscrucial to emphasize that N, which denotes the number of vertices, is nolonger the number of observations in the data. Indeed, when consideringgraphs, the information about the data distribution comes from the pres-ence or absence of edges. Therefore, the total number of observations isin O(N2). As we shall see shortly, for many graph clustering models, theedges are not independent and so approximation techniques are requiredfor estimation and classification purposes.

Properties of real networks

Very interestingly, most real networks have been shown to share someproperties (Albert et al. 1999, Broder et al. 2000, Dorogovtsev et al. 2000,Amaral et al. 2000, Strogatz 2001) that we briefly recall in the following.

• Sparsity: The number of edges is linear in the number of vertices.

• Existence of a giant component: Connected subgraph that containsa majority of the vertices.

• Heterogeneity: A few vertices have a lot of connections while mostof the vertices have very few links. The degrees of the vertices aresometimes characterized using a scale free distribution (for instancesee Barabasi and Albert 1999).

• Preferential attachment: New vertices can associate to any vertices,but “prefer” to associate to vertices which already have many con-nections.

• Small world: The shortest path from one vertex to another is gener-ally rather small.


Figure 1.1 – Subset of the yeast transcriptional regulatory network (Milo et al. 2002).Nodes of the directed network correspond to genes, and two genes are linked if one geneencodes a transcriptional factor that directly regulates the other gene.


Figure 1.2 – The metabolic network of bacteria Escherichia coli (Lacroix et al. 2006).Nodes of the undirected network correspond to biochemical reactions, and two reactionsare connected if a compound produced by the first one is a part of the second one (orvice-versa).


Figure 1.3 – Subset of the french political blogosphere network. The data consists ofa single day snapshot of political blogs automatically extracted on 14th october 2006

and manually classified by the “Observatoire Présidentielle project” (Zanghi et al. 2008).Nodes correspond to hostnames and there is an edge between two nodes if there is a knownhyperlink from one hostname to another (or vice-versa).


1.3.2 Community structure

Many graph clustering methods aim at detecting community structure,also called assortative mixing, meaning the appearance of densely con-nected groups of vertices, with only sparser connections between groups(Figure 1.4). Most of them rely on the modularity score of Newman andGirvan (2004). However, we point out the recent work of Bickel and Chen(2009) who showed that these algorithms are (asymptotically) biased andthat using modularity scores could lead to the discovery of an incorrectcommunity structure, even for large graphs.

Figure 1.4 – Example of an undirected affiliation network with 50 vertices. The networkis made of three communities represented in red, blue, and green. Vertices connect mainlyto vertices of the same community.

Modularity score

Girvan and Newman (2002), Newman and Girvan (2004) proposed sev-eral intuitive community detection algorithms which involve iterative re-moval of edges from the network to split it into communities. Edges tobe removed are identified using one of a number of possible betweennessmeasures. All of them are based on the same idea. If two communities arejoined by only a few inter community edges, then all paths from verticesin one community to vertices in the other must pass along one of those


few edges. Therefore, given a suitable set of paths, we expect the numberof paths that go along an edge to be largest for inter community edges.

First, they introduced the edge betweenness which is a generalizationto edges of the (vertex) betweenness measure of Freeman (1977). The edgebetweenness of an edge is defined as the number of shortest paths betweenall pairs of vertices in the network that run along that edge. Second, theyconsidered the random walk betweenness. The expected number of timesa random walk between a particular pair of vertices will pass down a par-ticular edge is calculated. This expected value is then summed over allpairs of vertices to obtain the random walk betweenness of the edge. Asshown in Newman and Girvan (2004), other scores can obviously be con-sidered to obtain algorithms that may be more appropriate for some ap-plications. However, it appears that the choice of measure does not highlyinfluence the result of the algorithms. On the other hand, the recalculationstep after each edge removal is crucial (see Algorithm 6).

All these algorithms produce a dendrogram (Figure 1.5) which rep-resents an entirely nested hierarchy of possible community divisions forthe network. In order to select one of these divisions, Newman and Gir-van (2004) proposed a modularity criterion. Consider a particular divi-sion with Q communities and let us denote eql the fraction of all edges inthe network that link vertices in community q to vertices in communityl. Moreover, consider the fraction aq = ∑

Ql=1 eql of edges that connect to

vertices of community q. The modularity criterion is then given by:

mod =Q

∑q=1

(eqq − a2q). (1.49)

The criterion is computed for all the divisions, and a division is chosensuch that the modularity is maximized.

A limiting factor of these community detection algorithms is their poorscaling with the number m of edges and the number N of vertices in thenetwork. For instance calculating the shortest paths between a particu-lar pair of vertices can be done in O(m) (Ahuja et al. 1993, Cormen et al.2001). Because they are O(N2) vertex pairs, the computational cost tocompute all the edge betweenness scores is in O(mN2). This complexitywas improved independently by Newman (2001) and Brandes (2001) find-ing all betweennesses in O(mN). Since this calculation has to be repeatedfor the removal of each edge, the entire algorithm runs in worst-case timeO(m2N). In other words, for dense networks, where m is in O(N2), it runsin O(N5) while it scales in O(N3) for sparse networks, where m is linearin N.

Algorithm 6: Example of a community structure detection algorithmwith a betweenness score.

repeatCalculate betweenness scores for all edgesRemove the edge with the highest score

until No edges remain


010

20

30

40

50

Figure 1.5 – Dendrogram of a network with 50 vertices for the community detectionalgorithm with edge betweenness. It should be read from top to bottom. The algorithmstarts with a single community which contains all the vertices. Edges with the highestedge betweenness are then removed iteratively splitting the network into several commu-nities. After convergence, each vertex, represented by a leaf of the tree, is a sole memberof one of the 50 communities.


Rather than building the complete dendrogram (with edge removals)and then choosing the optimal division using the modularity criterion,Newman (2004) suggested to focus directly on the optimization of themodularity. Thus, he proposed an algorithm which falls in the generalcategory of agglomerative hierarchical clustering methods (Everitt 1974,Scott 2000). Starting with a configuration in which each vertex is the solemember of one of N communities, the communities are iteratively joinedtogether in pairs, choosing at each step the join that results in the great-est increase (or smallest decrease) in mod (1.49). Again, this leads to adendrogram for which the best cut is chosen by looking for the maximalvalue of the modularity. The computational cost of the entire algorithmis in O ((m + N)N), or O(N3) for dense networks and O(N2) for sparsenetworks. It was shown to be capable of handling a collaboration networkwith 50000 vertices in Newman (2004).

All the algorithms described in this section are implemented in the Rpackage “igraph” which is available on the CRAN:http://cran.r-project.org/web/packages/igraph

Latent position cluster model

An alternative approach for community detection in networks is the La-tent Position Cluster Model (LPCM) of Handcock et al. (2007). Considera N × N binary adjacency matrix X such that Xij equals 1 if there is anedge from vertex i to vertex j, and 0 otherwise. Moreover, let us define Y

a covariate information where Yij denotes some observed characteristicsabout the pair (i, j) of vertices. This might represent for instance the trafficinformation of users from blog i to blog j in a blogosphere network (seeFigure 1.3). Several characteristics can possibly be observed for each pairof vertices and therefore Yij can be vector valued. Note that a few otherrandom graph models have been proposed in the literature to take covari-ates into account (see for instance Zanghi et al. 2010, Mariadassou et al.2010). They will not be considered in this thesis where the goal is to clusterthe vertices by using the network topology only. Here, we describe LPCMin a general setting, as in Handcock et al. (2007), and emphasize that thealgorithm can also be used if Y is not available, simply by removing theterms in Yij in the following expressions.

LPCM assumes that the network does not contain any self loop whileboth directed and undirected relations can be analyzed. It is assumedthat each vertex, usually called actor in social sciences, has an unobservedposition in a d dimensional Euclidean latent space as in Hoff et al. (2002).Given the latent positions and the covariate information, the edges areassumed to be drawn from a Bernoulli distribution:

Xij|Zi, Zj, Yij ∼ B(

g(aZi ,Zj ,Yij))

.

The function g(x) = 1/(1+ e−x) is the logistic sigmoid function. MoreoveraZi ,Zj ,Yij is given by:

aZi ,Zj ,Yij = Y⊺

ij β0−β1|Zi−Zj |, (1.50)

where β0 as the same dimensionality as Yij and β1 is a scalar. Both β0 andβ1 are unknown parameters to be estimated. To represent clustering, the

http://cran.r-project.org/web/packages/igraph


positions are assumed to be drawn from a finite mixture of Q multivariatenormal distributions (see Section 1.1.2), each one representing a differentclass of vertices. Each multivariate distribution has its own mean vectoras well as spherical covariance matrix:

Zi ∼Q

∑q=1

αqN (µq, σ2q I),

and α denotes a vector of class proportions which satisfies αq > 0, ∀q and

∑Qq=1 αq = 1. Finally, according to LPCM, the latent positions Z1, . . . , ZN

are iid and given this latent structure, all the edges are supposed to beindependent. Consider now the second term on the right hand side of(1.50). By construction, if β1 is positive, we expect the L1 distance |Zi−Zj |to be smaller if vertices i and j are in the same class. In other words, theprobability g(aZi ,Zj ,Yij) of an edge between i and j is supposed to be higherfor vertices sharing the same class. Note that this corresponds exactly tothe definition of a community.

Handcock et al. (2007) proposed a two-stage maximum likelihood ap-proach and a Bayesian algorithm, as well as a BIC criterion to estimate thenumber of latent classes. The two-stage maximum likelihood approachfirst maps the vertices in the latent space and then uses a mixture modelto cluster the resulting positions. In practice, this procedure convergesmore quickly but looses some information by not estimating the positionsand the cluster model at the same time. Conversely, the Bayesian algo-rithm (see Figure 1.6), based on Markov Chain Monte Carlo, estimatesboth the latent positions and the mixture model parameters simultane-ously. It gives better results but is time consuming. Both the maximumlikelihood and the Bayesian approach are limited in the sense that theycan handle networks with a few hundreds of vertices only. They are im-plemented in the R package “latentnet” (Krivitsky and Handcock 2009)which is available on the CRAN:http://cran.r-project.org/web/packages/latentnet

http://cran.r-project.org/web/packages/latentnet


−3 −2 −1 0 1 2

−2

−1

01

2

+

+

+

+

Figure 1.6 – Directed network of social relations between 18 monks in an isolated Amer-ican monastery (Sampson 1969, White et al. 1976). Sampson collected sociometric in-formation using interviews, experiments, and observations. This network focus on therelation of “liking”. A monk is said to have a social relation of “like” to another monkif he ranked that monk in the top three monks for positive affection in any of the threeinterviews given. The positions of the vertices in the two data dimensional latent spacehave been calculated using the Bayesian approach for LPCM. The position of the threeclass centers found are indicated as well as circles with radius equal to the square root ofthe class variances estimated.


1.3.3 Heterogeneous structure

So far, we have seen some algorithms to uncover communities. We nowpresent some other approaches which can look for heterogeneous struc-ture in networks where vertices can have different types of connectionprofiles.

Hofman and Wiggins

Let us consider a binary adjacency matrix X representing a network G.The model of Hofman and Wiggins (2008) associates to each vertex of thenetwork a latent variable Zi drawn from a multinomial distribution:

Zi ∼M (1, α = (α1, . . . , αQ)) . (1.51)

As in other standard mixture models (see Section 1.1.2), the vector Zi hasall its components set to zero except one such that Ziq equals 1 if vertexi belong to class q. The edges are then assumed to be drawn from aBernoulli distribution:

Xij ∼ B(λ),

if vertices i and j are in the same class, that is Zi = Zj, and

Xij ∼ B(ǫ),

otherwise. Thus, the model can take both community structure (λ > ǫ)(Figure 1.4) and disassortative mixing (λ < ǫ) (Figure 1.7) into account. Asin the previous section, given the latent variables Z1, . . . , ZN , all the edgesare supposed to be independent. In order to estimate the posterior dis-tribution p(Z, α, λ, ǫ|X) over the latent variables and model parameters,Hofman and Wiggins (2008) used a variational Bayes EM algorithm (seeSection 1.2.2) with a factorized distribution:

q(Z, α, λ, ǫ) = q(α)q(λ)q(ǫ)N

∏i=1

q(Zi).

Moreover, they proposed a model selection criterion to estimate the num-ber of latent classes in networks. It relies on a variational approximationof the marginal log-likelihood log p(X) and has shown promising results.This criterion is experimented in Chapter 2.

The computational cost of the variational Bayes algorithm is O(N2Q)such that the algorithm can deal with networks with thousands of vertices.It is implemented in a Matlab package “VBMOD” available at:http://vbmod.sourceforge.net

http://vbmod.sourceforge.net


Figure 1.7 – Example of an undirected network with 20 vertices. The connection probabil-ities between the two classes in red and green are higher than the intra class probabilities.Vertices connect mainly to vertices of a different class.

1.4. Phase transition in stochastic block models 43

Stochastic block models

Originally developed in social sciences, the Stochastic Block Model (SBM)is a probabilistic generalization (Fienberg and Wasserman 1981, Hollandet al. 1983) of the method described in White et al. (1976). Given a network,it assumes that each vertex belongs to a hidden class among Q classes, anduses a matrix Π to describe the intra and inter connection probabilities(Frank and Harary 1982). No assumption is made on the form of theconnectivity matrix such that very different structures can be taken intoaccount. In particular, SBM can characterize the presence of hubs whichmake networks locally dense (Daudin et al. 2008). Moreover and to someextent, it generalizes many of the existing graph clustering techniques, asshown in Newman and Leicht (2007). For instance, the model of Hofmanand Wiggins (2008) can be seen as a constrained SBM where the diagonalof Π is set to λ and all the other elements to ǫ.

Formally, SBM considers a latent variable Zi, drawn from a multino-mial distribution (1.51), for each vertex in the network. Thus, each vertexbelongs to a single class, and that class is q if Ziq equals 1. The edges arethen assumed to be drawn from a Bernoulli distribution:

Xij|ZiqZjl = 1 ∼ B(πql),

where Π is a Q×Q matrix of connection probabilities. Again, given all thelatent variables, the edges are supposed to be independent. Note that SBMwas originally described in a more general setting (Nowicki and Snijders2001), allowing any discrete relational data. However, as explained inSection 1.3.1, we concentrate in the following on binary edges only.

SBM is related to the infinite block model of Kemp et al. (2004) al-though the number Q of classes is fixed. Moreover, contrary to the mixedmembership stochastic block model of Airoldi et al. (2008) which capturespartial membership and allows each vertex to have a distribution over aset of classes, SBM assumes that each vertex of a network belongs to asingle class. The identifiability of the parameters in SBM was studied byAllman et al. (2009; 2010), who showed that the model is generically iden-tifiable up to a permutation of the classes. In other words, except in aset of parameters which has a null Lebesgue’s measure, two parametersimply the same random graph model if and only if they differ only bythe ordering of the classes. Many inference strategies as well as a modelselection criterion have been proposed for SBM. They will be described inChapter 2.

SBM is the starting point of this thesis. A new Bayesian inferenceprocedure is proposed for this model in Chapter 2. It is then extended toallow overlapping clusters in Chapters 3 and 4.

1.4 Phase transition in stochastic block models

Contrary to all the previous sections of this chapter where we describedexisting work, we now present some new properties that we found whichbring some insights into the SBM model. The goal is to show that, for aspecific choice of connectivity matrix, the model can be seen as an instanceof the inhomogeneous random graph model. We then use the results of


Bollobás et al. (2005) to characterize the critical point of phase transitionin SBM, where a giant component appears. This work was done in collab-oration with C. Matias.

Let us start by considering a SBM model with Q classes and a kernel2

κ(x, y) on a finite metric space S = {1, . . . , Q}. We denote α = (α1, . . . , αQ)the vector of class proportions and introduce a probability measure µ onS such that µ(q) = αq, ∀q ∈ S. Moreover, let us define a Q× Q matrix Π

of connection probabilities which depend on the number N of vertices:

πql =κ(q, l)

N, ∀(q, l) ∈ S× S.

If an undirected graph without self loop is considered, the expected num-ber of edges e(N, κ) is given by:

E[e(N, κ)] = E[∑i<j

Xij]

= ∑i<j

E[Xij]

= ∑i<j

p(Xij = 1| α, Π)

= ∑i<j

∑Zi

∑Zj

p(Xij = 1|Zi, Zj, Π)p(Zi | α)p(Zj | α)

= ∑i<j

∑q,l

αqαlπql

=N(N − 1)

2 ∑q,l

αqαlκ(q, l)

N

=N − 1

2 ∑q,l

αqαlκ(q, l).

(1.52)

Thus, E[e(N, κ)] is linear in N. Obviously, this is also the case if the net-work contains self loops and/or it is directed. Following Bollobás et al.(2005), the model considered is an inhomogeneous random graph modelif the kernel κ is graphical, that is:

Q

∑q=1

Q

∑l=1

αqαk|κ(q, l)| < ∞,

and

limN→∞

1N

E[e(N, κ)] =12

Q

∑q,l

αqαlκ(q, l).

The kernel is bounded and using (1.52), these two properties are verified.

1.4.1 The phase transition

In order to study the phase transition, also called percolation transition,we use the operator introduced by Bollobás:

(Tκ v)(q) =Q

∑l=1

κ(q, l)v(l)αl , ∀q. (1.53)

2Here we define a kernel as a symmetric non-negative function


Note that both the input v and the output Tκ v of the operator are Qdimensional vectors. Consider now the norms given by:

|| u ||2 =

(Q

∑q=1

αqu2q

) 12

,

for any Q dimensional vector u and

||Tκ|| = sup{||Tκ v ||2 : vq ≥ 0, ∀q, || v ||2 ≤ 1} < ∞.

The kernel κ is said to be subcritical if ||Tκ|| < 1, critical if ||Tκ|| = 1, andsupercritical if ||Tκ|| > 1. If ||Tκ|| < 1, the size of the biggest connectedcomponent is C1(G) = O(log N), whereas C1(G) = O(N) if ||Tκ|| > 1.In our case, we aim at characterizing the connectivity matrix Π for which||Tκ|| = 1. Using matrix notations, (1.53) can be written:

Tκ v = K diag(α1, . . . , αQ) v,

where K is a Q× Q matrix such that (K)ql = κ(q, l) and diag(α1, . . . , αQ)is a diagonal matrix with diagonal equal to the vector α. If the vec-tor v is now decomposed on the basis {e1, . . . , eQ} of eigenvectors ofK diag(α1, . . . , αQ), we obtain:

v =Q

∑q=1

vqeq,

and

Tκ v = K diag(α1, . . . , αQ)Q

∑q=1

vqeq =Q

∑q=1

vkλqeq,

where {λ1, . . . , λQ} are the corresponding eigenvalues of {e1, . . . , eQ}.Thus

||Tκ v ||22 =Q

∑q=1

αqv2qλ2

q ≤ (maxq

λ2q)

Q

∑q=1

αqv2q = (max

qλ2

q)|| v ||22.

Subject to the constraint || v ||2 ≤ 1, the maximum maxq |λq| of ||Tκ v ||2 isreached when || v ||2 = 1. Therefore, ||Tκ|| = maxq |λq|. In other words,given the matrices K and diag(α1, . . . , αq), a giant component appears inthe graph, that is C1 = O(N), if maxq |λq| ≥ 1.

1.4.2 Experiments

In this section, some experiments are carried out to verify the propertiesfound previously. A SBM model is considered with a connectivity matrixΠ such that πql =

κ(q,l)N and

K =

a b . . . b

b a...

.... . . b

b . . . b a

,


to limit the number of free parameters.We start by fixing the number of classes Q = 2, the corresponding

proportions {α1 = 0.6, α2 = 0.4}, and the number of vertices N = 5000.We then generate some graphs by changing the values of a and b. Theparameter b is set to a/8 and a varies from 0.5 to 50. For each pair (a, b),both the proportion N∗ of vertices in the biggest connected componentand the highest eigenvalue maxq |λq| of K diag(α1, . . . , αQ) are computed.In order to obtain smoother results, each experiment is repeated 30 timesand the resulting proportions N∗ are averaged. The results are presentedin Figure 1.8.

0.5 1.0 1.5 2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

Figure 1.8 – Several graphs with 5000 vertices are generated using SBM with variousconnectivity matrices. The x axis represents the values of maxq |λq| while the proportionsN∗ of vertices in the biggest connected component are given in the y axis. The criticalpoint of phase transition occurs when maxq |λq| ≥ 1.

We verify that the phase transition occurs when the highest eigenvaluemaxk |λk| is equal to 1. Indeed, when maxq |λq| < 1, the proportion N∗

of vertices in the biggest connected component is close to zero whereas itconverges to 1 when maxq |λq| ≥ 1.


Conclusion

In this chapter, we reviewed mixture models as well as inference tech-niques such as the EM algorithm and the variational EM algorithm. Wealso focused on some model selection criteria to estimate the number ofcomponents from the data. Finally, we described some of the most widelygraph clustering algorithms and concentrated mainly on model based ap-proaches.

2Variational Bayesianinference and complexitycontrol for stochastic blockmodels

Contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.2 A mixture model for graphs . . . . . . . . . . . . . . . . . . . 52

2.2.1 Model and notations . . . . . . . . . . . . . . . . . . . . . . 52

2.2.2 A Bayesian Stochastic Block Model . . . . . . . . . . . . . . 53

2.3 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.3.1 Variational EM . . . . . . . . . . . . . . . . . . . . . . . . . 54


2.4 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

2.5.1 Comparison of the criteria . . . . . . . . . . . . . . . . . . . 57

2.5.2 The metabolic network of Escherichia coli . . . . . . . . . . 61

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

The clustering of vertices as well as the estimation of the Stochastic BlockModel (SBM) parameters have been subject to previous work and nu-

merous inference strategies such as variational Expectation Maximization(EM) and classification EM have been proposed. However, SBM still suf-fers from a lack of criteria to estimate the number of components in themixture. To our knowledge, only one model based criterion, ICL, has beenderived for SBM in the literature. It relies on an asymptotic approxima-tion of the Integrated Complete-data Likelihood and recent studies haveshown that it tends to be too conservative in the case of small networks.To tackle this issue, we propose a new criterion that we call ILvb, basedon a non asymptotic approximation of the marginal likelihood. We de-scribe how the criterion can be computed through a variational Bayes EMalgorithm.

49

2.1. Introduction 51

2.1 Introduction

Many methods have been proposed in the literature to jointly estimateSBM model parameters and cluster the vertices of a network. They all facethe same difficulty. Indeed, contrary to many mixture models, the con-ditional distribution of all the latent variables Z and model parameters,given the observed data X, can not be factorized due to conditional de-pendency (for more details, see Daudin et al. 2008). Therefore, optimiza-tion techniques such as the EM algorithm can not be used directly. In thecase of SBM, Nowicki and Snijders (2001) proposed a Bayesian probabilis-tic approach. They introduced some prior Dirichlet distributions for themodel parameters and used Gibbs sampling to approximate the posteriordistribution over the model parameters and posterior predictive distribu-tion. Their algorithm is implemented in the software BLOCKS, which ispart of the package StoCNET (Boer et al. 2006). It gives accurate a poste-riori estimates but can not handle networks with more than 200 vertices.Daudin et al. (2008) proposed a frequentist variational EM approach forSBM which can handle much larger networks. Online strategies have alsobeen developed (Zanghi et al. 2008).

While many inference strategies have been proposed for estimationand clustering purpose, SBM still suffers from a lack of criteria to esti-mate the number of classes in networks. Indeed, many criteria, such asthe Bayesian Information Criterion (BIC) or the Akaike Information Cri-terion (AIC) (Burnham and Anderson 2004) are based on the likelihoodp(X | α, Π) of the observed data X, which is intractable here. To tacklethis issue, Mariadassou et al. (2010) and Daudin et al. (2008) used a cri-terion, so-called ICL, based on an asymptotic approximation of the inte-grated complete-data likelihood. This criterion relies on the joint distribu-tion p(X, Z | α, Π) rather than p(X | α, Π) and can be easily computed, evenin the case of SBM. ICL was originally proposed by Biernacki et al. (2000)for model selection in Gaussian mixture models, and is known to be par-ticularly suitable for cluster analysis view since it favors well separatedclusters. However, because it relies on an asymptotic approximation, Bier-nacki et al. (2010) showed, in the case of mixtures of multivariate multino-mial distributions, that it may fail to detect interesting structures presentin the data, for small sample sizes. Mariadassou et al. (2010) obtained sim-ilar results when analyzing networks generated using SBM. They foundthat this asymptotic criterion tends to underestimate the number of classeswhen dealing with small networks. We emphasize that, to our knowledge,ICL is currently the only model based criterion developed for SBM.

Our main concern in this chapter is to propose a new criterion for SBM,based on the marginal likelihood p(X), also called integrated observed-data likelihood. The marginal likelihood is known to focus on densityestimation view and is expected to provide a consistent estimation of thedistribution of the data. For a more detailed overview of the differencesbetween integrated complete-data likelihood and integrated observed-datalikelihood, we refer to Biernacki et al. (2010). In the case of SBM, themarginal likelihood is not tractable and we describe in this chapter howa non asymptotic approximation can be obtained through a variationalBayes EM algorithm.

52

Chapter 2. Variational Bayesian inference and complexity control for stochastic blockmodels

In Section 2.2, we describe SBM and we introduce some non informa-tive conjugate prior distributions for the model parameters. The varia-tional Bayes EM algorithm is then presented in Section 2.3. We show inSection 2.4 how it naturally leads to a new model selection criterion thatwe call ILvb. Finally, in Section 2.5, we carry out some experiments usingsimulated data sets and the metabolic network of Escherichia coli, to assessILvb.The R package “mixer” implementing this work is available from thefollowing web site: http://cran.r-project.org/web/packages/

mixer

2.2 A mixture model for graphs

The data we model consists of a N × N binary matrix X, with entries Xijdescribing the presence or absence of an edge from vertex i to vertex j.Both directed and undirected relations can be analyzed but in the follow-ing, we focus on undirected relations. Therefore X is symmetric.

2.2.1 Model and notations

As mentioned in Section 1.3.3, the Stochastic Block Model (SBM) intro-duced by Nowicki and Snijders (2001) associates to each vertex of a net-work a latent variable Zi drawn from a multinomial distribution, such thatZiq = 1 if vertex i belongs to class q:

Zi ∼M(

1, α = (α1, α2, . . . , αQ))

.

The vector α denotes the vector of class proportions. The edges are thendrawn from a Bernoulli distribution:

Xij|{ZiqZjl = 1} ∼ B(πql),

where Π is a Q× Q matrix of connection probabilities. According to thismodel, the latent variables Z1, . . . , ZN are iid and given this latent struc-ture, all the edges are supposed to be independent.



2.2. A mixture model for graphs 53

Thus, when considering an undirected graph without self loops, thisleads to:

p(Z | α) =N

∏i=1M(Zi; 1, α) =

N

∏i=1

Q

∏q=1

αZiqq ,

andp(X |Z, Π) = ∏

i<jp(Xij|Zi, Zj, Π)

= ∏i<j

∏q,lB(Xij|πql)

ZiqZjl

= ∏i<j

∏q,l

(

πXij

ql (1− πql)1−Xij

)ZiqZjl.

In the case of a directed graph, the products over i < j must be replacedby products over i 6= j. The edges Xii must also be taken into account ifthe graph contains self-loops.

2.2.2 A Bayesian Stochastic Block Model

SBM can be described in a full Bayesian framework where it can be con-sidered as a generalisation of the affiliation model proposed by Hofmanand Wiggins (2008). Indeed, the Bayesian model of Hofman and Wig-gins (2008) considers a simple structure where vertices of the same classconnect with probability λ and with probability ǫ otherwise (see Section1.3.3). Therefore, it can be seen as a constrained SBM where the diagonalof Π is set to λ and all the other elements to ǫ.

To extend the SBM frequentist model, we first specify some non in-formative conjugate priors for the model parameters. Since p(Zi | α) is amultinomial distribution, we consider a Dirichlet distribution for the mix-ing coefficients:

p(

α | n0 = {n01, . . . , n0

Q})

= Dir(α; n0),

where n0q = 1/2, ∀q. This Dirichlet distribution corresponds to a non-

informative Jeffreys prior distribution which is known to be proper (Jef-freys 1946). It is also possible to consider a uniform distribution on theQ− 1 dimensional simplex by fixing n0

q = 1, ∀q.Since p(Xij|Zi, Zj, Π) is a Bernoulli distribution, we use independent

Beta priors to model the connectivity matrix Π:

p(

Π | η0 = (η0ql), ζ0 = (ζ0

ql))

= ∏q≤l

Beta(πql ; η0ql , ζ0

ql),

with η0ql = ζ0

ql = 1/2, ∀q. This corresponds to a product of non-informativeJeffreys prior distributions. Note that if the graph is directed, the productsover q ≤ l, must be replaced by products over q, l since Π is no longersymmetric.

Thus, the model parameters are now seen as random variables (seeFigure 2.1) whose distributions depend on the hyperparameters n0, η0,and ζ0. In the following, since these hyperparameters are fixed and inorder to keep the notations simple, they will not be shown explicitly inthe conditional distributions.

54


α

Zi Zj

Xijπ

Figure 2.1 – Directed acyclic graph representing the Bayesian view of SBM. Nodes rep-resent random variables, which are shaded when they are observed and edges representconditional dependencies.

2.3 Estimation

In this section, we first describe the variational EM algorithm used byDaudin et al. (2008) to jointly estimate SBM model parameters and clus-ter the vertices of a network. We then propose a new variational BayesEM algorithm for SBM which approximates the full posterior distributionof the model parameters and latent variables, given the observed data X.This procedure relies on a lower bound which will be later used, in Sec-tion 2.4, as a non asymptotic approximation of the marginal log-likelihoodlog p(X).

2.3.1 Variational EM

The likelihood p(X | α, Π) of the observed data X can be obtained throughthe marginalization p(X | α, Π) = ∑Z p(X, Z | α, Π). This summation in-volves QN terms and quickly becomes intractable. To tackle such prob-lem, the well known EM algorithm (see Section 1.1.3) has been appliedwith success on a large variety of mixture models. As shown in Section1.2.1, this two stage estimation approach (Hathaway 1986, Neal and Hin-ton 1998) can be described in a variational inference framework. Unfortu-nately, EM relies on the distribution p(Z |X, α, Π) which is not tractable inthe case of SBM and therefore variational approximations are required.

Thus, given a distribution q(Z) over the latent variables, the log-likelihood of the observed data is decomposed into two terms:

log p(X | α, Π) = LML(q; α, Π) + KL (q(·) || p(·|X, α, Π)) , (2.1)

where

LML(q; α, Π) = ∑Z

q(Z) log{

p(X, Z | α, Π)

q(Z)

}

, (2.2)

2.3. Estimation 55

and

KL (q(·) || p(·|X, α, Π)) = −∑Z

q(Z) log{

p(Z |X, α, Π)

q(Z)

}

. (2.3)

In (2.1) and (2.3), KL denotes the Kullback-Leibler divergence between thedistribution q(Z) and the distribution p(Z |X, α, Π). It can be easily veri-fied that minimizing (2.3) with respect to q(Z) is equivalent to maximizingthe lower bound (2.2) of (2.1) with respect to q(Z). To obtain a tractablealgorithm, Daudin et al. (2008) assumed that the distribution q(Z) can befactorized such that:

q(Z) =N

∏i=1

q(Zi) =N

∏i=1M(Zi; 1, τi),

where τiq is a variational parameter denoting the probability of node i tobelong to class q. This gives rise to a so-called variational EM procedure.During the variational E-step, the model parameters are fixed and, bymaximizing (2.2) with respect to q(Z), the algorithm looks for an approx-imation of the conditional distribution of the latent variables. Conversely,during the variational M-step, the approximation q(Z) is fixed and thelower bound is maximized with respect to the model parameters. Thisprocedure is repeated until convergence and was proposed by Daudinet al. (2008) for SBM.


In the context of mixture models, the conditional distribution p(Z |X, α, Π)can generally be computed and therefore Bayesian inference strategies fo-cus on estimating the posterior distribution p(α, Π |X). The distributionp(Z, α, Π |X) is then simply given by a byproduct. However, when consid-ering SBM, the distribution p(Z |X, α, Π) is intractable and so we proposeto approximate the full distribution p(Z, α, Π |X). We follow the work ofAttias (1999), Corduneanu and Bishop (2001), Svensén and Bishop (2004)on Bayesian mixture modelling and Bayesian model selection. Thus, themarginal log-likelihood, also called integrated observed-data log-likelihood,can be decomposed into two terms:

log p(X) = L (q(·)) + KL (q(·) || p(·|X)) , (2.4)

where

L(q) = ∑Z

∫ ∫

q(Z, α, Π) log{

p(X, Z, α, Π)

q(Z, α, Π)

}

d α d Π, (2.5)

and

KL (q(·) || p(·|X))

= −∑Z

∫ ∫

q(Z, α, Π) log{

p(Z, α, Π |X)

q(Z, α, Π)

}

d α d Π . (2.6)

Again, as for the variational EM approach (Section 2.3.1), minimizing (2.6)with respect to q(Z, α, Π) is equivalent to maximizing the lower bound

56


(2.5) of (2.4) with respect to q(Z, α, Π). However, we now have a full varia-tional optimization problem since the model parameters are random vari-ables and we are looking for an approximation q(Z, α, Π) of p(Z, α, Π |X).To obtain a tractable algorithm, we assume that the distribution q(Z, α, Π)can be factorized such that:

q(Z, α, Π) = q(α)q(Π)q(Z) = q(α)q(Π)N

∏i=1

q(Zi).

In the following, we use a variational Bayes EM algorithm (see Section1.2.2). We call variational Bayes E-step, the optimization of each distri-bution q(Zi) and variational Bayes M-step, the approximations of the re-maining distributions q(α) and q(Π). All the optimization equations, thelower bound, as well as proofs are given in the appendix.

We first initialize a matrix τold with a hierarchical algorithm based onthe classical Ward distance. The distance between vertices which is con-sidered is simply the Euclidean distance d(i, j) = ∑

Nk=1(Xik − Xjk)

2 whichtakes the number of discordances between i and j into account. Given anumber of classes Q, each vertex is assigned (hard assignment) to its near-est group. Second, the algorithm uses (B.4) and (B.6) to estimate the varia-tional distributions over the model parameters α as well as Π. Finally, thevariational distribution over the latent variables is estimated using (B.1).The algorithm cycles though the E and M steps until the absolute distancebetween two successive values of the lower bound (B.8) is smaller than athreshold eps. In the experiment section, we set eps = 1e− 6. In practice,smaller values slow the convergence of the algorithm and do not lead tobetter estimates.

The computational costs of the frequentist approach of Daudin et al.(2008) and our variational Bayes algorithm are both equal to O(Q2N2).Analyzing a sparse network takes about a second for N = 200 nodes andabout a minute for N = 1000.

2.4 Model selection

So far, we have seen that the variational Bayes EM algorithm leads to anapproximation of the posterior distribution of all the model parametersand latent variables, given the observed data. However, the problem ofestimating the number Q of classes in the mixture has not been addressedyet. Given a set of values of Q, we aim at selecting Q∗ which maximizesthe marginal log-likelihood log p(X |Q), also called integrated observed-data log-likelihood. The marginal likelihood is known to focus on densityestimation view and is expected to provide a consistent estimation of thedistribution of the data (Biernacki et al. 2010). Unfortunately, this quantityis not tractable, since for each value of Q, it involves integrating over allthe model parameters and latent variables:

log p(X |Q) = log

{

∑Z

∫ ∫

p(X, Z, α, Π |Q) d α d Π

}

.

To tackle this issue, we propose to replace the marginal log-likelihoodwith its variational Bayes approximation. Thus, given a value of Q, the al-gorithm introduced in Section 2.3.2 is used to maximize the lower bound

2.5. Experiments 57

(2.5) with respect to q(.). We recall that this maximization implies a min-imization of the KL divergence (2.6) between q(.) and the unknown pos-terior distribution. After convergence of the algorithm, according to (2.4),if the KL divergence is small, then the lower bound L (q(.)) approximatesthe marginal log-likelihood. Obviously, this assumption can not be veri-fied in practice since (2.6) can not be computed analytically. Moreover, weemphasize that there is no solid reason to believe that the KL divergenceis close to zero and does not depend on the model complexity. Never-theless, in order to obtain a tractable model selection criterion we rely onthis approximation. After convergence of the algorithm, the lower boundtakes a simple form and leads to a new criterion for SBM that we call ILvb

ILvb = log

Γ(∑Qq=1 n0

q)∏Qq=1 Γ(nq)

Γ(∑Qq=1 nq)∏

Qq=1 Γ(n0

q)

+Q

∑q≤l

log

{Γ(η0

ql + ζ0ql)Γ(ηql)Γ(ζql)

Γ(ηql + ζql)Γ(η0ql)Γ(ζ

0ql)

}

−N

∑i=1

Q

∑q=1

τiq log τiq,

where τiq is the estimated probability of vertex i to belong to class q and(nq)q, (ηql)ql , (ζql)ql are parameters given in the appendix. The gammafunction is denoted by Γ(·). Contrary to the criterion proposed by Daudinet al. (2008), ILvb does not rely on an asymptotic approximation, some-times called BIC-like approximation. In practice, given a network, thevariational Bayes EM algorithm is run for the different values of Q consid-ered and Q∗ is chosen such that ILvb is maximized.

2.5 Experiments

We present some results of the experiments we carried out to assess thecriterion we proposed in Section 2.4. Throughout our experiments, wechose to compare our approach to the work of Daudin et al. (2008) andHofman and Wiggins (2008). Indeed, contrary to many other model basedtechniques, the corresponding algorithms can analyze networks with hun-dred of nodes in a reasonable amount of time (a few minutes on a dualcore). We recall that Daudin et al. (2008) proposed a frequentist maximumlikelihood approach (see Section 2.3.1) for SBM as well as an ICL crite-rion. On the other hand, Hofman and Wiggins (2008) presented a modelfor community structure detection and a Bayesian criterion that we willdenote VBMOD. Thus, by using both synthetic data and the metabolicnetwork of bacteria Escherichia coli, our aim is twofold. First, we illustratethe overall capacity of SBM to retrieve interesting structures in a large va-riety of networks. Second, we concentrate on comparing the two criteriaICL and ILvb developed for SBM.

2.5.1 Comparison of the criteria

In these experiments, we consider two types of networks. In Section 2.5.1,we generate affiliation networks, made of community structures, usingthe generative model of Hofman and Wiggins (2008). Therefore, vertices

58


of the same class connect with probability λ and with probability ǫ oth-erwise. This corresponds to a constrained SBM where the diagonal of theconnectivity matrix is set to λ and all the other elements to ǫ:

Π =

λ ǫ . . . ǫ

ǫ λ...

.... . . ǫ

ǫ . . . ǫ λ

.

In Section 2.5.1, we then draw networks with more complex topologies,made of both community structures and a class of hubs. The correspond-ing model is given by the connectivity matrix:

Π =

λ ǫ . . . ǫ λ

ǫ λ...

.... . .

...λ . . . . . . . . . λ

,

where hubs connect with probability λ to any vertices in the network.Following Mariadassou et al. (2010) who showed that ICL tends to un-

derestimate the number of classes in the case of small graphs, we considernetworks with only N = 50 vertices to analyze the robustness of our cri-terion. We set (λ = 0.9, ǫ = 0.1) and for each value of QTrue in the set{3, . . . , 7}, we then generate 100 networks with classes mixed in the sameproportions α1 = · · · = αQTrue = 1/QTrue.

In order to estimate the number of classes in the latent structures, weapplied the methods of Hofman and Wiggins (2008), Daudin et al. (2008),and our algorithm (Section 2.3.2) on each network, for various numbers ofclasses Q ∈ {1, . . . , 7}. Note that, we choose n0

q = 1/2, ∀q ∈ {1, . . . , Q} forthe Dirichlet prior and η0

ql = ζ0ql = 1/2, ∀(q, l) ∈ {1, . . . , Q}2 for the Beta

priors. We recall that such distributions correspond to non informativeprior distributions. Like any optimization technique, the clustering meth-ods we consider depend on the initialization. Thus, for each simulatednetwork and each number of classes Q, we use five different initializationsof τ. Finally, we select the best learnt models for which the correspondingcriteria VBMOD, ICL, or ILvb were maximized.

Before comparing ICL and ILvb, it is crucial to recall that these twocriteria were not conceived for the same purpose. ICL approximates theintegrated complete-data likelihood and is known to focus on cluster anal-ysis view since it favors well separated clusters. It realizes a compromisebetween the estimation of the data density and the evidence of data parti-tioning. Conversely, ILvb approximates the marginal likelihood which isknown to focus on density estimation only. In the following experiments,since networks are generated using SBM, and because we evaluate the cri-teria through their capacity to retrieve the true number of classes, ILvb isexpected to lead to better results. However, in other situations (which arenot considered in this chapter), where the focus would be on the clusteringof vertices, ICL might be of possible interest.

2.5. Experiments 59

Affiliation networks

In Table 2.1, we observe that VBMOD outperforms both ICL and ILvb.For instance, when QTrue = 5, VBMOD correctly estimates the number ofclasses of the 100 generated networks, while ICL and ILvb have respec-tively a percentage of accuracy of 77 and 99. These differences increasewhen QTrue = 6 and QTrue = 7. Indeed, the higher QTrue is, the less ver-tices the classes contain, and therefore, the more difficult it is to retrieveand distinguish the community structures. Thus, when QTrue = 7, eachclass only contains on average QTrue/N ≈ 7.1 vertices. VBMOD appearsto be a very stable criterion for community structure detection. It has apercentage of accuracy of 84 while ICL never estimates the true numberof classes.

All the affiliation networks were generated using the model of Hof-man and Wiggins (2008) which explains the results of VBMOD presentedabove. Indeed, the corresponding model for community structure detec-tion only estimates the parameters λ and ǫ whereas the frequentist andBayesian approaches for SBM look for a full Q× Q matrix Π of connec-tion probabilities. They are capable of handling networks with complextopologies, as shown in the following section, but they might miss somestructures if the number of vertices is too limited.

We observe that ILvb leads to a better estimates of the true number ofclasses in networks than ICL. Thus, when QTrue = 5 and QTrue = 6, ILvbestimates correctly the number of classes of 99 and 73 networks while ICLhas respectively a percentage of accuracy of 77 and 12.

60


2 3 4 5 6 7

3 0 100 0 0 0 0

4 0 0 100 0 0 0

5 0 0 0 100 0 0

6 0 0 0 0 97 3

7 0 0 0 2 14 84

(a) QTrue\QVBMOD

2 3 4 5 6 7

3 0 100 0 0 0 0

4 0 0 100 0 0 0

5 0 0 23 77 0 0

6 0 1 28 59 12 0

7 0 8 49 42 1 0

(b) QTrue\QICL

2 3 4 5 6 7

3 0 100 0 0 0 0

4 0 0 100 0 0 0

5 0 0 0 99 1 0

6 0 0 4 23 73 0

7 0 2 14 44 27 13

(c) QTrue\QILvb

Table 2.1 – Confusion matrices for VBMOD, ICL and ILvb. λ = 0.9, ǫ = 0.1 andQTrue ∈ {3, . . . , 7}. Affiliation networks.

Networks with community structures and hubs

Table 2.2 displays the results of the experiments on networks exhibitingcommunity structures and hubs. The presence of hubs is a central prop-erty of so-called real real networks (Albert and Barabási 2002).

This slightly more complex and more realistic situation does heavilyperturb the estimation of VBMOD. Most of the time, VBMOD fails to de-tect the class of hub and henceforth underestimates the number of classes.For example, when QTrue = 3 or QTrue = 4, VBMOD always misses a class.When the number of true classes grows over four, VBMOD’s behaviourbecomes more variable but keep the same heavy tendency to underesti-mate.

In this context, ICL and ILvb behaves more consistently than VBMOD.When QTrue is less or equal than four both strategies are comparable. Butwhen the number of true classes increases, the performance of ICL dra-matically deteriorates, whereas ILvb remains more stable.

In the context of small graph, when the focus is on the estimation ofthe data density, ILvb clearly provides a more reliable estimation of thenumber of classes than ICL. It also shows better performances that VB-MOD when networks are made of classes with more complex topologiesthan communities.

2.5. Experiments 61

2 3 4 5 6 7

3 95 0 3 0 0 2

4 1 95 4 0 0 0

5 0 0 94 6 0 0

6 0 0 1 83 16 0

7 0 0 2 15 78 5

(a) QTrue\QVBMOD

2 3 4 5 6 7

3 0 100 0 0 0 0

4 0 0 100 0 0 0

5 0 0 12 88 0 0

6 0 0 19 59 22 0

7 0 3 29 56 12 0

(b) QTrue\QICL

2 3 4 5 6 7

3 0 100 0 0 0 0

4 0 0 100 0 0 0

5 0 0 2 98 0 0

6 0 0 1 29 70 0

7 0 0 3 34 45 18

(c) QTrue\QILvb

Table 2.2 – Confusion matrices for VBMOD, ICL and ILvb. λ = 0.9, ǫ = 0.1 andQTrue ∈ {3, . . . , 7}. Affiliation networks and a class of hubs.

2.5.2 The metabolic network of Escherichia coli

We apply the methodology described in this chapter to the metabolic net-work of bacteria Escherichia coli (Lacroix et al. 2006) which was analyzedby Daudin et al. (2008) using SBM. In this network, there are 605 verticeswhich represent chemical reactions and a total number of 1782 edges. Tworeactions are connected if a compound produced by the first one is a partof the second one (or vice-versa). As in the previous section, we considernon informative priors: we fixed n0

q = 1/2, ∀q ∈ {1, . . . , Q} for the Dirich-let prior and η0

ql = ζ0ql = 1/2, ∀(q, l) ∈ {1, . . . , Q}2 for the Beta priors.

Thus, for Q ∈ {1, . . . , 40}, we apply the methods of Hofman and Wig-gins (2008) as well as our approach on this network. We compute thecorresponding criteria and we repeat such procedure 60 times, for differ-ent initializations of τ. Indeed, to speed up the initialization, we first run akmeans algorithm with 40 classes and random initial centers. We then usethe corresponding partitions as inputs of the hierarchical algorithm de-scribed in Section 2.3.2. The results for ILvb are presented as boxplots inFigure 2.2. The criterion finds its maximum for QILvb = 22 classes, whileDaudin et al. (2008) found QICL = 21. Thus, for this particular large dataset, both ILvb and ICL lead to almost the same estimates of the number oflatent classes.

We also compared the learnt partitions in the Bayesian and in the fre-quentist approach. Figure 2.3 is a dot plot representation of the metabolicnetwork after having applied the Bayesian algorithm for QVB = 22. Each

62


1 3 5 7 9 12 15 18 21 24 27 30 33 36 39

−10

000

−95

00−

9000

−85

00−

8000

−75

00

Figure 2.2 – Boxplot representation (over 60 experiments) of ILvb for Q ∈ {1, . . . , 40}.The maximum is reached at QILvb

= 22.

2.5. Experiments 63

Figure 2.3 – Dot plot representation of the metabolic network after classification of thevertices into QVB = 22 classes. The x-axis and y-axis correspond to the list of verticesin the network, from 1 to 605. Edges between pairs of vertices are represented by shadeddots.

64


vertex i is classified into the class for which τiq is maximal (Maximum APosteriori estimate). We observed very similar patterns in the frequen-tist approach. Among the classes, eight of them are cliques πqq = 1 andsix have within probability connectivity greater than 0.5. As shown byDaudin et al. (2008), these cliques or pseudo-cliques gather reactions in-volving a same compound. Thus, chorismate, pyruvate, L-aspartate, L-glutamate, D-glyceraldehyde-3-phosphate and ATP are all responsible forcliques. Moreover, as observed in Daudin et al. (2008), since the connec-tion probability between class 1 and 17 is 1, they correspond to a singleclique which is associated to pyruvate. However that clique is split intotwo sub-cliques because of their different connectivities with reactions ofclasses 7 and 10. The approach of Hofman and Wiggins (2008) can notretrieve such complex topologies, as shown in Section 2.5.1, and manyclasses such as class 1 and 17 were merged. We found QVBMOD = 14.

Conclusion

In this chapter, we showed how the Stochastic Block Model (SBM) couldbe described in a full Bayesian framework. We introduced some non in-formative conjugate priors over the model parameters and we described avariational Bayes EM algorithm which approximates the posterior distri-bution of all the latent variables and model parameters, given the observeddata. Using this framework, we derived a non asymptotic model selectioncriterion, so-called ILvb, which approximates the marginal likelihood. Byconsidering networks generated using SBM, we showed that ILvb focuson the estimation of the data density and provides a relevant estimationof the number of latent classes. We also illustrated the capacity of SBM toretrieve interesting structures in a large variety of networks.

3Overlapping stochasticblock models

Contents3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.2 The stochastic block model . . . . . . . . . . . . . . . . . . . 68

3.3 The overlapping stochastic block model . . . . . . . . . . 69

3.3.1 Modeling sparsity . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3.2 Modeling outliers . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4.1 Correspondence with (non overlapping) stochastic blockmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

3.4.2 Permutations and inversions . . . . . . . . . . . . . . . . . 73

3.4.3 Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.5 Statistical inference . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5.1 The q-transformation . . . . . . . . . . . . . . . . . . . . . . 77

3.5.2 The ξ-transformation . . . . . . . . . . . . . . . . . . . . . . 78

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.6.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.6.2 French political blogosphere . . . . . . . . . . . . . . . . . 86

3.6.3 Saccharomyces cerevisiae transcription network . . . . . . 91

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Given a network, almost all graph clustering algorithms partition thevertices into disjoint clusters, according to their connection profile.

However, recent studies have shown that these techniques were too restric-tive and that most of the existing networks contained overlapping clusters.To tackle this issue, we present in this chapter the Overlapping Stochas-tic Block Model. Our approach allows the vertices to belong to multipleclusters, and, to some extent, generalizes the well known Stochastic BlockModel (Nowicki and Snijders 2001). We show that the model is genericallyidentifiable within classes of equivalence and we propose an approximateinference procedure, based on global and local variational techniques. Us-ing toy data sets as well as the French Political Blogosphere network andthe transcriptional network of Saccharomyces cerevisiae, we compare ourwork with other approaches.

65

3.1. Introduction 67

3.1 Introduction

A drawback of existing graph clustering techniques is that they all parti-tion the vertices into disjoint clusters, while lots of objects in real worldapplications typically belong to multiple groups or communities. For in-stance, many proteins, so-called moonlighting proteins, are known to haveseveral functions in the cells (Jeffery 1999), and actors might belong toseveral groups of interests (Palla et al. 2005). Thus, a graph clusteringmethod should be able to uncover overlapping clusters. This issue hasreceived growing attention in the last few years, starting with an algo-rithmic approach based on small complete sub-graphs developed by Pallaet al. (2005) and implemented in the software CFinder (Palla et al. 2006).They defined a k-clique community as a union of all k-cliques (completesub-graphs of size k) that can be reached from each other through a seriesof adjacent1 k-cliques. Given a network, their algorithm first locates allcliques and then identifies the communities using a clique-clique overlapmatrix (Everett and Borgatti 1998). By construction, the resulting commu-nities can overlap. In order to select the optimal value of k, the authorssuggested a global criterion which looks for a community structure ashighly connected as possible. Small values of k leads to a giant commu-nity which smears the details of a network by merging small communities.Conversely, when k increases, the communities tend to become smaller,more disintegrated, but also more cohesive. Therefore, they proposed aheuristic which consists in running their algorithm for various values of kand then to select the lowest value such that no giant community appears.

More recent work (Airoldi et al. 2008) proposed the Mixed Member-ship Stochastic Block model (MMSB) which has been used with successto analyze networks in many applications (Airoldi et al. 2007; 2006). Theyused variational techniques to estimate the model parameters and pro-posed a criterion to select the number of classes. As detailed in Helleret al. (2008), mixed membership models, as Latent Dirichlet Allocation(Blei et al. 2003), are flexible models which can capture partial member-ship (Griffiths and Ghahramani 2005, Heller and Ghahramani 2007), in theform of attribute-specific mixtures. In MMSB, a mixing weight vector πiis drawn from a Dirichlet distribution for each vertex in the network, πiqbeing the probability of vertex i to belong to class q. The edge probabilityfrom vertex i to vertex j is then given by pij = Z⊺

i→j B Zi←j, where B is aQ× Q matrix of connection probabilities similar to the Π matrix in SBM.The vector Zi→j is sampled from a multinomial distributionM(1, πi) anddescribes the class membership of vertex i in its relation towards vertex j.By symmetry, the vector Zi←j is drawn from a multinomial distributionM(1, π j) and represents the class membership of vertex j in its relationtowards vertex i. Thus, depending on its relations with other vertices, eachvertex can belong to different classes and therefore MMSB can be viewedas allowing overlapping clusters. However, the limit of MMSB is that itdoes not produce edges which are themselves influenced by the fact thatsome vertices belong to multiple clusters. Indeed, for every pair (i, j) ofvertices, only a single draw of Zi→j and Zi←j determines the probability pijof an edge, all the other class memberships of vertex i and j towards other

1Two k-cliques are adjacent if they share k− 1 vertices

68 Chapter 3. Overlapping stochastic block models

vertices in the network do not play a part. In this chapter, we present acomplementary approach which tackles this issue.

In Fu and Banerjee (2008), Fu and Banerjee model overlapping clusterson Q components by characterizing each individual i by a latent {0, 1}vector Zi of length Q drawn from independent Bernoulli distributions.The ith row of the data matrix then only depends on Zi. In the underlyingclustering structure, i belongs to the components corresponding to a 1in Zi. Nevertheless, the proposed model needs Q parameters for eachindividual and supposes independence between rows and columns of thedata matrix, which is not the case when looking for network structures.

In this chapter, we propose a new model for generating networks, de-pending on (Q + 1)2 + Q parameters, where Q is the number of compo-nents in the mixture. A latent {0, 1}-vector of length Q is assigned to eachvertex, drawn from products of Bernoulli distributions whose parametersare not vertex-dependent. Each vertex may then belong to several compo-nents, allowing overlapping clusters, and each edge probability dependsonly on the components of its endpoints.

In Section 3.2, we recall the two constraints that the stochastic blockmodel satisfies. In Section 3.3, we present the overlapping stochastic blockmodel and we show in Section 3.4 that the model is identifiable withinclasses of equivalence. In Section 3.5, we propose an EM-like algorithm toinfer the parameters of the model. Finally, in Section 3.6, we compare ourwork with other approaches using simulated data and two real networks.We show the efficiency of our model to detect overlapping clusters innetworks.

3.2 The stochastic block model

In this chapter, we consider a directed binary random graph G representedby an N × N binary adjacency matrix X. Each entry Xij describes thepresence or absence of an edge from vertex i to vertex j. We assume thatG does not have any self loop, and therefore, the variables Xii will notbe taken into account. The Stochastic Block Model (SBM) associates toeach vertex of a network a latent variable Zi drawn from a multinomialdistribution:

Zi ∼M(

1, α = (α1, α2, . . . , αQ))

,

where α denotes the vector of class proportions. As in other standardmixture models, the vector Zi sees all its components set to zero exceptone such that Ziq = 1 if vertex i belongs to class q. The model then verifies:

Q

∑q=1

Ziq = 1, ∀i ∈ {1, . . . , N}, (3.1)

andQ

∑q=1

αq = 1. (3.2)

3.3. The overlapping stochastic block model 69

3.3 The overlapping stochastic block model

In order to allow each vertex to belong to multiple classes, we relax theconstraints (3.1) and (3.2). Thus, for each vertex i of the network, weintroduce a latent vector Zi, of Q independent Boolean variables Ziq ∈{0, 1}, drawn from a multivariate Bernoulli distribution:

Zi ∼Q

∏q=1B(Ziq; αq) =

Q

∏q=1

αZiqq (1− αq)

1−Ziq . (3.3)

We point out that Zi can also have all its components set to zero which is auseful feature in practice as described in Sections 3.3.2 and 3.6. The edgeprobabilities are then given by:

Xij|Zi, Zj ∼ B(Xij; g(aZi ,Zj)

)= eXijaZi ,Zj g(−aZi ,Zj),

whereaZi ,Zj = Z⊺

i W Zj +Z⊺

i U+V⊺ Zj +W∗, (3.4)

and g(x) = (1 + e−x)−1 is the logistic sigmoid function. W is a Q × Qreal matrix whereas U and V are Q-dimensional real vectors. The firstterm in the right-hand side of (3.4) describes the interactions between thevertices i and j. If i belongs only to class q and j only to class l, then onlyone interaction term remains (Z⊺

i W Zj = Wql). However, as illustrated intable 3.1, the model can take more complex interactions into account ifone or both of these two vertices belong to multiple classes (Figure 3.1).Note that the second term in (3.4) does not depend on Zj. It models theoverall capacity of vertex i to connect to other vertices. By symmetry, thethird term represents the global tendency of vertex j to receive an edge.These two parameters U and V are related to the sender/receiver effects δiand γj in the Latent Cluster Random Effects Model (LCREM) of Krivitskyet al. (2009). However, contrary to LCREM, δi = Z⊺

i U and γj = V⊺ Zjdepend on the classes. In other words, two different vertices sharing thesame classes, will have exactly the same sender/receiver effects, which isnot the case in LCREM. Finally, we use the scalar W∗ as a bias, to modelsparsity.

(0, 0) (1, 0) (0, 1) (1, 1)(0, 0) W∗ V1 + W∗ V2 + W∗ V1 + V2 + W∗

(1, 0) U1 + W∗W11 + U1 + V1

+W∗W12 + U1 + V2

+W∗W11 + W12 + U1+V1 + V2 + W∗

(0, 1) U2 + W∗W21 + U2 + V1

+W∗W22 + U2 + V2

+W∗W21 + W22 + U2+V1 + V2 + W∗

(1, 1) U1 + U2 + W∗W11 + W21 + U1+U2 + V1 + W∗

W12 + W22 + U1+U2 + V2 + W∗

W11 + W12 + W21+W22 + U1 + U2+V1 + V2 + W∗

Table 3.1 – The values of aZi ,Zj in functions of Zi (rows) and Zj (columns) for anoverlapping stochastic block model with Q = 2.

If we associate to each latent variable Zi a vector Zi =(

Zi, 1)⊺, then

(3.4) can be written:aZi ,Zj = Zi

⊺WZj, (3.5)


where

W =

(W U

V⊺ W∗

)

.

The Zi(Q+1)s can be seen as random variables drawn from a Bernoullidistribution with probability αQ+1 = 1. Thus, one way to think about themodel is to consider that all the vertices in the graph belong to a (Q + 1)-th cluster which is overlapped by all the other clusters. In the following,we will use (3.5) to simplify the notations.

Figure 3.1 – Example of a directed graph with three overlapping clusters.

Finally, given the latent structure Z = {Z1, . . . , ZN}, all the edges aresupposed to be independent. Thus, when considering directed graphswithout self-loop, the Overlapping Stochastic Block Model (OSBM) is de-fined through the following distributions:

p(Z | α) =N

∏i=1

Q

∏q=1

αZiqq (1− αq)

1−Ziq , (3.6)

and

p(X |Z, W) =N

∏i 6=j

eXijaZi ,Zj g(−aZi ,Zj).

The graphical model of OSBM is given in Figure 3.2.

3.3.1 Modeling sparsity

As explained in Airoldi et al. (2008), real networks are often sparse2 and itis crucial to distinguish the two sources of non-interaction. Sparsity mightbe the result of the rarity of interactions in general but it might also in-dicate that some class (intra or inter) connection probabilities are close tozero. For instance, social networks (see Section 3.6.2) are often made ofcommunities where vertices are mostly connected to vertices of the same

2the corresponding adjacency matrices contain mainly zeros

3.4. Identifiability 71

Figure 3.2 – Directed acyclic graph representing the frequentist view of the overlappingstochastic block model. Nodes represent random variables, which are shaded when theyare observed and edges represent conditional dependencies.

community. This corresponds to classes with high intra connection prob-abilities and low inter connection probabilities. In (3.4), we can notice thatW∗ appears in aZi ,Zj for every pair of vertices. Therefore, W∗ is a conve-nient parameter to model the two sources of sparsity. Indeed, low valuesof W∗ result from the rarity of interactions in general, whereas high valuessignify that sparsity comes from the classes (parameters in W, U and V).

3.3.2 Modeling outliers

When applied on real networks, graph clustering methods often lead togiant classes of vertices having low output and input degrees (Daudinet al. 2008, Latouche et al. 2009). These classes are usually discarded andthe analysis of networks focus on more highly structured classes to extractuseful information. The product of Bernoulli distributions (3.6) providesa natural way to encode these “outliers”. Indeed, rather than using giantclasses, OSBM uses the null component such that Zi = 0 if vertex i is anoutlier and should not be classified in any class.

3.4 Identifiability

Before looking for an optimization procedure to estimate the model pa-rameters, given a sample of observations (a network), it is crucial to verifywhether OSBM is identifiable. A theorem of Allman et al. (2009) lies at thecore of the results presented in this section.

If we denote, F (Θ) = {Pθ, θ ∈ Θ}, a family of models we are inter-ested in, the classical definition of identifiability requires that for any two


different values θ 6= θ′, the corresponding probability distributions Pθ andPθ′ are different.

3.4.1 Correspondence with (non overlapping) stochastic block models

Let ΘOSBM be the parameter space of the family of OSBMs with Q classes:

ΘOSBM = {(α, W) ∈ [0, 1]Q ×R(Q+1)2

}.

Each θ in ΘOSBM corresponds to a random graph model which is definedby the distribution p(X | α, W). The aim of this section is to characterizewhether there exists any relation between two different parameters θ andθ′ in ΘOSBM, leading to the same random graph model.

We consider the (non overlapping) Stochastic Block Model (SBM) in-troduced by Nowicki and Snijders (2001). The model is defined by aset of classes C, a vector of class proportions γ = {γC}C∈C verifying∑C∈C γC = 1, and a matrix of connection probabilities Π =

(ΠC,D

)

C,D∈C2 .Note that they are an infinite number of ways to represent and encode theclasses. For simplicity, a common choice is to set C = {1, . . . , Q} and pos-

sibly C ={

C ∈ {0, 1}Q, ∑Qq=1 Cq = 1

}

, for a model with Q classes. Therandom graphs are drawn as follows. First, the class of each vertex issampled from a multinomial distribution with parameters (1, γ). Thus,each vertex i belongs only to one class, and that class is C with probabil-ity γC. Second, the edges are drawn independently from each other fromBernoulli distributions, the probability of an edge (i, j) being ΠC,D, if ibelongs to class C and j to class D.

Let ΘSBM be the parameter space of the family of SBMs with 2Q classes:

ΘSBM = {(γ, Π) ∈ [0, 1]2Q× [0, 1]2

2Q, ∑

C∈C

γC = 1}.

Considering that each possible value of the vectors Zis in an OSBM withQ classes encodes a class in a SBM with 2Q classes (i.e. C = {0, 1}Q), yieldsa natural function:

φ :ΘOSBM → ΘSBM

(α, W) → (γ, Π),

where

γC =Q

∏q=1

αCqq (1− αq)

1−Cq , ∀C ∈ {0, 1}Q,

and

ΠC,D = g(C⊺ W D+C⊺ U+V⊺ D+W∗), ∀(C, D) ∈ {0, 1}Q × {0, 1}Q.

Let GN denote the set of probability measures on the graphs of Nvertices. The OSBM of parameter θ in ΘOSBM and the SBM of parame-ter φ(θ) in ΘSBM clearly induce the same measure µ in GN . Thus, de-noting by ψ(γ, Π) the probability measure in GN induced by the SBMof parameter (γ, Π), the problem of identifiability is to characterize therelations between parameters θ ∈ ΘOSBM and θ′ ∈ ΘOSBM such thatψ(φ(θ)) = ψ(φ(θ′)).


ΘOSBM → ΘSBM → GN

θ = (α, W)φ−→ (γ, Π)

ψ−→ µ

.

The identifiability of SBM was studied by Allman et al. (2009), whoshowed that the model is generically identifiable up to a permutation ofthe classes. In other words, except in a set of parameters which has a nullLebesgue’s measure, two parameters imply the same random graph modelif and only if they differ only by the ordering of the classes. Therefore, themain theorem of Allman et al. (2009) implies the following result:

Théorème 3.1 There exists a set ΘbadSBM ⊂ ΘSBM of null Lebesgue’s measure such that, for every

(γ, Π) and (γ′, Π′) not in Θbad

SBM, ψ(γ, Π) = ψ(γ′, Π′) if and only if there

exists a function Pν such that (γ′, Π′) = Pν

((γ, Π)

), where:

• ν is a permutation on {0, 1}Q,

• γ′C = γν(C), ∀C ∈ {0, 1}Q,

• Π′C,D = Πν(C),ν(D), ∀(C, D) ∈ {0, 1}Q × {0, 1}Q.

Thus, studying the generical identifiability of the OSBM is equivalentto characterizing the parameters of ΘOSBM verifying φ(θ′) = Pν(φ(θ)) forsome permutation ν on {0, 1}Q.

3.4.2 Permutations and inversions

As in the case of the SBM, reordering the Q classes of the OSBM and doingthe corresponding modification in α and W does not change the generativerandom graph model. Indeed, let σ be a permutation on {1, . . . , Q} and letPσ denote the function corresponding to the permutation σ of the classes.Then, (α′, W

′) = Pσ(α, W) is defined by:

α′q = ασ(q), ∀q ∈ {1, . . . , Q},

andW′q,l = Wσ(q),σ(l), ∀(q, l) ∈ {1, . . . , Q + 1}2.

Now, let ν be the permutation of {0, 1}Q defined by:

ν(C) = (Cσ(1), . . . , Cσ(Q)), ∀C ∈ {0, 1}Q.

It is then straightforward to see that, for every parameter θ in ΘOSBM

and every permutation σ, φ(Pσ(θ)) = Pν(φ(θ)), where Pν is defined inTheorem 3.1.

There is another family of operations in ΘOSBM which does not changethe generative random graph model, which we call inversions. They cor-respond to exchanging the labels 0 to 1 and vice versa on some of the coor-dinates of the Zi’s. To give an intuition, consider a parameter θ = (α, W)in ΘOSBM. Let us generate graphs under the probability measure in GN

induced by θ and consider only the first coordinate of the Zi’s. If we de-note by “cluster 1” the vertices whose Zi’s have a 1 as first coordinate,the graph sampling procedure consists in sampling the set “cluster 1” and


then drawing the edges conditionally on that information. Note that itwould be equivalent to sample the vertices which are not in “cluster 1”and to draw the edges conditionally on that information. Thus there ex-ists an equivalent reparametrization where the 1’s in the first coordinatecorrespond to the vertices which are not in “cluster 1”. This is the param-eter θ′ obtained from θ by an inversion of the first coordinate.

Let A be any vector of {0, 1}Q. We define the A-inversion IA as follows:

IA :ΘOSBM → ΘOSBM

(α, W) → (α′, W′)

,

where

α′j =

{1− αj if Aj = 1

αj otherwise, ∀j ∈ {1, . . . , Q},

andW′= M⊺

A W MA .

The matrix MA is defined by:

MA =

(I − 2diag(A) A

0 . . . 0 1

)

,

with diag(A) being the Q×Q diagonal matrix whose diagonal is the vec-tor A.

Proposition 3.1 For every A ∈ {0, 1}Q, let ν be the permutation of {0, 1}Q defined by:

∀C ∈ {0, 1}Q, ν(C)i =

{1− Ci if Ai = 1

Ci otherwise.

Then, for every θ in ΘOSBM:

φ(IA(θ)) = Pν(φ(θ)),

where Pν is defined in theorem 3.1.

Proof. Consider θ ∈ ΘOSBM and A ∈ {0, 1}Q and define (γ, Π) = φ(θ)and (γ′, Π

′) = φ(IA(θ)). It is straightforward to verify that:

γ′C = γν(C), ∀C ∈ {0, 1}Q.

Moreover, since MA

(C1

)=(

ν(C)1

)

, it follows that:

Π′C,D = g

((C⊺ 1

)M⊺

A W MA

(D1

))

= g((

ν(C)⊺ 1)W(

ν(D)1

))

= Πν(C),ν(D) .

Therefore, φ(IA(θ)) = Pν(φ(θ)) .


3.4.3 Identifiability

Let us define the following equivalence relation:

θ ∼ θ′ if ∃σ, A | θ′ = IA(Pσ(θ)).

To be convinced that it is an equivalence relation, note that:

IA ◦ Pσ = Pσ ◦ Iσ−1(A).

Consider the set of equivalence classes for the relation ∼. It followsthat:

• Two parameters in the same equivalence class induce the same mea-sure in GN ,

• Each equivalence class contains a parameter θ = (α, W) such thatα1 ≤ α2 ≤ . . . ≤ αQ ≤

12 . Moreover, if the αis are all distinct and

strictly lower than 12 , there is a unique such parameter in the equiv-

alence class.

We are now able to state our main theorem about identifiability, that isthat the model is generically identifiable up to the equivalence relation ∼:

Théorème 3.2 For every α ∈]0, 1[Q, let β ∈ RQ be the vector defined by βk = − log( αk

1−αk), for

every k.Define Θbad

OSBM as the set of parameters (α, W) such that one of the followingconditions holds:

• there exists 1 ≤ k ≤ Q such that αk = 0 or αk = 1 or αk =12 ,

• there exist 1 ≤ k, l ≤ Q such that αk = αl ,

• there exist C, D ∈ {0, 1}Q × {0, 1}Q such that ∑k βkCk = ∑k βkDk,

• φ(α, W) ∈ ΘbadSBM, set of null measure given by Theorem 3.1.

Then ΘbadOSBM has a null Lebesgue’s measure on ΘOSBM and:

∀ θ, θ′ ∈ (ΘOSBM \ΘbadOSBM)2, φ(θ) = φ(θ′)⇔ θ ∼ θ′ .

Proof. ΘbadOSBM is the union of a finite number of hyperplanes or spaces

which are isomorphic to hyperplanes. Therefore, µ(ΘbadOSBM) = 0.

Let θ = (α, W), θ′ = (α′, W′), φ(θ) = (γ, Π), and φ(θ′) = (γ′, Π

′). Asφ is constant on each equivalence class and as θ and θ′ are not in Θbad

OSBM,we can assume that 0 < α1 < . . . < αk <

12 and 0 < α′1 < . . . < α′k <

12 .

Proving the theorem is then equivalent to prove that θ = θ′.

As φ(θ) = φ(θ′), Theorem 3.1 ensures that there exists a permutationν : {0, 1}Q → {0, 1}Q such that:

{γ′C = γν(C) ∀C

Π′C,D = Πψ(C),ψ(D) ∀C, D.


Then, in particular:{

∏k

αCkk (1− αk)

1−Ck , C ∈ {0, 1}Q} ={

∏k

(α′k)Ck(1− α′k)

1−Ck , C ∈ {0, 1}Q}.

(3.7)The minima of those two sets as well as the second lowest values are equal,that is:

∏k

αk = ∏k

α′k and ( ∏k≤Q−1

αk)(1− αQ) = ( ∏k≤Q−1

α′k)(1− α′Q).

Dividing those equations term by term yields αQ1−αQ

=α′Q

1−α′Qand finally

αQ = α′Q. Dividing all terms by αCQQ (1− αQ)

1−CQ in 3.7, by induction itfollows that:

α = α′. (3.8)

Now, for any C ∈ {0, 1}Q, the fact that γ′C = γν(C) can be written as:

∏k

αCkk (1− αk)

1−Ck = ∏k

αν(C)kk (1− αk)

1−ν(C)k

∑k

Ck log(αk

1− αk) + ∑

k

log(1− αk) = ∑k

ν(Ck) log(αk

1− αk) + ∑

k

log(1− αk)

∑k

βkCk = ∑k

βkν(C)k.

Since θ /∈ ΘbadOSBM, this implies that ν(C) = C. As it is true for every C, ν is

in fact the identity function.

Therefore, for every C, D, ΠC,D = Π′C,D, that is

∑q,l

wqlcqdl + ∑q

uqcq + ∑l

vldl + w∗ = ∑q,l

w′qlcqdl + ∑q

u′qcq + ∑l

v′ldl + w′∗.

Applying it for C = D = 0 implies W∗ = W′∗.

Applying it for D = 0 and C = δq, where δq is the vector having a 1 onthe qth coordinate and 0’s elsewhere yields uq + W∗ = u′q + W

′∗ and thusuq = u′q.

By symmetry, C = 0 and D = δl implies vl = v′l .Finally, C = δq and D = δl gives Wql = W ′ql .Thus

W = W′. (3.9)

By Equations 3.8 and 3.9, we have θ = θ′.

3.5 Statistical inference

Given a network, our aim in this section is to estimate the OSBM parame-ters.

The log-likelihood of the observed data set is defined through themarginalization: p(X | α, W) = ∑Z p(X, Z |α, W). This summation in-volves 2NQ terms and quickly becomes intractable. To tackle this issue,

3.5. Statistical inference 77

the Expectation-Maximization (EM) algorithm has been applied on manymixture models. However, the E-step requires the calculation of the pos-terior distribution p(Z |X, α, W) which cannot be factorized in the case ofnetworks (see Daudin et al. 2008, for more details). In order to obtain atractable procedure, we present some approximations based on global andlocal variational techniques.

3.5.1 The q-transformation

Given a distribution q(Z), the log-likelihood of the observed data set canbe decomposed using the Kullback-Leibler divergence KL(· || ·):

log p(X | α, W) = LML(q; α, W) + KL(q(·) || p(·|X, α, W)

), (3.10)

where

LML(q; α, W) = ∑Z

q(Z) log{

p(X, Z | α, W)

q(Z)

}

, (3.11)

and

KL(q(·) || p(·|X, α, W)

)= −∑

Z

q(Z) log{

p(Z |X, α, W)

q(Z)

}

. (3.12)

The maximum log p(X | α, W) of the lower bound LML (3.11) is reachedwhen q(Z) = p(Z |X, α, W). Thus, if the posterior distributionp(Z |X, α, W) was tractable, the optimizations of LML and log p(X | α, W),with respect to α and W, would be equivalent. However, in the case ofnetworks, p(Z |X, α, W) cannot be calculated and LML cannot be opti-mized over the entire space of q(Z) distributions. Thus, we restrict oursearch to the class of distributions which satisfy:

q(Z) =N

∏i=1

q(Zi), (3.13)

with

q(Zi) =Q

∏q=1B(Ziq; τiq)

=Q

∏q=1

τZiq

iq (1− τiq)1−Ziq .

Each τiq is a variational parameter which corresponds to the posteriorprobability of node i to belong to class q. As for the vector α, the vectorsτi = {τi1, . . . , τiQ} are not constrained to lie in the Q − 1 dimensionalsimplex.

Proposition 3.2 (Proof in Appendix C.1) The lower bound of the observed data log-likelihood is


given by:

LML(q; α, W) =N

∑i 6=j

{

Xijτi⊺Wτ j + EZi ,Zj [log g(−aij)]

}

+N

∑i=1

Q

∑q=1

{τiq log αq + (1− τiq) log(1− αq)

}

−N

∑i=1

Q

∑q=1

{τiq log τiq + (1− τiq) log(1− τiq)

}.

(3.14)

Unfortunately, since the logistic sigmoid function is non linear,EZi ,Zj [log g(−aZi ,Zj)] in (3.14) cannot be computed analytically. Thus, weneed a second level of approximation to optimize the lower bound of theobserved data set.

3.5.2 The ξ-transformation

Proposition 3.3 (Proof in Appendix C.2) Given a variational parameter ξij, EZi ,Zj [log g(−aZi ,Zj)]satisfies:

EZi ,Zj [log g(−aij)] ≥ log g(ξij)−(τi

⊺Wτ j + ξij)

2−λ(ξij)

(

EZi ,Zj [(Zi⊺WZj)

2]− ξ2ij

)

.

(3.15)

Eventually, a lower bound of the first lower bound can be computed:

log p(X | α, W) ≥ LML(q; α, W) ≥ LML(q; α, W, ξ), (3.16)

where

LML(q; α, W, ξ) =N

∑i 6=j

{

(Xij −

12

)τ⊺

i Wτ j + log g(ξij)−ξij

2

− λ(ξij)

(

Tr(

W⊺EiW Σj

)

+ τ⊺

j W⊺EiWτ j − ξ2

ij

)}

+N

∑i=1

Q

∑q=1


}

−N

∑i=1

Q

∑q=1


}.

The resulting variational EM algorithm (see Algorithm 7) alternativelycomputes the posterior probabilities τi and the parameters α and W max-imizing

maxξLML(q; α, W, ξ).

The computational cost of the algorithm is equal to O(N2Q4). Forcomparison the computational cost of the methods proposed by Daudinet al. (2008) and Latouche et al. (2009) for (non-overlapping) SBM is equalto O(N2Q2). Analyzing a sparse network with 100 nodes takes about tenseconds on a dual core, and about a minute for dense networks.

3.5. Statistical inference 79

Algorithm 7: Overlapping stochastic block model for directedgraphs without self loop.

// INITIALIZATION

Initialize τ with an Ascendant Hierarchical Classification algorithmSample W from a zero mean σ2 spherical Gaussian distribution

// OPTIMIZATION

repeat// ξ-transformation

ξij ←

√

Tr(

W⊺EiW Σj

)

+ τ j⊺W

⊺EiWτ j, ∀i 6= j

// M-step

αq ←∑

Ni=1 τiq

N , ∀qOptimize LML

(q; α, W, ξ

)with respect to W, with a gradient

based optimization algorithm (e.g. quasi-Newton method ofBroyden et al. 1970)// E-step

repeat

for i=1:N doOptimize LML

(q; α, W, ξ

)with respect to τi, with a box

constrained (τiq ∈ [0, 1]) gradient based optimizationalgorithm (e.g. Byrd method Byrd et al. 1995)

end

until τ convergesuntil LML

(q; α, W, ξ

)converges


For all the experiments we present in the following section, set σ2 = 0.5and we used the Ascendant Hierarchical Classification algorithm imple-mented in the R package “mixer” which is available at: http://cran.r-project.org/web/packages/mixer.

3.6 Experiments

We present some results of the experiments we carried out to assessOSBM. Throughout our experiments, we compared our approach to SBM(the non-overlapping version of OSBM), the Mixed Membership StochasticBlock model (MMSB) of Airoldi et al. (2008), and the work of Palla et al.(2005), implemented in the software (Version 2.0.1) CFinder (Palla et al.2006).

In order to perform inference in SBM, we used the variational Bayesalgorithm of Latouche et al. (2009) which approximates the posterior dis-tribution over the latent variables and model parameters, given the edges.We computed the Maximum A Posteriori (MAP) estimates and obtainedthe class membership vectors Zi. We recall that SBM assumes that eachvertex belongs to a single class and therefore each vector Zi has all itscomponents set to zero except one, such that Ziq = 1 if vertex i is clas-sified into class q. For OSBM, we relied on the variational approximateinference procedure described in Section 3.5 and computed the MAP esti-mates. Contrary to SBM, each vertex can belong to multiple clusters andtherefore the vectors Zi can have multiple components set to one. As de-scribed in Section 3.1, MMSB can also be viewed as allowing overlappingclusters. For more details, we refer to Airoldi et al. (2008). In order to es-timate the MMSB mixing weight vectors πi, we used the collapsed Gibbssampling approach implemented in the R package lda (Chang 2010). Wethen converted each vector πi into a binary membership vector Zi usinga threshold t. Thus, for πiq ≥ t, we set Ziq = 1 and Ziq = 0 otherwise.In all the experiments we carried out, we defined t = 1/Q and we foundthat for higher values MMSB tended to behave like SBM. Finally, we con-sidered CFinder which is a widely used algorithmic approach to uncoveroverlapping communities. As described in Section 3.1, CFinder looks fork-clique communities where each k-clique community is a union of allk-cliques (complete sub-graphs of size k) that can be reached from eachother through a series of adjacent k-cliques. The algorithm first locates allcliques and then identifies the communities and overlaps between com-munities using a clique-clique overlap matrix (Everett and Borgatti 1998).Vertices that do not belong to any k-clique are seen as outliers and notclassified.

Contrary to OSBM (and CFinder), SBM and MMSB cannot deal withoutliers. Therefore, to obtain fair comparisons between the approaches,when OSBM was run with Q classes, SBM and MMSB were run with Q+ 1classes and we identified the class of outliers. In practice, this can easilybeen done since this class contains most of the vertices of the networkhaving low output and input degrees.



3.6. Experiments 81

3.6.1 Simulations

In this set of experiments, we generated two types of networks using theOSBM generative model. In Section 3.6.1, we sampled networks with com-munity structures (Figure 3.3), where vertices of a community are mostlyconnected to vertices of the same community. To limit the number of freeparameters, we considered the Q×Q real matrix W:

W =

λ −ǫ . . . −ǫ

−ǫ λ...

.... . . −ǫ

−ǫ . . . −ǫ λ

.

In Section 3.6.1, we generated networks with more complex topologies,using the matrix W:

W =

λ λ −ǫ . . . . . . . . . −ǫ

−ǫ −λ −ǫ . . . . . . . . ....

... −ǫ λ λ −ǫ . . ....

...... −ǫ −λ −ǫ . . .

......

...... −ǫ

. . . −ǫ −ǫ...

......

... −ǫ λ λ

−ǫ . . . . . . . . . . . . −ǫ −λ

.

In these networks, if class i is a community and has therefore a high intraconnection probability, then its vertices also highly connect to vertices ofclass i + 1 which itself has a low intra connection probability. Such starpatterns (Figure 3.4) often appear in transcription networks, as shown inSection 3.6.3, and protein-protein interaction networks.

For these two sets of experiments, we used the Q-dimensional realvectors U and V:

U = V =(ǫ . . . ǫ

),

and we set Q = 4, λ = 4, ǫ = 1, W∗ = −5.5. Moreover, for the vector α

of class probabilities, we set αq = 0.25, ∀q ∈ {1, . . . , Q}. We generated 100networks with N = 100 vertices and for each of these networks, we clus-tered the vertices using CFinder, SBM, MMSB, and OSBM. Finally, we useda criterion similar to the one proposed by Heller and Ghahramani (2007),Heller et al. (2008) to compare the true Z and the estimated Z clusteringmatrices. Thus, for each network and each method, we computed the L2distance d(P, P) where P = Z Z⊺ and P = ZZ

⊺. These two N× N matrices

are invariant to column permutations of Z and Z and compute the numberof shared clusters between each pair of vertices of a network. Therefore,d(P, P) is a good measure to determine how well the underlying clusterassignment structure has been discovered. Since CFinder depends on aparameter k (size of the cliques), for each simulated network, we ran thesoftware for various values of k and selected k for which the L2 distancewas minimized. Note that this choice of k tends to overestimate the perfor-mances of CFinder compared to the other approaches. Indeed, in practice,


when analyzing a real network, k needs to be estimated (see Section 3.6.2)while P is unknown. OSBM was run with Q classes whereas SBM andMMSB were run with Q + 1 classes. For both SBM and MMSB, and eachgenerated network, after having identified the class of outliers, we set thelatent vectors of the corresponding vertices to zero (null component). TheL2 distance d(P, P) was then computed exactly as described previously.

Figure 3.3 – Example of a network with community structures. Overlaps are representedin black and outliers in gray.

Networks with community structures

The results that we obtained are presented in Table 3.2 and in Figure 3.5.We can observe that CFinder, MMSB, and OSBM lead to very accurateestimates Z of the true clustering matrix Z. For most networks, they re-trieve the clusters and overlaps perfectly although CFinder and MMSBappear to be slightly biased. Indeed, while the median of the L2 distanced(P, P) over the 100 samples is null for OSBM, it is equal to 22 for CFinderand 27.5 for MMSB. Since CFinder is an algorithmic approach, and not aprobabilistic model, it does not classify a vertex vi if it does not belong toany k-cliques of a k-clique community. Conversely, OSBM is more flexibleand can take the random nature of the network into account. Indeed, theedges are assumed to be drawn randomly, and, given each pair of vertices,OSBM deciphers whether or not they are likely to belong to the same class,depending on their connection profiles. Therefore, OSBM can predict that

3.6. Experiments 83

Figure 3.4 – Example of a network with community structures and stars. Overlaps arerepresented in black and outliers in gray.


vi belongs to a class q although it does not belong to any k-cliques. Over-all, we found that MMSB retrieves the clusters well but often misclassifiessome of the overlaps. Thus, if a given vertex belongs to several clusters, ittends to be classified by MMSB into only one of them. Nevertheless, theresults clearly illustrate that MMSB improves over SBM, which cannot re-trieve any of the overlapping clusters. It should also be noted that CFinderhas fewer outliers (Figure 3.5) than MMSB and OSBM and appears to beslightly more stable when looking for overlapping community structuresin networks.

05

01

00

15

02

00

25

03

00

CFinder SBM MMSB OSBM

Figure 3.5 – L2 distance d(P, P) over the 100 samples of networks with communitystructures, for CFinder and OSBM. Measures how well the underlying cluster assign-ment structure has been retrieved.

Mean Median Min MaxCFinder 43.53 22 0 203

SBM 116.46 103.3 0 321

MMSB 53.76 27.5 0 293

OSBM 41.83 0 0 258

Table 3.2 – Comparison of CFinder, SBM, and OSBM in terms of the L2 distance d(P, P)over the 100 samples of networks with community structures.

3.6. Experiments 85

Networks with community structures and stars

In this set of experiments, we considered networks with more complextopologies. As shown, in Table 3.3 and in Figure 3.6, the results of CFinderdramatically degrade while those of OSBM remain more stable. Indeed,the median of the L2 distances d(P, P) over the 100 samples is equal to43 for OSBM, while it is equal to 354.5 for CFinder. This can be easilyexplained since CFinder only looks for community structures of adjacentk-cliques, and can not retrieve classes with low intra connection probabil-ities. Conversely, OSBM uses a Q×Q real matrix W and two real vectorsU and V of size Q to model the intra and inter connection probabilities.No assumption is made on these matrix and vectors such that OSBM cantake heterogeneous and complex topologies into account. As for CFinder,the results of MMSB degrades although they remain better than SBM. Asfor the previous section, MMSB retrieves the clusters well but misclassifiesthe overlaps more frequently when considering networks with communitystructures and stars.

05

01

50

25

03

50

45

05

50

CFinder SBM MMSB OSBM

Figure 3.6 – L2 distance d(P, P) over the 100 samples of networks with communitystructures and stars, for CFinder and OSBM. Measures how well the underlying clusterassignment structure has been retrieved.


Mean Median Min MaxCFinder 362.07 354.5 181 567

SBM 134.68 118.87 15.14 352.09

MMSB 119.01 98.5 0 367

OSBM 77 43 0 328

Table 3.3 – Comparison of CFinder, SBM, and OSBM in terms of the L2 distance d(P, P)over the 100 samples of networks with community structures and stars.

3.6.2 French political blogosphere

We consider the French political blogosphere network and we focus on asubset of 196 vertices connected by 2864 edges. The data consists of a sin-gle day snapshot of political blogs automatically extracted on 14th october2006 and manually classified by the “Observatoire Présidentielle project”(Zanghi et al. 2008). Nodes correspond to hostnames and there is an edgebetween two nodes if there is a known hyperlink from one hostname toanother. The four main political parties which are present in the data setare the UMP (french “republican”), UDF (“moderate” party), liberal party(supporters of economic-liberalism), and PS (french “democrat”). There-fore, we applied our algorithm with Q = 4 clusters and we obtained theresults presented in Figure 3.7 and in Table 3.4.

First, we notice that the clusters we found are highly homogeneous andcorrespond to the well known political parties. Thus, cluster 1 contains 35

blogs among which 33 are associated to UMP while cluster 2 contains 39

blogs among which 30 are related to UDF. Similarly, it follows that cluster3 corresponds to the liberal party and cluster 4 to PS. We found nineoverlaps. Thus, three blogs associated to UMP belong to both cluster 1

(UMP) and 2 (UDF). This is a result we expected since these two politicalparties are known to have some relational ties. Moreover, a blog associatedto UDF belongs to both cluster 1 (UMP) and 4 (PS) while another UDF blogbelongs to cluster 2 (UDF) and 4 (PS). This can be easily understood sinceUDF is a moderate party. Therefore, it is not surprising to find UDF blogswith links with the two biggest political parties in France, representingthe left and right wings. Very interestingly, among the nine overlaps wefound, four of them correspond to blogs of political analysts. Thus, a blogoverlaps cluster 1 (UMP) and 4 (PS). Another one overlaps cluster 2 (UDF),3 (liberal party), and 4 (PS). Finally, the two last blogs of political analystsoverlap cluster 2 (UDF) and 4 (PS).

We ran CFinder and we used the criterion (Palla et al. 2005) they pro-posed to select k (see Section 3.1). Thus, we ran the software for variousvalues of k and we found k = 7. Lower values lead to giant componentswhich smear the details of the network. Conversely, for higher values,the communities start disintegrating. Using k, we uncovered 11 clusterswhich correspond to sub-clusters of the clusters we found using OSBM.For instance, cluster 3 (liberal party) was split into two clusters, whereascluster 4 (PS) was split into three. Indeed, while OSBM predicted that theconnection profiles of these sub-clusters were very similar and thereforeshould be merged, CFinder could not uncover any k-clique community,that is a union of fully connected sub-graphs of size k, containing these

3.6. Experiments 87

sub-clusters. Note that using CFinder, we retrieved the overlaps uncov-ered by our algorithm. CFinder did not classify 95 blogs.

We also clustered the blogs of the network using MMSB and SBM. Aspreviously, for both models, we used Q + 1 clusters and we identified theclass of outliers. The results of MMSB are presented in Figure 3.8. Over-all, we can notice that MMSB lead to similar clusters as OSBM, althoughcluster 4 is less homogeneous in MMSB than in OSBM. We found eightoverlaps using MMSB and we emphasize that five of them correspond ex-actly to the one found with our approach. Thus, the model retrieved twoamong the three UMP blogs overlapping cluster 1 (UMP) and 2 (UDF).Moreover, MMSB uncovered the UDF blog overlapping cluster 1 (UMP)and 4 (PS), as well as the blog of political analysts overlapping cluster 2

(UDF), 3 (liberal party), and 4 (PS). It also retrieved the blog of political an-alysts overlapping cluster 1 (UMP) and 4 (PS). Finally, the results of SBMare presented in Figure 3.9. Again, the clusters found by this approachare very similar to the one uncovered by OSBM. However, because SBMdoes not allow each vertex to belong to multiple clusters, it misses a lot ofinformation in the network. In particular, while some of the blogs of polit-ical analysts are viewed as overlaps by OSBM, because of their relationalties with the different political parties, they are all classified into a singlecluster by SBM.

3.89 0.17 0.54 -0.70 -0.70

0.17 2.47 -0.40 -0.84 0.40

0.55 -0.40 4.43 -0.85 -0.38

-0.70 -0.84 -0.85 1.66 0.87

-0.70 0.40 -0.38 0.87 -3.60

Table 3.4 – The estimated W matrix for the classification of the blogs into Q = 4 clustersusing OSBM. The 4× 4 matrix on the top left hand side represents the W matrix whilethe vectors on the top right hand side and bottom left hand side represent the vectors Uand V⊺ respectively. The remaining term corresponds to the bias. The diagonal of Windicates that blogs have a heavy tendency to connect to blogs of the same class. Blogsof cluster 1 (UMP) have also a positive tendency to connect to blogs of clusters 2 (UDF)and 3 (liberal party). Conversely, blogs of cluster 4 (PS), representing the left wing, aremore isolated in the network.


cluster 1

cluster 2

cluster 3

cluster 4

outliers

UMP

30 + 3

2 + 3

0

0

5

UDF

0 + 1

29 + 1

0

0 + 2

1

liberal

0

0

24

0

1

PS

0

0

0

40

17

analysts

0 + 1

1 + 3

1 + 1

0 + 4

5

others

0

0

0

1

30

Figure 3.7 – Classification of the blogs into Q = 4 clusters using OSBM. The entry(i, j) of the matrix describes the number of blogs associated to the j-th political party(column) and classified into cluster i (row). Each entry distinguishes blogs which belongto a unique cluster from overlaps (single membership blogs + overlaps). The last rowcorresponds to the null component.

3.6. Experiments 89

cluster 1

cluster 2

cluster 3

cluster 4

cluster 5

UMP

27 + 2

2 + 2

0

0

9

UDF

0 + 2

29 + 1

0

0 + 1

1

liberal

0

0

25

0

0

PS

0

0 + 1

0

30 + 1

26

analysts

1 + 1

3 + 2

1 + 2

0 + 2

3

others

0

0

0

1

30

Figure 3.8 – Classification of the blogs into Q = 5 clusters using MMSB. The entry (i, j)of the matrix describes the number of blogs associated to the j-th political party (column)and classified into cluster i (row). Each entry distinguishes blogs which belong to a uniquecluster from overlaps (single membership blogs + overlaps). Cluster 5 corresponds to theclass of outliers.


cluster 1

cluster 2

cluster 3

cluster 4

cluster 5

UMP

37

1

0

0

2

UDF

0

31

0

0

1

liberal

1

0

24

0

0

PS

0

0

0

26

31

analysts

0

1

1

0

9

others

2

0

0

0

29

Figure 3.9 – Classification of the blogs into Q = 5 clusters using SBM. The entry (i, j)of the matrix describes the number of blogs associated to the j-th political party (column)and classified into cluster i (row). Cluster 5 corresponds to the class of outliers.

3.6. Experiments 91

3.6.3 Saccharomyces cerevisiae transcription network

We consider the yeast transcriptional regulatory network described in Miloet al. (2002) and we focus on a subset of 197 vertices connected by 303

edges. Nodes of the network correspond to operons, and two operons arelinked if one operon encodes a transcriptional factor that directly regulatesthe other operon. The network is made of three regulation patterns, eachone of them having its own regulators and regulated operons. Therefore,using Q = 6 clusters, we applied our algorithm and we obtained theresults in Table 3.5.

First, we notice that clusters 1, 3, and 5 contain only two operons each.These operons correspond to hubs which regulate respectively the nodesof clusters 2, 4, and 6, all having a very low intra connection probability.To analyze our results, we used GOToolBox (Martin et al. 2004) on eachcluster. This software aims at identifying statistically over-representedterms of the Gene Ontology (GO) in a gene data set. We found that theclusters correspond to well known biological functions. Thus, the nodesof cluster 2 are regulated by STE12 and TEC1 which are both involvedin the response to glucose limitation, nitrogen limitation and abundantfermentable carbon source. Similarly, MSN4 and MSN2 regulate the nodesof cluster 4 in response to different stress such as freezing, hydrostaticpressure, and heat acclimation. Finally, the nodes of cluster 6 are regulatedby YAP1 and SKN7 in the presence of oxygen stimulus. Our algorithmwas able to uncover two overlapping clusters (operons in bold in Table.3.5). Interestingly, contrary to the other operons of clusters 2, 4, and 6,which are all regulated by operons of a single cluster (cluster 1, 3, or 5),these overlaps correspond to co-regulated operons. Thus, SSA4 and TKL2

belong to cluster 2 and 4 since they are co-regulated by (STE12, TEC1) and(MSN4 and MSN2). Moreover, HSP78, CTT1, and PGM2 belong to cluster4 and 6 since they are co-regulated by (MSN4, MSN2) and (YAP1, SKN7).It should also be noted that OSBM did not classify 112 operons which allhave very low output and input degrees.

Because the network is sparse, we obtained very poor results withCFinder. Indeed, the network contains only one 3-clique and no k-cliquefor k > 3. Therefore, for k = 2, all the operons were classified into a singlecluster and no biological information could be retrieved. For k = 3, onlythree operons were classified into a single class and for k > 3 no operonwas classified.

As previously, we ran MMSB and SMB with Q + 1 clusters and weidentified the class of outliers. Both approaches retrieved the six clustersfound by OSBM. However, we emphasize that contrary to the politicalblogoshpere network, MMSB did not uncover any overlap in the yeasttranscriptional regulatory network.

As in Section 3.6.1, these results clearly illustrate the capacity of OSBMto retrieve overlapping clusters in networks with complex topologicalstructures. In particular, in situations were networks are not made ofcommunity structures, while the results of CFinder dramatically degradeor cannot even be interpreted, OSBM seems particularly promising.


cluster size operons1 2 STE12 TEC1

2 33

YBR070C MID2 YEL033W SRD1 TSL1 RTS2 PRM5 YNL051W PST1YJL142C SSA4 YGR149W SPO12 YNL159C SFP1 YHR156C YPS1YPL114W HTB2 MPT5 SRL1 DHH1 TKL2 PGU1 YHL021C RTA1

WSC2 GAT4 YJL017W TOS11 YLR414C BNI5 YDL222C3 2 MSN4 MSN2

4 32

CPH1 TKL2 HSP12 SPS100 MDJ1 GRX1 SSA3 ALD2 GDH3GRE3 HOR2 ALD3 SOD2 ARA1 HSP42 YNL077W HSP78 GLK1

DOG2 HXK1 RAS2 CTT1 HSP26 TPS1 TTR1 HSP104 GLO1SSA4 PNC1 MTC2 YGR086C PGM2

5 2 YAP1 SKN7

6 19YMR318C CTT1 TSA1 CYS3 ZWF1 HSP82 TRX2 GRE2 SOD1AHP1 YNL134C HSP78 CCP1 TAL1 DAK1 YDR453C TRR1

LYS20 PGM2

Table 3.5 – Classification of the operons into Q = 6 clusters. Operons in bold belong tomultiple clusters.

Conclusion

In this chapter, we proposed a new random graph model, the OverlappingStochastic Block Model, which can be used to retrieve overlapping clustersin networks. We used global and local variational techniques to obtain atractable lower bound of the observed log-likelihood and we defined anEM like procedure which optimizes the model parameters in turn. Weshowed that the model is identifiable within classes of equivalence andwe illustrated the efficiency of our approach compared to other methods,using simulated data and real networks. Since no assumption is made onthe matrix W and vectors U and V used to characterize the connectionprobabilities, the model can take very different topological structures intoaccount and seems particularly promising for the analysis of networks.In the experiment section, we set the number Q of classes using a prioriinformation we had about the networks. However, we believe it is crucialto develop a model selection criterion to estimate the number of classesautomatically from the topology.

4Model selection inoverlapping stochastic blockmodels

Contents4.1 A Bayesian Overlapping Stochastic Block Model . . . . 95

4.2 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.2.1 The q-transformation . . . . . . . . . . . . . . . . . . . . . . 96

4.2.2 The ξ-transformation . . . . . . . . . . . . . . . . . . . . . . 97


4.2.4 Optimization of ξ . . . . . . . . . . . . . . . . . . . . . . . . 99

4.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.4.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.4.2 Saccharomyces cerevisiae transcription network . . . . . . 104

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

In the previous chapter, we introduced the Overlapping StochasticBlock Model (OSBM), a new random graph model which allows the ver-tices of a network to belong to multiple classes. A variational EM algo-rithm was considered for estimation as well as clustering purposes andthe model was shown to be able to take very different topological struc-tures into account. However, no model selection criterion is yet availableto estimate the number of components from the data. To tackle this issue,we consider a Bayesian framework as in Chapter 2. We then propose acriterion, that we call ILosbm, based on a non asymptotic approximationof the marginal log-likelihood. We describe how ILosbm can be computedthrough a variational Bayes EM algorithm. Finally, experiments are car-ried out to assess the criterion using simulated data and the transcriptionalregulatory network of Saccharomyces cerevisiae.

93

4.1. A Bayesian Overlapping Stochastic Block Model 95

4.1 A Bayesian Overlapping Stochastic Block Model

The data we model consists of a N × N binary matrix X with entries Xijdescribing the presence or absence of an edge from vertex i to vertex j.Both directed and undirected relations can be analyzed but in the follow-ing, we concentrate on directed relations. Moreover, we assume that thegraph we consider does not contain any self loop. Therefore, the variablesXii will not be taken into account.

As shown in the previous chapter, the Overlapping Stochastic BlockModel (OSBM) associates to each vertex of a network a latent variable Zidrawn from a multivariate Bernoulli distribution:

Zi ∼Q

∏q=1

αZiqq (1− αq)

1−Ziq ,

where Q denotes the number of classes considered. The edges are thenassumed to be drawn from a Bernoulli distribution:

Xij|Zi, Zj ∼ B(Xij; g(aZi ,Zj)

)= eXijaZi ,Zj g(−aZi ,Zj),

whereaZi ,Zj = Z⊺

i W Zj +Z⊺

i U+V⊺ Zj +W∗,

and g(x) = (1 + e−x)−1 is the logistic sigmoid function. According toOSBM, the latent variables Z1, . . . , ZN are iid and given this latent struc-tures, all the edges are supposed to be independent.

Thus, when considering a directed graph without self loop, we recallthat:

p(Z | α) =N

∏i=1

Q

∏q=1

αZiqq (1− αq)

1−Ziq ,

and

p(X |Z, W) =N

∏i 6=j

p(Xij|Zi, Zj, W)

=N

∏i 6=j

eXijaZi ,Zj g(−aZi ,Zj).

(4.1)

Finally, to keep the notations uncluttered, we will use the notations of theprevious chapter, that is Zi =

(Zi, 1

)⊺, ∀i and:

W =

(W U

V⊺ W∗

)

.

As the (non overlapping) Stochastic Block Model (SBM), OSBM canbe described in a full Bayesian framework by introducing some conjugateprior distributions for the model parameters. Since p(Zi | α) is a multivari-ate Bernoulli distribution, we consider independent Beta distributions forthe class probabilities:

p(α) =Q

∏q=1

Beta(αq; η0q , ζ0

q),

96 Chapter 4. Model selection in overlapping stochastic block models

where η0q = ζ0

q = 1/2, ∀q. As mentioned already, this corresponds to aproduct of non-informative Jeffreys prior distributions. A uniform distri-bution can also be chosen simply by fixing η0

q = ζ0q = 1, ∀q.

In order to model the (Q + 1) × (Q + 1) real matrix W, we considerthe vec operator which stacks the columns of a matrix into a vector. Thus,if A is a 2× 2 matrix such that:

A =

(A11 A12A21 A22

)

,

then

Avec =

A11A21A12A22

.

Following the work of Jaakkola and Jordan (2000) on Bayesian logisticregression, where a Gaussian distribution is used for the weight vector β,we model the vector W

vec using a multivariate Gaussian prior distributionwith mean vector W

vec0 and covariance matrix S0:

p(Wvec) = N (W

vec; Wvec0 , S0).

The inference procedure introduced in this chapter is described in a gen-eral setting, allowing any covariance matrix S0. In practice, we have usedS0 = σ2 I in all the experiments that we carried out, where I denotes theidentity matrix.

4.2 Estimation

In this section, we propose a Variational Bayes EM (VBEM) algorithm,based on global and local variational techniques, which leads to an ap-proximation of the full posterior distribution over the model parametersand latent variables, given the observed data X. As for the VBEM al-gorithm described in Chapter 2, this procedure relies on a lower boundwhich will be later used as non asymptotic approximation of the marginallog-likelihood log p(X).

4.2.1 The q-transformation

The posterior distribution p(Z, α, W |X) is intractable and so we proposeto rely on variational approximations. Following the VBEM algorithm weintroduced in Chapter 2, the marginal likelihood can be decomposed intotwo terms:

log p(X) = L(q) + KL ((q(·)||p(·|X)) ,

where

L(q) = ∑Z

∫ ∫

q(Z, α, W) log{

p(X, Z, α, W)

q(Z, α, W)

}

d α d W, (4.2)

4.2. Estimation 97

and

KL(q(·)||p(·|X)) = −∑Z

∫ ∫

q(Z, α, W) log{

p(Z, α, W |X)

q(Z, α, W)

}

d α d W .

(4.3)L is a lower bound of log p(X) and KL(·||·) denotes the Kullback-Leibler divergence between the distribution q(Z, α, W) and the distri-bution p(Z, α, W |X). To obtain a tractable algorithm, we assume thatq(Z, α, W) can be factorized such that:

q(Z, α, W) = q(α)q(W)q(Z) = q(α)q(W)( N

∏i=1

Q

∏q=1

q(Ziq)).

This factorization should be compared with (2.3.2). Indeed, contrary toSBM where we assumed that q(Z) = ∏

Ni=1 q(Zi), we now consider a fac-

torization over all the vertices and classes, that is q(Z) = ∏Ni=1 ∏

Qq=1 q(Ziq).

Only with this later assumption did we obtain analytical expressions, asshown in Section 4.2.3.

At this point, the lower bound is still intractable due to the logisticfunction in the distribution p(X |Z, α, W) (see 4.1 and 4.2). Therefore, asecond level of approximation is required.

4.2.2 The ξ-transformation

As noted in Proposition 4.1, a tractable lower bound can be obtained usingthe work of Jaakkola and Jordan (2000).

Proposition 4.1 (Proof in Appendix D.1) Given any N × N positive real matrix ξ, a lower boundof the first lower bound is given by:

log p(X) ≥ L(q) ≥ L(q; ξ),

where

L(q; ξ) = ∑Z

∫ ∫

q(Z, α, W) log(h(Z, W, ξ)p(Z | α)p(α)p(W)

q(Z, α, W)

)d α d W,

and

log h(Z, W, ξ) =N

∑i 6=j

{

(Xij −12)aZi ,Zj −

ξij

2+ log g(ξij)− λ(ξij)(a2

Zi ,Zj− ξ2

ij)

}

.

For now, ξ is held fixed but we will see in Section 4.2.4 how it can beestimated from the data.


In order to approximate the posterior distribution p(Z, α, W |X) with adistribution q(Z, α, W), a VBEM algorithm is applied on the lower boundL(q; ξ). For a general description of the VBEM algorithm, we refer toSection 1.2.2. The algorithm starts with a matrix τ initialized using ahierarchical classification algorithm, as in Chapter 3. Given a matrix ξ, the


Propositions 4.2,4.3, and 4.4 are then used to maximize the lower boundL(q; ξ) with respect to q(Z, α, W). We call variational Bayes E step theoptimization of the distributions q(Ziq) and variational Bayes M-step theoptimization of the remaining factors.

Proposition 4.2 (Proof in Appendix D.2) VBEM leads to a distribution q(α) which takes the samefunctional form as the prior p(α):

q(α) =Q

∏q=1

Beta(αq; ηNq , ζN

q ),

where

ηNq = η0

q +N

∑i=1

τiq,

and

ζNq = ζ0

q + N −N

∑i=1

τiq.

Proposition 4.3 (Proof in Appendix D.3) VBEM leads to a distribution q(W) which takes thesame functional form as the prior p(W):

q(Wvec) = N (W

vec; WvecN , SN),

with

S−1N = S−1

0 +2N

∑i 6=j

λ(ξij)(Ej⊗ Ei),

and

WvecN = SN

{

S−10 W

vec0 +

N

∑i 6=j

(Xij −12) τ j⊗ τi

}

.

The symbol ⊗ denotes the Kronecker product. Moreover, each (Q + 1)× (Q + 1)probability matrix Ei satisfies:

Ei = EZi [Zi Z⊺

i ]

=

τi1 τi1τi2 . . . τi1τiQ τi1τi2τi1 τi2 . . . τi2τiQ τi2

......

τiQτi1 τiQτi2 . . . τiQ τiQτi1 τi2 . . . τiQ 1

.

Proposition 4.4 (Proof in Appendix D.4) VBEM leads to a distribution q(Ziq) with the samefunctional form as the prior:

q(Ziq) = B(Ziq; τiq),

where

τiq = g{

ψ(ηNq )− ψ(ζN

q ) +N

∑j 6=i

(Xij −12) τ⊺

j (W⊺

N)·q +N

∑j 6=i

(Xji −12) τ⊺

j (WN)·q

− Tr((

Σ′

qq +2Q+1

∑l 6=q

τil Σ′

ql

)( N

∑j 6=i

λ(ξij) Ej)+(

Σqq +2Q+1

∑l 6=q

τil Σql)( N

∑j 6=i

λ(ξ ji) Ej))}

,

and Σql = EWq ,Wl[W·q W

⊺

·l ], Σ′

ql = EWq· ,Wl·[W

⊺

q· Wl·].

4.2. Estimation 99

4.2.4 Optimization of ξ

So far, we have seen how a VBEM algorithm could be used to obtain an ap-proximation of the posterior distribution p(Z, α, W |X) for a given matrixξ. However, we have not addressed yet how ξ could be estimated from thedata. We follow the work of Bishop and Svensén (2003) on Bayesian hierar-chical mixture of experts. Thus, given a distribution q(Z, α, W), the lowerbound L(q; ξ) is maximized with respect to each variable ξij in order toobtain the tightest lower bound L(q; ξ) of L(q). As shown in Proposition4.5 and Appendix D.5, this optimization leads to estimates ξij of ξij.

Proposition 4.5 (Proof in Appendix D.5) An estimate ξij of ξij is given by:

ξij =

√

Tr((

SN + WvecN (W

vecN )⊺

)(Ej⊗ Ei)

)

.

This gives rise to a three step optimization algorithm. Given a matrixξ, the variational Bayes E and M steps are used to approximate the pos-terior distribution over the model parameters and latent variables. Thedistribution q(Z, α, W) is then held fixed while the lower bound L(q; ξ) ismaximized with respect to ξ. These three stages are repeated until con-vergence of the lower bound (see Algorithm 8).

For all the experiments that we carried out, we set S0 = σ2 I withσ2 = 1 and ξij = 0.001, ∀i 6= j. As the VEM algorithm proposed in Chapter3, the computational cost of the algorithm is equal to O(N2Q4). Note thata R package called “OSBM” will be soon available on the CRAN. For now,the code is available upon request.


Algorithm 8: Variational Bayes inference for overlapping stochasticblock model when applied on a directed graph without self loop.

// INITIALIZATION

Initialize τ with an Ascendant Hierarchical Classification algorithmInitialize ξij, ∀i 6= j

// OPTIMIZATION

repeat

Ei ← EZi [Zi Z⊺

i ], ∀i// M-step

ηNq ← η0

q + ∑Ni=1 τiq, ∀q

ζNq ← ζ0

q + N −∑Ni=1 τiq, ∀q

S−1N ← S−1

0 +2 ∑Ni 6=j λ(ξij)(Ej⊗ Ei)

WvecN ← SN

{

S−10 W

vec0 +∑

Ni 6=j(Xij −

12 ) τ j⊗ τi

}

// Optimization of ξ

ξij ←

√

Tr((

SN + WvecN (W

vecN )⊺

)(Ej⊗ Ei)

)

, ∀i 6= j

// E-step

repeatCompute τiq, ∀(i, q) using Proposition 4.4

until τ convergesuntil L(q; ξ) converges

4.3 Model Selection

Given a set of values of Q, we aim at selecting Q∗ which maximizes themarginal log-likelihood log p(X |Q), also called integrated observed-datalog-likelihood. Unfortunately, this quantity is not tractable since for eachvalue of Q, it involves integrating over all possible model parameters andlatent variables:

log p(X |Q) = log

{

∑Z

∫ ∫

p(X, Z, α, W |Q)d α d W

}

.

As in Chapter 2, we propose to replace the marginal log-likelihood withits variational approximation. Thus, for each value of Q considered, Al-gorithm 8 is applied in order to maximize L(q; ξ) with respect to q(·) andξ. After convergence, the lower bound is then used as an estimation oflog p(X |Q). Obviously, this approximation cannot be verified analyticallybecause neither L(q) in (4.2) nor the Kullback-Leibler divergence in (4.3)are tractable. Nevertheless, we rely on this approximation to propose atractable model selection criterion. After convergence of the algorithm,the lower bound takes a simple form and leads to the first criterion for

4.4. Experiment 101

OSBM that we call ILosbm (Proof in Appendix D.6):

ILosbm =N

∑i 6=j

{

log g(ξij)−ξij

2+ λ(ξij)ξ

2ij

}

+Q

∑q=1

log{

Γ(η0q + ζ0

q)Γ(ηNq )Γ(ζN

q )

Γ(η0q)Γ(ζ

0q)Γ(η

Nq + ζN

q )

}

−12

log |S0

SN| −

12(W

vec0 )⊺ S−1

0 Wvec0 +

12(W

vecN )⊺ S−1

N W⊺

N

−N

∑i=1

Q

∑q=1

{

τiq log τiq + (1− τiq) log(1− τiq)

}

.

4.4 Experiment

We consider simulated data and the transcriptional regulatory networkof yeast in order to assess the Bayesian inference procedure introduced inSection 4.2.4 as well as the model selection criterion ILosbm.

4.4.1 Simulated data

OSBM is used in this set of experiments to generate networks with com-munity structure, where vertices of a community are mostly connectedto vertices of the same community. The networks are made of overlapsand therefore if we rely on a criterion for (non overlapping) SBM as ILvb(see Chapter 2), the number of classes is highly overestimated. Indeed,such criterion uses many extra components to characterize the overlapsbetween classes. As in Section 3.6.1, to limit the number of free parame-ters, we consider the Q×Q real matrix W:

W =

λ −ǫ . . . −ǫ

−ǫ λ...

.... . . −ǫ

−ǫ . . . −ǫ λ

,

and the Q-dimensional real vectors U and V:

U = V =(ǫ . . . ǫ

),

such that λ = 6, ǫ = 1, and W∗ = −5.5. For each value QTrue in theset {3, . . . , 7}, we generate 100 networks (see an example in Figure 4.1)with N = 100 vertices and classes mixed in the same proportions , thatis α1 = · · · = αQTrue = 1/QTrue. The VBEM algorithm is then applied oneach network for various numbers of classes Q ∈ {2, . . . , 8}. Note thatwe choose n0

q = 1/2, ∀q and S0 = σ2 I with σ2 = 1. Like any optimiza-tion method, the overlapping clustering algorithm we propose dependson the initialization. Thus, for each simulated network and each numberof classes Q, we consider 20 initializations of τ. Finally, we select the bestlearnt model for which the criterion ILosbm is maximized.

The results presented in Table 4.1 clearly illustrate that ILosbm is a rel-evant criterion to estimate the number of overlapping classes in networks.For instance, when QTrue = 3 or QTrue = 4, ILosbm correctly estimates the


2 3 4 5 6 7 8

3 0 99 1 0 0 0 0

4 0 0 99 1 0 0 0

5 0 0 0 93 5 2 0

6 0 0 0 7 64 22 7

7 0 0 0 0 16 47 37

Table 4.1 – Confusion matrix for ILosbm. λ = 6, ǫ = 1, W∗ = −5.5, QTrue ∈ {3, . . . , 7}and QILosbm

∈ {2, . . . , 8}.

number of overlapping classes of 99 of the 100 networks generated. As Qincreases, the performances of the criterion remain quite stable.

We now consider networks generated by changing the value of λ toλ = 4 which causes the intra class probabilities to decrease. The resultsare presented in Table 4.2. It appears that ILosbm has a tendency to over-estimate the number of components. Indeed, we found that the moredifficult it is to distinguish the communities, the more the criterion tendsto use extra classes to model the overlaps between classes.

2 3 4 5 6 7 8

3 0 99 1 0 0 0 0

4 0 0 85 9 5 0 1

5 0 0 4 53 26 9 8

6 0 0 0 18 34 27 21

7 0 0 0 4 18 30 48

Table 4.2 – Confusion matrix for ILosbm. λ = 4, ǫ = 1, W∗ = −5.5, QTrue ∈ {3, . . . , 7}and QILosbm

∈ {2, . . . , 8}.

4.4. Experiment 103

Figure 4.1 – Figure produced by the R “OSBM” package. Example of a network gener-ated using OSBM, with λ = 6, ǫ = 1, W∗ = −5.5, and Q = 5 classes. Overlaps arerepresented using pies and outliers are in white.


4.4.2 Saccharomyces cerevisiae transcription network

Let us consider the yeast transcriptional regulatory network describedin Chapter 4. The network is made of 197 vertices connected by 303edges. We apply the VBEM algorithm for various number of classesQ ∈ {2, . . . , 8}. Moreover, for each value of Q, we consider 10 initial-izations of the matrix τ and we select the best learnt model. The resultsof ILosbm are presented in Figure 4.2. The criterion finds its maximum forQILosbm = 6 as expected. It retrieves the three regulations patterns (see Sec-tion 3.6.3) where each regulation pattern is made of a group of regulatorsand a group of regulated operons.

2 3 4 5 6 7 8

−1660

−1640

−1620

−1600

−1580

−1560

Figure 4.2 – The ILosbm criterion for Q ∈ {2, . . . , 8}. The maximum is reached atQILosbm

= 6.

Conclusion

In this chapter, we first introduced some conjugate prior distributions forthe parameters of the overlapping stochastic block model. We then pro-posed a variational Bayes EM algorithm, based on global and local varia-tional techniques. The algorithm can be used to approximate the posteriordistribution over the model parameters and latent variables, given the ob-served data. In this framework, we derived a model selection criterion,so called ILosbm, which is based on a non asymptotic approximation ofthe marginal log-likelihood. Using simulated data and a real network, we

4.4. Experiment 105

showed that ILosbm provides a relevant estimation of the number of over-lapping clusters. In future work, we will assess the Bayesian confidenceintervals of the posterior distributions found by the variational Bayes EMalgorithm. Moreover, we are also interested in developing a method inorder to be able to learn from a network whether the parameters U, V, W

and W∗ are zero or not.

Conclusion

As more and more network structured data sets are available, thestatistical analysis of graphs has become common place. In this thesis,we considered unsupervised methods which aim at clustering verticesdepending on their connection profiles. There is a long history of re-search on the topic, which goes back to the earlier work of Moreno in1934, and many graph clustering algorithms have been proposed. Wepresented model based techniques and focused mainly on the StochasticBlock Model (SBM).

SBM is a mixture model which assumes that the vertices of a networkare spread into different classes, so that the probability of an edge betweentwo vertices only depends on the classes they belong to. No assumption ismade on these probabilities such that very different topological structurescan be taken into account. In particular, the model can characterize thepresence of hubs which make networks locally dense. SBM was shownto generalize many of the existing graph clustering algorithms and so weconsidered it as the starting point of the thesis.

The clustering of vertices as well as the estimation of SBM parametershad been subject to previous work and numerous inference strategies hadbeen proposed. However, the model still suffered from a lack of criteria toestimate the number of components in the mixture. Only one model basedcriterion had been derived for SBM in the literature. However, it tendedto be too conservative in the case of small networks. In order to tacklethis issue, we first illustrated how SBM could be described in a Bayesianframework. A new inference procedure was then proposed to estimatethe posterior distribution over the model parameters and latent variables,given the observed data. In this framework, we derived a new model se-lection criterion, so-called ILvb, based on a non asymptotic approximationof the marginal likelihood. By considering networks generated using SBMas well as a real network, we showed that ILvb focus on the estimationof the data density and provides a relevant estimation of the number oflatent classes.

Besides, almost all graph clustering models, such as SBM, partition thevertices into disjoint clusters. However, most real networks are knownto be made of overlapping clusters. For instance, many proteins, so-called moonlighting proteins, are known to have several functions in thecells, and actors might belong to several groups of interest. Therefore,we introduced a new random graph model, the Overlapping StochasticBlock Model (OSBM). It allows the vertices to belong to multiple classesand can take very different topological structures into account. We pro-posed a variational EM algorithm, based on global and local variationaltechniques, to cluster the vertices of a network and estimate the model pa-rameters. Using simulated data, the french political blogosphere network,

107

108 Conclusion

as well as the yeast transcriptional regulatory network, we showed thatthis procedure outperformed existing graph clustering algorithms.

Finally, we illustrated how OSBM could be described in a Bayesianframework by introducing some conjugate prior distribution for the modelparameters. Then, we proposed an inference procedure to approximatethe posterior distribution over the model parameters and latent variables,given the observed data. In this framework, we obtained the first modelselection criterion for OSBM that we called ILosbm. It is based on a nonasymptotic approximation of the marginal likelihood and has shown en-couraging results.

Perspectives

In future work, we will investigate Markov chain Monte Carlo techniquesas alternative approaches to estimate the posterior distribution of OSBM.These methods have already been considered for the (non overlapping)SBM model and have shown some interesting features. In particular, theyappear to be less dependent on the initialization of the τ matrix than varia-tional approaches. Some other experiments on more challenging networksare also of interest in order to assess the ILosbm criterion. Similarly, we be-lieve it is crucial to investigate the Bayesian confidence intervals of theposterior distributions found by the variational Bayes EM algorithm. Fornow, we have only used the algorithm for model selection purposes andthe posterior distributions should be better exploited. As mentioned al-ready, it would be particularly relevant to develop a procedure to be ableto learn from a network whether the parameters U, V, W, and W∗ arezero or not. Finally, recent studies have extended existing random graphmodels in order to take covariate information about the edges or verticesinto account. Others have focused on modelling the dynamic of networks,that is the appearence of edges through time. We will investigate possibleextensions of OSBM in order to take these features into account.

AMixture models

A.1 Factorization of the integrated complete-data

likelihood

If the prior over the model parameters can be factorized, such that p(θ) =p(α)p(κ), then:

p(X, Z) = p(X |Z)p(Z).

Proof:

p(X, Z) =∫

p(X, Z | θ)p(θ)d θ

=∫

p(X |Z, θ)p(Z | θ)p(θ)d θ

=∫

p(X |Z, κ)p(Z | α)p(α)p(κ)d α d κ

=

(∫

p(X |Z, κ)p(κ)d κ

)(∫

p(Z | α)p(α)d α

)

= p(X |Z)p(Z).

A.2 Exact expression of log p(Z)

Using a Jeffreys non informative prior distribution for the class propor-tions α, Biernacki et al. (2000) showed that log p(Z) is given by:

log p(Z) = log Γ(Q2) +

Q

∑q=1

log Γ(12+ nq)−Q log Γ(

12)− log Γ(N +

Q2),

(A.1)where nq = ∑

Ni Ziq, ∀q.

Proof: Let us first consider a Dirichlet prior distribution for the classproportions:

p(α) = Dir(α; n0)

= D(n 0)Q

∏q=1

αnq−1q ,

with

D(n0) =Γ(∑Q

q n0q)

∏Qq=1 Γ(n0

q).

109

110 Appendix A. Mixture models

It leads to:

log p(Z) = log(∫

p(Z | α)p(α)d α

)

= log

(∫ n

∏i=1

Q

∏q=1

αZiqq D(n0)

Q

∏q=1

αn0

q−1q d α

)

= log

(

D(n0)∫ Q

∏q=1

α∑

ni=1 Ziq

q

Q

∏q=1

αn0

q−1q d α

)

= log

(

D(n0)∫ Q

∏q=1

αn0

q+nq−1q d α

)

.

Therefore

log p(Z) = log

(

D(n0)∫ Q

∏q=1

αnnew−1

qq d α

)

= log

(

D(n0)

D(nnew)

∫

D(nnew)Q

∏q=1

αnnew−1

qq d α

)

= log

D(n0)

D(nnew)

∫

Dir(α; nnew)d α︸︷︷︸

1

= logD(n0)

D(nnew)

= log Γ(Q

∑q=1

n0q) +

Q

∑q=1

log Γ(n0q + nq)−

Q

∑q=1

log Γ(n0q)

− log Γ

(Q

∑q=1

(n0q + nq)

)

,

(A.2)

where nq = ∑ni=1 Ziq, ∀q. In (A.2), nnew denotes the Q dimensional vector

such that nnewq = n0

q + nq, ∀q. If we now consider a Jeffreys non infor-mative prior distribution (Robert 1994) which corresponds to a Dirichletdistribution with n0

q = 1/2, ∀q, we obtain (A.1).

A.3 Asymptotic approximation of log p(Z) using Stir-ling formulae

Assuming that N and the nqs have large values (namely the αq are farfrom 0), Biernacki et al. (2000) relied on the Stirling formulae to obtain anapproximation of log p(Z):

log p(Z) ≈ maxα

log p(Z | α)−Q− 1

2log N

=Q

∑q=1

nq lognq

N−

Q− 12

log N,

where nq = 1/2, ∀q.

A.3. Asymptotic approximation of log p(Z) using Stirling formulae 111

Proof: For large values of s, the Stirling formula approximates theGamma function:

Γ(s + 1) ≈ (2π)12 ss+ 1

2 exp(−s). (A.3)

Thus, using (A.3) in (A.1) and removing the terms in O(1) (assumingQ = O(N)) leads to:

log p(Z) ≈Q

∑q=1

log(

(2π)12 (nq −

12)nq exp(−nq +

12)

)

− log(

(2π)12 (N +

Q2− 1)N+ Q

2 −12 exp(−N −

Q2+

12)

)

≈Q

∑q=1

(nq log nq − nq)

)− N log N −

Q− 12

log N + N

=Q

∑q=1

nq log nq − N log N −Q− 1

2log N

=Q

∑q=1

nq lognq

N−

Q− 12

log N

= maxα

log p(Z | α)−Q− 1

2log N.

Indeed, it is straightforward to see that the maximum of log p(Z | α) isreached when αq =

nq

N , ∀q. Therefore:

maxα

log p(Z | α) =N

∑i=1

Q

∑q=1

Ziq lognq

N

=Q

∑q=1

nq lognq

N.

BSBM

B.1 Optimization of q(Zi)

The optimal approximation at vertex i is:

q(Zi) =M(Zi; 1, τi = {τi1, . . . , τiQ}), (B.1)

where τiq is the probability (responsability) of node i to belong to class q.It satisfies the relation:

τiq ∝ eψ(nq)−ψ(∑Ql=1 nl)

N

∏j 6=i

Q

∏l=1

eτjl

(

ψ(ζql)−ψ(ηql+ζql)+Xij

(

ψ(ηql)−ψ(ζql)

))

, (B.2)

where ψ(.) is the digamma function. In order to optimize the distributionq(Z), we rely on a fixed point algorithm. Thus, given a matrix τold, thealgorithm builds a new matrix τnew where each rows satisfies (B.2). Afternormalization, it then uses τnew to build a new matrix and so on. Thealgorithm stops when ∑

Ni=1 ∑

Qq=1 |τ

oldiq − τnew

iq | < eps. In the experimentsection, we set eps = 1e− 6.

Proof: According to variational Bayes, the optimal distribution q(Zi)is given by:

log q(Zi) = EZ\i ,α,Π[log p(X, Z, α, Π)] + const

= EZ\i ,Π[log p(X |Z, π)] + E

Z\i ,α[log p(Z |α)] + const

= EZ\i ,Π[∑

i′<j∑q,l

Zi′qZjl

(

Xi′ j log πql + (1− Xi′ j) log(1− πql))

]

+ EZ\i ,α[

N

∑i′=1

Q

∑q=1

Zi′q log αq] + const

=Q

∑q=1

Ziq

(

Eαq [log αq] +N

∑j 6=i

Q

∑l=1

τjl

(

Xij(Eπql [log πql ]− Eπql [log(1− πql)]

)

+ Eπql [log(1− πql)]))

+ const

=Q

∑q=1

Ziq

(

ψ(nq)− ψ(N

∑l=1

nl) +N

∑j 6=i

Q

∑l=1

τjl

(

Xij(ψ(ηql)− ψ(ζql)

)

+ ψ(ζql)− ψ(ηql + ζql)))

+ const,

(B.3)

113

114 Appendix B. SBM

where Z\i denotes the class of all nodes except node i. We have usedEy[log y] = ψ(a)− ψ(a + b) when y ∼ Beta(y; a, b). Moreover, to simplifythe calculations, the terms that do not depend on Zi have been absorbedinto the constant. Taking the exponential of (B.3) and after normalization,we obtain the multinomial distribution (B.1).

B.2 Optimization of q(α)

The optimization of the lower bound with respect to q(α) produces a dis-tribution with the same functional form as the prior p(α):

q(α) = Dir(α; n), (B.4)

where

nq = n0q +

N

∑i=1

τiq.

Proof: According to variational Bayes, the optimal distribution q(α) isgiven by:

log q(α) = EZ,Π[log p(X, Z, α, Π)] + const

= EZ[log p(Z | α)] + log p(α) + const

=N

∑i=1

Q

∑q=1

τiq log αq +Q

∑q=1

(n0q − 1) log αq + const

=Q

∑q=1

(

n0q − 1 +

N

∑i=1

τiq

)

log αq + const.

(B.5)

Taking the exponential of (B.5) and after normalization, we obtain theDirichlet distribution (B.4).

B.3 Optimization of q(Π)

Again, the functional form of the prior p(Π) is conserved through thevariational optimization:

q(Π) =Q

∏q≤l

Beta(πql ; ηql , ζql), (B.6)

For q 6= l, the hyperparameter ηql is given by:

ηql = η0ql +

N

∑i 6=j

Xijτiqτjl ,

and ∀q

ηqq = η0qq +

N

∑i<j

Xijτiqτjq.

Moreover, for q 6= l, the hyperparameter ζql is given by:

ζql = ζ0ql +

N

∑i 6=j

(1− Xij)τiqτjl ,

B.4. Lower bound 115

and ∀q

ζqq = ζ0qq +

N

∑i<j

(1− Xij)τiqτjq.

Proof : According to variational Bayes, the optimal distribution q(Π)is given by:

log q(Π) = EZ,α[log p(X, Z, α, Π)] + const

= EZ[log p(X |Z, Π)] + log p(Π) + const

=N

∑i<j

Q

∑q,l

τiqτjl

(

Xij log πql + (1− Xij) log(1− πql))

+Q

∑q≤l

(

(η0ql − 1) log πql + (ζ0

ql − 1) log(1− πql))

+ const

=Q

∑q<l

N

∑i 6=j

τiqτjl

(

Xij log πql + (1− Xij) log(1− πql))

+Q

∑q=1

N

∑i<j

τiqτjq

(

Xij log πqq + (1− Xij) log(1− πqq))

+Q

∑q≤l

(

(η0ql − 1) log πql + (ζ0

ql − 1) log(1− πql))

+ const

=Q

∑q<l

((

η0ql − 1 +

N

∑i 6=j

τiqτjlXij

)

log πql +(

ζ0ql − 1 +

N

∑i 6=j

τiqτjl(1− Xij))

log(1− πql)

)

+Q

∑q=1

((

η0qq − 1 +

N

∑i<j

τiqτjqXij

)

log πqq +(

ζ0qq − 1 +

N

∑i<j

τiqτjq(1− Xij))

log(1− πqq)

)

.

(B.7)Taking the exponential of (B.7) and after normalization, we obtain theproduct of Beta distribution (B.6).

B.4 Lower bound

The lower bound takes a simple form after the variational Bayes M-step.Indeed, it only depends on the posterior probabilities τiq as well as thenormalizing constants of the Dirichlet and Beta distributions:

L(q) = log

Γ(∑Qq=1 n0

q)∏Qq=1 Γ(nq)

Γ(∑Qq=1 nq)∏

Qq=1 Γ(n0

q)

+

Q

∑q≤l

log

{Γ(η0


Γ(ηql + ζql)Γ(η0ql)Γ(ζ

0ql)

}

−N

∑i=1

Q

∑q=1

τiq log τiq.

116 Appendix B. SBM

Proof : The lower bound is given by:

L(q) = ∑Z

∫ ∫

q(Z, α, Π) log{p(X, Z, α, Π)

q(Z, α, Π)}d α d π

= EZ,α,Π[log p(X, Z, α, Π)]− EZ,α,Π[log q(Z, α, Π)]

= EZ,Π[log p(X |Z, Π)] + EZ,α[log p(Z | α)] + Eα[log p(α)] + EΠ[log p(Π)]

−N

∑i=1

EZi [log q(Zi)]− Eα[log q(α)]− EΠ[log q(π)]

=N

∑i<j

Q

∑q,l

τiqτjl

(

Xij

(

ψ(ηql)− ψ(ζql))

+ ψ(ζql)− ψ(ηql + ζql)

)

+N

∑i=1

Q

∑q=1

τiq

(

ψ(nq)− ψ(Q

∑l=1

nl))

+ log Γ(Q

∑q=1

n0q)−

Q

∑q=1

log Γ(n0q)

+Q

∑q=1

(

n0q − 1

)(

ψ(nq)− ψ(Q

∑l=1

nl))

+Q

∑q≤l

(

log Γ(η0ql + ζ0

ql)

− log Γ(η0ql)− log Γ(ζ0

ql) + (η0ql − 1)

(

ψ(ηql)− ψ(ηql + ζql))

+ (ζ0ql − 1)

(

ψ(ζql)− ψ(ηql + ζql)))

−N

∑i=1

Q

∑q=1

τiq log τiq

− log Γ(Q

∑q=1

nq) +Q

∑q=1

log Γ(nq)−Q

∑q=1

(

nq − 1)(

ψ(nq)− ψ(Q

∑l=1

nl))

−Q

∑q≤l

(

log Γ(ηql + ζql)− log Γ(ηql)− log Γ(ζql)

+ (ηql − 1)(


+ (ζql − 1)(


=Q

∑q<l

((

η0ql − ηql +

N

∑i 6=j

τiqτjlXij

)(


+(

ζ0ql − ζql +

N

∑i 6=j

τiqτjl(1− Xij))(


+Q

∑q=1

((

η0qq − ηqq +

N

∑i<j

τiqτjqXij

)(

ψ(ηqq)− ψ(ηqq + ζqq))

+(

ζ0qq − ζqq +

N

∑i<j

τiqτjq(1− Xij))(

ψ(ζqq)− ψ(ηqq + ζqq)))

+Q

∑q=1

(

n0q − nq +

N

∑i=1

τiq

)(

ψ(nq)− ψ(Q

∑l=1

nl))

+ log{Γ(∑

Qq=1 n0

q)∏Qq=1 Γ(nq)

Γ(∑Qq=1 nq)∏

Qq=1 Γ(n0

q)}+

Q

∑q≤l

log{Γ(η0


Γ(ηql) + ζqlΓ(η0ql)Γ(ζ

0ql)}

−N

∑i=1

Q

∑q=1

τiq log τiq.

(B.8)

B.4. Lower bound 117

After the variational Bayes M-step, most of the terms in the lower boundvanish since:

• ∀q : nq = n0q + ∑

Ni=1 τiq.

• ∀q 6= l : ηql = η0ql + ∑

Ni 6=j Xijτiqτjl ,

• ∀q : ηqq = η0qq + ∑

Ni<j Xijτiqτjq.

• ∀q 6= l : ζql = ζ0ql + ∑

Ni 6=j(1− Xij)τiqτjl ,

• ∀q : ζqq = ζ0qq + ∑

Ni<j(1− Xij)τiqτjq.

Only the terms depending on the probabilities τiq and the normalizingconstants of the Dirichlet and Beta distributions remain.

COSBM

C.1 First lower bound

The lower bound defined in (3.11) can be written:

LML(q; α, W) =N

∑i 6=j

{


}

+N

∑i=1

Q

∑q=1


}

−N

∑i=1

Q

∑q=1


}.

Proof: The lower bound can be decoposed into:

LML(q; α, W) = ∑Z

q(Z) log p(X, Z | α, W)−∑Z

q(Z) log q(Z)

= EZ[log p(X, Z | α, W)]− EZ[log q(Z)]

= EZ[log p(X |Z, W)] + EZ[log p(Z | α)]− EZ[log q(Z)],(C.1)

where the expectations are taken according to the distribution q(Z) andthe last term of (C.1) is an entropy term. Using (3.13), we obtain:

LML(q; α, W) =N

∑i 6=j

{

XijEZi ,Zj [aij] + EZi ,Zj [log g(−aij)]}

+N

∑i=1

Q

∑q=1

{

EZiq [Ziq] log αq + (1− EZiq [Ziq]) log(1− αq)}

−N

∑i=1

Q

∑q=1

{

EZiq [Ziq] log τiq + (1− EZiq [Ziq]) log(1− τiq)}

=N

∑i 6=j

{


}

+N

∑i=1

Q

∑q=1


}

−N

∑i=1

Q

∑q=1


}.

(C.2)

119

120 Appendix C. OSBM

C.2 Second lower bound

Using local variational approximations, a tractable lower bound can beobtained:


∑i 6=j

{

(Xij −

12

)τ⊺


2

− λ(ξij)

(

Tr(

W⊺EiW Σj

)

+ τ⊺


ij

)}

+N

∑i=1

Q

∑q=1


}

−N

∑i=1

Q

∑q=1


},

where Ei = EZi [ZiZi⊺] = Σi +τiτi

⊺ and:

Σi =

(var(Zi) 0

0 0

)

, ∀i.

Proof: As noticed in Section 3.5 the first lower bound is a function ofthe expectations EZi ,Zj [log g(−aij)] which are untractable. In order to com-pute a second tractable lower bound, we consider the bound log g(x, ξ) onthe log-logistic function:

log g(x) ≥ log g(x, ξ) = log g(ξ) +(x− ξ)

2− λ(ξ)(x2 − ξ2), ∀x, ξ ∈ R,

(C.3)where λ(ξ) = 1

4ξ tanh( ξ2 ) = 1

2ξ

{g(ξ)− 1

2

}and ξ is a variational parame-

ter. This bound was first introduced by Jaakkola and Jordan (2000), in theframework of Bayesian logistic regression, to obtain a tractable approxi-mation of the marginal likelihood. It is based on symmetrization of thelog-logistic function and a Taylor expansion in the variable x2. It leads to:

log g(−aij) = log g(−Zi⊺WZj) ≥ log g(−Zi

⊺WZj, ξij),

where

log g(−Zi⊺WZj, ξij) = log g(ξij)−

(Zi⊺WZj + ξij)

2−λ(ξij)

(

(Zi⊺WZj)

2− ξ2ij

)

.

Therefore, we have:

EZi ,Zj [log g(−aij)] = ∑Zi ,Zj∈{0,1}Q

log g(−aij)q(Zi)q(Zj)

≥ ∑Zi ,Zj∈{0,1}Q

{

log g(ξij)−(Zi

⊺WZj + ξij)

2− λ(ξij)

(

(Zi⊺WZj)

2

− ξ2ij

)}

q(Zi)q(Zj)

≥ log g(ξij)−(τi

⊺Wτ j + ξij)

2− λ(ξij)

(

EZi ,Zj [(Zi⊺WZj)

2]− ξ2ij

)

.

C.3. Optimization of ξij 121

The expectation terms are now tractable:

EZi ,Zj [(Zi⊺WZj)

2] = EZi ,Zj [Zi⊺WZjZi

⊺WZj]

= EZi ,Zj [Zj⊺W

⊺ZiZi

⊺WZj]

= EZj [Zj⊺W

⊺EZi [ZiZi⊺]WZj]

= EZj [Zj⊺W

⊺(Σi +τiτi

⊺)WZj]

= Tr(

W⊺(

Σi +τiτi⊺)W Σj

)

+ τ j⊺W

⊺(Σi +τiτi

⊺)Wτ j,

where

Σi =

(var(Zi) 0

0 0

)

, ∀i.

We have used the property that ∀A a matrix,

E[Zj⊺

A Zj] = Tr(

A var(Zj))+ E[Zj]

⊺ A E[Zj].

In the following, and in order to simplify the notations, we denote:

Ei = EZi [ZiZi⊺] = Σi +τiτi

⊺.

Thus:EZi ,Zj [(Zi

⊺WZj)

2] = Tr(

W⊺EiW Σj

)

+ τ j⊺W

⊺EiWτ j.

We eventually get the expression of a tractable second lower bound:


∑i 6=j

{

(Xij −

12

)τ⊺


2

− λ(ξij)

(

Tr(

W⊺EiW Σj

)

+ τ⊺


ij

)}

+N

∑i=1

Q

∑q=1


}

−N

∑i=1

Q

∑q=1


}.

withlog p(X | α, W) ≥ LML(q; α, W) ≥ LML(q; α, W, ξ).

C.3 Optimization of ξij

An estimate of ξij is given by:

ξij =

√(

Tr(

W⊺EiW Σj

)

+ τ j⊺W

⊺EiWτ j

)

.

122 Appendix C. OSBM

Proof: The partial derivative of the lower bound, with respect to ξij, isgiven by:

∂LML

∂ξij(q; α, W, ξ) = g(−ξij)−

12− λ

′(ξij)

(

Tr(

W⊺EiW Σj

)

+ τ j⊺W

⊺EiWτ j − ξ2

ij

)

+ 2ξijλ(ξij)

= −λ′(ξij)

(

Tr(

W⊺EiW Σj

)

+ τ j⊺W

⊺EiWτ j − ξ2

ij

)

,

(C.4)where we have used the property that (log g)

′(ξij) = g(−ξij) and g(ξij) +

g(−ξij) = 1. Since each bound log g(−aij, ξij) is an even function withrespect to ξij, we can consider only positive values of ξij without loss ofgenerality. Therefore, we have λ

′(ξij) 6= 0 since λ(ξij) is a strictly decreas-

ing function on this domain. Finally, if we set the derivative (C.4) of thelower bound to zero, we obtain:

ξij2= Tr

(

W⊺EiW Σj

)

+ τ j⊺W

⊺EiWτ j.

C.4 Optimization of αq

An estimate of αq is given by:

αq =∑

Ni=1 τiq

N.

Proof: If we set the partial derivative of the lower bound, with respectto αq, to zero, we obtain:

∂LML

∂αq(q; α, W, ξ) =

N

∑i=1{

τiq

αq− (

1− τiq

1− αq)} = 0.

Thus,

(1− αq)N

∑i=1

τiq = αq

N

∑i=1

(1− τiq).

This leads toN

∑i=1

τiq = αqN,

and

αq =∑

Ni=1 τiq

N.

C.5 Optimization of W

An estimate of vec(W) is given by:

vec(W) =

{

2N

∑i 6=j

λ(ξij)(Ej ⊗ Ei

)

}−1{N

∑i 6=j

(Xij −12)(τ j ⊗ τi

)

}

,

C.5. Optimization of W 123

where vec denotes an operator which stacks the columns of a matrix intoa vector.

Proof: The gradient of the lower bound with respect to the matrix W

is given by:

∇WLML(q; α, W, ξ) =N

∑i 6=j

{

(Xij−12)τiτ j

⊺− 2λ(ξij)

(

EiW Σj +EiWτ jτ j⊺

)}

,

since ∀B, C symmetric matrices:

∇WTr(W⊺B W C) = B W C+B⊺ W C⊺ = 2 B W C,

and ∀ b a vector:

∇W b⊺ W⊺

B W b = B⊺ W b b⊺+B W b b⊺ = 2 B W b b⊺ .

Finally, we obtain:

∇WLML(q; α, W, ξ) =N

∑i 6=j

{

(Xij −12)τiτ j

⊺ − 2λ(ξij)EiWEj

}

.

Therefore, the matrix W which maximizes the lower bound satisfies:

2N

∑i 6=j

{

λ(ξij)EiWEj

}

=N

∑i 6=j

{

(Xij −12)τiτ j

⊺

}

.

This implies:

vec{

2N

∑i 6=j

λ(ξij)EiWEj

}

= vec{ N

∑i 6=j

(Xij −12)τiτ j

⊺

}

,

and

2N

∑i 6=j

λ(ξij)vec(EiWEj

)=

N

∑i 6=j

(Xij −12)vec(τiτ j

⊺). (C.5)

From (C.5), we obtain:

2N

∑i 6=j

λ(ξij)(Ej ⊗ Ei

)vec(W) =

N

∑i 6=j

(Xij −12)(τ j ⊗ τi

),

since Ej is a symmetric matrix and ∀B, C two matrices:

vec(

B W C)=(

C⊺⊗B)vec(W).

Moreover ∀ b, c two vectors:

vec(

c b⊺)= b⊗ c .

Therefore an estimate of vec(W) is given by:

vec(W) =

{

2N

∑i 6=j

λ(ξij)(Ej ⊗ Ei

)

}−1{N

∑i 6=j

(Xij −12)(τ j ⊗ τi

)

}

.

DBayesian OSBM

D.1 Lower Bound

Given a N × N positive real matrix ξ, a lower bound of the first lowerbound can be computed:

log p(X) ≥ L(q) ≥ L(q; ξ),

where

L(q; ξ) = ∑Z

∫ ∫


q(Z, α, W)

)d α d W,

and

log h(Z, W, ξ) =N

∑i 6=j

{

(Xij −12)aZi ,Zj −

ξij


Zi ,Zj− ξ2

ij)

}

.

Proof: Let us start by showing that:

log p(X |Z, W) ≥ log h(Z, W, ξ),

where ξ is an N × N positive real matrix. We use the bound on the log-logistic function introduced by Jaakkola and Jordan (2000):

log g(x) ≥ log g(ξ) +x− ξ

2− λ(ξ)(x2 − ξ2), ∀(x, ξ) ∈ R×R

+, (D.1)

where λ(ξ) = (g(ξ) − 1/2)/(2ξ). Note that (D.1) is an even functionand therefore we can consider only positive values of x without loss ofgenerality. Since

log p(Xij|Zi, Zj, W) = XijaZi ,Zj + log g(−aZi ,Zj),

then

log p(Xij|Zi, Zj, W) ≥ XijaZi ,Zj + log g(ξij)−aZi ,Zj + ξij

2− λ(ξij)(a2

Zi ,Zj− ξ2

ij)

= (Xij −12)aZi ,Zj −

ξij


Zi ,Zj− ξ2

ij).

(D.2)Following (4.1):

log p(X |Z, W) =N

∑i 6=j

log p(Xij |Zi, Zj, W).

125

126 Appendix D. Bayesian OSBM

Thereforelog p(X |Z, W) ≥ log h(Z, W, ξ).

We recall that the lower bound L(q) is given by:

L(q) = ∑Z

∫ ∫

q(Z, α, W) log{

p(X, Z, α, W)

q(Z, α, W)

}

= ∑Z

∫ ∫

q(Z, α, W) log p(X |Z, W) + ∑Z

∫ ∫

q(Z, α, W) log(

p(Z | α)p(α)p(W))

− q(Z, α, W) log q(Z, α, W)

≥∑Z

∫ ∫

q(Z, α, W) log h(Z, W, ξ) + ∑Z

∫ ∫

q(Z, α, W) log(

p(Z | α)p(α)p(W))

− q(Z, α, W) log q(Z, α, W)

= ∑Z

∫ ∫


q(Z, α, W)

)d α d W

= L(q; ξ).

Finallylog p(X) ≥ L(q) ≥ L(q; ξ).

D.2 Optimization of q(α)

The optimization of the lower bound with respect to q(α) produces a dis-tribution with the same functional form as the prior p(α):

q(α) =Q

∏q=1

Beta(αq; ηNq , ζN

q ),

where

ηNq = η0

q +N

∑i=1

τiq,

and

ζNq = ζ0

q + N −N

∑i=1

τiq.

Proof: According to variational Bayes, the optimal distribution q(α) isgiven by:

log q(α) = EZ,W[log(h(Z, W, ξ)p(Z | α)p(α)p(W)] + const

= EZ[log p(Z | α)] + log p(α) + const

=N

∑i=1


}+

Q

∑q=1

{(ηq − 1) log αq + (ζq − 1) log(1− αq)

}

+ const

=Q

∑q=1

{

(η0q +

N

∑i=1

τiq − 1) log αq + (ζ0q + N −

N

∑i=1

τiq − 1) log(1− αq)

}

+ const.

(D.3)The functional form of (D.3) corresponds to the logarithm of a product ofBeta distributions.

D.3. Optimization of q(W) 127

D.3 Optimization of q(W)

The optimization of the lower bound with respect to q(W) produces adistribution with the same functional form as the prior p(W):

q(Wvec) = N (W

vec; WvecN , SN),

with

S−1N = S−1

0 +2N

∑i 6=j

λ(ξij)(Ej⊗ Ei),

and

WvecN = SN

{

S−10 W

vec0 +

N

∑i 6=j

(Xij −12) τ j⊗ τi

}

.

Each (Q + 1)× (Q + 1) probability matrix Ei satisfies:

Ei = EZi [Zi Z⊺

i ]

=

τi1 τi1τi2 . . . τi1τiQ τi1τi2τi1 τi2 . . . τi2τiQ τi2

......

τiQτi1 τiQτi2 . . . τiQ τiQτi1 τi2 . . . τiQ 1

.

Proof: According to variational Bayes, the optimal distribution q(W)is given by:

log q(Wvec) = EZ,α[log

(h(Z, W, ξ)p(Z | α)p(α)p(W)

)] + const

= EZ[log h(Z, W, ξ)] + log p(Wvec) + const

=N

∑i 6=j

{

(Xij −12)EZi ,Zj [aZi ,Zj ]− λ(ξij)EZi ,Zj [a

2Zi ,Zj

]

}

+ (Wvec

)⊺ S−10 W

vec0 −

12(W

vec)⊺ S−1

0 Wvec

+const.

(D.4)

EZi ,Zj [aZi ,Zj ] is given by:

EZi ,Zj [aZi ,Zj ] = EZi ,Zj [Z⊺

i W Zj]

= τ⊺

i W τ j

= (τ j⊗ τi)⊺ W

vec

= (Wvec

)⊺(τ j⊗ τi).

(D.5)

EZi ,Zj [a2Zi ,Zj

] is given by:

EZi ,Zj [a2Zi ,Zj

] = EZi ,Zj [(Z⊺

i W Zj)2]

= EZi ,Zj [((Zj⊗ Zi)

⊺ Wvec )2

]

= EZi ,Zj [(Zj⊗ Zi)⊺ W

vec(Zj⊗ Zi)

⊺ Wvec

]

= EZi ,Zj [(Wvec

)⊺(Zj⊗ Zi)(Zj⊗ Zi)⊺ W

vec]

= EZi ,Zj [(Wvec

)⊺((Zj Z

⊺

j )⊗ (Zi Z⊺

i ))

Wvec

]

= (Wvec

)⊺(

Ej⊗ Ei)

Wvec .

(D.6)


Using (D.5) and (D.6) in (D.4), we obtain:

log q(Wvec) = (Wvec)⊺

{

S−10 W

vec0 +

N

∑i 6=j

(Xij −12)(τ j⊗ τi)

}

− (Wvec

)⊺{

12

S−10 +

N

∑i 6=j

λ(ξij)(

Ej⊗ Ei)

}

Wvec

+const.

(D.7)The fonctional form of (D.7) corresponds to the logarithm of a Gaussiandistribution with mean W

vecN and covariance matrix SN .

D.4 Optimization of q(Ziq)

The optimization of the lower bound with respect to q(Ziq) produces adistribution with the same functional form as the prior p(Ziq| α):

q(Ziq) = B(Ziq; τiq),

where

τiq = g{

ψ(ηNq )− ψ(ζN

q ) +N

∑j 6=i

(Xij −12) τ⊺

j (W⊺

N)·q +N

∑j 6=i

(Xji −12) τ⊺

j (WN)·q

− Tr((

Σ′

qq +2Q+1

∑l 6=q

τil Σ′

ql

)( N

∑j 6=i

λ(ξij) Ej)+(

Σqq +2Q+1

∑l 6=q

τil Σql)( N

∑j 6=i

λ(ξ ji) Ej))}

,

and Σql = EWq ,Wl[W·q W

⊺

·l ], Σ′

ql = EWq· ,Wl·[W

⊺

q· Wl·].Proof: According to variational Bayes, the optimal distribution q(Ziq)

is given by:

log q(Zbc) = EZ\bc ,α,W[log

(h(Z, W, ξ)p(Z | α)p(α)p(W)

)] + const,

where Z\bc is the set of all class memberships except Zbc.

log q(Zbc) = EZ\bc ,W[log h(Z, W, ξ)] + E

Z\bc ,α[log p(Z | α)] + const.

EZ\bc ,α[log p(Z | α)] is given by:

EZ\bc ,α[log p(Z | α)] = ZbcEαc [log αc] + (1− Zbc)Eαc [log(1− αc)] + const

= Zbc(ψ(ηN

c )− ψ(ηNc + ζN

c ))+ (1− Zbc)

(ψ(ζN

c )− ψ(ηNc + ζN

c ))+ const

= Zbc(ψ(ηN

c )− ψ(ζNc ))+ const,

where ψ(·) is the digamma function (the logarithmic derivative of thegamma function Γ(·) which appears in the normalizaing constants of theBeta distributions).

D.4. Optimization of q(Ziq) 129

EZ\bc ,W[log h(Z, W, ξ)] =

N

∑i 6=j

{

(Xij −12)E

Z\bc ,W[aZi ,Zj ]− λ(ξij)EZ\bc ,W[a2Zi ,Zj

]

}

+ const

=N

∑j 6=b

{

(Xbj −12)E

Z\cb ,Zj ,W

[aZb ,Zj ]− λ(ξbj)EZ\cb ,Zj ,W

[a2Zb ,Zj

]

}

+N

∑i 6=b

{

(Xib −12)E

Z\cb ,Zi ,W

[aZi ,Zb ]− λ(ξib)EZ\cb ,Zi ,W

[a2Zi ,Zb

]

}

+ const

=N

∑j 6=b

{

(Xbj −12)E

Z\cb ,Zj ,W

[aZb ,Zj ] + (Xjb −12)E

Z\cb ,Zj ,W

[aZj ,Zb ]

− λ(ξbj)EZ\cb ,Zj ,W

[a2Zb ,Zj

]− λ(ξ jb)EZ\cb ,Zj ,W

[a2Zj ,Zb

]}

+ const.

EZ\cb ,Zj ,W

[aZb ,Zj ] = EZ\cb ,Zj ,W

[Q+1

∑q,l

Zbq Wql Zjl ]

= Zbc

Q+1

∑l=1

EWcl[Wcl ]τjl + const

= Zbc τ⊺

j (W⊺

N)·c + const.

EZ\cb ,Zj ,W

[aZj ,Zb ] = EZ\cb ,Zj ,W

[Q+1

∑q,l

Zjq Wql Zbl ]

= Zbc

Q+1

∑l=1

EWlc[Wlc]τjl + const

= Zbc τ⊺

j (WN)·c + const.


EZ\cb ,Zj ,W

[a2Zj ,Zb

] = EZ\cb ,Zj ,W

[( Q+1

∑q,l

Zjq Wql Zbl)( Q+1

∑q,l

Zjq Wql Zbl)]

= EZ\cb ,Zj ,W

[Q+1

∑q,q′ ,l,l′

Zbl Zbl′ Zjq Wql Wq′ l′ Zjq′ ]

= EZ\cb ,Zj ,W

[Zbc

Q+1

∑q,q′

Zjq Wqc Wq′ c Zjq′ +2Zbc

Q+1

∑q,q′ ,l 6=c

Zbl Zjq Wqc Wq′ l Zjq′ ] + const

= Zbc

{

EZj ,W·c [W⊺

·c Zj Z⊺

j W·c] + 2Q+1

∑l 6=c

τblEZj ,W·c ,W·l [W⊺

·c Zj Z⊺

j W·l ]

}

+ const

= Zbc

{

EW·,c[W

⊺

·c Ej W·c] + 2Q+1

∑l 6=c

τblEW·c ,W·l [W⊺

·c Ej W·l ]

}

+ const

= Zbc

{

EW·c[(W·c⊗ W·c)

⊺] Evecj +2

Q+1

∑l 6=c

τblEW·c ,W·l [(W·l ⊗ W·c)⊺] E

vecj

}

+ const

= Zbc

{

EW·c[((W·c W

⊺

·c)vec)⊺] E

vecj +2

Q+1

∑l 6=c

τblEW·c ,W·l [((W·c W⊺

·l)vec)⊺] E

vecj

}

+ const

= Zbc

{

(Σveccc )⊺ E

vecj +2

Q+1

∑l 6=c

τbl(Σveccl )⊺ E

vecj

}

+ const

= ZbcTr((

Σcc +2Q+1

∑l 6=c

τbl Σcl)

Ej

)

+ const,

where Σql = EWq ,Wl[W·q W

⊺

·l ]. Similarly, we have:

EZ\cb ,Zj ,W

[a2Zb ,Zj

] = ZbcTr((

Σ′

cc +2Q+1

∑l 6=c

τbl Σ′

cl

)Ej

)

+ const,

where Σ′

ql = EWq· ,Wl·[W

⊺

q· Wl·]. Finally, we obtain:

log q(Zbc) = Zbc

{

ψ(ηNc )− ψ(ζN

c ) +N

∑j 6=b

(Xbj −12) τ⊺

j (W⊺

N)·c +N

∑j 6=b

(Xjb −12) τ⊺

j (WN)·c

− Tr((

Σ′

cc +2Q+1

∑l 6=c

τbl Σ′

cl

)( N

∑j 6=b

λ(ξbj) Ej)+(

Σcc +2Q+1

∑l 6=c

τbl Σcl)( N

∑j 6=b

λ(ξ jb) Ej))}

+ const.(D.8)

The fonctional form of (D.8) corresponds to the logarithm of a Bernoullidistribution with parameter τbc. Indeed:

logB(Zbc; τbc) = Zbc log τbc + (1− Zbc) log(1− τbc)

= Zbc log(τbc

1− τbc) + const.

If we denote p = log(τbc/(1− τbc)

), then τbc = g(p).

D.5. Optimization of ξ 131

D.5 Optimization of ξ

Setting the partial derivative of the lower bound with respect to ξij, tozero, leads to an estimate ξij of ξij:

ξij =

√

Tr((

SN + WvecN (W

vecN )⊺

)(Ej⊗ Ei)

)

.

Proof: The partial derivative of the lower bound with respect to ξij isgiven by:

∂L

∂ξij(q; ξ) = −

12+ g(−ξij)− λ

′(ξij)

(EZi ,Zj ,W[a2

Zi ,Zj]− ξ2

ij

)+ 2ξijλ(ξij).

According to (D.6),

EZi ,Zj [a2Zi ,Zj

] = (Wvec

)⊺(

Ej⊗ Ei)

Wvec,

therefore

EZi ,Zj ,W[a2Zi ,Zj

] = EW[(Wvec

)⊺(Ej⊗ Ei) Wvec

]

= EW

[

[Tr(

Wvec

(Wvec

)⊺(Ej⊗ Ei))]

= Tr(

EW

[W

vec(W

vec)⊺](Ej⊗ Ei)

)

= Tr((

SN + WvecN (W

vecN )⊺

)(Ej⊗ Ei)

)

.

(D.9)

Moreover (log g)′(ξij) = g(−ξij) and g(ξ j) + g(−ξij) = 1. We obtain:

∂L

∂ξij(q; ξ) = −λ

′(ξij)

{

Tr((

SN + WvecN (W

vecN )⊺

)(Ej⊗ Ei)

)

− ξ2ij

}

.

Finally, λ(ξij) is a strictly decreasing function for positive values of ξij.Thus, λ

′(ξij) 6= 0 and if we set the derivative of (D.5) to zero, it leads to:

ξ2ij = Tr

((SN + W

vecN (W

vecN )⊺

)(Ej⊗ Ei)

)

.

D.6 Lower bound

After the variational Bayes M-step, most of the terms in the lower boundvanish:

L(q; ξ) =N

∑i 6=j

{

log g(ξij)−ξij

2+ λ(ξij)ξ

2ij

}

+Q

∑q=1

log{

Γ(η0q + ζ0

q)Γ(ηNq )Γ(ζN

q )

Γ(η0q)Γ(ζ

0q)Γ(η

Nq + ζN

q )

}

−12

log |S0

SN| −

12(W

vec0 )⊺ S−1

0 Wvec0 +

12(W

vecN )⊺ S−1

N W⊺

N

−N

∑i=1

Q

∑q=1

{


}

. (D.10)


Proof:

L(q; ξ) = ∑Z

∫ ∫


q(Z, α, W)d α d W

= EZ,W[log h(Z, W, ξ)] + EZ,α[log p(Z | α)] + Eα[log p(α)] + EW[log p(W)]

− EZ[log q(Z)]− Eα[log q(α)]− EW[log q(W)]

=N

∑i 6=j

{

(Xij −12)EZi ,Zj ,W[aZi ,Zj ]−

ξij

2+ log g(ξij)− λ(ξij)

(EZi ,Zj ,W[a2

Zi ,Zj]− ξ2

ij

)}

+N

∑i=1

Q

∑q=1

{

τiq(ψ(ηN

q )− ψ(ηNq + ζN

q ))+ (1− τiq)

(ψ(ζN


q ))}

+Q

∑q=1

{

log(Γ(η0

q + ζ0q)

Γ(η0q)Γ(ζ

0q)) + (η0

q − 1)(ψ(ηN


q ))

+ (ζ0q − 1)

(ψ(ζN


q ))}

+ EW[log p(W)]−N

∑i=1

Q

∑q=1

{


}

−Q

∑q=1

{

log(Γ(ηN

q + ζNq )

Γ(ηNq )Γ(ζN

q ))(ηN

q − 1)(ψ(ηN


q ))

+ (ζNq − 1)

(ψ(ζN


q ))}

− EW[log q(W)].(D.11)

EZi ,Zj ,W[aZi ,Zj ] is given by:

EZi ,Zj ,W[aZi ,Zj ] = EZi ,Zj ,W[Z⊺

i W Zj]

= EW[τ⊺

i W τ j]

= EW[(τ j⊗ τi)⊺ W

vec]

= EW[(Wvec

)⊺(τ j⊗ τi)]

= (WvecN )⊺(τ j⊗ τi).

(D.12)

EZi ,Zj ,W[a2Zi ,Zj

] is given by (D.9)

D.6. Lower bound 133

EW[log p(W)] is given by:

EW[log p(W)] = EW[−12(W

vec− W

vec0 )⊺ S−1

0 (Wvec− W

vec0 )]−

12(Q + 1)2 log(2π)−

12

log | S0 |

= −12

EW[(Wvec

)⊺ S−10 W

vec] + EW[(W

vec)⊺ S−1

0 Wvec0 ]−

12(W

vec0 )⊺ S−1

0 Wvec0

−12(Q + 1)2 log(2π)−

12

log | S0 |

= −12

EW[(Wvec

)⊺ S−10 W

vec] + (W

vecN )⊺ S−1

0 Wvec0 −

12(W

vec0 )⊺ S−1

0 Wvec0

−12(Q + 1)2 log(2π)−

12

log | S0 |

= −12

Tr(

EW

[(W

vec)⊺ W

vec ]S−1

0

)

+ (WvecN )⊺ S−1

0 Wvec0 −

12(W

vec0 )⊺ S−1

0 Wvec0

−12(Q + 1)2 log(2π)−

12

log | S0 |

= −12

Tr((

SN +(WvecN )⊺ W

vecN)

S−10

)

+ (WvecN )⊺ S−1

0 Wvec0 −

12(W

vec0 )⊺ S−1

0 Wvec0

−12(Q + 1)2 log(2π)−

12

log | S0 |.

(D.13)Similary, we have:

EW[log q(W)] = −12

Tr((

SN +(WvecN )⊺ W

vecN)

S−1N

)

+ (WvecN )⊺ S−1

N WvecN −

12(W

vecN )⊺ S−1

N WvecN

−12(Q + 1)2 log(2π)−

12

log | SN |.

(D.14)After rearranging the terms in (D.11) and using (D.9), (D.12), (D.13), as

well as (D.14), we obtain:

L(q; ξ) =N

∑i 6=j

{

log g(ξij)−ξij

2+ λ(ξij)ξ

2ij

}

+Q

∑q=1

log{

Γ(η0q + ζ0

q)Γ(ηNq )Γ(ζN

q )

Γ(η0q)Γ(ζ

0q)Γ(η

Nq + ζN

q )

}

Q

∑q=1

{(η0

q +N

∑i 6=j

τiq− ηNq)(

ψ(ηNq )−ψ(ηN

q + ζNq ))+(ζ0

q + N−N

∑i=1

τiq− ζNq)(

ψ(ζNq )−ψ(ηN

q + ζNq ))}

−12

Tr((

SN +(WvecN )⊺ W

vecN)(

S−10 +

N

∑i 6=j

2λ(ξij)(Ej⊗ Ei)− S−1N

))

+ (WvecN )⊺

(

S−10 W

vec0 +

N

∑i 6=j

(Xij −12)(τ j⊗ τi)− S−1

N WvecN

)

−12

log |S0

SN| −

12(W

vec0 )⊺ S−1

0 Wvec0 +

12(W

vecN )⊺ S−1

N W⊺

N

−

{

τiq log τiq + (1− τiq log(1− τiq)

}

. (D.15)

Bibliography

R.K. Ahuja, T.L. Magnanti, and J.B. Orlin. Network flows: theory, algorithms,and applications. Prentice Hall, Upper Saddle River, New Jersey, 1993.(Cited in page 36.)

E. Airoldi, D. Blei, E. Xing, and S. Fienberg. Mixed membership stochas-tic block models for relational data with application to protein-proteininteractions. In Proceedings of the International Biometrics Society AnnualMeeting, 2006. (Cited in page 67.)

E. Airoldi, D. Blei, S. Fienberg, and E. Xing. Mixed membership analysisof high-throughput interaction studies: relational data. ArXiv e-prints,2007. (Cited in page 67.)

E.M. Airoldi, D.M. Blei, S.E. Fienberg, and E.P. Xing. Mixed member-ship stochastic blockmodels. Journal of Machine Learning Research, 9:1981–2014, 2008. (Cited in pages 43, 67, 70, and 80.)

H. Akaike. Information theory and an extension of the maximum likeli-hood principle. In Second International Symposium on Information Theory,pages 267–281, 1973. (Cited in page 20.)

H. Akaike. A new look at the statistical model identification. IEEE Trans-actions on Automatic Control, 19:716–723, 1974. (Cited in page 20.)

R. Albert and A.L. Barabási. Statistical mechanics of complex networks.Modern Physics, 74:47–97, 2002. (Cited in pages 30 and 60.)

R. Albert, H. Jeong, and A.L. Barabasi. Diameter of the world-wide web.Nature, 401:130–131, 1999. (Cited in page 31.)

E.S. Allman, C. Matias, and J.A. Rhodes. Identifiability of parameters inlatent structure models with many observed variables. Annals of Statis-tics, 37(6A):3099–3132, 2009. URL http://www.imstat.org/aos/.(Cited in pages 43, 71, and 73.)

E.S. Allman, C. Matias, and J.A. Rhodes. Parameters identifiability in aclass of random graph mixture models. ArXiv e-prints, 2010. (Cited inpage 43.)

L.A.N Amaral, A. Scala, M. Barthélémy, and H.E. Stanley. Classes of small-world networks. In Proceedings of the National Academy of Sciences, vol-ume 97, pages 11149–11152, 2000. (Cited in page 31.)

H. Attias. Inferring parameters and structure of latent variable modelsby variational bayes. In K.B. Laskey and H. Prade, editors, Uncertaintyin Artificial Intelligence : proceedings of the fifth conference, pages 21–30.Morgan Kaufmann, 1999. (Cited in page 55.)

135

http://www.imstat.org/aos/

136 Bibliography

A.L. Barabasi and R. Albert. Emergence of scaling in random networks.Science, 286:509–512, 1999. (Cited in page 31.)

A.L. Barabási and Z.N. Oltvai. Network biology: understanding the cell’sfunctional organization. Nature Rev. Genet, 5:101–113, 2004. (Cited inpage 30.)

J.O. Berger. Statistical decision theory and Bayesian analysis. Springer, 1985.(Cited in page 13.)

J.M. Bernardo and A.F.M. Smith. Bayesian theory. Wiley, 1994. (Cited inpage 13.)

P.J. Bickel and A. Chen. A non parametric view of network models andnewman-girvan and other modularities. In Proceedings of the NationalAcademy of Sciences, volume 106, pages 21068–21073, 2009. (Cited inpage 35.)

C. Biernacki and G. Govaert. Using the classification likelihood to choosethe number of clusters. Computing Science and Statistics, 29:451–457, 1997.(Cited in page 23.)

C. Biernacki, G. Celeux, and G. Govaert. An improvement on the neccriterion for assessing the number of clusters in a mixture model. PatternRecognition Letters, 20:267–272, 1999. (Cited in pages 22 and 23.)

C. Biernacki, G. Celeux, and G. Govaert. Assessing a mixture model forclustering with the integrated completed likelihood. IEEE Trans. PatternAnal. Machine Intel, 7:719–725, 2000. (Cited in pages 22, 23, 24, 51, 109,and 110.)

C. Biernacki, G. Celeux, and G. Govaert. Exact and monte carlo calcu-lations of integrated likelihoods for the latent class model. Journal ofStatistical Planning and Inference, 140:2991–3002, 2010. (Cited in pages 51

and 56.)

C.M. Bishop. Pattern recognition and machine learning. Springer-Verlag,2006. (Cited in pages 14, 16, and 18.)

C.M. Bishop and M. Svensén. Bayesian hierarchical mixtures of experts.In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence,pages 57–64. U. Kjaerulff and C. Meek, 2003. (Cited in page 99.)

D. Blei, A.Y. Ng, and M.I. Jordan. Latent dirichlet allocation. Journal ofMachine Learning Research, 3:993–1022, 2003. (Cited in page 67.)

P. Boer, M. Huisman, T.A.B. Snijders, C.E.G. Steglich, L.H.Y Wichers, andE.P.H Zeggelink. StOCNET : an open software system for the advanced sta-tistical analysis of social networks. Groningnen:ProGAMMA/ICS, 2006.Version 1.7. (Cited in page 51.)

B. Bollobás, S. Janson, and O. Riordan. The phase transition in inhomoge-neous random graphs. Random Structures and Algorithms, 2005. (Citedin page 44.)

Bibliography 137

H. Bozdogan and S.L. Sclove. Multi-sample cluster analysis using akaike’sinformation criterion. Annals of the Institute of Statistical Mathematics, 36:163–180, 1984. (Cited in page 21.)

U. Brandes. A faster algorithm for betweenness centrality. Journal of Math-ematical Sociology, 25:163–177, 2001. (Cited in page 36.)

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata,A. Tomkins, and J. Wiener. Graph structure in the web. Computer Net-works, 33:309–320, 2000. (Cited in page 31.)

C.G. Broyden, R. Fletcher, D. Goldfarb, and D.F. Shanno. Bfgs method.Journal of the Institute of Mathematics and Its Applications, 6:76–90, 1970.(Cited in page 79.)

K.P. Burnham and D.R. Anderson. Model selection and multi-model inference:a practical information-theoric approach. Springer-Verlag, 2004. (Cited inpage 51.)

R.H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithmfor bound constrained optimization. Journal on Scientific and StatisticalComputing, 16:1190–1208, 1995. (Cited in page 79.)

J.G. Campbell, C. Fraley, F. Murtagh, and A.E. Raftery. Linear flaw detec-tion in woven textildes using model based clustering. Pattern recognitionLetters, 18:1539–1548, 1997. (Cited in page 22.)

G. Celeux and G. Soromenho. An entropy criterion for assessing the num-ber of clusters in a mixture model. Classification Journal, 13:195–212,1996. (Cited in pages 21 and 22.)

J. Chang. The lda package, 2010. Version 1.2. (Cited in page 80.)

A. Corduneanu and C.M. Bishop. Variational bayesian model selection formixture distributions. In T. Richardson and T. Jaakkola, editors, ArtificialIntelligence and Statistics : proceedings of the eighth conference, pages 27–34.Morgan Kaufmann, 2001. (Cited in page 55.)

T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein. Introduction toalgorithms. MIT Press, Cambridge, 2001. (Cited in page 36.)

I. Csiszar and G. Tusnady. Information geometry and alternating mini-mization procedures. Statistics and Decisions, 1(1):205–237, 1984. (Citedin page 25.)

V.M. Dang. Classification de données spaciales: modèles probabilistes et critèresde partitionnement. PhD thesis, Université de Technologie de Compiègne,1998. (Cited in page 14.)

L. Danon, A. Diaz-Guilera, J. Duch, and A. Arenas. Comparing commu-nity structure identification. J Stat Mech, 2005. (Cited in page 29.)

A. Dasgupta and A.E. Raftery. Detecting features in spacial point pro-cesses with clutter via model based clustering. Journal of the AmericanStatistical Association, 93:294–302, 1998. (Cited in page 22.)

138 Bibliography

J. Daudin, F. Picard, and S. Robin. A mixture model for random graphs.Statistics and Computing, 18:1–36, 2008. (Cited in pages 2, 43, 51, 54, 55,56, 57, 58, 61, 64, 71, 77, and 78.)

A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood forincomplete data via the em algorithm. Journal of the Royal StatisticalSociety, B39:1–38, 1977. (Cited in page 14.)

S.N. Dorogovtsev, J.F.F. Mendes, and A.N. Samukhin. Structure of growingnetworks with preferential linking. Physical Review Letter, 85:4633–4636,2000. (Cited in page 31.)

C. Elkan. Using the triangle inequality to accelerate kmeans. In Proceedingsof the Twelfth International Conference on Machine Learning, pages 147–153,2003. (Cited in page 11.)

E. Estrada and J.A. Rodriguez-Velazquez. Spectral measures of bipartiv-ity in complex networks. Physical Review E, 72:046105, 2005. (Cited inpage 30.)

M.G. Everett and S.P. Borgatti. Analyzing clique overlap. Connections, 21:49–61, 1998. (Cited in pages 67 and 80.)

B. Everitt. Cluster analysis. Wiley, 1974. (Cited in page 38.)

S.E. Fienberg and S. Wasserman. Categorical data analysis of single so-ciometric relations. Sociological Methodology, 12:156–192, 1981. (Cited inpage 43.)

R. Fletcher. Practical methods of optimization. Wiley, 1987. (Cited in page 13.)

S. Fortunato. Community detection in graphs. Physics Reports, 3-5:75–174,2010. (Cited in page 30.)

O. Frank and F. Harary. Cluster inference by using transitivity indices inempirical graphs. Journal of the American Statistical Association, 77:835–840, 1982. (Cited in page 43.)

L. Freeman. A set of measures of centrality based upon betweenness.Sociometry, 40:35–41, 1977. (Cited in page 36.)

Q. Fu and A. Banerjee. Multiplicative mixture models for overlappingclustering. In Proceedings of the IEEE International Conference on Data Min-ing, pages 791–796, 2008. (Cited in page 68.)

A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin. Bayesian data analysis.Chapman and Hall, 2004. (Cited in page 13.)

M. Girvan and M.E.J. Newman. Community structure in social and bio-logical networks. In Proceedings of the National Academy of Sciences, vol-ume 99, pages 7821–7826, 2002. (Cited in page 35.)

A. Goldenberg, A.X. Zheng, S.E. Fienberg, and E.M. Airoldi. A survey ofstatistical network models. Foundations and Trends in Machine Learning, 2

(2):129–233, 2010. (Cited in page 30.)

Bibliography 139

T. Griffiths and Z. Ghahramani. Infinite latent feature models and the in-dian buffet process. In Neural Information Processing Systems, volume 18,pages 475–482, 2005. (Cited in page 67.)

S.F. Gull. Developpments in maximum entropy data analysis. In Maxi-mum Entropy and Bayesian Methods, pages 53–71. Kluwer, 1989. (Cited inpage 13.)

M.S. Handcock, A.E. Raftery, and J.M. Tantrum. Model-based clusteringfor social networks. Journal of the Royal Statistical Society, 170:1–22, 2007.(Cited in pages 38 and 39.)

T. Hastie, R. Tibshirani, and J. Friedman. The element of statistical learning.Springer, 2001. (Cited in page 13.)

R.J. Hathaway. Another interpretation of the em algorithm for mixturedistributions. Statistics & Probability Letters, 4:53–56, 1986. (Cited inpages 23, 25, and 54.)

K. Heller and Z. Ghahramani. A nonparametric bayesian approach tomodeling overlapping clusters. In In Proceedings of The 11th InternationalConference On AI And Statistics, 2007. (Cited in pages 67 and 81.)

K. Heller, S. Williamson, and Z. Ghahramani. Statistical models for partialmembership. In Proceedings of the 25th International Conference on MachineLearning (ICML), pages 392–399, 2008. (Cited in pages 67 and 81.)

M.E. Hodgson. Reducing computational requirements of the minimum-distance classifier. Remote Sensing of Environments, 25:117–128, 1998.(Cited in page 11.)

A.E. Hoerl and R. Kennard. Ridge regression:biased estimation fornonorthogonal problems. Technometrics, 12:55–67, 1970. (Cited inpage 13.)

P.D. Hoff, A.E. Raftery, and M.S. Handcock. Latent space approaches tosocial network analysis. Journal of the Royal Statistical Society, 97:1090–1098, 2002. (Cited in page 38.)

J.M. Hofman and C.H. Wiggins. A bayesian approach to network modu-larity. Physical Review Letters, 100:258701, 2008. (Cited in pages 29, 41,43, 53, 57, 58, 59, 61, and 64.)

P. Holland, K.B. Laskey, and S. Leinhardt. Stochastic blockmodels: somefirst steps. Social Networks, 5:109–137, 1983. (Cited in page 43.)

T.S. Jaakkola and M.I. Jordan. Bayesian parameter estimation via vari-ational methods. Statistics and Computing, 10:25–37, 2000. (Cited inpages 96, 97, 120, and 125.)

C.J. Jeffery. Moonlighting proteins. Trends in Biochemical Sciences, 24:8–11,1999. (Cited in page 67.)

H. Jeffreys. An invariant form for the prior probability in estimationsproblems. In Proceedings of the Royal Society of London. Series A, volume186, pages 453–461, 1946. (Cited in pages 14 and 53.)

140 Bibliography

R. Kass and A.E. Raftery. Bayes factor. Journal of the American StatisticalAssociation, 90:773–795, 1995. (Cited in page 22.)

C. Kemp, T.L. Griffiths, and J.B. Tenenbaum. Discovering latent classes inrelational data. Technical report, MIT, 2004. (Cited in page 43.)

A.B. Koehler and E.H. Murphee. A comparison of the akaike and schwarzcriteria for selecting model order. Applied Statistics, 37:187–195, 1988.(Cited in page 21.)

P.N. Krivitsky and M.S. Handcock. The latentnet package. Statnet project,2009. Version 2.1-1. (Cited in page 39.)

P.N. Krivitsky, M.S. Handcock, A.E. Raftery, and P.D. Hoff. Representingdegree distributions, clustering, and homophily in social networks withlatent cluster random effects models. Social Networks, 31:204–213, 2009.(Cited in page 69.)

V. Lacroix, C.G. Fernandes, and M.-F. Sagot. Motif search in graphs: ppli-cation to metabolic networks. Transactions in Computational Biology andBioinformatics, 3:360–368, 2006. (Cited in pages viii, 30, 33, and 61.)

P. Latouche, E. Birmelé, and C. Ambroise. Bayesian methods for graph clus-tering, pages 229–239. Springer, 2009. (Cited in pages 3, 6, 71, 78, and 80.)

P. Latouche, E. Birmelé, and C. Ambroise. Overlapping stochastic blockmodel with application to the french political blogosphere. Annals ofApplied Statistics, 2010a. to appear. (Cited in pages 3 and 6.)

P. Latouche, E. Birmelé, and C. Ambroise. Variational bayes inferenceand complexity control for stochastic block models. Statistical Modelling,2010b. to appear. (Cited in pages 3 and 6.)

B.G. Leroux. Consistent estimation of amixing distribution. Annals ofStatistics, 20:1350–1360, 1992. (Cited in page 22.)

D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447,1992. (Cited in page 13.)

J. MacQueen. Some methods for classification and analysis of multivariateobservations. In Proceedings of the Fifth Berkeley Symposium on Mathemat-ical Statistics and Probability, volume 1, pages 281–297, 1967. (Cited inpage 11.)

M. Mariadassou, S. Robin, and C. Vacher. Uncovering latent structure invalued graphs: a variational approach. Annals of Applied Statistics, 4(2),2010. (Cited in pages 38, 51, and 58.)

D. Martin, C. Brun, E. Remy, P. Mouren, D. Thieffry, and B. Jacq. Go-toolbox: functional analysis of gene datasets based on gene ontology.Genome Biology, 5(12), 2004. (Cited in page 91.)

G. McLachlan and T. Krishnan. The EM algorithm and extensions. New York:John Wiley, 1997. (Cited in page 14.)

Bibliography 141

G McLachlan and D. Peel. Finite mixture models. New York: Jogn Wiley,2000. (Cited in pages 9, 20, and 22.)

R. Milo, S. Shen-Orr, S. Itzkovitz, D. Kashtan, D. Chklovskii, and U. Alon.Network motifs: simple building blocks of complex networks. Science,298:824–827, 2002. (Cited in pages viii, 30, 32, and 91.)

A.W. Moore. The anchors hierarch: using the triangle inequality to survivehigh dimensional data. In Proceedings of the Twelfth Conference on Uncer-tainty in Artificial Intelligence, pages 397–405, 2000. (Cited in page 11.)

J.L. Moreno. Who shall survive?: a new approach to the problem of Humaninterrelations. Nervous and Mental Disease Publishing, Washington DC,1934. (Cited in page 29.)

R. Neal and G. Hinton. A view of the EM algorithm that justifies incre-mental, sparse, and other variants. In M. I. Jordan, editor, Learning inGraphical Models, Dordrecht, 1998. Kluwer. (Cited in pages 25 and 54.)

M. Newman and E. Leicht. Mixture models and exploratory analysis innetworks. In Proceedings of the National Academy of Sciences, volume 104,pages 9564–9569, 2007. (Cited in pages 29 and 43.)

M.E.J Newman. Scientific collaboration networks: Ii. shortest paths,weighted networks, and centrality. Physical Review E, 64:016132, 2001.(Cited in page 36.)

M.E.J. Newman. Mixing patterns in networks. Physical Review E, 67:026126,2003. (Cited in page 29.)

M.E.J. Newman. Fast algorithm for detecting community structure in net-works. Physical Review Letter, 69, 2004. (Cited in page 38.)

M.E.J Newman and M. Girvan. Finding and evaluating community struc-ture in networks. Physical Review E, 69:026113, 2004. (Cited in pages 35

and 36.)

J. Nocedal and S.J. Wright. Numerical optimization. Springer, 1999. (Citedin page 13.)

K. Nowicki and T.A.B. Snijders. Estimation and prediction for stochasticblockstructures. Journal of the American Statistical Association, 96:1077–1087, 2001. (Cited in pages 30, 43, 51, 52, 65, and 72.)

G. Palla, I. Derenyi, I. Farkas, and T. Vicsek. Uncovering the overlappingcommunity structure of complex networks in nature and society. Nature,435:814–818, 2005. (Cited in pages 67, 80, and 86.)

G. Palla, I. Derenyi, I. Farkas, and T. Vicsek. CFinder, the community clusterfinding program, 2006. Version 2.0.1. (Cited in pages 67 and 80.)

G. Palla, A.L Barabási, and T. Vicsek. Quantifying social group evolution.Nature, 446:664–667, 2007. (Cited in page 30.)

T. Park and G. Casella. The bayesian lasso. Journal of the American StatisticalAssociation, 103(482):681–686, 2008. (Cited in page 13.)

142 Bibliography

A.E. Raftery. Bayesian model selection in social research (with discussionby andrew gelman, donald b. rubin, robert m. hauser, and a rejoinder),1995. (Cited in page 22.)

B.D. Ripley. Pattern recognition and neural networks. Cambridge: CambridgeUniversity Press, 1996. (Cited in page 22.)

V. Rmasubramanian and K.K. Paliwal. A generalized optimization of thek-d tree for fast nearest-neighbour search. In Proceedings of the IEEE Re-gion 10 International Conference, pages 565–568, 1990. (Cited in page 11.)

C.P. Robert. The Bayesian choice: a decision-theoretic motivation. Springer-Verlag, 1994. (Cited in page 110.)

K. Roeder and L. Wasserman. Practical density estimation using mix-ture of normals. Journal of the American Statistical Association, 92:894–902,1997. (Cited in page 22.)

S.F. Sampson. Crisis in a cloister. PhD thesis, Cornell University, 1969.(Cited in pages ix and 40.)

G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461–464, 1978. (Cited in page 22.)

S.L. Sclove. Application of model selection criteria to some problemsin multivariate analysis. Psychometrika, 52:333–343, 1987. (Cited inpage 21.)

J. Scott. Social network analysis: a handbook. Sage publications, 2000. (Citedin page 38.)

T.A.B. Snijders and K. Nowicki. Estimation and prediction for stochasticblock-structures for graphs with latent block structure. Journal of Classi-fication, 14:75–100, 1997. (Cited in page 30.)

G. Soromenho. Comparing approaches for testing the number of compo-nents in a finite mixture model. Computational Statistics, 9:65–78, 1993.(Cited in page 21.)

S.H. Strogatz. Exploring complex networks. Nature, 410:268–276, 2001.(Cited in page 31.)

M. Svensén and C.M. Bishop. Robust bayesian mixture modelling. Neuro-computing, 64:235–252, 2004. (Cited in page 55.)

R. Tibshirani. Regression shrinkage and selection via the lasso. Journal ofthe Royal Statistical Society, B, 58:267–288, 1996. (Cited in page 13.)

G. Wahba. A comparison of gcv and gml for choosing the smoothing pa-rameter in the generalized spline smoothing problem. Numerical Math-ematics, pages 383–393, 1975. (Cited in page 13.)

D.J. Watts and S.H. Strogatz. Collective dynamics of small-world net-works. Nature, 393:440–442, 1998. (Cited in page 30.)

Bibliography 143

H.C. White, S.A. Boorman, and R.L. Breiger. Social structure from multi-ple networks. i. blockmodels of roles and positions. American Journal ofSociology, 81:730–780, 1976. (Cited in pages ix, 40, and 43.)

H. Zanghi, C. Ambroise, and V. Miele. Fast online graph clustering viaerdös renyi mixture. Pattern Recognition, 41(12):3592–3599, 2008. (Citedin pages viii, 30, 34, 51, and 86.)

H. Zanghi, S. Volant, and C. Ambroise. Clustering based on random graphmodel embedding vertex features. Pattern Recognition Letters, 31(9):830–836, 2010. (Cited in page 38.)

Ce document a été préparé à l’aide de l’éditeur de texte GNU Emacs et dulogiciel de composition typographique LATEX 2ε.

Titre Modèles de graphes aléatoires à structure cachée pour l’analysedes réseaux

Résumé Le résumé en français (≈ 1000 caractères)

Mots-clés Les mots-clés en français

Title Le titre en anglais

Abstract Le résumé en anglais (≈ 1000 caractères)

Keywords Les mots-clés en anglais

Modèles de graphes aléatoires à structure cachée pour l ...

Documents

Transcript of Modèles de graphes aléatoires à structure cachée pour l ...