Optimizing and Understanding Network Structure for Di usion

Optimizing and Understanding Network Structure for Diffusion

Yao Zhang

Dissertation submitted to the Faculty of theVirginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophyin

Computer Science and Application

B. Aditya Prakash, ChairBert HuangRavi Kumar

Naren RamakrishnanAnil Vullikanti

September 21, 2017Blacksburg, Virginia

Keywords: Data Mining, Network, DiffusionCopyright 2017, Yao Zhang


Yao Zhang

(ABSTRACT)

Given a population contact network and electronic medical records of patients, how todistribute vaccines to individuals to effectively control a flu epidemic? Similarly, given theTwitter following network and tweets, how to choose the best communities/groups to stoprumors from spreading? How to find the best accounts that bridge celebrities and ordinaryusers? These questions are related to diffusion (aka propagation) phenomena. Diffusion canbe treated as a behavior of spreading contagions (like viruses, ideas, memes, etc.) on someunderlying network. It is omnipresent in areas such as social media, public health, and cybersecurity. Examples include diseases like flu spreading on person-to-person contact networks,memes disseminating by online adoption over online friendship networks, and malwarepropagating among computer networks. When a contagion spreads, network structure (likenodes/edges/groups, etc.) plays a major role in determining the outcome. For instance, arumor, if propagated by celebrities, can go viral. Similarly, an epidemic can die out quickly,if vulnerable demographic groups are successfully targeted for vaccination.

Hence in this thesis, we aim to optimize and understand network structure better in lightof diffusion. We optimize graph topologies by removing nodes/edges for controlling ru-mors/viruses from spreading, and gain a deeper understanding of a network in terms ofdiffusion by exploring how nodes group together for similar roles of dissemination. We developseveral novel graph mining algorithms, with different levels of granularity (node/edge level togroup/community level), from model-driven and data-driven perspectives, focusing on topicslike immunization on networks, graph summarization, and community detection. In contrastto previous work, we are the first to systematically develops more realistic, implementableand data-based graph algorithms to control contagions. In addition, our thesis is also the firstwork to use diffusion to effectively summarize graphs and understand communities/groups ofnetworks in a general way.

1. Model-driven. Diffusion processes are usually described using mathematical models,e.g., the Independent Cascade (IC) model in social media, and the Susceptible-Infectious-Recovered (SIR) model in epidemiology. Given such models, we propose to optimizenetwork structure for controlling propagation (the immunization problem) in severalpractical and implementable settings, taking into account the presence of infections,the uncertain nature of the data and group structure of the population. We developefficient algorithms for different interventions, such as vaccination (node removal) andquarantining (edge removal). In addition, we study the graph coarsening problem forboth static and temporal networks to obtain a better understanding of relations amongnodes when a contagion is propagating. We seek to get a much smaller representationof a large network, while preserving its diffusive properties.

2. Data-driven. Model-driven approaches can provide ideal results if underlying diffusionmodels are given. However, in many situations, diffusion processes are very complicated,and it is challenging or even impossible to pick the most suited model to describe them.In addition, rapid technological development has provided an abundance of data suchas tweets and electronic medical records. Hence, in the second part of the thesis, weexplore data-driven approaches for diffusion in networks, which can directly work onpropagation data by relaxing modeling assumptions of diffusion. To be specific, wefirst develop data-driven immunization strategies to stop rumors or allocate vaccines byoptimizing network topologies, using large-scale national-level diagnostic patient datawith billions of flu records. Second, we propose a novel community detection problemto discover “bridge” and “celebrity” communities from social media data, and designcase studies to understand roles of nodes/communities using diffusion.

Our work has many applications in multiple areas such as epidemiology, sociology andcomputer science. For example, our work on efficient immunization algorithms, such asdata-driven immunization, can help CDC better allocate vaccines to control flu epidemics inmajor cities. Similarly, in social media, our work on understanding network structure usingdiffusion can lead to better community discovery, such as finding media accounts that canboost tweet promotions in Twitter.


Yao Zhang

(GENERAL AUDIENCE ABSTRACT)

In public health, how to distribute vaccines to effectively control an epidemic like flu overpopulation? In social media, how to identify different roles of users who participate in thespread Of content through social networks? These questions and many others are relatedto diffusion (aka propagation) phenomena in networks (aka graphs). Networks, as naturalstructures to model relations between objects, arise in many areas, such as online socialnetworks, population contact network, and the Internet. Diffusion can be treated as a behaviorof spreading contagions (like viruses, ideas, memes, etc.) on some underlying network. Itis also prevalent: e.g., diseases like flu spreading on person-to-person contact networks,memes disseminating by online adoption over online friendship networks, and malwarepropagating among computer networks. When a contagion spreads, network structure (likenodes/edges/groups, etc.) plays a major role in determining the outcome. For instance, arumor, if propagated by celebrities, can go viral. Similarly, an epidemic can die out quickly,if vulnerable demographic groups are successfully targeted for vaccination.

This thesis targets at general audience and provides a comprehensive study on how to optimizeand understand network structure better in light of diffusion. We optimize graph topologiesby removing nodes/edges for controlling rumors/viruses from spreading, and gain a deeperunderstanding of a network in terms of diffusion by exploring how nodes group together forsimilar roles of dissemination. In contrast to previous work, we are the first to systematicallydevelops more realistic, implementable and data-based graph algorithms to control contagions.In addition, our thesis is also the first work to use diffusion to effectively summarize graphsand understand communities/groups of networks in a general way. Our work has manyapplications in multiple areas such as epidemiology, sociology and computer science. Forexample, our work on efficient immunization algorithms, such as data-driven immunization,can help experts better allocate vaccines to control flu epidemics. Similarly, in social media,our work on understanding network structure using diffusion can lead to better communitydiscovery, such as finding media accounts that can boost tweet promotions in Twitter.

Acknowledgments

First, I would like to thank my advisor B. Aditya Prakash. This thesis would not have beendone without his endless support, advice, and encouragement over the five years I spentat Virginia Tech. I appreciate all the time he has devoted to discussing research problems,editing my papers, giving suggestions on my future career.

Second, I am grateful to have a wonderful committee: Bert Huang, Naren Ramakrishnan,and Anil Vullikanti at Virginia Tech, and Ravi Kumar at Google Research. I appreciate theirtime in giving me invaluable advice and feedback to form and improve my thesis.

I could not have finished my thesis without all my collaborators: Abhijin Adiga, BijayaAdhikari, Aditya Bharadwaj, Steve Jan, Chanhyun Kang, Manish Purohit, Laura Pullum,Arvind Ramanathan, Sudip Saha, V.S. Subrahmanian, and Anil Vullikanti. I particularlywould like to thank Anil for his kind help with immunization problems. Without his thoughtfulidea and insightful discussion, that part of my work would not have been possible.

I am also thankful to all my peers and friends during my Ph.D. studies. I give thanks to alllabmates for inspiring discussions, interesting research meetings, and helpful feedbacks onpresentations, including Bijaya Adhikari, Liangzhe Chen, Sorour E. Amiri, Steve Jan, ElahehRaisi, Shashidhar Sundareisan, Xinfeng Xu, and Ben Wang. In addition, I am fortunatelyto have amazing internships during the summer. I am grateful to have the opportunity towork closely with talented researchers in the industry including Haifeng Chen, ZhengzhangChen and Kai Zhang at NEC Lab, and Changwei Hu, Yifan Hu and Meizhu Liu at Yahoo!Research.

Finally, I would like to thank my family, especially my beloved parents, for their endlesslove, support, sacrifice and encouragement over the years, without which I would never havegotten to where I am today.

v

Contents

1 Introduction 1

1.1 Thesis Overview, Statement and Structure . . . . . . . . . . . . . . . . . . . 2

1.2 Summary of the Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Part I: Model-driven . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.2 Part II: Data-driven . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Contributions and Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.3.2 Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.4 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Survey 12

2.1 Diffusion for Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Information Diffusion in Social Media . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Network Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Graph Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Community Detection and Role Discovery . . . . . . . . . . . . . . . . . . . 15

I Model-Driven Perspective 17

3 Data-Aware Vaccination 20

3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 Contact Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vi

3.1.2 Virus Propagation Models . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 The “Data” in Data-Aware Vaccine Allocation . . . . . . . . . . . . . 23

3.2.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Complexity of the DAV problem . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.1 Hardness result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3.2 Approximability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Our Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 Simplification—Merging infected nodes . . . . . . . . . . . . . . . . . 28

3.4.2 DAVA-TREE—Optimal solution when the merged graph is a tree . . . . 29

3.4.3 dava—An effective algorithm on arbitrary graphs under IC model . . 31

3.4.4 dava-prune—A faster algorithm with the same result as dava . . . . 34

3.4.5 dava-fast—An even faster heuristic . . . . . . . . . . . . . . . . . . . 38

3.4.6 Discussion of proposed methods . . . . . . . . . . . . . . . . . . . . . 39

3.5 Extending to the SIR model . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.6.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Uncertain Data-Aware Vaccination 52

4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


4.2.1 Uncertainty model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53


4.3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 The Sample-Cascade Algorithm . . . . . . . . . . . . . . . . . . . . . 56

4.3.2 Expect-Max: a faster algorithm . . . . . . . . . . . . . . . . . . . . . 59

4.3.3 Extending to SIR model . . . . . . . . . . . . . . . . . . . . . . . . . 64

vii

4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64


4.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Group Immunization 71

5.1 Our Problem Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.1 Problem Definition under LT model . . . . . . . . . . . . . . . . . . . 74

5.1.2 Problem Definition for spectral radius . . . . . . . . . . . . . . . . . . 75

5.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.2.1 Edge Deletion under LT model . . . . . . . . . . . . . . . . . . . . . 76

5.2.2 Node Deletion under LT model . . . . . . . . . . . . . . . . . . . . . 82

5.2.3 Edge Deletion for Spectral Radius . . . . . . . . . . . . . . . . . . . . 83

5.2.4 Node Deletion for Spectral Radius . . . . . . . . . . . . . . . . . . . . 90

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92


5.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Graph Coarsening 99

6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


6.3 Our Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.3.1 Score Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.3.2 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.4 Sample Application: Influence Maximization . . . . . . . . . . . . . . . . . . 109

6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.5.1 Performance for the GCP problem . . . . . . . . . . . . . . . . . . . 111

6.5.2 Application 1: Influence Maximization . . . . . . . . . . . . . . . . . 113

viii

6.5.3 Application 2: Diffusion Characterization . . . . . . . . . . . . . . . . 115

6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7 Temporal Graph Coarsening 117

7.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2 Our Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.2.1 Formulation framework . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.2.2 Q1: Propagation-based property . . . . . . . . . . . . . . . . . . . . . 120

7.2.3 Q2: Merge Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 121


7.3 Our Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3.1 Main idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.3.2 Step 1: An Alternate Static View . . . . . . . . . . . . . . . . . . . . 125

7.3.3 Step 2: A Well Conditioned Network . . . . . . . . . . . . . . . . . . 128

7.3.4 NetCondense . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135


7.4.2 Performance of NetCondense: Effectiveness . . . . . . . . . . . . . 136

7.4.3 Application 1: Temporal Influence Maximization . . . . . . . . . . . 137

7.4.4 Application 2: Event Detection . . . . . . . . . . . . . . . . . . . . . 138

7.4.5 Application 3: Understanding/Exploring Networks . . . . . . . . . . 141

7.4.6 Scalability and Parallelizability . . . . . . . . . . . . . . . . . . . . . 143

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

II Data-Driven Perspective 145

8 Data-Driven Immunization 148

8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.2 Problem Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

ix

8.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

8.3.1 Generating Cascades from SocialContact . . . . . . . . . . . . . . 153

8.3.2 Data-Driven Immunization . . . . . . . . . . . . . . . . . . . . . . . . 160

8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163


8.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

8.4.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

9 Detecting Media and Kernel Community 171


9.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.1.2 Media nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

9.1.3 Kernel Communities . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

9.1.4 Ordinary Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

9.1.5 Relative structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

9.1.6 The MeiKeCom task . . . . . . . . . . . . . . . . . . . . . . . . . . 177

9.2 Our Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

9.2.1 Finding Media Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

9.2.2 Finding kernel communities . . . . . . . . . . . . . . . . . . . . . . . 180

9.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182


9.3.2 Evaluation of media nodes . . . . . . . . . . . . . . . . . . . . . . . . 183

9.3.3 Evaluation of kernel communities . . . . . . . . . . . . . . . . . . . . 185

9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

10 Conclusions and Future Work 187

10.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

x

Bibliography 190

xi

List of Figures

1.1 (a) and (b) Effectiveness for dava: SIR model on Real Datasets. Expectednumber of healthy nodes after distributing vaccines according to differentalgorithms VS budget k. Higher is better. Our method dava-fast significantlyoutperforms other baselines by saving upto to 10 times more healthy nodes.(c) A example of uncertain environment. . . . . . . . . . . . . . . . . . . . . 4

1.2 (a) and (b). Eigendrop ratio vs. number of groups. (c) and (d): Footprint ratiovs. number of groups. Lower is better. Our algorithms (LP for spectral radiusand GreedyLT for the LT model) consistently outperform other baselinealgorithms as the number of groups changes as well as the size of groups changes. 5

1.3 (a) and (b): Illustration of our coarsening process. (c): Application of influencemaximization. Running time vs k: CoarseNet gets increasing orders-of-magnitude speed-up over Pmia. . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 (a): Illustration of our coarsening process for a temporal network. (b) and(c): Effectiveness of NetCondense on DBLP for coarsening edges (αN) andtimestamps (αT ) respectively. RX: the ratio of the change in the first eigenvalueof the system matrix. High is better. NetCondense maintains the firsteigenvalue even if we reduce 90% of the graph. . . . . . . . . . . . . . . . . . 7

1.5 Case Studies for Houston per location. Heatmap of (a): Total population;(b): Patients in eHCR; (c): Number of vaccines actually taken in eHCR;(d): Vaccine allocations from ImmuModel; (e): Vaccine allocations fromImmuConGreedy. Our approach ImmuConGreedy considers both networkand patient information, and is able to find high vulnerable areas like TexasMedical Center (Zipcode 77030). . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.6 MeiKe detects better structure: (a) an example Twitter network; (b) commu-nities detected by Newman’s algorithm; (c) ordinary communities (green),media nodes (black), and kernel communities (red) detected by MeiKe. . . . 9

3.1 Graph used in the reduction from Minimum k-union Problem. . . . . . . . . 26

3.2 Counter-Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

xii

3.3 An example of minimum spanning tree. For p=1 and k=1, the optimal solutionis node 2 in MST. However, in the original graph, the optimal solution is node 4. 31

3.4 An example of dominator tree. For p=1 and k=1, the optimal solution is node4. For p=0.5 and k=1, the optimal solution is node 1. . . . . . . . . . . . . . 33

3.5 Effectiveness of dava-prune: IC model with p = {0.1, 0.5, 0.9} on variousReal Datasets. Expected number of healthy nodes after distributing vaccinesaccording to different algorithms (i.e. σ′G,I′(S)) VS budget k. dava-pruneoutputs the same results as dava. . . . . . . . . . . . . . . . . . . . . . . . . 44

3.6 Effectiveness for DAV (Comparison with baselines): IC model with p = 1 onvarious Real Datasets. Expected number of healthy nodes after distributingvaccines according to different algorithms (i.e. σ′G,I′(S)) VS budget k. Higheris better. Note that dava-prune and dava-fast consistently outperform theother algorithms by up to 10 times in magnitude, and dava-prune saves morenodes than dava-fast. Best seen in color. . . . . . . . . . . . . . . . . . . . . 45

3.7 Effectiveness for DAV (Comparison with baselines): IC model with p = 0.6 onvarious Real Datasets. Expected number of healthy nodes after distributingvaccines according to different algorithms (i.e. σ′G,I′(S)) VS budget k. Higheris better. Note that dava-prune and dava-fast consistently outperform theother algorithms by up to 10 times in magnitude, and dava-prune saves morenodes than dava-fast. Best seen in color. . . . . . . . . . . . . . . . . . . . . 46

3.8 Effectiveness for DAV (Comparison with baselines): IC model with p ={0.1, 0.5, 0.9} on various Real Datasets. Expected number of healthy nodesafter distributing vaccines according to different algorithms (i.e. σ′G,I′(S)) VSbudget k. Higher is better. Note that dava-prune and dava-fast consistentlyoutperform the other algorithms, and dava-prune saves more nodes thandava-fast. Best seen in color. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.9 Effectiveness for DAV (Comparison with baselines): SIR model on RealDatasets. Expected number of healthy nodes after distributing vaccines accord-ing to different algorithms (i.e. σ′G,I′(S)) VS budget k. Higher is better. Notethat dava-prune and dava-fast consistently outperform the other algorithms,and dava-prune saves upto 16k more nodes than our second best algorithmdava-fast. Best seen in color. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.10 Effectiveness for DAV (Comparison with baselines over the size of I0): SIRmodel on Real Datasets. Expected number of healthy nodes after distributingvaccines according to different algorithms (i.e. σ′G,I′(S)) VS Size of I0. Higheris better. Note that dava-prune and dava-fast consistently outperform theother algorithms. Best seen in color. . . . . . . . . . . . . . . . . . . . . . . 49

xiii

3.11 Effectiveness for DAV w.r.t. distribution of I0: SIR model on Real Datasets.I0 is uniformly at random chosen from the population with the age 60 or above.Expected number of healthy nodes after distributing vaccines according todifferent algorithms (i.e. σ′G,I′(S)) VS Size of I0. Higher is better. Note thatdava-prune and dava-fast consistently outperform the other algorithms. Bestseen in color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.12 Running time (sec.). (a) Running time VS budget k; (b) Running time VSgraph size (k = 200). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.1 (a). Quality of Sample-Cas. Comparison between Sample-Cas and Opti-mal on KARATE over different distributions (r = #healthy nodes saved by Sample-Cas

#healthy nodes saved by Optimal

and α = 0.5). (b) and (c). Comparison between Expect-Dom and Expect-Eig. Ratio of R (R = #healthy nodes saved by Expect-Dom

#healthy nodes saved by Expect-Eig) vs. α. Expect-Dom

performs better than Expect-Eig when R > 1, otherwise Expect-Eig isbetter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.2 Effectiveness (α = 0.5, UNIFORM). Expected number of healthy nodes afterdistributing vaccines vs. budget k. Higher is better. (a), (b), (d), (e): ICmodel; (c), (f): SIR model. Sample-Cas and Expect-Max outperformother baseline algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Effectiveness (α = 0.5). Expected number of healthy nodes after distributingvaccines vs. budget k. Higher is better. (a), (b): IC model; (c): SIR model.Sample-Cas and Expect-Max outperform other baseline algorithms. . . . 68

5.1 Effectiveness for LT model various Real Datasets (edge deletion). Graph sus-ceptibility ratio (footprint when vaccines are given

footprint without giving vaccines) vs. number of vaccines. Lower

is better. Greedy-LT consistently outperforms other baseline algorithms. . 95

5.2 Effectiveness for LT model various Real Datasets (node deletion). Graph sus-ceptibility ratio (footprint when vaccines are given

footprint without giving vaccines) vs. number of vaccines. Lower

is better. Greedy-LT consistently outperforms other baseline algorithms. . 95

5.3 Effectiveness for the change of the first eigenvalue various Real Datasets (edge

deletion). Eigendrop ratio (λ′GλG

) vs. number of vaccines (λ′G is the expectedeigenvalue after allocating vaccines). Lower is better. sdp, GroupGreedy-Walk, and lp consistently outperform other baseline algorithms. . . . . . . 96

5.4 Effectiveness for the change of the first eigenvalue various Real Datasets

(node deletion). Eigendrop ratio (λ′GλG

) vs. number of vaccines (λ′G is theexpected eigenvalue after allocating vaccines). Lower is better. qp consistentlyoutperforms other baseline algorithms. . . . . . . . . . . . . . . . . . . . . . 96

xiv

5.5 (a) and (b). Eigendrop ratio vs. number of groups. (c) and (d): Graphsusceptibility ratio vs. number of groups. Lower is better. Our algorithmsconsistently outperform other baseline algorithms as the number of groupschanges as well as the size of groups changes. . . . . . . . . . . . . . . . . . 97

5.6 Vaccine Distributions for PORTLAND and MIAMI (Budget=10000). Number ofvaccines vs. Age. (Age range ’0-9’: 1; ’10-19’: 2; ’20-29’: 3; ’30-39’: 4; ’40-49’:5; ’50-59’: 6; ’60-69’: 7; ’70-79’: 8; ’80-89’: 9; ’90-’: 10.) . . . . . . . . . . . . 98

6.1 Why reweight? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Reweighting of edges after merging node pairs . . . . . . . . . . . . . . . . . 103

6.3 Effectiveness of coarseNet for GCP. λ vs α for coarseNet and random.coarseNet maintains λ values. . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.4 Scalability of coarseNet for GCP. (a,b,c) Linear w.r.t. α. (d) Near-linearw.r.t. size of graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

6.5 Effectiveness of cspin. Ratio of influence spread between cspin and pmia for(a) different datasets; (b) varying α. (c) Running time vs α. . . . . . . . . . 114

6.6 Scalability of cspin. (a,b) vs k; (c) vs size of graph. cspin gets increasingorders-of-magnitude speed-up over pmia. . . . . . . . . . . . . . . . . . . . . 115

6.7 Distribution of # groups entered by movie traces. . . . . . . . . . . . . . . . 116

7.1 Condensing a Temporal Network . . . . . . . . . . . . . . . . . . . . . . . . 118

7.2 (a) Example of merge operation on a single edge (a, b) when time-pair {i, j} ismerged to form super-time k. (b) Example of node-pair {a, b} being mergedin a single time i to form super-node c. . . . . . . . . . . . . . . . . . . . . 121

7.3 (a) G, and (b) corresponding FG. . . . . . . . . . . . . . . . . . . . . . . . . 125

7.4 RX = λcondX /λX vs αN (top row, αN = 0.5) and vs αT (bottom row, αT = 0.5). 137

7.5 Plot of RS = λNetCondenseS /λGreedySys

S . . . . . . . . . . . . . . . . . . . . . . . 137

7.6 Condensed WorkPlace (αN = 0.6, αT = 0.5). . . . . . . . . . . . . . . . . . . 141

7.7 Condensed School (αN = 0.5 and αT = 0.5). . . . . . . . . . . . . . . . . . . 141

7.8 (a)Near-linear Scalability w.r.t. size; (b). Near-linear speed up w.r.t numberof cores for parallelized implementation. . . . . . . . . . . . . . . . . . . . . 144

8.1 Overview of our approach. We first generate a set of cascades, then allocatevaccine to different locations. . . . . . . . . . . . . . . . . . . . . . . . . . . 151

xv

8.2 Counter-Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.3 Effectiveness of ImmuConGreedy on the whole R. Infection ratio r vs.Vaccine budget m. Infection ratio r =

∑Mi∈M

σG,Mi(x)∑

Mi∈MσG,Mi

(0). Lower is better. Immu-

ConGreedy consistently outperforms other baselines over all datasets. . . . 166

8.4 Effectiveness of ImmuConGreedy for the testing data. Infection ratio rvs. Vaccine budget m. Lower is better. ImmuConGreedy consistentlyoutperforms other baselines for both MIAMI and Houston. . . . . . . . . . . . 167

8.5 Robustness of ImmuConGreedy as data size varies. Ratio of saved nodes RSvs. percentage of used log data p%. RS = SData

SModel. SData (SModel): the number of

nodes we can save when vaccines are allocated according to ImmuConGreedy(ImmuModel). Percentage of used log data p: [N(t0), . . . , p%N(tmax)].Higher: ImmuConGreedy is closer to ImmuModel. . . . . . . . . . . . . . 167

8.6 Scalability. (a) total running time of MappingGeneration and ImmuCon-Greedy vs. vaccine budget m; (b) total running time of MappingGenera-tion and ImmuConGreedy vs. number of cascade samples k. . . . . . . . 168

8.7 Case Studies for Houston and MIAMI per location. Houston: (a), (b), (c),(d) and (e); MIAMI: (f), (g), (h), (i) and (j). Heatmap of (a) and (f): Totalpopulation; (b) and (g): Patients in eHCR; (c) and (h): Number of vaccinesactually taken in eHCR; (d) and (i): Vaccine allocations from ImmuModel;(e) and (j): Vaccine allocations from ImmuConGreedy. . . . . . . . . . . . 169

9.1 Our method detects more intuitive structure: (a) an example Twitter retweetnetwork; (b) communities detected by Newman’s algorithm; (c) ordinarycommunities (green), media nodes (black), and kernel communities (red)detected by NetCondense. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

9.2 Structure of K, M , and O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

9.3 Left: the graph G′ with edge weights to represent local effects of diffusion for G;

Right: the resulting merged graph with new weights when node a and node b in G′

are merged into a new node c. . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

xvi

List of Tables

1.1 Structure of the thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Terms and Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Running time (sec.) of dava, dava-prune dava-fast and Netshield whenk = 200. Runs terminated when running time t > 24 hours (shown by ’-’).We did not show the running time of Random, Degree, and PageRankbecause they are fast heuristics. . . . . . . . . . . . . . . . . . . . . . . . . 49


4.2 Uncertainty models for initial infections used in this chapter. . . . . . . . . . 55

4.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.4 Running times (sec.) when k = 100 and l = 200 (α = 0.5). Runs terminatedwhen running time t > 24 hours. (shown by ’-’) . . . . . . . . . . . . . . . . 69


5.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.1 Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2 Datasets: Basic Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.3 Insensitivity of cspin to random pullback choices : Expected influence spreaddoes not vary much. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.1 Summary of symbols and descriptions . . . . . . . . . . . . . . . . . . . . . . 119

7.2 Datasets Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

xvii

7.3 Performance of CondInf (CI) with ForwardInfluence (FI) and Greedy-OT

(GO) as base methods. σm and Tm are the footprint and running time for method

m respectively. ‘-’ means the method did not finish. . . . . . . . . . . . . . . . . 139

7.4 Additional Datasets for EDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.5 Performance of CondED. F1 stands for F1-Score. Speed-up is the ratio oftime to run SnapNETS on G to the time to run SnapNETS on Gcond. . . . 140


8.2 Network Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

8.3 MappingGeneration. αM: average of αM over all M ∈ M; α∗: optimalvalue of αM; N =

∑tmaxt=t1|N(t)|1. . . . . . . . . . . . . . . . . . . . . . . . . 168


9.2 Datasets Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

9.3 Quality of media nodes compared to the ground-truth. . . . . . . . . . . . . 183

9.4 Quality of NetCondense for MeiKeCom-media. . . . . . . . . . . . . . 184

9.5 Quality (F1-score) of kernel communities compared to other competitors onCoauthor. DP: Distributed and Parallel Computing; GV: Graphics and Vision;NC: Networks and Communications. . . . . . . . . . . . . . . . . . . . . . . 185

xviii

Chapter 1

Introduction

Networks (aka graphs) arise in many areas, as they are natural structures to model relationsbetween objects. Examples include online social networks, web-linkage graphs, protein-interaction networks, communication networks, autonomous-system graphs, and many more.Hence, network analysis has become a popular approach to study a variety of interactionsin the real world like online media, biology, sociology, cyber security and so on. Forexample, analyzing online social networks has become vital for viral marketing [142] andcrowdsourcing [68]; studying topologies of web-linkage graphs is critical for search engine [124]and recommendation system [79]; uncovering protein-interaction networks provides insightsinto molecular processes [171]; building autonomous-system graphs helps in developing nextgeneration network protocols [45].

Diffusion (aka propagation), as a phenomenon of spreading contagions (like viruses, ideas,memes, etc.) on an underlying network, is also prevalent: from memes/opinions disseminatingby so-called “word-of-mouth” effects on friendship networks, to malware propagating throughcomputer networks, to diseases like flu spreading over person-to-person contact networks.The abundant of diffusion phenomena has posed various challenging and fascinating researchproblems, like how people spread information in online communities, why memes can suddenlygain widespread popularity, and many others. Since networks play a fundamental role in thespread of diseases, opinions and information, they are usually crucial to solve these problems.For example, identifying influential individuals in a friendship network (like Facebook) isan efficient method for viral marketing [72, 88, 142], and vaccinating vulnerable people ina contact network can effectively stop an epidemic [19, 166]. Due to the importance ofnetworks for studying diffusion, the past few years have seen many novel graph miningtasks proposed for diffusion analysis, including diffusion modeling [44,112,127,133], influencemaximization [22,72,142], controlling contagions [19,74,165,166], and so forth.

Diffusion study in networks comes with several challenges. The first one is the realism. Most ofthe previous work has some “strict” assumptions. For example, to control diffusion in networks,they typically assume some specific structure of the network (like scale-free networks [17,95]),

1

2

or have no prior information of infections (so-called “pre-emptive” intervention [7, 25]).These assumptions are almost never true in practice. The second challenge is the scalability.With the availability of large propagation data and large-scale networks, some algorithmsin the past work with high computational cost (like the greedy algorithm for influencemaximization [22,72]), might not scale up to real data. Hence, we need to develop fast andscalable algorithms, or propose innovative frameworks to adapt existing approaches to bigdata. Finally, few previous studies leverage diffusion to gain insights on networks, such ashow nodes participate in dissemination. With the huge volumes of propagation data, it isunclear how to use such data to understand network structure better effectively.

1.1 Thesis Overview, Statement and Structure

Overview and Thesis Statement. To tackle the above challenges, in this thesis, wedevelop efficient, practical and implementable graph mining algorithms to (1) optimizenetwork structure to control the outcome of contagion propagation, and (2) understandnode interactions in light of diffusion. We conduct our study on mining massive diffusion andgraph data for real-world applications in a variety of areas such as public health, epidemiology,and social media. Our work leverages a broad range of techniques in big data analytics,graph theory, machine learning, optimization and epidemiology. Our work mainly targetstwo research threads:

• Controlling rumors/viruses from spreading via optimizing graph topologies. We seek toanswer questions like: given a who-contacts-whom network, how to allocate vaccines toeffectively contain influenza pandemics when an infection is already in progress? Givenan online friendship network, how to choose the best communities or individuals to stoprumors from spreading? How to utilize plenty of flu records and contact networks inmajor cities for influenza intervention? We propose efficient and rigorous approximationalgorithms to optimize network structure (e.g., removing node/edges) to control diffusion(immunization). In contrast to the previous literature, our proposed methods are moreefficient, practical and implementable: they tackle graphs with millions of nodes andpropagation records with billions of items, significantly outperform other state-of-the-artapproaches, and take into account the presence of infections, the uncertain nature ofthe data, group structure of the population, and rich surveillance data.

• Gaining a deeper understanding of a network by studying diffusion processes on it. Weare interested in questions like: Can we quickly “zoom-out” of a large-scale networkto help speed up algorithms for applications like viral marketing? How to identifydifferent roles of users who participate in propagations? We leverage rich diffusioninformation to better understand network structure. The problems we study includenetwork summarization w.r.t. influence for both static and temporal graphs, andcommunity and role discovery from diffusion. The above work can facilitate several

3

graph mining tasks such as influence maximization, community detection, and linkprediction.

In summary, we explore both model-driven and data-driven approaches with different levelsof network granularity, ranging from the individual node/edge level to the group/communitylevel. For the model-driven approach, we apply classical mathematical models (such as SIRand IC) to describe diffusion; for data-driven approach, we take advantage of the huge volumeof cascade data (like tweets and electronic medical records) for our study.

Our thesis statement is:

Novel network optimization algorithms help in better controlling diffusion;while diffusion can also support the better understanding of network struc-ture.

Thesis Structure. Following the thesis statement, we organize the thesis into two parts:model-driven and data-driven. The outline is shown in Table 1.1. We first introduce theModel Driven Part (Part I) consisting of Chapter 3, 4, 5, 6, 7, and then present the DataDriven Part (Part II) consisting of Chapter 8 and 9.

Table 1.1: Structure of the thesis.

Network Optimization for Diffusion Understanding Network Structure using Diffu-sion

ModelDriven

• Data-Aware Vaccination (Chapter 3)• Uncertain Data-Aware Vaccination(Chapter 4)• Group Immunization (Chapter 5)

• Graph Coarsening (Chapter 6)• Temporal Graph Coarsening (Chapter7)

DataDriven

• Data-Driven Immunization (Chapter8)

• Detecting Media and Kernel Commu-nity (Chapter 9)

The following sections will briefly summarize each chapter (Section 1.2), and present ourcontributions (Section 1.3).

1.2 Summary of the Work

1.2.1 Part I: Model-driven

Diffusion processes are usually described by mathematical models such as the Susceptible-Infectious-Recovered (SIR) model in epidemiology [5], and the independent cascade (IC)

4

model and the linear threshold (LT) model in social media [72]. Given these popular models,we study how to manipulate and understand network structure w.r.t. diffusion. To bespecific, we study the immunization problem (controlling diffusion) as well as the graphsummarization problem (zooming-out of networks) under various settings for several diffusionmodels. In contrast to previous work, our proposed methods are more practical and realistic.For instance, we tackle the immunization problem in the presence of already infected nodes,as well as the group setting. In terms of the graph summarization problem, we study how toobtain a smaller representation of a large network under both static and temporal settings,while maintaining the diffusion property for certain propagation models.

Network Optimization for Diffusion

Chapter 3: Data-Aware Vaccination. The main question we answer in this chapteris: given a graph, like a social/computer network or the blogosphere, in which an infection(or meme or virus) has been spreading for some time, how to select the k best nodes forimmunization/quarantining immediately? Most previous work in immunization tries tocontrol an epidemic before it has started. However, in practice, it is almost never true.Instead, in this chapter, we study how to immunize healthy nodes when an infection is alreadyin progress. Efficient algorithms for such a problem can help public-health experts make moreinformed choices of immunization. We formulate the Data-Aware Vaccination problem, andprove it is a NP-hard problem. Then we propose several effective subquadratic-time heuristicsdava, dava-prune and dava-fast. We demonstrate the scalability and effectiveness of ouralgorithms through extensive experiments on multiple real networks including epidemiologydatasets, which show substantial gains of up to 10 times more healthy nodes at the end.Figure 3.9(a) and (b) illustrate our results on two large population contact networks.

0 500 1000 1500 20002

3

4

5

6

7

8

9

10x 10

4

Budget of vaccines (k)

Exp

ect

ed

nu

mb

er

of

he

alth

y n

od

es

PORTLAND performance

DAVA−fastNETSHIELDDEGREERANDOMPAGERANK

0 500 1000 1500 2000

4

5

6

7

8

9

10

11

12

13x 10

4


Exp

ect

ed

nu

mb

er

of

he

alth

y n

od

es

MIAMI performance

DAVA−fastNETSHIELDDEGREERANDOMPAGERANK

(a) PORTLAND (b) MIAMI (c) Surveillance Pyramid

Figure 1.1: (a) and (b) Effectiveness for dava: SIR model on Real Datasets. Expected number ofhealthy nodes after distributing vaccines according to different algorithms VS budget k. Higher isbetter. Our method dava-fast significantly outperforms other baselines by saving upto to 10 timesmore healthy nodes. (c) A example of uncertain environment.

Chapter 4: Uncertain Data-Aware Vaccination. Given a noisy or sampled snapshot

5

of a network, in which an infection has been spreading for some time, what are the best nodesto immunize (vaccinate)? Typically surveillance data on who is infected is limited or thedata is sampled. For example, a surveillance pyramid for monitoring flu cases shows sourcesof uncertainty in public health, as each level of it has a certain probability to miss some trulyinfected people (Figure 3.9(c)). Hence, it is important to account for this uncertainty whileallocating resources. In the previous chapter, we have defined the Data-Aware Vaccinationproblem. In this chapter, we extend it to an uncertain environment, where we have informationconsisting of confirmed cases as well as a probability distribution of unknown cases. Weformulate the Uncertain Data-Aware Immunization problem, and design multiple efficientalgorithms that naturally take into account the uncertainty, while providing robust solutions.Experimental results on large epidemiological and social networks show the efficiency andscalability of our methods.

Chapter 5: Group Immunization. Chapter 3 and 4 study the immunization problemunder the data-aware environment. However, sometimes it is hard to ensure specific individualstake the adequate vaccine. Instead, immunization at group scale (like schools and communities)is usually more practical due to constraints in implementations and compliance. In thischapter, we seek to answer: given a network with groups, such as a contact-network groupedby ages, which are the best groups to immunize to control the epidemic? Equivalently,how to best choose communities in social networks like Facebook to stop rumors fromspreading? We formulate the problem of controlling propagation at group scale, called theGroup Immunization problem for multiple natural settings (for both threshold and cascade-based contagion models under both node-level and edge-level interventions), and developmultiple efficient algorithms, including provably approximate solutions. We demonstrate thatour algorithms significantly outperform other heuristics, and adapt to the group structure.Figure 5.5 shows the results of our algorithms against other competitors under differentsettings.

0 20 40 60 800.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Groups

Ratio of Eigendrop

RANDOMDEGREEEIGENQP

0 1000 2000 3000 4000 50000.75

0.8

0.85

0.9

0.95

1

Number of Groups

Ratio of Eigendrop

RANDOMDEGREEEIGENLP

0 50 100 150 2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Groups

Gra

ph

Su

sce

ptib

ility

Ra

tio

RANDOMDEGREEEIGENGREEDY−LT

0 1000 2000 3000 4000 5000

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Groups

Gra

ph

Su

sce

ptib

ility

Ra

tio


(a) PORTLAND (b) YouTube (c) OregonAS (d) YouTube

(Node, Budget=20,000) (Edge, Budget=5000) (Edge, Budget=1000) (Node, Budget=1000)

Spectral Radius LT model

Figure 1.2: (a) and (b). Eigendrop ratio vs. number of groups. (c) and (d): Footprint ratio vs.number of groups. Lower is better. Our algorithms (LP for spectral radius and GreedyLT for theLT model) consistently outperform other baseline algorithms as the number of groups changes aswell as the size of groups changes.

6

Understanding Network Structure using Diffusion

Chapter 6: Graph Coarsening. Is there a smaller equivalent representation of the graphthat preserves its propagation characteristics? Can we group nodes together based on theirinfluence properties? To understand relationships of nodes in a large network, it is betterto look at a smaller “copy” of it. In addition, sometimes it is impossible to deploy a graphmining algorithm to large networks due to high computational costs. Instead, can we run thealgorithm on a small summary of a large graph? To solve above questions, in this chapter,we formulate a novel Graph Coarsening problem to find a succinct representation of anygraph while preserving key characteristics for diffusion processes on that graph, and developa near-linear-time algorithm CoarseNet, which enables us to reduce the graph by 90%in some cases without much loss of information. Figure 1.3 (a)(b) shows an example ofcoarsening one edge for a graph. We also show that our method can help in applications likeinfluence maximization and detecting patterns of propagation at the level of automaticallycreated groups on real cascade data (Figure 1.3 (c)).

1

2

3

4

5

6

7

89

10

11

12

13

14

15

1

2

3

7

10

6,11

4,5,8,9,12,15

13

14

�

��

��

��

��

��

� ��

��

��

��

��

�

��

��

(a) Original Graph (b) Coarsened Graph (c) Flickr (Varying sizes)

Figure 1.3: (a) and (b): Illustration of our coarsening process. (c): Application of influencemaximization. Running time vs k: CoarseNet gets increasing orders-of-magnitude speed-up overPmia.

Chapter 7: Temporal Graph Coarsening. Modern networks are very large in size andalso evolve with time. As their size grow, the complexity of performing network analysisgrows as well. Hence, getting a smaller representation of a temporal network with similarproperties will help in various graph mining tasks. In this chapter, we extend the graphcoarsening problem to the temporal graph setting. We aim to get a smaller diffusion-equivalentrepresentation of a set of time-evolving networks (see Figure 1.4(a)). In particular, we firstformulate a well-founded and general temporal-network summarization problem based onthe so-called system-matrix of the network, and then propose NetCondense, a scalableand effective algorithm in subquadratic running time with linear space complexities. Ourextensive experiments show that we can reduce the size of large real temporal networks (frommultiple domains such as social, co-authorship and email) significantly without much loss ofdiffusion information (see Figure 1.4(b) and (c)). In addition, we leverage NetCondense

7

for several tasks to validate its wide-applicability, including influence maximization and eventdetection on temporal networks.

(a) Coarsening Process (b) αN = 0.5 (c) αT = 0.5

Figure 1.4: (a): Illustration of our coarsening process for a temporal network. (b) and (c): Effective-ness of NetCondense on DBLP for coarsening edges (αN ) and timestamps (αT ) respectively. RX:the ratio of the change in the first eigenvalue of the system matrix. High is better. NetCondensemaintains the first eigenvalue even if we reduce 90% of the graph.

1.2.2 Part II: Data-driven

This part of the thesis is devoted to optimizing network structure for diffusion and gaining adeeper understanding of networks in light of diffusion processes from a data-driven perspective.Recent developments in real-time monitoring and data storage technologies have provided uswith rich propagation data. For example, in public health, influenza activities can be trackedby the medical surveillance in the form of electronic healthcare record (eHCR); in socialmedia, with the advent of the Internet, mobile platforms, and web services, diffusion datalike meme cascades and opinion sharing can be easily recorded. The huge volume of dataallows us to directly investigate the mechanism of diffusion in terms of network structureby relaxing restrictive propagation models, or even without assuming any models. In thispart, we apply data-driven methodologies to study challenging problems like developingnetwork optimization algorithms for vaccine allocation policies, and grouping nodes based ontheir roles during the diffusion. In contrast to model driven approaches, this part focuses onlearning knowledge directly from data, leading to better understanding of network structureand even more realistic immunization policies by optimizing networks. To be specific, westudy the following problems at the group/community level: developing efficient data-drivenimmunization algorithms, and detecting media and kernel community w.r.t. diffusion.

Network Optimization for Diffusion

Chapter 8: Data-Driven Immunization. Given a contact network and coarse-graineddiagnostic information like electronic Healthcare Reimbursement Claims (eHCR) data, canwe develop efficient intervention policies to control an epidemic? Most existing studies onimmunization assume prior epidemiological models. In practice, disease spread is usually

8

complicated, hence assuming an underlying model may deviate the result from true spreadingpatterns, leading to possibly inaccurate interventions. In this chapter, we take into account apropagation log and contact networks for controlling propagation. We formulate the novel andchallenging Data-Driven Immunization problem without assuming classical epidemiologicalmodels. We first propose an efficient sampling approach, SampleGreedy, to align surveil-lance data with contact networks, then develop an efficient algorithm, ImmuConGreedy,with the provably approximate guarantee for immunization. Extensive experiments on multi-ple datasets show the effectiveness and scalability of our methods. Finally, we also conductcase studies on nation-wide real medical surveillance data, and find interesting patterns (seeFigure 1.5).

(a) Population (b) Patients (c) Vaccines in

eHCR

(d) Vaccines from

ImmuModel

(e) Vaccines from

ImmuConGreedy

Figure 1.5: Case Studies for Houston per location. Heatmap of (a): Total population; (b): Patients ineHCR; (c): Number of vaccines actually taken in eHCR; (d): Vaccine allocations from ImmuModel;(e): Vaccine allocations from ImmuConGreedy. Our approach ImmuConGreedy considers bothnetwork and patient information, and is able to find high vulnerable areas like Texas Medical Center(Zipcode 77030).

Understanding Network Structure using Diffusion

Chapter 9: Detecting Media and Kernel Community. How to find communities ofnodes based on their diffusive characteristics? Answering this question can help betterunderstand different roles of nodes during diffusion. In this chapter, we investigate twoimportant types of nodes for diffusion: nodes that are influential (“kernel nodes”), and nodesthat serve as “bridges” to boost the diffusion (“media nodes”). We give an intuitive andnovel optimization-based problem, Detecting Media and Kernel Community, for this task,which aims to discover media nodes as well as community structures of kernel nodes. Wedevelop an efficient algorithm, MeiKe, which first obtains media nodes via a new successivesummarization based approach, and then finds kernel nodes including their communitystructures. Experimental results show that MeiKe finds high-quality media and kernelcommunities (by outperforming non-trivial baselines by 40% in F1-score). In addition, ourcase studies also demonstrate the applicability of MeiKe on real datasets (see Figure 1.6).

9

(a) Twitter network (b) Result of Newman’salgorithm

(c) Result of MeiKe

Figure 1.6: MeiKe detects better structure: (a) an example Twitter network; (b) communitiesdetected by Newman’s algorithm; (c) ordinary communities (green), media nodes (black), andkernel communities (red) detected by MeiKe.

1.3 Contributions and Impact

We give the main contributions and impact of the thesis next. To the best of our knowledge,it is the first work that systematically develops more realistic, implementable and data-basedgraph algorithms to control contagions. In addition, we are the first to use diffusion toeffectively summarize graphs and understand communities/groups of networks in a generalway.

1.3.1 Contributions

Model-Driven

• Vaccination under Data-Aware environments. We are the first to study theimmunization problem under data-aware environments in the presence of alreadyinfected nodes, as well as the uncertain nature of the infection surveillance data. Ourmethods are more realistic for public health experts to make real time decision forimmunization given limited information and resources. In contrast, most past work forcontrolling propagation has neither developed strategies for vaccination after the startof an epidemic, nor considered the uncertain surveillance data.

• Group Immunization. We are the first to propose group level intervention immuniza-tion problems in both cascade style models and threshold based models. We considerarbitrarily specified groups, and interventions that involve both edge and node removal,modeling quarantining and vaccination, respectively. Compared to state-of-the-artindividual based immunization strategies, our methods can help experts in epidemiology

10

and social media to make more practical decisions for controlling diffusion like epidemiccontagion and rumor contagion.

• Graph Summarization for Diffusion. We are the first to study the problem ofsummarizing large graphs in terms of diffusion for both static and temporal graphs.Our near-linear summarization algorithm for static graphs and sub-quadratic algorithmfor temporal graphs, reduce the size of graphs by up to 90% without much loss ofkey diffusion information. Furthermore, it can be used in a number of interestingapplications, such as the influence maximization problem in which we obtain highquality solutions, which runs orders of magnitude faster compared to state-of-the-artalgorithms.

Data-Driven

• Data-Driven Immunization. We are the first to study the immunization problemfrom a data-driven perspective for influenza like illness. We present efficient algorithmswith provably approximate solutions to allocate vaccines to locations, which can apply tovarious areas like public health. In addition, we are the first to leverage nation-wide realsurveillance data to augment realistic interaction networks for immunization problemsdirectly in a data-driven manner.

• Node/Community Role Discovery for Diffusion. We are the first to studyroles of nodes/communities under a diffusion setting. We develop an efficient andpractical algorithm to identify media nodes automatically, and uncover kernel communitystructures. We are able to find meaningful media nodes as well as kernel communities,which can be used for applications like influence maximization.

1.3.2 Impact

• Our model-driven approaches for controlling diffusion and understanding networkstructure have been cited more than 60 times since 2014, and have been followed byresearchers in universities like CMU, UMD, Michigan and etc.

• Our work on immunization and graph summarization has been included into multiplegraduate courses, such as Data Mining Large Networks and Time-Series (VT CS6604)and Data Analytics (VT CS5526).

• Our group immunization work has been used in influenza modeling and simulationstudies in the Biocomplexity Institute at Virginia Tech.

• Our results on graph coarsening have been used for other data mining tasks like influencemaximization, event detection and community detection.

11

• Our data-driven immunization algorithm is in use at the Oak Ridge National Laboratoryfor the influenza intervention study on the massive eHCR data with billions of patientrecords in major US metropolitan areas.

1.4 Outline of the Thesis

The rest of the thesis is organized as follows. We first survey the related work in Chapter2, and then present our work on the model-driven study in Part I including Chapter 3–7,followed by the data-driven study in Part II including Chapter 8–9. Finally, we conclude ourthesis in Chapter 10.

Chapter 2

Survey

In this section, we survey the related work, including Epidemiology, Information Diffusion,Network Optimization and Community Detection.

2.1 Diffusion for Epidemiology

The early canonical textbooks and surveys discussing modeling epidemic spread include [5, 9,27, 51,64,103,140]. The fundamental epidemiological models like SIS [9, 64] and SIR [5] areproposed to understand the complex dynamics of the epidemics, in which the total populationis divided into different states (such as Susceptible, Infected and Recovered), and the statesare changed via. transition rates like the infection rate and the recovery rate. None of earlywork considers the underlying network structure. Pastor-Satorras and Vespignani [127] firstuse epidemiological models to study the diffusion of computer viruses in scale-free networks,while Moore and Newmen [112] study epidemics in small-world networks. Other studies thatapply epidemiological models for social networks include [50,58].

Epidemic thresholds, the minimum virulence of a virus which results in an epidemic havebeen extensively studied along with epidemiological models. Much work has gone into findingepidemic thresholds for a variety of networks [5, 73, 106, 128, 174]. Recent results [44, 133]show that the spectral radius, the largest eigenvalue of the adjacency matrix of a network, isconnected to the reproduction number in epidemiology, and determines the phase-transition(‘epidemic threshold) between epidemic/nonepidemic regimes in several epidemiologicalmodels like SIR and SIS. Some of our studies for immunization aim to minimize the spectralradius.

To summarize, we study the immunization problem (control an epidemic) under a variety ofepidemiological models (like SIR/SIS) in more realistic and practical settings with differentgranularity ranging from the individual level to the group level.

12

13

2.2 Information Diffusion in Social Media

Information Diffusion can be described via. cascade style models and threshold based modelsin general.

The cascade style models assume information is propagated over edges on networks. Many cas-cade style models have been proposed in the literature, such as the discrete-time independentcascade model [72], and more recently the continuous-time independent cascade model [30,31].Most recent work aims to learn the mechanism of how information is propagated or patternsof online content diffusion on social media like Twitter, Facebook and new article websites.Leskovec et al study the diffusion of new cycles, while Galuba et al [43] study the diffusion ofURLs retweeted in Twitter. Matsubabara et al [99] give a SI based model to extract patternsof diffusion in the popularity of online content exhibits on Twitter, blog posts and news mediaarticles. There is other related literature in studying applications of information propagationprocesses on networks, including information cascades [14, 53], blog propagations [58, 84,91],and viral marketing [88,142].

Another different type of diffusion models are threshold models [20,56], which model diffusionas a threshold behavior. The idea is that nodes will be infected when they reach somethreshold. The threshold is usually related to the number of infected neighbors in a network.For example, under the linear threshold (LT) model [72], a node will be infected if the sumof the weights of its infected neighbors are greater than a random chosen threshold. There isalso a lot of research interest in studying the threshold model related problems, includinglearning adoption rate (threshold) in Wikipedia [26] and Twitter [65, 144], user contentsharing patterns [10], and social group based adoption in Facebook [167].

We study the network optimization problem in light of diffusion under both cascade style andthreshold based models. For cascade style models, we try to minimize the spectral radius,while for threshold based models we want to minimize the number of infections at the end ofdiffusion.

2.3 Network Optimization

Immunization Algorithms. The immunization problem aims to develop optimal strategiesfor vaccine allocation. Most exiting work in the literature studies the problem using a“pre-emptive” approach, i.e., distributing vaccines before the start of the epidemic. Cohenet al. [25] propose the acquaintance immunization policy for both the SIS and SIR model(pick a random person, and immunize one of its neighbors at random). In addition, theimmunization problem has been well studied on power-law graphs [17,95]. Hayashi et al. [62]develop efficient strategies for E-mail virus attack on power-law computer networks under theso-called SHIR model (Susceptible, Hidden, Infectious, Recovered). A game of inoculationgiven a cost and loss-risk, under random starting points has been studied [7]. In a similar vein,

14

a O(log n) approximation for minimizing ‘societal’ cost under a deterministic propagationmodel was given in [21]. Kimura et al. [75] first study the problem of blocking links (edgeremoval) in a network. Recently two formulations of the problem of blocking a contagionthrough edge removals under the model of discrete dynamical systems are given [11,83]. Manystudies also compare the performance of a limited number of pre-determined sequences ofinterventions (like school closure, antiviral for treatment) within simulation models [32,36,60].Yaesoubi and Cohen [176] focus on the optimal dynamic policies under a simplified flu-likemodel, and a homogenous population (i.e. every node is connected to every other node).Finally, the most relate work to our problem includes minimizing the largest eigenvalue of thegraph at the individual level for node removal [166] and edge removal [165], and minimizingthe expected number of infected nodes under the LT model for edge removal [74].

To summarize, none of the above studies look into the immunization problem under thedata-aware environment at both individual and group scale. Furthermore, they also fail tostudy the immunization problem directly from data without assuming any models.

Techniques for immunization algorithms on graphs. Graph based immunization algorithmsleverage a variety of techniques including matrix perturbation [165,166], matrix spectrum [93,131], submodularity [72, 74, 89, 157], graph walks [148,184], SDP and QP [185]. In particular,matrix spectrum and perturbation are widely used to minimize the epidemic threshold.Prakash et al. [131] study the connection between spectral radius and the epidemic threshold,and then several studies seek to optimize the spectral radius by removing nodes/edgesusing the matrix perturbation theory [165,166,185]. Recently, graph-walk based approachlike [148, 184], as well as SDP and QP [185], are also used for minimizing the spectralradius. On the other hand, to minimize the epidemic size, submodularity optimization is thecommon technique for individual immunization [74] and group immunization over integerlattice [157,185]. In addition, submodularity maximization has been applied to many othergroup mining tasks like influence maximization [72,89].

Other Optimization Problems. There are other network optimization problems in previousliterature, including PageRank [124], HITS [76], Betweeness [41,121], and coreness score [111].Kempe et al [72] propose the influence maximization problem under the IC and LT model,which aims to influence the maximum number of nodes at the end of diffusion. In contrast,our immunization problems try to remove nodes/edges to minimize the expected number ofinfections. Other relate work includes outbreak detection [82, 89], and finding starting nodesof the epidemic [86,135,147,159], which both aim to select a subset of important nodes ongraphs for diffusion related problems.

There exists another category of network optimization problems which try to leverageoff-the-shelf machine learning techniques. The key step is to embed a graph into a lowdimensional feature space (the graph embedding problem). Early work of graph embeddingincludes Laplacian Eigenmap [13], IsoMap [162], locally linear embedding [146], and spectraltechniques [8, 24,173]. However, these methods are slow and do not scale to large networks.Recently, with the emergence of high-performance computing (HPC) [66,67], deep learning has

15

become a promising direction for several tasks. Leveraging deep learning techniques, severalnovel algorithms were proposed to learn feature representations of nodes [57,130,160,170].Perozzi et al. [130] proposed DeepWalk, which extends the skip-Gram model [108] to networksand learns feature representations based on contexts generated by random walks. Groveret al. proposed a more general method, Node2Vec [57], which generalizes random walks togenerate various contexts. SDNE [170] and LINE [160] learn feature representation of nodeswhile preserving the first and second order proximity. Node2vec was shown to outperformDeepWalk and LINE in link prediction. In addition to embed nodes, graph/subgraph basedembedding problems have been studied recently [114,143,177]. Risen and Bunke propose tolearn vector representations of graphs based on edit distance to a set of pre-defined prototypegraphs [143]. Yanardag et al. [177] and Narayanan et al. [114] learn vector representation ofsubgraphs using Word2Vec [108] by generating “corpus” of subgraphs where each subgraphis treated as a word. None of the above work embed any subgraphs such as cascades indiffusion.

2.4 Graph Summarization

The graph summarization problem seeks to find a compact representation of a large graphby leveraging global and local graph properties. This includes neighborhood-based sum-marization for web graph [138]; lexicographic-based compression for web [16] and socialnetworks [23]; query-based compression [97]; BFS-based graph summarization [6]; local similar-ity exploitation for unweighted graphs [115] as well as weighted graphs [164]; attributed-basedgraph summarization [163,182]; node aggregation from node activities [137]; subgraph ex-traction [81]; temporal-pattern extraction [153]. None of these studies summarizes graphs interms of diffusion compared to our Graph Coarsening problem.

Graph Summarization is also related to several graph sparsification algorithms, whose goal isto either reduce storage and manipulation costs, or simplify structure. Elkin and Peleg [33]studied the problem of finding a sparse subgraph, called “spanners” that maintains thepairwise distances between all nodes within a multiplicative or additive factor. Fung etal [42] study the cut-sparsifier problem which asks for a sparse weighted subgraph. Graphsparsification is also used for influence analysis: Mathioudakis et al [98] propose an algorithmto find the sparse backbone of an influence network. While graph sparsification removesedges (so the nodes stay the same), our Graph Coarsening problem contract node-pairs toreduce graphs.

2.5 Community Detection and Role Discovery

There has been a lot of work on community detection. Although there is no clear consensus,communities are viewed as set of subgraphs with dense internal connections and sparse external

16

connections [29, 39, 119,120]. There are several graph partitioning algorithms for communitydetection, including hierarchical clustering [118,120,150], spectral partitioning [37,59] andbetweeness [49]. In addition, modularity is a popular method to detect communities ongraphs [15, 119], where the modularity metric of communities is leveraged. Recent work hasalso tried to find overlapping communities [3,90,109,125,178] or learn influence models atgroup scale [105]. However, none of these methods looks into bringing the diffusive roles ofnodes together with finding communities.

Our work on finding media and kernel communities is also related to the topic of role discovery.Role discovery, which tries to find nodes that perform similar functions in networks, hasbeen previously studied. McCallum et. al. [101] first approach this problem using a topicmodel based method. Recent studies, like Gilpin et al [47] and Han et al [61], apply NMFand probabilistic generative model for this problem. The most related work to our workincludes [63, 92,172,179]. Wang et al [172] use the notion of “kernels” to define nodes withimportant roles and aim to group them together. Henderson et al [63] use features to extractdifferent roles of nodes including bridge nodes that connect so called ‘main-stream’ nodes(typically celebrities). Lou et al [92] and Yang et al [179] detect structural hole spannerswhich bridge homogeneous communities. However, all existing works do not take diffusiveproperties of the bridge nodes into account the way we do.

Part I

Model-Driven Perspective

17

Overview of Part I

Diffusion processes are usually described by mathematical models, such as the Susceptible-Infectious-Recovered (SIR) [5] in epidemiology, and the Independent Cascade (IC) [72] insocial media. In this part, we seek to investigate how to optimize and understand networkstructure w.r.t. diffusion given underlying models. In the next four chapters, we study theimmunization problem as well as the graph summarization problem.

• Immunization. Given a graph, where a contagion (like influenza) has been spreading forsome time, how to pick the best people to control contagion? Similarly, how to betterchoose communities in social networks to stop rumors from spreading? These questionsare related to immunization problems (developing interventions to control contagions),which naturally arise in many areas, such as epidemiology and social media. We tacklethe immunization problem by studying how to optimize network structure with differentgranularities from the individual level to the group level. In contrast to the past workof immunization, our proposed methods are more efficient, practical and implementable:they take into account the presence of infections, the uncertain nature of the data,group structure of the population. In addition, our methods significantly outperformother state-of-the-art approaches, e.g., dava outperforms pre-emptive immunizationalgorithms by up to 10 times in both magnitude and running-time.

• Graph Summarization. Is there a smaller equivalent representation of networks thatpreserves their propagation characteristics? It is an important problem driven bydiffusion processes in various fields, leading to better understanding of network structurewith many applications like viral marketing and diffusion characterization. In addition,networks in practice are usually evolve over times. As their size grows, the complexityof performing network analysis grows as well. Getting a smaller representation ofa temporal network with similar properties will also help in various graph miningtasks. Hence, in this part, we study a novel graph summarization problem, calledgraph coarsening problem for both static and temporal graphs, which tries to obtaina succinct representation of any graph as a summary, while preserving key diffusioncharacteristics for various propagation models. Our proposed method coarseNet forstatic graphs can coarsen graphs up to 90% without much loss of diffusion information.In addition, our method NetCondense for temporal graphs can get 48 times speed-up

18

19

over state-of-the-art algorithms for influence maximization over dynamic networks.

The part is organized as follows: Chapter 3 and Chapter 4 study the data-aware and uncertaindata-aware problems respectively (node level). The work in Chapter 3 is published in SDM2014 [186] and TKDD 2015 [188], while the work in Chapter 4 is published in CIKM 2014 [187].Chapter 5 studies the group immunization problem, which is published in ICDM 2015 [185]and TKDE 2016 [184]. Chapter 6 and 7 discuss the graph coarsening problem for staticand temporal graphs respectively. They have been published in KDD 2014 [136] and SDM2017 [1].

Chapter 3

Data-Aware Vaccination

Given a contact network, and some already infected people, which healthy persons shouldbe immediately given vaccines to best control the epidemic? Similarly, given the follower-followee network and some rumors spreading on it, which accounts should Twitter decideto suspend/delete/warn to stop the misinformation as fast as possible? Propagation-styleprocesses on graphs/networks are important tools to model situations of interest in real-lifelike in epidemiology, in cyber-security and in social-systems. For instance, infectious diseasesspreading over contact networks, malware propagating over computer networks, spam/rumorsspreading on Twitter, all are propagation-style processes. Hence, controlling and stoppingsuch malicious propagation is a natural and significant problem with a large number ofapplications.

In this chapter, we focus on the problem of how to select the best nodes to distributevaccines on a network, when the disease has already spread over some parts of the network.Intuitively speaking, we want to study how to best build a “wall” in the network againstan already spreading contagion. We assume the vaccines completely “immunize” nodes, i.e.they are removed from the network. Most previous work in immunization is only interestedin designing algorithms and policies for so-called pre-emptive immunization, in which theytry to find the best baseline strategies for controlling an epidemic before the epidemic hasstarted spreading. They assume that the epidemic can start anywhere on the network atany random point. However, in practice this assumption is almost never realistic—e.g.,children and elderly are more susceptible to the flu, and peoples working with animals aremore likely to get infected (say in case of avian flu), and hence a flu-epidemic tends to startfrom them. Therefore, though such pre-emptive policies give us good baseline strategies,they may not be ideal for responsive decision-making if an epidemic has already infectedsome people. In this chapter we study a novel “data-aware” immunization setting, whichtakes into account the current infections at the time of vaccine allocation. Efficient solutionsfor such a Data-Aware Vaccination problem can help policy makers make better decisionsabout how to best distribute vaccines. For example, one way our algorithms can eventually

20

21

enable “real-time’ vaccine allocation is by re-running our algorithms every hour or every day.The Data-Aware Vaccination problem can naturally be applied to cyber security and socialmedia as well: which computers should install the anti-virus software first when malwareattacks have already spread; which user accounts should be blocked in Twitter to best stopspam/rumor spreading?

Our main contributions in this chapter include:

1. Problem Formulation and Hardness Results : We formulate the Data-Aware Vaccination(DAV ) problem on arbitrary graphs as an combinatorial optimization problem, andprove it is NP-hard and also hard to approximate within an absolute error. To the bestof our knowledge, we are the first to address the Data-Aware Vaccination problem onarbitrary graphs (please see Related Work for more details).

2. Effective Algorithms: We first propose dava-tree, an optimal algorithm for mergedtrees, then extend it to general graphs (dava algorithm). Furthermore, using carefulpruning techniques, we present a faster algorithm, dava-prune, which gives the sameresult as dava. In addition, we provide a third heuristic, dava-fast, which is the fastestalgorithm with some loss of performance.

3. Experimental Evaluation: We present extensive experiments against several popularimmunization algorithms on multiple real networks (including large epidemiologicalsocial-contact graphs), and demonstrate the efficacy and scalability of our algorithms.Our algorithms outperform other competitors by up to 10 times in both magnitude andrunning-time. In addition to that, we also compare dava-prune and dava-fast: dava-prune saves up to 16k more nodes than dava-fast in large networks while dava-fast isthe fastest algorithm.

This work has been published in SDM 2014 [186] and TKDD 2015 [188]. We will first givepreliminaries next, and then present our problem formally, followed by our proposed methoddava. Finally, we present our experiments to show the effectiveness of our methods.

3.1 Preliminaries

We give preliminaries in this section. Table 3.1 lists the main symbols used in the chapter.

3.1.1 Contact Network

There exists an underlying contact network G(V,E) (between people/computers/blogs etc.)on which the contagion (disease/virus/meme etc.) can spread, and we assume our graph isweighted and undirected. Each vertex v ∈ V represents an individual of the network and

22

Table 3.1: Terms and Symbols

Symbol Definition and Description

DAV Data-Aware Vaccination problemSIR Susceptible-Infected-Recovered ModelIC Independent Cascade Model

G(V,E) graph G with nodes set V and edges set EI0 infected nodes setpu,v propagation probability from node u to vδ curing probability for the SIR modelk the budget (i.e., #nodes to give vaccines to)S set of nodes selected for vaccinationσG,I0(S) the expected number of infectious nodes at the

end (footprint)σ′G,I0(S) the expected number of healthy nodes at the end

γS(j) the expected benefit of σ′G,I0(∗) when adding j intoS

edges represent the interaction between these individuals. In this chapter, edge weight pu,vrepresents the propagation probability of the contagion from u to v.

3.1.2 Virus Propagation Models

We use two widely used discrete-time virus propagation models to describe how the virusspreads on the network: the so-called Susceptible-Infected-Recovered (SIR) model, and theIndependent Cascade (IC) model (a special case of SIR model).

Susceptible-Infected-Recovered Model SIR, a fundamental model which has beenextensively used in epidemiology [5,64,152], is used to model mumps (or chicken-pox) likeinfections. In the SIR model, there are three states for each node in the network: susceptible(healthy), infected and recovered. In each time-step, each infected node u tries to infect eachof its healthy neighbors v independently with probability pu,v (the weight on each edge). Thehealthy neighbors who are successfully infected will be infected from the next time-step. Inaddition, at each time-step each infected node has a curing probability δ to become recovered(from the next step). Once recovered, a node can not participate in the epidemic further. Theprocess begins when some initial nodes are infected and ends when no infected nodes remain.

Independent Cascade Model IC is a well-known model [22, 55, 72] which is used todescribe viral marketing and related meme processes. In contrast to SIR, in the IC model,

23

each infected node has exactly one chance to infect (‘activate’) its neighbors independentlywith the propagation probability (in effect the curing probability δ = 1 here). The processof the IC model proceeds as follows. It is a discrete-time model, in which a node in thegraph can be either infected or healthy. When a node u first becomes an infectious nodeat timestep t, it is given a single chance to infect each of its currently healthy neighbors w,which it succeeds with probability pu,w (the weight of the edge {u,w} in the contact graphG). If u succeeds, then w will become infectious at time-step t + 1. If multiple neighborsof w first become infectious at time-step t, then their activation attempts are sequenced inan arbitrary order, but all performed at time-step t. Whether or not u succeeds, it cannotmake any further attempts to infect w in subsequent rounds. The process terminates whenno additional node becomes infectious.

We will first describe our algorithms in the IC model. Then, we will see that extendingour algorithms to the general SIR model is reasonably straightforward (which we explain inSection 3.5).

3.2 Problem Formulation

3.2.1 The “Data” in Data-Aware Vaccine Allocation

In this chapter, we are interested in effectively allocating vaccines in the network given a setof initially infected nodes (I0) and we call it Data-Aware Vaccine Allocation. Note that the“initially” infected node set I0 does not necessarily refer to the infected nodes at the beginningof the epidemic. Instead, it refers to the set of infected nodes as observed whenever wedecide to allocate the vaccines. Loosely speaking, we call I0 the “starting point” or “initially”infected node set for ease of our description, but it does not mean the starting point of theepidemic. The size of I0 is determined by how severe an epidemic is when we observe it anddecide to perform an intervention. For instance, if it is a deadly disease which was undetectedtill a long time, then the I0 will be relatively large. Generally speaking, our goal is to designa robust allocation strategy which can handle any size of I0.

We assume I0 is known, and can be drawn from any distribution. In practice, there areseveral ways to obtain I0. In particular, in social media such as Twitter, the accounts thatpost a certain rumor can be sampled from Twitter API [113]. In epidemiology, we canget infectious cases from hospital reports or CDC documents [104]. In addition, there arelarge-scale micro-scale realistic simulations on how to generate infection spreads [35].

Note that in this chapter, for the observability of infected nodes set I0, we assume a simplescenario: as long as individuals have become symptomatic, and been diagnosed as infected(hospitals will report the diagnosed cases), we assume they are infected. In social media, thisis easier and straightforward.

24

3.2.2 Problem Definition

Now we are ready to formulate our problem formally. We are given a fixed-set of nodes I0,which contains all infected nodes at certain time of the epidemic process as we discussedabove. We assume that if a vaccine is given to a healthy node v, v cannot be infected by itsneighbors at any time, which means v is effectively removed from the graph. We are given agraph G(V,E), the set I0 and a budget of k vaccines. We want to find the ‘best’ set S ofhealthy nodes that should be given vaccines at the beginning. As the propagation model isstochastic, let us denote σG,I0(S) to be the expected total number of infected nodes at theend of the process (the ‘footprint’), given that I0 nodes were infected at the start, and S wasthe vaccinated set. The best set S is the one which minimizes σG,I0(S).

Formally,

Problem 1: Data-Aware Vaccination problem DAV (G, I0, P, δ, k).

Given: A graph G(V,E) with node set V and edge set E, the infected node set I0, SIR modelwith propagation probability on each edge {i, j} pi,j ∈ P and curing probability δ, and aninteger (budget) k.

Find: A set S of k nodes from V − I0 to distribute vaccine to minimize σG,I0(S), i.e.

S∗ = argminS

σG,I0(S) (3.1)

s.t. |S| = k

Comment 1 Define σ′(·) to be the expected number healthy nodes at the end i.e. σ′G,I0(S) =|V |−σG,I0(S). Given the same set S and an integer k, clearly minimizing σG,I0(S) is equivalentto maximizing σ′G,I0(S) (as nodes can either be infected or healthy). In this chapter, for easeof description we adopt this alternate form (of maximizing σ′G,I0(S)).

Comment 2 Clearly the problem is trivial when k ≥ |N(I0)− I0|, where N(I0) is the setof immediate neighbors of I0 in the graph G (as we can just vaccinate all of these nodes, andthe disease will stop). In reality, this is never the case, as vaccines are expensive and networksare huge. For example, for k = 10, in our experiments, we found that |N(I0)− I0| rangedfrom 10 to 250 times our budget. Hence, we will implicitly assume that k < |N(I0)− I0|.

3.3 Complexity of the DAV problem

We discuss the complexity of the Data-Aware Vaccination problem in this section. In summary,we first prove that on general graphs, the DAV problem is NP-hard, then show that DAVproblem is also hard to approximate.

25

3.3.1 Hardness result

Unfortunately, our problem is NP-hard. We will reduce it from the Minimum k-union(MinKU) set problem (where we want to minimize the union of k subsets), which was provento be hard [169].

Consider the corresponding decision version of DAV:Problem 2: Data-Aware Vaccination (Decision Version) DAV (G, I0, P, δ, k, τ):

Given: A graph G(V,E), the node set I0, the SIR model with propagation probability pi,j ∈ Pand curing probability δ, and an integer (budget) k ≥ 0, and τ ≥ 0.

Find: Is there a set S of k nodes of G to distribute the vaccine such that σ′G,I0(S) ≥ τ?

Theorem 3.1. DAV (Decision Version) is NP-hard.

Proof. We prove that DAV is NP-hard via a reduction from the NP-Complete Minimumk-union Problem (MinKU) problem [169].

Minimum k-union Problem (Decision Version). given a collection of n subsetsC = {S1, S2, ..., Sn} over a finite set of elements U = {u1, u2, ..., um} and an integer 0 ≤ k < n,and τ ≥ 0, is there exactly k subsets Sj1, ..., Sjk in S such that |Sj1 ∪ ... ∪ Sjk| ≤ τ

Note that [169] only proves that MinKU is NP-hard. But given an instance of MinKU(Decision Version), it is easy to check whether |Sj1 ∪ ... ∪ Sjk| ≤ τ in polynomial time, soMinKU ∈ NP as well and hence MinKU (Decision Version) is NP-complete.

Given an instance of MinKU, we construct a graph G with uniform propagation probabilityp = 1, and δ = 1 (essentially IC model, a special case of SIR). Let each subset Si be a node inthe graph. Also let there be a node j in the graph for each element uj. Then a sole infectednode I connects to the all nodes in the set {S1, S2, ...Sn} and subsequently each Si connectsto its elements j. So G has 1 +m+ n nodes. Clearly this construction takes polynomial time.Figure 3.1 shows an example where S1 = {1}, S2 = {1, 2}, S3 = {1, 2}, S4 = {m}, S5 = {m},and Sn = ∅.

Based on the graph G, we can create an instance IN ≡ (G, I, P, δ, n− k, n− k +m− τ) ofDAV. Now we need to show that (1) if the instance of MinKU problem has an solution, thenthe instance IN also has the solution; (2) if the MinKU Problem does not have a solution,then the instance IN also does not have a solution.

(1). This is true. If we can select k subsets Sj1, ..., Sjk for the MinKU problem, then wecan choose other n − k nodes from the collection C for our DAV problem. In this case,σG,I(S) = k + |Sj1 ∪ ... ∪ Sjk| + 1 ≤ τ + k + 1, so σ′G,I(S) = |V | − σG,I0(S) = m + n + 1 −σG,I0(S) ≥ m+ n− k − τ .

(2). This is also true. Suppose the instance I for DAV has a solution, we will show thatMinKU Problem also has a solution which is a contradiction. We can assume that the

26

I

S1 S2 S3 S4 S5 Sn

1 2 m

Figure 3.1: Graph used in the reduction from Minimum k-union Problem.

solution set is S0, so we have σ′G,I(S0) ≥ m+ n− k − τ .

Firstly note that because of the construction of the graph G, if an assignment containsnodes from 1, 2, ...m, we can always swap them with nodes named as S1, S2, ...Sn to get analternate assignment which is at least as good as the original. Note that if we choose a nodet from 1, 2, ...m, then the value of σ′G,I(S) (the number of nodes we block) only increases byone. This is so as it is impossible to block nodes Si without choosing them (all the nodesS1, S2, ...Sn are directly connected to I). Moreover t has connections only with nodes Si. Sot can not block any other nodes. Finally, if we choose a substitute node from S1, S2, ...Sn, itis possible that some additional nodes in 1, 2, ...m may be blocked. Hence, swapping t with anode in S1, S2, ...Sn will result in a solution which is at least as good.

Hence we can assume without loss of generality that the optimal solution S∗ containsonly nodes from Si. Then we have σ′G,I(S

∗) ≥ σ′G,I0(S0) ≥ m+ n− k − τ , so σG,I(S∗) =

n + m + 1 − σ′G,I(S∗) ≤ τ + k + 1. In this case S∗ contains n − k nodes, if we choose

other k nodes Sj1, ..., Sjk from C − S∗ as the solution for MinKU problem, then we have|Sj1 ∪ ... ∪ Sjk| = σG,I(S

∗) − 1 − k ≤ τ . This means the MinKU problem has a solutionwhich is a contradiction.

Combining (1) and (2) proves the hardness.

3.3.2 Approximability

Typically related optimization problems arising in graphs have a submodular structurelending themselves to the near-optimal greedy solution. But unfortunately, our function isnot submodular.

Remark 3.1. σ′G,I0(S) in DAV is not a submodular function.

Proof. See Fig. 3.2. A submodular function has the property that if A ⊆ B, then addingan element j into both sets, we should have f(A ∪ {j}) − f(A) ≥ f(B ∪ {j})− f(B).Suppose I is infected and A = ∅ and B = {X}, we have σ′G,I0(A ∪ {Y })− σ

′G,I0

(A) = 5 andσ′G,I0(B ∪ {Y })− σ

′G,I0

(B) = 8− 2 = 6. So σ′G,I0(S) is not a submodular function.

27

I

X Y

1 2 3 4

5 6

Figure 3.2: Counter-Example.

Vinterbo [169] gave a greedy algorithm which can approximate MinKU and its equivalentproblem Maximum k-Intersection problem (MaxKI) (where we want to maximize theintersection of k given subsets) within a constant factor if the cardinality of all the subsetsare bounded by a constant. However, in our case the ‘subsets’ can be very large, and thusthe approximation result is not useful. The algorithm we develop is related to Vinterbo’ssetting in the sense that it is also greedy, though we use different efficient techniques for ourparticular setting. For the general case, Xavier [175] proved that the MaxKI problem cannotbe approximated within a tighter constant 1

Nε where N is the number of subsets and ε > 0,under NP 6⊂ BPTIME(2N

ε). Recently Sheih et al [154] proved MaxKI is inapproximable

within an absolute error, with a smaller inapproximable gap and under the weaker P6=NPassumption. Using the results in [154], unfortunately our problem is also inapproximablewithin an absolute error 1

2m1−2ε +O(m1−3ε) if m = |V | − |N(I0)− I0| − |I0|. Here, m means

the number of nodes except for the infected nodes and their neighbors.

Theorem 3.2. Given any constant 0 < ε < 1/3, there exists a mε such that the Data-AwareVaccination problem with m > mε, cannot be approximated in polynomial time within anabsolute error of 1

2m1−2ε + 3

8m1−3ε − 1 unless P=NP.

Proof. The Minimum k-union Problem (MinKU) is equivalent to the Maximum k-intersectionProblem (MaxKI) (where we want to maximize the intersection of k given subsets) [169] .Shieh [154] proved that MaxKI problem1 of universe size m ≥ mε cannot be approximatedin polynomial time within an absolute error 1

2m1−2ε + 3

8m1−3ε − 1 unless P=NP. Using the

reduction in the NP-hard proof, we can prove that the DAV problem is hard to approximateif MaxKI cannot be approximated in polynomial time within an absolute error.

1Interestingly, the related inverse problem of minimizing the intersection, equivalent to the max-coverageproblem, has a 1− 1/e approximation.

28

3.4 Our Proposed Methods

Due to the results in the previous section, we present effective heuristics next. We describeour methods assuming the IC model in the DAV problem in this section, and then extendthem to handle the general SIR case.

Roadmap of this section: We first simplify the DAV problem by merging infected nodes.Then we propose an optimal solution called DAVA-TREE on trees. Based on DAVA-TREE, wegive an effective algorithm, dava, on any arbitrary graph. However, dava is not scalable tolarge networks, hence we provide a faster algorithm, dava-prune, which returns the sameresult as dava. In addition to that, we propose a much faster heuristic, dava-fast in the end.

3.4.1 Simplification—Merging infected nodes

To simplify our problem, as the hardness reduction from the previous section suggests, wemerge all the infected nodes into a single ‘super infected’ node and get an equivalent problemwith only a single infected node. Intuitively this is because it does not matter how theinfected nodes are connected among themselves—all it matters for our problem is how theyare connected to healthy nodes. If a healthy node has multiple infected neighbors, it will havea new edge probability which would be the logical-OR of the individual probabilities. Forexample, if a healthy node c has two infected neighbors a and b with edge probabilities pa andpb, the new edge probability between I ′ and c would be 1− (1− pa)(1− pb) = pa + pb − papb.

Remark 3.2. Given an instance of the DAV problem (G, I0, P, k) under IC model, Algo-rithm 3.1 outputs an equivalent problem instance (G′, I ′, P ′, k) where I ′ is the sole infectednode in the new graph G′.

Proof. We can prove it by induction. If a healthy node c connects to two infected nodes a andb, according to IC model, the propagation probability will be 1−(1−pa)(1−pb) = pa+(1−pa)pb(equal to the probability of getting infected from at least one of a or b). Hence Algorithm 3.1(Line 9) outputs the same propagation probability for c. When c has l + 1 infected nodes,suppose pI′c is the propagation probability after merging l nodes, when merging a new infectednode d, the propagation probability is 1− (1− pI′c)(1− pd) = pI′c + (1− pI′c)pd, which is thesame as the output of Algorithm 3.1.

In Algorithm 3.1, line 3-13 is used to copy edges from previous infected nodes set I0 tothe new infected node I ′. Line 5-10 shows how to assign new propagation probability pI′j.Suppose the previous infected nodes set I0 has EI0 edges in total, Algorithm 3.1 will takeO(|I0|+ EI0) time.

29

Algorithm 3.1 MERGE

Require: Input graph G, infected node set I0, probability set P1: G′ = G2: Add node I ′ to G′

3: for each node i in I0 do4: if there exists an edge eij between i and j then5: if there is no edge eI′j then6: Add edge eI′j into G′

7: pI′j ← pij8: else9: pI′j ← pI′j + (1− pI′j)pij

10: end if11: Remove eij from G′

12: end if13: end for14: Remove all nodes in I0 from G′

15: return graph G′ and the infected node I ′

3.4.2 DAVA-TREE—Optimal solution when the merged graph is atree

Let us call the graph we get after merging (i.e. after Algorithm 3.1) the m-graph. The secondimportant observation is that as we show next, if the m-graph of our instance is a tree, thenwe can get an optimal polynomial time algorithm under IC model, for any edge propagationprobability pi,j ∈ [0, 1]. We call the algorithm DAVA-TREE (Data Aware Vaccine Allocationon a tree).

Before we describe our algorithm, define the quantity γS(j)—which is the ‘benefit ’ of node jto the optimization goal when nodes in S have already been removed. It is essentially theexpected number of nodes we save after removing j, given that we have already removednodes from set S. We have:

γS(j) = σ′G,I0(S ∪ {j})− σ′G,I0

(S) (3.2)

Let γ(j) = γ∅(j). Algorithm 3.2 proceeds by computing γj efficiently for every neighbor nodeof I ′ (in a simple tree traversal) and then taking those neighbors of the infected node I ′ withtop-k γ(·) values. Lemma 3.1 proves that this gives us the optimal solution. In short, asthere is only one path from any node to any other node in the tree: the optimal solutionmust be a subset of the immediate neighbors of I ′ with top k value of γ(j).

Lemma 3.1 (Correctness of DAVA-TREE). If the m-graph G(V,E) is a tree, then we can getoptimal solution of the DAV problem under IC model by Algorithm 3.2.

30

Algorithm 3.2 DAVA-TREE

Require: Tree T , infected node I ′, k and pij1: Set I ′ as the root2: for each neighbor j of I ′ do3: γ(j)← pI′j × calPartial(j)4: end for5: S = nodes with top-k values of γ(j)6: return S

Function calPartial(node n)if n is not a leaf then

benefit← 1for each child i of n do

benefit← benefit + calPartial(i)× pniend for

elsebenefit← 1

end ifreturn benefitEndFunction

Proof. We show the following things: in the optimal set, the chosen nodes must be neighborsof the infected node I ′; the benefit of each such node is independent of the rest of the set S;and finally that we correctly calculate γ(j).

First the nodes we select must be neighbors of the infected node I ′. Suppose a solutionset M = {S ∪ v} (for some S) selects a node v that is not a neighbor of I ′, and sinceG(V,E) is a tree there is only one path p from I ′ to v. If in S there is no other node inthe path p, then instead of v, we can choose the neighbor of I ′ in the path p, say node u.be smaller if we select u instead of v. Note that v can get infected only if u gets infected.Hence, σ′G,I′(S ∪ {u})− σ′G,I′(S ∪ {v}) > 0 as u can save at least one more node from beinginfected than v. On the other hand, if except v there is a node in p in S, then we know thatσ′G,I′(S ∪ {v}) = σ′G,I′(S), so we can choose some another neighbor of I ′ instead of v, and geta better value. Hence in either case, we can swap v with a neighbor of I ′ and get a bettersolution than M .

Second we must select k neighbors of I ′ with top values of γ(j). For a S that containsonly I ′s neighbors, note that σ′G,I′(S) = |V | − σG,I′(S) = |V | − (σG,I′(∅) − Σi∈Sγ(i)) =σ′G,I′(∅) + Σi∈Sγ(i). In other words, the benefit of each node is independent of S. This isbecause the subtree of each node j ∈ S is disconnected from the subtree of any other nodei ∈ S: hence node i ∈ S can not impact any other node j ∈ S. Thus we can just choose thetop-k values of γ to maximize σ′G,I′(S).

31

I

1 2

3 4 5

6 7 8 9 10 11

(a) Original Graph.

I

1 2

3 4 5

6 7 8 9 10 11

(b) Minimum Spanning Tree.

Figure 3.3: An example of minimum spanning tree. For p=1 and k=1, the optimal solution is node2 in MST. However, in the original graph, the optimal solution is node 4.

Finally, note that Line 3 in the DAVA-TREE calculates γ(j) correctly. See that γ(j) =σ′G,I0({j})− σ

′G,I′(∅) = σG,I′(∅)− σG,I′({j}). As each subtree is independent, this quantity

is equal to the probability that j is infected times the expected number of sick nodes innode j’s subtree, when j is infected. It is easy to see that the function calPartial exactlycomputes the latter value, and so we get the correct value of γ(j).

Lemma 3.2 (Running Time of DAVA-TREE). Algorithm 3.2 DAVA-TREE costs O(|V |+ |E|+k log |V |) time in the worst case.

Proof. Calculating the expected benefit needs a complete tree traversal, so it takes O(|V |+|E|)time. Suppose I ′ has m neighbors, selecting top k nodes needs O(k logm) using a heap.

3.4.3 dava—An effective algorithm on arbitrary graphs under ICmodel

What if the m-graph is not a tree? We give an effective heuristic next when m-graphs arearbitrary networks. After the merge algorithm, we can guarantee that a connected graphhas only one infected node I ′. Intuitively, we need to capture (a) the ‘closeness ’ of nodes tothe infection (represented by I ′) and at the same time, (b) the importance of the nodes in‘saving ’ other nodes. Thus good solutions are composed of nodes which are close to I ′ andalso prevent the infection of many others.

We can still use DAVA-TREE algorithm by generating a tree from the m-graph, rooted at I ′.Which tree should we use? People have typically used spanning trees like the MinimumSpanning Tree (MST) in related problems. The problem with MST is that potential solutionnodes (for the original graph) can reside at higher depths in the MST—but as we saw in theprevious section, the DAVA-TREE algorithm only chooses nodes which are neighbors of I ′. See

32

Figure 3.3 for an example. Let p = 1 on all edges, and budget k = 1. Then the MST rootedat I ′ will not have the node 4 as a neighbor of node I ′, but the optimal solution in this casewill exactly be node 4. On the other hand, if p = 0.5 and k = 1, then it is easy to verify thatnode 1 will become the optimal solution.

Dominator tree We propose to use the dominator tree of the graph and then runDAVA-TREE algorithm on the dominator tree. As we explain later, the dominator tree avoidsthis problem, by precisely capturing the ‘closeness’ property required from the solutions.

In graph theory, given a source node I ′, a node v dominates another node u if every pathfrom I ′ to u contains v. Node v is the immediate dominator of u, denoted by v = idom(u),if v dominates u and every other dominator of u dominates v. We can build a dominatortree rooted at I ′ by adding an edge between the nodes u and v if v = idom(u). Dominatortrees have been used extensively in studying control-flow graphs. Building dominator trees isa very well-studied topic, with near-linear time algorithms available [18,87]. Figure 3.4(b)shows the dominator tree for graph in Figure 3.4(a). Note that the edges in the dominatortree may not in fact exist in the original graph (compared to say the MST).

Note that we have not specified how to weight the edges of this dominator tree yet. Eventhe simple unweighted version of the dominator tree has structural properties which are veryuseful for the DAV problem. As we show in the next lemma, the optimal solution for theoriginal graph can only be a subset of the neighbors of I ′ in the unweighted dominator tree.

Lemma 3.3. For the DAV problem, the optimal solution should be the children of root I ′ inthe unweighted dominator tree of the m-graph G.

Proof. Suppose the optimal solution set S∗ contains a node v which is not the neighbor of I ′

in the dominator tree. So there must exist a node u that is the child of I ′, and it dominatesv in the original graph (as the dominating relationship is transitive). This means that allpaths from I ′ to v have to pass through u. In other words, node v does not get infected if uis not infected. The rest of the proof is similar to the argument used in Lemma 5.1. In sum,we can swap v with a neighbor of I ′ and get solution which is at least as good.

Note that by building the dominator tree we can reduce the search space substantially withoutlosing any information—we demonstrate this in experiments as well, the number of neighborsof I ′ in the dominator tree is typically a fraction of the total number of nodes in the originalgraph. Further, we can prove that if p = 1, running DAVA-TREE on dominator tree of m-graphG returns the best first node.

Lemma 3.4. For the special case when the budget k = 1 and propagation probability p = 1,running algorithm DAVA-TREE on dominator tree T of m-graph G weighted with pu,v as above,gives the optimal solution.

33

I

1 2

3 4 5

6 7 8 9 10 11

(a) Original Graph.

I

1 2 4

3 5

6 7

8 9 10 11

(b) Dominator Tree.

Figure 3.4: An example of dominator tree. For p=1 and k=1, the optimal solution is node 4. Forp=0.5 and k=1, the optimal solution is node 1.

Proof. According to Lemma 3.3, the node v we select should be a child of root in thedominator tree. Also, when k = 1, by the definition the best node to pick is the one with themax. value of γ(·). The only nodes v can prevent from infection are those nodes u for whichv lies on every path from I ′ to u (as removing v will cut off all possible paths of infection)—orthe nodes exactly in the subtree of v in T . And for those nodes that are not in the subtree ofv in T , when v is removed, they are still infected since they have at least one path from I ′ tothem in the original graph. So the value of γ(·) for v in the original graph is the number ofnodes for which v lies on every path from I ′ to them. This means the value of γ(·) in theoriginal graph is equal to the value of γ(·) in T . So the best solution for the dominator treeis also the best solution for the original graph.

Weighting the dominator tree DAVA-TREE assumes that the edges in the network denotepropagation probabilities. We want to preserve such information (coming from the originalgraph) in the dominator tree, to make the ‘benefit’ computation accurate. Hence we weighteach edge {u, v} in the dominator tree by pu,v, the total probability that node u can infect vin the original graph (note that u and v may not be neighbors in the original graph).

Lemma 3.4 and the preceding discussion suggest a natural greedy heuristic: find the bestsingle node i using DAVA-TREE in the dominator tree (weighted with pu,v as defined before);then remove node i from the graph; recompute the dominator tree; and repeat till budget kis exhausted. We call this algorithm BASIC.

Speeding up BASIC Unfortunately, computing pu,v for a given pair of nodes u and v inthe IC model is #P-complete [22]. It is essentially the canonical s− t connectivity problemin random graphs. We can use Monte-Carlo sampling to get an estimate through simulations,but even that is too slow. Hence we propose to approximate it by using the maximumpropagation path probability between nodes u and v, which is intuitively the most-likely path

34

through which an infection can spread from node u to v in the original graph.

In the original graph, suppose ppathi(u, v) means the propagation probability from u to vthrough path i. We define maximum propagation path probability pi,j as the maximum valueof ppathi(u, v). Here we can use pu,v as the approximate propagation probability for the edge{u, v} in the dominator tree. Max. path probability has been used before in context ofthe influence maximization problem [22], but they need to compute it between all pairs ofnodes. On the other hand, in our problem, we need to compute it only for edges in thedominator tree of the graph. In fact, further, it is easy to see that in a dominator tree

rooted at I ′, if v = idom(u), then pv,u =pI′,upI′,v

. This means we only need to calculate the

maximum propagation path probability from root I ′ to all other nodes, which is similar tofind shortest-paths in graph theory.

Hence a faster algorithm than BASIC would be to assign probabilities pv,u in the dominatortree. We call the complete algorithm dava for arbitrary graphs. Pseudocode is given inAlgorithm 3.3.

Algorithm 3.3 dava Algorithm for Arbitrary Networks

Require: Graph G, P , budget k, infected set I0

1: S = ∅2: G′ = Run MERGE on G and I0

3: repeat4: T = Build the dominator tree from G′ and assign probabilities pv,u5: v = Run DAVA-TREE on T with budget = 16: S = S ∪ v7: Remove node v from G′

8: until |S| = k9: return S

Lemma 3.5. (Running time of dava) Algorithm 3.3 takes O(k(|E|+ |V | log |V |)) worst-casetime.

Proof. Building a dominator tree costs O(|V |+ |E|) time [18]. If the original graph has theuniform propagation probability, we are able to obtain pI′,v by through Breadth-First Search(BFS) which takes O(|V |+ |E|) time. Otherwise we can use Dijkstra’s Algorithm to get pI′,vwhich takes O(|E|+ |V | log |V |) time. Thus, assigning probabilities costs O(|E|+ |V | log |V |)worst-case time. Removing node v also takes linear time.

3.4.4 dava-prune—A faster algorithm with the same result as dava

dava works fine on small graphs, but can be slow on large graphs, as it re-builds the dominatortree multiple times. Hence we propose an faster algorithm dava-prune, which speeds up

35

dava using prune techniques while returning the same result as dava. (Note: we also presentanother costlier pruning algorithm, which is still faster than dava but slower than dava-prune.We describe it in the appendix as it may be of independent interest).

Building a dominator tree costs linear time, while weighting a dominator tree using maximumpropagation probability takes O(|E|+ |V | log |V |) time. Hence, the bottleneck for dava isthe reweighting process. If we are able to reduce the reweighting time, we can speed up dava.The following lemmas (Lemma 3.6, Lemma 3.7 and Lemma 3.8) will give us some guidanceto reduce the time complexity of reweighting. Basically, the idea is that we do not needto recalculate weights for some nodes, and we can reduce the complexity of recalculatingnecessary weights as well.

To recap, we have the original graph G, m-graph G′, and the infected node I ′ in G′. Letus denote v to be the node selected from G′ using dava, Gnew to be the new graph afterremoving v from G′, Told to be the old dominator tree before removing v, and Tnew to be new(re-built) dominator tree.

First, the following two lemmas will show that we do not need to reweight nodes that are notin the first layer of the old dominator tree Told.

Lemma 3.6. After removing the selected node v from the m-graph G′ using dava, nodesthat are not in the first layer of the old dominator tree Told will have the same parent nodesin the new (re-built) dominator tree Tnew.

Proof. First, we prove that for any node i that is neither in the first layer of the old dominatortree Told nor under the subtree of v in Told, after removing v, i’s neighbors in Gnew will notchange, which means i’s neighbors in G′ cannot be under the subtree of v in Told. Here weuse a non-trivial property of the dominator tree from [87] that in the graph the dominatorsof a node a is the intersection of its neighbors’ dominators. Since i’s dominators is theintersection of its neighbors’ dominators, after removing v, i’s dominators in Tnew cannotincrease. Furthermore, i must have the same dominators in Tnew: if i has a smaller numberof dominators in Tnew, at least one of i’s neighbor j in G′ must be under the subtree of v inTold. Since i is not in the subtree of v in Told, there exists at least one path from I ′ to i in G′

which does not include v. So there exists at least one path from I ′ to j as well such thatv is not in that path. Hence, j is not under the subtree of v in Told, which contradicts ourassumption. Therefore, none of i’s neighbor is under the subtree of v in Told. Similarly, forany node i that is not in the first layer of Told, i’s neighbors must be under the same subtreeas i.

Now we prove the lemma by induction. First, consider node u in the second layer of Told, aswe proved above, u’s neighbors cannot under the subtree of v in Told, and they are at thesame subtree of Told as u, hence the intersection of dominators of u’s neighbors must containu’s parent in Told. Therefore, u’s parent remains the same in the new dominator tree Tnew.Second, suppose for nodes under the lower layer L of Told where 2 < L ≤ N , they have thesame parents in Tnew. Then, consider node u under the N + 1 layer of Told. If u’s neighbors

36

are under the higher layer, their parents will be the same in Tnew, which means they havethe same immediate dominators in Tnew. And for nodes that are under the lower layer, theirdominators is also the dominators of u, hence u’s parent will be the same node in Tnew.

Therefore, nodes that are not in the first layer of Told will have the same parent nodes in thenew dominator tree Tnew.

Lemma 3.7. After removing the selected node v from G′ using dava, nodes that are not inthe first layer of the old dominator tree Told will have the same edge-weights to their parentsin the new (re-built) dominator tree Tnew.

Proof. Suppose i is the node not in the first layer of Told, and j is its parent. According toLemma 3.6, j is still i’s parent in the new dominator tree Tnew. pj,i is the probability fromthe maximum propagation path from j to i in G′. If the maximum propagation path from jto i does not contain nodes under the subtree of v in Told, clearly pj,i remains the same inTnew.

Let us assume there exists a node n that is under the subtree of v in Told, and also in themaximum propagation path from j to i. Since j is not under the subtree of v in Told, theremust exist at least one path from I ′ to j in G′ that does not contain v. Since n is in themaximum propagation path from j to i, there must have a path from I ′ to n which goesthrough j but does not contain v. Hence, n cannot be dominated by v, which contradicts ourassumption that n is under the subtree of v in Told. Therefore, the maximum propagationpath from j to i does not contain nodes under the subtree of v in Told which means pj,i willstay the same in Tnew.

Though all the edge-weights of nodes that are not in the first layer of Told remain the samein Tnew, we cannot ignore these nodes because they may be in the maximum propagationpath from I ′ to nodes in the first layer of Tnew. However, luckily the next lemma will showthat we do not need to consider nodes that that not in the first layer of Told when calculatingweights of nodes in the first layer of Tnew (the maximum propagation path from I ′ to nodesthat are in the first layer of Tnew).

Lemma 3.8. After removing the selected node v from G′ using dava, in the new (re-built)dominator tree Tnew, the maximum propagation path from I ′ to nodes that are in the firstlayer of the old dominator tree Told only contains nodes in the first layer of Told.

Proof. First, for node j in the first layer of Told, either j only has one neighbor, I ′, in G′, orthere exist at least two distinct paths from I ′ to j. Here, distinct paths mean there is nocommon nodes in these paths. If there exist common nodes in different paths from I ′ to j, jcannot be in the first layer of Told.

If j only has one neighbor I ′ in G′, then clearly in the new dominator tree Tnew, pI′j remainsthe same. Now let us consider the case that there exist at least two distinct paths from I ′

37

to j. Suppose there exists a node n that is not in the first layer of Told, but belongs to apath from I ′ to j. Since j is not the direct neighbor of I ′ in G′, there must exist at least twodistinct path from I ′ to j in G′. Hence for node n, there also exist at least two distinct pathsfrom I ′ to n in G′, which means n only has one dominator, that is I ′. Hence n is in the firstlayer of Told, which contradicts the assumption that n is not in the first layer of Told. Hence,every path from I ′ to node j in the first layer of Told can only contain nodes that are in thefirst layer of Told.

Therefore, after removing v, in Tnew the maximum propagation path from I ′ to j only containsnodes in the first layer of Told.

Lemma 3.6 shows that if a node u is not in the first layer of Told, u’s parent in the newdominator tree Tnew remains the same, and furthermore according to Lemma 3.7, themaximum propagation probability from u’s parent to u in Tnew will not change as well. Inaddition, Lemma 3.8 demonstrates that to calculate new edge-weights for nodes in the firstlayer of Tnew, we do not need to run Dijkstra’s algorithm on the whole graph G′. Instead,using the fact that maximum propagation paths from I ′ to node j in the first layer of Toldonly contain neighbors of I ′ in Told, we can get the new edge-weight pI′j by running Dijkstra’salgorithm on the subgraph with only I ′ and nodes in the first layer of Told.

Hence we propose a faster heuristic than dava. We call this algorithm dava-prune. Pseu-docode is given in Algorithm 3.4. dava-prune rebuilds the dominator tree after selecting anode but only reweights the edge-weights of those nodes in a subgraph of G′ (see Lines 8-10).The next lemma shows that dava-prune returns the same result as dava.

Algorithm 3.4 dava-prune Algorithm



3: T ′ = Build the dominator tree from G′ and assign probabilities pv,u4: repeat5: v = Run DAVA-TREE on Told with budget = 16: S = S ∪ v7: Remove node v from G′

8: T ′ = Build the dominator tree from G′

9: G′′ =Removing nodes that are not in the first layer of T on G′

10: Run Dijkstra’s algorithm on G′′ to get new weights11: until |S| = k12: return S

Lemma 3.9. (Correctness of dava-prune) dava-prune returns the same result as dava.

Proof. Lemma 3.7 shows that for nodes that are not in the first layer of the old dominatortree Told, all of their edge-weights remain the same, while Lemma 3.8 indicates that for node

38

u in the first layer of Told, the maximum propagation path from I ′ to u only contains nodes inthe first layer of Told. Hence, if we run Dijkstra’s algorithm on G′′ (a subgraph only containingnodes in the first layer of T ), we can get the same the maximum propagation probabilityfrom I ′ to u in the new (re-built) dominator tree Tnew. Therefore, dava-prune gives the sameresults as dava.

Lemma 3.10. (Running time of dava-prune) Algorithm 3.4 takes O(|V | log |V |+ k(|E|+F logF ) worst-case time where F is the number of nodes in the first layer of the dominatortree T .

Proof. Building a dominator tree costs O(|V |+ |E|) time [18], and weighting the dominatortree for the first take costs O(|E|+ |V | log |V |). After that, reweighting dominator tree onlytakes O(F logF ) time. Hence in general, the worst-case time complexity for dava-prune isO(|V | log |V |+ k(|E|+ F logF ).

3.4.5 dava-fast—An even faster heuristic

Even with pruning techniques, dava-prune may still be slow on large graph, as it depends onthe number of nodes in the first layer. Hence we propose an even faster heuristic dava-fast,which runs DAVA-TREE on the dominator tree with the full budget k, instead of running itwith k = 1 after each step. Essentially we are picking neighbor nodes (in the dominator tree)of I ′ with the top-k γ(·) values. Pseudo-code is in Algorithm 3.5. dava-fast performs well inour experiments, with some loss of quality but at a fraction of the running time of dava.

Algorithm 3.5 dava-fast Algorithm



3: T = Build the dominator tree from G′ and assign probabilities pv,u4: S = Run DAVA-TREE on T with budget =k5: return S

Lemma 3.11. (Running time of dava-fast) Algorithm 3.5 takes O(|E|+ |V | log |V |) worst-case time.

Proof. Building a dominator tree costs O(|V |+ |E|) time [18], and assigning probabilitiescosts O(|E|+ |V | log |V |) worst-case time. Running DAVA-TREE costs O(|V |+ |E|+ k log |V |)time.

39

3.4.6 Discussion of proposed methods

We will discuss the performance of dava, dava-prune and dava-fast in terms of the structureof graph next.

dava As we showed in Lemma 3.1, if the graph is a tree, dava gives the optimal solution.In addition to that, the total benefit of the set S (the number of nodes we can save afterremoving S) as computed from the dominator tree, denoting as σ′Dom(S), is related to thetrue benefit σ′G,I′(S) from the original graph.

Lemma 3.12. σ′Dom(S) ≤ σ′G,I′(S).

Proof. First, σ′Dom(S) equals to footprint of the diffusion from I ′ to nodes in the subtree ofu ∈ Si after running IC model. For any node u in the set S, after removing u from the graphG, every node v under the subtree of u in the dominator tree will be isolated from the supernode I ′ in G since every path from I ′ to v must go through u. So the benefit we can at leastget is the footprint of the diffusion from I ′ to all nodes under the subtree of u ∈ S, which isexactly σ′Dom(S). Hence, σ′Dom(S) ≤ σ′G,I′(S).

Lemma 4.1 shows the connection between the dominator tree and the original graph interms of the nodes we save. Furthermore, it also demonstrates how our dava algorithmapproximates the DAV problem: it maximizes the lower bound of the number of nodes wecan actual save from the original graph.

However, if there are too many nodes in the first layer of the dominator tree, the bound inLemma 4.1 will become loose, and dava will not perform well because under this scenario, welose too much information from the original graph. Nevertheless we found in the experiments(Section 3.6.2), in practice, the number of nodes in the first layer of the dominator trees isnot large enough (only a fraction of total number of healthy nodes in the graph). Hence,dava performs very well on real networks.

dava-prune We showed that dava-prune gives the same result as dava (Lemma 3.9). Inaddition to that, it is faster than dava because we can reweight the dominator tree in asmaller graph. However, if the number of nodes in the first layer of the dominator tree islarge enough, dava-prune can be as slow as dava. As we discuss later, in practice the firstlayer of the dominator tree is a fraction of the size of the graph, so dava-prune is much fasterthan dava.

Hence, since we cannot run dava on large networks, dava-prune can serve as a baseline formeasuring the performance of faster heuristics.

Additionally, as we discuss below, the difference in performance between dava-prune anddava-fast depends on the structure of the network. If a dominator tree changes dramatically,

40

dava-prune will provide a much better performance compared to dava-fast, no matter hownetwork size changes. Our experiments show that dava-prune can save 16, 000 more nodes(for the budget k = 2000) for large networks like PORTLAND. Furthermore, for small graphslike STANFORD, it can save almost 10% more nodes than dava-fast. Hence, even thoughdava-prune is not as fast as dava-fast, it is still a viable option for applications whereeffectiveness is very critical: to achieve high performance, one can afford to sacrifice somerunning time.

dava-fast As is clear from the pseudo-code, if the dominator tree does not change after weselect certain nodes, dava-fast will give the same result as dava and dava-prune. Suppose weremove node u in G′ (the merged graph). The dominator tree will change only if there existsa node v such that after removing u, every path from I ′ to v in G′ must go through a certainnode. In practice, the number of such nodes v is not that large; hence, the performance ofdava-fast is competitive. Experimental results also demonstrate that dava-fast performs well:though it is not as good as dava and dava-prune, it outperforms other baseline algorithms.

3.5 Extending to the SIR model

In this section, we will explain how to extend our solution to the SIR case. Recall that inthe SIR model, as opposed to the IC model, a node u tries to infect its neighbor v multipletimes. Suppose Zu is the random variable denoting the number of time-steps u stays infecteduntil recovery (this is also the number of time-steps that u tries to infect v). The probabilitythat v gets infected by u is Buv = 1− (1− pu,v)Zu . Note that Zu has a geometric distributionPr(Zu = z) = (1− δ)z−1δ, with the expectation E[Zu] = 1

δ(δ is the curing probability). If

we force u to be infected for only one time-step before recovering (as in the IC model), thenβu,v, the equivalent probability that u infects v successfully, can be approximated by theexpectation of Buv, E[Buv].

Lemma 3.13. The expectation of Buv (the probability that node v gets infected by a neighbornode u) in the SIR model is given by, E[Buv] = pu,v

1−(1−δ)(1−pu,v).

Proof. First, E[Buv] = 1−E[(1−pu,v)Zu ] = 1−E[eZu ln(1−pu,v)]. Let us denote t = ln(1−pu,v),then E[Buv] = 1− E[etZu ].

Since Zu is a geometric distribution where Pr(Zu = z) = (1− δ)z−1δ, and t ≤ 0 < − ln(1− δ),using its moment-generating function E[etZu ] = δet

1−(1−δ)et [52], we can get:

E[Buv] = 1− δet

1− (1− δ)et= 1− δ(1− pu,v)

1− (1− δ)(1− pu,v)=

pu,v1− (1− δ)(1− pu,v)

41

According to Lemma 3.13 we can directly apply our algorithms to the SIR case, by justusing an equivalent IC model with βu,v as the propagation probability where βu,v can beapproximated by pu,v

1−(1−δ)(1−pu,v).

Recall that the IC model is a special case of the SIR model where each infected node u hasonly one chance to infect its neighbor v with the probability pu,v, which essentially meansδ = 1. So as a sanity check, if we apply Lemma 3.13 to the IC model, βu,v is exactly pu,v, asexpected (i.e. we recover the original weights).

3.6 Experiments

In this section, we give an experimental evaluation of our algorithms. We describe our setupnext. We conducted the experiments using a 4 Xeon E7-4850 CPU with 512GB of 1066Mhzmain memory, and implemented the algorithms in Python.

We seek to answer the following questions via our experiments:

1. How do dava, dava-prune and dava-fast perform compared with the baselines?

2. How effective dava-prune is compared to dava?

3. What is the scalability of our algorithms?

4. Does I0 affect the performance of our algorithms?

3.6.1 Experimental Setup

Datasets We run our experiments on multiple real datasets. In addition to trying to pickdatasets of various sizes, we also chose them from different domains where the DAV problemis especially applicable. Table 3.2 shows datasets we used.

Table 3.2: Datasets

Dataset Model #Vertices #Edges

OregonAS IC 633 2172STANFORD IC 8929 53829GNUTELLA IC 10876 39994BRIGHTKITE IC 58228 214078

PORTLAND SIR 0.5 million 1.6 millionMIAMI SIR 0.6 million 2.1 million

42

OregonAS: The Oregon AS router graph is a network graph collected from the Oregon routerviews. It contains 633 links among 2172 AS peers.2 The contagion here can be thought ofmalware and computer-network viruses, which we want to control by shutting-off or patchingrelevant routers.

STANFORD: It is the Stanford CS hyperlink network from 2001, in which a web page links toanother page.3 We made the graph undirected, and chose the largest connected component—itcontains 8929 nodes and 53829 links. Contagions here can be false information spreadingthrough the webspace, and we want to prevent their spread by posting true information atstrategic web pages.

GNUTELLA: It is a peer-to-peer network showing the snapshot of the Gnutella P2P file sharingnetwork from August 2012. It contains 39994 links among 10876 peers. Similar to OregonAS,we can control the spread of malware and harmful files by patching some important peers.

BRIGHTKITE: It is a friendship network from the SNAP dataset collection,4 from a formerlocation-based social networking service provider Brightkite5, which consists of 58228 nodesand 214078 edges. As friends regularly frequent the same places, such location-based networkscan be useful for the public-health.

PORTLAND: We experimented on this social-contact graph based on detailed microscopicsimulations [116], versions of which have been used in national smallpox modeling studies [35].We used the dataset we get by tracking all the different types of activities of about 0.5millionpeople (nodes) in the city of Portland, Oregon. The resulting graph had about 1.6millionedges (interactions). The edges were weighted with the contact-times (in secs) between people.

MIAMI: It is another social-contact graph based on detailed microscopic simulations [116] withabout 0.6million people and 2.1million edges.

Settings For the IC model, we use three illustrative settings: (a) Uniform probabilityp = 0.6; (b) Uniform probability p = 1; and (c) pu,v uniformly randomly chosen from{0.1, 0.5, 0.9} (following literature [22]). For the SIR model, since our socio-contact graphshave contact time between people [116], we will use normalized contact time as the attackprobability pu,v, and set a uniform recovery rate δ = 0.6. Note that the performance of

2http://topology.eecs.umich.edu/data.html.3http://www.cise.ufl.edu/research/sparse/matrices/Gleich/.4http://snap.stanford.edu/data/index.html.5http://www.brightkite.com.

http://topology.eecs.umich.edu/data.html

http://www.cise.ufl.edu/research/sparse/matrices/Gleich/

http://snap.stanford.edu/data/index.html

http://www.brightkite.com

43

our algorithms does not change when δ varies since when we change δ, edge weights in thenetwork are changed (see Section 3.5).

For most of experiments, we randomly choose 100 nodes to be infected as the set I0 forthe IC model, and 300 nodes for the SIR model (as the graphs are larger). Using suchsetting, we aim to model the infection at the time early intervention happens. However, wealso empirically study the effect of the distribution and size of I0 on our algorithms (seeSection 3.6.2.3 and 3.6.2.4 for more details).

We choose roughly 1% of the whole population as the budget k for large networks in ourexperiments. To get the expected number of healthy nodes after immunization, we run theprocesses (either IC or SIR) 1000 times and take the average.

Baseline Algorithms We compare our algorithms dava, dava-prune and dava-fastagainst various other competitors to better judge their performance. Except for Random,all baselines use weighted edges.

1. Random: In this method, we choose to give the vaccines to k uniformly randomlychosen healthy nodes.

2. Degree: In this method, we choose to give the vaccines to the top-k healthy nodesaccording to their weighted degree. This is similar to the popular acquaintance immu-nization method [25].

3. PageRank: Here, we choose to give the vaccines to the top-k healthy nodes with thehighest pagerank. PageRank is a popular node importance measurement which hasbeen widely used in social media for many tasks [124]. We use the restart probabilityof 0.15.

4. Per-PRank: In this method, we choose the top-k healthy nodes with the highestpersonalized pagerank with respect to the given infected nodes [70]. Intuitively, Per-PRank takes into consideration how close the nodes are to the infected nodes set (I0).We use the restart probability of 0.15.

5. Netshield: This is a state-of-the-art pre-emptive immunization algorithm [166], whichaims to minimize the epidemic threshold of the graph. We take the top-k healthy nodesaccording to the algorithm (i.e. we get the ranking from this algorithm, and ignorenodes which are already infected in our problem).

Implementation Note Lenguaer and Tarjan [87] proposed two algorithms for constructinga dominator tree: a complicated near-linear time algorithm which takes O(mα(m,n)) timewhere m is the number of edges, n is the number of nodes and α(m,n) is a functional inverseof Ackermann’s function; and a simpler one which takes O(m log n) time. Buchsbaum etal [18] presented an exact linear-time dominator algorithm, but their algorithm ran slower onreal flowgraphs than the algorithms in [87]. We hence use the simpler O(m log n) algorithmin our implementation (common libraries also include this version typically).

44

3.6.2 Experimental Results

We describe the results of our experiments next. First, we show that dava and dava-prunegive the same results. Second, we demonstrate that dava, dava-prune and dava-fast getupto 10 times better solutions compared to the baselines (resulting in thousands of morehealthy nodes), and dava-prune saves upto 16k (roughly 3% of the graph) more nodes thandava-fast for large networks. In addition, we demonstrate that the size of I0 will not changethe accuracy of our algorithms. Finally, we also show the scalability of our algorithms onlarge datasets.

We would like to note here that the number of the nodes in the first layer of the dominatortrees of the graphs were a fraction (ranging from 0.3 to 0.7) of the total number of healthynodes in the graph. As discussed before, because of Lemma 3.3, this reduces the solutionspace substantially.

3.6.2.1 Effectiveness of dava-prune

1 2 5 10 20 30 40 500

100

200

300

400

500

600


Expected number of healthy nodes

DAVA

DAVA−prune

1 2 5 10 20 30 40 50 1002000

500

1000

1500

2000

2500



DAVA

DAVA−prune

(a) OregonAS (b) GNUTELLA

1 2 5 10 20 30 40 50 1002000

500

1000

1500

2000

2500

3000

3500

4000

4500

5000



DAVA

DAVA−prune

1 2 5 10 20 30 40 50 100 2000

2000

4000

6000

8000

10000

12000

14000



DAVA

DAVA−prune

(c) STANFORD (d) BRIGHTKITE

Figure 3.5: Effectiveness of dava-prune: IC model with p = {0.1, 0.5, 0.9} on various Real Datasets.Expected number of healthy nodes after distributing vaccines according to different algorithms (i.e.σ′G,I′(S)) VS budget k. dava-prune outputs the same results as dava.

We compare dava-prune with dava to demonstrate that dava-prune outputs exactly thesame result as dava. Since dava is not scalable to PORTLAND and MIAMI, we do not show

45

plots for them. also show how well the performance of dava-prune when comparing it withdava-fast.

Figure 3.5 shows the results of dava-prune and dava when p = {0.1, 0.5, 0.9} for OregonAS,GNUTELLA, STANFORD and BRIGHTKITE: dava-prune always gives the same results as dava forall networks when we vary k, which is consistent with our theoretical result (see Lemma 3.9).

3.6.2.2 Comparison with baselines

0 10 20 30 40 500

100

200

300

400

500



OREGON performance (p=1)

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 50 100 1500

100

200

300

400

500

600

700



GNUTELLA performance (p=1)

DAVA−pruneDAVA−fastNETSHIELDDEGREERANDOMPAGERANKPER−PRANK


0 50 100 1500

500

1000

1500

2000

2500

3000

3500



STANFORD performance (p=1)

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 100 200 300 400 5001000

2000

3000

4000

5000

6000

7000



BRIGHTKITE performance (p=1)



Figure 3.6: Effectiveness for DAV (Comparison with baselines): IC model with p = 1 on variousReal Datasets. Expected number of healthy nodes after distributing vaccines according to differentalgorithms (i.e. σ′G,I′(S)) VS budget k. Higher is better. Note that dava-prune and dava-fastconsistently outperform the other algorithms by up to 10 times in magnitude, and dava-prune savesmore nodes than dava-fast. Best seen in color.

Since dava and dava-prune output the same result as we discussed before, in this section,we only show the performance of dava-prune for ease of exposition.

Summary of results. dava-prune and dava-fast consistently outperform other baselinealgorithms for all networks with different settings. Comparing dava-prune and dava-fast, wefound that, as expected dava-prune has better performance: it saves upto 16k (roughly 3%of the graph) more nodes than dava-fast in large networks. Hence, if the performance is thetop priority, dava-prune is the best choice. If the running time is equally crucial, dava-fast

46

0 10 20 30 40 500

50

100

150

200

250

300

350

400

450

500



OREGON performance (p=0.6)

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 50 100 1501300

1400

1500

1600

1700

1800



GNUTELLA performance (p=0.6)



0 50 100 150500

1000

1500

2000

2500

3000

3500



STANFORD performance (p=0.6)

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 100 200 300 400 500

6000

7000

8000

9000

10000

11000



BRIGHTKITE performance (p=0.6)



Figure 3.7: Effectiveness for DAV (Comparison with baselines): IC model with p = 0.6 on variousReal Datasets. Expected number of healthy nodes after distributing vaccines according to differentalgorithms (i.e. σ′G,I′(S)) VS budget k. Higher is better. Note that dava-prune and dava-fastconsistently outperform the other algorithms by up to 10 times in magnitude, and dava-prune savesmore nodes than dava-fast. Best seen in color.

with a much faster running time (which does not depend on k) would be a better choice.Having said that, the running time for dava-prune is also competitive (see Section 8.2.3): ititself is much faster than dava.

IC model Figures 3.6, 3.7 and 3.8 demonstrate our experimental results when p = 1,p = 0.6 and p = {0.1, 0.5, 0.9}. In all networks, dava-prune and dava-fast consistentlyoutperform baseline algorithms, and dava-prune gives the best results for all networks.

As OregonAS had only ∼ 600 nodes, we varied k till 50 (roughly 9% of the graph). OregonAShas a jelly-fish-type structure, hence for lower k, most algorithms work well by targetingthe nodes in the core. But for larger k, the periphery needs to be targeted, and here ouralgorithms provide the best solution. For bigger networks like GNUTELLA and STANFORD withtens of thousands nodes, the difference in performance of dava-prune and dava-fast fromthe other algorithms is clearer. Also, the gains from our algorithms reduced as the p valuedecreased. This is expected as weaker the disease (in spreading), lower is benefit of vaccinatingi.e. lower is the savings gain from carefully selecting important nodes (as removing a node

47

0 10 20 30 40 500

100

200

300

400

500

600

Budget of vaccines (k)Expected number of healthy nodes OREGON performance

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 50 100 1501700

1800

1900

2000

2100

2200

2300



GNUTELLA performance

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK


0 50 100 150

1000

1500

2000

2500

3000

3500



STANFORD performance

DAVA−prune

DAVA−fast

NETSHIELD

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 100 200 300 400 5000.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4x 10

4



BRIGHTKITE performance



Figure 3.8: Effectiveness for DAV (Comparison with baselines): IC model with p = {0.1, 0.5, 0.9}on various Real Datasets. Expected number of healthy nodes after distributing vaccines accordingto different algorithms (i.e. σ′G,I′(S)) VS budget k. Higher is better. Note that dava-prune anddava-fast consistently outperform the other algorithms, and dava-prune saves more nodes thandava-fast. Best seen in color.

will not further change the expected infections much). Per-PRank and PageRank performwell in STANFORD (a web-graph, with many hubs which these baselines exploit). AdditionallyNetshield performed well only in BRIGHTKITE, demonstrating that the type of solutionschange drastically, once we take the data of who is infected into account. In all these cases,dava-prune and dava-fast performed the best: dava-prune and dava-fast both saved aboutupto 10 times more nodes than baseline algorithms on STANFORD.

SIR model We got similar results under the SIR model on PORTLAND and MIAMI too (Figure3.9): dava-prune and dava-fast outperform other baselines. We notice that the larger kbecomes, the better dava-prune and dava-fast perform than other algorithms: they bothsave more than 75k (roughly 15% of the graph) nodes than Degree when k = 2000 (whichis similar to the well-known acquaintance immunization method used in practice [25])!

Finally, under both the IC and SIR experiments, though our faster heuristic dava-fastperformed very well, getting competitive solutions, it is not as good as dava-prune: dava-prune gives the best results for all networks, e.g., it saves about 400 more nodes than dava-fast

48

0 1000 2000 3000 4000 50000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

5





0 1000 2000 3000 4000 5000

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10

5



MIAMI performance


(a) PORTLAND (b) MIAMI

Figure 3.9: Effectiveness for DAV (Comparison with baselines): SIR model on Real Datasets.Expected number of healthy nodes after distributing vaccines according to different algorithms (i.e.σ′G,I′(S)) VS budget k. Higher is better. Note that dava-prune and dava-fast consistently outperformthe other algorithms, and dava-prune saves upto 16k more nodes than our second best algorithmdava-fast. Best seen in color.

on STANFORD when k = 150. The difference is clearer on PORTLAND and MIAMI: dava-prunesaves 16k (roughly 3% of the graph) more nodes than dava-fast when k = 2000. Hence, ifwe care mainly about the performance, dava-prune is the best choice, and the running timeof dava-prune is also competitive (see Section 8.2.3). Though dava-fast is not as good asdava-prune, it has its advantage: it is a much faster algorithm.

3.6.2.3 Quality w.r.t. size of I0

We also show the performance of our algorithms w.r.t. the size of I0. We demonstrate theresults of PORTLAND and MIAMI in Figure 3.10. For other networks with IC model, we gotsimilar results. First, we have an obvious observation that as the size of I0 increases, theexpected number of healthy node at the end decreases. Second, on both PORTLAND and MIAMI,dava-prune and dava-fast consistently outperform other baselines as the size of I0 changes.Hence, our algorithms are robust w.r.t. the size of I0, i.e., the gains of our algorithms areconsistently higher than other baselines.

3.6.2.4 Quality w.r.t. distribution of I0

We now show the performance of our algorithms w.r.t. a specific distribution of I0 as well.Figure 3.11 shows the results when I0 is uniformly at random chosen from people with the ageover 60 (commonly treated as the vulnerable population) on PORTLAND and MIAMI. As shownin Figure 3.11, dava-prune and dava-fast consistently outperform other baselines as well.Notice that plots in Figure 3.11 is almost the same as the result in Figure 3.9 when I0 is chosenfrom the whole population. Therefore, our algorithms are robust w.r.t. the distribution of I0

too, i.e., the gains of our algorithms are consistent with different distributions of I0.

49

100 300 500 1000 20000

2

4

6

8

10

12x 10

4

Size of Infected Nodes (I0)



100 300 500 1000 20000

2

4

6

8

10

12

14

x 104

Size of Infected Nodes (I0)




Figure 3.10: Effectiveness for DAV (Comparison with baselines over the size of I0): SIR model onReal Datasets. Expected number of healthy nodes after distributing vaccines according to differentalgorithms (i.e. σ′G,I′(S)) VS Size of I0. Higher is better. Note that dava-prune and dava-fastconsistently outperform the other algorithms. Best seen in color.

3.6.2.5 Scalability

Table 3.3: Running time (sec.) of dava, dava-prune dava-fast and Netshield when k = 200.Runs terminated when running time t > 24 hours (shown by ’-’). We did not show the running timeof Random, Degree, and PageRank because they are fast heuristics.

dava-fast dava-prune

dava Netshield

OregonAS 0.89 17.5 23.3 4.9STANFORD 14.2 1365.2 2920.4 74.1GNUTELLA 20.1 2245.4 4700.5 79.4BRIGHTKITE 109.3 8921.3 19444.1 246.8PORTLAND 778.4 66424.4 - 8211.6MIAMI 1034.2 81638.5 - 11233.9

Although our algorithms are polynomial-time, we show some running time results to demon-strate scalability. Table 3.3 shows the running time for dava, dava-prune, dava-fast andNetshield for different datasets when the budget k = 200. We did not show the runningtime of Random, Degree, PageRank and Per-PRank because they are fast heuristicsthat finish in the order of seconds, e.g., all four heuristics finish within 60 seconds on thelargest network, MIAMI. dava-fast is much faster than both dava and dava-prune—it tookonly 20 seconds to select 200 nodes in GNUTELLA while dava takes more than an hour to finishit. For the large-scale datasets like PORTLAND and MIAMI, dava-fast took less than 15 minutesto select 200 nodes, while dava could not finish in the allotted time. Further, dava-fast isup to 10 times faster than Netshield—this is because Netshield has a O(nk2) complexity(while our algorithms are linear in the budget). dava-prune is faster than dava as well—itonly took about half the time dava takes. Further, for PORTLAND and MIAMI, dava-prune

50

0 1000 2000 3000 4000 50000.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2x 10

5





0 1000 2000 3000 4000 50000.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6x 10

5



MIAMI performance



Figure 3.11: Effectiveness for DAV w.r.t. distribution of I0: SIR model on Real Datasets. I0 isuniformly at random chosen from the population with the age 60 or above. Expected number ofhealthy nodes after distributing vaccines according to different algorithms (i.e. σ′G,I′(S)) VS Sizeof I0. Higher is better. Note that dava-prune and dava-fast consistently outperform the otheralgorithms. Best seen in color.

can finish in the allotted time.

Despite being slower than dava-fast, for small graphs, the performance improvement ofdava-prune is worth its increased running time: e.g., on STANFORD, it saves 350 more nodesthan dava-fast (i.e., more than 10% as many nodes as dava-fast saves, and 4% of nodesw.r.t. the graph size), while having a competitive running time (within 25 minutes) especiallywhen we consider applications where performance is the top priority. On the other hand,for large graphs such as PORTLAND and MIAMI, dava-fast has its advantage: it is about 80times faster than dava-prune while the difference in nodes saved is marginal (about 250nodes for k = 200). However, for a larger budget k, as we see above in Section 3.6.2.2, thedifference in performance between dava-prune and dava-fast may increase a bit. In that case,dava-prune might be a good option. Nevertheless, the choice of dava-prune and dava-fastreally depends on the application: if we do not need a fast algorithm, dava-prune is clearly abetter choice; while for some time-sensitive applications, dava-fast will be more suitable, e.g.in case we need to quickly update our allocations.

We also shows the running times of our algorithms w.r.t. k (see Figure 3.12(a)). Wedemonstrate the result of GNUTELLA, but for other networks we got similar results. We observethat as k increases, the running time increases linearly for both dava-prune and dava whichare consistent with their theoretical complexities, while the running time of dava-fast isconstant as its running time does not depend on k (see Lemma 3.11).

Finally, we also experimented with different graphs to show the running times of our algorithmsw.r.t. the size of the graph (see Figure 3.12(a)) when k = 200. We use the BTER Model [151]to generate graphs of various sizes. BTER Model is a well-known graph model whichaccurately captures the properties of real-world networks. Here the graphs we generatedpreserve the community structure and degree distribution of GNUTELLA. We observe that as

51

0 20 40 60 80 1000

500

1000

1500

2000

2500


Runnig time (sec.)

GNUTELLA

DAVA−pruneDAVA−fastDAVA

0 2 4 6 8 10

x 104

0

0.5

1

1.5

2x 10

4

Graph Size

Runnig time (sec.)

DAVA−pruneDAVA−fastDAVA

(a) GNUTELLA (b) BTER

Figure 3.12: Running time (sec.). (a) Running time VS budget k; (b) Running time VS graph size(k = 200).

the size of graph increases, the running time increases almost linearly for dava, dava-pruneand dava-fast, again as expected from their running time complexities.

3.7 Conclusion

This chapter addresses the problem of immunizing healthy nodes in presence of alreadyinfected nodes given a graph like a social/computer network or the blogsphere. The potentialapplications are broad: from distributing vaccines to control the epidemic, to stopping alreadypresent rumors in social media. Our main contributions are: we formulated the problem calledData-Aware Vaccination, proved it is NP-hard and also hard to approximate within an absoluteerror. After that we gave an optimal algorithm dava-tree for m-trees, and presented threepolynomial-time heuristics dava, dava-prune and dava-fast for general graphs of varyingdegrees of performance. We demonstrated the effectiveness and efficiency of our algorithmsthrough extensive experiments on multiple datasets, including large epidemiological socialcontact networks, computer networks and social media networks, on both the well-knownIC and SIR models. Our algorithms outperform other competitors by up to 10 times inmagnitude. Among our methods, dava-prune saves up to 16k more nodes than dava-fast,but dava-fast is the fastest. In addition we also presented the scalability of dava-prune anddava-fast on large-scale networks.

Chapter 4

Uncertain Data-Aware Vaccination

In the previous chapter, we study the Data-Aware Vaccination problem, which takes intoaccount the prior information for controlling diffusion. In this chapter, we extend it toan uncertain environment, where we assume infection information follows a probabilitydistribution. This chapter is motivated by the fact that in reality contagions usually spreadover uncertain environments and the sources of such uncertainty are many. For example, inpublic health, due to the so-called multi-layered surveillance pyramid [48,96,156] at each layerthe number of detected infections is a fraction of the infections in the layer below it. Hencethe total detected infections at the top of the pyramid is a fraction of the actual infections inthe population at the bottom. Another example is the likelihood ratios used in diagnostictesting [28]. For each a person who gets the negative test outcome, she has some probabilitythat her test was a false-negative. In social media, as externals we rarely get access to thecomplete cascade. Researchers usually have access to only a uniform sample of cases (e.g.the Twitter API). In Facebook, most users keep their activity and profiles private. Moreover,if only because of the extreme velocity of social media data, one has to resort to using only asample of the data. Hence this implies that we will have to make do with only an uncertainsnapshot.

In this chapter, we study the problem of how to best distribute vaccines to nodes in largenetworks, in presence of uncertain prior information. Our goal is not to fill-in the missinginformation; instead we want to take robust decisions in presence of uncertain information.Our contributions in this chapter include:

1. Problem Formulation: We formulate the Uncertain Data-Aware Vaccination problem,which takes into account multiple natural uncertainty models arising from social mediaand epidemiology.

2. Efficient Algorithms: As the problem is NP-hard and hard to approximate withinabsolute error, we develop multiple polynomial-time algorithms of varying efficiency,namely (a) Sample-Cas, based on the sample average approximation; and (b) Expect-Max, a faster hybrid algorithm which leverages the so-called expected graph and two

52

53

complementary approaches to estimate benefits.3. Extensive Experiments: We demonstrate the effectiveness and scalability of our algo-

rithms on multiple real datasets including large epidemiological and social networks, overdifferent uncertainty distributions and initial conditions. Our algorithms outperformseveral other competitor algorithms, getting substantial gains in both number of nodessaved, and running time.

This work has been published in CIKM 2014 [187]. We first give preliminaries, and thendescribe the uncertain models and introduce our problem UDAV formally in section 4.2. Wepropose two methods Sample-Cas and Expect-Max for UDAV next, and validate ourmethod in the experiment section finally.

4.1 Preliminaries

Table 4.1 lists the main symbols used in this chapter. There exists an underlying contactnetwork G on which the contagion (disease/virus/meme etc.) can spread. We assume thatour network is weighted and undirected, but all our methods can be naturally generalized todirected graphs.

We use two widely used propagation models to describe how the virus spreads on the network:the Independent Cascade (IC) model and the Susceptible-Infected-Recovered (SIR) model.SIR is a well-known epidemiological model to model mumps-like infections [5, 64]. A node inthis model can be healthy (susceptible), infectious or recovered. When a node u becomesinfected at the timestamp t, it will try to infect each of its direct healthy neighbors v withthe propagation probability βu,v. If u succeeds, v will become infectious at the timestampt + 1. At the end of each timestamp t, each infected node u has a curing probability ρ tobecome ‘recovered’ at the next timestamp t + 1. Once recovered, u will never be infectedfurther. The process stops when no additional node becomes infectious. The IC model [72],a special case of SIR, has been extensively studied in the social media to model the viralmarketing. Unlike SIR, a node u in IC has only single chance to infect its healthy neighbors(hence the curing probability, ρ = 1 here).


4.2.1 Uncertainty model

In this chapter, we are concerned with the scenario when we know the underlying contactnetwork, but we do not know the exact current infected state of the network. One source ofuncertainty is public-health surveillance [48, 69, 94, 96, 122, 156]. Generally there are three

54



UDAV Uncertain Data-Aware Vaccination problem

IC Independent Cascade Model

SIR Susceptible-Infected-Recovered Model

footprint number of infected nodes at the end

benefit number of nodes saved

G(V,E) graph G with nodes set V and edges set E

U uncertainty model

βi,j propagation probability from node i to j (weight over edges)

pi probability that i is infected at the start

k the budget (i.e., the number of vaccines available)

S set of nodes to give vaccines to

ES(F ) the expected footprint after vaccinating S

δZ(S) given graph Z, the expected benefit of vaccinating S in Z

l number of samples

α percentage of nodes that have pi > 0 in U

types of surveillance: population-based, health provider-based and lab-based. Althoughdifferent types of surveillance may have different probabilities to miss the truly infectedperson, we can simply use a set of probabilities P (over the nodes) to model such uncertainty.Another example is the likelihood ratios used in diagnostic testing; each a person has aprobability p that her test was a false-negative. In Twitter, each relevant ‘infected’ tweet canbe modeled as having some probability of being missed (because of uniform samples [113]).

Table 4.2 summarizes common probability distributions models U we use in this chapterto model the uncertainty in observed infections. Each gives the probability of a node i notobserved as infected being truly infected. We focus on fully factorizable distributions (overnodes) for simplicity1. Hence, if Gj denotes a particular configuration of infections in thenetwork (i.e. a ‘possible world’), then Pr(G ≡ Gj) =

∏a∈I pa

∏b∈H(1− pb) where I and H

are the set of infected and healthy nodes in Gj, and the probabilities pi for any node i comefrom U .

1Extending our results to more general forms e.g. distributions factorizing over groups of nodes beinginfected is interesting future work.

55

Table 4.2: Uncertainty models for initial infections used in this chapter.

Name Distribution Description

UNIFORM pi = p All nodes have identical probability to be infected. Canbe thought of sampling rate in case of Twitter API [113].

SURVEILLANCE pi ∈ P Each node takes a probability from P (which is a finiteset of probabilities like {0.1, 0.5, 0.9}). See Surveillancepyramid [96,122,156].

PROP-DEG pi ∝ di The probability to be infected for each node is propor-tional to its degree, i.e., people with larger number ofconnections have higher probabilities to be infected.

GENERAL pi Each node has its own infected probability.


Now we are ready to state our problem formally. We assume that a contagion can travel inprinciple from any node to any other node i.e. the graph is connected (strongly connectedif directed). We are given a fixed-set I0 with infected nodes, and an uncertainty model Uas above. We are also given a budget k of vaccines. Giving a vaccine to a node renders itimmune to the virus and hence it can not get infected further (effectively removing it fromthe network). Our goal is to find the ‘best’ set S of nodes to vaccinate to minimize the spreadof the contagion, which can be measured by the so-called ‘footprint’, the number of infectednodes at the end. A subtle point is that vaccination is meaningful only for healthy nodes.Hence when we select a node-set S, not all the nodes in S can be vaccinated (removed) in allpossible sampled graphs: if a node i is infected in a possible world Gj, then it can not bevaccinated in Gj, and it does not give us any benefit there.

More specifically, suppose S is the set of nodes selected initially for vaccination and Gi is aparticular realization (‘world’) sampled from U . There only exist infected or healthy nodes inGi. Denote Si ⊆ S as the subset of nodes in S that are healthy in Gi—these are the nodeswhich will be vaccinated in Gi. Denote σGi(Si) as the expected number of infected nodesafter running the epidemiology model e.g. IC, on Gi starting from the infected nodes in Gi

but after removing nodes in set Si. Let F be the random variable denoting the number ofinfected nodes after choosing set S under U . Then ES(F ) =

∑Gi∼U Pr(Gi)σGi(Si) and we

are trying to find the best set S to minimize ES(F ). Formally:

Problem 1: Uncertain Data-Aware Vaccination Problem: UDAV(G,U , I0, k).

Given: A graph G(V,E) with node set V and edge set E, the uncertainty model U , theinfected node set I0, propagation probability on each edge {i, j} βi,j, and an integer (budget)k.

Find: A set of nodes S∗ = argminS

ES(F ) s.t. |S| = k.

56

Note that as vaccination will be applied only to healthy nodes in a possible world, thisformulation also naturally generalizes the corresponding deterministic version of this problem(data-ware vaccination problem studied in DAV [188]).

Complexity. UDAV is NP-hard, and cannot be approximated within an absolute errorsince its deterministic counterpart DAV is itself NP-hard, and cannot be approximated withinan absolute error [188].

4.3 Proposed Methods

Overview. In this section, we first present a sampling algorithm Sample-Cas for UDAV ,which is a stochastic algorithm under the SAA framework. However, Sample-Cas is notscalable to large networks. Hence, we propose two faster algorithms: Expect-Dom andExpect-Eig, which are based on the expected graph and measuring benefits of vaccinations.After analyzing the performance of Expect-Dom and Expect-Eig, we show that thesetwo algorithms are complementary w.r.t. the support of the uncertainty model, and hencewe present a hybrid algorithm called Expect-Max with sub-quadratic running time.

We assume the GENERAL model everywhere in this section (as the rest in Table 4.2 arejust special cases of GENERAL). Further we describe the algorithms assuming the IC modelfirst(Section 4.3.1 and 4.3.2)—later, we will discuss how to extend to the SIR model (Sec-tion 4.3.3).

4.3.1 The Sample-Cascade Algorithm

Main Idea. Since UDAV is a stochastic optimization problem, we try to apply the SAA(Sample Average Approximation) [77] framework to solve it. The idea is to reduce thestochastic optimization problem to the deterministic version by sampling the uncertaintydistribution to generate a finite number of deterministic cases. Unfortunately, as we mentionedin the previous section, even the deterministic version of UDAV is NP-hard. Hence we leveragethe solution in [188], which utilizes a spanning tree called dominator tree, and then find asuitable sub-modular structure to solve UDAV approximately.

Details. Let δGi(Si) be the expected benefit after vaccinating the healthy node set Si in agraph Gi, i.e.:

δGi(Si) = σGi(∅)− σGi(Si) (4.1)

57

So

ES(F ) =∑Gi

Pr(Gi)σGi(Si)

=∑Gi

Pr(Gi)(σGi(∅)− δGi(Si)) (4.2)

Since∑

GiPr(Gi)σGi(∅) is constant, UDAV (Problem 1) can be rewritten as:

S∗ = argmaxS

∑Gi

Pr(Gi)δGi(Si) s.t. |S| = k.

So we need to compute δGi(Si) for each Gi, which is essentially the deterministic problem ongraph Gi. Hence we re-purpose the solution from [188]: first merge all the infected nodes in Gi

into a super node I0 (I0 is infected). If a healthy node has multiple infected neighbors, I0 willconnect to the node with the probability that is the logical-OR of the individual probabilities(so if a node u has two infected neighbors x and y, βI0,u = 1− (1− βx,u)(1− βy,u)). Secondly,build a dominator tree DomGi on this merged graph, and properly weight it. Briefly, given asource node I0, a node v dominates another node u if every path from I0 to u contains v.Node v is the immediate dominator of u, denoted by v = idom(u), if v dominates u and everyother dominator of u dominates v. We can build a dominator tree rooted at I0 by adding anedge between the nodes u and v if v = idom(u) (totally in near-linear time [18, 87]). Finallywe approximate δGi(Si) as δDomGi (Si) (i.e. the benefit after removing nodes in Si in DomGi).

In fact, can further prove that the real benefit of removing Si from graph Gi is lower-boundedby δDomGi (Si).

Lemma 4.1. (Lower Bound of δGi(Si)) The nodes we can save from G must be greater thannodes we save from its dominator tree, that is, δDomGi (Si) ≤ δGi(Si) (where the inequality issaturated when the merged graph is a tree).

Proof. For any node u in Si, the benefit we can get on Gi is at least all nodes under thesubtree of u in the dominator tree of Gi (because there is no path from I0 to those nodes).Hence, δDomGi (Si) ≤ δGi(Si).

Let Q(S) =∑

GiPr(Gi)δDomGi (Si); then using Lemma 4.1, we get Q(S) ≤

∑Gi

Pr(Gi)δGi(Si).The gap between Q(S) and

∑Gi

Pr(Gi)δGi(Si) depends on the structure of the graph (if themerged-graph is a tree, there is no gap). Hence, Lemma 4.1 suggests that we can use Q(S) toapproximate

∑Gi

Pr(Gi)δGi(S). Hence we formulate Problem 2 next, to approximate UDAV(Problem 1).

Problem 2: Given: G(V,E), U , I0, and k.

Find: A set of nodes S∗ = argmaxS

Q(S) s.t. |S| = k.

Interestingly, Q(S) is a submodular function, while δGi(S) is not submodular [188].

58

Lemma 4.2. (Submodularity of Q(S)) Q(S) is a submodular function.

Proof. First we need to prove that given a set S, δDomGi (S) is a submodular function ofS. To prove it, we need to show that if U ⊆ W , then adding a node n, δDomGi (U ∪ {n})−δDomGi (U) ≥ δDomGi (W ∪ {n})− δDomGi (W ).

Let Ui ⊆ U and Wi ⊆ W as the healthy nodes set respectively. Since δDomGi (W ) = δDomGi (Wi)and Ui ⊆ Wi we need to prove that δDomGi (Ui ∪ {n})−δDomGi (Ui) ≥ δDomGi (Wi ∪ {n})− δDomGi (Wi).

Case 1. If node n is an infected node, adding n will get zero benefits for both Ui and Wi.

Now let’s assume n is healthy. There are three possible cases: Case 2. If node n is underthe subtree of a node in set Ui, after adding node n, δDomGi (Ui ∪ {n})− δDomGi (Ui) = 0, andδDomGi (Wi ∪ {n})− δDomGi (Wi) = 0.

Case 3. If node n is not under the subtree of Ui, but under the subtree of Vi, thenδDomGi (Ui ∪ {n})− δDomGi (Ui) > 0, but δDomGi (Wi ∪ {n})− δDomGi (Wi) = 0.

Case 4. If node n is not under the subtree of Wi. In this case, the marginal gain for both Uiand Vi is the same, δDomGi ({n}).

Hence, δDomGi (S) is a submodular function of S. And since the linear combination ofsubmodular function is still a submodular function. Therefore, Q(S) is a submodularfunction.

We now apply the SAA framework: sample l graphs G1, G2, . . . , Gl from U and defineQl(S) = 1

l

∑li=1 δDomGi (Si). As Ql(S) is a submodular function, we apply the greedy

algorithm [117] to obtain the (1− 1/e)-approximation for Problem 2 (under the l samples).We call this algorithm Sample-Cas (Algorithm 4.1). Note that we can speed up Algorithm 4.1using CELF optimization [89].

Algorithm 4.1 The Sample-Cas algorithm

Require: Input G, U ,I0, k and l1: Sample G1, . . . , Gl from U and G2: Merge infected nodes into I0

i for each Gi

3: Build dominator trees DomG1 , . . . , DomGl rooted at I0i for Gi

4: S = ∅5: for i← 1 to k do6: a∗ = arg maxa

1l

∑δDomGia

7: Remove a∗ from each of DomG1 , . . . , DomGl .8: S = S ∪ {a∗}.9: end for

10: return S

59

Lemma 4.3. (Running Time of Sample-Cas) The time complexity for Algorithm 4.1 isO(l(k|V |+ k|E|+ |V | log |V |)).

Proof. Sample Gi needs O(l|V |) time, and build l dominator trees and weight it needO(l|V | log |V |) time. Selecting a node a needs l(|E| + |V |) time. So in general the timecomplexity is O(l(k|V |+ k|E|+ |V | log |V |)).

How many samples? The next lemma estimates the number of samples l needed so thatQl(S) is a good estimate of Q(S).

Lemma 4.4. (Number of samples) For any ε > 0, to estimate Q(S) within absolute error εwith probability γ = 1− 2 exp(−2lε2

∆2 ), we need l ≥ ∆2

2ε2ln 2

1−δ , where ∆ is the upper bound forδDomGi (S) in the dominator tree.

Proof. It follows from using the well-known Hoeffding’s Inequality [110].

As ∆ can be O(|V |), Lemma 4.4 shows that we need worst-case O(|V |2) samples to getaccurate estimates. Hence Sample-Cas does not scale to large networks.

4.3.2 Expect-Max: a faster algorithm

Since Sample-Cas is not scalable for large networks, we next develop another faster algorithmExpect-Max. We give the main idea, and then describe the details in the subsequentsubsections.

Main Idea. We first formulate an equivalent problem which uses the concept of a so-called‘expected graph’ GE. Based on that, we propose two different methods Expect-Dom andExpect-Eig, measuring expected benefits of vaccinations. We show that these two methodsare in fact complementary, and hence then propose Expect-Max, which is sub-quadratic inrunning time (in nodes/edges).

4.3.2.1 Expected Graph: An Equivalent Formulation

Here, we formulate an equivalent formulation of Problem 1 based on the concept of an‘expected graph’.

Definition 1 (Expected Graph): The expected graph GE is constructed as follows: startwith G; add a ‘super node’ I0; connect I0 to any node i where pi > 0 with the edge weightβI0,i = pi (pi ∈ U , the uncertainty model); and then mark all nodes except I0 as healthynodes.

60

As we show next, this construction transforms the uncertainty model from nodes to edgeswithout losing any information. Hence, we can focus on a single graph GE instead of samplinggraphs (the main reason why Sample-Cas was slow).

More specifically, we show in Lemma 4.5, an equivalent formulation of Problem 1 basedon expected graph GE, under GENERAL and for budget k = 1. The main idea is that,crucially, as GENERAL is factorizable (i.e. for a particular configuration Gj, Pr(G ≡ Gj) =∏

a∈I pa∏

b∈H(1− pb), see Section 4.2 for details), after running the first step of the diffusionmodel on the expected graph, we will get the same configurations like sampling from theuncertainty model in Problem 1. A subtle point is that Lemma 4.5 also takes into accountthe fact that nodes can not be vaccinated in all ‘possible-worlds’ (wherever they are alreadyinfected), by correcting the estimate got from GE by an appropriate factor.

Lemma 4.5. (Equivalent formulation of UDAV when k = 1) When the budget k = 1,for the UDAV problem, the best node a∗ = arg mina E{a}(F ) can be equivalently wrriten asa∗ = arg maxa(1− pa)δGE({a}).

Proof. We first prove δGE(a) =∑

GiPr(Gi)δGi(a) using Definition 1 and the factorizability of

GENERAL. Based on this, we can prove that E{a}(F ) =∑

GiPr(Gi) σGi(∅)− (1− pa)δGE({a}),

hence minimizing E{a}(F ) is equivalent to maximizing (1− pa)δGE({a}).

Lemma 4.5 shows that when the budget k = 1, we can get an equivalent formulation ofProblem 1 based on the expected graph. Furthermore, note that UDAV is a stochasticproblem, while Lemma 4.5 is based on calculating ‘benefits’ δGE({a}) on a deterministic graphGE. Next we propose two heuristics to estimate δGE({a}) on GE which are complementarymethods based on α, the support of the uncertainty model (see Section 4.3.2.4 for moredetails).

4.3.2.2 The Expect-Dom Algorithm

One of the ways we can estimate the benefits is by using our Lemma 4.1 on GE. The mainidea is that we estimate δGE({a}) by its lowerbound δDomGE

({a}) via the dominator tree onthe expected graph. Motivated by the equivalent formulation of Problem 1 (Lemma 4.5),we propose that at each step select a node with the maximum value of (1− pa)δDomGE

({a})after building the dominator tree DomGE of GE. We call this algorithm Expect-Dom(Algorithm 4.2).

Lemma 4.6. (Running Time of Expect-Dom) The time complexity for Algorithm 4.2 isO(k(|V |+ |E|) + |V | log |V |).

Proof. Creating an expected graph GE costs O(|V |) time, building a dominator tree andweight it need |V | log |V | time. Updating dominator tree costs O(|V |+ |E|) time. Hence, thetime complexity of Expect-Dom is O(k(|V |+ |E|) + |V | log |V |).

61

Algorithm 4.2 The Expect-Dom algorithm

Require: Input G, U ,I0 and k1: Construct GE2: S = ∅3: Build a dominator tree DomGE on GE4: for i← 1 to k do5: a∗ = arg maxa(1− pa)δDomGE

({a})6: S = S ∪ {a∗}7: Remove a∗ from GE8: end for9: return S

4.3.2.3 The Expect-Eig Algorithm

Another approach we propose is to estimate δGE({a}) is via the change in the largest eigenvalueof GE, ∆λ1(a), after removing node a. The largest eigenvalue of the adjacency matrix of agraph is related to the so-called ‘epidemic threshold’ of the graph under several epidemicmodels [132, 133]. If the largest eigenvalue is very small, a virus will get extinguished quickly.Next we will explain why ∆λ1(a) is crucial to the benefits. In addition to that, we will showhow to estimate δGE({a}) using the greedy algorithm in [166] as well.

Justification of ∆λ1(a). Let λi/ui be the i-th largest eigenvalue/ eigenvector of GE, andft be the vector of probability of each node being infected at time t. The next lemma willshow that the expected number of newly infected nodes is upper-bounded by a function of λ1.Hence, reducing λ1 (maximizing ∆λ1(a)) by removing node a, can effectively minimize theexpected number of newly infected nodes, and eventually minimize E{a}(F ) (the expectednumber of infected nodes at the end). According to Equation 4.2 (in Section 4.3.1), minimizingE{a}(F ) is equivalent to maximizing the benefit δGE({a}). Hence we can estimate δGE({a})using ∆λ1(a).

Lemma 4.7. The expected number of newly infected nodes at timestep t+1, is upper-boundedby h = e′(

∑|V |j=1 λ

tjujuj

′)f1. Furthermore, h ≤ λt1e′(∑|V |

j=1 ujuj′)f1 where e = (1, . . . , 1)′ and

f1 = (p1, . . . , pn)′ (the initial infection probabilities of the nodes, which essentially comes fromthe uncertainty model).

Proof. First, following steps of Lemma 1 in [132], we can get that the expected number of

newly infected nodes at timestep t+ 1 is upper-bounded by e′(∑|V |

j=1 λtjujuj

′)f1. Second, since

λ1 is real and positive (using Perron-Frobenius theorem), we get h ≤ λt1e′(∑|V |

j=1 ujuj′)f1.

The Expect-Eig Algorithm. Motivated by Lemma 4.5 (the equivalent formation ofProblem 1) and Lemma 4.7, we can greedily select a node with the maximum value of

62

(1− pa)∆λ1(a) at each step, using ∆λ1(a) as an estimate of the benefit of removing a node.We call this algorithm Expect-Eig (Algorithm 4.3).

Comment. [166] gives a fast greedy algorithm for this task, by approximating ∆λ1(a) ≈ 2λ1u2a

(based on the first-order matrix perturbation theory). Here we use it in Algorithm 4.3 (Line5).

Lemma 4.8. (Running Time of Expect-Eig) The time complexity for Algorithm 4.3 isO(k(|V |+ |E|)).

Proof. Calculating u1 costs O(|E|) time using the power method. Hence, Algorithm 4.3 takesO(k(|V |+ |E|)) time.

Algorithm 4.3 The Expect-Eig algorithm

Require: Input G, U ,I0 and k1: Construct GE2: Get λ1 and u1 = (u1, . . . , un)′ from GE.3: S = ∅4: for i← 1 to k do5: ∆λ1(a) = 2λ1u

2a

6: a∗ = arg maxa(1− pa)∆λ1(a)7: S = S ∪ {a∗}8: Remove a∗ from GE and update λ1 and u1.9: end for

10: return S

4.3.2.4 The Hybrid Algorithm: Expect-Max

Although Expect-Dom and Expect-Eig are both fast algorithms compared to Sample-Cas, they may not work well all the time. Next we will discuss how uncertainty modelsaffect their performances, and present a hybrid algorithm combining both of them.

Discussion about Expect-Dom. Denote α as the support of the uncertainty model (thepercentage of nodes that are possibly infected). When α = 0, the UDAV problem becomesexactly the DAV problem [188] (the deterministic case of UDAV ) and Expect-Dom reducesto the algorithm in [188], which was shown to perform well. However consider the oppositecase α = 1. In this case, I0 connects to the rest of nodes. Hence the dominator tree of GEbecomes a star. For any node a, δDomGE

(a) will only depend on the propagation probabilityfrom I0 to a (i.e., pa). We cannot utilize any other information from the original graph,hence we would choose nodes essentially randomly. This also gives us the intuition that as αincreases, the performance of Expect-Dom will become worse (a fact we demonstrate inexperiments as well).

63

Discussion about Expect-Eig. As we discussed in Section 4.3.2.3, the expected numberof newly infected nodes at timestep t + 1 is upperbounded by h = e′(

∑|V |j=1 λ

tjujuj

′)f1 ≤λt1e

′(∑|V |

j=1 ujuj′)f1 = h1 (f1 essentially comes from the uncertainty model). We first demon-

strate that this inequality (h ≤ h1) saturates when f1 is parallel to u1. Then, we will showhow to maximize our chance to achieve this, which will lead us to the discussion about theperformance of Expect-Eig in terms of α.

Lemma 4.9. (h-h1 Gap) As the inner product of u1 and f1 increases, h1 − h decreases.When f1 is parallel to u1, h1 = h.

Proof. As u′1f1 increases, f1 becomes more parallel to u′1, and uj′f1 (j 6= 1) becomes smaller

(because ui and uj are orthogonal). Hence h2 − h1 decreases. And when uj′f1 = 0 (j 6= 1),

h1 = h.

This shows that closer the uncertainty model is to u1, the better bound h1 is of h: as a resultof which we expect ∆λ1(a) to become a better estimate, and hence Expect-Eig to performbetter. How is this related to α? The following analysis shows a preliminary justification.Apriori we do not know the graph, hence we do not know u1: so reasonably we can assume itis randomly uniformly picked from a n-dimensional space. Let us denote x as the randomvariable of the first eigenvector. To make f1 more parallel to x, we need to maximize theexpectation of f ′1x (i.e., Ex[f ′1x]). It is not hard to see that as we increase α, Ex[f ′1x] willincrease.

Lemma 4.10. (Expected gap) When α increases, Ex[x′f1] increases as well.

Proof. Ex[x′f1] = f ′1Ex[x], and all elements in Ex[x] are non-negative. As α increases, moreelements in f1 become non-zero, hence Ex[x′f1] increases as well.

Lemma 4.10 suggests that when α increases, we expect f1 and u1 to become more parallel,and so the gap to decrease, as a result of which ∆λ1(a) becomes a better estimate. Thuseven this preliminary analysis immediately suggests that as α becomes larger, Expect-Eigshould perform better. Again we demonstrate this through experiments as well.

The Expect-Max Algorithm. The above discussion suggests a complementary picture:when α is low, we expect Expect-Dom to be better, and when α is high, we expect Expect-Eig to be better. Unfortunately, we don’t know exactly when which algorithm is better: thislikely depends not only on α but also the graph, and the distribution. However, we can stillleverage this insight to propose a hybrid algorithm called Expect-Max, which maintainsthe scalability and quality of Expect-Dom and Expect-Eig. Expect-Max chooses eitherExpect-Dom or Expect-Eig based on their performances, that is,

SExpect-Max = argmaxS={SExpect-Dom,SExpect-Eig}

ES(F )

64

Comment. S is the output either of Expect-Dom or Expect-Eig, and ES(F ) can beobtained by via simulation of the IC model (not sampling from the uncertainty model).Also note that Expect-Max is not the greedy algorithm that picks one node from eitherExpect-Dom or Expect-Eig in each step. Instead, it chooses S just once after runningExpect-Dom and Expect-Eig. Hence the time complexity for Expect-Max is O(k(|V |+|E|) + |V | log |V |+ T ) where T is the time to run IC model (which should be sub-quadraticin edges).

4.3.3 Extending to SIR model

Note that in SIR, the footprint is the total number of recovered nodes at the end (in contrastto the IC model). Nevertheless, leveraging the method in [188], we can directly extend ouralgorithms to SIR model by changing SIR model to IC model with the propagation probability

1− (1− βi,j)1ρ . This does not change any of our algorithms/results.

4.4 Experiments

We present a detailed experimental evaluation in this section.


We briefly describe our set-up next. We implemented the algorithms in Python, and conductedthe experiments using a 4 Xeon E7-4850 CPU with 512GB of 1066Mhz main memory.

Datasets. We ran our experiments on multiple datasets using both IC and SIR. Table 4.3summarizes the datasets, which were chosen for their size as well as the applicability to theUDAV problem (from social media to epidemiology).

1. KARATE is a social network of friendships with 34 members in a karate club at a USuniversity in the 1970s [180].

2. OregonAS2 is the Oregon AS router graph collected from the Oregon router views. Thecontagion here can be thought of malware and computer-network viruses, which wewant to control by shutting-off or patching relevant routers.

3. STANFORD3 is the Stanford CS hyperlink network, in which a web page links to anotherpage. Contagions here can be false information spreading through the webspace, andwe want to prevent their spread by posting true information at strategic web pages.

2http://topology.eecs.umich.edu/data.html.3http://www.cise.ufl.edu/research/sparse/matrices/Gleich/.

http://topology.eecs.umich.edu/data.html

http://www.cise.ufl.edu/research/sparse/matrices/Gleich/

65

4. GNUTELLA4 is a peer-to-peer network showing the snapshot of the Gnutella P2P filesharing network from August 2012. Similar to OregonAS, we can control the spread ofmalware and harmful files by patching some important peers.

5. BRIGHTKITE5 is a friendship network from a location-based social networking serviceprovider Brightkite. As friends regularly frequent the same places, such location-basednetworks can be useful for the public-health.

6. PORTLAND and MIAMI are social-contact graphs based on detailed microscopic simulationsof large US cities. Edge weights here represent the expected contact time betweenpeople. Versions of these have been used in national smallpox and influenza modelingstudies using the SIR model [35].

Table 4.3: Datasets

Dataset Nodes(V) Edge(E) Model

KARATE 34 156 ICOregonAS 633 2172 ICSTANFORD 8929 53829 ICGNUTELLA 10876 39994 IC

BRIGHTKITE 59228 0.2 million IC

PORTLAND 0.5 million 1.6 million SIRMIAMI 0.6 million 2.1 million SIR

0 1 2 3 4 5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Ratio of r

KARATE

UNIFORMSURVEILLANCEPROP−DEG

0.8 0.85 0.9 0.95 10.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

α

Ratio of R

UNIFORM

PROP−DEG

SURVEILLANCE

0.8 0.85 0.9 0.95 10.9

0.92

0.94

0.96

0.98

1

1.02

1.04

1.06

1.08

1.1

α

Ratio of R

UNIFORM

PROP−DEG

SURVEILLANCE

(a) Sample-Cas vs. Optimal (b) STANFORD (c) BRIGHTKITE

Figure 4.1: (a). Quality of Sample-Cas. Comparison between Sample-Cas and Opti-mal on KARATE over different distributions (r = #healthy nodes saved by Sample-Cas

#healthy nodes saved by Optimal and α =0.5). (b) and (c). Comparison between Expect-Dom and Expect-Eig. Ratio of R (R =#healthy nodes saved by Expect-Dom#healthy nodes saved by Expect-Eig ) vs. α. Expect-Dom performs better than Expect-Eig whenR > 1, otherwise Expect-Eig is better.

Uncertainty models. We used three types of uncertainty models: (a) UNIFORM: p = 0.6;(b) SURVEILLANCE: pi uniformly randomly chosen from {0.1, 0.5} for each node i (following

4GNUTELLA and BRIGHTKITE are from http://snap.stanford.edu/data/index.html.

http://snap.stanford.edu/data/index.html

66

different levels of the surveillance pyramid, e.g: 10% of the total population is infected and isin the hospital, and only ∼ 33% of infected people go to a hospital, which together imply 50%of the total population is infected and does not go to a hospital); (c) PROP-DEG: pi = di/dmaxfor each node i (dmax is the maximum degree of the graph G).

Parameters. For IC model, βu,v is uniformly randomly chosen from {0.1, 0.5, 0.9} [22]. ForSIR model, we use the normalized contact time as the propagation probability βu,v, and set auniform curing probability ρ = 0.6. We uniformly randomly pick 5% of nodes as infectednodes. For Sample-Cas, we set the number of samples l = 200. For robustness, each datapoint we show is the mean of 1000 runs of the diffusion/epidemiological model.

Baselines. We compare our algorithms against various intuitive and non-trivial competitorsto better judge their performance. Recall that I0 is the infected node set (I0 = {u|u ∈ V, pu =1}). Let us denote W = V − I0 (so W is the set of nodes that are not certainly infected atthe start).

1. Optimal: a brute-force algorithm that tries all combinations of possible solutions. Asit is very slow, we only run it on very small graph (KARATE).

2. Random: uniformly randomly select k nodes from W .3. Degree: choose the top-k nodes from W according to their weighted degree.4. PageRank: pick the top-k healthy nodes from W with the highest pagerank. We use

the restart probability of 0.15.5. Per-PRank: we first merge all infected nodes into one supernode as the preferred

node, and then choose the top-k nodes from W with the highest personalized pagerankwith respect to the supernode [70]. We use the restart probability of 0.15.

6. Dava-fast: This is a fast immunization algorithm [188], which aims to control theepidemic in presence of already infected nodes (without uncertainty in the data). Weapply Dava-fast as if any node from W on G is a healthy node. We take the top-knodes from W according to the algorithm.

4.4.2 Results

In short, we demonstrate that Sample-Cas and Expect-Max outperform other baselineson all datasets. Sample-Cas provides very accurate results, but does not scale to largenetworks, while Expect-Max is fast, scalable and effective. We also show the behaviors ofExpect-Dom and Expect-Eig as α varies.

4.4.2.1 Accuracy of Sample-Cas

First of all, we compare Sample-Cas with Optimal on KARATE to demonstrate its accuracy(because Optimal is too slow, we chose KARATE so that we can run Optimal completely).

67

As Figure 4.1(a) shows, for all uncertainty models, Sample-Cas saves at least 90% of nodescompared to Optimal no matter how k changes. We also found as expected, Sample-Cas’sperformance gets better as number of samples increases (not shown here).

4.4.2.2 Justification of Expect-Max

We compared Expect-Dom with Expect-Eig as α changes on multiple datasets underthree uncertainty models (see Figure 4.1(b) and (c)). For all networks, as expected fromour discussion in Section 4.3.2.4, clearly as α increases, Expect-Eig becomes better whileExpect-Dom becomes worse. In addition to that, there does exist a ‘cross-over point’ foreach network where the algorithms switch in performance (R = 1 in Figure 4.1(b) and (c)).However, this cross-over point is different for different networks and for different distributions,which is the reason why we propose the Expect-Max algorithm (as we do not know exactlywhen we should use either Expect-Dom or Expect-Eig as α changes).

0 10 20 30 40 5050

100

150

200

250

300

350



SAMPLE−CAS

EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 50 100 150 2001200

1250

1300

1350

1400

1450

1500

1550

1600



SAMPLE−CAS

EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 500 1000 1500 20007.8

7.9

8

8.1

8.2

8.3

8.4x 10

5



EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

(a) OregonAS (b) GNUTELLA (c) PORTLAND

0 50 100 150 200400

500

600

700

800

900

1000

1100

1200



SAMPLE−CAS

EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 50 100 150 2005200

5400

5600

5800

6000

6200

6400

6600



SAMPLE−CAS

EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 500 1000 1500 20001.05

1.06

1.07

1.08

1.09

1.1

1.11

1.12x 10

6



EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

(d) STANFORD (e) BRIGHTKITE (f) MIAMI

Figure 4.2: Effectiveness (α = 0.5, UNIFORM). Expected number of healthy nodes after distributingvaccines vs. budget k. Higher is better. (a), (b), (d), (e): IC model; (c), (f): SIR model. Sample-Cas and Expect-Max outperform other baseline algorithms.

68

0 50 100 150 2001500

1600

1700

1800

1900

2000



SAMPLE−CAS

EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 50 100 150 2001700

1800

1900

2000

2100

2200

2300



SAMPLE−CAS

EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

0 500 1000 1500 20009.5

9.6

9.7

9.8

9.9

10

10.1

10.2x 10

5



EXPECT−MAX

DAVA−FAST

DEGREE

RANDOM

PAGERANK

PER−PRANK

(a) SURVEILLANCE (GNUTELLA) (b) PROP-DEG (GNUTELLA) (c) SURVEILLANCE (PORTLAND)

Figure 4.3: Effectiveness (α = 0.5). Expected number of healthy nodes after distributing vaccines vs.budget k. Higher is better. (a), (b): IC model; (c): SIR model. Sample-Cas and Expect-Maxoutperform other baseline algorithms.

4.4.2.3 Effectiveness of Sample-Cas and Expect-Max

Figure 4.2(a), (b), (d) and (e) show experimental results under IC model for UNIFORM. Inall networks, Sample-Cas and Expect-Max consistently outperform other competitors.OregonAS contains only 600 nodes, hence we varied k till 50. Due to a jelly-fish-typestructure of OregonAS, for lower k, most algorithms perform well by targeting the nodes inthe core. However, for larger k, Sample-Cas provides the best solution, while Expect-Maxoutperforms other competitors as well, getting solutions almost as good as Sample-Cas.For GNUTELLA, STANFORD and BRIGHTKITE (much larger than OregonAS), the difference ofSample-Cas and Expect-Max from the other algorithms is clearer: they save upto 2.5times the nodes than other algorithms, yet Expect-Max took a fraction of the runningtime of Sample-Cas (see Table 4.4). Note that although Dava-fast contains information ofinfected nodes, it doesn’t perform well (especially on STANFORD) because it fails to take intoaccount the uncertainty model.

We got similar result under SIR model on PORTLAND and MIAMI for UNIFORM (see Figure4.2(c) and (f)). Since PORTLAND and MIAMI have more than 0.5million nodes, Sample-Casdid not finish even in a day, and we do not show it on the plots. We notice that the larger kbecomes, the better Expect-Max performs than other competitors. When k = 2000, thedifference of Expect-Max from other algorithms is clearer: it saves more than 10, 000 nodesthan the second best algorithm Dava-fast.

For SURVEILLANCE and PROP-DEG, the results are the same: Sample-Cas and Expect-Maxalways outperform other algorithms (see Figure 4.3). We do not show the plots of otherdatasets and other values of α due to lack of space, but the results are similar: Sample-Casand Expect-Max provide the best solution.

69

Table 4.4: Running times (sec.) when k = 100 and l = 200 (α = 0.5). Runs terminated whenrunning time t > 24 hours. (shown by ’-’)

Sample-Cas Expect-Max

OregonAS 241.6 3.1STANFORD 3401.7 45.2GNUTELLA 4221.1 59.8

BRIGHTKITE 19072.0 371.5PORTLAND - 6930.2MIAMI - 9231.4

4.4.2.4 Scalability

Although both Sample-Cas and Expect-Max are polynomial time (in particular Expect-Max is subquadratic in nodes and edges), we show some running time results to evaluatescalability. Table 4.4 shows the running times of our algorithms under UNIFORM: Expect-Max is much faster than Sample-Cas. Expect-Max takes only 45 seconds on STANFORD

while Sample-Cas takes about an hour. The larger networks are, the faster Expect-Maxis than Sample-Cas. On BRIGHTKITE with 60K nodes, Expect-Max is more than 50times faster than Sample-Cas. Furthermore, on the largest network MIAMI, Expect-Maxtakes about 2.5 hours to select 100 nodes while Sample-Cas did not finish even in one day.Hence, Expect-Max is scalable for large networks.

4.5 Conclusion

This chapter addresses the problem of distributing vaccines given uncertain data over largenetworks with applications to cascade-like processes on networks in several areas. The maincontributions are:

(a) Problem Formulation: Motivated by multiple natural uncertainty models from socialmedia and epidemiology, we first formulated the Uncertain Data-Aware Vaccination(UDAV ) problem.

(b) Efficient Algorithms: Due to its computational complexity, we presented two mainnovel algorithms: Sample-Cas and Expect-Max. Sample-Cas is an accuratestochastic algorithm under the SAA framework, while Expect-Max is a fast hybridalgorithm with sub-quadratic time complexity, which utilizes the expected graph andtwo complementary methods to estimate benefits.

(c) Extensive Experiments: Experimental results demonstrate that our algorithms outper-form several other baseline algorithms on multiple diverse real datasets (from social,

70

cyber, and epidemiological domains) over multiple different uncertainty models.

Future work can include extending our results to other models in epidemiology such asSIS (where nodes can get infected multiple times), and generalizing Expect-Max to anyuncertainty distribution (not just factorizable distributions).

Chapter 5

Group Immunization

Infectious diseases account for a large fraction of deaths worldwide. The main public healthresponse to containing epidemic outbreaks is by vaccination and social distancing, e.g., [35,104].These interventions have resource constraints (e.g., limited supply of vaccines and the highcost of social distancing), and therefore, designing optimal control strategies is an active areaof research in public health policy planning, e.g., [25, 104, 132, 145, 155, 166]. In Chapter 3and 4, we study the immunization problem at the node level under several realistic settings.

However, optimal strategies based on node level characteristics, such as the degree or spectralproperties [132,166] cannot be easily turned into implementable policies, because such targetedimmunization of specific individuals raises significant social and moral issues. As a result,vaccination policies, such as those specified by CDC are at the level of groups (e.g., based ondemographics), and almost all the efforts in epidemiology are focused on developing group levelstrategies, even though this may lead to sub-optimal solutions, compared to the individuallevel policies. For instance, Medlock et al. [104] develop an optimal vaccine allocation fordifferent age groups. However, all prior work on optimal group level immunization hasfocused on differential equation based models, and has not been studied on network modelsof epidemic spread. Implementing such interventions is challenging because people “comply”with them based on their individual utility. We model such limited compliance by randomallocation within each group, which motivates this chapter. We study both vaccinationproblems (which can be modeled by node removal from the network) and social distancing(which can be modeled by edge removal).

Similar diffusion processes arise in other domains such as social media, e.g., the spread of spamrumors on Facebook, Twitter, LiveJournal or Friendster. These are also commonly modeledby models such as the Linear Threshold (LT) model [72]. Analogous to the public-healthcase, we can control such processes by ‘immunization’ via blocking users or preventing someinteractions. Past work has studied individual-level based immunization algorithms for theLT model [74]. However, it is more realistic to issue a warning bulletin on group pages, andmembers within groups can get the warning to stop disseminating rumors. Similarly, Twitter

71

72

can warn a group of accounts to control the spread of the malicious tweets. The same holdstrue for user groups in Friendster and LiveJournal.

In this chapter, we present a unified approach to study strategies for controlling the spreadof diffusion processes through group level interventions, capturing both uncertainty and lackof control at high resolution within groups. The main contributions of this chapter are:

1. Problem Formulation: We develop group level intervention problems in both theLT model, and the SIS/SIR models, for which we consider a spectral radius basedformulation. We consider arbitrarily specified groups, and interventions that involveboth edge and node removal, modeling quarantining and vaccination, respectively.The interventions specify the number xi of nodes/edges that can be removed withineach group Ci; however, these are chosen randomly within the group. These problemsgeneralize the node level problems and have not been studied before.

2. Effective Algorithms : We develop efficient theoretical and practical algorithms for thefour problem classes we consider, including provable approximation algorithms (sdpand GroupGreedyWalk). We find that diverse kinds of techniques are needed forthese problems—submodular function maximization on an integer lattice, semidefiniteprogramming, quadratic programming, and the link between closed walks and spectralradius. Our algorithms also leverage prior techniques for analyzing contagion processes,e.g., [74, 131,132,148], but require non-trivial extensions.

3. Experimental Evaluation: We present extensive experiments on multiple real datasetsincluding epidemiological and social networks, and demonstrate that our algorithmsoutperform other competitors on node and edge deletion at group scale for controllinginfection as well as spectral radius minimization.

This work has been published in ICDM 2015 [185] and TKDE 2016 [184]. Next we firstformulate our problems for two different settings, and then propose several efficient algorithmswith different scalability and provable guarantees. Finally, we demonstrate the effectivenessof our methods via experiments and case studies.

5.1 Our Problem Formulations

Table 5.1 lists the main symbols we use throughout this chapter. Here we assume our graphG(V,E) is directed and weighted. We refer to both node and edge level interventions asimmunization.

Groups in a Graph. For a graph G(V,E), we assume that the edge (node) set is partitionedinto groups C = {C1, . . . , Cn} for the edge (node) immunization problems. For a nodepartition, C can be groups of communities, locations, demographics, etc. And edge groups

73



G(V,E) graph G with the node set V and the edge set EC set containing groupsA set of initial infected nodesn the number of groups in the graphm budget (the number of vaccines)puv weight on edge e(u, v)g(v) group index of node v, i.e., g(v) = i if v ∈ Cig(u, v) group index of edge (u, v), i.e., g(u, v) = i if (u, v) ∈ Cix vaccine allocation vector (x1, · · · , xn) for edges/nodesσC,A(x) the expected number of infected nodes at the end when x is allocated to

edgesσ′C,A(x) the expected number of infected nodes at the end when x is allocated to

nodesek vector with ek = 1 and ei = 0 for i 6= kME(x) E[M(x)]∆E(x) maximum expected degree of G(x)λE(x) expected spectral radius of M(x)λ(ME(x)) spectral radius of the expected matrix ME(x)λminE minimum expected spectral radius over all M(x), i.e., minx λE(x)

xmin the allocation vector which minimizes λ(ME(x)) over all x, i.e.,arg minx λ(ME(x))

s number of samples in GroupGreedyWalk

can be induced from node groups. For example, for an edge e = (u, v), if u and v belong to agroup Ct, then e ∈ Ct, otherwise it belongs to group Cij = {euv|u ∈ Ci, v ∈ Cj}. The edgegroups we defined ensure that every edge e has a group even if the endpoints of e belong todifferent node groups. Note that we assume there are no overlaps among groups.

Allocating Vaccines to Groups. We define x = (x1, . . . , xn) as the vaccine allocationvector, i.e., if we give xi vaccines to group Ci, xi edges (nodes) will be uniformly randomlyremoved from Ci, which means those edges/nodes will not be involved in the diffusion process.The objective of our immunization problem is to find an allocation that controls the diffusionprocess most effectively.

Similar to the individual based immunization that tries to find key edges/nodes which can‘cut’ a graph into many disjoint components, our goal is to find groups that have high ratioof edges/nodes that can cut a graph disjointly. In addition, a good solution also dependson objective functions. For example, for node immunization, a good solution may contain

74

groups with more nodes that connect to different other groups.

Main Idea of our Problem Definitions. We give two different sets of problems whichcover a wide range of contagion-like processes both threshold-based and cascade-style in thenext two subsections. In addition, all our problems have been carefully formulated to beseamless generalizations of the corresponding individual-level problems.

5.1.1 Problem Definition under LT model

Our first set of problems are based on the LT model which is a well-known model forsocial media and complex propagations [72] suited for representing ‘threshold’ behaviorsfor activation. As mentioned in the introduction, the vaccination problem here can help tocontrol such processes like spam and rumors on Twitter and Facebook [74]. Under the LTmodel, our goal is to minimize the expected number of infected nodes at the end of diffusionby removing edges/nodes within groups.

In the LT model, a node v can be influenced by each neighbor u according to a weightpuv where

∑e(u,v)∈E puv ≤ 1. The diffusion process proceeds as follows: at the start, every

node u uniformly randomly chooses a threshold θu from the range [0,1], which representsthe weighted fraction of u’s neighbors that must be active to activate u; an inactive nodeu becomes active at time t + 1 if

∑w∈Nt

ipwu ≥ θu where N t

u is the set of active neighborsof u at time t; all active nodes will stay active. The process stops when no additional nodebecomes active. Each group may have some seeds (initial infected nodes). The seeds willspread information/virus by the LT model.

For the edge deletion under the LT model, let σC,A(x) (Zn → R) denote the expected numberof infected nodes in G (the footprint of G), given seed set A and vaccine allocation vector xfor the group set C.

Now we are ready to define the edge version of the problem under the LT model.

Problem 1: Group Immunization under LT model (edge version):

Given: Graph G(V,E), a partition of the edge set C = {C1, . . . , Cn}, seed set A and mvaccines (budget). Let x be the edge vaccine allocation vector.

Find: The optimum allocation xopt which maximizes f(x) = σC,A(0)− σC,A(x) s.t. |x| ≤ m.

Next, we define the node version of this problem. Let σ′C,A(x) denote the footprint of G. Itis same as σC,A(x) except that the allocation vector x corresponds to node vaccination.

Problem 2: Group Immunization under LT model (node version):

Given: Graph G(V,E), a partition of the vertex set C = {C1, . . . , Cn}, seed set A and mvaccines (budget). Let x be the node vaccine allocation vector.

Find: The optimum allocation xopt which maximizes f ′(x) = σ′C,A(0)− σ′C,A(x) s.t. |x| ≤ m.

75

Hardness of our problems: Problems 1 and 2 are NP-hard as their special case, individual-level based immunizations (when each edge/node is a group), are NP-hard themselves [74,186].

5.1.2 Problem Definition for spectral radius

Our second set of problems are based on the spectral radius formulation [165,166] followingthe fundamental SIR and SIS models that contain the popular IC model [72] as a specialcase.

Spectral radius, denoted by λ, refers to the largest eigenvalue of the adjacency matrix of agraph G. Recent results [131] have shown that λ is connected to the reproduction numberin epidemiology, and determines the phase-transition (‘epidemic threshold’ τ) betweenepidemic/non-epidemic regimes in a very large range of cascade-style models [131], includingSIR (‘mumps-like’ which generalizes the IC model), SIS (‘flu-like’), SEIS (with incubationperiod) and so on. As shown in [131], τ ∝ 1

λ, and if λ < 1, virus cannot spread all over G

quickly irrespective of initial conditions. In other words, if we can minimize λ as much aspossible, an epidemic will be quickly extinguished.

Tong et al [165,166] proposed effective node-based and edge-based individual immunizationmethods to minimize λ. Following their methodology, in this chapter we aim to maximizethe drop of the spectral radius of G, ∆λ, when vaccines are allocated to groups. Similar toProblems 1 and 2, when xi vaccines are given to group Ci, we uniformly remove xi nodes/edgeat random. Hence, we want to find the optimal allocation x such that the expectation of∆λ, E[∆λ](x) is maximum. Note that we do not define the problems here based on the‘footprint’ (as in the previous section for LT) for primarily two reasons: (a) these versionsnaturally generalize the corresponding individual-level immunization problems studied inpast literature [165,166]; and (b) due to the epidemic threshold results, using the spectralradius allows us to immediately formulate a general problem for multiple cascade-style models(like SIR/SIS/IC) each with differences in their exact spreading process which we can ignore.Formally our problems are:

Problem 3: Group Immunization for spectral radius (edge version)

Given: Graph G(V,E), a partition of the edge set C = {C1, . . . , Cn}, and m vaccines(budget). Let x be the edge vaccine allocation vector, and let E[∆λ](x) denote the expecteddrop in the spectral radius after the immunization.

Find: The optimum allocation xopt which maximizes E[∆λ], i.e., xopt = arg maxx E[∆λ](x)s.t. |x| ≤ m.

Problem 4: Group Immunization for spectral radius (node version)

Given: Graph G(V,E), a partition of the node set C = {C1, . . . , Cn}, and m vaccines(budget). Let x be the edge vaccine allocation vector, and let E[∆λ](x) denote the expecteddrop in the spectral radius after the immunization.

76

Find: The optimum allocation xopt which maximizes E[∆λ], i.e., xopt = arg maxx E[∆λ](x)s.t. |x| ≤ m.

Hardness of our problems: Problems 3 and 4 are NP-hard too—their special cases,individual-level immunizations are NP-hard [165,166].

5.2 Proposed Methods

We first discuss our algorithms for the Group Immunization problem under the LTmodel (Section 5.2.1& 5.2.2 for Problems 1 & 2), and then the spectral radius versions(Section 5.2.3& 5.2.4 for Problems 3 & 4).

5.2.1 Edge Deletion under LT model

Recall that the function f(x) we are trying to optimize in Problem 1 is defined over an integerlattice, and is not a simple set function. Hence, we cannot simply apply the greedy algorithmproposed by Khalil et al. [74], as it wants to find a set of edges for the individual basedimmunization. Instead, our approach is to carefully identify a submodularity like conditionthat is satisfied by our function f(x), for which a greedy algorithm gives good performance.Let ek be the vector with 1 at the kth index and 0 be the all zeros vector. We consider thefollowing three properties.

(P1) f(x) ≥ 0 and f(0) = 0.

(P2) (Non-decreasing) f(x) ≤ f(x + ek) for any k.

(P3) (Diminishing returns) For any x′ ≥ x and k, we have f(x+ek)−f(x) ≥ f(x′+ek)−f(x′).

The notion of submodularity of set functions has been extended to functions over integerlattices—see, e.g., [157], who show that a greedy algorithm gives a constant factor approxima-tion to submodular lattice functions with budget constraints. We note that in the context offunctions defined on an integer lattice, unlike in the case of set functions, submodularity neednot be equivalent to diminishing return property. Besides, there are multiple non-equivalentdefinitions of the diminishing return property, as observed in [157]. We show below inTheorem 5.1 that a greedy algorithm gives an (1− 1/e)-factor approximation to a functionf(x) satisfying the properties (P1), (P2) and (P3) above. It is not clear whether the analysisof [157] implies a similar bound for the kind of functions f(x) we need to consider here.

Lemma 5.1. Suppose y = (yi, . . . , yn)T where∑

j yj = m, then f(x+y)−f(x) ≤∑

j yj(f(x+ej)− f(x)).

77

Algorithm 5.1 Greedy algorithm

Require: f , budget m1: x = 02: for j = 1 to m do3: i = arg maxk=1,...,n f(x + ek)− f(x)4: x = x + ei5: end for6: return x

Proof. Let y can be recursively obtained from a sequence e(1), . . . , e(m) (e(i) ∈ {e1, . . . , en})such that y = y(m) = y(m−1) + e(m), y(i) = y(i−1) + e(i) (i ≤ m) and y0 = 0.

Obviously,∑m

i=1 e(i) =∑

j yjej = y. Then,

f(x + y)− f(x)

=m∑i=1

f(x + y(i))− f(x + y(i−1))

=m∑i=1

f(x + y(i−1) + e(i))− f(x + y(i−1))

≤m∑i=1

f(x + e(i))− f(x) (Diminishing Return)

=n∑j=1

yj(f(x + ej)− f(x))

Theorem 5.1. Suppose f(x), x ∈ Zn satisfies the properties (P1), (P2) and (P3) above.Then, Algorithm 5.1 gives a (1 − 1/e)-approximate solution to the problem of maximizingf(x) subject to

∑i xi ≤ m.

Proof. Suppose x is the solution from the greedy algorithm, and x∗ is the optimal solution.Hence, we have

∑j xj =

∑j x∗j = m. Since σC(0) is constant, the greedy algorithm is

equivalent to:

C∗ = arg maxCi

f(x + ei)− f(x).

Let us define x(i) as the solution got from the i-th iteration of the greedy algorithm, hencex = x(m). And x∗ can be represent as

∑j x∗jej. We have

78

f(x∗) ≤ f(x∗ + x(i))

= f(x(i)) + (f(x∗ + x(i))− f(x(i)))

≤ f(x(i)) +∑j

x∗j (f(x(i) + ej)− f(x(i))) (Lemma 5.1)

≤ f(x(i)) +∑j

x∗j (f(x(i+1))− f(x(i))) (Greedy Alg.)

= f(x(i)) +m(f(x(i+1))− f(x(i)))

Hence, f(x(i+1)) ≥ (1 − 1m

)f(x(i)) + 1mf(x∗). Recursively, we can get f(x(i)) ≥ (1 − (1 −

1m

)i)f(x∗). Therefore, f(x) = f(x(m)) ≥ (1− (1− 1m

)m)f(x∗) ≥ (1− 1/e)f(x∗).

Now, we will show that the objective function f(x) = σC,A(0)− σC,A(x) for the edge deletionproblem under the LT model satisfies the properties stated in Theorem 5.1. In the ensuingdiscussion, we will assume without loss of generality that there is only one seed node. This isbecause, if there are multiple seed nodes, then, we can merge all of them to a single ‘super’ node(say s) in the following manner: for every vertex v ∈ V \ A, set psv =

∑u∈N(v)∩A puv, where

N(v) is the set of neighbors of v. We note that after this modification the edges between v andits susceptible neighbors are unchanged, and at time 0,

∑w∈Nv pwv =

∑w∈N(v)∩A pwv = psv.

Hence, σC,A(x) = σC,s(x). Henceforth, we will assume that there is only one seed node, anddrop the subscript A from σC,A(x), denoting it by σC(x).

Lemma 5.2. The function f(x) = σC(0)− σC(x) satisfies the properties (P1), (P2) and (P3)above.

Proof. Property 1 is trivially true because, when x = 0, by definition, f(0) = 0, and sincevaccination does not increase the number of infections, σC(x) ≤ σC(0). For the rest of theproof, since σC(0) is a constant, we only need to analyze σC(x). Note that for any x′ ≥ x,we can find a sequence of vectors (z1, z2, . . . , zl) for some l such that x = z1, x′ = zl andzi = zi−1 + eki−1

for some index ki−1. Therefore, it is enough to prove that Properties 1 and 2hold for x′ = x + ej for some index j. Also, we can assume that xj < |Cj|, for j = 1, . . . , n,for if this is not true for some j, then, it implies that all the edges in Cj will be vaccinated,and therefore, we can simply remove all Cj from the analysis and reduce the budget by xj.

Let R(x) ⊆ 2V be the collection of sets R satisfying |R ∩ Ci| = xi. Following the equivalencebetween influence in the LT model and the directed percolation process [72], we haveσC(x) =

∑G Pr[G]

∑R∈R(x) Pr[R]γC(G, R), where the first sum is over all possible live-edge

subgraphs G of G in the percolation process, Pr[R] is the probability when the set R is removed, and γC(G, R) is the expected number of infected nodes in G at the end of the LT processafter the set R is removed. This can be rewritten as σC(x) =

∑G Pr[G]σC(G,x), where Pr[G]

79

is the probability of sampling G, and σC(G,x) =∑

R∈R(x) Pr[R]γC(G, R). Henceforth, we

will abbreviate γC(G, R) as γ(R).

We will show that σC(G,x) is non-increasing, i.e. σC(G,x) ≥ σC(G,x′) where x′ = x + ej,thereby showing that f(x) satisfies Property 2. Since the number of nodes reachable fromthe seed node with R removed is at least as many as those with R ∪ {e} removed, for anye ∈ Cj \R, we have γ(R) ≥ γ(R ∪ {e}). Therefore,

σC(G,x′) =∑

R′∈R(x′)

Pr[R′]γ(R′)

=∑

R∈R(x)

∑e∈Cj\R

1

|Cj | − xjPr[R]γ(R ∪ {e})

≤∑

R∈R(x)

∑e∈Cj\R

1

|Cj | − xjPr[R]γ(R)

=∑

R∈R(x)

Pr[R]γ(R) = σC(G,x)

Finally, we will show that σC(G,x + ek)− σC(G,x) ≤ σC(G,x′ + ek)− σC(G,x′). From theabove discussion, this will imply that f(x) satisfies Property 3. Suppose x′ = x + ej , we havetwo cases to consider: (1). ek = ej; (2). ek 6= ej.

For 1 ≤ i ≤ n, let ci = |Ci| and xi denote the ith element in x.

First, we consider case (1) (ek = ej). For R ∈ R(x), Pr[R] =∏

i1

(cixi)= ρ 1

(ckxk), where,

ρ =∏

i 6=k1

(cixi).

σC(G,x)− σC(G,x + ek)

=∑

R∈R(x)

ρ1(ckxk

) γ(R)−∑

R′∈R(x′)

ρ1( ck

(xk+1)

) γ(R′)

=ρ∑

R∈R(x)

[ 1(ckxk

) γ(R)− 1

xk + 1

∑e∈Ck\R

1(ck

xk+1

) γ(R ∪ {e})].

The factor 1xk+1

is due to the fact that R ∪ {e} comes up in (xk + 1) combinations involvingR and e. This simplifies to

σC(G,x)− σC(G,x + ek) (5.1)

=ρxk!(ck − xk − 1)!

ck!

∑R∈R(x)

∑e∈Ck\R

γ(R)− γ(R ∪ {e}) .

80

Similarly, we have

σC(G,x′)− σC(G,x′ + ek)

=ρ(xk + 1)!(ck − xk − 2)!

ck!

∑R′∈R(x′)

∑e∈Ck\R′

γ(R′)− γ(R′ ∪ {e})

=ρ(xk + 1)!(ck − xk − 2)!

ck!

∑R∈R(x)

1

(xk + 1)

∑e′∈Ck\R∑

e∈Ck\(R∪{e′})

γ(R ∪ {e′})− γ(R ∪ {e, e′}) .

From [74, proof of Theorem 6], γ(R)− γ(R∪{e}) ≥ γ(R∪{e′})− γ(R∪{e, e′}) (supermodu-larity). Therefore, (ck−xk− 1)

∑e[γ(R)− γ(R∪{e})] ≥

∑e′∑

e γ(R∪{e′})− γ(R∪{e, e′}).Hence proved.

Now, we consider case (2). Let Pr[R] = ρ′ 1

(ckxk)1

(cjxj), where ρ′ =

∏i 6=j,k

1

(cixi). We can get

σC(G,x)− σC(G,x + ek) from eqn. (5.1). And

σC(G,x′)− σC(G,x′ + ek)

=ρ′xk!(ck − xk − 1)!( cj

(xj+1)

)ck!

∑R′∈R(x′)

∑e∈Ck\R′

[γ(R′)− γ(R′ ∪ {e})]

=ρ′(xj + 1)!(cj − xj − 1)!

cj !

xk!(ck − xk − 1)!

ck!

∑R

1

xj + 1∑ej∈Cj\R

∑e∈Ck\(R∪{ej})

[γ(R ∪ ej)− γ(R′ ∪ {e, ej})] .

Again from [74], (cj − xj − 1)∑

e γ(R)− γ(R ∪ {e}) ≥∑

ej

∑e γ(R ∪ {ej})− γ(R ∪ {ej, e}).

Hence proved.

Algorithm 5.1 provides a simple greedy algorithm. Different from the greedy algorithm in [74],to implement Algorithm 3.3, we need to handle the random natural of vaccine allocationsfor estimating σC(x). To do this, we apply the Sample Average Approximation (SAA)framework. Let L ⊂ R(x), denote a sample set from the set of all possible allocations.σC(x) ≈ σC(x) = 1

|L|∑

R∈L γC(R), Kempe et al. [72] show that γC(R) can be estimated bysampling from the set of live-edge graphs. A live-edge graphs T is generated as follows: foreach node v ∈ V , independently select at most one of its incoming edges with probabilitypuv, and with probability 1 −

∑u:(u,v)∈E puv no edge is selected. Let this sample set be

denoted by M. This approach takes O(|M||L|(|E| + |V |)) time to estimate σC(x), andO(mn|M||L|(|E|+|V |)) for the full greedy algorithm, which is not practical for large networks.However, we can speed up this naive greedy algorithm.

Speed-up of the Greedy Algorithm: Greedy-LT. Since a live-graph sampled from Mis a tree, we can denote it as T sX where s is the root, and r(u, T sX) = |{v|v ∈ subtree(u)}|,

81

i.e., the number of nodes that are under the subtree of u in T sX . Greedy-LT is summarizedin Algorithm 5.2. It first merges all seeds into a ‘supernode’ s and samples |M| live-edgegraphs, and then compute r(u, T sX) in parallel for all nodes in all the live graphs (Line1-3). After that we greedily select m vaccines (Line 4-10): we initially set the allocationvector x = 0, and in each iteration, for each group Ci, we calculate the marginal loss∆Ci,s(x + ei) =

∑e(u,v)∈T sX

r(s, T sX)− r(s, T sX \ e), i.e., we randomly pick one edge from each

group for each live-edge graph, then sum their marginal losses up over T sX as Ci’s marginalloss. Note that r(s, T sX)− r(s, T sX \ e) = r(v, T sX) + 1, where node v is the endpoint of e [74].We pick the group C∗ with the maximum marginal loss. Finally we removed the edge thathas been picked, and update r(u, T sX) in parallel (Line 11-13). There are two cases to updateT sX if e(u, v) ∈ T sX : (1) for v’s children, we can remove them because it is not reachablefrom s; (2) for any ancestor a of v, r(a, T sX \ e) = r(a, T sX)− r(v, T sX)− 1, which can be donein constant time. Following Theorem 5.1, Greedy-LT is a (1 − 1/e − ε)-approximationalgorithm where ε is the approximation factor for estimating σC(x).

Algorithm 5.2 Greedy-LT

Require: Graph G, group set C, seed set A, and budget m1: Merge seed set A to I2: Sample live-edge graphs M = {T IX1

, . . . , T IX|M|}3: For each T IX , calculate r(u, T IX) for all nodes (in parallel)4: Set x = 05: for j = 1 to m do6: for each T IX and Ci do7: pick an edge eCiX at random for Ci and T IX8: end for9: C∗ = arg maxCi

∑eCiX ∈T

IX

(r(I, T IX)− r(I, T IX \ eCiX ))

10: xC∗ = xC∗ + 111: for each T IX do12: If eC

∗X (u, v) ∈ T IX , remove edge eCiX and update r(n, T IX) for node n (in parallel)

13: end for14: end for15: return x

Running Time of Greedy-LT. Calculating all r(u, T IX) costs O(|M||V |) time since wecan traverse T IX once to get all values of r(u, T IX). And greedily choosing m vaccine allocationneeds O(mn|M||V |). Hence, the serial version of Greedy-LT costs O(mn|M||V |). Notethat in practice, we can speed it up by computing and updating ri(u, T

IX) in parallel. In

addition, since T IX is tree, the increasing difference property still holds, hence we can accelerateGreedy-LT by “lazy evaluation” [89] as well.

82

5.2.2 Node Deletion under LT model

Our algorithm for the node version of the Group Immunization problem is also the greedyalgorithm 3.3, as in the edge version in Section 5.2.1. Without loss of generality, we alsoassume that all seed nodes in A are merged, and drop the subscript A from σ′C,A(x), denotingit by σ′C(x). Our analysis relies on proving that the function f ′(x) = σ′C(0) − σ′C(x) inProblem 2 satisfies the properties (P1), (P2) and (P3) from Section 5.2.1, as discussed below.

Lemma 5.3. The function f ′(x) = σ′C(0) − σ′C(x) satisfies the properties (P1), (P2) and(P3).

Proof. Our proof follows on the same lines as the proof of Lemma 5.2. As a first step, weneed to prove that γ′C(R) is monotone non-increasing and supermodular, where γ′C(R) is theexpected number of infected nodes at the end of the LT process after R is removed. Weshow that this follows from the corresponding property for the function γC(·), as used inLemma 5.2.

First, for any node set S ⊂ V , we define edge set ES = {e(i, j) | i ∈ S∨ j ∈ S}. Let S, T ⊂ Vwith S ⊆ T . Then, ES ⊆ ET , so that γ′C(S) = γC(ES) ≤ γC(ET ) = γ′C(T ). This proves γ′C(·)is monotonically non-increasing.

Next, we show that γ′C(R) is supermodular. We first prove below that ES∩T ⊆ ES ∩ ET andES∪T ⊆ ES∪ET . For the former, observe that for any edge (i, j) ∈ ES∩T , we must have eitheri ∈ S ∩ T or j ∈ S ∩ T . From the definition of ES and ET , it follows that (i, j) ∈ ES ∩ ET ,so that ES∩T ⊆ ES ∩ ET . Similarly, if (i, j) ∈ ES∪T , we have either i ∈ S ∪ T or j ∈ S ∪ T .This implies that (i, j) ∈ ES ∪ ET .

Hence, γC(ES ∩ET ) ≤ γC(ES∩T ) and γC(ES ∪ET ) ≤ γC(ES∪T ). According to the supermod-ularity of γC(·), γC(ES) + γC(ET ) ≤ γC(ES ∪ET ) + γC(ES ∩ET ), therefore, γ′C(S) + γ′C(T ) =γC(ES) + γC(ET ) ≤ γC(ES ∪ ET ) + γC(ES ∩ ET ) ≤ γC(ES∪T ) + γC(ES∩T ) = γ′C(S ∪ T ) +γ′C(S ∩ T ), which proves γ′C(·) is supermodular.

Following the equivalence between the LT model and the directed percolation process [72],we have σ′C(x) =

∑G Pr[G]

∑R∈R(x) Pr[R]γ′C(G, R), where the first sum is over all possible

live-edge subgraphs G of G in the percolation process, and γ′C(G, R) is the expected numberof infected nodes in G at the end of the LT process after the set R of nodes is removed. Theproof then follows as in Lemma 5.2, using the supermodularity of γ′C(·).

Lemma 5.3 suggests that Theorem 5.1 holds for node version as well: GREEDY algorithmwill provide a (1−1/e)-approximate solution. We extend Greedy-LT (Algorithm 5.2) to thenode version: instead of randomly pick edges (Line 7), we randomly pick nodes to calculatethe marginal loss (Line 9), and remove the corresponding nodes (Line 12). The observationis that calculating the marginal loss of removing node v in C in constant time holds here aswell, i.e., r(I, T IX)− r(I, T IX \ v) = r(v, T IX) + 1. Hence, the updating process is the same asthe edge version of Greedy-LT.

83

5.2.3 Edge Deletion for Spectral Radius

We propose three algorithms for Problem 3 (edge immunization based on spectral radius)with different tradeoffs of quality and running time: the first one, sdp, is a constantfactor approximation algorithm that minimizes the actual eigendrop; the second algorithm,GroupGreedyWalk, is a bicriteria approximation algorithm based on hitting-walks; thethird algorithm, lp, is an Linear Programming (LP) based method which uses an estimationof the eigendrop.

SDP is a constant-factor approximation algorithm, which gives us good results, but it isvery slow with a O(|V |4polylog(|V |)) time complexity. Hence, we develop GroupGreedy-Walk, a bicriteria approximation algorithm based on hitting closed walks [148]. ThoughGroupGreedyWalk loses a little quality compared to sdp, it is faster with a O(sm2|V |3)time complexity (where s corresponds to the number of samples, described later). However,it may still not be scalable to very large networks with millions of nodes. Therefore, wecome up with lp, a linear programming based heuristic whose time complexity depends onlyon the number of groups, not graph size. In reality, the number of groups in a group istypically much smaller than the number of nodes. Hence, lp is much faster than sdp andGroupGreedyWalk. And experimental results demonstrate that it is scalable to networkswith millions of nodes, and provides competitive empirical performance (see Section 5.3).

Note that even though sdp and GroupGreedyWalk may not be scalable to very largenetworks, both have proven performance guarantee. In addition, they are not merely oftheoretical interest: they can be used as a baseline to assess the performance of fasterheuristics on smaller networks.

Next, we will introduce the sdp algorithm (Section 5.2.3.1), the GroupGreedyWalkalgorithm (Section 5.2.3.2), and the lp heuristic (Section 5.2.3.3) respectively.

5.2.3.1 sdp: a constant factor approximation algorithm

Let G(V,E) be a graph whose edge set is partitioned into n groups C1, · · · , Cn. Let x bethe edge allocation vector. For an edge (u, v), let g(u, v) denote the index of the groupto which (u, v) belongs. Let G(x) be the random graph obtained by removing each edgein Ci with probability pi = xi/ci, where ci = |Ci|. Let M(x) be its adjacency matrix andλE(x) = E[λ(M(x))] be the expected spectral radius.

(M(x))uv =

{1, with prob. (1− pg(u,v)) if (u, v) ∈ E(G),

0, otherwise.(5.2)

Let ME(x) = E[M(x)] be the expectation of the adjacency matrix of G(x).

(ME(x))uv =

{1− pg(u,v), if (u, v) ∈ E(G),

0, otherwise.(5.3)

84

The problem is to find the optimal allocation, i.e., the x for which λE(x) is minimized. Wewill denote this value by λmin

E := minx λE(x).

Remark 5.1. In the SDP formulation, for ease of analysis, we replace the hard budgetconstraint by an expected budget constraint, i.e., the expected size of the vaccine allocationvector x is m. This is not a problem since, in reality, the budget is sufficiently high (� log n).Hence, with high probability, the number of vaccines in the solution will be very close to theexpected budget. Given this small difference, we can force the number of vaccines to be withinthe budget constraints, with very little effect on the performance.

The sdp formulation: finding the allocation x with minimum λ(ME(x)). Note that,ME(x)uv = (1− pg(u,v)), if (u, v) ∈ E(G). We use a simple SDP to find the allocation whichminimizes λ(ME(x)) and meets the budget constraint m.

minimize tsubject to 0 ≤ pi ≤ 1, for i = 1, . . . , n∑

i pici ≤ m,tI −ME(x) � 0 .

(5.4)

Let xmin denote the allocation vector corresponding to the solution of the SDP.

Analysis: Relating λminE to λ(ME(xmin)). One can use the following result by Lu and

Peng [93] to bound λE(x) with respect to λ(ME(x)).

Theorem 5.2 ( [93]). Consider an edge-independent random graph H. Let M(H) denote itsadjacency matrix and ME(H) = E[M(H)]. ∆E(H) denotes the maximum expected degree. If∆E(H)� log4 |V |, then, almost surely

|λi(M(H))− λi(ME(H))| ≤ (2 + o(1))√

∆E(H),

for i = 1, . . . , |V |.

Recall that xmin is the output of SDP (5.4), and it corresponds to the allocation vectorwhich minimizes λ(ME(x)) over all x. Let ∆E(xmin) denote the maximum expected degree ofG(xmin). The following lemma proves that the sdp formulation gives us an approximationalgorithm with constant factor O(

√∆E(xmin)).

Lemma 5.4. If xmin is such that ∆E(xmin) � log4 |V |, then, λminE ≤ λ(ME(xmin)) + (2 +

o(1))√

∆E(xmin)+1.

Proof. Let z = λ(ME(xmin)) + (2 + o(1))√

∆E(xmin). Applying Theorem 5.2 to G(xmin),λ(M(xmin)) ≤ z almost surely. In fact, for ∆E(xmin) � log4 |V |, it can be shown that

85

Pr(λ(M(xmin)) ≥ z

)≤ 1/|V | (see [93, proof of Theorem 6]). Noting that λ(M(xmin)) ≤

λ(M),

λE(xmin) = E[λ(M(xmin))]

≤ Pr(λ(M(xmin)) ≤ z

)· z

+ Pr(λ(M(xmin)) ≥ z

)· λ(M)

≤ 1 · z +( 1

|V |

)· λ(M) < z + 1 .

By definition, λminE ≤ λE(xmin). Therefore, λmin

E ≤ λE(xmin) ≤ z + 1. Hence, proved.

Running time. The SDP step (Eq. (5.4)) dominates the running time of this algorithm,which is O(|V |4polylog(|V |)).

5.2.3.2 GroupGreedyWalk: a bicriteria approximation algorithm

As shown above, sdp with a (|V |4polylog(|V |)) time complexity, is too slow for large net-works. In this section, we leverage the technique of hitting closed walks [148] for theGroup Immunization problem, and propose a bicriteria approximation algorithm calledGroupGreedyWalk.

Saha et al. [148] studied the problem of minimizing the spectral radius under a given thresholdby removing the smallest number of edges, and developed a greedy based approximationalgorithm for it. Different from their work, our goal is to distribute a given budget of vaccinesto groups to minimize the spectral radius as small as possible. We can adapt their greedyalgorithm to the group immunization, by choosing groups with maximum marginal gain ofhitting closed walks. However, it is not clear whether this works, as we need to consider all“possible worlds” for group immunization.

In graph G, a closed walk is a sequence of nodes starting and ending at the same node, witheach two consecutive nodes adjacent to each other. Closed k-walk is a walk with length k.Let walks (e,G, k) denote the number of closed k-walks in G containing e = (i, j). Wesay that an edge set E hits a walk w if w contains an edge from E. Recall that G(x) is arandom graph obtained by removing a random subset of xi edges in Ci, where C1, . . . , Cn isa partition of the edge set E. Let W(G, k) be the set of all walks of length k in the graph G.Let nk(G, e) denote the number of walks of length k in G that pass through edge e. Similarly,let nk(G,S) denote the walks of length k in G that pass though edges in the set S. Letnk(G) = nk(G,E) = |W(G, k)| denote the number of walks with length k in G. Here wefocus on walks of a fixed length k = θ(log |V |). Note that for G(x), nk(G(x)) is a randomvariable.

Algorithm 5.3 gives the pseudocode of our GroupGreedyWalk algorithm. We assume thatCountWalks(G,x) returns the expected number of walks surviving in G(x). The idea of

86

Algorithm 5.3 GroupGreedyWalk (G, m)

Require: Graph G, group set C, and budget m1: x = 02: for j = 1 to m do3: i = arg maxk=1,...,nCountWalks(x + ek)−CountWalks(x)4: x = x + ei5: end for6: return x

GroupGreedyWalk is that, each time we select a group Ci with the maximum marginalgain in CountWalks(G,x), when allocating one vaccine to Ci.

Algorithm 5.3 follows the framework of the individual based GreedyWalk algorithm [148].Instead of picking edges, it chooses groups to maximize marginal gain of eigendrop. However,the main challenge here is to show GroupGreedyWalk is a provable approximationalgorithm. Let xopt(m) be the optimum solution corresponding to budget m , and T =λ1(G(xopt(m))) (the spectral radius after vaccine allocation for the optimum solution). Wecan prove the following theorem:

Theorem 5.3. Let xopt(m) be the optimum solution corresponding to budget m of edgesremoved. Let xg be the allocation returned by GroupGreedyWalk (G, c1m log2 |V |), for aconstant c1. Then we have λ1(G(xg)) ≤ c′T for a constant c′, where λ1(G(xg)) is the spectralradius after allocating vaccines based on xg.

Remark 5.2. Theorem 5.3 shows that GroupGreedyWalk is a (c1 log2 |V |, c′)-bicriteriaapproximation algorithm. Different from the analysis of traditional approximation algorithms,in order to bound the result of GroupGreedyWalk w.r.t to the optimal solution, we need alarger budget c1 log2 |V |m. Typically, log2 |V | is much smaller than the budget m. And whenthe budget m is very large, the marginal gain of eigendrop for a larger budget c1 log2 |V |mwill tend to be very close to the marginal gain of eigendrop for the budget m. Hence, addingsuch small factor into the budget m will have little effect on the performance.

We will use Lemmas 5.5 and 5.6 to prove this theorem. Intuitively, Lemma 5.5 shows theexpected spectral radius is upperbounded by T if the number of walk k = O(log |V |); whileLemma 5.6 shows that the expected number of walks with length k can be upperbounded byT as well.

Lemma 5.5. If E[nk(G(x)] = O(|V |2kT k) for k = O(log |V |), then E[λ1(G(x))] ≤ c3T fora constant c3.

Proof. Let G(x) be a distribution over graphs G1, . . . , GS. Since k is even, we have λ1(Gj)k ≤∑

i λi(Gj)k ≤ knk(Gj). Because E[nk(G(x))] =

∑Gj

Pr[Gj]nk(Gj) = O(|V |2kT k), we have

E[λ1(G(x))k] =∑

GjPr[Gj]λ1(Gj)

k ≤∑

GjPr[Gj]knk(Gj) = O(|V |2kkT k).

87

Since function h(x) = x1/k is concave, using Jensen’s inequality, we have h(E[X]) ≥ E[f(X)]for a random variable X. This implies E[λ1(G(x))] ≤ (n2kkT k)1/k ≤ c3T .

Lemma 5.6. Let xopt(m) be the optimum allocation such that T = E[λ1(G(xopt(m)))]. Lety be defined as

yi =

{xopti , if xopti ≤ mi/2,

mi otherwise.

where mi is the number of edges in group Ci. Then, we have E[nk(G(y))] ≤ |V |2kT k.

Proof. Let Gavg(y) denote the expected graph with weight of each edge e ∈ Ci beingwy(e) = 1−yi/mi. Our proof is in two parts: (1) we show that λ1(Gavg(y)) ≤ T , and then (2)we show that E[nk(G(y))] can be upperbounded in terms of the weights of walks in Gavg(y).In the discussion below, we will use A(G) to denote the adjacency matrix of graph G, andA(G)ij to denote the entry (i, j) in the matrix.

For the first part of the proof, let z denote the first eigenvector of Gavg(y), i.e., zTGavg(y)z =λ1(Gavg(y)). Let G1, . . . , GS denote the graphs over which G(y) forms a distribution. Wehave

E[zTA(G(y))z] =∑

G′ lnG(y)

Pr[G′]zTA(G′)z

=∑

G′ lnG(y)

Pr[G′]∑uv

A(G′)uvzuzv

=∑uv

zuzv∑

G′ lnG(y)

Pr[G′]A(G′)uv

=∑uv

zuzvE[A(G(y))uv]

From the definition of y, there might be some indices i such that yi = mi, Without loss ofgenerality, we assume that there exists r′ ≤ r, such that for all i ≤ r′, we have yi < mi, forfor all i > r′, yi = mi. If there exist no indices i with yi = mi, we have r′ = r. If yi = mi, alledges in Ci are removed in G(y), so E[A(G(y))uv] = 0 for all (u, v) ∈ Ci. Therefore,

E[zTA(G(y))z] =

r′∑i=1

∑(u,v)∈Ci

zuzv

(mi−1yi

)(miyi

)=

r′∑i=1

∑(u,v)∈Ci

zuzvmi − yimi

=r′∑i=1

∑(u,v)∈Ci

zuzvwwy(u, v)

= λ1(GE(y)),

88

where the first equality follows because Pr[A(G(y))uv] is the probability that edge (u, v) is

not removed, which is(mi−1

yi)

(miyi )if (u, v) ∈ Ci.

Next, for the second part of the proof, we consider E[nk(G(y))] =∑

w∈W(G,k) Pr[w survives inG(y)].

Let S(w) be the set of edges in walk w, with n(e, w) denoting the number of times edge e istraversed by w. Also, let Si = S(w) ∩ Ci. Let si = |Si|, and let si =

∑e∈Si n(e, w). If there

exists i such that si > 0 and si+yi > mi, it must be the case that Pr[w survives in G(y)] = 0.We focus on w such that for each i with si > 0, si + yi ≤ mi. Then,

Pr[w survives in G(y)] (5.5)

=∏

e∈S(w)

Pr[e survives in G(y)]

=∏i

Pr[all edges in Si survive in G(y)]

=∏i

(mi−siyi

)(miyi

) ≤∏i

(mi − yimi

)si

≤ 2k∏i

(mi − yimi

)si = 2k∏

e∈F (e)

wy(e)n(e,w),

where the last equality follows because the definition of wy(e), the second inequality followsbecause mi−yi

mi≥ 1

2and

∑i si = k, and the first inequality above follows because(

mi−siyi

)(miyi

) =(mi − yi)(mi − yi − 1) . . . (mi − yi − si + 1)

mi(mi − 1) . . . (mi − si + 1)

≤ (mi − yimi

)si ,

since mi−yi−jmi−j ≤

mi−yimi

.

The above discussion implies that for all walks w ∈ W(G, k), Pr[w survives in G(y)] ≤2k∏

e∈F (e)wy(e)n(e,w). This implies that

E[nk(G(y))] =∑

w∈W(G,k)

Pr[w survives in G(y)]

≤∑

w∈W(G,k)

2k∏

e∈F (e)

wy(e)n(e,w)

≤ 2ktrace(Gavg(y))

= 2k|V |∑j=1

λj(Gavg(y))k

≤ 2k|V |λ1(Gavg(y))k ≤ 2k|V |T k

89

Now, we prove Theorem 5.3:

Proof of Theorem 5.3. Let g(x) denote the expected number of walks in W(G, k) hit by theedges that are removed in G(x). Then g(x) has the diminishing returns property, i.e., forx ≤ x′, we have g(x + ei) − g(x) ≥ g(x′ + ei) − g(x′). The proof of diminishing returnsfollows the proof of Lemma 5.2.

We will compare g(xg) to g(y) where y is defined as

yi =

{xopti , if xopti ≤ mi/2,

mi otherwise.

where mi is the number of edges in group Ci. Note that∑

i yi ≤ 2∑

i xopti ≤ 2m.

Let x(i) denote the vector after ith iteration of GroupGreedyWalk. Since g(x) hasdiminishing returns property, it follows the proof of Theorem 5.1 that f(x(i)) ≥ (1− (1−

12m

)i)f(y). Therefore, for i = O(m log2 |V |), we have

1− (1− 1

2m)O(m log2 |V |) ≥ 1− (1/e)log2 |V |

≥ 1− 1

|V |log |V | ≥ 1− 1

N,

where N is the number of total walks in the original graph G.

From Lemma 5.6, we have E[nk(G(y))] ≤ |V |cT k for a constant c. This implies f(y) ≥N − |V |cT k. Therefore, f(xg) ≥ (1− 1

|V |)(N − |V |cT k) ≥ N − 1− |V |cT k. This implies that

nk(G(xg)) ≤ O(|V |cT k). From Lemma 5.5,it follows that λ1(G(xg)) ≤ c′T .

Implementation notes. Given the adjacency matrix A of G, the number of k-length walksfrom u to v is given by Ak−1

uv . It also corresponds to the number of walks hit by the edge (u, v).We implement the algorithm as follows. In each iteration, we randomly sample a set ofedges of the G according to x. For each sample, we compute the expected decrease in thenumber of walks for the removal of one edge in group i (for computing the effect of allocationvector x + ei) as follows: We construct G(x), compute A′ = A(G(x))k−1 and take the averageover all A′(u, v) elements where (u, v) belongs to group i. We perform this for each sample(number of samples is s) and take the average over all the samples. Finally, we choose that iwhich gives the maximum average and update x by adding ei to it.

Running time. For budget m, Am−1 can be computed in time O(m2|V |3). For each sampleof x, we compute A(G(x))m−1. Note that, computing the effect of removing ei for eachsample takes only O(|V |2) time. Therefore, for a sample size of s, the algorithm overall takesO(sm2|V |3) time. If m = O(log |V |), the time complexity is O(s|V |3 log2 |V |).

90

5.2.3.3 lp: a fast heuristic

GroupGreedyWalk is a good approximation algorithm like sdp, however, it may not bescalable to very large networks with millions of nodes. In this section, the propose a muchfaster heuristic based on estimating eigendrop.

The eigendrop when removing edges in the set ET can be approximated by φ(T ) =∑(i,j)∈ET Mijuiuj where Mu = λu and u = (u1, . . . , ui, . . . ) [165]. Given the allocation

vector x, the expected drop in spectral radius is then given by

E[∆λ] ≈ φ(x)

=∑i,j∈E

Mijuiuj Pr((i, j) is removed

)=∑a∈C

∑(i,j)∈Ck

Mijuiujxa .

(5.6)

If we define αa =∑

(i,j)∈Ca Mijuiuj, then, φ(x) =∑

a αaxa. We want to maximize φ(x)subject to the budget constraints. This can be formulated as a linear program as given below.

maximize∑

a αaxasubject to

∑a xa|Ca| ≤ m

0 ≤ xa ≤ 1(5.7)

Running time. The LP takes O(n4) time where n is the number of groups. Note thatit is not a function of the graph size. Typically, the number of groups is small, hence thisalgorithm is very fast.

5.2.4 Node Deletion for Spectral Radius

Here, we propose an algorithm for solving Problem 4: the group node immunization problemwith respect to eigendrop. It is based on the approximate eigendrop method which wasdiscussed in Section 5.2.3. The eigendrop when removing nodes in S can be approximated asfollows [166].

∆λ ≈ φ(S) =∑j∈S

2λu2j −

∑i,j∈S

Mijuiuj (5.8)

where Mu = λu and u = (u1, . . . , ui, . . . ). Recall that C is the set of groups and x =(x1, . . . , xi, . . .) is the allocation vector where, xi is the fraction of nodes vaccinated in groupCi. For the group vaccination problem, the expected eigendrop can be approximated by

91

applying (5.8) as follows:

E[∆λ] ≈ φ(x) =∑j∈V

2λu2j Pr(j is vaccinated)

−∑i,j∈V

Mijuiuj Pr(i & j are vaccinated) .(5.9)

Let g(v) denote the index of the group to which v belongs to, i.e., if v ∈ Ci, then, g(v) = i.The probability that j is vaccinated is xg(j) and the probability that both i and j arevaccinated is

Pr(i & j are vaccinated) =

{xg(i)xg(j), if g(i) 6= g(j),

x2g(i)

|Cg(i)||Cg(i)|−1 , otherwise.

(5.10)

Applying the above to (5.9),

φ(x) =∑a

∑j∈Ca

2λu2jxa −

∑a

∑i,j∈Ca

Mijuiujx2a

|Ca||Ca| − 1

−∑a6=b

∑i∈Ca,b∈Cb

Mijuiujxaxb .

Observing that Mij, ui and xa are constants, defining αa =∑

j∈Ca 2λu2j|Ca||Ca|−1

, βa =∑i,j∈Ca Mijuiuj, and Γab =

∑i∈Ca,j∈Cb Mijuiuj, we get,

φ(x) =∑a

αaxa −∑a

βax2a −

∑a6=b

Γabxaxb .

Our aim is to find that x which maximizes φ(x). This can be formulated as a quadraticprogram.

minimize∑

a βax2a +

∑a6=b Γabxaxb −

∑a αaxa

= 12xTQx + cTx

subject to∑

a xa|Ca| ≤ B0 ≤ xa ≤ 1,

(5.11)

where, Qaa = 2βa and for a 6= b, Qab = 2Γab and ca = −αa. If Q is not semi-definite,the problem is NP-Hard [149]. In that case, we use a low-rank matrix Q formed by allits eigenvectors corresponding to non-negative eigenvalues. The QP on Q can be solved inpolynomial time using the ellipsoid method [149].

Lemma 5.7. |φ(xQ)−φ(xQ)| ≤ n2· ‖Q− Q‖F , where n is the number of groups in the graph.

92

Proof. Let xQ and xQ correspond to the best allocation vectors corresponding to Q and Q

respectively. Let φ(x) = 12xT Qx + cTx. Then,

φ(xQ

)− φ(xQ) =(−1

2xTQQx

Q− cTx

Q

)−(−1

2xTQQxQ − cTxQ

)≤(−1

2xTQQx

Q− cTx

Q

)−(−1

2xTQQx

Q− cTx

Q

)=(1

2(xT

Q(Q− Q)x

Q

)≤ n

2yT (Q− Q)y

where y is some unit vector. The last expression follows from the fact that xTQ

xQ ≤ n.

Finally, we note that |yT (Q− Q)y| ≤ ‖Q− Q‖F . Hence, proved.

Running time. The QP takes O(n4) time. Again, note that n is the number of groups.Hence, it is fast when the number of groups is small.

5.3 Experiments

We present a detailed experimental evaluation now.


We implemented the algorithms in Python, and conducted the experiments using a 4 XeonE7-4850 CPU with 512GB of 1066Mhz main memory.

Table 5.2: Datasets

Dataset Num. of nodes Num. of edges Num. of groups

SBM 1500 5000 20Protein 2361 7182 13OregonAS 10670 22002 100YouTube 50K 450K 5000PORTLAND 0.5 million 1.6 million 91MIAMI 0.6 million 2.1 million 91

93

Datasets. Table 5.2 briefly summarizes the dataset. We run our experiments on multipledatasets, which were chosen for their size as well as different domains where the GroupImmunization problem is especially applicable. Note that all our datasets are networks,not diffusion traces. If we have diffusion traces, we can use some state-of-the-art algorithms,such as [55], to learn edge weights first, and then apply our algorithm.

1). SBM (Stochastic Block Model) [102] is a well-known model to generate synthetic graphswith groups. We generate small networks from the Stochastic Block Model to test theeffectiveness of all our methods.

2). Protein1 is a protein-protein interaction network in budding yeast. There are 13 classesof proteins, which are naturally treated as groups. It is a biological network, where ourimmunization algorithms can be potentially applied to block protein interactions.

3). OregonAS2 is the Oregon AS router graph collected from the Oregon router views, andgroups here are based on router conductivities. We use Louvain [15], a fast communitydetection algorithm to specify groups. It is a computer network where our algorithms can beused to stop malware outbreaks.

4). YouTube3 a friendship network in which users can form groups. We create an inducedgraph by selecting nodes that are in the top 5000 communities. It is a social media networkwhere we can apply our algorithms to control rumor spread.

5). PORTLAND and MIAMI are social-contact graphs based on detailed microscopic simulationsof large US cities, which has been used in national smallpox and influenza modeling studiesusing the SIR model [35]. We divided people into groups by ages ranging from 0-90 (hence91 groups in both networks). They are both contact networks where our algorithms can beadopted to minimize virus propagation.

Settings. For LT model, we uniformly randomly choose 1% nodes as the infected nodes(seeds) at the start. And we use the same method in [74] to generate the probabilities on theedges: for a node v, we assign each its incoming edge (u, v) with a probability puv uniformlyat random, then we uniformly randomly give a probability wv to v representing v’s incomingedges fail to activate it. Then we get the normalized weight puv = puv/(

∑u∈V puv + wv). We

construct 500 live-edge graphs in our algorithm for LT model. For robustness, each datapoint we show is the mean of 1000 runs of randomly sampling removed edges/nodes fromgroups. In the edge deletion version, edge communities are induced from node communities,i.e., for an edge e = (u, v), if both u and v belong to a group Ct, then e ∈ Ct, otherwise itbelongs to group Cij = {euv|u ∈ Ci, v ∈ Cj}.

Baselines. As we are not aware of any direct competitor tackling our group immunizationproblems, we construct three baselines for both node and edge deletion to better judge their

1http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm.2http://snap.stanford.edu/data/oregon1.html.3http://snap.stanford.edu/data/com-Youtube.html.

http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm

http://snap.stanford.edu/data/oregon1.html

http://snap.stanford.edu/data/com-Youtube.html

94

performance. Analogous versions of these baselines have been regularly used in state-of-the-artindividual immunization studies [148,165,166].

(1). Random: uniformly randomly assign vaccines to groups for both node deletion and edgedeletion.

(2). Degree: for node deletion, we calculate the average degree dCi of each group Ci, andindependently assign vaccines to Ci with probability dCi/

∑Ck∈C dCk ; for edge deletion, we

first calculate the product degree de [107] of each edge e = (u, v), i.e., de = du∗dv, then similarto node deletion, we calculate the average product degree dCi of Ci, and assign vaccines toCi with probability dCi/

∑Ck∈C dCk .

(3). Eigen: Eigenvalue centrality has been widely used in the immunization literature [165,166], even as a baseline for LT model [74]. Let u be the eigenvector corresponding to the firsteigenvalue of the graph. The eigenscore of node a is ua, while the eigenscore of edge e(a, b)is |uaub| [165]. For both node and edge deletion, we calculate the average eigenscore uCi ofeach group Ci, and independently assign vaccines to Ci with probability uCi/

∑Ck∈C uCk .

Indeed the reason we formulate the group immunization problems in this chapter is that it istypically not feasible to force targeted individuals to be vaccinated in practice (as discussedbefore in the introduction).

5.3.2 Results

In short, we demonstrate that our methods outperform other baselines on all datasets. Wealso show how the behaviors of our methods change as groups vary. Finally, we conduct acase study to analyze the vaccine allocations at group scale.

Note that we have given the time complexity of each algorithm. Some of our algorithms, e.g.,the one based on sdp is fairly time intensive, though it runs in polynomial time. However, itis important to keep in mind that these algorithms are expected to be run before an epidemicoutbreak, where the solution quality is much more critical than the run time.

5.3.2.1 Performance

Figure 5.1 shows experimental results under LT model for group edge deletion, while Figure 5.2demonstrates the results for node deletion. In all networks, Greedy-LT consistentlyoutperform other competitors. Since we have same budgets for both edge and node deletion,clearly node removal should perform better than edge deletion as node deletion removes moreedges. Our results demonstrate this fact. As shown in Figure 5.1, Greedy-LT performspretty well for edge deletion compared with other competitors, e.g., in YouTube, Greedy-LTcan reduce about 25% of the infection if 500 edges are removed, while for Random, Degreeand Eigen, the infection almost remains the same even removing 500 edges. For node

95

deletion (Figure 5.2), Greedy-LT performs even better: it reduces more than 30% of theinfection given the maximum budgets.

0 20 40 60 80 1000.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Gra

ph

Su

sce

ptib

ility

Ra

tio


0 50 100 150 20050

100

150

200

250

300

Number of Vaccines

Footprint

RANDOMDEGREEEIGENGREEDY

LT

0 100 200 300 400 5000.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Gra

ph

Su

sce

ptib

ility

Ra

tio


0 200 400 600 800 1000

0.984

0.986

0.988

0.99

0.992

0.994

0.996

0.998

1

Number of Vaccines

Gra

ph

Su

sce

ptib

ility

Ra

tio


(a) SBM (b) OregonAS (c) YouTube (d) PORTLAND

Figure 5.1: Effectiveness for LT model various Real Datasets (edge deletion). Graph susceptibil-ity ratio (footprint when vaccines are given

footprint without giving vaccines ) vs. number of vaccines. Lower is better. Greedy-LTconsistently outperforms other baseline algorithms.

0 20 40 60 80 1000.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Gra

ph

Su

sce

ptib

ility

Ra

tio


0 50 100 150 20050

100

150

200

250

300

Number of Vaccines

Footprint

RANDOMDEGREEEIGENGREEDY

LT

0 100 200 300 400 5000.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Gra

ph

Su

sce

ptib

ility

Ra

tio


0 200 400 600 800 10000.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Gra

ph

Su

sce

ptib

ility

Ra

tio


(a) SBM (b) OregonAS (c) YouTube (d) PORTLAND

Figure 5.2: Effectiveness for LT model various Real Datasets (node deletion). Graph susceptibil-ity ratio (footprint when vaccines are given

footprint without giving vaccines ) vs. number of vaccines. Lower is better. Greedy-LTconsistently outperforms other baseline algorithms.

Figure 5.3 shows experimental results of edge verion of group immunization for spectralradius, while Figure 5.4 demonstrates the results for node deletion. In all networks, sdp,GroupGreedyWalk, lp and qp consistently outperform other competitors. sdp givesthe best results for Protein, however, it is not scalable to large networks with more thanthousands of nodes. GroupGreedyWalk gives the second best performance, and it worksfor graphs with about 10K nodes. For very large networks like YouTube and PORTLAND withmillions of nodes, approximate algorithms like sdp and GroupGreedyWalk can not finishwithin an allocated time. lp for edge deletion and qp for node deletion, perform very wellfor large networks. For edge deletion (Figure 5.3), Random, Degree and Eigen cannotdecrease more than 10% of the first eigenvalue in YouTube when 5k vaccines are given togroups, while lp can reduce more than 20% of the eigenvalue. For node deletion (Figure 5.4),qp can get more than twice reduction of eigenvalue compared to other competitors. Whencomparing between node and edge deletion, we get the same result as Figure 5.1 and 5.2:given same vaccines to both edge and node, node removal can get a larger decrease of thespectral radius.

96

0 200 400 600 800 10000.9

0.92

0.94

0.96

0.98

1

Number of Vaccines

Ratio of Eigendrop

RANDOMDEGREEEIGENSDPLP

0 200 400 600 800 100035

40

45

50

55

60

Number of Vaccines

First Eigenvalue

RANDOMDEGREEEIGENQP

0 1000 2000 3000 4000 50000.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Ratio of Eigendrop

RANDOMDEGREEEIGENLP

0 2000 4000 6000 8000 100000.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Number of Vaccines

Ratio of Eigendrop

RANDOMDEGREEEIGENLP

(a) Protein (b) OregonAS (c) YouTube (d) PORTLAND

Figure 5.3: Effectiveness for the change of the first eigenvalue various Real Datasets (edge deletion).

Eigendrop ratio (λ′GλG

) vs. number of vaccines (λ′G is the expected eigenvalue after allocating vac-cines). Lower is better. sdp, GroupGreedyWalk, and lp consistently outperform other baselinealgorithms.

0 200 400 600 800 10000.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Ratio of Eigendrop

RANDOMDEGREEEIGENQP

0 200 400 600 800 100025

30

35

40

45

50

55

60

Number of Vaccines

First Eigenvalue

RANDOMDEGREEEIGENQP

0 1000 2000 3000 4000 50000.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Vaccines

Ratio of Eigendrop

RANDOMDEGREEEIGENQP

0 2000 4000 6000 8000 100000.8

0.85

0.9

0.95

1

Number of Vaccines

Ratio of Eigendrop

RANDOMDEGREEEIGENQP

(a) Protein (b) OregonAS (c) YouTube (d) PORTLAND

Figure 5.4: Effectiveness for the change of the first eigenvalue various Real Datasets (node deletion).

Eigendrop ratio (λ′GλG

) vs. number of vaccines (λ′G is the expected eigenvalue after allocating vaccines).Lower is better. qp consistently outperforms other baseline algorithms.

5.3.2.2 Varying Groups

We would like to see the effect of the change of granularity of vaccine allocation. We changedthe number of groups on PORTLAND, YouTube and OregonAS. For PORTLAND, age ranges from0 to 90, hence there are initially 91 groups. We decrease the number of groups by randomlymerging two adjacency age groups. For OregonAS, we use community detection algorithmLouvain [15] to find different number of groups. For YouTube, we randomly merge groundtrue communities to form smaller size of groups.

Figure 5.5(a) and (b) show the performance of qp and lp as the number of groups changes.First, both of them outperform other baselines for PORTLAND and YouTube. Second, as thenumber of groups increases, the spectral radius decreases more for all algorithms (exceptfor Random) due to the fact that the randomization of allocating vaccines deceases. Theextreme case is that when there is only one group, qp, Degree and Eigen are uniformlyrandomly allocate vaccine to the whole graph, which is exactly the same as Random. On thecontrary, when the number of groups is equal to the number of nodes, group immunizationbecomes individual immunization which is effective but much more expensive. Figure 5.5(c)and (d) show the performance of Greedy-LT as the number of groups varies. Similar to qp

97

and lp, it consistently outperforms other baselines. And the performance improvement iseven more obvious: when the graph size increases from 1 to 200, Greedy-LT almost reduces90% of the infection.

0 20 40 60 800.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Number of Groups

Ratio of Eigendrop

RANDOMDEGREEEIGENQP

0 1000 2000 3000 4000 50000.75

0.8

0.85

0.9

0.95

1

Number of Groups

Ratio of Eigendrop

RANDOMDEGREEEIGENLP

0 50 100 150 2000.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Groups

Graph Susceptibility Ratio


0 1000 2000 3000 4000 5000

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Groups

Graph Susceptibility Ratio


(a) PORTLAND (b) YouTube (c) OregonAS (d) YouTube(Node, Budget=20,000) (Edge, Budget=5000) (Edge, Budget=1000) (Node, Budget=1000)

Spectral Radius LT model

Figure 5.5: (a) and (b). Eigendrop ratio vs. number of groups. (c) and (d): Graph susceptibilityratio vs. number of groups. Lower is better. Our algorithms consistently outperform other baselinealgorithms as the number of groups changes as well as the size of groups changes.

5.3.2.3 Case Study

We now study the group vaccination problem on realistic social contact networks, PORTLANDand MIAMI, using age based groups; as discussed earlier, age based directives are commonlyused by public health agencies. Figure 5.6 shows the number of vaccines assigned to differentage groups, for a total of 10,000 vaccines, using the qp algorithm. We find the groups withage 70− 79 and 60− 66 get the maximum allocation, for the PORTLAND and MIAMI networks,respectively. This contrasts with CDC recommendations, and the strategy proposed byMedlock et al. [104]. This might be because these results do not use the detailed networkstructure. We believe this is an interesting result which merits further study.

5.4 Conclusion

This chapter addresses the problems of controlling epidemics by means of interventions thatcan be implemented at a group level. We formulate the Group Immunization problemin the LT model as well as SIS/SIR models (considering the spectral radius minimization)for both edge-level and node-level interventions. We develop algorithms with rigorousperformance guarantees and good empirical performance for all these problem classes. Ouralgorithms require a diverse class of techniques, including submodular function maximization,linear programming, quadratic programming, semidefinite programming, and hitting closedwalks. Finally, we evaluate them on real networks of diverse scales. We demonstrate that

98

1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

Age Range

Nu

mb

er

of

Vaccin

es

RANDOM

DEGREE

EIGEN

QP

1 2 3 4 5 6 7 8 9 100

2000

4000

6000

8000

10000

Age Range

Nu

mb

er o

f V

acci

nes

RANDOMDEGREEEIGENQP


Figure 5.6: Vaccine Distributions for PORTLAND and MIAMI (Budget=10000). Number of vaccinesvs. Age. (Age range ’0-9’: 1; ’10-19’: 2; ’20-29’: 3; ’30-39’: 4; ’40-49’: 5; ’50-59’: 6; ’60-69’: 7;’70-79’: 8; ’80-89’: 9; ’90-’: 10.)

our algorithms significantly outperform other heuristics, and adapt to the group structure.Some of our algorithms, e.g., sdp is fairly time intensive, though it runs in polynomial time.However, it is important to keep in mind that these algorithms are expected to be run beforean epidemic outbreak, where the solution quality is much more critical than the run time.

Chapter 6

Graph Coarsening

The unprecedented popularity of online social networking websites, such as Facebook, Google+,Flickr, and YouTube, has made it possible to analyze real social networks. Word of mouthmarketing and viral marketing strategies have evolved to take advantage of this networkstructure by utilizing network effects. Similarly, understanding large-scale epidemiologicaldatasets is important for designing effective propagation models and containment policies forpublic health. The sheer size of today’s large social networks makes it challenging to performsophisticated network analysis.

Given a propagation graph, possibly learnt from cascade analysis, is it possible to get a smallernearly diffusion-equivalent representation for it? Getting a smaller equivalent graph willhelp multiple algorithmic and data mining tasks like influence maximization, immunization,understanding cascade data and data compression. In this chapter, we study a novel graphcoarsening problem with the aim of approximating a large social network by a much smallergraph that approximately preserves the network structure. Our primary goal is to find acompact representation of a large graph such that diffusion and propagation processes onthe large graph can be studied by analyzing the smaller representation. Intuitively, most ofthe edges in a real network are relatively unimportant; hence we propose characterizing and“contracting” precisely such edges in a graph to obtain a coarse representation.

The main contributions of this chapter are:

(a) Problem Formulation: We carefully formulate a novel Graph Coarsening Problem(GCP) to find a succinct representation of a given social network so that the diffusioncharacteristics of the network are mostly preserved.

(b) Efficient Algorithms: We develop coarseNet, an efficient (near-linear time) andeffective algorithm for GCP, using careful approximations. We show that due to ournovel scoring technique, the coarsened graph retains most of the diffusive properties ofthe original network.

(c) Extensive Experiments: We show that coarseNet is able to coarsen graphs up to

99

100

90% without much loss of key information. We also demonstrate the usefulness of ourapproach via a number of interesting applications. A major application we considerin this work is that of influence maximization in the Independent Cascade model. Wepropose a framework cspin that involves coarsening the graph and then solving influencemaximization on the smaller graph to obtain high quality solutions. As the coarsenedgraph is much smaller than the original graph, the influence maximization algorithmruns orders of magnitude faster on the coarsened graph. Further using real cascade datafrom Flixster, we show how GCP can potentially help in understanding propagationdata and constructing non-network surrogates for finding nodes with similar influence.

This work has been published in KDD 2014 [136]. Next we first give preliminaries, andthen formally formulated our problem and propose a linear algorithm to coarsen graph. Wevalidate our method via. experiments, and finally conduct case studies for the influencemaximization and diffusion characteristic problems.

6.1 Preliminaries

Table 6.1 gives some of the notation.

Table 6.1: Symbols


A,B, . . . matrices (bold upper case)

~a,~b, . . . column vectorsaj or a(j) jth element of vector an number of vertices in the graphsm number of edges in the graphsα the reduction factorλG first eigenvalue (in absolute value) of adjacency matrix of graph G~uG, ~vG Right and left first eigenvectors (for λG) of adjacency matrix GIC Model The Independent Cascade ModelGCP Graph Coarsening ProblemcoarseNet Our algorithm for GCP

A social network is a directed, weighted graph G = (V,E,w). Usually each vertex v ∈ Vrepresents an individual of the network and edges represent influence relationships betweenthese individuals. The Independent Cascade (IC) model is a popular diffusion model usedto model the way influence propagates along the edges of a social network. In this setting,a vertex v ∈ V is called active if it has been influenced and inactive otherwise. Once an

101

inactive vertex becomes active, it always stays active, i.e. we focus only on progressive models.Given a seed set S ⊂ V of initially active vertices, the Independent Cascade model proceedsin discrete time steps as follows. At time step t, let St denote the set of vertices activatedat time t. Every vertex u ∈ St is given a single chance to activate each currently inactiveneighbor v with probability of success w(u, v) independently of all other interactions. If usucceeds, then v becomes active at time t+ 1. This diffusion process continues until no moreactivations are possible. The influence spread of seed set S, denoted by σ(S), is the expectednumber of activated vertices at the end of the process.


Motivated by the fact that in any real network, most edges and vertices are not important(due to the heavily skewed degree distributions), we propose a graph coarsening problemwhich involves pruning away precisely such edges (and vertices). We aim to coarsen the graphto obtain a much smaller representation which retains the diffusive properties. We coarsena graph by successively merging adjacent node pairs. We attempt to quickly find “good”edges which have little effect on the network’s diffusive properties. At first glance, this seemsimpossible as the diffusive properties of a graph are highly dependent on the connectivity ofthe vertices and edge weights. Further, determining which node pairs to merge and analyzingthe effect of merging two nodes on diffusion are non-trivial. Informally, we study the followingproblem in this chapter:

Definition 6.1 (Informal Problem). Input: Weighted graph G = (V,E,w) and a targetfraction 0 < α < 1 Goal: Coarsen G by repeatedly merging adjacent node pairs to obtain aweighted graph H = (V ′, E ′, w′) such that

• |V ′| = (1− α)|V |• Graph H approximates graph G with respect to its diffusive properties

Role of Eigenvalues. In order to address the informal problem described above, we needa tractable way to characterize the diffusive properties of a network. Recent work [133]shows that for almost any propagation model (including the IC model), important diffusioncharacteristics (in particular the so-called epidemic threshold) of a graph (after removingself loops) are captured by the spectrum of the graph, specifically, by the first eigenvalueof the adjacency matrix. Thus it is natural to believe that if the first eigenvalue of thecoarsened graph H (its adjacency matrix) is close to that of the original graph G, then Hindeed approximates G well. Although the work of [133] deals with undirected graphs, theirfindings are also applicable to strongly connected directed graphs.

Merging node pairs. To explicitly formulate the problem in Definition 6.1, we also needto define what happens when a node pair is merged (i.e. an edge is contracted) in a weighted

102

graph. More precisely, after merging neighboring vertices a and b to form a new node c,we need to determine the new edge weights of all incoming and outgoing edges of c. Inorder to maintain the diffusive properties of the network, we need to reweight the new edgesappropriately.

x e d b a

cdex

0.5 0.5 0.5 0.5

0.50.5 0.5?

Figure 6.1: Why reweight?

To see why this is crucial, consider Figure 6.1. Assume that the IC model is being run.Suppose we need to pick the two best seeds (i.e. two nodes with the maximum influencespread as defined in the previous Section) from the top 5-vertex chain. Further assume thatthe graph is undirected and each edge has the same weight β = 0.5. Clearly, vertices b and eare the best. If we merge vertices {a, b}, we get the bottom 4-vertex chain. To still matchthe original solution, we correspondingly want {c, e} to be the best seed-set in the newchain—but if edge {d, c} remains the same weight, any of the pair of vertices {e, c} or {x, d}are the best seed sets in the 4-vertex chain. This motivates the need to reweight suitably sothat new coarsened graph still retains the original characteristics.

The main insight is that if we select c as a seed, we are in-effect intending to choose only oneof vertices a and b to be seeded (influenced); which suggests that the likelihood of d beinginfluenced from c is either 0.5 or 0.25 (corresponding to when a or b is chosen respectively).Hence the weight of edge (c, d) should be modified to reflect this fact.

We propose the following solution: Suppose e = (a, b) is contracted and a and b are mergedtogether to form “supervertex” c (say). We reweight the edges adjacent to a and b whilecoarsening so that the edges now represent the average of the transmission probabilities via aor b. So in our example of Figure 6.1, edge {c, d} would have weight 0.375 (average of 0.5and 0.25). Further, we can verify that in this case {e, c} will be the best seed-set, as desired.

Extending the same principle, Figure 9.3 shows the general situation for any candidate nodepair (a, b) and how a merge and re-weight (= contract) operation will look like. More formally,our contract operation is as follows:

Definition 6.2 (Merging node pairs). Let Nbi(v) (respectively Nbo(v)) denote the set ofin-neighbors (resp. out-neighbors) of a vertex v. Let viu = w(u, v) and vou = w(v, u) denotethe weight of the corresponding edges. If the node pair(a, b) is now contracted to a new vertex

103

z

a bx y

z

cx y

aizaoz biz

boz

β1

β2

aox

aix boy

biy

aiz(1+β1)+biz(1+β2)4

aoz(1+β2)+boz(1+β1)4

1+β12 aix

1+β22 aox

1+β22 biy

1+β12 boy

Figure 6.2: Reweighting of edges after merging node pairs

c, and w(a, b) = β1 and w(b, a) = β2, then the new edges are weighted as -

cit =

(1 + β1)ait2

∀t ∈ Nbi(a)\Nbi(b)

(1 + β2)bit2

∀t ∈ Nbi(b)\Nbi(a)

(1 + β1)(ait) + (1 + β2)(bit)

4∀t ∈ Nbi(a) ∩Nbi(b)

(6.1)

cot =

(1 + β2)aot2

∀t ∈ Nbo(a)\Nbo(b)

(1 + β1)bot2

∀t ∈ Nbo(b)\Nbo(a)

(1 + β2)(aot ) + (1 + β1)(bot )

4∀t ∈ Nbo(a) ∩Nbo(b)

(6.2)

Graph Coarsening Problem. We are now ready to state our problem formally. Motivatedby the connections between the diffusive and spectral properties of a graph, we define thefollowing Graph Coarsening Problem to find the set of node pairs which when merged(according to Definition 6.2) lead to the least change in the first eigenvalue. Further, since avertex cannot influence itself, we assume without loss of generality that the graph G has noself loops.

Definition 6.3 (Graph Coarsening Problem).Input: Directed, strongly connected, weighted graph G = (V,E,w) without self loops and atarget fraction 0 < α < 1Output: E∗ = arg minE′⊂E,|E′|=α|V | |λG − λG′|, where G′ is obtained from G by merging allnode pairs in E ′.

104

A related problem is Edge Immunization [165] that asks for a set of edges whose removalleads to the greatest drop in the first eigenvalue. In contrast, GCP seeks to find a set of edgeswhose contraction (Definition 6.2) leads to the least change in the first eigenvalue. The EdgeImmunization problem is known to be NP-hard [165].

6.3 Our Solution

As obvious algorithms to GCP are clearly exponential, we propose a greedy heuristic thatrepeatedly merges a node pair which minimizes the change in the first eigenvalue. Let G−(a,b)

denote the graph G after merging nodes a and b (and incorporating the re-weighting strategy),and λG denote the first eigenvalue of the adjacency matrix of G. We define the score of anode pair(a, b) as follows:

Definition 6.4 (Score). Given a weighted graph G = (V,E,w) and an adjacent node pair(a, b),score(a, b) is defined by:

score(a, b) = |λG−(a,b)− λG| = ∆λ(a,b)

Intuitively, if score(a, b) ≈ 0, it implies that edges (a, b) and (b, a) do not play a significantrole in the diffusion through the graph and can thus be contracted.

Naıve Algorithm: The above intuition suggests the following naıve algorithm for selectingnode pairs for merging. At each stage, calculate the change in the eigenvalue due to mergingeach adjacent node pair, choose the node pair leading to the least change, merge the chosennodes, and repeat until the graph is small enough. An implementation for this, even usingthe Lanczos algorithm for eigenvalue computation for sparse graphs, will be too expensive,taking O(m2) time. Can we compute (maybe approximately) the scores of each node pairfaster?

Main Idea: We use a matrix perturbation argument to derive an expression for the changein eigenvalue due to merging two adjacent nodes. Using further information about the specificperturbations occurring due to merging two adjacent nodes, we show that the change in theeigenvalue can be approximated well in constant time. Thus, we obtain a linear (O(m)) timescheme to estimate the score of every pair of adjacent nodes.

6.3.1 Score Estimation

Let a and b denote the two neighboring vertices that we are trying to score. We assume thatthe first eigenvalue of the graph λG and the corresponding right and left eigenvectors ~u,~vare precomputed. Further since the graph G is strongly connected, by the Perron-Frobeniustheorem, the first eigenvalue λG and the eigenvectors ~u and ~v are all real and have positive

105

components. When it is clear from the context, we drop subscripts G and (a, b). In the proofsthat follow λ = λG and ∆λ = ∆λ(a,b) as there is no ambiguity. Let A denote the adjacencymatrix of the graph. Further as ~u denotes the eigenvector of A, let ua = u(a) denote thecomponent of ~u corresponding to vertex a. Merging nodes changes the dimensions of theadjacency matrix A which we handle by viewing merging nodes a, b as adding b′s neighborsto a and isolating node b.

Approximation 6.1 provides an equation for the change in the eigenvalue by a matrixperturbation argument. Proposition 6.1 and Proposition 6.2 show how our reweightingstrategy helps us to approximate the score(a, b) in constant time.

Approximation 6.1. The change in eigenvalue ∆λ can be approximated by ∆λ =~vT∆A~u+ ~vT∆A∆~u

(~vT~u+ ~vT∆~u)where ∆A denotes an infinitesimally small change in the adjacency matrix A and ∆~u denotesthe corresponding change in the eigenvector ~u.

Justification. By the definition of an eigenvalue and eigenvector of a matrix, we have

A~u = λ~u (6.3)

~vTA = ~vTλ (6.4)

Perturbing all values of (6.3) infinitesimally, we get

(A + ∆A)(~u+ ∆~u) ≈ (λ+ ∆λ)(~u+ ∆~u) (6.5)

A∆~u+ ∆A~u+ ∆A∆~u ≈ λ∆~u+ ∆λ~u+ ∆λ∆~u (6.6)

Premultiplying by ~vT and using (6.3) and (6.4),

∆λ(~vT~u+ ~vT∆~u) ≈ ~vT∆A~u+ ~vT∆A∆~u (6.7)

∆λ ≈ ~vT∆A~u+ ~vT∆A∆~u

(~vT~u+ ~vT∆~u)(6.8)

Using expression (6.8) along with prior knowledge about the perturbations to the adjacencymatrix A and the eigenvector ~u, we obtain an expression for computing the score of the nodepair.

Proposition 6.1 (Score Estimate). Under Approximation 6.1, the score of a node pairscore(a, b) can be approximated as

∆λ(a,b) =−λ (uava + ubvb) + va~u

T ~co + β2uavb + β1ubva~vT~u− (uava + ubvb)

(ignoring second order terms).

106

Proof. Approximation 6.1 provided an expression for ∆λ in terms of the change in theadjacency matrix and the eigenvector. Now ∆A, i.e., change in the adjacency matrix canbe considered as occurring in three stages namely (i) Deletion of a, (ii) Deletion of b, (iii)Insertion of c. Assume that c is inserted in place of a. Thus we obtain,

∆A =−(~ai ~ea

T + ~ea ~aoT)−(~bi~eb

T + ~eb~boT)

+(~ci ~ea

T + ~ea~coT)

(6.9)

where ~ev denotes a vector with a 1 in the vth row and 0 elsewhere. Further, as we modifyonly two rows and columns of the matrix, this change ∆A is very small.

Also, deletion of vertices a and b cause ath and bth components of ~u and ~v to be zero. ∆~u, i.e,change in the eigenvector ~u can thus be considered as setting ua and ub to zero, followed bysmall changes to other components and to ua due to addition of c. Thus we obtain,

∆~u = −ua ~ea − ub~eb + ~δ (6.10)

Although ∆~u cannot be considered as small, we assume that the changes ~δ after setting uaand ub components to zero are very small.

Substituting for ∆A, we get

~vT∆A~u = ~vT (−(~ai ~eaT + ~ea ~ao

T)− (~bi~eb

T + ~eb~boT

(6.11)

+ (~ci ~eaT + ~ea~co

T))~u (6.12)

Since ~vT ~ea = va and similarly,

~vT∆A~u = −ua~vT ~ai − va ~aoT~u− ub~vT ~bi − vb~bo

T~u (6.13)

+ ua~vT ~ci + va~co

T~u (6.14)

But ~vT ~ai = λva and ~aoT~u = λua and similarly,

~vT∆A~u = −2λ (uava + ubvb) + ua~vT ~ci + va~co

T~u (6.15)

Now using (6.9) and (6.10) consider,

~vT∆A∆~u = ~vT∆A(−ua ~ea − ub~eb + ~δ) (6.16)

Since ∆A and ~δ are both very small, we ignore the second order term ~vT∆A~δ.

⇒ ~vT∆A∆~u = ~vT∆A(−ua ~ea − ub~eb) (6.17)

= ~vT (−(~ai ~eaT + ~ea ~ao

T)− (~bi~eb

T + ~eb~boT

(6.18)

+ (~ci ~eaT + ~ea~co

T))(−ua ~ea − ub~eb) (6.19)

107

Since self loops do not affect diffusion in any way, we can assume without loss of generalitythat G has no self loops. Further, simplifying using definitions of eigenvalue we get,

~vT∆A∆~u = λ(uava + ubvb) + β2uavb (6.20)

+ β1ubva − ua~vT ~ci (6.21)

Ignoring small terms, we also have,

~vT∆~u = ~vT (−ua ~ea − ub~eb + ~δ) = −(uava + ubvb) (6.22)

Substituting (6.15),(6.21) and (6.22) in Approximation 6.1, we get

∆λ =−λ (uava + ubvb) + va~u

T ~co + β2uavb + β1ubva~vT~u− (uava + ubvb)

(6.23)

Note that every term in this expression is a simple product of scalars, except for the ~uT ~co

term. We now show that even ~uT ~co can in fact be expressed in terms of scalars and can thusbe computed in constant time.

Proposition 6.2. Using the re-weighting scheme as defined in Definition 6.2, if c denotesthe new vertex created by merging nodes {a, b} and ~co denotes the out-adjacency vector of

c, ~uT ~co =(1 + β2)

2(λua − β1ub) +

(1 + β1)

2(λub − β2ua) where β1 is the weight of edge (a, b)

and β2 is the weight of the edge (b, a).

Proof. Let X = Nbo(a) \ Nbo(b), Y = Nbo(b) \ Nbo(a), Z = Nbo(a) ∩ Nbo(b). Since, c isadjacent only to neighbors of a and b, we have

~uT ~co =∑t∈X

utcot +

∑t∈Y

utcot +

∑t∈Z

utcot + ucW (6.24)

where W is the weight of a self loop added at c. Note that a self loop does not affect diffusionin any way (as a node can not influence itself). We use a self loop only in the analysis so asto compute the scores efficiently.

As per our reweighting scheme (See Definition 6.2)

~uT ~co =∑t∈X

(1 + β2)

2uta

ot +

∑t∈Y

(1 + β1)

2utb

ot (6.25)

+∑t∈Z

((1 + β2)

4aot +

(1 + β1)

4bot )ut + ucW (6.26)

108

But, by definition of eigenvalues, we know that

λua =∑t∈V

utaot =

∑t∈X

utaot +

∑t∈Z

utaot + ubβ1 (6.27)∑

t∈X

utaot = λua −

∑t∈Z

utaot − β1ub (6.28)

= λua − ao(Z)− β1ub (6.29)

where ao(Z) =∑

t∈Z utaot

Similarly, we get ∑t∈Y

utbot = λub − bo(Z)− β2ua (6.30)

Substituting Equations (6.29), (6.30), in (6.26),

~uT ~co =(1 + β2)

2(λua − ao(Z)− β1ub) (6.31)

+(1 + β1)

2(λub − bo(Z)− β2ua) (6.32)

+(1 + β2)

4ao(Z) +

(1 + β1)

4bo(Z) + ucW (6.33)

We now choose W = (−(1 + β2)

4ao(Z)− (1 + β1)

4bo(Z))/uc, so that we get

~uT ~co =(1 + β2)

2(λua − β1ub) +

(1 + β1)

2(λub − β2ua) (6.34)

Corollary 6.1. Given the first eigenvalue λ and corresponding eigenvectors ~u,~v, the scoreof a node pair score(a, b) can be approximated in constant time.

Proof. Substituting for ~ut~co in Proposition 6.1 using Proposition 6.2, we obtain an expressionfor score(a, b) that is composed entirely of scalar terms. Thus we can estimate the edge scorein constant time.

6.3.2 Complete Algorithm

Using the approximation described in the previous section, we assign a score to every pair ofadjacent nodes of the graph. We then sort these node pairs in ascending order of the absolute

109

Algorithm 6.1 Coarsening Algorithm - coarseNet (G,α)

Require: A directed, weighted graph G=(V ,E,w),a reduction factor α

Ensure: Coarsened graph Gαcoarse=(V ′,E ′,w′)

1: i = 02: n = |V |3: G′ = G4: for each adjacent pair of nodes a, b ∈ V do5: Compute score(a, b) using Section 6.3.16: end for7: π ← ordering of node pairs in increasing order of score8: while i ≤ αn do9: (a, b) = π(i)

10: G′ ← ContractG′(a, b)11: i++

12: end while13: return Gα

coarse = G′

value of their scores. Intuitively, we would like to merge a node pair if it has minimal score.Given an upper bound of α, the graph is then coarsened by contracting αn node pairs one byone in this order ignoring any pairs that have already been merged. We give the pseudo-codeof our algorithm coarseNet in Algorithm 6.1.

Lemma 6.1 (Running Time). The worst case time complexity of our algorithm is O(m ln(m)+αnnθ) where nθ denotes the maximum degree of any vertex at any time in the coarseningprocess.

Proof. Computing the first eigenvalue and eigenvector of the adjacency matrix of the graphtakes O(m) time (for example, using Lanczos iteration assuming that the spectral gap islarge). As shown in Section 6.3.1, each node pair can be assigned a score in constant time.In order to score all m adjacent pairs of nodes of the graph, we require linear i.e. (O(m))time. The scored node pairs are sorted in O(m ln(m)) time. Merging two nodes (a, b) requiresO(deg(a) + deg(b)) = O(nθ) time. Since we each merge at most αn pairs of nodes, themerging itself has time complexity O(αnnθ).

Therefore, our worst-case time complexity is O(m ln(m) + αnnθ).

6.4 Sample Application: Influence Maximization

The eigenvalue based coarsening method described above aims to obtain a small networkthat approximates the diffusive properties of the original large network. As an example

110

application, we now show how to apply our graph coarsening framework to the well studiedinfluence maximization problem. Recall that given a diffusion model (IC model in our case)and a social network, the influence maximization problem is to find a small seed set of knodes such that the expected number of influenced nodes is maximized.

Since we have designed our coarsening strategy such that nodes and edges important fordiffusion remain untouched, we expect that solving influence maximization on the coarsenedgraph is a good proxy for solving it on the much larger original network. The major challengein this process is to determine how to map the solutions obtained from the coarsened graphback onto the vertices of the original network. But due to the carefully designed coarseningstrategy which tries to keep important, candidate vertices unmerged, we observe that a simplerandom pull back scheme works well in practice.

More formally, we propose the following multi-stage approach to solve influence maximization:

1. Coarsen the social network graph G by using Algorithm 6.1 to obtain a much smallergraph Gcoarse. Let µ : V → Vcoarse denote a mapping from vertices of the original graphto those of the coarsened graph.

2. Solve the influence maximization problem on Gcoarse to get k vertices s1, . . . , sk inthe coarsened graph that optimize the desired objective function. We can use anyoff-the-shelf algorithm for influence maximization in this step. Since Gcoarse is muchsmaller than G, traditional algorithms for influence maximization can provide highquality solutions in little time.

3. Pull back the solutions on to the vertices of the original graph. Given a seed si inGcoarse, we need to select a vertex v ∈ µ−1(si) from G as a seed. Multiple strategiescan be considered here such as v = arg maxu∈µ−1(si)

(σ(u)) where σ(u) is the expectedinfluence by seeding u. However, thanks to our careful coarsening framework, we showthat a simple strategy of selecting a seed uniformly at random from µ−1(si) for everyseed si performs very well in practice.

Algorithm 6.2 describes our strategy to solve influence maximization problems. Note that asimilar strategy can be applied to study other problems based on diffusion in networks.

Algorithm 6.2 cspin: Influence Maximization Framework

Require: A weighted graph G=(V ,E,w), the number of seeds k, a reduction factor αEnsure: A seed set S of k seeds

1: Gαcoarse, µ ← coarseNet (G,α) (See Algorithm 6.1)

2: s′1, s′2, . . . , s

′k ← InfluenceMaximization(Gα

coarse, k)3: for i = 1, . . . , k do4: si ← random sample from µ−1(s′i)5: end for6: return S = {s1, s2, . . . , sk}

111

6.5 Experiments

We performed several experiments to show the effectiveness of coarseNet algorithm andalso the GCP framework for cascade analysis.

Table 6.2: Datasets: Basic Statistics

Dataset #Vertices #Edges Mean Degree

Flickr small 500,038 5,002,845 20.01Flickr medium 1,000,001 14,506,356 29.01Flickr large 2,022,530 21,050,542 20.82DBLP 511,163 1,871,070 7.32Amazon 334,863 1,851,744 11.06Brightkite 58,228 214,078 7.35Portland 1,588,212 31,204,286 39.29Flixster 55,918 559,863 20.02

Datasets All experiments were conducted on an Intel Xeon machine (2.40 GHz) with 24GBof main memory. We used a diverse selection of datasets from different domains to test ouralgorithm and framework (see Table 6.2). These datasets were chosen for their size as wellas the applicability to the diffusion problem. coarseNet was tested on data from Flickr,DBLP, Amazon, Brightkite and Portland epidemiology data. In the Flickr data, verticesare users, and links represent friendships. In the DBLP data, vertices represent authorsand edges represent co-authorship links. Brightkite is a friendship network from a formerlocation-based social networking service provider Brightkite. In the Amazon dataset, verticesare products and an edge represents that the two products are often purchased together.The Portland dataset is a social contact graph of vertices representing people and edgesrepresenting interactions—it represents a synthetic population of the city of Portland, Oregon,and has been used in nation-wide smallpox studies [35]. Finally, we also used a real cascadedataset Flixster1, where cascades of movie ratings happen over a social network.

6.5.1 Performance for the GCP problem

We want to measure the performance of coarseNet algorithm on the GCP problem. Inshort, we can coarsen up to 70% of node-pairs using coarseNet, and still retain almost thesame eigenvalue.

1http://www.cs.ubc.ca/~jamalim/datasets/

http://www.cs.ubc.ca/~jamalim/datasets/

112

6.5.1.1 Effectiveness

0% 30% 40% 50% 60% 70%0

0.1

0.2

0.3

0.4

0.5

Reduction Factor

Firs

t Eig

enva

lue

0% 30% 40% 50% 60% 70%0

0.5

1

1.5

2

2.5

Reduction Factor

Firs

t Eig

enva

lue

0% 30% 40% 50% 60% 70%0

0.5

1

1.5

2

2.5

Reduction Factor

Firs

t Eig

enva

lue

COARSENET RANDOM

(a) Amazon (b) DBLP (c) Brightkite

Figure 6.3: Effectiveness of coarseNet for GCP. λ vs α for coarseNet and random.coarseNet maintains λ values.

As a baseline we used random, a random node-pair coarsening algorithm (randomly choosea node-pair and contract), used in some community detection techniques. Figure 6.3 showsthe values of λ as the reduction factor α increases when we ran coarseNet and randomon three datasets (we set a weight of 0.02 for this experiment). We observed that in alldatasets, as the reduction factor α increases, the values of λ barely change for coarseNet,showing that the diffusive properties are maintained even with almost 70% contraction; whilerandom destroyed the eigenvalue very quickly with increasing α. This shows that (a) largegraphs can in fact be coarsened to large percentages while maintaining diffusion; and (b)coarseNet effectively solves the GCP problem. As we show later, we apply the GCPproblem and coarseNet on a detailed sample application of influence maximization.

6.5.1.2 Scalability

0.3 0.4 0.5 0.6 0.70

200

400

600

800

1000

Reduction Factor

Run

ning

Tim

e (in

sec

onds

)

Y = 1400X −190R2=0.9898

0.3 0.4 0.5 0.6 0.70

100

200

300

400

500

600

700

800

900

Reduction Factor

Run

ning

Tim

e (in

sec

onds

)

Y = 1800X − 500R2=0.9505

0.3 0.4 0.5 0.6 0.70

50

100

150

200

250

300

Reduction Factor

Run

ning

Tim

e (in

sec

onds

)

Y = 580X − 140R2=0.9950

5 6 7 8 9 10

x 105

600

800

1000

1200

1400

1600

1800

Graph Size (Number of Vertices)

Ru

nn

ing

Tim

e(i

n s

ec

on

ds

)

Y = 0.0015X + 120

R2=0.9530

(a) Amazon (b) DBLP (c) Brightkite (d) Flickr (Varying sizes)

Figure 6.4: Scalability of coarseNet for GCP. (a,b,c) Linear w.r.t. α. (d) Near-linear w.r.t. sizeof graph.

Figure 6.4 shows the running times of coarseNet w.r.t. α and n. To analyze the runtime ofcoarseNet with respect to graph size (n), we extracted 6 connected components (with 500K

113

to 1M vertices in steps of 100K) of the Flickr large dataset. As expected from Lemma 6.1, weobserve that in all datasets, as the reduction factor α increases, the running time increaseslinearly (figures also show the linear-fit, with R2 values), and scale near-linearly as the size ofthe graph increases. This demonstrates that coarseNet is scalable for large datasets.

6.5.2 Application 1: Influence Maximization

Here we demonstrate in detail a concrete application of our GCP problem and coarseNetalgorithm to diffusion-related problems. We use the well-known Influence Maximizationproblem. The idea as discussed before is to use the Coarsen-Solve-Project CSPIN framework(see Section 6.4). In short we find that we obtain 300× speed-up on large networks, whilemaintaining the quality of solutions.

Propagation probabilities: Since accurate propagation probabilities for these networksare not available, we generate propagation probabilities according to two models followingthe literature.

• Uniform: Each edge is assigned a low propagation probability of 0.02. In most realsocial networks, the propagation probabilities are known to be low.• Trivalency: We also test on the trivalency model studied in [22]. For every edge

we choose a probability uniformly at random from the set {0.1, 0.01, 0.001} whichcorrespond to the edge having high, medium and low influence respectively.

Algorithms and setup: We can use any off-the-shelf algorithm to solve Inf. Max. problemon the smaller coarsened network. Here, we choose to use the fast and popular pmia [22]algorithm. We then compared the influence spreads and running-times of the cspin frameworkwith the plain pmia algorithm to demonstrate gains from using GCP.

6.5.2.1 Effectiveness

Quality of solution (Influence spread). In all experiments, the influence spread generatedby our cspin approach is within 10% of the influence spread generated by pmia. In somecases, we even perform slightly better than pmia. Figure 6.5(a) shows the expected spreadobtained by selecting k = 1000 seeds on five datasets. For these experiments, the percentageof edges to merged is set at 90% and we use the uniform propagation model.

Quality w.r.t α. We find that we can merge up to 95% of the edges while still retaininginfluence spread. As more edges are merged, the coarsened graph is smaller; so the superseedsin Gα

coarse can be found faster and thus we expect our running time to decrease. We ran testson the Flickr medium dataset for 1000 seeds and varied α from 80% to 95%. Figure 6.5(b)shows the ratio of the expected influence spread obtained by cspin to that obtained by pmiais almost 1 with varying α.

114

�

��

��

��

��

�

��

��

��

��

��

�

��

��

��

��

��

��

��

��

��

��

��

(a) Spread ratio (b) Spread ratio on Flickr medium (c) Running times on Flickr medium

Figure 6.5: Effectiveness of cspin. Ratio of influence spread between cspin and pmia for (a)different datasets; (b) varying α. (c) Running time vs α.

Quality of solution: Effect of unbiased random pullback. coarseNet groups nodes

Table 6.3: Insensitivity of cspin to random pullback choices : Expected influence spread does notvary much.

#Trials Maximum Spread Minimum Spread Coefficient of variation (σµ)

100 58996.6 58984.8 5.061× 10−5

which have similar diffusion effects, hence choosing any one of the nodes randomly insidea group will lead to similar spreads (hence we do the random pullback in cspin). Note wedo not claim that these groups belong to link-based communities—only that their diffusiveeffects are similar. To demonstrate this, we performed 100 trials of the random pullback phasefor the Flickr small graph. For these trials, 1000 superseeds were found by coarsening 90% ofthe edges. In each trial, we use these same superseeds to find the 1000 seeds independentlyand uniformly at random. Table 6.3 shows that the coefficient of variation of the expectedspread is only 5.061× 10−5.

6.5.2.2 Scalability

Scalability w.r.t number of seeds (k). As the budget k increases, we see dramaticperformance benefits of cspin over pmia. We run experiments on Flickr small, and Portlandby setting α = 90%, and k varied from 0.01% to 1% of |V |. Figure 6.6(a,b) shows the totalrunning times (including the coarsening). Due to lack of space we show only the resultsfor the trivalency model (the uniform case was similar). In all datasets, as k increases, therunning time of cspin increases very slowly. Note that we get orders of magnitude speed-ups:e.g. on Flickr pmia takes more than 10 days to find 200+ seeds, while cspin runs in 2minutes.

Scalability w.r.t α. We can see that the running time also drops with increased coarseningas seen in Figure 6.5(c).

115

�

��

��

��

��

��

��

� ��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

� ��

��

��

��

��

�

��

��

�

��

��

��

��

��

��

��

��

��

��

��

��

��

(a) Flickr small (trivalency model) (b) Portland (trivalency model) (c) Flickr (Varying sizes)

Figure 6.6: Scalability of cspin. (a,b) vs k; (c) vs size of graph. cspin gets increasing orders-of-magnitude speed-up over pmia.

Scalability w.r.t n. We ran cspin on the components of increasing size of Flickr large withk = 1000 and α = 90%. Figure 6.6(c) plots the running times: cspin obtains a speed up ofaround 250× over pmia consistently.

6.5.3 Application 2: Diffusion Characterization

We now briefly describe how the GCP problem can help in understanding cascade datasetsin an exploratory setting.

Methodology: We used a Flixster dataset, where users can share ratings of movies withfriends. There is a log-file which stores ratings actions by each user: and a cascade is supposedto happen when a person rates the same movie soon after one of her friends. We use themethodology of [55] to learn influence probabilities of a IC-model over the edges of thefriendship network from the traces. We then coarsen the resulting directed graph usingcoarseNet to α = 50%, and study the formed groups (supernodes). Note that this is incontrast to the approaches where the group information is supplied by a graph-partitioningalgorithm (like METIS), and then a group-based IC model is learnt. The base network had55, 918 nodes and 559, 863 edges. The trace-log contained about 7 million actions over 48, 000movies. We get 1891 groups after removing groups with only one node, with mean group size16.6 with the largest group having 22061 nodes (roughly 40% of nodes).

Distribution of movies over groups: Figure 6.7 shows the histogram of the # of groupsreached by the movie propagations (we assume that a movie reaches a group if at least 10% ofits nodes rated that movie). We show only the first 100 points of the distribution. We observethat a very large fraction of movies propagate in a small number of groups. Interestingly weobserve a multi-modal distribution, suggesting movies have multiple scales of spread.

Groups through the lens of surrogates: An important point to note is that our groupsmay not be link-based communities: we just ensure that nodes in a group have the samediffusive properties. We validated this observation in the previous section (Table 6.3). Hence

116

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

Number of Groups

Nu

mb

er o

f M

ovi

es

Figure 6.7: Distribution of # groups entered by movie traces.

a natural question is if groups found in Flixster have any other natural structure (e.g.demographics)—if they do, we can get a non-network external surrogate for similar diffusivecharacteristics. Fortunately, the Flixster does contain a couple of auxiliary features for itsusers (like ID, Last Login, Age). We calculated the Mean Absolute Error (MAE) for ‘Age’inside each group, and compared it with the MAE across groups. We found that the averageMAE inside the group is very small (within 2 years) compared to a MAE of almost 8 outside,which implies that ages are concentrated within groups and can act as surrogates for diffusivecharacteristics.

6.6 Conclusion

We propose influence-based coarsening as a fundamental operation in the analysis of diffusiveprocesses in large networks. Based on the connections between influence spread and spectralproperties of the graph, we propose a novel Graph Coarsening Problem and provide aneffective and efficient heuristic called coarseNet. By carefully reweighting the edgesafter each coarsening step, coarseNet attempts to find a succinct representation of theoriginal network which preserves important diffusive properties. We then describe the cspinframework to solve influence maximization problems on large networks using our coarseningstrategy. Experimental results show that cspin indeed outperforms traditional approachesby providing high quality solutions in a fraction of the time. Finally we show that ourcoarseNet framework can also be used for examining cascade datasets in an exploratorysetting. We observe that in our case study the nodes merged together form meaningfulcommunities in the sense of having similar diffusive properties which can serve as surrogatesusing external demographic information.

Chapter 7

Temporal Graph Coarsening

In the previous chapter, we studied the Graph Coarsening problem, which seeks to find asmaller representation of a graph while the diffusion property is preserved. In this chapter,we extend it to the temporal graph setting, where we assume graphs change over time. As wementioned in the previous chapter, it is difficult to analyze today’s networks due to their largesize. It is even more challenging considering that such networks evolve over time. Indeed,typical mining algorithms on dynamic networks are very slow. While summarization on staticnetworks has been studied in the past literature, surprisingly getting a smaller representationof a temporal network has not received much attention. Since the size of temporal networksare order of magnitude higher than static networks, their succinct representation is importantfrom a data compression viewpoint too. Hence, in this chapter, we study the problem of‘condensing’ a temporal network to get one smaller in size which is nearly ‘equivalent’ withregards to propagation (see Figure 7.1 for an example). Such a condensed network can bevery helpful in downstream data mining tasks, such as ‘sense-making’, influence maximization,event detection, immunization and so on. Our contributions in this chapter are:

• Problem formulation: Using spectral characterization of propagation processes, weformulate a novel and general Temporal Network Condensation problem.• Efficient Algorithm: We design careful transformations and reductions to develop an

effective, near-linear time algorithm NetCondense which is also easily parallelizable.It merges unimportant node and time-pairs to quickly shrink the network without muchloss of information.• Extensive Experiments: Finally, we conduct multiple experiments over large diverse real

datasets to show correctness, scalability, and utility of our algorithm and condensationin several tasks e.g. we show speed-ups of 48 times in influence maximization and 3.8times in event detection over dynamic networks.

This work has been published in SDM 2017 [1]. Next we first give preliminaries, and formallyformulated our problem. Then we present our proposed methods, discuss empirical result,

117

118

Figure 7.1: Condensing a Temporal Network

and conclude this chapter respectively.

7.1 Preliminaries

We give some preliminaries next. Notations used and their descriptions are summarized inTable 7.1.

Temporal Networks: We focus on the analysis of dynamic graphs as a series of individualsnapshots. In this chapter, we consider directed, weighted graphs G = (V,E,W ) whereV is the set of nodes, E is the set of edges and W is the set of associated edge-weightsw(a, b) ∈ [0, 1]. A temporal network G is a sequence of T graphs, i.e., G = {G1, G2, . . . , GT},such that the graph at time-stamp i is Gi = (V,Ei,Wi). WLOG, we assume every Gi in G hasthe same node-set V (as otherwise, if we have Gi with different Vi, just define V = ∪Ti=1Vi).We also assume, in principle there is a path for any node to send information to any othernode in G (ignoring time), as otherwise we can simply decompose. Our ideas can, however,be easily generalized to other types of dynamic graphs.

Propagation models: We primarily base our discussion on two fundamental discrete-timepropagation/diffusion models: the SI [5] and IC models [72]. The SI model is a basicepidemiological model where each node can either be in ‘Susceptible’ or ‘Infected’ state. In astatic graph, at each time-step, a node infected/active with the virus/contagion can infecteach of its ‘susceptible’ (healthy) neighbors independently with probability w(a, b). Once thenode is infected, it stays infected. SI is a special case of the general ‘flu-like’ SIS model, asthe ‘curing rate’ (of recovering from the infected state) δ in SI is 0 while in SIS δ ∈ [0, 1). Inthe popular IC (Independent Cascade) model nodes get exactly one chance to infect theirhealthy neighbors with probability w(a, b); it is a special case of the general ‘mumps-like’ SIRmodel, where nodes in ‘Removed’ state do not get re-infected, with δ = 1.

We consider generalizations of these models to temporal networks [134], where an infectednode can only infect its susceptible ‘current’ neighbors (as given by G). Note that models onstatic graphs are special cases of those on temporal networks (with all Gi ∈ G identical).

119

Table 7.1: Summary of symbols and descriptions

Symbol Description

G Temporal NetworkGcond Condensed Temporal NetworkGi, Ai ith graph of G and adjacency matrixwi(a, b) Edge-weight between nodes a and b

in time-stamp iαN Target fraction for nodesαT Target fraction for time-stampsT # of timestamps in Temporal NetworkFG Flattened Network of GXG Average Flattened Network of GSG The system matrix of GFG; XG The adjacency matrix of FG; XGλS Largest eigenvalue of SGλF; λX Largest eigenvalue of FG; XGA Matrix (Bold capital letter)u, v Column Vectors (Bold small letter)

7.2 Our Problem Formulation

Real temporal networks are usually gigantic in size. However, their skewed nature (in termsof various distributions like degree, triangles etc.) implies the existence of many nodes/edgeswhich are not important in propagation. Similarly, as changes are typically gradual, most ofadjacent time-stamps are not drastically different. There may also be time-periods with sparseconnectivities which will not contribute much to propagation. Overall, these observationsintuitively imply that it should be possible to get a smaller ‘condensed’ representation of Gwhile preserving its diffusive characteristics; which is our task.

It is natural to condense as a result of only local ‘merge’ operations on node-pairs andtime-pairs of G—such that each application of an operation maintains the propagationproperty and shrinks G. This will also ensure that successive applications of these operations‘summarize’ G in a multi-step hierarchical fashion.

More specifically, merging a node-pair {a, b} will merge nodes a and b into a new super-nodesay c, in all Gi in G. Merging a time-pair {i, j} will merge graphs Gi and Gj to create a newsuper-time, k, and associated graph Gk. However, allowing merge operations on every possiblenode-pair and time-pair results in loss of interpretability of the result. For example, it ismeaningless to merge two nodes who belong to completely different communities or mergetimes which are five time-stamps apart. Therefore, we have to limit the merge operations in

120

a natural and well-defined way. This also ensures that the resulting summary is useful fordownstream applications. We allow a single node-merge only on node pairs {a, b} such that{a, b} ∈ Ei for at least one Gi, i.e. {a, b} is in the unweighted ‘union graph’ UG(V,Eu = ∪iEi).Similarly, we restrict a single time-merge to only adjacent time-stamps. Note that we canstill apply multiple successive merges to merge multiple node-pairs/time-pairs. Our generalproblem is:

Informal Problem 7.1. Given a temporal network G = {G1, G2, . . . , GT} with Gi =(V,Ei,Wi) and target fractions αN ∈ (0, 1] and αT ∈ (0, 1], find a condensed temporal net-work Gcond = {G′1, G′2, . . . , G′T ′} with G′i = (V ′, E ′i,W

′i ) by repeatedly applying “local” merge

operations on node-pairs and time-pairs such that (a) |V ′| = (1−αN )|V |; (b) T ′ = (1−αT )T ;and (c) Gcond approximates G w.r.t. propagation-based properties.

7.2.1 Formulation framework

Formalizing Informal Problem 7.1 is challenging as we need to tackle following two researchquestions: (Q1) Characterize and quantify the propagation-based property of a temporalnetwork G; (Q2) Define “local” merge operations.

In general, Q1 is difficult as the characterization should be scalable and concise. For Q2,the merges are local operations, and so intuitively they should be defined so that any localdiffusive changes caused by them is minimum. Using Q1 and Q2, we can formulate InformalProblem 7.1 as an optimization problem where the search space is all possible temporalnetworks with the desired size and which can be constructed via some sequence of repeatedmerges from G.

7.2.2 Q1: Propagation-based property

One possible naive answer is to run some diffusion model on G and Gcond and see if thepropagation is similar; but this is too expensive. Therefore, we want to find a tractableconcise metric that can characterize and quantify propagation on a temporal network.

A major metric of interest in propagation on networks is the epidemic threshold whichindicates whether the virus/contagion will quickly spread throughout the network (and causean ‘epidemic’) or not, regardless of the initial conditions. Past work [44,133] have studiedepidemic thresholds for various epidemic models on static graphs. Recently, [134] show thatin context of temporal networks and the SIS model, the threshold depends on the largesteigenvalue λ of the so-called system matrix of G: an epidemic will not happen in G if λ < 1.The result in [134] was only for undirected graphs; however it can be easily extended toweighted directed G with a strongly connected union graph UG (which just implies that inprinciple any node can infect any other node via a path, ignoring time; as otherwise we canjust examine each connected component separately).

121

Definition 7.1. System Matrix: For the SI model, the system matrix SG of a temporalnetwork G = {G1, G2, ..., GT} is defined as SG =

∏Ti=1(I + Ai).

where At is the weighted adjacency matrix of Gt. For the SI model, the rate of infection isgoverned by λS, the largest eigenvalue of SG. Preserving λS while condensing G to Gcond willimply that the rate of virus spreading out in G and Gcond will be preserved too. Therefore λSis a well motivated and meaningful metric to preserve during condensation.

7.2.3 Q2: Merge Definitions

We define two operators: µ(G, i, j) merges a time-pair {i, j} in G to a super-time k in Gcond;while ζ(G, a, b) merges node-pair {a, b} in all Gi ∈ G and results in a super-node c in Gcond.

As stated earlier, we want to condense G by successive applications of µ and ζ. We alsowant them to preserve local changes in diffusion in the locality of merge operands. At thenode level, the level where local merge operations are performed, the diffusion process is bestcharacterized by the probability of infection. Hence, working from first principles, we designthese operations to maintain the probabilities of infection before and after the merges in the‘locality of change’ without worrying about the system matrix. For µ(G, i, j), the ‘localityof change’ is Gi, Gj and the new Gk. Whereas, for ζ(G, a, b), the ‘locality of change’ is theneighborhood of {a, b} in all Gi ∈ G.

(a) Time merge of a single edge (b) Node Merge in a single time

Figure 7.2: (a) Example of merge operation on a single edge (a, b) when time-pair {i, j} is mergedto form super-time k. (b) Example of node-pair {a, b} being merged in a single time i to formsuper-node c.

Time-pair Merge: Consider a merge µ(G, i, j) between consecutive times i and j. Considerany edge (a, b) in Gi and Gj (note if (a, b) /∈ Ei, then wi(a, b) = 0) and assume that node a isinfected and node b is susceptible in Gi (illustrated in Figure 7.2 (a)). Now, node a can infectnode b in i via an edge in Gi, or in j via an edge in Gj . We want to maintain the local effectsof propagation via the merged time-stamp Gk. Hence we need to readjust edge-weights in Gk

such that it captures the probability a infects b in G (in i and j).

Lemma 7.1. (Infection via i & j) Let Pr(a→ b|Gi, Gj) be the probability that a infects b inG in either time i or j, if it is infected in Gi. Then Pr(a→ b|Gi, Gj) = [wi(a, b) + wj(a, b)],upto a first order approximation.

122

Proof. In the SI model, for node a to infect node b in time pair {i, j}, either the infectionoccurs in Gi or in Gj, therefore,

P (a→ b|Gk, Gj) = wi(a, b) + (1− wi(a, b))wj(a, b)

We have

P (a→ b|Gk, Gj) = wi(a, b) + wj(a, b)− wi(a, b)wj(a, b)

Now, ignoring the lower order terms,

P (a→ b|Gk, Gj) = wi(a, b) + wj(a, b).

Lemma 7.1 suggests that the condensed time-stamp k, after merging a time-pair {i, j} shouldbe Ak = Ai + Aj. However, consider a G such that all Gi in G are the same. This iseffectively a static network: hence the time-merges should give the network Gi rather thanTGi. This discrepancy arises because for any single time-merge, as we reduce ‘T ’ from 2 to 1,to maintain the final spread of the model, we have to increase the infectivity along each edgeby a factor of 2 (intuitively speeding up the model [64]). Hence, the condensed network at

time k should be Ak =Ai+Aj

2instead; while for the SI model, the rate of infection should

be doubled for time k in the system matrix. Motivated by these considerations, we define atime-stamp merge as follows:

Definition 7.2. Time-Pair Merge µ(G, i, j). The merge operator µ(G, i, j) returns a new

time-stamp k with weighted adjacency matrix Ak =Ai+Aj

2.

Node-pair Merge: Similarly, in ζ(G, a, b) we need to adjust the weights of the edges tomaintain the local effects of diffusion between a and b and their neighbors. Note that whenwe merge two nodes, we need to merge them in all Gi ∈ G.

Consider any time i. Suppose we merge {a, b} in Gi to form super-node c in G′i (note thatG′i ∈ Gcond). Consider a node x such that {a, b} and {a, x} are neighbors in Gi (illustrated inFigure 7.2 (b)). When c is infected in G′i, it is intuitive to imply that either node a or b isinfected in Gi uniformly at random. Hence we need to update the edge-weight from c to x inG′i, such that the new edge-weight is able to reflect the probability that either node a or binfects x in Gi.

Lemma 7.2. (Probability of infecting out-neighbors) If either node a or node b is infectedin Gi and they are merged to form a super-node c, then the first order approximation ofprobability of node c infecting its out-neighbors is given by:

123

Pr(c→ z|Gi) =

wi(a, z)

2∀z ∈ Nboi (a)\Nboi (b)

wi(b, z)

2∀z ∈ Nboi (b)\Nboi (a)

wi(a, z) + wi(b, z)

4∀z ∈ Nboi (a) ∩Nboi (b)

where, Nboi (v) is the set of out-neighbors of node v in time-stamp i. We can write down thecorresponding probability Pr(z → c|Gi) (for getting infected by in-neighbors) similarly.

Proof. Let us recall that Nboi (v) is the set of out-neighbors of node v at time-stamp i.When super-node c is infected in G′i ∈ Gcond (the summary network), either node a ornode b is infected in the underlying original network i.e, in Gi ∈ G. Hence, for a nodez ∈ Nboi (a)\Nboi (b), the probability of node c infecting z is,

P (c→ z|Gi) =P (a→ z|Gi) + P (b→ a|Gi−1)P (a→ z|Gi)

2

The above equation shows that if a is infected, it infects z at time i directly. But for b, toinfect z at time i, b has to infect a at time i− 1, and then a infects b at time 1. We rewritethe probabilities as defined by the edge-weights,

P (c→ z|Gi) =wi(a, z) + wi−1(b, z)wi(a, z)

2

Ignoring lowering order terms, we can get,

P (c→ z|Gi) =wi(a, z)

2

Similarly, we can prove other cases.

Motivated by Lemma 7.2, we define node-pair merge as:

Definition 7.3. Node-Pair merge ζ(G, a, b). The merge operator ζ(G, a, b) merges a andb to form a new super-node c in all Gi ∈ G, s.t. wi(c, z) = Pr(c → z|Gi) and wi(z, c) =Pr(z → c|Gi).

124


We can now formally define our problem.

Problem 7.1. (Temporal Network Condensation Problem (TNC)) Given a temporalnetwork G = {G1, G2, . . . , GT} with strongly connected UG, αN ∈ (0, 1] and αT ∈ (0, 1], finda condensed temporal network Gcond = {G′1, G′2, . . . , G′T ′} with G′i = (V ′, E ′i,W

′i ) by repeated

applications of µ(G, ·, ·) and ζ(G, ·, ·), such that |V ′| = (1 − αN)|V |; T ′ = (1 − αT )T ; andGcond minimizes |λS − λcond

S |.

Problem 7.3 naturally contains the graph coarsening problem for a static network in theprevious chapter (which aims to preserve the largest eigenvalue of the adjacency matrix)as a special case: when G = {G}. GCP itself is a challenging problem as it is related toimmunization problems. Hence, Problem 7.3 is intuitively even more challenging.

7.3 Our Proposed Method

The naive algorithm is combinatorial. Even the greedy method which computes the next bestmerge operands will be O(αNV

6), even without time-pair merges. In fact, even computingSG is inherently non-trivial due to matrix multiplications. It does not scale well for largetemporal networks because SG gets denser as the number of time-stamps in G increases.Moreover, since SG is a dense matrix of size |V | by |V |, it does not even fit in the mainmemory for large networks. Even if there was an algorithm for Problem 7.3 that couldbypass computing SG, λS still has to be computed to measure success. Therefore, even justmeasuring success for Problem 7.3, as is, seems hard.

7.3.1 Main idea

To solve the numerical and computational issues, our idea is to find an alternate representationof G such that the new representation has the same diffusive properties and avoids the issuesof SG. Then we develop an efficient sub-quadratic algorithm.

Our main idea is to look for a static network that is similar to G with respect to propagation.We do this in two steps. First we show how to construct a static flattened network FG, andshow that it has similar diffusive properties as G. We also show that eigenvalues of SG andthe adjacency matrix FG of FG are precisely related. Due to this, computing eigenvaluesof FG too is difficult. Then in the second step, we derive a network from FG whose largesteigenvalue is easier to compute and related to the largest eigenvalue of FG. Using it wepropose a new related problem, and solve it efficiently.

125

(a) Temporal Network (b) Flattened Network

Figure 7.3: (a) G, and (b) corresponding FG.

7.3.2 Step 1: An Alternate Static View

Our approach for getting a static version is to expand G and create layers of nodes, suchthat edges in G are captured by edges between the nodes in adjacent layers (see Figure 7.3).We call this the “flattened network” FG.

Definition 7.4. Flattened network. FG for G is defined as follows:

• Layers: FG consists of 1, ..., T layers corresponding to T time-stamps in G.• Nodes: Each layer i has |V | nodes (so FG has T |V | nodes overall). Node a in the

temporal network G at time i is represented as ai in layer i of FG.• Edges: At each layer i, each node ai has a direct edge to a(i+1) mod T in layer (i+ 1)

mod T with edge-weight 1. And for each time-stamp Gi in the temporal network G, ifthere is a directed edge (a, b), then in FG, we add a direct edge from node ai to nodeb(i+1) mod T with weight wi(a, b).

For the relationship between G and FG, consider the SI model running on G (Figure 7.3 (a)).Say node a is infected in G1, which also means node a1 is infected in FG (Figure 7.3 (b)).Assume a infects b in G1. So in the beginning of G2, a and b are infected. Correspondinglyin FG node a1 infects nodes a2 and b2. Now in G2, no further infection occurs. So the samenodes a and b are infected in G3. However, in FG infection occurs between layers 2 and 3,which means a2 infects a3 and b2 infects b3. Propagation in FG is different than in G as each‘time-stamped’ node gets exactly one chance to infect others. Note that the propagationmodel on FG we just described is the popular IC model. Hence, running the SI model in Gshould be “equivalent” to running the IC model in FG in some sense.

We formalize this next. Assume we have the SI model on G and the IC model on FG startingfrom the same node-set of size I(0). Let IGSI(t) be the expected number of infected nodes at

the end of time t. Similarly, let IFGIC (T ) be the expected number of infected nodes under the

IC model till end of time T in FG. Note that IFGIC (0) = I

FGSI (0) = I(0). Then:

126

Lemma 7.3. (Equivalence of propagation in G and FG) We have∑T

t=1 IGSI(t) = I

FGIC (T ).

Proof. To prove the Lemma above, first we show the following:

T−1∑t=0

IGSI(t) = IFGIC (T − 1) (7.1)

We prove this by induction over time-stamp t = {0, 1, ..., T − 1}.

Base Case: At t = 0, since the seed set is the same, infection in both the model are same.Hence, IGSI(0) = I

FGIC (0)

Inductive Step: For inductive step, let the inductive hypothesis be that for time-stamp0 < k < T − 1,

∑kt=0 I

GSI(t) = I

FGIC (k).

Let δGSI(k + 1) be the number of new infection in the SI model in G at time k + 1, Hence, the

total number of infected nodes at time k + 1 is IGSI(k) + δGSI(k + 1). Similarly, let δFGIC(k + 1)

be the number of newly infected nodes at the time k + 1. Since, the number of δGSI(k + 1)new nodes got infected in SI model in G, same nodes in the layer k + 2 get infected in FG.Moreover, all the nodes that were infected in layer k + 1 at time k in FG infect correspondingnodes in the next layer. Hence,

δFGIC(k + 1) = δGSI(k + 1) + IGSI(k)

Now, we have

k+1∑t=0

IGSI(t) =k∑t=0

IGSI(t) + IGSI(k) + δGSI(k + 1)

By inductive hypothesis, we get,

k+1∑t=0

IGSI(t) = IFIC(k) + IGSI(k) + δGSI(k + 1)

= IFIC(k) + δFGIC(k + 1) = IFIC(k + 1)

Now, at the time T , infection in G occurs in time-stamp T but infection in FG occurs betweenlayers T and 1. Recall that nodes are seeded in the layer 1 of FG for IC model, and hencethey cannot get infected. Therefore, the difference in the cumulative sum of infection of SIand total infection in IC is I

FGIC (0).

Hence,

T∑t=0

IGSI(t) = IFIC(T ) + IFGIC (0)

127

Since IFIC(0) = IGSI(0),∑T

t=1 IGSI(t) = IFIC(T ).

That is, the cumulative expected infections for the SI model on G is the same as the infectionsafter T for the IC model in FG. This suggests that the largest eigenvalues of SG and FG areclosely related. Actually, we can prove a stronger statement that the spectra of FG and G areclosely related (Lemma 7.4).

Lemma 7.4. (Eigen-equivalence of SG and FG) We have (λF)T = λS. Furthermore, λ is aneigenvalue of FG, iff λT is an eigenvalue of SG.

Proof. Ai is the weighted adjacency matrix of Gi ∈ G. It has size of |V |2. Any vector xi isof size V by 1. Now the definition of FG,

FG =

0 I + A1 0 ... 00 0 I + A2 ... 0...

......

......

0 0 0 ... I + AT−1

I + AT 0 0 ... 0

Now solving for eigenvalue of FG,

FG

x1

x2...

xT−1

xT

= λ

x1

x2...

xT−1

xT

From above, we get

(I + A1)x2 = λx1,

(I + A2)x3 = λx2,

...

(I + ATx1) = λxT

Multiplying the equations together,[T∏i=1

(I + Ai)

]x1 = λT · x1

128

Finally,

SG · x1 = λT · x1

From above we get that λT is the eigenvalue of SG if λ is the eigenvalue of FG . Same argumentin reverse proves the converse.

Now, since UG is strongly-connected, we have |λF| ≥ |λ|, for any λ that is eigenvalue of FG.And we also have, if |x| > |y| then |xk| > |yk| for any k > 1. Therefore there are not any λsuch that |λT | > |λTF|. So, λTF has to be the principal eigenvalue of SG.

Lemma 7.4 implies that preserving λS in G is equivalent to preserving λF in FG. Therefore,Problem 7.3 can be re-written in terms of λF (of a static network) instead of λS (of a temporalone).

7.3.3 Step 2: A Well Conditioned Network

However λF is problematic too. The difficulty in computing λF arises because FG is ill-conditioned. So modern packages take many iterations and the result may be imprecise.Intuitively, it is easy to understand that computing λF is difficult: as if it were not, computingλS itself would have been easy (just compute λF and raise it to the T -th power).

So we create a new static network that has a close relation with FG and whose adjacencymatrix is well-conditioned. To this end, we look at the average flattened network, XG , whose

adjacency matrix is defined as XG = FG+FG′

2, where FG

′ is the transpose of FG. It is easy tosee that trace of XG and FG are equal, which means that the sum of eigenvalues of XG andFG are equal. Moreover, we have the following:

Lemma 7.5. (Eigenvalue relationship of FG and XG) The largest eigenvalue of FG, λF, andthe largest eigenvalue of XG, λX, are related as λF ≤ λX.

Proof. First, according to the definition, XG = FG+FG′

2. Let λ(FG) be the spectrum of FG and

λ(XG) be spectrum of XG. Let λX be the largest eigenvalue of XG. Function Re(c) returnsthe real part of c.

Now, λ(FG) and λ(XG) are related by the majorization relation [181]. i.e., Re(λ(FG)) ≺ λ(XG)

Which implies that any eigenvalue of FG, λ ∈ λ(FG), satisfies Re(λ) ≤ λX

Since the union graph UG is strongly connected, FG is strongly connected. Hence, by PerronFrobenius theorem, the largest eigenvalue of FG , λF, is real and positive. Therefore, λF ≤ λX.

129

Note that if λX < 1, then λF < 1. Moreover, if λF < 1 then λS < 1. Hence if there isno epidemic in XG, then there is no epidemic in FG as well, which implies that the rate ofspread in G is low. Hence, XG is a good proxy static network for FG and G and λX is awell-motivated quantity to preserve. Also we need only weak-connectedness of UG for λX(and corresponding eigenvectors) to be real and positive (by the Perron-Frobenius theorem).Furthermore, XG is free of the problems faced by FG and SG: it is well-conditioned and itseigenvalue can be efficiently computed.

New problem: Considering all of the above, we re-formulate Problem 7.3 in terms of λX.Since G and XG are closely related networks, the merge definitions on XG can be easilyextended from those on G.

Note that edges in one time-stamp of G are represented between two layers in XG and edgesin two consecutive time-stamps in G are represented in three consecutive layers in XG . Hence,merging a time-pair in G corresponds to merging three layers of XG.

A notable difference in µ(G, ., .) and µ(XG, ., .) arises due to the difference in propagationmodels; we have the SI model in G whereas we the IC model in XG . Since a node gets only asinge chance to infect its neighbors in the IC model, infectivity does not need re-scaling XG.Despite this difference, the merge definitions on G and XG remain identical.

Let us assume we are merging time-stamps i and j in G. For this, we need to look at the edgesbetween layers i and j, and j and k, where k is layer following j. Now, merging time-stampsi and j in G corresponds to merging layers i and j in XG and updating out-links and in-linksin the new layers. Let wi,j(a, b) be the edge weight between any node a in layer i.

Definition 7.5. Time-Pair Merge µ(XG, i, j). The merge operator µ(XG, i, j) results in anew layer m such that edge weight between any nodes a in layer m and b in layer k, wm,k(a, b)is defined as

wm,k(a, b) =wi,j(a, b) + wj,k(a, b)

2

Note for h = i− 1 mod T , wh,i and wh,m are equal, since the time-stamp h in G does notchange. And as XG is symmetric wm,k and wk,m are equal. Similarly, we extend node-pairmerge definition in G as follows. As in G, we merge node-pairs in all layers of XG.

Definition 7.6. Node-Pair Merge ζ(XG, a, b). Let Nbo(v) denote the set of out-neighborsof a node v. Let wi,j(a, b) be edge weight from any node a in layer i to any node b at layerj. Then the merge operator ζ(XG, a, b) merges node pair a, b to form a super-node c, whoseedges to out-neighbors are weighted as

130

wi,j(c, z) =

wi,j(a, z)

2∀z ∈ Nbo(a)\Nbi(b)

wi,j(b, z)

2∀z ∈ Nbo(b)\Nbi(a)

wi,j(a, z) + wi,j(b, z)

4∀z ∈ Nbo(a) ∩Nbi(b)

Finally, our problem can be re-formulated as following.

Problem 7.2. Given G with weakly connected UG over V , αN and αT , find Gcond by repeatedapplication of µ(XG, ., .) and ζ(XG, ., .) such that |V ′| = (1 − αN)|V |; T ′ = (1 − αT )T ; andGcond minimizes |λX − λcond

X |.

7.3.4 NetCondense

In this section, we propose a fast greedy algorithm for Problem 7.2 called NetCondense,which only takes sub-quadratic time in the size of the input. Again, the obvious approach iscombinatorial. Consider a greedy approach using ∆-Score.

Definition 7.7. ∆-Score. ∆XG(a, b) = |λX − λcondX | where λcond

X is the largest eigenvalue ofthe new XG after merging a and b (node or time-pair).

The greedy approach will successively choose those merge operands at each step whichhave the lowest ∆-Score. Doing this naively will lead to quartic time (due to repeatedre-computations of λX for all possible time/node-pairs). Recall that we limit time-mergesto adjacent time-pairs and node-merges to node-pairs with an edge in at least one Gi ∈ G,i.e. edge in the union of all Gi ∈ G, called the Union Graph UG. Now, computing ∆-Scoresimply for all edges (a, b) ∈ UG is still expensive, as it requires computing eigenvalue of XGfor each node-pair. Hence we estimate ∆-Score for node/time pairs instead using MatrixPerturbation Theory [158]. Let v be the eigenvector of XG, corresponding to λX. Let v(ai)be the ‘eigenscore’ of node ai in XG. XG(ai, bi) is the entry in XG in the row ai and thecolumn bi. Now we have the following lemmas.

Lemma 7.6. (∆-Score for time-pair) Let Vi = nodes in Layer i of XG. Now, for mergeµ(XG, i, j) to form k,

∆XG (i, j) =−λX(

∑i∈Vi,Vj

v(i)2) +∑k∈Vk

v(i)koTv + Y

vTv −∑i∈Vi,Vj

v(i)2

upto a first-order approximation, where η(i,j) = v(i)v(j), Y =∑

i∈Vi,j∈Vj(2 · η(i,j))XG(i, j),

and koTv = 12(λXv(i) + λXv(j) + v(i) + v(j)).

131

Proof. For convenience, we write λX as λ and XG as X. Similarly, we write let v(xi) as vi,when i is clear from the context.

Now, By the matrix perturbation theory, we have

∆λ =vT∆Xv + vT∆X∆v

vTv + vT∆v(7.2)

When we merge a time-pair in X, we essentially merge blocks corresponding to the time-stamps i and j in X to create new blocks corresponding to time-stamp k. Since we want tomaintain the size of the matrix as the algorithm proceeds, we place layer k in layer i’s placeand set rows and columns of XG corresponding to layer j to be zero. Therefore, the changein X can be written as

∆X =∑

i∈Vi,Vj

−(iieTi + eiioT ) +

∑k∈Vk

−(kieTk + ekkoT ) (7.3)

where ea is a column vector with 1 at position a and 0 elsewhere, and ki and koT are k-thcolumn and row vectors of X respectively. Similarly, change in the right eigenvector can bewritten as:

∆v =∑

i∈Vi,Vj

−(viei) + δ (7.4)

As δ is very small, we can ignore it. We notice that vTei = vi, vT ii = λvi and ioTv = λvi.Now, we can compute Eqn. 7.2 as follows:

vT∆Xv =∑

i∈Vi,Vj

−(vivT ii + vii

oTv) +∑k∈Vk

(vivTki + vik

oTv) (7.5)

Further simplifying,

vT∆Xv =∑

i∈Vi,Vj

−(2λv2i ) +

∑k∈Vk

(vivTki + vik

oTv) (7.6)

And we have

vT∆v = v∑

i∈Vi,Vj

(−viei) = −∑

i∈Vi,Vj

v2i (7.7)

Similarly,

132

vT∆X∆v = vT [∑

i∈Vi,Vj

−(iieTi + eiioT ) +

∑k∈Vk

−(kieTk + ekkoT )][

∑i∈Vi,Vj

−(viei)] (7.8)

Here we notice that there are edges between two layers in XG, only if they are adjacent.Moreover, the edges are in both directions between the layers.

vT∆X∆v =∑

i∈Vi,Vj

λvivj +∑

i∈Vi,j∈Vj

(vivj + vjvi)X(i, j) −∑k∈Vk

vivTki −

∑k∈Vk,j∈Vj

vikoTvjej . (7.9)

Since self loops have no impact on diffusion, we can write∑

k∈Vk,j∈Vj vikoTvjej = 0,

vT∆X∆v =∑

i∈Vi,Vj

λvivj +∑

i∈Vi,j∈Vj

(2vivjX(i, j) −∑k∈Vk

vivTki (7.10)

Putting together, we have

∆λ =−λ∑i∈Vi,Vj

v2i +

∑i∈Vi,j∈Vj

2vivj)X(i, j) +∑k∈Vk

vikoTv

vTv −∑i∈Vi,Vj

v2i

(7.11)

Note that we merge the same node in different layers of XG corresponding to different time-stamps in G. Now, Let i be a node in ti and j the same node in tj, and we merge them toget new node k in tk. Notice that i and j cannot have common neighbors. Let Nbo(v) bethe set of out-neighbors of node v. For brevity, let I = Nbo(i) and J = Nbo(j). We have thefollowing,

koTv =∑y∈I

vykoTy +

∑z∈J

vzkoTz (7.12)

=∑y∈I

vy1

2ioTy +

∑z∈J

vz1

2joTz

Now, let w be the edge-weight between i and j, λvi =∑

y∈I vyioTy −vjw therefore,

∑y∈I vyi

oTy =

λvi + vjw

Similarly,∑

z∈B vzjoTy = λvj + viw. By construction, we have w = 1 in XG. Hence,

133

koTv =1

2(λXvi + λXvj + vi + vj). (7.13)

Lemma 7.7. (∆-Score for node-pair) Let Va = {a1, a2, . . . , aT} ∈ XG corresponding to nodea in G. For merge ζ(XG, a, b) to form c,

∆XG (a, b) =−λX(

∑a∈Va,Vb

v(a)2) +∑c∈Vc

v(a)coTv + Y

vTv −∑a∈Va,Vb

v(a)2

upto a first-order approximation, where η(a,b) = v(a)v(b), Y =∑

a∈Va,b∈Vb(2η(a,b))XG(a, b),

and coTv = 12λX(v(a) + v(b)).

Proof. Following the step similar to that of Lemma 7.6,

∆λ =−λ∑i∈Vi,Vj

v2i +

∑i∈Vi,j∈Vj

(2vivj)X(i, j) +∑k∈Vk

vikoTv

vTv −∑i∈Vi,Vj

v2i

(7.14)

Note that we merge the same node in different layers of XG corresponding to different time-stamps in G. Now, Let i be a node in ti and j the same node in tj, and we merge them toget new node k in tk. Notice that i and j cannot have common neighbors. Let Nbo(v) bethe set of out-neighbors of node v. For brevity, let I = Nbo(i) and J = Nbo(j). We have thefollowing,

koTv =∑y∈I

vykoTy +

∑z∈J

vzkoTz (7.15)

=∑y∈I

vy1

2ioTy +

∑z∈J

vz1

2joTz

Now, let w be the edge-weight between i and j, λvi =∑

y∈I vyioTy −vjw therefore,

∑y∈I vyi

oTy =

λvi + vjw

Similarly,∑

z∈B vzjoTy = λvj + viw. By construction, we have w = 1 in XG. Hence,

koTv =1

2(λXvi + λXvj + vi + vj). (7.16)

134

Lemma 7.8. NetCondense has sub-quadratic time-complexity of O(TEu + E logE +αNθTV + αTE), where θ is the maximum degree in any Gi ∈ G and linear space-complexityof O(E + TV ).

Proof. Line 1 and Line 2 in NetCondense takes O(E) time. To calculate the largesteigenvalue and corresponding eigenvector of XG using Lanczos algorithm takes O(E) time [165].It takes O(TV + E) time for Lines 3 and 4. It takes O(T ) time to calculate score for eachnode pair. Therefore, Lines 5 and 6 take O(TEu). Line 7 takes O(T log T ) and Line 8 hasworst-time complexity of O(E logE). Now, for Lines 10 to 12, merging node-pairs require usto look at neighbors of nodes being merged at each time-stamp. Hence, it takes O(Tθ) timeand we require αNV merges. Therefore, time-complexity for all node-merge is O(αNθTV ).Similarly, it takes O(αTE) time for all time-merges.

Therefore, the time-complexity of NetCondense is O(TEu + E logE + αNθTV + αTE).Note that the complexity is sub-quadratic as Eu ≤ V 2 (In our large real datasets, we foundEu << V 2).

For NetCondense, we need O(E) space to store XG and Gcond. We also need O(Eu) andO(T ) space to store scores for node-pairs and time-pairs respectively. To store eigenvectors ofXG, we require O(TV ) space. Therefore, total space-complexity of NetCondense is linearO(E + TV ).

Algorithm 7.1 NetCondense

Require: Temporal graph G , 0 < αN < 1, 0 < αT < 1Ensure: Temporal graph Gcond(V ′, E ′, T ′)

1: Obtain XG using Definition 7.42: for Every adjacent time-pairs {i, j} do3: Calculate ∆XG(i, j) using Lemma 7.64: end for5: for every node-pair {a, b} in UG do6: Caluclate ∆XG(a, b) using Lemma 7.77: end for8: sort the lists of ∆-Score for time-pairs and node-pairs9: Gcond = G

10: while |V ′| > αN · |V | or T ′ > αT · T do11: (x, y)← node-pair or time-pair with lowest ∆-Score12: Gcond ← µ(Gcond, x, y) or ζ(Gcond, x, y)13: end while14: return Gcond

Parallelizability: We can easily parallelize NetCondense: once the eigenvector of XG iscomputed, ∆-Score for node-pairs and time-pairs (loops in Lines 3 and 5 in Algorithm 7.1)

135

can be computed independent of each other in parallel. Similarly, µ and ζ operators (in Line11) are also parallelizable.

7.4 Experiments


We briefly describe our set-up next. All experiments are conducted using a 4 Xeon E7-4850CPU with 512GB 1066Mhz RAM.

Datasets. We run NetCondense on a variety of real datasets (Table 7.2) of varying sizesfrom different domains such as social-interactions (WorkPlace, School, Chess), co-authorship(Arxiv, DBLP) and communication (Enron, Wikipedia, WikiTalk). They include weightedand both directed and undirected networks. Edge-weights are normalized to the range [0, 1].

WorkPlace, and School are contact networks publicly available from SocioPatterns1. In bothdatasets, edges indicate that two people were in proximity in the given time-stamp. Weightsrepresent the total time of interaction in each day.

Enron is a publicly available dataset2. It contains edges between core employees of thecorporation aggregated over 44 weeks. Weights in Enron represent the count of emails.

Chess is a network between chess players. Edge-weight represents number of games playedin the time-stamp.

Arxiv is a co-authorship network in scientific papers present in arXiv’s High Energy PhysicsPhenomenology section. We aggregate this network yearly, where the weights are number ofco-authored papers in the given year.

ProsperLoan is loan network among user of Prosper.com. We aggregate the loan interactionamong users to define weights.

Wikipedia is an edit network among users of English Wikipedia. The edges represent thattwo users edited the same page and weights are the count of such events.

WikiTalk is a communication network among users of Spanish Wikipedia. Edge betweennodes represent that users communicates with each other in the given time-stamp. Weight inthis dataset is aggregated count of communication.

DBLP is coauthorship network from DBLP bibliography, where two authors have an edgebetween them if they have co-authored a paper in the given year. We define the weights forco-authorship network as the number of co-authored papers in the given year.

1http://www.sociopatterns.org/2https://www.cs.cmu.edu/enron/

136

Table 7.2: Datasets Information.

Dataset Weight |V | |E| TWorkPlace Contact Hrs 92 1.5K 12 DaysSchool Contact Hrs 182 4.2K 9 DaysEnron # Emails 184 8.4K 44 MonthsChess # Games 7.3K 62.4K 9 YearsArxiv # Papers 28K 3.8M 9 Years

ProsperLoan # Loans 89K 3.3M 7 YearsWikipedia # Pages 118K 2.1M 10 YearsWikiTalk # Messages 497K 2.7M 12 Years

DBLP # Papers 1.3M 18M 25 Years

Baselines. Though there are no direct competitors, we adapt multiple methods to use asbaselines.

Random: Uniformly randomly choose node-pairs and time-stamps to merge.

Tensor: Here we pick merge operands based on the centrality given by tensor decomposition.G can be also seen as a tensor of size |V | × |V | × T . So we run PARAFAC decomposition [78]on G and choose the largest component to get three vectors x, y, and z of size |V |, |V |, andT respectively. We compute pairwise centrality measure for node-pair {a, b} as x(a) · y(b)and for time-pair {i, j} as z(i) · z(j) and choose the top-K least central ones.

CNTemp: We run CoarseNet on UG and repeat the summary to create Gcond.

7.4.2 Performance of NetCondense: Effectiveness

We ran all the algorithms to get Gcond for different values of αN and αT , and measureRX = λcond

X /λX to judge performance for Problem 7.2. See Figure 7.4. NetCondense isable to preserve λX excellently (upto 80% even when the the number of time-stamps andnodes are reduced by 50% and 70% respectively). On the other hand, the baselines performmuch worse, and quickly degrade λX. Note that Tensor does not even finish within 7 daysfor DBLP for larger αN . Random and Tensor perform poorly even though they use thesame merge definitions, showcasing the importance of right merges. In case of Tensor,unexpectedly it tends to merge unimportant nodes with all nodes in their neighborhood evenif they are “important”; so it is unable to preserve λX. Finally CNTemp performs badly asit does not use the full temporal nature of G.

We also compare our performance for Problem 7.3, against an algorithm specifically designedfor it. We use the simple greedy algorithm GreedySys for Problem 7.3 (as the brute-forceis too expensive): it greedily picks top node/time merges by actually re-computing λS. Wecan run GreedySys only for small networks due to the SG issues we mentioned before. See

137

(a) Arxiv (b) ProsperLoan (c)WikiTalk (d) DBLP

Figure 7.4: RX = λcondX /λX vs αN (top row, αN = 0.5) and vs αT (bottom row, αT = 0.5).

Figure 7.5 (λMS is λcond

S obtained from method M). NetCondense does almost as well asGreedySys, due to our careful transformations and reductions.

(a) WorkPlace (b) School

Figure 7.5: Plot of RS = λNetCondenseS /λGreedySys

S .

7.4.3 Application 1: Temporal Influence Maximization

In this section, we show how to apply our method to the well-known Influence Maximizationproblem on a temporal network (TempInfMax) [2]. Given a propagation model, TempInfMaxaims to find a seed-set S ⊆ V at time 0, which maximizes the ‘footprint’ (expected number ofinfected nodes) at time T . Solving it directly on large G can be very slow. Here we propose touse the much smaller Gcond as an approximation of G, as it maintains the propagation-basedproperties well.

138

Specifically, we propose CondInf (Algorithm 7.2) to solve the TempInfMax problem ontemporal networks. The idea is to get Gcond from NetCondense, solve TempInfMax problemon Gcond, and map the results back to G. Thanks to our well designed merging scheme thatmerges nodes with the similar diffusive property together, a simple random mapping isenough. To be specific, let the operator that maps node v from Gcond to G be ζ−1(v). If v isa super-node then ζ−1(v) returns a node sampled uniformly at random from v.

Algorithm 7.2 CondInf

Require: Temporal graph G , 0 < αN < 1, 0 < αT < 1Ensure: seed set S of top k seeds

1: S = ∅2: Gcond ← NetCondense (G, αN , αT )3: k′1, k

′2, ..., k

′S ← Run base TempInfMax on Gcond

4: for every k′i do5: ki ← ζ−1(k′i); S ← S ∪ {ki}6: end for7: return S

We use two different base TempInfMax methods: ForwardInfluence [2] for the SI modeland Greedy-OT [46] for the PersistentIC model. As our approach is general (our results canbe easily extended to other models), and our actual algorithm/output is model-independent,we expect CondInf to perform well for both these methods. To calculate the footprint,we infect nodes in seed set S at time 0, and run the appropriate model till time T . Toevaluate the performance of our algorithm, we choose two off-the-shelf methods, namelyForwardInfluence [2] and Greedy-OT for the PersistentIC model [46]. As our approachis general (our lemmas and proofs can be easily extended to other models), and our actualoutput is model-independent, we expect our approach to perform well for both these methods.We use footprints and running time as two measurements. For the footprint, we first infectthe nodes in S on G, then run the IC model to compute the expected number of infectednode at the end of the IC model. We set αT = 0.5 and αN = 0.5 for all datasets forForwardInfluence. Similarly, We set αT = 0.5 and αN = 0.5 for School, Enron, andChess, αN = 0.97 for Arxiv, and αN = 0.97 for Wikipedia for Greedy-OT (as Greedy-OT is very slow). We show results for ForwardInfluence and Greedy-OT in Table 7.3.The results for Greedy-OT shows that it did not even finish for datasets larger than Enron.As we can see, our method performs almost as good as the base method on G, while beingsignificantly faster (upto 48 times), showcasing its usefulness.

7.4.4 Application 2: Event Detection

Event detection [141] [4] is an important problem in temporal networks. The problem seeks toidentify time points at which there is a significant change in temporal network. As snapshots

139

Table 7.3: Performance of CondInf (CI) with ForwardInfluence (FI) and Greedy-OT(GO) as base methods. σm and Tm are the footprint and running time for method m respectively.‘-’ means the method did not finish.

Dataset σFI σCI TFI TCI

School 130 121 14s 3sEnron 110 107 18s 3sChess 1293 1257 36m 45sArxiv 23768 23572 3.7d 7.5h

Wikipedia - 26335 - 7.1h

Dataset σGO σCI TGO TCI

School 135 128 15m 1.8mEnron 119 114 9.8m 24sChess - 2267 - 8.6mArxiv - 357 - 2.2h

Wikipedia - 4591 - 3.2h

in a temporal network G evolve, with new nodes and edges appearing and existing onesdisappearing, it is important to ask if a snapshot of G at a given time differs significantlyfrom earlier snapshots. Such time points signify intrusion, anomaly, failure, e.t.c dependingupon the domain of the network. Formally, the event detection problem is defined as follows:

Problem 7.3. (Event Detection (EDP)) Given a temporal network G = {G1, G2, . . . , GT},find a list R of time-stamps t, such that 1 ≤ t ≤ T and Gt−1 differs significantly from Gt.

As yet another application of NetCondense, in this section we show that summary Gcond

of a temporal network G returned by NetCondense can be leveraged to speed up theevent detection task. We show that one can actually solve the event detection task onGcond instead of G and still obtain high quality results. Since, NetCondense groups onlyhomogeneous nodes together and preserves important characteristics of the original networkG, we hypothesize that running SnapNETS on G and Gcond should produce similar results.Moreover, due to the smaller size of Gcond running SnapNETS on Gcond is faster than runningit on much larger G. In our method, given a temporal network G and node reduction factorαN (we set time reduction factor αT to be 0), we obtain Gcond and solve the EDP problemon Gcond. Specifically, we propose CondED (Algorithm 7.3) to solve the event detectionproblem.

Algorithm 7.3 CondED

Require: Temporal graph G , 0 < αN < 1Ensure: List R of time-stamps

1: Gcond ← NetCondense (G, αN , 0)2: R ← Run base Event Detection on Gcond

3: return R

For EDP we use SnapNETS [4] as the base method. In addition to some of the datasetspreviously used, we run CondED on other datasets which are previously used for eventdetection [4]. These datasets are described below in detail and the summary is in Table 7.4.

AS Oregon-PA and AS Oregon-MIX are Autonomous Systems peering information network

140

Table 7.4: Additional Datasets for EDP.

Dataset |V | |E| TCo-Occurence 202 2.8K 31 DaysAS Oregon-PA 633 1.08K 60 UnitsAS Oregon-MIX 1899 3261 70 UnitsIranElection 126K 5.5M 30 Days

Higgs 456K 14.8M 7 Days

Table 7.5: Performance of CondED. F1 stands for F1-Score. Speed-up is the ratio of time to runSnapNETS on G to the time to run SnapNETS on Gcond.

αN = 0.3 αN = 0.7Dataset F1 Speed-Up F1 Speed-Up

Co-Occurence 1 1.23 0.18 1.24School 1 1.05 1 1.56

AS Oregon-PA 1 1.43 1 2.08AS Oregon-MIX 1 1.22 1 2.83IranElection 1 1.27 1 3.19

Higgs 0.66 1.17 1 3.79Arxiv 1 1.09 1 2.27

collected from the Oregon router views3. IranElection and Higgs are twitter networks,where the nodes are twitter user and edges indicate follower-followee realtionship. Moredetails on these datasets are given in [4].

Co-Occurence is a word co-occurrence network extracted from historical newspapers, pub-lished in January 1890, obtained from the library of congress4. Nodes in the network are thekeywords and edges between two keywords indicate that they co-appear in a sentence in anewspaper published in a particular day. The edge-weights indicate the frequency with whichtwo words co-appear.

To evaluate performance of CondED, we compare the list of time-stamps Rcond obtained byCondED with the list of time-stamps R obtained by SnapNETS on the original network G.We treat the time-stamps discovered by SnapNETS as the ground truth and following themethodology in [4], we compute the F-1 score. We repeat the experiment with αN = 0.3 andαN = 0.7. The results are summarized in Table 7.5.

As shown in Table 7.5, CondED has very high F1-score for most datasets even whenαN = 0.7. This suggests that the list of time-stamps Rcond returned by CondED matchesthe result from SnapNETS very closely. Moreover, the time taken for the base method to

3http://www.topology.eecs.umich.edu/data.html4http://chroniclingamerica.loc.gov

141

run in Gcond is upto 3.5 times faster than the time it takes to run on G.

However, for Co-Occurence dataset, the F1-score for αN = 0.7 is a mere 0.18, despite havingF1-score of 1 for αN = 0.3. Note that Co-Occurence is one of the smallest dataset that wehave, hence very high αN seems to deteriorate the structure of the network, which suggests adifferent value of αN is suitable for different networks in the Event Detection task.

7.4.5 Application 3: Understanding/Exploring Networks

Figure 7.6: Condensed WorkPlace (αN = 0.6, αT = 0.5).

Figure 7.7: Condensed School (αN = 0.5 and αT = 0.5).

We can also use NetCondense for ‘sense-making’ of temporal datasets: it ensures thatimportant nodes and times remain unmerged while super-nodes and super-times form coherentinterpretable groups of nodes and time-stamps. This is not the case for the baselines e.g.Tensor merges important nodes, giving us heterogeneous super-nodes lacking interpretability.

WorkPlace: It is a social-contact network between employees of a company with five depart-ments, where weights are normalized contact time. Its been used for vaccination studies [34].In Gcond (see Figure 7.6), we find a super-node composed mostly of nodes from SRH (orange)and DSE (pink) departments, which were on floor 1 of the building while the rest were onfloor 2. We also noticed that the proportion of contact times between individuals of SRHand DSE were high in all the time-stamp. In the same super-node, surprisingly, we find a

142

node from DMCT (green) department on floor 2 who has a high contact with DSE nodes. Itturns out s/he was labeled as a “wanderer” in [34].

Unmerged nodes in the Gcond had high degree in all T . For example, we found nodes 80,150, 751, and 255 (colored black) remained unmerged even for αN = 0.9 suggesting theirimportance. In fact, all these nodes were classified as “Linkers” whose temporal stability iscrucial for epidemic spread [34]. The stability of “Linkers” mentioned in [34] suggests thatthese are important nodes in every time-stamp. The visualization of Gcond emphasizes thatlinkers connect consistently to nodes from multiple depts; which is not obvious in the originalnetworks. We also examined the super-times, and found that the days in and around theweekend (where there is little activity) were merged together.

School: It is socio-contact network between high school students from five different sectionsover several days [40]. We condensed School dataset with αN = 0.5 and αT = 0.5. Thestudents in the dataset belong to five sections, students from sections “MP*1” (green) and“MP*2” (pink) were maths majors, and those from “PC” (blue) and “PC*” (orange) werePhysics majors, and lastly, students from “PSI” (dark green) were engineering majors [40].In Gcond (see Figure 7.7), we find a super-node containing nodes from MP*1 (green) andMP*2 (pink) sections (enlarged pink node) and another super-node (enlarged orange node)with nodes from remaining three sections PC (blue), PC* (orange), and PSI (dark green).Our result is supported by [40], which mentions that the five classes in the dataset canbrodaly be divided into two categories of (MP*1 and MP*2) and (PC, PC*, and PSI). Thegroupings in the super-nodes are intuitive as it is clear from the visualization itself thatdataset can broadly be divided into two components (MP*1 and MP*2) and (PC, PC*, andPSI). We also see that unmerged nodes in each major connect densely every day. Theseconnections are obscured by unimportant nodes and edges in the original network. Thisdense inter-connection of “important” nodes is obscured by unimportant nodes and edgesin the original network. However, visualization of the condensed network emphasizes onthese important structures and how they remain stable over time. We also noted that smallsuper-nodes of size 2 or 3, were composed solely of male students, which can be attributed tothe gender homophily exhibited only by male students in the dataset [40]. We noticed thatweekends were merged together with surrounding days in School, while other days remainedintact.

Enron: They are the email communication networks of employees of the Enron Corporation.In Gcond (αN = 0.8, αT = 0.5), we find that unmerged nodes are important nodes such as G.Whalley (President), K. Lay (CEO), and J. Skilling (CEO). Other unmerged nodes includedVice-Presidents and Managing Directors. We also found a star with Chief of Staff S. Kean inthe center and important officials such as Whalley, Lay and J. Shankman (President) for sixconsecutive time-stamps. We also find a clique of various Vice-Presidents, L. Blair (Director),and S. Horton (President) for seven time-stamps in the same period. These structuresappear only in consecutive time-stamps leading to when Enron declared bankruptcy. Suddenemergence, stability for over six/seven time-stamps, and sudden disappearance of thesestructures correctly suggests that a major event occurred during that time. We also note that

143

time-stamps in 2001 were never merged, indicative of important and suspicious behavior.

To investigate the nature of nodes which get merged early, we look at the super-nodes atthe early stage of NetCondense, we find two super-nodes with one Vice-President in each(Note that most other nodes are still unmerged). Both super-nodes were composed mostlyof Managers, Traders, and Employees who reported directly or indirectly to the mentionedVice-Presidents. We also look at unmerged time-stamps and notice that time-stamps in 2001were never merged. Even though news broke out on October 2001, analysts were suspicious ofpractices in Enron Corporation from early 20015. Therefore, our time-merges show that theevents in early 2001 were important indicative of suspicious activities in Enron Corporation.

DBLP: These are co-authorship networks from DBLP-CS bibliography. This is an especiallylarge dataset: hence exploration without any condensation is hard. In Gcond (αN = 0.7, αT =0.5), we found that the unmerged nodes were very well-known researchers such as PhilipS. Yu, Christos Faloutsos, Rakesh Aggarwal, and so on, whereas super-nodes groupedresearchers who had only few publications. In super-nodes, we found researchers from thesame institutions or countries (like Japanese Universities) or same or closely-related researchfields (like Data Mining/Visualization). In larger super-nodes, we found that researcherswere grouped together based on countries and broader fields. For example, we found separatesuper-nodes composed of scientists from Swedish, and Japanese Universities only. We alsofound super-nodes composed of researchers from closely related fields such as Data Miningand Information visualization. Small super nodes typically group researchers who have veryfew collaborations in the dataset, collaborated very little with the rest of the network andhave edges among themselves in few time-stamps. Basically, these are researchers withvery few publications. We also find a giant super-node of size 395, 000. An interestingobservation is that famous researchers connect very weakly to the giant super-node. Forexample, Rakesh Aggarwal connects to the giant super-node in only two time-stamps withalmost zero edge-weight. Whereas, less known researchers connect to the giant super-nodewith higher edge-weights. This suggests researchers exhibit homophily in co-authorshippatterns, i.e. famous researchers collaborate with other famous researchers and non-famousresearchers collaborate with other non-famous researchers. Few super-nodes that famousresearchers connect to at different time-stamps, also shows how their research interest hasevolved with time. For example, we find that Philip S. Yu connected to a super-node of IBMresearchers who primarily worked in Database in early 1994 and to a super-node of datamining researchers in 2000.

7.4.6 Scalability and Parallelizability

Figure 7.8(a) shows the runtime of NetCondense on the components of increasing size ofArxiv. NetCondense has subquadratic time complexity. In practice, it is near-linear w.r.tinput size. Figure 7.8(b) shows the near-linear run-time speed-up of parallel-NetCondense

5http://www.webcitation.org/5tZ26rnac

144

(a) Scalability (b) Parallelizability

Figure 7.8: (a)Near-linear Scalability w.r.t. size; (b). Near-linear speed up w.r.t number of cores forparallelized implementation.

vs # cores on Wikipedia.

7.5 Conclusion

In this chapter, we proposed a novel general Temporal Network Condensation Problemusing the fundamental so-called ‘system matrix’ and present an effective, near-linear andparallelizable algorithm NetCondense. Using a variety of large datasets, we leverage it todramatically speed-up influence maximization and event detection algorithms on temporalnetworks. We also leverage the summary network given by NetCondense to visualize,explore, and understand multiple networks. As also shown by our experiments, it is usefulto note that our method itself is model-agnostic and has wide-applicability, thanks to ourcarefully chosen metrics which can be easily generalized to other propagation models such asSIS, SIR, and so on.

Part II

Data-Driven Perspective

145

Overview of Part II

In the previous part, we study the immunization and summarization problem for diffusionunder several practical settings, assuming underlying diffusion models. In this part, weseek to optimize and understand network structure in light of diffusion processes from adata-driven perspective. With the latest developments of surveillance and storage techniques,it is possible for us to obtain a huge volume of propagation data. For instance, in the publichealth area, we can extract influenza activities from the medical surveillance in the form ofelectronic healthcare record (eHCR); in social media, we can get diffusion data (like memecascades, opinion propagation, etc.) from the Internet, mobile platforms, and web services.The rich propagation data make it easy to study diffusion problems in networks by relaxingrestrictive propagation models, or even without assuming any models. Hence, in the nextchapters, we apply data-driven methodologies to study challenging problems including networkoptimization for data-driven immunization and propagation based community detection.

• Data-Driven Immunization. Given the huge volume of data, can we develop efficientintervention policies to control an epidemic directly without restrictive assumptions?Past work has focused on developing interventions assuming prior epidemiologicalmodels. However, in practice disease spread is usually complicated, hence assumingan underlying model may deviate from true spreading patterns, and lead to possiblyinaccurate interventions. We develop efficient immunization algorithms with theoreticalguarantees by leveraging both patients information and contact networks. The proposedmethods outperform several baselines by reducing up to 45% of the infection with alimited budget. Furthermore, we conduct vaccine case studies in major US cities usingbillions of flu record. To the best of our knowledge, it is the first immunization studyon large-scale realistic datasets.

• Community Detection. Most previous work on community detection fails to discovercommunities in terms of propagation processes. To better understand different roles ofnodes during diffusion, we study the Detecting Media and Kernel Communityproblem, which tries to uncover two important types of nodes during diffusion: theinfluential nodes (“kernel nodes”), and nodes that serve as “bridges” to boost thediffusion (“media nodes”), as well their community structure. Our approach NetCon-dense provides high-quality solutions by outperforming non-trivial baselines by 40% in

146

147

F1-score.

The part is organized as follows: Chapter 8 studies the data-driven immunization problem,and Chapter 9 covers the problem of detecting media and kernel community. The work inChapter 8 is published in ICDM 2017 [189], while the work in Chapter 9 is published in SDM2017 [183].

Chapter 8

Data-Driven Immunization

As we discussed in the previous chapters (Chapter 3, 4, 5), most work on designing immu-nization algorithms has focused on developing innovative strategies which assume knowledgeof the underlying disease model or make assumptions of very fine-grained individual-levelsurveillance data. Vaccination and social distancing are among the principle strategies forcontrolling the spread of infectious diseases [60, 104].

Recent trends have led to the increasing availability of electronic claims data and alsocapabilities in developing very realistic urban population contact networks. This motivatesthe following problem: given a contact network, and a coarse-grained propagation log likeelectronic Health Reimbursement Claims (eHCR), can we learn an efficient and realisticintervention policy to control propagation (such as a flu outbreak)? Further, can we do itdirectly without assuming any epidemiological models? Influenza viruses change constantly,hence designing interventions optimized for specific epidemic model parameters is likely to besuboptimal [129].

The diagnostic propagation log data provides us with a good sense of how diseases spread,while contact networks tell us how people interact with others. We take into account bothfor immunization and study the data-driven immunization problem. Some of the majorchallenges include: (i) the scale of these datasets (eHCR consists of billions of records andcontact networks have millions of nodes), and (ii) eHCR data is anonymized, and availableonly at a zip-code level. The main contributions of our chapter are:

(a) Problem Formulation. We formulate the Data-Driven Immunization problem given acontact network and the propagation log. We first sample the most likely “social contact”cascades from the propagation log to the contact network. and then pose the immunizationproblem at a location level, and show it is NP-hard.

(b) Effective Algorithms. We present efficient algorithms to get the most-likely samples,and then provide a contribution-based greedy algorithm, ImmuConGreedy, with provablyapproximate solutions to allocate vaccines to locations.

148

149

(c) Experimental Evaluation. We present extensive experiments against several competitors,including graph-based and model-based baselines, and demonstrate that our algorithmsoutperforms baselines by reducing upto 45% of the infection with limited budget. Furthermore,we conduct case studies on nation-wide real medical surveillance data with billions of recordsto show the effectiveness of our methods. To the best of our knowledge, we are the first tostudy realistic immunization policies on such large-scale datasets.

This work has been published in ICDM 2017 [189]. Next we first give preliminaries, and thenformally formulated our problem and give efficient and provable algorithms. We evaluate ourmethod via. extensive experiments and case studies on the massive eHCR data in two majorUS metropolitan areas.

8.1 Preliminaries

We give a brief introduction of the propagation data eHCR and contact networks we used inthis section.

Propagation Data (eHCR). The propagation data for this study was primarily based onIMS Health claims data, electronic Healthcare Reimbursement Claims (eHCR), which consistsof over a billion claims for the period of April 1st, 2009 - March 31st, 2010. The claimsdata consists of reimbursement claims recorded electronically from health care practitionersreceived from all parts of the US, including urban and rural areas. The dataset, its features,and its overall coverage/completeness are described in detail in [123,139]; for this study, weused daily flu reports, based on ICD-9 codes 486XX and 488XX and individual locations(zip-code) recorded in the claims. Prior to our study, we obtained internal InstitutionalReview Board approval for analyzing the dataset.

Activity Based Populations. We use city-scale activity based populations as contactnetworks (see [12,35] for more details). These models are constructed by a “first-principles”approach, and integrate over a dozen public and commercial datasets, including census, landuse, activity surveys and transportation networks. The model includes detailed demographicattributes at an individual and household level, along with normative activities. These modelshave been used in a number of studies on epidemic spread and public health policy planning,including response strategies for smallpox attacks [35] and the National strategy for pandemicflu [60].

8.2 Problem Formulations

Table 8.1 lists the main notation used throughout this chapter.

We use G(V,E) to denote an undirected unweighted graph and L = {L1, ..., Ln} to denote a set

150



G(V,E) graph G with the node set V and the edge set ER propagation logN infection matrix for the propagation log RN(L`, ti) the number of patients at ti in L`t0 the earliest timestep t0 = 0n number of locationsL = {L1, . . . , Ln} set of locationsm number of vaccinesx vaccine allocation vector [x1, . . . , xn]′

k number of samples in MM set of sampled cascades {M1, . . . ,Mk}M a sampled cascadeSIM the starting infected node set in MσG,M(x) the expected number of nodes SIM can reach when x is givenρG,Mi

(x) σG,M(0)− σG,M(x)αM,` number of nodes that have at least one parent in M at location L`S` the initial starting node set at location L`, where |S`| = N(L`, t0)

of locations. Vi ⊆ V denotes the set of nodes at location Li; we assume there are no overlappingnodes between locations. Large medical surveillance data, like eHCR is usually anonymousdue to privacy issues. Hence, in this chapter, we assume the number of infections are given.Formally, the propagation log R is an infection matrix N ((tmax + 1)× n), where t0 and tmaxare the earliest and last timesteps. Each element N(L`, t) represents the number of patientsin R at location L` at time t. Each row vector N(t) = [N(L1, t), . . . , N(Ln, t)] represents thenumber of infections at time t, and each column vector NL` = [N(L`, t0), . . . , N(L`, tmax)]

T

represents the number of infections at location L`.

Interactions and Surveillance. A contact network G models people’s interactions withothers, which is a powerful tool to control epidemics. For example, Prakash et al. [131]showed that the first eigenvalue of the adjacency matrix of G is related to the epidemicthreshold. An epidemic will be quickly extinguished given a small epidemic threshold.Several effective algorithms have been proposed to minimize the first eigenvalue to controlepidemics [165,166,185]. However, all of them assume an underlying epidemiological modellike Susceptible-Infected-Recovered (SIR) [5]. In addition, they are strictly graph-basedmethods without looking into rich medical surveillance data. Though graph-based methodscan provide us with good baseline strategies, they do not take into account particular patternsof a given virus. On the other hand, the disease propagation data R like eHCR, can giveus a coarse-grained picture of infections. However, there is very little information on how

151

Figure 8.1: Overview of our approach. We first generate a set of cascades, then allocate vaccine todifferent locations.

an epidemic spreads via person-to-person contacts from R. Hence, we believe the diseasepropagation data R, along with a contact network G, can help us develop better and moreimplementable interventions to control an epidemic. For example, we can take the surveillancedata of the past flu season to allocate vaccines for the current flu season.

Map R to nodes in G. The main challenge of integrating R and G is that R (like eHCR)in practice is anonymized. Hence we cannot associate each record in R with a node in G.In this chapter, we tackle this challenge by mapping infections from R to nodes in G at thelocation level. The idea is that at each location L` and time ti, we pick N(L`, ti) nodes in Gas infected nodes. Note that we can have multiple choices of mapping R to G. For example,in Figure 8.1, N(L2, t0) = 1, hence, we can pick either A or B as infected node at t0. Wedenote these choices as M, where M is a set of cascades. We define a cascade M as follows:

Definition 8.1. ( Cascade). A cascade M is a directed acyclic graph (DAG) induced by Rand G. Each node u ∈ VM is associated with a location L` and a timestep ti, where u ∈ Viand u is infected at ti (denoted as t(u) = ti). For node u and v in M, if eu,v ∈ E andt(u) = t(v)− 1, there is a directed edge from u to v in M. We denote e(u, v) ∈ EM.

We could select N(L`, ti) nodes uniformly at random as infected nodes in G for each M.However, it is not practical as infection distributions are not uniform. For example, if a nodeu has an infected neighbor, u can be infected by that node; in contrast, if u does not haveany infected neighbor in R, it is unlikely to be infected. Hence, we propose to map R to Gaccording to the SocialContact approach.

SocialContact. We say an infected node u gets infected by “social contact” in G, if uhas a direct neighbor that is infected earlier than u. Otherwise, we call a node is infectedby external forces. In reality, infectious diseases (like flu, mumps, etc.) usually spread viaperson-to-person contact. Hence, for a mapped cascade M, we want to maximize the numberof nodes caused by SocialContact. Formally, we define αM = |{u|∃v, e(v, u) ∈ EM}|, i.e.,αM is the number of nodes that have at least one parent in M. Then maximizing the number

152

of nodes infected by SocialContact is equivalent to maximizing αM. Figure 8.1 shows twocascades with the best αM = 4: as only the node that starts the infection does not have aparent. To get k cascades with SocialContact in M, we formulate the Mapping Problem:

Problem 8.1. (Mapping Problem). Given a contact network G, propagation log R, andnumber of cascades k, find M∗ = {M∗

1, . . . ,M∗k} where each node u in M is associated with

a location L` and a time ti:

M∗ = arg maxM

∑Mi∈M

αMi , s.t. |M| = k (8.1)

Remark 8.1. Since we do not specify any epidemiological model (like SIR) for Problem 8.1,it is difficult to define any probability distribution for M. Hence, the sample average approxi-mation approach is not applicable for this problem.

Data-Driven Immunization. Once we generateM, we want to study how to best allocatevaccines to minimize the infection shown in R. Recently, Zhang et al [185] proposed a model-based group immunization problem, in which they uniformly-at-random allocate vaccines tonodes within groups—this mimics real-life distribution of vaccines by public-health authorities.We leverage their within-group allocation approach. Let us define x = [x1, . . . , xn]′ as avaccine allocation vector, where xi is the number of vaccines given to location Li. If wegive xi vaccines to location Li, xi nodes will be uniformly randomly removed from Vi. Theobjective is to find an allocation that “break” the cascades most effectively. We define SIMas the starting ‘seed’ infected nodes in M, i.e., SIM = {u ∈ VM|tu = t0}, and σG,M(x) as theexpected number of nodes SIM can reach after x is allocated to locations in M. Hence, wewant to minimize σG,M(x) to limit the expected infection over any cascade M ∈ M. Forexample, in Figure 8.1, once 2 vaccines are given to L1 and L2, we minimize the number ofnodes that B can reach in the two cascades. Formally, our data-driven immunization problemgiven M (from Problem 8.1) is:

Problem 8.2. (Data-Driven Immunization). Given a contact network G, a set of cascadesM , and budget m, find a vaccine allocation vector x∗:

x∗ = arg minx

1

|M|∑

Mi∈MσG,Mi(x), s.t. |x|1 = m (8.2)

Hardness. Both Problem 8.1 and Problem 8.2 are NP-hard, as they can be reduced fromthe Max-K-Set Union problem [71] and the DAV problem [186] respectively.

153

8.3 Proposed Method

In this section, we first develop an efficient algorithm MappingGeneration to solveProblem 8.1, and then propose a contribution based algorithm ImmuConGreedy forProblem 8.2.

8.3.1 Generating Cascades from SocialContact

Main Idea: To tackle Problem 8.1, we first focus on a special case where k = 1 (find a singlecascade M), then extend it to multiple cascades. The challenge here is that even when k = 1,Problem 8.1 is still NP-hard. Our main idea to solve this is to first generate SIM (the seedset), and then generate M from SIM. In principle, this can be done from checking SIM’si-hop neighbors. Clearly, SIM’s quality will directly affect M’s quality. However, it is stillhard to find SIM and generate M from SIM. Instead, we identify a necessary condition forthe optimal M, and propose a provable approximation algorithm to find SIM that satisfiesthe condition. We make the algorithm faster by leveraging the Approximate NeighborhoodFunction (ANF) technique. Then we generate the corresponding cascade M from SIM, andpropose a fast algorithm MappingGeneration to extend it to k cascades for Problem 8.1.

Finding SIM. To find a high quality SIM, we first examine what is the optimal M. As wementioned before, the optimal M has the maximum value of αM. Let us define α∗M as themaximum of αM (αM ≤ α∗M). Then we have the following lemma:

Lemma 8.1. α∗M =∑tmax

t=t1|N(t)|1, i.e., the number of infections after the earliest time t0.

Proof. When we map R to G, the optimal case for a cascade M is that every node u witht(u) > t0 has at least one parent in M, and the only nodes that do not have any parents arethe ones infected at the earliest time t0. Hence, α∗M is the number of nodes that are infectedafter t0.

Now we know the maximum αM. However, it is hard to find a SIM with the optimal M asshown in the next lemma.

Lemma 8.2. Find a set SIM for the cascade M with αM = α∗M is NP-hard.

Proof. We can reduce it from a Max-K-Set Union problem which tries to pick K sets thatcover at least ρ elements. Consider an instance of the Max-K-Set Union problem with n sets(n > K), A1, . . . , An and m elements ej , we can construct the following instance: we create nnodes that have direct edge to each set Ai respectively, then each set Ai has a direct edgeto any ej ∈ Ai. For the log R, we have N(L0, t0) = |K|, N(L0, t1) = |K|, and N(L0, t2) = ρ,then we want to find a M with αM = |K| + ρ. If we can find such M, the K-Set Unionproblem must be solvable.

154

According to Lemma 8.2, it is intractable to examine the whole graph to get SIM for largenetworks (like Houston with 59 million edges in Section 8.4). Hence, instead we will look ateach location independently to find SIM, and aggregate the result to generate M.

Let us define αM,` as the number of nodes that have at least one parent in M at location L`.Similarly to αM, we have αM,` ≤ α∗M,` where α∗M,` =

∑tmaxi=1 N(L`, ti). α

∗M,` is the number

of patients after t0 at location L` in R, and it is the optimal value for αM,`. Since wewant to find a set of starting nodes, here we define S` as a node set at location L`: i.e.,S` = {v|v ∈ S and v ∈ V`} where |S`| = N(L`, t0). For each location L`, we want to find aset S` as the starting infected node set, such that S` will yield a cascade M that minimizesαM,`. Our idea is to find S` that satisfies a necessary condition for the best αM,`. We denoteCF (S`, ti) = |{u|u ∈ Vl,∃v ∈ S`, dist(v, u) ≤ i}|, i.e., the number of nodes that S` can reachwithin distance i (i-hops) in L` in G. Similarly, we denote CN(L`, ti) =

∑ik=0N(L`, ti) (the

cumulative number of infections in L` in R until time ti). The next lemma will show that foreach location L`, when αM,` = α∗M,`, the constraint in Eqn. 8.3 must be satisfied.

Lemma 8.3. (Necessary Condition) Given a cascade M generating from S`, if αM,` = α∗M,`,then for any timestep ti ∈ [0, tmax] and all locations L`, we have

CF (S`, ti) ≥ CN(L`, ti) (8.3)

Proof. If αM,` = α∗M,`, every node that is infected after t0 has a parent. For any nodeu that is infected at ti, u must be within the i-th hops of S`, which means the numberof nodes within the i-hops of S` is greater than the number of nodes infected at ti, i.e.,CF (S`, ti) ≥ CN(L`, ti).

Lemma 8.3 demonstrates a necessary condition (Eqn. 8.3) for the maximum αM,`. Hence,we seek to develop an efficient algorithm that can produce accurate results for the necessarycondition. Our idea is to construct a new objective function, which can get the necessarycondition for the best M at location L`. To do so, we propose the following problem to findSIM:

Problem 8.3. Given graph G and infection matrix N. We want to find S∗ = {S∗1 , . . . , S∗n}where |S∗` | = N(L`, t0) for any location L`, such that

S∗` = arg minS`

θ(S`) ∀ location L`,

where

θ(S`) =

tmax∑i=0

1CF (S`,ti)<CN(L`,ti)(CN(L`, ti)− CF (S`, ti)).

Here 1CF (S`,ti)<CN(L`,ti) is an indicator function: if CF (S`, ti) < CN(L`, ti) then it is 1,otherwise 0.

155

Justification of Problem 8.3. Recall that α∗M,` is the optimal value for αM,`, and θ(S`) isnon-negative. We have the following lemma:

Lemma 8.4. If αM,` is optimal, then θ(S`) = 0.

Proof. When αM,` is optimal, αM,` = α∗M,`. Second, let us βS` as the number of nodes withoutany parents. Maximize αM,` for Problem 8.1 is equivalent to minimizing βS` at location L`.Suppose β∗S` is maximum number of nodes without any parents in the sample at locationL`. It is obvious β∗S` = CN(L`, t0) = |SL|. For each timestep ti, if CFi(S`) < CN(L`, ti),then CN(L`, ti) − CFi(S`) is the number of nodes that cannot be mapped to the cascadegenerated by S` at timestep ti. Hence, θ(S`) is the number of nodes that cannot be mappedto the cascade generated by S`. If there exists any ti that CFi(S`) < CN(L`, ti), we canalways generate a cascade by mapping all CFi(S`) nodes into the cascade, then uniformlyat random map other θ(S`) nodes into cascade. This way, the number of nodes withoutany parents, βS` ≤ β∗L` + θ(S`) as θ(S`) nodes can have connection within themselves. SinceβS` + αS` =

∑tiN(Li, ti), then αM,` ≥ α∗M,` − θ(S`). Hence, α∗M,` − θ(S`) ≤ αM,` ≤ α∗M,`.

When θ(S`) = 0, αM,` = α∗M,`.

Lemma 8.4 shows that if we minimize θ(S`), we are able to get the necessary condition forthe best M at location L`. Therefore, we propose Problem 8.3 to get SIM.

Hardness. Problem 8.3 is NP-hard, as it can be reduced from the set cover problem [71].

Solving Problem 8.3. Let us define g(S`) = [∑tmax

i=0 CN(L`, ti)]− θ(S`).∑tmax

i=0 CN(L`, ti) isconstant, so minimizing θ(S`) is equivalent to maximize g(S`).

Lemma 8.5. g(S`) has the following properties: g(∅) = 0; it is monotonic increasing andsubmodular.

Proof. First, it is clear that g(∅) = 0.

Second, to proof g(S) is monotonic increasing, we need to prove θ(S) is a monotonic decreasingfunction. To do that, we first show that CFi(S`) is monotone non-decreasing and submodularfunctions for any i and L`. First, let us define fi(S`) as the number of nodes in L` thatS` can reach in i-hops, hence fi(S`) ≤ fi(Sk) when S` ⊆ Sk. Second, given S` ⊆ Sk and anode u, fi(S` ∪ {u})− fi(S`) is marginal gain of a set union. Since the function in the setunion problem is submodular [71], hence fi(S`) is also submodular. Since fi(S`) is monotonenon-decreasing and submodular, hence the cumulative function CFi(S`) is also non-decreasingand submodular.

Let Xi = [1CFi(A∪B)<CNi(CNi − CFi(A ∪ B))], Yi = [1CFi(A)<CNi(CNi − CFi(A))]. For anyset A and B,

156

θ(A ∪B)− θ(A) =T∑i=1

(Xi − Yi) (8.4)

For any i, let consider the following two cases:

(1) If 1CFi(A)<CNi = 0, it means CFi(A) ≥ CNi, then CFi(A∪B) ≥ CNi, hence, 1CFi(A∪B)<CNi =0. We have Xi − Yi = 0.

(2) If 1CFi(A)<CNi = 1, we have two cases:

(2a) 1CFi(A∪B)<CNi = 0, then Xi − Yi = −Yi = −(CNi − CFi(A)) < 0;

(2b) 1CFi(A∪B)<CNi = 1, then Xi − Yi = (CNi −CFi(A∪B))− (CNi −CFi(A)) = CFi(A)−CFi(A ∪B) ≤ 0 (using Claim 2).

Putting together, we have θ(A∪B) ≤ θ(A). Hence, θ(S) is monotonic decreasing, hence g(S)is monotonic increasing.

Third, to prove g(S) is submodualr, For any location l, we need to prove that, givenS ⊆ T , g(S ∪ {a})− g(S) ≥ g(T ∪ {a})− g(T ), which is equivalent to θ(S)− θ(S ∪ {a}) ≤θ(T )− θ(T ∪ {a}) (supermodularity). Let’s write

δ(S, a, i) = [1CFi(S∪{a})<CNi(CNi − CFi(S ∪ {a}))]− [1CFi(S)<CNi(CNi − CFi(S))], and

δ(T, a, i) = [1CFi(T∪{a})<CNi(CNi − CFi(T ∪ {a}))]− [1CFi(T )<CNi(CNi − CFi(T ))], then,

θ(S)− θ(S ∪ {a}) =∑t

i=1 δ(S, a, i), and θ(T )− θ(T ∪ {a}) =∑t

i=1 δ(T, a, i).

For any i, let consider the following two cases:

(1) If 1CFi(S)<CNi = 0, then 1CFi(S∪{a})<CNi = 1CFi(T )<CNi = 1CFi(T∪{a})<CNi = 0. Henceδ(S, a, i) = δ(T, a, i) = 0.

(2) If 1CFi(S)<CNi = 1, we have the following cases:

(2a) If 1CFi(T )<CNi = 0, then we have 1CFi(T∪{a})<CNi = 0. Let us consider the value of1CFi(S∪{a})<CNi :

If 1CFi(S∪{a})<CNi = 0, then δ(S, a, i) = (CNi − CFi(S ∪ {a})) < 0 = δ(T, a, i).

If 1CFi(S∪{a})<CNi = 1, then δ(S, a, i) = CFi(S)− CFi(S ∪ {a}) < 0 = δ(T, a, i).

(2b) If 1CFi(T )<CNi = 1, let us consider the value of 1CFi(S∪{a})<CNi :

If 1CFi(S∪{a})<CNi = 0, then 1CFi(T∪{a})<CNi = 0, then δ(S, a, i) = −(CNi − CFi(S)) ≤−(CNi − CFi(T )) = δ(T, a, i) (using Claim 2).

If 1CFi(S∪{a})<CNi = 1, then for 1CFi(T∪{a})<CNi :

if 1CFi(T∪{a})<CNi = 1, then δ(S, a, i) = CFi(S)−CFi(S∪{a})) ≤ CFi(T )−CFi(T ∪{a})) =

157

δ(T, a, i) (using Claim 2 that CFi(S) is a submodular function).

otherwise 1CFi(T∪{a})<CNi = 0, then since we have CFi(T ∪{a}) ≥ CNi, δ(S, a, i) = CFi(S)−CFi(S ∪ {a})) ≤ CFi(T )− CFi(T ∪ {a})) ≤ CFi(T )− CNi = δ(T, a, i) (using Claim 2).

Putting all cases together, we have θ(S) − θ(S ∪ {a}) ≤ θ(T ) − θ(T ∪ {a}). Hence, g(S ∪{a})− g(S) ≥ g(T ∪ {a})− g(T ).

g(S) is a submodular function.

Lemma 8.5 suggests a natural greedy algorithm to solve Problem 8.3. We call it Sample-NaiveGreedy. Each time it picks a node u∗ such that u∗ = arg maxu∈V` g(S`∪{u})−g(S`)until N(L`, t0) nodes have been selected to S`. We do it for all locations to get SIM.

Lemma 8.6. For each location L`, SampleNaiveGreedy gives a (1− 1/e)-approximatesolution to g(S`).

Proof. Minimizing θ(S`) is equivalent to maximizing g(S`) = (∑tmax

i=1 CN(L`, ti)) − θ(S`)as∑tmax

i=1 CN(L`, ti) is constant. g(S`) has the following properties: (1) g(∅) = 0; (2) itis monotonic increasing; (3) it is a submodular function. Hence, the greedy algorithm tomaximize g(S`) gives a (1− 1/e)-approximate solution [117].

SampleNaiveGreedy selects a node with the maximum marginal gain of g(S`) iteratively.It takes O(|V |(|V |+ |E|)) time if we run BFS to get each CF (S`, ti) for each iteration. Thetime complexity to get all |N(t0)|1 nodes as SIM is O(|N(t0)|1|V |(|V |+ |E|)), which is notscalable to large networks. Hence, we need a faster algorithm.

Speeding up SampleNaiveGreedy. In SampleNaiveGreedy, each time we recomputeCF (S` ∪ {u}, ti) for all i, which takes O(|E|+ |V |) time. We can speed up this computationby leveraging the ANF (Approximate Neighborhood Function) algorithm [126], which uses aclassical probabilistic counting algorithm, the Flajolet-Martin algorithm [38] to approximatethe sizes of union-ed node sets using bit strings. Here, we refer to the bit string thatapproximates CF (S`, ti) as F(S`, i). To estimate CF (S` ∪ {u}, ti) , we first do a bitwise-OR operation: F(S` ∪ {u}, i) = [F(S`, i) OR F({u}, i)], then convert it to CF (S` ∪ {u}, ti) .According to the ANF algorithm, CF (·, ti) = φ(F(·)) = (2b)/.77351, where b is the averageposition of the leftmost zero bit of the bit string. Since the bitwise-OR operation takesconstant time, we can reduce the running time of CF (S` ∪ {u}, ti) for all timesteps i fromO(|E|+ |V |) to O(tmax).

We propose SampleGreedy (Algorithm 8.1), a modified greedy algorithm with bitwise-ORoperations for Problem 8.3. It first gets F({u}, i) for all nodes at location L` over all timestepsusing ANF [126] (Line 2), then follows SampleNaiveGreedy. However, we use bitwise-ORoperations to speed up the computation of CF (S` ∪ {u}, ti) (Line 7-8).

Lemma 8.7. SampleGreedy takes O((|V ||N(t0)|1 + |E|)tmax) time.

158

Algorithm 8.1 SampleGreedy

Require: graph G, and propagation log matrix N.1: for each location L` do2: Get F({u}, i) for all timestep i, all u ∈ V` using ANF [126]3: y = N(L`, t0)4: S` = ∅, and F(S`, i) = 0 for all timesteps i5: for i = 1 to y do6: for each node u ∈ V` − S` do7: F(S` ∪ {u}, i) = F(S`, i) OR F({u}, i) for all ti8: CF (S` ∪ {u}, ti) = φ(F(S` ∪ {u}, i)) for all ti9: end for

10: u∗ = arg maxu∈V`−S` g(S`)− g(S` ∪ {u})11: S` = S` ∪ u∗12: end for13: end for14: return SIM = {S1, . . . , Sn}

Proof. Computing all F({u}, i) over all locations takes O((|V |+ |E|)tmax) according to theANF algorithm. Since bitwise-OR operation takes constant time, hence, Line 8 and 9 takesO(tmax) time, and Line 7-10 takes O(|V |tmax) time. Since ||N(t0)||1 is the total number ofstarting infected nodes, it takes ||N(t0)||1|V |tmax time to pick total ||N(t0)||1 nodes. Hence,the overall running time is O((|V |||N(t0)||1 + |E|)tmax).

Generating cascades from SIM. Once we obtain SIM from Algorithm 8.1, we can generateM from SIM. Let us define D`

i = {u|u ∈ V`,∃v ∈ SIM, dist(v, u) = i}, i.e, a set of nodesin location L` that SIM can reach at distance i. We propose the CascadeGenerationalgorithm (Algorithm 8.2) for M. We first add SIM to the cascade M, and compute D`

i forall time ti and location L` by running a BFS starting from SIM (Line 2). Then we selectnodes into M by running another BFS from SIM as well: at each distance i from SIM, foreach location L` we pick N(L`, ti) nodes uniformly at random to M, and add correspondingedges (Line 4-18). Note that we do it by permutating the set D`

i . N(L`, ti) nodes are selectedas follows: (1) if |CandidateQueuel| ≥ N(L`, ti) (the constraint in Eqn. 8.3 follows), weuniformly at random pick N(L`, ti) to M from CandidateQueue (Line 8-10); (2) otherwise,we add all nodes in CandidateQueue to M, record the number of nodes left (Line 11-12),and finally randomly pick other nodes in V` to M (Line 18).

Lemma 8.8. CascadeGeneration takes O(|V |+ |E|) time.

Proof. Running BFS takes O(|V |+ |E|) (Line 2). For each timestep t at each nodes Li, wecheck the nodes in D`

i , hence overall we just need to traverse the nodes once, which takeslinear time (Line 4-17). Hence, CascadeGeneration takes O(|V |+ |E|) time.

159

Algorithm 8.2 CascadeGeneration

Require: Graph G, propagation log matrix N, and node set SIM1: Add all nodes in SIM to the cascade M2: Compute D`

i for all time ti (by running BFS from SIM)3: PreSet = SIM, NumLeftNode=04: for i = 1 to tmax do5: for each location L` do6: D`

i = Permutate(Dì )

7: Add Dì to the end of CandidateQueue`

8: if |CandidateQueue`| ≥ N(L`, ti) then9: CurSet=pop N(L`, ti) nodes from the top of CandidateQueue`

10: else11: CurSet=pop all nodes in CandidateQueue`12: NumLeftNode+=(N(L`, ti)− |CandidateQueue|`)13: end if14: Add CurSet to M, and edges from PreSet to CurSet if e(u, v) ∈ G for any u ∈ PreSet

and v ∈ CurSet15: end for16: PreSet=CurSet17: end for18: Uniformly randomly pick NumLeftNode nodes from V` to M19: return M

Extend CascadeGeneration to k cascades. We can simply extend Algorithm 8.2 to kcascades. Note that CascadeGeneration permutates the nodes in D`

i (Line 6), hence, fordifferent permutations, we can generate different cascades. If the constraint in Eqn. 8.3 holds,at time ti, we uniformly at random add N(L`, ti) into M from

∑ij=1 |D`

i | −∑i−1

j=1 N(L`, tj)candidate nodes. If the constraint does not follow, we uniformly at random pick extra nodesfrom V − VM to M.

Remark 8.2. The above random process will generate O(∏

L`∈L∏

i |Dì |) cascades.

Remark 8.2 shows that we have a large number of cascades. In case if we need more, wecan generate extra cascades by ranking the result of SampleGreedy: instead of pickingthe best S`, we pick the top sets (in Algorithm 8.1 Line 10-11). In practice, as shown inour experiments, we do not need to do this, as we have enough cascades. In addition, ourcascades have high quality: the average value of αM is almost the same as the optimalsolution (Table 8.3).

MappingGeneration. Combining the above results, we propose the MappingGenerationalgorithm (Algorithm 8.3) to solve Problem 8.1.

Claim 8.1. The time complexity of MappingGeneration (Algorithm 8.3) is O((|V ||N(t0)|1+|E|)tmax + k(|V |+ |E|)), where k is the number of runs for CascadeGeneration to get kcascades.

160

Algorithm 8.3 MappingGeneration

Require: graph G, propagation log R1: Generate propagation log matrix N2: Run SampleGreedy (G,N) (Algorithm 8.1) to get SIM3: RunCascadeGeneration (G,N, SIM) (Algorithm 8.2) until k unique cascades are found forM

4: return M.

8.3.2 Data-Driven Immunization

Main Idea: In this section, we solve the Data-Driven Immunization (Problem 8.2) assumingthe samples are available. We first show that ρG,Mi

(x) in Problem 8.2 is neither submodularnor supermodular. We then propose to optimize an alternative credit-based objective function,which is an upperbound of ρG,Mi

(x) (Problem 8.4). We show that this function is non-negative,increasing and has the diminishing return property. Based on these properties, we propose agreedy algorithm which gives a (1− 1/e)-approximate solution.

Let us define ρG,Mi(x) = σG,Mi

(0) − σG,Mi(x). ρG,Mi

(x) can be thought as the number ofnodes we can save if x is allocated. Since σG,Mi

(0) is constant, we can rewrite Problem 8.2 as

x∗ = arg maxx

1

|M|∑

Mi∈MρG,Mi(x).

Since we uniformly randomly allocate x, ρG,Mi(x) can be written as ρG,Mi

(x) =∑

S Pr(S)rG,Mi(S),

where S is a node set sampled from the random process of distributing x (|S| = ||x||1), andrG,Mi

(S) is the number of nodes SIMican reach after removing S. Recall that Problem 8.2

is a NP-hard problem, hence maximizing 1|M|∑

Mi∈M ρG,Mi(x) is also a NP-hard problem.

Figure 8.2: Counter-Example

Note that ρG,Mi(x) is defined over an integer lattice, and is not a simple set function. If

a function h(x) has the diminishing return property over an integer lattice, then for anyx′ ≥ x and k, we have h(x + ek) − h(x) ≥ h(x′ + ek) − h(x′) (ek be the vector with 1 at

161

the kth index). According to [185], there exists a near-optimal algorithm to maximize h(x).Unfortunately, ρG,Mi

(x) does not follow the diminishing return property.

Remark 8.3. ρG(x,Mi) does not have diminishing return property. Figure 8.2 shows acounter-example, where all nodes are in different locations. Suppose x = 0, x′ = e1, thenx ≤ x′, however, ρG,Mi

(x + e2)− ρG,Mi(x) = 5 and ρG,Mi

(x′ + e2)− ρG,Mi(x′) = 8− 2 = 6.

Instead, we develop a contribution based approach. The idea is if we remove a node u in Mi,the number of nodes u can save is related to u’s children. Each child of u can contribute tothe savings of removing u. First, let us denote INMi

(S) as the set of S’s parents in Mi, i.e.,INMi

(S) = {u|e(u, v) ∈ Mi, v ∈ S}, and OUTMi(S) as the set of S’s children in Mi. We

define the contribution CG,Mi(S) recursively,

CG,Mi(S) = |S|+∑

v∈OUTMi(S)

|INMi({v}) ∩ S||INMi({v})|

CG,Mi({v}).

|INMi({v})∩S|

|INMi({v})| is the fraction of savings v contributes to S. The intuition is that since we do

not have any propagation models, it is reasonable to assume the infected v should be infectedby any of its parents equally, hence v contributes its savings equally to each of its parents.Now we define the contribution function over an integer lattice,

ζG,Mi(x) =∑S

Pr(S)CG,Mi(S), (8.5)

where S is a node set sampled from the random process of distributing x (|S| = |x|1).Lemma 8.9 will show that ζG,Mi

(x) is the upperbound of ρG,Mi(x), and it is also lowerbounded

by expected number of nodes S can reach.

Lemma 8.9. Given a cascade Mi, ρG,Mi(x) ≤ ζG,Mi

(x).

Proof. Since ζG,Mi(x) =

∑S Pr(S)CG,Mi

(S) and ρG,Mi(x) =

∑S Pr(S)rG,Mi

(S), we need toshow that rG,Mi

(S) ≤ CG,Mi(S). rG,Mi

(S) is the number of nodes S can save in Mi, we canshow that given any node u that SIM can save, the credit u given to SIM must be 1. This isbecause if we can save u, it means every path from SIM to u has been removed when S isremoved. Hence, all nodes within the paths from SIM has been removed. These nodes are alnodes that propagate u’s credit to SIM, so all credit of u can be contributed to CG,Mi

(S).Hence, CG,Mi

(S) is at least equal to rG,Mi(S). On the other hand, other nodes that S can

not save are also make contributions to the credit of CG,Mi(S) . Hence, CG,Mi

(S) ≥ rG,Mi(S),

which leads to ρG,Mi(x) ≤ ζG,Mi

(x).

We use ζG,Mi(x) to estimate ρG,Mi

(x). Hence, we formally define the following problem forProblem 8.2.

Problem 8.4. Given a contact network G(V,E), a set of cascades M, and budget m, find avaccine allocation vector x∗:

162

x∗ = arg maxx

1

|M|∑

Mi∈MζG,Mi(x), s.t.|x|1 = m. (8.6)

Lemma 8.10. ζG,Mi(x) has the following properties:

(P1) ζG,Mi(x) ≥ 0 and ζG,Mi

(0) = 0.

(P2) (Nondecreasing) ζG,Mi(x) ≤ ζG,Mi

(x + ei) for i.

(P3) (Diminishing return) For any x′ ≥ x, we have ζG,Mi(x + ei) − ζG,Mi

(x) ≥ ζG,Mi(x′ +

ei)− ζG,Mi(x′).

Proof. This follows the proof in Lemma 5.2.

Given the properties of ζG,Mi(x) in Lemma 8.10, we propose a greedy algorithm, Immu-

NaiveGreedy for Problem 8.4: each time we give one vaccine to location L`∗ , such that

`∗ = arg maxL`

∑Mi∈M

ζG,Mi(x + e`)− ζG,Mi(x),

until m vaccines are allocated.

Lemma 8.11. ImmuNaiveGreedy gives a (1− 1/e)-approximate solution to Problem 8.4.

Proof. According to Theorem 5.1, since ζG,Mi(x) has P1, P2 and P3, ImmuNaiveGreedy

gives a (1− 1/e)-approximate solution.

In ImmuNaiveGreedy, since we uniformly randomly distribute vaccines, we can applythe Sample Average Approximation (SAA) framework, i.e., ζG,Mi

(x) ≈ 1|S|∑

S∈S CG,Mi(S),

where S is a set of samples taken from the vaccine allocation process. This approach takesO(|S|(|V |+ |E|)) to estimate ζG,Mi

(x), and we need to look into |M| cascades to pick thebest location L`∗ for one iteration. We have |L| locations and m vaccines. Hence, the totaltime complexity of ImmuNaiveGreedy is O(m|L||M||S|(|V |+ |E|)), which is not practicalfor large networks. However, we can speed up this naive greedy algorithm.

Speeding up ImmuNaiveGreedy. We propose a faster algorithm, ImmuConGreedy(Contribution-based Greedy Immunization) in Algorithm 8.4, which takes only O(m|M|(|V |+|E|)) time. The idea is that we can compute the contribution function efficiently whenthe budget m = 1, i.e., all values of ζG,Mi

(e`) in Mi can be obtained in O(|V |+ |E|) time.This is because ζG,Mi

(e`) =∑

u∈V`1|L`|

CG,Mi({u}), and we can get CG,Mi

({u}) for all u ∈ Vby traversing Mi once. For simplicity, let din(v) = |INMi

({v})|. We have CG,Mi({u}) =

1 +∑

v∈OUTMi({u})

1din(v)

CG,Mi({v}). If u does not have any children (OUTMi

({u}) = ∅),

163

CG,Mi({u}) = 1. Since Mi is a DAG, we can iteratively obtain CG,Mi

({u}) for all u ∈ Vfrom a reversed order of a topological sort, which takes O(|V |+ |E|) time.

In Algorithm 8.4, we compute contribution function CG,Mi({u}) for all Mi (Line 4), which

takes O(|M|(|V | + |E|)) time. Then we obtain∑

Mi∈M ζG,Mi(e`) for each location L` by

summing up the contribution for each u ∈ V` (Line 5), which takes O(|M||V |) time. Oncewe allocate one vaccine to the best location L`∗ , we update each Mi by uniformly at randomremoving one node in L`∗ (Line 7). This way we can just compute

∑Mi∈M ζG,Mi

(e`) insteadof∑

Mi∈M ζG,Mi(x + e`) after the next iteration.

Algorithm 8.4 ImmuConGreedy

Require: graph G(V,E), propagation log R, and budget m1: M =MappingGeneration (G,R) {Section 8.3.1}2: x = 03: for j = 1 to m do4: ∀Mi ∈M: compute CG,Mi({u}) for each node u5: ∀ location L` ∈ L: compute

∑Mi∈M ζG,Mi(e`)

6: `∗ = arg maxL`∑

Mi∈M ζG,Mi(e`)7: ∀Mi ∈M: update Mi by uniformly at random removing one node at location L`∗

8: x = x + e`∗

9: end for10: return x

Lemma 8.12. ImmuConGreedy takes O(m|M|(|V |+ |E|)) time.

Proof. First, for simplicity, let din(v) = |INMi({v})|.

Since CG,Mi({u}) = 1 +

∑v∈OUTMi

({u})1

din(v)CG,Mi

({v}), if u does not have any children

(OUTMi({u}) = ∅), clearly CG,Mi

({u}) = 1. Note that Mi is a DAG, we can iterativelyobtain CG,Mi

({u}) from a reversed order of a topological sort, which takes O(|V |+ |E|) time.Hence, Line 4 takes O(|M|(|V |+ |E|)) time.

Second, ζG,Mi(el) =

∑{v}∈|L`| Pr({v})CG,Mi

({v}). Since we uniformly at random give vaccines

to locations, Pr({v}) = 1|L`|

. Hence, ζG,Mi(el) =

∑v∈Lk

1|Lk|

CG,Mi({v}). Hence, Line 5 takes

O(M|V |) time.

Third, Line 6 and 7 take |L| and O(|M|) time respectively. Hence, the overall running timeis O(m|M|(|V |+ |E|)).

8.4 Experiments

We conducted the experiments using a 4 Xeon E7-4850 CPU with 512GB of 1066Mhz mainmemory.

164


Networks. We do experiments on multiple datasets (Table 8.2).

Table 8.2: Network Datasets

Dataset Nodes Edges Locations

WorkPlace 92 757 5HighSchool 182 2221 5

SBM 1000 5000 20MIAMI 2.2 million 50 million 74

Houston 2.7 million 59 million 98

(1) Stochastic Block Model (SBM) [102] is a well-known graph model to generate syntheticgraphs with groups.

(2) WorkPlace and HighSchool are social contact networks1. Nodes in HighSchool arestudents from 5 different sections and edges represent two students who are in vicinity ofeach other. Nodes in WorkPlace are employees of a company with 5 departments and edgesindicate two people are in proximity of each other. We treat each section/department as alocation.

(3) MIAMI and Houston are million-node social-contact graphs from city-scale activity basedsynthetic populations as described in Section 8.1. We divided people by their residentialzipcodes.

Propagation logs. We have the billion-record eHCR data (described in Section 8.1) asthe propagation log R for MIAMI and Houston. The MIAMI and Houston have 118K and132K patients respectively For SBM, HighSchool, and WorkPlace, we run the well-knownSIR model (infection rate as 0.4, and recovery rate as 0.6) to generate the propagation log R:we first uniformly at random pick 5% nodes at each location as seeds at t0, then run a SIRsimulation to get other infected nodes.

Settings. We set the number of samples |M| = 1000 for MappingGeneration, andnumber of bitmasks as 32 for computing F(·) in SampleGreedy (similar to the ANFalgorithm [126]).

Baselines. As we are not aware of any direct competitor tackling our problem, we useseveral baselines to better judge our performance. These baselines have been regularly usedfor immunization studies. However, none of them take into account both propagation logand contact networks.

(1) Random: uniformly randomly assign vaccines to locations.

1http://www.sociopatterns.org

http://www.sociopatterns.org

165

(2) PropPopulation: a data based approach: assign vaccines to locations in proportion topopulation in locations.

(3) PropInfection: a data based approach: assign vaccines in proportion to total numberof infections in locations.

(4) Degree: a graph based approach: calculate the average degree dLi of each location Li,and independently assign vaccines to Li with probability dLi/

∑Lk∈L dLk .

(5) ImmuModel: a model based approach: apply the model-driven group immunizationalgorithm (the QP version) in [185]. ImmuModel aims to minimize the spectral radius of acontact graph. Spectral radius is the first eigenvalue of the graph, which has been proven tobe the threshold of an epidemic in the graph [131]. We set edge weights to be 0.24 accordingto [123].

8.4.2 Results

In short, we demonstrate that our immunization algorithm ImmuConGreedy outperformsother baselines on all datasets. We also show our approach is robust as the size of thepropagation log R varies. In addition, we show that our sampling algorithm SampleGreedyprovides accurate results for generating cascade samples. Finally, we study the scalability ofour approach.

Effectiveness of ImmuConGreedy. Figure 8.3 shows results of minimizing the spreadon cascades for the whole log R. In all datasets, ImmuConGreedy consistently outperformsothers. WorkPlace and HighSchool have < 200 nodes, hence we varied m till 10. However,even with the small budget 10, ImmuConGreedy can reduce 45% of the infection, whichis about 10% better than the second best ImmuModel. For MIAMI and Houston withupto 2.7million nodes, ImmuConGreedy can reduce about 50% of the infection on thecascades generated by SocialContact with only 50K vaccines. Model-based ImmuModeland data-based PropInfection perform better than Random and Degree as they takeinto account either epidemic threshold in the contact graph or the eHCR data. However,ImmuConGreedy easily outperforms them, as it leverages both contact networks and theeHCR data.

We also study how to leverage the rich log data to develop vaccine interventions in the future.To do so, we split the eHCR data into training parts and testing parts: we get the vaccineallocations from the training parts (the fall regime of flu from Aug 2009 - Oct 2009), andapply the allocations to the testing parts (the winter regime of flu from Nov 2009 - Feb 2010)to examine how effective our approach ImmuConGreedy is. Figure 8.4 shows the results ofinfection reductions on the cascades generating from the testing data. ImmuConGreedyconsistently outperforms others in both MIAMI and Houston: it can reduce about 25% ofthe infection with only 5K vaccines, compared to other baselines like ImmuModel andPropInfection.

166

We use simulations of the SIR model to evaluate the performance of ImmuConGreedy onthe activity based urban social contact networks (described in Section 8.1). These were firstcalibrated to get the same outbreak size as in the eHCR data for these cities. We then choosea random subset of individuals in each zipcode to be vaccinated, based on the allocation byImmuConGreedy. We find the reduction in the number of infections is quite substantial inmany cases. For instance, for Miami, for a budget of 50K vaccines, the ImmuConGreedyallocation leads to more than 50% reduction, compared to a random allocation.

(a) WorkPlace (b) HighSchool (c) MIAMI (d) Houston

Figure 8.3: Effectiveness of ImmuConGreedy on the whole R. Infection ratio r vs. Vaccine

budget m. Infection ratio r =

∑Mi∈M

σG,Mi(x)∑

Mi∈MσG,Mi

(0) . Lower is better. ImmuConGreedy consistently

outperforms other baselines over all datasets.

Robustness of ImmuConGreedy. We study how sensitive ImmuConGreedy is, as thesize of the propagation log R varies next. To do so, we first generate synthetic propagation logR from the SIR model, then manually change the size of R as the input of our data. Finally,we compare ImmuConGreedy to the model based approach ImmuModel. For each dataset,we generate R by running a SIR simulation (with the infection rate 0.4 and the recovery rate0.6 for WorkPlace, HighSchool and SBM, and the infection rate 0.24 and timesteps to recovery7 for MIAMI according to [123]). Once R is generated, we change the size of R by extractinga portion [N(t0), . . . ,N(tmax)] as the input (p% of R). For example, suppose tmax = 20 andp = 50, we use [N(t0), . . . ,N(t10)] as the propagation log. Since we know all configurationscome from the SIR model, we expect the model-based approach ImmuModel to do betterthan ImmuConGreedy. However, as p increases, as more data is used, ImmuConGreedyshould approach ImmuModel. Figure 8.5 shows the results: as expected, for all datasets,clearly as p increases, ImmuConGreedy becomes better. Interestingly for smaller datasetslike WorkPlace, HighSchool, SBM, even with only 25% of data, we can get upto 85% of theperformance. For large networks like MIAMI, we need more data: however, when all the datais used, compared to ImmuModel, ImmuConGreedy can achieve 90% of the savings.

Effectiveness of MappingGeneration. We also study the performance of MappingGen-eration by comparing αM to the optimal value α∗ (Problem 8.1). We obtain α∗ using thebrute-force algorithm. See Table 8.3: αM, the average value of αM over all sampled cascades,is almost the same as α∗ for all datasets. For example, in SBM, αM is 107.9, a difference of only1.1 from α∗. In addition, we found that α∗ is exactly the same as the number of nodes that

167

(a) MIAMI (b) Houston

Figure 8.4: Effectiveness of ImmuConGreedy for the testing data. Infection ratio r vs. Vaccinebudget m. Lower is better. ImmuConGreedy consistently outperforms other baselines for bothMIAMI and Houston.

Figure 8.5: Robustness of ImmuConGreedy as data size varies. Ratio of saved nodes RS vs.percentage of used log data p%. RS = SData

SModel. SData (SModel): the number of nodes we can save when

vaccines are allocated according to ImmuConGreedy ( ImmuModel). Percentage of used log datap: [N(t0), . . . , p%N(tmax)]. Higher: ImmuConGreedy is closer to ImmuModel.

are infected after the first timestep t0, which suggests the best scenario for SocialContactis that only nodes which are infected at the earliest time are not caused by social contact.

Scalability. Figure 8.6 shows the running time of MappingGeneration and ImmuCon-Greedy w.r.t. the vaccine budget m and the number of cascades k on SBM. For Figure 8.6(a)we set k = 100, while for Figure 8.6(b) we set m = 20. We observe that as m increases and kincreases, the running time scales linearly (figures also show the linear-fit with R2 values).Consistent with the time complexity bounds for our algorithms in Section 8.3, large datasets

168

Table 8.3: MappingGeneration. αM: average of αM over all M ∈M; α∗: optimal value of αM;N =

∑tmaxt=t1|N(t)|1.

Dataset αM α∗ N

WorkPlace 79.2 83.0 83HighSchool 165.2 170.0 170

SBM 107.9 109.0 109

(a) Varying m (b) Varying k

Figure 8.6: Scalability. (a) total running time of MappingGeneration and ImmuConGreedyvs. vaccine budget m; (b) total running time of MappingGeneration and ImmuConGreedy vs.number of cascade samples k.

need fairly extensive time. For example, MIAMI takes about 2 days to get 5K vaccines. Thisis still reasonable: importantly, note that we expect to run immunization algorithms forinfectious epidemics, so the solution quality is much more critical than the fastest runningtime.

8.4.3 Case Studies

We conduct case studies to analyze vaccine allocations per zipcode for both Houston andMIAMI. Figure 8.7 shows the total population, the total #patients in the eHCR data, thetotal #vaccines taken in the eHCR data2, the total #vaccines from ImmuModel, and thetotal #vaccines from ImmuConGreedy, respectively.

Figure 8.7(a), (b), (c), (d) and (e) show the case study for Houston. First, the areaswith zipcode 77030 and 77024 in Figure 8.7(b) have the largest number of patients, andvaccine allocations from both eHCR (Figure 8.7 (c)), and ImmuConGreedy (Figure 8.7(e)) also prefer these areas. Second, vaccines taken in the eHCR data do not follow thetotal population (Figure 8.7(a)), but roughly follow the distribution of #patients in eHCR.This may suggest the immunization strategy in practice is to give vaccines based on the

2We extract vaccine reports based on ICD-9 codes V04.81. This is real vaccine allocations in practice.

169

(a) Population (b) Patients (c) Vaccines in

eHCR

(d) Vaccines from

ImmuModel

(e) Vaccines from

ImmuConGreedy

(f) Population (g) Patients (h) Vaccines in

eHCR

(i) Vaccines from

ImmuModel

(j) Vaccines from

ImmuConGreedy

Figure 8.7: Case Studies for Houston and MIAMI per location. Houston: (a), (b), (c), (d) and (e);MIAMI: (f), (g), (h), (i) and (j). Heatmap of (a) and (f): Total population; (b) and (g): Patients ineHCR; (c) and (h): Number of vaccines actually taken in eHCR; (d) and (i): Vaccine allocationsfrom ImmuModel; (e) and (j): Vaccine allocations from ImmuConGreedy.

proportion of reported patients. Third, ImmuModel distributes 38% of vaccines to threeareas (77002, 77008 and 77056), which are the center of Houston Metropolitan Area (likedowntown and uptown) with a large number of interactions in the contact network. However,both data-based and model-based approaches do not perform well (see Figure 8.3). Ourmethod, ImmuConGreedy, gives 43% of vaccines to the areas 77030, 77024 and 77002.The first two areas have the highest infections in eHCR, while the last one is essential forminimizing the epidemic threshold as ImmuModel suggests. Hence, ImmuConGreedyconsiders both eHCR and contact networks. It is interesting that the Texas Medical Center(one of the largest medical centers in the world) is in 77030, and Houston downtown is in77002. Hence, ImmuConGreedy targets regions with high risk of influenza outbreak.

Figure 8.7(f), (g), (h), (i) and (j) show the case study for MIAMI. First, vaccines taken ineHCR (Figure 8.7(h)) follow the distribution of #patients as well (Figure 8.7(g)). Second,ImmuModel distributes 31% of vaccines in one area with zipcode 33165 (Figure 8.7(i)). Webelieve this area with large number of households, is critical to minimize the spectral radiusof the contact network in MIAMI. However, both data-based and model-based approachesdo not perform well in MIAMI as well (as shown in Figure 8.3). Interestingly, as shown inFigure 8.7(j), our approach, ImmuConGreedy, gives most of the vaccines (29%, 18% ) toareas with the largest number of patients (33140 and 33176 respectively). We observe thatdifference from Houston, in MIAMI ImmuConGreedy tend to prefer data-based approaches.

170

However, the areas adjacent to 33165, which ImmuModel targets, also get higher vaccineallocations than others—this means ImmuConGreedy also takes into account informationin the contact network. In fact, the areas ImmuConGreedy targets indeed have high risk ofan influenza outbreak: they are either tourist attractions (33140) or residential areas (33176).For example, 33140 belongs to Miami Beach, which is a famous place with large transientpopulation.

8.5 Conclusion

This chapter addresses the novel problem of controlling epidemics in presence of coarse-grainedhealth surveillance data and population contact networks. We formulate the Data-DrivenImmunization problem, which first aims to align the propagation log with contact networks,and then allocate vaccines to minimize spread in the data. We develop an efficient approachMappingGeneration to obtain high quality cascades, and then give an approximationalgorithm ImmuConGreedy with provable solutions for immunization on sampled cascades.We demonstrate the effectiveness of our method through extensive experiments on multipledatasets including nation-wide real electronic Health Reimbursement Claims data. Finally,case studies in Miami and Houston metropolitan regions show that our allocation strategiestake both the network and surveillance data into account to effectively distribute vaccines.

Chapter 9

Detecting Media and KernelCommunity

Given a large graph G, possibly learnt from cascade analysis, can we find communities ofbridges and influential nodes? Diffusion over networks is an important phenomenon withmany applications such as public health, social media, and cyber security. The problem ofcommunity detection (i.e. finding cohesive groups of nodes) has been extensively studiedin many fields, and many algorithms have been proposed. The typical assumption forcommunities is that they have denser internal connectivity and sparser external connectivity(also called ‘cavemen’ communities) [119]. Such notions have been relaxed and extended tohandle overlapping structures too [178]. While very useful to understand network topologyin general, they may not be ideal to discover how information propagates, when networks areactually being utilized for diffusion. Other lines of recent work try to learn influence modelsat community-scale, using groups supplied by graph-partitioning algorithms like METIS [105]or extract the structure of high-degree/celebrity nodes [172]. Instead, in this chapter, weexplore community detection by factoring in different roles of nodes during the diffusion in ageneral way without restrictive assumptions on the process.

Based on just the diffusive properties of the network, we want to discover nodes whichare critical for diffusion (the ‘media nodes’/‘bridges’) and understand how they connect tocelebrities/‘kernels’ and other ordinary nodes. Media nodes bridging celebrities and ordinarynodes may not necessarily have a large number of connections, making it harder to extractthem. Traditional community detection algorithms usually cannot uncover this tri-partitestructure. Finding this structure can help in downstream tasks, like viral marketing, linkprediction, immunization and so on. We demonstrate an example in Figure 9.1: the left figureis a Twitter retweet network with two communities: technology and entertainment. Eachcommunity has three types of users: celebrities, media, and other nodes. The middle figure isthe result of community detection obtained from the classical Newman’s modularity-basedalgorithm [119]. The right figure is the result obtained from our algorithm NetCondense.

171

172

Newman’s algorithm uncovers communities that are horizontal, which groups all three typesof nodes together. However, our algorithm identifies media nodes, and discovers verticalkernel communities which group celebrities with common interest. Our contributions include:

(a) Twitter retweet network (b) Result of Newman’salgorithm

(c) Result of NetCondense

Figure 9.1: Our method detects more intuitive structure: (a) an example Twitter retweet network;(b) communities detected by Newman’s algorithm; (c) ordinary communities (green), media nodes(black), and kernel communities (red) detected by NetCondense.

• Problem Formulation: We design a novel task MeiKeCom to find communities of nodesusing diffusive properties. MeiKeCom is an intuitive and principled optimization-basedformulation. To the best of our knowledge, we are the first to study such a task undera diffusion setting.• Effective Algorithms: We develop NetCondense, an efficient and practical algorithm

to identify media nodes, and kernel communities. We use a variety of techniquesincluding getting graph summaries.• Extensive Experiments: We run extensive experiments and conduct case studies on

large real-world networks to demonstrate the effectiveness of our algorithm. It findshigh-quality groups, outperforming several non-trivial baselines.

This work has been published in SDM 2017 [183]. Next we formally formulate the problem,and then give an efficient two-stage algorithm to solve it. Finally we demonstrate theeffectiveness of our algorithm via. extensive experiments and case studies.


Table 9.1 lists the main symbols we use in this chapter.

173



G(V,E,W ) graph with the node set V , the edge set E and the weightset W

wij edge weight in G (prob. that i infects j)K; l set of kernel communities with |K| = lKi; ki kernel community set with ki nodes

K set of all kernel nodes (=⋃li=1Ki)

M ;m set of media nodes with |M | = m

σ(S) downstream effect of a set Sρ(S) upstream effect of a set Sφ(S) full-stream effect of a set Sφb(a) local effect of edge (a, b) on φ(a)simM(i, j) Jaccard similarity of node i and j w.r.t M1(u, v, E) an indicator function representing whether

(u, v) or (v, u) ∈ Eu the right eigenvector of λGG′(V ′, E ′,W ′) graph considering local effects in Gw′ij edge weight in G′

λG (λ′G) the largest eigenvalue of the adjacency matrix of G (G′)

9.1.1 Preliminaries

We assume that our network G(V,E,W ) is a weighted directed graph, where V is the set ofvertices, E is the set of edges, and W = {wij|(i, j) ∈ E} is the set of edge weights (WLOG,we assume wij ∈ (0, 1]). wij measures the “strength” of interaction from i to j e.g. retweets.Now consider a diffusion process on G, such as the spread of a meme on a blog-networkor a topic on a citation network. In fact G may have been learnt from observing such aprocess itself. For ease of description, we assume that the diffusion follows the well-knownIndependent Cascade (IC) model [72]. However, our method can be naturally generalized toa wide variety of cascade style models like SIR, SIS and others [5], as it leverages the localeffects of diffusion. In the IC model, each infected node i gets only one chance to infect itshealthy neighbor j independently with probability wij.

9.1.2 Media nodes

In a network G, different nodes may have different impact on a diffusion process. For example,there exists a small fraction of nodes that are influential (‘celebrities’) e.g. users like BarackObama who get retweeted many times in Twitter. Several previous works have studied the

174

problem of finding celebrities and their structures [72,172]. In addition to celebrities, anothertype of nodes also plays an important role in the information diffusion. These nodes need notbe well-connected as celebrities. However, they are willing to get information from varioustypes of nodes, and also willing to create/push the information to other nodes [85]. We noticethat they can be treated as “bridges” between plain celebrities and the rest of the networkincluding other celebrities and normal users, to boost the diffusion. For example, in theTwitter retweet network (Figure 1), CNN and TEDchris have many connections to celebrities(elonmusk, spacex and TeslaMoters), and they are also followed by the rest of the network.Once a celebrity posts a tweet, the tweet can quickly reach these nodes (upstream), andthen effectively propagate to other nodes via these nodes (downstream). In other words,structurally speaking, the main observation is that these nodes have both high upstreameffect (of getting influenced) and downstream effect (of influencing other nodes) during theinformation diffusion. We call these nodes “media nodes”.

Comparison to Role Discovery. The concept of bridge nodes in general has been studied inprevious works in terms of role discovery [63, 92, 179] (see details in related work). How-ever, all of above studies assume that bridge nodes structurally connect to homogeneousnodes/communities (such as celebrities). In contrast, our description of media nodes is fromthe viewpoint of information diffusion: (a) they can easily get influenced by celebrities whilethey also tend to influence many other nodes; and (b) they bridge heterogeneous nodes(celebrities and other nodes) as well as homogeneous nodes (celebrities and celebrities).

So media nodes have the following properties:

Property PM1: Upstream effect of diffusion. The upstream effect of a node set S ondiffusion means its capability of getting influenced from other nodes, i.e., the probability ofnodes in S getting infected in general.

Property PM2: Downstream effect of diffusion. The downstream effect of S on diffusionmeans its capability of influencing other nodes, i.e, how many nodes S can infect if it is aseed set.

Media node set M . A media node set M should have both high upstream and downstreameffect. We first define upstream and downstream effect of diffusion formally. We define theupstream effect of a node set S, ρ(S), as the expected number of infected nodes in S over allpossible seeds uniformly chosen at random, i.e., ρ(S) =

∑A⊆V Pr(A)ρA(S), where A is all

possible choices of the seed set and Pr(A) is the probability of these possible choices (A isuniformly chosen from |V |, so Pr(A) = 1

2|V |). ρA(S) is the expected number of active nodes

in set S at the end of the diffusion process under the IC model, given seed set A. As ρA(S)measures how many nodes in S will get influenced if A is a seed set, intuitively ρ(S) measureshow likely nodes in S can be influenced in general. Hence higher ρ(S) is, the higher upstreameffect S has overall.

We define the downstream effect as σ(S). Following the definition in [72], σ(S) is the expectednumber of active nodes in the entire network at the end of the diffusion process, with S as

175

seeds. It measures how much influence S can spread over a network (higher σ(S) is, thehigher downstream effect S has).

Given σ(S) and ρ(S), we define φ(S), the full-stream diffusion effect of S as, φ(S) = σ(S)ρ(S).Intuitively, φ(S) tells us about the expected “total usefulness” of a node during diffusion overall possible spreading cascades. A node having a large expected influencing capacity may notnecessarily have a large “usefulness”, as its ability to get influenced from others may still besmall. Formally, a media node set is:

Definition 9.1. (ε-m Media node set) Given an ε ∈ R+ and m ∈ N, any node set M ⊆ V isan ε-m media node set iff φ(M) > ε and |M | = m.

9.1.3 Kernel Communities

Given a media node set M , a natural follow-up question to ask is which nodes have highinfluence on media nodes? As mentioned above, there exists a small fraction of nodes thatare influential (‘celebrities’). In this chapter, we call them kernel nodes. We observe thatkernel nodes typically have high out-degree. For example, in Twitter network, kernel nodeslike Obama have millions of people retweeting but very few people he retweets.

Subsequently, we are interested in communities of kernels, as we want to study groups ofnodes that behave similarly. Community structure allows us to uncover the underlyinginteractions between nodes [49]. First, it is straightforward to assume ‘kernel communities’are structurally densely connected [49, 119, 172], as kernel nodes tend to have high degree.Second, we also want nodes in each kernel community connect to similar media nodes. Thiscan help us understand which groups of nodes have influential pattern similar to media nodes.We observe kernel nodes that connect to similar media nodes are related. For example, inFigure 9.1, three related accounts elonmusk, spacex, and TeslaMotors, all connect to medianodes CNN and TEDchris. In sum, kernel communities should have:

Property PK1: Connectivity among themselves.

Property PK2: Similarity w.r.t. media nodes.

Kernel community set K. First, we use simM(u, v) to denote how similar are the con-nections to a media node set M for u and v. Let us denote NM(i) = {j|j ∈ M, (i, j) ∈E or (j, i) ∈ E}. Since NM(i) contains all nodes in M that connect to i, we therefore useJaccard similarity between NM(u) and NM(v) to represent the similarity of u and v w.r.t.

M , namely, simM(u, v) = |NM (u)∩NM (v)||NM (u)∪NM (v)| . Let us denote K = {K1, . . . , Kl} as a set of kernel

communities where each Ki ⊆ V is a kernel community, K =⋃li=1 Ki as a set with all kernel

nodes, 1(u, v, E) as an indicator function representing whether (u, v) or (v, u) ∈ E, and wuvas the maximum weight between u and v. Now we are ready to give the formal definition ofkernel community :

Definition 9.2. (Kernel community) A set of kernel communities K = {K1, . . . , Kl} sat-

176

isfies: Ki ⊆ V \M , Ki 6= Kj, ∀i |Ki| = ki, and for any node u ∈ Ki, v /∈ Ki, we have∑a∈Ki 1(u, a, E)wausimM(u, a) ≥

∑a∈Ki 1(v, a, E)wvasimM(v, a).

The intuition above is that for any node u ∈ Ki, the cumulative similarity+connectivitybetween u and all nodes in Ki should be stronger than the one between a node v /∈ Ki andall nodes in Ki. The term 1(u, a, E)wau comes from PK1, while simM(u, a) comes from PK2.Note that two communities can connect to similar media nodes though they may not be wellconnected among themselves.

9.1.4 Ordinary Nodes

We call nodes apart from kernel nodes and media nodes as ordinary nodes. They typicallyhave more connections to kernel nodes (due to high degrees of kernel nodes). Hence, weassociate ordinary communities to corresponding kernel communities. Formally, for Ki, itscorresponding ordinary Oi are obtained by counting the links from node u ∈ V \ (K ∪M)to Ki. If node u has the highest number of links to kernel Ki, then u ∈ Oi. Note that forsimplicity, we assume there is no overlap between ordinary communities. If node u has thesame number of links to multiple kernels, we uniformly at random pick one as its associatedkernel. For example, in Figure 9.1, jwage has more connections to elonmusk, spacex andTeslaMotors, so it belongs to such kernel community.

9.1.5 Relative structure

Figure 9.2: Structure of K, M , and O.

Given our definitions above, the structure we want to uncover is shown in Figure 9.2. First,according to PK1, K1 has more edges within itself than to K2 in Figure 9.2. And an ordinarycommunity has more connections to one kernel community than to others (e.g., O1 mainlyconnects to K1). In contrast, due to PM1 and PM2, media nodes can bridge between differentkernel communities (homogeneous nodes in a sense that they are kernels), as well as kerneland ordinary communities (heterogeneous nodes).

177

9.1.6 The MeiKeCom task

Under the IC model, we now define our task, Detecting Media and Kernel Community(MeiKeCom), as two separate problems:

Problem 9.1. MeiKeCom-media

Given: Graph G(V,E,W ), number of media nodes m.

Find: set M∗ = arg maxM φ(M), s.t. |M | = m.

Problem 9.2. MeiKeCom-kernel

Given: Graph G(V,E,W ), media node set M , l kernel communities in which each kernelcommunity Ki has ki nodes.

Find: kernel community set K = {K1, . . . , Kl},

K∗ = arg maxK

∑Ki∈K

∑u,v∈Ki

1(u, v, E)wuvsimM (u, v)

s.t. ∀ i, j, |Ki| = ki, Ki 6= Kj , and Ki ⊆ V \M

These problems naturally follow from our definitions. We allow our kernel communitiesto have overlaps for flexibility (however, in practice, such overlaps are small < 5% of thecommunity size). By design there is no overlap between media and kernel nodes.

Remark 1 [Generality]: Though we assume the diffusion process is under the IC model,MeiKeCom can be defined easily for other infection models (only the definition of φ(M)will change).

Complexity. We have the following propositions:

Proposition 9.1. In MeiKeCom-media, finding a best set M for σ(M) is NP-hard andfor ρ(M) is #P-hard. Also φ(·) is not submodular or supermodular.

Proposition 9.2. MeiKeCom-kernel is NP-hard.

In Proposition 9.1, the #P-hard property can be proved by a reduction from the countingproblem of s-t connectedness in a directed graph, which is #P-complete [22]; the NP-hardpart is well-known [72]; and the submodularity/supermodularity can be shown by two counter-examples. Proposition 9.2 can be proved by a reduction from the well-known MaximumClique problem [71]. Hence, MeiKeCom is very challenging.

9.2 Our Methods

In this section, we propose a novel multi-stage algorithm, NetCondense (MEdIa KErnel-community detection algorithm), to solve the MeiKeCom task. NetCondense consists of

178

two parts: it first finds M using a merge-based algorithm, and then detects kernel communities.We mainly focus on Problem 9.1, and give an iterative pair-wise relaxation heuristic forProblem 9.2. Once we find M and K, the ordinary communities can be found directly asmentioned in Sec. 9.1.4.

9.2.1 Finding Media Nodes

We need to optimize φ(M) = σ(M)ρ(M) for MeiKeCom-media. σ(M) can be possiblyoptimized using influence maximization algorithms [72]. The metric ρ(M) intuitively relatesto immunization problems such as [166], where the goal is to remove a set of nodes tomaximize the number of nodes saved. However, both of them separately can not exactlysolve MeiKeCom-media (as also shown in Table 9.4 in our experiments). A naıve way isto use Greedy (an algorithm that successively adds a node to M with maximum marginalgain of φ(M)). However, Greedy will involve running Monte-Carlo simulations, and willcost O(m|V |I(|V |+ |E|)) time (where I is the simulation time). This is infeasible for largenetworks. Hence, we need a faster algorithm.

Main Idea. We propose a novel merge-based approach instead. The idea is that we mergeunimportant edges successively, maintaining the overall full-stream effect, such that nodes thatremain unmerged (the ‘singleton’ nodes) are ones with highest φ(·). To find such unimportantedges, we first look at the local contribution of each edge (a, b) on φ(a)1. We then mergenode-pairs that have the smallest impact on the overall full-stream effect. We keep mergingnode-pairs until there are only m singleton nodes left, and these nodes are media nodes. Thisapproach raises three important questions: (Q1) How to quantify the ‘local effect’ of edge(a, b) on φ(a)? (Q2) How does local effect change when an edge is merged? (Q3) Which edgesto merge such that the overall change in full-stream effect is the smallest?

Q1 Local effect. We define the local effect of edge (a, b) on φ(a) (denoted by φb(a)) as theprobability of b getting infected directly through a. Formally, φb(a) = ρ(a)wab. Recall thatφ(a) = ρ(a)σ(a). Since σ(a) can be treated as the summation of the probability of each nodegetting infected (σ(a) =

∑i∈V Pr(i gets infected | a is infected)), φb(a) can be treated as the

direct contribution of edge (a, b) towards φ(a). To compute φb(a), a key question is to obtainρ(a).

The next proposition shows that ρ(a) is related to u = [u1, . . . , u|V |]T , the right eigenvector

of the largest eigenvalue λG, of the adjacency matrix of G. It can be proved by extendingLemma 6 in [168] to a set of cascade style models including IC, SIR and SIS on G.

Proposition 9.3. If λG > 1, then for a node a in G, ρ(a) ∝ ua.

To ensure ua ∈ R+, G needs to be strongly connected [168]. If not, we can just extract thegiant strongly connected component (GCC) of G and operate on the GCC. This is because

1φ(a) = φ({a}) (similarly for σ(a) and ρ(a)).

179

in real networks most of the nodes lie in the GCC [80]. Moreover, nodes outside the GCCare unlikely to be media nodes, as they usually will not have high full-stream effect (at leastone of σ(·) or ρ(·) is small). Note that we only need to do this for media nodes, and isnot required for further steps. To summarize, using Proposition 9.3, the local effect of edge(a, b) on φ(a) is proportional to uawab, i.e., φb(a) ∝ uawab. For convenience, we constructa new graph G′(V ′, E ′,W ′) to represent the local effect φb(a), where V ′ = V , E ′ = E, andw′ab = uawab (as shown in Figure 9.3(Left)).

Q2 Local effect after merging. The next natural question to ask is, starting from G′, ifedge (a, b) is merged to form a new node c, what should the new local effects of edges from cto its neighbors be? It is intuitive to assume that if c is infected in G′, we are really intendingto choose to infect only one of node a or b (chosen uniformly at random). Hence, consider anode x that has an edge from a in G′ (see Figure 9.3 (Left)). If we want to merge a and bto form node c, then after merging, w′cx = [1

2(ua + ub)][

12wax(1 + wba)]. The first term comes

from ρ(c) (which is either ρ(a) or ρ(b)). The second term comes from a or b spreading theinfluence to x (b to x probability is waxwba, and from a to x is wax). Figure 9.3 shows othercases (such as when s and t connect to both a and b). In summary, the merging process is:

Figure 9.3: Left: the graph G′ with edge weights to represent local effects of diffusion for G; Right:the resulting merged graph with new weights when node a and node b in G′ are merged into a newnode c.

Definition 9.3. (φ-merge) Let Ni(v) (No(v)) denote the set of in-neighbors (out-neighbors)of a node v. If the node-pair (a, b) is now merged to a new node c in G′, then the local effectof edges between c and its neighbors are:

w′nc =

un(1 + wab)wna2

∀n ∈ Ni(a)\Ni(b)

un(1 + wba)wnb2

∀n ∈ Ni(b)\Ni(a)

un[(1 + wba)wnb + (1 + wba)wnb)]

4∀n ∈ Ni(a) ∩Ni(b)

w′cn =

ya,b(1 + wba)wan4

∀n ∈ No(a)\No(b)

ya,b(1 + wab)wbn4

∀n ∈ No(b)\No(a)

ya,b[(1 + wba)wan + (1 + wab)wbn]

4∀n ∈ No(a) ∩No(b)

where ya,b = (ua + ub)/2.

180

Q3 Selecting node pairs to merge. Definition 9.3 shows how local effects change whenedges are merged. Now let us investigate which node pairs should we merge such that changein overall full-stream effect is as small as possible. Note that the small value of w′ab does notmean the full-stream effect of edge (a, b) on the whole graph is small. To quantify the overallfull-stream effect, intuitively, when edges are merged, our goal is to maintain the diffusiveproperty of the whole graph G′. Prakash et al. [133] demonstrate that the diffusive propertyof a graph is captured by the largest eigenvalue of the adjacency matrix of a graph, for a widerange of cascade style propagation models, including IC model. We adapt this methodologyin this chapter2. We have proposed a diffusion based coarsening algorithm to get a smallerrepresentation of a graph while maintaining the largest eigenvalue. Differently, our idea isto get the media nodes (singleton nodes) instead of a smaller graph, and we also need tomaintain the local effect of incident edges on φ(·) to (resulting in different merge definitions).

To maintain the largest eigenvalue of the adjacency matrix of G′ (denoted by λ′G), our goal isto merge edges which have the least impact on it. The idea is that, we measure the impactof merging each edge on the overall diffusion as I(a, b) = |λ′G−(a,b)

− λ′G|, where λ′G−(a,b)is the

largest eigenvalue of the graph G′−(a,b) which is the result of merging (a, b) on G′. G′−(a,b) isobtained following Definition 9.3. Let us define h and g as the left and right eigenvectorscorresponding to λ′G. Now, using matrix perturbation theory, I(a, b) can be approximated as:

Proposition 9.4. As a first-order approximation, the impact of an edge (a, b) is

I(a, b) =−λ (gaha + gbhb) + haθ + w′bagahb + w′abgbha

hTg − (gaha + gbhb),

where θ = ua+ub

2 [ 1+wba

2 (λ′Ggaua− wabgb) + 1+wab

2 (λ′Ggbub− wbaga)].

Algorithm. From Proposition 9.4, we can get I(a, b) for each edge (a, b) in O(1) time. Hence,to get the media node set M , we keep merging edges with the smallest impacts until onlym singleton nodes are left. Algorithm 9.1 shows the pseudocode. We first compute theeigenvector and get G′ for local effects(Line 2-3). Then we obtain I(a, b) for each edge (a, b)(Lines 4-5), and finally, merge edges with smallest impact till m singleton (unmerged) nodesare left (Lines 7-9). Note that Algorithm 9.1 is monotonous: the set of media nodes selectedfor the larger m is the superset of media node chosen for smaller m, which is desirable.

Proposition 9.5. The time complexity of Algorithm 9.1 is O(|E| log |E|+D(|V | −m)).

9.2.2 Finding kernel communities

According to Proposition 9.2, MeiKeCom-kernel is a NP-hard problem. In this section,we leverage the idea in [172] (Algorithm 2): we convert Problem 9.2 into an optimization

2This also allows our method to be generalized to other diffusion models.

181

Algorithm 9.1 Finding Media Nodes

Require: graph G, number of media nodes m1: i = 0, n = |V |, S = ∅2: Compute the right eigenvector u corresponding to λG3: Get G′ by updating edge weight w′ab = uawab in G4: for each edge (a, b) in G′

5: Compute I(a, b) according to Proposition 9.46: π = ordering of pairs in the increasing order of I7: while the number of singleton nodes > m do8: i = i+ 19: (a, b) = π(i), G′ = G′−(a,b)

10: M = Singleton Nodes11: return M

problem that can find an ‘assignment’ vector zv for any node v, and solve it iteratively. Notethat we need to plug in media nodes here, as any node u and v in the same kernel communityhave high value of simM(u, v). Specifically, for each node v ∈ V \M , we define a weightvector zv = [zv1, . . . , zvl]

T to represent its relative importance to each community kernel. Thehigher zvi is, the more connection v has to kernel community Ki. Given zu and zv, it isnatural to use their inner product zTuzv to measure similarity between their connection tokernel communities. And as defined in Section 9.1, the similarity between u and v w.r.t. Mis quantified by simM(u, v). Let us denote z = [z1, . . . , zi, . . . ] where i ∈ V . We have thefollowing optimization problem:

z∗ = arg maxz

∑(u,v)∈E\EM

zTu zvwuvsimM (u, v)

s.t.∑

v∈V \M

zvi = ki,∀i ∈ {1, . . . , l};

∑1≤i≤l

zvi ≤ 1,∀v ∈ V \M ;

zvi ≥ 0,∀v ∈ V \M,∀i ∈ {1, . . . , l},

(9.1)

where EM is a set of edges incident on M , i.e., EM = {(u, v) ∈ E|u ∈ M or v ∈ M}. Thiscan be solved iteratively efficiently with the time complexity O(lγ2) per iteration, where γ isthe number of nodes that are connected to M .

182

9.3 Experiments


We briefly describe our set-up. All experiments are conducted using a 4 Xeon E7-4850 CPUwith 512GB of 1066Mhz main memory.

Datasets. We use multiple datasets (Table 9.2). We expect to find media nodes as: mediawebsites/accounts in MemeTracker and Twitter, survey papers in Citation, and people whocover multiple areas/departments in Coauthor, Google+ and Enron. We learn the weights ofMemeTracker from blog cascades [54] and normalize the number of emails for Enron as edgeweights. For others, we set them to be the same as wij = 0.02.

To evaluate our method, we also use ground-truth media nodes and kernel communities forCoauthor and MemeTracker. We briefly describe it next. For Coauthor, we pick authorswho are PC members in confs. of more than 2 areas as media nodes, as they are important indifferent areas such as AI, DB and Networks (as shown in Fig. 9.2). After that, we directly useother PC members in each area as the ground truth for kernel communities [172]. Similarly,for MemeTracker, we pick high web traffic websites which cover more than two topics (likesports and entertainment) as media nodes. For kernel communities, we pick websites in eacharea that have spread the most memes from the original cascades [54] as the ground truth.

Table 9.2: Datasets Information.

Dataset Domain #Nodes #EdgesEnron [101] Emails 156 2,061

MemeTracker [54] Cascades 851 5,000Citation [161] Citation 8,046 18,322Google+ [100] Social Media 107K 14MTwitter [172] Social Media 456K 8MCoauthor [172] Coauthors 0.8M 2M

Parameters. We choose m to be roughly 1-2% of the graph size and set l = 5. Thismatches media node set sizes and number of kernel communities we found in datasets withground-truth. And for the ease of evaluation, we conservatively set all ki to be 100 forCoauthor, Twitter and Google+; and as 10 for Enron, MemeTracker and Citation (due totheir smaller sizes).

Baselines. To measure the diffusive property of media nodes, we compare NetCondensewith Greedy (mentioned in Sec. 9.2.1), pmia [22] and Netshield [166]. To test thehypothesis that media nodes are not just the nodes connecting/overlapping communities, weuse HIS and MaxD [92] (finding structural holes that connect homogenous communities),and BigClam [178], Clique [125] (overlapping community detection), as baselines. Tomeasure the performance of kernel communities, we compare NetCondense with several

183

community detection algorithms: Louvain [15]; d-Louvain, p-Louvain (apply Louvain tohigh-degree/high-pagerank (top 20%) nodes); Newman [49]; Weba [172] (celebrity-based);and BigClam, Clique.

9.3.2 Evaluation of media nodes

We measure our performance on a variety of aspects.

Comparison with ground-truth. We use Precision, Recall, and F1-score to compareagainst the baselines. As shown in Table 9.3, NetCondense performs the best for Coauthor(we got the same result for MemeTracker), which achieves up to 40% improvement over allbaselines in F1-score. Note that BigClam does not return any overlapping communities andhence any media nodes for Coauthor. From the results, it is obvious that media nodes areneither simply structural holes that HIS and MaxD optimize for, nor just overlaps amongcommunities that BigClam and Clique can find. Similarly, pmia and Netshield do notperform well. All the results are expected, as NetCondense returns nodes with full-streamdiffusion effect.

Table 9.3: Quality of media nodes compared to the ground-truth.

Coauthor Precision Recall F1-scoreNetCondense 0.231 0.520 0.320

pmia 0.176 0.301 0.222Netshield 0.149 0.195 0.169

HIS 0.194 0.412 0.263MaxD 0.173 0.372 0.237

BigClam 0.000 0.000 0.000Clique 0.044 0.366 0.078

Performance of NetCondense for MeiKeCom-media. As mentioned before, medianodes have high full-stream diffusion effect. To validate it, we compare NetCondenseagainst pmia and Netshield. Note that Greedy is not scalable for large networks. Wecould only run it on Enron and MemeTracker. We find that NetCondense is able to obtainat least 85% of nodes obtained by Greedy, while being significantly faster. Table 9.4 showsthe results (all values are averaged over 1000 simulations) of NetCondense against pmiaand Netshield on Citation and Google+. For both networks, pmia has the highest σ(M)value as it optimizes downstream effect. Netshield does best for ρ(M) as immunizationalgorithms are related to the upstream effect. NetCondense gives the best results for φ(M),which shows that our algorithm effectively solves for MeiKeCom-media. In addition, wealso find that media nodes are diverse: they are barely connected among themselves, yetwell connected to the rest of the network. This makes sense as we want them to diffuseinformation to the whole network. For example, in Coauthor, there are almost zero edgesamong media nodes. Furthermore, they connect to multiple kernel communities. For example,in Coauthor, Carlos Guestrin, as a media node, connects to multiple kernel communities.

184

Table 9.4: Quality of NetCondense for MeiKeCom-media.

Citation σ(M) ρ(M) φ(M)NetCondense 1744.7 20.4 35591.9

pmia 1974.7 4.8 9382.6Netshield 1087.2 28.9 31420.1

Google+ σ(M) ρ(M) φ(M)NetCondense 7842.4 611.6 4.8× 106

pmia 8723.5 672.3 3.8× 106

Netshield 6612.1 672.3 4.4× 106

Case studies of media nodes. We conduct case studies to show NetCondense can findmeaningful nodes.

Coauthor: Authors discovered as media nodes using NetCondense, such as Carlos Guestrinand Leonidas J. Guibas, are typically researchers who have published papers in multipleareas. For example, Carlos Guestrin has published papers in multiple areas such as AI, DBand Networks. Hence, they act as classic bridge nodes. In addition, a media node doesnot necessarily have high degree. For example, Wei-Ying Ma, found as a media node, onlyhas six collaborations in our dataset, but still connects multiple domains. However, he is awell-known researcher who works on areas like AI, DB and Viz. This highlights the fact thatNetCondense is able to detect high-quality media nodes even with low degrees.

Citation: Papers identified as media nodes point to areas where results could be improved,survey existing methods, and asking important open questions. For example, “Data manage-ment projects at Google” by Cafarella et al. (2008) provides overview of subset of ongoingprojects at Google like Map-Reduce and GFS. Since Map-Reduce and GFS are importantprojects, they get more citations than the paper itself. Though the paper itself has a relativelylower citation count of 35, the papers which cite it have higher citation counts. Other medianodes, such as “Magic sets and other strange ways to implement logic programs” by Bancilhonet al. (1986), ask open questions. Eleven of the papers that cite this paper and try to solvethe open questions, are nodes in kernel communities.

MemeTracker: Media nodes found by NetCondense include mainstream websites such asguardian.co.uk, huffingtonpost.com, washingtonpost.com. They are all general newsmedia websites that cover multiple topics such as politics, sports, technology, etc.

Twitter: We find accounts affiliated with media organizations such as NBC, CBSTopNewsand bbcamerica as media nodes. We also find Ryan Penagos’s account AgentM as a medianode. Since he is the VP and Executive Editor of Marvel’s Digital Media Group, he acts asbridge node between entertainment kernel (mostly consisting of celebrities) and finance kernel.However, baselines like Clique, HIS or MaxD find many unimportant non-news-mediaaccounts.

Enron: Media nodes found by NetCondense are the main executives like K. Lay (CEO)

185

Table 9.5: Quality (F1-score) of kernel communities compared to other competitors on Coauthor. DP:Distributed and Parallel Computing; GV: Graphics and Vision; NC: Networks and Communications.

Method AI DB DP GV NC Avg.NetCondense 0.613 0.532 0.791 0.392 0.644 0.594Louvain 0.362 0.070 0.578 0.333 0.164 0.301d-Louvain 0.465 0.168 0.755 0.155 0.237 0.356p-Louvain 0.418 0.243 0.762 0.110 0.305 0.368Newman 0.002 0.014 0.118 0.015 0.003 0.030BigClam 0.054 0.004 0.032 0.004 0.005 0.019Clique 0.106 0.029 0.521 0.405 0.039 0.220Weba 0.601 0.521 0.761 0.431 0.632 0.589

and J. Shankman (COO), as they routinely communicate with different departments byemails. It is interesting that J. Hernandez, as an administrator, is also a media node. Webelieve it is because she has many communications among different departments. However,other baselines can not find it.

9.3.3 Evaluation of kernel communities

We conduct multiple experiments for kernels as well.

Comparison with ground-truth. We compute F1-score, and Jaccard similarity to evaluatethe performance of NetCondense. Table 9.5 shows the results of F1-scores for Coauthor.In short, NetCondense gets the best results overall: it gets up to 6 times better solutionscompared to the baselines including Weba (celebrity based), Newman and Louvain(traditional community detection), and BigClam and Clique (overlapping communitydetection).

Centrality and connectivity. We found that the centrality of kernel nodes are muchhigher than others: for all networks, kernel nodes have up to 17.2 times higher averagedegree, eigenscore, and pagerank than nodes in the whole graph. As expected, each of kernelcommunity has very dense intra-community connections. For example, in MemeTracker

the average intra-community connections for NetCondense is 45.7, which is larger thand-Louvain, and p-Louvain (31.3, and 26.3).

Case studies of kernel communities. We found that each Ki usually covers only onearea/topics. In Citation, each kernel usually has its specific topics. For example, one kernelcomprises of papers on parallel processing and database, while another has papers on queryestimation and optimization. In Twitter, we find sports kernel contains athletes like SerenaWilliams and Dwight Howard, while entertainment kernel has celebrities such as MariahCarey and Taylor Swift.

Kernel’s corresponding ordinary community Ordinary communities are consistent with

186

their corresponding kernel communities in terms of diffusion. To verify it, we first pick 50nodes uniformly at random in each Ki as seeds, then run the IC model over G to get finalinfections. In every dataset, at least 75% of nodes that are infected belong to the kernel’scorresponding ordinary community.

Comparison between kernel communities and media nodes Each kernel communityobtained from NetCondense shares similar properties like research area, news topic, etc.Media nodes, on the other hand, are diverse and connect to multiple kernel communities. Inaddition, different from kernel nodes, media nodes do not necessarily have high centralities.Recall that Wei-Ying Ma, a media node in Coauthor, has a relatively low degree. Though asshown in Table 9.4, media nodes have high full-stream effect of diffusion, while it is not arequired property for kernel nodes.

9.4 Conclusion

We studied the novel task of discovering communities of nodes leveraging their diffusionroles. We give an intuitive and principled optimization-based formulation MeiKeCombased on finding media, kernel and ordinary communities, show that it is computationallychallenging, and then give an effective and practical multi-step algorithm NetCondensefor it. NetCondense first finds media nodes via a novel merge-based algorithm, and thencomputes the kernel communities via a relaxation. Extensive experiments on multiple realdatasets show that NetCondense outperforms other baselines in both media node discoveryand kernel community detection, and NetCondense can also find meaningful groups forinsights. There are several fruitful avenues for future work, like extending our results totemporal networks.

Chapter 10

Conclusions and Future Work

10.1 Conclusions

In this thesis, we focus on better optimizing and understanding network structure in termsof diffusion. The tasks we study cover topics like immunization, graph summarization andcommunity detection. In contrast to previous work, our thesis is the the first work to (1)develop more realistic, implementable and data-based graph algorithms to control contagions;(2) use diffusion to effectively understand communities of networks and summarize graphs ina general way. To sum up, we develop several efficient and effective graph mining algorithmsto (1) control contagions from spreading by removing nodes/edges, and (2) gain a deeperunderstanding of a network for diffusion by exploring how nodes group together for similarroles of dissemination. Our algorithms take into account different levels of granularity likenode/edge level and group/community level, from model-driven to data-driven views:

• Model-Driven. We have proposed efficient immunization algorithms to control propaga-tion under diffusion models for various practical and implementable settings, includingdata-aware environment, uncertain data-aware environment, and group/communityintervention. In addition, we have presented a fast coarsening algorithm to summarizeboth static and temporal graphs, while preserving diffusion properties. We show thatour methods work for a variety of popular models in public health, social media, cybersecurity, and so on. Experimental results on large-scale data demonstrate that ourmethods are effective and scalable compared to other baselines. Furthermore, we haveconducted case studies to show that our immunization algorithms can offer reliableguidance for vaccine allocations, and our summarization approach can significantlyspeed up large scale graph mining tasks like influence maximization.

• Data-Driven. We have studied two challenging problems from a data-driven perspective,including data-driven immunization and diffusion-based community detection. In

187

188

contrast to model-driven approaches, this part has directed worked on large-scale graphand propagation data by relaxing modeling assumptions of diffusion. We show that(1) our data-driven immunization algorithm, using large-scale national-level diagnosticpatient data, can effectively provide finer-granular solutions for vaccine allocations atthe zipcode level; (2) our novel community detection algorithm can find different roleof nodes and communities participated in the diffusion efficiently, and can discoverinteresting patterns in real social networks.

10.2 Future Work

This thesis can be extended to several directions. One of our long-term research goals isto develop rigorous foundations to advance our understanding of network structure anddynamical processes over them, and ultimately leverage the learned knowledge to facilitatehuman interventions (like immunization) in multiple fields like social science, healthcare, onlinecommunity and cyber security. This thesis has made several contributions for developing betternetwork optimization algorithms to control diffusion, and gaining a better understanding ofnetworks when a contagion is propagating. There are still several challenges, which can beexplored. Next, we will describe some of them:

• Using Surveillance for Social Contact Network. Currently, we test our results on large-scale simulated networks (like MIAMI and Houston). These are good datasets, whichhave been widely used in diffusion analysis and provided convincing results. For example,MIAMI and Houston are constructed by taking into account several reliable sources, suchas census, land use, activity surveys and transportation networks [35, 60]. However,with the availability of rich surveillance data, it is possible to generate large-scale realcontact networks to test our methods. The main challenge here is to construct a largereliable contact networks by integrating multiple heterogeneous human activity sourceslike flu reports and diagnosis information from electronic health records and check-ininformation from social media.

• Embedding large networks from propagation data. Network embedding, an emerging toolfor the graph feature engineering, together with deep learning techniques, has led tovast improvement of several graph mining tasks. Compared to traditional graph miningalgorithms, it has the flexibility to leverage any off-the-shelf machine learning methodsfor tasks like community detection, node classification and others, leading to promisingresults. On the other hand, as shown in our thesis, studying diffusion processes canuncover interesting network structure. Hence, it is a challenging and exciting problemto consider embedding problems together with rich surveillance propagation data liketweets and electronic health records. We would like to explore research questions like:How to embed propagation data and unify it with network features? How to discover

189

propagation patterns from embedding results? How to adaptively learn features of agraph given massive propagation streams?

• Building Comprehensive Intervention systems. With the current progress of this thesis,we would like to further build an intervention system that integrates current surveillancesystems, deploys immunization policies in the thesis, and provides reliable services. Itis a challenging task from developing algorithms to building real systems. In fact, fewstudies have focused on building robust and comprehensive intervention systems tocontrol diffusion. Our vision of such a system has three key features: time-sensitivityand scalability. First, it can provide real-time response to huge surveillance data at anytime, and is adaptive to quick changes like the burst of influenza in epidemiology andmisinformation outbreak in social media. Second, it can scale to big data, real-timeanalytics and online approaches. For this, we are particularly interested in exploitingthe capabilities of big data framework (e.g., Hadoop, Spark, Solr, etc.) that best fit theinterventions system.

• More Robust Analysis. Some of our studies like the uncertain data-aware vaccination(Chapter 4) and the data-driven immunization (Chapter 8) are stochastic optimizationproblems, which we try to optimize“expected” cases. In fact, even some populardiffusion models are randomized models (like IC and SIR), in which we typicallyconsider “expected” cases. However, in some scenarios, we would like to conduct amore robust analysis. For example, sometimes, we are interested in studying “rareconditions”: CDC may want to know the worst case of infection spread. Furthermore,since the surveillance data can be noisy and missing, more robust analysis of such datacan further improve the performance of the proposed algorithms in this thesis. Hence,we believe it is a promising direction to make our network analysis for diffusion morerobustness.

Bibliography

[1] Bijaya Adhikari, Yao Zhang, Aditya Bharadwaj, and B. Aditya Prakash. Condensingtemporal networks using propagation. In Proceedings of the SIAM Data MiningConference, SDM ’17, 2017.

[2] Charu C Aggarwal, Shuyang Lin, and S Yu Philip. On influential node discovery indynamic social networks. In SDM, 2012.

[3] Yong-Yeol Ahn, James P Bagrow, and Sune Lehmann. Link communities revealmultiscale complexity in networks. Nature, 466(7307):761–764, 2010.

[4] Sorour E Amiri, Liangzhe Chen, and B. Aditya Prakash. Snapnets: Automaticsegmentation of network sequences with node labels. 2017.

[5] Roy M. Anderson and Robert M. May. Infectious Diseases of Humans. Oxford UniversityPress, 1991.

[6] Alberto Apostolico and Guido Drovandi. Graph compression by bfs. Algorithms,2(3):1031–1044, 2009.

[7] James Aspnes, Kevin Chang, and Aleksandr Yampolskiy. Inoculation strategies forvictims of viruses and the sum-of-squares partition problem. In Proceedings of thesixteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’05, pages43–52, 2005.

[8] Francis R Bach and Michael I Jordan. Learning spectral clustering. In NIPS, volume 16,2003.

[9] Norman Bailey. The Mathematical Theory of Infectious Diseases and its Applications.Griffin, London, 1975.

[10] Eytan Bakshy, Brian Karrer, and Lada A Adamic. Social influence and the diffusionof user-created content. In Proceedings of the 10th ACM conference on Electroniccommerce, pages 325–334. ACM, 2009.

190

191

[11] Chris Barrett, Harry B Hunt, Madhav V Marathe, SS Ravi, Daniel J Rosenkrantz,Richard E Stearns, and Mayur Thakur. Predecessor existence problems for finitediscrete dynamical systems. Theoretical Computer Science, 386(1-2):3–37, 2007.

[12] Christopher L Barrett, Richard J Beckman, Maleq Khan, V. S. Anil Kumar, Madhav VMarathe, Paula E Stretz, Tridib Dutta, and Bryan Lewis. Generation and analysisof large synthetic social contact networks. In Winter Simulation Conference, pages1003–1014. Winter Simulation Conference, 2009.

[13] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques forembedding and clustering. In NIPS, volume 14, pages 585–591, 2001.

[14] Sushil Bikhchandani, David Hirshleifer, and Ivo Welch. A theory of fads, fashion,custom, and cultural change in informational cascades. Journal of Political Economy,100(5):992–1026, October 1992.

[15] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre.Fast unfolding of communities in large networks. Journal of Statistical Mechanics:Theory and Experiment, 2008(10):P10008, 2008.

[16] Paolo Boldi and Sebastiano Vigna. The webgraph framework i: compression techniques.In Proceedings of the 13th international conference on World Wide Web, pages 595–602.ACM, 2004.

[17] Linda Briesemeister, Patric Lincoln, and Philip Porras. Epidemic profiles and defenseof scale-free networks. WORM 2003, Oct. 27 2003.

[18] Adam L. Buchsbaum, Haim Kaplan, Anne Rogers, and Jeffery R. Westbrook. Anew, simpler linear-time dominators algorithm. ACM Trans. Program. Lang. Syst.,20(6):1265–1296, November 1998.

[19] Duncan S. Callaway, Mark E. J. Newman, Steven H. Strogatz, and Duncan J. Watts.Network robustness and fragility: Percolation on random graphs. Physical ReviewLetters, 85(25):5468–5471, 2000.

[20] Damon Centola. The spread of behavior in an online social network experiment. science,329(5996):1194–1197, 2010.

[21] Po-An Chen, Mary David, and David Kempe. Better vaccination strategies for betterpeople. In Proceedings of the 11th ACM conference on Electronic commerce, EC ’10,pages 179–188, New York, NY, USA, 2010. ACM.

[22] Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalentviral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 1029–1038.ACM, 2010.

192

[23] Flavio Chierichetti, Ravi Kumar, Silvio Lattanzi, Michael Mitzenmacher, AlessandroPanconesi, and Prabhakar Raghavan. On compressing social networks. In Proceedingsof the 15th ACM SIGKDD international conference on Knowledge discovery and datamining, pages 219–228. ACM, 2009.

[24] Fan RK Chung. Spectral graph theory, volume 92. American Mathematical Soc., 1997.

[25] Reuven Cohen, Shlomo Havlin, and Daniel ben Avraham. Efficient immunizationstrategies for computer networks and populations. Physical Review Letters, 91(24),December 2003.

[26] Dan Cosley, Daniel P Huttenlocher, Jon M Kleinberg, Xiangyang Lan, and SiddharthSuri. Sequential influence models in social networks. ICWSM, 10:26, 2010.

[27] Daryl J Daley and David G Kendall. Epidemics and rumours. 1964.

[28] Jonathan J Deeks and Douglas G Altman. Statistics notes: Diagnostic tests 4: likelihoodratios. BMJ: British Medical Journal, 329(7458):168, 2004.

[29] Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Weighted graph cuts without eigen-vectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell., 29(11):1944–1957, 2007.

[30] Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. Influence function learningin information diffusion networks. In ICML, pages 2016–2024, 2014.

[31] Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. Scalable influenceestimation in continuous-time diffusion networks. In Advances in neural informationprocessing systems, pages 3147–3155, 2013.

[32] J Dushoff, JB Plotkin, C Viboud, L Simonsen, M Miller, M Loeb, and DJ Earn.Vaccinating to protect a vulnerable subpopulation. PLoS Med, 4(5):e174, 2007.

[33] Michael Elkin and David Peleg. Approximating k-spanner problems for k¿ 2. TheoreticalComputer Science, 337(1):249–277, 2005.

[34] Mathieu Genois et. al. Data on face-to-face contacts in an office building suggest alow-cost vaccination strategy based on community linkers. Network Science, 3(03),2015.

[35] Stephen Eubank, Hasan Guclu, V. S. Anil Kumar, Madhav V. Marathe, AravindSrinivasan, Zoltan Toroczkai, and Nan Wang. Modelling disease outbreaks in realisticurban social networks. Nature, 429(6988):180–184, May 2004.

[36] NM Ferguson, DA Cummings, C Fraser, JC Cajka, PC Cooley, and DS Burke. Strategiesfor mitigating an influenza pandemic. Nature, 442(7101):448–452, 2006.

193

[37] Miroslav Fiedler. Algebraic connectivity of graphs. Czechoslovak mathematical journal,23(2):298–305, 1973.

[38] Philippe Flajolet and G Nigel Martin. Probabilistic counting algorithms for data baseapplications. Journal of computer and system sciences, 31(2):182–209, 1985.

[39] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.

[40] Julie Fournet and Alain Barrat. Contact patterns among high school students. PloSone, 9(9):e107878, 2014.

[41] L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry,pages 35–41, 1977.

[42] Wai Shing Fung, Ramesh Hariharan, Nicholas JA Harvey, and Debmalya Panigrahi.A general framework for graph sparsification. In Proceedings of the forty-third annualACM symposium on Theory of computing, pages 71–80. ACM, 2011.

[43] Wojciech Galuba, Karl Aberer, Dipanjan Chakraborty, Zoran Despotovic, and WolfgangKellerer. Outtweeting the twitterers-predicting information cascades in microblogs.WOSN, 10:3–11, 2010.

[44] Ayalvadi Ganesh, Laurent Massoulie, and Don Towsley. The effect of network topologyon the spread of epidemics. In IEEE INFOCOM, Los Alamitos, CA, 2005. IEEEComputer Society Press.

[45] Lixin Gao. On inferring autonomous system relationships in the internet. IEEE/ACMTransactions on Networking (ToN), 9(6):733–745, 2001.

[46] Nathalie TH Gayraud, Evaggelia Pitoura, and Panayiotis Tsaparas. Diffusion maxi-mization in evolving social networks. In COSN, 2015.

[47] Sean Gilpin, Tina Eliassi-Rad, and Ian Davidson. Guided learning for role discovery(glrd): framework, algorithms, and applications. In KDD, pages 113–121. ACM, 2013.

[48] Jeremy Ginsberg, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark SSmolinski, and Larry Brilliant. Detecting influenza epidemics using search engine querydata. Nature, 457(7232):1012–1014, 2009.

[49] Michelle Girvan and Mark EJ Newman. Community structure in social and biologicalnetworks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.

[50] Michaela Goetz, Jure Leskovec, Mary McGlohon, and Christos Faloutsos. Modelingblog dynamics. In ICWSM. Citeseer, 2009.

[51] William Goffman and VA Newill. Generalization of epidemic theory. Nature,204(4955):225–228, 1964.

194

[52] Samuel Goldberg. Probability: an introduction. Courier Dover Publications, 1986.

[53] Jacob Goldenberg, Barak Libai, and Eitan Muller. Talk of the network: A complexsystems look at the underlying process of word-of-mouth. Marketing Letters, 2001.

[54] Manuel Gomez Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks ofdiffusion and influence. In KDD. ACM, 2010.

[55] Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. Learning influence probabil-ities in social networks. In Proceedings of the third ACM international conference onWeb search and data mining, pages 241–250. ACM, 2010.

[56] Mark Granovetter. Threshold models of collective behavior. American Journal ofSociology, 83(6):1420–1443, 1978.

[57] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks.In Proceedings of the 22nd ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, pages 855–864. ACM, 2016.

[58] Daniel Gruhl, Ramanathan Guha, David Liben-Nowell, and Andrew Tomkins. Infor-mation diffusion through blogspace. In Proceedings of the 13th international conferenceon World Wide Web, pages 491–501. ACM, 2004.

[59] Lars Hagen and Andrew B Kahng. New spectral methods for ratio cut partitioningand clustering. IEEE transactions on computer-aided design of integrated circuits andsystems, 11(9):1074–1085, 1992.

[60] M. Elizabeth Halloran, Neil M. Ferguson, Stephen Eubank, Ira M. Longini, Derek A. T.Cummings, Bryan Lewis, Shufu Xu, Christophe Fraser, Anil Vullikanti, Timothy C.Germann, Diane Wagener, Richard Beckman, Kai Kadau, Chris Barrett, Catherine A.Macken, Donald S. Burke, and Philip Cooley. Modeling targeted layered containmentof an influenza pandemic in the United States. In Proceedings of the National Academyof Sciences (PNAS), pages 4639–4644, March 10 2008.

[61] Yu Han and Jie Tang. Probabilistic community and role model for social networks. InKDD, pages 407–416. ACM, 2015.

[62] Yukio Hayashi, Masato Minoura, and Jun Matsukubo. Recoverable prevalence ingrowing scale-free networks and the effective immunization. arXiv:cond-mat/0305549v2, Aug. 6 2003.

[63] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu,Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. Rolx: structural roleextraction & mining in large graphs. In KDD, pages 1231–1239, 2012.

[64] H. W. Hethcote. The mathematics of infectious diseases. SIAM Review, 42, 2000.

195

[65] Nathan Oken Hodas and Kristina Lerman. How visibility and divided attention constrainsocial contagion. In Privacy, Security, Risk and Trust (PASSAT), 2012 InternationalConference on and 2012 International Confernece on Social Computing (SocialCom),pages 249–257. IEEE, 2012.

[66] Kaixi Hou, Weifeng Liu, Hao Wang, and Wu-chun Feng. Fast Segmented Sort on GPUs.In Proceedings of the 2017 International Conference on Supercomputing, ICS ’17. ACM,2017.

[67] Kaixi Hou, Hao Wang, and Wu-chun Feng. Gpu-unicache: Automatic code generationof spatial blocking for stencils on gpus. In Proceedings of the ACM Conference onComputing Frontiers, CF ’17. ACM, 2017.

[68] Jeff Howe. The rise of crowdsourcing. Wired magazine, 14(6):1–4, 2006.

[69] Crump JA, Youssef FG, Luby SP, Wasfy MO, Rangel JM, Taalat M, Oun SA, andMahoney FJ. Estimating the incidence of typhoid fever and other febrile illnesses indeveloping countries. Emerging Infectious Diseases, 9(5):539–544, 2003.

[70] Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the12th International Conference on World Wide Web, WWW ’03, pages 271–279, NewYork, NY, USA, 2003. ACM.

[71] Richard M Karp. Reducibility among combinatorial problems. In Complexity ofcomputer computations, pages 85–103. Springer, 1972.

[72] David Kempe, Jon Kleinberg, and Eva Tardos. Maximizing the spread of influencethrough a social network. In Conference of the ACM Special Interest Group onKnowledge Discovery and Data Mining, New York, NY, 2003. ACM Press.

[73] J. O. Kephart and S. R. White. Measuring and modeling computer virus prevalence.IEEE Computer Society Symposium on Research in Security and Privacy, 1993.

[74] Elias Boutros Khalil, Bistra Dilkina, and Le Song. Scalable diffusion-aware optimizationof network topology. In KDD 2014, pages 1226–1235. ACM, 2014.

[75] Masahiro Kimura, Kazumi Saito, and Hiroshi Motoda. Minimizing the spread ofcontamination by blocking links in a network. In Proceedings of the 23rd nationalconference on Artificial intelligence, AAAI’08, pages 1175–1180. AAAI Press, 2008.

[76] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. In ACM-SIAMSymposium on Discrete Algorithms, 1998.

[77] Anton J. Kleywegt, Alexander Shapiro, and Tito Homem-de Mello. The sample averageapproximation method for stochastic discrete optimization. SIAM J. on Optimization,12(2):479–502, February 2002.

196

[78] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAMReview, 51(3):455–500, 2009.

[79] Ioannis Konstas, Vassilios Stathopoulos, and Joemon M Jose. On social networks andcollaborative recommendation. In Proceedings of the 32nd international ACM SIGIRconference on Research and development in information retrieval, pages 195–202. ACM,2009.

[80] Gueorgi Kossinets and Duncan J Watts. Empirical analysis of an evolving socialnetwork. science, 311(5757), 2006.

[81] Danai Koutra, U. Kang, Jilles Vreeken, and Christos Faloutsos. Vog: summarizing andunderstanding large graphs. In Proceedings of the 2014 SIAM International Conferenceon Data Mining. SIAM, 2014.

[82] Andreas Krause, Jure Leskovec, Carlos Guestrin, Jeanne VanBriesen, and ChristosFaloutsos. Efficient sensor placement optimization for securing large water distributionnetworks. Journal of Water Resources Planning and Management, 134(6):516–526,2008.

[83] Chris J. Kuhlman, Gaurav Tuli, Samarth Swarup, Madhav V. Marathe, and S. S. Ravi.Blocking simple and complex contagion by edge removal. In ICDM, pages 399–408,2013.

[84] Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. On thebursty evolution of blogspace. In WWW ’03: Proceedings of the 12th internationalconference on World Wide Web, pages 568–576, New York, NY, USA, 2003. ACMPress.

[85] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is twitter, asocial network or a news media? In WWW, pages 591–600. ACM, 2010.

[86] T. Lappas, E. Terzi, D. Gunopoulos, and H. Mannila. Finding effectors in socialnetworks. SIGKDD, 2010.

[87] Thomas Lengauer and Robert Endre Tarjan. A fast algorithm for finding dominatorsin a flowgraph. ACM Trans. Program. Lang. Syst., 1(1):121–141, January 1979.

[88] Jure Leskovec, Lada A. Adamic, and Bernardo A. Huberman. The dynamics of viralmarketing. In EC ’06: Proceedings of the 7th ACM conference on Electronic commerce,pages 228–237, New York, NY, USA, 2006. ACM Press.

[89] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen,and Natalie S. Glance. Cost-effective outbreak detection in networks. In KDD, pages420–429, 2007.

197

[90] Jure Leskovec, Kevin J Lang, and Michael Mahoney. Empirical comparison of algorithmsfor network community detection. In WWW. ACM, 2010.

[91] Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, and MatthewHurst. Cascading behavior in large blog graphs: Patterns and a model. In Society ofApplied and Industrial Mathematics: Data Mining, 2007.

[92] Tiancheng Lou and Jie Tang. Mining structural hole spanners through informationdiffusion in social networks. In WWW, pages 825–836, 2013.

[93] Linyuan Lu and Xing Peng. Spectra of edge-independent random graphs. the electronicjournal of combinatorics, 20(4):P27, 2013.

[94] Braks M, Van Der Giessen J, Kretzschmar M, van Pelt W, Scholte EJ, Reusken C,Zeller H, van Bortel W, and Sprong H. Towards an integrated approach in surveillanceof vector-borne diseases in europe. Parasit Vectors, 4(192), 2011.

[95] Nilly Madar, Tomer Kalisky, Reuven Cohen, Daniel ben Avraham, and Shlomo Havlin.Immunization and epidemic dynamics in complex networks. Eur. Phys. J. B, 38(2):269–276, 2004.

[96] Madhav Marathe and Anil Kumar S. Vullikanti. Computational epidemiology. Commun.ACM, 56(7):88–96, July 2013.

[97] Hossein Maserrat and Jian Pei. Neighbor query friendly compression of social networks.In Proceedings of the 16th ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 533–542. ACM, 2010.

[98] Michael Mathioudakis, Francesco Bonchi, Carlos Castillo, Aristides Gionis, and AnttiUkkonen. Sparsification of influence networks. In Proceedings of the 17th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 529–537. ACM,2011.

[99] Yasuko Matsubara, Yasushi Sakurai, B. Aditya Prakash, Lei Li, and Christos Faloutsos.Rise and fall patterns of information diffusion: model and implications. In Proceedingsof the 18th ACM SIGKDD international conference on Knowledge discovery and datamining, KDD ’12, pages 6–14, 2012.

[100] Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego networks.In NIPS, 2012.

[101] Andrew McCallum, Xuerui Wang, and Andres Corrada-Emmanuel. Topic and rolediscovery in social networks with experiments on enron and academic email. Journalof artificial intelligence research, pages 249–272, 2007.

198

[102] Aaron F. McDaid, Brendan Murphy, Nial Friel, and Neil Hurley. Clustering in networkswith the collapsed stochastic block model. Arxiv preprint arXiv:1203.3083, HelenMartin 2012.

[103] A G McKendrick. Applications of mathematics to medical problems. In Proceedings ofEdin. Math. Society, volume 44, pages 98–130, 1925.

[104] J. Medlock and A. P. Galvani. Optimizing influenza vaccine distribution. Science, 325,2009.

[105] Yasir Mehmood, Nicola Barbieri, Francesco Bonchi, and Antti Ukkonen. Csi:Community-level social influence analysis. In Machine Learning and Knowledge Dis-covery in Databases, volume 8189 of Lecture Notes in Computer Science, pages 48–63.Springer Berlin Heidelberg, 2013.

[106] Lauren Ancel Meyers, M.E.J. Newman, and Babak Pourbohloul. Predicting epidemicson directed contact networks. Journal of Theoretical Biology, 240(3):400 – 418, 2006.

[107] P. Van Mieghem, D. Stevanovic, F. Fernando Kuipers, Cong Li, Ruud van de Bovenkamp,Daijie Liu, and Huijuan Wang. Decreasing the spectral radius of a graph by link removals.IEEE Transactions on Networking, 2011.

[108] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributedrepresentations of words and phrases and their compositionality. In NIPS, pages3111–3119, 2013.

[109] Nina Mishra, Robert Schreiber, Isabelle Stanton, and Robert E Tarjan. Finding stronglyknit clusters in social networks. Internet Mathematics, 5(1-2):155–174, 2008.

[110] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algo-rithms and Probabilistic Analysis. Cambridge University Press, New York, NY, USA,2005.

[111] James Moody and Douglas R. White. Social cohesion and embeddedness: A hierarchicalconception of social groups. American Sociological Review, pages 1–25, 2003.

[112] Cristopher Moore and Mark EJ Newman. Epidemics and percolation in small-worldnetworks. Physical Review E, 61(5):5678, 2000.

[113] Fred Morstatter, Jurgen Pfeffer, Huan Liu, and Kathleen M Carley. Is the samplegood enough? comparing data from twitter’s streaming api with twitter’s firehose. InICWSM, 2013.

[114] Annamalai Narayanan, Mahinthan Chandramohan, Lihui Chen, Yang Liu, and San-thoshkumar Saminathan. subgraph2vec: Learning distributed representations of rootedsub-graphs from large graphs. arXiv preprint arXiv:1606.08928, 2016.

199

[115] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summarization withbounded error. In SIGMOD08, pages 419–432. ACM, 2008.

[116] NDSSL. Synthetic Data Products for Societal Infrastructures and Protopopulations:Data Set 2.0. NDSSL-TR-07-003, 2007.

[117] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis ofapproximations for maximizing submodular set functions. Mathematical Programming,14(1):265–294, 1978.

[118] M. E. J. Newman. Finding community structure in networks using the eigenvectors ofmatrices. phys. Rev. E, page 036104, 2006.

[119] Mark EJ Newman. Modularity and community structure in networks. Proceedings ofthe National Academy of Sciences, 103(23):8577–8582, 2006.

[120] Mark EJ Newman. Communities, modules and large-scale structure in networks. NaturePhysics, 8(1):25–31, 2012.

[121] M.E.J. Newman. A measure of betweenness centrality based on random walks. SocialNetworks, 27:39–54, 2005.

[122] Hiroshi Nishiura, Gerardo Chowell, and Carlos Castillo-Chavez. Did modeling overesti-mate the transmission potential of pandemic (h1n1-2009)? sample size estimation forpost-epidemic seroepidemiological studies. PLoS ONE, 6(3):e17908, 03 2011.

[123] Ozgur Ozmen, Laura L. Pullum, Arvind Ramanathan, and James J. Nutaro. Augment-ing epidemiological models with point-of-care diagnostics data. PLOS ONE, 11(4):1–13,04 2016.

[124] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRankcitation ranking: Bringing order to the web. Technical report, Stanford Digital LibraryTechnologies Project, 1998. Paper SIDL-WP-1999-0120 (version of 11/11/1999).

[125] Gergely Palla, Imre Derenyi, Illes Farkas, and Tamas Vicsek. Uncovering the overlappingcommunity structure of complex networks in nature and society. Nature, 435(7043):814–818, 2005.

[126] Christopher R. Palmer, Phillip B. Gibbons, and Christos Faloutsos. Anf: A fast andscalable tool for data mining in massive graphs. KDD ’02, pages 81–90, New York, NY,USA, 2002. ACM.

[127] Romualdo Pastor-Satorras and Alessandro Vespignani. Epidemic spreading in scale-freenetworks. Physical review letters, 86(14):3200, 2001.

[128] Romualdo Pastor-Satorras and Alessandro Vespignani. Epidemic dynamics in finitesize scale-free networks. Physical Review E, 65:035108, 2002.

200

[129] Lorenzo Pellis, Frank Ball, Shweta Bansal, Ken Eames, Thomas House, Valerie Isham,and Pieter Trapman. Eight challenges for network epidemic models. Epidemics, pages58–62, 2015.

[130] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of socialrepresentations. In Proceedings of the 20th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 701–710. ACM, 2014.

[131] B. A. Prakash, D. Chakrabarti, M. Faloutsos, N. Valler, and C. Faloutsos. Thresh-old conditions for arbitrary cascade models on arbitrary networks. Knowledge andInformation Systems, 2012.

[132] B. Aditya Prakash, Lada A. Adamic, Theodore J. Iwashyna, Hanghang Tong, andChristos Faloutsos. Fractional immunization in networks. In Proc. of SDM, pages659–667, 2013.

[133] B. Aditya Prakash, Deepayan Chakrabarti, Michalis Faloutsos, Nicholas Valler, andChristos Faloutsos. Threshold conditions for arbitrary cascade models on arbitrarynetworks. In ICDM, 2011.

[134] B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos, and ChristosFaloutsos. Virus propagation on time-varying networks: Theory and immunizationalgorithms. In ECML/PKDD10, 2010.

[135] B. Aditya Prakash, Jilles Vreeken, and Christos Faloutsos. Spotting culprits in epidemics:How many and which ones? In ICDM, 2012.

[136] Manish Purohit, B. Aditya Prakash, Chanhyun Kang, Yao Zhang, and VS Subrah-manian. Fast influence-based coarsening for large networks. In Proceedings of the20th ACM SIGKDD international conference on Knowledge discovery and data mining,pages 1296–1305. ACM, 2014.

[137] Qiang Qu, Siyuan Liu, Christian S Jensen, Feida Zhu, and Christos Faloutsos.Interestingness-driven diffusion process summarization in dynamic networks. InECML/PKDD. 2014.

[138] Sriram Raghavan and Hector Garcia-Molina. Representing web graphs. In Proceedingsof the 19th International Conference on Data Engineering, pages 405–416. IEEE, 2003.

[139] Arvind Ramanathan, Laura L Pullum, Tanner C Hobson, Chad A Steed, Shannon PQuinn, Chakra S Chennubhotla, and Silvia Valkova. Orbit: Oak ridge biosurveillancetoolkit for public health dynamics. BMC bioinformatics, 16(17):S4, 2015.

[140] Anatol Rapoport. Spread of information through a population with socio-structural bias:I. assumption of transitivity. The bulletin of mathematical biophysics, 15(4):523–533,1953.

201

[141] Shebuti Rayana and Leman Akoglu. Less is more: Building selective anomaly ensembleswith application to event detection in temporal graphs. In Proceedings of the 2015SIAM International Conference on Data Mining, pages 622–630. SIAM, 2015.

[142] M. Richardson and P. Domingos. Mining knowledge-sharing sites for viral marketing.In SIGKDD, 2002.

[143] Kaspar Riesen and Horst Bunke. Graph classification and clustering based on vectorspace embedding. World Scientific Publishing Co., Inc., 2010.

[144] Daniel M Romero, Brendan Meeder, and Jon Kleinberg. Differences in the mechanics ofinformation diffusion across topics: idioms, political hashtags, and complex contagionon twitter. In Proceedings of the 20th international conference on World wide web,pages 695–704. ACM, 2011.

[145] D. Z. Roth and B. Henr. Social distancing as a pandemic influenza prevention measure,2011.

[146] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locallylinear embedding. science, 290(5500):2323–2326, 2000.

[147] Polina Rozenshtein, Aristides Gionis, B. Aditya Prakash, and Jilles Vreeken. Recon-structing an epidemic over time. In Proceedings of the 22Nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD ’16, pages 1835–1844.ACM, 2016.

[148] Sudip Saha, Abhijin Adiga, B. Aditya Prakash, and Anil Kumar S Vullikanti. Ap-proximation algorithms for reducing the spectral radius to control epidemic spread. InProceedings of the 2015 SIAM International Conference on Data Mining, pages 568–576.SIAM, 2015.

[149] Sartaj Sahni. Computationally related problems. SIAM Journal on Computing, 3(4):262–279, 1974.

[150] John Scott. Social network analysis: A handbook. SAGE Publications, 2012.

[151] C Seshadhri, Tamara G Kolda, and Ali Pinar. Community structure and scale-freecollections of erdos-renyi graphs. Physical Review E, 85(5):056109, 2012.

[152] Devavrat Shah and Tauhid Zaman. Rumors in a network: Who’s the culprit? IEEETransactions on Information Theory, 57(8):5163–5181, 2011.

[153] Neil Shah, Danai Koutra, Tianmin Zou, Brian Gallagher, and Christos Faloutsos.Timecrunch: Interpretable dynamic graph summarization. In Proceedings of the 21thACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pages 1055–1064. ACM, 2015.

202

[154] Min-Zheng Shieh, Shi-Chun Tsai, and Ming-Chuan Yang. On the inapproximability ofmaximum intersection problems. Inf. Process. Lett., 112(19):723–727, October 2012.

[155] Eunha Shim. Optimal strategies of social distancing and vaccination against seasonalinfluenza. Mathematical biosciences and engineering, 10(5), 2013.

[156] Gavin JD Smith, Dhanasekaran Vijaykrishna, Justin Bahl, Samantha J Lycett, MichaelWorobey, Oliver G Pybus, Siu Kit Ma, Chung Lam Cheung, Jayna Raghwani, SamirBhatt, et al. Origins and evolutionary genomics of the 2009 swine-origin h1n1 influenzaa epidemic. Nature, 459(7250):1122–1125, 2009.

[157] Tasuku Soma, Naonori Kakimura, Kazuhiro Inaba, and Ken-ichi Kawarabayashi. Opti-mal budget allocation: Theoretical guarantee and efficient algorithm. In InternationalConference on Machine Learning, pages 351–359, 2014.

[158] G. W. Stewart and Ji-Guang Sun. Matrix Perturbation Theory. Academic Press, 1990.

[159] Shashidhar Sundareisan, Jilles Vreeken, and B. Aditya Prakash. Hidden Hazards:Finding Missing Nodes in Large Graph Epidemics, pages 415–423.

[160] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line:Large-scale information network embedding. In Proceedings of the 24th InternationalConference on World Wide Web, pages 1067–1077. ACM, 2015.

[161] Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnetminer:Extraction and mining of academic social networks. In KDD’08, 2008.

[162] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric frameworkfor nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.

[163] Yuanyuan Tian, Richard A Hankins, and Jignesh M Patel. Efficient aggregation forgraph summarization. In SIGMOD08, pages 567–580. ACM, 2008.

[164] Hannu Toivonen, Fang Zhou, Aleksi Hartikainen, and Atte Hinkka. Compression ofweighted graphs. In KDD, 2011.

[165] Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad, Michalis Faloutsos, and ChristosFaloutsos. Gelling, and melting, large graphs by edge manipulation. In Proceedingsof the 21st ACM international conference on Information and knowledge management,pages 245–254. ACM, 2012.

[166] Hanghang Tong, B. Aditya Prakash, Charalampos E. Tsourakakis, Tina Eliassi-Rad,Christos Faloutsos, and Duen Horng Chau. On the vulnerability of large graphs. InICDM, 2010.

203

[167] Johan Ugander, Lars Backstrom, Cameron Marlow, and Jon Kleinberg. Structural diver-sity in social contagion. Proceedings of the National Academy of Sciences, 109(16):5962–5966, 2012.

[168] Piet Van Mieghem, Jasmina Omic, and Robert Kooij. Virus spread in networks. ToN,17(1):1–14, 2009.

[169] Staal A. Vinterbo. Privacy: A machine learning view. IEEE Trans. on Knowl. andData Eng., 16(8):939–948, August 2004.

[170] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Pro-ceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining, pages 1225–1234. ACM, 2016.

[171] Jianlong Wang, Sridhar Rao, Jianlin Chu, Xiaohua Shen, Dana N Levasseur, Thorold WTheunissen, and Stuart H Orkin. A protein interaction network for pluripotency ofembryonic stem cells. Nature, 444(7117):364–368, 2006.

[172] Liaoruo Wang, Tiancheng Lou, Jie Tang, and John E Hopcroft. Detecting communitykernels in large social networks. In ICDM, pages 784–793. IEEE, 2011.

[173] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Communitypreserving network embedding. In AAAI, pages 203–209, 2017.

[174] Yang Wang, Deepayan Chakrabarti, Chenxi Wang, and Christos Faloutsos. Epidemicspreading in real networks: An eigenvalue viewpoint. In Symposium on ReliableDistributed Systems, pages 25–34, Los Alamitos, CA, 2003. IEEE Computer SocietyPress.

[175] Eduardo C. Xavier. A note on a maximum k-subset intersection problem. Inf. Process.Lett., 112(12):471–472, June 2012.

[176] Reza Yaesoubi and Cohen Ted. Dynamic health policies for controlling the spread ofemerging infections: Influenza as an example. PLoS ONE, 6(9):e24043, 2011.

[177] Pinar Yanardag and SVN Vishwanathan. Deep graph kernels. In Proceedings of the 21thACM SIGKDD International Conference on Knowledge Discovery and Data Mining,pages 1365–1374. ACM, 2015.

[178] Jaewon Yang and Jure Leskovec. Overlapping community detection at scale: a nonneg-ative matrix factorization approach. In WSDM. ACM, 2013.

[179] Yang Yang, Jie Tang, C Leung, Yizhou Sun, Qicong Chen, Juanzi Li, and Qiang Yang.Rain: Social role-aware information diffusion. In AAAI, 2015.

[180] W.W. Zachary. An information flow model for conflict and fission in small groups.Journal of Anthropological Research, 33:452–473, 1977.

204

[181] Fuzhen Zhang. Matrix theory: basic results and techniques. Springer Science andBusiness Media, 2011.

[182] Ning Zhang, Yuanyuan Tian, and Jignesh M Patel. Discovery-driven graph summariza-tion. In ICDE, 2010.

[183] Yao Zhang, Bijaya Adhikari, Steve Jan, and B. Aditya Prakash. Meike: Influence-basedcommunities in networks. In Proceedings of the SIAM Data Mining Conference, SDM’17, 2017.

[184] Yao Zhang, Abhijin Adiga, Sudip Saha, Anil Vullikanti, and B. Aditya Prakash. Near-optimal algorithms for controlling propagation at group scale on networks. IEEETransactions on Knowledge and Data Engineering, 28(12):3339–3352, 2016.

[185] Yao Zhang, Abhijin Adiga, Anil Vullikanti, and B. Aditya Prakash. Controlling propa-gation at group scale on networks. In Data Mining (ICDM), 2015 IEEE InternationalConference on, pages 619–628. IEEE, 2015.

[186] Yao Zhang and B. Aditya Prakash. Dava: Distributing vaccines over networks underprior information. In Proceedings of the SIAM Data Mining Conference, SDM ’14, 2014.

[187] Yao Zhang and B. Aditya Prakash. Scalable vaccine distribution in large graphs givenuncertain data. In Proceedings of the 23rd ACM International Conference on Conferenceon Information and Knowledge Management, pages 1719–1728. ACM, 2014.

[188] Yao Zhang and B. Aditya Prakash. Data-aware vaccine allocation over large networks.ACM Transactions on Knowledge Discovery from Data (TKDD), 10(2):20, 2015.

[189] Yao Zhang, Arvind Ramanathan, Anil Vullikanti, Laura Pullum, and B. Aditya Prakash.Data driven immunization. ICDM’17, 2017.

Optimizing and Understanding Network Structure for Di usion

Documents

Transcript of Optimizing and Understanding Network Structure for Di usion