Performance Analysis and Optimization of Virtualized Cloud ... · Performance Analysis and...

205
Performance Analysis and Optimization of Virtualized Cloud-RAN Systems by Hazem M. Soliman A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2017 by Hazem M. Soliman

Transcript of Performance Analysis and Optimization of Virtualized Cloud ... · Performance Analysis and...

Performance Analysis and Optimization of Virtualized Cloud-RANSystems

by

Hazem M. Soliman

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2017 by Hazem M. Soliman

Abstract

Performance Analysis and Optimization of Virtualized Cloud-RAN Systems

Hazem M. Soliman

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2017

Cloud radio access networks (C-RAN) are a promising solution against the ossification of wire-

less systems. C-RANs provide a platform for rapid innovation and deployment of new wireless

technologies. However, they also present a set of challenges un-encountered in traditional sys-

tems. The goal of this thesis is to identify, study and provide solutions for those challenges.

The challenges studied in this thesis fall into two broad categories; the first set of challenges

is about multiplexing several network slices on the same physical infrastructure. The second

set of challenges stems from the cloud computing concept itself and how it affects the wireless

systems architecture.

For the first part, we start at the PHY-layer, and focus on the question of how multiple

network slices can be accommodated on the same infrastructure. We conduct a performance

analysis of the alternative multiplexing and scheduling schemes that can be used for slicing

and interference coordination. Next, we show how we can integrate the effects of statistical

multiplexing into PHY-layer performance indicators, and provide an algorithm for admission

control combined with resource slicing using both FDMA and SDMA.

For the cloud computing challenges, we start by looking at how the cloud computing model

combined with the demands of wireless networks raise the need for efficient distributed schedul-

ing schemes. We provide a completely distributed solution that achieves up to 92% efficiency

and discuss the effects of the nature of the scheduler on the performance.

One of the main goals of C-RAN is providing more energy-efficient systems through dynamic

resource scaling. We investigate this problem from both the radio access part as well as the

cloud computing part. For the radio access, we propose an optimization and control framework

for the activation, association and clustering of remote radio heads (RRH). The problem is

ii

solved using the successive geometric programming approach for signomial optimization. For

the cloud computing part, we propose a predictive control framework for anomaly-aware scaling

of computing resources. Our proposed scheme is based on the Gaussian process model and

provides 95% prediction accuracy and 90% anomaly detection accuracy.

iii

Contents

I Introduction 1

1 Introduction and Motivation 2

1.1 From Network to Wireless Virtualization . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.1 Challenges of Wireless Virtualization . . . . . . . . . . . . . . . . . . . . . 6

1.2 NFV, SDN and VN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Architecture Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.5 Deployment Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.6 Research Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.7 Thesis Structure and Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.8 NFV, SDN and VN within the Context of Wireless Virtualization . . . . . . . . . 20

1.8.1 NFV in Wireless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.8.2 SDN in Wireless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8.3 VN in Wireless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.9 Deployment Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Background and Literature Review 24

2.1 A First Look at Wireless Virtualization . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.1 WiMAX Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.2 vBTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.3 NVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

iv

2.2.4 CellSlice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.5 LTE eNB Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2.6 SDR and Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2.7 OpenRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.8 R-Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.2.9 Resource Abstraction and Dynamic Resource Allocation . . . . . . . . . . 34

II Network Slicing and Infrastructure Sharing 38

3 PHY-Layer Admission Control and Network Slicing 40

3.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.5 Admission Control and Resource Slicing Algorithm . . . . . . . . . . . . . . . . . 47

3.5.1 Spectrum Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5.2 Admission Control through the Maximum Independent Set . . . . . . . . 48

3.5.3 SDMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.6 QoS Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.6.1 Post-Nulling Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6.2 Stochastic Number of Users . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Multi-Operator Scheduling in Cloud-RANs 59

4.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

v

4.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Scheduling Algorithms for VOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5.1 Case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.5.2 Case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5.3 Applications of Case 1 and 2 . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.5.4 Intuition Behind Case 1 and Case 2 . . . . . . . . . . . . . . . . . . . . . 71

4.6 General Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.6.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.6.2 Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.6.3 Proof of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.6.4 Neuro-Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

III Cloud Computing Challenges 81

5 Fully Distributed Scheduling in Cloud-RAN Systems 83

5.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5 Distributed Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5.1 Maximum Throughput Rayleigh Channels . . . . . . . . . . . . . . . . . . 90

5.5.2 General Schedulers and Distributions . . . . . . . . . . . . . . . . . . . . . 93

5.5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.5.4 Relation Between Fairness and Predictability . . . . . . . . . . . . . . . . 98

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

vi

6 Joint RRH Activation and Clustering in Cloud-RANs 101

6.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4.3 Interference Coordination Model . . . . . . . . . . . . . . . . . . . . . . . 107

6.4.4 Interference Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.5 Joint Activation and Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . 109

6.5.1 Set Cover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.5.2 Greedy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.6 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7 Long-term Activation, Clustering and Association in Cloud-RAN 117

7.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.4.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.5 Successive Geometric Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.5.1 Signomial Geometric Programming . . . . . . . . . . . . . . . . . . . . . . 125

7.6 Successive Geometric Optimization for Activation, Clustering and Association . . 127

7.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8 Graph-based Diagnosis in Software-Defined Infrastructure 134

8.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

vii

8.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.3.1 Anomaly Detection in Static Graphs . . . . . . . . . . . . . . . . . . . . . 138

8.3.2 Anomaly Detection in Dynamic Graphs . . . . . . . . . . . . . . . . . . . 138

8.3.3 Graph Centrality Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.4 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

8.5 Graph Diagnosis Module Description . . . . . . . . . . . . . . . . . . . . . . . . 140

8.5.1 Application Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.5.2 System Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.5.3 Forensics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.6 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.6.1 Identifying Master Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.6.2 Assortativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

8.6.3 Physical Connectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.7 Proof of Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.7.1 Webserver - Database workload pattern . . . . . . . . . . . . . . . . . . . 144

8.7.2 Bandwidth throttling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.7.3 DoS attack on a webserver . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.7.4 Spark Job failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

9 Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure 153

9.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.4 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.4.1 Cost Measure and Quality of Service . . . . . . . . . . . . . . . . . . . . . 159

9.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

9.6.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

viii

9.6.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

IV Conclusion 170

10 Conclusion and Future Work 171

10.1 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

10.1.1 PHY-Layer Admission Control and Network Slicing . . . . . . . . . . . . 172

10.1.2 Multi-Operator Scheduling in Cloud-RANs . . . . . . . . . . . . . . . . . 172

10.1.3 Fully Distributed Scheduling in Cloud-RAN Systems . . . . . . . . . . . . 173

10.1.4 Joint RRH Activation and Clustering in Cloud-RANs . . . . . . . . . . . 173

10.1.5 Long-term Activation, Clustering and Association in Cloud-RANs . . . . 173

10.1.6 Graph-based Diagnosis in Software-Defined Infrastructure . . . . . . . . . 174

10.1.7 Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure . 174

10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Bibliography 176

ix

List of Figures

1.1 Cloud-RAN Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.2 SAVI Deployment Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.1 vBTS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Simplified WiMAX Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 NVS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Cell Slice Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 OpenRadio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6 OpenRF Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.7 OpenRF Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.1 Cloud-RAN Architecture - Admission Control and Slicing . . . . . . . . . . . . . 43

3.2 Interval Graph and Conflict Graph for the outcome of step 3.5.1 . . . . . . . . . 49

3.3 Simulation and fitting of the received signal power . . . . . . . . . . . . . . . . . 53

3.4 Number of Selected Slices versus different QoS values ε for different values of

total number of slices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 Number of Selected Slices per Frequency Resource versus different QoS values ε

for different values of total number of slices . . . . . . . . . . . . . . . . . . . . . 56

3.6 Number of Selected Slices per Frequency Resource versus different QoS values ε

for different values of average number of users . . . . . . . . . . . . . . . . . . . . 56

3.7 Number of Selected Slices versus different QoS values ε for different values of

average number of users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

x

3.8 Comparison of the Markov bound and the simulated probability term defined in

(3.11) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.9 Simulation of the probability term defined in (3.11) . . . . . . . . . . . . . . . . . 58

4.1 Cloud-RAN Architecture - Admission Control and Slicing . . . . . . . . . . . . . 61

4.2 Example of Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Conflict graph for the example in Fig. 4.2 . . . . . . . . . . . . . . . . . . . . . . 67

4.4 Conflict graph case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.5 Requests in case 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.6 Requests in case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.7 Requests in the general case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.8 Binary Tree Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.9 Interval Graph Unit and the corresponding intervals . . . . . . . . . . . . . . . . 73

4.10 General Graph Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.11 Conflict graph for the general case . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.12 Performance of the proposed algorithms for case 1 . . . . . . . . . . . . . . . . . 78

4.13 Performance of the general algorithm for case 2 . . . . . . . . . . . . . . . . . . . 78

4.14 Percentage performance loss for case 1 . . . . . . . . . . . . . . . . . . . . . . . . 79

4.15 Percentage performance loss for case 2 . . . . . . . . . . . . . . . . . . . . . . . . 79

5.1 Cloud-RAN Architecture - Distributed Scheduling . . . . . . . . . . . . . . . . . 86

5.2 Expected SNR Comparison versus γ . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.3 Expected SNR Comparison versus Number of Users . . . . . . . . . . . . . . . . 94

5.4 Distributed Decision Flow Chart for General channels and schedulers . . . . . . . 95

5.5 Prediction Errors for Maximum Throughput Scheduling . . . . . . . . . . . . . . 96

5.6 Comparison of Expected SINR for Maximum Throughput Scheduling . . . . . . 97

5.7 Prediction Errors for Proportional Fairness Scheduling . . . . . . . . . . . . . . . 97

5.8 Comparison of Expected SINR for Proportional Fairness Scheduling . . . . . . . 98

5.9 Prediction Errors versus β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.1 Cloud-RAN Architecture - Admission Control and Slicing . . . . . . . . . . . . . 104

xi

6.2 The average number of users per active RRH . . . . . . . . . . . . . . . . . . . . 113

6.3 Change of average QoS as the number of users is varied . . . . . . . . . . . . . . 114

6.4 Change of average QoS as the number of active RRHs changes . . . . . . . . . . 114

6.5 Overall QoS as the number of users per RRH is varied . . . . . . . . . . . . . . . 115

6.6 Average number of users as the number of users per RRH is varied . . . . . . . . 115

6.7 Average number of users as the number of users per RRH is varied . . . . . . . . 116

7.1 Cloud-RAN Architecture - Activation, Clustering and Association . . . . . . . . 120

7.2 Average Activation and Clustering Probabilities versus Average Traffic Load . . 129

7.3 Average Activation and Clustering Probabilities versus Average Traffic Load,

β = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.4 Average Activation and Clustering Probabilities versus Inter-RRH Distance . . . 130

7.5 Average Activation and Clustering Probabilities versus Inter-RRH Distance, β = 0131

7.6 Average Activation and Clustering Probabilities versus QoS Factor . . . . . . . . 131

7.7 Average Activation and Clustering Probabilities versus QoS Factor, β = 0 . . . . 132

7.8 Average Activation Probability Error versus Average Traffic Prediction Error . . 132

7.9 Average Clustering Probability Error versus Average Traffic Prediction Error . . 133

8.1 Cloud-RAN Architecture - Anomaly Detection and Scaling . . . . . . . . . . . . 137

8.2 Graph-Based Diagnosis In Software-Defined Infrastructure System Architecture . 140

8.3 Graphs of Different Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

8.4 Maximum Betweenness Centrality for Different Applications . . . . . . . . . . . . 143

8.5 Mean Betweenness Centrality for Different Applications . . . . . . . . . . . . . . 144

8.6 Assortativity of Different Applications . . . . . . . . . . . . . . . . . . . . . . . . 145

8.7 Physical Connectivity of VMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

8.8 Webserver - Database workload diagram . . . . . . . . . . . . . . . . . . . . . . . 147

8.9 Webserver Database testing phase . . . . . . . . . . . . . . . . . . . . . . . . . . 149

8.10 Bandwidth throttling testing phase . . . . . . . . . . . . . . . . . . . . . . . . . 150

8.11 DoS attack testing phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.12 Spark Job failure testing phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

xii

9.1 Cloud-RAN Architecture - Auto-scaling and anomaly detection . . . . . . . . . . 156

9.2 SAVI testbed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

9.3 Example of CPU utilization Prediction for a Web application . . . . . . . . . . . 166

9.4 Prediction Accuracy for a Web application . . . . . . . . . . . . . . . . . . . . . . 166

9.5 Example of CPU utilization Prediction for a BigData application(Master) . . . . 167

9.6 Example of CPU utilization Prediction for a BigData application(Worker) . . . . 167

9.7 Prediction Accuracy for a BigData application . . . . . . . . . . . . . . . . . . . 168

9.8 Example of CPU utilization Prediction in anomalous scenarios . . . . . . . . . . 168

9.9 Anomaly Detection Accuracy for a Web application . . . . . . . . . . . . . . . . 169

xiii

Part I

Introduction

1

Chapter 1

Introduction and Motivation

In today’s networks, the networking protocols are tied to the fixed hardware of the physical

infrastructure. This inflexibility has made it very difficult to provide truly differentiated ser-

vices [116]. In line with the approach taken in information technology (IT) and computing

virtualization, it has become apparent that decoupling the networking infrastructure from its

functionalities must be a key design principle for future networks. This decoupling is captured

by the term network virtualization (NV) [8] which is proving to be a popular approach in both

industry and academia. For example, virtualization is now one of the fundamental features in

the next generation networking projects such as Global Environment for Network Innovation

(GENI) and Smart Application on Virtual Infrastructure (SAVI).

NV works by replacing the various networking equipment into the industry standard soft-

ware running on high performance servers, switches and storage. These are located in data

centers, which can be, depending on their constraints regarding proximity to the users, built

near renewable energy resources to reduce their carbon footprint. Network function virtual-

ization should be applicable to any data plane and control plane processing in fixed as well as

mobile networks. Hence, NV transforms future networks into highly flexible and programmable

environment open to continuous innovation.

Another crucial element of next generation networks is connectivity and mobility through

wireless access [107]. Wireless networks are a crucial and significant part of the networking

architecture with the continued growth of wireless rates, coverage and reliability, as well as

the increased demand for connectivity and mobility from the users. On the other hand, the

2

Chapter 1. Introduction and Motivation 3

continued emergence of new wireless standards has resulted in a chaotic environment where dif-

ferent standards handle the same functions, e.g. mobility, separately and in a different manner.

This can result in a less efficient resource utilization and a significant loss in performance. For

next generation networks, interoperability and coexistence between the different standards is

essential. These fundamental elements–flexible networking services, wireless connectivity and

interoperability between wireless standards–have led to the emergence of wireless virtualization

as a key element in any future network architecture.

Not only does a virtualized wireless network provide solutions to the issues of current net-

works, it also opens the networking industry to new business models. The programmability and

virtualization of the infrastructure opens the way to a shared networking infrastructure, hence

significantly reducing the capital expenditure (CAPEX) cost of each provider and promising to

provide better quality-of-service (QoS) and quality-of-experience (QoE) for the end users [156].

Future networks are envisioned to accommodate two kinds of players: the infrastructure owner

(IO), and the virtual operator (VO). The infrastructure owner owns the physical infrastructure

as well as the spectrum access rights. It holds service-level agreements (SLAs) with the VOs in

order to make its resources available to the them, where the VOs can then deliver the services

to their end-users.

We can summarize the technical motivations for wireless virtualization as follows:

• Encouraging openness and more innovation in services and applications;

• Reducing equipment cost and power consumption through leveraging cloud computing

capabilities;

• More efficient spectrum utilization through sharing and dynamic spectrum access;

The business/social motivations for wireless virtualization are

• Separation of the infrastructure operator and the system operator will help reduce the

required manpower;

• Sharing the infrastructure to help reduce the high costs of hardware and physical con-

struction also opens the market for small companies;

Chapter 1. Introduction and Motivation 4

• Minimizing the time needed for a new operator to enter the market and to move innova-

tions into the practical domain;

• Bringing diversity of services to the end-users.

1.1 From Network to Wireless Virtualization

In order to better understand wireless virtualization, we first need to rigorously define what is

meant by network virtualization. In computer science, virtualization refers to the abstraction

of computing resources and the providing of these resources to the user with the illusion of

a dedicated physical resource. The same concept has been extended to the field of computer

networks [149],[146]. Several definitions exist for network virtualization. The concept of virtual

networking is related to that of virtual private networks which date back at least into the 1990s.

For example, an enterprise-centric definition for network virtualization is given by Cisco [34]:

”The term network virtualization refers to the creation of logical isolated network partitions

overlaid on top of a common enterprise physical network infrastructure”.

Another way to look at network virtualization is from the perspective of resource abstrac-

tion and its different levels [144]:

”The term network virtualization describes the ability to refer to network resources logically

rather than having to refer to specific physical network devices, configurations, or collections

of related machines. There are different levels of network virtualization, ranging from single-

machine, network-device virtualization that enables multiple virtual machines to share a single

physical-network resource, to enterprise-level concepts such as virtual private networks and

enterprise-core and edge-routing techniques for creating sub networks and segmenting existing

networks”.

It is also important to distinguish between the notion of network virtualization and that of

virtual private networks (VPN), which is the focus of the next definition [41]:

”Network virtualization is an approach whereby several network instances can co-exist on a

Chapter 1. Introduction and Motivation 5

common physical network infrastructure. The type of network virtualization needed is not to

be confused with current technologies such as Virtual Private Networks (VPNs), which merely

provide traffic isolation: full administrative control as well as potentially full customization of

the virtual networks (VNets) is also required to realize the vision of using network virtualization

as the basis for a Future Internet”.

A key part of network virtualization is to handle the heterogeneity of the resources and be

able to aggregate them together [67]:

”Network virtualization is the technology that enables the creation of logically isolated network

partitions over shared physical network infrastructures so that multiple heterogeneous virtual

networks can simultaneously coexist over the shared infrastructures. Also, network virtualiza-

tion allows the aggregation of multiple resources and makes the aggregated resources appear as

a single resource”.

Finally, the authors in [146] have tried to combine all these definitions together and came up

with a wide definition covering all aspects of network virtualization:

”Network virtualization is any form of partitioning or combining a set of network resources,

and presenting (abstracting) it to users such that each user, through its set of the partitioned

or combined resources has a unique, separate view of the network. Resources can be funda-

mental (nodes, links) or derived (topologies), and can be virtualized recursively. Node and link

virtualization involve resource partition/combination/abstraction; and topology virtualization

involves new address (another fundamental resource we have identified) spaces”.

In summary, a key and perhaps the central element of virtualization is that it is an ab-

straction that is sufficiently detailed to assure a required functionality, but that is also concise

and re-usable in that it hides the details of the implementation. This allows high level users

to build on the virtualized and sufficiently isolated view of a network, while also allowing the

network’s provider to change the underlying implementation, transparently to the high level

user. In essence, virtualization as we see it is about achieving balance across three different

axes:

• An abstraction that is sufficiently detailed while also concise.

Chapter 1. Introduction and Motivation 6

• A sufficient isolation level between the different operators without sacrificing too much of

the network utilization.

• Transparency to the high-level users while enabling changing the underlying implementa-

tion.

In this regard, Cloud-RAN has emerged as a promising architecture for 5G networks lever-

aging the concepts of wireless virtualization [4]. The main design principle of the cloud-RAN

architecture is the separation between the base-band processing and the RF-band transmis-

sion. This ensures flexible deployment, fast upgrade capabilities and efficient abstraction of the

network resources. The main goal of this thesis is to address the challenges of the design and

deployment of the cloud-RAN architecture.

1.1.1 Challenges of Wireless Virtualization

Equipped with our definition of network virtualization, we now look into how this definition

can be applied to the wireless network, and the new challenges that arise in comparison to

the wired network case. Several challenges arise when trying to virtualize the wireless access

network, these include:

• Abstraction: In the context of Information theory, a time varying channel typically has

higher capacity than a non time-varying one, due to its additional temporal degrees of

freedom [141]. The same can be said for frequency and space degrees of freedom as

well. Efficiently utilizing these degrees of freedom is a main factor in designing wireless

systems, and requires coordination between the PHY-layer information and the MAC-

layer decision making, through the use of adaptive scheduling, scrambling and coding

for example. This need for cross-layer decision making and a tight control of the PHY-

layer resources challenges the flow-level abstraction used in wired networks, where the

PHY-layer is fairly agnostic and independent of the higher layers.

• Transparency: The nature of the PHY-layer technology being used affects the application

that can utilize this network. For example, low power applications might prefer CDMA-

based multiplexing, while data-intensive application would prefer OFDMA-multiplexing.

Chapter 1. Introduction and Motivation 7

The dependence of the application on the PHY-layer technology makes it more chal-

lenging to achieve transparency between the view given to the users and the underlying

implementation.

• Isolation: Isolation is even harder to achieve in the wireless network due to the shared

nature of the channel. Moreover, statistical aggregation in the form of long coding se-

quences or large frequency bands is essential for achieving high transmission rates. This

is a trade-off between achieving good isolation though a strict division of resources, and

risking low utilization due to the loss of statistical multiplexing gains associated with

the shared resources. Moreover, over-provisioning is hard to be applied to the wireless

spectrum, which is also the most important resource in the wireless network.

• Variability and unpredictability: wireless nodes can be greatly different from each other

due to the nature of wireless signal propagation [106]. More specifically, wireless propaga-

tion is very node specific, hard to control and has a significant impact on the performance.

• Scarcity of the resource: one of the reasons for the success of the cloud computing business

model based on virtual computing, is the statistical multiplexing gains and agility in

deploying resources on demand. This is not the case in wireless, since spectrum is typically

very rare and will usually experience congestion.

• Non-generic Hardware: computing resources consist of generic hardware (HW), making

the virtualization process easier through software (SW). However, in the wireless network,

the computing demands of the PHY layer are very high that they can only be achieved

through task-specific optimized HW, like the FFT engine for WiMAX and LTE. If the

PHY baseband processing is implemented using SW, then speed will be an issue, if HW

is used instead which is the current case, then virtualization will be hard. Moreover, the

RF front-end will always be done in HW.

• Stochastic Nature: due to the variability of the wireless channel.

• Overheads and Retransmissions: due to the difficult propagation conditions in wireless

networks, packet retransmission is more frequent then in wired networks.

Chapter 1. Introduction and Motivation 8

1.2 NFV, SDN and VN

Three important concepts are always mentioned when discussing network virtualization. These

are network function virtualization (NFV) [8], software-defined networking (SDN) [89], and

virtual networking (VN). In this section, we provide one way to distinguish between the three,

that is particularly useful for the context of wireless virtualization. These terms are not isolated

nor orthogonal from each other. However, each one of them looks at the problem from a certain

perspective. We distinguish between these terms as follows:

• NFV: the high cost of specialized hardware devices has motivated the concept of func-

tion virtualization. Similar to the computing resources, NFV is about decoupling the

networking protocols from the underlying hardware and migrating them into standard

computing resources. This is a well-established approach within the IT community, and

is the main reason behind the success of cloud computing. The difference however, lies

in how successful this migration is. Due to the high computational cost of some network

functions, especially within the wireless domain, it is quite challenging to implement the

networking functions fully on standard computing resources.

• SDN: the concept of SDN has been motivated by the difficulty to manage enterprise net-

works and the slow and costly process of administering them. The idea behind SDN is to

separate the data plane from the control plane, hence giving the network administrator

the capability to program the flow of packets in his network using software APIs. Open-

Flow is the most popular standard of SDN [90]. On a more general level, the networks

being completely SW defined means that it is easy to upgrade by just upgrading the SW,

and this is where SDN meets NFV.

• VN: Virtual networking is the ability to multiplex multiple tenants in the same infrastruc-

ture, with guaranteed isolation between them and the perception of a dedicated network.

The success of virtual networking is directly influenced by the SDN capabilities of the un-

derlying network, as this simplifies the process of constructing, isolating, managing and

de-constructing such virtual networks. We can also see that creating a virtual network

does not necessarily need SDN or NFV, though these may be the easiest and most flexible

Chapter 1. Introduction and Motivation 9

way to do so.

1.3 Architecture

At this point we would like to lay out the system architecture used throughout the thesis. The

architecture adopts the cloud-RAN concept [4][53], where the most of the processing function-

alities are moved to the cloud to be executed on general purpose processing units. The cloud

is then connected through optical fibers to a set of remote radio heads (RRH) for radio trans-

mission. This architecture exposes a set of challenges that we will try to address in the later

chapters. In Fig. 1.1 we show the proposed system architecture. The architecture is composed

of a set of components as follows:

• Remote Radio Heads (RRHs): This is the access component of the network and

is responsible for the final transmission of radio signals to the users. The RRHs are

connected through a high-speed network to the cloud computing cluster. This connection

network is known as the fronthaul network. The I/Q signals are prepared inside the

cloud and forwarded for final transmission through the RRHs. In comparison with the

traditional base stations, RRHs are smaller and less expensive. Hence they can deployed

more densely to provide better coverage for the end users. The second advantage is that

the RRHs are relatively agnostic to the PHY-layer technology being used, hence upgrading

the communication protocol can be done without having to upgrade the physical access

network, providing significant CAPEX savings.

• Base-band Processes: these comprise the main execution units inside the cloud com-

puting cluster, and can be divided into two classes:

– User Process: the user process handles all the processing, both uplink and down-

link, for a single user. It implements the typical PHY-layer pipeline including source

and channel coding, scrambling and modulation. The user process handles some

of the heaviest computation in the network, and is therefore optimized through an

aggressive use of lookup tables (LUTs). Each network slice has its own set of user

processes. One or more user processes can be running on a virtual machine at a

Chapter 1. Introduction and Motivation 10

time depending on the amount of computation needed and the capabilities of the

VM itself. The user process can be migrated to a different machine if the underlying

computing resource are insufficient. Hence, the concept of the user process is crucial

for realizing the cloud distributed computation model in the cloud-RAN architec-

ture. Unlike the typical cloud computing applications where the virtual machine is

the main computing in the system, the low latency required in wireless applications

raises the need for smaller computing units, represented here as the user and cell

processes.

– Cell Process: the cell process handles the processing that can not be done for each

user individually, but instead needs the data from all the users within a specific

cell/cluster. This includes for example the MAC-layer scheduling and the inverse

Fourier transform (IFFT) as well as the FFT operations. The cell process receives

the output of the user processes in the form of I/Q signals, and is responsible for

the final preparation of the signal sent to the access network through the fronthaul

connections. Being a cell wide process, the computation requirements of the cell

process are directly dependent on the number of users being served. Hence, it is

most computationally demanding when the traffic is at its peak. Even during low

traffic, the cell process is still computationally intensive, as the scheduler and IFFT

blocks are very computationally demanding. Similar to the user process, the cell

process might need to be migrated or have its VM upscaled as the computation

demand increases. However, there is another significant challenge in designing the

cell process due to the extensive traffic between the cell process and all the user

processes within the cell.

• Network Slice Controller: this includes all the control plane and higher-layers deci-

sions made by a specific slice. In essence, this corresponds to the core network within

the current network architectures, plus all the higher-layers operations. It also includes

the interface for communication with the infrastructure controller. The slice controller

communicates with the infrastructure controller about the admission control process, the

resource provisioning and the coordination between the different slices.

Chapter 1. Introduction and Motivation 11

• Infrastructure Controller: this is responsible for all the control decision regarding the

infrastructure itself, and the interaction between the network slices. It can be seen as the

generalization of the FlowVisors [120] used in wired network virtualization to the wireless

case. The infrastructure controller is responsible for the initial admission control and

slicing decisions, as well as provisioning this slicing during the normal network operation,

through scheduling and interference coordination for example. The infrastructure con-

troller is also responsible for administering the computing part of the network. Through

communication with the various processes, it can evaluate their computing needs and

carry out subsequent decisions for resource scaling or process migration.

1.4 Architecture Advantages

Having laid out the architecture, we can now discuss the challenges associated with deploying

it in practice. Within the cloud-RAN architecture, the cloud is responsible for handling all the

base-band processing required for transmission. Moving the base-band processing to the cloud

is challenging, due to two seemingly conflicting goals of cloud computing and wireless systems,

these are elasticity and latency. On one hand, a key concept within cloud computing is that of

elasticity, i.e. computing processes are virtualized and migrated between the physical servers

to optimize some criteria such as energy efficiency or utilization. On the other hand, migrating

virtual machines takes a few seconds to finish, which is three orders of magnitude more than

the millisecond latency required in modern wireless systems.

The Bell Labs architecture for C-RAN has addressed such a problem by introducing the

concepts of a user process and a cell process [53]. The user process handles all the processing

per a single user. User processes communicate with the cell process, which is responsible for

the cell-wide processing such as the scheduling and the last stages of the PHY-layer pipeline.

Through using a software process as the main processing unit instead of a virtual machine, the

issue with migration is solved, as what is needed now is just instantiating the process with the

same parameters on a different machine.

Another dimension of the problem is that building wireless systems on general purpose

CPUs is challenging due to the low-latency required in such systems. However, the key insight

Chapter 1. Introduction and Motivation 12

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

Cell-wide processing

Scheduler and IFFT need not be collocated

User Process Cell Process

IFFT Scheduler/Precoder

Network Slice Controller

Slice Scheduler

Slice Communication

Protocol

Infrastructure Controller

Admission Control Network Slicing Interference Coordination

Computing

resources scaling

and migration

Resource

Provisioning

Clo

ud

Net

wo

rk F

abri

c

CSI/Null-space exchange

Slice Precoder VM utilization

Scaling decisions

Scheduling and

Clustering Decisions

Access

Network

Control

Activation

Decisions

Scheduling request/grant

Beamforming Vectors

I/Q Signals

End-users

Figure 1.1: Cloud-RAN Architecture

Chapter 1. Introduction and Motivation 13

to implement such systems is by realizing that while wireless processing is very computationally

intensive, it has relatively low memory requirements. The standard way to solve this problem in

high-level programming languages is by leveraging lookup tables (LUTs). LUTs trade memory

for computation speed. This approach has been successfully applied in the SORA platform

[136], which currently supports both WiFi and LTE.

Interestingly, lookup tables benefits are not just about the computation speed. OpenFlow

has become the de facto interface for controlling networking switches and routers in wired

environments [90]. One of the functionalities of OpenFlow is to decide how the mapping is

done between the input and output ports of a switch. The OpenFlow controller will fill up a

switching table, thereby deciding the action to be taken for each flow. Our main observation here

is that LUTs, besides being the building blocks for wireless NFV, are also the key enablers for

SDN in wireless. LUTs can be seen as a switch abstraction, where the input bits combinations

correspond to the input ports, while the desired output bits are the output ports. While the

strict latency requirements in wireless means that we can not wait for the controller to respond

back, there are several ways by which we can provide programmability into wireless networks

as follows:

• Repopulate the table: this is the basic, though slow, action where the content of the LUT

itself can be updated on demand. While this is very flexible, it might need some down

time for the system in order to update the tables.

• Offset based mapping: consider the modulation table where the input binary bits are to be

mapped to the output complex symbol. In order to make the modulation programmable,

we can populate different parts of the LUT with different modulation schemes. An offset

is then programmed through an OpenFlow-like protocol which controls which part of the

table is used.

The final piece to realize wireless virtualization is being to provide isolated service to different

network slices. The scheduler module within the MAC layer already provides us with the tools

for that. The problem of supporting multiple slices on the same wireless resource is a direct

extension of the resource allocation and multiplexing in wireless which is already well-studied.

However, new approaches such as hierarchal and distributed scheduling introduce new flavors

Chapter 1. Introduction and Motivation 14

into the problem.

1.5 Deployment Challenges

The deployment challenges can summarized around two design principles present in the archi-

tecture as follows:

• Cloud Computation Model:

– Distributed Processing: unlike the current systems where all the processing is

done centrally in the base station, the cloud computing model offers new challenges

due to the distributed nature of its resources, such as the virtual machines. The

split of the base-band processing into a user and a cell process is key here to leverage

the distributed computing model. However, a new challenge rises, which is how

to address the extensive communication traffic needed between these two types of

processes.

– Elastic Resources, Scaling and Clustering: the other major feature of the

cloud computing model is the resource elasticity and dynamic scaling of the assigned

resources based on the demand/traffic volume. First, there is the question of scaling

the computing resources according to the traffic pattern. This is one advantage of

the per-user processing approach used in the architecture, as it enables low-latency

scaling necessary for the wireless applications.

Second, a related question can be posed for the access network, in terms of the RRH

activation. Networks are typically designed according to the peak demand. When

the demand is outside its peak, then the cloud-RAN model calls for saving the

extra resources and utilizing the statistical multiplexing gains. Achieving resource

elasticity in the access network without affecting the quality of the service received

by the users is the key challenge here.

• Network Slicing and Infrastructure Sharing:

– Admission Control: One of the first decisions to be made by the infrastructure

controller is whether a new slice should be admitted in the network. This decision

Chapter 1. Introduction and Motivation 15

must take into account the available resources, the requested QoS as well as the QoS

of the slices already admitted. The infrastructure controller must ensure that the

QoS of the already admitted slices will not affected by the new slice. At the same

time, the infrastructure controller must ensure that it can provide the new slice

with its target QoS. This is particularly challenging in in the wireless domain due

to interference, the time-variable channel and the random movement and arrivals of

the network’s users.

– Slicing Dimension: Jointly with the admission control decision, the infrastructure

controller needs to decide which resources are assigned to this new slice, and which

dimensions (space,frequency,time) are used to slice the network. This decision re-

quires quantifying the performance difference between each slicing technique in terms

of the overall network utilization and the provided QoS.

– Resource Provisioning: once a slice has been admitted, the infrastructure con-

troller needs to provision its resources. The goal is to maintain its QoS from one

side, and guarantee a sufficient degree of isolation between it and the other slices

from the other side. This isolation is needed to protect the QoS of the other slices as

well. This process is done through scheduling and precoding as means of interference

coordination between the different slices.

1.6 Research Problems

The research problems we study in the thesis correspond directly to the deployment challenges

identified above. In particular:

• Network Slicing and Infrastructure Sharing:

– Admission Control: Several elements have to defined in order to answer the ad-

mission control question. These include a performance metric for the slice, i.e. QoS,

a multiplexing scheme and a coordination policy for resource provisioning. In wire-

less networks, QoS is directly related to the signal-to-noise ratio (SNR) and the

bandwidth. QoS is also function of the multiplexing scheme used. For example, if

Chapter 1. Introduction and Motivation 16

SDMA is used, then the QoS depends on the amount of spatial degrees of freedom

which in turn depend on the number of antennas given to a slice and its number of

users. Moreover, QoS is directly related to the number of RRHs and number of users

per RRH, which decides the portion of bandwidth each user can get. In summary,

a comprehensive QoS metric that takes into account both the PHY-layer aspects

(multiplexing scheme, SNR) and MAC-layer aspects (number of resource blocks per

user) is needed in order to arrive at an efficient admission control policy.

– Slicing Dimension: Several multiplexing schemes can be used to share the radio

spectrum between the slices, such as FDMA, SDMA and TDMA. Each scheme has

its own trade-offs, FDMA provides a good degree of isolation, while SDMA provides

more utilization efficiency. Moreover, one of the primary motivations for cloud-RAN

is leveraging the statistical multiplexing gains between the network slices to preserve

resources. A crucial ingredient in this case is the inclusion of the stochastic nature

of the number of active users. This randomness is key to modeling the statistical

multiplexing gains achieved by SDMA. To address the slicing problem, we need to

quantify the difference between the different schemes under study in terms of QoS,

isolation and statistical multiplexing.

– Resource Provisioning: Resources need to be provisioned by the infrastructure

controller in order to preserve the QoS performance for each slice. If spatial mul-

tiplexing is allowed between the slices, then an interference coordination policy has

to be imposed by the infrastructure owner to avoid excessive leakage or interference

between the slices. One example is the interference nulling policy. In this policy,

the infrastructure owner provides each slice with the null space upon which it must

project its signals to avoid interfering with the other slices.

Scheduling is another form of interference coordination focused on the frequency-time

resource blocks. Since the spectrum resources are now shared across different slices,

the typical MAC-layer scheduler is expanded into a two-stage hierarchal scheduler.

The first stage is where the slice schedules its own users, while the second stage

is where the scheduling of the slices themselves is undertaken by the infrastructure

Chapter 1. Introduction and Motivation 17

controller. However, a key question here is how can such a scheduler be designed

in a way that balances the flexibility given to the slice with the overall utilization

achievable by the infrastructure controller.

• Cloud Computation Model:

– Distributed Processing: The MAC-layer scheduler is a performance bottleneck in

the current systems. Migrating the system to the cloud will only increase the prob-

lem, as there is now the additional overhead due to the communication between the

user process and the cell process. Distributed scheduling is an interesting approach

in this case, as it lowers, or even eliminates, the excessive communication between

the user process and the cell process. A natural question to ask in this case is how

to design an effective distributed scheduler for the cloud-RAN, and how efficient can

this distributed scheduler be compared with the centralized one.

– Elastic Resources, Scaling and Clustering: The relatively cheap cost of RRHs

compared with traditional base stations enables building a denser wireless network.

This density leads to better coverage, but at the cost of increased energy usage

and interference. However, not all RRHs need to be active at the same time. An

important problem is how to select a subset of RRHs to be active at any point in

time such that overall network performance is not affected. Clustering is key in this

case, as interference is utilized, or at least eliminated, to provide satisfactory signal

levels for the affected users.

For the computing resources, wireless base band processing is a function of the

channel state, i.e. better channel conditions can support higher rates leading to

more extensive processing [136]. Hence, a good forecast model for the channel can be

used to also predict the needed computation power. This prediction can be provided

to the infrastructure controller which can then pro-actively make the scaling and

migration decisions for the cloud computing resources.

Chapter 1. Introduction and Motivation 18

1.7 Thesis Structure and Contributions

We have discussed some of the research problems in our architecture. Next we discuss how we

have addressed them in the thesis.

• Network Slicing and Infrastructure Sharing:

– Admission Control and Slicing: In Chapter 3, we study the admission control

and slicing problem. First, we provide a performance analysis comparing between

FDMA and SDMA. The random number of active users is integrated into the model

to account for statistical multiplexing. Then, a QoS metric is found based on the

null space projection technique. Third, a three-step algorithm is proposed for the

joint admission control and slicing decisions. Simulation results study the trade-off

between the QoS and the degree of multiplexing and correspondingly utilization.

This work has been published in [127].

– Resource Provisioning: In Chapter 4 we study the hierarchal scheduling as a form

of resource provisioning between the slices. The main assumption here is that the

infrastructure controller decision is limited to be a Yes/No decision to give the slice

the maximum flexibility. The problem is found to be an example of maximum weight

independent set (MWIS). First, we investigate two special cases that have polynomial

time optimum solutions. These cases correspond to the single carrier orthogonal

frequency division multiple access (SC-OFDMA) and time division multiple access

(TDMA). Then we investigate the intuition behind these cases optimality, and study

how we can extend this intuition by proposing a heuristic for the general case that

works well for the two special cases (98.5% and 94% respectively). This work has

been published in [126].

• Cloud Computation Model:

– Distributed Processing: In Chapter 5 we study the distributed scheduling prob-

lem. The approach is to completely remove the central scheduler and teach each

individual user process to come up with the decision on its own. We provide an

analytical performance analysis for the achievable rate in the case of the Rayleigh

Chapter 1. Introduction and Motivation 19

fading and maximum throughput scheduling. For this case, we find that the dis-

tributed scheduler can achieve 92% of the performance of the centralized one. Then,

we study more general scenarios employing machine learning clustering techniques

such as support vector machines (SVM) and decision trees. Here, we find that dis-

tributed scheduling is able to provide up to 89% of the performance of the centralized

one. We also uncover an interesting trade-off between the fairness of the scheduler

and its predictability, and study this trade-off for a general mean-variance scheduler.

This work has been published in [125].

– Elastic Resources, Scaling and Clustering: In Chapter 6 we study the problem

of joint activation and clustering in cloud-RANs. In this case, our objective function

is a combination of the number of active RRHs (representing the energy), the number

of users per active RRH and the SINR for these users (representing the QoS), and

the size of cluster (representing the clustering penalty). Our main constraint is a

coverage constraint, where each user has to be covered by at least one RRH. We

propose a two step algorithm to handle the problem. The first step is a set-cover

problem where the minimum number of RRHs is activated to guarantee coverage.

The second step is a greedy improvement by activating more RRHs or clustering

active ones to improve performance. This work has been published in [128].

This framework is then extended in Chapter 7 in several directions. First we include

the user-RRH association as another variable in our model. Second, we expand the

problem to a long-term optimization where the queuing dynamics are integrated into

the model. The resulting problem is an example of signomial optimization, which is

then solved efficiently using successive geometric approximation. Finally, we study

how this framework can be extended into a stochastic control framework by operating

on the traffic forecast. We measure the sensitivity of our decisions with respect to

the traffic forecast error, and find it to be 9% for the activation decision and 18%

for the clustering decision.

In Chapter 8 and 9 we study the scaling of computing resources jointly with anomaly

detection. Chapter 8 is focused on identifying a good set of features for identifying

Chapter 1. Introduction and Motivation 20

anomalies 1. The main framework is then studied in Chapter 9 where we study the

joint problem of computing resource scaling and anomaly detection. This is modeled

as a stochastic optimization problem. The proposed solution policy is based on a

Gaussian process model where the probability of exceeding a utilization threshold is

our scaling indicator, and the deviation between the prediction and the measurement

is our anomaly detector. We measure a prediction accuracy of 95% and an anomaly

detection accuracy of over 90%. Part of this work has been published in [145].

1.8 NFV, SDN and VN within the Context of Wireless Virtu-

alization

Applying the concepts of NFV, SDN and VN to wireless networks necessitates the specification

of the aspects of architecture, design and implementation of wireless virtualization. In this sec-

tion , we summarize our previous discussion by revisiting the motivations behind these concepts.

We see that NFV is about avoiding the use of specialized HW, SDN is about programmability

and VN is about sharing the resources between different slices.

1.8.1 NFV in Wireless

Traditionally, wireless systems have been implemented using FPGAs or ASIIC to accommodate

the high computational requirements of the wireless PHY and MAC layers. However, the

continuous advancements in CPUs and the powerful capabilities of data centers have made it

possible to build such systems using only general purpose CPUs. While wireless protocols have

high computational needs, their memory needs are relatively low. We can see then that the use

of look up tables (LUT) is fundamental for any implementation of wireless systems on CPUs.

LUT trade memory for computation, and have been successfully used to implement WiFi on

general purpose CPUs in SORA [136].

1Chapter 8 is a joint work with Joseph Wahba, a former MSc student in the research group.

Chapter 1. Introduction and Motivation 21

1.8.2 SDN in Wireless

In wired SDN, the OpenFlow controller will fill up a switching table which decides the action

to be taken for each flow. Our main observation is that LUTs are the key enablers for SDN

in wireless as they are for NFV. LUTs can be seen as a switch abstraction, where the input

bits combinations correspond to the input ports, and the desired output bits correspond to the

output ports.

1.8.3 VN in Wireless

For the radio access network, the problem of slicing the spectrum between a set of virtual

networks is equivalent to the well known multiplexing problem. While the standard approaches

such FDMA, TDMA, CDMA and SDMA are still applicable, new approaches such as hierarchal

and distributed scheduling introduce new flavors into the problem.

1.9 Deployment Scenario

Having defined the architecture, we can now look into more details about how such an ar-

chitecture can be deployed on the SAVI testbed [70]. SAVI is a testbed for software-defined

infrastructure (SDI) with integrated control and management framework for heterogeneous com-

puting and networking resources. These heterogeneous resources are jointly managed through

the SDI manager, which supersedes the infrastructure controller discussed in the cloud-RAN

architecture above. The SDI manager contains a set of modules, each responsible for a subset

of the resources. For example, the Nova module from OpenStack is responsible for computing

resources, and OpenFlow-style controllers are responsible for the networking resources. In Fig.

1.2 we provide an example for the deployment scenario.

The first step towards realizing a virtualized wireless system is the virtualization of the base

band network functions in software. Two examples of the efforts in this area are OpenAirIn-

terface [101] and SORA [136]. OpenAirInterface provides an open source implementation for

the main functionalities in an LTE/OFDMA system. Since OpenAirInterface is developed in

C++, it can be easily deployed on the SAVI virtual machines. The same case holds for SORA.

These base-band processors comprise the bulk of the user processes and the associated slice

Chapter 1. Introduction and Motivation 22

SDI Manager

Cloud-RAN Module

OpenStack

Nova Module

Admission

module

Scheduling

module

Scaling

module

QoS request

Users count estimation

Slice Controller

Precoder Design

User Process

Updating the

LUT

Querying the

spatial resources

Distributed scheduling

Hierarchal scheduling

Cell Process

Updating the

scheduling

weights CSI report

Migration/scaling

Figure 1.2: SAVI Deployment Scenario

controller.

The second step is to interface the wireless network, i.e. OpenAirInterface with the SDI

manager. This connection is crucial to realize the architecture and solutions discussed above.

From the communications and networking perspective, the SDI manager should be able

to control and update the PHY-layer pipeline through the LUT entries. LUTs provide an

abstraction of the PHY-layer processing similar to the flow processing used in OpenFlow, hence

similar control mechanisms can be developed to control the PHY-layer operation. Besides

choosing the appropriate modulation and coding schemes, the SDI manager can update the

precoding LUTs in the PHY-layer pipeline in accordance with the null space projection policy

to ensure elimination of inter-slice interference.

From the computing perspective, the main observation is that the computing needs of a

wireless system are related to the CSI [136]. Better channel conditions are utilized for higher

transmission rates and consequently more processing needs. By providing the CSI to the cloud

controller, and with appropriate forecast mechanisms, the VMs scaling and migrations decisions

Chapter 1. Introduction and Motivation 23

can be found and executed pro-actively in an efficient manner.

Chapter 2

Background and Literature Review

2.1 A First Look at Wireless Virtualization

Wireless virtualization, in the broadest sense, can be considered as a multiple-access problem

between the virtual operators who share the same infrastructure and access the same part of

the spectrum. This view has been taken in the GENI document on wireless virtualization[106],

which discusses the basic multiple access techniques as approaches to wireless virtualization.

These multiple access techniques are

• Frequency Division Multiple Access (FDMA).

• Time Division Multiple Access (TDMA).

• Code Division Multiple Access (CDMA).

• Space Division Multiple Access (SDMA).

• Any combination of the above techniques.

There are of course trade-offs in taking each approach. FDMA can result in low utilization of

the scarce spectrum resource and can be infeasible when the spectrum band is crowded. TDMA

alleviates this utilization problem but suffers from context-switching delay which can be in the

order of milliseconds. The SDMA approach taken in the ORBIT testbed [115] is not feasible

in practical commercial networks. CDMA, while not having the limitations above, is known to

be interference-limited as in current cellular architectures.

24

Chapter 2. Background and Literature Review 25

2.2 Literature Review

This section covers the efforts by the research community into building virtualized wireless

networks. These efforts can be classified into the following categories:

• WiMAX virtualization.

• LTE virtualization.

• Resource Abstraction and dynamic resource allocation.

2.2.1 WiMAX Virtualization

Many of the key papers within the framework of wireless virtualization were published by the

team at Rutgers University in joint effort with NEC Labs. These efforts were targeted at the

WiMAX system, and had the advantage of performing real tests within their ORBIT testbed

[115]. Their work has resulted in the following virtualization architectures:

1. vBTS: virtualization through emulation of base stations.

2. NVS: virtualization through MAC-layer enhancements.

3. CellSlice: virtualization through feedback control.

2.2.2 vBTS

vBTS stands for virtual base transceiver system. It is part of the ORBIT testbed and was

proposed as a virtualization architecture for WiMAX by Rutgers University and NEC Labs

in [24]. The motivation behind this architecture is that the physical base station is owned

by the infrastructure owner who may not be willing to expose his proprietary HW to the

virtual operator. The architecture tries to balance this closedness of the base station HW with

the programmability, observability and repeatability needed by the virtual operator. Their

approach is to give each virtual operator an emulated base station, and use a traffic shaper to

guarantee isolation between the virtual operators in the physical transmission. Hence, vBTS is

essentially a software-based virtualization solution located at the service gateway level.

Chapter 2. Background and Literature Review 26

Physical BTS

vBTS 1

vBTS 2

User in VN1

User in VN2

Isolation

Figure 2.1: vBTS Architecture

BST2

BTS1

BTS3

ASN CSNContent

Providers

Local IP network

Internet

Figure 2.2: Simplified WiMAX Architecture

Chapter 2. Background and Literature Review 27

The basic WiMAX architecture is shown in Fig. 2.2. The main components of the archi-

tecture are the base transceiver system (BTS), the access service network gateway (ASN), and

the connectivity service network gateway (CSN). The ASN gateway is the connection between

the BTS and the access core network. The vBTS architecture emulates different base stations

for the different virtual operators in VMs running within a data center. One major advantage

of vBTS is that it gives each slice complete control over its MAC, enabling the slice to support

different MACs each belonging to a specific slice. The data center is connected through the

ASN gateway to the BTS. Hence, it falls down to the ASN gateway to guarantee the isola-

tion between the traffic belonging to different vBTSs. This is done through the Slice Isolation

Engine (SIE), which is an amendment to the standard ASN gateway.

The SIE is implemented through a virtual network traffic shaper (VNTS) mechanism pro-

posed in [23]. This is a dynamic traffic sharping technique aimed at balancing utilization and

isolation. The mechanism is divided into the VNTS engine and the VNTS controller. The

VNTS controller interacts with the physical base station through the simple network manage-

ment protocol (SNMP) in order to get information about the conditions of the base station

and prevent overflowing it with packets beyond its transmission capacity. Once aware of the

physical base station transmission capacity and of the wights given to each virtual operator,

it enforces the traffic shaping through the VNTS engine according to the wight given to each

slice without exceeding the capacity of the base station.

While vBTS provides a simple solution to the virtualization problem, it has its own draw-

backs. First, it can only isolate traffic in the downlink as it has no control over the uplink.

Moreover, the SIE can only provide coarse rather than strict isolation between the slices. Third,

the existence of two scheduling modules, one at the physical base station and one at SIE affects

the utilization of the system since they are not fully coordinated.

2.2.3 NVS

The Network Virtualization Substrate (NVS) [76] moves beyond the high-level architecture

of vBTS and integrates virtualization into the physical base station itself. In doing so, NVS

provides more customization to the virtual operators, achieves better utilization of the system

and guarantees strict isolation between the slices, effectively overcoming the shortcomings of

Chapter 2. Background and Literature Review 28

Frame Scheduler

Classifier

Slice 1 Slice 2 Slice 3

DownLink two-level scheduler

Uplink two-level scheduler

DL Flows

UL Flows

Figure 2.3: NVS Architecture

the vBTS architecture. NVS also has control over both the uplink and downlink unlike vBTS.

The design principles of NVS are summarized as follows:

1. Isolation: between slices.

2. Customization: for each individual slice.

3. Utilization: efficient resource usage.

Even though NVS is designed for WiMAX, it can be easily extended to other orthogonal fre-

quency division multiple-access (OFDMA) based systems such as Long Term Evolution (LTE)

and LTE-Advanced.

Virtualization Level

WiMAX defines the notion of a service flow between the base station and a user device. A

service flow is a unidirectional flow (either uplink or downlink) of packets with a particular set

of QoS parameters. A user’s end-to-end connections are mapped to one or more of its service

flows. The setup of service flows is left as a policy specification to the network operators. For

efficient resource allocation, the base station includes a collection of schedulers. A downlink flow

scheduler determines the sequence of packets to be transmitted in the downlink direction based

Chapter 2. Background and Literature Review 29

on flow priorities and other QoS parameters. Similarly, an uplink flow scheduler determines

uplink slot allocation based on the bandwidth requests from clients, channel quality, and QoS.

These schedulers then invoke the frame scheduler that maps the packets and uplink resource

allocations to specific slots in each MAC frame. Virtualization can be done at different levels.

Lower levels such as subchannel or HW achieve better efficiency and utilization, but are more

complex. Virtualization at higher levels is easier but leads to less efficient isolation. NVS

virtualizes at the flow level. The provisioning of resources, or what may be called the contract

between the VO and the IO is done in two ways here:

1. Resource based: a fixed amount of resources will always be assigned, independent of

channel conditions for example.

2. Bandwidth based: aggregate throughput will be guaranteed all the time, no matter what

the channel conditions are.

Scheduling

A two-level scheduling is proposed in NVS:

1. Slice scheduling: an optimal scheduling algorithm is discussed to schedule the WiMAX

MAC layer frames.

2. Flow scheduling: NVS gives the virtual operator several options to choose from which

determine the order at which different flows will be scheduled within the same slice.

Implementation

NVS was implemented for WiMAX using a PicoChip WiMAX basetation, combined with a

WiMAX profile gateway and a set of WiMAX USB clients. To implement NVS, the authors

needed to update the MAC layer of the WiMAX base station to account for the proposed

hierarchal scheduling. This needed adding around 500 lines of C code to the existing MAC

protocol.

NVS is a very interesting virtualization appraoch. It provides strict isolation between the

slices, high utilization of the spectrum resource and a sense of configurability to the VOs. How-

ever, it falls short of taking into account the inter-cell interaction of the cellular systems, which

Chapter 2. Background and Literature Review 30

BS1

BS2

BS3

BS4

CellSlice equipped gateway 1

CellSlice equipped gateway 2

Core Network

VO1

VO2

VO3

Figure 2.4: Cell Slice Architecture

is becoming the main limitation of such systems [48]. Extending such TDMA-like scheduling

of slices to a set of interfering base stations is not straightforward, as well as extending NVS

itself to the case when the base station is equipped with MIMO capabilities.

2.2.4 CellSlice

CellSlice is a gateway-level solution that achieves the slicing without modifying the base stations’

MAC schedulers [77]. Unlike NVS, CellSlice does not try to introduce any changes in the

physical base station. Hence, in order to provide an isolation level comparable to that of NVS

it employs traffic shaping mechanisms to achieve isolation between slices at the gateway level

without affecting the built in MAC schedulers. The authors propose a simple traffic algorithm

that indirectly constraints the base station scheduler. The assumptions needed for a system

employing CellSlice to actually work are as follows:

1. Sensing: the base stations send periodic feedback information to the CellSlice engine

containing information about the total available resources as well as the utilization per

user represented as the average per flow Modulation and Coding Scheme (MCS).

2. Actuating: a single shaping parameter is exchanged between the CellSlice engine and the

base station which controls maximum sustained rate per flow. The base station need to

Chapter 2. Background and Literature Review 31

take this parameter into consideration when performing its MAC scheduling.

The operation of CellSlice is simple, whenever a base station indicates that it is under-

utilized, CellSlice incrementally increases the maximum sustainable rate for all users. This

continues until the base station indicates over utilization, after which CellSlice must reset the

maximum sustainable rate for each flow according to its service level agreement with the VO

owning such a flow.

While CellSlice offers a very simple solution for wireless virtualization that introduces no

changes to the physical base station, its performance is limited by the traffic characteristics of

the flows. Specifically, the more fluctuating the traffic is, the harder it becomes to control it

through CellSlice feedback loop. The drawback is more explicit in the uplink, where CellSlice

engine is not aware of the state of the flows until the base station sends its periodic feedback

messages. In order to guarantee strict isolation between the slices, the base station needs to send

its feedback messages more frequently, introducing a trade-off between isolation and utilization.

2.2.5 LTE eNB Virtualization

As part of the 4WARD project in EU, the research group at University of Bremen has considered

virtualization of the LTE eNB [159]. Their virtualization framework of the LTE eNB consists

of two main stages:

1. Virtualization of the physical HW of the eNB,

2. Virtualization of the air interface controlled by the eNB.

The authors focused on the virtualization of air interface, considered the physical node

virtualization to be a task similar to any other computing node virtualization. An entity

called ”Hypervisor” will be responsible for scheduling the physical resources between the virtual

instances running on top of it.

The hypervisor is also responsible for scheduling the air interface resources. Since we are

dealing with LTE, the smallest unit of air interface that can be allocated is the Physical Resource

Block (PRB), the task of the hypervisor is to schedule access to the PRBs between different

virtual operators.

Chapter 2. Background and Literature Review 32

Framework

The architecture is similar to the NVS architecture proposed for WiMAX. One difference is the

assumption that the each virtual operator can have his own virtual eNB, e.g. software based

eNB running on a VM. Another difference is that their scheduling is slightly different than

NVS in that it schedules the users belonging to the different virtual operators directly instead

of scheduling the slices.

Methodology

Collect information from the virtual operators about their users, such as channel conditions,

then, depending on the type of contract of the virtual operators, the PRBs are allocated.

Types of Contracts

The authors considered four types of contracts, i.e. SLAs, between the IO and the VOs

• Fixed Guarantee: fixed BW will be allocated all the time.

• Dynamic Guarantees: a guaranteed maximum BW will be allocated if requested, other-

wise only the actually needed BW is allocated.

• Best Effort with min guarantees: minimum guaranteed BW will be allocated all the time,

and extra BW may be allocated in a best-effort manner.

• Best Effort with no guarantees: BW will be allocated in a best-effort manner only.

The authors then proposed a scheduling algorithm for allocating the PRBs among virtual

operators taking into account the types of contracts used. This framework follows a mixed

TDMA/FDMA approach, which is limited by the switching times of TDMA, and the maximum

number of subcarriers (FDMA).

2.2.6 SDR and Virtualization

Another line of work has concerned itself with virtualization architectures that work in a way

very similar to the software-defined radio (SDR) architectures. These include OpenRadio and

OpenRF.

Chapter 2. Background and Literature Review 33

Master

DSP

Slave

DSPs

Viterbi

AcclsFFT Accls

Decision

Plane

Process

ing

Plane

RF

Plane

Figure 2.5: OpenRadio

OpenRadio

OpenRadio [18] can be considered as the meeting of SDR and SDN. It tries to build a re-

configurable base-band processing system for wireless application (SDR), while providing the

appropriate control and management APIs (SDN) to the network designer. OpenRadio provides

a library of generic DSP blocks from which any wireless standard can be built. For example,

the FFT and IFFT blocks of LTE OFDMA and SC-FDMA downlink and the OFDMA blocks

of WiMAX both uplink and downlink can be regarded as different manipulations of the same

DSP blocks. Hence, by providing a rich set of such blocks, a virtual operator can use a pro-

vided API to connect these blocks together in the best way that suits him and build his own

wireless system. It goes forward by building an operating system (OS) for wireless nodes, in a

way similar to the network OS in OpenFlow. This OS abstracts the different wireless resources

and standards through generic APIs paving the way for more robust control and more efficient

utilization. While OpenRadio is an SDR at its heart and not a virtualization architecture, its

integration of SDN concepts makes it a rich environment for developing virtualization solutions

for wireless in the same way that OpenFlow did for wired networks.

Chapter 2. Background and Literature Review 34

OpenRF Controller

OpenRF-enabled multiple antenna

Access Point

OpenRF-enabled multiple antenna

Access Point

OpenRF-enabled multiple antenna

Access Point

User User User User User User

Figure 2.6: OpenRF Architecture

2.2.7 OpenRF

OpenRF [80] is the first architecture that tries to combine the concepts of SDN with those

of MIMO communication. The idea of OpenRF is to leverage the beamforming capabilities of

MIMO systems, and abstract the spatial dimensions created by beamforming in a way similar to

the switch ports. Specifically, it adds an extra entry to the routing table abstraction pioneered

by OpenFlow. This new entry is responsible for controlling the spatial dimensions which can be

accessed by a certain flow. Then, in a way similar to OpenFlow, APIs are available to control

the entries of the table, effectively providing control over the MIMO beamforming. OpenRF

was however only developed for single base station WiFi systems, and it remains open how to

extend it to cellular systems such as LTE.

2.2.8 R-Cloud

2.2.9 Resource Abstraction and Dynamic Resource Allocation

Another line of research concerns itself with the market-related problems of virtualization. In

a virtualized environment, different entities will compete for the shared resources. In the liter-

ature, authors have mainly considered the competition for wireless spectrum between different

virtual operators, with auction theory being the most popular approach to the problem.

Chapter 2. Background and Literature Review 35

WLAN ID

Ethernet IP TCP Precoding Space

Src Dest Type Src Dest Protocol Src Dest Coherence Interference

Figure 2.7: OpenRF Table

Dynamic Spectrum Access in Virtualized LTE

The team at University of Bremen also considered the problem of dynamic spectrum access and

its related issues [74]. They considered different levels of competition:

• Users choose the virtual operator such that their utility is maximized.

• Virtual operators compete for the users pool.

• Virtual operators compete for the spectrum.

At the user level, each user is represented through a utility function that depends on the

price charged by the chosen operator, operator congestion level, and QoE(either as a function

of QoS parameters of the operator, or through a proposed function of the allocated BW that

is argued to capture QoE). The users compete for the operators in a game theoretic fashion.

The users competition game can be modeled as a finite potential game [95]. The users can

employ a learning strategy of trial and error, a choice that is known for its scalability as

well as well-studied convergence behavior for potential games. Virtual operators Competition

for the Spectrum Go up one level and you are faced with the problem of virtual operators’

competition for the spectrum. This competition problem is modeled as an auction problem,

consisting of the IO as a spectrum broker (auctioneer), virtual operators (bidders) and spectrum

Chapter 2. Background and Literature Review 36

(auctioned items). Uniform price auctions are chosen as the auction framework. By choosing

suitable utility functions for the auctioneer and the virtual operators, and from the properties

of uniform price auctions, the authors were able to prove the existence of a dominant strategy

for virtual operators. The auction process goes as follows:

1. The auctioneer announces the start of the auction and the time to submit the bids.

2. Each bidder observes the demands of its own users.

3. Bidders submit their bids.

4. The auctioneer receives the bids and decides the shares of each virtual operator.

5. The allocations are fed to the hypervisor which takes the role of scheduling.

The authors however fell short of studying the case when the two games are interwound.

Stochastic Game for Wireless Virtualization

Game theory has been used in [45] to model the interaction between the different entities

in a virtualized wireless network. This paper provides a novel framework for virtualization

through resource abstraction. The motivation for this approach was that previous approaches to

virtualization require the service providers to explicitly understand the wireless access protocols.

Their new approach alleviates that by separating the wireless resource management performed

at the IO level from the quality-of-service control performed at the virtual operator level.

Moreover, by having the infrastructure provider take control of the resource management, it

is more aware of the heterogeneous services and the underlying time-varying wireless features

(e.g., channel conditions, available spectrum resources).

In such a framework, each virtual operator has a certain utility, which is the sum of the

utilities of its own users. The utility of each user is a function of the rate allocated to it, as

part of the feasible rate region(the possible set of rates given a certain channel). The virtual

operators bid for the wireless resources in the form of rates on behalf of their users. The

VickreyClarkeGroves is chosen as the bidding mechanism used by the virtual operators to bid

for the spectrum. The competitive game between virtual operators is played in sequential

Chapter 2. Background and Literature Review 37

stages, and the utility is considered to be the average utility through all stages. The main

results of this paper are summarized as follows:

1. Modeling the game as a stochastic game.

2. Proving the existence of NE in the game.

3. Using conjectural prices, the sequential games are decoupled and the game is tractable.

4. Centralized algorithm for finding the conjectural prices by the infrastructure operator.

5. Proving the efficiency of the NE associated with these prices.

6. Use of reinforcement learning to overcome the non-causality of the problem.

Part II

Network Slicing and Infrastructure

Sharing

38

39

In this part, we study a set of challenges related to the admission and multiplexing of

several network slices on the same physical infrastructure. These challenges revolve around

admission control, network slicing and resource provisioning. First, we study the admission

control problem from the perspective of the infrastructure controller. We also study the QoS

performance of the different multiplexing schemes and how they can be used in the admission

control decisions. In relation to the admission control, we study the resource provisioning

policies by which the infrastructure controller can maintain an appropriate QoS performance

for the admitted slices. We study two forms of resource provisioning, the first is precoding-based

interference coordination and the second is hierarchal scheduling.

Chapter 3

PHY-Layer Admission Control and

Network Slicing

3.1 Context

Wireless virtualization is a promising approach to foster innovation and prevent the ossification

of wireless networks. Within a virtualized wireless network, multiple network slices, or virtual

operators (VO), are co-hosted on the same physical infrastructure. In this chapter, we study

one of the first decisions that need to be taken by the infrastructure controller, that is which

slices should be admitted into the physical infrastructure and how should the network be sliced

between them, i.e. which multiplexing technique, TDMA, FDMA or SDMA, should be used.

Another related question is how should the stochastic arrival process affect the slicing and QoS

criteria. To answer these two questions, we study the problem of QoS-aware joint admission

control and network slicing. Due to the NP-hardness of the problem, we approach it using a

heuristic algorithm composed of three steps: spectrum allocation, admission control and spatial

multiplexing. The proposed algorithm incorporates the effects of QoS and stochastic traffic. We

study through simulations the benefits of joint spatial-frequency multiplexing over the static

frequency slicing approach. Finally, our simulation results help shed some light on the trade-offs

between frequency and spatial multiplexing as well as between QoS and utilization.

40

Chapter 3. PHY-Layer Admission Control and Network Slicing 41

3.2 Introduction

Current cellular networks are plagued with long installation times, high cost of equipment and

the widespread use of specialized hardware. The tight association between the hardware and

its functionalities is an obstacle in front of fast paced innovation. These issues have encouraged

the introduction of virtualization principles into the wireless domain [32]. Virtualization essen-

tially involves three design principles: the use of modular hardware (HW) to support arbitrary

software (SW) functionalities; support multiple networks on the same physical infrastructure;

and use of SW to manage, control and upgrade the different resource slices [75]. In this regard,

C-RAN has appeared as a promising architecture for 5G networks leveraging the concepts of

wireless virtualization [4].

The problem of wireless virtualization can not be studied without considering the PHY-

layer aspects of it, as this is a main differentiator between wireless and wired networks. Due

to the stochastic nature of the wireless channel and the scarcity of the spectrum, we have to

adopt an admission control step that decides whether the wireless channel/network has enough

capacity to serve a specific network slice. Together with the admission control step, there is

also the question of how should the new slice be admitted. Such question does not show up in

the wired network since the virtualization is done at the packet level. However, since the radio

spectrum is shared between all slices, studying wireless virtualization involves delving deeper

into how the spectrum should be shared and how the capacity is divided between the slices.

Moreover, this study must also take into account the stochastic nature of the network traffic

and leverage the resulting statistical multiplexing gains. Jointly with all these decisions, the

infrastructure controller must decide upon the provisioning policy between the slices once they

have been admitted.

A challenge unique to wireless environments is how to share the air interface between the

slices and which multiplexing scheme should be used. The qualitative understanding is that

FDMA provides the best isolation and is the most practical, however it might result in under-

utilization due to the loss of statistical multiplexing gain [3]. TDMA is adaptive to the varying

traffic behavior, but suffers from synchronization and switching time issues [3]. SDMA is the

most flexible yet also the most complex. However, quantitatively characterizing the difference

Chapter 3. PHY-Layer Admission Control and Network Slicing 42

in performance between these multiplexing schemes remains an open problem. Prior work has

focused on static environments where the users are infinitely-backlogged.

In Fig. 3.1 we show the system architecture as discussed in Chapter 1 but showing only

the parts which are the focus of study in this chapter. This chapter is focused on on the

interaction between the slice controller and the infrastructure controller in order to agree upon

the admission control, slicing and resource provisioning decisions and policies. The first step

of this mutual interaction is a request for resources submitted by the network slice, indicating

the requested bandwidth and the associated QoS. The request should also include details about

the distribution of the number of active users within the slice during any time slot. Based

on these information, and aware of the already admitted slices, the infrastructure controller

decides whether this new slice should be admitted, which resources it will have access to, and

which multiplexing scheme is going to be used for sharing the resources with the new slice.

3.3 Related Work

Even though the question of which multiplexing technique to use for slicing has not been studied

enough, several studies have been performed within the context of multi-user MIMO to compare

and optimize the TDMA vs SDMA multiplexing. The performance of TDMA vs. SDMA for

the case of opportunistic beamforming has been studied in [69]. A distributed algorithm was

proposed in [78] to switch between TDMA and SDMA in MU-MIMO networks. An adaptive

strategy for switching has been designed for the case of imperfect channel state information in

[162]. The effects of delay and channel quantization have been studied in [161]. These works

are more about conventional MIMO systems and are not directly applicable to Cloud-RAN

systems. For example, the focus of these works is how to come up with an adaptive strategy

that dynamically switches between the TDMA and SDMA. In our case, this decision is made

only once at the beginning by the infrastructure controller. Moreover, this strategy is developed

for individual users. Our study is different in that the decision is taken for the network slice

as a whole, and consequently must take into account the aspect of the time variability of the

number of users within each slice.

Admission control is a fundamental problem in wireless communications and networking.

Chapter 3. PHY-Layer Admission Control and Network Slicing 43

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

User Process

Network Slice Controller

Slice Communication

Protocol

Infrastructure Controller

Admission Control Network Slicing Interference Coordination

Resource

Provisioning

Clo

ud

Net

wo

rk F

abri

c

CSI/Null-space exchange

Slice Precoder Projection decisions

End-users

Figure 3.1: Cloud-RAN Architecture - Admission Control and Slicing

Chapter 3. PHY-Layer Admission Control and Network Slicing 44

We can divide the approaches to the admission control problem into three categories. In the

first category, there is the call admission control problem [51],[10]. The main methodology

here is based on the queuing theoretic analysis and using call drop probabilities as the main

performance metric. This is the traditional approach which focused on the traffic models and did

not take into account the PHY-layer aspects and the advances in multi-antenna transmission.

The second category focused solely on the PHY-layer and in particular the MIMO beamforming

problem [88],[49]. While not directly about admission control, appropriate constraints can be

introduced into the problem such that some users are assigned a zero beamforming vector, and

consequently no signal power, in case their QoS target can not be achieved. The main drawback

in these approaches is that they do not take into account the traffic arrival process and the

variable number of users, as these approaches use what is called the infinite buffer model. The

third category is about the virtual network embedding problem [33],[43]. This is the approach

most similar to the case we study here, the only difference is that the focus so far has been on

wired networks. In summary, what we study here is a mixture of these three cases. First, our

problem is about the admission control of the whole network slice, not individual users, as in

the case of virtual network embedding. Second, we have to integrate the PHY-layer aspects

into the model as this is crucial for wireless transmission. Third, we include the effects of the

user arrival process and the random number of users since this is one of the underlying reasons

for the statistical multiplexing gains as a motivation for cloud-RAN architectures.

3.4 System Model

Consider a C-RAN network where the I/Q signals are prepared within a cloud-computing

platform and forwarded through a high-upeed network to a set of RRHs. Let R denote the set

of RRHs, or equivalently the set of antennas, S denote the set of frequency resources. Let U

be the set of slices. Throughout the chapter, we use the words slices and VOs interchangeably.

3.4.1 Motivating Example

Consider a network that is shared between two slices. The first slice is a sensor network that

multicasts/broadcasts the same information to a set of nodes. The second slice is a data-oriented

Chapter 3. PHY-Layer Admission Control and Network Slicing 45

network that tries to maximize the transmission rate. One possible operational procedure for

the second network is to pick, at each time slot, the user with the best channel, and transmit

as much data as it can to this selected user. The different nature and requirements for the two

networks makes the optimum precoding in both slices very different. For the multicast network,

the precoder is determined according to the eigenvector corresponding to the largest eigenvalue

of the composite channel matrix [84]. For the data-oriented network, the optimum beamformer

is the matched filter of the channel vector for the selected user.

The scenario we envision for the virtualized wireless network is that each slice will design its

own precoder. Once a precoder is decided by a specific slice, it has to be projected into the null

space of the channel of the other slices in order to cancel all mutual interference. The question

then becomes how to share the network between the different slices across the frequency and

space dimensions, taking into account the performance difference between frequency and spatial

multiplexing.

In light of the above discussion, we formulate the following centralized optimization problem

maxW1,...,W|U|

∑u∈U

fu(Wu,Hu)

s.t. H−uWu = 0 ∀ u ∈ U

HuW−u = 0 ∀ u ∈ U

(3.1)

where fu(.) is the utility function chosen by each slice u ∈ U , Wu is the precoder for slice u,

W−u is the set of concatenated precoders for all slices other than u. Hu is the channel matrix

for slice u and H−u is the channel matrix for all slices other than u. The constraints in problem

(3.1) state that slice u should receive as well as cause zero interference to all the other slices,

and this applies to all slices u ∈ U .

Due to the separable nature of the objective function, the solution of the above optimization

problem is equivalent to

Wu = arg maxWu

fu(Wu,Hu) ⊥ Null(H−u) ∀ u ∈ U (3.2)

In other words, each slice will design its own precoder, which is then projected into the null

Chapter 3. PHY-Layer Admission Control and Network Slicing 46

space of the interfering channel matrix to ensure complete interference nulling. However, the

above formulation covers only the spatial aspect of the resources. On one hand, we might ask

what would have happened if we assigned different spectrum bands to the different slices. In

such a case no interference nulling is needed and each slice can fully utilize its spatial degrees

of freedom. On the other hand, the strict division of the spectrum can lead to underutilization

of this resource due to the lack of statistical multiplexing, resulting from the stochastic nature

of the traffic. This trade-off and how it affects the resource sharing is the main problem we

consider in this chapter.

3.4.2 Problem Formulation

In this section we provide the formulation for the optimization problem to be solved by the

infrastructure owner (IO). We chose the utility function of the IO to be the number of admitted

VOs, e.g. to maximize the profit. Each VO submits a resource request composed of the

requested bandwidth Bu, and the associated QoS level Qu(to be defined in 3.6). In the following

we assume that admission control is done once at the start of the system operation, we assume

the IO is aware of the statistical properties of the channels as well as the aggregate load per

VO. The optimization is formulated as:

maxa,Si

1Ta

s.t. Q (Si) ≥ Qi ∀ i ∈ {j | aj = 1}

ai ∈ {0, 1} ∀ i ∈ U

Si ∈ S ∀ i ∈ U

(3.3)

where a|U|×1 is a binary vector of elements ai indicating whether slice i has been admitted,

Si is the set of resources allocated to slice i, Q (.) is the QoS function mapping between the

allocated resources and the expected performance, and Qi is the QoS level required by slice i.

We assume that the IO will perform the admission control only once at the beginning or as part

of a slow control loop, moreover, the per slice admission control is reflected in the QoS metric

for the slice depending on its aggregate admitted load.

The problem formulation provided above is an integer programming problem, hence is non-

Chapter 3. PHY-Layer Admission Control and Network Slicing 47

convex as as well as NP-complete. In the next section, we provide the steps of our algorithm

to solve the problem as well as our definition of the QoS function.

3.5 Admission Control and Resource Slicing Algorithm

Our main problem defined in (3.3) is in general NP-hard due to its combinatoric nature. Hence,

our approach is to start with FDMA, and gradually build upon the initial slicing with SDMA

as long as the QoS criteria is satisfied. The high-level steps of the algorithm are as follows:

1. Spectrum Allocation: find a spectrum allocation such that each slice gets a spectrum

band equal to its request, while having as few conflicts between the different slices as

possible. If there is still some conflict in the allocation, i.e. some part of the spectrum is

shared between at least two slices, proceed to the admission control step.

2. Admission Control: pick a feasible set of slices such that no spectrum resource is allocated

to more than one slice.

3. Spatial Multiplexing: the final step is to greedily improve the existing set of admitted

slices by spatially multiplexing additional ones. The admission of new slices should be

such that no QoS constraint is violated.

3.5.1 Spectrum Allocation

The first step in our algorithm is to allocate a frequency band to each slice. We assume that

slices have no preference with respect to the different bands. Hence, the criteria we focus upon

is to minimize the maximum conflict, i.e. band intersection, between the different slices. This

becomes

minx

γ

s.t.γ ≥ QiQj((xi −Bi/2)− (xj +Bj/2))2 ∀ i, j ∈ U , i 6= j

γ ≥ QiQj((xi +Bi/2)− (xj −Bj/2))2 ∀ i, j ∈ U , i 6= j

xi −Bi/2 ≥ 0

− xi +Bi/2 ≤ B

(3.4)

Chapter 3. PHY-Layer Admission Control and Network Slicing 48

where xi is the center of the band allocated to slice i, and Bi is the size of the frequency

band requested by it. Note that we use the weights QiQj to penalize the intersection between

two slices with high QoS levels over lower ones. The goal of this optimization is to minimize

the maximum intersection between the allocated bands, where xi − Bi/2 is the lower end and

xi + Bi/2 is the upper end of the band allocated to slice i1. While problem (3.4) is non-

convex, we will proceed with a local optimum for it, as this is much easier to find than original

formulation in (3.3).

Note that the above problem is just about finding a spectrum allocation, possibly infeasible,

for all the possible slices. The question of feasibility, i.e. admission control, is handled next.

3.5.2 Admission Control through the Maximum Independent Set

Once the frequency bands have been decided, the next step is to pick a non-conflicting set of

slices. An example for this is shown in Fig. 3.2, where we have 7 slices with their bands already

allocated. The algorithm needs to select the best, e.g largest, non-conflicting set of slices. We

have eight independent sets of slices, {1, 2}, {3, 4, 5}, {1, 4, 5}, {6, 7}, {3, 2}, {1, 4, 7}, {3, 4, 7} and

{6, 5}, and the goal is to choose the one with the maximum weight.

In order to find a feasible solution, i.e. a set of non-intersecting frequency bands, we need

to solve:

maxa

|U|∑i=1

aiwi

s.t. ∩i:ai=1 Si = ∅ , ai ∈ {0, 1}

(3.5)

where Si is the interval {xi −Bi/2, xi +Bi/2} and xi found through solving (3.4). The first

constraint says that all admitted slices have to be non-conflicting, while the second is the binary

constraint imposed upon the decision a. This problem is essentially about selecting a subset of

non-conflicting slices of maximum weight.

The key point now is to identify that problem (3.5) is equivalent to a maximum weight

1Note that if Bi is the same for all slices, the problem becomes that of maximizing the minimum mutualdistance between all the band centers.

Chapter 3. PHY-Layer Admission Control and Network Slicing 49

Figure 3.2: Interval Graph and Conflict Graph for the outcome of step 3.5.1

independent set problem (MWIS). Consider the graph G = {V, E}. Let V be equal to the set

of slices U . Define E = {e : eij = 1 ⇐⇒ Si ∩ Sj 6= ∅}. Assign to each vertex vi the weight wi.

Now we have a graph with each vertex representing a slice. Two vertices are connected with an

edge if and only if their corresponding bands are conflicting. Hence, the problem of choosing

a set of non-conflicting slices with maximum weight becomes the problem of a selecting a set

of independent vertices of maximum weight, which is the maximum weighted independent set

problem for the graph G [52],[20]. In case no weights are associated to each slice, we can set all

wi’s to be equal to one and the problem becomes a MIS problem.

While the MWIS is NP-hard in general, the case resulting from solving problem (3.4) belongs

to a class of graphs known as interval graphs. This class of graphs provides a special case where

we can find the optimum solution to problem (3.5) in linear time. An example of the intervals

and their associated graph is shown in Fig. 3.2. The algorithm for the optimum solution can

be found in [62] and is provided here in Algorithm 3.1 for the sake of completion.

Chapter 3. PHY-Layer Admission Control and Network Slicing 50

Algorithm 3.1 Maximum Weight Independent Set for Interval Graphs

Input: A set of weighted intervals V = {v1, v2, ..., v|U|} and the sorted endpoints set L ={l1, l2, ..., l|U|}return The MWIS Mmax of Vtemp max← 0;Mmax ← ∅; last interval← 0;for all j ← 1 to |U| doX (j)← 0;

end forfor all i← 1 to 2|U| do

if li is a left endpoint of interval vc thenX (c)← temp max+ weight(vc)

end ifif li is a right endpoint of interval vc then

if X (c) > temp max thentemp max← X (c)last interval← c

end ifend if

end forMmax ←Mmax ∪ {vlast interval}; temp max← temp max− weight(vlast interval)for all j ← last interval − 1 to 1 do

if X (j) = temp max and bj < alast interval thenMmax ←Mmax ∪ vjtemp max← temp max− weight(vj)last interval← j

end ifend for

Chapter 3. PHY-Layer Admission Control and Network Slicing 51

3.5.3 SDMA

Once a feasible FDMA-based solution has been found, the final step is consider if more slices

can be spatially multiplexed with the chosen slices while still satisfying the QoS constraints.

Here, we follow a greedy approach. We consider each non-allocated slice, and examine whether

the QoS metrics are still satisfied. If this is the case, then the new slice is added and we move

on to examine the next non-allocated slice. The following section covers how we define the QoS

function.

3.6 QoS Analysis

Recall that Wu is the initial, pre-nulling, precoder designed by slice u. Without loss of general-

ity, we focus on the case where Wu = wu, i.e. one user is active at the time and the beamformer

is a vector. This is in line with the motivating example we discussed in 3.4.1.

We consider the matched filter-zero forcing precoding. In other words, each slice will match

the precoder vector to its users’ channels while projecting it into the null-upace of the other

slices’ channels. Let wu denote the precoder vector for slice u, hu is the channel for slice u and

H−u is the combined channel matrix for all the other slices. According to the matched filter

criteria, wu = h∗Tu .

In order for slice u to induce no interference on the other slices, the precoder needs to be

projected into the null space of their channel. Let us define wu = wu ⊥ Null(H−u).

We find it important now to distinguish three cases. Let k be the number of users serviced

with slices −u through the channel matrix H−u, n be the total number of antennas and m

the number of antennas allocated to slice u. In other words, the matrix H−u is of dimensions

k × n−m. These cases are:

• k > n−m

In this case of a ’tall and thin’ matrix, and assuming the matrix is of full column rank,

the null space is of dimension zero. Hence we can not find a precoder wu = wu ⊥

Null(H−u) 6= 0 and interference can not be completely removed using spatial beamform-

ing.

Chapter 3. PHY-Layer Admission Control and Network Slicing 52

• k = n−m

The square matrix case is the same as the thin matrix case, where , assuming full rank,

the null space is empty and interference can not be removed.

• k < n−m

This is our most interesting case, where the null space is not empty.

3.6.1 Post-Nulling Normalization

As explained in the previous section,, the precoder vector is first projected into the null space of

the interfering channel, then normalized. The following theorem characterizes the distribution

of the received signal power |wu|22

Theorem 1. Let n be the total number of antennas, m the number of antennas allocated to

slice u, and k the number of receivers in slices −u. If H−u is a complex Gaussian channel,

then:

|wu|22 ∼ Γ(n−m− k, 2) (3.6)

Proof. This follows from [148] where it is shown that the projection of an (n−m)-dimensional

vector vector with i.i.d. unit variance complex Gaussian components into a uniform k-dimensional

subspace is Γ(n−m−k, 2). This is in line with the results in [68], [57] regarding the zero-forcing

beamforming.

We provide simulation results in Fig. 3.3 to show the matching between the simulated and

analytical results.

3.6.2 Stochastic Number of Users

The above analysis has assumed the number of users k to be fixed. However, the true advantage

of SDMA over FDMA might not be realized until we consider a stochastic number of users,

where the statistical multiplexing gain is present.

Chapter 3. PHY-Layer Admission Control and Network Slicing 53

0 5 10 15 20 25 30 35 40 450

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Received Signal Power

Den

sity

K=2,n−m=8Gamma(6,2)K=4,n−m=8Gamma(4,2)K=6,n−m=8Gamma(2,2)

Figure 3.3: Simulation and fitting of the received signal power

It is easy to extend the previous results by conditioning on the present number of users. In

other words, for X = |wu|22

fX|K (x|k) ∼ Γ(n−m− k, 2) (3.7)

fX(x) =n−m−1∑k=0

Γ(n−m− k, 2)P[K = k] (3.8)

In general, the above summation is hard to calculate for the popular distributions of the number

of users within a queue such as Poisson and Erlang. Therefore, we proceed with an approxima-

tion using Markov’s inequality

P[|wu|22 > ε

]≤E[|wu|22

=E[E[|wu|22|K

]]ε

(3.9)

P[|wu|22 > ε] ≤ E [n−m− k]

ε(3.10)

P[|wu|22 > ε] ≤ 1

ε(n−m− E [k]) =

1

ε′

(1− E [k]

n−m

)(3.11)

Chapter 3. PHY-Layer Admission Control and Network Slicing 54

This probability upper bound is how we define the QoS metrics within our work, where ε serves

as a lower bound on the signal power experienced by the users.

Algorithm 3.2 Admission Control and Network Slice Allocation

Input: A request from each slice u ∈ U including the bandwidth Bu and the QoS level QuOutput: The admitted set of slices U and their associated allocated bands Bu ∀u ∈ U

1: Solve problem (3.4) to find the initial band allocation Bu ∀ u ∈ U2: Using Algorithm 3.1, find a feasible set of bands U3: ∀ u ∈ U , Bu = Bu set the final band allocation to be equal to the initial band allocation

for the feasible set.4: for all u ∈ U \ U do5: if ∃Bu such that Q (Si) ≥ Qi ∀ i ∈ U ∪ {u} then6: U ← U ∪ {u}7: end if8: end for

The overall algorithm combining the three steps together is shown in Algorithm 3.2. Note

that in the above discussion we have assumed the existence of a power control loop that com-

pensates for the different path-losses between distributed transmitters. Our approach can be

easily extended to the general case using the results of [61]. The main result of [61] is to extend

theorem 1 to the general case by showing that |wu|22 is now composed of a weighted sum of

gamma random variables. Since our bound in (3.11) is dependent only upon the expected value,

our approach is easily extendible.

3.7 Simulation Results

The behavior for the algorithm is shown in Fig. 3.4. The figure shows the number of admitted

slices per the QoS criteria versus the QoS parameter ε, for a fixed upper bound on the probability

equal to 0.9. We consider B = 10 and Bi = 3∀i. Hence, with only FDMA, the maximum number

of slices that can be admitted is 3. Figure 3.4 shows how can we increase this number when

using SDMA. The results show that this number can be increased by 100% for low to mdeium

QoS levels. In Fig. 3.5 we show the number of slices per spectrum band, i.e. occupancy ratio.

We can see that when we limit the number of slices to 3, the system is underutilized as reflected

in the less than one occupancy ratio. Increasing the number of slices ensures full frequency

utilization and also spatial multiplexing.

In Fig. 3.6 and 3.7 we show the change of the number of admitted slices as their expected

Chapter 3. PHY-Layer Admission Control and Network Slicing 55

1 2 3 4 5 6ε

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

Number of Selected Slice

s

Number of Selected Slices versus QoS

S = 3

S = 5

S = 7

S = 10

S = 12

Figure 3.4: Number of Selected Slices versus different QoS values ε for different values of totalnumber of slices

number of users per slice varies, where the number of available slices is fixed at 7. We can

see that the algorithm hinges upon the balance between the QoS bound, ε, and the expected

number of users, E(K), and increasing either of them will eventually saturate the system.

In Fig. 3.8 we study the behavior of the probability term defined in (3.11). As expected,

the Markov bound approximation tightens as we move towards the tail of the distribution. It

is also more accurate for larger values of ε. In Fig. 3.9 we provide a zoomed out version of the

simulated bounds. We can observe around 20% reduction in the probability for a unit change

in ε, as well as around 10% decrease in probability for a unit change in E [k].

3.8 Conclusion

In this chapter, we have studied the problem of joint admission control and slicing in virtual

wireless networks. We have provided characterization for the QoS performance and its relation

to the stochastic traffic. We have used these characterizations to devise a three step algorithm

with low complexity to tackle the problem. Our simulation results have covered the trade-offs

between frequency and spatial multiplexing, admission control and utilization as well as the

Chapter 3. PHY-Layer Admission Control and Network Slicing 56

1 2 3 4 5 6ε

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Number of Slice

s Per Frequency

Reso

urce

Number of Multiplexed Slices versus QoS

S = 3

S = 5

S = 7

S = 10

S = 12

Figure 3.5: Number of Selected Slices per Frequency Resource versus different QoS values ε fordifferent values of total number of slices

1 2 3 4 5 6ε

2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

Number of Selected Slices

Number of Selected Slices for varying E(k)

E(k)=0

E(k)=1

E(k)=2

E(k)=3

E(k)=4

E(k)=5

E(k)=6

Figure 3.6: Number of Selected Slices per Frequency Resource versus different QoS values ε fordifferent values of average number of users

Chapter 3. PHY-Layer Admission Control and Network Slicing 57

1 2 3 4 5 6ε

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Number of Slices Per Frequency Resource

Number of Multiplexed Slices for varying E(k)

E(k)=0

E(k)=1

E(k)=2

E(k)=3

E(k)=4

E(k)=5

E(k)=6

Figure 3.7: Number of Selected Slices versus different QoS values ε for different values of averagenumber of users

0 1 2 3 4 5 6 70

1

2

3

4

5

6

7

8

9

10

E(k)

P[|w

|2 2] > ε

ε=0.8,Upper Boundε=0.8,Simulatedε=1,Upper Boundε=1,Simulatedε=1.5,Upper Boundε=1.5,Simulatedε=2,Upper Boundε=2,Simulated

Figure 3.8: Comparison of the Markov bound and the simulated probability term defined in(3.11)

Chapter 3. PHY-Layer Admission Control and Network Slicing 58

0 1 2 3 4 5 6 7

0.4

0.5

0.6

0.7

0.8

0.9

1

E(k)

P[|w

|2 2] > ε

ε=0.8ε=1ε=1.5ε=2

Figure 3.9: Simulation of the probability term defined in (3.11)

accuracy of the QoS bounds.

Chapter 4

Multi-Operator Scheduling in

Cloud-RANs

4.1 Context

The software-defined approach of cloud radio access networks (C-RANs) enables supporting

multiple virtual operators (VOs) on the same physical infrastructure. In this shared envi-

ronment, a coordinator is needed to manage the sharing of resources between the VOs. In

Chapter 3 we studied how this coordination can be achieved through null-space projection. In

this chapter, we look at another form coordination, that is hierarchal scheduling. Designing

a scheduling coordinator is about striking a good balance between the flexibility given to the

VOs, and the efficiency of the resource utilization. In particular, we study the problem of coor-

dinated scheduling in the multi-operator cloud-RAN environment. We formulate the problem

as a two-stage scheduling, where in the first stage the VOs are responsible for scheduling their

own users, after which they submit their resource requests to a centralized coordinator. The

coordinator selects a subset of non-conflicting requests for transmission. We show that the

problem in the general case is NP-hard. We then discuss two special cases and relate them

to the existing communication protocols. By gaining insights from these two special cases, we

propose a general heuristic, which works on any formulation of the problem, and is still able

to provide close-to-optimum performance in the special cases we considered. The heuristic is

59

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 60

shown to have some similarities to the neuro-computation techniques such as Hopfield-network.

Finally, simulation results are provided to show the efficiency of the proposed algorithms.

4.2 Introduction

The concept of cloud-RANs is closely related to that of software-defined networking (SDN)[75]

and network virtualization[32]. Overall, one of the main goals of these technologies is to be

able to support distributed computing capabilities, in the form of data-center servers, as well

as share the physical infrastructure between different network operators. A challenge unique to

wireless networks is sharing the air interface between the VOs. In such a case, designing efficient

coordination schemes between these VOs is tricky. On one hand, SDN and virtual network-

ing advocate the diversity of services and technologies to be supported by the different VOs.

Hence, VOs need to have sufficient control over the resources given to them in order to provide

service-differentiation for their customers. On the other hand, in such highly-heterogeneous

environment, coming up with efficient coordination decisions is hard. The question remains

open about how to coordinate VOs with heterogeneous MAC-layers or even PHY-layers.

In Chapter 3 we studied how interference coordination through null space projection can

be used to share the resources between the different slices. In this chapter we study another

form of interference coordination through hierarchal scheduling. This is a form of scheduling

where the decision is made in two stages; first each slice selects a subset of its own users, then

the infrastructure controller selects which slices get access to the spectrum resources within the

next time slot. However, the challenge here is how to balance the flexibility given to the slices

with the overall system utilization. In other words, in order for the infrastructure controller to

truly optimize the system utilization, it needs to do a more detailed scheduling. This, however,

would lead to less flexibility given to the slice controller.

The cloud-RAN architecture is expected to host a diverse set of network slices, each with

its own PHY and MAC technologies. The reason for this, as discussed in Chapter 3, is that

different applications impose different requirements on the PHY and MAC layers. For example,

sensor networks need low power transmission, while augmented reality applications need low

latency. Sensor networks might use spread spectrum techniques while augmented reality might

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 61

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

Cell-wide processing

User Process Cell Process

Scheduler

Network Slice Controller

Slice Scheduler

Slice Communication

Protocol

Infrastructure Controller

Resource

Provisioning

Clo

ud

Net

wo

rk F

abri

c

Scheduling

Decisons

Scheduling request/grant

I/Q Signals

End-users

Figure 4.1: Cloud-RAN Architecture - Admission Control and Slicing

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 62

use time division techniques. Different PHY technologies have different ways to allocate the

spectrum resources. For example, in LTE and other OFDMA systems, part of the spectrum

is used as a control channel to notify each user which resources, if any, it has been assigned.

Hence, resource assignment is a complicated process that needs access to multiple resources

simultaneously, i.e. an LTE slice needs access to both the control and data resource blocks to

correctly deliver data in order to its user.

For the reasons discussed in the above discussions, we decided to limit the infrastructure

controller decision to be a Yes/No one. Each slice will select which set of resources it needs to

access within the next time slot, including all the data and control ones. Then the infrastructure

controller decides whether the slice gets to use all these requested resources or none at all. This

ensures that the slice can choose its own arbitrary data/control resource splitting scheme, and

the infrastructure controller decision is agnostic to how this choice or design is made.

In Fig. 4.1 we show the system architecture focusing on the multi-slice scheduling parts.

In particular, we study the trade-off between the flexibility given to VOs, and the efficiency

of the coordinator decisions. We consider a scenario where the VOs have their computing

resources in the data center. These computing resources prepare a resource request. This

request includes the specification of the requested resource, e.g. frequency band or resource

block, the modulation and coding scheme (MCS) to be used and so on. By allowing the VOs

to choose the resources and their MCS, they are given sufficient control to provide the desired

service differentiation for their users. These resource requests are then received at a central

coordinator. In order to be able to accommodate requests from heterogeneous VOs, the decision

of the coordinator is constrained to a Yes/No decision. In other words, the coordinator chooses

only a subset of the requests without altering them. We show that the problem is equivalent

to a maximum weighted independent set problem (MWIS), a well-known problem in Graph

theory and complexity theory [52]. Maximum independent set problems are both NP-hard[52]

and APX-hard [20], i.e. they are as hard to approximate as to solve. Next, we discuss two

special cases, where the resources being requested have to form a contiguous set, similar to the

constraint in the LTE uplink. We provide the optimum algorithms for these cases, for which

the complexity is either logarithmic or linear. Finally, we propose an efficient heuristic, which

is able to work on the general problem, and can provide very-efficient solutions for the special

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 63

cases discussed.

4.3 Related Work

The literature on scheduling for for wireless virtualization is growing rapidly. The NVS archi-

tecture proposed in [114] discussed two-level scheduling for WiMAX systems. In this case, all

the network slices are limited to use WiMAX, and while the same architecture can be applied to

other OFDMA systems such as LTE, the limitation on the homogeneity of the slices still stands.

In the LTE architecture proposed in [160] the infrastructure controller, called Hypervisor, has

direct access to the slice schedulers. In other words, the Hypervisor handles all the scheduling

and slices have no direct control over how their own users are scheduled. A stochastic game

framework was proposed in [46] based on the VCG mechanism. This is a utility maximization

where the slices are competing for the resources, but may get only a subset of the resources

they request. This is not our goal as we have explained before the strong coupling between data

and control resources needed for successful service. Opportunistic scheduling was proposed for

WiFi systems in [154] where different slices can be scheduled on the same channel subject to

a limit on the collision probability. However, no work has considered the case when both LTE

and WiFi are present in the network.

In summary, most of the existing works have focused on OFDM-based scheduling. However,

this assumption may not necessarily hold in future networks. For example, IEEE802.15.4 has

already standardized the use of spread spectrum for wireless sensor networks due to its superior

performance in low power scenarios. Moreover, those works have not considered the coupling

between the requested resources and the fact that if the slice is not assigned all the requested

resources, it might not be able to utilize any of them. Another drawback of the existing work

is the centralized nature of the processing, where the baseband processing and scheduling are

performed within the same physical machine, typically a base-station FPGA. In other words, the

challenges associated with cloud-RANs are not discussed. Finally, extending these techniques

to Coordinated multi-point (CoMP)[50] scenarios is not straight-forward, as they are mainly

designed for the single transmitter case.

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 64

4.4 System Model

Consider a cloud-RAN system where a cloud data center forwards the I/Q signals through a

high-speed network to a set of RRHs. Let S be the set of resources, which may include frequency,

time, space or code resources. Let U be the set of VOs, also referred to as network slices. Each

VO i ∈ U prepares a request specifying the desired subset of resources. Let Si denote such a

request. Each request Si comes associated with a weight wi. This weight, as well as the size

of the request, is used by the coordinator to compare between the different requests. Different

scheduling weights have been discussed in the literature, see [42] for a survey. The scheduling

decisions discussed here are to replace the scheduling decisions in the current architectures,

hence will occur with the same frequency as them.

In this chapter we are not concerned with the design of the scheduling weights, rather that

the VOs and the coordinator agree on a specific weighting criteria. The decision made by the

coordinator is as follows:

Problem 4.1:

maxx

|U|∑i=1

xiwi

s.t. ∩i:xi=1 Si = ∅

xi ∈ {0, 1}

(4.1)

where x is the decision made by the coordinator, xi = 1 means that the request Si has been

accepted, and zero otherwise. The first constraint says that all accepted requests have to be

non-conflicting, while the second is the binary constraint imposed upon the decision x. This

problem is about selecting a subset of non-conflicting requests of maximum weight.

Theorem 2. Problem 4.1 is NP-hard. Moreover, the problem is APX-hard.

Proof. The proof follows by showing that the problem is equivalent to a maximum weighted

independent set problem. Consider the graph G = {V, E}. Let V be equal to the set of VOs

U . Define E = {e : eij = 1 ⇐⇒ Si ∩ Sj 6= ∅}. Assign to each vertex vi the weight wi. Now we

have a graph with each vertex representing a request. Two vertices are connected with an edge

if and only if their corresponding requests are conflicting. Hence, the problem of choosing a

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 65

set of non-conflicting requests with maximum weight becomes the problem of a selecting a set

of non-connected vertices of maximum weight, which is the maximum weighted independent

set problem for the graph G. From the properties of MWIS problem [52],[20], the theorem

follows.

4.5 Scheduling Algorithms for VOs

In this section, we discuss two special cases for Problem (4.1) which have polynomial time

optimum solutions. Then we discuss how these two algorithms can be extended to solve problem

(4.1) in its general form.

4.5.1 Case 1

Due to the difficulty of problem (4.1), we discuss now how it can be efficiently solved in a special

case. Let I ={

20, 21, ..., 2i, 2i+1, ..., 2log2(|S|)}

, where |S| is assumed to be a power of two. We

impose the following set of conditions upon any request:

1. |Si| ∈ I.

2. Let s0, ..., sN−1 be the elements of S. Let Si = {sk, sk+1, ..., sm}. Then k mod 2 =

0 ∀ k 6= m.

The first condition says that the size of any request has to be a power of 2. The intuition

behind the second constraint is that these requests represent nodes of a binary tree.

The algorithm for solving the case 1 problem is composed of three steps:

Building the Tree

The tree is constructed as follows:

(a) The tree is of height log2(|S|), and the number of leaf nodes is |S|.

(b) The tree is composed of a set of nodes tl,j , where 0 ≤ l ≤ log2(|S|) is the index of the

current level of the tree and 0 ≤ j ≤ 2l − 1 is the index of the node within level l.

(c) For each leaf node tlog2(|S|),j , associate the resource Stlog2(|S|),j = {sj}.

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 66

{0,1,2,3}

{0,1}w0

{2,3}w5

{0}w1

{1}w2 , w3

{2}w4

{3}

Figure 4.2: Example of Tree. In this example we have 4 resource blocks. The first line in eachnode represents the resources attached to it, while the second line is the requests matching withthis node. The algorithm starts at the leaf nodes, and selects one request per node. Here it willhave to choose between slice 2 and slice 3 for the second resource block. If w3 > w2, then slice3 is chosen. In the second step, the requests from the sibling leaf nodes are joined together, soslice 1 and slice 3 are joined together. In the third step, we compare between w0 and w1 +w3,for the {0, 1} node, and between w4 and w5 for the {2, 3} node.

(d) Starting from l = log2(|S|), repeat until l − 1 = 0:

• For each pair of sibling nodes tl,j , tl,j+1, construct a new parent node tl−1, j2, and

associate with it the resources Stl−1, j

2

= Stl,j ∪ Stl−1,j+1

An example of such a tree is shown in Fig. 4.2 and the corresponding conflict graph is shown

in Fig. 4.3.

Attaching the VOs

The procedure for attaching the VOs into the tree is straight-forward. Each VO is attached

to the tree node that matches his resource request, i.e a request i is attached to node l, j if

Si = Stl,j . Define Uol,j as the set of direct requests for node l, j, i.e.

Uol,j ={Si | Si = Stl,j ∀ i ∈ U

}(4.2)

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 67

w0

w1

w3

w2

w5

w4

Figure 4.3: Conflict graph for the example in Fig. 4.2

Select the Optimal Requests Subset

This is based on Algorithm 4.1. The algorithm starts at the leaf nodes, and selects one request

per node according to the maximum weight. The winning requests from sibling nodes are then

combined together into a single request. Next, the algorithm visits the parent node, where the

comparison is done between the requests associated to the node, plus the joint request from the

sibling child nodes. The process terminates at the root of the tree.

Algorithm 4.1 Binary Tree-based Scheduling Algorithm

1. Set l = log2(S) and Ul,j = Uol,j .

2. Repeat until l = 0:

Select For each node, select the request with the maximum weight.

Ul,j = arg maxwiSi ∀ Si ∈ Ul,j (4.3)

Combine Go one level up the tree, at each node, combine the winning requests from the childnodes and attach them to the parent node as a new request with weight equal to thesum of winning weights.

Ul,j = U0l,j ∪

{Ul+1,2j ∪ Ul+1,2(j+1)

}(4.4)

Theorem 3. Algorithm 4.1 finds the optimal solution for problem (4.1) given that the conditions

in 4.5.1 are satisfied.

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 68

Proof. Let C∗ be the selected subset of resource requests output from Algorithm 4.1 with overall

weight w∗. Suppose another subset C is to have an overall weight w > w∗. Without loss of

generality, suppose C = C∗ − Sk + Sl. We consider two cases:

1. |Sk| = |Sl| : in this case a tree node has chosen Sk instead of Sl. Hence, w = w∗ −wk +

wl. However, from the definition of the select step in Algorithm 4.1, wk > wl, hence,

−wk + wl > 0 and w∗ > w, i.e. a contradiction.

2. Sk ⊂ Sl : in this case, the rejection of Sk happened at one of its parent nodes. However,

from the definition of both the select and combine steps in Algorithm 4.1, either w > w∗

leads to a contradiction or C is an infeasible solution.

Since the number of steps of the algorithm is equal to the depth of the tree, the complexity of

the algorithm is O(log2(|S|)).

4.5.2 Case 2

The logarithmic complexity of the above algorithm makes it an attractive way to address the

difficulty of problem 1. However, the constraints imposed upon the resource requests may

result in degradation of performance compared to the general case, or may be too tight for

some applications. In this part, we discuss another special case, which is also shown to have a

polynomial time optimal solution with less constraints than the previous case. In our current

case, the only constraint we require is:

• Si is contiguous for all i ∈ U .

Such class of graphs where the vertices can be mapped into a set of intervals on the real line is

known as Interval Graphs. An example of such class of graphs is shown in Fig. 4.4. The MWIS

problem for interval graphs has been studied in [62], where an optimal algorithm of O(|S|) was

proposed. This is the same algorithm we have discussed in 3.1.

4.5.3 Applications of Case 1 and 2

The main condition imposed upon the scheduling request in both case 1 and 2 is that the

requested resources have to be contiguous. This condition is already present in the LTE uplink,

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 69

w0

w1

w2

w3

w4

w5

Figure 4.4: Request pattern: s0 = {0, 1, 2}, s1 = {0}, s2 = {2, 3}, s3 = {3}, s4 = {0, 1}, s5 ={1, 2}

where single carrier frequency division multiple access (SC-FDMA) is used. These cases can

also be applied to a TDMA system, where each VO gets a consecutive set of time slots to

transmit in.

However, a wide range of application scenarios can not be handled by these cases. An

important example is the LTE downlink, as resources do not to be contiguous in the OFDMA

systems, unlike SC-FDMA. Secondly, allocating code book entries in a MU-MIMO system also

can not be done while assuming contiguous resources. These examples are explained in the

time-frequency grids in Figures 4.5, 4.6 and 4.7. In Fig. 4.5 we show a request pattern for

case 1. There are constraints on the request location, size and contiguity. In Fig. 4.6 we

show a request pattern for case 2. We have imposed a constraint only on the contiguity of

the requested resources. We can already see that case 2 will result in improved performance

compared to case 1 as one more block is assigned. A general request pattern is shown in Fig.

4.7. We have introduced two grids to refer either to a multi-cell scenario, or a multi-antenna

scenario with codebook entries, or both. In both grids, there no constraints on the requested

resources. The request pattern in Fig. 4.7 can not be handled by the algorithms in 4.5.1 and

4.5.2. An example of the conflict graph for such a case is shown in Fig. 4.11.

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 70

RB11VO1

RB21VO3

RB22VO3

RB12VO1

RB31VO1

RB32UnAssigne

d

RB13VO2

RB23VO3

RB33VO2

RB14VO2

RB24VO3

RB34VO2

RB44VO1

RB43UnAssigne

d

RB42VO3

RB41VO3

Figure 4.5: Requests in case 1

RB11VO1

RB21VO3

RB22VO1

RB12VO1

RB31VO3

RB32UnAssigne

d

RB13VO1

RB23VO1

RB33VO1

RB14VO2

RB24VO2

RB34VO2

RB44VO1

RB43VO2

RB42VO3

RB41VO3

Figure 4.6: Requests in case 2

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 71

RB11VO1

RB21VO3

RB22VO1

RB12VO2

RB31VO3

RB32VO2

RB13VO1

RB23VO1

RB33VO1

RB14VO2

RB24VO2

RB34VO2

RB44VO1

RB43VO2

RB42VO3

RB41VO3

RB11VO1

RB21VO3

RB22VO1

RB12VO1

RB31VO2

RB32VO3

RB13VO3

RB23VO2

RB33VO2

RB14VO2

RB24VO1

RB34VO2

RB44VO1

RB43VO3

RB42VO1

RB41VO3

Figure 4.7: Requests in the general case

4.5.4 Intuition Behind Case 1 and Case 2

While the MWIS is known to be NP-hard, we have discussed two cases that have optimum

polynomial time solutions. Our goal at this point is to delve deeper into the properties of these

two cases that enable finding their optimum solutions. In Fig. 4.8 we show what we call the

binary tree unit. This conflict graph has three nodes {w0, w1, w2} that together form a binary

tree. There are two independent sets within this graph, {w1, w2} and {w0}. The key point

here is to identify that w0 connects w1 with w2, i.e. not only is w1 not in conflict with w2, but

they both benefit from eliminating the w0 node as it is in conflict with both of them. In other

words, w1 supports w2. Hence, the key point in the binary tree algorithm is to not make an

elimination decision until the supporting set has been fully formed.

In Fig. 4.9 we show the unit for the conflict graph corresponding to the interval graph case.

Note that this is still a tree, but not a binary one. Instead, it has the V-shaped architecture

where w2 connects w0 and w3. The fact that the tree is not binary anymore and the presence

of the V-shaped architecture makes the decision more difficult. However, the same idea stands,

before eliminating any node, we need to find its supporting set first and make the decision

based on the combined weight of the full support set. The algorithms achieves this through a

scan of all the intervals from beginning to end to form these supporting sets, hence the linear

complexity.

In Fig. 4.10 we show the conflict graph corresponding to the general case with no special

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 72

W0

W1 W2

Figure 4.8: Binary Tree Unit

architecture, instead the conflict graph is no longer a tree and has cycles. The presence of cycles

makes it very hard to find the supporting set, as this itself is just a smaller MWIS problem. The

special architecture of the graphs in the first two cases made it possible to find these supporting

sets through a recursive solution. However, this approach can not be applied to the general

case. Our goal in the next section is to build upon the intuition from the previous two cases

and provide a heuristic for the general case. This heuristic should be based on the operation

principle used in the first two algorithms, but can be applied for a general graph. Hence, if

applied to these two special case, the heuristic should provide satisfactory performance with

respect to the corresponding optimum algorithm, while still being applicable to the general

case.

4.6 General Heuristic

4.6.1 Intuition

Our goal now is to come up with a technique that can tackle Problem 4.1 in the general case.

Since the MWIS is APX-hard and known to be one of the hardest problems in complexity

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 73

W0

W1 W2

W3

W4

W1

W0

W2

W3

W4

Figure 4.9: Interval Graph Unit and the corresponding intervals

W0

W1 W2

W3

Figure 4.10: General Graph Unit

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 74

theory [52],[20], the criteria we follow to measure the performance of any algorithm is how it

performs in comparison to the optimum algorithms in their special graphs. The way we design

our heuristic is by looking again at the previous cases and trying to gain insights into how and

why these cases had polynomial time optimum solutions.

The main insight we get from these algorithm is that a set of non-conflicting nodes each

with a small weight should be chosen over a single node with a large weight, if this large node

happens to be conflicting with that set of nodes. Note that this idea is in contrast with the

greedy approach, which usually starts by selecting the node with the largest weight. Hence, it

is necessary for each node to find an approximation for the supporting set of non-conflicting

nodes that support it against the set of conflicting nodes. Note that this is also why the

problem is so hard in the general case, as finding such sets is itself a MWIS, just on a smaller

set. In the special cases we discussed, the key idea was to be able to solve these smaller

MWIS efficiently by imposing some constraints on the graph, hence a recursive algorithm with

polynomial complexity was possible.

4.6.2 Operation

The operation of the algorithm is summarized in Algorithm 4.2. The max(.) can be either a soft-

max or a hardmax. We use the softmax in the proof of convergence. The algorithm can be seen

as equipping each node with a neuron. The neuron’s activation function is w0i

(12 + 1

2 tanh(.)),

where w0i is the initial weight associated with the request. At convergence, the output from the

neuron is either zero or woi . The input to the neuron is composed of:

1. Positive Input: the current weight of the node, plus the sum of the maximum weights

of each set of conflicting support nodes. In other words, for each set of supporting nodes

that are in conflict, we select the one with the maximum weight.

2. Negative Input: the maximum weight among all nodes in conflict with the current node.

The weight for each such node is found in a way similar to the positive input, where the

node’s weight is added together with the sum of the maximum supporting sets.

Theorem 4. If wi <23 ∀i ∈ V, then Algorithm 4.2 converges to a unique fixed point.

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 75

Algorithm 4.2 General Neuro-Optimization Heuristic

1. Initialize each node weight to w0i = wi.

2. Update the node weights according to the following equation

wt+1i = w0

i

(1

2+

1

2tanh

(wti + pti −max

j∈Vi

{wtj + ptj

}))(4.5)

where pti =∑

j∈V−i(maxk∈Vj∩V−i w

tk

)3. Repeat until maxi4wi ≤ ε

4. MWIS Umax ={i ∈ U | |woi − w

tendi | < |wtendi − 0|

}

4.6.3 Proof of Convergence

We express the equation for Algorithm 4.2 as a mapping as follows

xi = w0i

1

2+

1

2tanh

xi + yi − log

∑j∈Vi

exj+yj

(4.6)

where yi =∑

j∈V−ilog

( ∑k∈Vj∩V−i

exk

)The proof of convergence is based on showing that x = f(x) in Eq. (4.6) is a contraction

mapping with a unique fixed point. From Lemma 2 in [97], the mapping f : R|U| → R|U|

converges at least linearly to a unique fixed point if

supx∈R|U|

||f ′(x)|| < 1 (4.7)

Different norms lead to different bounds, the 23 choice used in the theorem follows from the

∞-norm. In this case the condition

||f ′(x)||∞ < 1 (4.8)

becomes

maxi=1:|U|

|U|∑j=1

|f ′(x)i,j | < 1 (4.9)

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 76

w0

w1

w2

w3

w4

w5

Figure 4.11: Request pattern: s0 = {0, 1, 3}, s1 = {0}, s2 = {1, 3}, s3 = {3}, s4 = {0, 2}, s5 ={1, 2}

Or equivalently|U|∑j=1

|f ′(x)i,j | < 1 ∀i = 1 : |U| (4.10)

Let g(x) = xi + yi − log(∑

j∈Vi exj+yj

). It can be shown that

||f ′(x)||i =

|U|∑j=1

|f ′(x)i,j |

≤ w0i

[1− tanh2(g(x))

]1 +

∑k∈Vj∩V−i

exj∑k∈Vj∩V−i

exj+

∑j∈Vi e

xj+yj∑j∈Vi e

xj+yj

≤ w0

i

[1− tanh2(g(x))

] 3

2≤ w0

i

3

2

(4.11)

Hence, for wi <23 ∀i ∈ V, the mapping is a contraction mapping and convergence is proved. It

is this worth noting that we have this bound to be conservative as our simulations exhibited

convergence without needing to impose this condition on the initial weights.

4.6.4 Neuro-Optimization

A class of optimization techniques related to neural network is known as neuro-optimization.

The most famous of which is the Hopfield network introduced by Hopfield in 1982 [60]. Hopfield

networks are fully-connected neural networks with binary or approximately binary, i.e. sigmoid,

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 77

activation functions. Hopfield networks were used to solve NP-hard combinatorial problems,

such as the traveling salesman problem. The main drawback of Hopfield networks is the large

number of nodes, O(|U|2) needed when solving such combinatorial problem. Our scheme does

not suffer from this problem as we use a number of nodes O(|U|).

4.7 Simulation Results

In this section we present our results regarding the performance of the proposed heuristic in

comparison to the discussed optimum algorithms. The scenario is as explained in the chap-

ter, where the VOs prepare requests for resources and submit them to the coordinator. The

coordinator compares the requests based on the size and weight of each request, and selects a

non-conflicting subset of maximum weight. We vary the number of VOs requests, also called

flows, from 5 to 30. The scheduling weight we use is the channel power gain, where we use

samples from 3GPP LTE channels. Each VO picks the best subset of resources based on the

channel and with randomly chosen sizes. We also compare the proposed algorithms with linear

programming (LP). However, the complexity of the MWIS problem makes the LP solution

highly inefficient.

The first two algorithms discussed are provably optimal for the two special cases. The

main advantage of the third heuristic is its applicability for the general case. The goal of the

simulation results provided here is to study its performance in comparison with the optimum

algorithm for each case.

In Fig. 4.12 we compare the performance of Algorithm 4.2 with the proposed optimal

algorithm for case 1, i.e. Algorithm 4.1. By throughput, we mean the sum of the channel power

gains for the selected subset of requests, which is also the weight of the selected independent set.

The figure shows that the proposed heuristic is within just 2.5% from the optimum algorithm.

The loss in throughput is shown in Fig. 4.14.

A similar observation can be seen in Fig. 4.13 for the interval graph case, case 2. However,

the loss in performance here is more due to the increased complexity of the optimum algorithm.

The proposed heuristic is within 6% from the optimum throughput as shown in Fig. 4.15.

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 78

5 10 15 20 25 30

Number of Flows

0

50

100

150

200

250

300

Throughput

Throughput Comparison for Optimal and Heuristic Algorithm

Optimal

Proposed Heuristic

Linear Program

Figure 4.12: Performance of the proposed algorithms for case 1

5 10 15 20 25 30

Number of Flows

0

50

100

150

200

250

300

350

Throughput

Throughput Comparison for Optimal and Heuristic Algorithm

Optimal

Proposed Heuristic

Linear Program

Figure 4.13: Performance of the general algorithm for case 2

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 79

5 10 15 20 25 30

Number of Flows

0.0

0.5

1.0

1.5

2.0

2.5

Percentage of Throughput Loss

Percentage of Throughput Loss for the Heuristic Algorithm

Figure 4.14: Percentage performance loss for case 1

5 10 15 20 25 30

Number of Flows

0

1

2

3

4

5

6

Percentage of Throughput Loss

Percentage of Throughput Loss for the Heuristic Algorithm

Figure 4.15: Percentage performance loss for case 2

Chapter 4. Multi-Operator Scheduling in Cloud-RANs 80

4.8 Conclusion

In this chapter we have studied the scheduling of multi VOs in a cloud-RAN environment.

We modeled the case when the VOs employ heterogeneous communication protocols. We have

shown that the coordination problem in such a case is in general NP-hard. We then proceeded

by specifying two special cases and provided the optimum algorithm for each case. Finally,

we proposed a novel neuro-computation heuristic, which is able to handle the general problem

but still provide close-to-optimum results for the special cases studied. The simulation results

confirm the effectiveness of the proposed heuristic and help learn more about the operation of

scheduling in cloud-RAN networks. It is worth noting here that such an approach is not the only

to coordinate multiple operators in the same infrastructure, and not necessarily the optimum.

Another possible approach is that IO offers a set of traffic streams which are then utilized by

the VOs. We see this as mainly a trade-off problem, traffic streams offer better utilization,

while forcing all VOs to use the same PHY/MAC technologies. The approach we studied here

was mainly focused on the case where the VOs are heterogeneous. A similar trade-off has also

been seen in the cloud computing field, between virtual machines and containers. While virtual

machines give more flexibility such as choosing different operating systems, they result in lower

utilization compared to containers, which are, on the other hand, much less flexible. There is

no right or wrong approach here, and the choice is to be done per scenario.

Part III

Cloud Computing Challenges

81

82

In the second part of the thesis, we study a set of challenges brought forward by the cloud

computing model itself. The cloud computing model has led to the split of the base band

processing into a user process and a cell process. We study how distributed scheduling can be

used to handle the excessive communication between the two. Next, we study the problem of

resource elasticity and dynamic scaling, for both access and cloud computing resources. For

the access network, we study joint activation, clustering and association of RRHs in a way that

balances energy efficiency with the end-users QoS. For cloud computing, we study the joint

anomaly detection and auto-scaling of the computing resources.

Chapter 5

Fully Distributed Scheduling in

Cloud-RAN Systems

5.1 Context

Cloud Radio Access Networks (C-RAN) promise to leverage cloud computing capabilities for

enhancing the quality and coverage of next generation 5G networks. 5G networks shall witness

an increasing density of users and access points, very small latencies, more bandwidth resources,

and the use of virtualized hardware for baseband processing. Within such an environment, the

problem of scheduling the network users across the radio resources might become a bottleneck

of the system. The cloud computing model has led to the split of the base band processes

into two processes: user process and cell process. However, a new challenge arises due to the

excessive communication needed between the two. In this chapter, we study the design and

performance of distributed schedulers in C-RAN systems. The idea is that each user’s base

band processing unit (BBU) tries to guess whether its user should be scheduled or not. First,

we focus on the case of maximum throughput scheduling and Rayleigh channels, and provide

closed-form expressions for the expected effective channel and signal-to-noise ratio (SNR) in

the distributed scenario. In order to deal with general channels and schedulers, we adopt the

classification techniques from machine learning. We discover an interesting relationship between

the fairness of the scheduler, and its ability to be distributed. In particular, schedulers which

83

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 84

are more fair are also more prune to prediction errors in the distributed scenario. Finally, we

provide simulation results showing that distributed scheduling can provide up to 92% of the

performance of the centralized case.

5.2 Introduction

The concept of cloud-RANs is closely related to that of software-defined networking (SDN) [75]

and network virtualization [32]. Overall, one of the main goals of these technologies is to be

able to support distributed computing capabilities, in the form of data-center servers, as well as

sharing the physical infrastructure between different network operators. The cloud computing

design principles have led to the concepts of distinct user and cell processes within the cloud

RAN architecture. However, such a design has to be considered from the wireless network

perspective. In particular, a central MAC-layer scheduler is needed to coordinate the resource

allocation between the distributed processing units.

One of the main challenges in porting wireless networks to the cloud is the low latency

required in wireless transmission. For example, in LTE a frame has to be sent every millisecond

[119]. Preparing an LTE frame is a computationally expensive process. LTE has adopted two

principles that contribute greatly to its high data rates: channel-selective scheduling and adap-

tive modulation. Scheduling involves selecting a subset of users with relatively good channel

conditions for transmission. Adaptive modulation and coding means choosing the best mod-

ulation and coding scheme for the selected users based on their channel state as well as the

target bit error rate (BER). Revisiting the cloud-RAN architecture, we see that the cell process

is responsible for scheduling the users, while the user process is responsible for the modulation

and coding part. Generally, communication is needed between these two components to come

up with a decision that is as good as the legacy case.

Due to the increasing number of users and RRHs, larger bandwidth and higher PHY layer

complexity, the scheduler is expected to become a heavy performance bottleneck. First, the

needed communication between the processing units and the scheduler is massive. For example,

every time slot, which is 1 ms for LTE and even less is predicted for 5G [17], all users need to

send their information, including channel state information (CSI), to the scheduler. Second,

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 85

even after such information is received, finding the best subset of users is a computationally

expensive task of quadratic complexity [42].

In Fig. 5.1 we show the system architecture as discussed in Chapter 1 focusing on the

scheduling part. In this framework, the base band processing is divided into two parts: user

processing and cell processing. The user process handles all the processing for a single user

such as modulation and coding, while the cell process handles the cell-wide processing such

as the scheduling and the IFFT. Focusing on these two aspects of the cell processing, we can

already see the heavy traffic load between the cell process and the user process. For the IFFT,

the user processes send their I/Q signals for final processing and forwarding to the RRH. For

the scheduler, each user process negotiates with the cell process in order to find whether its

respective user has been selected for transmission within the next time slot. However, this

decision typically involves information such as the CSI of the user and its queue occupancy,

information that is only available to the user process. Hence, the scheduling process already

requires at least two way communication, first the user process forwards its user info to the

scheduler, which then replies with the scheduling decision. Additional communication might

also be needed as more complex features of the scheduler are considered such as the contiguous

band requirements in LTE uplink. These challenges raise an important questions: What happens

if we remove the central scheduler altogether and make the process completely distributed ?

In summary, within the centralized framework, the BBUs communicate with the central

scheduler for the final scheduling decision. This process might involve an initial request, plus

some further negotiations depending on the degree of conflict between the requests from the

different BBUs. All this communication must be finished in less than one transmission time

in order to accommodate some time for the PHY layer processing. We can see that this com-

munication is a very demanding process. Within the distributed architecture, this extensive

communication is non-existent, and the goal of this chapter is to understand how much perfor-

mance we can get.

In this chapter, we study the fully distributed scheduling in Cloud-RAN systems. We model

the distributed scheduling within a C-RAN system and discuss the differences between C-RAN

systems and the existing approaches. Next, we study the maximum throughput scheduling

in Rayleigh channels. We equip each BBU with a predetermined threshold. A BBU would

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 86

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

Cell-wide processing

User Process Cell Process

Scheduler

C

lou

d N

etw

ork

Fab

ric

Activation

Decisions

Scheduling request/grant

I/Q Signals

End-users

Figure 5.1: Cloud-RAN Architecture - Distributed Scheduling

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 87

schedule its user if and only if its channel power gain exceeds the threshold. We provide

closed form expressions for the expected channel gain and SNR as a function of the threshold.

This expression can be maximized to get the optimum threshold value. We then study other

scheduling frameworks, such as proportional fairness and mean-variance maximization. In

these general cases, we use the classification techniques such as support vector machines (SVM)

and Decision Trees to learn the centralized scheduling decisions. Interestingly, we find that

schedulers that are more fair, are also harder to classify and predict. In general, our simulation

results show that up to 92% of the centralized performance can be obtained from the fully

distributed case.

5.3 Related Work

Scheduling for virtual wireless networks has recently received significant attention in the lit-

erature. The NVS prototype developed by NEC Labs proposed to use a two-level hierarchal

scheduling scheme [114]. In the first level, a virtual operator (VO) is selected, and in the second

level, a flow belonging to the selected VO is chosen for transmission based on some parameters

in the service level agreement (SLA) between the VO and the infrastructure owner (IO). A

stochastic game framework was discussed in [46], where an auction determines which VOs get

which resources. The team at University of Bremen has proposed a Hypervisor-like architecture

for LTE virtualization [160], and discussed the scheduling framework within it. Opportunistic

techniques for spectrum sharing among VOs was discussed in [154]. A related line of effort

is concerning with the extension of the network embedding problem into the wireless domain.

[155] is an example of these effort where an on-line scheduling algorithm was proposed based on

Karnaugh maps. However, these works have focused on scheduling multiple VOs on the same

infrastructure, and have not considered the new aspects of the problem related to cloud com-

puting. Specifically, the distributed computation model of the cloud has not been considered

in the current literature. For example, the separation of the user process and the cell process,

the extensive communication needed between the two, and the need, potential as well as design

of distributed approaches for the scheduler design are all absent from the current literature on

cloud-RAN scheduling.

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 88

By transforming the scheduling problem into a distributed decision one, it becomes closer to

a random access problem. The two main approaches in random access are the ALOHA schemes

and the carrier-sensing schemes, known as CSMA [21]. The classical ALOHA approaches have

not considered the effect of the CSI on the system’s performance. The CSI has been integrated

into the system model in more recent works such as [100][64][118][65][151][93]. In [100] and

[151], the problem has been studied assuming successive interference cancellation (SIC) at the

receiver, i.e. collision does not lead to packet loss and multiple users can transmit at the

same time. The authors of [93] have designed a contention resolution scheme where each time

slot is divided into a contention slot and a transmission slot, and the CSI controls the access

probability within the contention slot. Hu et. al [64] have designed a distributed random access

policy using sub-gradient methods to maximize proportional fairness. We extend the ideas of

these works to the cloud-RAN framework.

One of the main differentiators between the cloud-RAN framework and the existing ALOHA

schemes is that a collision in cloud-RAN is not a physical collision due to electromagnetic wave

interference at the receiver as in the case of ALOHA. Instead, it is a logical collision at the cell

process. The assumption that we make is that the cell process keeps a buffer for every resource

block. Whenever a user process decides that its user should transmit on a specific resource

block, it prepares the I/Q signal and forwards it to the corresponding buffer in the cell process.

Either the buffer has enough capacity to store the data for one user only, or the cell process

can simply pick the first signal received. In either case, the existence of the buffer and the fact

that this is a logical rather than physical collision guarantees that the resource block is utilized

if at least one user requested it. The only performance loss will occur if the user selected by

the cell process is not the best one (the cell process is not supposed to decide which user is

picked as we consider fully distributed architectures) or if no user requests the resource block.

These two reasons for performance loss will be considered in our performance analysis later in

the chapter. This aspect of the problem in terms of the different nature of the collision and the

new elements of the cloud-RAN architecture is the main reason why the performance is much

higher compared to the ALOHA case, as will shown.

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 89

5.4 System Model

5.4.1 System Architecture

Consider a cloud-RAN system where a set of virtual machines (VMs) host a set of base band

processors, BBUs. The BBUs forward the I/Q signals through a high-speed network to the

RRHs, where they are transmitted through the air interface to the users’ terminals. In systems

with deterministic access, such as 3G and 4G, a scheduler is responsible for choosing a subset

of users for transmission at each time slot. Within the considered framework, the BBUs are

responsible for calculating the scheduling weights, e.g. channel gain for maximum throughput

and channel gain divided by aggregate throughput for proportional fairness. These weights are

transmitted from the VM where the BBU is located to the VM where the scheduler function is

running. Once the data is received from all BBUs, the scheduling decision is made. Typically,

this is a process of complexity O(|U||S|), where |U| is the number of users and |S| is the number

of resources. Note that this process is repeated every transmission slot, e.g. 1 ms for LTE. We

focus in this chapter on the single cell scenarios, where the scheduling decision is localized at

each cell. Extensions to multi-cell coordinated scheduling is left for future work.

In order to avoid the scheduler becoming a bottleneck for the system, we study in this

chapter what happens if we make the process fully distributed. The architecture we assume

is as follows: the BBUs are still responsible for calculating the scheduling weights as before.

However, there is no central scheduler, each BBU should be able to find out on its own, with

no coordination with the other BBUs, whether its respective user should be scheduled. Finally,

the BBUs prepare the pre-filtered signals, which are sent to a central unit for final filtering

and forwarding to the RRH [53]. We assume that if two or more user are scheduled for the

same resources, one of them is chosen at random before the final filtering process. While this

complete lack of coordination might be a pessimistic assumption, understanding the behavior

in such a case can serve as a lower bound on the performance of other scenarios.

The question then becomes how can the BBU know if its user should be scheduled. If the

criteria is too conservative, only a small number of BBUs will schedule their users, resulting

in underutilized resources. On the other hand, if the criteria is too permissive, many BBUs

will schedule their users, and when one of these users is chosen at random, it might result in

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 90

choosing a less deserving user. Finding the optimal trade-off between these two extremes is the

main optimization problem of this chapter.

5.4.2 Model

Consider a cloud-RAN system where a cloud data center forwards the I/Q signals through a

high-speed network to a set of RRHs. Let S be the set of resources. Let U be the set of

distributed BBUs.

Each BBU u ∈ U prepares a request specifying the desired subset of resources. Let Su

denote such a request. We assume that each request Su is composed of the weights ws,u for

every resource s ∈ S. This weight is used by the central scheduler to compare the different

requests. Different scheduling weights have been discussed in the literature, see [42] for a survey.

In the current work we are not concerned with the design of the scheduling weight, rather we

assume that the BBUs and the scheduler agree upon a specific criteria for determining the

weights. The decision made by the central scheduler is:

us = arg maxu∈U

ws,u (5.1)

where us is the user selected for resource s.

5.5 Distributed Scheduling

5.5.1 Maximum Throughput Rayleigh Channels

In this section, we consider the problem of distributing the maximum throughput scheduler

for users whose channels follow a Rayleigh distribution. We adopt a threshold-based decision

criteria for the BBUs. Each BBU observes the channel vector of its user, compares it with a

predefined threshold value, and based on that decides whether to select the user for transmission.

If two or more users are selected for the same resource, one is chosen at random. For the

maximum throughout scheduling, the user with highest SNR value for each resource is selected.

In the following, we denote by |h|s,u the channel gain for user u at resource s, while |h|s denotes

the channel gain for resource s, i.e. the channel of the selected user. Note that |h|s = 0 if none

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 91

of the users is selected. 1

Theorem 5. Let γ denote the threshold value, the expected channel gain |h|s per resource s is

given by

E(|h|s) =

[γ +

√π

2eγ2

2 erfc(γ

sqrt(2))

][1−

(1− e

γ2

2

)N](5.2)

and the expected SNR is

E(|h|2s) =[γ2 + 2

] [1−

(1− e

γ2

2

)N](5.3)

Proof. We assume Rayleigh channels, hence

|h|s,u ∼ fH(x) = xe−x22 (5.4)

First consider when none of the users are selected, in which case the instantaneous channel gain

would be zero, i.e.

P(|h|s = 0) = [1− P(X > γ)]N (5.5)

Otherwise, out of the users who submit their request, one is chosen at random. In such a case,

the channel distribution is a mixture distribution.

fHs|K(x|k) =k∑

u=1

fHs,u(x|x ≥ γ)P(us = u)

a=

k∑u=1

fHs,u(x|x ≥ γ)

k

b=fHs,u(x|x ≥ γ)

(5.6)

where ina= we assume a uniform distribution for selecting a user at random,

b= follows from |h|s,u

being i.i.d ∀u ∈ U , and K is a random variable for the number of BBUs who have scheduled

1SNR is defined as SNR = |h|2Pσ2n

. We adopt a normalized SNR where P = σ2n = 1. We also assume the

channel follows an i.i.d. Rayleigh distribution for all users. This assumption can be justified in the presence of apower control loop which accounts for the path-loss effect, leaving only the fast fading as identically distributed.Power control loops are employed in some wireless systems such as the LTE uplink [119] and CDMA [111]. Thenoise power is just a normalization constant, while P can change value based on the slow, with respect to thescheduling, power control loop. In either case, this normalization does not affect the results of the chapter.

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 92

their users for the specific resource s.

fHs(x) =N∑k=1

fHs|K(x|k)P(K = k)

=N∑k=1

fHs|K(x|k)

(N

k

)[P(X > γ)]k [1− P(X > γ)]N−k

c= fHs|K(x|k) [1− P(K = 0)]

= fHs,u(x|x ≥ γ) [1− P(K = 0)]

(5.7)

wherec= follows from fHs|K(x|k) being independent of k as shown in (5.6). Now for Rayleigh

distributions, it can be shown that

fHs,u(x|x ≥ γ) =xe−x22

e−γ22

(5.8)

and

P(K = 0) = [P(X < γ)]N =

(1− e

γ2

2

)N(5.9)

In summary

fHs(x) =

[1− P(X > γ)]N , x = 0

fHs,u(x|x ≥ γ)

[1−

(1− e

γ2

2

)N], x ≥ γ

(5.10)

Note that fHs(x) is undefined for the range 0 < x < γ.

To find E(|h|s)

E(|h|s) = 0 ∗ P(|h|s = 0) +

∫ ∞γ

x× xe−x22

e−γ22

[1−

(1− e

γ2

2

)N](5.11)

leading to

E(|h|s) =

[γ +

√π

2eγ2

2 erfc(γ

sqrt(2))

][1−

(1− e

γ2

2

)N](5.12)

The expression for E(|h|2s) can be found using the same procedure starting from the fact that

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 93

|h|2s is a Chi-square random variable with 2 degrees of freedom, i.e.

|h|2s,u ∼ fH2(z) =1

2e−z2 (5.13)

Since γ∗ = arg maxγ E(log(1 + SNRs(γ))) = arg maxγ E(SNRs(γ)) = arg maxγ E(|h|s(γ)),

by maximizing E(|h|s), we find the optimum value of γ as a function of N and store it in a

lookup table for example

In order to validate our analysis and study the performance of the distributed scheduling,

simulation results are provided in Fig. 5.2 and 5.3. In Fig. 5.2 we show the expected channel

gain and SNR versus γ. It can be seen that the analytical expressions match exactly with the

simulation. A comparison of the expected performance between the centralized and distributed

schedulers is shown in Fig. 5.3 for different numbers of users. We find that the distributed

scheduler can achieve approximately 85% of the SNR performance and 92% of the channel

capacity performance in comparison to the centralized scheduler. We take this value as an

upper bound on the performance of the schemes in the upcoming sections.

5.5.2 General Schedulers and Distributions

In the case of general schedulers and channel distributions, performing an analysis similar to

the one in the previous part might not be feasible. A more systematic approach makes use of

the classification techniques [147] from machine learning to learn the scheduling decisions. Each

BBU will be equipped with a classifier. At each transmission slot, the classifier will determine

whether the BBU should/can transmit at a specific resource block. This decision is based on

some data features such as channel gain and queue size. Note that the classifiers belonging

to different BBUs do not communicate. Hence, the process is completely distributed and no

inter-BBU signaling is needed.

First, the system is to be trained in the presence of a centralized scheduler. BBUs submit

their requests to the scheduler, which performs the user selection decisions. The data involving

the users’ channels and the scheduler’s decisions are used to train the classifier. Once training is

finished, each BBU is provided the trained classifier which then makes the distributed decision.

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 94

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

2

4

6

8

10

12

γ

E(|

h|)

, E

(|h

|2),

E(lo

g(1

+|h

|2))

E(|h|s) − Simulation

E(|h|s,u

)

E(|h|s) − Analytical

E(|h|s) − Centralized

E(|h|2) − Centralized

E(|h|2) − Distributed

E(log(1+|h|2)) − Centralized

E(log(1+|h|2)) − Distributed

E(|h|2) − Analytical

Figure 5.2: Expected SNR Comparison versus γ

20 40 60 80 100 120 140 160 180 2002

3

4

5

6

7

8

9

10

11

12

Number of Users

E(|

h|)

, E

(|h

|2),

E(lo

g(1

+|h

|2))

E(|h|) − CentralizedE(|h|) − DistributedE(|h|) − Analytical

E(|h|2) − Centralized

E(|h|2) − Distributed

E(log(1+|h|2)) − Centralized

E(log(1+|h|2)) − Distributed

E(|h|2) − Analytical

Figure 5.3: Expected SNR Comparison versus Number of Users

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 95

Start with a network with |U| users and a centralized

scheduler

Simulate the system for sufficient time using arbitrary channel profiles

Use the channel profile and the scheduling decision to train the classification algorithm,

e.g. SVM

Equip each BBU with the trained classification

algorithm

User Scheduling decision

At each time t, calculate the scheduling weight, channel profile, plus other

features, e.g. queue state

Figure 5.4: Distributed Decision Flow Chart for General channels and schedulers

We have tried several classification technique, and generally found out that SVMs with Gaussian

kernels and Decision Trees [147] tend to provide the best performance. The flow chart for this

decision process is shown in Fig. 5.4

5.5.3 Simulation Results

The simulation results for these techniques are shown in Figures 5.5, 5.6, 5.7 and 5.8. We

have used the 3GPP channel model, which does not have a closed-form pdf function. We have

trained the system using 5000 data points, equivalent to a 5 second transmission time. In

Fig. 5.5 we plot the prediction errors assuming maximum throughput scheduling and 3GPP

channels. We show both hit error ratio(when a user is scheduled but should not be) and miss

error ratio(when a user is not scheduled but should be) as the frequency for these two events is

very different. For maximum throughput scheduling and 3GPP channel, we can see that both

SVM and Decision trees provide prediction accuracy in the order of 95%. In Fig. 5.7 we show

the same results for proportional fairness scheduler. In contrast, the prediction accuracies here

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 96

10 20 30 40 50 60

Number of Users

0.00

0.05

0.10

0.15

0.20

0.25

0.30

Prediction Errors

Total Error+SVM-RBF

Hit Error+SVM-RBF

Miss Error+SVM-RBF

Total Error+Decision Tree

Hit Error+Decision Tree

Miss Error+Decision Tree

Figure 5.5: Prediction Errors for Maximum Throughput Scheduling

are generally lower. In Fig. 5.6 and 5.8 we show the expected SNR for both schedulers. We

can see that the loss in performance for both scheduling schemes is around 11%.

These results show the interesting relation between the fairness of a scheduler, and its

predictability. Less fair schedulers such as the maximum throughput, are easier to predict,

since the probability of two user having excellent channel conditions is small. On the other

hand, introducing fairness into the scheduler tends to decrease the difference between the users,

making it harder for each of them to predict the scheduling decision. However, the penalty for

a wrong decision, in terms of SNR loss, in the less fair schedulers is more severe than in the

more fair ones. This explains why the loss in terms of expected SNR is almost the same. When

we try to achieve fairness between users, picking the wrong one does not have much negative

effect as when we are trying to maximize the overall system performance.

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 97

10 20 30 40 50 60

Number of Users

1.5

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Expected SINR

Expected SINR versus Number of Users

Centralized

SVM-RBF

Decision Tree

Figure 5.6: Comparison of Expected SINR for Maximum Throughput Scheduling

10 20 30 40 50 60

Number of Users

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Prediction Errors

Total Error+SVM-RBF

Hit Error+SVM-RBF

Miss Error+SVM-RBF

Total Error+Decision Tree

Hit Error+Decision Tree

Miss Error+Decision Tree

Figure 5.7: Prediction Errors for Proportional Fairness Scheduling

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 98

10 20 30 40 50 60

Number of Users

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

Expected SINR

Expected SINR versus Number of Users

Centralized

SVM-RBF

Decision Tree

Figure 5.8: Comparison of Expected SINR for Proportional Fairness Scheduling

5.5.4 Relation Between Fairness and Predictability

In the previous section we tried the classification approach on the maximum throughput and

the proportional fairness schedulers. Interestingly, we found the performance of the classifier, in

terms of prediction accuracy, for the maximum throughput was always better than that of the

proportional fairness. This motivates the question of whether this observation is more general,

that the more fair schedulers are harder to predict. In this section we introduce a new scheduler,

the mean-variance scheduler. The aim of this scheduler is to optimize a weighted combination

of throughput (sum of users’ rates∑

u∈U ru) and fairness, represented as the variance of the

users rates V ar(ru). This is formulated as follows

maxr

∑u∈U

ru − βV ar(ru) (5.14)

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 99

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Beta

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Prediction Errors

Prediction Errors Versus Beta

Total Error

Hit Error

Miss Error

Figure 5.9: Prediction Errors versus β

The discrete decision formed by the scheduler at each time slot n is

us(n) = arg maxu∈U

ru(n)− β(ru(n) +

ru(n)

N

)2

− 2β

(ru(n) +

ru(n)

N

)n−1∑t=0

ru(t)−

∑u∈U

n−1∑t=0

ru(t)

N

(5.15)

The results for this scheduler are shown in Fig. 5.9. As expected the prediction error increases

as β is increased, i.e. more fairness. Note that we have used the same learning parameters for

different values of β, which might be suboptimal. However, the main observation still holds,

that the more fair the scheduler is, the harder it is to predict its decisions and distribute its

operation.

To understand this phenomena, let us consider first the maximum throughput scheduler.

At each time slot, the maximum throughout scheduler picks the user with the best channel

condition. The user with the best channel condition will be in the tail of the probability

Chapter 5. Fully Distributed Scheduling in Cloud-RAN Systems 100

distribution, i.e. a high value for the channel magnitude which occurs with very low probability.

However, if the number of users is large enough, then we expect to have one such user at

almost every time slot. Since this best user user is far from the channel values with significant

probability, the data is relatively separable and the classifier can learn a good decision boundary.

In other words, there is enough probability to have one user with a very good channel condition

, but very low probability of having two such users, hence the data becomes separable.

Now let us look at the other extreme, schedulers that are focused solely on fairness. One such

scheduler is the round-robin, where all users get equal access to the resources irrespective of their

channel condition. The round-robin scheduler is statistically equivalent to a uniformly random

scheduler, where each is used is picked with equal probability. In both cases, the resources are

divided equally between the users. However, there is no information in the uniformly random

scheduler, and hence no classifier can learn to model its behavior. The means that fully fair

schedulers are not predictable. Since the proportional fairness scheduler lies somewhere between

the maximum throughput and the fully fair scheduler, we can see now why it is harder to predict

its decisions compared with the maximum throughput schedulers.

5.6 Conclusion

In this chapter we have studied the distributed scheduling problem in Cloud-RAN systems.

Analytical analysis for the Rayleigh channels and maximum throughput scheduler is provided.

We found that distributed scheduling in this case is able to provide around 92% of the centralized

performance. We then extended the scheme to general channels and schedulers by adopting

the classification techniques from machine learning. We discovered two conflicting effects that

depend upon the fairness of the scheduler. In particular, more fair schedulers are easier to

predict, but the penalty for wrong decisions is more severe. With enough training and efficient

parameter selection, the distributed schedulers are able to provide up to 89% of the centralized

performance.

Chapter 6

Joint RRH Activation and

Clustering in Cloud-RANs

6.1 Context

Cloud Radio Access Networks (Cloud-RAN) promise to leverage cloud computing capabilities to

enhance the quality and coverage of wireless networks. A dense network of remote radio heads

(RRHs) ensures less attenuation at the receiver side. However, two drawbacks are associated

with such dense network: the first is the high energy consumption associated with the large

number of RRHs; the second is the interference experienced by the receiver due to the close

proximity of the transmitters. The cloud-RAN must adopt the cloud design principles such

as resource elasticity and dynamic scaling. The infrastructure controller is responsible for

controlling the access network through activation and clustering of RRHs. The decisions made

by the infrastructure controller are based on the information it receives from the user processes

and should balance energy efficiency with the QoS received by the users. Hence, in this chapter

we study the problem of joint activation and clustering of RRHs. Since the problem is NP-

hard, we provide a two-step algorithm that can find an efficient solution. The first step uses

linear-programming relaxation to find a feasible solution. The second step is a greedy approach

to improve the utility function through gradual activation-clustering of RRHs. Our simulation

results demonstrate the benefit in the joint design of activation and clustering over existing

101

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 102

activation only approaches.

6.2 Introduction

Energy efficiency is one of the main goals of cloud RAN. Current studies estimate that the

information and communication technology (ICT) sector contributes around 2% of the global

CO2 emissions [94], [86]. This carbon footprint is expected to triple by 2020 as a result of

the massive growth of cellular traffic. From the network operators perspective, building more

energy-efficient systems not only lowers their carbon footprint, but is also of significant economic

benefit as it saves their expenditure on energy bills. Considering that around 60-80% of the

energy consumption in a cellular network is consumed at the base-stations [87], [130], it is

no surprise that the C-RAN architecture tries to come up with more energy-efficient network

architectures.

The C-RAN architecture succeeds in decreasing one aspect of energy consumption, which is

related to the cooling and infrastructure of the macro base stations used in the current systems.

However, the energy consumption due to transmission is still present, and might even increase

due to the large number of RRHs envisioned in C-RAN systems. An important question then

becomes, given the high-redundancy of transmitters associated with the dense installation of

RRHs, how to select only a subset of these RRH in order to satisfy the users’ needs while

keeping the energy consumption to a minimum.

Interference is another important factor in the design of cloud RAN systems. The high

density of RRHs results in a decrease in the signal to interference and noise ratio (SINR). Co-

ordinated Multi-Point Transmission (CoMP) is a family of cooperative transmission techniques

that is well studied in cellular systems [50]. A main idea of CoMP is to cluster transmitters

together into a cooperative transmission set in order to coordinate their transmission. Hence,

clustering RRHs together results in improved SINR at the user side, which can be utilized to de-

activate some RRHs in order to save energy. However, in cloud RAN systems, users’ baseband

processing is performed in servers located in data centers, and the mutual exchange of data as

required by CoMP is governed by the networking infrastructure of these data centers. Hence,

any clustering scheme should strike an efficient trade-off between the users’ SINR distribution

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 103

and the bandwidth consumption of the underlying networking infrastructure.

In Fig. 6.1 we show the system architecture focused on the access network control part. The

main part under study is the activation and clustering decisions by the infrastructure controller.

The infrastructure controller monitors the state of the network through continuously receiving

updates from the user process about the states of their users, eh.g. position, queue occupancy

and SINR. Based on this information, the infrastructure controller can then decide to activate

or de-activate a set of RRHs. Clearly, the activation decision is taken when a set of user within

close proximity to an inactive RRH are receiving an inadequate level of service. Similarly, the de-

activation is taken when the infrastructure controller can identify part of the network with low

enough density of users such that they can be migrated to another RRH without affecting their

QoS. The decisions made the infrastructure controller are about balancing the energy efficiency

of the network with the QoS achieved by the end users. Crucial to the infrastructure controller

decisions is the notion of clustering. Two RRHs can be clustered such that together they can

provide an acceptable level of service to the user of a third inactive RRH. The infrastructure

communicates this clustering decision to the corresponding cell processes in order to have them

coordinate their transmission together. FInally, The infrastructure controller communicates

with the RRHs by giving them the activation/de-activation decisions. The capabilities of the

cloud-RAN architecture, such as the collocation of the base band processing in the cloud and

the central management of the network through the infrastructure controller, are great enablers

for these clustering-enhanced activation decisions.

6.3 Related Work

In [133], the authors studied the base station deployment problem together with switching

ON/OFF some of them in order to guarantee QoS for the users. They used area spectral

efficiency [14] as their main QoS metric, and proposed a simple deterministic greedy algorithm

for the deployment and operation of the base stations. [108] introduced the idea of spatio-

temporal profiling to select an active subset of base stations for each duration. Cooperative

communication was used in [55] to accommodate the users whose base stations are turned off.

[104] introduced the concept of network impact to measure the effect of switching OFF a base

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 104

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

Cell-wide processing

User Process Cell Process

Precoder

Infrastructure Controller

Resource

Provisioning

Clo

ud

Net

wo

rk F

abri

c

Clustering Decisions

Access

Network

Control

Activation

Decisions

Beamforming Vectors

I/Q Signals

End-users

Figure 6.1: Cloud-RAN Architecture - Admission Control and Slicing

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 105

station on its neighbors, and used it to build heuristics for gradual base station activation.

The joint deployment and operation problem was studied in [132], and greedy algorithms have

been proposed for both problems. An analysis and optimization approach based on stochastic

geometry was performed in [124], where two BSs sleeping strategies were studied. The energy-

delay trade-off was studied in [131]. Finally, antenna switching was distinguished from base

station switching in [163], which introduced both dynamic and semi-dynamic solutions for the

problem.

The main drawback of current studies is the absence of the clustering effect on the clustering

decisions. This is more crucial in the cloud-RAN architectures due to the high density of the

RRHs as mentioned before. Hence, our main goal in this chapter is to study how can the

clustering effect/decisions be incorporated into the RRH activation problem.

We improve upon the existing literature as follows: first, we introduce a coverage constraint

to the problem formulation in order to ensure connectivity for all users, and to avoid the waste

of resources associated with connection establishment and termination, which will occur due to

blindly turning off some RRHs. The second contribution is in the modeling of the users’ QoS.

While previous attempts have focused on SINR as the QoS metric, they ignored the higher-layer

metrics such as the number of users and the division of resources between them. Even if a user

achieves high SINR, its overall QoS still depends on how many resource blocks, bandwidth,

it can get. The third and main contribution is to consider RRH clustering jointly with RRH

activation. The reason for this is that it might be enough to cluster two RRHs together to

cover the area of a third RRH, without turning on that third RRH, hence saving energy.

6.4 System Model

6.4.1 System Description

Network Model

Consider a cloud-RAN system where a cloud data center forwards the I/Q signals through a

high-speed network to a set of RRHs. Let R be the set of RRHs, U the set of users, T the set

of time slots and C the set of RRHs clusters.

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 106

Traffic Model

Both the users’ location model and the BSs’ locations model follow a Poisson point process

(PPP)[16], each with a different arrival rate.

Association Criteria

We assume a minimum distance association rule, i.e. a user u ∈ U is connected to a RRH

r∗ ∈ R if r∗ is active and

r∗ = arg minr∈Ron

d(u, r) (6.1)

where Ron is the set of active RRHs and d(u, r) is the distance between user u and RRH r.

Channel Model

We assume the channel hu,r between a user u and a RRH r to be distributed as a Rayleigh

random variable.

6.4.2 Problem Formulation

The high-level optimization is summarized as:

maxactive & clustered RRHs

System Utility

s.t. all users are covered

(6.2)

We define the system utility as

f(x,y) =− γ1∑

i∈{i:xi=1}

Pi + γ2∑u∈U

Qu

+ γ3∑

i∈{i:xi=1}

|Ui| − γ4∑

i,j∈{i:xi=1}

1(yij)

(6.3)

where x is a vector variable for RRH activation such that xi = 1 if RRH i is active and zero

otherwise. y is a vector variable with yij = 1 if RRH i and j are clustered together, and zero

otherwise. Pi is the power consumption for RRH i and 1(.) is the indicator function. Qu is the

QoS for user u. Ui is the set of users connected to RRH i. The utility function is a weighted

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 107

combination of utility terms, i.e. QoS per user, and cost terms, i.e. power consumption,

size of the cluster and number of users per RRH, where the weights γi are used to control the

importance of each term. These weights are also normalization constants to enable the addition

of the heterogeneous terms in the utility function.

Note that changing the different weights in the utility function will lead the system to behave

in different ways. For example, increasing γ1 places more weight on the energy consumption

part, rendering the system more energy-efficient at the expense of a decreased QoS. The opposite

can be said about γ2 and γ3. Increasing γ4 penalizes larger clusters, which can be compensated

by increased energy consumption, or decreased QoS or both.

Our goal is then to maximize the utility function subject to some constraint. In this chapter,

we focus on the coverage constraint, that is all users are covered by at least one RRH. Some

other constraints may be used. For example, QoS constraints on the SINR are popular in

the literature. However, since we look at both PHY-layer QoS, SINR, and MAC-layer QoS,

resources per user, we choose to include QoS only in the objective function. Our optimization

problem is

maxx,y

− γ1∑

i∈{i:xi=1}

Pi + γ2∑u∈U

Qu

+∑

i∈{i:xi=1}

|Ui| −∑

i,j∈{i:xi=1}

1(yij)

s.t. Ru 6= ∅ ∀u ∈ U

xi, yij ∈ {0, 1}

(6.4)

where Ru is the set of RRHs a user u can connect to.

6.4.3 Interference Coordination Model

We envision a two-stage control loop for our C-RAN system. The first loop involves the decisions

of RRH activation and clustering, with period times typically in the order of minutes or more.

The second loop involves per frame scheduling and beamforming, with period times in the

order of milli-seconds. Hence, the decisions of the first loop should be aware of the average

performance of the second loop. We focus in this chapter on precoding-based interference

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 108

coordination. Considering joint power control and joint frequency assignment is left for future

work.

Assuming user u is associated with RRH r, the received signal-to-noise-and-interference

ratio is

SINR(u, r) =|hu,rwu,r|2pu,r∑

l∈Cr|hu,lwu,l|2pu,l +

∑l∈C/Cr

pu,l + σ2(6.5)

where Cr is the cluster of RRHs containing RRH r, wu,l is the precoding weight between user

u and RRH l, and pu,l is the received power at user u from RRH l before the rayleigh fading

effect. The first interference term is the intra-cluster interference, and can be characterized

based on the coordination strategy used within this cluster. The second term is the inter-

cluster interference, which we assume to be function only of the path-loss.

We assume that zero-forcing (ZF) precoding is employed within each cluster. Let Hc be the

channel matrix between users served by all RRHs r ∈ Cr ∈ C, then the ZF precoding matrix is

given by

Wc = H−1c (6.6)

The performance of the ZF precoders was studied in [19] and [129], where it was shown that

the received SNR, as interference is now nulled, behaves as a Chi-square random variable with

2(Q−k+1) degrees of freedom, denoting the difference between number of transmitting antennas

and number of receivers. [19] also showed that, when the number of transmitting antennas

equals the number of receivers, Q = k, the average normalized SNR ispu,rσ2 . This means that

interference is nulled and no losses, on average, are suffered in terms of the transmitted power.

In summary, excluding the inter-cluster interference, each cluster can provide its own users with

an average SNR =pu,rσ2 . Hence, Qu = SINR(u, r) =

pu,r∑l∈C/Cr

pu,l+σ2 . This is the value that we

use to model the average precoding performance in our activation and clustering model.

6.4.4 Interference Graph

In the above formulation we have not specified how the set Ru is defined. For this we define an

interference graph for our wireless network. An interference graph is a graph {V, E} where the

set of vertices V is the set of RRHsR, and the set of edges is E = {eij = 1 iff i ∈ I(j)} where I(i)

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 109

is the set of RRHs interfering with RRH i. In this chapter, we define I(i) = {j : d(i, j) ≤ dth}.

This means that two RRHs interfere whenever their inter-distance is below a certain threshold

dth.

6.5 Joint Activation and Clustering Algorithm

We can now substitute the definitions from 6.4.4 into the problem defined in 6.4. However, we

can still see the binary constraints imposed on the variables x,y render the problem NP-hard.

Our approach to overcome this difficulty is as follows:

1. Select a subset RSC ⊆ R, such that all users can access at least one RRH, i.e. find a

feasible solution to the problem.

2. Greedily improve the feasible solution as follows:

(a) Select the switched off RRH that is expected to have the most improvement when

turned on.

(b) Find the improvement in the utility due to turning on the selected RRH.

(c) Select the the switched on RRH that is expected to have the most improvement

when clustered.

(d) Find the improvement in the utility due to clustering the selected RRH.

(e) Choose the action that gives more improvement in the utility, and repeat.

6.5.1 Set Cover

For the first step, which is finding a feasible solution, this can be formulated as a set-cover

problem for the interference graph. Consider the linear program (LP):

minx

|R|∑i=1

c(Si)xi

s.t.∑i:e∈Si

xi ≥ 1 , ∀e ∈ R

xi ∈ {0, 1} , i = 1, 2, ..., |R|

(6.7)

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 110

where Si = Ii∪{i} is the set of cells covered by RRH i, which is also equal to the interference set

plus RRH i own cell. While the above problem is still NP-hard, it has an efficient approximation

scheme using linear-programming rounding. The first step is to solve a relaxed version of (6.7)

as follows:

minx

|R|∑i=1

c(Si)xi

s.t.∑i:e∈Si

xi ≥ 1 , ∀e ∈ R

0 ≤ xi ≤ 1 , i = 1, 2, ..., |R|

(6.8)

The output of problem (6.8) is then rounded to provide an integer solution using the

Algorithm summarized in Algorithm 6.1. This LP rounding scheme is known to be an f -

Algorithm 6.1 LP Rounding Set-Cover

1. Solve the relaxed version of the LP problem as shown in (6.8) to get x = (x1, x2, ..., x|R|);

2. Let f be the maximum frequency (the maximum number of times that an element appearsin distinct coverage sets);

3. Output deterministic rounding x = (x1, x2, ..., x|R|) ∈ {0, 1} where

xi =

{1 , xi ≥ 1/f

0 , otherwise

approximation of the set cover problem [143].

6.5.2 Greedy Improvement

The next step in our algorithm is to gradually improve upon the feasible solution found in

Algorithm 6.1. We follow a greedy approach, which can also be viewed as a discrete gradient

ascent one. The algorithm starts with the feasible solution found from Algorithm 6.1, and tries

to pick the direction that gives us the largest increase in the utility function. However, since

there are many decisions to choose from, activating each off RRH or clustering each pair of on

RRHs, we propose a simplification. First, we select the RRH that is most likely to improve

the utility. For the activation part, we pick the RRH which has the most users in its cell and

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 111

is also turned off. While for the clustering part, we pick the RRH that is causing the most

interference. For each case, we find the utility improving by activation or clustering the chosen

RRH, and proceed with the one that gives us the larger improvement.

The steps are summarized in Algorithm 6.2. Step 1 finds the inactive RRH with the most

users within its cell. Step 3 finds the utility improvement resulting from activating this RRH.

Step 4 finds the active RRH that generates the most interference. Step 5 finds nearest interfering

RRH to the one selected in step 4. Step 7 finds the utility improvement from clustering the

two RRHs selected in steps 4 and 5. Finally the decision, either activation or clustering, with

the higher increase in utility is chosen in step 8. The process repeats until the marginal gain is

below a specified threshold or a maximum number of iterations is reached.

Algorithm 6.2 Greedy Improvement

• Do for n = 0→ N and 4xf,4yf > ε

1. Find i such that xi = 0 and Ui > Uj ∀j : xj = 0

2. Set xn+1 = xn ∀j 6= i and xn+1,i = 1

3. Find 4xf = f(xn+1, yn)− f(xn, yn)

4. Find k such that xk = 1 and∑

u∈U I(u, k) >∑

u∈U I(u, j) ∀j : xj = 1 and j /∈ Ck5. Find j such that d(k, j) < d(k, l) ∀l : xl = 1 and j, l /∈ Ck6. Set yn+1 = yn ∀v, w 6= i, j and yn+1,i,j = 1

7. Find 4yf = f(xn, yn+1)− f(xn, yn)

8.

xn+1, yn+1 =

{{xn+1, yn if 4xf ≥ 4yf

xn, yn+1 if 4yf < 4yf

6.6 Simulation Results

In this section we present performance results of the proposed solution to the optimization

problem. We simulate a network of 20 RRHs, distributed uniformly within an area of 39250m2,

where on average each AP has a cell of diameter 50 meters.

The main parameters that we vary in our simulations are dth and |U|. The first is the

interference threshold, whose higher values indicate a larger number of RRHs are needed to

provide coverage, and consequently being selected by the algorithm. In such case the greedy

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 112

algorithm will favor clustering the already active RRHs over turning on new ones. The second

parameter is the average number of users. We vary this parameter from 1 user to 200 users per

RRH. This significantly affects the QoS performance, which will also be reflected in the greedy

algorithm decisions.

In Fig. 6.2 we show the average number of active RRHs as the the total number of users is

increased. First, we can see how the interference threshold throttles the domain of the greedy

algorithm. When the threshold is low, the number of RRHs selected by the set-cover problem

is already large, leaving little room for improvement in this side. As the interference threshold

reaches a value of 50, which is the average value of RRH inter-distance in our simulation, the

process saturates. We can already the significant energy savings available, this is particularly

important when the number of users is small. In such a case, almost half of all RRHs can be

turned off.

In Figures 6.3, 6.4 and 6.5 we study the behavior of the different components of QoS. The

main point here is to show how much improvement can be achieved by joint design of RRH

clustering and activation, hence confirming the benefit of our approach. Fig. 6.3 shows the

behavior of the first component of QoS (SIR) versus the total number of users. We can observe

up to 25% improvement in SIR QoS. Fig. 6.4 is beneficial to understanding the behavior of the

algorithm. It shows the QoS provided by the algorithm versus the number of RRHs activated

by it. We note that different lines start at different points due to the difference in the number

of initial RRHs activated by the set-cover problem. Fig. 6.5 shows the behavior of the overall

QoS, which is the weighted sum of SIR and the average number of users per RRH. The QoS

gains decrease when we consider the overall QoS. This confirms our rationale for studying both

PHY-layer and MAC-layer metrics.

In Fig. 6.6 and 6.7 we show the behavior of the algorithm as we change the area within

which the users and RRHs are co-located. We can see that for smaller areas, i.e. smaller density

factor, the number of active RRHs is less. This follows directly from the fact that once RRHs

are closer to each other, each RRH can cover more users. The interference threshold can be

tuned to control this behavior by forcing more RRHs to be turned on. In Fig. 6.7 we study

the behavior of QoS. We observe that as the density is increased, QoS is decreased. This is due

to the excessive clustering selected by the algorithm. As the RRHs become very far from each

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 113

20 40 60 80 100 120 140 160 180 200

Number of Users

13

14

15

16

17

18

19

20

Num

ber of Active R

RHs

Number of Active RRHs at distance thresholds 10 to 50

Thr=10

Thr=20

Thr=30

Thr=40

Thr=50

Figure 6.2: The average number of users per active RRH

other, the effect of CoMP clustering is less significant, and QoS is decreased. We note that the

clustering term in the utility function can be tuned to control this phenomena.

6.7 Conclusion

We have studied the problem of joint clustering and RRH activation in Cloud-RAN networks.

We have provided a two-step approach to overcome the combinatorial nature of the problem.

The first step involves a linear prorgam approximation to give a feasible solution using an

interference graph. The second step is greedily improving the solution, searching over both

activation and clustering decisions. Our simulation results have shown around 25% improvement

in terms of QoS and energy savings of the joint clustering and activation over the legacy

activation only approach.

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 114

20 40 60 80 100 120 140 160 180 200

Number of Users per RRH

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

QoS (SIR)

QoS versus at distance thresholds 10 to 50

Thr=10

Thr=20

Thr=30

Thr=40

Thr=50

No Cluster

Figure 6.3: Change of average QoS as the number of users is varied

13 14 15 16 17 18 19 20

Number of Active RRHs

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

Qo

S (

SIR

) p

er

Use

r

QoS distance thresholds 10 to 50

Thr=10

Thr=20

Thr=30

Thr=40

Thr=50

No Cluster

Figure 6.4: Change of average QoS as the number of active RRHs changes

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 115

20 40 60 80 100 120 140 160 180 200

Number of Users per RRH

1.5

2.0

2.5

3.0

3.5

4.0

Overall QoS

Overall QoS at distance thresholds 10 to 50

Thr=10

Thr=20

Thr=30

Thr=40

Thr=50

No Cluster

Figure 6.5: Overall QoS as the number of users per RRH is varied

0 5 10 15 20

Density Factor

6

8

10

12

14

16

18

20

Number of Active RRHs

Activation Behavior versus RRH Density

Thr=10

Thr=20

Thr=30

Thr=40

Thr=50

Figure 6.6: Average number of users as the number of users per RRH is varied

Chapter 6. Joint RRH Activation and Clustering in Cloud-RANs 116

0 5 10 15 20

Density Factor

2.0

2.5

3.0

3.5

4.0

4.5

QoS (SIR)

QoS Behavior versus RRH Density

Thr=10

Thr=20

Thr=30

Thr=40

Thr=50

Figure 6.7: Average number of users as the number of users per RRH is varied

Chapter 7

Long-term Activation, Clustering

and Association in Cloud-RAN

7.1 Context

In this chapter we build upon the work in Chapter 6 and extend it in several ways. While the

model used in the previous chapter showed the strong benefit of the joint activation-clustering

approach, there were two main aspects still missing. These are the dynamic and flexible user-

RRH association, and the temporal correlation of the user and traffic behavior, and consequently

the activation and clustering decisions. The more flexible association scheme we consider here

relieves us of having to always associate the user with its nearest RRH. Instead, users can be

dynamically hand-overed between the different RRHs, hence giving more flexibility to the acti-

vation and clustering decisions. In order to incorporate the temporal correlation, i.e. queuing,

into the model, we have to also include clustering as a variable and study its effect on the

SINR. We address these challenges by providing a comprehensive model that incorporates all

the aspects of activation, clustering and association. The resulting problem belongs to the class

of signomial optimization. We show how this problem can be efficiently solved using succes-

sive geometric approximation. Finally, we study how this approach can be extended into a

stochastic control one. The main idea is to perform the optimization based on the traffic fore-

cast. We measure the sensitivity of the activation and clustering decisions with respect to the

117

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN118

forecast error, and find the error to be 9% and 18% for the activation and clustering decisions

respectively.

7.2 Introduction

The past few years have witnessed a large increase in cellular network traffic. Since the systems

are already operating close to their maximum capacity, one solution is to build more dense

wireless networks with aggressive frequency re-use factors. A dense network of RRHs ensures

less attenuation at the receiver side. However, two drawbacks are associated with these dense

networks: the first is the high energy consumption associated with such a large number of RRHs;

the second is the increased interference experienced by the receiver due to close proximity of

the transmitters.

In this Chapter, we study the same question as in the last chapter about energy-efficient

activation of RRHs. Previously, we studied the joint optimization of activation and clustering

and demonstrated the significant effect clustering has on the problem. Our goal now is to

extend the problem in two important ways:

• Association: User association is another aspect of the problem. Contrary to the tradi-

tional distance-based association, the high density of RRHs in cloud-RAN enables more

dynamic association schemes. Since the RRHs are closer to each other, instead of asso-

ciating a user to its nearest base station, it can instead be associated with one or more

nearby RRHs which can provide the user with comparable level of service but at a much

more balanced load. Eventually, even though the new RRH might be further from the

user, the low load on this RRH will help the user get more frequency resources to account

for the decrease in the received signal power.

• Queuing: The other major aspect is the strong time-dependency of the network load.

A decision to activate or de-activate a RRH will have a strong impact on the queues

occupancy and the network state at the next time slot. This strong temporal correlation

means that consecutive activation, clustering and association decisions are strongly inter-

twined, and should ideally be jointly optimized. This leads us to formulate the problem

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN119

as a long-term optimization, where the queuing aspects are included to represent of the

temporal correlation of the state, and consequently the decisions.

In Fig. 7.1 we show the system architecture under study in this chapter. This is very

similar to the one in Chapter 6, except for the inclusion of association decisions as well. The

infrastructure controller communicates with the user process in order to find out about the

user’s position and its queue occupancy. This information is then used to decide upon the

activation and clustering of the RRHs as well as the association between the user and RRHs.

The activation decisions are then sent to the access network to activate/de-activate the corre-

sponding RRHs, while the clustering decisions are sent to the appropriate cell processes, and

the association decisions are sent to their respective users and cells processes.

7.3 Related Work

Besides the works reviewed in the previous chapter such as [133][108][55][104][132][131][163],

there is one more specific work that we would like to discuss here. The problem of dynamic base

station activation and user association was studied in [9]. This work has a lot of similarities with

ours, in that both study long-term optimization of RRH activation jointly with user association.

However, there are a few drawbacks with that approach that we intend to alleviate here:

• First and foremost clustering is not included in the model used in [9]. One of our main

contributions in this chapter is arriving at a generic form for SINR that integrates the

effect of RRH clustering for both interference coordination and joint transmission scenar-

ios. This form is necessary to be able to perform long-term optimization of the activation

and clustering decisions, since the traditional greedy approach to clustering might prove

infeasible once we go beyond a single time slot decision.

• The approach in [9] reduces to a form of greedy optimization that solves an optimization

problem for each time slot based on the current queue occupancy. Our approach solves

the whole problem at once, and utilizes traffic forecasts for the future decisions. Greedy

optimization based only on the current state ignores the typical cyclo-stationary pattern

of the user behavior and network traffic. In other words, the decision process for the

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN120

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

Cell-wide processing

User Process Cell Process

Precoder

Infrastructure Controller

Resource

Provisioning

Clo

ud

Net

wo

rk F

abri

c

Clustering Decisions

Access

Network

Control

Activation

Decisions

Beamforming Vectors

I/Q Signals

End-users

Association

Decisions

Figure 7.1: Cloud-RAN Architecture - Activation, Clustering and Association

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN121

current time slot should take into consideration whether the traffic is expected to go up

or down next, and decide correspondingly whether to add or remove resources from the

network, in terms of RRH activation, clustering and user association.

7.4 System Model

7.4.1 System Model

Consider a cloud-RAN system where a cloud data center forwards the I/Q signals through a

high-speed network to a set of RRHs. Let R be the set of RRHs, U the set of users, T the set

of time slots and C the set of RRHs clusters.

Activation model

Our main goal is to optimize the energy efficiency while satisfying the QoS level of the ser-

vice received by the users. Towards this end, we define xi as the probability that RRH i is

active. While the activation variable in general is binary, modeling it as probability makes our

formulation more tractable, as well the model more general.

Signal Model

Let pij be the received signal power at user j from RRH i. Similarly, let µij be the probability

that user j is associated with RRH i. We again use a probabilistic model for the association.

This can be justified on one hand as a mathematical relaxation of the binary variable, while

on the other hand this can be seen as the probability that a RRH accepts a connection request

from this user. We can now write the received signal power for user j as follows

Sj =∑i∈R

µijpijxi (7.1)

Interference Model

The main goal of clustering is to address the interference problem present in current networks.

There several ways where clustering can be leveraged to enhance the system performance. In

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN122

this chapter, we focus on two of these, interference cancellation and joint transmission. Inter-

ference cancellation is when the CSI is shared between RRHs to combat interference through

beamforming, while joint transmission refers to designing the waveforms such that they add

constructively at the receiver. Similar to the activation and association, we will use a proba-

bilistic model for clustering. Let qik be the probability that RRH i and k are clustered together,

i.e. sharing the CSI and/or data. If interference cancellation is the chosen mode, we assume

that zero-forcing is the used beamformer. Hence, the received interference at user j is

Ij =∑i∈R

∑k∈Rk 6=i

(1− qik) pkjxk (7.2)

The interference is then the sum of all interference received from all RRH not clustered with

the associated RRH i.

In the case when joint transmission is used, then not only is interference nulled, but it is

also a useful signal that increases the received power. In such a case, we refer to the interference

as Ij . For a user j, Ij can be written as

Ij =∑i∈R

∑k∈Rk 6=j

(qik) pkjxk (7.3)

While the different optimization variables might operate on different time scales, the fact

that they are modeled as probabilities alleviates this drawback as probabilities can be considered

as recommendation, or soft-decisions, that do not have to be followed exactly. Instead, they

form guidelines for on-line operation. For example, the association probability could mean that

RRH i accepts the connection request from a specific user j with probability µij .

SINR Model

Considering the activation, clustering and association models together, we can write the received

SINR for the interference cancellation case as follows

SINRj =Sj

Ij + σ2=

∑i∈R

µijpijxi∑i∈R

∑k∈Rk 6=i

(1− qik) pkjxk + σ2(7.4)

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN123

and for the joint transmission case

SINRj =Sj + IjIj + σ2

=

∑i∈R

µijpijxi +∑i∈R

∑k∈Rk 6=j

(qik) pkjxk

∑i∈R

∑k∈Rk 6=i

(1− qik) pkjxk + σ2(7.5)

Queuing Model

One of our main goals is to be able to optimize over multiple time slots. Hence, the model must

take into account the evolution of the system state, namely the queue size. Let Qt+1 and Qt

be the queue sizes at times t+ 1 and t. Then the queue evolves as follows

Qt+1j = Qtj − Ctj +At+1

j ∀t ∈ T (7.6)

where Ctj = log(

1 + SINRtj

)is the channel capacity at time t, and At+1

j is the arrival traffic

at time t+ 1. However, the form above is not suitable for our formulation of the problem as a

geometric program. Taking the exponential of both sides we get

eQt+1j = eQ

tj−Ctj+A

t+1j

eQt+1j = eQ

tje−C

tjeA

t+1j

Qt+1j ≈

QtjAt+1j

SINRj

(7.7)

where Qt+1j = eQ

t+1j , At+1

j = eAt+1j and the approximation in the last step is due to ignoring the

one in the capacity expression, i.e. SINRj >> 1.

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN124

7.4.2 Problem Formulation

We are now ready to formulate our optimization problem as follows

minx,q,µ

∑i∈R

xi + β∑i,k∈R

qik

s.t. SINRtj ≥ γjQj ∀j ∈ U , t ∈ T

SINRtj =QtjA

t+1j

Qt+1j

∀j ∈ U , t ∈ T

0 ≤ x,q, µ ≤ 1

(7.8)

or equivalently

minx,q,µ

∑i∈U

xi + β∑i,k∈U

qik

s.t.

∑i∈R

µijpijxi∑i∈R

∑k∈Rk 6=i

(1− qik) pkjxk + σ2≥ γjQj ∀j ∈ U , t ∈ T

Qt+1j

QtjAt+1j

∑i∈R

µijpijxi∑i∈R

∑k∈Rk 6=i

(1− qik) pkjxk + σ2= 1 ∀j ∈ U , t ∈ T /{0}

0 ≤ x,q, µ ≤ 1

(7.9)

For simplicity, we will assume σ2 = 0 from now on. The objective of the optimization is to

minimize the energy consumed through minimizing the number of active RRHs while satisfying

a QoS constraint such that the received SINR is greater than a factor multiplied by the queue

size.

7.5 Successive Geometric Optimization

Problem (7.9) is an example of a signomial geometric programming problem [27]. Unlike stan-

dard geometric programming problems, signomial geometric programming problems are non-

convex and hard to solve. An approach to solve such problems was introduced in [152]. The

algorithm is based on approximating the problem as a series of standard geometric program-

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN125

ming problems that can be solved to reach a global optimum solution. In the following we

summarize the algorithm in [152] as applied to our problem

7.5.1 Signomial Geometric Programming

Consider the optimization problem defined as follows

minx

fo(x)

s.t. fk(x) =

mk∑j=1

ckj

n∏i=1

xakiji ≤ 1, k = 1, 2, ...,K1

fk(x) =

mk∑j=1

ckj

n∏i=1

xakiji = 1, k = K1 + 1,K1 + 2, ...,K2

(7.10)

Unlike standard geometric programming problems, there is no positivity constraint imposed

on the constants ckj , hence these problems can not transferred into any convex form using the

standard techniques of geometric programming.

Global Optimization

Each fk(x) can be written as

fk(x) = f+k (x)− f−k (x) (7.11)

where both f+k (x) and f−k (x) are posynomial equations. Hence, problem (7.10) can be written

as

minx

fo(x)

s.t. f+k (x)− f−k (x) ≤ 1, k = 1, 2, ...,K1

f+k (x)− f−k (x) = 1, k = K1 + 1,K1 + 2, ...,K2

xi > 0, i = 1, 2, ..., n

(7.12)

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN126

which is equivalent to

minx

fo(x)

s.t.f+k (x)

f−k (x) + 1≤ 1, k = 1, 2, ...,K1

f+k (x)

f−k (x) + 1= 1, k = K1 + 1,K1 + 2, ...,K2

xi > 0, i = 1, 2, ..., n

(7.13)

Now introduce auxiliary variables sk such that

minx

fo(x) +

q∑k=p+1

sk

s.t.f+k (x)

f−k (x) + 1≤ 1, k = 1, 2, ...,K1

f+k (x)

f−k (x) + 1≤ 1, k = K1 + 1,K1 + 2, ...,K2

s−1k(f−k (x) + 1

)f+k (x)

≤ 1, k = K1 + 1,K1 + 2, ...,K2

xi > 0, i = 1, 2, ..., n

sk ≥ 1 k = K1 + 1,K1 + 2, ...,K2

(7.14)

A key step in the algorithm is that a posynomial function g(x) =∑

ν uν(x) with uν(x) being

the monomial terms can be lower bounded as follows

g(x) ≥ g(x) =∏ν

(uν(x)

αν(x)

)αν(x)(7.15)

where the parameters αν(y) can be computed using

αν(y) =uν(y)

g(y)∀ ν (7.16)

By inserting the approximation (7.15) into (7.14), we get a convex approximation for the

original problem (7.10), where the accuracy of the approximation depends on the tightness

of the bound (7.15). The algorithm starts by random guesses for the exponents (7.16), and

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN127

iteratively updates the solution while improving the bounds until a desired accuracy is reached.

7.6 Successive Geometric Optimization for Activation, Cluster-

ing and Association

Using the approximation scheme in (7.15) , we can write (7.9) as

minx,q,µ

∑i∈U

xi + β∑i,k∈U

qik +∑

j∈U , t∈T /{0}

stj

s.t. γjQj

∑i∈R

∑k∈Rk 6=i

(qik) pkjxk + σ2

∏i∈R

(µijpijxi)αai

αai≤ 1 ∀j ∈ U

Qt+1j

QtjAt+1j

(∑i∈R

µijpijxi

)+ Qt+1

j

∑i∈R

∑k∈Rk 6=i

(qik) pkjxk

.

∏i,k∈Rk 6=i

(pkjxk)αbi,k

αbi,k≤ 1 ∀j ∈ U , t ∈ T /{0}

1

stj

∑i∈R

∑k∈Rk 6=i

pkjxk

∏i∈R

1

αci

(Qt+1j

QtjAt+1j

µijpijxi

)αci.

∏i,k∈Rk 6=i

1

αdi,k

(Qt+1j qikpkjxk

)αdi,k ≤ 1 ∀j ∈ U , t ∈ T /{0}

0 ≤ x,q, µ ≤ 1

s ≥ 1

(7.17)

We have summarized the solution algorithm in 7.1.

We have also considered how our solution can be extended into a model-predictive controller.

The simple idea is to predict the future values of the arrival traffic Atj , and use these predicted

values in the optimization problem (7.17). This approach can be enhanced by redoing the

optimization again at the beginning of each time slot. However, we focus in this work on the

whole interval prediction and optimizing once as this is less computationally expensive.

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN128

Algorithm 7.1 Successive Geometric Optimization for Activation, Clustering and Association

step 0: Choose initial feasible values for the variables x,q, µ, x(0),q(0), µ(0) respectively. Selectinitial values for s solution accuracy ε > 0. Set iteration counter r = 0.

step 1: At the r-th iteration, evaluate the exponents as in equation (7.16) and the correspondingbounds as in (7.15).

step 2: Solve the convex optimization problem (7.17) to obtain x(r),q(r), µ(r).step 3: if ||x(r) − x(r−1)||+ ||q(r) − q(r−1)||+ ||µ(r) − µ(r−1)|| < ε, then stop, else go to step 1.step 4: Set r = r + 1

7.7 Simulation Results

In this section we present our simulation results for the algorithm. We focus on two main

aspects: the interaction between the activation and clustering and the effect of that on the

system performance. the second aspect is how successfully can we extend the framework into

a model-predictive control system. We simulate a system of 4 RRHs, 16 access areas and 10

time slots.

In Fig. (7.2) and (7.3) we plot the average activation and clustering probabilities versus

the average traffic load. The main difference between these two figures is the value of β in

the objective function. When β = 0 as in the second figure, the average activation probability

falls significantly, suggesting that clustering brings significant performance improvements that

activating more RRHs is not needed. This is in line with many results in the literature about the

performance gain of base station clustering. However, clustering is not cost-free. First, radio

resources need to be allocated such that the CSI can be recognized from each RRH. Second,

strict synchronization have to be implemented across all transmitters. Third, in cloud-RAN

systems clustering might use extensive bandwidth in the data center. There are no models for

all these clustering effects in the literature. For the purpose of this work, we considered adding

a clustering term to the objective function to be sufficient. The effect of this term is shown

in Fig. (7.2) as we can see the two probabilities are more balanced compared to Fig. (7.3).

Lastly, this figure shows that at least 40% savings can be achieved in terms of activation energy

compared to leaving all RRHs active.

In Fig. (7.4) and (7.5) we study the dependence of the activation and clustering probabilities

on another important parameter of the system, the inter-RRH distance. We see a different

behavior compared to the average traffic, that is clustering is not enough to account for the

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN129

100 101 102 103

Mean Arrival Traffic

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Pro

ba

bility

Mean Activation ProbabilityMean Clustering Probability

Figure 7.2: Average Activation and Clustering Probabilities versus Average Traffic Load

100 101 102 103

Average Incoming Traffic

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pro

ba

bility

Average Activation ProbabilityAverage Clustering Probability

Figure 7.3: Average Activation and Clustering Probabilities versus Average Traffic Load, β = 0

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN130

10 15 20 25 30 35 40 45 50

Inter-RRH Distance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability

Figure 7.4: Average Activation and Clustering Probabilities versus Inter-RRH Distance

increased inter-RRH distance whether or not we include clustering in the objective. This can

be easily explained as the inter-RRH imposes a square-law degradation in the signal power that

can not be recovered no matter how much clustering you use.

The last parameter, the QoS factor γ exhibits a middle-ground between inter-RRH distance

and traffic load as shown in Fig. (7.6) and (7.7). Initially, clustering is enough to account

for any increase in γ, then at a certain point more RRHs need to be activated. When the

clustering is considered in the objective, both the activation and clustering probabilities have

to be increased to keep the system performing at the required levels.

Finally, we study how much can be achieved when the framework ins extended into a model

predictive controller by operating on the predicted traffic. We simulate the system with the true

traffic arrival and then run 100 simulations where the traffic is mixed with random noise. The

prediction error is shown in Fig. (7.8) and (7.9). We can see that the activation probability

is predictable with 91% accuracy, while the clustering probability is predictable with 82%

accuracy.

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN131

10 15 20 25 30 35 40 45 50

Inter RRH distance

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pro

ba

bility

Average Activation ProbabilityAverage Clustering Probability

Figure 7.5: Average Activation and Clustering Probabilities versus Inter-RRH Distance, β = 0

100 101 102 103

γ

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Probability

Average Activation ProbabilityAverage Clsutering Probability

Figure 7.6: Average Activation and Clustering Probabilities versus QoS Factor

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN132

100 101 102 103

γ

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

Probability

Figure 7.7: Average Activation and Clustering Probabilities versus QoS Factor, β = 0

0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17

Average Percentage Error in Traffic

0.09

0.091

0.092

0.093

0.094

0.095

0.096

0.097

0.098

0.099

0.1

Ave

rag

e E

rro

r in

Activa

tio

n P

rob

ab

ility

Figure 7.8: Average Activation Probability Error versus Average Traffic Prediction Error

Chapter 7. Long-term Activation, Clustering and Association in Cloud-RAN133

0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17

Average Percentage Error in Traffic

0.18

0.181

0.182

0.183

0.184

0.185

0.186

0.187

0.188

0.189

0.19

Ave

rag

e E

rro

r in

Clu

ste

rin

g P

rob

ab

ility

Figure 7.9: Average Clustering Probability Error versus Average Traffic Prediction Error

7.8 Conclusion

We have studied the long-term optimization of RRH activation, clustering and association.

Our main contribution is a general formulation that includes all the three variables as well as

the queue evolution behavior. the resulting model can be efficiently solved using successive

geometric programming. We have studied the performance when noisy estimates of the traffic

are used. We have seen the activation probabilities can be accurate up to 91% and the clustering

probabilities accurate up to 82%.

Chapter 8

Graph-based Diagnosis in

Software-Defined Infrastructure

8.1 Context

In cloud-RAN systems, the infrastructure controller is responsible for the monitoring, diagno-

sis and dynamic scaling of cloud computing resources in order to provide resource elasticity.

The infrastructure controller needs to ensure the user/cell process has not been compromised

before assigning any extra resource. Hence, anomaly detection is the first step towards secure

resource management. However, investigating individual resource behavior may not be efficient

in detecting abnormal behavior in large and complex datacenters. In this chapter, we propose

a scalable graph based diagnosis framework to detect system anomalies in Software-Defined

Infrastructure running in the SAVI testbed. We have leveraged Graph Mining and Machine

Learning techniques in our approach in order to detect different kinds of anomalies. We have ex-

perimentally tested our framework on several use cases: Webserver-Database workload pattern,

bandwidth throttling between a pair of VMs, denial-of-service (DoS) attack on a webserver

and Spark Job failure. Our framework was able to detect all the aforementioned anomalies

accurately 1.

1This chapter is a joint work with Joseph Wahba, a former MSc student in our research group.

134

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 135

8.2 Introduction

In this chapter, and the next, we focus on the cloud computing aspect of the system. Particu-

larly, we study the anomaly detection and auto-scaling problems. The Cloud computing model

is increasingly being adopted in enterprises and service providers as it allows them to seam-

lessly manage their infrastructures. It was then natural to develop the next generation wireless

architecture, cloud-RAN, as a cloud-based architecture. Virtualization has become the en-

abler technology in today’s data centers. Through virtualization of networking and computing

resources, we can provide both infrastructure-as-a-service (IaaS), network-as-a-service (NaaS)

and platform-as-a-service (PaaS). These services enable sharing the infrastructure between dif-

ferent network slices. Hence, it is now possible to rapidly deploy applications on computing

infrastructure and speed up the rate of innovation.

As different cloud platforms continue to grow in scale and complexity (including C-RAN),

the diagnosis and management task of cloud data centers and platforms becomes a critical

challenge. Dynamic management of the cloud resources by upscaling/downscaling is at the

heart of the cloud economic model. A critical part of the resource management is detecting

abnormal behaviors in a data center in order to spot unusual system behaviors such as operator

errors, hardware, software failures, different attacks and anomalous communication patterns.

The cloud-focused architecture we envision for C-RAN is shown in Fig. 8.1. The controller

monitors the capacity of the physical and virtual machines on which the user process is running.

It also monitors the process itself in terms of computing resources both needed and assigned.

Once the infrastructure controller decides that the user process needs more computing resources,

it can either upgrade the underlying virtual machine, or migrate the whole process to a different

physical machine altogether. This decision by the infrastructure controller can be made based

on either the actual monitored state of the process, or the controller’s own prediction of the

process needs, as will be discussed in the next chapter. An important step before the scaling

decision is identifying whether the observed behavior is normal or anomalous. If the controller

suspects that virtual machine has been compromised, then it should be quarantined instead

of given extra resources. In this chapter we start by studying the anomaly detection problem,

and we study the joint anomaly detection and resource scaling in the next chapter. While we

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 136

have not actually implemented the wireless base-band processing of the user process, we have

designed the approaches in this chapter and the next to be as applicable as possible to the full

cloud-RAN case.

The anomaly detection process can be considered as a closed-loop system, involving data

collection, processing as well as decision making and execution. Resource based anomaly de-

tection techniques are useful in diagnosing anomalies in individual resources. By leveraging

Graph-Mining and Machine Learning techniques, unusual behaviors in data centers could be

detected not only based on a per-resource behavior, but using a holistic view of inter-dependency

and inter-communication pattern between different resources.

One example of a data center cloud management platform is the SAVI [71] testbed on

which we have implemented our approach. The SAVI project was established to investigate

future application platforms designed for rapid applications deployment. SAVI testbed has

been developed for controlling and managing converged virtual resources focused on computing

and networking. In a SAVI Smart Edge we have compute, network, storage, FPGA, and other

resources. OpenStack [1] is used for managing compute, storage, GPU and FPGA resources.

OpenFlow [91] controllers are used for controlling network resources such as switches.

Our main contribution in this chapter is developing a graph-based anomaly detection frame-

work for the SAVI testbed. Our framework leverages the Apache Spark big-data platform for

scalability. We have tested our framework on several use cases including Webserver-Database

workload pattern, bandwidth throttling between a pair of VMs, denial-of-service (DoS) attack

on a webserver and Spark Job failure. Our framework was able to detect the aforementioned

anomalies accurately.

8.3 Related Work

Graph based Anomaly detection has been studied under many different settings using various

statistics tools and graph mining algorithms [12]. There are two main categories of approaches

for detecting anomalies in graphs: Methods for static graph and approaches for dynamic graph

data.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 137

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

User Process

Infrastructure Controller

Computing

resources scaling

and migration

Clo

ud

Net

wo

rk F

abri

c

VM utilization

Scaling decisions

End-users

Figure 8.1: Cloud-RAN Architecture - Anomaly Detection and Scaling

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 138

8.3.1 Anomaly Detection in Static Graphs

In static graphs, the main task for anomaly detection is to discover anomalous network entities

(e.g., nodes, edges) given the entire graph structure. Static graphs are either plain graphs which

do not have attributes or attributed graphs where nodes and/or edges have features associated

with them. Given a snapshot of a plain or attributed graph, the anomaly detection problem

could be defined as finding the nodes and/or edges that are few and significantly different

from the patterns observed in the rest of the graph. In static plain graphs, the only available

information is the graph’s structure. Therefore, in order to detect anomalies, the structure

of the graph is used to find patterns and spot anomalies. There are two main categories of

methods in detecting anomalies in static plain graphs: structure-based methods [58] [26] [44]

[59] and community-based methods [134] [29] [140] [153]. In static attributed graphs, anomaly

detection methods exploit the structure as well as the correlation of attributes of the graph to

find patterns and spot anomalies [102] [40] [83]. In community based methods, approaches aim

to identify those outlier nodes in a graph whose attribute values deviate significantly from the

other members of the specific communities where they belong [47] [153] [98].

8.3.2 Anomaly Detection in Dynamic Graphs

Dynamic graphs are time-evolving graphs which are composed of sequences of static graphs.

Given a sequence of graphs, the anomaly detection problem could be defined as whether the

graph has become significantly different from its predecessors. Hence, it is necessary to de-

fine two things: first the features that represent a graph; second a distance measure between

these features. Based on the distance, we can train the system to decide whether a specific

graph is anomalous or not. The authors in [25] studied different graph similarity measures,

anomaly detection techniques in large network based data and clustering similar graphs to-

gether. Different approaches have been used in detecting anomalies in dynamic graphs such

as feature-based events [72] [121] [28], decomposition-based events [11] [117], clustering-based

[73] [99] and window-based events [110] [96]. In [66], an eigen-space based approach has been

proposed for modeling graphs and detecting anomalies.

In contrast to the current work, we focus on anomaly detection in the physical infrastruc-

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 139

ture itself, unlike [66], and detect a wider range of anomalies in several use cases. We have

used a novel approach in detecting anomalies by leveraging both graph-based metrics and ma-

chine learning techniques. Our work is the first to address graph-based anomaly detection in

virtualized heterogeneous environments.

8.3.3 Graph Centrality Measures

There has been extensive work on quantifying a graph from the perspectives of centrality,

robustness, criticality and connectivity. In [36], node connectivity has been chosen as the best

metric to quantify graph robustness. The authors of [35] have introduced the symmetry-ratio

as a measure of graph symmetry. The concept of network criticality has been discussed in

[138],[139] as another of graph robustness. Our interest in these metrics stems from the fact

that they can be used as efficient graph features in the detection of anomalies.

8.4 System Architecture

In this section of the chapter we present our system architecture used for graph-based diag-

nosis in the SAVI testbed. Figure 8.2 depicts the architecture for our system. The system

is composed of four main modules: The monitoring and measurements module, the diagnos-

tics module, the decision making module and the orchestration module. The monitoring and

measurements module is responsible for collecting different metrics from SAVI heterogeneous

resources such as Network and Compute metrics and building graphs for different applications

running in SAVI testbed. The diagnostics module is responsible for performing the Graph-based

Anomaly Detection which present in this chapter. The decision making module is responsible

for performing suitable actions in order to heal the system from the effect of the anomalies.

Finally, the orchestration module is responsible for executing the suitable decisions made in

order to return the system to its steady state condition. The focus of this chapter is on the

diagnostics module as this is where the anomaly detection is done.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 140

Figure 8.2: Graph-Based Diagnosis In Software-Defined Infrastructure System Architecture

8.5 Graph Diagnosis Module Description

In this section of the chapter we present our design for the graph-based diagnostics module of

the system described in Section 8.4.

8.5.1 Application Graphs

The Application Graphs module is responsible for identifying different applications graphs run-

ning in SAVI testbed. It is responsible for classifying application graphs into Static and Dynamic

graphs. Since there are different anomaly detection techniques to be used, classifying appli-

cation graphs into Static and Dynamic is important in identifying anomalies. The nature of

distributed applications running in cloud platforms raise the importance of studying application

graphs instead of individual resources behavior.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 141

8.5.2 System Profiles

This module is responsible for saving different application profiles running in SAVI testbed.

Those generated profiles represent the normal behavior state of the running applications. The

profiles are made of different features and metrics calculated for different application graphs.

New incoming measurements are compared with those profiles in order to identify whether the

monitored resources graphs are behaving in a normal manner or not.

8.5.3 Forensics

The Forensics module is responsible for investigating whether the detected graph anomalies

resulted from an application misbehavior or not. Furthermore, the Forensics module is re-

sponsible for performing Root Cause Analysis for the detected graph anomalies. The Forensics

module is responsible for investigating what are the sources of these graph anomalies as well as

why these sources raise such anomalies to predict these anomalies in the future.

8.6 Exploratory Analysis

The goal of this section is to provide some exploratory analysis that can be done using the

graph-theoretic metrics. This exploratory analysis helps to provide a qualitative understanding

of the applications’ behavior, while the more quantitative anomaly detection is left for the

evaluation section.

8.6.1 Identifying Master Nodes

A recurrent feature of several cloud applications is the existence of ”master” nodes. One

example is the master node present in the MapReduce applications such as Spark [157] and

Hadoop[54]. The centrality metrics discussed in [35],[138] and [139] provide a way to identify

such nodes in a graph. Centrality metrics quantify how central a node is in the graph. We have

studied several centrality metrics for our applications, such as betweenness centrality, closeness

centrality, and degree centrality. We have conducted the study on four different applications,

whose graphs are shown in Figure 8.3. The results for the betweenness centrality are shown in

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 142

Figure 8.3: Graphs of Different Applications

Figure 8.4 and Figure 8.5.

It can be observed that the larger difference between the maximum and average betweenness

for a specific graph matches the existence of a central node as shown in Figure 8.3. For example,

application 4 exhibits the largest difference between the mean and maximum centrality. We can

see that for application 4, the graph is composed of a central node holding three other nodes

together. Next one is application 3, which has pseudo-central node holding the graph together,

but some other nodes have their connectivity outside this central node. Last is application 1,

which has an almost mesh-like graph, resulting in small values for the maximum centrality.

8.6.2 Assortativity

Another metric is Assortativity. Assortativity measures the correlation between the degrees

of the connected nodes. A graph is assortative if its high degree nodes are connected to high

degree nodes, and low degree nodes are connected to low degree nodes, in which case it will

have a more positive assortativity coefficient. We can see in Figure 8.6 that application 1 has

the highest assortativity due to the almost uniform connectivity of its nodes. On the other

hand, application 4 has the lowest assortativity. The presence of a central node in application

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 143

Figure 8.4: Maximum Betweenness Centrality for Different Applications

4 means that the highest degree node, i.e. the central node, is connected to the low degree

nodes, resulting in a low assortativity coefficient.

8.6.3 Physical Connectivity

The placement of VMs on the different physical machines is of profound importance for load-

balancing and root-cause analysis. In Figure 8.7 we provide one way to analyze this behavior. If

we label the applications 1-4, the diagonal elements of this fig show how many physical machines

are used by each application. The off-diagonal show how many are shared between each pair

of applications. For example, application 1 and 2 have 4 physical machines in common.

8.7 Proof of Concept

In this section, we demonstrate how our graph-based anomaly detection framework operates.

The first 3 use cases illustrate static graphs scenarios while the last use case illustrates a dynamic

graph one.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 144

Figure 8.5: Mean Betweenness Centrality for Different Applications

8.7.1 Webserver - Database workload pattern

In this scenario, we consider a workload running on a webserver that is serving requests by

accessing a database as shown in Figure 8.8. When the workload increases on the webserver,

the workload increases consequently on the database and vice-versa. In order to illustrate our

graph-based anomaly detection approach, we have intentionally connected another webserver

running a workload to the same database.

We train our system by monitoring the database behavior in two cases: First, running

workload-1 as show in Fig. 8.8 while workload-2 is suspended. Afterwards, we suspend

workload-1 and run workload-2.

The main idea in this scenario is to illustrate that monitoring the database application

solely will not be able to detect anomalies as its pattern is periodic and normal looking. In

order to detect anomalies in this scenario, we have used Linear Support Vector Classification

(LinearSVC) [63] to perform the classification between normal behavior and anomalous behavior

of the system. We have trained our system and we present our detection results in the evaluation

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 145

Figure 8.6: Assortativity of Different Applications

section.

8.7.2 Bandwidth throttling

In this scenario, we consider two communicating virtual machines forming a graph. A virtual

link between two Virtual Machines belonging to the same application is suffering from band-

width throttling. This can be a useful scenario to detect the efficiency of isolation between

different slices, as well as detecting misconfiguration of the network parameters.

In this scenario, we use time series adjacency matrices of the graph in order to detect

anomalies. We calculate the distance between every two consecutive adjacency matrices. The

distance d1(A,B) between adjacency matrices if the matrices are A = (aij) and B = (bij)

could be expressed as : d1(A,B) =∑n

i=1

∑nj=1 aij − bij .

In order to train our system, first we initiate a file transfer between the two virtual machines

and we calculate the different values of d1. Afterwards, we introduce a Bandwidth throttling

over one of the virtual links between the two VMs and calculate the corresponding values of

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 146

Figure 8.7: Physical Connectivity of VMs

d1. Finally, we use LinearSVC in order to build a model. This model will be used in detecting

Bandwidth throttling anomalies between the VMs in the evaluation section.

8.7.3 DoS attack on a webserver

In this scenario, we consider a graph composed of three nodes : Denial of Service attacking

node, Webserver node and a back-end database node.

In this scenario, we use time series adjacency matrices of the graph in order to detect

anomalies. We calculate the distance between every two consecutive adjacency matrices as

previously discussed. In order to train our system, first we initiate a denial of service attack

and we calculate the different values of d1. Afterwards, we use LinearSVC in order to build

a model. This model will be used in detecting DoS anomalies in the evaluation section. The

main difference between this case and the previous one is that the magnitude of d1 decreases

in the Bandwidth throttling scenario whereas in the DoS scenario it increases.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 147

Figure 8.8: Webserver - Database workload diagram

8.7.4 Spark Job failure

In this scenario, we consider a graph of a Spark [157] Cluster composed of six nodes : A

Spark Master node and five Spark worker nodes. The cluster that we are using is running a

job of collecting monitoring data from SAVI testbed core node and saving them into Hadoop

Distributed File System (HDFS) [122].

In this scenario, we use time series Assortativity coefficient [103] calculated for the graph

in order to detect anomalies. In order to train our system, first we calculate the Assortativity

coefficient for the Spark Cluster running the monitoring data collection job then we intentionally

kill the job to generate the labeled training dataset. Afterwards, we use LinearSVC in order

to build a model. This model will be used in detecting Spark Job failure anomalies in the

evaluation section.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 148

8.8 Evaluation

In order to evaluate our system, we have conducted several experiments to verify our approach

in detecting anomalies. We performed our experiments in the core node of the SAVI testbed,

composed of over 20 physical servers hosting a few hundred VMs. We use the OpenStack

and OpenFlow to collect data about the various elements in our network. Collected metrics

include: CPU utilization, amount of disk read and write data, amount of memory read and write

data, and network bandwidth between each pair of VMs. Our experiments are reproducible

by requesting access to SAVI testbed from [2]. The details about the metrics available from

Openstack can be found in [37]. We use Hadoop as our distributed file-system for data storage

and Spark as our analytics framework. In the following subsections, we present the verification

for each use case described in the Proof of Concept section.

Webserver - Database workload pattern

We have trained our system using 5 hours of data. Afterwards, we tested our system using

1.5 hours of data. We used the CPU Utilization metrics collected from the Webserver and

Database. Figure 8.9 shows the testing phase of our system. The dash-doted curve represents

the CPU Utilization of the Webserver, the dashed curve represents the CPU utilization of the

Database. The solid curve represents the labels of the test data that we know beforehand ;

high means anomaly, low means normal behavior. The dots represent the predicted labels of

the data using LinearSVC. Our system was able to detect the 11 anomalies accurately as shown

in Figure 8.9.

Bandwidth throttling

We have setup two Virtual Machines with xlarge flavor that has 160GB of disk. We have

initiated a file transfer operation between the two VMs. The file size was 100GB with a transfer

rate 10 Mbps between the two VMs.The throttling value in the training phase was fixed at 512

kbps. The training set for LinearSVC was 134 data points. Afterwards, we tested our system

by repeating the same experiment but while having random varying throttling value between

1 Mbps and 5 Mbps as shown in Figure 8.10. The dotted curve represents the time varying

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 149

Figure 8.9: Webserver Database testing phase

d1, the solid curve represents the labels of the test data that we know beforehand and the dots

represent the predicted labels of the data using LinearSVC. Our system was able to detect the

35 anomalies accurately as shown in Figure 8.10.

Denial of Service Attack

We have trained our system using 118 data points by initiating a Denial of Service attack for

4.46 hours. Afterwards, we have tested our system by repeating the experiment for 2 hours

as shown in Figure 8.11. The dashed curve represents the time varying d1, the solid curve

represents the labels of the test data that we know beforehand and the dots represent the

predicted labels of the data using LinearSVC. Our system was able to detect the 26 anomalies

accurately as shown in Figure 8.11.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 150

Figure 8.10: Bandwidth throttling testing phase

Spark Job Failure

We have trained our system by using the 327 data points collected from the Spark Cluster in

10.9 hours. Afterwards, we have tested our system by repeating the experiment for 5.5 hours

as shown in Figure 8.12. The solid curve represents the time varying Assortativity Coefficient,

the dashed curve represents the labels of the test data that we know beforehand. The dots

represent the predicted labels of the data using LinearSVC; high means normal behavior, low

means an anomaly. Our system was able to detect the 30 anomalies accurately as shown in

Figure 8.12.

We have repeated the previous experiments several times. Since the dimensionality of the

data is relatively low as well as the linear separability nature of our problem the LinearSVC

algorithm works accurately in all iterations. However, if the problem complexity increases the

performance of the Support Vector Machines algorithm is expected to degrade and several sim-

ulations will be required to evaluate its performance. Dimensionality and non linear separable

anomaly data can increase the problem complexity.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 151

Figure 8.11: DoS attack testing phase

8.9 Conclusion

In this chapter we have designed and evaluated a graph-based diagnosis framework in Software-

Defined Infrastructure running in SAVI testbed. Our framework is able to accurately detect

system anomalies by leveraging different Graph-mining and Machine Learning techniques. We

have tested our framework on several use cases covering different kinds of anomalies affecting

various types of application graphs.

Chapter 8. Graph-based Diagnosis in Software-Defined Infrastructure 152

Figure 8.12: Spark Job failure testing phase

Chapter 9

Auto-Scaling and Anomaly

Detection in Software-Defined

Infrastructure

9.1 Context

In continuation of Chapter 8, we resume our study of how the infrastructure controller can

accurately and securely manage the cloud computing resources. The software-defined nature

of next-generation cloud environments are a great enabler for building efficient resource man-

agement frameworks. The elasticity of the cloud environment enables scaling up resources

according to the customer’s demand, offering a significant advantage over dedicated private

infrastructure systems. For example, in cloud-RAN the system load is directly dependent on

the number of active user which exhibits strong cyclo-stationary behavior. However, the effi-

ciency of such auto-scaling systems are limited by their ability to identify anomalous patterns

in the customer’s behavior in order to avoid unnecessary scaling when the system is breached.

Motivated by these two lines of reasoning, we propose a framework for an anomaly-aware auto-

scaling system. We employ a stochastic control framework that is able to predict the future

states of the system and identify the need to pro-actively scale the resources, as well as to

detect anomalies if the observed state is significantly different from the predicted one. We have

153

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure154

implemented our framework as part of the SAVI testbed. We have leveraged the Spark Big-

Data platform to make our framework scalable with the large number of resources present in

the cloud. We present experimental results where we have observed an efficient 95% prediction

accuracy as well as over 90% anomaly detection accuracy.

9.2 Introduction

A significant challenge in cloud systems is to automatically scale the resources according to

the customer’s needs [85]. The infrastructure provider should be able to observe the the usage

pattern, and release any idle resources to avoid overcharging the customers. More important, is

the ability to upgrade the assigned resources to prevent the possibility of performance bottle-

necks. However, in the current networked world, the ability to estimate a customer’s behavior

is hindered by the various security attacks that might breach its system [145]. For this reason,

we provide the first, to our knowledge, study for joint anomaly-detection and auto-scaling for

software-defined infrastructure. The key concept in our solution is to employ a non-parametric

prediction scheme. Such a scheme should be able to not only predict the future state of the

system, but also provide the confidence level of its prediction. Using the predicted state and

its confidence level, an automated management system will then decide if the resource is to

be scaled up or down, and when the actual state is observed, decide whether it is normal or

anomalous.

In this chapter, we continue the theme we started in Chapter 8 about the auto-scaling and

anomaly detection in cloud computing systems. In particular, we study how auto-scaling can

be combined with anomaly detection to provide a general resource management framework for

cloud resources. The system architecture under study is shown in Fig. 9.1. This is the same

architecture used in the previous chapter where the infrastructure controller reads the state of

the user processes and the computing resources in order to come up with the necessary scaling

and migration decisions. In this chapter we focus on how auto-scaling can be designed jointly

with anomaly detection, unlike the previous chapter where we focused solely on the anomaly

detection part. We propose a pro-active auto-scaling policy where the infrastructure controller

predicts the future state of the system, e.g. the computing needs of the user process. In wire-

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure155

less virtualization, the computing needs are directly dependent on the channel state. Better

channel conditions are utilized to increase the transmission through more advanced modulation

schemes. This increases the load on the modulation and coding blocks and consequently needs

more computing power. However, if the user is moving at a reasonable speed, then the chan-

nel exhibits strong temporal correlation. This correlation can be leveraged to build efficient

predictors for the future channel conditions based on the current ones. If we can predict the

channel, then we can estimate the needed computation, and consequently assign new resources

if needed.

9.3 Related Work

The problem of auto-scaling has received wide attention combined with the rise of cloud com-

puting. Auto-scaling capabilities are already implemented by cloud providers such as Amazon

[7], Google [5] and Microsoft [6]. Even when the cloud is not intended as an infrastructure as

a service, such as the case with Facebook, cloud owners would still try to scale their resources

to save on the electricity costs for example.

The simplest way to do auto-scaling is through pre-defined thresholds. When the measured

metric exceeds the upper threshold, the resource is scaled up, and the opposite happens when

it goes below the lower threshold [31],[30]. One of the first challenges facing such a simplistic

approach is oscillation [38]. As per common engineering practice, the single threshold can be

replaced by two level thresholds in an attempt to estimate the pattern of the running application

[56]. Adding time constraints is also a viable solution where the scaling decision is made only

if the resource state exceeds the threshold for a specific time duration [31]. There remain

however two main drawbacks with such an approach. The first is the re-active nature of the

decisions. Scaling decisions are typically made after the resource has entered the danger zone.

The second and more important drawback is the difficulty of selecting good thresholds, the

lack of a systematic way to do such a selection, and the non-obviousness of an efficient way to

measure its performance. These reasons have motivated researchers to look for better ways to

address the problem.

Queuing theory has traditionally been the main tool to study the performance of networking

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure156

Baseband Processing, Lookup tables for faster

processing and switch abstraction

Baseband Processing, Lookup tables for faster

processing and switch abstraction

FrontHaul Network

coding scrambling modulation

Bin

ary

inp

ut

bit

s

I/Q

Sign

als

Cloud

computing

resources

Remote Radio

heads

User Process

Infrastructure Controller

Computing

resources scaling

and migration

Clo

ud

Net

wo

rk F

abri

c

VM utilization

Scaling decisions

End-users

Figure 9.1: Cloud-RAN Architecture - Auto-scaling and anomaly detection

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure157

Figure 9.2: SAVI testbed Architecture

and servers applications [92]. The theory has been used to model cloud environments, mainly

through G/G/1 or G/G/n models [13], [142]. The main goal of this analysis is to find the

expected peak load, the necessary resources needed to serve a certain workload, or the mean

response time for requests. As such, queuing theory is more appropriate for planning purposes

rather on-line decision making. Queuing theory suffers, to some extent, from the rigidity of

its models. Additionally, it is typically hard to extend the traditional models to new scenarios

such as MapReduce applications with master and slave nodes. Moreover, the queue control

problems, when they are studied, typically end up with threshold-based decisions anyways [15].

The controller design aspect of the auto-scaling problem has naturally led to the adop-

tion of techniques from the field of control theory [105]. The main goal here is the controller

design. Controller classes such as proportionalintegral (PI), proportionalderivative (PD) and

proportionalintegralderivative (PID) have been proposed [81],[164]. The main challenge in the

controller design process is choosing the transfer function, state-space function or the perfor-

mance model [85]. This difficulty can be partially overcome by adopting adaptive controllers,

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure158

which can be considered as special cases of the more general reinforcement learning problem

[135].

The ability to make the scaling decision pro-actively involves predicting the future state of

the resource [137]. Predictive analysis combined with re-enforcement learning are considered in

the literature to be the most promising solutions [85]. These techniques have been studied in

the literature in [137],[39],[112],[150] . However, a major challenge that has not been addressed

before is the ability to correctly distinguish the true system resource states from the anomalous

and to consider such distinction when making the scaling decisions.

Our main contributions are : we provide a general formulation for the stochastic control

problem of joint anomaly detection and auto-scaling in cloud computing systems. We also

provide the practical aspects of the problem from our interaction with the implementation in

OpenStack. Based on these two points, we propose a solution policy that can pro-actively

decide about the scaling and re-actively detect the anomalies. We discuss our implementation

of the framework within the SAVI testbed. We provide experimental results for the framework

including prediction accuracy, anomaly-detection accuracy, as well as other aspects of how the

involved components work within OpenStack.

9.4 System Model

Consider a cloud computing system where a cloud customer, or equivalently an application,

needs a set of resources V. Due to anomalous behaviors or security attacks, the customer is

actually provided with the resource set V, where typically V ⊂ V. Let f : V → R be the cost

function mapping the given set of resources to the actual cost. Our optimization problem is

then as follows:

Problem 9.1:

minV

E{f(V) + ν |M|

}s.t. g(V) ≤ δ

M = V − V

(9.1)

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure159

where g(.) is a quality of service measure function, andM is the mismatch, possibly due to

anomalies, between the needed and assigned resources and |M| is its cardinality, ν is a weight

parameter that determines the trade-off between the two terms of the objective function, and

δ is the QoS parameter. The goal of the optimization problem is to minimize the expected cost

incurred and avoid assigning any unneeded resources.

Problem (9.1) belongs to a class of problems known as stochastic optimization problems

[109]. Outside of using Bellman’s equation [22], which suffers from the curse of dimensionality,

this class of problems has no systematic way for finding its solutions [109]. The general approach

to solve these problems is as follows:

1. Define the set of observable states of the system S.

2. Define the action set A.

3. Define the policy pθ : S → A as the state-action mapping parametrized by θ.

4. Optimize the policy as a function of its parameters p = arg maxθ pθ

In order to define our policy, we have to specify more details about problem (1).

9.4.1 Cost Measure and Quality of Service

In this chapter, we adopt the following notations:

• V = {vhi : i ∈ {1, 2, ..., Nv}, h ∈ H}

• f(V) =∑h∈H

ch

∣∣∣Vh∣∣∣• g(V) = P(u(vhi ) ≥ γ) ∀ vhi ∈ V

where h denotes the flavor of the VM vhi , Nv is the number of VMs in the set V, and H is

the set of available flavors, typically small, medium and large, ch is the cost associated with a

specific flavor.

The first condition states that the set of resources we consider is the set of virtual machines

(VMs). Each virtual machine has an index, and a flavor. The flavor of the VM determines

its computing power in terms of CPU, memory and so on. The second condition states that

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure160

the cost incurred is proportional to the number of assigned VMs, and the cost of each VM

is dependent upon its flavor. The third condition is our QoS metric, which states that the

probability that the utilization of the VM exceeds a specific threshold γ should be below a

certain value δ, as in 9.1.

In such a case, the state and action are specifically defined as follows:

• s(t) = u(vhi , t) ∀ vhi ∈ V

• a(vhi ) = vh+1i , a(vhi ) = vh−1i

The state at time t is defined as the utilization levels of all assigned VMs. The action applied

on a VM vhi can either upgrade its flavor to the next level vh+1i , or downgrade it to the previous

one vh−1i . Note that in general we could transition between any two flavors, not necessarily

consecutive. These definitions of actions and states are compatible with the enabling technolo-

gies in the cloud systems, as will be explained in the next section.

Before we discuss the policy, we have to consider some additional aspects of the problem:

Re-Active versus Pro-Active

Auto-scaling decisions can be made either reactively or pro-actively [85]. Reactively means

that we observe the state first, and then decide the action. On the other hand, a pro-active

decision first involves some form of prediction, then we base our action upon the predicted state.

Intuitively, re-active techniques are less computationally demanding than pro-active ones, while

the latter are expected to outperform the former due to their anticipative nature. The decision

to choose which technique to go with depends on the frequency of the state observations and

the time needed to execute the action. In our case, when OpenStack receives a re-size request,

it can either re-scale the VM on the same physical machine, or migrate it to another one. The

migration process takes on average 10 times the time of the same-machine scaling (around

20 minutes for migration from our experiments). Note that depending on the application

requirements, even the no-migration re-size time might be too long. For such reasons, we have

decided to proceed with a pro-active policy.

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure161

Parametric vs. Non-parametric Models

Predicting the future state involves a machine-learning problem. In general, machine learning

techniques can be classified into two broad categories: Parametric and Non-Parametric [113].

Unlike parametric models, non-parametric models are considered to be an optimization over

the space of functions, making them more flexible and able to capture more general patterns.

The other reason why we chose non-parametric models is that they are better at capturing

correlations across time, as they use the past samples directly in predicting the next one [113].

One popular non-parametric model is known as the Gaussian process [113], which assumes a

jointly Gaussian distribution for all observed data points. More details about the Gaussian

process can be found in [113].

Policy Definition

Following the definitions of the state and action sets, we define our policy in Algorithm 9.1.

Algorithm 9.1 Execute at each time slot t

Input: The historical utilization data u(vhi , t) ∀ vhi ∈ V ∀ T − L ≤ t ≤ Treturn The action set a(vhi , T + la) ∀ vhi ∈ VAnomaly Detectionfor all vhi ∈ V do

if u(vhi , T ) > β || u(vhi , T ) < α where P(α ≥ u(vhi , T ) ≤ β) > e thenDeclare an anomaly, i.e. if the utilization is outside a given interval corresponding to aconfidence level e

end ifend forScaling Decisionfor all vhi ∈ V do

fu(vhi ,T+la),u(vhi ,T ),...,u(vhi ,T−L)(up, u0, ..., uL) ∼ N (µ,K)

fu(vhi ,T+la)(up) ∼ N (µp, σ

2p)

if P(u(vhi , t) ≥ γ) ≥ δ ∀ T ≤ t ≤ T + la thena(vhi ) = vh+1

i

else if P(u(vh−1i , t) ≥ γ) ≤ δ ∀ T ≤ t ≤ T + la thena(vhi ) = vh−1i

end ifend for

The notations used are as follows: L is the length of the data collection window, T is our

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure162

operation interval, la is the look ahead interval, e is the confidence level and β−α is the length of

the confidence interval, and N (µ,K),N (µp, σ2p) are the Gaussian distributions resulting from

fitting the Gaussian process.

The process we follow in Algorithm 9.1 is as follows: the Gaussian process model is trained

for each VM. The main feature used in the training is the CPU utilization. We have used

both the CPU utilization of that VM as well as that of its connected VMs. The networking

information is available from OpenFlow.

Once the training phase is done, two decisions have to made by the algorithm. The first

decision is whether the observed state is anomalous. From the the Gaussian process model, we

predict the expected value as well as the confidence interval for the next time slot. When the

next measurement comes in, an anomaly is declared if the observed data point is outside the

specified confidence interval, for example a 95% confidence interval can be used in such a case.

If the observation is declared to be safe, then use the Gaussian process to predict the next state.

The second decision is the scaling decision. Using the Gaussian distribution assumption, we

can estimate the probability that the QoS constraint is violated. If such probability exceeds a

certain limit, then an up-scaling decision is made. On the other hand, in order for the policy

to minimize the objective function, we consider the case for a down-scaling. If a down-scaling

can be made while keeping the QoS constraint intact, then we proceed with the down-scaling.

Note that depending on the rate of change of the resource utilization, we might need to do a

multi-level up-scale or down-scale.

Note that the proposed algorithm does not solve problem (9.1) in its original form, as this is

typically infeasible for the stochastic optimization problems. Instead, the proposed policy tries

to achieve the same goals, i.e. resource scaling and anomaly detection, based on the capabilities

of our practical system.

The following parameters control the performance of the algorithm. For the anomaly de-

tection part, e is the level of confidence and [α, β] is the corresponding interval. If e is chosen

to be 95%, then α = µp − 3σp and β = µp + 3σp. L is the window of past measurements that

are used in predicting the next state. Larger window sizes typically result in better predictions,

but at the expense of increased complexity. The other parameter is la which is the look ahead

time. Typically for a Gaussian process, the further ahead we are trying to predict, the less

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure163

confident we are in our predictions.

9.5 Experimental Setup

The proposed control framework requires an infrastructure which provides three import func-

tionalities:

1. Sensing: acquiring relevant data about the computing nodes and preparing them for

collection.

2. Collecting: gathering the measured data from all sources into a centralized location for

further processing.

3. Processing: the data is analyzed and the logic decisions are executed.

The SAVI testbed we used to test our framework provides us with these functionalities. For

the sensing component, agents are installed on the machines to acquire the data. These agents

interact with the OpenStack Ceilometer component and acquires data such as CPU utilization,

memory read and write volumes and networking traffic besides others. The data is then trans-

mitted through a Kafka messaging server[79] and collectively stored using Hadoop distributed

file system [123]. For the processing part, we have chosen to go with the Spark BigData ana-

lytics platform [158]. The distributed processing capabilities of Spark enable us to handle large

volumes of data that are typically present in cloud environments. For more details about the

monitoring system, please refer to [82].

9.6 Experimental Results

In this section we report our experimental results for the prototype of the proposed framework.

We have focused on two cloud applications: web application composed of a web server and

a data base, and a BigData application composed of a Spark cluster running a streaming

job. The anomaly we used is a denial-of-service (DoS) attack on the server. The auto-scaling

is done through the Nova module of OpenStack. We measure both the confidence interval

prediction/detection accuracy, as well as the normalized means square error (NMSE).

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure164

9.6.1 Prediction

In Fig. 9.3 we show the time series plot of the measured and predicted CPU utilization for a

Web application. The green band is the prediction confidence interval. In this scenario, we use

the Gaussian process to predict every tenth sample. The figure shows that even though the

predicted utilization might deviate from the actual measurement, using the confidence interval

makes the prediction more robust as most measurements actually fall within the predicted

interval.

The non-parametric models, unlike the parametric ones, require keeping the data points for

use in the prediction phase. In Fig. 9.4 we plot the prediction accuracy versus the number

of past observation stored denoted as the window size. The figure shows that the prediction

accuracy, representing the percentage of measured points falling within the predicted confidence

interval, is around 95%. The figure also shows that a window of around 100 samples can provide

satisfactory performance and there might be no necessity to increase the window size beyond

that.

In Fig. 9.5 and Fig. 9.6, we show the time series plot for Spark master and worker nodes.

We can see that the Spark nodes, especially the master, exhibit more dynamic behavior than

the web server one. The Gaussian process parameters need to be adjusted in order to increase

the size of the confidence interval. Hence the width of confidence interval used for the BigData

application is typically larger than the one used for the Web application. In Fig. 9.7, we

show the prediction accuracy for the BigData application. Similar to the web application,

the prediction accuracy using the confidence interval is in the order of 95%. However, the

normalized mean square error (NMSE) is larger due to the more dynamic behavior of the Spark

nodes. These results justify our choice of basing our decisions upon confidence intervals since

they offer more robust predictions.

In general, the nature of the application might make it harder to predict its performance. A

web server that receives many requests will have very short cycles of peak performance that will

stabilize into a stationary behavior at the measurement time scale. For the BigData application

where the CPU utilization is mainly dependent upon the code being executed, different pieces

of code might induce different CPU loads, making the utilization harder to predict.

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure165

9.6.2 Anomaly Detection

In Fig. 9.8 we have a time series plot of the server CPU utilization in the presence of DoS

attacks. The server runs for around half an hour and then gets a DoS attack for a period of

five minutes, visible as the spikes in the CPU utilization. We use the data gathered during

the normal 30 minutes to predict the states during the anomalous 5 minute duration. Note

that we receive 2 measurements during the 5 minutes interval. We declare an anomaly if the

measurement lies outside the predicted confidence interval. Since we always have two anomalous

points at a time, our anomaly detection accuracy for each single DoS attack is either 0%, 50%

or 100%. . We also steadily increase the normal load with time. The detection accuracies, i.e.

the percentage of points declared as anomalies, are shown in Fig. 9.9, where they are plotted

versus the mean load of the server during the normal operation state.

Our main observation from this experiment is that there is a trade-off between the ability

to detect anomalies and the utilization level of the VM. As the mean utilization level goes

up, the anomaly detection accuracy goes down. The explanation for this phenomena is as

follows: when the VM is operating at 20% utilization for example, then it is easy to declare

the 90% as an anomaly. Compare this with the case when the normal state is around 80%

utilization, which makes a change to 90% utilization very probable. Going back to our objective

function defined in problem 1, we see that there is a trade-off between the two components of

the objective function, namely minimizing the used resources and maximizing the anomaly

detection accuracy. Minimizing the used resources means that the assigned VMs have to be

operating near their maximum utilization, which makes the anomalies harder to detect.

One point that has not been studied in this chapter is closing the control loop when the

anomaly has been detected. This involves the action needed to be taken to counter the anomaly,

as well as an efficient state update mechanism that takes into account our confidence of the

measurements.

9.7 Conclusion

In this chapter we have proposed a control framework for joint anomaly detection and auto-

scaling in software-defined infrastructures. We have proposed policy based on the Gaussian

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure166

200 400 600 800 1000 1200 1400 1600 1800Time +1.2303e7

18

20

22

24

26

28

30

Utilization

UtilizationPrediction

Figure 9.3: Example of CPU utilization Prediction for a Web application

0 50 100 150 200 250 300Window Size

0.04

0.06

0.08

0.10

0.12

0.14

0.16

Error

NMSEConfidence Error

Figure 9.4: Prediction Accuracy for a Web application

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure167

200 400 600 800 1000 1200 1400 1600 1800Time +1.2303e7

−2

0

2

4

6

8

10

12

14

16

utilization

Utilization MasterPrediction

Figure 9.5: Example of CPU utilization Prediction for a BigData application(Master)

200 400 600 800 1000 1200 1400 1600 1800Time +1.2303e7

−10

0

10

20

30

40

50

60

70

utilization

Utilization SparkPrediction

Figure 9.6: Example of CPU utilization Prediction for a BigData application(Worker)

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure168

0 50 100 150 200 250 300WindowSize

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Error NMSE

Confidence Error

Figure 9.7: Prediction Accuracy for a BigData application

0 100 200 300 400 500 600Time +1.23119e7

−20

0

20

40

60

80

100

Utilization

UtilizationPrediction

Figure 9.8: Example of CPU utilization Prediction in anomalous scenarios

Chapter 9. Auto-Scaling and Anomaly Detection in Software-Defined Infrastructure169

10 20 30 40 50 60 70 80 90Mean Utilization

−0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Anomal Detection Accurac

MeasurementsFitted Curve

Figure 9.9: Anomaly Detection Accuracy for a Web application

process prediction mechanism. The proposed framework is implemented as part of the SAVI

testbed. Our measurements have shown a 95% prediction accuracy as well as 90% anomaly

detection accuracy. One of the main observations from our experiments is that there is a trade-

off between minimizing the amount of used resources and the ability to detect their anomalies.

This work is part of an ongoing effort for using BigData techniques to design efficient diagnosis

and control techniques for the SAVI testbed.

Part IV

Conclusion

170

Chapter 10

Conclusion and Future Work

Cloud-RAN is a crucial part of the architecture for 5G-systems. The motivation for moving

towards cloud-RANs stems from market reasons such as the specialized wireless infrastructure

and the difficulty and time cost needed to deploy new technologies, as well as scientific reasons

mainly related to the centralized information model needed for novel interference management

schemes. While cloud-RAN, and 5G, might not be as fundamental a change from LTE 4G as

LTE itself was from WCDMA, there are lots of issues that need to be addressed before any

possible deployment.

In this thesis, we have studied various issues related to migration of wireless systems to the

cloud as envisioned in cloud-RAN systems. We have tried to cover all areas of the architecture

from the PHY-layer multiplexing, to the cloud computing management, including the MAC-

layer scheduling as well as the network wide control.

10.1 Contribution

The focus of this thesis has been studying the cloud-RAN architecture and identifying its

deployment challenges as well as provide potential solutions to these challenges. In particular,

we have identified two themes of challenges: the cloud computing model, and the slicing of the

network resources. For the slicing of the resources, one main question is the admission control

and embedding of the slices. We identified that the stochastic arrival of the network’s users is

a crucial element in this perspective, since it underlies the statistical multiplexing gains which

171

Chapter 10. Conclusion and Future Work 172

lie at the heart of the motivation for using cloud computing resources. Directly related to this

aspect is the choice between the different wireless multiplexing schemes and how they utilize

the stochasticity of the users arrivals. Another crucial element is the complex dependency

between the resources used in the PHY-layer, e.g. control and data channels. This represents

a significant challenge whenever heterogeneous architectures are to be co-hosted by the same

infrastructure.

The cloud computing model also presents another set of challenges as well as architecture

elements. One crucial difference between cloud-based architectures and current ones is that

computing resources in the cloud are composed of a large number of virtual machines, which

raises the need for distributed computations as well as efficient inter-machine communication

protocols. Secondly, cloud architectures leverage resource elasticity, hence the architecture has

to able to scale not only the computing resources, but also the access resources. Scaling can

also be done jointly for these two kinds of resources, for example by leveraging CSI to upscale or

downscale the container/VM computing resources, and also by co-locating VMs serving same

cell users on the same physical machine.

In light of the above discussion, the detailed contributions of the thesis are as follows:

10.1.1 PHY-Layer Admission Control and Network Slicing

In this chapter, we have studied the problem of joint admission control and slicing in virtual

wireless networks. We have provided characterization for the QoS performance and its relation

to the stochastic traffic. We have used these characterizations to devise a three step algorithm

with low complexity to tackle the problem. Our simulation results have covered the trade-offs

between frequency and spatial multiplexing, admission control and utilization as well as the

accuracy of the QoS bounds.

10.1.2 Multi-Operator Scheduling in Cloud-RANs

In this chapter we have studied the scheduling of multi VOs in a cloud-RAN environment.

We modeled the case when the VOs employ heterogeneous communication protocols. We have

shown that the coordination problem in such a case is in general NP-hard. We then proceeded

by specifying two special cases and provided the optimum algorithm for each case. Finally,

Chapter 10. Conclusion and Future Work 173

we proposed a novel neuro-computation heuristic, which is able to handle the general problem

but still provide close-to-optimum results for the special cases studied. The simulation results

confirm the effectiveness of the proposed heuristic and help learn more about the operation of

scheduling in cloud-RAN networks.

10.1.3 Fully Distributed Scheduling in Cloud-RAN Systems

In this chapter we have studied the distributed scheduling problem in Cloud-RAN systems.

Analytical analysis for the Rayleigh channels and maximum throughput scheduler was provided.

We found that distributed scheduling in this case is able to provide around 92% of the centralized

performance. We then extended the scheme to general channels and schedulers by adopting

the classification techniques from machine learning. We discovered two conflicting effects that

depend upon the fairness of the scheduler. In particular, less fair schedulers are easier to

predict, but the penalty for wrong decisions is more severe. With enough training and efficient

parameter selection, the distributed schedulers are able to provide up to 89% of the centralized

performance.

10.1.4 Joint RRH Activation and Clustering in Cloud-RANs

In this chapter we have studied the problem of joint clustering and RRH activation in Cloud-

RAN networks. We have provided a two-step approach to overcome the combinatorial nature of

the problem. The first step involves a linear program approximation to give a feasible solution

using an interference graph. The second step is greedily improving the solution, searching

over both activation and clustering decisions. Our simulation results have shown around 25%

improvement in terms of QoS and energy savings of the joint clustering and activation over the

legacy activation only approach.

10.1.5 Long-term Activation, Clustering and Association in Cloud-RANs

In this chapter we have studied the long-term optimization of RRH activation, clustering and

association. Our main contribution is a general formulation that includes all the three variables

as well as the queue evolution behavior. the resulting model can be efficiently solved using

successive geometric programming. We have studied the performance when noisy estimates of

Chapter 10. Conclusion and Future Work 174

the traffic are used. We have seen the activation probabilities can be accurate up to 91% and

the clustering probabilities accurate up to 82%.

10.1.6 Graph-based Diagnosis in Software-Defined Infrastructure

In this chapter we have designed and evaluated a graph-based diagnosis framework in Software-

Defined Infrastructure running in SAVI testbed. Our framework is able to accurately detect

system anomalies by leveraging different Graph-mining and Machine Learning techniques. We

have tested our framework on several use cases covering different kinds of anomalies affecting

various types of application graphs.

10.1.7 Auto-Scaling and Anomaly Detection in Software-Defined Infrastruc-

ture

In this chapter we have proposed a control framework for joint anomaly detection and auto-

scaling in software-defined infrastructures. We have proposed policy based on the Gaussian

process prediction mechanism. The proposed framework is implemented as part of the SAVI

testbed. Our measurements have shown a 95% prediction accuracy as well as 90% anomaly

detection accuracy. One of the main observations from our experiments is that there is a trade-

off between minimizing the amount of used resources and the ability to detect their anomalies.

This work is part of an ongoing effort for using BigData techniques to design efficient diagnosis

and control techniques for the SAVI testbed.

10.2 Future Work

One important direction to build upon the current research is deploying the whole software LTE

protocol stack on a cloud computing platform. While we have studied the management of the

computing resources, this study was conducted for general applications. There is a unique aspect

of wireless systems that is the computing resources needed is a function of the radio parameters,

most importantly the channel state. Hence, decisions about upscaling, downscaling or migrating

a user process have to be made using information about the user’s channel. Together with the

low latency existent and expected from future wireless systems, tight integration between the

Chapter 10. Conclusion and Future Work 175

wireless protocol stack and the cloud computing management framework such OpenStack has

to be achieved.

For the 5G architecture itself, there are open questions about how cloud-RAN works with

the other proposed ideas for 5G, such as millimeter waves and massive MIMO. Performance

studies about the trade-off, pros and cons of each technology as well as the use cases is an

important research direction.

Bibliography

[1] Available : https://www.openstack.org/.

[2] Request to access savi: http://www.savinetwork.ca/about-savi/request-access-to-savi-

testbed/.

[3] Technical document on wireless virtualization. Technical report, GENI: Global Environ-

ment for Network Innovations, 2006.

[4] C-RAN the road towards green RAN. Technical report, White Paper China Mobile

Research Insitute, 2011.

[5] Google cloud, https://cloud.google.com/. Online, Accessed 2016.

[6] Microsoft azure, https://azure.microsoft.com/en-us/. Online, Accessed 2016.

[7] Amazon elastic compute cloud (amazon ec2), http://aws.amazon.com/ec2/. Online, Ac-

cessed Sept 2016.

[8] Network functions virtualisation: An introduction, benefits, enablers, challenges & call

for action. In Introductory White Paper, SDN and OpenFlow World Congress, October

2012.

[9] A. Abbasi and M. Ghaderi. Energy cost reduction in cellular networks through dynamic

base station activation. In IEEE International Conference on Sensing, Communication,

and Networking (SECON), 2014.

[10] M. H. Ahmed. Call admission control in wireless networks: A comprehensive survey.

IEEE Communications Surveys Tutorials, 2005.

176

Bibliography 177

[11] Leman Akoglu and Christos Faloutsos. Event detection in time series of mobile commu-

nication graphs. In Army Science Conference, pages 77–79, 2010.

[12] Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and

description: a survey. Data Mining and Knowledge Discovery, 29(3):626–688, 2015.

[13] A. Ali-Eldin, J. Tordsson, and E. Elmroth. An adaptive hybrid elasticity controller for

cloud infrastructures. In IEEE Network Operations and Management Symposium, 2012.

[14] Mohamed Slim Alouini and Andrea J. Goldsmith. Area spectral efficiency of cellular

mobile radio systems. IEEE Transactions on Vehicular Technology, 48, no. 4:2047–1066,

July 1999.

[15] E. Altman. Flow control using the theory of zero sum markov games. IEEE Transactions

on Automatic Control, 39:814–818, 1994.

[16] Martin Haenggi andJeffrey G. Andrews, Francois Baccelli, Olivier Dousse, and Massimo

Franceschetti. Stochastic geometry and random graphs for the analysis and design of

wireless networks. IEEE Journal on Selected Areas in Communications, 27, No. 7:1029 –

1046, September 2009.

[17] J.G. Andrews, S. Buzzi, Wan Choi, S.V. Hanly, A. Lozano, A.C.K. Soong, and J.C.

Zhang. What Will 5G Be? IEEE Journal on Selected Areas in Communications, Vol.

32, No. 6, 1065-1082, 2014.

[18] Manu Bansal, Jeffrey Mehlman, Sachin Katti, and Philip Levis. Openradio: a pro-

grammable wireless dataplane. In Proceedings of the first workshop on Hot topics in

software defined networks, 2012.

[19] Diego Bartolome and Ana I. Perez-Neira. A unified fairness framework in multi-antenna

multi-user channels. pages 81–84, December 2004.

[20] Cristina Bazgan, Bruno Escoffier, and Vangelis Th. Paschos. Completeness in standard

and differential approximation classes: Poly-(D)APX- and (D)PTAS-completeness. The-

oretical Computer Science, 339:272–292, 2005.

Bibliography 178

[21] Dimitri Bertsekas and Robert Gallager. Data Networks. Prentice Hall, 1991.

[22] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific,

2005.

[23] G. Bhanage, R. Daya, I. Seskar, and D. Raychaudhuri. VNTS: A virtual network traffic

shaper for air time fairness in 802.16e system. In Communications (ICC), IEEE Inter-

national Conference on, 2010.

[24] Gautam Bhanage, Ivan Seskar, Rajesh Mahindra, and Dipankar Raychaudhuri. Virtual

basestation: Architecture for an open shared wimax framework. In Proceedings of the Sec-

ond ACM SIGCOMM Workshop on Virtualized Infrastructure Systems and Architectures

VISA, 2010.

[25] Cemal Cagatay Bilgin and Bulent Yener. Dynamic network evolution: Models, clustering,

anomaly detection. IEEE Networks, 2006.

[26] Phillip Bonacich and Paulette Lloyd. Eigenvector-like measures of centrality for asym-

metric relations. Social networks, 23(3):191–201, 2001.

[27] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University

Press, 2004.

[28] Horst Bunke, Peter J Dickinson, Miro Kraetzl, and Walter D Wallis. A graph-theoretic

approach to enterprise network dynamics, volume 24. Springer Science & Business Media,

2007.

[29] Deepayan Chakrabarti. Autopart: Parameter-free graph partitioning and outlier detec-

tion. In Knowledge Discovery in Databases: PKDD 2004, pages 112–124. Springer, 2004.

[30] T. C. Chieu, A. Mohindra, and A. A. Karve. Scalability and performance of web appli-

cations in a compute cloud. In e-Business Engineering (ICEBE), 2011 IEEE 8th Inter-

national Conference on, 2011.

Bibliography 179

[31] Trieu C. Chieu, Ajay Mohindra, Alexei A. Karve, and Alla Segal. Dynamic scaling of web

applications in a virtualized cloud computing environment. In Proceedings of the 2009

IEEE International Conference on e-Business Engineering, 2009.

[32] N. M. Mosharaf Kabir Chowdhury and Raouf Boutaba. Network virtualization: State of

the art and research challenges. IEEE Communications Magazine, 47:20–26, July 2009.

[33] N.M. Mosharaf Kabir Chowdhury and Raouf Boutaba. A survey of network virtualization.

Comput. Netw., 2010.

[34] Cisco. Network virtualization-path isolation design guide. Available Online.

[35] Anthony H. Dekker and Bernard Colbert. The symmetry ratio of a network. In Aus-

tralasian symposium on Theory of computing CATS, 2005.

[36] Anthony H. Dekker and Bernard D. Colbert. Network robustness and graph topology. In

27th Australasian conference on Computer science ACSC, 2004.

[37] OpenStack Meter description. http://docs.openstack.org/admin-guide/telemetry-

measurements.html.

[38] X. Dutreilh, A. Moreau, J. Malenfant, N. Rivierre, and I. Truck. From data center resource

allocation to control theory and back. In 2010 IEEE 3rd International Conference on

Cloud Computing, 2010.

[39] Xavier Dutreilh, Sergey Kirgizov, Olga Melekhova, Jacques Malenfant, Nicolas Rivierre,

and Isis Truck. Using reinforcement learning for autonomic resource allocation in clouds:

towards a fully automated workflow. In 7th International Conference on Autonomic and

Autonomous Systems (ICAS), 2011.

[40] William Eberle and Lawrence Holder. Discovering structural anomalies in graph-based

data. In Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE Interna-

tional Conference on, pages 393–398. IEEE, 2007.

[41] S. Baucke et al. Virtualization approach: Concept. Technical report, Technical Report

FP7-ICT-2007-1-216041-4WARD/D-3.1.1, The 4WARD Project, 2009.

Bibliography 180

[42] G. Piro F. Capozzi, L.A. Grieco, G. Boggia, and P. Camarda. Downlink packet scheduling

in LTE cellular networks: Key design issues and a survey. IEEE Communications Surveys

and Tutorials, 15:678–700, 2013.

[43] A. Fischer, J. F. Botero, M. T. Beck, H. de Meer, and X. Hesselbach. Virtual network

embedding: A survey. IEEE Communications Surveys Tutorials, 2013.

[44] Linton C Freeman. A set of measures of centrality based on betweenness. Sociometry,

pages 35–41, 1977.

[45] Fangwen Fu and U.C. Kozat. Stochastic game for wireless network virtualization. Net-

working, IEEE/ACM Transactions on, 21 number 1:84–97, Feb 2013.

[46] Fangwen Fu and Ulas C. Kozat. Stochastic game for wireless network virtualization.

IEEE/ACM Transactions on Networking, 21:84–97, February 2013.

[47] Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han. On community

outliers and their efficient detection in information networks. In Proceedings of the 16th

ACM SIGKDD international conference on Knowledge discovery and data mining, pages

813–822. ACM, 2010.

[48] D. Gesbert, S. Hanly, H. Huang, S. Shamai Shitz, O. Simeone, and Wei Yu. Multi-cell

mimo cooperative networks: A new look at interference. Selected Areas in Communica-

tions, IEEE Journal on, 28, number 9:1380–1408, December 2010.

[49] David Gesbert, Stephen Hanly, Howard Huang, Shlomo Shamai Shitz, Osvaldo Simeone,

and Wei Yu. Multi-cell MIMO cooperative networks: A new look at interference. IEEE

Journal on Sele, 28, No. 9:1380–1408, 2010.

[50] David Gesbert, Stephen Hanly, Howard Huang, Shlomo Shamai Shitz, Osvaldo Simeone,

and Wei Yu. Multi-cell MIMO cooperative networks: A new look at interference. IEEE

Journal on Selected Areas in Communications, 28, No. 9:1380–1408, 2010.

[51] Majid Ghaderi and Raouf Boutaba. Call admission control in mobile cellular networks:

A comprehensive survey: Research articles. Wirel. Commun. Mob. Comput., February

2006.

Bibliography 181

[52] Chris Godsil and Gordon F. Royle. Algebraic Graph Theory. Springer, 2001.

[53] Bernd Haberland, Fariborz Derakhshan, Heidrun Grob-Lipski, Ralf Klotsche, Werner

Rehm, Peter Schefczik, and Michael Soellner. Radio base stations in the cloud. Bell Labs

Technical Journal, 18, No. 1:129–152, 2013.

[54] Apache Hadoop. Available: http://hadoop.apache.org/.

[55] Feng Han, Zoltan Safar, W. Sabrina Lin, Yan Chen, and K. J. Ray Liu. Energy-efficient

cellular network operation via base station cooperation. In IEEE ICC, Ottawa ON, June

2012.

[56] M. Z. Hasan, E. Magana, A. Clemm, L. Tucker, and S. L. D. Gudreddi. Integrated and

autonomic cloud resource scaling. In 2012 IEEE Network Operations and Management

Symposium, 2012.

[57] Ahmadreza Hedayat and Aria Nosratinia. Outage and diversity of linear receivers in flat-

fading MIMO channels. IEEE Transactions on Signal Processing, 55, No. 12:5868 – 5873,

Dec. 2007.

[58] Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos, Leman Akoglu, Lei Li, Koji

Maruhashi, B Aditya Prakash, and Hanghang Tong. Metric forensics: a multi-level ap-

proach for mining volatile graphs. In Proceedings of the 16th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 163–172. ACM, 2010.

[59] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Le-

man Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. Rolx: structural role extrac-

tion & mining in large graphs. In Proceedings of the 18th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 1231–1239. ACM, 2012.

[60] J. J. Hopfield. Neural networks and physical systems with emergent collective compu-

tational abilities. Proceedings of the National Academy of Sciences of the USA, 79 no.

8:2554–2558, April 1982.

[61] K. Hosseini, W. Yu, and R. S. Adve. A stochastic analysis of network mimo systems.

IEEE Transactions on Signal Processing, 2016.

Bibliography 182

[62] Ju Yuan Hsiao, Chuan Yi Tang, and Ruay Shiung Chang. An efficient algorithm for

finding a maximum weight 2-independent set on interval graphs. Information Processing

Letters, 43:229–235, October 1992.

[63] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide to support

vector classification. 2003.

[64] Yichuan Hu and Alejandro Ribeiro. Adaptive distributed algorithms for optimal random

access channels. IEEE Transactions on Wireless Communications, 10, No. 8:2703–2715,

August 2011.

[65] Yichuan Hu and Alejandro Ribeiro. Optimal wireless networks based on local channel

state information. IEEE Transactions on Signal Processing, 60, No. 9:4913 – 4929, June

2012.

[66] Tsuyoshi ID and Hisashi KASHIMA. Eigenspace-based anomaly detection in computer

systems. In tenth ACM SIGKDD international conference on Knowledge discovery and

data mining KDD, 2004.

[67] S. Jeong and H. Otsuki. Framework of network virtualization. FG-FN OD-17, 2009.

[68] Nihar Jindal, Jeffrey G. Andrews, and Steven Weber. Multi-antenna communication in

ad hoc networks: Achieving mimo gains with simo transmission. IEEE Transaction on

Communications, 59, No. 2:529 – 540, February 2011.

[69] E.A. Jorswieck, P. Svedman, and B. Ottersten. Performance of TDMA and SDMA based

opportunistic beamforming. Wireless Communications, IEEE Transactions on, 7, No.

11:4058–4063, 2008.

[70] J. M. Kang, H. Bannazadeh, and A. Leon-Garcia. Savi testbed: Control and manage-

ment of converged virtual ict resources. In 2013 IFIP/IEEE International Symposium on

Integrated Network Management (IM 2013), 2013.

[71] Joon-Myung Kang, Hadi Bannazadeh, and Alberto Leon-Garcia. Savi testbed: Control

and management of converged virtual ict resources. In Integrated Network Management

(IM 2013), 2013 IFIP/IEEE International Symposium on, pages 664–667. IEEE, 2013.

Bibliography 183

[72] U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. Centralities in large

networks: Algorithms and observations. In SDM, volume 2011, pages 119–130. SIAM,

2011.

[73] George Karypis and Vipin Kumar. Metis–unstructured graph partitioning and sparse

matrix ordering system, version 2.0. 1995.

[74] Manzoor Ahmed Khan and Yasir Zaki. Dynamic spectrum trade and game-theory based

network selection in lte virtualization using uniform auctioning. In Proceedings of the

9th IFIP TC 6 International Conference on Wired/Wireless Internet Communications

WWIC, 2011.

[75] Keith Kirkpatrick. Software-defined networking. Communications of the ACM, 56:16–19,

September 2013.

[76] R. Kokku, R. Mahindra, Honghai Zhang, and S. Rangarajan. Nvs: A substrate for

virtualizing wireless resources in cellular networks. Networking, IEEE/ACM Transactions

on, 20 number 5:1333–1346, 2012.

[77] R. Kokku, R. Mahindra, Honghai Zhang, and S. Rangarajan. Cellslice: Cellular wire-

less resource slicing for active ran sharing. In Communication Systems and Networks

(COMSNETS), Fifth International Conference on, 2013.

[78] M. Kountouris, D. Gesbert, and T. Salzer. Distributed transmit mode selection for MISO

broadcast channels with limited feedback: Switching from SDMA to TDMA. In Sig-

nal Processing Advances in Wireless Communications, 2008. SPAWC 2008. IEEE 9th

Workshop on, 2008.

[79] Jay Kreps, Neha Narkhede, and Jun Rao. Kafka: A distributed messaging system for log

processing. In NetDB Workshop, 2011.

[80] Swarun Kumar, Diego Cifuentes, Shyamnath Gollakota, and Dina Katabi. Bringing cross-

layer mimo to today’s wireless lans. SIGCOMM Computer Communication Review, 43

number 4:387–398, October 2013.

Bibliography 184

[81] Harold C. Lim, Shivnath Babu, Jeffrey S. Chase, and Sujay S. Parekh. Automated control

in cloud computing: Challenges and opportunities. In Proceedings of the 1st Workshop

on Automated Control for Datacenters and Clouds, 2009.

[82] J. Lin, R. Ravichandiran, H. Bannazadeh, and A. Leon-Garcia. Monitoring and measure-

ment in software-defined infrastructure. In 2015 IFIP/IEEE International Symposium on

Integrated Network Management (IM), 2015.

[83] Chao Liu, Xifeng Yan, Hwanjo Yu, Jiawei Han, and S Yu Philip. Mining behavior graphs

for” backtrace” of noncrashing bugs. In SDM, pages 286–297. SIAM, 2005.

[84] M. J. Lopez. Multiplexing, scheduling, and multicasting strategies for antenna arrays in

wireless networks. PhD thesis, Dept. of Elect. Eng. and Comp. Sci., MIT, Cambridge,

MA, 2002.

[85] Tania Lorido-Botran, Jose Miguel-Alonso, and Jose A. Lozano. A review of auto-scaling

techniques for elastic applications in cloud environments. Journal of Grid Computing,

12(4):559–592, 2014.

[86] J. Malmodin, A. Moberg, D. Lunden, G. Finnveden, and N. Lovehagen. Greenhouse gas

emissions and operational electricity use in the ICT and entertainment & media sectors.

Journal of Industrual Ecology, 14, no. 5:770–790, October 2010.

[87] M. A. Marsan, L. Chiaraviglio, D. Ciullo, and M. Meo. Optimal energy savings in cellular

access networks. 2009.

[88] E. Matskani, N. D. Sidiropoulos, Z. q. Luo, and L. Tassiulas. Convex approximation

techniques for joint multiuser downlink beamforming and admission control. IEEE Trans-

actions on Wireless Communications, 2008.

[89] Nick McKeown. How sdn will shape networking. In Open Networking Summit, Stanford,

October 2011.

[90] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jen-

nifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: Enabling innovation in

campus networks. SIGCOMM Comput. Commun. Rev., 2008.

Bibliography 185

[91] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jen-

nifer Rexford, Scott Shenker, and Jonathan Turner. Openflow: enabling innovation in

campus networks. ACM SIGCOMM Computer Communication Review, 38(2):69–74,

2008.

[92] Daniel A. Menasce, Lawrence W. Dowdy, and Virgilio A. F. Almeida. Performance by

Design: Computer Capacity Planning By Example. Prentice Hall PTR, 204.

[93] Guowang Miao, Ye (Geoffrey) Li, and Ananthram Swami. Channel-aware distributed

medium access control. IEEE/ACM Transactions on Networking, 20, No. 4:1290–1303,

August 2012.

[94] Simon Mingay. Green IT: The new industry shock wave. Technical report, Gartner, 2007.

[95] Dov Monderer and Lloyd S. Shapley. Potential games. Games and Economic Behavior,

14 issue 1, 1996.

[96] Misael Mongiovi, Petko Bogdanov, Razvan Ranca, Ambuj K Singh, Evangelos E Pa-

palexakis, and Christos Faloutsos. Netspot: Spotting significant anomalous regions on

dynamic networks. In Proceedings of the 13th SIAM international conference on data

mining (SDM), Texas-Austin, TX. SIAM, 2013.

[97] J. M. Mooij and H. J. Kappen. Sufficient conditions for convergence of the sum-product

algorithm. IEEE Transactions on Information Theory, 2007.

[98] Emmanuel Muller, Patricia Iglesias Sanchez, Yvonne Mulle, and Klemens Bohm. Rank-

ing outlier nodes in subspaces of attributed graphs. In Data Engineering Workshops

(ICDEW), 2013 IEEE 29th International Conference on, pages 216–222. IEEE, 2013.

[99] Andrew Y Ng, Michael I Jordan, Yair Weiss, et al. On spectral clustering: Analysis and

an algorithm. Advances in neural information processing systems, 2:849–856, 2002.

[100] Minh Hanh Ngo, Vikram Krishnamurthy, , and Lang Tong. Optimal channel-aware

ALOHA protocol for random access in WLANs with multipacket reception and decen-

tralized channel state information. IEEE Transactions on Signal Processing, 56, No.

6:2575–2588, June 2008.

Bibliography 186

[101] Navid Nikaein, Mahesh K. Marina, Saravana Manickam, Alex Dawson, Raymond Knopp,

and Christian Bonnet. Openairinterface: A flexible platform for 5g research. SIGCOMM

Comput. Commun. Rev., 2014.

[102] Caleb C Noble and Diane J Cook. Graph-based anomaly detection. In Proceedings of the

ninth ACM SIGKDD international conference on Knowledge discovery and data mining,

pages 631–636. ACM, 2003.

[103] Rogier Noldus and Piet Van Mieghem. Assortativity in complex networks. Journal of

Complex Networks, page cnv005, 2015.

[104] Eunsung Oh, Member Kyuho Son, and Bhaskar Krishnamachari. Dynamic base station

switching-on/off strategies for green cellular networks. IEEE Transactions on Wireless

Communications, 12, no. 5:1536–1276, May 2013.

[105] Tharindu Patikirikorala and Alan Colman. Feedback controllers in the cloud. In APSEC

2010, Cloud Workshop, 2010.

[106] Sanjoy Paul and Srini Seshan. Technical document on wireless virtualization. Technical

report, GENI: Global Environment for Network Innovations, September 2006.

[107] Subharthi Paul, Jianli Pan, and Raj Jain. Architectures for the future networks and the

next generation internet: A survey. Computer Communications, 34:2–42, January 2011.

[108] Chunyi Peng, Suk-Bok Lee, Songwu Lu, Haiyun Luo, and Hewu Li. Traffic-driven power

saving in operational 3g cellular networks. In The 17th Annual International Conference

on Mobile Computing and Networking MobiCom, 2011.

[109] Warren B. Powell. Clearing the Jungle of Stochastic Optimization, chapter Chapter 4,

pages 109–137. INFORMS.

[110] Carey E Priebe, John M Conroy, David J Marchette, and Youngser Park. Scan statistics

on enron graphs. Computational & Mathematical Organization Theory, 11(3):229–247,

2005.

Bibliography 187

[111] Wireless Communications: Principles and Practice. Wireless Communications: Principles

and Practice. Prentice Hall, 2001.

[112] Jia Rao, Xiangping Bu, Cheng-Zhong Xu, Leyi Wang, and George Yin. Vconf: A rein-

forcement learning approach to virtual machines auto-configuration. In 6th International

Conference on Autonomic Computing, 2009.

[113] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT

Press, 2006.

[114] Member Ravi Kokku, Rajesh Mahindra, Honghai Zhang, and Sampath Rangarajan. NVS:

A substrate for virtualizing wireless resources in cellular networks. IEEE/ACM Transac-

tions on Networking, 20:1333–1346, October 2012.

[115] D. Raychaudhuri, M. Ott, and I. Secker. Orbit radio grid tested for evaluation of next-

generation wireless network protocols. In Proceedings of the First International Con-

ference on Testbeds and Research Infrastructures for the Development of Networks and

Communities (TRIDENTCOM), 2005.

[116] Dipankar Raychaudhuri and Mario Gerla. Emerging Wireless Technologies and the Future

Mobile Internet. Cambridge University Press, 2011.

[117] Ryan A Rossi, Brian Gallagher, Jennifer Neville, and Keith Henderson. Modeling dy-

namic behavior in large evolving graphs. In Proceedings of the sixth ACM international

conference on Web search and data mining, pages 667–676. ACM, 2013.

[118] Jun-Bae Seo and V.C.M. Leung. Design and analysis of cross-layer contention resolution

algorithms for multi-packet reception slotted ALOHA systems. Wireless Communications,

IEEE Transactions on, 10, No. 3:825–833, 2011.

[119] Stefania Sesia, Issam Toufik, and Matthew Baker. LTE - The UMTS Long Term Evolu-

tion: From Theory to Practice. WILEY, 2011.

[120] Rob Sherwood, Glen Gibb, Kok kiong Yap, Martin Casado, Nick Mckeown, and Guru

Parulkar. FlowVisor: A network virtualization layer. Technical report, 2009.

Bibliography 188

[121] Peter Shoubridge, Miro Kraetzl, WAL WALLIS, and Horst Bunke. Detection of abnormal

change in a time series of graphs. Journal of Interconnection Networks, 3(01n02):85–101,

2002.

[122] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop

distributed file system. In Mass Storage Systems and Technologies (MSST), 2010 IEEE

26th Symposium on, pages 1–10. IEEE, 2010.

[123] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop

distributed file system. In IEEE 26th Symposium on Mass Storage Systems and Tech-

nologies (MSST), 2010.

[124] Yong Sheng Soh, Tony Q. S. Quek, Marios Kountouris, , and Hyundong Shin. Energy

efficient heterogeneous cellular networks. IEEE Journal on Selected Areas in Communi-

cations, 31, No. 5:840–850, May 2013.

[125] H. M. Soliman and A. Leon-Garcia. Fully distributed scheduling in cloud-RAN systems.

In IEEE Wireless Communications and Networking Conference, 2016.

[126] H. M. Soliman and A. Leon-Garcia. A novel neuro-optimization method for multi-operator

scheduling in cloud-RANs. In IEEE International Conference on Communications (ICC),

2016.

[127] H. M. Soliman and A. Leon-Garcia. Qos-aware frequency-space network slicing and ad-

mission control for virtual wireless networks. In IEEE Global Communications Conference

(GLOBECOM), 2016.

[128] H. M. Soliman and A. Leon-Garcia. Qos-aware joint rrh activation and clustering in

cloud-RANs. In IEEE Wireless Communications and Networking Conference, 2016.

[129] Hazem M. Soliman, Omar A. Nasr, and Mohamed M. Khairy. Analysis and optimization

of backhaul sharing in coMP. pages 1013–1018, September 2013.

[130] K. Son, H. Kim, Y. Yi, and B. Krishnamachari. Toward energy-efficient operation of base

stations in cellular wireless networks. CRC Press, Taylor and Francis, 2012.

Bibliography 189

[131] Kyuho Son, Hongseok Kim, Yung Yi, and Bhaskar Krishnamachari. Base station opera-

tion and user association mechanisms for energy-delay tradeoffs in green cellular networks.

IEEE Journal on Selected Areas in Communications, 29, No. 8:1525–1536, September

2011.

[132] Kyuho Son, Eunsung Oh, and Bhaskar Krishnamachari. Energy-efficient design of hetero-

geneous cellular networks from deployment to operation. Computer Networks, 78:95–106,

2015.

[133] Kyuho Son, Eunsung Oh, and Bhaskar Krishnamachari. Energy-aware hierarchical cell

configuration from deployment to operation. In IEEE INFOCOM, Shanghai, April 2011.

[134] Jimeng Sun, Huiming Qu, Deepayan Chakrabarti, and Christos Faloutsos. Neighbor-

hood formation and anomaly detection in bipartite graphs. In Data Mining, Fifth IEEE

International Conference on, pages 8–pp. IEEE, 2005.

[135] Barto A.G. Sutton, R.S. Introduction to Reinforcement Learning. Cambridge University

Press, 1998.

[136] Kun Tan, He Liu, Jiansong Zhang, Yongguang Zhang, Ji Fang, and Geoffrey M. Voelker.

Sora: High-performance software radio using general-purpose multi-core processors. Com-

mun. ACM, January 2011.

[137] G. Tesauro, N. K. Jong, R. Das, and M. N. Bennani. A hybrid reinforcement learning

approach to autonomic resource allocation. In Proceedings of the 2006 IEEE International

Conference on Autonomic Computing, 2006.

[138] Ali Tizghadam and Alberto Leon-Garcia. A robust routing plan to optimize throughput in

core networks. In 20th international teletraffic conference on Managing traffic performance

in converged networks, 2007.

[139] Ali Tizghadam and Alberto Leon-Garcia. On robust traffic engineering in transport

networks. IEEE Global Telecommunications Conference GLOBECOM, 2008.

[140] Hanghang Tong and Ching-Yung Lin. Non-negative residual matrix factorization with

application to graph anomaly detection. In SDM, pages 143–153. SIAM, 2011.

Bibliography 190

[141] David Tse and Pramod Viswanath. Fundamentals of Wireless Communication. Cam-

bridge University Press, 2005.

[142] Bhuvan Urgaonkar, Prashant Shenoy, Abhishek Chandra, Pawan Goyal, and Timothy

Wood. Agile dynamic provisioning of multi-tier internet applications. ACM Trans. Auton.

Adapt. Syst., 3:1:1–1:39, March 2008.

[143] Vijay V. Vazirani. Approximation Algorithms. Springer, 2001.

[144] William von Hagen. Professional Xen Virtualization. Wrox, 2008.

[145] Joseph Wahba, Hazem Soliman, Hadi Bannazadeh, and Alberto Leon-Garcia. Graph-

based diagnosis in software-defined infrastructure. In International Conference on Net-

work and Service Management (CNSM), 2016.

[146] Anjing Wang, M. Iyer, R. Dutta, G.N. Rouskas, and I. Baldine. Network virtualization:

Technologies, perspectives, and frontiers. Lightwave Technology, Journal of, 31 number

4:523–537, February 2013.

[147] Andrew Webb. Pattern Recognition. Wiley, 2002.

[148] J. H. Winters, J. Salz, and R. D. Gitlin. The impact of antenna diversity on the capacity

of wireless communication systems. IEEE Transactions on Communications, 1994.

[149] G. Woodruff, N. Perinpanathan, F. Chang, P. Appanna, and A. Leon-Garcia. Atm net-

work resources management using layer and virtual network concepts. 1997.

[150] Cheng-Zhong Xu, Jia Rao, and Xiangping Bu. A unified reinforcement learning approach

for autonomic cloud management. J. Parallel Distrib. Comput., 72:95–105, 2012.

[151] Chongbin Xu, Li Ping, Peng Wang, S. Chan, and Xiaokang Lin. Decentralized power

control for random access with successive interference cancellation. IEEE Journal on

Selected Areas in Communications, 31, No. 11:2387 – 2396, May 2013.

[152] Gongxian Xu. Global optimization of signomial geometric programming problems. Eu-

ropean Journal of Operational Research, 233:500510, 2014.

Bibliography 191

[153] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas AJ Schweiger. Scan: a structural

clustering algorithm for networks. In Proceedings of the 13th ACM SIGKDD international

conference on Knowledge discovery and data mining, pages 824–833. ACM, 2007.

[154] Mao Yang, Yong Li, Depeng Jin, Jian Yuan, Li Su, and Lieguang Zeng. Opportunistic

spectrum sharing based resource allocation for wireless virtualization. In Seventh Inter-

national Conference on Innovative Mobile and Internet Services in Ubiquitous Computing

(IMIS), July 2013.

[155] Mao Yang, Yong Li, Lieguang Zeng, Depeng Jin, and Li Su. Karnaugh-map like online

embedding algorithm of wireless virtualization. In 15th International Symposium on

Wireless Personal Multimedia Communications (WPMC), September 2012.

[156] Kok-Kiong Yap, Rob Sherwood, Masayoshi Kobayashi, Te-Yuan Huang, Michael Chan,

Nikhil Handigol, Nick McKeown, and Guru Parulkar. Blueprint for introducing innovation

into wireless mobile networks. In Proceedings of the Second ACM SIGCOMM Workshop

on Virtualized Infrastructure Systems and Architectures VISA, 2010.

[157] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica.

Spark: Cluster computing with working sets. HotCloud, 10:10–10, 2010.

[158] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica.

Spark: Cluster computing with working sets. In the 2Nd USENIX Conference on Hot

Topics in Cloud Computing, 2010.

[159] Yasir Zaki, Liang Zhao, Carmelita Goerg, and Andreas Timm-Giel. A Novel LTE Wireless

Virtualization Framework. Springer Berlin Heidelberg, 2011.

[160] Yasir Zaki, Liang Zhao, Carmelita Goerg, and Andreas Timm-Giel. LTE mobile network

virtualization. Mobile Networks and Applications, 16:424–432, August 2011.

[161] Jun Zhang, Robert W. Heath, Marios Kountouris, and Jeffrey G. Andrews. Mode switch-

ing for the multi-antenna broadcast channel based on delay and channel quantization.

EURASIP J. Adv. Signal Process, pages 1:1–1:15, February 2009.

Bibliography 192

[162] Jun Zhang, M. Kountouris, J.G. Andrews, and R.W. Heath. Multi-mode transmission for

the MIMO broadcast channel with imperfect channel state information. Communications,

IEEE Transactions on, 59, No. 3:803–814, 2011.

[163] Qian Zhang, Chenyang Yang Harald Haas, and John S. Thompson. Energy efficient

downlink cooperative transmission with bs and antenna switching off. IEEE Transactions

on Wireless Communications, 13, no. 9:5183–5195, September 2014.

[164] Qian Zhu and Gagan Agrawal. Resource provisioning with budget constraints for adap-

tive applications in cloud environments. In Proceedings of the 19th ACM International

Symposium on High Performance Distributed Computing, 2010.