Performance Annotations for Cloud Computing · 2019-12-18 · Data Centers Are Everywhere... Data...
Transcript of Performance Annotations for Cloud Computing · 2019-12-18 · Data Centers Are Everywhere... Data...
Performance Annotations for Cloud Computing
Daniele Rogora∗ Steffen Smolka% Antonio Carzaniga∗ Amer Diwan$
Robert Soulé∗∼
presented by
Daniele Rogora
∗Università della Svizzera italiana %Cornell University $Google ∼Barefoot Networks
HotCloud 2017
1 / 27
Data Centers Are Everywhere...
Data centers provide◮ Services for end users
◮ Google, Facebook, Dropbox
◮ Services for companies and universities◮ AWS for SAP, cloud mail services
◮ Raw processing power (IaaS, PaaS)◮ Amazon EC2, Microsoft Azure
2 / 27
Data Centers Are Everywhere...
Data centers provide◮ Services for end users
◮ Google, Facebook, Dropbox
◮ Services for companies and universities◮ AWS for SAP, cloud mail services
◮ Raw processing power (IaaS, PaaS)◮ Amazon EC2, Microsoft Azure
Data centers are widespread◮ Microsoft has more than 100 data centers
◮ they account for more than 1M servers
◮ Amazon has more than 30 data centers◮ they account for more than 1.5M servers
◮ Google has 15 data centers scattered around the world◮ in 2013 they accounted for around 900k servers
2 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
Software
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
Software
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
Software
3 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
Software
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
Software
4 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
Web Server
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
DBMS
4 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
AUTH Web Server
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
DBMS
RPC
4 / 27
... But They Are Complex
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 6
Machine 7
Machine 8
Machine 9
Machine 10
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
AUTH Web Server
CPU HDD NET
Operating System
Virtual Network
Virtual Machines
DBMS
RPC
4 / 27
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
Operating System
Virtual Network
Virtual MachinesAUTH
Web Server
Operating System
Virtual Network
Virtual Machines
DBMS
RPC
Understanding the performance of a data center is difficult
many layers
limited scope of tools and people’s knowledge
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 1
Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
Switch 2
Machine 11
Machine 12
Machine 13
Machine 14
Machine 15
Switch 3
Operating System
Virtual Network
Virtual MachinesAUTH
Web Server
Operating System
Virtual Network
Virtual Machines
DBMS
RPC
Understanding the performance of a data center is difficult
many layers
limited scope of tools and people’s knowledge
3 real-world questions by a data center operator
How much load increase can we support with the current
setup? Where will the bottleneck be? What would break
first?
How much would it help to move the database server to
faster hardware, or directly on the metal?
Can we understand and explain unexpected behaviors?
5 / 27
Goal: Creating a New Model
We want to build a dynamic performance model for data centers
comprehensive
live
interactive
6 / 27
Goal: Creating a New Model
We want to build a dynamic performance model for data centers
comprehensive
live
interactive
Logs
Query
Model
Goal: Creating a New Model
We want to build a dynamic performance model for data centers
comprehensive
live
interactive
Logs
Query
Model
How is it performing now?
Performance
Goal: Creating a New Model
We want to build a dynamic performance model for data centers
comprehensive
live
interactive
Logs
Query
Model
What if the we improve the hardware?
Predictive result
6 / 27
Goal: Creating a New Model
We want to build a dynamic performance model for data centers
comprehensive
live
interactive
Logs
Query
Statistical Model
6 / 27
Performance Annotations
7 / 27
Example: a Web Service
Request Response
System
The response time grows quadratically
with the size of the body of the request
Example: a Web Service
Request Response
System
The response time grows quadratically
with the size of the body of the request
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 20 40 60 80 100
tim
e (
ms)
body size (MB)
Example: a Web Service
Request Response
System
The response time grows quadratically
with the size of the body of the request
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 20 40 60 80 100
tim
e (
ms)
body size (MB)@Time ∼ a∗ |in|2 +b ∗ |in|+c
Example: a Web Service
Request Response
System
The memory used grows linearly
with the size of the body of the request
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100
mem
ory
usage (
kB
)
body size (kB)@Mem ∼ a∗ |in|+b
8 / 27
Example: a Web Service
Request Response
System
Machine 1 Machine 7
Example: a Web Service
Request Response
System
Machine 1 Machine 7VM PostgreSQL
Example: a Web Service
Request Response
System
Machine 1 Machine 7VM PostgreSQL
Apache WS
Example: a Web Service
Request Response
System
Machine 1 Machine 7VM PostgreSQL
Apache WSOwnCloud
Example: a Web Service
Request Response
System
Machine 1 Machine 7VM PostgreSQL
Apache WSOwnCloud
getFile(string f)
normalizePath(string p)
parseQuery(string q)
execute(query q)
8 / 27
Example: a Web Service
Request Response
System
Machine 1 Machine 7VM PostgreSQL
Apache WSOwnCloud
getFile(string f)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
0 20 40 60 80 100
tim
e (
ms)
body size (MB)
normalizePath(string p)
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100
me
mo
ry u
sa
ge
(kB
)
body size (kB)
parseQuery(string q)
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Me
mo
ry (
kb
yte
s)
String length
execute(query q)
0
10
20
30
40
50
60
70
80
90
0 20 40 60 80 100
tim
e (
ms)
query size
8 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB)
9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
Line Fit
9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
Line Fit
@Mem∼Norm(5ms*path_len(path), 5ms)
9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
Line Fit
@Mem∼Norm(5ms*path_len(path), 5ms)
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Me
mo
ry (
kb
yte
s)
String length 9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
Line Fit
@Mem∼Norm(5ms*path_len(path), 5ms)
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(4.2kB)
∨
@Mem∼Constant(0.5kB)
9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
Line Fit
@Mem∼Norm(5ms*path_len(path), 5ms)
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(4.2kB)
∨
@Mem∼Constant(0.5kB)
700
750
800
850
900
950
1000
10 20 30 40 50 60 70 80 90M
em
ory
(b
yte
s)
String length 9 / 27
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(8kB) 0
5
10
15
20
25
30
35
40
0 2 4 6 8 10 12 14
Me
mo
ry (
kb
yte
s)
Path length
Line Fit
@Mem∼Norm(5ms*path_len(path), 5ms)
0
1
2
3
4
5
10 20 30 40 50 60 70 80 90 100
Me
mo
ry (
kb
yte
s)
String length
@Mem∼Constant(4.2kB)
∨
@Mem∼Constant(0.5kB)
700
750
800
850
900
950
1000
10 20 30 40 50 60 70 80 90M
em
ory
(b
yte
s)
String length
@Mem∼?
9 / 27
How To Create Annotations
10 / 27
Instrumentation
11 / 27
Instrumentation
For every call of all the functions in the system, we need:
metrics of interest◮ execution time◮ dynamic memory allocation◮ locks holding time
relevant features of the input parameters◮ string length◮ collection size
11 / 27
Instrumentation
For every call of all the functions in the system, we need:
metrics of interest◮ execution time◮ dynamic memory allocation◮ locks holding time
relevant features of the input parameters◮ string length◮ collection size
Also, the instrumentation must be:
crosslayer
crossplatform
11 / 27
Automatic Annotation Inference
Automatic Annotation Inference
Logs
Function: correctFolderSize feature: collection size
pcc: 0.5610
0
5
10
15
20
25
0 1 2 3 4 5 6 7 8 9
Mem
ory
(kbyte
s)
Collection size
12 / 27
Automatic Annotation Inference
Logs
Function: correctFolderSize feature: string length
pcc: 0.78673
0
5
10
15
20
25
0 20 40 60 80 100 120
Mem
ory
(kbyte
s)
String length
12 / 27
Automatic Annotation Inference
Logs
Function: correctFolderSize feature: path length
pcc: 0.9473
0
5
10
15
20
25
0 1 2 3 4 5 6 7
Mem
ory
(kbyte
s)
Path length
12 / 27
Automatic Annotation Inference
Logs
Function: correctFolderSize feature: path length
pcc: 0.9473
0
5
10
15
20
25
0 1 2 3 4 5 6 7
Mem
ory
(kbyte
s)
Path length
Line fit
regression
12 / 27
Automatic Annotation Inference
Logs
Function: broadcastEvent feature: collection size
pcc: -0.1155
0
0.5
1
1.5
2
2.5
3
3.5
4
0 1 2 3 4 5
Tim
e (
ms)
Collection size
12 / 27
Automatic Annotation Inference
Logs
Function: broadcastEvent feature: string length
pcc: -0.2764
0
0.5
1
1.5
2
2.5
3
3.5
4
8 10 12 14 16 18 20
Tim
e (
ms)
String length
12 / 27
Automatic Annotation Inference
Logs
Function: broadcastEvent no feature
0
0.5
1
1.5
2
2.5
3
3.5
4
-1 -0.5 0 0.5 1
Tim
e (
ms)
Scalar value
clusters
12 / 27
Annotations Uses
13 / 27
Uses
14 / 27
Uses
Documentation◮ how do functions behave?
14 / 27
Uses
Documentation◮ how do functions behave?
Anomaly/failure detection◮ is the system behaving normally? Is there a performance regression?
14 / 27
Uses
Documentation◮ how do functions behave?
Anomaly/failure detection◮ is the system behaving normally? Is there a performance regression?
Extrapolation◮ can we scale up?
14 / 27
Uses
Documentation◮ how do functions behave?
Anomaly/failure detection◮ is the system behaving normally? Is there a performance regression?
Extrapolation◮ can we scale up?
Composition◮ can we infer the behavior of the caller from the annotations of the callees?
14 / 27
Case Study
15 / 27
Case Study
setup
15 / 27
The System: Application Level
Client
VM
HAProxy
VMnginx
OwnCloudVM
PostgreSQL
VM
NFS
WEBDAV WEBDAV
TCP
TCP
16 / 27
The System: Computing Resources
ETH
Host01 Host02
Host03 Host04 Host05 Host06 Host07
Host08 Host09 Host10 Host11 Host12
17 / 27
The System: Computing Resources
ETH + VLAN
Host01
Test
Host02
Openstack
Neutron
Host03 Host04 Host05 Host06 Host07
Host08 Host09 Host10 Host11 Host12
18 / 27
The System: Computing Resources
ETH + VLAN
STORAGE NODE
Host01
Test
Host02
Openstack
Neutron
Host03 Host04 Host05 Host06 Host07
Host08
Ceph OSD
Host09
Ceph OSD
Host10
Ceph OSD
Host11
Ceph OSD
Host12
Ceph OSD
18 / 27
The System: Computing Resources
ETH + VLAN
COMPUTE NODE
STORAGE NODE
Host01
Test
Host02
Openstack
Neutron
Host03
Ceph master
DBMS
Host04
Ceph master
NFS server
Host05
Ceph master
Web server
Host06
Sync server
DBMS
Host07
Load balancer
LDAP
Host08
Ceph OSD
Host09
Ceph OSD
Host10
Ceph OSD
Host11
Ceph OSD
Host12
Ceph OSD
18 / 27
PHP Instrumentation
OwnCloud (php)
function foo(){...}
19 / 27
PHP Instrumentation
OwnCloud (php)
function foo_inner() {...}
19 / 27
PHP Instrumentation
OwnCloud (php)
function foo_inner() {...}
foo_inner()
function foo() {
start = time()
end = time()
log(end - start)
}
19 / 27
Cross-Platform Instrumentation
OwnCloud (php)
function write_to_db() {
...
}
19 / 27
Cross-Platform Instrumentation
OwnCloud (php)
function write_to_db() {...
trace_id = rnd_string()...
}
19 / 27
Cross-Platform Instrumentation
OwnCloud (php)
function write_to_db() {...
trace_id = rnd_string()...
}
PostgreSQL (C)
function execute_query() {...
get(trace_id)...
}
19 / 27
Cross-Platform Instrumentation
OwnCloud (php)
function write_to_db() {...
trace_id = rnd_string()...
}
PostgreSQL (C)
function execute_query() {...
get(trace_id)...
}
Marshaller (C)
marshall() {...
put(trace_id)...
}
Unmarshaller (C)
unmarshall() {...
get(trace_id)...
}
19 / 27
Case Study
20 / 27
Case Study
annotations
20 / 27
It Works!
0
20
40
60
80
100
120
140
160
180
200
220
0 50 100 150 200 250 300
Me
mo
ry (
kb
yte
s)
Collection size
@Mem ∼ 721.362B ∗ |in|+8851.16B
21 / 27
It Works!
3
4
5
6
7
8
9
0 10 20 30 40 50 60 70 80
Me
mo
ry (
kb
yte
s)
String length
\OC\Files\View::getOwner()
@Mem ∼ 8.240kB
21 / 27
It Works!
0
0.1
0.2
0.3
0.4
0.5
0.6
0 50 100 150 200 250 300
Tim
e (
ms)
Collection size
\Sabre\DAV\Server::generateMultiStatus()
@Time ∼ 3.6e−03B ∗ |in|2 +3.8e−05B ∗ |in|+6.6e−06B
21 / 27
It Works!
0
100
200
300
400
500
600
0 20 40 60 80 100 120
Me
mo
ry (
byte
s)
String length
\OC\Files\Cache\Cache::normalize()
@Mem ∼ Norm(
411.9,15.35)
∨Norm(
435.3,24.41)
∨Norm(
448,0)
∨Norm(
459.6,19.48)
∨Norm(
477.6,23.18)
∨Norm(
502.0,18.58)
21 / 27
Case Study
22 / 27
Case Study
anomaly detection
22 / 27
Anomaly Injection
Client
VM
HAProxy
VMnginx
OwnCloudVM
PostgreSQL
VM
NFS
WEBDAV WEBDAV
TCP
TCP
23 / 27
Anomaly Injection
Client
VM
HAProxy
VMnginx
OwnCloudVM
PostgreSQL
VM
NFS
WEBDAV WEBDAV
TCP
TCP
Added Latency: 1-10ms23 / 27
Annotations Catch The Anomaly
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
%
Annotation robustness (%)
no delay
24 / 27
Annotations Catch The Anomaly
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
%
Annotation robustness (%)
no delay
1ms
24 / 27
Annotations Catch The Anomaly
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
%
Annotation robustness (%)
no delay
1ms
2ms
24 / 27
Annotations Catch The Anomaly
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
%
Annotation robustness (%)
no delay
1ms
2ms
5ms
24 / 27
Annotations Catch The Anomaly
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
%
Annotation robustness (%)
no delay
1ms
2ms
5ms
10ms
24 / 27
Future
25 / 27
Ongoing Work - Discussion
Machine learning techniques fine tuning
26 / 27
Ongoing Work - Discussion
Machine learning techniques fine tuning
Annotations composition◮ stack analysis
26 / 27
Ongoing Work - Discussion
Machine learning techniques fine tuning
Annotations composition◮ stack analysis
Workload creation◮ can we forge workloads that expose specific behaviors?
26 / 27
Ongoing Work - Discussion
Machine learning techniques fine tuning
Annotations composition◮ stack analysis
Workload creation◮ can we forge workloads that expose specific behaviors?
Feature Selection◮ Is a heuristically built set of basic features enough?◮ Can we exploit programmers’ knowledge of the system?
26 / 27
Ongoing Work - Discussion
Machine learning techniques fine tuning
Annotations composition◮ stack analysis
Workload creation◮ can we forge workloads that expose specific behaviors?
Feature Selection◮ Is a heuristically built set of basic features enough?◮ Can we exploit programmers’ knowledge of the system?
Extensive testing◮ Java, DaCapo benchmarks
26 / 27
Performance Annotations for Cloud Computing
Daniele Rogora∗ Steffen Smolka% Antonio Carzaniga∗ Amer Diwan$
Robert Soulé∗∼
presented by
Daniele Rogora
∗Università della Svizzera italiana %Cornell University $Google ∼Barefoot Networks
HotCloud 2017
27 / 27