Post on 21-Jun-2020
PRESENTATION TITLE GOES HERE
Lies, Damn Lies and Performance Metrics
Barry Cooks
Virtual Instruments
2 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Goal for This Talk
Take away a sense of how to make the move
from:
Improving your mean time to innocence
to
Improving your infrastructure performance
3 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
What We’ll Cover
A case of performance metrics gone bad
Some history
What performance monitoring needs
The lies
The damn lies
The performance metrics
How can you use them
4 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Application is
down …
… again.
5 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Center Management - Actual
You see this???
6 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Array tools say
it’s okay…
7 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Center Management - Actual
How can I “help”?
8 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Center Management - Actual
Meanwhile, at the storage vendor …
Have you tried updating
your drivers and
firmware?
9 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
And the switch vendor …
Can you clear the
counters and run another
log collection?
11 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
IBM – A Point of Reference
Mainframes collected and correlated lots of data
about the workload and infrastructure
12 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Closed vs. Open Systems
The move to open systems was introduced
Numerous competing vendors
Interconnected specialized devices
Inconsistency in monitoring methods and metrics
Correlating data from multiple vendors is a serious
challenge
Vendors’ focus has been on core innovation
Monitoring became a secondary priority
13 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
What does performance monitoring need?
14 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
What’s Required for Success
Understanding what data is relevant
A method to gather that data, ideally, without
impacting systems under monitoring
End-to-end view of data
Historical data retention
Comparable data across vendor ecosystem
Actionable insights from that data
16 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Performance Monitoring Today
“Performance” metrics are often
Not really performance metrics
Utilization
Error counters
Samples taken on a polling interval
Every minute, hour, 6 hours?
Rollup averages over a window of time
At 16G a single 2KB frame takes 1.25μs to transmit.
That’s 48 million 2K reads per minute.
Fifteen- minute average? That’s the population of
Europe.
17 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
$67K average
Traditional Performance Management
0
$350,000
$700,000
$1,000,000
$295K
average
The Outlier
18 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
The Hidden Issue
5,000ms
1ms
Response Time
10,000 I/Os
I/Os per second
10,000 @ 1ms x 20s
32 @ 5,000ms
10,000 @ 1ms x 35
Total Commands: 550,032
Total I/O Time:
1ms * 10,000 I/Os * 55s +
32 I/Os *5s = 710,000
Average Response Time
= 1.29ms
0 20sec 25sec 60sec
0 20sec 25sec 60sec
19 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
A Question of Balance
Is the traffic between these ports on the same server balanced?
Port A mean traffic: 4.41Mb/s
Port B mean traffic: 4.40Mb/s
20 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Workload Profiling
20
21 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Vendor “Response Time” Metrics
Utilization = 100% * busy time in period / (idle + busy) time in period
Throughput = total number of visitors in periods / period in length in
seconds
Average Busy Queue Length = sum of queue upon arrival of visitor x
/ total number of visitors
Queue length = ABQL * utilization/100%
Response time = queue length / throughput (Little’s Law)
Response Time = (Sum of Queue Upon Arrival of Visitor / Total
Number of Visitors)
* (100% * Busy Time in Period / (Idle + Busy)
time in period) / 100%) / (# of Visitors in Period / Length of Period)
22 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Vendor “Response Time” Metrics
The Fine Text (Necessary Caveats):
For low LUN throughput (<32 IOPS), response time might be inaccurate.
Lazy writes skew the LUN busy counter.
Dual SP ownership of a disk can also impact response time.
Each SP only knows about its own ABQL, throughput and utilization for the disk.
At poll time, they exchange views. The utilization is max(SPA,SPB).
ABQL is computed from the sum of the sum.
And SP throughput is the sum of SPA and SPB throughput.
Be wary of confusing SP response time in Analyzer with the average response time of all
LUNs on that SP.
A LUN is busy (not resting) as long as something is queued to it.
An SP is busy (not resting) as long as it is not in the OS idle loop.
While a disk is busy getting a LUN request, the LUN is still busy.
While a disk is busy getting a LUN request, the SP might be idle.
The SP response time is generally smaller than the average response time of all the
LUNs on that SP.
Host response time is approximated by LUN response time.
23 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Time Skew
R2 at one minute delay is 0.91,
while at zero delay it is 0.41
24 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Gathering the Data
A challenge for external software-based
monitoring – perturbing the system under
investigation
Adding load
Changing behavior
25 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Collection
26 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Collection
AIX
VMware
HPux
Solaris
HyperV
Brocade
Cisco EMC
HDS
IBM
Cisco
Brocade
27 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Data Collection
AIX
VMware
HPux
Solaris
HyperV
EMC
HDS
IBM Brocade
Cisco
Brocade
Cisco
28 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
The damn lies
29 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Decisions Based on Thresholds
Refer to
Documentation
All clear?
Not yet
I guess so.
ASK SOMEBODY
Not yet
I guess so.
INPUT A VALUE
On the first try? Yeah,
right.
Pick a lower
threshold
Yes.
Go buy a lottery ticket,
immediately.
Just the right
number of alarms?
Have
something
better to
do?
Create an email
filter Done!
All clear?
Done,
yet?
No.
Uhhh. Yeah. Finally.
30 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Where should alarm thresholds be placed?
31 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Threshold
Traditional Performance Management
Data Granularity Challenge
One-minute
32 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Threshold
Traditional Performance Management
Data Granularity Challenge
One-second
33 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Traditional Performance Management
Data Granularity Challenge
One-millisecond
Threshold
34 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Performance metrics
35 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
$67K average
Traditional Performance Management
0
$350,000
$700,000
$1,000,000
$295K
average
The Outlier - Revisited
36 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
What Does Average Response Time Mean?
Q: When you hear your average response time is 20 ms, what is the first thing that pops
into your mind?
A. My response distribution must look like this: B. My response distribution must look like this: C. My response distribution must look like this:
D. My response distribution must look like this:
E. I don’t know what my response distribution looks
like because taking an average of all the response
times is not a helpful thing to do.
F. When’s lunch?
37 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
What Are “Histograms”?
A histogram is a graphical
representation of the distribution of
data.
Scalar quantization, typically
denoted as y=Q(x), is the process
of using a quantization function Q()
to map a scalar (one-dimensional)
input value x to a scalar output
value y.
38 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Histogram Bins
Timing Bins:
Reads Writes > 0 <= 0.05ms > 0 <= 0.05ms
> 0.05 <= 0.2ms > 0.05 <= 0.1ms
> 0.2 <= 0.5ms > 0.1 <= 0.2ms
> 0.5 <= 1ms > 0.2 <= 0.3ms
> 1 <= 2ms > 0.3 <= 0.5ms
> 2 <= 4ms > 0.5 <= 0.7ms
> 4 <= 6ms > 0.7 <= 1ms
> 6 <= 8ms > 1 <= 1.5ms
> 8 <= 10ms > 1.5 <= 2ms
> 10 <= 15ms > 2 <= 3ms
> 15 <= 20ms > 3 <= 4ms
> 20 <= 30ms > 4 <= 6ms
> 30 <= 50ms > 6 <= 10ms
> 50 <= 75ms > 10 <= 20ms
> 75 <= 100ms > 20 <= 30ms
> 100 <= 150ms > 30 <= 50ms
> 150 <= 250ms > 50 <= 75ms
> 250 <= 500ms > 75 <= 100ms
> 500 <= 1000ms > 100 <= 150ms
> 1000 <= 4500ms > 150 <= 250ms
> 4500ms > 250 <= 1000ms
> 1000 <= 4500ms
> 4500ms
Size Bins:
Read & Write > 0 <= 0.5 KiB
> 0.5 <= 1 KiB
> 1 <= 2 KiB
> 2 <= 3 KiB
> 3 <= 4 KiB
> 4 <= 8 KiB
> 8 <= 12 KiB
> 12 <= 16 KiB
> 16 <= 24 KiB
> 24 <= 32 KiB
> 32 <= 48 KiB
> 48 <= 60 KiB
> 60 <= 64 KiB
> 64 <= 96 KiB
> 96 <= 128 KiB
> 128 <= 192 KiB
> 192 <= 256 KiB
> 256 <= 512 KiB
> 512 <= 1024 KiB
> 1024 KiB
The bins were selected on three criteria:
1. Sampling from live datacenter systems
2. Common SLA Language
a. Common service level agreement
language is for 10, 15, 20, 30,
50ms boundaries
3. Expected disk seek/access latencies
a. Cache hit range 0 – 0.5ms
b. EFD / SSD range 0.5 – 2ms
c. 15k FC/SAS range 2 – 6ms
d. 10k FC/SAS range 6 – 10ms
e. SATA/NL-SAS range 10 – 15ms
39 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Write Cache Misses
Cache Hits
Cache Misses
40 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Impacts of Auto-Tiering
Cache
Hits FC
SSD
SATA
Auto-tiering left unattended
41 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
IO Size Skew
Average I/O size = 80KiB Does not do a very good
job of describing the distribution.
42 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Histogram Capabilities
43 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Answers, not data
44 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
How to Analyze HBA Queue Depth
Approach #1
$ if (queue_size > 128)
throw_red_flag
Threshold Trigger
High Quality Raw Data
Approach #2
Average
Queue Depth = 15
Average Metric
45 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Approach #3
How to Analyze HBA Queue Depth
Combining Multiple Metrics With Machine Learning Analytics
Execution throttle
set too high!
Queue Size R
esp
on
se T
ime
(m
s)
Execution throttle
set properly.
Queue Size
Resp
on
se T
ime
(m
s)
95%th
75%th
50%th
Both these scenarios would trigger red flags in Approach #2
46 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Repositioning VMs in a Cluster
High Quality Raw Data
VM#1 MEM Usage
VM#1 NET Usage
VM#1 CPU Usage
VM#1 Disk Usage
Average Metrics
Approach #1 Approach #2
$ if (vm_cpu_usage > 85%)
move_vm_process
Threshold Trigger
47 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Approach #3
Repositioning VMs in a Cluster
Predict Future Usage and Reorganize to Fix Bottlenecks BEFORE they Happen
Reorganize VMs such that the busy times of one VM
correspond with the free times of the rest of the server
Time
Se
rve
r C
PU
Utiliz
atio
n %
VM#12
VM#35
VM#46
Bottleneck One Server predicted
future
steady usage
VM#16
VM#25
VM#17
Today
Se
rve
r C
PU
Utiliz
atio
n %
(include both Dynamic CPU
& Memory Utilization)
48 2015 Data Storage Innovation Conference. ©Virtual Instruments. All Rights Reserved.
Where We Landed
Using high-quality, low-impact data, we can
drive better decision-making across the
infrastructure
Analytics will enable a change in the way
answers are derived from the data