VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It
-
Upload
vmworld -
Category
Technology
-
view
1.204 -
download
3
description
Transcript of VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It
Silent Killer: How Latency Destroys
Performance...And What to Do About It
Bhavesh Davda, VMware
Josh Simons, VMware
VSVC5187
#VSVC5187
2 2
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
3 3
What is Latency?
Examples in computing environments:
• Signal propagation within a microprocessor
• Memory access from cache, from local memory, from non-local memory
• PCI I/O data transfers
• Data access within rotating media
• Operating system scheduling
• Network communication, local and wide area
• Application logic
Typically reported as average latency
Latency is a measure of time delay experienced in a system,
the precise definition of which depends on the system and
the time being measured. (Wikipedia)
4 4
https://gist.github.com/hellerbarde/2843375
^
and IT person
5 5
A Latency Number Every Human Should Know
6 6
What is Jitter?
Examples in computing environments
• Unpredictable response times in financial trading applications
• Stalling, stuttering audio and video in telecommunication applications
• Reduced performance of distributed parallel computing applications
• Measurable variations in run times for long-running jobs
Jitter is variation in latency that causes non-deterministic
performance in seemingly deterministic workloads
“Insanity: doing the same thing over and
over again and expecting different results.”
Albert Einstein
7 7
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
8 8
Effects of Latency and Jitter on VoIP Audio Quality
Original 5% drop 20% drop
http://www.voiptroubleshooter.com/sound_files/
1 2 3 4 5 6
1 2 3 4 5 6
De-jitter buffering
1 2 3 4 5 6
De-jitter buffering
ITU-T G.114 Latency Recommendation
Mean Opinion Score (MOS)
4.3-5.0
4.0-4.3
3.6-4.0
3.1-3.6
2.6-3.1
Hig
her
is b
etter
1 2 3 4 5 6
Play out latency
1 2 4 5
Drops
9 9
The Case of the Missing Supercomputer Performance
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of
ASCI Q, Petrini, F., Kerbyson, D., Pakin, S., Proceedings of the 2003 CM/IEEE conference on Supercomputing
Peer-to-peer parallel (MPI)
application performance degrades
as scale increases – up to 2X
worse than predicted by model
No obvious explanations, initially
Noise – extraneous daemons, kernel
timers, etc. – indicted as problem
Jittered arrival times at application
synchronization points resulted in
significant overall slowdowns
(lo
wer
is b
ett
er)
(lo
wer
is b
ett
er)
10 10
Latency Affects Throughput, Packet Rate, and IOPs, Too
Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec)
XMIT Time (sec) = Latency + Packet Size * 0.01
Throughput (bits/sec) = Packet Size / XMIT Time
Packet
Size (bits)
Throughput (bits/sec) Packet Rate (packets/sec)
Latency
0 sec Latency
0.01 sec Latency
0.04 sec Latency
0 sec Latency
0.01 sec Latency
0.04 sec
1 100 100
10 100 10
100 100 1
11 11
Latency Affects Throughput, Packet Rate, and IOPs, Too
Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec)
XMIT Time (sec) = Latency + Packet Size * 0.01
Throughput (bits/sec) = Packet Size / XMIT Time
Packet
Size (bits)
Throughput (bits/sec) Packet Rate (packets/sec)
Latency
0 sec Latency
0.01 sec Latency
0.04 sec Latency
0 sec Latency
0.01 sec Latency
0.04 sec
1 100 50 100 50
10 100 91 10 9
100 100 99 1 1
12 12
Latency Affects Throughput, Packet Rate, and IOPs, Too
Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec)
XMIT Time (sec) = Latency + Packet Size * 0.01
Throughput (bits/sec) = Packet Size / XMIT Time
Packet
Size (bits)
Throughput (bits/sec) Packet Rate (packets/sec)
Latency
0 sec Latency
0.01 sec Latency
0.04 sec Latency
0 sec Latency
0.01 sec Latency
0.04 sec
1 100 50 20 100 50 20
10 100 91 71 10 9 7
100 100 99 96 1 1 1
13 13
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
14 14
Network Latency in Bare Metal Environments
Message copy from application
to OS (kernel)
OS (network stack) + NIC driver
queues packet for NIC
NIC DMAs packet and transmits
on the wire
CPUs RAM
Interconnect
NIC Disk
Network
Switch
Server
15 15
Interconnect
Network Latency in Virtual Environments
Message copy from
application to GOS (kernel)
GOS (network stack) + vNIC
driver queues packet for
vNIC
VM exit to VMM/Hypervisor
vNIC implementation
emulates DMA from VM,
sends to vSwitch
vSwitch queues packet for
pNIC
pNIC DMAs packet and
transmits on the wire
Network
Switch
VMs
Virtual Switch
NIC Server
Management
Agents
Background
Tasks
ESXi Hypervisor
16 16
Network Storage: Small I/O Case Study
Rendering applications
• 1.4X – 3X slowdown seen initially
Customer NFS stress test
• 10K files
• 1K random reads/file
• 1-32K bytes
• 7X slowdown
Single change
• Disable LRO (Large Receive Offload) within the
guest to avoid coalescing of small messages upon
arrival
• See KB 1027511: Poor TCP Performance can occur
in Linux virtual machines with LRO enabled
Final application performance
• 1 – 5% slower than native
Guest OS
Application
ESXi NFS Server
17 17
Data Center Networks – the Trend to Fabrics
WAN/Internet
WAN/Internet
NO
RT
H /
SO
UT
H
EAST/WEST
18 18
Agenda
Introduction
• Definitions
• Effects
• Sources
Mitigation
• BIOS settings
• CPU scheduling and over-commitment
• Memory over-commitment and MMU virtualization
• NUMA and vNUMA
• Guest OS
• Storage
• Networking
19 19
General Guidelines about Tuning for Latency
vSphere ESXi is designed for high performance and fairness
• Maximizes overall performance of all VMs without unfairly penalizing any VM
• Defaults are carefully tuned for high throughput
Tunable settings should be thoroughly vetted in a test environment
before deployment
Tuning should be applied individually to study the effects on
performance
• Maintain good change control practices
Certain tunables for lowest latency can negatively affect
throughput and efficiency, so consider tradeoffs
• Consider isolating latency-sensitive VMs on dedicated hosts
• DRS host groups can be used to manage groups of hosts supporting latency-
sensitive VMs
20 20
Optimizing for Latency-sensitive Workloads (1 of 3)
Power Management
• Set at both BIOS and hypervisor levels
• Hyperthreading may cause jitter due to pipeline
sharing
• Intel Turbo Boost may cause runtime jitter
CPU and memory over-commitment
• Transparent page sharing may cause jitter due
to non-deterministic share-breaking on writes
• Memory compression
• Better to avoid over-subscription of resources
Memory virtualization
• Hardware memory virtualization can sometimes
be slower than software approaches
Max performance / Static High
To disable:
sched.mem.pshare.enable = FALSE
Mem.MemZipEnable = 0
For shadow page tables (i.e.,
software approach):
monitor.virtual_mmu = software
21 21
Memory Virtualization
HPL Native
(GFLOP/s)
Virtual
EPT on EPT off
4K guest pages 37.04 36.04 (97.3%) 36.22 (97.8%)
2MB guest pages 37.74 38.24
(100.1%)
38.42
(100.2%)
*RandomAccess Native
(GUP/s)
Virtual
EPT on EPT off
4K guest pages 0.01842 0.0156 (84.8%) 0.0181 (98.3%)
2MB guest pages 0.03956 0.0380 (96.2%) 0.0390 (98.6%)
physical
virtual
machine
EPT = Intel Extended Page Tables = hardware page table virtualization = AMD RVI
22 22
NUMA and vNUMA
hypervisor
Application
socket M socket M socket M socket M
Making virtual NUMA nodes visible within the Guest OS
allows ESXi to respect GOS process placement and
memory allocation decisions, which can lead to significant
performance increases
23 23
Optimizing for Latency-sensitive Workloads (2 of 3)
NUMA
• ESXi optimally allocates CPU and memory
• NUMA node affinity can be set manually
• Exposing NUMA topology to wide guests
(vNUMA) can be very important. Automatic for
#vCPU > 8 and can be forced otherwise
• NUMA scheduler does not include HT by
default. Can be overridden to prevent VM split
across NUMA nodes
numa.nodeAffinity = X
numa.vcpu.min = N (< #vCPUs)
numa.vcpu.preferHT = “1”
24 24
vNUMA Performance Study: SpecOMP (Lower is Better)
Performance Evaluation of HPC Benchmarks on VMware’s ESX Server, Ali Q., Kiriansky, V., Simons
J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011
25 25
Optimizing for Latency-sensitive Workloads (2 of 3)
NUMA
• ESXi optimally allocates CPU and memory
• NUMA node affinity can be set manually
• Exposing NUMA topology to wide guests
(vNUMA) can be very important. Automatic for
#vCPU > 8 and can be forced otherwise
• NUMA scheduler does not include HT by
default. Can be overridden to prevent VM split
across NUMA nodes
VM scheduling optimizations
• e.g., suppress descheduling
Guest OS choice
• Later distributions are usually better (tickless kernel, etc.)
• RHEL 6+, SLES 11+, etc. (2.6.32+ kernel)
• Windows Server 2008+
monitor_control.halt_desched = FALSE
numa.nodeAffinity = X
numa.vcpu.min = N (< #vCPUs)
numa.vcpu.preferHT = “1”
26 26
Optimizing for Latency-sensitive Workloads (3/3)
Storage
• Storage stack already tuned for small block transfers
• iSCSI and NAS (host and guest) affected by network tuning
parameters
• Local Flash memory’s much lower latency exposes overheads
in software stack that we are working to address
Networking
• Interrupt coalescing should be disabled
vNIC
pNIC
• Jumbo frames may interfere with low-latency traffic
• Disable Large Receive Offload (LRO) for TCP (including NAS)
• Polling for I/O completion rather than using interrupts
• Passthrough / direct assignment for lowest I/O latencies
ethernetX.coalescingScheme = “disabled”
esxcli module parameter driver-parameter
DPDK, RDMA poll mode
27 27
kern
el
Kernel Bypass Model
driver
tcp/ip
sockets
hardware
application
rdma
gu
es
t kern
el
driver
tcp/ip
sockets
vmkernel
application
hardware
user
user
rdma
28 28
InfiniBand Bandwidth with Passthrough / Direct Assignment
0
500
1000
1500
2000
2500
3000
3500
2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
51
2K
1M
2M
4M
8M
Ban
dw
idth
(M
B/s
)
Message size (bytes)
Send: Native
Send: ESXi
RDMA Read: Native
RDMA Read: ESXi
RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vSphere 5, April 2011
http://labs.vmware.com/academic/publications/ib-researchnote-apr2012
29 29
Latency with Passthrough / Direct Assignment (Send/Rcv, Polling)
1
2
4
8
16
32
64
128
256
512
1024
2048
40962 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16
K
32
K
64
K
12
8K
25
6K
51
2K
1M
2M
4M
8M
Hal
f ro
un
dtr
ip la
ten
cy (
µs)
Message size (bytes)
Native
ESXi ExpA
MsgSize (bytes)
Native ESXi ExpA
2 1.35 1.75
4 1.35 1.75
8 1.38 1.78
16 1.37 2.05
32 1.38 2.35
64 1.39 2.9
128 1.5 4.13
256 2.3 2.31
30 30
New Features Planned for Upcoming vSphere ESXi Releases
New virtual machine property: “Latency sensitivity”
• High => lowest latency
Exclusively assign physical CPUs to virtual CPUs of “Latency
Sensitivity = High” VMs
• Physical CPUs not used for scheduling other VMs or ESXi tasks
Idle in Virtual Machine monitor (VMM) when Guest OS is idle
• Lowers latency to wake up the idle Guest OS, compared to idling in ESXi
vmkernel
Disable vNIC interrupt coalescing
For DirectPath I/O, optimize interrupt delivery path for lowest
latency
Make ESXi vmkernel more preemptible
• Reduces jitter due to long-running kernel code
31 31
Summary
Virtualization does add some latency over bare metal
vSphere is generally tuned for throughput and fairness
• Tunables exist at the host, VM, and guest level to improve latency
• This will become more automatic in subsequent releases
ESXi is a good hypervisor for virtualizing an increasingly broad
array of applications, including latency-sensitive applications such
as Telco, Financial, and some HPC workloads
When observing application performance degradation in the future,
we hope you will think about the “silent killer” and try some of
techniques we’ve described here
32 32
Resources
Best Practices for Performance Tuning of Latency-Sensitive
Workloads in vSphere VMs
http://www.vmware.com/resources/techresources/10220
Network I/O Latency in vSphere 5
http://www.vmware.com/resources/techresources/10256
Deploying Extremely Latency-Sensitive Applications in vSphere 5.5
http://www.vmware.com/files/pdf/techpaper/deploying-latency-sensitive-apps-
vSphere5.pdf
RDMA Performance in Virtual Machines Using QDR InfiniBand on
VMware vSphere 5
http://labs.vmware.com/academic/publications/ib-researchnote-apr2012
33 33
Other VMworld Activities Related to This Session
HOL:
HOL-SDC-1304
vSphere Performance Optimization
Session:
VSVC5596
Extreme Performance Series: Network Speed Ahead
VSVC5187
THANK YOU
Silent Killer: How Latency Destroys
Performance...And What to Do About It
Bhavesh Davda, VMware
Josh Simons, VMware
VSVC5187
#VSVC5187