VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

Silent Killer: How Latency Destroys

Performance...And What to Do About It

Bhavesh Davda, VMware

Josh Simons, VMware

VSVC5187

#VSVC5187

2 2

Agenda

Introduction

• Definitions

• Effects

• Sources

Mitigation

• BIOS settings

• CPU scheduling and over-commitment

• Memory over-commitment and MMU virtualization

• NUMA and vNUMA

• Guest OS

• Storage

• Networking

3 3

What is Latency?

Examples in computing environments:

• Signal propagation within a microprocessor

• Memory access from cache, from local memory, from non-local memory

• PCI I/O data transfers

• Data access within rotating media

• Operating system scheduling

• Network communication, local and wide area

• Application logic

Typically reported as average latency

Latency is a measure of time delay experienced in a system,

the precise definition of which depends on the system and

the time being measured. (Wikipedia)

4 4

https://gist.github.com/hellerbarde/2843375

^

and IT person

5 5

A Latency Number Every Human Should Know

6 6

What is Jitter?

Examples in computing environments

• Unpredictable response times in financial trading applications

• Stalling, stuttering audio and video in telecommunication applications

• Reduced performance of distributed parallel computing applications

• Measurable variations in run times for long-running jobs

Jitter is variation in latency that causes non-deterministic

performance in seemingly deterministic workloads

“Insanity: doing the same thing over and

over again and expecting different results.”

Albert Einstein

7 7

Agenda

Introduction

• Definitions

• Effects

• Sources

Mitigation

• BIOS settings



• NUMA and vNUMA

• Guest OS

• Storage

• Networking

8 8

Effects of Latency and Jitter on VoIP Audio Quality

Original 5% drop 20% drop

http://www.voiptroubleshooter.com/sound_files/

1 2 3 4 5 6

1 2 3 4 5 6

De-jitter buffering

1 2 3 4 5 6

De-jitter buffering

ITU-T G.114 Latency Recommendation

Mean Opinion Score (MOS)

4.3-5.0

4.0-4.3

3.6-4.0

3.1-3.6

2.6-3.1

Hig

her

is b

etter

1 2 3 4 5 6

Play out latency

1 2 4 5

Drops

http://www.voiptroubleshooter.com/sound_files/

9 9

The Case of the Missing Supercomputer Performance

The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of

ASCI Q, Petrini, F., Kerbyson, D., Pakin, S., Proceedings of the 2003 CM/IEEE conference on Supercomputing

Peer-to-peer parallel (MPI)

application performance degrades

as scale increases – up to 2X

worse than predicted by model

No obvious explanations, initially

Noise – extraneous daemons, kernel

timers, etc. – indicted as problem

Jittered arrival times at application

synchronization points resulted in

significant overall slowdowns

(lo

wer

is b

ett

er)

(lo

wer

is b

ett

er)

10 10

Latency Affects Throughput, Packet Rate, and IOPs, Too

Assume a 100 bit/sec channel bandwidth (1 bit every 0.01 sec)

XMIT Time (sec) = Latency + Packet Size * 0.01

Throughput (bits/sec) = Packet Size / XMIT Time

Packet

Size (bits)

Throughput (bits/sec) Packet Rate (packets/sec)

Latency

0 sec Latency

0.01 sec Latency

0.04 sec Latency

0 sec Latency

0.01 sec Latency

0.04 sec

1 100 100

10 100 10

100 100 1

11 11





Packet

Size (bits)


Latency

0 sec Latency

0.01 sec Latency

0.04 sec Latency

0 sec Latency

0.01 sec Latency

0.04 sec

1 100 50 100 50

10 100 91 10 9

100 100 99 1 1

12 12





Packet

Size (bits)


Latency

0 sec Latency

0.01 sec Latency

0.04 sec Latency

0 sec Latency

0.01 sec Latency

0.04 sec

1 100 50 20 100 50 20

10 100 91 71 10 9 7

100 100 99 96 1 1 1

13 13

Agenda

Introduction

• Definitions

• Effects

• Sources

Mitigation

• BIOS settings



• NUMA and vNUMA

• Guest OS

• Storage

• Networking

14 14

Network Latency in Bare Metal Environments

Message copy from application

to OS (kernel)

OS (network stack) + NIC driver

queues packet for NIC

NIC DMAs packet and transmits

on the wire

CPUs RAM

Interconnect

NIC Disk

Network

Switch

Server

15 15

Interconnect

Network Latency in Virtual Environments

Message copy from

application to GOS (kernel)

GOS (network stack) + vNIC

driver queues packet for

vNIC

VM exit to VMM/Hypervisor

vNIC implementation

emulates DMA from VM,

sends to vSwitch

vSwitch queues packet for

pNIC

pNIC DMAs packet and

transmits on the wire

Network

Switch

VMs

Virtual Switch

NIC Server

Management

Agents

Background

Tasks

ESXi Hypervisor

16 16

Network Storage: Small I/O Case Study

Rendering applications

• 1.4X – 3X slowdown seen initially

Customer NFS stress test

• 10K files

• 1K random reads/file

• 1-32K bytes

• 7X slowdown

Single change

• Disable LRO (Large Receive Offload) within the

guest to avoid coalescing of small messages upon

arrival

• See KB 1027511: Poor TCP Performance can occur

in Linux virtual machines with LRO enabled

Final application performance

• 1 – 5% slower than native

Guest OS

Application

ESXi NFS Server

17 17

Data Center Networks – the Trend to Fabrics

WAN/Internet

WAN/Internet

NO

RT

H /

SO

UT

H

EAST/WEST

18 18

Agenda

Introduction

• Definitions

• Effects

• Sources

Mitigation

• BIOS settings



• NUMA and vNUMA

• Guest OS

• Storage

• Networking

19 19

General Guidelines about Tuning for Latency

vSphere ESXi is designed for high performance and fairness

• Maximizes overall performance of all VMs without unfairly penalizing any VM

• Defaults are carefully tuned for high throughput

Tunable settings should be thoroughly vetted in a test environment

before deployment

Tuning should be applied individually to study the effects on

performance

• Maintain good change control practices

Certain tunables for lowest latency can negatively affect

throughput and efficiency, so consider tradeoffs

• Consider isolating latency-sensitive VMs on dedicated hosts

• DRS host groups can be used to manage groups of hosts supporting latency-

sensitive VMs

20 20

Optimizing for Latency-sensitive Workloads (1 of 3)

Power Management

• Set at both BIOS and hypervisor levels

• Hyperthreading may cause jitter due to pipeline

sharing

• Intel Turbo Boost may cause runtime jitter

CPU and memory over-commitment

• Transparent page sharing may cause jitter due

to non-deterministic share-breaking on writes

• Memory compression

• Better to avoid over-subscription of resources

Memory virtualization

• Hardware memory virtualization can sometimes

be slower than software approaches

Max performance / Static High

To disable:

sched.mem.pshare.enable = FALSE

Mem.MemZipEnable = 0

For shadow page tables (i.e.,

software approach):

monitor.virtual_mmu = software

21 21

Memory Virtualization

HPL Native

(GFLOP/s)

Virtual

EPT on EPT off

4K guest pages 37.04 36.04 (97.3%) 36.22 (97.8%)

2MB guest pages 37.74 38.24

(100.1%)

38.42

(100.2%)

*RandomAccess Native

(GUP/s)

Virtual

EPT on EPT off

4K guest pages 0.01842 0.0156 (84.8%) 0.0181 (98.3%)

2MB guest pages 0.03956 0.0380 (96.2%) 0.0390 (98.6%)

physical

virtual

machine

EPT = Intel Extended Page Tables = hardware page table virtualization = AMD RVI

22 22

NUMA and vNUMA

hypervisor

Application

socket M socket M socket M socket M

Making virtual NUMA nodes visible within the Guest OS

allows ESXi to respect GOS process placement and

memory allocation decisions, which can lead to significant

performance increases

23 23


NUMA

• ESXi optimally allocates CPU and memory

• NUMA node affinity can be set manually

• Exposing NUMA topology to wide guests

(vNUMA) can be very important. Automatic for

#vCPU > 8 and can be forced otherwise

• NUMA scheduler does not include HT by

default. Can be overridden to prevent VM split

across NUMA nodes

numa.nodeAffinity = X

numa.vcpu.min = N (< #vCPUs)

numa.vcpu.preferHT = “1”

24 24

vNUMA Performance Study: SpecOMP (Lower is Better)

Performance Evaluation of HPC Benchmarks on VMware’s ESX Server, Ali Q., Kiriansky, V., Simons

J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011

25 25


NUMA

• ESXi optimally allocates CPU and memory

• NUMA node affinity can be set manually

• Exposing NUMA topology to wide guests

(vNUMA) can be very important. Automatic for

#vCPU > 8 and can be forced otherwise

• NUMA scheduler does not include HT by

default. Can be overridden to prevent VM split

across NUMA nodes

VM scheduling optimizations

• e.g., suppress descheduling

Guest OS choice

• Later distributions are usually better (tickless kernel, etc.)

• RHEL 6+, SLES 11+, etc. (2.6.32+ kernel)

• Windows Server 2008+

monitor_control.halt_desched = FALSE

numa.nodeAffinity = X

numa.vcpu.min = N (< #vCPUs)

numa.vcpu.preferHT = “1”

26 26

Optimizing for Latency-sensitive Workloads (3/3)

Storage

• Storage stack already tuned for small block transfers

• iSCSI and NAS (host and guest) affected by network tuning

parameters

• Local Flash memory’s much lower latency exposes overheads

in software stack that we are working to address

Networking

• Interrupt coalescing should be disabled

vNIC

pNIC

• Jumbo frames may interfere with low-latency traffic

• Disable Large Receive Offload (LRO) for TCP (including NAS)

• Polling for I/O completion rather than using interrupts

• Passthrough / direct assignment for lowest I/O latencies

ethernetX.coalescingScheme = “disabled”

esxcli module parameter driver-parameter

DPDK, RDMA poll mode

27 27

kern

el

Kernel Bypass Model

driver

tcp/ip

sockets

hardware

application

rdma

gu

es

t kern

el

driver

tcp/ip

sockets

vmkernel

application

hardware

user

user

rdma

28 28

InfiniBand Bandwidth with Passthrough / Direct Assignment

0

500

1000

1500

2000

2500

3000

3500

2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

1M

2M

4M

8M

Ban

dw

idth

(M

B/s

)

Message size (bytes)

Send: Native

Send: ESXi

RDMA Read: Native

RDMA Read: ESXi

RDMA Performance in Virtual Machines using QDR InfiniBand on VMware vSphere 5, April 2011

http://labs.vmware.com/academic/publications/ib-researchnote-apr2012

29 29

Latency with Passthrough / Direct Assignment (Send/Rcv, Polling)

1

2

4

8

16

32

64

128

256

512

1024

2048

40962 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16

K

32

K

64

K

12

8K

25

6K

51

2K

1M

2M

4M

8M

Hal

f ro

un

dtr

ip la

ten

cy (

µs)

Message size (bytes)

Native

ESXi ExpA

MsgSize (bytes)

Native ESXi ExpA

2 1.35 1.75

4 1.35 1.75

8 1.38 1.78

16 1.37 2.05

32 1.38 2.35

64 1.39 2.9

128 1.5 4.13

256 2.3 2.31

30 30

New Features Planned for Upcoming vSphere ESXi Releases

New virtual machine property: “Latency sensitivity”

• High => lowest latency

Exclusively assign physical CPUs to virtual CPUs of “Latency

Sensitivity = High” VMs

• Physical CPUs not used for scheduling other VMs or ESXi tasks

Idle in Virtual Machine monitor (VMM) when Guest OS is idle

• Lowers latency to wake up the idle Guest OS, compared to idling in ESXi

vmkernel

Disable vNIC interrupt coalescing

For DirectPath I/O, optimize interrupt delivery path for lowest

latency

Make ESXi vmkernel more preemptible

• Reduces jitter due to long-running kernel code

31 31

Summary

Virtualization does add some latency over bare metal

vSphere is generally tuned for throughput and fairness

• Tunables exist at the host, VM, and guest level to improve latency

• This will become more automatic in subsequent releases

ESXi is a good hypervisor for virtualizing an increasingly broad

array of applications, including latency-sensitive applications such

as Telco, Financial, and some HPC workloads

When observing application performance degradation in the future,

we hope you will think about the “silent killer” and try some of

techniques we’ve described here

32 32

Resources

Best Practices for Performance Tuning of Latency-Sensitive

Workloads in vSphere VMs

http://www.vmware.com/resources/techresources/10220

Network I/O Latency in vSphere 5


Deploying Extremely Latency-Sensitive Applications in vSphere 5.5

http://www.vmware.com/files/pdf/techpaper/deploying-latency-sensitive-apps-

vSphere5.pdf

RDMA Performance in Virtual Machines Using QDR InfiniBand on

VMware vSphere 5

http://labs.vmware.com/academic/publications/ib-researchnote-apr2012



http://www.vmware.com/files/pdf/techpaper/deploying-latency-sensitive-apps-vSphere5.pdf









33 33

Other VMworld Activities Related to This Session

HOL:

HOL-SDC-1304

vSphere Performance Optimization

Session:

VSVC5596

Extreme Performance Series: Network Speed Ahead

VSVC5187

THANK YOU

Silent Killer: How Latency Destroys

Performance...And What to Do About It

Bhavesh Davda, VMware

Josh Simons, VMware

VSVC5187

#VSVC5187

VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It

Technology

Transcript of VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to Do About It