XPDS14: Efficient Interdomain Transmission of Performance Data - John Else, Citrix
XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
-
Upload
the-linux-foundation -
Category
Technology
-
view
439 -
download
5
description
Transcript of XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix
![Page 1: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/1.jpg)
XenServer Engineering Performance TeamFelipe Franciosi
Going double digits on a single host
Scaling Xen’s Aggregate Storage Performance
e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy
![Page 2: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/2.jpg)
© 2014 Citrix
Agenda• The dimensions of storage performance
๏ What exactly are we trying to measure? !
• State of the art ๏ blkfront, blkback, blktap2+tapdisk, tapdisk3, qemu-qdisk ๏ trade-offs between traditional grant mapping, persistent grants, grant copy !
• Aggregate measurements ๏ Pushing the boundaries with very, very fast local storage !
• Where to go next?
2
![Page 3: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/3.jpg)
What exactly are we trying to measure?
The Dimensions of Storage Performance
![Page 4: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/4.jpg)
© 2014 Citrix
The Dimensions of Storage Performance• You have probably seen this: !
!
!
!
!
• The average user will usually: ๏ Run a synthetic benchmark on a bare metal environment ๏ Repeat the test on a virtual machine ๏ Draw conclusions without seeing the full picture
4
# dd if=/dev/sda of=/dev/null bs=1M count=100 iflag=direct 100+0 records in 100+0 records out 104857600 bytes (105 MB) copied, 0.269689 s, 389 MB/s
# hdparm -t /dev/sda !/dev/sda: Timing buffered disk reads: 1116 MB in 3.00 seconds = 371.70 MB/sec
![Page 5: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/5.jpg)
© 2014 Citrix
The Dimensions of Storage Performance
5
thro
ughp
ut
log(block size)
![Page 6: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/6.jpg)
© 2014 Citrix
The Dimensions of Storage Performance
6
thro
ughp
ut
log(block size)
sequ
entia
lity
# of threads
LBA
io d
epth
C/P states config
temperature
io engine
noise
read
ahe
ad
direction (r/
w)
![Page 7: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/7.jpg)
© 2014 Citrix
The Dimensions of Storage Performance• The simpler of all cases:
๏ single thread ๏ iodepth=1 ๏ direct IO ๏ sequential !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
7
![Page 8: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/8.jpg)
© 2014 Citrix
The Dimensions of Storage Performance
8
• Pushing boundaries a bit: ๏ multiple threads ๏ iodepth=1 ๏ direct IO ๏ sequential “kind of” !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
![Page 9: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/9.jpg)
© 2014 Citrix
The Dimensions of Storage Performance• Comparing dom0 vs. domU:
๏ single thread vs. single VM ๏ iodepth=1 ๏ direct IO ๏ sequential !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
9
![Page 10: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/10.jpg)
© 2014 Citrix
The Dimensions of Storage Performance
10
• Comparing dom0 vs. domU: ๏ many threads vs. many VMs ๏ iodepth=1 ๏ direct IO ๏ sequential kind of !
• Extra notes: ๏ BIOS perf. mode set to OS ๏ Fans set to maximum power ๏ Xen Scaling Governor set to
Performance (forces P0) ๏ Maximum C-State set to 1 ๏ No pinning ๏ Creedence #87433
• Kernel 3.10 + Xen 4.4
![Page 11: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/11.jpg)
And the trade-offs between the technologiesState of the Art
![Page 12: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/12.jpg)
© 2014 Citrix
State of the Art: traditional grant mapping
12
user space kernel space
VDI blkfront block layerblkback
libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with foreign pages
page grants
block layer
user apps
VDIdevice driver
user space kernel space
• Pros: ๏ no copies involved ๏ low-latency alternative
(when done in kernel) !
• Cons: ๏ not “network-safe” ๏ hard on grant tables
![Page 13: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/13.jpg)
© 2014 Citrix
State of the Art: persistent grants
13
user space kernel space
VDI blkfront block layerblkback
libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with foreign pages
persistent page grants
block layer
user apps
VDIdevice driver
user space kernel space
• Pros: ๏ easy on grant tables ๏ copies on the front end !
• Cons: ๏ not “network-safe” ๏ copies involved
blkfront memcpy() data from/to a set of persistently granted pages on demand
![Page 14: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/14.jpg)
© 2014 Citrix
• Pros: ๏ “network-safe” ๏ neat features (VHD) !
• Cons: ๏ copies involved ๏ use lots of memory ๏ hard on grant tables
State of the Art: tapdisk2+blktap2+blkback
14
user space kernel space
VDI blkfront block layerblkback
tapdisk2 libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with local pages
page grants
block layer
blktap2 TAP
user apps
blktap2 copies data to local pages
libaio
vfs / others
block layerVDI
device driver
user space kernel space
![Page 15: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/15.jpg)
© 2014 Citrix
• Pros: ๏ “network-safe” ๏ easy on grant tables ๏ neat features (VHD) !
• Cons: ๏ copies involved (back end) ๏ use lots of memory
State of the Art: grant copy
15
user space kernel space
VDI blkfront block layer
tapdisk3 libc / libaio
vfs / others
xen’s blkif protocol: shared memory &
event channels
dom0 domU
requests are associated with pages in the
guest’s memory space
requests are associated with local pages
libaio
vfs / others
user apps
VDIdevice driver
block layer
user space kernel space
gntdev evtchn
tapdisk3 issues grant copy commands via the “gntdev” Xen copies data across domains
![Page 16: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/16.jpg)
© 2014 Citrix
State of the Art: technologies comparison
16
extra!copies
network-safe
low-latency potential
“neat” features
easy on grant tables
grant mapping N N
Y (if done in blkback)
depends (not in
blkback)N
persistent grants
Y (front end) N N
Y (qcow in
qemu-qdisk)Y
grant copy Y (back end) Y N
Y (vhd in
tapdisk3)Y
blktap2 Y (back end) Y N Y N
![Page 17: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/17.jpg)
Going double digits on a single hostAggregate Measurements
![Page 18: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/18.jpg)
© 2014 Citrix
Aggregate Measurements• Test environment:
๏ Intel PowerEdge R720 • Intel E5-2643v2 @ 3.50 GHz (2 Sockets, 6 Cores/socket, HT Enabled)
- Unless stated otherwise: 24 vCPUs to dom0, 2 vCPUs to each guest • 64 GB of RAM
- Unless stated otherwise: 4 GB to dom0, 512 MB to each guest • BIOS Settings:
- Power regulators set to “Performance per Watt (OS)” - C-States disabled, Xen Scaling Governor set to “Performance”
๏ Storage: • 4 x Micron P320h • 2 x Intel P3700 • 1 x Fusion-io ioDrive2
18
![Page 19: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/19.jpg)
© 2014 Citrix
Aggregate Measurements
19
SSD #1 (SR 1) lv01 … lv10lv02
lv01 … lv10lv02
lv01 … lv10lv02
SSD #2 (SR 2)
SSD #7 (SR 7)
vd 7
vd 1 (SR 1 - lv01)
(SR 7 - lv01)
VM 01
vd 7
vd 1 (SR 1 - lv02)
(SR 7 - lv02)
VM 02
vd 7
vd 1 (SR 1 - lv10)
(SR 7 - lv10)
VM 10
![Page 20: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/20.jpg)
© 2014 Citrix
Aggregate Measurements
20
• Baseline ๏ Measurements from dom0 ๏ Each line corresponds to a
group of 7 threads (one for each disk) !
๏ Some of the drives respond faster for small block sizes and a single thread
![Page 21: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/21.jpg)
© 2014 Citrix
Aggregate Measurements
21
• qemu-qdisk ๏ Persistent grants disabled ๏ With O_DIRECT
![Page 22: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/22.jpg)
© 2014 Citrix
Aggregate Measurements
22
• qemu-qdisk ๏ Persistent grants enabled ๏ With O_DIRECT !
๏ Apparent bottleneck was single process per VM
![Page 23: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/23.jpg)
© 2014 Citrix
Aggregate Measurements
23
• tapdisk2 + blktap2 ๏ With O_DIRECT !
๏ Using blkback from 3.10 ๏ No persistent grants ๏ No indirect-IO
![Page 24: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/24.jpg)
© 2014 Citrix
Aggregate Measurements
24
• tapdisk2 + blktap2 ๏ With O_DIRECT !
๏ Using blkback from 3.16 ๏ Persistent grants ๏ Indirect-IO !
๏ Apparent bottleneck on some pvspinlock operations
![Page 25: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/25.jpg)
© 2014 Citrix
Aggregate Measurements
25
• blkback 3.16 ๏ 8 dom0 vCPUs ๏ 6 domU vCPUs !
๏ Persistent grants ๏ Indirect-IO !
๏ Apparent bottleneck on some pvspinlock operations
![Page 26: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/26.jpg)
© 2014 Citrix
Aggregate Measurements
26
• tapdisk3 ๏ Using grant copy ๏ With O_DIRECT ๏ Using libaio !
๏ Apparent bottleneck is vCPU utilisation
![Page 27: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/27.jpg)
Areas for improvementWhere To Go Next?
![Page 28: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/28.jpg)
© 2014 Citrix
thro
ughp
ut
log(block size)
• Single-VBD performance remains problematic ๏ [1/3] Latency is too high
Where To Go Next?
28
vdi VMvirtualisation subsystem
~ ms ~ us
~ us
disk
~ ns
![Page 29: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/29.jpg)
© 2014 Citrix
• Single-VBD performance remains problematic ๏ [2/3] IO Depth is limited to 32
Where To Go Next?
29
VMdiskblkfront
32 reqs
• Are these workloads realistic? • We can use multi-page rings! • But…
qdisktapdisk
blkbackvdi
![Page 30: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/30.jpg)
© 2014 Citrix
• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded
Where To Go Next?
30
VMdiskblkfront
32 reqs
qdisk
tapdisk
blkbackvdi
{fio
numjobs = 1 iodepth = ___ blksz = 4k rw = read
io_submit()
io_getevents()
{1, 1, 4k} = 15k IOPS, 15 %
{1, 8, 4k} = 70k IOPS, 35 %
{1, 16, 4k} = 110k IOPS, 55 %
{1, 24, 4k} = 165k IOPS, 85 %
{1, 32, 4k} = 190k IOPS, 100 %
{1, 64, 4k} = 195k IOPS, 100 %~400k IOPS at 4k (sequential reads)
{5, 32, 4k} = 415k IOPS, 55 % (each)
![Page 31: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/31.jpg)
© 2014 Citrix
blkfront
32 reqs
qdisk
tapdisk
vdi
• Single-VBD performance remains problematic ๏ [3/3] Backend is single threaded
Where To Go Next?
31
VMdisk
{fio
numjobs = 1 iodepth = ___ blksz = 4k rw = read
io_submit()
io_getevents()
{1, 1, 4k} = 10k IOPS, 30 % (30 % in dom0)
~400k IOPS at 4k (sequential reads)
{1, 8, 4k} = 50k IOPS, 75 % (75 % in dom0)
{1, 16, 4k} = 70k IOPS, 100 % (100 % in dom0)
{1, 32, 4k} = 110k IOPS, 120 % (100 % in dom0)
{4, 32, 4k} = 115k IOPS, 400 % (125 % in dom0)
blkback
![Page 32: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/32.jpg)
© 2014 Citrix
Where To Go Next?• Many-VBD performance could be much better:
!๏ Both persistent grants and grant copy are interesting alternatives:
• tapdisk3 with grant copy is network-“friendly” and has one process per VBD • qdisk with persistent grants does the copy on the front end !
๏ But both add extra copies to the data path: • We should be avoiding copies… :-/
- Grant operations need to scale better - The network retransmission issues need to be addressed
32
![Page 33: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/33.jpg)
e-mail: [email protected] freenode: felipef #xen-api twitter: @franciozzy
![Page 34: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/34.jpg)
© 2014 Citrix
Support Slides• Usage of O_DIRECT with QDISK vs. 1 x Micron P320h • Number of dom0 vCPUs on Creedence #87433 + blkback from 3.16 • Temperature effects on storage performance
34
![Page 35: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/35.jpg)
© 2014 Citrix
Usage of O_DIRECT in qemu-qdisk
35
• qemu-qdisk ๏ Without O_DIRECT
(default) !
๏ Faster for small block sizes ๏ Faster for single-VM !
๏ Scalability issue (investigation pending)
![Page 36: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/36.jpg)
© 2014 Citrix
Usage of O_DIRECT in qemu-qdisk
36
• qemu-qdisk ๏ With O_DIRECT
(directiosafe=1) !
๏ Slower for small block sizes ๏ Slower for single-VM !
๏ Scales much better
![Page 37: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/37.jpg)
© 2014 Citrix
Impact of dom0 vCPU count
37
• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ
• blkback backported from 3.16 • LVs plugged directly to guests !
• Throughput sinks with ๏ larger blocks ๏ increased number of guests
• oprofile suggests pvspinlock
![Page 38: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/38.jpg)
© 2014 Citrix
Impact of dom0 vCPU count
38
• XenServer Creedence #87433 ๏ Kernel 3.10 + internal PQ
• blkback backported from 3.16 • LVs plugged directly to guests !
• Giving less vCPUs to dom0 • Aggregate throughput improves
![Page 39: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/39.jpg)
© 2014 Citrix
Temperature Effects on Storage Performance
39
• Workload keeps pCPUs busy with large block sizes !
• iDRAC Settings > Thermal > Thermal Base Algorithm
• “Maximum Performance”
![Page 40: XPDS14 - Scaling Xen's Aggregate Storage Performance - Felipe Franciosi, Citrix](https://reader034.fdocuments.in/reader034/viewer/2022052310/555a3ee7d8b42a83368b4ea9/html5/thumbnails/40.jpg)
© 2014 Citrix
Temperature Effects on Storage Performance
40
• Workload keeps pCPUs busy with large block sizes !
• iDRAC Settings > Thermal > Thermal Base Algorithm
• “Auto” !
• Effects very noticeable with 3 or more guests