Userspace NVMe Driver in QEMU - Linux Foundation...
Transcript of Userspace NVMe Driver in QEMU - Linux Foundation...
![Page 1: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/1.jpg)
Userspace NVMe Driver in QEMU
Fam ZhengSenior Software Engineer
KVM Form 2017, Prague
![Page 2: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/2.jpg)
2
About NVMe
● Non-Volatile Memory Express● A scalable host interface specification like SCSI and virtio
● Up to 64k I/O queues, 64k commands per queue● Efficient command issuing and completion handling
● Extensible command sets● Attached over PCIe, M.2 and fabrics (FC, RDMA)
![Page 3: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/3.jpg)
![Page 4: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/4.jpg)
Why?
![Page 5: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/5.jpg)
Overhead
![Page 6: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/6.jpg)
6
![Page 7: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/7.jpg)
7
![Page 8: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/8.jpg)
8
![Page 9: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/9.jpg)
9
![Page 10: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/10.jpg)
10
Faster device → more visible overhead!
![Page 11: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/11.jpg)
11
* FusionIO is an old model so may not represent its state-of-art
* SATA (SSD) test is done on a different host so the relativity doesn’t matter much
![Page 12: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/12.jpg)
12
* FusionIO is an old model so may not represent its state-of-art
* SATA (SSD) test is done on a different host so the relativity doesn’t matter much
![Page 13: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/13.jpg)
13
Latency Reducing● KVM optimizations
● kvm_halt_poll by Paolo Bonzini● QEMU AioContext polling by Stefan Hajnoczi
● Kernel optimizations● /sys/block/nvme0n1/queue/io_poll by Jens Axboe
(improves aio=threads case)
● Device assignment● QEMU: -device vfio-pci
● Userspace device driver based on VFIO● DPDK/SPDK: vhost-user-blk● QEMU: VFIO driver in this talk
![Page 14: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/14.jpg)
14
ArchitectureFrom QEMU PoV
Guest kernel
QEMU
VirtIO device
BlockBackend
Block layer
QCOW2
Host kernel
POSIX/linux-aioVFIO NVMe driver
vfio-pci.konvme.ko
…VFS
VirtIO driver
![Page 15: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/15.jpg)
15
Implementation● $QEMU_SRC/util/vfio-helpers.c
● A generic helper library for userspace drivers● Manages per device IO virtual address (IOVA) space● Optimized for I/O operations:
● Pre-allocate IOVA for all guest ram● Efficient oneshot IOVA allocation for bounce buffer I/O
● $QEMU_SRC/block/nvme.c● Registers a new BlockDriver (nvme://)● Handles NVMe logic● Integrates with AioContext polling● Prepared for QEMU multiqueue block layer
![Page 16: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/16.jpg)
16
Characteristics
● Commands: READ, WRITE (with FUA), FLUSH● IOV based (zero-copy)● One IO queue pair for now● More efficient for guest I/O● Less efficient for bounce buffered I/O and utility
● More on this later…● Device is exclusively used by one VM similar to device assignment
![Page 17: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/17.jpg)
17
I/O Request Lifecyclevirtio-*.ko
↓
Queue virtio request (GPA/vIOVA)
virtio
↓
Map I/O address to host address (HVA)
virtio-blk
↓
Parse request, call blk_aio_preadv/pwritev
block layer
↓
Call NVMe driver
NVMe driver Send request to device
![Page 18: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/18.jpg)
18
NVMe Driver Operations
(1) Check that the addresses and lengths are aligned If not, allocate an aligned bounce buffer to do next steps
(2) Map host addresses to IOVAs(3) Prepare an NVMe Request structure using IOVAs and put it on the NVMe
I/O queue(4) Kick device by writing to doorbell(5) Poll for completions of earlier requests(6) Yield until irq eventfd is readable
![Page 19: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/19.jpg)
19
Address Translations6 7 8 9
0 1 6 10 11
100 101 106 110 1119998...
IOVA ? ? ? ?
NVMe
?
iova
IOMMU
RWRR ? ...
submission queuepage list
Guest app buffer
Guest physical addr
Host virtual address (no vIOMMU)
page list is pre-allocated!
![Page 20: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/20.jpg)
20
IOVA Mapping
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map), .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE, .vaddr = (uintptr_t)host, .size = size, .iova = iova, };
ioctl(vfio_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
struct vfio_iommu_type1_dma_map dma_map = { .argsz = sizeof(dma_map), .flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE, .vaddr = (uintptr_t)host, .size = size, .iova = iova, };
ioctl(vfio_fd, VFIO_IOMMU_MAP_DMA, &dma_map);
![Page 21: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/21.jpg)
21
Address Translations6Guest app buffer 7 8 9
Guest physical addr 0 1 6 10 11 ...
Host virtual address (no vIOMMU) 100 101 106 110 111 ...9998...
IOVA addr space 10 11 16 20 21 ...
20 16 11 21 ✓
iova
RWRR ? ...
I/O queuePRP list
![Page 22: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/22.jpg)
22
How About Host Buffers?
● The (slow) default:VFIO_IOMMU_MAP_DMA each new buffer to a new address as it comes
● Remedy for hot buffers:
void bdrv_register_buf(BlockDriverState *bs, void *host, size_t size);void bdrv_unregister_buf(BlockDriverState *bs, void *host);
Map/unmap a buffer to IO virtual address in the same way as guest ram.
![Page 23: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/23.jpg)
23
The IOVA Allocator
Fixed Free Temporary
0 MIN low_water_mark MAXhigh_water_mark
● Keep record of mapped buffers for later use, if advisable● Distinguish throwaway / fixed mappings with a parameter
int qemu_vfio_dma_map(QEMUVFIOState *s, void *host, size_t size, bool temporary, uint64_t *iova)
● Use a pair of self-incrementing counters to track available IOVAs● When free IOVAs run out, discard all temporary mappings and reset
counter (caller makes sure all old mappings are useless)
![Page 24: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/24.jpg)
24
Usage● Until patches are merged to mainline:git clone https://github.com/qemu/famz --branch nvme
● configure && make, as usual● Bind device to vfio-pci, see also:
https://www.kernel.org/doc/Documentation/vfio.txt● ./x86_64-softmmu/qemu-system-x86_64 \-enable-kvm \… \-drive file=nvme://0000:44:00.0/1,if=none,id=drive0 \-device virtio-blk,drive=drive0,id=virtio0
● Syntax:nvme://<domain:bus:dev.func>/<namespace>Or, use structured option-drive \driver=nvme,device=<domain:bus:dev.func>,namespace=<N>,if=none...
![Page 25: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/25.jpg)
25
![Page 26: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/26.jpg)
26
IOPS Improvement over Linux-aio
(IOPS) Relative
rand-read-1-req +12%
rand-read-4-req +20%
rand-write-1-req +22%
rand-write-4-req +12%
rand-rw-1-req +3%
rand-rw-4-req +22%
![Page 27: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/27.jpg)
27
Configuration Limitations
Approach Limitation
POSIX None
nvme:// One NVMe, one VM
SPDK vhost-user-blk * Host must use hugepages* Guest must use VirtIO
Device assignment * One NVMe, one VM* Guest must use NVMe
![Page 28: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/28.jpg)
28
Feature Availability
Approach Host block features
QEMU block features Migration
POSIX ✓ ✓ ✓
nvme:// ✗ ✓ ✓
SPDK vhost-user-blk ✗ ✗ ✓
Device assignment ✗ ✗ ✗
![Page 29: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/29.jpg)
29
Overall comparison
Functionality
Perfo
rman
ce
POSIX
vfio-pci passthrough
nvme://
SPDKvhost-user-blk
![Page 30: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/30.jpg)
30
Status and future
● Status● Patches v3 on [email protected]:● https://lists.gnu.org/archive/html/qemu-block/2017-07/msg00191.html● Also available at github:
https://github.com/famz/qemu nvme
● TODO● Get it merged!● Integrate with multi-queue block layer
![Page 31: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/31.jpg)
31
Benchmark configuration● Host 1: Fedora 26 / RHEL 7 (x86_64)
Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz x264GB ramIntel Corporation DC P3700 380GFusionIO ioDrive2 340GWestern Digital WD RE4 WD5003ABYX 500GB 7200 RPM 64MB
● Host 2: Fedora 26Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz16GB ramSamsung SSD 840 PRO 128G
● Guest: Fedora 26 (x86_64), 1 vCPU, 1GB ram● Tool: fio-2.18● Job:
ramp_time = 30runtime = 30bs=4krw={randread, randwrite, randrw}iodepth={1, 4}
![Page 32: Userspace NVMe Driver in QEMU - Linux Foundation Eventsevents17.linuxfoundation.org/sites/events/files/slides... · 2020. 8. 15. · 9. 10 Faster device → more visible overhead!](https://reader035.fdocuments.in/reader035/viewer/2022071406/60fd09c9fcef5d430162d0b2/html5/thumbnails/32.jpg)
THANK YOU