Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested...
Transcript of Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested...
![Page 1: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/1.jpg)
Device Assignment with Nested Guest and DPDK
Peter Xu <[email protected]>Red Hat Virtualization Team
![Page 2: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/2.jpg)
2
Agenda
● Problems● Unsafe userspace device drivers● Device assignment for nested guests
● Solution● Status update
![Page 3: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/3.jpg)
BACKGROUNDS
![Page 4: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/4.jpg)
4
Backgrounds
● What the talk is about?● DMA of assigned devices (no PCI configrations, IRQs, MMIOs…)● vIOMMU (QEMU, x86_64/Intel)
● These two features cannot work together (before)...● Guest IOMMU page table is only visible to the guest● An assigned hardware cannot see the guest IOMMU page table
● Will we need it?
![Page 5: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/5.jpg)
PROBLEMS
![Page 6: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/6.jpg)
6
Problem 1: Userspace Drivers
● More userspace drivers!● VFIO/UIO driver can pass though a device to userspace● DPDK/SPDK uses PMDs to drive devices
● However, userspace drivers are not trusted● MMU protects CPU accesses (CPU instructions)● IOMMU protects device accesses (DMA)
● What if we want to “assign” an assigned device to DPDK in the guest?● No vIOMMU, means no device DMA protection● Guest kernel is at risk: as long as userspace driver used, kernel tainted!
![Page 7: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/7.jpg)
7
● How device assignment works for L1 guest?
● Device seen by the L1 guest
● Guest uses L1GPA as DMA addresses
● Host IOMMU maps L1GPA → HPA before guest starts
● What if we assign a hardware twice to a nested guest?● Device seen by both L1 & L2 guest
● L2 guest uses L2GPA as DMA address
● We need host IOMMU to map L2GPA → HPA… but how?
Problem 2: Device Assignment for Nested Guests
![Page 8: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/8.jpg)
8
Problem 2: Device Assignment forNested Guests (cont.)
Host Memory
Host IOMMU
VFIO driver
L1 Guest Memory
L1 Guest IOMMU
PCI Device
PCI Device
VFIO driver
L2 Guest Memory
PCI Device
Host
L1 Guest
L2 Guest
Provides L2GPA -> L1GPA Mapping
Provides L1GPA -> HPA Mapping
![Page 9: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/9.jpg)
SOLUTION
![Page 10: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/10.jpg)
WHAT WE HAVE?
![Page 11: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/11.jpg)
11
DMA for Emulated Device, w/o vIOMMU
vCPU
Emulated Device
(e1000/virtio)
Guest Memory
QEMU
Guest
Memory Core API
(1)
(2)
(3)
(1) IO Request (2) Allocate DMA buffer (3) DMA request (GPA) (4) Memory access (GPA)
(4)
![Page 12: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/12.jpg)
12
DMA for Emulated Device, w/ vIOMMU
vCPU
Emulated Device
(e1000/virtio)
QEMU
Guest
Memory Core API
(1)
(2)
vIOMMU
Guest Memory
(3)(6)
(4)
(5)
(7)
(1) IO request (2) Allocate DMA buffer, setup device page table (IOVA->GPA) (3) DMA request (IOVA) (4) Page translation request (IOVA) (5) Lookup device page table (IOVA->GPA) (6) Get translation result (GPA) (7) Complete translation request (GPA) (8) Memory access (GPA)
(8)
![Page 13: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/13.jpg)
13
DMA of Assigned Devices, w/o vIOMMU
vCPU
Assigned PCI Device
QEMU
Guest
Memory Core API
(1)
(2)Guest Memory
(3)
(1) IO request (2) Allocate DMA buffer (3) Virtual DMA request (using GPA) (4) DMA request (using GPA) (5) Memory access (using HPA)
Assigned PCI Device
IOMMU(4)
(5)
Device Page Table (GPA->HPA)
![Page 14: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/14.jpg)
WHAT WE NEED?
![Page 15: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/15.jpg)
15
DMA of Assigned Devices, w/ vIOMMU
vCPU
Assigned PCI Device
QEMU
Guest
Memory Core API
(1)
(2)Guest Memory
(3)
(1) IO request (2) Allocate DMA buffer, setup device page table (IOVA->GPA) (3) Send MAP notification (4) Sync shadow page table (IOVA->HPA) (5) Sync Complete (6) MAP notification Complete (7) Virtual DMA request (using IOVA) (8) DMA request (using IOVA) (9) Memory access (using HPA)
Assigned PCI Device
IOMMU
(7)vIOMMU
(4)
(8)
(9)
Device Shadow Page Table (IOVA->HPA)
(5)
(6)
![Page 16: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/16.jpg)
16
IOMMU Shadow Page TableHardware IOMMU page tables without/with a vIOMMU in the guest(GPA→HPA is the original page table; IOVA→HPA is the shadow page table)
HPA
HPA
HPA
...
HPA
HPA
HPA
HPA
HPA
HPA
HPA
...
HPA
HPA
HPA
HPA
Device Page Table Root Pointer (GPA->HPA)
GPA[31:22] GPA[21:12] GPA[11:0]
DATA
DATA
DATA
...
DATA
DATA
DATA
DATA
HPA
HPA
HPA
...
HPA
HPA
HPA
HPA
HPA
HPA
HPA
...
HPA
HPA
HPA
HPA
Device Shadow Page Table Root Pointer (IOVA->HPA)
IOVA[31:22] IOVA[21:12] IOVA[11:0]
DATA
DATA
DATA
...
DATA
DATA
DATA
DATA
Without vIOMMU: GPA->HPA
With vIOMMU: IOVA->HPA
![Page 17: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/17.jpg)
17
Synchronizing Shadow Page Tables
● Solution 1 (not used): Write-protect guest page table● Complicated; possibly need a new KVM interface to report the event
● Solution 2 (being used): VT-d caching mode● “Any page entry update will require explicit invalidation of caches”
(VT-d spec chapter 6.1)● No KVM change needed● Have existing Linux guest driver support● Intel-only solution; PV-like, but also applies to hardware
![Page 18: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/18.jpg)
18
Shadow Page Tables: MMU vs IOMMU
TYPE MMU IOMMU
Target Processors Devices
Allow page faults? Yes (of course!) No [*]
Trigger mode(shadow sync) Page Fault Explicit Message
(caching-mode)
Page Table Format 32bits, 64bits, PAE,... 64bits
Cost(shadow sync)
Small, relatively(KVM only)
Huge(long code path [**])
Need Previous State? No Yes [***]
[*]: Upstream work ongoing to enable Intel IOMMU page faults[**/***]: Please refer to follow up slides for more information
![Page 19: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/19.jpg)
19
Shadow Sync: Costly for IOMMU!(Example: when L2 guest maps one page)
L2 Guest
IOMMU Driver
KVM
QEMU (L2 instance)
vIOMMU
VFIO
QEMU (L1 instance)
L1 Kernel
KVM
Host Kernel
vIOMMU
VFIO
IOMMU Driver
IOMMU Driver
Host IOMMU
![Page 20: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/20.jpg)
20
Shadow Sync: About State Cache
● MMU shadow sync● Talk to page tables: PGD, PUD, PMD, PTE,…● Doing set() on page table entries● No need to cache previous state
● IOMMU shadow sync● Talk to vfio-pci driver: VFIO_IOMMU_MAP_DMA, VFIO_IOMMU_UNMAP_DMA
(no direct access to page tables, the same even to vfio-pci driver underneath)● Doing add()/remove() on page table entries● We can either create a new entry (it must not exist before), or delete an old entry● Previous state matters, since otherwise we can’t judge what page has been mapped
![Page 21: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/21.jpg)
STATUS UPDATE
![Page 22: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/22.jpg)
22
Some Facts… and TBDs
● Emulated devices v.s. Assigned devices from IOMMU perspective● Emulated: fast mapping (no sync), slow IO (need guest translation)● Assigned: slow mapping (need sync), fast IO (no guest translation)
● Some performance numbers (Intel ixgbe, 10Gbps NIC)● Kernel ixgbe driver, very slow (~80% degradation on L1)● Userspace DPDK driver, very fast (close to line speed, both L1 & L2)
● Future works?● Reduce context switches when sync shadow pages? (vhost-iommu?)● Nested page table? (need hardware support, like EPT comparing to softmmu)● Sharing the state cache?
(e.g. vfio-pci has similar state cache, see “vfio_iommu.dma_list”)
![Page 23: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/23.jpg)
23
Wanna try?
● QEMU command line to try this out:
● Versions:● QEMU: please use v3.0 or newer● Linux: please use v4.18-rc1 or newer
● For more information, please visit:● https://wiki.qemu.org/Features/VT-d
qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \ -device intel-iommu,intremap=on,caching-mode=on \ -device vfio-pci,host=XX:XX:XX
qemu-system-x86_64 -M q35,accel=kvm,kernel-irqchip=split -m 2G \ -device intel-iommu,intremap=on,caching-mode=on \ -device vfio-pci,host=XX:XX:XX
![Page 24: Device Assignment with Nested Guest and DPDK · 2018. 11. 15. · Device Assignment with Nested Guest and DPDK Peter Xu Red Hat Virtualization Team. 2 Agenda](https://reader033.fdocuments.in/reader033/viewer/2022060903/609f2ecfdcac3a03470cbbee/html5/thumbnails/24.jpg)
THANK YOU
plus.google.com/+RedHat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatlinkedin.com/company/red-hat