Liran Liss Mellanox Technologies
Transcript of Liran Liss Mellanox Technologies
![Page 1: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/1.jpg)
On Demand Paging for User-level Networking Liran Liss Mellanox Technologies
![Page 2: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/2.jpg)
Agenda
• Memory registration • RDMA programming challenges • On Demand Paging (ODP) • Page pre-fetching • Initial testing • Future work • Conclusions
© 2013 Mellanox Technologies 2
![Page 3: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/3.jpg)
Memory Registration
• Apps register Memory Regions (MRs) for IO – Referenced memory
must be part of process address space at registration time
– Memory key returned to identify the MR
• Registration operation – Pins down the MR – Hands off the virtual to
physical mapping to HW
© 2013 Mellanox Technologies 3
Application U
ser-space driver stack
HW
MR SQ
CQ
Key
RQ
Kernel driver stack
![Page 4: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/4.jpg)
Memory Registration – continued • Fast path
– Applications post IO operations directly to HCA
– HCA accesses memory using the translations referenced by the memory key
© 2013 Mellanox Technologies 4
Application U
ser-space driver stack
HW
MR SQ
CQ
RQ
Kernel driver stack
Key
Wow !!! But…
![Page 5: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/5.jpg)
Challenges
• Size of registered memory must fit physical memory • Applications must have memory locking privileges • Continuously synchronizing the translation tables
between the address space and the HCA is hard – Address space changes (malloc, mmap, stack) – NUMA migration – fork()
• Registration is a costly operation – No locality of reference
© 2013 Mellanox Technologies 5
![Page 6: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/6.jpg)
Achieving High Performance
• Requires careful design • Dynamic registration
– Naïve approach induces significant overheads – Pin-down cache logic is complex and not complete
• Pinned bounce buffers – Application level memory management – Copying adds overhead
© 2013 Mellanox Technologies 6
![Page 7: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/7.jpg)
On Demand Paging
• MR pages are never pinned by the OS – Paged in when HCA needs them – Paged out when reclaimed by the OS
• HCA translation tables may contain non-present pages – Initially, a new MR is created with non-present pages – Virtual memory mappings don’t necessarily exist
© 2013 Mellanox Technologies 7
![Page 8: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/8.jpg)
Semantics
• ODP memory registration – Specify IBV_ACCESS_ON_DEMAND access flag
• Work request processing – WQEs in HW ownership must reference mapped memory
• From Post_Send()/Recv() until PollCQ() – RDMA operations must target mapped memory – Access attempts to unmapped memory trigger an error
• Transport – RC semantics unchanged – UD responder drops packets while page faults are resolved
• Standard semantics cannot be achieved unless wire is back-pressured
© 2013 Mellanox Technologies 8
![Page 9: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/9.jpg)
Advantages
• Simplified programming – MPI rendezvous without
dynamic registrations – No dedicated buffer pools to
manage
• Practically unlimited memory registrations – No special privileges are
required
• Physical memory optimized to hold working set
© 2013 Mellanox Technologies 9
SendSomething() { char buf[SIZE]; WQE wqe; ... FillBuf(buf); wqe.sge[0].addr = buf; wqe.sge[0].length = SIZE; wqe.sge[0].lkey = STACK_KEY; ... Post_Send(wqe); while (!PollCQ()); }
![Page 10: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/10.jpg)
Design
• Kernel only – Transparent to applications
• Generic code (ib_core) tasks – Manage page invalidations
• Register for MMU notifier calls • Provide context for invalidations • Locate intersection between page invalidations and
MRs
– Support page faults • Synchronize between invalidations and page faults • Page-in user pages and map to dma
© 2013 Mellanox Technologies 10
![Page 11: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/11.jpg)
Design – continued
• Driver code (mlx4_core/ib) tasks – Process page faults
• Catch and classify HW page faults • Provide context for page faults
– Per-QP work_struct for requester/responder
– Handle HW page invalidations
© 2013 Mellanox Technologies 11
![Page 12: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/12.jpg)
Data Structures
© 2013 Mellanox Technologies 12
ib_core
mlx4_core / mlx4_ib
Key MR tree
Per user virt_addrumem
interval tree
umem Page list DMA list
mlx4_ib_mr Translation
table QP Requestor ctx Responder ctx
mmu_notifier handlers
MR
HCA
![Page 13: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/13.jpg)
Page-in Flow
© 2013 Mellanox Technologies 13
ib_core
mlx4_core / mlx4_ib
Key MR tree
Per user virt_addrumem
interval tree
umem Page list DMA list
mlx4_ib_mr Translation
table QP Requestor ctx Responder ctx
mmu_notifier handlers
MR
HCA 1. Page
fault event
2. Look up MR
3. Request pages
4. Get pages + map to
DMA
5. Update HW mappings 6. Resume
QP
![Page 14: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/14.jpg)
Invalidation Flow
© 2013 Mellanox Technologies 14
ib_core
mlx4_core / mlx4_ib
Key MR tree
Per user virt_addrumem
interval tree
umem Page list DMA list
mlx4_ib_mr Translation
table QP Requestor ctx Responder ctx
mmu_notifier handlers
MR
HCA
1. Page invalidation
2. Look up intersecting
MRs 5. Acknowledge
invalidation
3. Request invalidation
4. Flush HW caches
6. Unmap DMA and return
![Page 15: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/15.jpg)
Page Pre-fetching
• New Verb for pre-fetching pages • Uses
– Warming up new memory mappings – MPI rendezvous optimization – UD responder optimization
© 2013 Mellanox Technologies 15
![Page 16: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/16.jpg)
Initial Testing
• ODP support – Implemented all RC transport flows for IB and RoCE
• Excluding SRQ and memory windows – UD over IB and RoCE – Raw Ethernet QP
• Inter-operability – Latency of non-ODP applications running concurrently hardly
affected – Mixed requestors/responders also work well
• Native performance for memory-resident ODP pages • Page-in performance
– 4K page fault takes approximately 135us – 4M page fault takes approximately 1ms
© 2013 Mellanox Technologies 16
![Page 17: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/17.jpg)
Execution Time Breakdown (Send Requestor)
© 2013 Mellanox Technologies 17
2%
36%
4% 5% 16%
37%
0%
Schedule in
WQE read
Get User Pages
PTE update
TLB flush
QP resume
Misc
0%
7%
83%
1%
2%
7%
0%
Schedule in
WQE read
Get User Pages
PTE update
TBL flush
QP resume
Misc
4K Page fault (135us total time) 4M Page fault (1ms total time)
![Page 18: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/18.jpg)
Future work: Huge MR Support • Support MRs in the size of TBs • Implicit ODP
– Register complete application address space up-front – Effectively eliminate memory registration
• Meta-data size must be a function of currently mapped memory instead of MR size – Applies to all data structures (IB core, driver, and HW)
• Memory Windows (MWs) become the main vehicle for controlling access rights
© 2013 Mellanox Technologies 18
![Page 19: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/19.jpg)
Future Work: Improve OS integration • Update PTE accessed/dirty bits according to IO
accesses • Page invalidation batching
– Page eviction in the swapper – NUMA migration process
• Extend ODP to guest physicalmachine translations for virtualization
© 2013 Mellanox Technologies 19
![Page 20: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/20.jpg)
Conclusions
• RDMA performance is great – But requires careful design
• ODP simplifies RDMA programming and deployment – Moves memory management to the OS – Lifts memory-pinning limits
• ODP does not sacrifice performance or interoperability
• ODP eliminates memory registrations!!! – Coming up soon
© 2013 Mellanox Technologies 20
![Page 21: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/21.jpg)
Thank You
© 2013 Mellanox Technologies
![Page 22: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/22.jpg)
Backup
© 2013 Mellanox Technologies
![Page 23: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/23.jpg)
Concurrent Page Faults
• Each QP has at most 2 concurrent page faults – Requestor – Responder
• Faulting QP temporarily suspended until fault is resolved by SW – Even if another QP satisfies the fault in the meantime – Required for correct completion semantics
© 2013 Mellanox Technologies 23
![Page 24: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/24.jpg)
Page-in / Invalidation Races
• Invalidations may race with page faults – HW will complete all in-flight memory accesses to an
invalidated range before completing the invalidation – New accesses will trigger a page fault normally
• Page-in requests are not serviced while handling mmu_notifier invalidations – QP is resumed without updating the page tables – HW will retry access optionally triggering another
page fault – Simplifies the code considerably
© 2013 Mellanox Technologies 24
![Page 25: Liran Liss Mellanox Technologies](https://reader033.fdocuments.in/reader033/viewer/2022042603/6262c628b4cbb30b6c2ab121/html5/thumbnails/25.jpg)
Forward Progress
• Challenge – Single MTU-sized packet may refer to multiple S/G entries
in WQEs – Single RDMA-W transaction may span multiple pages
• Forward progress generally not guaranteed – Pages are not pinned inherent race with page
invalidation – Not any different than CPU accesses
• Alleviate by paging-in multiple pages at once – Read multiple SGEs in WQE page faults – Pre-fetch large consecutive ranges in RDMA faults
• Not an issue in practice
© 2013 Mellanox Technologies 25