Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7...

29
Ben Walker Darek Stojaczyk

Transcript of Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7...

Page 1: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

Ben Walker

Darek Stojaczyk

Page 2: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit

Agenda

• Direct Memory Access

• Hugepage Management

• Threading Model

2

Page 3: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit

Direct memory Access

3

Page 4: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 4

DIRECT memory ACCESS (DMA)

NVMe Memory MMU CPUpaddr vaddr

Memory MMU CPUvaddr

NVMeiova

IOMMU

Page 5: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 5

HOW TO DO DMA in SPDKspdk_malloc()

• Backed by hugepages

• Pinned

• Shared between processes in a multi-process group

Page 6: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit

HUGEPAGE MANAGEMENT

6

Page 7: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 7

Pre-reserve big chunk of hugepages

Dynamic allocation within pre-reserved region

Legacy Memory ManagementSPDK application

Challenge:

• Need to calculate how much memory the program needs up front

HP HP HP

alloc allocalloc

HP HP

Page 8: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 8

Implemented in DPDK 18.05, hardened in DPDK 18.08

Directly in response to SPDK (and other feedback)

Dynamically allocates hugepages as needed

NEW DYNAMIC MEMORY MANAGEMENT

HP HP HP

SPDK application

alloc allocalloc

Available since SPDK 18.10 and DPDK 18.05.1

Page 9: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 9

register your own memoryspdk_mem_register(void *vaddr, size_t len)

spdk_mem_unregister(void *vaddr, size_t len)

Restrictions:

• If no IOMMU, must use hugepages

• The memory address and length need to be a multiple of 2MB

Page 10: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 10

Memory maps• Translation tables

• virtual address uint64_t

• There are multiple registered maps

• Maps are notified when users register memory

map = spdk_mem_map_alloc(notify_cb, …)

spdk_mem_map_set_translation(map, vaddr, size, translation)

spdk_mem_map_translate(map, vaddr, *size)

Page 11: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 11

Memory mapsName Key Value User

PCIe Virtual address Physical/busaddress

NVMe, I/OAT

RDMA Virtual address ibv_mr * NVMe-oF initiator and target

Vhost-user Virtual address Virtual address Vhost-user PMD

Page 12: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit

Dynamic Threading

12

Page 13: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 13

Reminder: Performance via ConcurrencyModern CPUs provide many cores

Modern I/O devices provide many independent queues

Goal: Architect software to match the hardware

…Core 0 Core NCore 1

…I/O Device I/O Device I/O Device

Page 14: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 14

SPDK 18.07 and EarlierSystem thread running a loop

• Pinned to a specific CPU core

• Called a “reactor”

• Polls event ring for incoming work

• Communication via event passing

Implemented in lib/event

Reactor 1Reactor 0 Reactor N…

Core 0 Core NCore 1

Events Events Events

Page 15: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 15

Challenges• Scaling Up/Down

• Rebalancing

• Integration

Reactor 1Reactor 0 Reactor N…

Core 0 Core NCore 1

Events Events Events

Always Spinning!

Page 16: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit

Other Libraries

16

Threading Abstractions: Goals• Create Abstraction layer between the reactor framework and other SPDK libraries

• Allow applications to plug in their own framework

• Trickling in over the last year – fully abstracted in 18.10

Event Framework

Thread

Page 17: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 17

Threading Abstractions: Concepts• spdk_thread

• A thread spinning in a loop, processing events. We tie data to spdk_thread.

• spdk_msg

• An event – a function pointer passed from one thread to another.

• spdk_poller

• A repeated event on a thread with an optional delay between calls.

• spdk_io_channel

• An abstraction for per-thread context.

Page 18: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 18

Threading Abstractions: Polling• spdk_thread_poll

• Runs any expired pollers

• Runs all incoming messages

• Returns

• As of 19.01, reactors do the following on each loop:

• Call spdk_thread_poll on their single spdk_thread

• Check for incoming events (the old style events)

Page 19: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 19

Lightweight Threading: The BasicsConcept Address Space Multiplex Method Scheduler

Process Isolated Pre-emptive OS

Thread Shared Pre-emptive OS

Fiber Shared Cooperative Language Run Time, Framework

Green Thread Either Either Not OS

Today, spdk_thread is a Thread. In 19.04, an spdk_threadbecomes both a Fiber and a Green Thread.

Page 20: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 20

Lightweight Threading: The Implementation

Reactor 1Reactor 0 Reactor N

Core 0 Core NCore 1

Threads Threads Threads

spdk_thread

spdk_thread

spdk_thread_createSchedule thread

Page 21: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 21

Lightweight Threading: RECAP• spdk_thread_lib_init

• Provide a “schedule” callback

• Reactors contain a list of spdk_threads

• Each loop, call spdk_thread_poll on each spdk_thread

• When thread created, call schedule callback. Get placed on reactor.

Page 22: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 22

Lightweight Threading: Advanced Scheduling• spdk_thread_poll returns whether it did any “work”

• Each poller reports whether it accomplished anything (true/false)

• Any message also counts as work

Use Busy Indications to Reschedule Threads!

Page 23: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 23

Lightweight Threading: Advanced Scheduling

Reactor 1Reactor 0 Reactor N

Core 0 Core NCore 1

Threads Threads Threads

spdk_thread spdk_thread

spdk_thread spdk_thread

zzzZZZZZ

Page 24: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 24

Lightweight Threading: Advanced Scheduling

Reactor 1Reactor 0 Reactor N

Core 0 Core NCore 1

Threads Threads Threads

spdk_thread spdk_thread

spdk_thread spdk_thread

Slow down!

Page 25: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 25

Lightweight Threading: Advanced Scheduling

Intel Speed Select Technology

• Intel SST-Base Frequency

• Intel SST-Core Power

Available on Cascade Lake N SKUs

Page 26: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 26

Future Direction• Smarter thread schedulers

• More configuration parameters (trade latency vs power consumption)

• Thread affinity

Page 27: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy
Page 28: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy

SPDK, PMDK & Vtune™ Amplifier Summit 28

Backup

Page 29: Ben Walker Darek Stojaczyk · HUGEPAGE MANAGEMENT 6. SPDK, PMDK & Vtune™ Amplifier Summit 7 Pre-reserve big chunk of hugepages Dynamic allocation within pre-reserved region Legacy