Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era
The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure...
Transcript of The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure...
![Page 1: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/1.jpg)
/54
The Impact of Thread-Per-Core Architecture on Application Tail Latency
Pekka Enberg, Ashwin Rao, and Sasu Tarkoma University of Helsinki
ANCS 2019
!1
![Page 2: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/2.jpg)
/54
Introduction
• Thread-per-core architecture has emerged to eliminate overheads in traditional multi-threaded architectures in server applications.
• Partitioning of hardware resources can improve parallelism, but there are various trade-offs applications need to consider.
• Takeaway: Request steering and OS interfaces are holding back the thread-per-core architecture.
!2
![Page 3: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/3.jpg)
/54
Outline
• Overview of thread-per-core
• A key-value store
• Impact on tail latency
• Problems in the approach
• Future directions
!3
![Page 4: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/4.jpg)
/54
Outline
• Overview of thread-per-core
• A key-value store
• Impact on tail latency
• Problems in the approach
• Future directions
!4
![Page 5: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/5.jpg)
/54
What is thread-per-core?
• Thread-per-core = no multiplexing of a CPU core at OS level
• Eliminates thread context switching overhead [Qin 2019; Seastar]
• Enables elimination of thread synchronization by partitioning [Seastar]
• Eliminates thread scheduling delays [Ousterhout, 2019]
!5
Ousterhout et al. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. NSDI ’19.Qin et al. 2018. Arachne: Core-Aware Thread Management. OSDI ’18.Seastar framework for high-performance server applications on modern hardware. http://seastar.io/
![Page 6: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/6.jpg)
/54
Interrupt isolation for thread-per-core
• The in-kernel network stack runs in kernel threads, which interfere with application threads.
• Network stack processing must be isolated to CPU cores not running application thread.
• Interrupt isolation can be done with IRQ affinity and IRQ balancing configuration changes.
• NIC receive side-steering (RSS) configuration needs to align with IRQ affinity configuration.
!6Li et al. 2014. Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency. SOCC ‘14
![Page 7: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/7.jpg)
/54
Partitioning in thread-per-core
• Partitioning of hardware resources (such as NIC and DRAM) can improve parallelism, by eliminating thread synchronization.
• Different ways of partitioning resources:
• Shared-everything, shared-nothing, and shared-something.
!7
![Page 8: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/8.jpg)
/54
Shared-everything
!8
CPU0 CPU1 CPU2 CPU3
DRAM Data
![Page 9: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/9.jpg)
/54
Shared-everything
!9
CPU0 CPU1 CPU2 CPU3
DRAM Data
Hardware resources are shared between all CPU cores.
![Page 10: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/10.jpg)
/54
Shared-everything
!10
CPU0 CPU1 CPU2 CPU3
DRAM Data
Every request can be processed on any CPU core.
![Page 11: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/11.jpg)
/54
Shared-everything
!11
CPU0 CPU1 CPU2 CPU3
DRAM Data
Data access must be synchronized.
![Page 12: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/12.jpg)
/54
Shared-everything• Advantages:
• Every request can be processed on any CPU core.
• No request steering needed.
• Disadvantages:
• Shared-memory scales badly on multicore [Holland, 2011]
• Examples:
• Memcached (when thread pool size equals CPU core count)
!12Holland et al. 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11
![Page 13: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/13.jpg)
/54
Shared-nothing
!13
CPU0 CPU1 CPU2 CPU3
DRAM Data Data Data Data
![Page 14: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/14.jpg)
/54
Shared-nothing
!14
CPU0 CPU1 CPU2 CPU3
DRAM Data Data Data Data
Hardware resources are partitioned between CPU cores.
![Page 15: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/15.jpg)
/54
Shared-nothing
!15
CPU0 CPU1 CPU2 CPU3
DRAM Data Data Data Data
Request can be processed on one specific CPU core.
![Page 16: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/16.jpg)
/54
Shared-nothing
!16
CPU0 CPU1 CPU2 CPU3
DRAM Data Data Data Data
Data access does not require synchronization.
![Page 17: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/17.jpg)
/54
Shared-nothing
!17
CPU0 CPU1 CPU2 CPU3
DRAM Data Data Data Data
Requests need to be steered.
![Page 18: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/18.jpg)
/54
Shared-nothing• Advantages:
• Data access does not require synchronization.
• Disadvantages:
• Request steering is needed [Lim, 2014; Didona, 2019]
• CPU utilisation imbalance if data is not distributed well (“hot partition”)
• Sensitive to skewed workloads
• Examples:
• Seastar framework and MICA key-value store
!18Didona et al. 2019. Sharding for Improving Tail Latencies in In-memory Key-value Stores. NSDI '19 Lim et al. 2014. MICA: A Holistic Approach to Fast In-memory Key-value. NSDI ’14
![Page 19: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/19.jpg)
/54
Shared-something
!19
CPU0 CPU1 CPU2 CPU3
DRAM Data Data
![Page 20: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/20.jpg)
/54
Shared-something
!20
CPU0 CPU1 CPU2 CPU3
DRAM Data Data
Hardware resources are partitioned between CPU core clusters.
![Page 21: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/21.jpg)
/54
Shared-something
!21
CPU0 CPU1 CPU2 CPU3
DRAM Data Data
No synchronization needed for data access on different CPU clusters.
![Page 22: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/22.jpg)
/54
Shared-something
!22
CPU0 CPU1 CPU2 CPU3
DRAM Data Data
Data access needs to be synchronised within the same CPU core cluster.
![Page 23: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/23.jpg)
/54
Shared-something• Advantages:
• Request can be processed on many cores
• Shared-memory scales on small core counts [Holland, 2011].
• Improved hardware-level parallelism?
• For example, partitioning around sub-NUMA clustering could improve memory controller utilization.
• Disadvantages:
• Request steering becomes more complex.
!23Holland et al. 2011. Multicore OSes: Looking Forward from 1991, Er, 2011. HotOS ‘11
![Page 24: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/24.jpg)
/54
Takeaways
• Partitioning improves parallelism, but there are trade-offs applications need to consider.
• Isolation of the in-kernel network stack is needed to avoid interference with application threads.
!24
![Page 25: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/25.jpg)
/54
Outline
• Overview of thread-per-core
• A key-value store
• Impact on tail latency
• Problems in the approach
• Future directions
!25
![Page 26: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/26.jpg)
/54
A shared-nothing, key-value store
• To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value store.
• Memcached wire-protocol compatible for easier evaluation.
• Software-based request steering with message passing between threads.
• Lockless, single-producer, single-consumer (SPSC) queue per thread.
!26
![Page 27: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/27.jpg)
/54
Shared-nothing
!27
CPU0 CPU1 CPU2 CPU3
DRAM Data Data Data Data
Taking the shared-nothing model…
![Page 28: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/28.jpg)
/54
KV store design
!28
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
…and implementing it on Linux.
![Page 29: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/29.jpg)
/54
KV store design
!29
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
In-kernel network stack isolated on its own CPU cores.
![Page 30: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/30.jpg)
/54
KV store design
!30
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
Application threads are running on their own CPU cores.
![Page 31: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/31.jpg)
/54
KV store design
!31
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
Message passing between the application threads.
![Page 32: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/32.jpg)
/54
Outline
• Overview of thread-per-core
• A key-value store
• Impact on tail latency
• Problems in the approach
• Future directions
!32
![Page 33: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/33.jpg)
/54
Impact on tail latency• Comparison of Memcached (shared-everything) and
Sphinx (shared-nothing)
• Measured read and update latency with the Mutilate tool
• Testbed servers (Intel Xeon):
• 24 CPU cores, Intel 82599ES NIC (modern)
• 8 CPU cores, Broadcom NetXtreme II (legacy)
• Varied IRQ isolation configurations.
!33
![Page 34: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/34.jpg)
/54
Impact on tail latency
!34
![Page 35: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/35.jpg)
/54
Impact on tail latency
!35
![Page 36: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/36.jpg)
/54
99th percentile latency over concurrency for updates
!36
24 48 72 96 120
144
168
192
216
240
264
288
312
336
360
384
Number of Concurrent Connections
0.0
0.5
1.0
1.5
2.0
2.5
99th
Per
cent
ileU
pdat
eLa
tenc
y(m
s)
Memcached (legacy)
Sphinxd (legacy)
Memcached (modern)
Sphinxd (modern)
![Page 37: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/37.jpg)
/54
99th percentile latency over concurrency for updates
!37
24 48 72 96 120
144
168
192
216
240
264
288
312
336
360
384
Number of Concurrent Connections
0.0
0.5
1.0
1.5
2.0
2.5
99th
Per
cent
ileU
pdat
eLa
tenc
y(m
s)
Memcached (legacy)
Sphinxd (legacy)
Memcached (modern)
Sphinxd (modern)
Memcached
Sphinx
![Page 38: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/38.jpg)
/54
99th percentile latency over concurrency for updates
!38
24 48 72 96 120
144
168
192
216
240
264
288
312
336
360
384
Number of Concurrent Connections
0.0
0.5
1.0
1.5
2.0
2.5
99th
Per
cent
ileU
pdat
eLa
tenc
y(m
s)
Memcached (legacy)
Sphinxd (legacy)
Memcached (modern)
Sphinxd (modern)
Memcached
Sphinx
No locking, better CPU cache utilization.
![Page 39: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/39.jpg)
/54
Latency percentiles for updates
!39
0.0 0.5 1.0 1.5 2.0
Update Latency (ms)
15
10
20
50
80
909599Per
cent
ile(%
)
Memcached (legacy)
Sphinxd (legacy)
Memcached (modern)
Sphinxd (modern)
MemcachedSphinx
![Page 40: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/40.jpg)
/54
Takeaways• Shared-nothing model reduces tail latency for update
requests, because partitioning eliminates locking.
• More results in the paper:
• Interrupt isolation reduces latency for both shared-everything and shared-nothing.
• No difference for read requests between shared-nothing and shared-something (no locking in either case).
!40
![Page 41: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/41.jpg)
/54
Outline
• Overview of thread-per-core
• A key-value store
• Impact on tail latency
• Problems in the approach
• Future directions
!41
![Page 42: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/42.jpg)
/54
Packet movement between CPU cores
!42
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
![Page 43: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/43.jpg)
/54
Packet movement between CPU cores
!43
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
A packet arrives on NIC RX queue and is processed by in-kernel network stack on CPU0.
![Page 44: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/44.jpg)
/54
Packet movement between CPU cores
!44
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
Application thread receives the request on CPU1.
![Page 45: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/45.jpg)
/54
Packet movement between CPU cores
!45
SoftIRQThread
NIC RXQueue
PollIRQ
CPU0
ApplicationThread
DRAM
CPU1
Socket
Socket
Userspace
KernelH
ardware
SoftIRQThread
NIC RXQueue
PollIRQ
CPU2
ApplicationThread
DRAM
CPU3
Network
Message Passing
Socket
Socket
Request is steered to an application thread on CPU3.
![Page 46: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/46.jpg)
/54
Request steering inefficiency
• Inter-thread communication efficiency matters for software steering:
• Message passing by copying is a bottleneck. Avoiding copies makes the implementation more complex.
• Thread wakeup are expensive, batching is needed, but it increases latency.
• Busy-polling is a solution, but it wastes CPU resources in some scenarios.
!46
![Page 47: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/47.jpg)
/54
Partitioning scheme and skewed workloads
• Partitioning scheme is critical, but the design decision is application specific. Not always easy to partition.
• Skewed workloads are difficult to address with shared-nothing model.
!47
![Page 48: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/48.jpg)
/54
Outline
• Overview of thread-per-core
• A key-value store
• Impact on tail latency
• Problems in the approach
• Future directions
!48
![Page 49: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/49.jpg)
/54
Request steering with a programmable NIC?
• Program running on the NIC parses request headers, and steers request to correct application thread [Floem, 2018].
• Eliminates request software steering overheads and packet movement cost.
• On Linux, the Express Data Path (XDP) and eBPF interface could be used for this.
!49Mangpo, et al. Floem: A Programming System for NIC-Accelerated Network Applications. OSDI ’18.
![Page 50: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/50.jpg)
/54
OS support for inter-core communication?
• On Linux, wakeup needed for inter-thread messaging are performed using eventfd interface or signals, but both have overheads.
• Adding better support for inter-core communication in the OS would help.
!50
![Page 51: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/51.jpg)
/54
Non-blocking OS interfaces
• Thread-per-core requires non-blocking OS interfaces.
• New asynchronous I/O interfaces, such as io_uring on Linux, will help.
• Paging and memory-mapped I/O are effectively blocking operations (when you take a page fault), and must be avoided.
!51
![Page 52: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/52.jpg)
/54
Network stack scheduling control
• In-kernel network stack runs in kernel threads, which interfere with application threads.
• Configuring IRQ isolation is possible, but hard and error-prone. Better interfaces are needed.
• Moving the network stack to user space helps.
!52
![Page 53: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/53.jpg)
/54
Summary
• Thread-per-core architecture addresses kernel thread overheads.
• Partitioning of hardware resources has advantages and disadvantages, applications need to consider different trade-offs.
• Request steering is critical: CPU and NIC co-design and better OS interfaces are needed to unlock full potential of thread-per-core.
!53
![Page 55: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/55.jpg)
/54
Backup slides
!55
![Page 56: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/56.jpg)
/54
Read latency (99th)
!56
24 48 72 96 120
144
168
192
216
240
264
288
312
336
360
384
Number of Concurrent Connections
0.0
0.5
1.0
1.5
2.0
2.5
99th
Per
cent
ileRea
dLa
tenc
y(m
s)Memcached (legacy)
Sphinxd (legacy)
Memcached (modern)
Sphinxd (modern)
![Page 57: The Impact of Thread- Per-Core Architecture on Application ... · key-value store • To measure the impact of thread-per-core on tail latency, we designed a shared-nothing key-value](https://reader034.fdocuments.in/reader034/viewer/2022051607/6039c62a7d4b33450e55589d/html5/thumbnails/57.jpg)
/54
Read latency
!57
0.0 0.5 1.0 1.5 2.0
Read Latency (ms)
15
10
20
50
80
909599Per
cent
ile(%
)
Memcached (legacy)
Sphinxd (legacy)
Memcached (modern)
Sphinxd (modern)