High Performance Storage with blk-mq and scsi-mq · PDF fileHigh Performance Storage with...

High Performance Storage withblk-mq and scsi-mq

Christoph Hellwig

Problem Statement

The Linux storage stack doesn't scale:– ~ 250,000 to 500.000 IOPS per LUN– ~ 1,000,000 IOPS per HBA– High completion latency– High lock contention and cache line bouncing– Bad NUMA scaling

Linux SCSI Performance

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

fio 4k random read performance - RAID HBA with 16 SAS SSDs

Linux 2.6.32

Linux Storage Stack - Issues

The Linux block layer can't handle high IOP or low latency devices– All the block layer?

Linux Storage Stack

HW driver

Device mapper,Software RAID

Request layer

SCSI layer

HW driver HW driverHW driver

BIO submission

Linux Storage Stack – Issues (2)

The request layer can't handle high IOPS or low latency devices

Vendors work around by implementing make_request based drivers– Lots of code duplication– Missing features

SCSI drivers are tied into the request framework

Linux Storage Stack – blk-mq

A replacement for the request layer– First prototyped in 2011– Merged in Linux 3.13 (2014)

Not a drop-in replacement– Different driver API– Different queuing model (push vs pull)

Blk-mq – architecture

Processes dispatch into per-cpu software queues Software queues map to hardware issue queues

– In the optimal case:• N(hardware queues) = N(CPU cores)

– For now the most common case is::• N(hardware queues) = 1

Blk-mq I/O submission path

Processes

Software contexts(per-CPU)

Hardware contexts(based on HW capabilities)

Blk-mq – request allocation and tagging

Provides combined request allocation and tagging– Requests are allocated at initialization– Requests are indexed by the tag– Tag and request allocation are combined

Avoids per-request allocations in the driver– Driver data in “slack” space behind request– S/G list is part of driver data

Blk-mq – I/O completions

Uses IPIs to complete on the submitting node and avoid false cache line sharing– Can be disabled, or forced to the submitting

core Old request code provided similar functionality

– Non-integrated additional functionality– Uses software interrupts instead of IPIs

Prototype for blk-mq usage in SCSI

First “scsi-mq” prototype from Nic Bellinger – Published in late 2012– Used early blk-mq to drive SCSI– Demonstrated millions of IOPS– Required (small) changes to drivers– Only using a single hardware queue– Did not support various existing SCSI stack

features

Production design for blk-mq in SCSI

Should be a drop in replacement– Must support full SCSI stack functionality– Must not require driver API changes– Driver should not be tied to blk-mq

Should avoid code duplication– Push as much as possible work to blk-mq– Refactor SCSI code to avoid separate code paths

as much as possible

Production design for blk-mq in SCSI -Request allocation and tagging

Considerations for request and tag allocation:– Allocating a request for each per-LUN tag would

inflate memory usage– Various hardware requires per-host tags anyway

Thus went with blk-mq changes to allow per-host tag sets

Production design for blk-mq in SCSI -S/G lists

Modern SCSI HBAs allow for huge S/G lists– Linux supports up to 2048 S/G list entries,

which require 56 KiB of S/G list structures– We don't want to preallocate that much

Preallocate a single 128 entry chunk– Enough for most latency sensitive small I/O– The rest is dynamically allocated as needed

Blk-mq work driven by SCSI

Transparent pre/post-flush request handling Head of queue request insertion Partial completion support BIDI request support Shared tag space between multiple request_queues Better support for requeuing from IRQ context Lots of bugfixes and small features / cleanups

SCSI preparation for blk-mq

New cmd_size field in host template– Allows to allocate per-driver command data

Host-lock reductions– Elimination of host-wide spinlocks in I/O

submission and completion Upper level driver refactoring

– Avoids legacy request layer interaction– Provides a cleaner drivers abstraction

SCSI blk-mq status

Required blk-mq features included in Linux 3.16 Preparatory SCSI work merged in Linux 3.16 Blk-mq support for SCSI merged in Linux 3.17

– Must be enabled by scsi_mod.use_blk_mq=Y boot option

– Does not work with dm-multipath Big distributions include preparatory patches

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

200,000

400,000

600,000

800,000

1,000,000

1,200,000

fio 512 byte random read performance - RAID HBA with 16 SAS SSDs

Linux 2.6.32

3.17-rc3 (with blk-mq)

Note: HBA maxes out at about 1 million IOPS

SCSI profiling data

46.13% [kernel] [k] _spin_lock_irq 26.92% [kernel] [k] _spin_lock_irqsave 9.32% [kernel] [k] _spin_lock 0.47% [kernel] [k] kmem_cache_alloc 0.45% [kernel] [k] scsi_request_fn 0.39% [kernel] [k] _spin_unlock_irqrestore 0.33% [kernel] [k] kref_get 0.32% [kernel] [k] __blockdev_direct_IO_newtrunc 0.32% [kernel] [k] kmem_cache_free 0.30% [kernel] [k] native_write_msr_safe

2.67% [kernel] [k] do_blockdev_direct_IO 2.60% [kernel] [k] __bt_get 2.43% [kernel] [k] __blk_mq_run_hw_queue 2.07% [kernel] [k] put_compound_page 1.87% [kernel] [k] __blk_mq_alloc_request 1.60% [kernel] [k] _raw_spin_lock 1.59% [kernel] [k] kmem_cache_alloc 1.58% [kernel] [k] scsi_queue_rq 1.44% [kernel] [k] _raw_spin_lock_irqsave

Linux 2.6.32

Linux 3.17-rc3(with blk-mq)

1 2 4 6 80

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

Multiple LUN performance, single threaded - SRP attached null_io target

3.14.3 3.16+ 3.16+ (with blk-mq)

Note: Target overload in 8 LUN case prevents linear scaling

random read, 12 threads random write, 12 threads random read, 1 thread random write, 1 thread0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1,400,000

Single LUN performance - SRP attached null_io target

3.14.3

3.16+ (with blk mq)IOP

SCSI blk-mq status - near term work

Better way to select blk-mq vs legacy code path– Compile time option added for 3.18-rc

We would like to fully replace the old SCSI I/O path with the blk-mq one.

Missing features:– I/O scheduler support in blk-mq– multipath support (prototype exists now)

Exposing multiple HW queues to SCSI drivers

SCSI core so far only exposes a single queue– Some drivers are ready for multiple queues– So far do internal queue mapping

Design for tag allocation:– We want per-queue tag allocations for scalability

reasons– Add a queue prefix to the Tag– Work done by Bart van Assche, likely to be

merged for Linux 3.19

Future work – better integration

Expose more blk-mq flags to SCSI– Request merge control– better command allocation/freeing hooks– Reserved tags for HBA use

Future work - longer term research

Further reduction of shared cache lines:– let blk-mq handle per-host queuing limits– let hardware handle per-LUN or per-target

queuing limits Map multiple LUNs (request_queues) to the same

blk-mq contexts

References

Benchmarks:– Bart van Assche (Fusion-io / Sandisk):

• https://docs.google.com/file/d/0B1YQOreL3_FxWmZfbl8xSzRfdGM/edit?pli=1

– Robert Elliott (HP):• http://marc.info/?l=linux-kernel&m=140313968523237&w=2

Thanks

Fusion-io (now a Sandisk company)– For sponsoring the blk-mq in SCSI work

Jens Axboe– For code and slide review, and blk-mq itself

Bart van Assche, Robert Elliott– For code and slide review as well as benchmark

Questions?

High Performance Storage with blk-mq and scsi-mq · PDF fileHigh Performance Storage with...

Documents

Transcript of High Performance Storage with blk-mq and scsi-mq · PDF fileHigh Performance Storage with...

Port Multiplier Support - Ubuntu Desktoppatching file drivers/scsi/sata_sx4.c patching file drivers/scsi/sata_uli.c patching file drivers/scsi/sata_via.c patching file drivers/scsi/sata_vsc.c

SCSI commands

SCSI Over Fibre Channel Support For Linux On zSerieso FC attached storage devices o Optional: FCP-SCSI bridge + SCSI devices zSeries In A SAN –Hardware Requirements. SCSI over FibreChannel

Unit No.: 03 - Direct-Attached Storage, SCSI, and Storage ......SCSI interface (Table 5-2 lists some of the available parallel SCSI interfaces). The SCSI design is now making a transition

€¦ · GOLF PRIDE@ Multi Compound - WHT/BLK Multi Compound - WHT/BLK (Mid) Multi Compound - BLK/BLK z-Grip Cord Tour Velvet BCT - BLK Tour Velvet Tour Velvet (Mid) Tour Velvet (Jumbo)

187 - StencilMafia · 2020. 2. 4. · BLK 6310 bamboo BLK 6610 hemp BLK 6620 pan BLK 6630 lost island BLK 6720 toad BLK 6730 face BLK 6905 hannibal BLK 6910 murdock BLK 6920 jaw s

Jim Harris Principal Software Engineer Intel Data Center Group … · • SPDK Developer Meetup (Chandler, AZ) ... RocksDB Ceph Core Application Framework GPT PMDK blk virtio scsi

Security, Integrity and Choices for NVMe over Fabrics · Security, Integrity and Choices for NVMe over Fabrics Nishant Lodha ... blk-mq API nvme API blk-mq API nvme API blk-mq API

RLECT · 2019. 7. 18. · 330 12 scsi 5.0m 3.5- 240k £999 425 12 scsi 5.0m 3.5 240k £1179 520 12 scsi 5.om i 3.5' 1240k £1349 670 16 scsi 4.0m 5.25' 64k £1399 1070 14 scsi 4.0m

53C80 SCSI

SCSI Systems

Systems SCSI Cluster Installation and Troubleshooting Guide · SCSI ID before you connect the SCSI cables to the shared storage system. See " Setting the SCSI Host Adapter IDs "for

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . · SCSI SPI-4 interface specifications to the extent described in this manual. The SCSI Interface Product Manual The SCSI

Scsi Report

Serial Attached SCSI Architecture - SCSI Trade Association

Learning Language Games through Interaction · 2019. 10. 10. · stack or blk pos 4 rem blk pos 2 thru 5 rem blk pos 2 thru 4 stack bn blk pos 1 thru 2 ll bn blk stack or blk pos

Hard Drives - tim.id.autim.id.au/laptops/apple/misc/hard_drives.pdf · Basics WIDE and FAST SCSI Protocols - 2. WIDE and FAST SCSI Protocols. SCSI chains that use SCSI-2 protocols

Jim Harris Principal Software Engineer Intel Data Center Group · RocksDB Ceph Core Application Framework GPT PMDK blk virtio scsi QEMU Linux nbd RDMA RDMA SPDK Architecture Added

PCI-X Dual Channel Ultra320 SCSI Adapter Installation and ...ps-2.kev009.com/rs6000/manuals/Adapters/SCSI/Ultra... · “PCI-X Dual Channel Ultra320 SCSI Adapter SCSI Connectors”

SCSI Standards and Technology Update: SAS and SCSI … · SCSI Standards and Technology Update: SAS and SCSI Express ... And now, SCSI Over PCIe (SOP, ... (SRP) PCI Express : SCSI