Optimizing)the)WAFL)File)System) forthe) …the)WAFL)File)System) forthe) Many9coreand)Many...

Optimizing the WAFL File System for the

Many-core and Many-media EraMatthew Curtis-Maury, PhD [email protected]

© 2016 NetApp, Inc. All rights reserved. 1

Agenda

1) Background on NetApp and WAFL

2) Multiprocessor Execution in WAFL

3) Write Allocation Architecture

4) Buffer Cache Management

5) Summary


Background on NetApp

© 2016 NetApp, Inc. All rights reserved. --- NETAPP CONFIDENTIAL ---3

What is NetApp?

§ Primarily a data storage technology company§ Servers with software to manage huge amounts of data§ High performance data access, replicate data, backup to the cloud, etc

§ 12,000 global employees (Silicon Valley, RTP, Bangalore)

§ $6.1 billion annual revenue

§ It’s a great place to work§ #4 “World Best Multinational Workplaces” (Great Place to Work Institute)§ 13 consecutive years on FORTUNE’s “100 Best Companies to Work For”

§ #1 in 2009§ Engineering sites in RTP, Sunnyvale, and Bangalore


Background on ONTAP

§ Data ONTAP is our proprietary storage operating system that runs on storage servers (or “filers”)§ On real hardware or in a virtual machine

§ Connect over one of many supported protocols to access data


FCP iSCSI CIFS NFS

Lun Semantics File Semantics

WAFL Virtualization Layer

RAID and storage interface

Background on WAFL

§ WAFL: Write Anywhere File Layout§ Filesystem in the NetApp Data ONTAP OS§ All data and metadata stored in files, hierarchy rooted at the SuperBlock§ File data stored in 4K blocks that can be cached in core as buffers§ Storage is managed as Aggregates (collections of drives)§ Data is managed as Volumes (FlexVols) within the Aggregates

§ Writes flushed to disk in a batch called Consistency Point§ Persistence guaranteed through non-volatile log in NVRAM§ Data always written to new blocks on disk (no overwrites)

§ Currently supported hardware:§ 2 – 40 cores on a single node§ 4 – 512GB of RAM


Historical Perspective


1994 2016 ChangeMemory 256MB 512GB 2K xMax Capacity 28GB 138PB 5M xStorageMedia

SPI HDD SATA, SAS, SSD, NVMe, SMR, Cloud, etc

SFS ops 626 500k ~1K xNum Cores 1 ~40LoC 18K 30M 1K x

This talk is about supporting more cores, more media types, and more memory.

Evolution of Multiprocessor Execution in WAFL



Summary

§ Problem Statement:§ New systems continue to have more cores, but WAFL cannot be rewritten from scratch for better scalability.

§ Solution:§ A set of techniques that allow WAFL to be INCREMENTALLY parallelized with reasonable effort at each stage to achieve the required parallelism.

§ Classical Waffinity (2006), Hierarchical Waffinity (2011), Hybrid Waffinity (2016)

§ From “To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code”§ OSDI 2016

Background


§ NetApp Data ONTAP OS§ Commercial operating system running on storage servers

§ WAFL File System: Write Anywhere File Layout§ Processes messages that represent file system requests § All data and metadata stored in files§ File data stored in 4K blocks that can be cached in core as buffers§ Storage is managed as Aggregates (collections of drives)§ Data is managed as Volumes within the Aggregates

§ Competitive performance requires continued CPU scaling§ Problem: WAFL is millions of lines of complicated code§ A pure locking model would require impractical levels of code changes§ Many other techniques in the literature, not suitable for (our) legacy system

§ E.g., Barrelfish, fos, Corey, Multikernel, …

History of Parallelism in ONTAP


§ Data ONTAP was created for single-CPU systems of 1994

§ Parallelism via “Coarse-grained Symmetric Multi-processing”§ Each subsystem was assigned to a single-threaded domain§ Minimal explicit locking required, message passing between domains§ Scaled to 4 cores, but all of WAFL serialized

RAID WAFL Network Storage Protocol

CPUs

…

Classical Waffinity


§ In the beginning… WAFL processed all messages serially§ WAFL parallelism introduced with Waffinity (2006)

§ User files partitioned into 2MB file stripes§ Messages to each file stripe are mapped to data partitions called Stripe affinities§ Scheduler dynamically assigns affinities to a set of parallel threads § Facilitates parallelism with minimal locking/code changes§ Serial affinity that excludes all Stripe affinities

§ Parallelized the most common operations: R and W to user files§ Scaled effectively to 8 cores

CPUs

WAFL

S3

Classical Waffinity (2006)


W(A, 0)

0 2 31 4 6 75

W(B, 0)

W(A, 7)

R(A, 0)

A0 2 31 4 6 75B

Stripe 1

Stripe 2

Stripe 3

Serial

Delete A

Idle threads

S1

S5

S6

Runnable affinities

S4 S8

Cores

Incoming messages

Work in Serial affinity is bad!

Hierarchical Waffinity (2011)


§ Hierarchical data partitioning to match hierarchical data§ Particular shape fine-tuned for WAFL§ Hierarchical permissions / exclusion

§ Allows parallelization of work that used to run in Serial affinity§ Friendly to incremental development

AGGR

STRIPE

Node

Aggregates

Volumes

Files

Blocks

File system Hierarchy

STRIPE

Files

Blocks STRIPE

Volumes

Files

Blocks STRIPE

Files

Blocks

AGGR

VOL

STRIPE

Serial

Aggregate

Volume

Volume Logical

Stripe

Aggregate VBN

Volume VBN

STRIPEVVBN Range

STRIPEAVBN Range

Affinity Hierarchy

Hierarchical Waffinity – Data mappings


Aggregate

Serial

Aggregate

Volume

Volume Logical

Stripe

Aggregate VBN

Volume VBN

Range

Range

Stripe Range

Volume

Volume Logical

Stripe

Volume VBN

RangeStripe RangeRange

Retain Stripe affinities (now per volume)

Parallelism between user file and metafile accesses

User data Metadata

Parallelism between different volumes and aggregatesParallelism between different volumes and aggregates

Classical vs. Hierarchical Waffinity


§ SFS2008 contains metadata operations (Create, Remove, etc)§ Classical Waffinity: Ran in Serial affinity (48% of wallclock time)§ Hierarchical Waffinity allows the messages to run in Volume (Logical)

§ 2.59 additional cores used translated into a 23% throughput increase

7.79

10.38

0

3

6

9

12

0

40

80

120

160

Classical Waffinity Hierarchical Waffinity

Core Usage

Throughput

(K ops/sec)

SFS2008

Throughput Core Usage

AGGR

VOL

STRIPE

Serial

Aggregate

Volume

Volume Logical

Stripe

Aggregate VBN

Volume VBN

STRIPEVVBN Range

STRIPEAVBN Range

Hierarchical Waffinity CPU Scaling


§ 95% average core occupancy across 6 key workloads

0

5

10

15

20

1 2 3 4 5 6 7 8

Core Usage

RandomRead SequentialRead RandomWriteSequentialWrite SFS2008 SPC1

Hierarchical Waffinity Limitations


§ Some important workloads access two different file blocks§ SnapVault: Replicates a volume’s contents. Writes user data plus metadata.§ Block Free: Updates FS metadata. Writes to multiple metafiles.

§ Mappings optimized for traditional cases not well-suited here

AGGR

VOL

STRIPE

Serial

Aggregate

Volume

Volume Logical

Stripe

Aggregate VBN

Volume VBN

STRIPEVVBN Range

STRIPEAVBN Range

User DataMetadataUser Data + Metadata


Hybrid Waffinity (2016)

§ Extends Waffinity with fine-grained locking§ Allows particular blocks to be accessed from multiple affinities§ These blocks protected with locking

§ Allows incremental development§ Most blocks (and therefore code) continue to use partitioning§ Even lock-protected buffers can remain unchanged

AGGR

VOL

STRIPE

Serial

Aggregate

Volume

Volume Logical

Stripe

Aggregate VBN

Volume VBN

STRIPEVVBN Range

STRIPEAVBN Range

User DataMetadata

Lock-free MetadataNo Access

Hybrid vs. Hierarchical Waffinity


§ Block allocation and free operations in Volume VBN for metadata accesses§ Hybrid Waffinity allows message to run in Volume VBN Range§ 62.8% increase in throughput from 4.8 cores additional cores

0

10

20

0

1,000

2,000

1 2 3 4 5 6 7 8 9

Core Usage

Throughput (MB/s)

Sequential Overwrite

Hierarchical Waffinity Tput Hybrid Waffinity TputHierarchical Waffinity Cores Hybrid Waffinity Cores

AGGR

VOL

STRIPE

Serial

Aggregate

Volume

Volume Logical

Stripe

Aggregate VBN

Volume VBN

STRIPEVVBN Range

STRIPEAVBN Range

Conclusion


§ Developed a set of techniques to allow incremental parallelization of the WAFL file system§ Focusing on data partitioning has kept the required code changes to a manageable level while allowing sufficient parallelism

§ Selectively adding in locking in a restricted way has further parallelized workloads with poor data partition mappings

§ Provided insight into the internals of a high-performance production file system

Write Allocation Architecture in WAFL



Summary

§ Problem Statement:§ Existing write allocator not extensible to new storage types AND not scalable on many cores. Must adapt to new media to stay competitive.

§ Solution:§ A write allocation architecture in WAFL that allows for the rapid adoption of new storage media and scales to many cores.

§ From “White Alligator: An Architecture for WAFL Write Allocation in the Many-Core and Many-Media Era”§ Submitted to FAST 2017

What is Write Allocation?


§ Storage systems must store modifications to persistent storage§ Write allocation is everything involved in persisting FS changes

§ Identifying “desirable” locations on disk to write to§ Arranging data and metadata on these locations

§ Drives are carved up into fixed size blocks§ Blocks within a drive and across drive types have different properties§ Match the right data to the right disk location

§ Need to quickly and efficiently:§ Find free locations§ Perform the write§ Update metadata§ Perform subsequent reads of the data

Background on WAFL


§ In WAFL, every write is to a new disk location (no overwrites)§ Writes are buffered in memory and flushed to disk in a batch

§ Reply to client quickly after operation is logged in NVRAM§ Consistent set of changes flushed to disk atomically (Consistency Point)§ Batching allows for better layout and amortization of overhead

User dataMetadata

Super Block

In-coreFS Flush modifications to disk Flush modifications to disk

Start CPFreeze FS data

Start CPFreeze FS data

User dataMetadata

Super Block

Legacy Write Allocation Objectives


1) Maximize full-stripe writes

2) Maximize sequential readperformance

3) Utilize full aggregate writethroughput

D1 PD2 D3

Stripe

Tetris

Evolution of WAFL write allocation since 1994:

Legacy Write Allocation Architecture


Infrastructure to:(a) find desired blocks, (b) efficiently write to them, and(c) update corresponding metadata

Policy to control the specific layoutof file blocks on the storage medium

White Alligator Architecture


Traditionalpolicy

Policy

Infrastructure

API

FlashPools© NextThing

White Alligator Architecture


§ Buckets of blocks are passed across the API§ Infrastructure provides blocks with certain attributes

§ E.g., SSD, outer cylinder HDD blocks, cloud block, etc§ Keeps a supply of filled buckets available in a lock-protected “bucket cache”

§ Policies implemented to make use of the provided types of blocks§ Many advantages to operating on buckets 1) Quick adoption of new storage media2) Policy and infrastructure are independently parallelizable• Write allocation running on parallel threads using different buckets• Modifications to the filesystem done in parallel in Hierarchical Waffinity

3) Amortize the cost of “turning the crank” for a free block4) Modularity improves code maintainability

§ Successfully used to develop a variety of allocation policies

White Alligator Performance


§ FlashPools achieves ~same throughput with 0.1X the latency§ Using fewer drives, replace many HDDs with a few SSDs§ Better $/iops, less shelf space, less maintenance, less power

70.187 67.4069.10

0.910

2

4

6

8

10

0

20

40

60

80

100

HDD SSD+HDD

Latency (ms)

Throughput (MB/s)

Throughput Latency

White Alligator CPU Scaling


§ Sequential Overwrite on 20-core server§ Every overwrite requires write allocation

§ Frontend = Policy, Backend = Infrastructure

7.759.35 10.35

17.39

0

5

10

15

20

0

500

1000

1500

2000

2500

Serial Frontend, Serial Backend

Serial Frontend, Parallel Backend

Parallel Frontend, Serial Backend

Parallel Frontend, Parallel Backend

Core Occupancy

Throughput (MB/sec)

Throughput Core Occupancy

Buffer Cache Management in WAFL



Summary

§ Problem Statement:§ Distributed buffer cache processing requires complex distributed heuristics.

§ Solution:§ Global buffer priority ordering, with parallel processing of key buffer cache operations.

§ From “Think Global, Act Local: A Buffer Cache Design for Global Ordering and Parallel Processing in the WAFL File System”§ ICPP 2016

Buffer Cache: Read from Disk

SAN Host

UNIXClient

WindowsClient

Console

HBA

NIC

SANService

NFSService

CIFSService

Storage

NetworkNetworkStack

Protocols

RAID

34

Memory Buffer / Cache

WAFL

NVRAM

Data block colors are not significant. NFS and CIFS can read the same cached data.

Buffer Cache: Read from Cache

SAN Host

UNIXClient

WindowsClient

RS-232

HBA

NIC

SANService

NFSService

CIFSService

NetworkNetworkStack

Protocols

RAID

Storage

35

NVRAM

Memory Buffer / Cache

WAFL

Data block colors are not significant. NFS and CIFS can read the same cached data.

WAFL Buffer Cache

Goal: Cache disk blocks in memory that are predicted to be needed in the near future to accelerate future accesses.

§ 7-level Multi-queue, LRU ordered

§ Prioritization based on type, frequency, and recency of access

§ Operations: Insertion, Touching, Scavenging, Aging


High Pri

Low Pri

Recycle Queues

Free Cache

Scavenging

Per-Affinity Buffer Cache (PABC)

§ For CPU scalability, recycle queues were replicated per affinity§ Buffers mapping to each affinity are managed together§ Allows buffer cache management to be performed in parallel with no locking

§ Distributed heuristics added to approximate global ordering

§ Distributed decision making created problems§ Often resulted in bad replacement decisions which hurt performance§ Created complicated code that was difficult to maintain


Affinity 1 Affinity 1 Affinity 2 Affinity n

High Pri

Low Pri

Global Recycle Queue (GRQ)

§ Objective: Global Ordering with Parallel processing

§ Organize buffers across all affinities in a single global queue§ Stage 1 Scavenging — “sorts” lowest priority buffers into affinity queues§ Stage 2 Scavenging — parallel execution of costly operations within the affinities


High PriorityBuffers

Low PriorityBuffers

Untouched buffers logically flow towards the scavenger

Touching promotes away from scavenger

Insert buffers at various pointsdepending on buffer importance

FreeCache

Global Recycle Queue Per-AffinityPer-AffinityQueuesPer-Affinity

Recycle Queues

Global Recycle Queue Set


§ Global queue is implemented as a set of smaller queues§ One spinlock per queue for MP safety§ Many short queues to reduce lock contention§ Buffers inserted at the same time are sprayed over a range of queues

Priority

Q1Q2

Q3Q4Q5

Q6

Q7

…Qn

Insert BuffersFrom All Affinities

GRQ

Buffers A, B, C, D, E, Fof same type inserted together

GRQ Performance Evaluation

§ Summary: GRQ performance on-par with PABC§ Even though PABC had been heavily optimized over a decade§ Ordering is achieved despite removal of code complexity


Experimental Platform:8 core48GB RAM

Workload Management

§ Goal: Provide QoS management to buffer cache usage§ Provide mechanism for upper and lower bounds on buffer cache usage§ No hard partitioning of buffers

§ Allocations§ Guarantee a portion of cache space for a workload§ No buffer scavenging until Allocation is exceeded§ Buffers are isolated on a dedicated Global Holding Queue

§ Soft Limits§ Prevent a workload from loading buffers so quickly that others’ are thrashed§ Offending workloads experience degraded service

§ i.e,. Its buffers are inserted at lower-priority queues


Workload Management: Allocations

§ Primary Workload: Random read with working set = 80% of RAM§ Allocation = 80% of memory

§ Interfering Workloads: Random read with working set > RAM size


Experimental Platform:8 core32 GB RAM

Workload Management: QoS using Soft Limits

§ Two instances of random read with working set = size of RAM § Prioritized workload given Soft Limit of 75%§ De-prioritized workload given Soft Limit of 15%


System under test:8 core32 GB RAM

Conclusions

§ Think Global, Act Local§ Global ordering of buffers§ Local (distributed) processing of buffers§ Achieved via a two stage buffer management algorithm

§ Architecture allowed for the removal of unmaintainable code

§ Introduced workload management capabilities to the buffer cache

§ GRQ has been adopted in several other areas of WAFL


Optimizing)the)WAFL)File)System) forthe) …the)WAFL)File)System) forthe) Many9coreand)Many...

Documents

Transcript of Optimizing)the)WAFL)File)System) forthe) …the)WAFL)File)System) forthe) Many9coreand)Many...