SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of...

20
ddn .com ©2013 DataDirect Networks. All Rights Reserved. SFA Product Line High Performance Solutions for Big Data: Setting the Bar in Both Bandwidth & lOPS DDN | Whitepaper

Transcript of SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of...

Page 1: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com©2013 DataDirect Networks. All Rights Reserved.

SFA Product LineHigh Performance Solutions for Big Data: Setting the Bar in Both Bandwidth & lOPS

DDN | Whitepaper

Page 2: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com2©2013 DataDirect Networks. All Rights Reserved.

Table of Contents

Abstract ���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������3Introduction ������������������������������������������������������������������������������������������������������������������������������������������������������������������������3SFA12KX ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������5SFA12KX Storage OS Architecture �������������������������������������������������������������������������������������������������������������������������������6

Active/Active Model �����������������������������������������������������������������������������������������������������������������������������������������6Data Protection ��������������������������������������������������������������������������������������������������������������������������������������������������8

RAID ��������������������������������������������������������������������������������������������������������������������������������������������������������8Hot Spares ��������������������������������������������������������������������������������������������������������������������������������������������9Battery Backed Write-Back Cache ������������������������������������������������������������������������������������������������9Mirrored Write-Back Cache ������������������������������������������������������������������������������������������������������������9Mirrored Transaction Journal ����������������������������������������������������������������������������������������������������� 10Metadata Mirrored n-Ways ���������������������������������������������������������������������������������������������������������� 10

Parity Check On Read DirectProtect:������������������������������������������������������������������������ 10Data Integrity Field DirectProtect: ���������������������������������������������������������������������������� 11

Storage System Efficiencies ������������������������������������������������������������������������������������������������������������������������ 11Storage Fusion Xcelerator ����������������������������������������������������������������������������������������������������������� 11Partial Disk Rebuild ������������������������������������������������������������������������������������������������������������������������ 12Real-time Adaptive Cache Technology (ReACT™) ��������������������������������������������������������������� 12Rebuild Priority �������������������������������������������������������������������������������������������������������������������������������� 13Read Quality of Service ����������������������������������������������������������������������������������������������������������������� 13

Management ���������������������������������������������������������������������������������������������������������������������������������������������������� 14DirectMon™ ��������������������������������������������������������������������������������������������������������������������������������������� 14Application Programming Interface (API) ������������������������������������������������������������������������������ 14Software Summary������������������������������������������������������������������������������������������������������������������������� 14

SFA12KX Hardware Architecture ������������������������������������������������������������������������������������������������������������������������������ 15RAID Processing ������������������������������������������������������������������������������������������������������������������������������ 15I/O Channels and Architecture �������������������������������������������������������������������������������������������������� 16Cache ��������������������������������������������������������������������������������������������������������������������������������������������������� 16Back End Disk Connectivity ��������������������������������������������������������������������������������������������������������� 16

Hardware Summary ��������������������������������������������������������������������������������������������������������������������������������������� 17SFA OS and In-Storage Processing™ Computing Systems ������������������������������������������������������������������������������� 17

In-Storage Processing Computing Capability �������������������������������������������������������������������������������������� 17PCIe Device Dedication �������������������������������������������������������������������������������������������������������������������������������� 18Virtual Disk Driver ������������������������������������������������������������������������������������������������������������������������������������������� 18Reduction in Equipment, Infrastructure and Complexity ���������������������������������������������������������������� 19

SFA12KX Family: Summary ����������������������������������������������������������������������������������������������������������������������������������������� 20

Page 3: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com3©2013 DataDirect Networks. All Rights Reserved.

AbstractBig Data dominates the storage landscape� Storage and compute challenges that once fell mostly in the domain of High Performance Computing (HPC) or even Superomputing are an everyday part of businesses that want to make the most of their Big Data� Storage systems that take on these Big Data challenges generally fall into one of two categories: those with high IOPS capability or those with high bandwidth capability� In the world of HPC, where the focus is on massive scale and streaming writes, storage systems with high bandwidth capabilities have been favored� Big Data often is comprised of large files that also benefit from high bandwidth for ingest or write-out activities�

Increasing core counts and ever increasing numbers of nodes in HPC has fundamentally changed data I/O patterns and storage system requirements� Big Data’s analytic processing challenges require very high levels of IOPS� Traditional storage systems are not capable of both high IOPS and high bandwidth�

This paper presents the next step in the evolution of the SFA storage product line� The SFA12KX™ Family builds on the concepts and success of the SFA10K™ Family and is uniquely suited to adapt to modern compute environments and the unique data storage challenges they present� The SFA Family performs at the highest levels of both sequential and random IOPS� Additionally we will examine the architecture that enables In-Storage Processing™ for embedding clustered/parallel file systems directly into the storage – resulting in significant reductions in complexity, latency, footprint and cost�

IntroductionAcross the storage industry, the vast majority of block storage systems have been deliberately designed to deliver random access I/O to serve transactional applications� These applications include reservation systems, banking applications, databases, email and messaging applications and batch processing jobs� These compute processes use fixed, structured storage formats, which are commonly referred to as structured data� With structured data, information is communicated to/from storage in small blocks, and accessed in a generally random pattern which requires high Input/Output Operations per Second (IOPS) to deliver sufficient performance�

In recent years, the digital content revolution has enabled personal, social and business computing, as well as the ability for predictive simulation, weather forecasting and processing of satellite imagery� This has resulted in an explosion in both size and number of files stored online� Businesses, both online as well as traditional “brick and mortar” are collecting data at an astonishing rate and are analyzing this data with various methods such as MapReduce which utilize large data sets and require both high bandwidth and high random IOPS capability� We call this “Big Data”� According to IDC, the Big Data market is expected to grow to $16�9B in 2015 with a compounded annual grow rate (CAGR) of 39�4%� Among the different segments, storage is the fastest growing with 61�4% CAGR�

Page 4: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com4©2013 DataDirect Networks. All Rights Reserved.

$18,000

$16,000

$14,000

$12,000

$10,000

$8,000

$6,000

$4,000

$2,000

2010 2011 2012 2013 2014 2015

Services

Software

Networking

Storage

Servers

$0

Figure 1 – Worldwide Big Data Technology and Services Revenue, IDC March 2012

The growth and emergence of Big Data has necessitated change in storage technology to deliver high random IOPS and high bandwidth at the same time� This market opportunity gave rise to storage architectural platforms such as DDN’s Silicon Storage Architecture™ (S2A) which uses specialized hardware to read and write unstructured data at the highest performance levels both for read and write and with no degradation during system correction events such as drive rebuilds�

Just as systems optimized for random IOPS do not excel at storing large sequential files, systems optimized for bandwidth are not necessarily class-leading in transactional data patterns�

The explosive growth in unstructured data favors storage systems optimized for bandwidth, as the growth in structured data as a percentage of aggregate market demand is slowing year over year� This growth has largely coincided with increasing CPU speeds� As processor frequency approached the upper limits of what is physically possible with silicon based technology, CPU manufactures found a different avenue to increase compute power per CPU: by combining multiple processing cores onto a single CPU socket, resulting in Moore’s Law being extended by several years� Recently, the number of processing cores (or simply “cores”) per chip in the commodity space has increased to the point that eight cores are common and a higher number of core processors are just around the corner�

Page 5: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com5©2013 DataDirect Networks. All Rights Reserved.

The increase in the number of cores per chip, and the number of threads per core allows multiple processes to run simultaneously, often producing multiple file accesses at once� What the individual running processes view as sequential access, the storage system sees as increasingly random access as data must be read or written in multiple locations on the storage media rather than stepping sequentially through one location� Further, access to hundreds or thousands of files simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata operations that require high-speed random IOPS for optimal response�

The need for multi-threaded, simultaneous file access on a massive scale is not a future requirement, it’s happening today� Currently, the top supercomputers have over 1�5 million CPU cores in their compute clusters, resulting in potentially hundreds of thousands of simultaneous file writes during checkpoint operations� Leading websites have tens of billions of files stored; accessed at any time with hundreds of thousands of file accesses per second� The continuous increases in processing cores per socket will allow clients to access more and more files simultaneously� This multi-threaded I/O will produce storage access patterns that are increasingly random, requiring high IOPS capability� Thus, a storage system designed to serve large files to multi-core compute environments must now be optimized to support mixed-mode access, offering both high random IOPS and high bandwidth�

Seemingly, storage systems can be optimized to serve either high random IOPS or high bandwidth� Conventional wisdom says that systems can excel at one or the other characteristic, but not both� Conventional wisdom also once said a storage system could not write as fast as it reads at peak bandwidth levels, but DDN’s Silicon Storage Architecture proved to breakthrough that long-standing belief� Today, a storage system can offer extreme performance in both random IOPS and bandwidth… That system utilizes DDN’s new Storage Fusion Architecture® and is known as the SFA12KX�

SFA12KXSFA12KX builds on the revolutionary Storage Fusion Architecture (SFA), first introduced by DDN with SFA10K� SFA is based on a unique combination of highly parallelized software, industry proven data integrity algorithms and high-speed hardware components to produce a storage controller that performs in the extreme range of both bandwidth and IOPS� By marrying a state of the art, multi-threaded data integrity engine to best of breed in processors, interconnects, buses, memory architecture and media technologies, SFA12KX capitalizes on the same advancements in technology as the clients it serves� This strategy ensures that as these technologies evolve and improve, SFA performance will improve along with them�

The SFA12KX employs RAID, data integrity and data management software written from the ground up to take advantage of multi-core processors and modern bus architectures� This highly threaded architecture allows performance to linearly scale with advances in underlying hardware� This same architecture allows the SFA12KX to do what no other RAID controller has been able to do to date: to perform in the extreme range of both bandwidth and IOPS� The SFA12KX delivers random IOPS over 1�7 million burst to cache and over 1�4M sustained 4K IOPS to SSDs� Sequential block bandwidth performance is 48GB/s for simultaneous reads and writes� Designed to house the most scalable unstructured file data, the system supports up to up to 1,680 drives of raw storage while enabling a combination of SAS, SATA or SSD drives�

Page 6: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com6©2013 DataDirect Networks. All Rights Reserved.

RAID 5, 6

SFA RAID 5, 6

16 x FDR In�niBand Host Ports

SFA Interface Virtualization

240 Gb/s Cache Link

Optimized Drive Support

32-64 GB High-Speed Cache

Internal SAS Switching Internal SAS Switching

SFA Interface Virtualization

2 3 4 5 6 7 8 P Q1

1m

2 3 41

1

RAID 5,6

PRAID 5,6

QRAID 6

RAID 6

960 Gb/s Internal SAS Storage Management Network

SATA

Leading Capacityand Cost-Optimized

Bandwidth

SAS

Balanced Mix of IOPSCapacity & Bandwidth

SSD

Delivering UnrivaledIOPS for Transactional

Applications

32-64 GB High-Speed Cache

SFA RAID 1

Figure 2 – SFA12KX Active/Active RAID Controller Architectural Overview

SFA12KX Storage OS ArchitectureThe SFA12KX runs on the market proven and mature SFA OS� SFA OS was purpose-built to fully exploit the power of multi-core processors� A storage controller is made up of many components and the design goal of SFA OS was to get the maximum performance out of every component in the system� Thus, it is not only the RAID engine that is optimized, but also the cache engine, data movers, drivers, schedulers and much more� All of these storage subsystems are highly parallelized and multi-threaded, creating a powerful, scalable software architecture that serves as the basis for high performance, high availability and rich features that will grow over time�

Active/Active ModelFrom conception, the SFA12KX had been designed to work in an Active/Active fashion� There are essentially two ways to implement Active/Active operation in a redundant RAID controller: Active/Active with Distributed Locking, or Active/Active with Dynamic Routing�

Page 7: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com7©2013 DataDirect Networks. All Rights Reserved.

Active/Active with Distributed Locking is the method that has been used historically for DDN’s S2A products� With this method, each logical unit is on-line to both controllers� Both controllers cache data for the logical unit, both controllers access the physical disks that contain the logical unit directly, and distributed locks are used to guarantee storage register semantics and write atomicity� The locks are communicated across an inter-controller link (ICL)� Because the S2A is optimized for bandwidth and has relatively little cache, ICL traffic is low and does not impact performance, however experience has shown that distributed locking slows IOPS performance� This is partly due to the ICL communication latency, but has more to do with the lock and cache lookup times� Thus, in a system destined to perform at extreme IOPS levels, a different method had to be implemented�

SFA implements an Active/Active host presentation model with routing-based data access and full cache coherency� The SFA OS provides preference indicators and target port groups for its SCSI target implementation and thus has the notion of a preferred controller and the preferred RAID Processor (RP)� In this approach, each logical unit is online to both controllers, but only one controller takes primary ownership for a given logical unit at a given time� The controller that masters the logical unit caches data for the logical unit and accesses the physical disks that contain that logical unit’s data� Additionally, the controller that masters the logical unit is the preferred controller for that logical unit, and I/O requests received by the non-preferred controller are forwarded to the controller that masters the logical unit�

This intelligent approach to storage management requires no distributed locking� Instead, I/O requests are forwarded (Figure 3)� When mirrored write-back caching is performed, the data must be transferred to both controllers and so there are no additional data transfers� Read data does have to be transferred across the ICL for reads that are not sent to the preferred controller, however these reads benefit from the logical unit’s read-ahead cache� When in write-thru mode, write data does have to be transferred across the ICL for writes that are not sent to the preferred controller�

Write to Preferred Path Read from Preferred Path Write to Non- Preferred Path Read from Non- Preferred Path

Logical DiskMaster

Controller

Cache

PartnerController

Cache

ICL

Logical DiskClient

Logical DiskMaster

ControllerPartner

Controller

Cache

ICLLogical Disk

MasterController

Cache

PartnerController

ICLLogical Disk

MasterController

PartnerController

ICL

Pref

erre

dPa

th

Pref

erre

dPa

th

Data Transfer Cache MirrorICL = Inter-Controller Link

CacheCacheCacheCache

Logical DiskClient

Logical DiskClient

Logical DiskClient

Figure 3 – Active/Active Routing Depicting IO Scenarios

Page 8: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com8©2013 DataDirect Networks. All Rights Reserved.

There are several advantages to the Active/Active with Routing method� The main advantage of this approach is that no distributed locking is required and that leads to better I/O performance and a very clean failover implementation, which leads to enhanced data integrity� Another advantage is that the caching, both read and write, is more efficient and effective because all of the cache data can be found in a single location�

Virtual disk clients need at least one path to each controller to allow failover and thus need a multi-path I/O driver to recognize that the logical units presented by the two controllers for one logical disk represent the same logical disk� It is important that the multi-path I/O driver is able to understand the standard SCSI preference indicators and target port groups� Such drivers are readily available for most major operating systems including Microsoft Windows® server products and Linux�

Each SFA storage pool (aka: RAID set) has a preferred home attribute that allows specification of which controller and RP should master the logical disks or virtual disks that are realized with that storage pool� Each logical disk has a current home attribute that indicates the controller that is actually mastering the logical unit currently, and this will change dynamically during failover and failback or when the preferred home attribute is changed� The SCSI preference indicators dynamically change to reflect the current home and the MPIO drivers are designed to dynamically adapt to changes in the SCSI preference indicators, so a proper MPIO driver will send most I/O requests to the controller that masters the logical unit�

Data Protection

RAID

The SFA OS RAID stack provides protection against single, physical disk failures with RAID-1 or RAID-5 data protection as well as double Physical Disk failures through the use of high-speed RAID-6 protection� Both the SFA RAID 5 and 6 parity protection implementations use a rotating parity scheme� The RAID-5 implementation adds a parity chunk to every stripe using XOR� The RAID-6 implementation adds a P and Q chunk to every stripe where P and Q are calculated with Galois Field arithmetic� Particular attention has been paid to closing all of the write holes1, the method for doing so goes beyond the scope of this paper�

A RAID set is implemented using an integral number of equal sized members, which are whole physical disks� The total number of RAID set members must be the number of RAID set data members plus parity members� A chunk is one or more sequential data blocks from a single RAID set member� Each member is made up of a sequence of chunks� A stripe consists of a set of chunks, the same ordinal chunk from each RAID set member� For RAID 6, two of the stripe’s chunks are used for parity (“P” and “Q”) while the remaining chunks are used for logical disk data� The data and parity members are laid out as shown in Figure 4 to provide load balancing for both reads and writes� This is sometimes referred to as “left symmetric”� For normal reads, only the data members need to be read� Optionally, one parity disk (“P”) is read and the parity is checked, part of a feature called DirectProtect™ which guards against silent data corruption which is within the realm of possibility and sporadically witnessed with SATA disk drive technology�

1 1 For RAID 5 and RAID 6, in the event of a system failure while there are active writes, the parity of a stripe may become inconsistent with the data. If this is not detected and repaired before a disk or block fails, data loss may ensue as incorrect parity will be used to reconstruct the missing block in that stripe. This potential vulnerability is sometimes known as the write hole. Battery-backed cache and similar techniques are commonly used to reduce the window of opportunity for this to occur. SGIS

Page 9: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com9©2013 DataDirect Networks. All Rights Reserved.

0 1 2 3 4 5 6 7 P for0-7

Q for0-7

89 10 11 12 13 14 15 P for8-15

Q for8-15

20 21 22 23 P for 16-23

Q for 16-23 16 1718 19

Stripe

Chunk

Additional stripes that maximize the chunks on each member .

Member Figure 4 – Example of RAID-6 RAID set Layout

Hot Spares

The SFA OS provides pools of spare physical disks that can be automatically used to replace failed physical disks� By replacing a failed RAID set member automatically, the mean-time to-repair for the RAID set is minimized resulting in improved data reliability�

Battery Backed Write-Back Cache

SFA OS provides a write-back cache feature that is used to improve I/O performance� Write-back cache data which has not been written to disk is preserved by maintaining power to the cache memory in the event of an AC mains failure long enough to copy the contents of the cache to stable storage� In addition, SFA OS is designed to tolerate a simultaneous AC mains failure and RAID software failure�

Mirrored Write-Back Cache

SFA OS provides the ability to mirror all write-back cache data such that the failure of a single controller will not result in data loss� A storage administrator can optionally turn off write-back cache mirroring for a RAID set (for higher performance) however data protection is reduced for logical units within that RAID set�

Page 10: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com10©2013 DataDirect Networks. All Rights Reserved.

Mirrored Transaction Journal

RAID write holes are prevented by executing stripe updates as ACID (Atomicity, Consistency, Isolation and Durability)2 transactions so that when they are interrupted by a power failure then they can be recovered from the transaction journal when power is restored� This journal is stored within the write-back cache and thus is mirrored so that when a simultaneous power failure and controller hardware failure occurs, the surviving controller can recover the transactions�

Metadata Mirrored n-Ways

SFA OS stores a copy of storage system metadata on 18 physical disks to minimize the likelihood that its metadata is lost or corrupted�

DirectProtect – Silent Data Corruption Detection and Avoidance

DirectProtect is a trademarked name for techniques that detect and correct data errors made by physical disks� It’s particularly valuable when using lower-cost spinning disk, such as SATA drives, which are designed with a lower bit-error-rate requirement than enterprise quality SAS disks� In SFA OS there are two levels of DirectProtect: Parity Check on Read (PCOR) and Data Integrity Field (DIF)

Parity Check On Read DirectProtect:

The SFA OS allows the administrator to specify whether DirectProtect is turned on or off per RAID set� If enabled for a given RAID set, RAID parity will be checked on all reads� In the event that the RAID parity is found to be bad, SFA OS takes steps to correct the data including retrying the reads and using P and Q to identify the bad data� Once the bad data is identified the correct data is generated from parity and the read is returned� Any bad data on physical disk is corrected in the process� When data is read as part of a write operation (e�g�, in a read-modify-write), the parity is checked as part of the read operations�

PCOR based DirectProtect can have an effect on performance because every read and write involves every data member of the RAID set – this performance impact varies with data access patterns� An I/O pattern in which every I/O is full-stripe-aligned (the I/O size equals the stripe size and is aligned on a stripe boundary) naturally involves every data member of the RAID set and has minimal performance impact with PCOR DirectProtect on� Sequential I/O patterns, in which the read-ahead or write-back cache can turn non-stripe-aligned I/Os into full-stripe aligned RAID set I/Os, have minimal performance impact with PCOR DirectProtect on� Small random reads performed with PCOR DirectProtect enabled (reads that access fewer disks than the number in a full stripe) will suffer more degradation due to the requirement to read from all the disks in the stripe to check parity�

2 In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction. An example of a transaction is a transfer of funds from one bank account to another, even though it might consist of multiple individual operations (such as debiting one account and crediting another). This brief definition was obtained from Wikipedia:http://en.wikipedia.org/wiki/ACID

Page 11: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com11©2013 DataDirect Networks. All Rights Reserved.

Data Integrity Field DirectProtect:

This approach to detecting and correcting Physical Disk errors stores redundant information about the data in a form other than RAID parity� One approach is to store a hash (e�g�, a CRC check) or Data Integrity Field (DIF) of each block’s data, and then check this each time the block is read� Of course, the physical disk already stores a sophisticated Reed-Solomon code for each block that both detects and corrects errors, so having the RAID system store another hash may seem redundant but remember that the purpose of SFA DirectProtect is to improve the undetected errors of low-cost physical disks�

There are several advantages to performing data integrity verification with a DIF vs� the PCOR method alone� The first is that calculating hash is far less intensive than calculating parity and hence results in significantly smaller levels of performance degradation� Additionally, the DIF method can easily detect and correct silent data corruption on mirrored (RAID 1) raid sets� Lastly, the DIF method has become accepted and standardized in the form of ANSI T10- DIF� This means it may be possible in a future version of SFA OS to emulate complete end-to-end data integrity checks even with SATA disk drives�

To improve DirectProtect performance and provide additional data integrity checking, SFA OS includes this secondary DIF method for ensuring data integrity on SATA disks� This method inserts an additional 512-byte DIF block on each physical disk for every 64 data blocks on that disk which is used to store a hash of the data in each of those 64 blocks� When data is read from a physical disk, the read is lengthened to include the DIF block and the hash code is calculated and checked against the value stored in the DIF block� If an error is detected then steps will be taken to correct the error using retries and RAID redundancy� DIF blocks are cached to minimize the impact on performance�

Storage System Efficiencies

Storage Fusion Xcelerator (SFX)

SFX is a suite of storage acceleration tools that combine spinning and solid state disk storage media with application aware technology to intelligently accelerate block and file-based data access� It is part of the SFA Operating System and extends the functionality of the storage system’s cache by selectively frontending traditional rotating media with some amount of flash memory� This yields acceleration in the context of the application� SFX consists of a pool of SSD flash-based drives that actually become an extension of the DRAM cache�

SFX cache can be allocated to a Logical Unit Number (LUN), which refers to a logical disk created from a group of real disks, or can be shared between multiple LUNs� It has the effect of front-ending the LUN with some very fast and large cache, without having to dedicate expensive SSD drives to a single LUN� There are currently four modes of SFX cache, which determine how data is served, and when data gets flushed-out of cache based on available headroom�

• SFX Read Cache – This mode is designed for read intensive workloads� It caches frequently accessed data sets in the faster SFX tier to significantly speed up application performance�

Page 12: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com12©2013 DataDirect Networks. All Rights Reserved.

• SFX Write Cache – This mode is designed for write intensive workloads� It allows large writes to burst at full speed to the SFX tier - and then groups the data and writes it down to disk over time eliminating the need to deploy petabytes of rotating media to meet extreme performance requirements�

• SFX Instant Commit – This mode is designed for read after writes� It populates the SFX tier with new writes to “warm-up” cache as well as accelerating subsequent reads�

• SFX Content Commit – This mode allows applications and file systems to send down hints to the storage system, delivering the best storage acceleration possible by eliminating the need to “guess” what the IO pattern will be upon deployment�

More details on SFX is available in the DDN white paper titled “Storage Fusion Xcelerator”�

Partial Disk Rebuild

When a disk does fail, the SFA OS tracks the changes made to a RAID set when a member physical disk becomes unavailable, and if that member becomes available again within a user-settable timeout then only the stripes that were modified while the member was missing are rebuilt� This minimizes the mean-time-to-repair for the RAID set and thus improves the data reliability of the RAID set while also limiting any performance impact of a drive repair�

Real-time Adaptive Cache Technology (ReACT™)

Because the SFA12KX performs at extreme levels in both IOPS and bandwidth, it was desirable to achieve extreme performance in mixed workload scenarios� Given a logical unit, where data I/O is comprised of both random I/O and sequential I/O, it is desirable to enable caching (and cache mirroring) for high IOPS performance� With cache mirroring enabled, sequential I/O performance suffers by having to cross the inter-controller link� It also has the side effect of invalidating random I/O cache as it fills the cache and displaces previously cached data� To remedy this situation, SFA OS employs the ReACT feature to intelligently cache or write-through data based on incoming write patterns�

With write-back cache enabled and ReACT disabled, the data written to a given logical disk with aligned full-stripe writes is cached in the write-back cache and mirrored to the partner controller� With ReACT enabled for a given pool, the data written to the pool with aligned full-stripe writes is not cached and is instead written directly to the physical disks (i�e�, write-through)� Either way, non-aligned writes are written to write-back cache (Figure 5)� By enabling ReACT, applications that generate aligned full-stripe writes can achieve higher performance because write data is not cached and thus is not mirrored, resulting in greatly reduced inter-controller link traffic�

Page 13: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com13©2013 DataDirect Networks. All Rights Reserved.

Aligned I/O Single-Operation Parallelized Striped-Writes No Cache Mirroring Required for Fast Data

Unaligned I/O Write-Back Cache Mirrored

Accelerated Write Performance Avoids RMW Performance Avoids RMW Performanc

Cache Mirror P Q M

Figure 5 – Optimizing Cache Utilization with ReACT

Rebuild Priority

SFA OS employs a tunable parameter per pool for rebuild priority� Adjusting this setting will cause the rebuild engine to use less or more system resources for the rebuild operation� This feature allows an administrator the flexibility to adjust rebuild priority in relation to overall system performance� A lower rebuild priority setting will consume less of these system resources which will allow the system to devote more resources to incoming I/O� Conversely, it may be appropriate to increase rebuild priority to shorten rebuild time�

Read Quality of Service

DDN provides a highly reliable quality of service on read that allows the SFA12KX to stream data with very low latency� This allows latency sensitive application such as video streaming to deliver consistent and predictable performance even during system component failures�

Page 14: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com14©2013 DataDirect Networks. All Rights Reserved.

Management

DirectMon™

Today, organizations are facing an exponential increase in the amount of data being created� The ability to successfully manage this data, coupled with the growing complexity of storage infrastructures is creating significant challenges for IT managers� While the cost of maintaining storage infrastructures continues to increase, headcount and budget remains fixed� What is needed is an advanced management platform that reduces the cost and complexity of storage management�

DirectMon is an advanced configuration and monitoring solution that leverages our leadership in supporting the world’s largest file storage systems for over a decade� Purpose-built to improve the performance of IT operations, top-down support is provided for managing multiple DDN SFA storage arrays including the SFA12KX, SFA10K and the GRIDScaler™ and EXAScaler™ clustered file system appliances� Removing the complexity out of managing storage, its ease-of-use features and notifications allow administrators to quickly resolve problems, freeing-up valuable time to concentrate on more important tasks� DirectMon is ideally suited for any size IT environment to help simplify the configuration and management of the storage infrastructure, even as data continues to grow exponentially�

Application Programming Interface (API)

In addition to the traditional ways of configuring and managing the SFA storage products via CLI, GUI and SNMP interfaces, API’s are provided to give customers the ability to programmatically configure and manage the SFA storage products� Enterprises will now be able to integrate our SFA products into their overall management framework� Python based API clients are provided to simplify the integration effort�

Page 15: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com15©2013 DataDirect Networks. All Rights Reserved.

Software SummarySFA OS on the SFA12KX introduces new levels of extreme performance with several unique features� The scalable architecture is also very expandable� The SFA12KX with SFA OS provides the foundation upon which new data management and block virtualization features will be built in future releases� The flexibility and architecture of SFA OS allows these new features to be developed quickly, with rapid evolution of features utilizing the same hardware platform� This evolution will provide long-term investment protection and enhances longevity of SFA based products�

SFA12KX Hardware ArchitectureThe last several years have seen significant improvements in multiple commodity components� As mentioned previously, processors are increasing in the number of computing cores and the speed of those cores� The processor and bus interconnects have evolved to speeds that were only available with proprietary designs just a short time ago� HyperTransport (HT) and Intel QuickPath Interconnect (QPI) have replaced slow Front Side Bus (FSB) technology with low latency, point-to-point links featuring revolutionary bi-directional transfer speeds� Now that both AMD and Intel have adopted the practice of integrating the memory controllers, memory access speeds are greatly increased and experience lower latency� Peripheral buses have converged to PCI-Express (PCIe) which is now in its 3rd generation, and processors now have multiple integrated PCIe interfaces� Thus, nearly all the major components, busses and I/O paths around commodity computing processors have greatly improved in just the last couple of years� Combining these processing and I/O capabilities with current HBAs, HCAs and NICs in a unique configuration yields an extremely powerful storage hardware platform (Figure 2)�

RAID Processing A powerful storage hardware platform is useless without a tightly integrated software architecture that squeezes every bit of performance from the components and makes them work in a harmonious fashion� The SFA data integrity engine has been written from the ground up to be multi-threaded and highly parallelized to take maximum advantage of multi-core, multi-thread storage processors� Not only do various elements of the RAID stack run in parallel but there are two parallel instances of the storage engine: one in RAID processor (RP)0 and one in RP1 (Figure 6)� Thus, the SFA12KX actually has two parallel, multi-threaded RAID engines that work simultaneously in each controller for a total of 4 RAID processors across the redundant controller pair� Further, each RAID processor runs multiple threads that manage the SFA cache, data integrity calculations and I/O movers� Thus, as the number of storage system cores are increased, additional parallel processes can be run simultaneously and both IOPS and bandwidth will increase accordingly�

I/O Channels and Architecture

Page 16: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com16©2013 DataDirect Networks. All Rights Reserved.

Powerful parallel RAID processors need to be able to handle massive amounts of I/O on the front end (block interfaces) and the back end (disk drive pool)� The SFA12KX meets this challenge by providing each RAID processor with its own dedicated I/O channels to Fibre Channel or InfiniBand host interfaces on the front end to balance performance to SAS disk-enclosure interfaces on the back end� A very high speed, low latency interconnect allows for data transfers between RAID processors if and when necessary� This arrangement allows the SFA12KX to perform at extreme data rates as data is streamed from the host interfaces directly into RAID processors and out the back end to disks without having to contend for a shared I/O bus�

The ability to move data through the controller in a streamlined fashion is what also gives the SFA12KX the ability to perform at extreme levels in IOPS� The ability to communicate via an unprecedented number of channels across multiple disks simultaneously is what allows the SFA12KX to achieve over 1�4 million sustained IOPS to SSD drives�

SFA12KX IO PathsFront End Block Interfaces, Up to 160 Gb/s

20 6Gb/s SAS x4 Links, 480Gb/s

20 6Gb/s SAS x4 Links, 480Gb/s

Front End Block Interfaces, Up to 160 Gb/sController 0

High Speed,Inter-Controller Links

240 Gb/s Controller 1

Up to 1680 SAS, SATA or SSD Drives

RP1 RPO RP1RPO

Figure 6 – SFA12KX Streamlined I/O Paths

CacheExtreme IOPS performance to disk is important but for small size, high IOPS data patterns where latency becomes the gating factor, cache is a necessity� The SFA12KX offers high levels of mirrored cache, 32 GB total� Cache is implemented in DDR3 SDRAM memory for the lowest latency, highest performing cache� In the case of a power event, SFA12KX utilizes a dedicated battery backup unit to hold up the controller while the un-flushed write-back cache data is transferred to internal, nonvolatile, mirrored storage�

Back End Disk ConnectivityOverall, the design of the SFA12KX hardware is about balance� The extreme performance capabilities of the host ports are facilitated by a streamlined I/O path directly to the back end disks� The massive 960Gb/s internal SAS network not only serves the IOPS and bandwidth needs of the controller itself, but has ample headroom for additional I/O operations internal to the architecture� This headroom allows disk rebuild I/O to coexist with application service as there is plenty of bandwidth for both to occur simultaneously� By providing 40 x 4 SAS channels to serve 1,680 disk drives, the ratio of drives per channel is decreased� This arrangement allows for more commands to be queued per drive as well as providing ample bandwidth for

Page 17: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com17©2013 DataDirect Networks. All Rights Reserved.

high IOPS SSD drives�

Additionally, because all of the disk enclosures are best-practice configured either as 5, 10 or 20 enclosures per SFA couplet – the SFA12KX has the ability to RAID storage enclosures for high levels of enclosure fault tolerance� Using an 8+2 RAID 6 configuration, the SFA controller can lose up to 4 drive enclosures (or 2/10ths of the system resources) on an active system and still deliver full access to online data�

Hardware SummaryThis unique combination of high-performance storage processing technologies, married to an advanced, optimized software architecture not only make the SFA12KX the leader in IOPS and bandwidth, but more importantly, serves as a high-density, fault-tolerant storage foundation for evolutionary SFA OS advances far into the future� SFA OS forms the basis for the next generation of ultra-high performance block storage� This unique hardware and software combination also lends itself to more interesting possibilities, further differentiating SFA OS�

SFA OS and In-Storage Processing™ Computing SystemsThe decision to marry unique and specialized software to industry standard hardware components in SFA lends itself to an innovation that goes far beyond block storage services� SFA OS allows for embedding applications within the SFA12KXE� The applications that make the most sense to embed (initially) are those that would benefit the most from reduced latency and high bandwidth: clustered file system services� Thus, in its first iteration the SFA12KXE has the capability to embed the Lustre file system (the OSSs) or IBM® GPFS™ (the NSDs)� Embedding the file system servers within the storage device reduces the number of servers, infrastructure requirements and network connections which in turn reduces complexity, power consumption and cooling requirements� At the same time, it streamlines I/O and reduces latency by removing data “hops” and eliminates wasteful storage protocol conversion�

In-Storage Processing Computing CapabilitySFA OS utilizes virtualization software to allow for applications to be run inside the storage device� Various methods of memory and resource protection are employed to guard the block RAID functionality and ensure overall system resources are allocated in a secure and controlled fashion� SFA OS acts as a hypervisor, using technologies such as ccNUMA and KVM to control processor, core, memory, I/O and virtual disk allocations� This ensures that applications that run in the embedded space cannot affect the block RAID process memory space and that the applications only utilize processing and I/O resources they have been assigned�

Virtualization technologies are usually associated with performance degradation, not improvements in performance� Though SFA OS uses software and hardware virtualization, special care and development have been undertaken to ensure not only as little performance degradation as possible, but to produce an environment that offers enhanced performance� This is largely achieved with two distinct methods�

PCIe Device Dedication

Page 18: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com18©2013 DataDirect Networks. All Rights Reserved.

In the case of Lustre and GPFS, Infiniband or Ethernet HCAs are commonly used as the frontend interfaces to the file system servers� Normally, virtualization technologies share hardware devices such as HCA’s among virtual machines; slowing access for all and requiring virtual device drivers� SFA overcomes these traditional bottlenecks by dedicating PCIe devices directly to virtual machines� In the course of virtual machine initialization, the PCIe address space for the PCIe device in question is remapped to the virtual machine space� When the virtual machine boots its associated OS, it “sees” the PCIe device (in this case, the Infiniband or Ethernet card) natively, as if it was running on a physical machine� This allows the use of the HCA’s native software drivers, eliminating any need for a virtual device� Utilizing this method, virtual machines running inside the SFA12KXE have been able to achieve external bandwidth of 20GB/s or more�

Virtual Disk DriverBy dedicating PCIe devices directly to virtual machines there is no need to modify OS images or supply highly specialized virtual I/O devices� Virtual machines running inside an SFA12KXE enjoy nearly native speed access to HCAs� The remaining hurdle is access to the virtual disks (LUNs) served by the block RAID services side of SFA from the OS running inside the virtual machine� This access is achieved with the addition of a small, lightweight kernel module to the Linux image running inside the virtual machine� This driver presents virtual disks assigned to the virtual machine as standard Linux block devices under “/dev”�

What looks like a standard block device is actually a shared memory interface between the virtual machine and the block RAID services managed by SFA OS� As shown in Fig� 7, what was a dedicated server, an FC HBA, an FC switch and another FC HBA is reduced down to a direct memory interface at processor bus speeds� For writes from the OS to the device, data in memory is copied from the virtual machine space to the RAID space before it is manipulated by the RAID engine� This prevents the virtual machine from having write access to the RAID memory space�

Traditional Client HCA/NIC Switch HCA/NIC Server StorageHBA HBASANSwitch

SFA EmbeddedApplications Client HCA/NIC Switch HCA/NIC

IO Paths

Application E�ciently Place Data Directly Into SFA Memory

Latency

Elimination of ProtocolConversion Reduces Latency,Improves IOPS Performance

Components Eliminated withSFA12K In-Storage Computing

Figure 7 – I/O Path Reduct/n in SFA12KX In-Storage Computing Systems

On reads of the virtual disk device, the block RAID engine reads from disk, places the data in memory and passes a shared pointer to the virtual disk driver so that the virtual machine can read directly from the RAID engine without a memory copy�

Page 19: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com19©2013 DataDirect Networks. All Rights Reserved.

Thus, IOPS intensive loads (such as file system metadata operations) can enjoy greatly reduced latency� The removal of SCSI protocol overhead, Fibre Channel interconnects, SAN switches and interface conversion reduces storage response times and lets the embedded file system take full advantage of the SFA12KXE high performance random I/O capabilities� This I/O streamlining in turn improves performance for transaction intensive workloads�

Reduction in Equipment, Infrastructure and ComplexityBy combining virtualization, an advanced new block RAID architecture and cutting edge hardware technology it’s possible to achieve high performance while at the same time reducing complexity� As shown in Fig� 8, using the Lustre file system as an example, SFA technology can result in as much as a 10 to 1 reduction in the number of managed systems depending on deployment�

While clustered file system services were the first choice for applications to be embedded within SFA, virtually any application that would benefit from ultra-low latency access to block disk devices could benefit from being embedded� As processors increase in speed and the number of cores, the possibilities for what can be embedded increase along with the performance of the block RAID engine�

Traditional Lustre Deployment to Achieve 5 GB/s SFA12KXE, Embedded EXA Scaler

Lustre Clients

LustreMGS

LustreOSS

NodesLustre

MDSNodes

Active StandbySAN

FibreChannel

External RAID Array

300 3.5” Disk Drives

1 SFA12KXE StorageBuilding Block

300 3.5” Disk Drives

Fibre Channel

IB or 10 Gig-EIB or 10 Gig-E

Lustre Clients

10 managed systems: 2+ RAID Arrays 7 Servers 1 Fibre Switch

Figure 8 – Reduction in Equipment, Infrastructure and Complexity with SFA12KXE

Page 20: SFA Product Line€¦ · simultaneously, via a single file system namespace – or the effect of thousands of threads writing a single file – requires substantial POSIX metadata

ddn.com20©2013 DataDirect Networks. All Rights Reserved.

DataDirect Networks (DDN) is the world’s largest privately held information storage company.

We are the leading provider of data storage and processing solutions and services, that enable content-rich and high growth IT environments to achieve the highest levels of systems scalability, efficiency and simplicity� DDN enables enterprises to extract value and deliver results from their information� Our customers include the world’s leading online content and social networking providers, high performance cloud and grid computing, life sciences, media production organizations and security & intelligence organizations� Deployed in thousands of mission critical environments worldwide, DDN’s solutions have been designed, engineered and proven in the world’s most scalable data centers, to ensure competitive business advantage for today’s information powered enterprise�

For more information, go to www�ddn�com or call +1-800-837-2298�

©2013 DataDirect Networks, Inc� All Rights Reserved� Storage Fusion Architecture, Storage Fusion Xcelerator, DirectRAID, DirectProtect, EXAScaler, GRIDScaler, In-Storage Processing, ReACT, SFA10K, SFA12KE, S2A, SFX are trademarks of DataDirect Networks� All other trademarks are the property of their respective owners�

DDN | About Us

SFA12KX Family: SummaryDisk storage systems simply enable computational output to reside on non-volatile media as opposed to being dependent on more volatile media (RAM)� Thus, its purpose is to serve compute clients rapidly with performance and integrity predictability� To the storage environment, it should not matter if systems are processing data for a Fortune 500 enterprise, climatologists predicting global weather patterns or scientists simulating high-energy physics� What does matter is that the technology used in those computers is becoming ever more multi-threaded� The resulting effect on storage systems is the simultaneous read and write of multiple files – whose access histogram is seen as mixed or highly transactional to the supporting storage systems� Thus, storage systems must adapt to changing data patterns to accommodate serving multithreaded compute clients without bottlenecking application I/O�

SFA12KX meets the challenges of changing data patterns by offering extreme performance in both IOPS and bandwidth� A unique combination of an entirely new storage operating system (SFA OS) and best of breed storage processing components have made a system architecture that performs well at both ends of the I/O spectrum a reality�

In addition to meeting the mixed I/O requirements of the most intensive compute environments, SFA OS also allows for embedding clustered file system services directly inside the block storage device� This capability results in the reduction of servers, infrastructure and complexity� In addition to reducing the complexity of scale-out storage, Storage Fusion Architecture can also increase storage responsiveness by removing latency-injecting elements from the storage cluster�

Now that DDN’s move to high speed storage processing systems is complete, rapid development of additional features is possible: advanced storage virtualization capabilities, data management features and advanced application encapsulation, resulting in infrastructure and complexity reduction� The SFA12KX family is the leader in performance (in both IOPS and bandwidth) but Storage Fusion Architecture ensures enduring leadership as it readily adapts to and benefits from advances in the processing components it utilizes�