High-Performance Lustre with Maximum Data …...TE AE TABLE OF CONTENTS 1.0 Introduction 1 2.0 The...

W H I T E P A P E R

High-Performance Lustre with Maximum Data Assurance

Silicon Graphics International Corp.900 North McCarthy Blvd.

Milpitas, CA 95035

Disclaimer and Copyright Notice

The information presented here is meant to be general discussion material only. SGI does not represent or warrant that its products, solutions, or services as set forth

in this document will ensure that the reader is in compliance with any laws or regulations.

©2015 Silicon Graphics International All rights reserved.

W H I T E P A P E R

T A B L E O F C O N T E N T S

1.0 Introduction 1

2.0 The Lustre File System 1

2.1 Metadata Management 1

2.2 Scale-Out Object Storage 2

2.3 Data Assurance Through Integrated T10 PI Validation 2

2.4 Simple and Standard Client Access to Data 2

3.0 T10 PI End-to-End Assurance 3

4.0 A Building Block Approach 4

5.0 Benchmark Process and Results 5

5.1 IOR POSIX Buffered Sequential I/O Results 6

5.2 IOR POSIX Buffered Random I/O Results 7

5.3 IOR POSIX Direct I/O Sequential Results 8

5.4 IOR POSIX DIO Random Results 9

6.0 Conclusion 9

SGI® Lustre

SGI® Lustre 1

W H I T E P A P E R

1.0 IntroductionIn High-Performance Computing (HPC), there is a strong correlation between the compute power of the

solution and the ability of the underlying data storage system to deliver the needed data for processing. As

processor power increases, the goal of system architects is to design systems with an appropriate balance of

data storage, data movement and data computing power – and to do so in a manner that optimizes the overall

processing output of the system at a given price point.

Lustre storage solutions based on an optimized combination of SGI servers and NetApp storage arrays provide

an excellent storage foundation that can be leveraged by HPC researchers, universities, and enterprises that

need to deploy a high-throughput, scale-out, commercially-supported and cost-effective parallel file system

storage solution. These SGI-delivered storage solutions use Intel Enterprise Edition for Lustre® software, a

commercially hardened and supported version of Lustre - the leading HPC open source parallel file system.

Additionally, by leveraging industry leading data assurance protocols – such as T10 PI – the SGI-NetApp

Lustre storage solutions are able to deliver the highest levels of data assurance and protection throughout the

end-to-end data path as storage volumes grow and the potential for undetected bit errors increases. The result

is a scale-out HPC storage solution capable of providing reliability and performance – and that is based on an

architecture that allows for the easy future scaling of both capacity and performance.

This white paper provides a brief overview of the Lustre File System and configuration information on a scale-out

SGI Lustre solution architecture that leverages NetApp-based block storage. The solution overview is followed

by performance analysis and conclusions that were obtained through structured benchmark tests.

2.0 The Lustre File SystemLustre is a parallel file system that delivers high performance through a scale-out approach that divides the

workload among numerous scale-out processing nodes. While the processing power of numerous data

storage servers is available, the system presents a traditional file system namespace that can be leveraged by

hundreds – or thousands – of compute nodes using traditional file-based data access methods.

A Lustre installation is made up of three key elements: the metadata management system, the object storage

subsystem which takes care of actual file/data storage, and the compute nodes from which the data/file access

is performed.

2.1 Metadata Management

The metadata management system is made-up of a Metadata Target (MDT) and a corresponding Metadata

Server (MDS). The MDT stores the actual metadata for the file system that includes elements like file names, file

time stamps, access permissions, and information regarding the actual storage location of data objects for any

given file within the object storage system. Within Lustre, the MDS is the server that services requests for file

system operations and performs management of the MDT.

More recent versions of Lustre include a scalable metadata capability that allow for request loads to be

spread across multiple servers – and in most deployments, the MDS is configured within a high-availability

(HA) environment to ensure ongoing availability of the file system in the event of a server/component failure.

SGI® Lustre 2

W H I T E P A P E R

2.2 Scale-Out Object Storage

The object storage system for Lustre is where the “scale out” attribute of the solution occurs. The object storage

system is made up of some number of Object Storage Servers (OSS) which manage the storage and retrieval

of data – and some number of Object Storage Targets (OST) which are the locations on which the actual data is

placed/read by the OSS.

Lustre deployments typically include numerous OSS nodes and multiple OST storage destinations – and this

scale-out attribute of Lustre creates an opportunity for the creation of object storage “building blocks” to be

defined such that additional capacity and/or throughput may be added to the system through the addition of

incremental building block system elements.

In general, administrators will increase the number of OSS nodes in order to increase the data transfer bandwidth

on the network that the storage system will support. OST storage configurations will be configured in order

to meet both the capacity requirements for the overall system as well as the data throughput/performance

requirements of the OSS nodes.

Within scale-out file systems (often referred to as ‘parallel file systems’) like Lustre, high-performance is achieved

by having the system ‘stripe’ data across multiple storage location (OSTs) such that file read/write operations are

able to benefit from the ability to leverage the throughput of many storage devices in parallel. The result is a system

that can deliver throughput at levels that far exceed the capabilities of any single device or node.

2.3 Data Assurance Through Integrated T10 PI Validation

The data presented in this white paper looks at the performance of a single pair of highly available OSS nodes

within the storage cluster. Additionally, the performance data presented is based on an SGI-and-NetApp Lustre

configuration that leverages the T10 PI data assurance protocol in order to deliver extremely high levels of data

validation/assurance. Later sections of this document will provide further information on T10 PI and the value that

it delivers in highly-scalable storage solutions.

2.4 Simple and Standard Client Access to Data

The Lustre storage solution includes client software that enables access to the scale-out Lustre storage solution

using a standard file system interface. This standard presentation allows client applications and tools to instantly

leverage Lustre-based data storage with no additional work or testing being required.

SGI® Lustre 3

W H I T E P A P E R

3.0 T10 PI – End-to-End AssuranceT10 Protection Information (T10 PI), is an important standard that reflects the storage and data management

industry’s commitment to end-to-end data integrity validation. By validating data at numerous points within

the I/O flow, T10 PI prevents silent data corruption, ensuring that invalid, incomplete or incorrect data will

never overwrite good data. Without T10 PI, data corruption events may slip through the cracks and result in

numerous negative outcomes that can include system downtime, lost revenue, or lack of compliance with

regulatory standards.

Protection Information (PI) adds an extra eight bytes of information to the 512-byte sectors typical of enterprise

hard drives. Increasing sector size to 520 bytes, these eight bytes of metadata consist of guard (GRD),

application (APP) and reference (REF) tags that are used to verify the 512 bytes of data in the sector.

Complementing PI, DIX is a technology that specifies how I/O controllers can exchange metadata with a host

operating system. The combination of DIX (data integrity between application and I/O controller) and PI (data

integrity between I/O controller and disk drive) delivers end-to-end protection against silent corruption of data in

flight between a sender and a receiver.

SGI Lustre solutions are able to implement end-to-end T10 PI in order to deliver an integrated data protection

capability. With the SGI IS5600i using T10 PI End to End, organizations are assured that their data is protected

from the time it leaves the server until the time it is next read and accessed. After the 8 byte PI field is set by

the HBA during the data write process, that PI field is rechecked by the array twice as it crosses through the

Controller before being verified yet another time by the disk drive as it is written to storage media.

During a Read Operation the disk drive re-verifies the PI data before returning it to the Controller – which

implements two additional checks - on the way to final verification

SGI understands the importance of data – and the integrity of that data – within high-performance computing

(HPC) environments, and has therefore focused on the implementation, validation and promotion of Protection

Information (PI) technology to provide customers with end-to-end data confidence.

SGI® Lustre 4

W H I T E P A P E R

4.0 A Building Block ApproachWhile the deployment of Lustre solutions involves a variety of solution components and servers, the achievement

of predictable high-performance results can be achieved by leveraging configurations that have been pre-validated,

documented and benchmarked.

This document presents configuration details and associated performance results based on extensive SGI and

NetApp configuration validation work that may be leveraged by customers to deploy solutions with excellent

performance and the highest levels of data assurance based on the integrated T10 PI features that are built-in

to the solution.

For this document, SGI is introducing the concept of a Scalable Storage Unit (SSU) that is comprised of two Lustre

OSS nodes connected to an SGI IS5600i storage array (based on technology from NetApp). The purpose of this

SSU-based approach is to create a Lustre scale-out ‘building block’ that can be replicated as needed to scale

throughput and capacity.

The overall test configuration and dual-OSS SSU is shown in the following diagram.

SGI® Lustre 5

W H I T E P A P E R

Server FunctionHostname

Lustre MDS Server MDS01

Lustre OSS Server OSS 1-2

Lustre Clients

SGI PlatformSGI® CH-C1104-GP2 “Highland” Server SGI® CH-C1104-GP2 “Highland” Server SGI® ICE™ X Cluster

Processors Type Intel® Xeon® E5-2690 v3 2.60GHz 30MB Cache

Intel® Xeon® E5-2690 v3 2.60GHz 30MB Cache

Intel® Xeon® E5-2690 v3 2.60GHz 30MB Cache

Number of Nodes 1 2 64 I/O Benchmark Lustre Clients

Total Cores per Node 24 24 24

Memory & Memory Speed 128 GB2133MHz

128 GB2133MHz

128 GB2133MHz

Local Storage 1 SATA 1TB 7.2 RPM 3Gb/s Drive 1 SATA 1TB 7.2 RPM 3Gb/s Drive Diskless Blades

Network Interconnect

IB FDR 4x Bandwidth 56Gb/sLatency < 1usec



OS RHEL v6.5 Mellanox OFED v2.3 RHEL v6.5 Mellanox OFED v2.3 SLES11 SP3 Mellanox OFED v2.3

Lustre Software Intel Enterprise Edition for Lustre 2.2(Lustre 2.5.X)

Intel Enterprise Edition for Lustre 2.2(Lustre 2.5.X)

Intel Enterprise Edition for Lustre 2.2(Lustre 2.5.X)

SGI Storage Platform SGI® IS5600™ (16G FC Interface) SGI® IS5600i™ w/ 6GB SAS T10 PIData Assurance Enabled

-

Storage Enclosure 24-Bay Enclosure (12 drives only used) 1x60-Bay Storage Ctrl + 1x60-Bay Expansion

-

Drive Details Drive Type

4x 200GB 6Gb/s SAS Enterprise SSD 120x 6TB 7.2K RPM 6Gb/s NL-SAS -

RAID Protection RAID10 Write Cache Mirror Enabled RAID6 (8+2) 128K Segment Size & WCM Enabled, DA Enabled

-

5.0 Benchmark Process & ResultsThis report summarizes the results of the IOR I/O benchmarks. Included in this report are the details of the

benchmark environment, commands, and the results achieved while performing the I/O benchmarks on an SGI

IS5600i Storage Array with two OSS servers based on Intel Enterprise Edition for Lustre software.

IOR is an industry standard I/O benchmark used for benchmarking parallel file systems. The IOR

application characteristics are 96% of the runtime in I/O, 1% in CPU & Memory Bandwidth, and 3% in MPI

communications. The I/O performance is determined by the performance of the proposed storage and

interconnects rather than processor speed or memory bandwidth of the Lustre client.

To capture the end to end data protection using T10 PI, the SGI Lustre OSS servers had two Emulex LightPulse

16Gb Fibre Channel (T10 PI) HBAs installed and the IS5600i Storage Array was configured with Data Assurance

enabled to prevent silent data corruption. The Emulex BlockGuardTM Data Integrity (offload) feature was enabled

in the kernel module lpfc.conf. All testing completed successfully – and the results shown reflect the fact that no

performance impact was introduced through the enablement of the T10 PI assurance elements.

SGI® Lustre 6

W H I T E P A P E R

5.1 IOR POSIX Buffered Sequential I/O Results

Figure 1 chart shows the throughput results of a scaling benchmark from a single Lustre client up to 64 Lustre

clients with 24 I/O threads per node. The aggregate file size (block size) was 192GB/client, which represents

1.5x the Lustre client physical memory to mitigate the influence of buffer cache.

Figure 1: IOR Buffered Sequential I/O Results

SGI® Lustre 7

W H I T E P A P E R

5.2 IOR POSIX Buffered Random I/O Results

Figure 2 chart shows the throughput results of the Buffered Random I/O throughput. As discussed previously

an aggregate file size (block size) was 192GB/client, which represents 1.5x the Lustre client physical memory to

mitigate the influence of buffer cache.

Figure 2: IOR Buffered Random I/O Results

W H I T E P A P E R

SGI® Lustre 8

5.3 IOR POSIX Direct I/O Sequential Results

Figure 3 chart shows the throughput results of Direct I/O using Sequential file access. Scaling is from a single

Lustre client to 64 Lustre clients. With the direct I/O benchmarks the aggregate file size was reduced to 96GB/

client since direct I/O system requests bypass the Linux Kernel buffer cache.

Figure 3: IOR Direct I/O Sequential Results

W H I T E P A P E R

SGI® Lustre 9

Global Sales and Support: sgi.com

©2015 Silicon Graphics International Corp. All rights reserved. SGI, SGI ICE, SGI UV, Rackable, NUMAlink, Performance Suite, Accelerate, ProPack, OpenMP and the SGI logo are registered trademarks of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. Intel, the Intel logo, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation in the U.S. and/or other countries. Linux is a registered trademark of Linus Torvalds in several countries. All other trademarks mentioned herein are the property of their respective owners. 06112015 4565 06112015

5.4 IOR POSIX DIO Random Results

Figure 4 chart shows the throughput results of Direct I/O using Random file access. Scaling is from a single

lustre client to 64 lustre clients. With the direct I/O benchmarks the aggregate file size was reduced to 96GB/

client since direct I/O system requests bypass the Linux Kernel buffer cache.

Figure 4: IOR Direct I/O Sequential I/O Results

5.0 Conclusion

Based on the benchmarks performed characterizing the I/O performance for IOR, SGI concludes that the

Lustre parallel file system is workload dependent but is an excellent parallel file system to support light to heavy

I/O application workloads. For data protection SGI uses industry standard T10 PI data assurance technology

to provide end to end data integrity with the SGI Lustre storage solution based on Intel Enterprise Edition for

Lustre software.

A Dual OSS configuration combined with an SGI IS5600i storage array with 120 drives supports up to 6GB/

sec; SGI defines this storage building block as a Scalable Storage Unit (SSU). A configuration of two SSU

increases throughput to 12GB/sec, and throughput above 100GB sequential and 60/50 GB/sec random write/

read respectively, can be achieved with the straight-forward addition of the Scalable Storage Units as defined in

this white paper.

High-Performance Lustre with Maximum Data …...TE AE TABLE OF CONTENTS 1.0 Introduction 1 2.0 The...

Documents

Transcript of High-Performance Lustre with Maximum Data …...TE AE TABLE OF CONTENTS 1.0 Introduction 1 2.0 The...