High-Performance Lustre with Maximum Data …...TE AE TABLE OF CONTENTS 1.0 Introduction 1 2.0 The...
Transcript of High-Performance Lustre with Maximum Data …...TE AE TABLE OF CONTENTS 1.0 Introduction 1 2.0 The...
W H I T E P A P E R
High-Performance Lustre with Maximum Data Assurance
Silicon Graphics International Corp.900 North McCarthy Blvd.
Milpitas, CA 95035
Disclaimer and Copyright Notice
The information presented here is meant to be general discussion material only. SGI does not represent or warrant that its products, solutions, or services as set forth
in this document will ensure that the reader is in compliance with any laws or regulations.
©2015 Silicon Graphics International All rights reserved.
W H I T E P A P E R
T A B L E O F C O N T E N T S
1.0 Introduction 1
2.0 The Lustre File System 1
2.1 Metadata Management 1
2.2 Scale-Out Object Storage 2
2.3 Data Assurance Through Integrated T10 PI Validation 2
2.4 Simple and Standard Client Access to Data 2
3.0 T10 PI End-to-End Assurance 3
4.0 A Building Block Approach 4
5.0 Benchmark Process and Results 5
5.1 IOR POSIX Buffered Sequential I/O Results 6
5.2 IOR POSIX Buffered Random I/O Results 7
5.3 IOR POSIX Direct I/O Sequential Results 8
5.4 IOR POSIX DIO Random Results 9
6.0 Conclusion 9
SGI® Lustre
SGI® Lustre 1
W H I T E P A P E R
1.0 IntroductionIn High-Performance Computing (HPC), there is a strong correlation between the compute power of the
solution and the ability of the underlying data storage system to deliver the needed data for processing. As
processor power increases, the goal of system architects is to design systems with an appropriate balance of
data storage, data movement and data computing power – and to do so in a manner that optimizes the overall
processing output of the system at a given price point.
Lustre storage solutions based on an optimized combination of SGI servers and NetApp storage arrays provide
an excellent storage foundation that can be leveraged by HPC researchers, universities, and enterprises that
need to deploy a high-throughput, scale-out, commercially-supported and cost-effective parallel file system
storage solution. These SGI-delivered storage solutions use Intel Enterprise Edition for Lustre® software, a
commercially hardened and supported version of Lustre - the leading HPC open source parallel file system.
Additionally, by leveraging industry leading data assurance protocols – such as T10 PI – the SGI-NetApp
Lustre storage solutions are able to deliver the highest levels of data assurance and protection throughout the
end-to-end data path as storage volumes grow and the potential for undetected bit errors increases. The result
is a scale-out HPC storage solution capable of providing reliability and performance – and that is based on an
architecture that allows for the easy future scaling of both capacity and performance.
This white paper provides a brief overview of the Lustre File System and configuration information on a scale-out
SGI Lustre solution architecture that leverages NetApp-based block storage. The solution overview is followed
by performance analysis and conclusions that were obtained through structured benchmark tests.
2.0 The Lustre File SystemLustre is a parallel file system that delivers high performance through a scale-out approach that divides the
workload among numerous scale-out processing nodes. While the processing power of numerous data
storage servers is available, the system presents a traditional file system namespace that can be leveraged by
hundreds – or thousands – of compute nodes using traditional file-based data access methods.
A Lustre installation is made up of three key elements: the metadata management system, the object storage
subsystem which takes care of actual file/data storage, and the compute nodes from which the data/file access
is performed.
2.1 Metadata Management
The metadata management system is made-up of a Metadata Target (MDT) and a corresponding Metadata
Server (MDS). The MDT stores the actual metadata for the file system that includes elements like file names, file
time stamps, access permissions, and information regarding the actual storage location of data objects for any
given file within the object storage system. Within Lustre, the MDS is the server that services requests for file
system operations and performs management of the MDT.
More recent versions of Lustre include a scalable metadata capability that allow for request loads to be
spread across multiple servers – and in most deployments, the MDS is configured within a high-availability
(HA) environment to ensure ongoing availability of the file system in the event of a server/component failure.
SGI® Lustre 2
W H I T E P A P E R
2.2 Scale-Out Object Storage
The object storage system for Lustre is where the “scale out” attribute of the solution occurs. The object storage
system is made up of some number of Object Storage Servers (OSS) which manage the storage and retrieval
of data – and some number of Object Storage Targets (OST) which are the locations on which the actual data is
placed/read by the OSS.
Lustre deployments typically include numerous OSS nodes and multiple OST storage destinations – and this
scale-out attribute of Lustre creates an opportunity for the creation of object storage “building blocks” to be
defined such that additional capacity and/or throughput may be added to the system through the addition of
incremental building block system elements.
In general, administrators will increase the number of OSS nodes in order to increase the data transfer bandwidth
on the network that the storage system will support. OST storage configurations will be configured in order
to meet both the capacity requirements for the overall system as well as the data throughput/performance
requirements of the OSS nodes.
Within scale-out file systems (often referred to as ‘parallel file systems’) like Lustre, high-performance is achieved
by having the system ‘stripe’ data across multiple storage location (OSTs) such that file read/write operations are
able to benefit from the ability to leverage the throughput of many storage devices in parallel. The result is a system
that can deliver throughput at levels that far exceed the capabilities of any single device or node.
2.3 Data Assurance Through Integrated T10 PI Validation
The data presented in this white paper looks at the performance of a single pair of highly available OSS nodes
within the storage cluster. Additionally, the performance data presented is based on an SGI-and-NetApp Lustre
configuration that leverages the T10 PI data assurance protocol in order to deliver extremely high levels of data
validation/assurance. Later sections of this document will provide further information on T10 PI and the value that
it delivers in highly-scalable storage solutions.
2.4 Simple and Standard Client Access to Data
The Lustre storage solution includes client software that enables access to the scale-out Lustre storage solution
using a standard file system interface. This standard presentation allows client applications and tools to instantly
leverage Lustre-based data storage with no additional work or testing being required.
SGI® Lustre 3
W H I T E P A P E R
3.0 T10 PI – End-to-End AssuranceT10 Protection Information (T10 PI), is an important standard that reflects the storage and data management
industry’s commitment to end-to-end data integrity validation. By validating data at numerous points within
the I/O flow, T10 PI prevents silent data corruption, ensuring that invalid, incomplete or incorrect data will
never overwrite good data. Without T10 PI, data corruption events may slip through the cracks and result in
numerous negative outcomes that can include system downtime, lost revenue, or lack of compliance with
regulatory standards.
Protection Information (PI) adds an extra eight bytes of information to the 512-byte sectors typical of enterprise
hard drives. Increasing sector size to 520 bytes, these eight bytes of metadata consist of guard (GRD),
application (APP) and reference (REF) tags that are used to verify the 512 bytes of data in the sector.
Complementing PI, DIX is a technology that specifies how I/O controllers can exchange metadata with a host
operating system. The combination of DIX (data integrity between application and I/O controller) and PI (data
integrity between I/O controller and disk drive) delivers end-to-end protection against silent corruption of data in
flight between a sender and a receiver.
SGI Lustre solutions are able to implement end-to-end T10 PI in order to deliver an integrated data protection
capability. With the SGI IS5600i using T10 PI End to End, organizations are assured that their data is protected
from the time it leaves the server until the time it is next read and accessed. After the 8 byte PI field is set by
the HBA during the data write process, that PI field is rechecked by the array twice as it crosses through the
Controller before being verified yet another time by the disk drive as it is written to storage media.
During a Read Operation the disk drive re-verifies the PI data before returning it to the Controller – which
implements two additional checks - on the way to final verification
SGI understands the importance of data – and the integrity of that data – within high-performance computing
(HPC) environments, and has therefore focused on the implementation, validation and promotion of Protection
Information (PI) technology to provide customers with end-to-end data confidence.
SGI® Lustre 4
W H I T E P A P E R
4.0 A Building Block ApproachWhile the deployment of Lustre solutions involves a variety of solution components and servers, the achievement
of predictable high-performance results can be achieved by leveraging configurations that have been pre-validated,
documented and benchmarked.
This document presents configuration details and associated performance results based on extensive SGI and
NetApp configuration validation work that may be leveraged by customers to deploy solutions with excellent
performance and the highest levels of data assurance based on the integrated T10 PI features that are built-in
to the solution.
For this document, SGI is introducing the concept of a Scalable Storage Unit (SSU) that is comprised of two Lustre
OSS nodes connected to an SGI IS5600i storage array (based on technology from NetApp). The purpose of this
SSU-based approach is to create a Lustre scale-out ‘building block’ that can be replicated as needed to scale
throughput and capacity.
The overall test configuration and dual-OSS SSU is shown in the following diagram.
SGI® Lustre 5
W H I T E P A P E R
Server FunctionHostname
Lustre MDS Server MDS01
Lustre OSS Server OSS 1-2
Lustre Clients
SGI PlatformSGI® CH-C1104-GP2 “Highland” Server SGI® CH-C1104-GP2 “Highland” Server SGI® ICE™ X Cluster
Processors Type Intel® Xeon® E5-2690 v3 2.60GHz 30MB Cache
Intel® Xeon® E5-2690 v3 2.60GHz 30MB Cache
Intel® Xeon® E5-2690 v3 2.60GHz 30MB Cache
Number of Nodes 1 2 64 I/O Benchmark Lustre Clients
Total Cores per Node 24 24 24
Memory & Memory Speed 128 GB2133MHz
128 GB2133MHz
128 GB2133MHz
Local Storage 1 SATA 1TB 7.2 RPM 3Gb/s Drive 1 SATA 1TB 7.2 RPM 3Gb/s Drive Diskless Blades
Network Interconnect
IB FDR 4x Bandwidth 56Gb/sLatency < 1usec
IB FDR 4x Bandwidth 56Gb/sLatency < 1usec
IB FDR 4x Bandwidth 56Gb/sLatency < 1usec
OS RHEL v6.5 Mellanox OFED v2.3 RHEL v6.5 Mellanox OFED v2.3 SLES11 SP3 Mellanox OFED v2.3
Lustre Software Intel Enterprise Edition for Lustre 2.2(Lustre 2.5.X)
Intel Enterprise Edition for Lustre 2.2(Lustre 2.5.X)
Intel Enterprise Edition for Lustre 2.2(Lustre 2.5.X)
SGI Storage Platform SGI® IS5600™ (16G FC Interface) SGI® IS5600i™ w/ 6GB SAS T10 PIData Assurance Enabled
-
Storage Enclosure 24-Bay Enclosure (12 drives only used) 1x60-Bay Storage Ctrl + 1x60-Bay Expansion
-
Drive Details Drive Type
4x 200GB 6Gb/s SAS Enterprise SSD 120x 6TB 7.2K RPM 6Gb/s NL-SAS -
RAID Protection RAID10 Write Cache Mirror Enabled RAID6 (8+2) 128K Segment Size & WCM Enabled, DA Enabled
-
5.0 Benchmark Process & ResultsThis report summarizes the results of the IOR I/O benchmarks. Included in this report are the details of the
benchmark environment, commands, and the results achieved while performing the I/O benchmarks on an SGI
IS5600i Storage Array with two OSS servers based on Intel Enterprise Edition for Lustre software.
IOR is an industry standard I/O benchmark used for benchmarking parallel file systems. The IOR
application characteristics are 96% of the runtime in I/O, 1% in CPU & Memory Bandwidth, and 3% in MPI
communications. The I/O performance is determined by the performance of the proposed storage and
interconnects rather than processor speed or memory bandwidth of the Lustre client.
To capture the end to end data protection using T10 PI, the SGI Lustre OSS servers had two Emulex LightPulse
16Gb Fibre Channel (T10 PI) HBAs installed and the IS5600i Storage Array was configured with Data Assurance
enabled to prevent silent data corruption. The Emulex BlockGuardTM Data Integrity (offload) feature was enabled
in the kernel module lpfc.conf. All testing completed successfully – and the results shown reflect the fact that no
performance impact was introduced through the enablement of the T10 PI assurance elements.
SGI® Lustre 6
W H I T E P A P E R
5.1 IOR POSIX Buffered Sequential I/O Results
Figure 1 chart shows the throughput results of a scaling benchmark from a single Lustre client up to 64 Lustre
clients with 24 I/O threads per node. The aggregate file size (block size) was 192GB/client, which represents
1.5x the Lustre client physical memory to mitigate the influence of buffer cache.
Figure 1: IOR Buffered Sequential I/O Results
SGI® Lustre 7
W H I T E P A P E R
5.2 IOR POSIX Buffered Random I/O Results
Figure 2 chart shows the throughput results of the Buffered Random I/O throughput. As discussed previously
an aggregate file size (block size) was 192GB/client, which represents 1.5x the Lustre client physical memory to
mitigate the influence of buffer cache.
Figure 2: IOR Buffered Random I/O Results
W H I T E P A P E R
SGI® Lustre 8
5.3 IOR POSIX Direct I/O Sequential Results
Figure 3 chart shows the throughput results of Direct I/O using Sequential file access. Scaling is from a single
Lustre client to 64 Lustre clients. With the direct I/O benchmarks the aggregate file size was reduced to 96GB/
client since direct I/O system requests bypass the Linux Kernel buffer cache.
Figure 3: IOR Direct I/O Sequential Results
W H I T E P A P E R
SGI® Lustre 9
Global Sales and Support: sgi.com
©2015 Silicon Graphics International Corp. All rights reserved. SGI, SGI ICE, SGI UV, Rackable, NUMAlink, Performance Suite, Accelerate, ProPack, OpenMP and the SGI logo are registered trademarks of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries. Intel, the Intel logo, Xeon, and Xeon Inside are trademarks or registered trademarks of Intel Corporation in the U.S. and/or other countries. Linux is a registered trademark of Linus Torvalds in several countries. All other trademarks mentioned herein are the property of their respective owners. 06112015 4565 06112015
5.4 IOR POSIX DIO Random Results
Figure 4 chart shows the throughput results of Direct I/O using Random file access. Scaling is from a single
lustre client to 64 lustre clients. With the direct I/O benchmarks the aggregate file size was reduced to 96GB/
client since direct I/O system requests bypass the Linux Kernel buffer cache.
Figure 4: IOR Direct I/O Sequential I/O Results
5.0 Conclusion
Based on the benchmarks performed characterizing the I/O performance for IOR, SGI concludes that the
Lustre parallel file system is workload dependent but is an excellent parallel file system to support light to heavy
I/O application workloads. For data protection SGI uses industry standard T10 PI data assurance technology
to provide end to end data integrity with the SGI Lustre storage solution based on Intel Enterprise Edition for
Lustre software.
A Dual OSS configuration combined with an SGI IS5600i storage array with 120 drives supports up to 6GB/
sec; SGI defines this storage building block as a Scalable Storage Unit (SSU). A configuration of two SSU
increases throughput to 12GB/sec, and throughput above 100GB sequential and 60/50 GB/sec random write/
read respectively, can be achieved with the straight-forward addition of the Scalable Storage Units as defined in
this white paper.