Next Generation HPC Storage Initiative - HPC Advisory … · 2 About Xyratex SI * Source: IDC,...
Transcript of Next Generation HPC Storage Initiative - HPC Advisory … · 2 About Xyratex SI * Source: IDC,...
Next Generation HPC Storage Initiative Torben Kling Petersen, PhD Lead HPC Solutions Architect
2
About Xyratex
SI
* Source: IDC, December 2010; Coughlin Associates, 2010
NSS
$1.6B NASDAQ: XRTX § World’s largest OEM Disk
Storage System provider * Ø 19% of worldwide enterprise
storage capacity shipped in 2010 Ø > 3,000 Petabytes of storage
shipped in 2010 through partners § ~750 Petabytes (25%) was
deployed into HPC environments Ø ~ 50% of disk drives worldwide are
produced utilizing Xyratex Technology
§ $1.6B in revenue in 2010 Ø Tenfold revenue growth since 2000
§ Over 25 years of data storage R&D experience
§ >2,000 employees Ø 150 software developers worldwide
3
Lustre Community – Xyratex’s Role & Position
§ LUG’2011 - HPCFS and EOFS align under OpenSFS Ø Xyratex is committed to OpenSFS Ø Xyratex fully complies with and supports the Lustre open source license
and community
§ Continue to build our Lustre architecture and development team
§ XRTX’s proposed Lustre Roadmap Disclosure
§ XRTX contributions & full alignment with Lustre 2.1
4
The ClusterStor Acquisition
§ ClusterStor Ø Clustered File System start-up, led by the founder of Lustre –
Dr. Peter Braam Ø World-class development team, deep Lustre expertise
§ Have 3 of top-5 Lustre developers in the world
Ø Skilled technical staff with over a 100 years of combined Lustre experience
§ Xyratex Acquisition Ø Dr. Braam and the ClusterStor team have joined Xyratex, where
they form the new File System development and support team § Attracting other key Lustre talent
§ Peter Bojanic, Leading software development and support Ø Dr. Braam leads this team as SVP of Storage Software
§ ~150 people worldwide
Dr. Peter Braam SVP of Software
Xyratex
5
Exascale Challenges
J. Dongarra, “Impact of Architecture and Technology for Extreme Scale on So]ware and Algorithm Design,” Cross‐cutting Technologies for Computing at the Exascale, February 2‐5, 2010 http://www.ijustweb.com/europar2010.org/_dwn_catalogo/1050-Inglese-dongarra.pdf
6
Top Lustre Systems
§ Tianhe-1A - National Supercomputer Center in Tianjin (China) Ø 2.566 petaflops (peak) Ø 80,016 Xeon X5670 processing cores (6
cores per CPU) and 7,168 Nvidia Tesla M2050 general purpose GPUs
Ø 262 TB of memory Ø Chinese-designed "Arch" high-speed
interconnect Ø 2PB storage
§ Jaguar - ORNL (US) Ø 1.75 petaflops (peak) Ø 224,256 AMD Opteron processing cores Ø 360 TB of memory Ø Seastar2+ for compute; QDR Infiniband
for IO Ø 240 GB/s and 10.5PB storage
§ Tera 100 - CEA (France) Ø 1.25 petaflops (peak) Ø 140,000 Intel Xeon 7500 processor cores Ø 300 TB of memory Ø QDR InfiniBand for compute and IO Ø 500 GB/s and total storage of 20 PB
7
Xyratex – Truly Differentiated in the Lustre Market
Designs, Develops and Tests the World’s Best Purpose Built
Storage Platforms
OEM Business Model – Economies of Scale – Number
1 OEM Storage Supplier in the
world (IDC)
Linux based Storage Appliance
Middleware Development
World-Class
Clustered File System
Development & Support Expertise
Storage Cluster Management Framework
Xyratex Lustre® Roadmap Priorities
9
Xyratex Lustre Roadmap priorities
What is driving our choices?
§ Better control of the release schedules
§ Leverage new technologies
§ Keep it simple
§ Are receiving community feedback
Reliability - Usability - Scalability
10
Lustre Scaling Challenges
§ IO Throughput Ø Read and write performance of the filesystem Ø Coordinating IO from increasing numbers of clients
§ Metadata operations Ø Making create, stat, unlinks fast in the face of increasing numbers of
clients
§ Filesystem Capacity Ø Size of the Lustre file system Ø Size of the underlying disk file system
§ RAS (Reliability, Availability, Serviceability) Ø Knowing what’s going on Ø Comprehensive data integrity protection Ø Recovering rapidly from failures
11
IO Throughput
§ Large Network I/O Ø Aggregate multiple IO requests into a single RPC, optimized for the backend storage
(e.g. 4MB I/O RPCs)
§ Utilize flash storage Ø Aggregate multiple write operations in battery-backed memory of flash storage before
writing to the disk (especially small IOs)
§ Network Request Scheduler (NRS) Ø Re-order collective IO into an effective, sequential stream; employ fair client IO
scheduling with support for prioritized clients
§ Wide striping and data placement Ø Increase stripe limit beyond current 160 target limit; implement a more compact file
layout description mechanism levering FIDs and bitmaps Ø Implement more diverse data layout policies (e.g. fill OSTs sequentially for archival
and cloud applications)
§ Wide area optimizations Ø Reduce RPC counts to improve Lustre metadata and IO performance on lower-
latency wide area networks
12
Metadata Operations
§ Flash cache for I/O metadata Ø Utilize fast SSD storage in the drive trays for journals, DRBD replicated PCI
flash storage in the controllers; and RAM-disk devices in battery-backed memory
§ Client concurrency Ø Support concurrent metadata operations by eliminating the client
semaphore; introduce multi-slot transactional updates to support multiple entries in the server reply
§ Size on MDS (SOM) Ø Addresses the dreaded “ls -l” use case Ø Use a new design for SOM that employs fast flash storage to support
Lustre’s synchronous IO model rather than introduce asynchronous IO as per the 2003 design
§ MDS threading Ø Implement a better threading model that enables MDS performance close to
raw ext4 performance, which is an order of magnitude faster
13
Filesystem Capacity
§ Large volume support for Lustre Ø ext4 already supports volumes up to 1EB but e2fsprogs
is limited to 16TB; qualify ext4 for LUNs up to 128 TB Ø Test Lustre in conjunction with ext4 and address any 64-
bit cleanliness issues
§ Maximum file counts Ø Implement 64-bit inode numbers in ext4 directories and
fix associated disk utilities that may expect < 4B inodes in the filesystem
Ø Use smaller inodes (512 B s. 4KB) in association with the “compact file layout description mechanism”
Ø Replace the current hash tree directory with a Btree layout to accommodate 1B files per directory
14
RAS
§ Verify on read Ø Implement verification of checksums in MDRAID to detect data integrity
issues during read operations
§ End-to-end integrity with T10-DIF Ø Lustre with zfs is not commercially viable and btrfs isn’t mature enough yet Ø Augment the current client/server checksum mechanism by leveraging T10
DIF/DIX to guarantee that the buffers and intended location of the buffers are identical in the drive firmware and the OS
§ Data Replication Ø Leverage the Lustre 2 changelog to provide fast backup, migration, and
replication services on a live filesystem
§ LNET Channel Bonding Ø Linux Ethernet already offers link aggregation capabilities below the LNET
layer but this capability is not available in the InfiniBand driver Ø Implement the LNET Channel Bonding design by CFS to provide link
aggregation and high availability for InfiniBand and other LNET supported interconnects
15
Other Xyratex Lustre Development initiatives
§ MapReduce integration Ø IO Performance research and workload optimizations for Lustre in
collaboration with United States Naval Research Laboratory (NRL) Ø Map/Reduce on Lustre: Hadoop Performance in HPC Environments,
Nathan Rutman http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_Map-Reduce_on_Lustre_1-4.pdf
§ Optimized gateway exports for CIFS & NFS clients Ø Integrated solution to support highly-available CIFS and NFS exports
§ IPv6 Ø Of increasing importance in US government procurements.
§ HSM Ø Integrate the HSM infrastructure designed by CFS in 2006 and
implemented by CEA and Oracle; integrate the RobinHood policy engine designed and implemented by CEA
16
Summary
§ Lustre is the dominant HPC filesystem in the world today
§ HPC compute systems are rapidly evolving to multi-petascale and exascale over the next decade
§ The corresponding demands on the filesystem present new challenges for Lustre
§ Xyratex, in collaboration with our OEM customers and the world’s leading Lustre user, has charted a modern, pragmatic roadmap for Lustre
§ The global community of Lustre users are actively collaborating on advancing the world’s open source parallel file system
§ For more information, please consult our Xyratex Lustre Architecture Priorities Overview http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_Lustre_Architecture_Priorities_Overview_1-0.pdf
17
Xyratex ClusterStor CS-3000 – Breaking new ground
§ Up to 1.5 PB/rack Ø Up to 672 disks per rack Ø Supports 1, 2 or 3 TB SAS disks
§ More than 20 GB/s throughput per rack § Supports full QDR IB or 10 GbE fabrics § All active components redundant and
hot swappable § Engineered, balanced solution for
extreme density and performance
§ Dedicated management utility § Lustre 2.0/2.1 based solution
Ø Active development of new features and fixes