Next Generation HPC Storage Initiative - HPC Advisory … · 2 About Xyratex SI * Source: IDC,...

Next Generation HPC Storage Initiative Torben Kling Petersen, PhD Lead HPC Solutions Architect

2

About Xyratex

SI

* Source: IDC, December 2010; Coughlin Associates, 2010

NSS

$1.6B NASDAQ: XRTX §  World’s largest OEM Disk

Storage System provider * Ø  19% of worldwide enterprise

storage capacity shipped in 2010 Ø  > 3,000 Petabytes of storage

shipped in 2010 through partners §  ~750 Petabytes (25%) was

deployed into HPC environments Ø  ~ 50% of disk drives worldwide are

produced utilizing Xyratex Technology

§  $1.6B in revenue in 2010 Ø  Tenfold revenue growth since 2000

§  Over 25 years of data storage R&D experience

§  >2,000 employees Ø  150 software developers worldwide

3

Lustre Community – Xyratex’s Role & Position

§  LUG’2011 - HPCFS and EOFS align under OpenSFS Ø  Xyratex is committed to OpenSFS Ø  Xyratex fully complies with and supports the Lustre open source license

and community

§  Continue to build our Lustre architecture and development team

§  XRTX’s proposed Lustre Roadmap Disclosure

§  XRTX contributions & full alignment with Lustre 2.1

4

The ClusterStor Acquisition

§  ClusterStor Ø  Clustered File System start-up, led by the founder of Lustre –

Dr. Peter Braam Ø  World-class development team, deep Lustre expertise

§  Have 3 of top-5 Lustre developers in the world

Ø  Skilled technical staff with over a 100 years of combined Lustre experience

§  Xyratex Acquisition Ø  Dr. Braam and the ClusterStor team have joined Xyratex, where

they form the new File System development and support team §  Attracting other key Lustre talent

§  Peter Bojanic, Leading software development and support Ø  Dr. Braam leads this team as SVP of Storage Software

§  ~150 people worldwide

Dr. Peter Braam SVP of Software

Xyratex

5

Exascale Challenges

J. Dongarra, “Impact of Architecture and Technology for Extreme Scale on So]ware and Algorithm Design,” Cross‐cutting Technologies for Computing at the Exascale, February 2‐5, 2010 http://www.ijustweb.com/europar2010.org/_dwn_catalogo/1050-Inglese-dongarra.pdf

6

Top Lustre Systems

§  Tianhe-1A - National Supercomputer Center in Tianjin (China) Ø  2.566 petaflops (peak) Ø  80,016 Xeon X5670 processing cores (6

cores per CPU) and 7,168 Nvidia Tesla M2050 general purpose GPUs

Ø  262 TB of memory Ø  Chinese-designed "Arch" high-speed

interconnect Ø  2PB storage

§  Jaguar - ORNL (US) Ø  1.75 petaflops (peak) Ø  224,256 AMD Opteron processing cores Ø  360 TB of memory Ø  Seastar2+ for compute; QDR Infiniband

for IO Ø  240 GB/s and 10.5PB storage

§  Tera 100 - CEA (France) Ø  1.25 petaflops (peak) Ø  140,000 Intel Xeon 7500 processor cores Ø  300 TB of memory Ø  QDR InfiniBand for compute and IO Ø  500 GB/s and total storage of 20 PB

7

Xyratex – Truly Differentiated in the Lustre Market

Designs, Develops and Tests the World’s Best Purpose Built

Storage Platforms

OEM Business Model – Economies of Scale – Number

1 OEM Storage Supplier in the

world (IDC)

Linux based Storage Appliance

Middleware Development

World-Class

Clustered File System

Development & Support Expertise

Storage Cluster Management Framework

Xyratex Lustre® Roadmap Priorities

9

Xyratex Lustre Roadmap priorities

What is driving our choices?

§ Better control of the release schedules

§ Leverage new technologies

§ Keep it simple

§ Are receiving community feedback

Reliability - Usability - Scalability

10

Lustre Scaling Challenges

§  IO Throughput Ø  Read and write performance of the filesystem Ø  Coordinating IO from increasing numbers of clients

§  Metadata operations Ø  Making create, stat, unlinks fast in the face of increasing numbers of

clients

§  Filesystem Capacity Ø  Size of the Lustre file system Ø  Size of the underlying disk file system

§  RAS (Reliability, Availability, Serviceability) Ø  Knowing what’s going on Ø  Comprehensive data integrity protection Ø  Recovering rapidly from failures

11

IO Throughput

§  Large Network I/O Ø  Aggregate multiple IO requests into a single RPC, optimized for the backend storage

(e.g. 4MB I/O RPCs)

§  Utilize flash storage Ø  Aggregate multiple write operations in battery-backed memory of flash storage before

writing to the disk (especially small IOs)

§  Network Request Scheduler (NRS) Ø  Re-order collective IO into an effective, sequential stream; employ fair client IO

scheduling with support for prioritized clients

§  Wide striping and data placement Ø  Increase stripe limit beyond current 160 target limit; implement a more compact file

layout description mechanism levering FIDs and bitmaps Ø  Implement more diverse data layout policies (e.g. fill OSTs sequentially for archival

and cloud applications)

§  Wide area optimizations Ø  Reduce RPC counts to improve Lustre metadata and IO performance on lower-

latency wide area networks

12

Metadata Operations

§  Flash cache for I/O metadata Ø  Utilize fast SSD storage in the drive trays for journals, DRBD replicated PCI

flash storage in the controllers; and RAM-disk devices in battery-backed memory

§  Client concurrency Ø  Support concurrent metadata operations by eliminating the client

semaphore; introduce multi-slot transactional updates to support multiple entries in the server reply

§  Size on MDS (SOM) Ø  Addresses the dreaded “ls -l” use case Ø  Use a new design for SOM that employs fast flash storage to support

Lustre’s synchronous IO model rather than introduce asynchronous IO as per the 2003 design

§  MDS threading Ø  Implement a better threading model that enables MDS performance close to

raw ext4 performance, which is an order of magnitude faster

13

Filesystem Capacity

§  Large volume support for Lustre Ø ext4 already supports volumes up to 1EB but e2fsprogs

is limited to 16TB; qualify ext4 for LUNs up to 128 TB Ø Test Lustre in conjunction with ext4 and address any 64-

bit cleanliness issues

§ Maximum file counts Ø  Implement 64-bit inode numbers in ext4 directories and

fix associated disk utilities that may expect < 4B inodes in the filesystem

Ø Use smaller inodes (512 B s. 4KB) in association with the “compact file layout description mechanism”

Ø Replace the current hash tree directory with a Btree layout to accommodate 1B files per directory

14

RAS

§  Verify on read Ø  Implement verification of checksums in MDRAID to detect data integrity

issues during read operations

§  End-to-end integrity with T10-DIF Ø  Lustre with zfs is not commercially viable and btrfs isn’t mature enough yet Ø  Augment the current client/server checksum mechanism by leveraging T10

DIF/DIX to guarantee that the buffers and intended location of the buffers are identical in the drive firmware and the OS

§  Data Replication Ø  Leverage the Lustre 2 changelog to provide fast backup, migration, and

replication services on a live filesystem

§  LNET Channel Bonding Ø  Linux Ethernet already offers link aggregation capabilities below the LNET

layer but this capability is not available in the InfiniBand driver Ø  Implement the LNET Channel Bonding design by CFS to provide link

aggregation and high availability for InfiniBand and other LNET supported interconnects

15

Other Xyratex Lustre Development initiatives

§  MapReduce integration Ø  IO Performance research and workload optimizations for Lustre in

collaboration with United States Naval Research Laboratory (NRL) Ø  Map/Reduce on Lustre: Hadoop Performance in HPC Environments,

Nathan Rutman http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_Map-Reduce_on_Lustre_1-4.pdf

§  Optimized gateway exports for CIFS & NFS clients Ø  Integrated solution to support highly-available CIFS and NFS exports

§  IPv6 Ø  Of increasing importance in US government procurements.

§  HSM Ø  Integrate the HSM infrastructure designed by CFS in 2006 and

implemented by CEA and Oracle; integrate the RobinHood policy engine designed and implemented by CEA

16

Summary

§  Lustre is the dominant HPC filesystem in the world today

§  HPC compute systems are rapidly evolving to multi-petascale and exascale over the next decade

§  The corresponding demands on the filesystem present new challenges for Lustre

§  Xyratex, in collaboration with our OEM customers and the world’s leading Lustre user, has charted a modern, pragmatic roadmap for Lustre

§  The global community of Lustre users are actively collaborating on advancing the world’s open source parallel file system

§  For more information, please consult our Xyratex Lustre Architecture Priorities Overview http://www.xyratex.com/pdfs/whitepapers/Xyratex_white_paper_Lustre_Architecture_Priorities_Overview_1-0.pdf

17

Xyratex ClusterStor CS-3000 – Breaking new ground

§  Up to 1.5 PB/rack Ø  Up to 672 disks per rack Ø  Supports 1, 2 or 3 TB SAS disks

§  More than 20 GB/s throughput per rack §  Supports full QDR IB or 10 GbE fabrics §  All active components redundant and

hot swappable §  Engineered, balanced solution for

extreme density and performance

§  Dedicated management utility §  Lustre 2.0/2.1 based solution

Ø  Active development of new features and fixes

18 [email protected]

Next Generation HPC Storage Initiative - HPC Advisory … · 2 About Xyratex SI * Source: IDC,...

Documents

Transcript of Next Generation HPC Storage Initiative - HPC Advisory … · 2 About Xyratex SI * Source: IDC,...