The feature development and structure evolvement of Lustre ...
Transcript of The feature development and structure evolvement of Lustre ...
Li Xi, Principle [email protected]
The Feature Development and Architecture Evolvement of Lustre under New Challenges
whamcloud.com
What is Lustre?
► Lustre is Software Defined Storage► Provides a distributed, parallel, and scalable storage cluster• Attached directly to compute nodes or site-wide filesystem
• Client access via network (Ethernet, OPA, InfiniBand)
• 1 EB+ filesystem limit, 32 PB single file limit (1EB for ZFS)
• Production file systems exceed 2TB/s, 50PB in size
►Maximum Performance at Massive Scale►Open-Source (GPLv2) and POSIX compliant► Extremely Efficient Use of Hardware Resources
CPU
DDR
Local Storage
I/O Node
Lustre Parallel File System
Solid State Drives / Hard Disk Drives
whamcloud.com
1999 2003 2007 2009 2010 2011 2012 2015 2017 2018 2019
History of Lustre
Illustrates the robustness of open source technology in the face of organizational changes
1.0 1.6 1.8 2.0 2.52.1 2.7 2.10 2.11 2.12
whamcloud.com
IO Perf ~1.36x per year
Capacity ~1.38x per year
Network Speed: ~1.32x per yearHDD Capacity: ~1.32x per year
Source: Rock Hard Lustre, Nathan Rutman, Cray (with updates for recent years); Disk Drive Prices (1955-2019), John C. McCallum
Lustre Performance and Capacity Growth
whamcloud.com
Classical Storage Architecture in HPC
Computer Nodes(NVRAM)
Compute Network
I/O fowarding Nodes(NVRAM, SSD)
Site-wide Storage Network
Parallel File system(NVRAM, SSD, Disk)
whamcloud.com
Classical Storage Architecture in HPC
►Compute Node NVRAM• High velocity for hot data• Network bandwidth: O(1PB/s) -> O(10PB/s)• Extremely low network & NVRAM latency
► I/O node NVRAM/SSD• Semi-hot data or staging buffer• Network bandwidth: O(10TB/s) -> O(100TB/s)
►Parallel file system with NVRAM/SSD/Disk• Site-wide shared warm storage• SAN limited: O(1TB/s) -> O(10TB/s)
►Move storage closer to compute!
whamcloud.com
HPC Storage Hierarchy is Changing
CPU
Memory(DRAM)
Storage(HDD)
CPU
Near Memory(HBM)
Near Sorage(NVRAM/SSD)
Far Memory(DRAM)
Far Storage(HDD/Tape)
FuturePast
On Chip
Off Chip
On Chip
Off Chip
whamcloud.com
Basic Lustre File System in Production
MDT OST OST OST OST OST OST OST OST
MDS
MDS OSS
OSS OSS
OSS
OSS
OSS OSS
OSS
ClientClientClientClient
whamcloud.com
Complex Lustre File System
MDT MDT OST OST OST OST OST OST OST OST
MDS MDS
MDS MDS OSS
OSS OSS
MDS
OSS
OSS OSS
OSS
ClientClientClientClient
ClientClientClientClient
LnetRouter
ClientClientClientClient
whamcloud.com
Tiered Lustre File System is Coming
LocalDatasets
LocalNMVe/NVRAM
MetadataServers (~100’s)
Object Storage Servers
(~1000’s)
MetadataTargets (MDTs)
ManagementTarget (MGT) HDD Object Storage Targets (OSTs)
Lustre Clients (~100,000+)NVMe MDTson client net
Archive OSTs (Erasure Coded)
Policy Engine,Data Transfer Nodes
NVMe OSTs (Burst Buffer)on client network
Transparent Tiering to Multiple Clouds
WAN ARCHIVE
Local dataprocessing
Bi-directional (remote) sync
Transparentmigration
whamcloud.com
Example Architecture of a Heterogeneous Lustre File System
OSTOST
OSTsOST
OSTOSTs
OST Pool Based on SSD
OSTOST
OSTsOST
OSTOSTs
OST Pool Based on Nearline HDD
OSTOST
OSTsOST
OSTOSTs
OST Pool Based on HDD
HSM Based on Tape
Client
Lustre on Demand based on NVMe
Client Client Client Client
Persistent Client Cache based on NVMe
Client Client Client
One Lustre Namespace
Archive/Restore
Attach/DetachStage-in/out
whamcloud.com
Challenges and Opportunities for Lustre File System
► Performance challenges• Wide usage of NVMe/SSD highlights the software latency• Software could be the bottleneck of collective bandwidth/IOPS
► Scalability challenges• Both data and metadata sizes keep on enlarging
► Data management challenges• Heterogeneous storage types• Data migration between multiple storage tiers• Data movement for local access• S3/POSIX HSM storage integration• Data integrity
Performance
ManagementScalability
whamcloud.com
Features of Lustre to Solve the Challenges
Distributed NamEspace
Data on MDT
Size on MDT
Persistent Client Cache
Parallel e2fsckParallel Readahead
Data Placement PolicyLNet Health
File Level Redundancy Token Bucket Filter
Project Quota
Pool QuotaFast Read
Lock Ahead Ladvise
Policy Engine
Large RPC Size
Large Directory on MDT
LNet Multi-Rail
ZFS OSD Data Security
Performance Scalability Management
HSM
Changelog
Large Xattr of Ext4
Progressive File Layout
whamcloud.com
Lustre Community Roadmap
2.11• Data on MDT• FLR Delayed Resync• Lock Ahead
2.13• Persistent Client Cache• Lnet Selection Policy• Self Extending Layouts
2.14• FLR Erasure Coding• Health Monitoring• DNE Auto Restriping
2.12• Lazy Size on MDT• LNet Health• DNE Dir Restriping
whamcloud.com
Upcoming Release Feature Highlights
► 2.12 was released in December, 2018• LNet Multi-Rail Network Health – improved fault tolerance
• Lazy Size on MDT (LSOM) – fast MDT filesystem scanning/attributes
• File Level Redundancy (FLR) enhancements – usability and robustness
• T10 Data Integrity Field (DIF) – improved data integrity
• DNE directory restriping – better space balancing and DNE2 adoption
► 2.13 development and landing underway, ETA August, 2019• Persistent Client Cache (PCC) – store data in client-local NVMe
• DNE automatic remote directory – improve load/space balance across MDTs
• LNet User Defined Selection Policy – tune LNet Multi-Rail interface selection
► 2.14 plans continued functional and performance improvements• File Level Redundancy – Erasure Coding (EC) for striped files
• OST pool quotas – manage space on heterogeneous storage targets
• DNE directory auto-split – improve usability and performance of DNE2
whamcloud.com
IO-500 (ISC’19)70% increase of the score on the same hardware over 2018-11 list
whamcloud.com
China LUG 2019 is coming!► China local event other than global LUG/LAD
► Date: 2019/10/15 (Tue.) 9:00-17:00
► Place: The New World Beijing Hotel, Beijing City
► Website: http://lustrefs.cn
► Presenters:
• \#jD�� _P9Q�32j�FQTL Q�W�32ki-X_P08KN//c�• 7�j��L0� =>�I�"KN�,�32j+�V�Y� 41�,Visiting Professor�• Z�j�dL0� _P9 e_P9KN/�KN��• �O6j][*fA� _PKNSi-X_P<;%�• @Hj��L eiXEGKN/�KN��• g)5j�B��i-X_P�,08���• �!�j�?L0iRKN�j5��QT�^`a��• ��MjC��"��(HPCQTiRbU$M%�• Peter JonesjDDN/Whamcloud �$M.J�• Andreas Dilger, DDN/Whamcloud �Lustre CTO�• :&jDDN/Whamcloud �h'$M%�
Questions?