GPFS & StoRM

22
GPFS & StoRM Jon Wakelin University of Bristol

description

GPFS & StoRM. Jon Wakelin University of Bristol. Pre-Amble. GPFS Basics What it is & what it does GPFS Concepts More in-depth technical concepts GPFS Topologies HPC Facilities at Bristol How we are using GPFS Creating a “mock-up”/staging-service for GridPP StoRM Recap & References. - PowerPoint PPT Presentation

Transcript of GPFS & StoRM

Page 1: GPFS & StoRM

GPFS & StoRM

Jon Wakelin

University of Bristol

Page 2: GPFS & StoRM

Pre-Amble• GPFS Basics

– What it is & what it does

• GPFS Concepts – More in-depth technical concepts

– GPFS Topologies

• HPC Facilities at Bristol– How we are using GPFS

– Creating a “mock-up”/staging-service for GridPP

• StoRM• Recap & References

Page 3: GPFS & StoRM

GPFS Basics• IBM’s General Parallel File System

– “Scaleable high-performance parallel file system” – Numerous HA features– Life-cycle Management Tools– Provides POSIX and “extended” interfaces to data

• Available for AIX and Linux– Only supported on AIX, RHEL and SuSE– Installed successfully on SL3.x (ask me if you are interested)– GPFS can run on a mix of these OSs

• Pricing - per processor– Free version available through IBM’s Scholars program– Currently developing new Licensing model

Page 4: GPFS & StoRM

GPFS Basics• Provides High-performance I/O

– Divides files into blocks and stripes the blocks across disks (on multiple storage devices)• Reads/Writes the blocks in parallel• Tuneable block sizes (depends on your data)

– Block-level locking mechanism• Multiple applications can access the same file concurrently • “multiple editors can work on different parts of a single file simultaneously. This eliminates the

additional storage, merging and management overhead typically required to maintain multiple copies”

– Client-side data-caching• Where is data cached?

• Multi-Cluster Configuration– Join GPFS clusters together

– Encrypted data and authentication or just authentication• openssl and keys

– Different security contexts (root squash á la NFS)

Page 5: GPFS & StoRM

GPFS Basics• Information Life-cycle Management

– Tiered storage• Create groups of disks within a file system, • based on reliability, performance, location, etc

– Policy driven automation• Automatically move, delete or replicate files - based on filename, username, or fileset.• e.g. Keep newest files on fastest hardware, migrate them to older hardware over time• e.g. Direct files to appropriate resource upon creation.

• Other notable points– Can specify user, group and fileset quotas

– POSIX and NFS v4 ACL support

– Can specify different IPs for GPFS and non-GPFS traffic

– Maximum limit of 268 million disks (2048 is default max)

Page 6: GPFS & StoRM

GPFS Topologies

• SAN-Attached– All nodes are physically attached to all NSDs

– High performance but expensive!

Page 7: GPFS & StoRM

• Network Shared Disk (NSD) Server– Subset of nodes are physically attached to NSDs

– Other nodes forward their IO requests to the NSD servers which perform the IO and pass back the data

GPFS Topologies

Page 8: GPFS & StoRM

GPFS Topologies

applicationLinux

NSDGPFS

applicationLinux

NSDGPFS

applicationLinux

NSD Server

GPFS

applicationLinux

NSD Server

GPFS

Local Area Network

• In practice, often have a mixed NSD + SAN environment– Nodes use SAN if they can and NSD servers if they can’t

– If SAN connectivity fails a SAN-attached node can fallback to using remaining NSD servers

Page 9: GPFS & StoRM

GPFS Redundancy & HA• Non-GPFS

– Redundant power supplies

– Redundant hot swap fans

– …

– RAID with hot swappable disks (multiple IBM DS4700s)

– FC with redundant paths (GPFS know how to use this)

• HA Features in GPFS– Primary and secondary Configuration Servers

– Primary and secondary NSD Servers for each Disk

– Replicate Metadata

– Replicate data

– Failure Groups • Specify which machines have a single point of failure • GPFS will use this info to make sure that replicated data is not striped across failure groups

Page 10: GPFS & StoRM

GPFS Quorum• Quorum

– A “Majority” of the nodes must be present before access to shared disks is allowed

– Prevent subgroups making conflicting decisions

– In event of failure disks in minority suspend and those in the majority continue

• Quorum Nodes– These nodes are counted to determine if the system is quorate

– If the system is no longer quorate• GPFS unmounts the filesystem …• … waits until quorum is established …• … and then recovers the FS.

• Quorum Nodes with Tie-Breaker Disks

Page 11: GPFS & StoRM

GPFS Performance• Preliminary results using

time dd if=/dev/zero of=testfile bs=1k count=2000000

• Multiple write processes on same node

1 process 90MB/s

2 processes 51 MB/s

4 processes 18MB/s

• Multiple write processes from different nodes

1 process 90MB/s

2 processes 58 MB/s

4 processes 28 MB/s

5 processes 23 MB/s

Page 12: GPFS & StoRM

GPFS Performance• In a hybrid environment (SAN-attached and NSD Server nodes)

– Read/Writes from SAN-attached nodes place little load on the NSD servers

– Read/Writes from other nodes place a high load on the NSD servers

• SAN-attached

[root@bf39 gpfs]# time dd if=/dev/zero of=file_zero count=2048 bs=1024k

real 0m31.773s

[root@bf40 GPFS]# top -p 26651

26651 root 0 -20 1155m 73m 7064 S 0 1.5 0:10.78 mmfsd

• Via NSD Server

[root@bfa-se /]# time dd if=/dev/zero of=/gpfs/file_zero count=2048 bs=1024k

real 0m31.381s

[root@bf40 GPFS]# top -p 26651

26651 root 0 -20 1155m 73m 7064 S 34 1.5 0:10.78 mmfsd

Page 13: GPFS & StoRM

Bristol HPC Facilities• Bristol, IBM, ClearSpeed and ClusterVision

– BabyBlue - installed Apr 2007 – Currently undergoing acceptance trials– BlueCrystal ~Dec 2007

• Testing – A number of “pump-priming” projects have been identified– Majority of users will develop, or port code, directly on the HPC system

• Only make changes at the Application level– GridPP

• System level changes• Pool accounts, World-addressable Slaves, NAT, Run services and daemons

• Instead we will build testing/staging system for GridPP– In-house and loan equipment from IBM– Reasonable Analogue of HPC facilities –

• No InfiniBand (but you wouldn’t use it anyway)

Page 14: GPFS & StoRM

Bristol HPC Facilities• BabyBlue

– Torque/Maui, SL 4 “Worker Node”, RHEL4 (maybe AIX) on Head-Nodes– IBM 3455,

• 96 dual-core, dual-socket 2.6GHz, AMD Opterons• 4? ClearSpeed Accelerator board

– 8GB RAM per node (2GB per core)– IBM DS4700 + EXP810, 15TB Transient storage

• SAN/FC network running GPFS

• BlueCrystal – c. Dec 2007– Torque/Moab– 512 dual-core, dual-socket nodes (or quad-core depending on timing)– 8GB RAM per node (1GB or 2GB per core) – 50TB Storage, SAN/FC Network running GPFS

• Server Room– 48 water cooled APC racks – 18 will be occupied by HPC, Physics Servers may be co-

located– 3 x270kW chillers (space for 3 more)

Page 15: GPFS & StoRM

GPFS BabyBlue

Page 16: GPFS & StoRM

GPFS MiniBlue

p-NSD

quorum

p-Config

IBM DS4500 – Configure hot spares

s-NSD

quorum

s-Config

---

quorum

---

Page 17: GPFS & StoRM

StoRM• StoRM is a storage resource manager for disk based storage systems.

– Implements the SRM interface version 2.2

– StoRM is designed to support guaranteed space reservation and direct access (using native POSIX I/O call)

– StoRM takes advantage of high performance parallel file systems• GPFS, XFS and Lustre???• Also standard POSIX file systems are supported

– Direct access to files from “Worker Nodes”• Compare with Castor, D-Cache and DPM

Page 18: GPFS & StoRM

StoRM architecture• Front end (FE):

– Exposes the web service interface

– Manages user authentication

– Sends the request to the BE

• Data Base (DB):– Stores SRM request and status– Stores file and space information

• Back end (BE): – Binds with the underlying file systems– Enforces authorization policy on files– Manages SRM file and space metadata

Page 19: GPFS & StoRM

StoRM miscellaneous• Scalability and high availability.

– FE, DB, and BE can be deployed on different machines– StoRM is designed to be configured with n FE and m BE, using a common DB

• Installation (Relatively straight forward)– RPM & Yaim (FE, BE and DB all on one server)

– Additional manual configuration steps• e.g. namespace.xml, Information Providers

– Not completely documented yet

– Mailing list

• CNAF x2 and Bristol– Basic tests - http://lxdev25.cern.ch/s2test/basic/history/

– Use Case tests - http://lxdev25.cern.ch/s2test/usecase/history/

– Currently still differences between Bristol and CNAF installations

Page 20: GPFS & StoRM

StoRM usage model

Page 21: GPFS & StoRM

Summary

• GPFS– Scalable high-performance file system– Highly Available, built on redundant components– Tiered storage or multi-cluster configuration for GridPP work

• HPC– University wide facility – not just for PP– GridPP requirements rather different from general/traditional HPC users– Build an “analogue” of the HPC system for GridPP

• StoRM– Better performance because StoRM builds on – Also, more appropriate data transfer model – POSIX and “file” protocol

Page 22: GPFS & StoRM

References• GPFS

– http://www-03.ibm.com/systems/clusters/software/gpfs.pdf

– http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf

– http://www-03.ibm.com/systems/clusters/software/whitepapers/gpfs_intro.pdf

• Storm– http://hst.home.cern.ch/hst/publications/storm_chep06.pdf– http://agenda.cnaf.infn.it/getFile.py/access?contribId=10&resId=1&materialId=slides&confId=0