Overview of High Performance Input/Output on LRZ HPC … › hager › files › 2010 › 07 ›...
Transcript of Overview of High Performance Input/Output on LRZ HPC … › hager › files › 2010 › 07 ›...
Overview ofHigh Performance Input/Output
on LRZ HPC systems
Christoph BiardzkiRichard Patra
Reinhold Bader
Agenda
Choosing the right file system
Storage subsystems at LRZ
Introduction to parallel file systems
Optimizing I/O in your applications
Big/Little Endian issues (Fortran)
File system types at LRZHome and Project file systems
Typically lots of small files (<1 MB)
Available space limited by quota
Very reliable
Regular backup is performed by LRZ
E.g., source code, binaries, configuration and (smaller) input files
Pseudo-temporary file systems
Huge local or (shared + parallel) file systems (>100 TB), no quota
Good I/O bandwidth with huge files (> 100 MB)
not optimal for small files (transactions)
Somewhat lower reliability due to new technology and size
High-watermark deletion, no backup!
E.g., large temporary files, large input or output files
Choosing the right file systemFilesystems are a shared resource – please be nice to other
users and…Do:
Put your really important data into a home/project file system
Use the $OPT_TMP environment variable which always references the optimal temporary file system
Use snapshots where available if you need an older version of a file or if you’ve removed a file by mistake
Contact LRZ HPC support if you feel you have an unusually I/O-intensive application or if you need additional, reliable storage for your project
Do not:
Use your home directory for temporary files
Put small files into parallel file systems (don’t use small files at all! ☺)
Put any data you can’t recompute into a pseudotemporary file system (no backup!)
Storage configuration at LRZNFS:
Home file systems in the Linux cluster + Altix (some TB) and HLRB-II (60 TB)
Expect a total performance of ~100 MB/s with sequential access
Snapshots are available as a backup measure
XFS / Cluster-XFS:
Used on altix (11+7 TB) and HLRB II (300 + 300 TB) as scratch file systems
Several 100 MB/s per process, up to 20 GB/s per file system on HLRB-II
Lustre:
Pseudotemporary file system on the Linux Cluster (140 TB)
Using 1.6 release
Up to 5000 MB/s aggregate I/O Bandwidth
Current I/O subsystem setupon Linux Cluster systems
… Lustre OST 120
Lustre
Introduction to parallel file systems
What is a “parallel” file system?
The “file server” becomes a bottleneck when a parallel application running on a cluster writes/reads huge amounts of data
In a parallel file system you can split a file among several file servers, parallelize the I/O and improve performance
In the diagram the“stripe size” is 4 (letters)
In reality ~2 MB
The number of servers used isalso configurable
You don’t want to stripe every file over all your servers
Exception: many clients accessone file (“parallel I/O”)
Example: Lustre at LRZ
Configurable parameters in Lustre:
Stripe size (Default 2 MB)
Stripe count = number of servers to stripe over (Default: 1)
Number of first server (Default: random)
Lustre Configuration:
1 Metadata-Server, 120 Data-Servers (called OSTs: Object Storage Targets)
~1 TB of storage attached to each OST
10 Gigabit Ethernet-Connections to network switches
Client connection:
Gigabit Ethernet: 90 MB/s
10 GE nodes: ~600 MB/s
Performance(2006)
Benchmark with up to 15 Dual-Itanium-Clients using Gigabit EthernetEvery client writes a 15 GB file into Lustre
General rules for I/OAvoid unnecessary I/O
Perform I/O in few and and large chunks
Binary instead of formatted data (Factor 3 performance improvement!)
Use appropriate filesystem
Use I/O libraries whenever available
Convert to target/visualization format in memory if possible
For parallel programs: output to separate files for each process: highest throughput, but usually needs postprocessing
Use library/compiler support for conversion between little/big endian of files used on different architectures
Avoid unnecessary open/close statements
Avoid explicit flushes of data to disk, except when needed for consistency reasons
I/O in FortranParameters of the OPEN statement:
Specify what you want to do: read, write or both:
ACTION='READ' / 'WRITE' / 'READWRITE'
Perform direct access with large record length (if possible a multiple of the disk block size):
ACCESS='DIRECT', RECL=<record_length>
Use binary (unformatted) I/O (default for direct access)
FORM='UNFORMATTED'
If you need sequential formatted access, remember to access data in large chunks at least
Use buffering if possible/manually increase buffer size (~100MB)
Intel Fortran run-time system: additional parameters of open statement:
BUFFERED=‘yes’, BUFFERCOUNT=10000
directives are usually proprietary
I/O in C
● Increase buffer size (~100MB):setvbuf (call before reading, writing or any other operation on the file)
● Perform unformatted instead of formatted IO:fwrite/fread instead of fprintf/fscanf
● For repositioning within the file use fseek
Example:double data[SIZE];char* myvbuf;FILE* fp;
fp=fopen(FILENAME, "w");setvbuf(fp, myvbuf, _IOFBF, 100000000);
fseek(fp, 0, SEEK_SET);fwrite(data, sizeof(double), SIZE, fp);
IO fully buffered
MPI-I/O
Perform non-contiguous IO with MPI derived datatypes
Perform collective IO
Tell the MPI subsystem what you want to do (read, write, both,...)call MPI_Info_set (info, 'access_style', <style>, ierr)
where <style> can be 'write_once', 'read_once', 'write_mostly', 'read_mostly', 'sequential',...
Pass additional hints to the MPI subsystem (unknown hints will be ignored)
many of these are implementation-dependent
Tuning I/O on Lustre:serial and MPI-parallel
Lustre striping factor:
lfs getstripe <filename> shows striping factor of a filelfs setstripe <directory> <stripe-size> <start-ost> \
<stripe-cnt>
sets striping size, factor and first ost for files created in directoryExample: lfs setstripe /lustre/a2832bf/bench 0 -1 12(will stripe with default striping size (2MB) over 12 OSTs)
Hints for MPI parallel I/O:call MPI_Info_set(info, 'striping unit', '<stripe-size>', ierr)
call MPI_Info_set(info, 'striping factor', '<stripe-cnt>', ierr)
call MPI_Info_set(info, 'num_io_nodes', '<stripe-cnt>', ierr)
jede Partition:
~1.25 GB/s im aggregierten Modus
19 blades
$OPT_TMP
Weiteres Dateisystem $PROJECTverfügbar
Tuning I/O on CXFS: FFIOglibc calls can be diverted to use alternative I/O layer: Fast and Flexible IOPrerequisites:
dynamic linkage at least against glibcexport LD_PRELOAD=/usr/lib/libFFIO.so
Optionally set variables:FF_IO_LOGFILE and FF_IO_OPEN_DIAGS
Set variable FF_IO_OPTS (mandatory!) to selectfile patternsI/O layers to be usedperformance relevant parameters
Then run program as usual
man libffiofor details
Example for FFIO usageexport FF_IO_OPTS=\
‘myfile.*(eie.direct.nodiag.mbytes:4096:64:6,\event.mbytes.notrace)'
64: number of pages in FFIO cache• low value enforces flushing to disk• high value provides effective
buffering• choose according to other memory
requirements of program
6: number of pages read-ahead if sequential access detected• can improve read performance if
suitably increased
Event layer (statistics):effectively unused heremonitor I/O between layers
Effects all files with basename myfile.*
E(nhanced) I(ntelligence) E(ngineering) suboptions:
direct unbuffered I/O
nodiag no cache usage statistics reportedmbytes unit for logging
4096: page size • units are 512 byte blocks• use this or an integer multiple
for LRZ system• striping unit TP9700: 2 MByte
FFIO for MPI programs
Can have separated FFIO settings for each MPI task
must use SGI MPT on Altixreplace FF_IO_OPTS by
export SGI_MPI=/usr/lib
export FF_IO_OPTS_RANK0=…
export FF_IO_OPTS_RANK1=…
Tuning MPI IO (XFS)
DMA transfers:call MPI_Info_set(info, 'direct_read', 'true', ierr)
call MPI_Info_set(info, 'direct_write', 'true', ierr)
bypasses OS buffer cache, can improve performance in special cases, but usually leads to performance degradation (do not use, except when memory used by buffer cache needed for computation)
See FFIO description on previous slides
MPI IO Example
Writing a distributed array of REAL4 (6 processes) with MPI derived datatype (darray):
1024
10
24 total MB/s (non-
collective)MB/s (collective)
Lustre
(6 OSTs)
12GB 34 102
120GB 51 100
XFS 12GB 270 315
120GB 189 160
Big/Little Endian issues:converting unformatted files
Environment variable specific to Intel-Fortran-generated binaries:export F_UFMTENDIAN=MODE|[MODE;]EXCEPTIONwhere:MODE = big | littleEXCEPTION = big:ULIST | little:ULIST | ULISTULIST = U | ULIST,UU = decimal | decimal-decimal
Examples:
F_UFMTENDIAN=big file format is big-endian for all unitsF_UFMTENDIAN=big:9,12 big-endian for units 9 and 12,
little-endian for othersF_UFMTENDIAN="big;little:8" big-endian for all except
unit 8if F_UFMTENDIAN is unset: default value little
Converting Files: Alternatives for Intel Fortran
Use –convert switch at compilationwill have effect on all units opened in source file
Use convert=‘…’ keyword on OPEN statementwill only affect opened I/O unitproprietary enhancement code non-portable!
Both option and keyword can take various values:big_endianlittle_endiancrayibm…
See compiler documentation / language reference for detailed information