Overview of High Performance Input/Output on LRZ HPC … › hager › files › 2010 › 07 ›...

Overview ofHigh Performance Input/Output

on LRZ HPC systems

Christoph BiardzkiRichard Patra

Reinhold Bader

Agenda

Choosing the right file system

Storage subsystems at LRZ

Introduction to parallel file systems

Optimizing I/O in your applications

Big/Little Endian issues (Fortran)

File system types at LRZHome and Project file systems

Typically lots of small files (<1 MB)

Available space limited by quota

Very reliable

Regular backup is performed by LRZ

E.g., source code, binaries, configuration and (smaller) input files

Pseudo-temporary file systems

Huge local or (shared + parallel) file systems (>100 TB), no quota

Good I/O bandwidth with huge files (> 100 MB)

not optimal for small files (transactions)

Somewhat lower reliability due to new technology and size

High-watermark deletion, no backup!

E.g., large temporary files, large input or output files

Choosing the right file systemFilesystems are a shared resource – please be nice to other

users and…Do:

Put your really important data into a home/project file system

Use the $OPT_TMP environment variable which always references the optimal temporary file system

Use snapshots where available if you need an older version of a file or if you’ve removed a file by mistake

Contact LRZ HPC support if you feel you have an unusually I/O-intensive application or if you need additional, reliable storage for your project

Do not:

Use your home directory for temporary files

Put small files into parallel file systems (don’t use small files at all! ☺)

Put any data you can’t recompute into a pseudotemporary file system (no backup!)

Storage configuration at LRZNFS:

Home file systems in the Linux cluster + Altix (some TB) and HLRB-II (60 TB)

Expect a total performance of ~100 MB/s with sequential access

Snapshots are available as a backup measure

XFS / Cluster-XFS:

Used on altix (11+7 TB) and HLRB II (300 + 300 TB) as scratch file systems

Several 100 MB/s per process, up to 20 GB/s per file system on HLRB-II

Lustre:

Pseudotemporary file system on the Linux Cluster (140 TB)

Using 1.6 release

Up to 5000 MB/s aggregate I/O Bandwidth

Current I/O subsystem setupon Linux Cluster systems

… Lustre OST 120

Lustre

Introduction to parallel file systems

What is a “parallel” file system?

The “file server” becomes a bottleneck when a parallel application running on a cluster writes/reads huge amounts of data

In a parallel file system you can split a file among several file servers, parallelize the I/O and improve performance

In the diagram the“stripe size” is 4 (letters)

In reality ~2 MB

The number of servers used isalso configurable

You don’t want to stripe every file over all your servers

Exception: many clients accessone file (“parallel I/O”)

Example: Lustre at LRZ

Configurable parameters in Lustre:

Stripe size (Default 2 MB)

Stripe count = number of servers to stripe over (Default: 1)

Number of first server (Default: random)

Lustre Configuration:

1 Metadata-Server, 120 Data-Servers (called OSTs: Object Storage Targets)

~1 TB of storage attached to each OST

10 Gigabit Ethernet-Connections to network switches

Client connection:

Gigabit Ethernet: 90 MB/s

10 GE nodes: ~600 MB/s

Performance(2006)

Benchmark with up to 15 Dual-Itanium-Clients using Gigabit EthernetEvery client writes a 15 GB file into Lustre

General rules for I/OAvoid unnecessary I/O

Perform I/O in few and and large chunks

Binary instead of formatted data (Factor 3 performance improvement!)

Use appropriate filesystem

Use I/O libraries whenever available

Convert to target/visualization format in memory if possible

For parallel programs: output to separate files for each process: highest throughput, but usually needs postprocessing

Use library/compiler support for conversion between little/big endian of files used on different architectures

Avoid unnecessary open/close statements

Avoid explicit flushes of data to disk, except when needed for consistency reasons

I/O in FortranParameters of the OPEN statement:

Specify what you want to do: read, write or both:

ACTION='READ' / 'WRITE' / 'READWRITE'

Perform direct access with large record length (if possible a multiple of the disk block size):

ACCESS='DIRECT', RECL=<record_length>

Use binary (unformatted) I/O (default for direct access)

FORM='UNFORMATTED'

If you need sequential formatted access, remember to access data in large chunks at least

Use buffering if possible/manually increase buffer size (~100MB)

Intel Fortran run-time system: additional parameters of open statement:

BUFFERED=‘yes’, BUFFERCOUNT=10000

directives are usually proprietary

I/O in C

● Increase buffer size (~100MB):setvbuf (call before reading, writing or any other operation on the file)

● Perform unformatted instead of formatted IO:fwrite/fread instead of fprintf/fscanf

● For repositioning within the file use fseek

Example:double data[SIZE];char* myvbuf;FILE* fp;

fp=fopen(FILENAME, "w");setvbuf(fp, myvbuf, _IOFBF, 100000000);

fseek(fp, 0, SEEK_SET);fwrite(data, sizeof(double), SIZE, fp);

IO fully buffered

MPI-I/O

Perform non-contiguous IO with MPI derived datatypes

Perform collective IO

Tell the MPI subsystem what you want to do (read, write, both,...)call MPI_Info_set (info, 'access_style', <style>, ierr)

where <style> can be 'write_once', 'read_once', 'write_mostly', 'read_mostly', 'sequential',...

Pass additional hints to the MPI subsystem (unknown hints will be ignored)

many of these are implementation-dependent

Tuning I/O on Lustre:serial and MPI-parallel

Lustre striping factor:

lfs getstripe <filename> shows striping factor of a filelfs setstripe <directory> <stripe-size> <start-ost> \

<stripe-cnt>

sets striping size, factor and first ost for files created in directoryExample: lfs setstripe /lustre/a2832bf/bench 0 -1 12(will stripe with default striping size (2MB) over 12 OSTs)

Hints for MPI parallel I/O:call MPI_Info_set(info, 'striping unit', '<stripe-size>', ierr)

call MPI_Info_set(info, 'striping factor', '<stripe-cnt>', ierr)

call MPI_Info_set(info, 'num_io_nodes', '<stripe-cnt>', ierr)

jede Partition:

~1.25 GB/s im aggregierten Modus

19 blades

$OPT_TMP

Weiteres Dateisystem $PROJECTverfügbar

Tuning I/O on CXFS: FFIOglibc calls can be diverted to use alternative I/O layer: Fast and Flexible IOPrerequisites:

dynamic linkage at least against glibcexport LD_PRELOAD=/usr/lib/libFFIO.so

Optionally set variables:FF_IO_LOGFILE and FF_IO_OPEN_DIAGS

Set variable FF_IO_OPTS (mandatory!) to selectfile patternsI/O layers to be usedperformance relevant parameters

Then run program as usual

man libffiofor details

Example for FFIO usageexport FF_IO_OPTS=\

‘myfile.*(eie.direct.nodiag.mbytes:4096:64:6,\event.mbytes.notrace)'

64: number of pages in FFIO cache• low value enforces flushing to disk• high value provides effective

buffering• choose according to other memory

requirements of program

6: number of pages read-ahead if sequential access detected• can improve read performance if

suitably increased

Event layer (statistics):effectively unused heremonitor I/O between layers

Effects all files with basename myfile.*

E(nhanced) I(ntelligence) E(ngineering) suboptions:

direct unbuffered I/O

nodiag no cache usage statistics reportedmbytes unit for logging

4096: page size • units are 512 byte blocks• use this or an integer multiple

for LRZ system• striping unit TP9700: 2 MByte

FFIO for MPI programs

Can have separated FFIO settings for each MPI task

must use SGI MPT on Altixreplace FF_IO_OPTS by

export SGI_MPI=/usr/lib

export FF_IO_OPTS_RANK0=…

export FF_IO_OPTS_RANK1=…

Tuning MPI IO (XFS)

DMA transfers:call MPI_Info_set(info, 'direct_read', 'true', ierr)

call MPI_Info_set(info, 'direct_write', 'true', ierr)

bypasses OS buffer cache, can improve performance in special cases, but usually leads to performance degradation (do not use, except when memory used by buffer cache needed for computation)

See FFIO description on previous slides

MPI IO Example

Writing a distributed array of REAL4 (6 processes) with MPI derived datatype (darray):

1024

10

24 total MB/s (non-

collective)MB/s (collective)

Lustre

(6 OSTs)

12GB 34 102

120GB 51 100

XFS 12GB 270 315

120GB 189 160

Big/Little Endian issues:converting unformatted files

Environment variable specific to Intel-Fortran-generated binaries:export F_UFMTENDIAN=MODE|[MODE;]EXCEPTIONwhere:MODE = big | littleEXCEPTION = big:ULIST | little:ULIST | ULISTULIST = U | ULIST,UU = decimal | decimal-decimal

Examples:

F_UFMTENDIAN=big file format is big-endian for all unitsF_UFMTENDIAN=big:9,12 big-endian for units 9 and 12,

little-endian for othersF_UFMTENDIAN="big;little:8" big-endian for all except

unit 8if F_UFMTENDIAN is unset: default value little

Converting Files: Alternatives for Intel Fortran

Use –convert switch at compilationwill have effect on all units opened in source file

Use convert=‘…’ keyword on OPEN statementwill only affect opened I/O unitproprietary enhancement code non-portable!

Both option and keyword can take various values:big_endianlittle_endiancrayibm…

See compiler documentation / language reference for detailed information

Overview of High Performance Input/Output on LRZ HPC … › hager › files › 2010 › 07 ›...

Documents

Transcript of Overview of High Performance Input/Output on LRZ HPC … › hager › files › 2010 › 07 ›...