N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan...

38
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services

Transcript of N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan...

Page 1: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

I/O Strategies for the T3E

Jonathan Carter

NERSC User Services

Page 2: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

2

T3E Overview

• T3E is a set of Processing Elements (PE) connected by a fast 3D torus.

• PEs do not have local disk

• All PEs access all filesystems equivalently

• Path for I/O generally looks like:– user buffer space

– system buffer space

– I/O device buffer space

Page 3: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

3

Filesystems

• /usr/tmp– fast

– subject to 14 day purge, not backed up

– check quota with quota -s /usr/tmp (usually 75Gb and 6000 inodes)

• $TMPDIR– fast

– purged at end of job or session

– shares quota with /usr/tmp

• $HOME– slower

– permanent, backed up

– check quota with quota (usually 2Gb and 3500 inodes)

Page 4: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

4

Types of I/O

• Language I/O: Fortran or C (ANSI or POSIX)

• Cray FFIO library (can be used from Fortran or C)

• MPI I/O

• Cray extensions to Fortran and C I/O (mostly for compatibility with PVP systems)

Page 5: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

5

I/O Strategies - Exclusive access files

• Each PE reads and writes to a separate file– Language I/O

– MPI I/O

– Increase language I/O performance with FFIO library (C must use POSIX style calls)

Page 6: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

6

I/O Strategies - Communication and I/O PE

• One PE coordinates reading and writing and communicates data back and forth between other PEs via message passing– Language I/O

– MPI I/O

– Increase language I/O performance with FFIO library

Page 7: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

7

I/O Strategies - Shared files

• All PEs read and write the same file simultaneously– Language I/O with FFIO library global layer

– MPI I/O

– Language I/O with FFIO library global layer and Cray extensions for additional flexibility

Page 8: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

8

Cray FFIO library

• FFIO is a set of I/O layers tuned for different I/O characteristics

• Buffering of data (configurable size)

• Caching of data (configurable size)

• Available to regular Fortran I/O without reprogramming

• Available for C through POSIX-like calls, e.g. ffopen, ffwrite

Page 9: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

9

The assign command

• the assign command controls– controls which FFIO layer is active

– striping across multiple partitions

– lots more

• scope of assign– File name

– Fortran unit number

– File type (e.g. all sequential unformatted files)

Page 10: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

10

assign Examples

• read and write to file restart.file from all PEs by using the FFIO library global layer

assign -F global:128:2 f:restart.file• use the FFIO library bufa layer to improve performance for

file opened on Fortran unit 10

assign -F bufa:128:2 u:10• use the FFIO library bufa layer to improve performance for

all unformatted sequential Fortran files

assign -F bufa:128:2 g:su

Page 11: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

11

assign Examples

• To see all active assigns

assign -V• To remove all active assigns

assign -R

Page 12: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

12

bufa FFIO layer

• bufa is an asynchronous buffering layer

• performs read-ahead, write-behind

• specify buffer size with -F bufa:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers

• buffer space increases your applications memory requirements

Page 13: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

13

global FFIO layer

• global is a caching and buffering layer which enables multiple PEs to read and write to the same file

• if one PE has already read the data, an additional read request from another PE will result in a remote memory copy

• file open is a synchronizing event

• By default, all PEs must open a global file, this can be changed by calling GLIO_GROUP_MPI(comm)

• specify buffer size with -F global:bs:nbufs where bs is the buffer size in units of 4Kbyte blocks, and nbufs is the number of buffers per PE

Page 14: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

14

File positioning with the global FFIO layer

• Positioning of a read or write is your responsibility

• File pointers are private

• Fortran – Use a direct access file, and read/write(rec=num)– Use Cray extensions setpos and getpos to position file pointer

(not portable)

• C– Use ffseek

Page 15: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

15

FFIO considerations

• Examples above use an unblocked file structure, normal Fortran files are blocked. To read the file without the global or bufa layers you must use

assign -s unblocked f:filename• bufa and global do not allow backspace, or skipping over a

partially read record. You can allow this behavior by using the cos layer in addition to bufa or global, but then setpos doesn’t work.

assign -s cos:128,bufa:128:2 f:filename

Page 16: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

16

More on FFIO

• There are many other FFIO layers, some pretty obscure– cache and cachea layers, good for random access files

• man intro_ffio for a terse description

• Cray Publication - Application Programmer’s I/O Guide

Page 17: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

17

More on assign

• Many text processing options

• Switch between Fortran 77 and Fortran 90 namelist

• File pre-allocation

• File striping

Page 18: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

18

Further Information

• I/O on the T3E Tutorial by Richard Gerber at http://home.nersc.gov/training/tutorials

• Cray Publication - Application Programmer’s I/O Guide

• Cray Publication - Cray T3E Fortran Optimization Guide

• man assign

Page 19: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

19

MPI I/O

• Part of MPI-2

• Interface for High Performance Parallel I/O– data partitioning

– collective I/O

– asynchronous I/O

– portability and interoperability

Page 20: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

20

MPI I/O Definitions

• An MPI file is an ordered collection of MPI types.

• A file may be opened individually or collectively by a group of processes

• The fileview defines a template for accessing the file and is used to partition the file amongst processes

Page 21: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

21

Fileviews

• A fileview is composed of three pieces:– a displacement (in bytes) form

the beginning of the file

– an elementary datatype (etype), which is the unit of data access and positioning within the file

– an filetype, which defines a template for accessing the file. A filetype can contain etypes or holes of the same extent as etypes.

Page 22: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

22

Fileviews (cont.)

• The filetype pattern is repeated, “tiling” the file

• Only the non-empty slots are available to read or write

Page 23: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

23

Fileview (cont.)

• Each process can have a different filetype

Process 0

Process 1

Process 2

Page 24: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

24

MPI_File_set_view

• Called after MPI_File_open to set fileview

• MPI_File_set_view(fh, disp, etype, filetype, datarep, info)– fh is a file handle

– disp, etype, and filetype define the fileview

– datarep is one of “native”, “internal”, or “external32”

– info is a set of hints to optimize performance

Page 25: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

25

MPI Info object

• An info object bundles up a set of parametersinteger finfo

call MPI_Info_create(finfo, ierr)

call MPI_Info_set(finfo, ‘access_style’, ‘write_mostly’, ierr)

• MPI I/O defines a set of parameters used to help optimize I/O performance

• MPI_Info_null can be used instead of an info object

Page 26: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

26

Open and Close

• MPI_File_open(comm, filename, amode, info, fh)– comm, open is collective over this communicator

– filename, string or character variable

– file access mode: MPI_MODE_RDONLY, MPI_MODE_RDWR etc.

– info object, used to pass hints to open

– file handle

• MPI_File_close(fh)

Page 27: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

27

Utility routines

• MPI_File_delete

• MPI_File_set_size

• MPI_File_preallocate

• MPI_File_set_info

Page 28: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

28

Query routines

• MPI_File_get_size

• MPI_File_get_group

• MPI_File_get_amode

• MPI_File_get_info

• MPI_File_get_view

Page 29: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

29

Data access routines

• Positioning– Explicit, each call has an offset

– Individual, each PE maintains an individual file pointer

– Shared, the file pointer is maintained globally

• Synchronism– Blocking, routine returns when complete

– Non-blocking, must call a termination routine to ensure completion

• Coordination– Non-collective

– Collective

Page 30: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

30

Summary of access routines

Positioning Synchronism CoordinationNon-collective Collective

Explicit BlockingNon-blocking

READ_AT READ_AT_ALL

IREAD_AT READ_AT_ALL BEGINWAIT READ_AT_ALL_END

Individual BlockingNon-blocking

READ READ_ALL

IREAD READ_ALL_BEGINWAIT READ_ALL_END

Shared BlockingNon-Blocking

READ_SHARED READ_ORDERED

IREAD_SHARED READ_ORDERED_BEGINWAIT READ_ORDERED_END

Page 31: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

31

Summery of access routines (cont.)

• MPI_File_seek

• MPI_File_get_position

• MPI_File_get_byte_offset

• MPI_File_seek_shared (collective)

• MPI_File_get_position_shared

Page 32: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

32

T3E Implementation

• No shared file pointers

• No non-blocking collective (split collective)

• SPR filed on non-blocking read

• Work in progress

Page 33: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

33

Examples• All the program fragments are available as working

programs on the T3E

• Do “module load training”, then look in $EXAMPLES/mpi_io

• All examples are of a distributed dot product– initialize data with random numbers

– compute dot product of whole vector

– write out data into a shared file

– read back in and check dot product

PE 0 PE 1 PE 2

Page 34: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

34

Naming convention

• First letter is positioning: explicit, individual, or shared

• Second letter is synchronism: blocking or non-blocking

• Third letter is coordination: non-collective or collective

• ebn.f90 is the explicit, blocking non-collective example

• There are several “ibn” examples dealing with different fileviews

Page 35: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

35

Filetype Example

• Process 0

• Process 1

• Process 2

Page 36: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

36

Filetype Example

filemode = MPI_MODE_RDWR + MPI_MODE_CREATE

call MPI_INFO_CREATE(finfo, ierr)call MPI_INFO_SET(finfo, 'access_style','write_mostly',ierr)

call MPI_FILE_OPEN(MPI_COMM_WORLD, 'vector', filemode,& finfo, fhv, ierr)

call MPI_TYPE_CREATE_SUBARRAY(1, m*nprocs, m, m*me,& MPI_ORDER_FORTRAN, MPI_REAL, mpi_fileslice, ierr)

disp=0call MPI_FILE_SET_VIEW(fhv, disp, MPI_REAL, mpi_fileslice,& 'native', MPI_INFO_NULL, ierr)

Page 37: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

37

Individual, blocking, non-collective

call MPI_FILE_WRITE(fhv, b, m, MPI_REAL, status, ierr)

lresult=sdot(m, b, 1, b, 1)call MPI_REDUCE(lresult, result, 1, MPI_REAL, MPI_SUM, 0,& MPI_COMM_WORLD, ierr)

if (me.eq.0) then write(6,*) 'dot product: ', resultend if

! zero vector and read it back in

b=0.0

disp=0call MPI_FILE_SEEK(fhv, disp, MPI_SEEK_SET, ierr)call MPI_FILE_READ(fhv, b, m, MPI_REAL, status, ierr)

Page 38: N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 I/O Strategies for the T3E Jonathan Carter NERSC User Services.

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

38

Further Information on MPI I/O

• MPI-The Complete Reference– Volume 1, The MPI Core

– Volume 2, The MPI Extensions