High Performance Computing Course Notes 2007-2008 Parallel I/O.

High Performance ComputingHigh Performance ComputingCourse Notes 2007-2008Course Notes 2007-2008

Parallel I/OParallel I/O

2Computer Science, University of WarwickComputer Science, University of Warwick

Aims

To learn how to achieve higher I/O performance

To use a concrete implementation (MPI-IO):

Some concepts, including: etypes, displacement and views

Collective vs. non-collective I/O

Contiguous vs. non-contiguous I/O

High Performance Parallel I/OHigh Performance Parallel I/O


Why are we looking at parallel I/O?Why are we looking at parallel I/O?

I/O is a major bottleneck in many parallel applications

I/O subsystems for parallel machines may be designed for high performance, however many applications achieve < 10th of the peak I/O bandwidth

Parallel-I/O systems designed for large data transfer (MB of data)

However, many parallel applications make many smaller I/O requests (<kB)


Parallel I/O – version 1.0Parallel I/O – version 1.0

Phase 1:

All processes send

data to proc. 00 1 2 3

d0 d1 d2 d3

Early solutions:

All processes send data to process 0, which then writes to file



Phase 1:

All processes send

data to proc. 0

Phase 2:

Proc. 0 writes to

file

0 1 2 3

d0 d1 d2 d3

d0 d1 d2 d3File

Early solutions:

All processes send data to process 0, which then writes to file


Bad things about version 1.0

1. Single node bottleneck

2. Poor performance

3. Poor scalability

4. Single point of failure

Good things about version 1.0

The parallel machine needs only support I/O from one process

Do not need specialized I/O library

If you are converting from sequential code then this parallel version (of program) is close to original

Results in a single file which is easy to manage




All processes can

now write in one

phase0 1 2 3

d0 d1 d2 d3

Each process writes to a separate file

d0 d1 d2 d3

File 1 File 2 File 3 File 4



All processes can

now write in one

phase0 1 2 3

d0 d1 d2 d3


d0 d1 d2 d3



1. Now we are doing things in parallel

2. High performance



All processes can

now write in one

phase0 1 2 3

d0 d1 d2 d3


d0 d1 d2 d3



1. We now have lots of small files to manage

2. How do we read the data back when #procs changes?

3. Does not interoperate well with other applications



All processes can

now write in one

Phase, to one common

file

0 1 2 3

d0 d1 d2 d3

Multiple processes of parallel program access (read/write)

data from a common file

d0 d1 d2 d3File




Simultaneous I/O from any number of processes

Maps well onto collective operations

Excellent performance and scalability

Results in a single file which is easy to manage and interoperates well with other applications


Requires more complex I/O library support


What is Parallel I/O?What is Parallel I/O?

Multiple processes of a parallel program accessing data (reading or writing) from a common file

FILE

P0 P1 P2 P(n-1)


Non-parallel I/0

Simple

Poor performance – if a single process is writing to one file

Hard to interoperate with other applications – if writing to more than one file

Parallel I/O

Provides high performance

Provides a single file with which it is easy to interoperate with other tools (e.g. visualization systems)

If you design it right then can use existing features of parallel libraries such as collectives and derived datatypes

Why Parallel I/O?Why Parallel I/O?


We are going to be looking at parallel I/O in the context of MPI, why?

Because writing is like sending a message, reading is like receiving

Because collective-like operations are important in parallel I/O

Because non-contiguous data layout is important (if we are using a single file), supported by MPI datatypes

Parallel I/O is now integral part of MPI-2

Why Parallel I/O?Why Parallel I/O?


Parallel I/O exampleParallel I/O example

Consider an example of a 2D array distributed among 16 processors

Array stored in row-major order

P0 P1

P4 P5

P8

P2 P3

P6 P7

P12

P9 P10 P11

P13 P14 P15

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P4 P5 P6 P7

P4 P5 P6 P7

etc…

Array Corresponding file


Access pattern 1Access pattern 1: MPI_File_seek: MPI_File_seek

Updates the individual file pointer

int MPI_File_seek( MPI_File mpi_fh, MPI_Offset offset, int whence );

Parameters

mpi_fh : [in] file handle (handle)

offset : [in] file offset (integer)

whence : [in] update mode (state)

MPI_FILE_SEEK updates the individual file pointer according to whence, which has the following possible values:

MPI_SEEK_SET: the pointer is set to offset

MPI_SEEK_CUR: the pointer is set to the current pointer position plus offset

MPI_SEEK_END: the pointer is set to the end of file plus offset


Access pattern 1Access pattern 1: MPI_File_read: MPI_File_read

Read using individual file pointer

int MPI_File_read( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status );

Parameters

mpi_fh: [in] file handle (handle)

buf: [out] initial address of buffer

count: [in] number of elements in buffer (nonnegative integer)

datatype: [in] datatype of each buffer element (handle)

status: [out] status object (Status)


Access pattern 1Access pattern 1

We could do a UNIX-style access pattern in MPI-IO

One independent read request is done for each row in the local array

Many independent, contiguous requests

MPI_File_open(… , “filename”, … , &fh)

for(i=0; i < n_local_rows; i++) {

MPI_File_seek (fh, offset, …)

MPI_File_read (fh, row[i], …)

}

MPI_File_close (&fh)

Individual file pointers per process per file handle

Each process sets the file pointer with some suitable offset

The data is then read into the local array

This is not a collective operation (non-blocking)


Access pattern 2: Access pattern 2: MPI_File_read_all

Collective read using individual file pointer

int MPI_File_read_all( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status );

Parameters

fh : [in] file handle (handle)

buf : [out] initial address of buffer (choice)

count : [in] number of elements in buffer (nonnegative integer)

datatype : [in] datatype of each buffer element (handle)

status : [out] status object (Status)

MPI_FILE_READ_ALL is a collective version of the blocking MPI_FILE_READ interface.



Similar to access pattern 1 but using collectives

All processes that opened file will read data together (with own access information)

Many collective, contiguous requests


for(i=0; i < n_local_rows; i++) {

MPI_File_seek (fh, offset, …)

MPI_File_read_all (fh, row[i], …)

}


read_all is a collective version of the read operation

This is blocking

Each process accesses the file at the same time

This may be useful as independent I/O operations do not convey what other procs are doing at the same time


File

Ordered collection of typed data items

MPI supports random or sequential access

Opened collectively by a group of processes

All collective I/O calls on file are done over this group

Displacement

Absolute byte position relative to the beginning of a file

Defines the location where a view begins

etype (elementary datatype)

Unit of data access and positioning

Can be a predefined or derived datatype

Offsets are expressed as multiples of etypes

Access pattern 3: DefinitionsAccess pattern 3: Definitions


Filetype

Basis for partitioning the file among processes and defines a template for accessing the file (based on etype)

View

Current set of data visible and accessible from an open file (as an ordered set of etypes)

Each process has its own view based on - a displacement, etype and filetype

Pattern defined by filetype is repeated (in units of etypes) beginning at the displacement

Access pattern 3: DefinitionsAccess pattern 3: Definitions


Access pattern 3: Access pattern 3: File ViewsFile Views

Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view

displacement = number of bytes to be skipped from the start of the file

etype = basic unit of data access (can be any basic or derived datatype)

filetype = specifies which portion of the file is visible to the process


Access pattern 3: Access pattern 3: A Simple A Simple Noncontiguous File View ExampleNoncontiguous File View Example

etype = MPI_INT

filetype = two MPI_INTs followed by a gap of four MPI_INTs

displacement filetype filetype and so on...

FILEhead of file


Access pattern 3: How do Access pattern 3: How do viewsviews relate to relate to multiple processes?multiple processes?

proc. 0 filetype

file …

displacement

proc. 1 filetype

proc. 2 filetype

Group of processes using complementary views to achieve

global data distribution

Partitioning a file among parallel processes


MPI_File_set_viewMPI_File_set_view

Describes that part of the file accessed by a single MPI process.

int MPI_File_set_view( MPI_File mpi_fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info );

Parameters

mpi_fh :[in] file handle (handle)

disp :[in] displacement (nonnegative integer)

etype :[in] elementary datatype (handle)

filetype :[in] filetype (handle)

datarep :[in] data representation (string)

info :[in] info object (handle)


Access pattern 3: Access pattern 3: File View ExampleFile View Example

MPI_File thefile;

for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i;MPI_File_open(MPI_COMM_WORLD, "testfile",

MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile);

MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int), MPI_INT, MPI_INT, "native",

MPI_INFO_NULL);MPI_File_write(thefile, buf, BUFSIZE, MPI_INT,

MPI_STATUS_IGNORE);MPI_File_close(&thefile);


MPI_Type_create_subarrayMPI_Type_create_subarray

Create a datatype for a subarray of a regular, multidimensional array

int MPI_Type_create_subarray( int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype );

Parameters

ndims :[in] number of array dimensions (positive integer)

array_of_sizes :[in] number of elements of type oldtype in each dimension of the full array (array of positive integers)

array_of_subsizes :[in] number of elements of type oldtype in each dimension of the subarray (array of positive integers)

array_of_starts :[in] starting coordinates of the subarray in each dimension (array of nonnegative integers)

order :[in] array storage order flag (state)

oldtype :[in] array element datatype (handle)

newtype :[out] new datatype (handle)


Using the Subarray DatatypeUsing the Subarray Datatype

gsizes[0] = 16; /* no. of rows in global array */gsizes[1] = 16; /* no. of columns in global array*/

psizes[0] = 4; /* no. of procs. in vertical dimension */psizes[1] = 4; /* no. of procs. in horizontal dimension */

lsizes[0] = 16/psizes[0]; /* no. of rows in local array */lsizes[1] = 16/psizes[1]; /* no. of columns in local array*/

dims[0] = 4; dims[1] = 4;periods[0] = periods[1] = 1;MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm);MPI_Comm_rank(comm, &rank);MPI_Cart_coords(comm, rank, 2, coords);


Subarray Datatype contd.Subarray Datatype contd.

/* global indices of first element of local array */start_indices[0] = coords[0] * lsizes[0];start_indices[1] = coords[1] * lsizes[1];

MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype);MPI_Type_commit(&filetype);



Each process creates a derived datatype to describe the non-contiguous access pattern

We thus have a file view and independent access

Single independent, non-contiguous request

MPI_Type_create_subarray

(… , &subarray, …)

MPI_Type_commit (&subarray)


MPI_File_set_view (fh, … , subarray, …)

MPI_File_read (fh, local_array, …)


Creates a datatype describing a subarray of a multi-dimentional array

Commits the datatype (must be done before comms)

System may compile at commit time an internal representation for the datatype




We thus have a file view and independent access

Single independent, non-contiguous request






MPI_File_read (fh, local_array, …)


Opens the file as before

Now changes the processes view of the data in the file using set_view

set_view is collective

Although the reads are still independent


Note here that we are reading the whole sub-array despite the non-contiguous storage

P0 P1

P4 P5

P8

P2 P3

P6 P7

P12

P9 P10 P11

P13 P14 P15

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P4 P5 P6 P7

P4 P5 P6 P7

etc…

Processes {4,5,6,7}, {8,9,10,11}, {12,13,14,15} will have file views

based on the same filetypes but with different displacements


proc. 0 filetype

proc. 1 filetype

proc. 2 filetype

proc. 3 filetype




We thus have a file view and collective access

Single collective, non-contiguous request






MPI_File_read_all (fh, local_array, …)


Creates and commits datatype as before

Now changes the processes view of the data in the file using set_view

set_view is collective

Reads are now collective


These access patterns express four different style of parallel I/O that are available

You should choose your access pattern depending on the application

Larger the size of the I/O request, the better performance

Collectives are going to do better than individual reads

Pattern 4 therefore offers (potentially) the best performance

Access patternsAccess patterns


I/O optimization: Data SievingI/O optimization: Data Sieving

Data sieving is used to combine lots of small accesses into a single larger one

Remote file systems (parallel or not) tend to have high latencies

Reducing the number of operations important


I/O optimization: Data Sieving WritesI/O optimization: Data Sieving Writes

Using data sieving for writes is more complicated

Must read the entire region first

Then make our changes

Then write the block back

Requires locking in the file system

Can result in false sharing


I/O optimization: Two-Phase Collective I/O optimization: Two-Phase Collective I/OI/O

Problems with independent, noncontiguous access Lots of small accesses

Independent data sieving reads lots of extra data

Idea: Reorganize access to match layout on disks Single processes use data sieving to get data for many

Often reduces total I/O through sharing of common blocks

Second ``phase'' moves data to final destinations


I/O optimizationI/O optimization: Collective I/O: Collective I/O

Collective I/O is coordinated access to storage by a group of processes

Collective I/O functions must be called by all processes participating in I/O

Allows I/O layers to know more about access as a whole

High Performance Computing Course Notes 2007-2008 Parallel I/O.

Documents

Transcript of High Performance Computing Course Notes 2007-2008 Parallel I/O.