High Performance Computing Course Notes 2007-2008 Parallel I/O.

39
High Performance Computing High Performance Computing Course Notes 2007-2008 Course Notes 2007-2008 Parallel I/O Parallel I/O

Transcript of High Performance Computing Course Notes 2007-2008 Parallel I/O.

Page 1: High Performance Computing Course Notes 2007-2008 Parallel I/O.

High Performance ComputingHigh Performance ComputingCourse Notes 2007-2008Course Notes 2007-2008

Parallel I/OParallel I/O

Page 2: High Performance Computing Course Notes 2007-2008 Parallel I/O.

2Computer Science, University of WarwickComputer Science, University of Warwick

Aims

To learn how to achieve higher I/O performance

To use a concrete implementation (MPI-IO):

Some concepts, including: etypes, displacement and views

Collective vs. non-collective I/O

Contiguous vs. non-contiguous I/O

High Performance Parallel I/OHigh Performance Parallel I/O

Page 3: High Performance Computing Course Notes 2007-2008 Parallel I/O.

3Computer Science, University of WarwickComputer Science, University of Warwick

Why are we looking at parallel I/O?Why are we looking at parallel I/O?

I/O is a major bottleneck in many parallel applications

I/O subsystems for parallel machines may be designed for high performance, however many applications achieve < 10th of the peak I/O bandwidth

Parallel-I/O systems designed for large data transfer (MB of data)

However, many parallel applications make many smaller I/O requests (<kB)

Page 4: High Performance Computing Course Notes 2007-2008 Parallel I/O.

4Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 1.0Parallel I/O – version 1.0

Phase 1:

All processes send

data to proc. 00 1 2 3

d0 d1 d2 d3

Early solutions:

All processes send data to process 0, which then writes to file

Page 5: High Performance Computing Course Notes 2007-2008 Parallel I/O.

5Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 1.0Parallel I/O – version 1.0

Phase 1:

All processes send

data to proc. 0

Phase 2:

Proc. 0 writes to

file

0 1 2 3

d0 d1 d2 d3

d0 d1 d2 d3File

Early solutions:

All processes send data to process 0, which then writes to file

Page 6: High Performance Computing Course Notes 2007-2008 Parallel I/O.

6Computer Science, University of WarwickComputer Science, University of Warwick

Bad things about version 1.0

1. Single node bottleneck

2. Poor performance

3. Poor scalability

4. Single point of failure

Good things about version 1.0

The parallel machine needs only support I/O from one process

Do not need specialized I/O library

If you are converting from sequential code then this parallel version (of program) is close to original

Results in a single file which is easy to manage

Parallel I/O – version 1.0Parallel I/O – version 1.0

Page 7: High Performance Computing Course Notes 2007-2008 Parallel I/O.

7Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 2.0Parallel I/O – version 2.0

All processes can

now write in one

phase0 1 2 3

d0 d1 d2 d3

Each process writes to a separate file

d0 d1 d2 d3

File 1 File 2 File 3 File 4

Page 8: High Performance Computing Course Notes 2007-2008 Parallel I/O.

8Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 2.0Parallel I/O – version 2.0

All processes can

now write in one

phase0 1 2 3

d0 d1 d2 d3

Each process writes to a separate file

d0 d1 d2 d3

File 1 File 2 File 3 File 4

Good things about version 2.0

1. Now we are doing things in parallel

2. High performance

Page 9: High Performance Computing Course Notes 2007-2008 Parallel I/O.

9Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 2.0Parallel I/O – version 2.0

All processes can

now write in one

phase0 1 2 3

d0 d1 d2 d3

Each process writes to a separate file

d0 d1 d2 d3

File 1 File 2 File 3 File 4

Bad things about version 2.0

1. We now have lots of small files to manage

2. How do we read the data back when #procs changes?

3. Does not interoperate well with other applications

Page 10: High Performance Computing Course Notes 2007-2008 Parallel I/O.

10Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 3.0Parallel I/O – version 3.0

All processes can

now write in one

Phase, to one common

file

0 1 2 3

d0 d1 d2 d3

Multiple processes of parallel program access (read/write)

data from a common file

d0 d1 d2 d3File

Page 11: High Performance Computing Course Notes 2007-2008 Parallel I/O.

11Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O – version 3.0Parallel I/O – version 3.0

Good things about version 3.0

Simultaneous I/O from any number of processes

Maps well onto collective operations

Excellent performance and scalability

Results in a single file which is easy to manage and interoperates well with other applications

Bad things about version 3.0

Requires more complex I/O library support

Page 12: High Performance Computing Course Notes 2007-2008 Parallel I/O.

12Computer Science, University of WarwickComputer Science, University of Warwick

What is Parallel I/O?What is Parallel I/O?

Multiple processes of a parallel program accessing data (reading or writing) from a common file

FILE

P0 P1 P2 P(n-1)

Page 13: High Performance Computing Course Notes 2007-2008 Parallel I/O.

13Computer Science, University of WarwickComputer Science, University of Warwick

Non-parallel I/0

Simple

Poor performance – if a single process is writing to one file

Hard to interoperate with other applications – if writing to more than one file

Parallel I/O

Provides high performance

Provides a single file with which it is easy to interoperate with other tools (e.g. visualization systems)

If you design it right then can use existing features of parallel libraries such as collectives and derived datatypes

Why Parallel I/O?Why Parallel I/O?

Page 14: High Performance Computing Course Notes 2007-2008 Parallel I/O.

14Computer Science, University of WarwickComputer Science, University of Warwick

We are going to be looking at parallel I/O in the context of MPI, why?

Because writing is like sending a message, reading is like receiving

Because collective-like operations are important in parallel I/O

Because non-contiguous data layout is important (if we are using a single file), supported by MPI datatypes

Parallel I/O is now integral part of MPI-2

Why Parallel I/O?Why Parallel I/O?

Page 15: High Performance Computing Course Notes 2007-2008 Parallel I/O.

15Computer Science, University of WarwickComputer Science, University of Warwick

Parallel I/O exampleParallel I/O example

Consider an example of a 2D array distributed among 16 processors

Array stored in row-major order

P0 P1

P4 P5

P8

P2 P3

P6 P7

P12

P9 P10 P11

P13 P14 P15

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P4 P5 P6 P7

P4 P5 P6 P7

etc…

Array Corresponding file

Page 16: High Performance Computing Course Notes 2007-2008 Parallel I/O.

16Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 1Access pattern 1: MPI_File_seek: MPI_File_seek

Updates the individual file pointer

int MPI_File_seek( MPI_File mpi_fh, MPI_Offset offset, int whence );

Parameters

mpi_fh : [in] file handle (handle)

offset : [in] file offset (integer)

whence : [in] update mode (state)

MPI_FILE_SEEK updates the individual file pointer according to whence, which has the following possible values:

MPI_SEEK_SET: the pointer is set to offset

MPI_SEEK_CUR: the pointer is set to the current pointer position plus offset

MPI_SEEK_END: the pointer is set to the end of file plus offset

Page 17: High Performance Computing Course Notes 2007-2008 Parallel I/O.

17Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 1Access pattern 1: MPI_File_read: MPI_File_read

Read using individual file pointer

int MPI_File_read( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status );

Parameters

mpi_fh: [in] file handle (handle)

buf: [out] initial address of buffer

count: [in] number of elements in buffer (nonnegative integer)

datatype: [in] datatype of each buffer element (handle)

status: [out] status object (Status)

Page 18: High Performance Computing Course Notes 2007-2008 Parallel I/O.

18Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 1Access pattern 1

We could do a UNIX-style access pattern in MPI-IO

One independent read request is done for each row in the local array

Many independent, contiguous requests

MPI_File_open(… , “filename”, … , &fh)

for(i=0; i < n_local_rows; i++) {

MPI_File_seek (fh, offset, …)

MPI_File_read (fh, row[i], …)

}

MPI_File_close (&fh)

Individual file pointers per process per file handle

Each process sets the file pointer with some suitable offset

The data is then read into the local array

This is not a collective operation (non-blocking)

Page 19: High Performance Computing Course Notes 2007-2008 Parallel I/O.

19Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 2: Access pattern 2: MPI_File_read_all

Collective read using individual file pointer

int MPI_File_read_all( MPI_File mpi_fh, void *buf, int count, MPI_Datatype datatype, MPI_Status *status );

Parameters

fh : [in] file handle (handle)

buf : [out] initial address of buffer (choice)

count : [in] number of elements in buffer (nonnegative integer)

datatype : [in] datatype of each buffer element (handle)

status : [out] status object (Status)

MPI_FILE_READ_ALL is a collective version of the blocking MPI_FILE_READ interface.

Page 20: High Performance Computing Course Notes 2007-2008 Parallel I/O.

20Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 2Access pattern 2

Similar to access pattern 1 but using collectives

All processes that opened file will read data together (with own access information)

Many collective, contiguous requests

MPI_File_open(… , “filename”, … , &fh)

for(i=0; i < n_local_rows; i++) {

MPI_File_seek (fh, offset, …)

MPI_File_read_all (fh, row[i], …)

}

MPI_File_close (&fh)

read_all is a collective version of the read operation

This is blocking

Each process accesses the file at the same time

This may be useful as independent I/O operations do not convey what other procs are doing at the same time

Page 21: High Performance Computing Course Notes 2007-2008 Parallel I/O.

21Computer Science, University of WarwickComputer Science, University of Warwick

File

Ordered collection of typed data items

MPI supports random or sequential access

Opened collectively by a group of processes

All collective I/O calls on file are done over this group

Displacement

Absolute byte position relative to the beginning of a file

Defines the location where a view begins

etype (elementary datatype)

Unit of data access and positioning

Can be a predefined or derived datatype

Offsets are expressed as multiples of etypes

Access pattern 3: DefinitionsAccess pattern 3: Definitions

Page 22: High Performance Computing Course Notes 2007-2008 Parallel I/O.

22Computer Science, University of WarwickComputer Science, University of Warwick

Filetype

Basis for partitioning the file among processes and defines a template for accessing the file (based on etype)

View

Current set of data visible and accessible from an open file (as an ordered set of etypes)

Each process has its own view based on - a displacement, etype and filetype

Pattern defined by filetype is repeated (in units of etypes) beginning at the displacement

Access pattern 3: DefinitionsAccess pattern 3: Definitions

Page 23: High Performance Computing Course Notes 2007-2008 Parallel I/O.

23Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 3: Access pattern 3: File ViewsFile Views

Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view

displacement = number of bytes to be skipped from the start of the file

etype = basic unit of data access (can be any basic or derived datatype)

filetype = specifies which portion of the file is visible to the process

Page 24: High Performance Computing Course Notes 2007-2008 Parallel I/O.

24Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 3: Access pattern 3: A Simple A Simple Noncontiguous File View ExampleNoncontiguous File View Example

etype = MPI_INT

filetype = two MPI_INTs followed by a gap of four MPI_INTs

displacement filetype filetype and so on...

FILEhead of file

Page 25: High Performance Computing Course Notes 2007-2008 Parallel I/O.

25Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 3: How do Access pattern 3: How do viewsviews relate to relate to multiple processes?multiple processes?

proc. 0 filetype

file …

displacement

proc. 1 filetype

proc. 2 filetype

Group of processes using complementary views to achieve

global data distribution

Partitioning a file among parallel processes

Page 26: High Performance Computing Course Notes 2007-2008 Parallel I/O.

26Computer Science, University of WarwickComputer Science, University of Warwick

MPI_File_set_viewMPI_File_set_view

Describes that part of the file accessed by a single MPI process.

int MPI_File_set_view( MPI_File mpi_fh, MPI_Offset disp, MPI_Datatype etype, MPI_Datatype filetype, char *datarep, MPI_Info info );

Parameters

mpi_fh :[in] file handle (handle)

disp :[in] displacement (nonnegative integer)

etype :[in] elementary datatype (handle)

filetype :[in] filetype (handle)

datarep :[in] data representation (string)

info :[in] info object (handle)

Page 27: High Performance Computing Course Notes 2007-2008 Parallel I/O.

27Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 3: Access pattern 3: File View ExampleFile View Example

MPI_File thefile;

for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i;MPI_File_open(MPI_COMM_WORLD, "testfile",

MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile);

MPI_File_set_view(thefile, myrank * BUFSIZE * sizeof(int), MPI_INT, MPI_INT, "native",

MPI_INFO_NULL);MPI_File_write(thefile, buf, BUFSIZE, MPI_INT,

MPI_STATUS_IGNORE);MPI_File_close(&thefile);

Page 28: High Performance Computing Course Notes 2007-2008 Parallel I/O.

28Computer Science, University of WarwickComputer Science, University of Warwick

MPI_Type_create_subarrayMPI_Type_create_subarray

Create a datatype for a subarray of a regular, multidimensional array

int MPI_Type_create_subarray( int ndims, int array_of_sizes[], int array_of_subsizes[], int array_of_starts[], int order, MPI_Datatype oldtype, MPI_Datatype *newtype );

Parameters

ndims :[in] number of array dimensions (positive integer)

array_of_sizes :[in] number of elements of type oldtype in each dimension of the full array (array of positive integers)

array_of_subsizes :[in] number of elements of type oldtype in each dimension of the subarray (array of positive integers)

array_of_starts :[in] starting coordinates of the subarray in each dimension (array of nonnegative integers)

order :[in] array storage order flag (state)

oldtype :[in] array element datatype (handle)

newtype :[out] new datatype (handle)

Page 29: High Performance Computing Course Notes 2007-2008 Parallel I/O.

29Computer Science, University of WarwickComputer Science, University of Warwick

Using the Subarray DatatypeUsing the Subarray Datatype

gsizes[0] = 16; /* no. of rows in global array */gsizes[1] = 16; /* no. of columns in global array*/

psizes[0] = 4; /* no. of procs. in vertical dimension */psizes[1] = 4; /* no. of procs. in horizontal dimension */

lsizes[0] = 16/psizes[0]; /* no. of rows in local array */lsizes[1] = 16/psizes[1]; /* no. of columns in local array*/

dims[0] = 4; dims[1] = 4;periods[0] = periods[1] = 1;MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm);MPI_Comm_rank(comm, &rank);MPI_Cart_coords(comm, rank, 2, coords);

Page 30: High Performance Computing Course Notes 2007-2008 Parallel I/O.

30Computer Science, University of WarwickComputer Science, University of Warwick

Subarray Datatype contd.Subarray Datatype contd.

/* global indices of first element of local array */start_indices[0] = coords[0] * lsizes[0];start_indices[1] = coords[1] * lsizes[1];

MPI_Type_create_subarray(2, gsizes, lsizes, start_indices, MPI_ORDER_C, MPI_FLOAT, &filetype);MPI_Type_commit(&filetype);

Page 31: High Performance Computing Course Notes 2007-2008 Parallel I/O.

31Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 3Access pattern 3

Each process creates a derived datatype to describe the non-contiguous access pattern

We thus have a file view and independent access

Single independent, non-contiguous request

MPI_Type_create_subarray

(… , &subarray, …)

MPI_Type_commit (&subarray)

MPI_File_open(… , “filename”, … , &fh)

MPI_File_set_view (fh, … , subarray, …)

MPI_File_read (fh, local_array, …)

MPI_File_close (&fh)

Creates a datatype describing a subarray of a multi-dimentional array

Commits the datatype (must be done before comms)

System may compile at commit time an internal representation for the datatype

Page 32: High Performance Computing Course Notes 2007-2008 Parallel I/O.

32Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 3Access pattern 3

Each process creates a derived datatype to describe the non-contiguous access pattern

We thus have a file view and independent access

Single independent, non-contiguous request

MPI_Type_create_subarray

(… , &subarray, …)

MPI_Type_commit (&subarray)

MPI_File_open(… , “filename”, … , &fh)

MPI_File_set_view (fh, … , subarray, …)

MPI_File_read (fh, local_array, …)

MPI_File_close (&fh)

Opens the file as before

Now changes the processes view of the data in the file using set_view

set_view is collective

Although the reads are still independent

Page 33: High Performance Computing Course Notes 2007-2008 Parallel I/O.

33Computer Science, University of WarwickComputer Science, University of Warwick

Note here that we are reading the whole sub-array despite the non-contiguous storage

P0 P1

P4 P5

P8

P2 P3

P6 P7

P12

P9 P10 P11

P13 P14 P15

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P0 P1 P2 P3

P4 P5 P6 P7

P4 P5 P6 P7

etc…

Processes {4,5,6,7}, {8,9,10,11}, {12,13,14,15} will have file views

based on the same filetypes but with different displacements

Access pattern 3Access pattern 3

proc. 0 filetype

proc. 1 filetype

proc. 2 filetype

proc. 3 filetype

Page 34: High Performance Computing Course Notes 2007-2008 Parallel I/O.

34Computer Science, University of WarwickComputer Science, University of Warwick

Access pattern 4Access pattern 4

Each process creates a derived datatype to describe the non-contiguous access pattern

We thus have a file view and collective access

Single collective, non-contiguous request

MPI_Type_create_subarray

(… , &subarray, …)

MPI_Type_commit (&subarray)

MPI_File_open(… , “filename”, … , &fh)

MPI_File_set_view (fh, … , subarray, …)

MPI_File_read_all (fh, local_array, …)

MPI_File_close (&fh)

Creates and commits datatype as before

Now changes the processes view of the data in the file using set_view

set_view is collective

Reads are now collective

Page 35: High Performance Computing Course Notes 2007-2008 Parallel I/O.

35Computer Science, University of WarwickComputer Science, University of Warwick

These access patterns express four different style of parallel I/O that are available

You should choose your access pattern depending on the application

Larger the size of the I/O request, the better performance

Collectives are going to do better than individual reads

Pattern 4 therefore offers (potentially) the best performance

Access patternsAccess patterns

Page 36: High Performance Computing Course Notes 2007-2008 Parallel I/O.

36Computer Science, University of WarwickComputer Science, University of Warwick

I/O optimization: Data SievingI/O optimization: Data Sieving

Data sieving is used to combine lots of small accesses into a single larger one

Remote file systems (parallel or not) tend to have high latencies

Reducing the number of operations important

Page 37: High Performance Computing Course Notes 2007-2008 Parallel I/O.

37Computer Science, University of WarwickComputer Science, University of Warwick

I/O optimization: Data Sieving WritesI/O optimization: Data Sieving Writes

Using data sieving for writes is more complicated

Must read the entire region first

Then make our changes

Then write the block back

Requires locking in the file system

Can result in false sharing

Page 38: High Performance Computing Course Notes 2007-2008 Parallel I/O.

38Computer Science, University of WarwickComputer Science, University of Warwick

I/O optimization: Two-Phase Collective I/O optimization: Two-Phase Collective I/OI/O

Problems with independent, noncontiguous access Lots of small accesses

Independent data sieving reads lots of extra data

Idea: Reorganize access to match layout on disks Single processes use data sieving to get data for many

Often reduces total I/O through sharing of common blocks

Second ``phase'' moves data to final destinations

Page 39: High Performance Computing Course Notes 2007-2008 Parallel I/O.

39Computer Science, University of WarwickComputer Science, University of Warwick

I/O optimizationI/O optimization: Collective I/O: Collective I/O

Collective I/O is coordinated access to storage by a group of processes

Collective I/O functions must be called by all processes participating in I/O

Allows I/O layers to know more about access as a whole