Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/OBasics

Claudio Gheller

[email protected]


Reading and Writing data is a problem usually underestimated.

However it can become crucial for:

•Performance

•Porting data on different platforms

•Parallel implementation of I/O algorithms


Time to access disk: approx 10-100 Mbyte/sec

Time to access memory: approx 1-10 Gbyte/sec

THEREFORE

When reading/writing on disk a code is 100 times slower.

Optimization is platform dependent. In general:

write large amount of data in single shots

Performance


For example:

avoid looped read/write

do i=1,N

write (10) A(i)

enddo

Is VERY slow

Performance

Optimization is platform dependent. In general:write large amount of data in single shots


Data portability

This is a subtle problem, which becomes crucial only after all… when you try to use data on different platforms.

For example: unformatted data written by a IBM system cannot be read by a Alpha station or by a Linux/MS Windows PC

There are two main problem:• Data representation• File structure


Data portability: number representation

There are two different representations:

Big EndianByte3 Byte2 Byte1 Byte0 will be arranged in memory as

follows: Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0

Little EndianByte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3

Alpha, PC

Unix (IBM, SGI, SUN…)


Data portability: File structure

For performance reasons, Fortran organizes binary files in BLOCKS.

Each block is identified by a proper bit sequence (usually 1 byte long)

Unfortunately, each Fortran compiler has its own Block size and separators !!!

Notice that this problem is typical of Fortran and does not affect C / C++


Data portability: Compiler solutions

Some compilers allows to overcome these problems with specific options

However this leads to • Spend a lot of time in re-configuring compilation

on each different system• Have a less portable code (the results depending

on the compiler)


Data portability: Compiler solutions

For example, Alpha Fortran compiler allows to use Big-Endian data using the

-convert big_endianoption

However this option is not present in any other compiler and, furthermore, data produced with this option are incompatible with the system that wrote them!!!


Fortran offers a possible solution both for the performance and for the portability problems with the DIRECT ACCESS files.

Open(unit=10, file=‘datafile.bin’, form=‘unformatted, access=‘direct’, recl=N)

The result is a binary file with no blocks and no control characters. Any Fortran compiler writes (and can read) it in THE SAME WAY

Notice however that the endianism problem is still present… However the file is portable between any platform with the same endianism


Direct Access Files The keyword recl is the basic quantum of written data. It is usually expressed in bytes (except Alpha which expresses it in words).

Example 1Real*4 x(100)

Inquire(IOLENGTH=IOL) x(1)

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)

Do i=1,100

write(10,rec=i)x(i)

Enddo

Close (10)

Portable but not performing !!!

(Notice that, this is precisely the C fread-fwrite I/O)


Direct Access Files

Example 2Real*4 x(100)

Inquire(IOLENGTH=IOL) x

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)

write(10,rec=1)x

Close (10)

Portable and Performing !!!


Direct Access Files Example 3

Real*4 x(100),y(100),z(100)

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)

write(10,rec=1)x

write(10,rec=2)y

write(10,rec=3)z

Close (10)

The same result can be obtained asReal*4 x(100),y(100),z(100)

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)

write(10,rec=2)y

write(10,rec=3)z

write(10,rec=1)x

Close (10)

Order is not important!!!


Parallel I/O I/O is not a trivial issue in parallel

ExampleProgram Scrivi

Write(*,*)’ Hello World’

End program Scrivi

Execute in parallel on 4 processors:

Pe 0

Pe 1

Pe 2

Pe 3

$ ./Scrivi

Hello World

Hello World

Hello World

Hello World


Parallel I/O

Goals:

Improve the performance

Ensure data consistency

Avoid communication

Usability


Parallel I/O

Solution 1: Master-Slave

Only 1 processor performs I/O

Pe 1

Pe 2

Pe 3

Pe 0

Data File

Goals:

Improve the performance: NO

Ensure data consistency: YES

Avoid communication: NO

Usability: YES (but in general not portable)


Parallel I/O

Solution 2: Distributed I/O

All the processors read/writes their own files

Pe 1

Pe 2

Pe 3

Data File 0

Goals:

Improve the performance: YES (but be careful)

Ensure data consistency: YES

Avoid communication: YES

Usability: NOPe 0

Data File 3

Data File 2

Data File 1

Warning: Do not parametrize with processors!!!


Parallel I/O

Solution 3: Distributed I/O on single file

All the processors read/writes on a single ACCESS=DIRECT file

Pe 1

Pe 2

Pe 3

Goals:

Improve the performance: YES for read, NO for write

Ensure data consistency: NO


Usability: YES (portable !!!)Pe 0

Data File


Parallel I/O

Solution 4: MPI2 I/O

MPI functions performs the I/O. These functions are not standards. Asyncronous I/O is supported

Pe 1

Pe 2

Pe 3

Goals:

Improve the performance: YES (strongly!!!)

Ensure data consistency: NO


Usability: YESPe 0

Data File

MPI


Case Study

Data analysis – case 1

How many clusters are there in the image ???

Cluster finding algorithm

Input = the image

Output = a number


Case Study

Case 1- Parallel implementation

Parallel Cluster finding algorithm

Input = a fraction of the image

Output = a number for each processor

All the parallelism is in the setup of the input. Then all processors work independently !!!!

Pe 0

Pe 1


Case Study

Case 1- Setup of the input

Each processor reads its own part of the input

file

Pe 0

Pe 1

! The image is NxN pixels, using 2 processors

Real*4 array(N,N/2)

Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N*N/2)

Startrecord=mype+1

read(10,rec=Startrecord)array

Call Sequential_Find_Cluster(array, N_cluster)

Write(*,*)mype,’ found’, N_cluster, ‘ clusters’


Case Study

Case 1- Boundary conditions

Boundaries must be treated in a specific way

Pe 0

Pe 1

! The image is NxN pixels, using 2 processors

Real*4 array(0:N+1,0:N/2+1)

! Set boundaries on the image side

array(0,:) = 0.0

array(N+1,:)= 0.0

jside= mod(mype,2)*N/2+mod(mype,2)

array(:,jside)=0.0

Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N)

Do j=1,N/2

record=mype*N/2+j

read(10,rec=record)array(:,j)

Enddo

If(mype.eq.0)then

record=N/2+1

read(10,rec=record)array(:,N/2+1)

elserecord=N/2-1read(10,rec=record)array(:,0)

endif

Call Sequential_Find_Cluster(array, N_cluster)

Write(*,*)mype,’ found’, N_cluster, ‘ clusters’

suggested

avoid


Case Study


From observed data……

…to the sky map


Case Study


Each map pixel is meausered N times.

The final value for each pixel is an “average” of all the corresponding

measurements0.1

0.7

0.3

1.8

2.3

0.2

5.7

1.0

0.4

0.3

0.7

0.6

1.2

1.3

8.1

3.2

0.9

0.8

0.1

0.3…

2 7 1 11

76

23

2 37

21

21

5 8 21

15

3 1 21

22

54

3…

values

map pixels id

0.3 0.5 0.7 0.9 1.0 1.1 1.4 1.2 1.1 0.9 MAP


Case Study

Case 2: parallelization

•Values and ids are distributed between processors in the data input phase (just like case 1)

•Calculation is performed independently by each processor

•Each processor produce its own COMPLETE map (which is small and can be replicated)

•The final map is the SUM OF ALL THE MAPS calculated by different processors


Case Study Case 2: parallelization

! N Data, M pixels, Npes processors (M << N)

Real*8 value(N/Npes)

Real*8 map(M)

Integer id(N/Npes)

Open(unit=10,file=‘data.bin’,access=‘direct’,recl=4*N/Npes)

Open(unit=20,file=‘ids.bin’,access=‘direct’,recl=4*N/Npes)

record=mype+1

Read(10,rec=record)value

Read(20,rec=record)id

Call Sequential_Calculate_Local_Map(value,id,map)

Call BARRIER

Call Calculate_Final_Map(map)

Call Print_Final_Map(map)

Define basic arrays

Read data in parallel (boundaries are neglected)

Calculate local mapsSincronize process

Parallel calculation of the final mapPrint final map


Case Study Case 2: calculation of the final map

Subroutine Calculate_Final_Map(map)

Real*8 map(M)

Real*8 map_aux(M)

Do i=1,npes

If(mype.eq.0)then

call RECV(map_aux,i-1)

map=map+map_aux

Else if (mype.eq.i-1)then

call SEND(map,0)

Endif

Call BARRIER

enddo

return

Calculate final map processor by processor

However MPI offers a MUCH BETTER solution

(we will see it tomorrow)


Case Study Case 2: print the final map

Subroutine Print_Final_Map(map)

Real*8 map(M)

If(mype.eq.0)then

do i=1,m

write(*,*)i,map(i)

enddo

Endif

return

Only one processor writes the result

At this point ONLY processor 0 has the final map and can print it out

Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...

Documents

Transcript of Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...