Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...

29
Firenze, 10-11 Giugno 2003, C. Ghelle Parallel I/O Basics Claudio Gheller CINECA [email protected]

Transcript of Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...

Page 1: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/OBasics

Claudio Gheller

[email protected]

Page 2: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Reading and Writing data is a problem usually underestimated.

However it can become crucial for:

•Performance

•Porting data on different platforms

•Parallel implementation of I/O algorithms

Page 3: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Time to access disk: approx 10-100 Mbyte/sec

Time to access memory: approx 1-10 Gbyte/sec

THEREFORE

When reading/writing on disk a code is 100 times slower.

Optimization is platform dependent. In general:

write large amount of data in single shots

Performance

Page 4: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

For example:

avoid looped read/write

do i=1,N

write (10) A(i)

enddo

Is VERY slow

Performance

Optimization is platform dependent. In general:write large amount of data in single shots

Page 5: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Data portability

This is a subtle problem, which becomes crucial only after all… when you try to use data on different platforms.

For example: unformatted data written by a IBM system cannot be read by a Alpha station or by a Linux/MS Windows PC

There are two main problem:• Data representation• File structure

Page 6: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Data portability: number representation

There are two different representations:

Big EndianByte3 Byte2 Byte1 Byte0 will be arranged in memory as

follows: Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0

Little EndianByte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3

Alpha, PC

Unix (IBM, SGI, SUN…)

Page 7: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Data portability: File structure

For performance reasons, Fortran organizes binary files in BLOCKS.

Each block is identified by a proper bit sequence (usually 1 byte long)

Unfortunately, each Fortran compiler has its own Block size and separators !!!

Notice that this problem is typical of Fortran and does not affect C / C++

Page 8: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Data portability: Compiler solutions

Some compilers allows to overcome these problems with specific options

However this leads to • Spend a lot of time in re-configuring compilation

on each different system• Have a less portable code (the results depending

on the compiler)

Page 9: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Data portability: Compiler solutions

For example, Alpha Fortran compiler allows to use Big-Endian data using the

-convert big_endianoption

However this option is not present in any other compiler and, furthermore, data produced with this option are incompatible with the system that wrote them!!!

Page 10: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Fortran offers a possible solution both for the performance and for the portability problems with the DIRECT ACCESS files.

Open(unit=10, file=‘datafile.bin’, form=‘unformatted, access=‘direct’, recl=N)

The result is a binary file with no blocks and no control characters. Any Fortran compiler writes (and can read) it in THE SAME WAY

Notice however that the endianism problem is still present… However the file is portable between any platform with the same endianism

Page 11: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Direct Access Files The keyword recl is the basic quantum of written data. It is usually expressed in bytes (except Alpha which expresses it in words).

Example 1Real*4 x(100)

Inquire(IOLENGTH=IOL) x(1)

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)

Do i=1,100

write(10,rec=i)x(i)

Enddo

Close (10)

Portable but not performing !!!

(Notice that, this is precisely the C fread-fwrite I/O)

Page 12: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Direct Access Files

Example 2Real*4 x(100)

Inquire(IOLENGTH=IOL) x

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)

write(10,rec=1)x

Close (10)

Portable and Performing !!!

Page 13: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Direct Access Files Example 3

Real*4 x(100),y(100),z(100)

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)

write(10,rec=1)x

write(10,rec=2)y

write(10,rec=3)z

Close (10)

The same result can be obtained asReal*4 x(100),y(100),z(100)

Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)

write(10,rec=2)y

write(10,rec=3)z

write(10,rec=1)x

Close (10)

Order is not important!!!

Page 14: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/O I/O is not a trivial issue in parallel

ExampleProgram Scrivi

Write(*,*)’ Hello World’

End program Scrivi

Execute in parallel on 4 processors:

Pe 0

Pe 1

Pe 2

Pe 3

$ ./Scrivi

Hello World

Hello World

Hello World

Hello World

Page 15: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/O

Goals:

Improve the performance

Ensure data consistency

Avoid communication

Usability

Page 16: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/O

Solution 1: Master-Slave

Only 1 processor performs I/O

Pe 1

Pe 2

Pe 3

Pe 0

Data File

Goals:

Improve the performance: NO

Ensure data consistency: YES

Avoid communication: NO

Usability: YES (but in general not portable)

Page 17: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/O

Solution 2: Distributed I/O

All the processors read/writes their own files

Pe 1

Pe 2

Pe 3

Data File 0

Goals:

Improve the performance: YES (but be careful)

Ensure data consistency: YES

Avoid communication: YES

Usability: NOPe 0

Data File 3

Data File 2

Data File 1

Warning: Do not parametrize with processors!!!

Page 18: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/O

Solution 3: Distributed I/O on single file

All the processors read/writes on a single ACCESS=DIRECT file

Pe 1

Pe 2

Pe 3

Goals:

Improve the performance: YES for read, NO for write

Ensure data consistency: NO

Avoid communication: YES

Usability: YES (portable !!!)Pe 0

Data File

Page 19: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Parallel I/O

Solution 4: MPI2 I/O

MPI functions performs the I/O. These functions are not standards. Asyncronous I/O is supported

Pe 1

Pe 2

Pe 3

Goals:

Improve the performance: YES (strongly!!!)

Ensure data consistency: NO

Avoid communication: YES

Usability: YESPe 0

Data File

MPI

Page 20: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Data analysis – case 1

How many clusters are there in the image ???

Cluster finding algorithm

Input = the image

Output = a number

Page 21: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Case 1- Parallel implementation

Parallel Cluster finding algorithm

Input = a fraction of the image

Output = a number for each processor

All the parallelism is in the setup of the input. Then all processors work independently !!!!

Pe 0

Pe 1

Page 22: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Case 1- Setup of the input

Each processor reads its own part of the input

file

Pe 0

Pe 1

! The image is NxN pixels, using 2 processors

Real*4 array(N,N/2)

Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N*N/2)

Startrecord=mype+1

read(10,rec=Startrecord)array

Call Sequential_Find_Cluster(array, N_cluster)

Write(*,*)mype,’ found’, N_cluster, ‘ clusters’

Page 23: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Case 1- Boundary conditions

Boundaries must be treated in a specific way

Pe 0

Pe 1

! The image is NxN pixels, using 2 processors

Real*4 array(0:N+1,0:N/2+1)

! Set boundaries on the image side

array(0,:) = 0.0

array(N+1,:)= 0.0

jside= mod(mype,2)*N/2+mod(mype,2)

array(:,jside)=0.0

Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N)

Do j=1,N/2

record=mype*N/2+j

read(10,rec=record)array(:,j)

Enddo

If(mype.eq.0)then

record=N/2+1

read(10,rec=record)array(:,N/2+1)

elserecord=N/2-1read(10,rec=record)array(:,0)

endif

Call Sequential_Find_Cluster(array, N_cluster)

Write(*,*)mype,’ found’, N_cluster, ‘ clusters’

suggested

avoid

Page 24: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Data analysis – case 2

From observed data……

…to the sky map

Page 25: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Data analysis – case 2

Each map pixel is meausered N times.

The final value for each pixel is an “average” of all the corresponding

measurements0.1

0.7

0.3

1.8

2.3

0.2

5.7

1.0

0.4

0.3

0.7

0.6

1.2

1.3

8.1

3.2

0.9

0.8

0.1

0.3…

2 7 1 11

76

23

2 37

21

21

5 8 21

15

3 1 21

22

54

3…

values

map pixels id

0.3 0.5 0.7 0.9 1.0 1.1 1.4 1.2 1.1 0.9 MAP

Page 26: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study

Case 2: parallelization

•Values and ids are distributed between processors in the data input phase (just like case 1)

•Calculation is performed independently by each processor

•Each processor produce its own COMPLETE map (which is small and can be replicated)

•The final map is the SUM OF ALL THE MAPS calculated by different processors

Page 27: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study Case 2: parallelization

! N Data, M pixels, Npes processors (M << N)

Real*8 value(N/Npes)

Real*8 map(M)

Integer id(N/Npes)

Open(unit=10,file=‘data.bin’,access=‘direct’,recl=4*N/Npes)

Open(unit=20,file=‘ids.bin’,access=‘direct’,recl=4*N/Npes)

record=mype+1

Read(10,rec=record)value

Read(20,rec=record)id

Call Sequential_Calculate_Local_Map(value,id,map)

Call BARRIER

Call Calculate_Final_Map(map)

Call Print_Final_Map(map)

Define basic arrays

Read data in parallel (boundaries are neglected)

Calculate local mapsSincronize process

Parallel calculation of the final mapPrint final map

Page 28: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study Case 2: calculation of the final map

Subroutine Calculate_Final_Map(map)

Real*8 map(M)

Real*8 map_aux(M)

Do i=1,npes

If(mype.eq.0)then

call RECV(map_aux,i-1)

map=map+map_aux

Else if (mype.eq.i-1)then

call SEND(map,0)

Endif

Call BARRIER

enddo

return

Calculate final map processor by processor

However MPI offers a MUCH BETTER solution

(we will see it tomorrow)

Page 29: Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA c.gheller@cineca.it.

Firenze, 10-11 Giugno 2003, C. Gheller

Case Study Case 2: print the final map

Subroutine Print_Final_Map(map)

Real*8 map(M)

If(mype.eq.0)then

do i=1,m

write(*,*)i,map(i)

enddo

Endif

return

Only one processor writes the result

At this point ONLY processor 0 has the final map and can print it out