Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...
-
Upload
marissa-bowen -
Category
Documents
-
view
213 -
download
0
Transcript of Firenze, 10-11 Giugno 2003, C. Gheller Parallel I/O Basics Claudio Gheller CINECA...
Firenze, 10-11 Giugno 2003, C. Gheller
Reading and Writing data is a problem usually underestimated.
However it can become crucial for:
•Performance
•Porting data on different platforms
•Parallel implementation of I/O algorithms
Firenze, 10-11 Giugno 2003, C. Gheller
Time to access disk: approx 10-100 Mbyte/sec
Time to access memory: approx 1-10 Gbyte/sec
THEREFORE
When reading/writing on disk a code is 100 times slower.
Optimization is platform dependent. In general:
write large amount of data in single shots
Performance
Firenze, 10-11 Giugno 2003, C. Gheller
For example:
avoid looped read/write
do i=1,N
write (10) A(i)
enddo
Is VERY slow
Performance
Optimization is platform dependent. In general:write large amount of data in single shots
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability
This is a subtle problem, which becomes crucial only after all… when you try to use data on different platforms.
For example: unformatted data written by a IBM system cannot be read by a Alpha station or by a Linux/MS Windows PC
There are two main problem:• Data representation• File structure
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: number representation
There are two different representations:
Big EndianByte3 Byte2 Byte1 Byte0 will be arranged in memory as
follows: Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0
Little EndianByte3 Byte2 Byte1 Byte0 will be arranged in memory as follows: Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3
Alpha, PC
Unix (IBM, SGI, SUN…)
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: File structure
For performance reasons, Fortran organizes binary files in BLOCKS.
Each block is identified by a proper bit sequence (usually 1 byte long)
Unfortunately, each Fortran compiler has its own Block size and separators !!!
Notice that this problem is typical of Fortran and does not affect C / C++
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: Compiler solutions
Some compilers allows to overcome these problems with specific options
However this leads to • Spend a lot of time in re-configuring compilation
on each different system• Have a less portable code (the results depending
on the compiler)
Firenze, 10-11 Giugno 2003, C. Gheller
Data portability: Compiler solutions
For example, Alpha Fortran compiler allows to use Big-Endian data using the
-convert big_endianoption
However this option is not present in any other compiler and, furthermore, data produced with this option are incompatible with the system that wrote them!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Fortran offers a possible solution both for the performance and for the portability problems with the DIRECT ACCESS files.
Open(unit=10, file=‘datafile.bin’, form=‘unformatted, access=‘direct’, recl=N)
The result is a binary file with no blocks and no control characters. Any Fortran compiler writes (and can read) it in THE SAME WAY
Notice however that the endianism problem is still present… However the file is portable between any platform with the same endianism
Firenze, 10-11 Giugno 2003, C. Gheller
Direct Access Files The keyword recl is the basic quantum of written data. It is usually expressed in bytes (except Alpha which expresses it in words).
Example 1Real*4 x(100)
Inquire(IOLENGTH=IOL) x(1)
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)
Do i=1,100
write(10,rec=i)x(i)
Enddo
Close (10)
Portable but not performing !!!
(Notice that, this is precisely the C fread-fwrite I/O)
Firenze, 10-11 Giugno 2003, C. Gheller
Direct Access Files
Example 2Real*4 x(100)
Inquire(IOLENGTH=IOL) x
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=IOL)
write(10,rec=1)x
Close (10)
Portable and Performing !!!
Firenze, 10-11 Giugno 2003, C. Gheller
Direct Access Files Example 3
Real*4 x(100),y(100),z(100)
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)
write(10,rec=1)x
write(10,rec=2)y
write(10,rec=3)z
Close (10)
The same result can be obtained asReal*4 x(100),y(100),z(100)
Open(unit=10, file=‘datafile.bin’, access=‘direct’, recl=4*100)
write(10,rec=2)y
write(10,rec=3)z
write(10,rec=1)x
Close (10)
Order is not important!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O I/O is not a trivial issue in parallel
ExampleProgram Scrivi
Write(*,*)’ Hello World’
End program Scrivi
Execute in parallel on 4 processors:
Pe 0
Pe 1
Pe 2
Pe 3
$ ./Scrivi
Hello World
Hello World
Hello World
Hello World
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Goals:
Improve the performance
Ensure data consistency
Avoid communication
Usability
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 1: Master-Slave
Only 1 processor performs I/O
Pe 1
Pe 2
Pe 3
Pe 0
Data File
Goals:
Improve the performance: NO
Ensure data consistency: YES
Avoid communication: NO
Usability: YES (but in general not portable)
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 2: Distributed I/O
All the processors read/writes their own files
Pe 1
Pe 2
Pe 3
Data File 0
Goals:
Improve the performance: YES (but be careful)
Ensure data consistency: YES
Avoid communication: YES
Usability: NOPe 0
Data File 3
Data File 2
Data File 1
Warning: Do not parametrize with processors!!!
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 3: Distributed I/O on single file
All the processors read/writes on a single ACCESS=DIRECT file
Pe 1
Pe 2
Pe 3
Goals:
Improve the performance: YES for read, NO for write
Ensure data consistency: NO
Avoid communication: YES
Usability: YES (portable !!!)Pe 0
Data File
Firenze, 10-11 Giugno 2003, C. Gheller
Parallel I/O
Solution 4: MPI2 I/O
MPI functions performs the I/O. These functions are not standards. Asyncronous I/O is supported
Pe 1
Pe 2
Pe 3
Goals:
Improve the performance: YES (strongly!!!)
Ensure data consistency: NO
Avoid communication: YES
Usability: YESPe 0
Data File
MPI
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Data analysis – case 1
How many clusters are there in the image ???
Cluster finding algorithm
Input = the image
Output = a number
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 1- Parallel implementation
Parallel Cluster finding algorithm
Input = a fraction of the image
Output = a number for each processor
All the parallelism is in the setup of the input. Then all processors work independently !!!!
Pe 0
Pe 1
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 1- Setup of the input
Each processor reads its own part of the input
file
Pe 0
Pe 1
! The image is NxN pixels, using 2 processors
Real*4 array(N,N/2)
Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N*N/2)
Startrecord=mype+1
read(10,rec=Startrecord)array
Call Sequential_Find_Cluster(array, N_cluster)
Write(*,*)mype,’ found’, N_cluster, ‘ clusters’
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 1- Boundary conditions
Boundaries must be treated in a specific way
Pe 0
Pe 1
! The image is NxN pixels, using 2 processors
Real*4 array(0:N+1,0:N/2+1)
! Set boundaries on the image side
array(0,:) = 0.0
array(N+1,:)= 0.0
jside= mod(mype,2)*N/2+mod(mype,2)
array(:,jside)=0.0
Open (unit=10, file=“image.bin”,access=‘direct’,recl=4*N)
Do j=1,N/2
record=mype*N/2+j
read(10,rec=record)array(:,j)
Enddo
If(mype.eq.0)then
record=N/2+1
read(10,rec=record)array(:,N/2+1)
elserecord=N/2-1read(10,rec=record)array(:,0)
endif
Call Sequential_Find_Cluster(array, N_cluster)
Write(*,*)mype,’ found’, N_cluster, ‘ clusters’
suggested
avoid
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Data analysis – case 2
From observed data……
…to the sky map
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Data analysis – case 2
Each map pixel is meausered N times.
The final value for each pixel is an “average” of all the corresponding
measurements0.1
0.7
0.3
1.8
2.3
0.2
5.7
1.0
0.4
0.3
0.7
0.6
1.2
1.3
8.1
3.2
0.9
0.8
0.1
0.3…
2 7 1 11
76
23
2 37
21
21
5 8 21
15
3 1 21
22
54
3…
values
map pixels id
0.3 0.5 0.7 0.9 1.0 1.1 1.4 1.2 1.1 0.9 MAP
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study
Case 2: parallelization
•Values and ids are distributed between processors in the data input phase (just like case 1)
•Calculation is performed independently by each processor
•Each processor produce its own COMPLETE map (which is small and can be replicated)
•The final map is the SUM OF ALL THE MAPS calculated by different processors
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study Case 2: parallelization
! N Data, M pixels, Npes processors (M << N)
Real*8 value(N/Npes)
Real*8 map(M)
Integer id(N/Npes)
Open(unit=10,file=‘data.bin’,access=‘direct’,recl=4*N/Npes)
Open(unit=20,file=‘ids.bin’,access=‘direct’,recl=4*N/Npes)
record=mype+1
Read(10,rec=record)value
Read(20,rec=record)id
Call Sequential_Calculate_Local_Map(value,id,map)
Call BARRIER
Call Calculate_Final_Map(map)
Call Print_Final_Map(map)
Define basic arrays
Read data in parallel (boundaries are neglected)
Calculate local mapsSincronize process
Parallel calculation of the final mapPrint final map
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study Case 2: calculation of the final map
Subroutine Calculate_Final_Map(map)
Real*8 map(M)
Real*8 map_aux(M)
Do i=1,npes
If(mype.eq.0)then
call RECV(map_aux,i-1)
map=map+map_aux
Else if (mype.eq.i-1)then
call SEND(map,0)
Endif
Call BARRIER
enddo
return
Calculate final map processor by processor
However MPI offers a MUCH BETTER solution
(we will see it tomorrow)
Firenze, 10-11 Giugno 2003, C. Gheller
Case Study Case 2: print the final map
Subroutine Print_Final_Map(map)
Real*8 map(M)
If(mype.eq.0)then
do i=1,m
write(*,*)i,map(i)
enddo
Endif
return
Only one processor writes the result
At this point ONLY processor 0 has the final map and can print it out