Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...
-
Upload
tamsyn-lucas -
Category
Documents
-
view
226 -
download
0
description
Transcript of Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package
Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber
National Center for Supercomputing ApplicationsUniversity of Illinois, Urbana-Champaign
2
Outline
HDF5 and PnetCDF performance comparison
Performance of collective and independent I/O operations
Collective I/O optimizations inside HDF5
3
Introduction
HDF5 and netCDF provide data formats and programming interfaces
HDF5 Hierarchical file structures High flexibility for data management
NetCDF Linear data layout Self-descriptive and minimizes overhead
HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO
4
HDF5 and PnetCDF performance comparison Previous Study:
PnetCDF achieves higher performance than HDF5 NCAR Bluesky
IBM Cluster 1600 AIX 5.1, GPFS Power4
LLNL uP IBM SP AIX 5.3, GPFS Power5
PnetCDF 1.0.1 vs. HDF5 1.6.5.
5
HDF5 and PnetCDF performance comparison Benchmark is the I/O kernel of FLASH. FLASH I/O generates 3D blocks of size 8x8x8 on
Bluesky and 16x16x16 on uP. Each processor handles 80 blocks and writes them
into 3 output files. The performance metric given by FLASH I/O is the
parallel execution time. The more processors, the larger the amount of data. Collective vs. Independent I/O operations
6
HDF5 and PnetCDF performance comparison
Bluesky uP
Flash I/O Benchmark (Checkpoint files)
0
500
1000
1500
2000
2500
10 110 210 310
Number of Processors
MB
/s
PnetCDF HDF5 independent
Flash I/O Benchmark (Checkpoint files)
0
10
20
30
40
50
60
10 60 110 160
Number of Processors
MB
/s
PnetCDF HDF5 independent
7
HDF5 and PnetCDF performance comparison
Bluesky uP
Flash I/O Benchmark (Checkpoint files)
0
10
20
30
40
50
60
10 60 110 160
Number of Processors
MB
/s
PnetCDF HDF5 collective HDF5 independent
Flash I/O Benchmark (Checkpoint files)
0
500
1000
1500
2000
2500
10 110 210 310
Number of Processors
MB
/s
PnetCDF HDF5 collective HDF5 independent
8
Performance of collective and independent I/O operations Effects of using contiguous and chunked layout
storage. Four test cases:
Writing a rectangular selection of double floats. Different access patterns and I/O types.
Data is stored in row-major order. The tests executed on NCAR Bluesky using
HDF5 version 1.6.5.
9
Contiguous storage (Case 1)
Every processor has a noncontiguous selection.
Access requests are interleaved.
Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.) Independent I/O: 1,659.48 s. Collective I/O: 4.33 s.
P0 P1 P2 P3 P0 P1 P2 P3 …….
P0 P1 P2 P3
Row 1 Row 2
10
Contiguous storage (Case 2)
No interleaved requests to combine.
Contiguity is already established. Write operation with 32
processors where each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 3.64 s. Collective I/O: 3.87 s.
P0
P1
P2
P3
P0 P1 P2 P3
11
Chunked storage (Case 3)
The chunk size does not divide exactly the array size.
If info about access pattern is given, MPI-IO can performs efficient independent access.
Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 43.00 s. Collective I/O: 22.10 s.
P0
P1
P2
P3
chunk 0 chunk 1
dataset
P0 …. P1 P1 …. P2 P2 …. P3 P3 ….P0
12
Chunked storage (Case 4)
Array partitioned exactly into two chunks.
Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.) Independent I/O: 7.66 s. Collective I/O: 8.68 s.
P0
P1
P2
P3
chunk 0 chunk 1
P4
P5
P6
P7
P0 P1 P2 P3
13
Conclusions
Performance is quite comparable when the two libraries are used in similar manners.
Collective access should be preferred in most situations when using HDF5 1.6.5.
Use of chunked storage may affect the contiguity of the data selection.
14
Improvements of collective IO supports inside HDF5
Advanced HDF5 feature: non-regular selections
Performance optimizations: chunked storage Provide several IO options to achieve good
collective IO performance Provide APIs for applications to participate
the optimization process
15
Improvement 1- non-regular selections
• Only one HDF5 IO call• Good for collective IO• Any such applications?
16
Performance issues for chunked storage
Previous support• Collective IO can only be done per chunk
Problem• Severe performance penalties with many
small chunk IOs
17
P0
chunk 1 chunk 2 chunk 3 chunk 4
P1
MPI-IOCollective View
Improvement 2: One linked chunk IO
18
Improvement 3: Multi-chunk IO Optimization
Have to keep the option to do collective IO per chunk Collective IO bugs inside different MPI-IO packages Limitation of system memory
Problem Bad performance caused by improper use of collective IO
P0P1
P5P6
P3P4
P7P8
P0P1
P5P6
P3P4
P7P8
19
Improvement 4
Problem HDF5 may make wrong decision about the
way to do collective IO Solution
Provide APIs for applications to participate the decision-making process
20
Decision-making aboutOne linked-chunk IO
Multi-chunk IO Decision-making about
Collective IO
One Collective IO call for all chunks Independent IO
Optional User Input
Optional User Input
Flow chart of Collective Chunking IO improvements inside HDF5
Collective chunk mode
Yes
NO
Yes NO
Collective IO per chunk
21
Conclusions
HDF5 provides collective IO supports for non-regular selections
Supporting collective IO for chunked storage is not trivial. Several IO options are provided.
Users can participate the decision-making process that selects different IO options.
22
Acknowledgments
This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.