Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber

National Center for Supercomputing ApplicationsUniversity of Illinois, Urbana-Champaign

2

Outline

HDF5 and PnetCDF performance comparison

Performance of collective and independent I/O operations

Collective I/O optimizations inside HDF5

3

Introduction

HDF5 and netCDF provide data formats and programming interfaces

HDF5 Hierarchical file structures High flexibility for data management

NetCDF Linear data layout Self-descriptive and minimizes overhead

HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

4

HDF5 and PnetCDF performance comparison Previous Study:

PnetCDF achieves higher performance than HDF5 NCAR Bluesky

IBM Cluster 1600 AIX 5.1, GPFS Power4

LLNL uP IBM SP AIX 5.3, GPFS Power5

PnetCDF 1.0.1 vs. HDF5 1.6.5.

5

HDF5 and PnetCDF performance comparison Benchmark is the I/O kernel of FLASH. FLASH I/O generates 3D blocks of size 8x8x8 on

Bluesky and 16x16x16 on uP. Each processor handles 80 blocks and writes them

into 3 output files. The performance metric given by FLASH I/O is the

parallel execution time. The more processors, the larger the amount of data. Collective vs. Independent I/O operations

6


Bluesky uP

Flash I/O Benchmark (Checkpoint files)

0

500

1000

1500

2000

2500

10 110 210 310

Number of Processors

MB

/s

PnetCDF HDF5 independent


0

10

20

30

40

50

60

10 60 110 160


MB

/s

PnetCDF HDF5 independent

7


Bluesky uP


0

10

20

30

40

50

60

10 60 110 160


MB

/s

PnetCDF HDF5 collective HDF5 independent


0

500

1000

1500

2000

2500

10 110 210 310


MB

/s

PnetCDF HDF5 collective HDF5 independent

8

Performance of collective and independent I/O operations Effects of using contiguous and chunked layout

storage. Four test cases:

Writing a rectangular selection of double floats. Different access patterns and I/O types.

Data is stored in row-major order. The tests executed on NCAR Bluesky using

HDF5 version 1.6.5.

9

Contiguous storage (Case 1)

Every processor has a noncontiguous selection.

Access requests are interleaved.

Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.) Independent I/O: 1,659.48 s. Collective I/O: 4.33 s.

P0 P1 P2 P3 P0 P1 P2 P3 …….

P0 P1 P2 P3

Row 1 Row 2

10

Contiguous storage (Case 2)

No interleaved requests to combine.

Contiguity is already established. Write operation with 32

processors where each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 3.64 s. Collective I/O: 3.87 s.

P0

P1

P2

P3

P0 P1 P2 P3

11

Chunked storage (Case 3)

The chunk size does not divide exactly the array size.

If info about access pattern is given, MPI-IO can performs efficient independent access.

Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 43.00 s. Collective I/O: 22.10 s.

P0

P1

P2

P3

chunk 0 chunk 1

dataset

P0 …. P1 P1 …. P2 P2 …. P3 P3 ….P0

12

Chunked storage (Case 4)

Array partitioned exactly into two chunks.

Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.) Independent I/O: 7.66 s. Collective I/O: 8.68 s.

P0

P1

P2

P3

chunk 0 chunk 1

P4

P5

P6

P7

P0 P1 P2 P3

13

Conclusions

Performance is quite comparable when the two libraries are used in similar manners.

Collective access should be preferred in most situations when using HDF5 1.6.5.

Use of chunked storage may affect the contiguity of the data selection.

14

Improvements of collective IO supports inside HDF5

Advanced HDF5 feature: non-regular selections

Performance optimizations: chunked storage Provide several IO options to achieve good

collective IO performance Provide APIs for applications to participate

the optimization process

15

Improvement 1- non-regular selections

• Only one HDF5 IO call• Good for collective IO• Any such applications?

16

Performance issues for chunked storage

Previous support• Collective IO can only be done per chunk

Problem• Severe performance penalties with many

small chunk IOs

17

P0

chunk 1 chunk 2 chunk 3 chunk 4

P1

MPI-IOCollective View

Improvement 2: One linked chunk IO

18

Improvement 3: Multi-chunk IO Optimization

Have to keep the option to do collective IO per chunk Collective IO bugs inside different MPI-IO packages Limitation of system memory

Problem Bad performance caused by improper use of collective IO

P0P1

P5P6

P3P4

P7P8

P0P1

P5P6

P3P4

P7P8

19

Improvement 4

Problem HDF5 may make wrong decision about the

way to do collective IO Solution

Provide APIs for applications to participate the decision-making process

20

Decision-making aboutOne linked-chunk IO

Multi-chunk IO Decision-making about

Collective IO

One Collective IO call for all chunks Independent IO

Optional User Input

Optional User Input

Flow chart of Collective Chunking IO improvements inside HDF5

Collective chunk mode

Yes

NO

Yes NO

Collective IO per chunk

21

Conclusions

HDF5 provides collective IO supports for non-regular selections

Supporting collective IO for chunked storage is not trivial. Several IO options are provided.

Users can participate the decision-making process that selects different IO options.

22

Acknowledgments

This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...

Documents

Transcript of Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...