Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...

22
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber National Center for Supercomputing Applications University of Illinois, Urbana- Champaign

description

3 Introduction HDF5 and netCDF provide data formats and programming interfaces HDF5  Hierarchical file structures  High flexibility for data management NetCDF  Linear data layout  Self-descriptive and minimizes overhead HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

Transcript of Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian...

Page 1: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package

Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber

National Center for Supercomputing ApplicationsUniversity of Illinois, Urbana-Champaign

Page 2: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

2

Outline

HDF5 and PnetCDF performance comparison

Performance of collective and independent I/O operations

Collective I/O optimizations inside HDF5

Page 3: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

3

Introduction

HDF5 and netCDF provide data formats and programming interfaces

HDF5 Hierarchical file structures High flexibility for data management

NetCDF Linear data layout Self-descriptive and minimizes overhead

HDF5 and a parallel version of netCDF from ANL/Northwestern U. (PnetCDF) provide support for parallel access on top of MPI-IO

Page 4: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

4

HDF5 and PnetCDF performance comparison Previous Study:

PnetCDF achieves higher performance than HDF5 NCAR Bluesky

IBM Cluster 1600 AIX 5.1, GPFS Power4

LLNL uP IBM SP AIX 5.3, GPFS Power5

PnetCDF 1.0.1 vs. HDF5 1.6.5.

Page 5: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

5

HDF5 and PnetCDF performance comparison Benchmark is the I/O kernel of FLASH. FLASH I/O generates 3D blocks of size 8x8x8 on

Bluesky and 16x16x16 on uP. Each processor handles 80 blocks and writes them

into 3 output files. The performance metric given by FLASH I/O is the

parallel execution time. The more processors, the larger the amount of data. Collective vs. Independent I/O operations

Page 6: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

6

HDF5 and PnetCDF performance comparison

Bluesky uP

Flash I/O Benchmark (Checkpoint files)

0

500

1000

1500

2000

2500

10 110 210 310

Number of Processors

MB

/s

PnetCDF HDF5 independent

Flash I/O Benchmark (Checkpoint files)

0

10

20

30

40

50

60

10 60 110 160

Number of Processors

MB

/s

PnetCDF HDF5 independent

Page 7: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

7

HDF5 and PnetCDF performance comparison

Bluesky uP

Flash I/O Benchmark (Checkpoint files)

0

10

20

30

40

50

60

10 60 110 160

Number of Processors

MB

/s

PnetCDF HDF5 collective HDF5 independent

Flash I/O Benchmark (Checkpoint files)

0

500

1000

1500

2000

2500

10 110 210 310

Number of Processors

MB

/s

PnetCDF HDF5 collective HDF5 independent

Page 8: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

8

Performance of collective and independent I/O operations Effects of using contiguous and chunked layout

storage. Four test cases:

Writing a rectangular selection of double floats. Different access patterns and I/O types.

Data is stored in row-major order. The tests executed on NCAR Bluesky using

HDF5 version 1.6.5.

Page 9: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

9

Contiguous storage (Case 1)

Every processor has a noncontiguous selection.

Access requests are interleaved.

Write operation with 32 processors, each processor selection has 512K rows and 8 columns (32 MB/proc.) Independent I/O: 1,659.48 s. Collective I/O: 4.33 s.

P0 P1 P2 P3 P0 P1 P2 P3 …….

P0 P1 P2 P3

Row 1 Row 2

Page 10: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

10

Contiguous storage (Case 2)

No interleaved requests to combine.

Contiguity is already established. Write operation with 32

processors where each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 3.64 s. Collective I/O: 3.87 s.

P0

P1

P2

P3

P0 P1 P2 P3

Page 11: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

11

Chunked storage (Case 3)

The chunk size does not divide exactly the array size.

If info about access pattern is given, MPI-IO can performs efficient independent access.

Write operation with 32 processors, each processor selection has 16K rows and 256 columns (32 MB/proc.) Independent I/O: 43.00 s. Collective I/O: 22.10 s.

P0

P1

P2

P3

chunk 0 chunk 1

dataset

P0 …. P1 P1 …. P2 P2 …. P3 P3 ….P0

Page 12: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

12

Chunked storage (Case 4)

Array partitioned exactly into two chunks.

Write operation with 32 processors, each chunk is selected by 4 processors, each processor selection has 256K rows and 32 columns (64 MB/proc.) Independent I/O: 7.66 s. Collective I/O: 8.68 s.

P0

P1

P2

P3

chunk 0 chunk 1

P4

P5

P6

P7

P0 P1 P2 P3

Page 13: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

13

Conclusions

Performance is quite comparable when the two libraries are used in similar manners.

Collective access should be preferred in most situations when using HDF5 1.6.5.

Use of chunked storage may affect the contiguity of the data selection.

Page 14: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

14

Improvements of collective IO supports inside HDF5

Advanced HDF5 feature: non-regular selections

Performance optimizations: chunked storage Provide several IO options to achieve good

collective IO performance Provide APIs for applications to participate

the optimization process

Page 15: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

15

Improvement 1- non-regular selections

• Only one HDF5 IO call• Good for collective IO• Any such applications?

Page 16: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

16

Performance issues for chunked storage

Previous support• Collective IO can only be done per chunk

Problem• Severe performance penalties with many

small chunk IOs

Page 17: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

17

P0

chunk 1 chunk 2 chunk 3 chunk 4

P1

MPI-IOCollective View

Improvement 2: One linked chunk IO

Page 18: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

18

Improvement 3: Multi-chunk IO Optimization

Have to keep the option to do collective IO per chunk Collective IO bugs inside different MPI-IO packages Limitation of system memory

Problem Bad performance caused by improper use of collective IO

P0P1

P5P6

P3P4

P7P8

P0P1

P5P6

P3P4

P7P8

Page 19: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

19

Improvement 4

Problem HDF5 may make wrong decision about the

way to do collective IO Solution

Provide APIs for applications to participate the decision-making process

Page 20: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

20

Decision-making aboutOne linked-chunk IO

Multi-chunk IO Decision-making about

Collective IO

One Collective IO call for all chunks Independent IO

Optional User Input

Optional User Input

Flow chart of Collective Chunking IO improvements inside HDF5

Collective chunk mode

Yes

NO

Yes NO

Collective IO per chunk

Page 21: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

21

Conclusions

HDF5 provides collective IO supports for non-regular selections

Supporting collective IO for chunked storage is not trivial. Several IO options are provided.

Users can participate the decision-making process that selects different IO options.

Page 22: Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan,…

22

Acknowledgments

This work is funded by National Science Foundation Teragrid grants, the Department of Energy's ASC Program, the DOE SciDAC program, NCSA, and NASA.