UNIT 4 The Telecommunications Revolution Lectured by Li Jianwei.
Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad...
-
Upload
susanna-woods -
Category
Documents
-
view
216 -
download
2
Transcript of Project 4 SciDAC All Hands Meeting March 26-27, 2002 PIs:Alok Choudhary, Wei-keng Liao Grad...
Project 4
SciDAC All Hands Meeting
March 26-27, 2002
PIs: Alok Choudhary, Wei-keng Liao
Grad Students: Avery Ching, Kenin Coloma, Jianwei Li
ANL Collaborators:
Bill Gropp, Rob Ross, Rajeev Thakur
Enabling High Performance Application I/O
Wei-keng Liao Northwestern University
Outline1. Design of parallel netCDF APIs
– Using MPI-IO underlying (student: Jianwei Li)
– Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL)
2. Non-contiguous data access on PVFS– Design of non-contiguous access APIs (student: Avery Ching)
– Interfaces to the MPI-IO (student: Kenin Coloma)
– Applications: FLASH, tiled visualization
– Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL)
3. High level data access patterns– ENZO astrophysics application
– Access patterns of an AMR application
NetCDF OverviewNetCDF (network Common Data Form) is an interface for array-oriented data access. It defines a machine-independent file format for representing multi-dimensional arrays with ancillary data, and provide I/O library for creation, access, and sharing of array-oriented data.
Each netCDF file is a dataset, which contains a set of named arrays.
Dataset Component
• Dimensions name, length– Fixed dimension– UNLIMITED dimension
• Variables: named arrays name, type, shape, attributes,
array data– Fixed sized variables: array of fixed
dimensions– Record variables: array with its most-
significant dimension UNLIMITED– Coordinate variables: 1-D array with the
same name as its dimension
• Attributes name, type, values, length– Variable attributes– Global attributes
netCDF example { // CDL notation for netCDF dataset
dimensions: // dimension names and lengths lat = 5, lon = 10, level = 4, time = unlimited;
variables: // var types, names, shapes, attributes float temp(time,level,lat,lon);
temp:long_name = "temperature"; temp:units = "celsius";
float rh(time,lat,lon); rh:long_name = "relative humidity"; rh:valid_range = 0.0, 1.0; // min and
max int lat(lat), lon(lon), level(level), time(time);
lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1";
// global attributes: :source = "Fictional Model
Output"; data: // optional data assignments
level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh = .5,.2,.4,.2,.3,.2,.4,.5,.6,.7,
.1,.3,.1,.1,.1,.1,.5,.7,.8,.8,
.1,.2,.2,.2,.2,.5,.7,.8,.9,.9,
.1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record
allocated }
Design Parallel netCDF APIs• Goal
– Maintain exactly the same original netCDF file format
– Provide parallel I/O functionalities• On top of MPI-IO
• High level parallel APIs– Minimize the argument list change of netCDF APIs– For legacy codes with minimal changes
• Low level parallel APIs– Using MPI-IO components, e.g. derived data types– For MPI-IO experienced users
NetCDF File Structure
● Header (dataset definition, extendable) - Number of records allocated - Dimension list - Global attribute list - Variable list
● Data (row-major, big-endian, 4 byte aligned) - Fixed-sized(non-record) data data for each variable is stored contiguously in defined order - Record data (non-contiguous between records of a var) a variable number of fixed-size records, each of which contains one record
for each of the record variables in defined order.
NetCDF APIs• Dataset APIs -- create/open/close a dataset, set the dataset to
define/data mode, and synchronize dataset changes to disk• Input: path, mode for create/open; dataset ID for opened dataset• Output: dataset ID for create/open
• Define mode APIs -- define dataset: add dimensions, variables• Input: opened dataset ID; dimension name and length to define dimension;
or variable name, number of dimensions, shape to define variable• Output: dimension ID; or variable ID
• Attribute APIs -- add, change, and read attributes of datasets• Input: opened dataset ID; attribute No. or attribute name to access
attribute; or attribute name, type, and value to add/change attribute• Output: attribute value for read attribute
• Inquiry APIs -- inquire dataset metadata (in memory): dim(id, name, len), var(name, ndims, shape, id)• Input: opened dataset id; dim name or id, or variable name or id• Output: dimension info, or variable info
• Data mode APIs – read/write variable (access method: single value, whole array, subarray, strided subarray, sampled subarray)• Input: opened dataset ID; variable id; element start index, count, stride,
index map.
Design of Parallel APIs• Two file descriptors
– NetCDF file descriptor: For header I/O (reuse of old code)Performed only by process 0
– MPI_File handle: For data array I/OPerformed by all processes
• Implicit MPI file handle and communicator– Added into the internal data structure– MPI communicator passed as an argument in create/open
• I/O implementation using MPI-IO– File view and offsets are computed from metadata in
header and user-provided arguments (start, count, stride)– Users choose either collective or non-collective I/O calls
Collective/Non-collective APIs
• Dataset APIs– Collective calls over the communicator
passed into the create or open call– All processes collectively switches between
define and data mode
• Define mode, attribute, inquiry APIs– Collective or non-collective calls– Operate in local memory (all processes have
identical header structures)
• Data mode APIs– Collective or non-collective calls– Access method: single value, whole array,
subarray, strided subarray
Changes in High-level Parallel APIsOriginal
netCDF APIs Parallel APIs Argument changed
Need MPI-IO
Dataset
nc_create nc_createAdd MPI_Comm
yes
nc_open nc_open
nc_enddef nc_enddef
No changenc_redef nc_redef
nc_close nc_close
nc_sync nc_sync
Define mode,Attribute,Inquiry
all No change No change no
Data mode
nc_put_var_ type* nc_put_var_ type
No change yesnc_get_var_ type
nc_get_var_ type
nc_get_var_ type_all
* type = text | uchar | schar | short | int | long | float | double
Example Code - Write• Create a dataset
– Collective– The input arguments should be the same among
processes– The returned ncid is different among processes (but
refers the same dataset)– All processes put in define mode
• Define dimensions– Non-collective– All processes should have the same definitions
• Define variables– Non-collective– All processes should have the same definitions
• Add attributes– Non-collective– All processes should have put the same attributes
• End define– Collective– All processes switch from define mode to data
mode
• Write variable data– All processes do a number of collective write to
write the data for each variable– Can do independent write, if you like– Each process provide different argument values
which are set locally
• Close the dataset– Collective
status = nc_create(comm, "test.nc", NC_CLOBBER, &ncid);
/* dimension */status = nc_def_dim(ncid, "x", 100L, &dimid1);status = nc_def_dim(ncid, "y", 100L, &dimid2);status = nc_def_dim(ncid, "z", 100L, &dimid3);status = nc_def_dim(ncid, "time", NC_UNLIMITED, &udimid);
square_dim[0] = cube_dim[0] = xytime_dim[1] = dimid1;square_dim[1] = cube_dim[1] = xytime_dim[2] = dimid2;cube_dim[2] = dimid3;xytime_dim[0] = udimid; time_dim[0] = udimid;
/* variable */status = nc_def_var (ncid, "square", NC_INT, 2, square_dim,
&square_id);status = nc_def_var (ncid, "cube", NC_INT, 3, cube_dim, &cube_id);status = nc_def_var (ncid, "time", NC_INT, 1, time_dim, &time_id);status = nc_def_var (ncid, "xytime", NC_INT, 3, xytime_dim,
&xytime_id);
/* attributes */status = nc_put_att_text (ncid, NC_GLOBAL, "title", strlen(title),
title);status = nc_put_att_text (ncid, square_id, "description",
strlen(desc), desc);
status = nc_enddef(ncid);
/* variable data */nc_put_vara_int_all(ncid, square_id, square_start, square_count,
buf1);nc_put_vara_int_all(ncid, cube_id, cube_start, cube_count, buf2);
nc_put_vara_int_all(ncid, time_id, time_start, time_count, buf3);nc_put_vara_int_all(ncid, xytime_id, xytime_start, xytime_count,
buf4);
status = nc_close(ncid);
The only change
Example Code - Read
status = nc_open(comm, filename, 0, &ncid);
status = nc_inq(ncid, &ndims, &nvars, &ngatts, &unlimdimid);
/* global attributes */for (i = 0; i < ngatts; i++) { status = nc_inq_attname(ncid, NC_GLOBAL, i, name); status = nc_inq_att (ncid, NC_GLOBAL, name, &type, &len); status = nc_get_att_text(ncid, NC_GLOBAL, name, valuep);}
/* variables */for (i = 0; i < nvars; i++) { status = nc_inq_var(ncid, i, name, vartypes+i, varndims+i, vardims[i],
varnatts+i);
/* variable attributes */ for (j = 0; j < varnatts[i]; j++) { status = nc_inq_attname(ncid, varids[i], j, name); status = nc_inq_att (ncid, varids[i], name, &type, &len); status = nc_get_att_text(ncid, varids[i], name, valuep); }}
/* variable data */for (i = 0; i < NC_MAX_VAR_DIMS; i++) start[i] = 0;for (i = 0; i < nvars; i++) { varsize = 1;
/* dimensions */ for (j = 0; j < varndims[i]; j++) { status = nc_inq_dim(ncid, vardims[i][j], name, shape + j); if (j == 0) { shape[j] /= nprocs; start[j] = shape[j] * rank; } varsize *= shape[j]; }
status = nc_get_vara_int_all(ncid, i, start, shape, (int *)valuep);}
status = nc_close(ncid);
• Open the dataset– Collective– The input arguments should be
the same among processes– The returned ncid is different
among processes (but refers the same dataset)
– All processes put in data mode
• Dataset inquiries– Non-collective– Count, name, len, datatype
• Read variable data– All processes do a number of
collective read to read the data from each variable in (B, *, *) manner
– Can do independent read, if you like
– Each process provide different argument values which are set locally
• Close the dataset– Collective
The only change
Non-contiguous Data Access on PVFS
• Problem definition• Design approaches
– Multiple I/O– Data sieving– PVFS list_io
• Integration into MPI-IO• Experimental results
– Artificial benchmark– FLASH application I/O– Tile visualization
Non-contiguous Data Access
• Data access that is not adjacent in memory or file– Non-contiguous in memory,
contiguous in file
– Non-contiguous in file, contiguous in memory
– Non-contiguous in file, non-contiguous in memory
• Two applications– FLASH astrophysics
application
– Tile visualization
Non-contiguous in file
Contiguous in memory
Non-contiguous in memory
Memory
File
Memory
File
Memory
File
Non-contiguous in memory
Contiguous in file
Non-contiguous in file
Multiple I/O Requests
Application
ContiguousData Region
ContiguousData Region
ContiguousData Region
I/ORequest
I/OServer
I/OServer
I/ORequest
I/ORequest
• Intuitive strategy– One I/O request per contiguous
data segment
• Large number of I/O requests to the file system– Communication costs between
applications and I/O servers become significant which can dominates the I/O time
I/OServer
First read requestSecond read request
File
Data Sieving I/O
• Reads a contiguous chunk frm the file into a temporary buffer
• Extract/update the requested portions– Number of requests reduced
– I/O amount increased
– Number of I/O requests depends on the size of sieving buffer
• Write back to file (for write operations)
Application
ContiguousData Region
I/ORequest
I/OServer
I/OServer
I/OServer
I/ORequest
ContiguousData Region
ContiguousData Region
ContiguousData Region
First I/O request Second I/O request
File
PVFS List_io• Combine non-contiguous
I/O requests into a single request
• Client support– APIs pvfs_list_read,
pvfs_list_write
– I/O request -- a list of file offsets and file lengths
• I/O server support– Wait for trailing list of file
offsets and lengths following I/O request
Application
PVFS Library
ContiguousData Region
ContiguousData Region
ContiguousData Region
I/ORequest
I/OServer
I/OServer
I/OServer
Artificial Benchmark
• Contiguous in memory, non-contiguous in file
• Parameters:– Number of accesses– Number of processors– Stride size = file size / number of accesses– Block size = stride size / number of processors
File
Memory
Proc 0 Proc 1 Proc 2
Stride
4 accesses
Benchmark Results
• Parameter configurations– 8 clients
– 8 I/O servers
– 1 Gigabyte file size
300
Number of Accesses
200
100
Data Sieving
40k
List_io
Multiple I/O
800k20k 600k400k200k100k80k60k
600
500
400
0
Tim
e (i
n s
eco
nd
s)
Read
0
50
100
150
200
250
300
350
400
10k 20k 30k 40k 50k 60k 70k 80k 90k
Number of Accesses
Tim
e (
se
co
nd
s)
Write
Multiple I/O
List_io
• Avoid caching effect at I/O servers– Read/write 4 files alternatively
since each I/O server has 512 MB memory
FLASH Application• An astrophysics application
developed at University of Chicago– Simulate the accretion of matter onto a
compact star, and the subsequent stellar evolution, including nuclear burning either on the surface of the compact star, or in its interior
• I/O benchmark measures the performance of the FLASH output: produces checkpoint files, plot-files– A typical large production run generates
~ 0.5 Tbytes (100 checkpoint files and 1,000 plot-files)
This image, the interior This image, the interior of an exploding star, of an exploding star, depicts the distribution depicts the distribution of pressure during a of pressure during a star explosionstar explosion
FLASH -- I/O Access Pattern
X-Axis
Y-Axis
Z-Axis
FLASH block structure
Variable 0
Variable 1
Variable 2
Variable 23
Blocks toaccess in
X-axis
Blocks toaccess in
Y-axis
Guard Cells
Cut a sliceof the block Each element
has 24 variables
Memory Organization
• Each processor has 80 cubes– Each has guard cells and a sub-
cube which holds the data to be output
• Each element in the cube contains 24 variables, each is of type double (8 bytes)– Each variable is partitioned
among all processors
• Output pattern– All variables are saved into a
single file, one after another
FLASH I/O ResultsAccess patterns:• In memory
– Each contiguous segment is small, 8 bytes
– Stride size between two segments is small, 192 bytes
• From memory to file– Multiple I/O: 8*8*8*80*24 =
983,040 request per processors
– Data sieving: 24 requests per processor
– List_io: 8*8*8*80*24/64 = 15,360 requests per processor (64 is the max number of offset-length pairs)
• In file – Each contiguous segment is of
size 8*8*8*8 = 4096 bytes written by each processor
– The output file is of size
8 MB * number of procs
1
10
100
1000
10000
100000
multiple I/O datasieving I/O list I/O
tim
e (s
eco
nd
s)
2 clients 4 clients
Tile Visualization
• Preprocess “frames” into streams of tiles by staging tile data on visualization nodes
• Read operations only
• Each node reads one sub-tile
• Each sub-tile has ghost regions overlapped with other sub-tiles
• The noncontiguous nature of this file access becomes apparent in its logical file representation
Tile 1 Tile 2 Tile 3
Tile 4 Tile 5 Tile 6
Example layout
• 3x2 display
• Frame size of 2532x1408 pixels
• Tile size of 1024x768 w/ overlap
• 3 byte RGB pixels
• Each frame is stored as a file of size 10MB
...
Single node’s file viewProc 0
Proc 1Proc 2
Integrate List_io to ROMIO
Filetype offsets & lengths Datatype offsets & lengths
...
.........
File Memory
pvfs_read_list(Memory offsets/lengths, File offsets/lengths)
• Then, using the list, ROMIO steps through both file and memory addresses
• ROMIO generates memory and file offsets and lengths to pass through pvfs_list_io
• ROMIO calls pvfs_list_io after all data has been read, or the set max array size has been reached, in which case a new list is generated
• ROMIO uses the internal ADIO function flatten to break both the filetypes and datatypes down into a list of offset and length pairs
Tile I/O ResultsCollective data sieving Collective read_list Non-collective data sieving Non-collective read_list
accu
mu
late
d t
ime
4 compute nodes 1740 MB
0
50
100
150
200
250
4 8 12 16
8 compute nodes 1740 MB
4 8 12 16
16 compute nodes 1740 MB
4 8 12 16
io nodes
0
50
100
150
200
250
0
50
100
150
200
250
4 compute nodes 435 MB
0
10
20
30
40
4 8 12 16
8 compute nodes 435 MB
4 8 12 16
16 compute nodes 435 MB
4 8 12 16
io nodes
0
10
20
30
40
0
10
20
30
40
4 compute nodes 108 MB
0123456789
10
4 8 12 16
4 compute nodes 40 MB
00.5
11.5
22.5
33.5
44.5
5
4 8 12 16
8 compute nodes 108 MB
4 8 12 16
8 compute nodes 40 MB
4 8 12 16
16 compute nodes 108 MB
4 8 12 16
io nodes
16 compute nodes 40 MB
4 8 12 16
io nodes
00.5
11.5
22.5
33.5
44.5
5
00.5
11.5
22.5
33.5
44.5
5
0123456789
10
0123456789
10
Analysis of Tile I/O Results
• Collective operations theoretically should be faster, but …
• Hardware problem– Fast Ethernet: overhead in the collective I/O takes too
long to catch back up with the independent I/O requests
• Software problem– A lot of extra data movement in ROMIO collectives --
the aggregation isn't as smart as it could be
• Plans to do– Use MPE logging facilities to figure out the problem– Study of the ROMIO implementation, find bottlenecks
in the collectives and try to weed them out
High Level Data Access Patterns
• Study of file access patterns of astrophysics applications– FLASH from University of Chicago– ENZO from NCSA
• Design of data management framework using XML and database– Essential metadata collection– Trigger rules for automatic I/O optimization
ENZO Application• Simulate the formation of
a cluster of galaxies starting near the big bang until the present day
• It is used to test theories of how galaxy forms by comparing the results with what is really observed in the sky today
• File I/O using HDF-4• Dynamic load balance using MPI• Data partitioning: Adaptive Mesh Refinement (AMR)
AMR Data Access Pattern• Adaptive Mesh Refinement partitions
problem domain into sub-domains recursively an dynamically
• A grid can only be owned by a processor but one processor can have many grids.
• Check-pointing is performed– Each grid is written to a separate file
(independent writes)
• During re-start– Sub-domain hierarchy need not be re-
constructed– Grids at the same time stamp are read
altogether
• During visualization– All grids are combined into a top grid
x x x
x
xx
x
x
x
xx
x
x
x
x
x
x
x
x x x
x
xxx
x x
xx
x
x
x
AMR Hierarchy Represented in XML
• AMR hierarchy is naturally mapped into XML hierarchy
• XML is embedded in a relational database
• Metadata queries/update through the database
• Database can handle multiple queries simultaneously – ideal for parallel applications
<Array name="density" dim="3">
</Array>
<Dimension value="10 8 12"/>
</Grid>
<Dimension value="5 6 4"/>
<FileName value="grid0.dat">
</type>
<Dimension value="2 3 2"/>
</Grid>
</Grid>
</Grid>
<Grid id="3" level="2">
</DataSet>
<Grid id="2" level="1">
<Grid id="1" level="1">
<GridRank value="3" />
<Dimension value="22 22 22"/>
<Grid id="0" level="0">
<Producer name="astro" /><DataSet name="grid">
<type IsComplex="1">
float32, int32, double64
grid.xml
elementattribute
element
typekey
1
attribute
34567
0
node
2
null
020446
parentkey
8 attribute level 6
cdata
elementattributeelementattribute
GridRank
name
value
Producer
idGrid
name
nameDataSet
0
"grid"
0
3
0
"ENZO"
grid.xml table
File System Based XML
(ns1, x) empty 0, r11, r2
directory name
file name(ns2, x) y1, text1
y2, text20, r3
Some text_T_file contents
(ns2, x) y1, text3 0, r4
More text_T_
r 0
_T_: text (used in character node nodes)
</ns2:y>
</ns1:x>
More text
_E_: element map_A_: attributes of the element node_N_ : name spce
r 1r 3
r 2r 4
</ns2:y>
<ns2:y y1="text3">
Some text
<ns1:x>
<ns2:y y1="text1" y2="text2">
_N_ _E_
_N_
_E_
_E_
_A_ _N_ _A_
_A_
file_based.xmlr 4
r 0
r 3
r 1 r 2
• File system is used to support the decomposition of XML documents into files and directories
• This representation consists of an arbitrary hierarchy of directories and files, and preserves the XML philosophy of being textual in representation but requires no further use of an XML parser to process the document
• Metadata locates near to the scientific data
Summary• List_io API incorporated into PVFS for non-
contiguous data access– Read operation is completed– Write operation in progress
• Parallel netCDF APIs– High-level APIs --- will completed soon– Low-level APIs --- interfaces already defined– Validater
• High level data access patterns– Access patterns of AMR applications– Other types of applications