Hiding Periodic I/O Costs in Parallel Applications
description
Transcript of Hiding Periodic I/O Costs in Parallel Applications
![Page 1: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/1.jpg)
![Page 2: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/2.jpg)
Hiding Periodic I/O Costsin Parallel Applications
Xiaosong Ma
Department of Computer Science
University of Illinois at Urbana-Champaign
Spring 2003
![Page 3: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/3.jpg)
3
Roadmap
• Introduction• Active buffering: hiding recurrent output cost• Ongoing work: hiding recurrent input cost• Conclusions
![Page 4: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/4.jpg)
4
Introduction
• Fast-growing technology propels high performance applications– Scientific computation– Parallel data mining – Web data processing– Games, movie graphics
• Individual component’s growth un-coordinated– Manual performance tuning needed
![Page 5: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/5.jpg)
5
We Need Adaptive Optimization
• Flexible and automatic performance optimization desired
• Efficient high-level buffering and prefetching for parallel I/O in scientific simulations
![Page 6: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/6.jpg)
6
Scientific Simulations
• Important– Detail and flexibility
– Save money and lives
• Challenging– Multi-disciplinary
– High performance crucial
![Page 7: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/7.jpg)
7
Parallel I/O in Scientific Simulations
• Write-intensive
• Collective and periodic
• “Poor stepchild”
• Bottleneck-prone
• Existing collective I/O focused on data transfer
Computation
…
I/O
Computation
I/O
Computation
I/O
Computation
…
![Page 8: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/8.jpg)
8
My Contributions
• Idea: I/O optimizations in larger scope– Parallelism between I/O and other tasks – Individual simulation’s I/O need– I/O related self-configuration
• Approach: hide the I/O cost
• Results– Publications, technology transfer, software
![Page 9: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/9.jpg)
9
Roadmap
• Introduction• Active buffering: hiding recurrent output cost• Ongoing work: hiding recurrent input cost• Conclusions
![Page 10: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/10.jpg)
10
Latency Hierarchy on Parallel Platforms
• Along path of data transfer– Smaller throughput– Lower parallelism and less scalable
local memory access
inter-processor communication
disk I/O
wide-area transfer
![Page 11: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/11.jpg)
11
Basic Idea of Active Buffering
• Purpose: maximize overlap between computation and I/O
• Approach: buffer data as early as possible
![Page 12: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/12.jpg)
12
Challenges
• Accommodate multiple I/O architectures
• No assumption on buffer space
• Adaptive– Buffer availability– User request patterns
![Page 13: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/13.jpg)
13
Roadmap
• Introduction• Active buffering: hiding recurrent output cost
– With client-server I/O architecture [IPDPS ’02]– With server-less architecture
• Ongoing work: hiding recurrent input cost• Related work and future work• Conclusions
![Page 14: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/14.jpg)
14
Client-Server I/O Architecture
compute processors
I/O servers
File SystemFile System
![Page 15: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/15.jpg)
15
Client State Machine
send ablock
preparebufferdata
exit
enter collective
write routine
buffer space
available
data to send
out of bufferspace
sent
no overflow
all data
![Page 16: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/16.jpg)
16
init.
exitmessage
idle, no data tofetch &data to
write done
idle-listen
alloc.buffers
preparereceivea block
fetch& write
all
write ablock
fetch ablock
busy-listen
write request
exit
recv.
recv.fetch
got write req.
no request write done
received all data
idle & to fetch
recv.
exit msg.
no data
out of buffer space
data to receive& enough buffer space
write
Server State Machine
![Page 17: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/17.jpg)
17
Maximize Apparent Throughput
• Ideal apparent throughput per server
Dtotal
Tideal = Dc-buffered Dc-overflow Ds-overflow
Tmem-copy TMsg-passing Twrite
• More expensive data transfer only becomes visible when overflow happens
• Efficiently masks the difference in write speeds
+ +
![Page 18: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/18.jpg)
18
Write Throughput without Overflow
0
200
400
600
800
1000
1200
2 4 8 16 32
number of clients
thro
ug
hp
ut
pe
r s
erv
er
(MB
/s)
local bufferingABMPIbinary write
0
200
400
600
800
1000
1200
2 4 8 16 32
number of clients
local bufferingABMPIHDF4 write
– Panda Parallel I/O library– SGI Origin 2000, SHMEM– Per client: 16MB output data per snapshot, 64MB buffer – Two servers, each with 256MB buffer
![Page 19: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/19.jpg)
19
Write Throughput with Overflow
0
50
100
150
200
250
2 4 8 16 32
number of clients
thro
ug
hp
ut
pe
r s
erv
er
(MB
/s)
idealABMPIbinary write
0
50
100
150
200
250
2 4 8 16 32
number of clients
idealABMPIHDF4 write
– Panda Parallel I/O library– SGI Origin 2000, SHMEM, MPI– Per client: 96MB output data per snapshot, 64MB buffer – Two servers, each with 256MB buffer
![Page 20: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/20.jpg)
20
Give Feedback to Application
• “Softer” I/O requirements
• Parallel I/O libraries have been passive
• Active buffering allows I/O libraries to take more active role– Find optimal output frequency automatically
![Page 21: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/21.jpg)
21
init.
exitmessage
idle, no data tofetch &data to
write done
idle-listen
alloc.buffers
preparereceivea block
fetch& write
all
write ablock
fetch ablock
busy-listen
write request
exit
recv.
recv.fetch
got write req.
no request write done
received all data
idle & to fetch
recv.
exit msg.
no data
out of buffer space
data to receive& enough buffer space
write
Server-side Active Buffering
![Page 22: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/22.jpg)
22
Performance with Real Applications
• Application overview – GENX– Large-scale, multi-component, detailed rocket simulation– Developed at Center for Simulation of Advanced Rockets
(CSAR), UIUC– Multi-disciplinary, complex, and evolving
• Providing parallel I/O support for GENX– Identification of parallel I/O requirements [PDSECA ’03]– Motivation and test case for active buffering
![Page 23: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/23.jpg)
23
Overall Performance of GEN1
– SDSC IBM SP (Blue Horizon)– 64 clients, 2 I/O servers with AB– 160MB output data per snapshot (in HDF4)
0
500
1000
1500
2000
2500
3000
3500
number of snapshots taken in 30 time steps
tim
e (
s)
I/O
Computation
![Page 24: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/24.jpg)
24
Aggregate Write Throughput in GEN2
– LLNL IBM SP (ASCI Frost)– 1 I/O server per 16-way SMP node – Write in HDF4
0
100
200
300
400
500
600
700
800
900
1000
2 (1) 4 (1) 8 (1) 15 (1) 30 (2) 60 (4) 120(8)
240(16)
480(32)
number of compute processors (number of SMP nodes)
app
aren
t ag
gre
gat
e w
rite
th
rou
gh
pu
t (M
B/s
)
Native I/O AB
![Page 25: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/25.jpg)
25
Scientific Data Migration
• Output data need to be moved
• Online migration
• Extend active buffering to migration– Local storage becomes
another layer in buffer hierarchy
Computation
…
I/O
Computation
I/O
Computation
I/O
Computation
internet
internet
![Page 26: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/26.jpg)
26
I/O Architecture with Data Migration
compute processors
InternetInternetFile SystemFile System
workstation runningvisualization tool
servers
![Page 27: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/27.jpg)
27
Active Buffering for Data Migration
• Avoid unnecessary local I/O– Hybrid migration approach
• Combined with data compression [ICS ’02]
• Self-configuration for online visualization
memory-to-memory transfer disk staging
![Page 28: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/28.jpg)
28
Roadmap
• Introduction• Active buffering: hiding recurrent output cost
– With client-server I/O architecture– With server-less architecture [IPDPS ’03]
• Ongoing work: hiding recurrent input cost• Conclusions
![Page 29: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/29.jpg)
29
Server-less I/O Architecture
compute processors
File SystemFile System
I/O thread
![Page 30: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/30.jpg)
30
Making ABT Transparent and Portable
• Unchanged interfaces• High-level and file-system independent• Design and evaluation [IPDPS ’03]• Ongoing transfer to ROMIO
ADIO
NFSHFS NTFS PFS PVFS XFSUFSABT
![Page 31: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/31.jpg)
31
Active Buffering vs. Asynchronous I/O
Active buffering Async I/O Application level (platform-independent)
Supported by file system (platform-dependent)
Transparent to user Not transparent to user
Designed for collective I/O
More difficult to use in collective I/O
Both local and remote I/O Local I/O
Works on top of scientific data formats
May not be supported by scientific data formats
![Page 32: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/32.jpg)
32
Roadmap
• Introduction• Active buffering: hiding recurrent output cost• Ongoing work: hiding recurrent input cost• Conclusions
![Page 33: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/33.jpg)
33
I/O in Visualization
• Periodic reads
• Dual modes of operation– Interactive– Batch-mode
• Harder to overlap reads with computation
Computation
…
I/O
Computation
I/O
Computation
I/O
Computation
![Page 34: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/34.jpg)
34
Efficient I/O Through Data Management
• In-memory database of datasets– Manage buffers or values
• Hub for I/O optimization– Prefetching for batch mode– Caching for interactive mode
• User-supplied read routine
![Page 35: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/35.jpg)
35
Related Work
• Overlapping I/O with computation– Replacing synchronous calls with async calls [Agrawal et al.
ICS ’96]– Threads [Dickens et al. IPPS ’99, More et al. IPPS ’97]
• Automatic performance optimization– Optimization with performance models [Chen et al. TSE ’00]– Graybox optimization [Arpaci-Dusseau et al. SOSP ’01]
![Page 36: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/36.jpg)
36
Roadmap
• Introduction• Active buffering: hiding recurrent output cost • Ongoing work: hiding recurrent input cost• Conclusions
![Page 37: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/37.jpg)
37
Conclusions
• If we can’t shrink it, hide it!
• Performance optimization can be done – more actively– at higher-level– in larger scope
• Make I/O part of data management
![Page 38: Hiding Periodic I/O Costs in Parallel Applications](https://reader034.fdocuments.in/reader034/viewer/2022051517/5681501b550346895dbe032e/html5/thumbnails/38.jpg)
38
References
• [IPDPS ’03] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Improving MPI-IO Output Performance with Active Buffering Plus Threads, 2003 International Parallel and Distributed Processing Symposium
• [PDSECA ’03] Xiaosong Ma, Xiangmin Jiao, Michael Campbell and Marianne Winslett, Flexible and Efficient Parallel I/O for Large-Scale Multi-component Simulations, The 4th Workshop on Parallel and Distributed Scientific and Engineering Computing with Applications
• [ICS ’02] Jonghyun Lee, Xiaosong Ma, Marianne Winslett and Shengke Yu, Active Buffering Plus Compressed Migration: An Integrated Solution to Parallel Simulations' Data Transport Needs, the 16th ACM International Conference on Supercomputing
• [IPDPS ’02] Xiaosong Ma, Marianne Winslett, Jonghyun Lee and Shengke Yu, Faster Collective Output through Active Buffering, 2002 International Parallel and Distributed Processing Symposium