Improving Parallel I/O for Scientific Applications on Jaguarfor the Department of Energy Parallel...
Transcript of Improving Parallel I/O for Scientific Applications on Jaguarfor the Department of Energy Parallel...
Managed by UT-Battellefor the Department of Energy
Improving Parallel I/O for ScientificImproving Parallel I/O for ScientificApplications on JaguarApplications on Jaguar
Weikuan Yu, Jeffrey VetterWeikuan Yu, Jeffrey Vetter
November 28, 2007November 28, 2007
2 Managed by UT-Battellefor the Department of Energy
OutlineOutline
•• Overview of Jaguar and its I/O SubsystemOverview of Jaguar and its I/O Subsystem
•• Characterization, profiling and tuningCharacterization, profiling and tuning–– Parallel I/OParallel I/O–– Storage OrchestrationStorage Orchestration–– Server-based client I/O statistics in Lustre file systemServer-based client I/O statistics in Lustre file system
•• Optimize parallel I/O over JaguarOptimize parallel I/O over Jaguar–– OPAL: Opportunistic and Adaptive MPI-IO library over LustreOPAL: Opportunistic and Adaptive MPI-IO library over Lustre–– Arbitrary striping pattern per MPI File; stripe aligned I/OArbitrary striping pattern per MPI File; stripe aligned I/O
accessesaccesses–– Direct I/O and Lockless I/ODirect I/O and Lockless I/O–– Hierarchical StripingHierarchical Striping–– Partitioned Collective I/OPartitioned Collective I/O
•• Conclusion & Future WorkConclusion & Future Work
3 Managed by UT-Battellefor the Department of Energy
Parallel I/O in Scientific ApplicationsParallel I/O in Scientific Applicationson Jaguaron Jaguar•• Earlier analysis of scientific codes running atEarlier analysis of scientific codes running at
ORNL/LCFORNL/LCF–– Seldom make use of MPI-IOSeldom make use of MPI-IO–– Little use of high-level I/O middleware: Little use of high-level I/O middleware: NetCDF NetCDF or HDF5 or HDF5 ……–– Large variance in the performance of parallel I/O using differentLarge variance in the performance of parallel I/O using different
software stackssoftware stacks
•• Current scenarios faced for scientific codesCurrent scenarios faced for scientific codes–– Application-specific I/O tuning and optimization, such asApplication-specific I/O tuning and optimization, such as
decomposition, aggregation, and bufferingdecomposition, aggregation, and buffering–– Direct use of POSIX I/O interface, hoping the best from fileDirect use of POSIX I/O interface, hoping the best from file
systemssystems
•• Our Effort:Our Effort:–– Obtain a good understanding of the parallel I/O behavior onObtain a good understanding of the parallel I/O behavior on
JaguarJaguar–– Promote the use of I/O library, e.g. MPI-IOPromote the use of I/O library, e.g. MPI-IO–– Provide an open-source and feature-rich MPI-IOProvide an open-source and feature-rich MPI-IO
implementationimplementation•• Avoid redundant efforts from different applicationsAvoid redundant efforts from different applications
4 Managed by UT-Battellefor the Department of Energy CUG 2007 -- Weikuan Yu
Storage (DDN 9550)Storage (DDN 9550) 18 racks, each of 36 TB; 18 racks, each of 36 TB; FibreFibre Channel (FC) links. Channel (FC) links. Each LUN (2TB) spans two tiers of disksEach LUN (2TB) spans two tiers of disks
Three Lustre file systemsThree Lustre file systems Each with its own MDS;Each with its own MDS; Two with 72 Two with 72 OSTsOSTs, each of 150TB; One with 144 , each of 150TB; One with 144 OSTsOSTs, 300 TB, 300 TB 72 service nodes for 72 service nodes for OSSsOSSs, each supporting 4 , each supporting 4 OSTsOSTs
ORNL Jaguar I/O SubsystemORNL Jaguar I/O Subsystem
5 Managed by UT-Battellefor the Department of Energy
Parallel I/O over JaguarParallel I/O over Jaguar
Parallel IOLibraries
Lustre File System
Storage (DDN)
Application
144 Targets
BenchmarkBandwidth?
Application Bandwidth?
Storage to Application, efficiency degradationStorage to Application, efficiency degradation
End-to-End IO performanceEnd-to-End IO performance Dependent on mid-layer components:Dependent on mid-layer components:
Lustre file systemLustre file system Parallel IO librariesParallel IO libraries
40GB/sec
57GB/sec
?? GB/sec
6 Managed by UT-Battellefor the Department of Energy
Characterization Parallel I/OCharacterization Parallel I/O
Independent and Contiguous Read/WriteIndependent and Contiguous Read/Write Share fileShare file Separated filesSeparated files
Non-contiguous I/ONon-contiguous I/O Data-sievingData-sieving Collective I/OCollective I/O
Parallel Open/CreateParallel Open/Create
Orchestrated I/O Accesses over JaguarOrchestrated I/O Accesses over Jaguar
7 Managed by UT-Battellefor the Department of Energy
Independent & Contiguous I/OIndependent & Contiguous I/O
Singe OST
200
250
300
350
400
450
1 2 4 8 16 32
No. of Processes
Ba
nd
wid
th (
MB
/se
c)
Separated Write Separated Read
Shared Write Shared Read
Single DDN Couplet
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32 64 128
No. of Processes
Ba
nd
wid
th (
MB
/se
c)
Shared Write Shared Read
Separated Write Separated Read
System Scalability
0
10000
20000
30000
40000
50000
16 32 48 72 96 120 144
No. of OSTs
Ba
nd
wid
th (
MB
/se
c)
Shared file write Shared file read
File-per-Proc write File-per-Proc read
Independent & contiguous I/O scalesIndependent & contiguous I/O scaleswell unless too many clients arewell unless too many clients arecontending for the same serverscontending for the same servers(over-provisioned)(over-provisioned)
Write scales better with increasingWrite scales better with increasingnumber of processesnumber of processes
Read scales better with very largeRead scales better with very largestriping widthstriping width
8 Managed by UT-Battellefor the Department of Energy
Small and Noncontiguous I/OSmall and Noncontiguous I/O
Data Sieving
0.1
1
10
100
1000
direct-
write
direct-
read
direct-
write
direct-
read
sieve-
write
sieve-
read
sieve-
write
sieve-
read
Nprocs=128 Nprocs=256 Nprocs=128 Nprocs=256
Ba
nd
wid
th (
MB
/se
c)
128^3 256^3 512^3
Collective IO
1
10
100
1000
10000
100000
coll-write coll-read coll-write coll-read
Nprocs=128 Nprocs=256
Ba
nd
wid
th (
MB
/se
c)
128^3 256^3 512^3
Small and Noncontiguous I/O results in lower performanceSmall and Noncontiguous I/O results in lower performance
Strong scaling makes the problem even worseStrong scaling makes the problem even worse
Data sieving from individual processes helps only read, but not writeData sieving from individual processes helps only read, but not writedue to increased lock connectionsdue to increased lock connections
Collective I/O is able to help on this by aggregating I/O andCollective I/O is able to help on this by aggregating I/O andeliminating lock contentionseliminating lock contentions
9 Managed by UT-Battellefor the Department of Energy
Parallel Open/CreateParallel Open/Create
Parallel Open (Nprocs=512)
0
100
200
300
400
500
600
700
8 16 32 40 48 72
No. of OSTs
Tim
e (
mil
lis
ec
)
Shared Separated
0
100
200
300
400
500
600
700
1 2 4 5 6 8 16 32 40 48 64 72
No. of OSTs
Tim
e (
mil
lis
ec
)
static dynamic
Parallel Open
1
10
100
1000
1 2 4 8 16 32 64 128 256 512
No. of Processes
Tim
e (
millisec)
Shared Separated
Use a shared MPI file for scalableUse a shared MPI file for scalableparallel Open/Createparallel Open/Create
Select the first OST dynamically forSelect the first OST dynamically forscalable parallel creationscalable parallel creation
10 Managed by UT-Battellefor the Department of Energy
Orchestrated File DistributionOrchestrated File Distribution
File Distribution -- Write
8000
9000
10000
11000
12000
13000
14000
0 1 2 3 4 5 6 7
Offset of First OST
Ba
nd
wid
th (
MB
/se
c)
Nprocs=32 Nprocs=64 Nprocs=128 Nprocs=256
LUN/OST Organization of DDN Storage Couplet
Impact of File Distribution with a Couplet
S3D - scr72b
4000
6000
8000
10000
0 1 2 3 4 5 6 7
Offset Index of the First OST
Ba
nd
wid
th (
MB
/se
c)
Nprocs=1024 Nprocs=2048 Nprocs=4096
Impact of File Distribution for S3D
11 Managed by UT-Battellefor the Department of Energy
Flash IO TuningFlash IO Tuning
Flash IO (Nprocs=1024, Checkpoint File)
0
4000
8000
12000
16000
No-Coll 4MB 8MB 16MB 32MB 64MB 128MB
Collective Buffer Size
Ba
nd
wid
th (
MB
/se
c)
Flash IO (Nprocs=2048)
0
2000
4000
6000
8000
10000
12000
128 256 512 1024 2048
No. of IO Aggregators
Ba
nd
wd
ith
(M
B/s
ec
)
Blocksize=16 Blocksize=32
Collective I/O significantly improves the I/O performanceCollective I/O significantly improves the I/O performanceof Flash I/Oof Flash I/O
For small block size, choose a smaller number ofFor small block size, choose a smaller number ofaggregators for Flash I/Oaggregators for Flash I/O
12 Managed by UT-Battellefor the Department of Energy
Using a Shared MPI File For S3DUsing a Shared MPI File For S3D
File Open Time
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
128 256 512 1024 2048 4096 8192
No. of Processes
Tim
e (
se
c)
Separated Files Shared File
Using separated files, S3D does not scale up to 8K processesUsing separated files, S3D does not scale up to 8K processesdue to file creation costdue to file creation cost
Using a shared MPI file for S3D can reduce file creation costUsing a shared MPI file for S3D can reduce file creation cost
More convenient in managing much fewer number of filesMore convenient in managing much fewer number of files
S3D Effective Bandwidth
0
8000
16000
24000
32000
40000
128 256 512 1024 2048 4096 8192
No. of Processes
Ba
nd
wid
th (
MB
/se
c)
Separated Files Shared File
13 Managed by UT-Battellefor the Department of Energy
Profiling through Server-based ClientProfiling through Server-based ClientI/O Statistics in LustreI/O Statistics in Lustre
•• Lustre Client StatisticsLustre Client Statistics–– /proc//proc/fs/lustrefs/lustre/*/*/exports/*/*/exports–– Per-exp client statistics is not sufficientPer-exp client statistics is not sufficient
•• liblustreliblustre clients are transient clients are transient–– Provided per-Provided per-nidnid client statistics client statistics–– Integrated into Lustre source releaseIntegrated into Lustre source release
•• BenefitsBenefits–– In-depth tracing of aggressive usageIn-depth tracing of aggressive usage–– Fine-grained I/O profilingFine-grained I/O profiling–– Knobs for future I/O service controlKnobs for future I/O service control
exports
nid01 nid02 nid03
exp01 exp02 exp03
exports
exp01 exp02 exp03
14 Managed by UT-Battellefor the Department of Energy
Parallel I/O Instrumentation &Parallel I/O Instrumentation &OptimizationOptimization•• Overview of Jaguar and its I/O SubsystemOverview of Jaguar and its I/O Subsystem
•• Characterization, Profiling and TuningCharacterization, Profiling and Tuning–– Parallel I/OParallel I/O–– Storage OrchestrationStorage Orchestration–– Server-based client I/O statistics in Lustre file systemServer-based client I/O statistics in Lustre file system
•• Optimize Parallel I/O over JaguarOptimize Parallel I/O over Jaguar–– OPAL: Opportunistic and Adaptive MPI-IO library over LustreOPAL: Opportunistic and Adaptive MPI-IO library over Lustre–– Arbitrary striping pattern per MPI File; stripe aligned I/OArbitrary striping pattern per MPI File; stripe aligned I/O
accessesaccesses–– Direct I/O and Lockless I/ODirect I/O and Lockless I/O–– Hierarchical StripingHierarchical Striping–– Partitioned Collective I/OPartitioned Collective I/O
•• Conclusion & Future WorkConclusion & Future Work
15 Managed by UT-Battellefor the Department of Energy
Parallel I/O Stack on JaguarParallel I/O Stack on Jaguar
•• Parallel I/O StackParallel I/O Stack–– MPI-IO, MPI-IO, NetCDFNetCDF, HDF5, HDF5–– Using MPI-IO as a basic buildingUsing MPI-IO as a basic building
componentcomponent–– Default Implementation from CrayDefault Implementation from Cray
•• With components named With components named AD_SysioAD_Sysio over oversysiosysio & & liblustreliblustre
•• Lack of an open-source MPI-IOLack of an open-source MPI-IOimplementationimplementation Proprietary code base; Prolonged phase ofProprietary code base; Prolonged phase of
bug fixesbug fixes Stagnant code development Stagnant code development w.r.tw.r.t Lustre Lustre
file system featuresfile system features Lack of latitudes for examination of internalLack of latitudes for examination of internal
MPI-IO implementationMPI-IO implementation Lack of freedom for other developers inLack of freedom for other developers in
experimenting optimizationsexperimenting optimizations
16 Managed by UT-Battellefor the Department of Energy
OPportunisticOPportunistic and Adaptive MPI-IO Library and Adaptive MPI-IO Library An MPI-IO package optimized for LustreAn MPI-IO package optimized for Lustre FeaturingFeaturing
Dual-Platform (Cray XT and Linux); Dual-Platform (Cray XT and Linux); unified codeunified code Open source, better performanceOpen source, better performance Improved data-sieving implementationImproved data-sieving implementation Arbitrary striping specification over Cray XTArbitrary striping specification over Cray XT Lustre stripe-aligned file domain partitioningLustre stripe-aligned file domain partitioning Direct I/O, Direct I/O, LockessLockess I/O, and more I/O, and more ……
OpportunitiesOpportunities Parallel I/O profiling and tuningParallel I/O profiling and tuning AA code base for patches and optimizations code base for patches and optimizations
http://http://ft.ornl.gov/projects/io/#downloadft.ornl.gov/projects/io/#download
OPALOPAL
Diagram of OPAL over Jaguar
17 Managed by UT-Battellefor the Department of Energy
Independent Read/WriteIndependent Read/Write
IOR Read/WriteIOR Read/Write Block size 512MB, transfer size 4MBBlock size 512MB, transfer size 4MB Read from or write to a shared file, 64 stripes wideRead from or write to a shared file, 64 stripes wide
Parallel Write to 64 Stripes
0
2000
4000
6000
8000
10000
12000
8 48 88 128 168 208 248
No. of Processes
Ba
nd
wid
th (
MB
/se
c)
0
50
100
150
200
Pe
rce
nta
ge
OPAL Cray Improvement
Parallel Read from 64 Stripes
0
2000
4000
6000
8000
10000
12000
8 48 88 128 168 208 248
No. of Processes
Ba
nd
wid
th (
MB
/se
c)
0
20
40
60
80
100
120
140
Pe
rce
nta
ge
OPAL Cray Improvement
18 Managed by UT-Battellefor the Department of Energy
Scientific Benchmarks Scientific Benchmarks ––ResultsResults
MPI-Tile-IO
0
500
1000
1500
2000
2500
3000
64 128 256 512 1024
No. of Processes
Ban
dw
dit
h (
MB
/sec)
Cray
OPAL
0
1000
2000
3000
4000
5000
64 256 576
No. of Processes
Ba
nd
wd
ith
(M
B/s
ec
)
Cray
OPAL
0
2000
4000
6000
8000
10000
12000
Ba
nd
wid
th (
MB
/se
c)
Cray OPAL Cray OPAL Cray OPAL
Checkpoint Plot-Center Plot-Corner
Nprocs=256 Nprocs=512 Nprocs=1024
NAS BT-IOMPI-Tile-IO
Flash I/O
19 Managed by UT-Battellefor the Department of Energy
Direct I/O and Lockless I/ODirect I/O and Lockless I/O
Direct I/ODirect I/O Perform I/O without file system buffering at the clientsPerform I/O without file system buffering at the clients Need to be page-alignedNeed to be page-aligned
Lockless I/OLockless I/O Client I/O without acquiring file system locksClient I/O without acquiring file system locks Beneficial for small and non-contiguous I/OBeneficial for small and non-contiguous I/O Beneficial for clients with I/O to the same stripesBeneficial for clients with I/O to the same stripes
Recently implemented in OPAL for Lustre platformsRecently implemented in OPAL for Lustre platforms Works over Linux cluster and Cray XT (CNL)Works over Linux cluster and Cray XT (CNL) Performance improvement for large streaming I/O on a Linux clusterPerformance improvement for large streaming I/O on a Linux cluster Benefits over Jaguar/CNL is yet to be determinedBenefits over Jaguar/CNL is yet to be determined
20 Managed by UT-Battellefor the Department of Energy
Hierarchical StripingHierarchical Striping
An revised implementation of MPI-I/O with An revised implementation of MPI-I/O with subfilessubfiles(CCGrid(CCGrid’’2007)2007) Break Lustre 160-stripe limitationBreak Lustre 160-stripe limitation Avoid over-provisioning problemAvoid over-provisioning problem Mitigate the impact of striping overheadMitigate the impact of striping overhead
•• Data Layout: hierarchical stripingData Layout: hierarchical striping–– Create another level of striping pattern for Create another level of striping pattern for subfilessubfiles–– Allow maximum coverage of Lustre storage targetsAllow maximum coverage of Lustre storage targets
0 12 3
S-2 S-1
S+1S
2S-2S+2 S+3
2S-1
nS+1nS
nS-2nS+2 nS+3
nS-1
Hierarchical Striping (HS)
subfile 0 subfile 1 subfile n
(Stripe width: 2; Stripe size: w)
21 Managed by UT-Battellefor the Department of Energy
Global Synchronization inGlobal Synchronization inCollective I/O on Cray XTCollective I/O on Cray XT•• Collective IO has low performance on Jaguar?Collective IO has low performance on Jaguar?
–– Much time is spent on global synchronizationMuch time is spent on global synchronization–– Application scientists are forced to aggregate I/O intoApplication scientists are forced to aggregate I/O into
smaller sub-communicatorssmaller sub-communicators
0
2
4
6
8
10
12
8 16 32 64 128 256
No. of Processes
Tim
e (
se
c)
0
0.5
1
1.5
2
2.5
Re
lati
ve
Ra
tio
collective point-to-point IO
coll vs. IO P2P vs. IO
22 Managed by UT-Battellefor the Department of Energy
Partitioned Collective I/OPartitioned Collective I/O((ParCollParColl)) Prototyped in an open-source MPI-IOPrototyped in an open-source MPI-IO
librarylibrary
Divide processes into smaller groups:Divide processes into smaller groups: Reduced global synchronizationReduced global synchronization Still benefit from IO aggregationStill benefit from IO aggregation
Improves collective IO for differentImproves collective IO for differentbenchmarksbenchmarks
Flash IO with ParColl
0
5000
10000
15000
20000
Cray 1 2 4 8 16 32 64
No. of Groups
Ba
nd
wid
th (
MB
/s)
MPI-Tile-IO with ParColl
0
2000
4000
6000
8000
10000
12000
16 32 64 128 256 512 1024
No. of Processes
Ba
nd
wid
th
(MB
/s)
0.0
100.0
200.0
300.0
400.0
500.0
Imp
rov
em
en
tCray ParColl %
Benefits to MPI-Tile-IO and Flash IOBenefits to MPI-Tile-IO and Flash IO
23 Managed by UT-Battellefor the Department of Energy
Conclusion & Future WorkConclusion & Future Work
•• Characterized I/O Performance over JaguarCharacterized I/O Performance over Jaguar
•• Tuned and optimized various I/O PatternsTuned and optimized various I/O Patterns
•• To make OPAL a public software releaseTo make OPAL a public software release
•• To model and predict application I/OTo model and predict application I/Operformanceperformance
•• A pointer again:A pointer again: http:// http://ft.ornl.gov/project/ioft.ornl.gov/project/io//