Improving Parallel I/O for Scientific Applications on Jaguarfor the Department of Energy Parallel...

Managed by UT-Battellefor the Department of Energy

Improving Parallel I/O for ScientificImproving Parallel I/O for ScientificApplications on JaguarApplications on Jaguar

Weikuan Yu, Jeffrey VetterWeikuan Yu, Jeffrey Vetter

November 28, 2007November 28, 2007

2 Managed by UT-Battellefor the Department of Energy

OutlineOutline

•• Overview of Jaguar and its I/O SubsystemOverview of Jaguar and its I/O Subsystem

•• Characterization, profiling and tuningCharacterization, profiling and tuning–– Parallel I/OParallel I/O–– Storage OrchestrationStorage Orchestration–– Server-based client I/O statistics in Lustre file systemServer-based client I/O statistics in Lustre file system

•• Optimize parallel I/O over JaguarOptimize parallel I/O over Jaguar–– OPAL: Opportunistic and Adaptive MPI-IO library over LustreOPAL: Opportunistic and Adaptive MPI-IO library over Lustre–– Arbitrary striping pattern per MPI File; stripe aligned I/OArbitrary striping pattern per MPI File; stripe aligned I/O

accessesaccesses–– Direct I/O and Lockless I/ODirect I/O and Lockless I/O–– Hierarchical StripingHierarchical Striping–– Partitioned Collective I/OPartitioned Collective I/O

•• Conclusion & Future WorkConclusion & Future Work


Parallel I/O in Scientific ApplicationsParallel I/O in Scientific Applicationson Jaguaron Jaguar•• Earlier analysis of scientific codes running atEarlier analysis of scientific codes running at

ORNL/LCFORNL/LCF–– Seldom make use of MPI-IOSeldom make use of MPI-IO–– Little use of high-level I/O middleware: Little use of high-level I/O middleware: NetCDF NetCDF or HDF5 or HDF5 ……–– Large variance in the performance of parallel I/O using differentLarge variance in the performance of parallel I/O using different

software stackssoftware stacks

•• Current scenarios faced for scientific codesCurrent scenarios faced for scientific codes–– Application-specific I/O tuning and optimization, such asApplication-specific I/O tuning and optimization, such as

decomposition, aggregation, and bufferingdecomposition, aggregation, and buffering–– Direct use of POSIX I/O interface, hoping the best from fileDirect use of POSIX I/O interface, hoping the best from file

systemssystems

•• Our Effort:Our Effort:–– Obtain a good understanding of the parallel I/O behavior onObtain a good understanding of the parallel I/O behavior on

JaguarJaguar–– Promote the use of I/O library, e.g. MPI-IOPromote the use of I/O library, e.g. MPI-IO–– Provide an open-source and feature-rich MPI-IOProvide an open-source and feature-rich MPI-IO

implementationimplementation•• Avoid redundant efforts from different applicationsAvoid redundant efforts from different applications

4 Managed by UT-Battellefor the Department of Energy CUG 2007 -- Weikuan Yu

Storage (DDN 9550)Storage (DDN 9550) 18 racks, each of 36 TB; 18 racks, each of 36 TB; FibreFibre Channel (FC) links. Channel (FC) links. Each LUN (2TB) spans two tiers of disksEach LUN (2TB) spans two tiers of disks

Three Lustre file systemsThree Lustre file systems Each with its own MDS;Each with its own MDS; Two with 72 Two with 72 OSTsOSTs, each of 150TB; One with 144 , each of 150TB; One with 144 OSTsOSTs, 300 TB, 300 TB 72 service nodes for 72 service nodes for OSSsOSSs, each supporting 4 , each supporting 4 OSTsOSTs

ORNL Jaguar I/O SubsystemORNL Jaguar I/O Subsystem


Parallel I/O over JaguarParallel I/O over Jaguar

Parallel IOLibraries

Lustre File System

Storage (DDN)

Application

144 Targets

BenchmarkBandwidth?

Application Bandwidth?

Storage to Application, efficiency degradationStorage to Application, efficiency degradation

End-to-End IO performanceEnd-to-End IO performance Dependent on mid-layer components:Dependent on mid-layer components:

Lustre file systemLustre file system Parallel IO librariesParallel IO libraries

40GB/sec

57GB/sec

?? GB/sec


Characterization Parallel I/OCharacterization Parallel I/O

Independent and Contiguous Read/WriteIndependent and Contiguous Read/Write Share fileShare file Separated filesSeparated files

Non-contiguous I/ONon-contiguous I/O Data-sievingData-sieving Collective I/OCollective I/O

Parallel Open/CreateParallel Open/Create

Orchestrated I/O Accesses over JaguarOrchestrated I/O Accesses over Jaguar


Independent & Contiguous I/OIndependent & Contiguous I/O

Singe OST

200

250

300

350

400

450

1 2 4 8 16 32

No. of Processes

Ba

nd

wid

th (

MB

/se

c)

Separated Write Separated Read

Shared Write Shared Read

Single DDN Couplet

0

500

1000

1500

2000

2500

3000

1 2 4 8 16 32 64 128

No. of Processes

Ba

nd

wid

th (

MB

/se

c)

Shared Write Shared Read

Separated Write Separated Read

System Scalability

0

10000

20000

30000

40000

50000

16 32 48 72 96 120 144

No. of OSTs

Ba

nd

wid

th (

MB

/se

c)

Shared file write Shared file read

File-per-Proc write File-per-Proc read

Independent & contiguous I/O scalesIndependent & contiguous I/O scaleswell unless too many clients arewell unless too many clients arecontending for the same serverscontending for the same servers(over-provisioned)(over-provisioned)

Write scales better with increasingWrite scales better with increasingnumber of processesnumber of processes

Read scales better with very largeRead scales better with very largestriping widthstriping width


Small and Noncontiguous I/OSmall and Noncontiguous I/O

Data Sieving

0.1

1

10

100

1000

direct-

write

direct-

read

direct-

write

direct-

read

sieve-

write

sieve-

read

sieve-

write

sieve-

read

Nprocs=128 Nprocs=256 Nprocs=128 Nprocs=256

Ba

nd

wid

th (

MB

/se

c)

128^3 256^3 512^3

Collective IO

1

10

100

1000

10000

100000

coll-write coll-read coll-write coll-read

Nprocs=128 Nprocs=256

Ba

nd

wid

th (

MB

/se

c)

128^3 256^3 512^3

Small and Noncontiguous I/O results in lower performanceSmall and Noncontiguous I/O results in lower performance

Strong scaling makes the problem even worseStrong scaling makes the problem even worse

Data sieving from individual processes helps only read, but not writeData sieving from individual processes helps only read, but not writedue to increased lock connectionsdue to increased lock connections

Collective I/O is able to help on this by aggregating I/O andCollective I/O is able to help on this by aggregating I/O andeliminating lock contentionseliminating lock contentions


Parallel Open/CreateParallel Open/Create

Parallel Open (Nprocs=512)

0

100

200

300

400

500

600

700

8 16 32 40 48 72

No. of OSTs

Tim

e (

mil

lis

ec

)

Shared Separated

0

100

200

300

400

500

600

700

1 2 4 5 6 8 16 32 40 48 64 72

No. of OSTs

Tim

e (

mil

lis

ec

)

static dynamic

Parallel Open

1

10

100

1000

1 2 4 8 16 32 64 128 256 512

No. of Processes

Tim

e (

millisec)

Shared Separated

Use a shared MPI file for scalableUse a shared MPI file for scalableparallel Open/Createparallel Open/Create

Select the first OST dynamically forSelect the first OST dynamically forscalable parallel creationscalable parallel creation


Orchestrated File DistributionOrchestrated File Distribution

File Distribution -- Write

8000

9000

10000

11000

12000

13000

14000

0 1 2 3 4 5 6 7

Offset of First OST

Ba

nd

wid

th (

MB

/se

c)

Nprocs=32 Nprocs=64 Nprocs=128 Nprocs=256

LUN/OST Organization of DDN Storage Couplet

Impact of File Distribution with a Couplet

S3D - scr72b

4000

6000

8000

10000

0 1 2 3 4 5 6 7

Offset Index of the First OST

Ba

nd

wid

th (

MB

/se

c)

Nprocs=1024 Nprocs=2048 Nprocs=4096

Impact of File Distribution for S3D


Flash IO TuningFlash IO Tuning

Flash IO (Nprocs=1024, Checkpoint File)

0

4000

8000

12000

16000

No-Coll 4MB 8MB 16MB 32MB 64MB 128MB

Collective Buffer Size

Ba

nd

wid

th (

MB

/se

c)

Flash IO (Nprocs=2048)

0

2000

4000

6000

8000

10000

12000

128 256 512 1024 2048

No. of IO Aggregators

Ba

nd

wd

ith

(M

B/s

ec

)

Blocksize=16 Blocksize=32

Collective I/O significantly improves the I/O performanceCollective I/O significantly improves the I/O performanceof Flash I/Oof Flash I/O

For small block size, choose a smaller number ofFor small block size, choose a smaller number ofaggregators for Flash I/Oaggregators for Flash I/O


Using a Shared MPI File For S3DUsing a Shared MPI File For S3D

File Open Time

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

128 256 512 1024 2048 4096 8192

No. of Processes

Tim

e (

se

c)

Separated Files Shared File

Using separated files, S3D does not scale up to 8K processesUsing separated files, S3D does not scale up to 8K processesdue to file creation costdue to file creation cost

Using a shared MPI file for S3D can reduce file creation costUsing a shared MPI file for S3D can reduce file creation cost

More convenient in managing much fewer number of filesMore convenient in managing much fewer number of files

S3D Effective Bandwidth

0

8000

16000

24000

32000

40000

128 256 512 1024 2048 4096 8192

No. of Processes

Ba

nd

wid

th (

MB

/se

c)

Separated Files Shared File


Profiling through Server-based ClientProfiling through Server-based ClientI/O Statistics in LustreI/O Statistics in Lustre

•• Lustre Client StatisticsLustre Client Statistics–– /proc//proc/fs/lustrefs/lustre/*/*/exports/*/*/exports–– Per-exp client statistics is not sufficientPer-exp client statistics is not sufficient

•• liblustreliblustre clients are transient clients are transient–– Provided per-Provided per-nidnid client statistics client statistics–– Integrated into Lustre source releaseIntegrated into Lustre source release

•• BenefitsBenefits–– In-depth tracing of aggressive usageIn-depth tracing of aggressive usage–– Fine-grained I/O profilingFine-grained I/O profiling–– Knobs for future I/O service controlKnobs for future I/O service control

exports

nid01 nid02 nid03

exp01 exp02 exp03

exports

exp01 exp02 exp03


Parallel I/O Instrumentation &Parallel I/O Instrumentation &OptimizationOptimization•• Overview of Jaguar and its I/O SubsystemOverview of Jaguar and its I/O Subsystem

•• Characterization, Profiling and TuningCharacterization, Profiling and Tuning–– Parallel I/OParallel I/O–– Storage OrchestrationStorage Orchestration–– Server-based client I/O statistics in Lustre file systemServer-based client I/O statistics in Lustre file system

•• Optimize Parallel I/O over JaguarOptimize Parallel I/O over Jaguar–– OPAL: Opportunistic and Adaptive MPI-IO library over LustreOPAL: Opportunistic and Adaptive MPI-IO library over Lustre–– Arbitrary striping pattern per MPI File; stripe aligned I/OArbitrary striping pattern per MPI File; stripe aligned I/O

accessesaccesses–– Direct I/O and Lockless I/ODirect I/O and Lockless I/O–– Hierarchical StripingHierarchical Striping–– Partitioned Collective I/OPartitioned Collective I/O

•• Conclusion & Future WorkConclusion & Future Work


Parallel I/O Stack on JaguarParallel I/O Stack on Jaguar

•• Parallel I/O StackParallel I/O Stack–– MPI-IO, MPI-IO, NetCDFNetCDF, HDF5, HDF5–– Using MPI-IO as a basic buildingUsing MPI-IO as a basic building

componentcomponent–– Default Implementation from CrayDefault Implementation from Cray

•• With components named With components named AD_SysioAD_Sysio over oversysiosysio & & liblustreliblustre

•• Lack of an open-source MPI-IOLack of an open-source MPI-IOimplementationimplementation Proprietary code base; Prolonged phase ofProprietary code base; Prolonged phase of

bug fixesbug fixes Stagnant code development Stagnant code development w.r.tw.r.t Lustre Lustre

file system featuresfile system features Lack of latitudes for examination of internalLack of latitudes for examination of internal

MPI-IO implementationMPI-IO implementation Lack of freedom for other developers inLack of freedom for other developers in

experimenting optimizationsexperimenting optimizations


OPportunisticOPportunistic and Adaptive MPI-IO Library and Adaptive MPI-IO Library An MPI-IO package optimized for LustreAn MPI-IO package optimized for Lustre FeaturingFeaturing

Dual-Platform (Cray XT and Linux); Dual-Platform (Cray XT and Linux); unified codeunified code Open source, better performanceOpen source, better performance Improved data-sieving implementationImproved data-sieving implementation Arbitrary striping specification over Cray XTArbitrary striping specification over Cray XT Lustre stripe-aligned file domain partitioningLustre stripe-aligned file domain partitioning Direct I/O, Direct I/O, LockessLockess I/O, and more I/O, and more ……

OpportunitiesOpportunities Parallel I/O profiling and tuningParallel I/O profiling and tuning AA code base for patches and optimizations code base for patches and optimizations

http://http://ft.ornl.gov/projects/io/#downloadft.ornl.gov/projects/io/#download

OPALOPAL

Diagram of OPAL over Jaguar


Independent Read/WriteIndependent Read/Write

IOR Read/WriteIOR Read/Write Block size 512MB, transfer size 4MBBlock size 512MB, transfer size 4MB Read from or write to a shared file, 64 stripes wideRead from or write to a shared file, 64 stripes wide

Parallel Write to 64 Stripes

0

2000

4000

6000

8000

10000

12000

8 48 88 128 168 208 248

No. of Processes

Ba

nd

wid

th (

MB

/se

c)

0

50

100

150

200

Pe

rce

nta

ge

OPAL Cray Improvement

Parallel Read from 64 Stripes

0

2000

4000

6000

8000

10000

12000

8 48 88 128 168 208 248

No. of Processes

Ba

nd

wid

th (

MB

/se

c)

0

20

40

60

80

100

120

140

Pe

rce

nta

ge

OPAL Cray Improvement


Scientific Benchmarks Scientific Benchmarks ––ResultsResults

MPI-Tile-IO

0

500

1000

1500

2000

2500

3000

64 128 256 512 1024

No. of Processes

Ban

dw

dit

h (

MB

/sec)

Cray

OPAL

0

1000

2000

3000

4000

5000

64 256 576

No. of Processes

Ba

nd

wd

ith

(M

B/s

ec

)

Cray

OPAL

0

2000

4000

6000

8000

10000

12000

Ba

nd

wid

th (

MB

/se

c)

Cray OPAL Cray OPAL Cray OPAL

Checkpoint Plot-Center Plot-Corner

Nprocs=256 Nprocs=512 Nprocs=1024

NAS BT-IOMPI-Tile-IO

Flash I/O


Direct I/O and Lockless I/ODirect I/O and Lockless I/O

Direct I/ODirect I/O Perform I/O without file system buffering at the clientsPerform I/O without file system buffering at the clients Need to be page-alignedNeed to be page-aligned

Lockless I/OLockless I/O Client I/O without acquiring file system locksClient I/O without acquiring file system locks Beneficial for small and non-contiguous I/OBeneficial for small and non-contiguous I/O Beneficial for clients with I/O to the same stripesBeneficial for clients with I/O to the same stripes

Recently implemented in OPAL for Lustre platformsRecently implemented in OPAL for Lustre platforms Works over Linux cluster and Cray XT (CNL)Works over Linux cluster and Cray XT (CNL) Performance improvement for large streaming I/O on a Linux clusterPerformance improvement for large streaming I/O on a Linux cluster Benefits over Jaguar/CNL is yet to be determinedBenefits over Jaguar/CNL is yet to be determined


Hierarchical StripingHierarchical Striping

An revised implementation of MPI-I/O with An revised implementation of MPI-I/O with subfilessubfiles(CCGrid(CCGrid’’2007)2007) Break Lustre 160-stripe limitationBreak Lustre 160-stripe limitation Avoid over-provisioning problemAvoid over-provisioning problem Mitigate the impact of striping overheadMitigate the impact of striping overhead

•• Data Layout: hierarchical stripingData Layout: hierarchical striping–– Create another level of striping pattern for Create another level of striping pattern for subfilessubfiles–– Allow maximum coverage of Lustre storage targetsAllow maximum coverage of Lustre storage targets

0 12 3

S-2 S-1

S+1S

2S-2S+2 S+3

2S-1

nS+1nS

nS-2nS+2 nS+3

nS-1

Hierarchical Striping (HS)

subfile 0 subfile 1 subfile n

(Stripe width: 2; Stripe size: w)


Global Synchronization inGlobal Synchronization inCollective I/O on Cray XTCollective I/O on Cray XT•• Collective IO has low performance on Jaguar?Collective IO has low performance on Jaguar?

–– Much time is spent on global synchronizationMuch time is spent on global synchronization–– Application scientists are forced to aggregate I/O intoApplication scientists are forced to aggregate I/O into

smaller sub-communicatorssmaller sub-communicators

0

2

4

6

8

10

12

8 16 32 64 128 256

No. of Processes

Tim

e (

se

c)

0

0.5

1

1.5

2

2.5

Re

lati

ve

Ra

tio

collective point-to-point IO

coll vs. IO P2P vs. IO


Partitioned Collective I/OPartitioned Collective I/O((ParCollParColl)) Prototyped in an open-source MPI-IOPrototyped in an open-source MPI-IO

librarylibrary

Divide processes into smaller groups:Divide processes into smaller groups: Reduced global synchronizationReduced global synchronization Still benefit from IO aggregationStill benefit from IO aggregation

Improves collective IO for differentImproves collective IO for differentbenchmarksbenchmarks

Flash IO with ParColl

0

5000

10000

15000

20000

Cray 1 2 4 8 16 32 64

No. of Groups

Ba

nd

wid

th (

MB

/s)

MPI-Tile-IO with ParColl

0

2000

4000

6000

8000

10000

12000

16 32 64 128 256 512 1024

No. of Processes

Ba

nd

wid

th

(MB

/s)

0.0

100.0

200.0

300.0

400.0

500.0

Imp

rov

em

en

tCray ParColl %

Benefits to MPI-Tile-IO and Flash IOBenefits to MPI-Tile-IO and Flash IO


Conclusion & Future WorkConclusion & Future Work

•• Characterized I/O Performance over JaguarCharacterized I/O Performance over Jaguar

•• Tuned and optimized various I/O PatternsTuned and optimized various I/O Patterns

•• To make OPAL a public software releaseTo make OPAL a public software release

•• To model and predict application I/OTo model and predict application I/Operformanceperformance

•• A pointer again:A pointer again: http:// http://ft.ornl.gov/project/ioft.ornl.gov/project/io//

Improving Parallel I/O for Scientific Applications on Jaguarfor the Department of Energy Parallel...

Documents

Transcript of Improving Parallel I/O for Scientific Applications on Jaguarfor the Department of Energy Parallel...