Parallel Visualization on Leadership Computing...

1
Parallel Visualization on Leadership Computing Resources One approach to meeting the increasing demands for analysis and visualization is to perform more of these tasks on supercomputers traditionally reserved for simulations. This can lead to increased performance, reduced cost, and tighter integration of analysis and visualization in computational science. Our team is developing and optimizing visualization algorithms to scale to architectures at over 10,000 cores. Performance is analyzed for ultrascale applications. Algorithms Parallelizing algorithms such as volume rendering into pipelines, consisting each of many cores, maximizes performance on these architectures. Tom Peterka, Rob Ross (ANL), Han‐Wei Shen (OSU), Kwan‐Liu Ma (UCD), Wesley Kendall (UTK), Hongfeng Yu (SNL) Analyses Applications From astrophysics to climate modeling, we are working one‐on‐one with scientists and their data to meet their large‐scale visualization requirements. Architectures At scales of tens of thousands of cores, visualization algorithms need to be tuned to specific architectures, so we study systems in detail, I/O for example. A U.S. Department of Energy laboratory managed by UChicago Argonne, LLC The Argonne Leadership Computing Facility’s IBM Blue Gene/P (left) and the Oak Ridge Leadership Computing Facility’s Cray XT4 (right). Images courtesy ALCF and NCCS. Architectural diagram of the the Blue Gene/P I/O system Compute nodes Gateway nodes Commodity network Storage nodes Enterprise storage BG/P tree 5.1 GB/s Ethernet 10 Gb/s Infiniband 16 Gb/s Serial ATA 3 Gb/s Volume rendering of angular momentum and streamlines of shock wave vector field in supernova dataset, courtesy of John Blondin. Time lag between first snowfall and green‐up in MODIS dataset, courtesy of NASA. Parallel volume rendering consists of parallel reading of the dataset in the I/O stage, software ray casting in the rendering stage, and blending individual images in the compositing stage. The cost of I/O in rendering a time series can be masked by visualizing multiple time steps in parallel pipelines. Each of the pipelines below is further parallelized among multiple nodes. The forwarder daemon runs on the login node and serializes final results. Scalability over a variety of data, image, and system sizes. The relative percentage of time in the stages of volume rendering as a function of system size. Different numbers of processes performing I/O can affect overall writing performance. ! ! ! ! ! 50 100 200 500 1000 2000 5000 20000 5 10 20 50 100 200 500 1000 Volume Rendering End!to!End Performance Number of Processes Total Frame Time (s) ! 4480^3 data, 4096^2 image 2240^3 data, 2048^2 image 1120^3 data, 1024^2 image Aggregate and component results are analyzed to determine bottlenecks. Because ultrascale visualization is dominated by I/O, our team devotes considerable effort to its study, both from systems and application perspectives.

Transcript of Parallel Visualization on Leadership Computing...

ParallelVisualizationonLeadershipComputingResources

Oneapproachtomeetingtheincreasingdemandsforanalysisandvisualizationistoperformmoreofthesetasksonsupercomputerstraditionallyreservedforsimulations.Thiscanleadtoincreasedperformance,reducedcost,andtighterintegrationofanalysisandvisualizationincomputationalscience.Ourteamisdevelopingandoptimizingvisualization

algorithmstoscaletoarchitecturesatover10,000cores.Performanceisanalyzedforultrascaleapplications.

AlgorithmsParallelizingalgorithmssuchasvolumerenderingintopipelines,consistingeachofmanycores,maximizes

performanceonthesearchitectures.

TomPeterka,RobRoss(ANL),Han‐WeiShen(OSU),Kwan‐LiuMa(UCD),WesleyKendall(UTK),HongfengYu(SNL)

Analyses

ApplicationsFromastrophysicstoclimatemodeling,weareworkingone‐on‐onewithscientistsandtheirdatatomeettheir

large‐scalevisualizationrequirements.

ArchitecturesAtscalesoftensofthousandsofcores,visualizationalgorithmsneedtobetunedtospecificarchitectures,sowestudysystemsindetail,I/Oforexample.

AU.S.DepartmentofEnergylaboratorymanagedbyUChicagoArgonne,LLC

TheArgonneLeadershipComputingFacility’sIBMBlueGene/P(left)andtheOakRidgeLeadershipComputingFacility’sCrayXT4(right).ImagescourtesyALCFandNCCS.

Architectural diagramofthetheBlueGene/PI/Osystem

ComputenodesGatewaynodes

Commoditynetwork

Storagenodes

Enterprisestorage

BG/Ptree

5.1GB/s

Ethernet

10Gb/s

Infiniband

16Gb/s

SerialATA

3Gb/s

Volumerenderingofangularmomentumandstreamlinesofshockwavevectorfieldinsupernovadataset,courtesyofJohnBlondin.

Timelagbetweenfirstsnowfallandgreen‐upinMODISdataset,courtesyofNASA.

Parallelvolumerenderingconsistsofparallelreadingofthe

datasetintheI/Ostage,softwareraycastingin

therenderingstage,andblendingindividual

imagesinthecompositingstage.

ThecostofI/Oinrenderingatimeseriescanbemaskedbyvisualizingmultipletimestepsinparallelpipelines.Eachofthepipelinesbelowisfurtherparallelizedamongmultiplenodes.Theforwarderdaemonrunsontheloginnodeandserializesfinalresults.

Scalabilityoveravarietyofdata,image,andsystemsizes.

Therelativepercentageoftimeinthestagesofvolumerenderingasafunctionofsystemsize.

DifferentnumbersofprocessesperformingI/Ocanaffectoverallwritingperformance.

!

!

!

!

!

50 100 200 500 1000 2000 5000 20000

510

20

50

100

200

500

1000

Volume Rendering End!to!End Performance

Number of Processes

Tota

l F

ram

e T

ime (

s)

! 4480^3 data, 4096^2 image

2240^3 data, 2048^2 image

1120^3 data, 1024^2 image

Aggregateandcomponentresultsareanalyzedtodeterminebottlenecks.BecauseultrascalevisualizationisdominatedbyI/O,ourteamdevotesconsiderableefforttoitsstudy,bothfromsystemsandapplicationperspectives.