PP16 Abstracts · 66 PP16 Abstracts IP1 Scalabilityof Sparse Direct Codes...

66 PP16 Abstracts

IP1

Scalability of Sparse Direct Codes

As part of the H2020 FET-HPC Project NLAFET, we arestudying the scalability of algorithms and software for us-ing direct methods for solving large sparse equations. Weexamine how techniques that work well for dense systemsolution, for example communication avoiding strategies,can be adapted for the sparse case. We also study thebenefits of using standard run time systems to assist us indeveloping codes for extreme scale computers. We will lookat algorithms for solving both sparse symmetric indefinitesystems and unsymmetric systems.

Iain DuffRutherford Appleton Laboratory, Oxfordshire, UK OX110QXand CERFACS, Toulouse, [email protected]

IP2

Towards Solving Coupled Flow Problems at the Ex-treme Scale

Scalability is only a necessary prerequisite for extreme scalecomputing. Beyond scalability, we propose here an archi-tecture aware co-design of the models, algorithms, and datastructures. The benefit will be demonstrated for the Lat-tice Boltzmann method as example of an explicit time step-ping scheme and for a finite element multigrid method asan implicit solver. In both cases, systems with more thanten trillion (1013) degrees of freedom can be solved alreadyon the current peta-scale machines.

Ulrich J. RuedeUniversity of Erlangen-NurembergDepartment of Computer Science (Simulation)[email protected]

IP3

Why Future Systems should be Data Centric

We are immersed in data. Classical high performance com-puting often generates enormous structured data throughsimulation; but, a vast amount of unstructured data is alsobeing generated and is growing at exponential rates. Tomeet the data challenge, we expect computing to evolve intwo fundamental ways: first through a focus on workflows,and second to explicitly accommodate the impact of bigdata and complex analytics. As an example, HPC systemsshould be optimized to perform well on modeling and simu-lation; but, should also focus on other important elementsof the overall workflow which include data managementand manipulation coupled with associated cognitive ana-lytics. At a macro level, workflows will take advantage ofdifferent elements of the systems hierarchy, at dynamicallyvarying times, in different ways and with different datadistributions throughout the hierarchy. This leads us to adata centric design point that has the flexibility to handlethese data-driven demands. I will explain the attributesof a data centric system and discuss challenges we see forfuture computing systems. It is highly desirable that thedata centric architecture and implementation lead to com-mercially viable Exascale-class systems on premise and inthe cloud.

Paul CoteusIBM Research, USA

[email protected]

IP4

Next-Generation AMR

Block-structured adaptive mesh refinement (AMR) is apowerful tool for improving the computational efficiencyand reducing the memory footprint of structured-grid nu-merical simulations. AMR techniques have been used forover 25 years to solve increasingly complex problems. Iwill give an overview of what we are doing in BerkeleyLabs AMR framework, BoxLib, to address the challengesof next-generation multicore architectures and the com-plexity of multiscale, multiphysics problems, including newways of thinking about multilevel algorithms and new ap-proaches to data layout and load balancing, in situ andin transit visualization and analytics, and run-time perfor-mance modeling and control.

Ann S. AlmgrenLawrence Berkeley National [email protected]

IP5

Beating Heart Simulation Based on a StochasticBiomolecular Model

In my talk, I will introduce a new parallel computationalapproach indispensable to simulate a beating human heartdriven by stochastic biomolecular dynamics. In this ap-proach, we start from a molecular level modeling, and wedirectly simulate individual molecules in sarcomere by theMonte Carlo (MC) method, thereby naturally handle thecooperative and stochastic molecular behaviors. Then weimbed these sarcomere MC samples into contractile ele-ments in continuum material model discretized by finiteelement method. The clinical applications of our heartsimulator are also introduced.

Takumi WashioUniversity of Tokyo, [email protected]

IP6

Conquering Big Data with Spark

Today, big and small organizations alike collect hugeamounts of data, and they do so with one goal in mind:extract ”value” through sophisticated exploratory analy-sis, and use it as the basis to make decisions as variedas personalized treatment and ad targeting. To addressthis challenge, we have developed Berkeley Data AnalyticsStack (BDAS), an open source data analytics stack for bigdata processing.

Ion StoicaUniversity of California, Berkeley, [email protected]

IP7

A Systems View of High Performance Computing

HPC systems comprise increasingly many fast components,all working together on the same problem concurrently.Yet, HPC efficient-programming dogma dictates that com-municating is too expensive and should be avoided. Pro-gram ever more components to work together with ever lessfrequent communication?!? Seriously? I object! In other

PP16 Abstracts 67

disciplines, deriving efficiency from large Systems requiresan approach qualitatively different from that used to deriveefficiency from small Systems. Should we not expect thesame in HPC? Using Grappa, a latency-tolerant runtimefor HPC systems, I will illustrate through examples thatsuch a qualitatively different approach can indeed yield ef-ficiency from many components without stifling communi-cation.

Simon KahanInstitute for Systems Biology and University ofWashington,[email protected]

SP0

SIAG/Supercomputing Best Paper Prize:Communication-Optimal Parallel and Sequen-tial QR and LU Factorizations

We present parallel and sequential dense QR factoriza-tion algorithms that are both optimal (up to polylogarith-mic factors) in the amount of communication they per-form and just as stable as Householder QR. We proveoptimality by deriving new lower bounds for the num-ber of multiplications done by non-Strassen-like QR, andusing these in known communication lower bounds thatare proportional to the number of multiplications. Wenot only show that our QR algorithms attain these lowerbounds (up to polylogarithmic factors), but that existingLAPACK and ScaLAPACK algorithms perform asymptot-ically more communication. We derive analogous commu-nication lower bounds for LU factorization and point outrecent LU algorithms in the literature that attain at leastsome of these lower bounds. The sequential and parallelQR algorithms for tall and skinny matrices lead to signif-icant speedups in practice over some of the existing algo-rithms, including LAPACK and ScaLAPACK, for example,up to 6.7 times over ScaLAPACK. A performance modelfor the parallel algorithm for general rectangular matricespredicts significant speedups over ScaLAPACK.

James W. DemmelUniversity of CaliforniaDivision of Computer [email protected]

Laura [email protected]

Mark HoemmenSandia National [email protected]

Julien LangouUniversity of Colorado [email protected]

SP1

SIAG/Supercomputing Career Prize: Supercom-puters and Superintelligence

In recent years the idea of emerging superintelligence hasbeen discussed widely by popular media, and many expertsvoiced grave warnings about its possible consequences.This talk will use an analysis of progress in supercomputerperformance to examine the gap between current tech-

nology and reaching the capabilities of the human brain.In spite of good progress in high performance computing(HPC) and techniques such as machine learning, this gapis still very large. I will then explore two related topicsthrough a discussion of recent examples: what can we learnfrom the brain and apply to HPC, e.g., through recent ef-forts in neuromorphic computing? And how much progresshave we made in modeling brain function? The talk will beconcluded with my perspective on the true dangers of su-perintelligence, and on our ability to ever build self-awareor sentient computers.

Horst D. SimonLawrence Berkeley [email protected]

CP1

Scale-Bridging Modeling of Material Dynamics:Petascale Assessments on the Road to Exascale

Within the multi-institutional, multi-disciplinary ExascaleCo-design Center for Materials in Extreme Environments(ExMatEx), we are developing adaptive physics refinementmethods that exploit hierarchical, heterogeneous architec-tures to achieve more realistic large-scale simulations. Toevaluate and exercise the task-based programming mod-els, databases, and runtime systems required to performmany-task computation workflows that accumulate a dis-tributed response database from fine-scale calculations, weare preforming petascale demonstrations on the Trinity su-percomputer, combining Haswell, Knights Landing, andburst buffer nodes.

Timothy C. GermannLos Alamos National [email protected]

CP1

Parallel Recursive Density Matrix Expansion inElectronic Structure Calculations

We propose a parallel recursive polynomial expansion al-gorithm for construction of the density matrix in linearscaling electronic structure theory. The expansion con-tains operations on matrices with irregular sparsity pat-terns, in particular matrix-matrix multiplication. By usingthe Chunks and Tasks programming model[Parallel Com-put. 40, 328 (2014)] beforehand unknown sparsity patternis exploited while preventing load imbalance. We evaluatethe performance in electronic structure calculations, inves-tigating linear scaling with system size and parallel strongand weak scaling.

Anastasia KruchininaUppsala [email protected]

Emanuel H. RubenssonUppsala [email protected]

Elias RudbergDivision of Scientific Computing, Department of I.T.Uppsala [email protected]

CP1

Comparison of Optimization Techniques in Parallel

68 PP16 Abstracts

Ab Initio Nuclear Structure Calculations

Many tasks in ab initio nuclear structure calculations mayrely on search techniques, such as tuning nucleon-nucleonand three-nucleon interactions to fit light nuclei. This fieldis still under-served, however, by the existing optimiza-tion software, mainly due to the complexity of the under-lying function and derivative evaluations. Previously, wehave developed a flexible framework to interface the MFDnpackage for large-scale ab initio nuclear structure calcula-tions with derivative-free optimization packages. Recently,we have obtained results with the derivative-based opti-mization techniques. In this talk we compare these twobroad optimization approaches as applied to parallel abinitio nuclear structure calculations and note on efficientfunction evaluations when several nuclei are fitted simulta-neously.

Masha SosonkinaOld Dominion UniversityAmes Laboratory/[email protected]

CP2

Feast Eigenvalue Solver Using Three Levels of MPIParallelism

We present an additional level of MPI parallelism for theFEAST Eigenvalue solver that can augment its numeri-cal scalability performances and data processing capabili-ties for challenging eigenvalue problems at extreme scale.Any MPI solver including domain-decomposition methods,can be interfaced with the software kernel through an RCImechanism. The algorithm contains three levels of paral-lelism and is ideally suited to run on parallel architectures.Performance results from a real-world engineering applica-tion are discussed.

James KestynUniversity of Massachusetts, Amherst, [email protected]

Eric PolizziUniversity of Massachusetts, Amherst, [email protected]

Peter TangIntel Corporation2200 Mission College Blvd, Santa Clara, [email protected]

CP2

Parallel Computation of Recursive Bounds forEigenvalues of Matrix Polynomials

We derive bounds for eigenvalues of polynomial matricesby generalizing to matrix polynomials a recent result forscalar polynomials that can be recursively refined, increas-ing the computational cost and decreasing sparsity, makingit too costly to carry out sequentially for large matrices.Instead, we use a parallel architecture (GPU) and achievebetter bounds in the same amount of time necessary to ob-tain a single unrefined bound on a sequential machine. Weillustrate with engineering examples.

Aaron MelmanSanta Clara UniversityDepartment of Applied Mathematics

[email protected]

CP2

Increasing the Arithmetic Intensity of a SparseEigensolver for Better Performance on Heteroge-nous Clusters

The performance of sparse eigensolvers is typically limitedby the network and/or main memory bandwidth. In this‘bandwidth bounded’ regime it is possible to perform ad-ditional ‘free flops’ to improve the robustness and/or con-vergence rate of a method. We present block orthogonal-ization techniques in simulated quadruple precision thatexploit specific hardware features. As an example of use,we investigate the performance of a block Jacobi-Davidsonmethod on compute clusters comprising multi/manycoreCPUs and GPGPUs.

Jonas ThiesGerman Aerospace Center (DLR)[email protected]

CP2

Highly Scalable Selective Inversion Using Sparseand Dense Matrix Algebra

The computation of a particular subset of entries of in-verse matrices is required in several different disciplines,such as genomic prediction, reservoir characterization andfinance. Methods based on the matrix factorization can beapplied to obtain a significant reduction of the complex-ity in a parallel and distributed-memory environment. Weintroduce an algorithm that implements an efficient, par-allel, and scalable selective inversion method based on LUand Schur-complement factorization of sparse and densematrices.

Fabio VerbosioUSI Universita della Svizzera [email protected]

Drosos KourounisUSI - Universita della Svizzera italianaInstitute of Computational [email protected]

Olaf SchenkUSI - Universita della Svizzera italianaInstitute of Computational [email protected]

CP2

Exploring Openmp Task Priorities on the MR3Eigensolver

As part of the OpenMP 4.1 draft, the runtime incorporatestask priorities. We use the Method of Multiple RelativelyRobust Representations (MR3), for which a pthreads-based task parallel version already exists (MR3SMP), toanalyze and compare the performance of MR3SMP withthree different OpenMP runtimes, with and without thesupport of priorities. From a dataset consisting of applica-tion matrices, it appears that OpenMP is always on par orbetter than the pthreads implementation.

Jan WinkelmannAICESRWTH Aachen

PP16 Abstracts 69

[email protected]

Paolo BientinesiAICES, RWTH [email protected]

CP3

Fast Mesh Operations Using Spatial Hashing:Remaps and Neighbor Finding

Mesh operations based on hashing techniques are fastand scale linearly with problem-size. Hashing is alsocomparison-free which is better suited for the fine grainedparallelism of Exascale hardware. We present how to lever-age hashing for fast remaps and neighbor finding and com-pare performance with standard tree-based approaches onGPUs, MICs and multicore systems.

Robert RobeyLos Alamos National LabXCP-2 Eulerian [email protected]

David NicholaeffXCP-2 Eulerian ApplicationsLos Alamos National [email protected]

Gerald CollomUniversity of WashingtonLos Alamos National [email protected]

Redman ColinArizona State UniversityLos Alamos National [email protected]

CP3

Traversing Time-Dependent Networks in Parallel

Large-scale time-dependent networks arise in a range ofdigital technologies, such as citation networks, schedulenetworks and networks of social users interacting throughmessaging. In this talk, we consider ways of mining suchcomplex networks, including computing in parallel theshortest temporal path and the connected components. Wegeneralize breadth first search and depth first search algo-rithms for time-dependent networks. All algorithms areimplemented in Julia, a new dynamic programming lan-guage for scientific computing.

Weijian ZhangSchool of MathematicsThe University of [email protected]

CP4

Scalability Studies of Two-Level Domain Decompo-sition Algorithms for Stochastic Pdes Having LargeRandom Dimensions

The scalability of two-level domain decomposition algo-rithms developed by Subber (PhD thesis, Carleton Uni-versity, 2012) for uncertainty quantification of large-scalestochastic PDEs is enhanced to handle high-dimensionalstochastic problems. Parallel sparse matrix-vector opera-tions are used to cut floating-point operations and memory

requirements. Both numerical and parallel scalabilities ofthe algorithm are presented for a diffusion equation havingspatially varying diffusion coefficient modeled by a non-Gaussian stochastic process.

Abhijit SarkarAssociate Professorcarleton University, Ottawa, Canadaabhijit [email protected]

Ajit DesaiCarleton University, [email protected]

Mohammad KhalilCarleton University, Canada(Currently at Sandia National Laboratories, USA)[email protected]

Chris PettitUnited States Naval Academy, [email protected]

Dominique PoirelRoyal Military College of Canada, [email protected]

CP4

”Mystic”: Highly-Constrained Non-Convex Opti-mization and Uncertainty Quantification

Highly-constrained, large-dimensional, and nonlinear op-timizations are at the root of many difficult problems inuncertainty quantification (UQ), risk, operations research,and other predictive sciences. The prevalence of parallelcomputing has stimulated a shift from reduced-dimensionalmodels toward more brute-force methods for solving high-dimensional nonlinear optimization problems. The ‘mys-tic’ software enables the solving of difficult UQ problemsas embarrassingly parallel non-convex optimizations; andwith the OUQ algorithm, can provide validation and cer-tification of standard UQ approaches.

Michael McKernsCalifornia Institute of [email protected]

CP4

A Resilient Domain Decomposition Approach forExtreme Scale UQ Computations

One challenging aspect of extreme scale computing con-cerns combining uncertainty quantification (UQ) methodswith resilient PDE solvers. We present a resilient poly-nomial chaos solver for uncertain elliptic PDEs. In a do-main decomposition framework, our solver constructs theDirichlet-to-Dirichlet operator by means of sampling androbust regression techniques. This leads to a resilient UQsolver that requires only a limited amount of global com-munication, thus benefiting the scalability of the computa-tions.

Paul Mycek, Andres ContrerasDuke [email protected], [email protected]

Olivier P. Le MatreLIMSI-CNRS

70 PP16 Abstracts

[email protected]

Khachik Sargsyan, Francesco Rizzi, Karla Morris, CosminSaftaSandia National [email protected], [email protected],[email protected], [email protected]

Bert J. DebusschereEnergy Transportation CenterSandia National Laboratories, Livermore [email protected]

Omar M. KnioDuke [email protected]

CP5

Reliable Bag-of-Tasks Dispatch System

We introduce a reliable bag-of-tasks dispatch system thatwill allow those bag-of-tasks applications run efficiently onnon-reliable hardware at excascale. The system is builton top of MPI to guarantee general applicability acrossdifferent architectures. It utilizes a mathematical modelthat, based on the execution time of some of the tasksand their sizes, determines, at runtime, if a task failure isrecovered via task re-execution or whether the task requiresa replica.

Tariq L. AlturkestaniKing Abdullah University of Science and TechnologyUniversity of [email protected]

Esteban MenesesCosta Rica Institute of [email protected]

Rami MelhemUniversity of Pittsburgh, [email protected]

CP5

HiPro: An Interactive Programming Tool for Par-allel Programming

The scientific computing community is suffering from alack of good development tool that can handle well theunique problems of coding for high performance comput-ing.HiPro is an interactive graphical tool designed for rapiddevelopment of large-scale numerical simulations. The pro-gramming tool is based on the domain specific frameworkJASMIN and is designed to ease parallel programming fordomain experts. Real applications demonstrate that it ishelpful on enabling increased software productivity.

Li Liao, Aiqing ZhangInstitute of Applied Physics and [email protected], zhang [email protected]

Zeyao MoLaboratory of Computational Physics, IAPCMP.O. Box 8009, Beijing 100088, P.R. China

zeyao [email protected]

CP5

The Elaps Framework: Experimental Linear Alge-bra Performance Studies

The multi-platform open source framework ELAPS facili-tates easy and fast, yet powerful performance experimen-tation and prototyping of dense linear algebra algorithms.In contrast to most existing performance analysis tools, ittargets the initial stages of the development process and as-sists developers in both algorithmic and optimization de-cisions. Users construct experiments to investigate howperformance and efficiency vary from one algorithm to an-other, depending on factors such as caching, algorithmicparameters, problem size, and parallelism.

Elmar Peise, Paolo BientinesiAICES, RWTH [email protected], [email protected]

CP6

Sparse Matrix Factorization on GPUs

We present a sparse multifrontal QR Factorization methodon a eterogeneous platform multiple GPUs. Our methodis extremely efficient for different sparse matrices. Ourmethod benefits from both the highly parallel general-purpose computing cores available on a graphics processingunit (GPU) and from multiple GPUs provided on a singleplatform. It exploits two types of parallelism: the firstover the available GPUs on the platform and the secondover the Streaming Multiprocessor (SMPs)within a GPU.We are also developing a technique for factorizing a densefrontal matrix on multiple GPUs using our GPU engine.

Mohamed GadouUniversity of [email protected]

Tim DavisTexas A&M [email protected]

Sanjay RankaCISE, University of Florida, [email protected]

CP6

A New Approximate Inverse Technique Based onthe GLT Theory

Sparse approximate inverse is a well-known precondition-ing technique. Given a matrix A, the approximation M ofthe A−1, in general, is based on Frobenius norm minimiza-tion |I −MA|. The sparsity pattern of the inverse matrixis prescribed a priori or adjusted dynamically during itsconstruction. In this presentation, a new method for com-puting an approximate inverse of a matrix is presented.We utilize the available knowledge of the matrix eigenval-ues attained by applying the Generalized Locally Toeplitz(GLT) theory. The sparsity pattern of the resulting ap-proximation can be controlled dynamically. Moreover, theconstruction of the approximate inverse is parallel and gen-erally cheap since the function describing the matrix A isknown a priori.

Ali Dorostkar

PP16 Abstracts 71

Uppsala UniversityDepartment of Information [email protected]

CP6

Designing An Efficient and Scalable Block Low-Rank Direct Solver for Large Scale Clusters

Electromagnetic problems with an integral equation for-mulation lead to solve large dense linear systems. To ad-dress large problems, we enhanced our Full-MPI directsolver with a block low-rank compression method. Weshow how we derived a block low-rank version from ourexisting solver, and detail the optimizations we used to in-crease its scalability on thousands of cores of the TERAsupercomputer at CEA/DAM, for example by introducinghybrid programming models, or with algorithmic improve-ments.

Xavier LacosteCEA - [email protected]

Cedric Augonnet, David GoudinCEA/[email protected], [email protected]

CP6

Locality-Aware Parallel Sparse Matrix-MatrixMultiplication

We propose a parallel sparse matrix-matrix multiplicationfor distributed memory clusters. Our method exploits lo-cality of the nonzero matrix elements without beforehandknowledge of the matrix sparsity pattern. This is achievedby using a distributed hierarchical matrix representation,straightforward to implement using the Chunks and Tasksprogramming model [Parallel Comput. 40, 328 (2014)].We demonstrate favorable weak and strong scaling of thecommunication cost with the number of processes, boththeoretically and in numerical experiments.

Emanuel H. RubenssonUppsala [email protected]

Elias RudbergDivision of Scientific Computing, Department of I.T.Uppsala [email protected]

CP6

Solving Sparse Underdetermined Linear LeastSquares Problems on Parallel Computing Plat-forms

Computing the minimum 2-norm solution is essential formany areas such as geophysics, signal processing andbiomedical engineering. In this work, we present a newparallel algorithm for solving sparse underdetermined lin-ear systems where the coefficient matrix is block diagonalwith overlapping columns. The proposed parallel approachhandles the diagonal blocks independently and the reducedsystem involving the shared unknowns. Experimental re-sults show the effectiveness of the proposed scheme on var-ious parallel computing platforms.

F. Sukru Torun

Bilkent [email protected]

Murat ManguogluMiddle East Technical [email protected]

Cevdet AykanatBilkent [email protected]

CP7

Towards Exascalable Algebraic Multigrid

Algebraic multigrid is the solver of choice in many andgrowing applications in today’s petascale environments.Exascale architectures exacerbate the challenges of in-creased SIMT concurrency, reduced memory capacity andbandwidth per core, and vulnerability to global synchro-nization. The recent Vassilevski-Yang (2014) mult-additiveAMG takes important steps in reducing communicationand synchronization frequency while controlling memorygrowth, but remains bulk-synchronous. We extend the al-gorithm further towards exascale environments with a task-based directed acyclic graph implementation.

Amani Alonazi, David E. [email protected], [email protected]

CP7

A Matrix-Free Preconditioner for Elliptic SolversBased on the Fast Multipole Method

We combine the Fast Multipole Method, originally devel-oped as a free-standing solver, with Krylov iteration as ascalable and performant preconditioner for traditional low-order finite discretizations of elliptic boundary value prob-lems. FMM possesses advantages for emergent architec-tures including: low synchrony, high arithmetic intensity,and exploitable SIMT concurrency. We compare FMM-preconditioned Krylov solvers with multilevel and othersolvers on a variety of elliptic problems and illustrate theireffectiveness in a computational fluid dynamics applica-tion.

Huda IbeidKing Abdullah University of Science and [email protected]

Rio YokotaTokyo Institute of [email protected]

David E. [email protected]

CP7

Performance of Smoothers for Algebraic Multi-grid Preconditioners for Finite Element VariationalMultiscale Incompressible Magnetohydrodynamics

We examine the performance of various smoothers for afully-coupled algebraic multigrid preconditioned Newton-Krylov solution approach for a finite element variationalmultiscale turbulence model for incompressible magneto-hydrodynamics (MHD). Our focus will be on large-scale,

72 PP16 Abstracts

high fidelity transient MHD simulations on unstructuredmeshes. We present scaling results for resistive MHD testcases, including results for over 500,000 cores on an IBMBlue Gene/Q platform.

Paul LinSandia National [email protected]

John ShadidSandia National LaboratoriesAlbuquerque, [email protected]

Paul TsujiLawrence Livermore National [email protected]

Jonathan J. HuSandia National LaboratoriesLivermore, CA [email protected]

CP7

Efficiency Through Reuse in Algebraic Multigrid

Algebraic multigrid methods (AMG) are utilized by manylarge-scale scientific applications. Even well developedAMG libraries may have few scalability areas of concern,be it an incomplete factorization smoother or coarse solver,or a graph traversal algorithms required in smoothed ag-gregation. We will discuss strategies of mitigating the ef-fect of such kernels through reuse of multigrid hierarchyinformation in multiple setups during the simulation, anddemonstrate the effectiveness of such strategies in parallelapplications.

Andrey ProkopenkoSandia National [email protected]

Jonathan J. HuSandia National LaboratoriesLivermore, CA [email protected]

CP7

A Dynamical Preconditioning Strategy and Its Ap-plications in Large-Scale Multi-Physics Simulations

In the implicit time-stepping based simulations, a fixedsetup-based preconditioner, e.g. AMG, ILU etc, is usuallyused due to its robustness, which we called the static pre-conditioning strategy. Setup-based preconditioners, how-ever, lead to poor parallel scalability due to its setup phase.In this work, a dynamical preconditioning strategy is intro-duced in order to reduce the setup overhead while maintainthe robustness. Results for a practical multi-physics simu-lation on O(104) cores show its efficiency and improvement.

Xiaowen XuInstitute of Applied Physics and ComputationalMathematics

[email protected]

CP8

Parallel Linear Programming Using Zonotopes

Zonotope is an affine projection of n-dimensional cube.The solution of a linear programming problem on a zono-tope is given by a closed-form expression. We consider lin-ear programming problem min cTx s.t. Ax = b, ||x||∞ ≤ 1.Following work of Fujishige et al., the problem with thehelp of lifting method can be reduced to finding an inter-section between a zonotope and a line. We consider par-allelization of this approach using multiple projections ofline points on the zonotope via first-order methods.

Max DemenkovInstitute of Control [email protected]

CP8

Vertex Separators with Mixed-Integer Linear Op-timization

We demonstrate a method of computing vertex separatorsin arbitrary graphs using a multi-level coarsening and re-finement strategy coupled with a mixed-integer linear pro-gramming solver. This approach is easily parallelizable dueto the use of branch-and-bound trees, leading to significantspeedups on many-core systems. Our method contrastswith other parallelized graph partitioning schemes in thatthe addition of processors does not decrease the separatorquality.

Scott P. Kolodziej, Tim DavisTexas A&M [email protected], [email protected]

CP8

Large-Scale Parallel Combinatorial Optimization– the Role of Estimation and Ordering the Sub-Problems

One major problem in parallel algorithms for a combinato-rial optimization problems is that the search space for thesub-problems differ in several magnitudes. This makes thescheduling hard, and the resulting algorithm often ineffi-cient. Measuring the search space of these sub-problemshelps to solve this, and have other usage as well. We willshow actual running results using several hundred proces-sors and show that our algorithm can solve previously un-solved maximum clique search problems.

Bogdan Zavalnij, Sandor SzaboInstitute of Mathematics and InformaticsUniversity of [email protected], [email protected]

CP9

Investigating IBM Power8 System ConfigurationTo Minimize Energy Consumption During the Ex-ecution of CoreNeuron Scientific Application

Leveraging from AMESTER library [P. Klavik et al.Changing Computing Paradigms Towards Power Effi-ciency, Klavk P, Malossi ACI, Bekas C, Curioni A.2014 Changing computing paradigms towards power ef-ficiency. Phil. Trans. R. Soc. A 372: 20130278.

PP16 Abstracts 73

http://dx.doi.org/10.1098/rsta.2013.0278], the power con-sumption of the POWER8 system will be investigatedwhen executing CoreNeuron scientific application. Ex-tending the study presented in [T. Ewart et al., Perfor-mance Evaluation of the IBM POWER8 Architecture toSupport Computational Neuroscientific Application UsingMorphologically Detailed Neurons, International Bench-marking and Optimization Workshop, 2015 Supercomput-ing Conference, Austin, TX, USA] both time to solutionand energy consumption of several configurations of thePOWER8 system will be analyzed. The optimal designpoint which allows maximizing application performancewhile lowering energy consumption will be suggested, of-fering useful insights for future system designs.

Fabien DelalondreEcole Polytechnique Federale de [email protected]

Timothee [email protected]

Pramod ShivajEPFL - Blue Brain [email protected]

Felix [email protected]

CP9

Scalable Parallel I/O Using HDF5 for HPC FluidFlow Simulations

According to the latest US Presidents Council of Advisorson Science and Technology ”high-performance computingmust now assume the ability to efficiently manipulate vastand rapidly increasing quantities of data”. Therefore, wepresent a framework for scientific I/O using HierarchicalData Format 5 (HDF5), focusing on highly scalable paral-lel read and write routines. Furthermore, this frameworkallows to visualise subsets of the computed data alreadyduring runtime of a CFD simulation on massive parallelsystems.

Christoph M. ErtlChair of Computation in Engineering, Prof. RankTechnische Universitat [email protected]

Ralf-Peter MundaniTUM, Faculty of Civil, Geo and EnvironmentalEngineeringChair for Computation in [email protected]

Jerome FrischInstitute of Energy Efficiency and Sustainable BuildingE3DRWTH Aachen [email protected]

Ernst RankComputation in EngineeringTechnische Universitat Munchen

[email protected]

CP9

Quantifying Performance of a Resilient EllipticPDE Solver on Uncertain Architectures UsingSST/Macro

Architectural uncertainty and resiliency are two key chal-lenges of exascale computing. In this talk, we present a per-formance study of a resilient PDE solver on uncertain archi-tectures using the architecture simulator software packageSST/macro. We adopt a task-based server-client model toenable resiliency to hard faults. Using this framework, weconduct an uncertainty quantification study allowing us toperform a sensitivity analysis to identify the main param-eters affecting performance and their correlations.

Karla Morris, Francesco Rizzi, Khachik SargsyanSandia National [email protected], [email protected],[email protected]

Kathryn DahlgrenSandia National [email protected]

Paul MycekDuke [email protected]

Cosmin SaftaSandia National [email protected]

Olivier P. Le [email protected]

Omar M. KnioDuke [email protected]


CP9

Implicitly Parallel Interfaces to Operational Quan-tum Computation

Many high-value computational problems involve searchspaces that grow exponentially with problem size. Ap-plying traditional parallelism to such problems has hadlimited success. An alternative strategy exploits quantumparallelism occurring within operational quantum comput-ers. Building on recently demonstrated benefits of quan-tum effects in such devices, we show how subject matterexperts using familiar interfaces can seamlessly tap into-quantum superposition, entanglement, and tunneling toperform their computations in entirely new ways,withoutlearning quantum mechanics.

Steve P. ReinhardtCray, Inc.

74 PP16 Abstracts

[email protected]

CP9

On the Path to Trinity - Experiences BringingCode to the Next Generation ASC Capability Plat-form

Our next system, Trinity, a Cray XC40, will be a mixture ofIntel Haswell and Intel Xeon Phi Knights Landing proces-sors. We are currently engaged in porting applications tothese new processor designs where vectorization, threadingand use multi-level memories are required for high perfor-mance. We will present initial results from these applica-tion ports and comment on the level of work required toget codes running on each partition of the machine.

Courtenay T. VaughanSandia National [email protected]

Simon D. HammondScalable Computer ArchitecturesSandia National [email protected]

CP10

Wave Propagation in Highly Heterogeneous Me-dia: Scalability of the Mesh and Random Proper-ties Generator

We present in this talk elastic wave propagation simula-tions in highly randomly heterogeneous geophysical mediausing a spectral element solver. When using thousand ofcores, meshing and generation of properties become issues.We propose in this talk: a scalable meshing algorithm forhexahedra that considers topography; and a scalable ran-dom field generator that constructs elastic parameter fieldswith chosen statistics. Large scale simulations and scala-bility analyses illustrate the effectiveness of the approach.

Regis CottereauCNRS, CentraleSupelec, Universite [email protected]

Jose CamataNACAD - High Performance Computing CenterCOPPE/Federal University of Rio de [email protected]

Lucio de Abreu Corra, Luciano de Carvalho PaludoCNRS, CentraleSupelec, Universite [email protected], [email protected]

Alvaro CoutinhoThe Federal University of Rio de [email protected]

CP10

An Overlapping Domain Decomposition Precondi-tioner for the Helmholtz Equation

An overlapping domain decomposition preconditioner isproposed to solve the high frequency Helmholtz equation.The computation domain is decomposed in d directions forproblem in Rd, and the local solution on each subdomainis updated simultaneously in one iteration, thus there isno sweeping along certain directions. In these ways, the

overlapping DDM is similar to the popular DDM for Pois-son problem. The complexity of the overlapping DDM isO(Nniter), where niter is the number of iteration of themethod, and it is shown numerically that niter is propor-tional to the number of subdomains in one direction. 2DHelmholtz problem with more than a billion unknowns and3D Helmholtz problem are solved efficiently with the pre-conditioner on massively parallel machines.

Wei LengLaboratory of Scientific and Engineering ComputingChinese Academy of Sciences, Beijing, [email protected]

Lili JuUniversity of South CarolinaDepartment of [email protected]

CP10

GPU Performance Analysis of DiscontinuousGalerkin Implementations for Time-Domain WaveSimulations

Finite element schemes based on discontinuous Galerkinmethods exhibit interesting features for massively parallelcomputing and GPGPU. Nowadays, they represent a credi-ble alternative for large-scale industrial applications. Com-putational performance of such schemes however stronglydepends on their implementations, and several strategieshave been proposed or discarded in the past. In this talk,we present and compare up-to-date performance results ofdifferent implementations tested in a unified framework.

Axel Modave, Jesse Chan, Tim WarburtonVirginia [email protected], [email protected], [email protected]

CP10

Efficient Solver for High-Resolution NumericalWeather Prediction Over the Alps

Prototype implementation of new ”EULAG” dynamicalcore, based on nonoscillatory forward in time methods,into the operational COSMO framework for numericalweather prediction is discussed. The core aims at the next-generation very high regional weather prediction over theAlps. Currently, its numerical formulation based on MP-DATA advection and iterative Krylov solver offers compet-itive performance to the operational dynamical core basedon Runge-Kutta methods. Moreover, it offers an unprece-dented choice of the soundproof/compressible equation sys-tems, offering similar quality of the solution but differentratio of global and local communication cost.

Zbigniew P. PiotrowskiInstitute of Meteorology and Water [email protected]

Damian WojcikPoznan Supercomputing and Networking [email protected]

Andrzej WyszogrodzkiInstitute of Meteorology and Water Management, [email protected]

Milosz CiznickiInstitute of Meteorology and Water Management

PP16 Abstracts 75

[email protected]

CP10

An Adaptive Multi-Resolution Data Model forFeature-Rich Environmental Simulations on Mul-tiple Scales

The accuracy and predictive value of environmental sim-ulations depend largely on the underlying geometric andsemantic data. Therefore, a data-model suitable for driv-ing such simulations with special emphasis on multi-scaleurban flooding scenarios is presented. This includes fu-sion and orchestration of heterogeneous and ubiquitousdata ranging from GIS to BIM sources, efficient hybridparallelization of storage, access, and processing of largedatasets, and the full integration and interaction with thepipeline for parallel numerical simulations.

Vasco VarduhnUniversity of [email protected]

CP11

Maximizing CoreNeuron Application Performanceon All CPU Architectures Using Cyme Library

The Cyme library first presented in [T. Ewart et al., ALibrary Maximizing SIMD Computation on User-DefinedContainers. In J.M. Kunkel, T. Ludwig, and H.W. Meuer(Eds.): ISC 2014, LNCS 8488, pp. 440–449. SpringerInternational Publishing Switzerland] supports the auto-generation of optimized code for specific targeted systemarchitectures. Its last extension allows supporting bothIBM POWER8 and ARM processors in addition to Intelx86 and MIC as well as IBM Blue Gene/Q. In this pre-sentation we present the integration of the Cyme libraryinto CoreNeuron scientific application which proved to offerclose optimal performance on IBM Blue Gene/Q systemsexcept in the case of data hazard. Thanks to the integra-tion, we demonstrate that better performance could benobtained, overcoming the shortcomings of the XLC com-piler.

Timothee Ewart, Kai Langen, Pramod Kumbhar, [email protected], [email protected],[email protected], [email protected]

Fabien DelalondreEcole Polytechnique Federale de [email protected]

CP11

Ttc: A Compiler for Tensor Transpositions

We present TTC, an open-source compiler for multidimen-sional tensor transpositions. Thanks to a range of opti-mizations (software prefetching, blocking, loop-reordering,explicit vectorization), TCC generates high-performanceparallel C/C++ code. We use generic heuristics and auto-tuning to manage the huge search space. Performance re-sults show that TTC achieves close to peak memory band-width on both the Intel Haswell and the AMD Steamrollerarchitectures, yielding performance gains of up to 15× overmodern compilers.

Paul Springer

AICESRWTH Aachen [email protected]

Paolo BientinesiAICES, RWTH [email protected]

CP11

Fresh: A Framework for Real-World StructuredGrid Stencil Computations on Heterogeneous Plat-forms

To tackle the heterogeneous programming challenge, foran important computation pattern structured-grid stencilcomputations, we propose FRESH (Framework for REal-world structured-grid Stencil computations on Heteroge-neous platforms), which consists of a Fortran-like domain-specific language and a source-to-source compiler. TheDSL contains domain specific features that ease program-ming. The compiler generates code for multiple platforms.A real-world computational electromagnetic applicationwas ported , and speedups of 1.47, 16.68 and 15.32 wereachieved on CPU, GPU and MIC.

Yang Yang, Aiqing ZhangInstitute of Applied Physics and ComputationalMathematicsyang [email protected], zhang [email protected]

Zhang YangInstitute of Applied Physics and ComputationalMathematicsChina Academy of Engineering Physicsyang [email protected]

CP12

Load Balancing Issues in Particle In Cell Simula-tions

Recent simulation campaigns of laser wakefield electronsacceleration, showed that, performance-wise, the most ur-gent concern is to find a way to face the strong load imbal-ance that arises on very large full 3D simulations. The hy-brid MPI-openMP typical implementation performs quitewell on systems with a couple hundreds of cores. Butthe accessible number of openMP threads is limited and,as the number of MPI processes increases, this relativelysmall number of threads is not able to balance the loadany more. Imbalance issues come back, wasting our pre-cious computation time again. Hence the need for an effi-cient dynamic load balancing algorithm. The algorithm wepresent is based on the division of each MPI domain intomany smaller, so called, patches. These patches are usedas data sorting structures and can be exchanged betweenMPI processes in order to balance the computational load.It has been implemented in the SMILEI code, a new opensource Particle In Cell (PIC) code, developed jointly byphysicists and HPC experts in order to make sure that itperforms well on the newest supercomputers architectures.

Arnaud [email protected]

Julien DerouillatCEA - Maison De La Simulation

76 PP16 Abstracts

[email protected]

CP12

Particle-in-Cell Simulations for Nonlinear Vlasov-Like Models

We report on the progress in the development of a particle-in-cell code for high-performance computing. We focus onnonlinear Vlasov-like models as applications and on ab-solute performance per core on the implementation side.Results on classical physics test cases are presented. Wereport on the code’s organization and the approach to mul-tiprocess and multithread issues.

Sever [email protected]

Edwin Chacon-GolcherInstitute of Physics of the ASCR, ELI-Beamlines,[email protected]

CP12

Parallel in Time Solvers for DAEs

Time integration of dynamic systems is a sequential processby nature. With the increase of the number of cores andprocessors in modern computers, making this process par-allel becomes an important issue. In this context, Braidsoftware has been developed at LLNL as a non-intrusiveparallel-in-time black box overlay. In this talk, we will fo-cus on Differential Algebraic Equations and how parallel-in-time algorithms perform for such problems. These DAEsarise in many fields including the modelling of power gridsystems.

Matthieu LecouvezLawrence Livermore National [email protected]

Robert D. FalgoutCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

Carol S. WoodwardLawrence Livermore Nat’l [email protected]

CP12

Implicitly Dealiased Convolutions for DistributedMemory Architectures

Convolutions are a fundamental tool of scientific comput-ing. For multi-dimensional data, implicitly dealiased con-volutions [Bowman and Roberts, SIAM J. Sci. Comput.2011] are faster and require less memory than conventionalFFT-based methods. We present a hybrid OpenMP/MPI im-plementation in the open-source software library FFTW++.The reduced memory footprint translates to a reducedcommunication cost, and the separation of input and workarrays allows one to overlap computation and communica-tion.

Malcolm RobertsUniversite de [email protected]

John C. BowmanDept. of Mathematical and Statistical SciencesUniversity of [email protected]

CP12

A Massively Parallel Simulation of the Coupled Ar-terial Cells in Branching Network

Arterial wall is composed of a network of multiple types ofcells connected via intercellular gap junctions. Under theenvironmental stimuli, e.g. the blood borne agonists andflow, a network of these cells elicits vasomotion. We presenta massively parallel computational model simulating thecoupled arterial cells in branching network, to study themechanistic basis of the coordinated vascular response up-stream and downstream of a localized stimulus, mediatedby intracellular calcium signaling. The goal is to emulateand validate wet lab experiments.

Mohsin A. ShaikhPawsey Supercomputing [email protected]

Constantine ZakkaroffUC HPCUniversity of [email protected]

Stephen MooreIBM Research Collaboratory for Life Sciences-MelbourneThe University of [email protected]

Tim DavidUniversity of [email protected]

CP13

Adaptive Transpose Algorithms for DistributedMulticore Processors

An adaptive parallel block transpose algorithm optimizedfor distributed multicore hybrid OpenMP/MPI architec-tures is presented. A hybrid configuration can decreasecommunication costs by reducing the number of MPInodes, thereby increasing message sizes. This allows fora more slab-like than pencil-like domain decomposition formultidimensional transforms (such as the FFT), reducingthe cost of, or even eliminating the need for, a second dis-tributed transpose. Nonblocking all-to-all transfers enableuser computation and communication to be overlapped.

John C. BowmanDept. of Mathematical and Statistical SciencesUniversity of [email protected]

Malcolm RobertsUniversite de [email protected]

CP13

Efficient Two-Level Parallel Algorithm Based onDomain Decomposition Method for Antenna Sim-ulation

This paper presents a scalable method for solving complex

PP16 Abstracts 77

antenna problems. The method is based on FETI-2LMdomain decomposition method plus a block ORTHODIRapproach for multiple right hand sides. The block ap-proach allows efficient multi-threading parallelization onmulti-core nodes for simultaneous local backward substi-tutions within the subdomains and decreases the overheaddue to message passing between subdomains for the or-thogonalization of search directions. Performance resultsfor large antenna arrays on systems with up to tens ofthousands of cores are analyzed.

Francois-Xavier [email protected]

CP13

GPU Acceleration of Polynomial Homotopy Con-tinuation

Homotopy continuation methods apply predictor-correctoralgorithms to solve polynomial systems. Our implementa-tion on a Graphics Processing Unit (GPU) combines algo-rithmic differentiation with double double and quad doublearithmetic. Our experiments with GPU acceleration showthat we can offset the cost overhead of double double pre-cision arithmetic. We report on our implementation onthe NVIDIA K20C and the integration into the softwarePHCpack and phcpy.

Jan VerscheldeDepartment of Mathematics, Statistics and ComputerScienceUniversity of Illinois at [email protected]

Xiangcheng YuDepartment of Mathematics, Statistics, and ComputerScienceUniversity of Illinois at [email protected]

CP14

Robust Domain Decomposition Methods for Struc-tural Mechanics Computations

Classical domain decomposition strategies (FETI, BDD)are known to slowly converge when applied to real engi-neering problems. We present more robust block-based do-main decomposition methods which can be viewed as mul-tiprecondioned CG algorithms, tailored for medium-sizedcomputer cluster. The talk focuses on numerical imple-mentation of this solver. Some numerical results demon-strating its robustness regarding classical hard points (het-erogeneity, incompressibility) will be shown. Slight mod-ifications allowing to deal with large displacement finiteelement problem will be introduced.

Christophe BovetLMT, ENS Cachan, CNRS, Universite Paris-Saclay,94235 [email protected]

Augustin Parret-FreaudSafran Tech, Rue des Jeunes Bois, ChteaufortCS 80112 78772 [email protected]

Pierre GosseletLMT, ENS Cachan

[email protected]

Nicole SpillaneCentre de Mathematiques AppliqueesEcole [email protected]

CP14

A Parallel Contact-Impact Simulation Framework

Parallel contact-impact simulation poses great challengeson both software design and performance. We present ourframework approach targeting at these challenges. Paralleldetails such as global contact search, dynamic load balanc-ing among processes and threads, data exchange amongsub-domains, etc., are encapsulated into our framework.Application developers are then able to focus only on thephysics and numerical schemes within each sub-domain.Algorithms and numerical experiments over 10K cores arepresented.

Jie Cheng, Qingkai LiuInstitute of Applied Physics and ComputationalMathematicscheng [email protected], [email protected]

CP14

Turbomachinery CFD on Modern Multicore andManycore Architectures

The exploitation of parallelism across all granularities (in-struction, data, core) is mandatory for achieving a highdegree of performance on current and future architectures.We present the performance tuning of multiblock and un-structured CFD solvers used for aero-engine design on lead-ing multicore (SandyBridge, Haswell) and manycore (XeonPhi) platforms. Optimisations include efficient SIMD com-putations of stencil operations derived from PDE dis-cretizations and domain decomposition for thread paral-lelism for both structured and unstructured grids.

Ioan C. Hadade, Luca Di MareImperial College [email protected], [email protected]

CP14

Optimization of the Photon’s Simulation on GPU

Fast simulation of photon’s propagation has applicationsin medical imaging and dosimetry. Our aim is to imple-ment a CUDA-based GPU method for the simulation ofphoton’s propagation in materials. Using profiling tools,we present how the algorithm has been modified in orderto improve thread level parallelism and memory access pat-tern design and so the execution time. A large-scale testwill be implemented to evaluate the total execution time.

Clement P. Rey, Benjamin Auer, Emmanuel Medernach,Ziad El BitarCNRS,UMR7178,Universite de [email protected], [email protected], [email protected], [email protected]

MS1

Evaluation of Energy Profile for Krylov IterativeMethods on Cluster of Accelerators

Krylov iterative methods are used to solve large sparse lin-

78 PP16 Abstracts

ear problems, which require optimization work to addressthe communication bottleneck. With respect to the energyprofile, we shall investigate whether optimizations in com-munication are also energy efficient and scalable. In thispresentation, we will reveal the relation between time andenergy profile of our optimized kernels for GMRES throughan extensive test on GPU clusters, which demands a smart-tuning scheme for optimal performances.

Langshi Chen

CNRS/Maison de la [email protected]

Serge G. PetitonUniversity Lille 1, Science and Technologies - [email protected]

MS1

Linear Algebra for Data Science Through Exam-ples

This talk presents some linear algebra problems behindthe techniques for analyzing data. Through examples fromdisciplines as astronomy and health, which generate datamasses that must to be properly analyzed, we show the roleof certain methods of linear algebra. Then we highlight theimportance of high performance computing in the analysisof these data.

Nahid EmadPRiSM laboratory and Maison de la Simulation,University [email protected]

MS1

Coarse Grid Aggregation for Sa-Amg Method withMultiple Kernel Vectors

The Smoothed Agregation Algebraic Multigrid (SA-AMG)method is a multi-level linear solver that can be used inmany kinds of applications. Its coarse grid, however, doesnot have enough number of unknowns for parallelism thecomputing environment may offer, so special care has tobe taken at that stage. In this contribution we discuss thecoarse grid aggregation approach. Convergence of the SA-AMG solver can be much improved by specifying multiplenear-kernel vectors, and our study investigates the effec-tiveness of the coarse grid aggregation for SA-AMG solverwith multiple kernel vectors.

Akihiro Fujii, Naoya Nomura, Teruo TanakaKogakuin [email protected], [email protected],[email protected]

Kengo NakajimaThe University of TokyoInformation Technology [email protected]

Osni A. MarquesLawrence Berkeley National LaboratoryBerkeley, [email protected]

MS1

Opportunities and Challenges in Developing and

Using Scientific Libraries on Emerging Architec-tures

Abstract not available

Michael HerouxSandia National [email protected]

MS2

Correlating Facilities, System and ApplicationData Via Scalable Analytics and Visualization

Understanding application, system and facilities perfor-mance and throughput over time necessitates monitoringand recording massive amounts of data. The next stepis storing and performing scalable analysis over this data.We will present efforts at LLNL to create an ecosystem formanaging historical performance data and using data ana-lytics and visualization to find root causes and insights intoperformance issues. We will also present some preliminaryresults on correlating data from various sources.

Abhinav Bhatele, Todd GamblinLawrence Livermore National [email protected], [email protected]

Alfredo Gimenez, Yin Yee NgUniversity of California, [email protected], [email protected]

MS2

A Monitoring Tool for HPC Operations Support

TACC Stats is a performance and resource usage monitor-ing tool that has been running on multiple HPC systemsfor the past five years. It collects data at regular intervalsfor every job from hardware counters, sysfs, and procfs.TACC uses this data in a variety of system managementand user support efforts. Analyses characterizing commonperformance issues on TACC systems will be presented us-ing components of the tool that facilitate data explorationand visualization.

R. Todd EvansUniversity of [email protected]

James C. BrowneDept. of Computer SciencesUniversity of [email protected]

John WestUniversity of Texas at AustinTexas Advanced Computing Centertba

Bill BarthTexas Advanced Computing CenterUniversity of Texas at [email protected]

MS2

Developing a Holistic Understanding of I/O Work-loads on Future Architectures

The I/O subsystems of extreme-scale computing systems

PP16 Abstracts 79

are becoming more complex as they stratify into tiers op-timized for different balances of performance and capacity.Consequentially, obtaining a coherent picture of the work-load that reflects everything from application-level I/Oto back-end file system performance will be significantlyharder. To this end, we are developing a framework for col-lecting and normalizing I/O workload data from disparatesources throughout the I/O subsystem to characterize theholistic utilization of future storage platforms.

Glenn LockwoodNERSCLawrence Berkeley National [email protected]

Nicholas WrightLawrence Berkeley National [email protected]

MS2

MuMMI: A Modeling Infrastructure for ExploringPower and Execution Time Tradeoffs in ParallelApplications

MuMMI (Multiple Metrics Modeling Infrastructure) facil-itates the measurement, modeling and prediction of per-formance and power consumptions for parallel applica-tions. The MuMMI framework develops models of run-time and power based upon performance counters; thesemodels are used to explore tradeoffs and identify methodsfor improving energy efficiency. In this talk we will dis-cuss the MuMMI framework and its recent development,and present examples of the use of MuMMI to improve theenergy efficiency of parallel applications.

Valerie Taylor, Xingfu WuTexas A&M [email protected], [email protected]

MS3

Parallel Graph Matching Algorithms Using MatrixAlgebra

We present distributed-memory parallel algorithms forcomputing matchings in bipartite graphs. We considerboth exact and approximate algorithms for cardinality andweighted matching problems. We substitute the asyn-chronous data access patterns of traditional matching al-gorithms by a small subset of more structured, bulk-synchronous functions based on matrix algebra. Rely-ing on communication-avoiding algorithms for the underly-ing matrix-algebric modules, different matching algorithmsachieve good speedups on tens of thousands of cores on cur-rent supercomputers.

Ariful Azad, Aydin BulucLawrence Berkeley National [email protected], [email protected]

MS3

Randomized Linear Algebra in Large-Scale Com-putational Environments

Randomized linear algebra (RLA) is an area that exploitsrandomness for the development of improved algorithms forfundamental matrix problems. It has led to the best worst-cast algorithms for problems such as over-determined leastsquares approximation, and high-quality implementations

have beaten Lapack as well as specialized code in scientificapplications. Here, we review recent RLA results in im-plementing in parallel and distributed environments, e.g.,in high performance computing applications and in dis-tributed data centers.

Michael Mahoney, Alex GittensUC [email protected], [email protected]

Farbod Roosta-KhorasaniComputer [email protected]

MS3

Parallel Combinatorial Algorithms in Sparse Ma-trix Computation?

Combinatorial techniques are used in several phases ofsparse matrix computation. For large-scale problems, whilenumerical phases are often executed in parallel, most ofthese combinatorial techniques are serial and can becomebottlenecks. We are investigating the extent to which someof the combinatorial techniques can be performed in par-allel.

Mathias JacquelinLawrence Berkeley National [email protected]

Esmond G. NgLawrence Berkeley National [email protected]

Barry PeytonDalton State [email protected]

Yili Zhrng, Kathy YelickLawrence Berkeley National [email protected], [email protected]

MS3

Parallel Algorithms for Automatic Differentiation

Despite three decades of effort, parallel algorithms for Au-tomatic Differentiation have been challenging to design.We describe a new approach to parallelization for secondand higher order derivatives using the Reverse mode of ADimplemented with the MPI library. By interpreting a sendmessage as creating a dependent variable, and a receivemessage as creating an independent variable, we performall communications in the Forward mode, thus avoiding thereversal of communication operations.

Alex PothenPurdue UniversityDepartment of Computer [email protected]

Mu WangPurdue [email protected]

MS4

Enlarged GMRES for Reducing Communication

80 PP16 Abstracts

When Solving Reservoir Simulation Problems

In this work we discuss enlarged Krylov subspaces, [Grigoriet al., Enlarged Krylov subspace conjugate gradient meth-ods for reducing communication], to solve linear systemswith one right hand side, as arising from reservoir simula-tions. Our work focuses on GMRES since the consideredlinear systems are neither positive definite nor symmet-ric. We discuss the usage of deflation, inexact breakdowns,[Robbe and Sadkane, Exact and inexact breakdowns in theblock GMRES method], in each iteration and deflation ofapproximated eigenvalues after each cycle when restartingGMRES.

Hussam Al DaasUniversite Paris-Sud [email protected]


Pascal HenonTOTAL E&[email protected]

Philippe RicouxTOTAL [email protected]

MS4

Communication-Avoiding Krylov Subspace Meth-ods in Theory and Practice

Solvers for sparse linear algebra problems, ubiquitousthroughout scientific codes, often limit application per-formance due to a low computation/communication ra-tio. In this talk, we discuss the development of parallelcommunication-avoiding Krylov subspace methods, whichcan achieve performance improvements in a wide varietyof scientific applications. We present results of a paral-lel scaling study and discuss complex tradeoffs that arisedue to both machine parameters and the numerical prop-erties of the problem. We conclude with recommendationson when we expect communication-avoiding variants to bebeneficial in practice.

Erin C. CarsonNew York [email protected]

MS4

Outline of a New Roadmap to Permissive Commu-nication and Applications That Can Benefit

FFT has been a classic computation engine for numerousapplications but is bandwidth-intensive, limiting its perfor-mance on off-the-shelf parallel machines. Using the FFTand other applications as examples, we examine the impactthat adoption of some enabling technologies would have onthe performance of a manycore architecture. These tech-nologies include 3D VLSI and silicon photonics, with a spe-cial midwifing role for microfluidic cooling, in a roadmaptowards permissive communication.

James A. Edwards, Uzi VishkinUniversity of Maryland

[email protected], [email protected]

MS4

CA-SVM: Communication-Avoiding Support Vec-tor Machines on Distributed Systems

We consider the design and implementation ofcommunication-efficient parallel support vector ma-chines, widely used in statistical machine learning, fordistributed memory clusters and supercomputers. Priorto our study, the parallel isoefficiency of a state-of-the-artimplementation scaled as W = Ω(P 3), where W is theproblem size and P the number of processors. OurCommunication-Avoiding SVM method improves theisoefficiency to nearly W = Ω(P ). We evaluate thesemethods on 96 to 1536 processors, and show speedupsof up to 16× over Dis-SMO, and a 95% weak-scalingefficiency on six real-world datasets with only modestlosses in overall classification accuracy.

Yang YouUniversity of California, [email protected]

MS5

Time Parallel and Space Time Solution Methods -From Heat Equation to Fluid Flow

In this talk, we discuss the parallel solution in space andtime of the Navier Stokes equation using space-time multi-grid methods. The choice of the smoother is discussedas well as the proper interwtwining of the Picard itera-tion, which is used for resolving the nonlinear terms in theNavier Stokes equations, with the linear multigrid methodused.

Rolf KrauseInstitute of Computational ScienceUniversity of [email protected]

Pietro BenedusiInstitute of Computational ScienceUSI Luganopietro,[email protected]

Peter ArbenzETH ZurichComputer Science [email protected]

Daniel HuppETH [email protected]

MS5

A Parallel-in-Time Solver for Time-PeriodicNavier-Stokes Problems

In science and engineering, many problems are driven bytime-periodic forcing. In fluid dynamics, this occurs forexample in turbines or in human blood flows. In the ab-sence of turbulence, if a fluid is excited long enough, atime-periodic steady state establishes. In numerical mod-els of such flows, the steady state is computed by simulatingthis transient phase with a time-stepping method until theperiodic state is reached. In this work, we present an al-ternative approach by solving directly for the steady-state

PP16 Abstracts 81

solution, introducing periodic boundary conditions in time.This allows us to use a multi-harmonic ansatz in time forthe solution of the Navier-Stokes equations. The solutionof the different Fourier modes is distributed to differentgroups of processing units, which then have to solve a spa-tial problem. Therefore, the time axis is parallelized suchthat the number of parallel threads scales with the numberof resolved Fourier modes.

Daniel HuppETH [email protected]


Dominik ObristUniversity of [email protected]

MS5

Time Parallelizing Edge Plasma Simulations Usingthe Parareal Algorithm

This work explores the exploitation of the parareal algo-rithm to achieve temporal parallelization of simulations ofmagnetically confined, fusion plasma at the edge of the de-vice. The physics is extremely complex due to the presenceof neutrals as well as the interaction of the plasma with thewall. The SOLPS code package, widely adapted by the fu-sion community, is used here to scale the performance ofthe parareal algorithm in various ITER relevant plasmas.

Debasmita SamaddarCCFEUK Atomic Energy [email protected]

David CosterMax-Planck-Institut fur Plasmaphysik, [email protected]

Xavier BonninITER Organization, [email protected]

Eva HavlickovaUK Atomic Energy [email protected]

Wael R. ElwasifORNL, [email protected]

Donald B. BatchelorOak Ridge National [email protected]

Lee A. BerryORNL, [email protected]

Christoph BergmeisterUniversity of Innsbruck, Austria

[email protected]

MS5

Micro-Macro Parareal Method for Stochastic Slow-Fast Systems with Applications in Molecular Dy-namics

We introduce a micro-macro parareal algorithm for thetime-parallel integration of multiscale-in-time systems.The algorithm first computes a cheap, but inaccurate, solu-tion using a coarse propagator (simulating an approximateslow macroscopic model), which is iteratively corrected us-ing a fine-scale propagator (accurately simulating the fullmicroscopic dynamics). This correction is done in paral-lel over many subintervals, thereby reducing the wall-clocktime needed to obtain the solution, compared to the inte-gration of the full microscopic model. We provide a numeri-cal analysis of the algorithm for a prototypical example of amicro-macro model, namely singularly perturbed ordinarydifferential equations. We show that the computed solu-tion converges to the full microscopic solution (when theparareal iterations proceed) only if special care is takenduring the coupling of the microscopic and macroscopiclevels of description. The convergence rate depends on themodeling error of the approximate macroscopic model. Weillustrate these results with numerical experiments, includ-ing a non-trivial model inspired by molecular dynamics.

Giovanni SamaeyDepartment of Computer Science, K. U. [email protected]

Keith MyerscoughDepartment of Computer ScienceKU [email protected]

Frederic LegollLAMI, [email protected]

Tony LelievreEcole des Ponts [email protected]

MS6

Parallel GMRES Using A Variable S-Step KrylovBasis

Sparse linear systems arise in computational science andengineering. The goal is to reduce the memory require-ments and the computational cost, by means of high per-formance computing algorithms. We introduce a new vari-ation on s-step GMRES in order to improve its stability,reduce the number of iterations necessary to ensure conver-gence, and thereby improve parallel performance. In doingso, we develop a new block variant that allows us to expressthe stability difficulties in s-step GMRES more fully.

David [email protected]

Jocelyne ErhelINRIA-Rennes, FranceCampus de Beaulieu

82 PP16 Abstracts

[email protected]

MS6

A Tiny Step Toward a Unified Task Scheduler forLarge Scale Heterogeneous Architectures


Clement [email protected]

MS6

Extraction the Compute and Communications Pat-terns from An Industrial Application


Thomas [email protected]

MS6

Toward Large-Eddy Simulation of Complex Burn-ers with Exascale Super-Computers: A Few Chal-lenges and Solutions

The modeling of turbulent combustion in realistic devicesis a multi-scale problem, which can be tackled with futureexascale super-computers. However, the multi-scale natureof the flow often requires h- or p-adaptivity and the use ofsplitting approaches to decouple the different time-scales.These methods introduce a large load unbalance that needsto be addressed when thousands of cores are used.

Vincent R. MoureauCORIA LaboratoryUniversite et INSA de [email protected]

Pierre BenardCORIA - CNRS UMR6614, Universite et INSA de [email protected]

Ghislain LartigueCORIA - CNRS [email protected]

MS7

Programming Linear Algebra with Python: ATask-Based Approach

Task-based programming has been extensively used to pro-gram linear algebra applications. While most of times thechosen programming language falls in the category of tra-ditional ones, there is a trend in the HPC community indeveloping applications in scripting languages. COMPSsprogramming model has been recently extended to supporta syntax in Python (PyCOMPSs). The talk will presentthe features of PyCOMPSs and how linear algebra Pythoncodes can be run in parallel in clusters.

Rosa M. BadiaBarcelona Supercomputing Center / CSIC

[email protected]

MS7

Optimizing Numerical Simulations of Elastody-namic Wave Propagation Thanks to Task-BasedParallel Programming

Elastodynamics equations can be discretized into an intrin-sic parallelizable algorithm. The programming paradigmchoice is critical as its capabilities to take advantage ofthe exposed degree of parallelism will vary. MPI mayprobe to be too static, especially concerning load balanc-ing issues and ease of granularity adaptation. Instead ofMPI, we used a task-based programming over runtime sys-tems, PaRSEC. We analyze this novel implementation andcompare the performances on both shared and distributedmemory machines.

Lionel BoillotTeam-Project Magique-3DINRIA Bordeaux [email protected]

Corentin [email protected]

George BosilcaUniversity of Tennessee - [email protected]

Emmanuel [email protected]

Henri CalandraTotalCSTJF, Geophysical Operations & Technology, R&[email protected]

MS7

Parsec: A Distributed Runtime System for Task-Based Heterogeneous Computing


Aurelien BouteillerUniversity of [email protected]


MS7

Controlling the Memory Subscription of Applica-tions with a Task-Based Runtime System

This talk will discuss the cooperation between a task-basedcompressed linear algebra code and a runtime to control thememory subscription levels throughout the execution. Thetask paradigm enables controlling the memory footprintof the application by throttling the task submission rate,striking a compromise between the performance benefitsof anticipative task submission and the resulting memoryconsumption. We illustrate our contribution on an ACA-

PP16 Abstracts 83

based compressed dense linear algebra solver.

Marc SergentINRIA, [email protected]

David GoudinCEA/[email protected]

Samuel ThibaultUniversity of Bordeaux 1, [email protected]

Olivier AumageINRIA, [email protected]

MS8

Uncertainty Quantification Using Tensor Methodsand Large Scale Hpc

We consider the problem of uncertainty quantification forextreme scale parameter dependent problems where an un-derlying low rank property of the parameter dependencyis assumed. For this type of dependency the hierarchicalTucker format offers a suitable framework to approximatea given output function of the solutions of the parameterdependent problem from a number of samples that is lin-ear in the number of parameters. In particular we can aposteriori compute the mean, variance or other interest-ing statistical quantities of interest. In the extreme scalesetting it is already assumed that the underlying fixed-parameter problem is distributed and solved for in parallel.We provide in addition a parallel evaluation scheme for thesampling phase that allows us on the one hand to combineseveral solves and on the other hand parallelise the sam-pling. Finally, we address the problem of estimating theaccuracy of the computations in low rank tensor format.

Lars Grasedyck, Christian LobbertRWTH [email protected], [email protected]

MS8

Parallel Methods in Space and Time and Their Ap-plication in Computational Medicine

In this talk, we consider parallel discretization methodsin space and time for the efficient solution of problemsarising from medicine. We start from permeation throughthe human skin, were we combine parallelization in timethrough the parareal method with multigrid methods inspace. Then, we discuss the full space time discretizationof reaction diffusion problems as occuring in electrophysi-ological simulations of cardiac activity.


Andreas KreienbuhlInstitute of Computational ScienceUSI [email protected]

Dorian Krause

Institute of Computational ScienceUniversity of [email protected]

Gabriel WittumG-CSCUniversity of Frankfurt, [email protected]

MS8

Scalable Shape Optimization Methods for Struc-tured Inverse Modeling Using Large Scale Hpc

In this talk, we present a method combining HPC tech-niques and large scale applications with an optimized in-verse shape identification procedure. It is demonstrated,how existing multigrid infrastructure can be incorporatedinto a shape optimization framework with a focus on scal-ability of the algorithm. These techniques are utilized inorder to fit a model of the human skin to data measure-ments and not only estimate the permeability coefficientsbut also the shape of the cells.

Volker H. SchulzUniversity of TrierDepartment of [email protected]

Martin SiebenbornUniversity of [email protected]

MS8

Parallel Adaptive and Robust Multigrid



MS9

Iterative ILU Preconditioning

Iterative and asynchronous algorithms, with their ability totolerate memory latency, form an important class of algo-rithms for modern computer architectures. In this talk wepresent computational and numerical aspects of replacingtraditional ILU preconditioners with iterative and poten-tially asynchronous variants.

Hartwig AnztInnovate Computing LabUniversity of [email protected]

Edmond ChowSchool of Computational Science and EngineeringGeorgia Institute of [email protected]

MS9

Kokkos: Manycore Programmability and Perfor-mance Portability

Kokkos enables development of HPC applications that are

84 PP16 Abstracts

performance portable across disparate manycore devicessuch as NVIDIA Kepler, Intel Xeon Phi, and AMD Fusion.Kokkos abstractions leverage C++11 features to enable de-veloper productivity in implementing thread parallel al-gorithms; especially hierarchical parallel algorithms andarchitecture-polymorphic data structures required for highdegree of parallelism and performant memory access pat-terns. In contrast, language extensions such as OpenMP,OpenACC, or OpenCL fail to address memory access pat-terns and are unwieldy to implement non-trivial parallelalgorithms. Kokkos abstractions, highlights of its standardC++11 library interface, and sampling of performance re-sults will be presented.

H. Carter Edwards, Christian TrottSandia National [email protected], [email protected]

MS9

ICB: IC Preconditioning with a Fill-in StrategyConsidering Simd Instructions

Most of current processors are equipped with SIMD in-structions that are used to increase the performance of ap-plication programs. We analyze the effective use of SIMDinstructions in the Incomplete Cholesky Conjugate Gradi-ent (ICCG) solver. A new fill-in strategy in the IC fac-torization is proposed for the SIMD vectorization of thepreconditioning step and to increase the convergence rate.Our numerical results confirm that the proposed methodhas better performance than the conventional IC(0)-CGmethod.

Takeshi IwashitaHokkaido [email protected]

Naokazu TakemuraKyoto [email protected]

Akihiro IdaAcademic Center for Computing and Media StudiesKyoto [email protected]

Hiroshi NakashimaACCMSKyoto [email protected]

MS9

Thread-Scalable FEM

Finite Element Method (FEM) is widely used for solv-ing Partial Differential Equations (PDE) in various typesof applications of computational science and engineering.Matrix assembly and sparse matrix solver are the mostexpensive processes in FEM procedures. In the presentwork, these processes are optimized on Intel Xeon Phi andNVIDIA Tesla K40 based on features of each architecture.This talk describes details of optimization and results ofperformance evaluation.

Kengo NakajimaThe University of TokyoInformation Technology Center

[email protected]

MS10

Data Management and Analysis for An Energy Ef-ficient HPC Center

The HPC center at LLNL is concerned about the qual-ity, cost, and environmental impact; in addition, we sharewith our providers the interest to reduce energy costs andimprove electrical grid reliability. We implemented an ex-tensive monitoring and data collection system to collectand analyze data from power meters, PMU’s, and com-puter environmental and power sensors. We will describeour work on managing (storage and integration), analysisand visualization of the collected data.

Ghaleb Abdulla, Anna Maria Bailey, John WeaverLawrence Livermore National [email protected], [email protected], [email protected]

MS10

Smart HPC Centers: Data, Analysis, Feedback,and Response

More efficient HPC operation can be enabled by analysisof full system data, feedback of analysis results to appli-cations and subsystems, and dynamic adaptive response.However, significant gaps for support of such operationsstill exist. We present progress on an integrated full sys-tem approach to HPC operations addressing our work ona) application and global platform analysis b) architecturefor data integration, analysis, and feedback and c) system(e.g., resource manager, facilities, partitioners) responses.

James Brandt, Ann GentileSandia National [email protected], [email protected]

Cindy MartinLos Alamos National Laboratoryc [email protected]

Benjamin AllanSandia National LaboratoriesLivermore, CA [email protected]

Karen D. DevineSandia National [email protected]

MS10

Low-Overhead Monitoring of Parallel Applications

Next-generation parallel systems will have hundreds ofhardware threads per node and in some cases attachedaccelerators. Growth in node-level parallelism is a chal-lenge for monitoring. This talk will lay out a vision forintegrated, low-overhead, sampling based measurement ofcomputation, communication, and I/O of hybrid programson next-generation supercomputers. Using sampling, onecan pinpoint inefficiencies and scaling bottlenecks at alllevels within and across nodes of parallel systems.

John Mellor-CrummeyComputer Science Dept.Rice [email protected]

PP16 Abstracts 85

Mark KrentelRice [email protected]

MS10

Discovering Opportunities for Data AnalyticsThrough Visualization of Large System Monitor-ing Datasets

Blue Waters is a Cray supercomputer comprised of over27,000 nodes and has been instrumented with detailedmonitoring across all of its subsystems. NCSA has been fo-cused on integrating the data and producing insights intoapplication and systems behavior. While realtime eventprocessing and visualization of the data has proved suc-cessful in demonstrating some clear data patterns, we lookforward to the opportunities in numerical analysis and dataanalytics on this long term dataset.

Michael ShowermanNCSAUniversity of Illinois at [email protected]

MS11

A Partitioning Problem for Load Balancing andReducing Communication from the Field of Quan-tum Chemistry

We present a combinatorial problem and potential solu-tions arising in parallel computational chemistry. TheHartree-Fock (HF) method has a very complex data accesspattern. Much research has been devoted over 20 yearsfor parallelizing this important method, based primarilyon intuition and experience. A formal approach for paral-lelizing HF while reducing communication may come fromgraph and hypergraph partitioning. Besides providing apotential solution, this approach may also shed light onthe optimality of existing approaches.

Edmond ChowSchool of Computational Science and EngineeringGeorgia Institute of [email protected]

MS11

Scalable Parallel Algorithms for De Novo Assemblyof Complex Genomes

A critical problem for computational genomics is the prob-lem of de novo genome assembly: the development of ro-bust scalable methods for transforming short randomlysampled sequences into the contiguous and accurate re-construction of complex genomes. While advanced meth-ods exist for assembling the small and haploid genomesof prokaryotes, the genomes of eukaryotes are more com-plex. We address this challenge head on by developingHipMer, an end-to-end high performance de novo assem-bler designed to scale to massive concurrencies. HipMeremploys an efficient Unified Parallel C (UPC) implementa-tion and computes the assembly of the human genome inonly 8.4 minutes using 15,360 cores of a Cray XC30 system.

Evangelos GeorganasEECSUC Berkeley

[email protected]

MS11

Community Detection on GPU

There has been considerable interest in community detec-tion for finding the modularity structure in real world data.Such data sets can arise from social networks as well as vari-ous scientific domains. The Louvain method is one popularmethod for this problem as it is simple and fast. It can alsobe used to detect hierarchical structures in the data. How-ever, its inherently sequential nature and cache unfriendlyworkloads makes it difficult to parallelize. This is partic-ularly true for co-processor architectures. In this work weshow how these obstacles can be overcome and present re-sults from implementing the algorithm on a GPU.

Md NaimUniversity of [email protected]

Fredrik ManneUniversity of Bergen, [email protected]

MS12

Lower Bound Techniques for Communication in theMemory Hierarchy

Communication requirements between different levels ofthe memory hierarchy of a DAG computation are strictlyconnected to the space requirements of the subset in suit-able partitions of the computation DAG. We discuss a ver-satile technique based on a generalized notion of visit of aDAG to obtain space lower bounds. We illustrate the tech-nique by deriving space lower bounds for specific DAGs.We also establish some limitations to the power of the pro-posed technique.

Gianfranco BilardiUniversity of [email protected]

MS12

Communication-Optimal Loop Nests

Communication (data movement) often dominates a com-putation’s runtime and energy costs, motivating organiz-ing an algorithm’s operations to minimize communication.We study communication costs of a class of algorithms in-cluding many-body and matrix/tensor computations and,more generally, loop nests operating on array variables sub-scripted by linear functions of the loop iteration vector. Weuse this algebraic relationship between variables and op-erations to derive communication lower bounds for thesealgorithms. We also discuss communication-optimal im-plementations that attain these bounds.

Nicholas KnightNew York [email protected]

MS12

Write-Avoiding Algorithms

Communication complexity of algorithms on parallel ma-chines and memory hierarchies is well understand. Most of

86 PP16 Abstracts

this work analyzes the amount of communication withoutseparately counting induced reads and writes. However, inupcoming memory technologies, such as non-volatile mem-ories, there is a significant asymmetry in the cost of readsand writes. This motivates us to study lower bounds on thenumber of writes, which are relatively expensive. While weprove that for many algorithms there can not be asymp-totically fewer writes than reads, in the case of direct lin-ear algebra and N-body methods the opposite is true. Wepresent algorithms that match these lower bounds.

Harsha Vardhan SimhadriLawrence Berkeley National [email protected]

MS12

Communication-Efficient Evaluation of MatrixPolynomials


Sivan A. ToledoTel Aviv [email protected]

MS13

Parallel-in-Time Coupling of Models and Numeri-cal Methods

We present a new numerical procedure for the simula-tion of time-dependent problems based on the coupling be-tween the finite element method and the lattice Boltzmannmethod. The procedure is based on the Parareal paradigmand allows to couple two numerical methods having opti-mal efficiency at different space and time scales. The mainmotivation, among other, is that one technique may bemore efficient, or physically more appropriate or less mem-ory consuming than the other depending on the target ofthe simulation and/or on the sub-region of the computa-tional domain. We detail the theoretical and numericalframework for linear parabolic equations even though itspotential applicability should be wider. We show variousnumerical examples on the heat equation that will validatethe proposed procedure and illustrate its advantages. Fi-nally some recent extensions and/or variants of our originalalgorithm will be discussed.

Franz [email protected]

Matteo AstorinoEcole Pol. Fed. de [email protected]

Alexei LozinskiLaboratoire de Mathematiques de BesanconCNRS et Universite de [email protected]

Alfio QuarteroniEcole Pol. Fed. de [email protected]

MS13

Fault-Tolerance of the Parallel-in-Time Integration

Method PFASST

For time-dependent PDEs, parallel-in-time integrationwith e.g. the ”parallel full approximation scheme in spaceand time” (PFASST) is a promising way to accelerate exist-ing space-parallel approaches beyond their scaling limits.In addition, the iterative, multi-level nature of PFASSTcan be exploited to recover from hardware failures, pre-venting a breakdown of the time integration process. Inthis talk, we present various recovery strategies, show theirimpact and discuss challenges for the implementation onHPC architectures.

Robert Speck, Torbjorn KlattJuelich Supercomputing CentreForschungszentrum Juelich [email protected], [email protected]

Daniel RuprechtSchool of Mechanical EngineeringUniversity of [email protected]

MS13

Parareal in Time and Fixed-Point Schemes

In this talk, we will discuss how the parallel efficiency of theparalleal in time algorithm can be improved by coupling itwith fixed-point schemes.

Olga Mula

CEREMADE (Paris Dauphine University)[email protected]

MS13

Implementating Parareal - OpenMP Or MPI?

I will introduce an OpenMP implementation of Pararealwith pipelining and compare it to a standard MPI imple-mentation with respect to runtime, memory and energy.The used benchmark code is PararealF90, which providesthree different implementations of Parareal with and with-out pipelining using OpenMP or MPI to parallelise in time.Benchmarks from a Linux cluster and a Cray XC40 showthat while both versions differ only very little in runtime,using OpenMP can save memory.

Daniel RuprechtSchool of Mechanical EngineeringUniversity of [email protected]

MS14

Combining Software Pipelining with NumericalPipelining in the Conjugate Gradient Algorithm

Modern hardware architectures provide an extremely highlevel of concurrency. One the one hand, new program-ming models such as task-based methods are being in-vestigated by the HPC community in order to pipelinetasks at a fine-grain level. On the other hand, new nu-merical methods are being studied in order to relieve nu-merical synchronization points of most inner-most kernelsused in simulations. In this talk, we propose a formula-tion of the Conjugate Gradient algorithm that combinessoftware pipelining through task-based programming withnumerical pipelining through the use of a recently intro-duce latency-hiding algorithm. We study the potential of

PP16 Abstracts 87

the method on modern multi-GPU, multicore and hetero-geneous arcthitectures.


Luc GiraudInriaJoint Inria-CERFACS lab on [email protected]

Stojce [email protected]

MS14

Rounding Errors in Pipelined Krylov Solvers


Siegfried CoolsUniversity of [email protected]

MS14

Espreso: ExaScale PaRallel Feti Solver

This talk will present a new approach for efficient imple-mentation of the FETI and Hybrid FETI solver for many-core accelerators using Schur complement (SC). By usingthe SC solver can avoid working with sparse Cholesky fac-tors of the stiffness matrices. Instead a dense structureis computed and used by the iterative solver. The mainbottleneck of this method is the computation of the SC.This bottleneck has been significantly reduced by usingnew PARDISO-SC sparse direct solver.

Lubomir RihaVSB Technical University of [email protected]

Tomas Brzobohaty, Alexandros MarkopoulosIT4Innovations National Supercomputing CentreVSB Technical University of [email protected], [email protected]

MS14

Notification Driven Collectives in Gaspi


Christian [email protected]

MS15

Exploiting Two-Level Parallelism by AggregatingComputing Resources in Task-Based ApplicationsOver Accelerator-Based Machines

Computing platforms are now extremely complex provid-ing an increasing number of CPUs and accelerators. Thistrend makes balancing computations between these het-erogeneous resources performance critical. In this paper

we tackle the task granularity problem and we propose ag-gregating several CPUs in order to execute larger paralleltasks and thus find a better equilibrium between the work-load assigned to the CPUs and the one assigned to theGPUs. To this end, we rely on the notion of schedulingcontexts in order to isolate the parallel tasks and thus del-egate the management of the task parallelism to the innerscheduling strategy. We demonstrate the relevance of ourapproach through the dense Cholesky factorization kernelimplemented on top of the StarPU task-based runtime sys-tem. We allow having parallel elementary tasks and usingIntel MKL parallel implementation optimized through theuse of the OpenMP runtime system. We show how ourapproach handles the interaction between the StarPU andthe OpenMP runtime systems and how it exploits the par-allelism of modern accelerator-based machines. We presentexperimental results showing that our solution outperformsstate of the art implementations to reach a peak perfor-mance of 4.5 TFlop/s on a platform equipped with 20 CPUcores and 4 GPU devices.

Terry [email protected]

MS15

Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures

In this talk we present the design of task-based sparse di-rect solvers on top of runtime systems. In the context of theqr mumps solver, we prove the usability and effectivenessof our approach with the implementation of a sparse ma-trix multifrontal factorization based on a Sequential Taskflow (STF) parallel programming model. Using this pro-gramming model, we developed features such as the inte-gration of dense 2D Communication Avoiding algorithmsin the multifrontal method allowing for better scalabilitycompared to the original approach used in qr mumps. Fur-thermore, following this approach, we show that we arecapable of exploiting heterogeneous architectures with thedesign of data partitioning and a scheduling algorithms ca-pable of handling the heterogeneity of resources.

Florent Lopez

Universite Paul Sabatier - IRIT (ENSEEIHT)[email protected]

MS15

An Effective Methodology for Reproducible Re-search on Dynamic Task-Based Runtime Systems

Reproducibility of experiments and analysis by others isone of the pillars of science. Yet, the description of experi-mental protocols (particularly in computer science) is oftenincomplete and rarely allows for reproducing a study. Inthis talk, we introduce the audience to the notions of repli-cability, repeatability and reproducibility and we discusstheir relevance in the context of HPC research. Then, wepresent an effective approach tailored to the specificities ofstudying HPC systems.

Arnaud Legrand, Luka StanisicINRIA Grenoble [email protected], [email protected]

MS15

High Performance Task-Based QDWH-eig on

88 PP16 Abstracts

Manycore Systems

Current dense symmetric eigensolver implementations suf-fer from the lack of concurrency during the reductionstep toward a condensed tridiagonal matrix and typicallyachieve a small fraction of theoretical peak performance.Indeed, whether the reduction to tridiagonal form em-ploys a single or multi-stage approach as in the state-of-the-art numerical libraries, the memory-bound nature ofsome of the computational kernels is still an inherent prob-lem. A novel dense symmetric eigensolver, QDWH-eig,has been presented by Y. Nakatsukasa and N. J. Higham(SIAM SISC, 2012), which addresses these drawbacks witha highly parallel and compute-bound numerical algorithm.However, the arithmetic complexity is an order of magni-tude higher than the standard algorithm, which makes itunattractive, at first glance, even with todays relatively lowcost of flops. By (1) implementing QDWH-eig frameworkby the DAG-based tile algorithms (fine-grained) to allowpipelining the various computational stages and to increasecomputational usage of all processing cores, (2) leveragingthe execution of based tile implementation of QDWH-eigwith GPUs, the QDWH-eig eigensolver becomes a compet-itive alternative.

Dalal [email protected]

Hatem LtaiefKing Abdullah University of Science & Technology(KAUST)[email protected]


MS16

Bit-Reproducible Climate and Weather Simula-tions: From Theory to Practice

The problem of reproducibility in floating-point numericalsimulations is an active research topic due to the many ad-vantages of reproducible computations, ranging from thesimplification of testing and debugging to the deploymentof applications on heterogeneous systems maintaining nu-merical consistency. A reproducible simulation always pro-duces the same result for the same settings regardless ofthe system it is run on and the configuration of the soft-ware. We show principles for achieving bit reproducibil-ity with a case study of the COSMO climate and weathercode in the context of the recently started crClim project.We use bit reproducibility to develop a fine-resolution cli-mate model for next-generation supercomputers which ne-cessitates high-performance bit-reproducibility. We discusssources of non-reproducibility and our strategies for solvingthem. We will also show how these techniques affect theperformance of the application. Overall, we aim at demys-tifying the belief that determinism is too hard to obtain inhigh-performance codes and show how reproducibility canbe a desirable feature for a numerical model.

Andrea ArteagaETH [email protected]

Christophe Charpilloz, Olive [email protected],

[email protected]

Torsten HoeflerETH [email protected]

MS16

Reproducible and Accurate Algorithms for Numer-ical Linear Algebra

On modern multi-core, many-core, and heterogeneousarchitectures, floating-point computations, especially re-ductions, may become non-deterministic and, therefore,non-reproducible mainly due to the non-associativity offloating-point operations and the dynamic scheduling. Weaddress the problem of reproducibility in the context offundamental linear algebra operations – like the ones in-cluded in the Basic Linear Algebra Subprograms (BLAS)library – and propose algorithms that yields both repro-ducible and accurate results. We extend this approach tothe higher level linear algebra algorithms, e.g. the LU fac-torization, that are built on top of these BLAS kernels. Wepresent these reproducible and accurate algorithms for theBLAS routines and the LU factorization as well as their im-plementations in parallel environments such as Intel serverCPUs, Intel Xeon Phi, and both NVIDIA and AMD GPUs.We show that the performance of the proposed implemen-tations is comparable to the standard ones.

Roman IakymchukUPMC - [email protected]

David DefourDALI/LIRMMUniversity of [email protected]

Sylvain CollangeINRIA – [email protected]

Stef GraillatUniversity Pierre and Marie Curie (UPMC)[email protected]

MS16

Parallel Interval Algorithms, Numerical Guaran-tees and Numerical Reproducibility

This talk focuses on the parallel implementation of inter-val algorithms. A first issue is to preserve the inclusionproperty, that is, the guarantee that a result computedusing interval arithmetic contains the exact result. Wewill illustrate the fact that the inclusion property is at riskwith parallel implementations and give some hints to pre-serve the inclusion property even for parallel interval al-gorithms. A second issue concerns the ir-reproducibilityof numerical computations, i.e. the fact that the samenumerical computations deliver different results dependingon the environment. We will detail the impact of numer-ical ir-reproducibility on interval computations and pointto several developments that remedy this problem.

Nathalie RevolINRIA - LIP, Ecole Normale Superieure de Lyon

PP16 Abstracts 89

[email protected]

MS16

Summation and Dot Products with ReproducibleResults

It can be shown that under very general conditions floating-point summation cannot be associative. Basically, a fixedpoint system is the only exception. Nevertheless summa-tion and dot product algorithms with reproducible resultscan be derived, for example using so-called error-free trans-formations (EFT). We discuss several possibilities of al-gorithms with identical results regardless of the order ofsummation, blocking or other techniques to speed up sum-mation.

Siegfried M. RumpTU [email protected]

MS17

Parallel Generator of Non-Hermitian Matriceswith Prescribed Spectrum and Shape Control forExtreme Computing

When designing eigensolvers it is necessary to providebacktesting sample matrices whose spectrum is known be-forehand even with constraints on the matrices. We derivea method that produces rigorously non-hermitian matrixeswith an exact control of its eigenvalues, shape, sparcity.The method is parallel and scalable in very high dimen-sion. Moreover we show that it is possible to reach absolutearithmetic precision by avoiding all potential troncaturescoming from number representations in the computer.

Herve [email protected]

France BoillodCNRS/[email protected]


Christophe CalvinCEA-Saclay/DEN/DANS/[email protected]

MS17

Monte Carlo Algorithms and Stochastic Process onManycore Architectures

Monte Carlo algorithms are well-known candidates formassively parallel architectures. Nevertheless, from the-ory to implementation on many core architectures, therealways some gaps. We will illustrate in this talk two ex-amples of stochastic algorithms and their practical imple-mentation onto many core architecture. The 1st one comesfrom Monte Carlo neutron transport and the 2nd is the in-vestagation on potential benefits of stochastic algorithmsfor solving linear systems.

Christophe Calvin

CEA-Saclay/DEN/DANS/DM2S

[email protected]

MS17

Hierarchical Distributed and Parallel Program-ming for Auto-Tunned Extreme Computing

Exascale hypercomputers will have highly hierarchical ar-chitectures with nodes composed by lot-of-core processorsand, often, accelerators ; mixing both distributed and par-allel computing paradigms. Methods have to be redesignedand some will be rehabilitated or hybridized. The differentprogramming levels will generate new difficult algorithm is-sues and numerical methods would have to be auto-tunedat runtime to accelerate convergence, minimize energy con-sumption and scalabilities. In this talk, we present a shortsurvey of auto-tuning of such methods and show that wemay use high level graph of component language, associ-ated with multilevel programming paradigm, to help sci-entists to develop efficient software.


MS17

Scalable Sparse Solvers on GPUS

The sparse direct method we present is a multifrontal QRfactorization intended specifically for GPU accelerators.Our approach relies on the use of a bucket scheduler thatexploits an irregular parallelism on both a coarse grain,among a set of fronts with different characteristics, andon a fine grain, through the exploitation of the staircaseshape of these fronts. The scheduler then relies on denseGPU kernels which design and implementation target re-cent GPU architectures.

Timothy A. Davis, Wissam M. Sid-LakhdarTexas A&MComputer Science and [email protected], [email protected]

MS18

Space-Filling Curves and Adaptive Meshes forOceanic and Other Applications

sam(oa)2 (space-filling curves and adaptive meshes foroceanic and other applications) is a software package forparallel adaptive mesh refinement on structured triangu-lar grids created via newest vertex bisection. ExploitingSierpinski order of the grid cells, it features high memoryefficiency and throughput as well as efficient dynamic loadbalancing on shared and distributed memory. We will re-port on recent extensions of sam(oa)2, such as implementa-tion of 3D problems (with 2D adaptivity) and optimizationfor latest hardware arcitectures, and show application sce-narios from tsunami simulation and porous media flow.

Michael BaderTechnische Universitat [email protected]

MS18

ADER-DG on Spacetrees in the ExaHyPE Project

We introduce a first version of ExaHyPE, a spacetree-basedAMR engine for hyperbolic equation system solvers. It

90 PP16 Abstracts

combines multilevel, higher-order ADER-DG with patch-based finite volumes. The spacetree either acts as computeor as organisational data structure - either to host thehigher order shape functions or to organise the patches.Our presentation introduces a reordering of the classicADER-DG solver phases that allows us to realise the algo-rithm with a pipelined single-touch policy, to process cellsconcurrently and to overlap MPI data exchange with theactual computation.

Dominic E. CharrierDurham [email protected]

Angelika SchwarzTechnische Universitat [email protected]

Tobias WeinzierlDurham [email protected]

MS18

A Tetrahedral Space-Filling Curve Via Bitwise In-terleaving

We introduce a space-filling curve for tetrahedral red-refinement that can be computed using bitwise interleav-ing operations similar to the well-known Morton curve forcubical meshes. We present algorithms that compute theparent, children and face-neighbors of a given mesh elementin constant time, as well as the next and previous elementin the space-filling curve and whether a given element ison the domain boundary. We close with numerical resultsobtained with our implementation.

Johannes Holke, Carsten BursteddeUniversitat [email protected], [email protected]

MS18

Nonconforming Meshes and Mesh Adaptivity withPetsc

Efficient mesh adaptivity requires a sensible combinationof data structures and numerical algorithms. Unlike AMRframeworks that try to control both components (structureon the outside, numerics on the inside), this talk will dis-cuss work on using PETSc to drive AMR libraries that pro-vide only data structures (numerics on the outside, struc-ture on the inside). An implementation of this interfaceusing the p4est library will be presented.

Toby IsaacICES, University of Texas at [email protected]

Matthew G. KnepleyRice [email protected] *Preferred

MS19

Directed Graph Partitioning

In scientific computing directed graphs are commonly usedfor modeling dependencies among entities. However, whilemodeling some of the problems as graph partitioning prob-lems, directionality is generally ignored. Accurate model-

ing of some of the problems necessitates to take the di-rectionally into account, which adds additional constraintsthat cannot be easily addressed in the current state-of-the-art partitioning methods and tools. In this talk, we willdiscuss some example problems, models and potential so-lution approaches for them.

Umit V. CatalyurekThe Ohio State UniversityDepartment of Biomedical [email protected]

Kamer KayaSabanci University,Faculty of Engineering and Natural [email protected]

Bora UcarLIP, ENS-Lyon, [email protected]

MS19

Parallel Graph Coloring on Manycore Architec-tures

In scientific computing, the problem of finding sets of in-dependent tasks is usually addressed with graph color-ing. We study performance portable graph coloring al-gorithms for many-core architectures. We propose a noveledge-based algorithm and enhancements of the speculativeGebremedhin-Manne algorithm that exploit architectures.We show superior quality and execution time of the pro-posed algorithms on GPUs and Xeon Phi compared to pre-vious work. We present effects of coloring on applicationssuch as Gauss-Seidel preconditioned solvers.

Mehmet DeveciThe Ohio State [email protected]

Erik G. BomanCenter for Computing ResearchSandia National [email protected]

Siva RajamanickamSandia National [email protected]

MS19

Parallel Approximation Algorithms for b-EdgeCovers and Data Privacy

We propose a new 3/2-approximation algorithm, calledLSE for computing b-Edge Cover and it’s application to adata privacy problem called adaptive k-Anonymity. b-EdgeCover is a special case of the well-known Set Multicoverproblem and also a generalization of Edge Cover problemin graphs. The objective is to choose a subset of C edgesin the graph with weights on the edges, such that at leasta specified number b(v) of edges in C are incident on eachvertex v and the sum of edge weights is minimized. We im-plement the algorithm on serial and shared-memory paral-lel processors and compare it’s performance against a col-lection of inherently sequential approximation algorithmsthat have been proposed for the Set Multicover problem.With LSE, i) we propose the first shared-memory paral-lel algorithm for the adaptive k-Anonymity problem and

PP16 Abstracts 91

ii) give new theoretical results regarding privacy guaran-tees which are significantly stronger than the best knownprevious results.

Arif KhanPurdue [email protected]

Alex PothenPurdue UniversityDepartment of Computer [email protected]

MS19

Clustering Sparse Matrices with Information fromBoth Numerical Values and Pattern

Considering any square fully indecomposable matrix A, wecan apply a two-sided diagonal scaling to |A| to render itinto doubly stochastic form. The Perron-Frobenius theo-rem is a key tool to exploit and we aim to use spectralproperties of doubly stochastic matrices to reveal hiddenblock structure in matrices. We also combine this withclassical graph analysis techniques to design partitioningalgorithms for large sparse matrices based on both numer-ical values and pattern information.


Philip KnightUniversity of Strathclyde, Glasgow, [email protected]

Sandrine MouyssetUniversite deToulouse, UPS-IRIT, [email protected]

Daniel RuizINPT-IRIT-Universite deToulouse, [email protected]


MS20

Performance and Power Characterization of Net-work Algorithms

We present a comparative analysis of the performanceand power/energy attributes of several community de-tection algorithms. We consider the Louvain, Walk-Trap, and InfoMap algorithms on several synthetic andreal-world networks. We introduce a new fine-grainedpower/performance measurement infrastructure that isused to estimate tradeoffs among performance, power, andaccuracy of community detection algorithms.

Boyana NorrisArgonne National [email protected]

Sanjukta BhowmickDepartment of Computer Science

University of Nebraska, [email protected]

MS20

The Structure of Communities in Social Networks

Within the area of social network analysis, community de-tection and understanding community structure are criti-cal to a variety of fields. However, there is little consen-sus as to what a community actually is, and characteriz-ing the structure of real communities is an important re-search challenge. I present a machine-learning frameworkto better understand the structure of real-world communi-ties. Application of this framework to real networks givesus surprising insight into the structure of communities.

Sucheta SoundarajanDepartment of Computer ScienceRutgers [email protected]

MS20

Parallel Algorithms for Analyzing Dynamic Net-works

We present, a parallel graph sparsification-based frame-work for updating the topologies of dynamic large-scalednetworks. The challenge in computing dynamic networkproperties is to update them without recomputing the en-tire network from scratch. This process becomes more chal-lenging in the parallel domain, because the concurrency inthe updates is difficult to determine. To address this issue,we use a technique known as graph sparsification, that di-vides the graph into subgraphs. The updates can be per-formed concurrently on each graph and then aggregatedover in a reduction-like format over the entire path. Weshow that this process can provide a general frameworkfor developing scalable algorithms for updating differenttopologies, including MST and graph connectivity.

Sanjukta BhowmickDepartment of Computer ScienceUniversity of Nebraska, [email protected]

Sriram SrinivasanUniversity of Nebraska at [email protected]

Sajal DasMissouri Univeristy of Science and [email protected]

MS20

Generating Random Hyperbolic Graphs in Sub-quadratic Time

NetworKit is our open-source toolkit for large-scale net-work analysis. Besides numerous analytics kernels it con-tains various graph generators. Random hyperbolic graphsare a promising family of scale-free geometric graphs usinghyperbolic geometry. We provide the first generation al-gorithm for random hyperbolic graphs with subquadraticrunning time. Using the code available in NetworKit, net-works with billions of edges can be generated in a few min-utes on a 16-core workstation.

Moritz von Looz

92 PP16 Abstracts

Karlsruhe Institute of Technology (KIT)Institute of Theoretical [email protected]

Henning MeyerhenkeKarlsruhe Institute of TechnologyInstitute for Theoretical Informatics, Parallel [email protected]

Roman PrutkinKarlsruhe Institute of Technology (KIT)Institute of Theoretical [email protected]

MS21

Improving Applicability and Performance for theMultigrid Reduction in Time (mgrit) Algorithm

Since clock speeds are no longer increasing, time integra-tion is becoming a sequential bottleneck. In this talk, wediscuss new advances in the MGRIT algorithm and corre-sponding code XBraid. MGRIT can be easily integratedinto existing codes, but more intrusive approaches allow forbetter performance. We present developments in XBraid tobroaden its applicability (e.g., support for temporal adap-tivity) and improve efficiency by allowing for varying levelsof intrusiveness, dictated by the user’s code.


Veselin DobrevLawrence Livermore National [email protected]

Tzanio V. KolevCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

Ben O’NeillUniversity of Colorado, [email protected]

Jacob B. SchroderLawrence Livermore National [email protected]

Ben SouthworthUniversity of Colorado, [email protected]

Ulrike Meier YangLawrence Livermore National [email protected]

MS21

Multigrid Methods with Space-Time Concurrency

One feasible approach for doing effective parallel-in-timeintegration is with multigrid. In this talk, we consider thecomparison of the three basic options of multigrid algo-rithms on space-time grids that allow space-time concur-rency: coarsening in space and time, semicoarsening in thespatial dimensions, and semicoarsening in the temporal di-

mension. We discuss advantages and disadvantages of thedifferent approaches and their benefit compared to tradi-tional space-parallel algorithms with sequential time step-ping in a parallel computational environment.

Stephanie FriedhoffUniversity of [email protected]

Robert D. Falgout, Tzanio V. KolevCenter for Applied Scientific ComputingLawrence Livermore National [email protected], [email protected]

Scott MaclachlanDepartment of Mathematics and StatisticsMemorial University of [email protected]

Jacob B. SchroderLawrence Livermore National [email protected]

Stefan VandewalleDepartment of Computer ScienceKatholieke Universiteit [email protected]

MS21

Convergence Rate of Parareal Method with Modi-fied Newmark-Beta Algorithm for 2nd-Order Ode

In this study, we introduce a modified Newmark-beta algo-rithm (MNBA) as a time integrator in the parareal methodfor 2nd-order ODEs. The MNBA allows us to calculatea highly accurate amplitude and phase regardless of timestep width. Therefore, the MNBA is effective for coarse-solver in the parareal algorithm. The accurate calculationis accomplished by a phase correction coefficient. The com-puted results showed excellent convergence rate for simpleharmonic motion with the MNBA.

Mikio [email protected]

Kenji OnoVCAD System Research Program, [email protected]

MS21

PinT for Climate and Weather Simulation

Over the last decade, the key component for increasing thecompute performance for HPC systems was an increase indata parallelism. However, for simulations with a fixedproblem size, this increase in data parallelism clearly leadsto circumstances with the communication latency domi-nating the overall compute time which again results in astagnation or even decline of scalability. For weather sim-ulations, further requirements on wallclock time restric-tions are given and exceeding these restrictions would makethese simulation results less beneficial. For such circum-stances, the so-called parallelization-in-time method pro-vide a promising approach. We present our recent resultsand methods on applying the Parareal approach as one ofthe parallelization-in-time method to a prototype of nu-

PP16 Abstracts 93

merical weather/climate simulations.

Martin SchreiberUniversity of ExeterCollege of Engineering, Mathematics and [email protected]

Pedro S. PeixotoUniversity of Sao Paulo, [email protected]

Terry HautLos Alamos National [email protected]

Beth WingateUniversity of [email protected]

MS22

Increasing Arithmetic Intensity Using StencilCompilers.

Stencil calculations and matrix-free Krylov sub-spacesolvers represent an important class of kernels in many sci-entific computing applications. In such types of solvers,applications of stencil kernels are often the dominant partof the computation, and an efficient parallel implementa-tion of the kernel is therefore crucial in order to reduce thetime to solution. Inspired by polynomial preconditioning,we increase the arithmetic intensity of this Krylov subspacebuilding block by replacing the matrix with a higher-degreematrix polynomial. This allows for better use of SIMDvectorization and as a consequence shows better speed-upson the latest hardware. Portability is made possible withthe use of state- of-the-art stencil compiler programs, andan extension to a distributed-memory environment withlocal shared-memory parallelism is demonstrated. Thus,we demonstrate an effective, portable approach which caneasily be adapted to extract high performance from newlydeveloped hardware.

Simplice [email protected]

Patrick SananUSI - Universita della Svizzera [email protected]


Reps BramUniversiteit [email protected]

Wim VanrooseAntwerp [email protected]

MS22

Many Core Acceleration of the Boundary Element

Library BEM4I

We present a library of parallel boundary element basedsolvers for solution of engineering problems. The librarysupports solution of problems modelled by the Laplace,Helmholtz, Lame, and time-dependent wave equation. Inour contribution we focus on the acceleration of the sys-tem matrices assembly using the Intel Xeon Phi coproces-sors and parallel solution of linear elasticity problems usingthe boundary element tearing and interconnecting method(BETI) in cooperation with the ESPRESO library.

Michal MertaVSB [email protected]

Jan ZapletalIT4InnovationsVSB-Technical University of [email protected]

MS22

HPGMG-FV Benchmark with GASPI/GPI-2:First Steps

The communication library GASPI/GPI is thread basedand offers a full flexibility to follow the task-model paral-lelization approach, being free from the restrictions of thestandard hybridization with MPI and threads/OpenMP.Here, after replacing the MPI-communication by a thread-parallel GPI-communication in the prolongation moduleof HPGMG, we compare the timings of the original im-plementation and the new one. Our results show thatGASPI/GPI might be an appropriate choice for the hi-erarchically parallel architectures nowadays.

Dimitar M. StoyanovITWM [email protected]

MS23

FE2TI - Computational Scale Bridging for Dual-Phase Steels

The computational scale bridging approach (FE2TI) com-bines the well-known FE2 method with the FETI-DP do-main decomposition method. This approach incorporatesphenomena on the microscale into a macroscopic prob-lem. FE2TI is used in the project ”EXASTEEL - Bridg-ing Scales for Multiphase Steels” (part of the Germanpriority program SPPEXA) for simulations of dual-phasesteels. Scalability results for three-dimensional nonlinearand micro-heterogeneous hyperelasticity problems are pre-sented, filling the complete Mira at Argonne National Lab-oratory (786,432 BlueGene/Q cores).

Martin LanserMathematical [email protected]

Martin LanserMathematical InstituteUniversitat zu Koln, [email protected]

Axel KlawonnUniversitat zu [email protected]

94 PP16 Abstracts

Oliver RheinbachTechnische Universitat Bergakademie [email protected]

MS23

Title Not Available


Steffen MuethingUniversitaet [email protected]

MS23

Extreme-Scale Solver for Earth’s Mantle – Spec-tral/Geometric/Algebraic Multigrid Methods forNonlinear, Heterogeneous Stokes Flow

We present an implicit solver that exhibits optimal al-gorithmic performance and is capable of extreme scal-ing for hard PDE problems, such as mantle convec-tion. To minimize runtime the solver advances in-clude: aggressive mesh adaptivity, mixed continuous-discontinuous discretization with high-order accuracy, hy-brid spectral/geometric/algebraic multigrid, and novelSchur-complement preconditioning. We demonstrate suchalgorithmically optimal implicit solvers that scale out to 1.5million cores for severely nonlinear, ill-conditioned, hetero-geneous, and anisotropic PDEs.

Johann RudiThe University of Texas at AustinThe University of Texas at [email protected]

A. Cristiano I. MalossiIBM Research - [email protected]

Tobin IsaacThe University of [email protected]

Georg StadlerCourant Institute for Mathematical SciencesNew York [email protected]

Michael [email protected]

Peter W. J. Staar, Yves Ineichen, Costas BekasIBM Research - [email protected], [email protected],[email protected]

Alessandro CurioniIBM [email protected]

Omar GhattasThe University of Texas at [email protected]

MS23

3D Parallel Direct Elliptic Solvers Exploiting Hier-

archical Low Rank Structure

We describe a parallel direct solver based on a octree-nested condensation strategy and hierarchical compressedmatrix representations for the linear algebra operations in-volved. The algorithm is asymptotically work-optimal andits overall memory requirements grow only linearly withproblem size assuming bounded ranks of off-diagonal ma-trix blocks. The algorithm exposes substantial concurrencyin the octree traversals. We show its parallel efficiency withan implementation running on multi-thousand cores on theKAUST Shaheen supercomputer.

Gustavo [email protected]


George M Turkiyyah, George M TurkiyyahAmerican University of [email protected],[email protected]


MS24

Performance Optimization of a Massively Paral-lel Phase-Field Method Using the Hpc FrameworkWalberla

We present a massively parallel phasefield code, based onthe HPC framework waLBerla, for simulating solidificationprocesses of ternary eutectic systems. Different patternsare forming during directional solidification of ternary eu-tectics, depending on the physical parameters, with signif-icant influence on the mechanical properties of the mate-rial. We show simulations of large structure formations,which give rise to spiral growth. Our implementation ofthis method in the HPC framework waLBerla uses explicitvectorization, asynchronous communication and I/O band-with reduction techniques. Optimization and scaling re-sults are presented for simulations run on three Germansupercomputers: SuperMUC, Hornet and JUQUEEN.

Martin BauerUniversity of Erlangen-NurembergDepartment of Computer [email protected]

Harald KostlerUniversitat Erlangen-NurnbergDepartment of Computer Science 10 (System Simulation)[email protected]

Ulrich RudeFriedrich-Alexander-University Erlangen-Nuremberg,[email protected]

MS24

Large Scale phase-field simulations of directional

PP16 Abstracts 95

ternary eutectic solidification

Materials with defined properties are crucial for the de-velopment and optimization of existing applications. Dur-ing the directional solidification of ternary eutectic alloys,a wide range of patterns evolve. This patterns are re-lated to the mechanical properties of the material. Largescale phase-field simulation, based on a grand-potential ap-proach, allow to study the underlying physical processesof microstructure evolution. Based on the massive par-allel waLBerla framework, a highly optimized solver wasdeveloped. Large phase-field simulations of the ternarysystem Al-Ag-Cu and the three-dimensional structure arepresented and visually and quantitatively compared to ex-periments.

Johannes HotzerHochschule Karlsruhe Technik und Wirtschaft, Germanytba

Philipp Steinmetz, Britta NestlerKarlsruhe Institute of Technology (KIT)Institute of Applied Materials (IAM-ZBS)[email protected], [email protected]

MS24

Building High Performance Mesoscale Simulationsusing a Modular Multiphysics Framework

The phasefield module included with MOOSE (the open-source Multiphysics Object Oriented Simulation Environ-ment) allows for the rapid creation and evolution of a widevariety of multiphysics mesoscale simulations. Researchersutilizing this tool have access to powerful finite elementcapabilities while enjoying highly scalable runs using thebuilt-in hybrid parallelism through the underlying frame-work. An overview of the advanced capabilities of thephasefield module will be provided including our parsedkernel and material system, fully-coupled linkage to otherphysics modules, and advances in algorithmic developmentwith the grain tracking capability.

Cody PermannCenter for Advanced Modeling and SimulationIdaho National [email protected]

MS24

Equilibrium and Dynamics in Multiphase-FieldTheories: A Comparative Study and a ConsistentFormulation

We present a new multiphase-field theory for describinginterface driven phase transitions in multi-domain and/ormulti-component systems. We first analyize the mostwidespread multiphase theories presently in use, to iden-tify their advantages/disadvantages. Then, based on theresults of the analysis, a new way of constructing the freeenergy surface is introduced, and a generalized multiphasedescription is derived for arbitrary number of phases (ordomains). The construction of the free energy functionaland the dynamic equations is based on criteria that en-sure mathematical and physical consistency. The presentedapproach retains the variational formalism; reduces (or ex-tends) naturally to lower (or higher) number of fields on thelevel of both the free energy functional and the dynamicequations; enables the use of arbitrary pairwise equilib-rium interfacial properties; penalizes multiple junctions in-creasingly with the number of phases; ensures non-negative

entropy production, and the convergence of the dynamicsolutions to the equilibrium solutions; and avoids the ap-pearance of spurious phases on binary interfaces. The newtheory is tested for multi-component liquid phase separa-tion (with/without hydrodynamics) and grain boundaryroughening.

Gyula I. TothDepartment of Physics and TechnologyUniversity of Bergen, [email protected]

Laszlo GranasyWignes Research Centre for PhysicsBudapest, [email protected]

Tamas PusztaiWigner Research Centre for PhyicsBudapest, [email protected]

Bjorn KvammeDepartment of Physics and TechnologyUniversity of [email protected]

MS25

PMI-Exascale: Enabling Applications at Exascale

The PMI-Exascale (PMIx) community is chartered withproviding an open source, standalone library that supportsa wide range of programming models at exascale and be-yond. Critical to the success of these large systems is theability to rapidly start applications, and the emerging part-nership between resource managers and applications formanaging workflows. This talk will provide an overview ofPMIx objectives, and a status report on the community’sprogress.

Ralph [email protected]

MS25

Middleware for Efficient Performance in HPC

Two middleware components are attracting significant at-tention today given their notable contributions to the effi-ciency of HPC systems in terms of energy and throughput,while also improving performance. The first is the use of”local” storage hierarchy, sometimes referred to as ”BurstBuffers”, though the functionality goes beyond the task ofbuffering checkpoint files. The second is the deployment ofadvance power management functionality, from the systemlevel, down through jobs, and into applications themselves.

Larry [email protected]

MS25

Power Aware Resource Management

Maximizing application performance under a power con-straint is a key research area for exascale supercomput-ing. Current HPC systems are designed to be worst-case

96 PP16 Abstracts

power provisioned, leading to under-utilized power and lim-ited application performance. In this talk I will presentRMAP, a low-overhead, power-aware resource manager im-plemented in SLURM that addresses the aforementionedissues. RMAP uses hardware overprovisioning and power-aware backfilling to improve job turnaround times by upto 31% when compared to traditional resource managers.

Tapasya PatkiLawrence Livemore National [email protected]

MS25

Co-Scheduling: Increasing Efficency for HPC

An approach to improve efficiency on HPC systems with-out changing applications is co-scheduling, i.e., more thanone job is executed simultaneously on the same nodes of asystem. If co-scheduled applications use different resourcetypes, improved efficiency can be obtained. However, ap-plications may also slow down each other when sharingresources. This talk presents our research activities andresults within FAST, a three year project funded by theGerman ministry of research and education.

Josef WeidendorferLRR-TUMTU [email protected]

MS26

Fast Solvers for Complex Fluids

Complex fluids are challenging to simulate as they are char-acterized by stationary or evolving microstructure. We willdiscuss the discretization of integral equation formulationsfor complex fluids. In particular, we will discuss the formu-lation, numerical challenges and scalability of algorithmsfor volume integral equations and we will present a newopen-source library for such problems. We will comparetheir performance to other state-of-the art codes.

George BirosUniversity of Texas at [email protected]

Dhairya MalhotraInstitute of Computational Engineering and SciencesThe University of Texas at [email protected]

MS26

Recent Parallel Results Using Dynamic Quad-TreeRefinement for Solving Hyperbolic ConservationLaws on 2d Cartesian Meshes

We discuss recent parallel results using ForestClaw, anadaptive code based on the Berger-Oliger-Colella patch-based scheme for solving hyperbolic conservation laws cou-pled with the quad/octree mesh refinement available in thep4est library. We show how parallel communication costsvary across different refinement strategies and patch sizesand detail the strategies we use to minimize these costs.Results will be shown for several hyperbolic problems, in-cluding shallow water wave equations on the cubed sphere.

Donna CalhounBoise State University

[email protected]

MS26

Simulation of Incompressible Flows on Parallel Oc-tree Grids

We present a second-order accurate solver for the incom-pressible Navier-Stokes equations on parallel Octree grids.The viscosity is treated implicitly and we use a stable pro-jection method. We rely on the p4est library to handle thedistributed Octree structure and capture the presence ofirregular interfaces with the level-set method. This leadsto a scalable, stable and accurate solver for incompressibleflows past complex geometries.

Arthur [email protected]

Frederic G. GibouUC Santa [email protected]

MS26

Basilisk: Simple Abstractions for Octree-AdaptiveSchemes

I will present the basic design and algorithms ofa new octree-adaptive free software library: Basilisk(http://basilisk.fr). I will show how using simple stencil-based abstractions, one can very simply construct a widerange of octree-adaptive numerical schemes including:multigrid solvers for elliptic problems, volume-of-fluid ad-vection schemes for interfacial flows and high-order flux-based schemes for systems of conservation laws. The auto-matic and transparent generalisation of these abstractionsto large-scale distributed parallel systems will also be dis-cussed.

Stephane PopinetNational Institute of Water and Atmospheric [email protected]

MS27

Tradeoffs in Domain Decomposition: Halos VsShared Contiguous Arrays


Jed BrownArgonne National Laboratory; University of [email protected]

MS27

Firedrake: Burning the Thread at Both Ends


Gerard J GormanDepartment of Earth Science and EngineeringImperial College [email protected]

MS27

Threads: the Wrong Abstraction and the Wrong

PP16 Abstracts 97

Semantic

MPI+OpenMP is frequently proposed as the right evolu-tionary programming model for exascale. Unfortunately,the evolutionary introduction of OpenMP into existingMPI-only codes is fraught with difficulty. We will describe”The Right Way” to do MPI+OpenMP and ultimatelyconclude that MPI+MPI is a more effective alternative forlegacy codes.

Jeff R. HammondIntel CorporationParallel Computing Labjeff [email protected]

Timothy MattsonIntel [email protected]

MS28

Recent Directions in Extreme-Scale Solvers

Extreme-scale supercomputers have made extremely largescientific simulations possible, but extreme-scale linearsolvers is often a bottleneck. New ideas and algorithmsare desperately needed to move forward. In this talk wereview several topics of current interest that will be cov-ered in this MS, such as, hierarchical low-rank solvers, fine-grained iterative incomplete factorizations (precondition-ers), asynchronous iterations and communication-avoiding(or reducing) iterative solvers.


MS28

A New Parallel Hierarchical Preconditioner forSparse Matrices

We will present a new parallel preconditioner for sparsematrices based on hierarchical matrices. The algorithmshares some elements with the incomplete LU factoriza-tion, but uses low-rank approximations to remove fill-ins.The algorithm also uses multiple levels, in a way similar tomultigrid, to achieve fast global convergence. Moreover, lo-cal communication is needed only for eliminating boundarynodes on every process. We will demonstrate performanceand scalability results using various benchmarks.

Chao ChenStanford [email protected]

Eric F. DarveStanford UniversityMechanical Engineering [email protected]


Erik G. BomanCenter for Computing ResearchSandia National Labs

[email protected]

MS28

Preconditioning Using Hierarchically Semi-Separable Matrices and Randomized Sampling

We present a software package, called STRUMPACK(Structured Matrices PACKage), to deal with dense orsparse rank structured matrices. Dense rank structuredmatrices occur for instance in the BEM, quantumchem-istry, covariance and machine-learning problems. Our fullyalgebraic sparse solver or preconditioner uses rank struc-tured kernels on small dense submatrices. The computa-tional complexity and actual runtimes are much lower thanfor traditional direct methods. We show results for severallarge problems on modern HPC architectures.

Pieter GhyselsLawrence Berkeley National LaboratoryComputational Research [email protected]

Francois-Henry RouetLawrence Berkeley National [email protected]

Xiaoye Sherry LiComputational Research DivisionLawrence Berkeley National [email protected]

MS28

Preconditioning Communication-Avoiding KrylovMethods

We have previously presented a domain decompositionframework to precondition communication-avoiding (CA)Krylov methods. This framework allowed us to develop aset of preconditioners that do not require any additionalcommunication beyond what the CA Krylov methods havealready needed while improving their convergence rates. Inthis talk, we present our recent extensions to the frameworkthat can improve the performance of the preconditioners.

Ichitaro YamazakiUniversity of Tennessee, [email protected]

Siva Rajamanickam, Andrey ProkopenkoSandia National [email protected], [email protected]


Mark Hoemmen, Michael HerouxSandia National [email protected], [email protected]

Stan TomovComputer Science DepartmentUniversity of [email protected]

Jack J. DongarraDepartment of Computer Science

98 PP16 Abstracts

The University of [email protected]

MS29

Deductive Verification of BSP Algorithms withSubgroups

We present how to perform deductive verification of BSPalgorithms with subgroup synchronisation. From BSP pro-grams, we generate sequential codes for the condition gen-erator WHY. By enabling subgroups, the user can provethe correctness of programs that run on hierarchical ma-chines (e.g. clusters of multi-cores). We present howto generate proof obligations of MPI programs that onlyuse collective operations. Our case studies are distributedstate-space construction algorithms, the basis of model-checking.

Frederic GavaLaboratoire d’Algorithmique, Complexite et LogiqueUniversite de Paris [email protected]

MS29

A BSP Scalability Analysis of 3D LU Factorizationwith Row Partial Pivoting

Three dimensional matrix multiplication is intuitively nomore complex in terms of communication volume than theLU factorization without pivoting. It is known that it isin fact the same. In this presentation, we show that thiscommunication volume lower bound can also be achievedfor the LU factorization with ”usual” row partial pivot-ing. The bulk synchronous parallel (BSP) model is used toanalyze the scalability of this algorithm and to discuss itstradeoffs.

Antoine PetitetHuawei Technologies [email protected]

MS29

Computing Almost Asynchronously: What Prob-lems Can We Solve in O(logP ) Supersteps?

The elegance and utility of BSP stem from treating a par-allel computation as a sequential one, consisting of super-steps. In many non-trivial computations this number canbe independent of the input, depending only on the num-ber of processors p. We give an overview of problems thatadmit work-optimal BSP solutions using only O(log p) su-persteps: FFT, sorting, suffix array, selection and LongestCommon Subsequence, and present general techniques forthe design of such almost-asynchronous algorithms.

Alexander TiskinComputer ScienceUniversity of [email protected]

MS29

Towards Next-Generation High-Performance BSP

Bulk Synchronous Parallel allows structured programming,ideally making algorithms run predictably, portably, andperformantly. This talk focuses on the performance as-pect of BSP, in particular its interoperability with ease-of-programming and resiliance, both aspects of BSP which

have proven extremely successful in the past decadethrough MapReduce, Pregel, etc. Can we construct a BSPplatform that behaves predictably, is fully portable and re-silient, remains easy to program in, while not sacrificingperformance?

Albert-Jan N. YzelmanKU LeuvenIntel ExaScience Lab (Intel Labs Europe)[email protected]

MS30

High Performance Computing in Micromechanics

Micromechanics can be characterized as a computation ofthe global (macrolevel) response of a heterogeneous ma-terial possessing a complicated microstructure. It can beimplemented via finite element method exploiting informa-tion on the microstructure provided by CT. The solution ofthe finite element systems then requires high performanceparallel solution methods. We shall discuss adaptation ofan algebraic two level solver exploiting additive Schwarztype approximate inverse for the solution of extreme largefinite element systems.

Radim BlahetaInstitute of Geonics, Academy of Sciences of the CzechRepublic, Ostrava, Czech [email protected]

O. JaklInstitute of Geonics, Academy of Sciences of the [email protected]

J. StaryInstitute of Geonics, Academy of Sciences of theCzech [email protected]

I. GeorgievInstitute of Information and CommunicationTechnologies, BASSofia, [email protected]

MS30

Balanced Dense-Sparse Parallel Solvers for Nonlin-ear Vibration Analysis of Elastic Structures

The dynamical analysis of elastic structures requires com-putation of periodic solutions with reliable accuracy, natu-rally leading to nonlinear models and large-scale systems.A parallel algorithm for implementation of the shootingmethod is proposed and studied. It involves simultane-ously basic matrix operations and solvers for sparse anddense matrices. The computational complexity and par-allel efficiency are estimated. Numerical experiments onhybrid HPC systems with accelerators are presented andthe achieved scalability is analyzed.

Svetozar MargenovInstitute of Information and Communication TechnologiesBulgarian Academy of Sciences, Sofia, [email protected]

MS30

Parallel Block Preconditioning Framework and Ef-

PP16 Abstracts 99

ficient Iterative Solution of Discrete Multi-PhysicsProblems

In this presentation we discuss software engineering is-sues and parallel performance of the block precondition-ing framework - a software middleware designed for simpledesign and implementation of block preconditioners thatarise in common discretisations of multi-physics problems.

Milan D. MihajlovicCentre for Novel Computing, Department of ComputerScienceThe University of [email protected]

MS30

Simulations for Optimizing Lightweight Structureson Heterogeneous Compute Resources

For lightweight structures in mechanical manufacturing,several parameters have to be determined, which is done inan optimization before production. The optimization is acomplex simulation code comprising several CFD and FEMsimulation tasks. The challenge is to execute this complexmulti-modular code on parallel or distributed hardware en-vironments. Besides others, this leads to a specific schedul-ing problem of the FEM tasks, which is the main topic ofthis talk.

Robert DietzeTU Chemnitz, [email protected]

Michael HofmannDepartment of Computer ScienceChemnitz University of Technology, [email protected]

Gudula RungerTU Chemnitz, [email protected]

MS31

Exploiting Synergies with Non-University Partnersin CSE Education

The chances and obstacles to involving non-university part-ners, in our case the federal research center in Jlich, ingraduate CSE education will be discussed. Although sig-nificant flexibility may be required by both the univer-sity and non-university partners in order to chomp witha joint working educational program, the benefits includea sustainable communication platforms, strengthening ofthe research collaborations, and more opportunities for thegraduates. One challenge is dealing with diverse back-grounds of incoming students. Depending on the desiredgraduate profile, a significant amount of curriculum flexi-bility may be necessary in order to harmonize the variousskill sets. In addition, integrative instruction elements thatcombine various components of CSE are called for, includ-ing custom-designed laboratory modules and projects.

Marek BehrRWTH Aachen UniversityChair for Computational Analysis of Technical Systems

[email protected]

MS31

Designing a New Masters Degree Program in Com-putational and Data Science

The Friedrich Schiller University Jena, founded in 1558,is a classical German university with around 20,000 stu-dents distributed over ten different departments. In win-ter 2014/2015, the university started a new master’s pro-gram in Computational and Data Science that is designedto exploit the synergies between computational science forlarge-scale problems and data-driven approaches to ana-lyze massive data sets. The fundamental methodologicalbasis of this interdisciplinary program consists of parallelprocessing and scalable algorithms.

Martin BueckerMathematics and Computer Science DepartmentFriedrich Schiller University [email protected]

MS31

Supporting Data-Driven Businesses Through Tal-ented Students in KnowledgeEngineering@Work

To go beyond the traditional learning experience, Knowl-edgeEngineering@Work offers its young and talented stu-dents in Knowledge Engineering the opportunity to facereal and challenging problems on the interface of appliedComputer Science and Mathematics, while doing theirbachelor. This allows them to become professionals thatare tailor made for the fast growing, highly competitiveand demanding market in which knowledge of how to han-dle large data sets is more and more important.

Carlo GaluzziDept. of Knowledge EngineeringMaastricht [email protected]

MS31

Blending the Vocational Training and UniversityEducation Concepts in the Mathematical TechnicalSoftware Developer (matse) Program

The challenging problems in natural and engineering sci-ences require a close cooperation between mathematicians,computer scientists and end-users. This talk presentsthe blend of a vocational training as Mathematical Tech-nical Software Developer (MATSE) and the training-accompanying bachelor program Scientific Programming.The graduates acquire the skills to handle mathematicalmodels as well as to find and implement solutions of simu-lation, optimization and visualization problems.

Benno WillemsenIT CenterRWTH Aachen [email protected]

MS32

Parallel Decomposition Methods for StructuredMIP

Large-scale optimization with integer variables is a chal-lenging problem with many scientific and engineering ap-plications, because of the complexity, nonlinearity and size

100 PP16 Abstracts

of the problems. However, certain classes of the prob-lems (e.g., stochastic and/or network optimization) havestructures that allow decomposition of the problems. Wepresent decomposition techniques and implementations ca-pable of running on high performance computing systems.Numerical examples are provided using large-scale stochas-tic unit commitment problems.

Kibaek KimDepartment of Industrial Engineering and ManagementSciencesNorthwestern [email protected]

MS32

Parallel Optimization of Complex Energy Systemson High-Performance Computing Platforms

We present scalable algorithms and PIPS suite of parallelimplementations for the stochastic economic optimizationof complex energy systems and parameter estimation forpower grid systems under dynamic transients due to un-foreseen contingencies. We will also discuss HPC simula-tions related to the stochastic optimization of electricitydispatch in the State of Illinois’ grid under large penetra-tion of wind power.

Cosmin G. PetraArgonne National LaboratoryMathematics and Computer Science [email protected]

MS32

PIPS-SBB: A Parallel Distributed Memory Solverfor Stochastic Mixed-Integer Optimization

Stochastic Mixed-Integer Programming (MIP) formula-tions of optimization problems from real-world applica-tions such as unit commitment can exceed available mem-ory on a single workstation. To overcome this limitation,we have developed PIPS-SBB, a parallel distributed mem-ory Branch-and-Bound based stochastic MIP open-sourcesolver that can leverage HPC architectures. In this talk,we discuss ongoing work on PIPS-SBB, and illustrate itsperformance on large unit commitment problem instancesthat cannot be solved to guaranteed optimality using serialimplementations.

Deepak RajanLawrence Livermore National [email protected]

Lluis MunguiaGeorgia Institute of [email protected]

Geoffrey M. OxberryLawrence Livermore National [email protected]

Cosmin G. PetraArgonne National LaboratoryMathematics and Computer Science [email protected]

Pedro Sotorrio, Thomas EdmundsLawrence Livermore National Laboratory


MS32

How to Solve Open MIP Instances by Using Su-percomputers with Over 80,000 Cores

In this talk, we introduce the Ubiquity Generator (UG)Framework, which is a software framework to parallelizea branch-and-bound based solver on a variety of parallelcomputing environments. UG gives us a systematic wayto develop a parallel solver that can run on large-scaledistributed memory computing environments. A successstory will be presented where we solve 14 open MIP in-stances from MIPLIB2003 and MIPLIB2010 using ParaS-CIP, a distributed-memory instantiation of UG using SCIPas MIP solver, on up to 80,000 cores of supercomputers.

Yuji ShinanoZuse Institute BerlinTakustr. 7 14195 [email protected]

Tobias AchterbergGurobi GmbH, [email protected]

Timo Berthold, Stefan HeinzFair Issac Europe Ltd, [email protected], [email protected]

Thorsten KochZuse Institute [email protected]

Stefan VigerskeGAMS Software GmbH, [email protected]

Michael WinklerGurobi GmbH, [email protected]

MS33

Improving Resiliency Through Dynamic ResourceManagement

Application recovery techniques for exascale systems targetquick recovery from node failures using spare nodes. Staticresource management assigns a fixed number of spare nodesto an individual job, leading to poor resource utilizationand possibly also to insufficiency. Our dynamic resourcemanagement scheme allows the execution of non-rigid jobsand can perform on-the-fly replacement of a failed node.Results show better job turnaround times under high fail-ure rates when compared to static resource management.

Suraj Prabhakaran, Felix WolfTechnische Universitat [email protected], [email protected]

MS33

Operation Experiences of the K Computer

The K computer is one of the largest supercomputer sys-tems in the world, and it has been operated for more than

PP16 Abstracts 101

three years. Since its release, we have solved a significantnumber of operation issues such as fixing and improvingthe system software components, customizing the opera-tion rules according to users needs, and so on. In my talk,I shall talk what we have experienced through the opera-tion of the K computer.

Atsuya UnoRIKEN Advanced Institute of Computing [email protected]

MS33

The Influence of Data Center Power and EnergyConstraints on Future HPC Scheduling Systems

Reducing the power and energy consumption of HPC datacenters is one of the essential areas of exascale research.Currently, PC data centers are built and provisioned forexpected peak power. But the standard workload powerconsumption can be as low as 50% of peak power. There-fore, future HPC data centers would then over-provisionor right size their data center; meaning that power andcooling is only sufficient for normal operation. Additionalpower restrictions might come from the power providers(maximum power change in a specific time frame). Otherfactors could include, for example, peak power vs. energyconsumption over the year, real time power consumptionadjustments based on renewable energy generation, andmoving from CPU hour based accounting to energy con-sumption. Since scheduling systems are currently the mainway to allocate resources for applications, it seems clearthat power and energy might need to become a schedulingresource/parameter. This talk will discuss possible datacenter policies concerning power and energy and their im-pact on scheduling decisions.

Torsten WildeLRZ: Leibniz Supercomputing [email protected]

MS34

Resiliency in AMR Using Rollback and Redun-dancy

Many resiliency strategies introduce redundancy in the ap-plications. Some others use roll-back and recovery. Struc-tured AMR has built-in redundancy and hierarchical struc-ture, both of which can be exploited to build a cost-effectiveresiliency strategy for AMR. We present an example of astrategy that provides coverage for a variety of faults andthe errors they manifest in the context of patch-based AMRfor solving hyperbolic PDE’s.

Anshu DubeyLawrence Berkeley National [email protected]

MS34

Adaptive Mesh Refinement in Radiation Problems

Radiative heat transfer is an important mechanism in aclass of challenging engineering and research problems. Adirect all-to-all treatment of these problems is prohibitivelyexpensive on large core counts. Here, we show that radi-ation calculations can be made to scale within the Uintahframework through a novel combination of reverse MonteCarlo ray tracing techniques combined with adaptive meshrefinement. Strong scaling results from DOE Titan are

shown to 256K cores and 16K GPUs.

Alan HumphreyScientific Computing and Imaging InstituteUniversity of [email protected]

Todd HarmanDepartment of Mechanical EngineeringUniversity of [email protected]

Daniel SunderlandSandia National [email protected]

Martin BerzinsScientific Computing and Imaging InstituteUniversity of [email protected]

MS34

Resolving a Vast Range of Spatial and DynamicalScales in Cosmological Simulations

Solving the equations of ideal hydrodynamics accuratelyand efficiently is a key feature of state-of-the-art cosmolog-ical codes. The AREPO code achieves a high accuracy byutilizing a second-order finite volume method on a mov-ing mesh. Recently, we have developed a second operatingmode by adopting a discontinuous Galerkin (DG) schemeon a Cartesian grid, which can be adaptively refined. DGis a higher-order method with a small stencil and thus verysuitable for parallel systems.

Kevin Schaal, Volker SpringelHeidelberg Institute for Theoretical [email protected], [email protected]

MS34

A Simple Robust and Accurate a Posteriori Sub-cell Finite Volume Limiter for the DiscontinuousGalerkin Method on Space-Time Adaptive Meshes

We present a class of high order numerical schemes, whichinclude finite volume and discontinuous Galerkin (DG) fi-nite element methods for the solution of hyperbolic con-servation laws in multiple space dimensions. For the DGscheme, we employ an ADER-WENO a-posteriori sub-cellfinite volume limiter, which is activated only in those re-gions where strong oscillations are produced. The resultingschemes have been implemented within space-time adap-tive Cartesian meshes and MPI parallelization. Severalrelevant numerical tests are shown.

Olindo ZanottiUniversita degli Studi di [email protected]

Michael DumbserUniversity of [email protected]

MS35

MPI+MPI: Using MPI-3 Shared Memory As a

102 PP16 Abstracts

MultiCore Programming System

MPI is widely used for parallel programming and enablesprogrammers to achieve good performance. MPI also hasa one-sided or put/get programming model. In MPI-3,this one-sided model was greatly expanded, including newatomic read-modify-write operations. In addition, MPI-3adds direct access to memory shared between processes.This talk discusses some of the opportunities and chal-lenges in making use of this feature to program multicoreprocessors.

William D. GroppUniversity of Illinois at Urbana-ChampaignDept of Computer [email protected]

MS35

How I Learned to Stop Worrying and Love MPI:Prospering with an MPI-Only Model in PFLO-TRAN + PETSc in the Manycore Era

The recent rise of ”manycore” processors has led to spec-ulation that developers must move from MPI-only mod-els to hybrid programming with MPI+threads to maintainscalability on leadership-class machines. However, carefulconsideration of communication patterns and judicious useof MPI-3 features presents a viable alternative that retainsa simpler programming model. We explore this alternativein the open-source, massively parallel environmental flowreactive transport code PFLOTRAN and the underlyingPETSc numerical framework it is built upon.

Richard T. MillsOak Ridge National Laboratoryand The University of Tennessee, [email protected]

MS35

MPI+X: Opportunities and Limitations for Het-erogeneous Systems

We investigate the use of node-level parallelization ap-proaches such as CUDA, OpenCL, and OpenMP coupledwith MPI for data communication across nodes and presentcomparisons with a conventional flat MPI approach. Ourresults demonstrate what can be considered an abstractionpenalty of MPI+X over flat MPI in the strong scaling limit.

Karl Rupp, Josef Weinbub, Florian Rudolf, AndreasMorhammer, Tibor GrasserInstitute for MicroelectronicsVienna University of [email protected], [email protected],[email protected], [email protected],[email protected]

Ansgar JuengelInstitute for Analysis and Scientific ComputingVienna University of [email protected]

MS35

The Occa Abstract Threading Model: Implemen-tation and Performance for High-Order Finite-Element Computations

An abstract threading model that simplifies the expression

of loop level parallelism will be discussed. This abstrac-tion has been implemented in the OCCA library, includingfront-end support for C, Fortran, C++, Julia and backendsupport for OpenMP, CUDA, and OpenCL. Algorithms ex-pressed in the OCCA Kernel Languages are portable acrossthe multiple threading backends. Support is provided forautomatic data movement between host and compute de-vices. Performance of OCCA based finite-element solverswill be presented.

Tim WarburtonVirginia [email protected]

David MedinaRice [email protected]

MS36

Revisiting Asynchronous Linear Solvers: ProvableConvergence Rate Through Randomization

Asynchronous methods for solving systems of linear equa-tions have been researched since Chazan and Miranker’spioneering 1969 paper on chaotic relaxation. The under-lying idea of asynchronous methods is to avoid processoridle time by allowing the processors to continue to makeprogress even if not all progress made by other processorshas been communicated to them. Historically, the applica-bility of asynchronous methods for solving linear equationswas limited to certain restricted classes of matrices, suchas diagonally dominant matrices. Furthermore, analysis ofthese methods focused on proving convergence in the limit.Comparison of the asynchronous convergence rate with itssynchronous counterpart and its scaling with the numberof processors were seldom studied, and are still not wellunderstood. In this talk, we discuss a randomized shared-memory asynchronous method for general symmetric pos-itive definite matrices. We will present a rigorous analysisof the convergence rate that proves that it is linear, andis close to that of the method’s synchronous counterpart ifthe processor count is not excessive relative to the size andsparsity of the matrix. Our work suggests randomization asa key paradigm to serve as a foundation for asynchronousmethods.

Haim AvronBusiness Analytics & Mathematical SciencesIBM T.J. Watson Research [email protected]

Alex DruinskyComputational Research DivisionLawrence Berkeley National [email protected]

Anshul GuptaIBM T J Watson Research [email protected]

MS36

Task and Data Parallelism Based Direct Solversand Preconditioners in Manycore Architectures

Sparse direct or incomplete factorizations are importantbuilding blocks in scalable solvers, for example, domaindecomposition and multigrid. Due to irregular data accessthese are difficult to parallelize, especially on manycore ar-chitectures. We give an overview of our ongoing parallel

PP16 Abstracts 103

solver projects that exploit either data or task parallelism,or both.


Erik G. Boman, Joshua D. BoothCenter for Computing ResearchSandia National [email protected], [email protected]

Kyungjoo KimSandia National [email protected]

MS36

Parallel Ilu and Iterative Sparse Triangular Solves

We consider hybrid preconditioners that are highly paralleland efficient. Therefore, in a first step, we consider station-ary iterative solvers for triangular linear equations basedon sparse approximate inverse preconditioners. These it-erative methods are used for solving triangular systemsresulting from parallel ILU preconditioners. For solvinggeneral symmetric or nonsymmetric systems we combinesparse approximative inverses with Gauss-Seidel-like split-ting methods and the above triangular solvers. This leadsto efficient and parallel preconditioners.

Thomas K. HuckleInstitut fuer InformatikTechnische Universitaet [email protected]

Juergen BraeckleTU [email protected]

MS36

Aggregation-Based Algebraic Multigrid on Mas-sively Parallel Computers

To solve large linear systems, it is nowadays standard to usealgebraic multigrid methods, although their parallel imple-mentations often suffer on large scale machines. We con-sider an aggregation-based method whose associated soft-ware code (AGMG) has several hundreds of users aroundthe world. To improve parallel performance, some criti-cal components are redesigned, in a relatively simple yetnot straightforward way. Excellent weak scalability resultsare reported on three petascale machines among the mostpowerful today available.

Yvan NotayUniversite Libre de [email protected]

MS37

Scalable Parallel Graph Matching for Big Data

The BSP style of programming is popular in graph com-putations, especially in the Big Data field, where routinelygraphs of billions of vertices are processed. A fundamentalproblem here is to match vertices in a large edge-weightedgraph. We present a bulk-synchronous parallel approxima-tion algorithm based on local edge dominance, with partialsorting of vertex neighbours and exploitation of data prox-

imity through a suitable vertex partitioning. The effects ofpartitioning are investigated experimentally.

Rob H. BisselingUtrecht UniversityMathematical [email protected]

MS37

From Cilk to MPI to BSP: The Value of StructuredParallel Programming

Parallel programming languages have still not reached thesame maturity, clarity and stability as sequential lan-guages. This situation has a strong negative effect on theindustry of parallel software. We explain three importantstyles of parallel programming: Cilk, MPI, and BSP. Weexplain how their basic features balance expressive powerwith simplicity, scalability and predictable performance.We also investigate complex language constructs such astypes, modules, objects, and patterns.

Gaetan HainsHuawei Technologies [email protected]

MS37

Systematic Development of Functional Bulk Syn-chronous Parallel Programs

With the current generalisation of parallel architecturesand increasing requirement of parallel computation arisesthe concern of applying formal methods, which allow spec-ifications of parallel and distributed programs to be pre-cisely stated and the conformance of an implementation tobe verified using mathematical techniques. However, thecomplexity of parallel programs, compared to sequentialones, makes them more error-prone and difficult to ver-ify. This calls for a strongly structured form of parallelism,which should not only be equipped with an abstractionor model that conceals much of the complexity of parallelcomputation, but also provides a systematic way of devel-oping such parallelism from specifications for practicallynontrivial examples. Program calculation is a kind of pro-gram transformation based on the theory of constructivealgorithms. An efficient program is derived step-by-stepthrough a sequence of transformations that preserve themeaning and hence the correctness. With suitable data-structures, program calculation can be used for writing par-allel programs. This talk presents the SyDPaCC system,a set of libraries for the proof assistant Coq. SyDPaCCallows to write naive (i.e. inefficient) functional programsthen to transform them into efficient versions that could beautomatically parallelised within the framework before be-ing extracted from Coq as code in the functional languageOCaml plus calls to the parallel functional programminglibrary Bulk Synchronous Parallel ML.

Frederic LoulergueLaboratoire d’Informatique Fondamentale d’OrleansUniversite d’[email protected]

MS37

Parallel Data Distribution for a Sparse EigenvalueProblem Based on the BSP Model

We discuss data distribution based on the BSP model toload balance a generalised eigenvalue computation arising

104 PP16 Abstracts

from machine learning. Using CG to solve a linear systemwith a matrix that is the product of three sparse matri-ces, the main operations are sparse matrix-vector prod-ucts. The matrices are partitioned based on hypergraphpartitioning with net amalgamation and the vectors arepartitioned using an (extended) local-bound method. Var-ious variants of these partitioning methods are analysedand compared.

Dirk RooseKU LeuvenDept. of Computer [email protected]

Karl MeerbergenK. U. [email protected]

Joris TavernierKU LeuvenDept. of Computer [email protected]

MS38

Efficient Parallel Algorithms for New Unidirec-tional Models of Nonlinear Optics

We are solving the unidirectional Maxwell equation simu-lating very fast movement of solitons in laser physics

∂zE +DE + ∂τE3 = 0,

here E is the field component, D describes the linear dis-persion operator. The problem is solved by applying thesplitting method, the dispersion step is approximated byusing the spectral method, the nonlinear term is approxi-mated by using finite difference method. Parallelization ofthe numerical algorithm is done by applying the domaindecomposition method, the algorithm is implemented byusing MPI.

Raimondas CiegasDepartment of Mathematical ModellingUniversity Vilnius, [email protected]

MS38

Parallelization of Numerical Methods for TimeEvolution Processes, Described by Integro-Differential Equations

We deal with the iterative solution of a large scale viscoelas-tic problem arising in geophysics when studying Earth dy-namics under load. In addition to the standard equilib-rium equation, the problem poses the challenge to handlea constitutive relation between stress and strains that isof integral form. An iterative solution method is appliedduring each time step. The focus is on efficient and paral-lelizable preconditioning, related also to the methods usedto advance in time.

Ali DorostkarUppsala UniversityDepartment of Information [email protected]

Maya NeytchevaDepartment of Information TechnologyUppsala University

[email protected]

MS38

Online Auto-Tuning for Parallel Solution Methodsfor ODEs

In this talk, we consider auto-tuning of parallel solutionmethods for initial value problems of systems of ordinarydifferential equations (ODEs). The performance of ODEsolvers is sensitive to the characteristics of the input data.Therefore, the best implementation variant cannot be de-termined at compile or installation time, and offline auto-tuning is not sufficient. Taking this into account, we devel-oped an online auto-tuning approach, which exploits thetime-stepping nature of ODE solvers.

Thomas RauberUniversitat Bayreuth, [email protected]

Natalia Kalinnik, Matthias KorchUniversity [email protected], [email protected]

MS38

Optimizing the Use of GPU and Intel Xeon PhiClusters to Accelerate Stencil Computations

MPDATA is a stencil-based kernel of EULAG geophysicalmodel for simulating thermo-fluid flows across a wide rangeof scales/physical scenarios. In this work, we investigatemethods of optimal adaptation of MPDATA to modernhetereogeneous clusters with GPU and Intel Xeon Phi ac-celerators. The achieved performance results are presentedand discussed. In particular, the sustained performance of138 Gflop/s is achieved for a single NVIDIA K20x GPU,which scales up to 16.1 Tflop/s for 128 GPUs.

Roman WyrzykowskiCzestochowa University of Technology,[email protected]

Krzysztof Rojek, Lukasz SzustakCzestochowa University of [email protected], [email protected]

MS39

Parallel Implementations of Ensemble Kalman Fil-ters for Huge Geophysical Models

The Data Assimilation Research Testbed (DART) is a com-munity software facility for ensemble Kalman filter dataassimilation. The largest DART applications employ en-sembles of size O(100) with models with up to one bil-lion variables. DART combines ensemble information fromstructured model grids with observations from unstruc-tured networks. A scalable implementation of DART mustprovide parallel algorithms that load balance computationwhile efficiently moving massive amounts of data. Newlydeveloped algorithms will be presented along with scalingresults.

Jeffrey AndersonNational Center for Atmospheric ResearchInstitute for Math Applied to Geophysics

PP16 Abstracts 105

[email protected]

Helen [email protected]

Jonathan HendricksNational Center for Atmospheric [email protected]

Nancy CollinsNational Center for Atmospheric ResearchBoulder, [email protected]

MS39

Development of 4D Ensemble Variational Assimi-lation at Mto-France

The 4D Ensemble Variational assimilation (4DEnVar) hasreceived a considerable attention at Numerical WeatherPrediction centres in the very recent years. Indeed, thisformalism allows, in such very large size estimation prob-lems related to Ensemble Kalman Filter, consistent errorcovariances, while being potentially highly parallel on newcomputer architectures. Such a system is being developedat Mto-France. We will present the different parallelizationopportunities offered by this new system and its overallperformance.

Gerald DesroziersMeteo [email protected]

MS39

Parallel 4D-Var and Particle Smoother Data As-similation for Atmospheric Chemistry and Meteo-rology

Parallel implemetation strategies building on two differ-ent assimilation methods are considered: Firstly, the fourdimensional variational technique applying parallel pre-conditioning of the square root of the forecast error co-variance matrix by the diffusion paradigm. This willbe presented in the context of an extended advection-diffusion-reaction equation for atmospheric chemistry mod-elling with aerosols. Secondly, a particle filter/smootherby an ultra-large (up to O(1000) member) ensemble ofa mesoscale meteorological model will be given, with atwo-level approach on its parallel implementation for shortrange prediction, where the role of particle weighting algo-rithms proofs critical.

Hendrik ElbernRhenish Institute for Environmental Researchat the University of Cologne, Cologne, [email protected]

Jonas Berndt, Philipp FrankeInstitute for Energy and Climate ResearchFZ [email protected], [email protected]

MS39

Parallel in Time Strong-Constraint 4D-Var

This talk discusses a scalable parallel-in-time algorithmbased on augmented Lagrangian approach to perform four

dimensional variational data assimilation. The assimila-tion window is divided into multiple subintervals to facili-tate parallel cost function and gradient computations. Thescalability of the method is tested by performing data as-similation with increasing problem sizes (i.e. the assim-ilation window). The data assimilation is performed forLorenz-96 and the shallow water model. We also explorethe feasibility of using a hybrid-method (a combination ofserial and parallel 4D-Vars) to perform the variational dataassimilation.

Adrian SanduVirginia Polytechnic Instituteand State [email protected]

MS40

Parallel Processing for the Square Kilometer ArrayTelescope


David De RoureUniversity of [email protected]

MS40

High Performance Computing for Astronomical In-strumentation on Large Telescopes

Exploiting the full capabilities of the next generation of ex-tremely large ground based optical telescopes will rely onthe technology of adaptive optics. This technique requiresmultiple wave-front sensing cameras and deformable mir-rors running at frame rates of 1 KHz. Meeting these re-quirements is a significant challenge and requires high per-formance computational facilities capable of operating withvery low latency. The proposed solution requires massivelyparallel systems based on emerging multi-core technology.

Nigel DipperCentre for Advanced InstrumentationDepartment of Physics, Durham [email protected]

MS40

Adaptive Optics Simulation for the European Ex-tremely Large Telescope on Multicore Architec-tures with Multiple Gpus

We present a high performance comprehensive implemen-tation of a multi-object adaptive optics (MOAO) simula-tion on multicore architectures with hardware acceleratorsin the context of computational astronomy. This imple-mentation will be used as an operational testbed for sim-ulating the design of new instruments for the EuropeanExtremely Large Telescope project (E-ELT), one of Eu-rope’s highest priorities in ground-based astronomy. Thesimulation corresponds to a multi-step multi-stage proce-dure, which is fed, near real-time, by system and turbu-lence data coming from the telescope environment. Usingmodern multicore architectures associated with the enor-mous computing power of GPUs, the resulting data-drivencompute-intensive simulation of the entire MOAO applica-tion, composed of the tomographic reconstructor and theobserving sequence, is capable of coping with the afore-mentioned real-time challenge and stands as a referenceimplementation for the computational astronomy commu-

106 PP16 Abstracts

nity.


Damien GratadourObservatoire de [email protected]

Ali [email protected]

Eric GendronObservatoire de [email protected]


MS40

The SKA (Square Kilometre Array) and Its Com-putational Challenges

The SKA currently being designed will enable radio as-tronomers to study the Universe at levels of detail and sen-sitivity well beyond current instruments. The SKA will betruly a software driven instrument, and its computationaldemands will be unprecedented and will pose challengesthat will need to be tackled and overcome. After introduc-ing briefly the aims and overall design of the SKA, this talkwill show how computational power will be needed in sev-eral areas such as data transport, storage and analysis, andnumerical computation in order to achieve the SKA scien-tific aims: it would indeed be stretching the boundaries ofcomputational capabilities.

Stefano SalviniOeRC, the University of [email protected]

MS41

The Missing High-Performance Computing FaultModel

The path to exascale computing poses several researchchallenges. Resilience is one of the most important chal-lenges. This talk will present recent work in developing themissing high-performance computing (HPC) fault model.This effort identifies, categorizes and models the fault, er-ror and failure properties of today’s HPC systems. It devel-ops a fault taxonomy, catalog and models that capture theobserved and inferred conditions in current systems andextrapolates this knowledge to exascale HPC systems.

Christian EngelmannOak Ridge National [email protected]

MS41

Resilience Research is Essential but Failure is Un-likely

Once again the high performance computing communityis faced with stark predictions of unreliable, failing and

unusable leadership computing systems. Previous predic-tions were false. We argue that unacceptable failure rateswill not emerge soon, but will remain at acceptable levelsuntil the computing community is prepared to handle fail-ures via algorithms and applications. In this presentationwe provide evidence and arguments for our hypothesis anddiscuss how resilience research is still essential and timely.

Michael HerouxSandia National [email protected]

MS41

Memory Errors in Modern Systems

Several recent publications have shown that hardwarefaults in the memory subsystem are commonplace. Thesefaults are predicted to become more frequent in futuresystems that contain orders of magnitude more DRAMand SRAM than found in current memory subsystems.These memory subsystems will need to provide resiliencetechniques to tolerate these faults when deployed in high-performance computing systems and data centers contain-ing tens of thousands of nodes. In this talk, I will focus onlearnings about hardware reliability gathered from systemsin the field, and use our data to project to whether hard-ware reliability will become a larger problem at exascale.

Steven Raasch, Vilas SridharanAMD [email protected], [email protected]

MS41

Characterizing Faults on Production Systems

Understanding and characterizing faults on production sys-tems is key to generate realistic fault models in order todrive flexible fault management across hardware and soft-ware components and to optimize the cost-benefit trade-offs among performance, resilience, and power consump-tion. In this talk, we will discuss recent efforts to extractfault data from several production systems at LLNL and tostore them in a central data repository. In addition to easyaccess to fault data, this will also enable correlations withother data (incl. power and performance data) to gain acomplete picture and to deduce root causes for observedfaults.

Martin SchulzLawrence Livermore National [email protected]

MS42

From Titan to Summit: Hybrid Multicore Com-puting at ORNL

The Titan supercomputer at Oak Ridge National Labora-tory was one of the first generally available, leadership com-puting systems to deploy a hybrid CPU-GPU architecture.In this talk I will describe the Titan system, talk about thelessons we have learned in more than three years of success-ful operation, and how we are applying those lessons in ourplanning for the Summit system that will be available toour users in 2018.

Arthur S. BlandOak Ridge National Laboratory

PP16 Abstracts 107

[email protected]

MS42

Hazelhen Leading HPC Technology and its Impacton Science in Germany and Europe

The new HLRS flagship system Hazelhen (a Cray XC40)is Europes fastest system in the HPCG benchmarking list.Designed to provide maximum sustainable performance tothe users of HLRS ignoring peak performance numbers ishas even climbed to number 8 in the TOP500 list and isone of the fastest systems in the world used both by re-search and industry. This talk will present insight into thestrategy that was at the heart of the HLRS decision. Itwill further show how Hazelhen can and has already im-pacted research and industry by providing high sustainedperformance for real world applications.

Thomas BonischUniverity of StuttgartThe High Performance Computing Center Stuttgarttba

Michael ReschHLRS, University of Stuttgart, [email protected]

MS42

Lessons Learned from Development and Operationof the K Computer

The K computer is one of the most powerful supercomput-ers in the world. The design concept of the K computerincludes not only huge computing power but also usabil-ity, reliability, availability and serviceability. In this talk,we introduce some functions of the K computer to realizethe concept in hardware and software aspects and discussabout lessons learned from the development and operationexperiences of the K computer.

Fumiyoshi [email protected]

MS42

Sequoia to Sierra: The LLNL Strategy

Lawrence Livermore National Laboratory (LLNL) has along history of leadership in large-scale computing. Ourcurrent platform, Sequoia, is a 96 rack BlueGene/Q sys-tem that is currently number three on the Top 500 list.Our next platform, Sierra, will be a heterogeneous systemdelivered by a partnership between IBM, NVIDIA and Mel-lanox. While the platforms are diverse, they represent acarefully considered strategy. In this talk, we will compareand contrast these platforms and the applications that runon them, and provide a glimpse into LLNL’s strategy be-yond Sierra.

Bronis R. de SupinskiLivermore ComputingLawrence Livermore National [email protected]

MS43

Partitioning and Task Placement with Zoltan2

Zoltan2 is a library for partitioning, task placement, data

ordering, and graph coloring. Part of the Trilinos frame-work, it is our research and development testbed for ge-ometric and graph-based algorithms on emerging high-performance computing platforms. In this talk, we describeZoltan2 and highlight its partitioning and task placementalgorithms. We demonstrate its benefits in the HOMMEatmospheric climate model and in particle-based methods.

Mehmet DeveciThe Ohio State [email protected]

Karen D. DevineSandia National [email protected]


Vitus Leung, Siva Rajamanickam, Mark A. TaylorSandia National [email protected], [email protected],[email protected]

MS43

Lessons Learned in the Development of FASTMathNumerical Software for Current and Next Genera-tion HPC

The FASTMath team has experimented with different pro-gramming models, communication reducing strategies, andadaptive load balancing schemes to handle both inter-nodeand intra-node concurrency on deep NUMA architectures.I will give an overview of our experience to date andthoughts on the additional developments needed to achieveperformance portability across a range of architectures andsoftware libraries.

Lori A. DiachinLawrence Livermore National [email protected]

MS43

Scalable Nonlinear Solvers and Preconditioners inClimate Applications

High resolution, long-term climate simulations are essentialto understanding regional climate variation within globalchange on the decade scale. Implicit time stepping methodsprovide a means to increase efficiency by taking time stepscommensurate with the physical processes of interest. Wepresent results for a fully implicit time stepping methodutilizing FASTMath Trilinos solvers and an approximateblock factorization preconditioner in the hydrostatic spec-tral element dynamical core of the Community AtmosphereModel (CAM-SE).

David J. GardnerLawrence Livermore National [email protected]

Richard ArchibaldComputational Mathematics GroupOak Ridge National [email protected]

108 PP16 Abstracts

Katherine J. EvansOak Ridge National [email protected]

Paul UllrichUniversity of California, [email protected]

Carol S. WoodwardLawrence Livermore Nat’l [email protected]

Patrick H. WorleyOak Ridge National [email protected]

MS43

Parallel Unstructured Mesh Genera-tion/Adaptation Tools Used in AutomatedSimulation Workflows

This presentation will overview recent advances in the de-velopment of a set of unstructured mesh tools that sup-port all steps in the simulation workflow to be executedin parallel on the same massively parallel computer sys-tem. These tools to be discussed address the parallel meshgeneration/adaptation, parallel mesh structures and dy-namic load balancing capabilities needed. The in-memorycoupling of these tools with multiple unstructured meshanalysis code will be overviewed.

Mark S. ShephardRensselaer Polytechnic InstituteScientific Computation Research [email protected]

Cameron SmithScientific Computation Research CenterRensselaer Polytechnic [email protected]

Dan A. IbanezRensselaer Polytechnic [email protected]

Brian Granzow, Onkar SahniRensselaer Polytechnic [email protected], [email protected]

Kenneth JansenUniversity of Colorado at [email protected]

MS44

Coupled Matrix and Tensor Factorizations: Mod-els, Algorithms, and Computational Issues

In the era of data science, it no longer suffices to analyzea single data set if the goal is to unravel the dynamics ofa complex system. Therefore, data fusion, i.e., extractingknowledge through the fusion of complementary data sets,is a topic of interest in many fields. We formulate datafusion as a coupled matrix and tensor factorization prob-lem, and discuss extensions of the model as well as variousalgorithmic approaches for fitting such data fusion models.

Evrim AcarUniversity of [email protected]

MS44

Parallel Tensor Compression for Large-Scale Scien-tific Data

Today’s computational simulations often generate moredata than can be stored or analyzed. For multi-dimensionaldata, we can use tensor approximations like the Tuckerdecomposition to expose low-rank structure and compressthe data. This talk presents a distributed-memory paral-lel algorithm for computing the Tucker decomposition forgeneral dense tensors. We show performance results anddemonstrate large compression ratios (with small error) fordata generated by applications of combustion science.

Woody N. AustinUniversity of Texas - [email protected]

Grey Ballard, Tamara G. KoldaSandia National [email protected], [email protected]

MS44

Parallelizing and Scaling Tensor Computations

Tensor decomposition is a powerful data analysis tech-nique for extracting information from large-scale multidi-mensional data and is widely used in diverse applications.Scaling tensor decomposition algorithms on large comput-ing systems is key for realizing the applicability of thesepowerful techniques on large datasets. We present ourtool - ENSIGN Tensor Toolbox (ETTB) - that has op-timized tensor decompositions through the application ofnovel techniques for improving the concurrency, data lo-cality, load balance, and asynchrony in the execution oftensor computations that are critical for scaling them onlarge systems.

Muthu M. Baskaran, Benoit Meister, Richard LethinReservoir [email protected], [email protected],[email protected]

MS44

Parallel Algorithms for Tensor Completion andRecommender Systems

Low-rank tensor completion is concerned with filling inmissing entries in multi-dimensional data. It has provenits versatility in numerous applications, including context-aware recommender systems and multivariate functionlearning. To handle large-scale datasets and applicationsthat feature high dimensions, the development of dis-tributed algorithms is central. In this talk, we will discussnovel, highly scalable algorithms based on a combinationof the canonical polyadic (CP) tensor format with blockcoordinate descent methods. Although similar algorithmshave been proposed for the matrix case, the case of higherdimensions gives rise to a number of new challenges andrequires a different paradigm for data distribution. Theconvergence of our algorithms is analyzed and numericalexperiments illustrate their performance on distributed-

PP16 Abstracts 109

memory architectures for tensors from a range of differentapplications.

Lars KarlssonUmea University, Dept. of Computing [email protected]

Daniel KressnerEPFL [email protected]

Andre UschmajewUniversity of Bonnuschmajew.ins.uni-bonn.de

MS45

Multi-ML: Functional Programming of Multi-BSPAlgorithms

The Multi-BSP model is an extension of the well knownBSP bridging model. It brings a tree based view of nestedcomponents to represent hierarchical architectures. In thepast, we designed BSML for functional programming BSPalgorithms. We now propose the Multi-ML language asan extension of BSML for programming Multi-BSP algo-rithms.

Victor Allombert, Julien TessonLaboratoire d’Algorithmique, Complexite et LogiqueUniversite Paris-est [email protected], [email protected]

Frederic GavaLaboratoire d’Algorithmique, Complexite et LogiqueUniversite de Paris [email protected]

MS45

New Directions in BSP Computing

BSP now powers parallel computing, big data analytics,graph computing and machine learning at many leading-edge companies. It is used at web companies for their anal-ysis of big data from massive graphs and networks withbillions of nodes and trillions of edges. BSP is also at theheart of new projects such as Petuum, for machine learn-ing with billions of parameters. At Huawei Paris ResearchCenter we are developing new BSP algorithms and softwareplatforms for Telecomms and IT.

Bill McCollHuawei Technologies [email protected]

MS45

Parallelization Strategies for Distributed MachineLearning

A key issue of both theoretical and engineering interestin effectively building distributed systems for large-scalemachine learning (ML) programs underlying a wide spec-trum of applications, is the bridging models which spec-ify how parallel workers should be coordinated. I discussthe unique challenges and opportunities in designing thesemodels for ML, and present a stale synchronous parallelmodel that can speed up parallel ML orders of magnitudes

over conventional BSP model and enjoy provable guaran-tees.

Eric XingSchool of Computer [email protected]

MS45

Panel: The Future of BSP Computing

Bulk Synchronous Parallel has had a large influence on par-allel computing, both in classical high-performance com-puting and, more recently, in Big Data computing. Chal-lenges to BSP are increasing latency costs, non-uniformmemory hierarchies and network interconnects, the ten-sion between structured/convenient programming versushigh performance, and the cost of fault-tolerance and tail-tolerance. Can BSP be relaxed to unify opposing pointsof view, or will the future hold tens or even hundreds ofdomain-specific BSP-style frameworks?

Albert-Jan N. YzelmanKU LeuvenIntel ExaScience Lab (Intel Labs Europe)[email protected]

MS46

Linear and Non-Linear Domain DecompositionMethods for Coupled Problems

In this talk, we consider fast and parallel solution meth-ods for linear and non-linear systems of equations, whicharise from the discretization of coupled PDEs. We discussmultigrid and domain decomposition based methods forbiomechanical applications, including the coupled electro-mechanical simulation of the human heart, as well as forfluid-structure interaction. We focus in particular on thevolume coupling of different PDEs, for which we proposean approach based on variational transfer between non-matching meshes.


Erich FosterInstitute of Computational ScienceUSI [email protected]

Marco FavinoInstitute of Computational ScienceUniversity of [email protected]

Patrick ZulianInstitute of Computational ScienceUSI [email protected]

MS46

Improving Strong Scalability of Linear Solver forReservoir Simulation

Reservoir simulation plays an important role in definingand optimizing the production strategies for oil and gas

110 PP16 Abstracts

reservoirs. As mesh sizes, complexity and nonlinearity ofthe models grow highly parallel and scalable linear solversare of paramount importance for realistic design and op-timization workflows. The industry standard ConstrainedPressure Residual (CPR) preconditioner consists of variouscomponents that are not well tailored for strong scalabil-ity. In this talk we will discuss the fundamental underlyingproblems and review several solution methods on how toovercome this performance barrier.

Tom B. [email protected]

MS46

Solving Large Systems for the ElastodynamicsEquation to Perform Acoustic Imaging

The field of numerical modeling of sound propagation in theocean, non destructive testing of materials, and of seismicwaves in geological media following earthquakes has beenthe subject of increasing interest of the underwater acous-tics and geophysics communities for many years. Comput-ing an accurate solution, taking into account a complexenvironment, is still the subject of active research focus-ing on simulations in the time domain for forward and alsoadjoint/inverse tomography problems in three dimensions(3D). Nowadays, with the stunning increase in computa-tional power, the full waveform numerical simulation ofcomplex wave propagation problems in the time domain isbecoming possible. The purpose of the talk is to presentsome illustrative examples that emphasize the high poten-tial of a high-order finite-element method for forward andadjoint/inverse problems in geophysics as well as in un-derwater acoustic and non destructive testing applications.We will thus show examples of numerical modeling of seis-mic wave propagation at very high resolution in large ge-ological models, and summarize recent results for inverseproblems in this area. The spectral-element method, be-cause of its ability to handle coupled fluid-solid regions, isalso a natural candidate for performing wave propagationsimulations in underwater acoustics in the time domain.Moreover, it is also known for being accurate to modelsurface and interface waves. In this presentation, we willshow configurations in which such types of waves, calledStoneley-Scholte waves, can be generated.

Dimitri Komatitsch, Paul CristiniLaboratory of Mechanics and AcousticsCNRS, Marseille, [email protected], [email protected]

MS46

PCBDDC: Robust BDDC Preconditioners inPETSc

A novel class of preconditioners based on Balancing Do-main Decomposition by Constraints (BDDC) methods isavailable in the PETSc library. Such methods provide so-phisticated and almost black-box Domain Decompositionpreconditioners of the non-overlapping type which are ableto deal with elliptic PDEs with coefficients’ heterogeneitiesand discretized by means of non-standard finite elementdiscretizations. The talk introduces the BDDC algorithmand its current implementation in PETSc. Large scale nu-merical results for different type of PDEs and finite ele-ments discretizations prove the effectiveness of the method.

Stefano Zampini

[email protected]

MS47

Title Not Available


Mike [email protected]

MS47

From Data to Prediction with Quantified Uncer-tainties: Scalable Parallel Algorithms and Applica-tions to the Flow of Ice Sheets

Given a model containing uncertain parameters, (noisy)observational data, and a prediction quantity of interest,we seek to (1) infer the model parameters from the datavia the model, (2) quantify the uncertainty in the inferredparameters, and (3) propagate the resulting uncertain pa-rameters through the model to issue predictions with quan-tified uncertainties. We present scalable parallel algorithmsfor this data-to-prediction problem, in the context of mod-eling the flow of the Antarctic ice sheet. We show that thework, measured in number of forward (and adjoint) modelsolves, is independent of the state dimension, parameterdimension, and data dimension.

Omar GhattasThe University of Texas at [email protected]

Tobin IsaacThe University of [email protected]

Noemi PetraUniversity of California, [email protected]

Georg StadlerCourant Institute for Mathematical SciencesNew York [email protected]

MS47

Using Adjoint Based Numerical Error Estimatesand Optimized Data Flows to Construct SurrogateModels

Surrogate based uncertainty quantification(UQ) using asuitably chosen ensemble of simulations is the method forchoice for most large scale simulations with UQ. We havedeveloped novel methods and workflows to inform the con-struction of these surrogates using adjoint based methodsfor error estimation in a scalable and parallel way whilemanaging the large data flows.

Abani PatraSUNY [email protected]

MS47

Parallel Aspects of Diffusion-Based Correlation

PP16 Abstracts 111

Operators in Ensemble-Variational Data Assimila-tion

Ensemble-variational data assimilation requires multipleevaluations of large correlation matrix-vector products.This presentation describes methods to define such prod-ucts using implicit diffusion-based operators that can beparallelized in the pseudo-time dimension. A solutionmethod based on the Chebyshev iteration is proposed: itdoes not require global communications and can be em-ployed to ensure the correlation matrix preserves its SPDattribute for an arbitrary convergence threshold. Examplesfrom a global ocean assimilation system are given.

Anthony Weaver, Jean [email protected], [email protected]

MS48

Parallel Pic Simulation Based on Fast Electromag-netic Field Solver

In this talk, a parallel symplectic PIC program for solvingnon-relativistic Vlasov-Maxwell system on unstructuredgrids for realistic applications will be introduced. Theimplicit scheme is used in our code. We solve Maxwellequations on tetrahedral mesh by finite element method.An efficient and scalable parallel algebraic solver is usedfor solving the discrete system. Numerical results will begiven to show that our method and parallel program isrobust and scalable.

Tao CuiLSEC, the Institute of Computational MathematicsChinese Academy of [email protected]

MS48

Parallel Application of Monte Carlo Particle Sim-ulation Code JMCT in Nuclear Reator Analysis

Massive data for full-core Monte Carlo simulation leads tomemory overload for a single core processor. Tree-baseddomain decomposition algorithm for combinatorial geom-etry is developed on JCOGIN. An efficient, robust asyn-chronous transport algorithm for domain decompositionMonte Carlo is introduced. Combination of domain decom-position and domain replication is designed and providedto improve parallel efficiency. Results of two full-core mod-els are presented using the Monte Carlo particle transportcode JMCT, which is developed on JCOGIN.

Gang LiInstitute of Applied Physics and ComputationalMathematics,Beijing, 100094, P.R. Chinali [email protected]

MS48

Applications of Smoothed Particle Hydrodynam-ics (SPH) for Real Engineering Problems: ParticleParallelisation Requirements and a Vision for theFuture

Recent and likely future HPC developments of SPH aremaking massively parallel computations viable; conse-quently there is a real possibility of achieving a step changein the level of complexity and realism in the SPH applica-tions with huge impact on industry and society. A review

of the latest SPH developments and its applications willbe presented. The challenges of the SPH real engineeringsimulations and a vision for the future are discussed.

Benedict RogersSchool of Mechanical, Aerospace and Civil EngineeringUniversity of [email protected]

MS48

An Hybrid OpenMP-MPI Parallelization for Vari-able Resolution Adaptive SPH with Fully Object-Oriented SPHysics

Starting from restructuring the original open-sourceSPHysics, a new SPH programming framework has beendeveloped employing the full functionality of object-oriented Fortran (OOF) to create a new code that is bothpowerful, highly adaptable to the computing architecturesof emerging technology. The novelties of this approach in-clude: (i) Highly modular, program interface (API), (ii) Anovel way to generate the neighbour list for particle inter-actions is proposed, (iii) Hybrid OpenMP-Message PassingInterface (MPI) parallelization.

Ranato VacondioUniversity of [email protected]

MS49

Fault-Tolerant Multigrid Algorithms Based on Er-ror Confinement and Full Approximation

We introduce a novel fault-tolerance scheme to detect andrepair soft faults in multigrid solvers, which essentiallycombines checksums for linear algebra with multigrid-specific algorithm-based fault-tolerance. To improve ef-ficiency we split the algorithm into two parts. For thesmoothing stage, we exploit properties of the full approx-imation scheme to check and repair the results. Our re-sulting method significally increases the fault-tolerance ofthe multigrid algorithm and has only a small impact in thenon-faulty case.

Mirco AltenberndUniversitat [email protected]

MS49

Soft Error Detection for Finite Difference Codes onRegular Grids

We investigate a strategy to detect soft errors in iterativeor time stepping methods where the iterates live on regu-lar grids. The principle idea is to complement iterates byredundant information that allows to determine if memorycopies or messages have arrived correctly at the destina-tion. We test our approach by means of the 2D heat equa-tion. Errors are injected asynchronously into the memoryor caches for testing.


Manuel KohlerETH Zurich Computer Science Department

112 PP16 Abstracts

[email protected]

MS49

Resilience for Multigrid at the Extreme Scale

We consider scalable geometric multigrid schemes to ob-tain fast and fault-robust solvers. The lost unknowns inthe faulty region are re-computed with local multigrid cy-cles. Especially effective are strategies in which the recov-ery in the faulty region is executed in parallel with globaliterations and when the local recovery is additionally ac-celerated. This results in an asynchronous multigrid iter-ation that can fully compensate faults. Excellent parallelperformance is demonstrated for systems up to a trillionunknowns.


Markus HuberTU [email protected]

Bjoern GmeinerUniversity [email protected]

Barbara WohlmuthTechnical University of [email protected]

MS49

A Soft and Hard Faults Resilient Solver for 2D El-liptic PDEs Via Server-Client-Based Implementa-tion

We present a new approach to solve 2D elliptic PDEs thatincorporates resiliency to soft and hard faults. Resiliencyto hard faults is obtained at the implementation level byleveraging a fault-tolerant MPI implementation (ULFM-MPI) within a task-based server-client model. Soft faultsare handled at the algorithm level by recasting the PDE asa sampling problem followed by a resilient regression-basedmanipulation of the data to obtain a new solution.

Francesco Rizzi, Karla MorrisSandia National [email protected], [email protected]

Paul MycekDuke [email protected]

Olivier P. Le [email protected]

Andres ContrerasDuke [email protected]

Khachik Sargsyan, Cosmin SaftaSandia National [email protected], [email protected]

Omar M. Knio

Duke [email protected]


MS50

Progress in Performance Portability of the XGCPlasma Microturbulence Codes: Heterogeneousand Manycore Node Architectures

The XGC plasma microturbulence particle-in-cell codemodels transport in the edge region of a tokamak fusionreactor. XGC scales to over 18,000 nodes/GPUs on CrayXK7 Titan and over 64000 cores on Blue Gene/Q Mira. Wepresent our progress and lessons learned in achieving per-formance portability to heterogeneous GPU and manycoreXeon Phi architectures by using compiler directives, suchas nested OpenMP parallelization for manycores, OpenMPSIMD vectorization, and OpenACC for porting to NvidiaGPUs.

Eduardo F. D’AzevedoOak Ridge National LaboratoryComputer Science and Mathematics [email protected]

Mark AdamsLawrence Berkeley [email protected]

Choong-Seock Chang, Stephane Ethier, Robert Hager,Seung-Hoe Ku, Jianying LangPrinceton Plasma Physics [email protected], [email protected], [email protected],[email protected], [email protected]

Mark ShepardScientific Computing Research CenterRensselaer Polytechnic [email protected]

Patrick H. WorleyOak Ridge National [email protected]

Eisung YoonRensselaer Polytechnic [email protected]

MS50

Research Applications and Analysis of a Fully Ed-dying Virtual Ocean.

We will describe the mathematical and computational tech-niques used to realize a global kilometer scale ocean modelconfiguration that includes representation of sea-ice andtidal excitation, and spans scales from planetary gyres tointernal tides. Computationally we exploit high-degrees ofparallelism in both numerical evaluation and in recordingmodel state to persistent disk storage. Using this capabil-ity we create a first of its kind multi-petabyte trajectoryarchive; sampled at unprecedented spatial and temporalfrequency.

Chris Hill

PP16 Abstracts 113

Earth, Atmospheric and Physical [email protected]

MS50

Challenges in Adjoint-Based Aerodynamic Designfor Unsteady Flows

An overview of the use of adjoint-based methods for abroad range of large-scale aerodynamic and multidisci-plinary design applications is presented. The general mo-tivation for the approach is outlined, along with a sum-mary of the implementation and experiences gained overtwo decades of concerted research effort. Specific issues as-sociated with applying these methods to unsteady simula-tions are highlighted, including next-generation computingchallenges currently faced by the community.

Eric NielsenNASA [email protected]

MS50

Extreme-Scale Genome Assembly

The rate and cost of genome sequencing continues to im-prove at an exponential rate, making their analysis a signif-icant computational challenge. In this talk we present Hip-Mer, a high performance de novo assembler, which is a crit-ical first step in many genome analyses. Recent work hasshown that efficient distributed memory algorithms usingthe UPC programming model allows unprecedented scalingto thousands of cores, reducing the human genome assem-bly time to only a few minutes.

Leonid Oliker, Aydin Buluc, Rob EganLawrence Berkeley National [email protected], [email protected], [email protected]

Evangelos GeorganasEECSUC [email protected]

Steven HofmeyrLawrence Berkeley National [email protected]

Daniel Rokhsar, Katherine YelickLawrence Berkeley National LaboratoryUniversity of California at [email protected], [email protected]

MS51

Unstructured Mesh Adaptation for MPI and Ac-celerators

Conformal simplex meshes can be adapted to improvediscretization error and enable Lagrangian simulations ofdeforming geometry problems. We present new algo-rithms that perform mesh modifications in batches, allow-ing OpenMP and CUDA parallelism inside supercomputernodes and making good use of MPI 3.0 neighborhood col-lectives between supercomputer nodes. The mesh is storedin a simple array-based structure, allowing low-overheadcoupling with physics codes inside device memory.

Dan A. Ibanez

Rensselaer Polytechnic [email protected]

MS51

Scaling Unstructured Mesh Computations

An unstructured grid, stabilized finite element flow solver(PHASTA) has proven to be a very effective vehicle for de-velopment, testing, and scaling demonstrations for a widevariety of parallel software components. These include un-structured grid meshing, adaptivity, load balancing, linearequation solution, and in situ data analytics. In this talk,we discuss how each of these components are integrated tokeep unstructured grid simulations on a productive pathto exascale partial differential equation simulation.

Kenneth JansenUniversity of Colorado at [email protected]

Jed BrownMathematics and Computer Science DivisionArgonne National Laboratory and CU [email protected]

Michel Rasquin, Benjamin MatthewsUniversity of Colorado [email protected],[email protected]

Cameron SmithScientific Computation Research CenterRensselaer Polytechnic [email protected]

Dan A. IbanezRensselaer Polytechnic [email protected]

Mark S. ShephardRensselaer Polytechnic InstituteScientific Computation Research [email protected]

MS51

Comparing Global Link Arrangements for Dragon-fly Networks

High-performance computing systems are shifting awayfrom traditional interconnect topologies to exploit newtechnologies and to reduce interconnect power consump-tion. The Dragonfly topology is one promising candidatefor new systems, with several variations already in produc-tion. It is hierarchical, with local links forming groups andglobal links joining the groups. At each level, the inter-connect is a clique. This presentation shows that the in-tergroup links can be made in meaningfully different ways.

Vitus LeungSandia National [email protected]

David BundeKnox College

114 PP16 Abstracts

[email protected]

MS51

An Extension of Hypres Structured and Semi-Structured Matrix Classes

The hypre software library provides parallel high perfor-mance preconditioners and solvers. One of its attrac-tive features is the provision of a structured and a semi-structured interface. We describe a new structured-gridmatrix class that supports rectangular matrices and con-stant coefficients and a semi-structured-grid matrix classthat builds on the new class. We anticipate that an effi-cient implementation of these new classes will lead to bet-ter performance on current and future architectures thanhypres current matrix classes.

Ulrike M. YangLawrence Livermore National [email protected]


MS52

High Performance Parallel Sparse Tensor Decom-positions

We investigate an efficient parallelization of the itera-tive algorithms for computing CANDECOMP/PARAFAC(CP) and Tucker sparse tensor decompositions on sharedand distributed memory systems. The key operations ofeach iteration of these algorithms are matricized tensortimes Khatri-Rao product (MTTKRP) and n-mode tensor-matrix product. These operations amount to element-wisevector multiplication or vector outer product, followed by areduction, depending on the sparsity of the tensor. We firstinvestigate a fine and a coarse-grain task definition for theseoperations, and propose methods to achieve load balanceand reduce the communication requirements for their par-allel execution on distributed memory systems. Then weinvestigate efficient parallelization issues on shared mem-ory systems. Based on these investigations, we implementthe alternating least squares method for both decomposi-tions in a software library. The talk will contain results forboth decomposition schemes on large computing systemswith real-world sparse tensors. We have a full-featuredimplementation for the CP-decomposition scheme and de-velopment for the Tucker decomposition in progress.

Oguz KayaInria and ENS [email protected]


MS52

An Input-Adaptive and In-Place Approach toDense Tensor-Times-Matrix Multiply

This talk describes a novel framework, called InTensLi(“intensely”), for producing fast single-node implementa-tions of dense tensor-times-matrix multiply (TTM) of arbi-

trary dimension. Whereas conventional implementations ofTTM rely on explicitly converting the input tensor operandinto a matrixin order to be able to use any available andfast general matrix-matrix multiply (GEMM) implemen-tation our frameworks strategy is to carry out the TTMin-place, avoiding this copy. As the resulting implementa-tions expose tuning parameters, this paper also describesa heuristic empirical model for selecting an optimal con-figuration based on the TTMs inputs. When comparedto widely used single-node TTM implementations thatare available in the Tensor Toolbox and Cyclops TensorFramework (CTF), InTensLi’s in-place and input-adaptiveTTM implementations achieve 4 and 13 speedups, showingGEMM-like performance on a variety of input sizes.

Jiajia LiCollege of ComputingGeorgia Institute of [email protected]

Casey Battaglino, Ioakeim Perros, Jimeng Sun, RichardVuducGeorgia Institute of [email protected], [email protected], [email protected], [email protected]. edu

MS52

Efficient Factorization with Compressed SparseTensors

Computing the Canonical Polyadic Decomposition is bot-tlenecked by multiplying a sparse tensor by several densematrices (MTTKRP). MTTKRP algorithms either use acompressed tensor specific to each dimension of the dataor a single uncompressed tensor at the cost of additionaloperations. We introduce the compressed sparse fiber datastructure and several parallel MTTKRP algorithms thatneed only a single compressed tensor. We reduce memoryby 58% while achieving 81% of the parallel performance.

Shaden SmithDept. Computer ScienceUniv. [email protected]

George KarypisUniversity of Minnesota / [email protected]

MS52

Low Rank Bilinear Algorithms for Symmetric Ten-sor Contractions

We demonstrate algebraic reorganizations that reduce bytwo the number of scalar multiplications (bilinear algo-rithm rank) needed for matrix-vector multiplication witha symmetric matrix and the rank-two symmetric update.We then introduce symmetry-preserving algorithms, whichlower the number of scalar multiplications needed for mosttypes of symmetric tensor contractions. By being nestedwith matrix multiplication, these algorithms reduce the to-tal number of operations needed for squaring matrices andcontracting partially-symmetric tensors.

Edgar SolomonikETH Zurich

PP16 Abstracts 115

[email protected]

MS53

BigDFT: Flexible DFT Approach to Large SystemsUsing Adaptive and Localized Basis Functions

Density functional theory is an ideal method for the studyof a large range of systems due to its balance between ac-curacy and computational efficiency, however the applica-bility of standard implementations is limited to systems ofaround 1000 atoms or smaller due to their cubic scalingwith respect to the number of atoms. In order to overcomethese limitations, a number of methods have been devel-oped which use localized orbitals to take advantage of thenearsightedness principle and reformulate the problem in amanner which can be made to scale linearly with the num-ber of atoms, paving the way for calculations on systemsof 10,000 atoms or more. We have recently implementedsuch a method in the BigDFT code (http://bigdft.org),which uses an underlying wavelet basis set that is idealfor linear-scaling calculations due to it exhibiting both or-thogonality and compact support. This use of waveletsalso has the distinct advantage that calculations can beperformed in a choice of boundary conditions - either free,wire, surface or periodic. This distinguishes it from otherlinear-scaling codes, which are generally restricted to peri-odic boundary conditions, and thus allows calculations oncharged systems using open boundary conditions withoutthe need for adding a compensating background charge.We will first outline the method and implementation withinBigDFT, after which we will present results showing boththe accuracy and the flexibility of the approach, includ-ing a demonstration of the improved scaling compared tostandard BigDFT calculations.

Thierry DeutschInstitute of Nanosciences and [email protected]

MS53

Design and Implementation of the PEXSI SolverInterface in SIESTA

We have implemented in the SIESTA code an interface toa new electronic structure solver based on the Pole Expan-sion and Selected Inversion (PEXSI) technique, whose com-plexity scales at most quadratically with size and can beapplied with good accuracy to all kinds of systems, includ-ing metals. We describe the design principles behind theinterface and the heuristics employed to integrate PEXSIefficiently in the SIESTA workflow to enable large-scalesimulations.

Alberto [email protected]

Lin LinUniversity of California, BerkeleyLawrence Berkeley National [email protected]

Georg HuhsBarcelona Supercomputing [email protected]

Chao YangLawrence Berkeley National Lab

[email protected]

MS53

Wavelets as a Basis Set for Electronic StructureCalculations: The BigDFT Project

Dauchechies wavelets are the only smooth orthogonal ba-sis set with compact support. These properties make themideally suited as a basis set for electronic structure calcu-lations, combining advantages from plane wave and Gaus-sian basis sets. I will explain the basic principles of waveletbased multi-resolution analysis and the wavelet based algo-rithms used in our BigDFT electronic structure code bothfor the solution the Schroedinger’s and Poisson’s equation.In addition I will present methods implemented in BigDFTto explore the potential energy surface

Stefan GoedeckerUniv Basel, [email protected]

MS53

DGDFT: A Massively Parallel Electronic StructureCalculation Tool

We describe a massively parallel implementation of the re-cently developed discontinuous Galerkin density functionaltheory (DGDFT) method, for efficient large-scale Kohn-Sham DFT based electronic structure calculations. TheDGDFT method uses adaptive local basis (ALB) func-tions generated on-the-fly during the self-consistent field(SCF) iteration to represent the solution to the Kohn-Shamequations. DGDFT can achieve 80% parallel efficiency on128,000 high performance computing cores when it is usedto study the electronic structure of 2D phosphorene sys-tems with 3,500-14,000 atoms. This high parallel efficiencyresults from a two-level parallelization scheme that we willdescribe in detail.

Chao YangLawrence Berkeley National [email protected]

MS54

Preconditioning of High Order Edge ElementsType Discretizations for the Time-HarmonicMaxwells Equations


Victorita DoleanUniversity of [email protected]

MS54

Block Iterative Methods and Recycling for Im-proved Scalability of Linear Solvers.

On the one hand, block iterative methods may be use-ful when solving systems with multiple right-hand sides,for example when dealing with time-harmonic Maxwellsequations. They indeed offer higher arithmetic intensity,and typically decrease the number of iterations of Krylovsolvers. On the other hand, recycling also provides a wayto decrease the time to solution of successive linear solves,when all right-hand sides are not available at the sametime. I will present some results using both approaches, as

116 PP16 Abstracts

well as their implementation inside the open-source frame-work HPDDM (https://github.com/hpddm/hpddm).

Pierre [email protected]

Pierre-Henri TournierUniversity Paris [email protected]

MS54

Reliable Domain Decomposition Methods withMultiple Search

Domain Decomposition methods are a family of solverstailored to very large linear systems that require parallelcomputers. They proceed by splitting the computationaldomain into subdomains and then approximating the in-verse of the original problem with local inverses comingfrom the subdomains. I will present some classical domaindecomposition methods and show that for realistic simu-lations (with heterogeneous materials for instance) conver-gence usually becomes very slow. Then I will explain howthis can be fixed by injecting more information into thesolver. In particular I will show how using multiple searchdirections within the conjugate gradient algorithm makesthe algorithm more reliable. Efficiency is also taken intoaccount since our solvers are adaptive.


Christophe BovetLMT, ENS Cachan, CNRS, Universite Paris-Saclay,94235 [email protected]

Pierre GosseletLMT, ENS [email protected]

Daniel J. RixenTU [email protected]

Francois-Xavier RouxONERAUniversite Pierre et Marie [email protected]

MS54

Parallel Preconditioners for Problems Arising fromWhole-Microwave System Modelling for BrainImaging

Brain imaging techniques using microwave tomography in-volve the solution of an inverse problem. Thus, solving theelectromagnetic forward problem efficiently in the inversionloop is mandatory. In this work, we use parallel precon-ditioners based on overlapping Schwarz domain decompo-sition methods to design fast and robust solvers for time-harmonic Maxwell´ s equations discretized with a Nedelecedge finite element method. We present numerical resultsfor a whole-system modeling of a microwave measurement

prototype for brain imaging.

Pierre-Henri TournierUniversity Paris [email protected]

Pierre [email protected]

Frederic NatafCNRS and [email protected]

MS55

Scaling Up Autotuning

Autotuning has been successful at the node level. As wemove towards larger machines, a challenge will be to scaleup autotuning systems to handle full applications at thescale of tens to hundreds of thousands of nodes. In thistalk, I will outline some of the challenges of running atscale and describe how we have been working to enhancethe Active Harmony system to handle these challenges.

Jeffrey HollingsworthUniversity of [email protected]

MS55

Auto-Tuning of Hierarchical Computations withppOpen-AT

We are now developing ppOpen-AT, which is a directive-base Auto-tuning (AT) language to specify fundamentalAT functions, i.e., varying values of parameters, loop trans-formations, and code selection. Considering with expectedhardware of Post Moores era, we focus on optimizationfor computations with deep hierarchy of 3D memory stack.ppOpen-AT provides code selection to optimize code withrespect to layers of the memory. Performance evaluationof AT with a code of FDM will be shown by utilizing theXeon Phi.

Takahiro KatagiriThe University of [email protected]

Masaharu MatsumotoInformation Technology CenterThe University of [email protected]

Satoshi OhshimaThe University of TokyoInformation Technology [email protected]

MS55

Auto-Tuning for the SVD of Bidiagonals

It is well known that the SVD of a matrix A, A = USV T ,can be obtained from the eigenpairs of the matrix CV =ATA or CU = AAT . Alternatively, the SVD can beobtained from the eigenpairs of the augmented matrixCUV = [0 A;AT 0]. This formulation is particularly at-tractive if A is a bidiagonal matrix and if only a partialSVD (a subset of singular values and vectors) is desired.

PP16 Abstracts 117

This presentation focuses on CUV in the context of bidi-agonal matrices, and discusses how tridiagonal eigensolverscan be applied and tuned in that context. The presentationwill also discuss opportunities for parallelism, accuracy andimplementation issues.

Osni A. MarquesLawrence Berkeley National LaboratoryBerkeley, [email protected]

MS55

User-Defined Code Transformation for High Per-formance Portability

Our research project is developing a code transformationframework, Xevolver, to allow users to define their owncode transformation rules and also to customize typicalrules for individual applications and systems. Use of user-defined code transformations is effective to achieve a highperformance while avoiding specialization of an HPC ap-plication code for a particular computing system. As aresult, without overcomplicating the code, we can achievehigh performance portability across diverse HPC systems.

Hiroyuki Takizawa, Shoichi Hirasawa

Tohoku University/JST [email protected], [email protected]

Hiroaki KobayashiTohoku University, [email protected]

MS56

Parallelization of Smoothed Particle Hydrodynam-ics for Engineering Applications with the Hetero-geneous Architectures

Smoothed Particle Hydrodynamics (SPH) method is apowerful technique used to simulate complex free-surfaceflows. However, one of the main drawbacks of the methodis its high computational cost and the large number of par-ticles needed in 3D simulations. High Performance Com-puting (HPC) becomes essential to accelerate SPH codesand perform large simulations. In this study, paralleliza-tion using MPI and GPUs (Graphics Processing Units) isapplied for executions on heterogeneous clusters of CPUsand GPUs.

Jose M. Domınguez AlonsoEPHYSLAB Environmental Physics LaboratoryUniversidade de [email protected]

MS56

H5hut: a High Performance Hdf5 Toolkit for Par-ticle Simulations

Particle-based simulations running on large high-performance computing systems can generate an enormousamount of data. Post-processing, effectively managing thedata on disk and interfacing it with analysis and visu-alization tools can be challenging, especially for domainscientists without expertise in high-performance I/O anddata management. We present the H5hut library, animplementation of several data models for particle-basedsimulations that encapsulates the complexity of HDF5 and

is simple to use, yet does not compromise performance

Achim [email protected]

MS56

Towards a Highly Scalable Particles Method BasedToolkit: Optimization for Real Applications

A large number of industrial physical problems can be sim-ulated using particle-based methods. In this talk, we willpresent a toolkit for solving a complex, highly nonlinearand distorted flow using Incompressible Smoothed ParticleHydrodynamics in very large scale. This toolkit is imple-mented cache friendly and specially designed to preservethe data locality. With unstructured communication mech-anism in parallel, this toolkit can also be used with theother type of particle methods, such as PIC.

Xiaohu GuoScientific Computing DepartmentScience and Technology Facilities [email protected]

MS56

Weakly Compressible Smooth Particle Hydrody-namics Performance on the Intel Xeon Phi

DualSPHysics is a Smoothed Particle Hydrodynamics codeto study free-surface flow phenomena where Eulerian meth-ods can be difficult to apply, such as waves or impactof dam-breaks on off-shore structures.The IPCC at theHartree Centre has been porting and optimising Dual-SPHysics for native execution on the Xeon Phi processor.Threading, vectorization and arithmetic efficiency was in-creased resulting in speed ups for both the Xeon CPUs andthe Xeon Phi.

Sergi-Enric SisoThe Hartree CentreScience and Technology Facilities [email protected]

MS57

Resilience for Asynchronous Runtime Systems andAlgorithms

Parallel asynchronous algorithms have shown to be easy tocouple with different kinds of low-overhead forward recov-ery techniques based on tasks. To manage the asynchronyof the parallel workload, a runtime system is typically used.Since its activity may also be sensitive to errors, some run-time specific resilience techniques are required under cer-tain circumstances. This talk will provide a vulnerabilitycomparison between runtime and algorithmic data struc-tures and also suggest low overhead resilience techniquesfor them.

Marc CasasBarcelona Supercomputing [email protected]

MS57

Experiments with FSEFI: The Fine-Grained SoftError Fault Injector

In this talk we will discuss FSEFI, the Fine-Grained Soft

118 PP16 Abstracts

Error Fault Injector and experiments we have conductedwith FSEFI to evaluate resiliency, fault-tolerance, and vul-nerability of applications and application kernels. FSEFI isa processor emulator-based fault injector where full appli-cations, operating system kernels, drivers, and middlewarerun. Users can create experiments and conduct trials ofhundreds of thousands of fault injections, collect the re-sults, and build a vulnerability/resiliency profile to gainmore information about an application. FSEFI has beenin construction since 2011 and has users at two govern-ment agencies, several national laboratories, and a handfulof universities. It is expected to be open sourced by theDepartment of Energy in the near future.

Nathan A. DeBardelebenLos Alamos National [email protected]

Qiang GuanLos Alamos National [email protected]

MS57

What Error to Expect When You Are Expecting aBit Flip.

We present a statistical analysis of the expected absoluteand relative error given a bit flip in an IEEE-754 scalar.We model the expected value agnostic of the standard used,and discuss the implications of using different standards,e.g., double vs quad or single vs double when a bit flip isexpected.

James ElliottNorth Carolina State [email protected]

Mark HoemmenSandia National [email protected]

Frank MuellerNorth Carolina State [email protected]

MS57

Silent Data Corruption and Error Propagation inSparse Matrix Computations

In this talk we highlight silent data corruption in severalsparse matrix computations. Error rates are expected toimpact next generation machines and silent data corrup-tion (SDC) remains a concern as simulations grown sizeand complexity. Here we investigate the propagation ofSDC in sparse matrix operations such as AMG and othersparse solvers. We detail the injection library used andshow how errors propagate through different operations,suggesting different algorithm-based mitigation strategiesto help reduce the impact of SDC.

Luke Olson, Jon CalhounDepartment of Computer ScienceUniversity of Illinois at [email protected], [email protected]

Marc SnirUniversity of Illinois at Urbana-Champaign

[email protected]

MS58

Silicon Photonics for Extreme Scale Computing

Silicon Photonics Integrated Circuits will soon enable thedirect integration of optical connectivity onto the NICs.Photonic IO through high data rates and wavelength mul-tiplexing is expected to support Tb/s bandwidth per linkthereby substantially increasing the available throughputto the node. In the last 5 years, bandwidth per node indominant HPC systems has only doubled while node com-pute power increased by a factor of 5 and global computerpower by a factor of 13. By suddenly allowing drasticallywider bandwidths, and by reducing the IO area and powerdissipation, Integrated Photonics is poised to revolution-ize inter-node I/O and trigger in depth changes in HPCarchitectures. In this talk, we show how the main charac-teristics of these upcoming photonic links can be estimated,how these performance estimates would influence the over-all shape of an Exascale-class interconnect, and finally, howthese predictions may influence global architectures.

Keren Bergman, Sebastien RumleyColumbia [email protected], [email protected]

MS58

Quantum Computing: Opportunities and Chal-lenges

The core of many real-world problems consists of extremelydifficult combinatorial optimization challenges that cannotbe solved accurately by traditional supercomputers in a re-alistic time frame. A key reason is that classical computingsystems are not designed to handle the exponential growthin computational complexity as the number of variablesincreases. In contrast, quantum computers use a physi-cal evolution process that harnesses quantum mechanicaleffects. In this talk, we will present NASAs experienceswith a 1097-qubit D-Wave 2X system that utilizes a pro-cess known as quantum annealing to solve this class of hardoptimization problems.

Rupak BiswasNASA Ames Research [email protected]

MS58

Task-Based Execution Models and Co-Design

Future HPC machines carry requirements that forcechanges to hardware and software models. The tension be-tween these models mandates a co-design approach for jug-gling compromises. In order to effectively address variabil-ity, dynamic behavior, power, and resiliency, task-basedexecution models are a compromise candidate that neitherarchitecture nor software groups particularly desire – butlack viable alternatives. This talk will showcase some ten-sion points, how task models address them, and provideresults based on OCR.

Romain CledatIntel [email protected]

Joshua FrymanIntel

PP16 Abstracts 119

[email protected]

MS58

Emerging Hardware Technologies for Exascale Sys-tems

To achieve exascale computing, fundamental hardware ar-chitectures must change. While many details of the exas-cale architectures are undefined it is clear that this machinewill be a significant departure from current systems. De-velopers will face increasing parallelism, decreasing singlethreaded performance as well as significant changes to thememory hierarchy. This talk will explore the continuumof emerging hardware technologies, evaluate their viabilityand briefly discuss the impact on the application program-mer.

David DonofrioLawrence Berkeley National [email protected]

MS59

Additive and Multiplicative Nonlinear Schwarz

Multiplicative Schwarz preconditioned inexact Newton(MSPIN) is naturally based on partitioning degrees of free-dom in a nonlinear PDE system by field rather than bysubdomain, where a modest factor of concurrency is sac-rificed for robustness. ASPIN (2002), by subdomain, isnatural for high concurrency and reduction of global syn-chronization. We augment ASPINs convergence theory forthe multiplicative case and show that MSPIN can be sig-nificantly more robust than a global Newton method andthan ASPIN.


Lulu LiuKing Abdullah University of Science and [email protected]

MS59

A Framework for Nonlinear Feti-Dp and Bddc Do-main Decomposition Methods

Solving nonlinear problems, e.g., in material science, re-quires fast and highly scalable parallel solvers. FETI-DPand BDDC domain decomposition methods meet these re-quirements for linearized implicit finite element problems.Recently, nonlinear versions have been proposed where thenonlinear problem is decomposed before linearization. Thiscan be viewed as a strategy to further localize compu-tational work and to extend parallel scalability towardsextreme-scale supercomputers. A framework for differentnonlinear FETI-DP and BDDC algorithms will be pre-sented.





MS59

Convergence Estimates for Composed Iterations

I will define nonlinear preconditioning, a composition tech-nique used to accelerate an iterative solution process, inanalogy with the linear case, and demonstrate its effective-ness for representative nonlinear PDE. I will outline thebeginning of a convergence theory for these composed it-erations.

Matthew G. KnepleyRice [email protected] *Preferred

MS59

Scaling Nonlinear FETI-DP Domain Decomposi-tion Methods to 786 452 Cores

Nonlinear FETI-DP (Finite Element Tearing and Intercon-necting) domain decomposition methods are parallel solu-tion methods for nonlinear implicit problems discretizedby finite elements. A recent nonlinear FETI-DP methodis combined with an approach that allows an inexact solu-tion of the FETI-DP coarse problem by a parallel AlgebraicMultigrid Method. Weak scalability for two TOP 10 super-computers, i.e., JUQUEEN at FZ Julich and the completeMira at Argonne National Laboratory (786,432 cores) areshown.





MS60

NUMA-Aware Hessenberg Reduction in the Con-text of the QR Algorithm

Hessenberg reduction appears in the aggressive early defla-tion bottleneck in parallel multishift QR algorithms. Thereduction is memory-bound and performance relies on ef-ficient use of caches and memory buses. We applied theparallel cache assignment technique recently proposed byCastaldo and Whaley and obtained a cache-efficient andNUMA-aware multithreaded algorithm. Numerous tun-able parameters were identified and their impact on per-

120 PP16 Abstracts

formance evaluated. The goal is to incorporate runtimeauto-tuning capabilities.

Mahmoud EljammalyUmea UniversityDept. of Computing [email protected]


MS60

Low Rank Approximation of Sparse Matrices forLu Factorization Based on Row and Column Per-mutations

In this talk we present an algorithm for computing a lowrank approximation of a sparse matrix based on a trun-cated LU factorization with column and row permutations.We present various approaches for determining the columnand row permutations that show a trade-off between speedversus deterministic/probabilistic accuracy. We show thatif the permutations are chosen by using tournament pivot-ing based on QR factorization, then the obtained truncatedLU factorization with column/row tournament pivoting,LU CRTP, satisfies bounds on the singular values whichhave similarities with the ones obtained by a communi-cation avoiding rank revealing QR factorization. We alsocompare the computational complexity of our algorithmwith randomized algorithms and show that for sparse ma-trices and certain accuracy, our approach is faster.


Sebastien [email protected]

James W. DemmelUniversity of CaliforniaDivision of Computer [email protected]

MS60

The Challenge of Strong Scaling for Direct Meth-ods

We present an analysis of the problems of getting nearfull performance for sparse direct methods, particularly forsymmetric systems, on complex modern hardware. Ourdiscussion is informed by experience on both modern CPUand GPU hardware. We focus on getting high performanceon small problems that are still important in many com-munities. The algorithms and techniques described can beused as building blocks to address the solution of largersystems on HPC resources.

Jonathan HoggSTFC Rutherford Appleton Lab, [email protected]

MS60

Assessing Recent Sparse Matrix Storage Formats

for Parallel Sparse Matrix-Vector Multiplication

The performance of parallel sparse matrix-vector multipli-cation (SpMV for short, calculating y ← Ax, where A is asparse matrix, x and y are dense vectors) highly dependson the storage format of the input matrix. Recently, manyformats have been proposed for many-core processors suchas GPUs and Xeon Phi. This talk will give a comprehensivecomparison for SpMV performance and format conversioncost (from the CSR to a new format) of recently developedformats including HYB (SC ’09), Cocktail (ICS ’12), ESB(ICS ’13), BCCOO (PPoPP ’14), BRC (ICS ’14), ACSR(SC ’14), CSR-Adaptive (SC ’14, HiPC ’15) and CSR5(ICS ’15).

Weifeng LiuNiels Bohr InstituteUniversity of [email protected]

MS61

New Building Blocks for Scalable Electronic Struc-ture Theory

We highlight two central building blocks for scalableelectronic structure theory on current high-performancecomputers:(i) New methodological approaches beyondsemilocal density-functional theory based on a local-ized resolution-of-indentity approach for the two-electronCoulomb operator; (ii) a new, general open-source libraryplatform ”ELSI” that will provide a seamless interface toaddress or circumvent the Kohn-Sham eigenvalue problem.Current work focuses on the massively parallel eigenvaluesolver ELPA, the orbital minimization method (OMM),and the PEXSI library.

Volker BlumDuke UniversityDepartment of Mechanical Engineering and [email protected]

MS61

MOLGW: A Parallel Many-Body PerturbationTheory Code for Finite Systems

The many-body perturbation theory was mostly developedfor condensed matter. The successes of the GW approx-imation to the electron self-energy for the band structureand of the Bethe-Salpeter equation for the absorption spec-trum are uncountable for solids. However, very little isknown about the performance of many-body perturbationtheory for atoms and molecules. The code molgw ad-dresses the efficient and accurate calculation within thisframework using modern parallelization, based on MPI andSCALAPACK.

Fabien BrunevalCEA, DEN, [email protected]

Samia M. Hamed, Tonatiuh Rangel-Gordillo, Jeffrey B.NeatonUC Berkeley (USA)[email protected], [email protected], [email protected]

MS61

GPU-Accelerated Simulation of Nano-Devices

PP16 Abstracts 121

from First-Principles

This talk will introduce a newly developed algorithm calledSplitSolve to take care of the Schrodinger equation withopen boundary conditions, as encountered in quantumtransport calculations. The strengths of SplitSolve are thesignificant reduction of the computational time it offersand its ability to fully leverage hybrid nodes with CPUsand GPUs. It thus allows for the transport simulation ofnanostructures composed of thousands of atoms at the ab-initio (from first-principles) level.

Mathieu Luisier, Mauro Calderara, Sascha BrueckIntegrated Systems LaboratoryETH [email protected], [email protected],[email protected]

Hossein Bani-Hashemian, Joost VandeVondeleNanoscale SimulationsETH [email protected],[email protected]

MS61

High-Performance Real-Time TDDFT Excited-States Calculations

From atoms and molecules to nanostructures, we discusshow the FEAST solver has considerably broaden the per-spectives for enabling reliable and high-performance large-scale first-principle DFT and real-time TDDFT calcula-tions. Our TDDFT simulations can operate a broadrange of electronic spectroscopy: UV-ViS, X-ray (usingour all-electron method), and near IR (using our TDDFT-Ehrenfest approach). Various numerical results will be pre-sented and discussed.

Eric PolizziUniversity of Massachusetts, Amherst, [email protected]

MS62

Composition and Compartmentalisation As En-abling Features for Data-Centric, Extreme ScaleApplications: An MPI-X Approach

Next generation extreme scale programming model is of-ten envisioned as MPI+X, where the second layer X is oneof OpenMP, OpenACC, etc. Conversely, we advocate aMPI-X approach, where MPI is only used as communica-tion library and applications are composed assembling instreaming fashion data-parallel kernels described as plainC++11 classes. They are designed to run on compositionalaccelerators, which can be implemented as cluster of de-vices, e.g. multi-cores, GPUs, FPGAs or a heterogeneouscombination of them.

Marco Aldinucci, Maurizio Drocco, Claudia MisaleUniversity of [email protected], [email protected], [email protected]

Massimo TorquatiUniversity of PisaItaly

[email protected]

MS62

Recent Advances in the Task-Based StarPU Run-time Ecosystem

The task paradigm has allowed the StarPU dynamic run-time system to achieve scalable performance for various ap-plications over heterogeneous systems and clusters thereof.We will here discuss the latest advances which allow us toproceed a step further: controlling the memory footprint(notably of applications with compressed data), employinglocal disks, achieving perfectly reproducible experimentsthrough simulation, and showing how far we are from per-fect execution.

Samuel ThibaultUniversity of Bordeaux 1, [email protected]

Marc SergentINRIA, [email protected]

Suraj [email protected]

Luka StanisicINRIA Grenoble [email protected]

MS62

Task Dataflow Scheduling and the Quest for Ex-tremely Fine-Grain Parallelism

Task dataflow programming models provide a useful pro-gramming abstraction but the underpinning dynamicschedulers carry a burden that imposes limits on strongscaling. This work proposes methods to assess the inher-ent overhead of schedulers and applies them to assess theburden of industry-standard schedulers. We demonstratethat the scheduler burden explains the performance scala-bility of a set of memory-intensive and compute-intensivecodes. Based on this we develop a novel scheduler thatdemonstrates increased scalability.

Hans Vandierendonck, Mahwish ArifQueen’s University BelfastElectronics, Electrical Engineering, and Computer [email protected], [email protected]

MS62

A Single Interface for Multiple Task-Based ParallelPrograming Paradigms/Schemes

Task-based parallel programming schemes have proven ad-vantageous regarding ease of programming, efficiency andperformance. By employing a hierarchical definition oftasks and data, locality and bandwidth utilization can bemaximized for a specific underlying hardware. However,explicitly encoding each level of parallelism becomes cum-bersome. We present a unified programming interface thatallows an algorithm written in terms of non-hierarchicaltasks to be scheduled hierarchically across different sharedand distributed memory environments by a task-based run-

122 PP16 Abstracts

time.

Afshin ZafariUppsala UniversityDept. of Information [email protected]

Elisabeth LarssonUppsala University, SwedenDepartment of Information [email protected]

MS63

Hardware/Algorithm Co-Optimization and Co-Tuning in the Post Moore Era

In the Post Moore Era performance and power efficiency isobtained from better resource usage, since no free speed updue to technology scaling is helping along any more. In thistalk we will present how hardware specialization and designspace exploration in conjunction with software synthesisand autotuning provides a technological basis for contin-ued performance and power gains in the Post Moore Era.Users write their programs against well-established mathlibrary interfaces like the BLAS and FFTW interface, andbehind the scenes code synthesis and hardware/algorithmco-design leverages dark silicon, 3D stacking, and emerg-ing memory technologies to translate substantial functionalblocks into highly optimized hardware constructs that canbe software configured. This is work in progress; how-ever, we have demonstrated the power of this approachwith space time adaptive processing, an FFT and linearalgebra heavy kernel.

Franz FranchettiDepartment of Electrical and Computer EngineeringCarnegie Mellon [email protected]

MS63

On Guided Installation of Basic Linear AlgebraRoutines in Nodes with Manycore Components

Most computational systems are composed of multicorenodes together with several GPUs or Xeon Phi, whichmight be of different types. This work analyses the adapta-tion of auto-tuning techniques used for hybrid CPU+GPUsystems to these more complex configurations. The studyis conducted on basic linear algebra kernels. Satisfactoryresults are obtained, with experimental execution timesclose to the lowest experimentally obtained.

Javier CuencaDepartamento de Ingeniera y Tecnologa de ComputadoresUniversity of [email protected]

Luis-Pedro GarcıaServicio de Apoyo a la Investigacion Cient{ı}ficaTechnical University of [email protected]

Domingo GimenezDepartamento de Informatica y SistemasUniversity of Murcia

[email protected]

MS63

Auto-Tuning for Eigenvalue Solver on the PostMoore’s Era

Dense Real-Symmetric eigenvalue problem consists of threemain procedures. Each of three procedures has a uniqueaspect from the viewpoint of the insensitivity of memoryaccess or floating point operations. In the Post-Mooresera, the present tuning strategy, which we reduce datatraffic cost by CA (Communication Avoidance), SR (Syn-chronous Reduction), and AS (Asynchronous algorithm),may change back into or mix up with the era of vectorcomputer. Automatic-tuning helps to find the sub-optimalcombination of these strategies.

Toshiyuki ImamuraRIKEN Advance Institute for Computational [email protected]

MS63

Automatic Tuning for Parallel FFTs on Intel XeonPhi Clusters

In this talk, we propose an implementation of a parallelfast Fourier transform (FFT) with automatic performancetuning on Intel Xeon Phi clusters. Parallel FFTs on IntelXeon Phi clusters require intensive all-to-all communica-tion. An automatic tuning facility for selecting the opti-mal parameters of overlapping between computation andcommunication, the all-to-all communication, the radices,and the block size is implemented. Performance results ofFFTs on an Intel Xeon Phi cluster are reported.

Daisuke TakahashiFaculty of Engineering, Information and SystemsUniversity of [email protected]

MS64

Thermal Comfort Evaluation for Indoor AirflowsUsing a Parallel CFD Approach

The thermal comfort prediction of occupants in roomsand vehicles is of great interest during planning and op-eration phases. With the evermore increasing computa-tional power available, a great variety of problems can besolved which were unthinkable until recently. This talk willpresent and discuss coupling procedures and strategies tointegrate human thermoregulation based on Fialas modelinto an in-house CFD code specifically designed with par-allel systems in mind to investigate indoor airflows.


Ralf-Peter MundaniTUM, Faculty of Civil, Geo and EnvironmentalEngineeringChair for Computation in Engineering

PP16 Abstracts 123

[email protected]

MS64

Fast Contact Detection for Granulates

We study the problem of contact detection between largenumbers of triangles resulting from meshed objects. Thisproblem arises in FSI and DEM, e.g., and classic algorithmsfrom computational geometry come along with many casedistinctions. We propose an iterative, weak solver thatvectorises but may fall back to a classic formulation if theNewton steps do not converge quickly. Based upon a treedomain decomposition, it furthermore is parallelised viaMPI.

Tomasz KoziaraDurham [email protected]

Konstantinos KrestenitisDurham UniversityDurham, [email protected]

Jon Trevelyan, Tobias WeinzierlDurham [email protected], [email protected]

MS64

LB Particle Coupling on Dynamically-AdaptiveCartesian Grids

Dynamically-adaptive, tree-structured grids are a well-known way of solving partial differential equations withless computational effort than regular Cartesian grids. Wefocus on a minimally-invasive approach using p4est, a well-known framework for such grids, which allows reusing hugeparts of the existing code. We integrated p4est as a newmesh in ESPResSo, a large MD-simulation package. Inour contribution, we present implementational details andperformance data for various background flow scenarios.

Michael LahnertUniversity of [email protected]

Miriam MehlUniversitat [email protected]

MS64

High-Performance Computing in Free SurfaceFluid Flow Simulations

Nowadays, more and more questions originate from the so-called emerging sciences such as medicine, sociology, biol-ogy, virology, chemistry, climate or geo-sciences. Here, wepresent an advanced simulation and (visual) explorationconcept for various environmental engineering applicationssuch as porous media flows and high water simulations.Based on complex 3D building information and environ-mental models, we can perform multi-resolution simula-tions on massive parallel systems combined with an inter-active exploration of the computed data during runtime.

Nevena Perovic

Chair for computation in engineeringTU [email protected]


Ralf-Peter MundaniTUM, Faculty of Civil, Geo and EnvironmentalEngineeringChair for Computation in [email protected]

Ernst RankComputation in EngineeringTechnische Universitat [email protected]

MS65

A Flexible Area-Time Model for Analyzing andManaging Application Resilience

Traditional error recovery systems such as checkpoint-restart fix key aspects of the error model, detection timing,and recovery approach. As a result, these methods havelimited ability to adapt to different error types, statistics,or algorithm behavior. We have designed a new model,area-time, a general, flexible framework for reasoning abouterrors, detection, and response. We apply this model toseveral specific application examples to demonstrate itsutility.

Andrew A. ChienThe University of ChicagoArgonne National [email protected]

Aiman FangDepartment of Computer ScienceUniversity of [email protected]

MS65

Local Recovery at Extreme Scales with FenixLR:Implementation and Evaluation

This talk explores how local recovery can be used forcertain classes of applications to reduce overheads dueto resilience. Specifically, it is centered on the develop-ment of programming support and scalable runtime mecha-nisms to enable online and transparent local recovery underthe FenixLR framework. We integrate these mechanismswith the S3D combustion simulation, and experimentallydemonstrate (using Titan) the ability to tolerate high fail-ure rates (every 5 seconds) at scales up to 262144 cores.

Marc GamellNSF Cloud and Autonomic Computing Center, RutgersDiscoveryInformatics Institute, Rutgers [email protected]

Keita Teranishi, Michael HerouxSandia National [email protected], [email protected]

124 PP16 Abstracts

Manish ParasharDepartment of Computer ScienceRutgers [email protected]

MS65

Spare Node Substitution of Failed Node

Spare node substitution for a failed node is not a new idea,but its behavior is not well-studied so far. Spare node sub-stitution may preserve the communication pattern and thenumber of nodes of a user program. Thus, it can drasti-cally reduce the effort of fault handling at the user pro-gram. However, actual communication topology at physi-cal network cannot be the same and this can result in thesevere degradation of communication performance. In thistalk, recent research result focusing on this topic will bepresented.

Atsushi Hori, Kazumi Yoshinaga, Yutaka IshikawaRIKEN Advanced Institute for Computational [email protected], [email protected], [email protected]

MS65

Will Burst Buffers Save Checkpoint/Restart?

Exascale systems are projected to fail frequently and par-allel file system performance is not expected to match theincrease in computational speed. Burst buffers can po-tentially solve this problem by smoothing out bursty I/Orequests, and reducing load on parallel file systems. In thistalk, we will explore the tradeoffs of using burst buffers aspart of storage systems, and how fault tolerance librariescan exploit burst buffers to provide increased performanceto user applications.

Kathryn MohrorLawrence Livermore National [email protected]

MS66

In Situ Processing Overview and Relevance to theHpc Community


E. Wes BethelLawrence Berkeley National [email protected]

MS66

Overview of Contemporary In Situ InfrastructureTools and Architectures


Julien [email protected]

Patrick O’LearyKitware, Inc., [email protected]

MS66

In Situ Algorithms and Their Application to Sci-

ence Challenges


Venkatram VishwanathArgonne Leadership Computing Facility, ArgonneNational [email protected]

MS66

Complex In Situ and In Transit Workflows


Matthew WolfGeorgia Institute of [email protected]

MS67

Addressing the Future Needs of Uintah on Post-Petascale Systems

The need to solve ever-more complex science and engineer-ing problems on computer architectures that are rapidlyevolving poses considerable challenges for modern simula-tion frameworks and packages. While the decomposition ofsoftware into a programming model that generates a task-graph for execution by a runtime system make portabilityand performance possible, it also imposes considerable de-mands on that runtime system. The challenge of meetingthose demands for near-term and longer term systems inthe context of the Uintah framework is considered in keyareas such as support for data structures, tasks on het-erogeneous architectures, performance portability, powermanagement and designing for resilience by the use of repli-cation based upon adaptive mesh refinement processes isaddressed.

Martin BerzinsScientific Computing and Imaging InstituteUniversity of [email protected]

MS67

Progress on High Performance ProgrammingFrameworks for Numerical Simulations

This talk reports the progress of high performance numeri-cal simulation frameworks in China. These frameworks aredesigned to separate parallel algorithm design from parallelprogramming, which eases challenging application develop-ment even with the rapid growth of supercomputers. Thedomain-specific software architectures and components areintroduced on top of general parallel programming systems,namely MPI and OpenMP. A variety of multi-petaflops ap-plications are developed on top of these frameworks, illus-trating the success of our approaches.

Zeyao MoLaboratory of Computational Physics, IAPCMP.O. Box 8009, Beijing 100088, P.R. Chinazeyao [email protected]

MS67

Can We Really Taskify the World? Challenges forTask-Based Runtime Systems Designers

To Fully tap into the potential of heterogeneous manycore

PP16 Abstracts 125

machines, the use of runtime systems capable of dynam-ically scheduling tasks over the pool of underlying com-puting resources has become increasingly popular. Suchruntime systems expect applications to generate a graphof tasks of sufficient ”width” so as to keep every process-ing unit busy. However, not every application can exhibitenough task-based parallelism to occupy the tremendousnumber of processing units of upcoming supercomputers.Nor can they generate tasks of appropriate granularity forCPUs and accelerators. In this talk, I will briefly presentthe main features provided by the StarPU task-based run-time system, and I will go through the major challengesthat runtime systems’ designers have to face in the upcom-ing years.

Raymond NamystUniversity of Bordeaux 1, [email protected]

MS67

Exascale Software

Exascale software requries innovation on many levels, suchas scalability to beyond a million cores, energy awareness,and algorithm based resilience. Future supercomputers willonly deliver their full potential when applications, mathe-matical models, and numerical algorithms are adapted toever more complex architectures in a co-design process thatis based on the systematic engineering for performance inall stages of the the computational science pipeline.

Hans-Joachim BungartzTechnische Universitat Munchen, Department ofInformaticsChair of Scientific Computing in Computer [email protected]


Philipp NeumannTechnische Universitat Munchen, GermanyScientific Computing in Computer [email protected]

MS68

The GraphBLAS Effort: Kernels, API, and Paral-lel Implementations

I will first give an overview of the GraphBLAS effort forstandardizing graph algorithmic building blocks in the lan-guage of linear algebra, including its scope, the set of ker-nels involved and the proposed API. I will then report onrecent results in developing high-performance implemen-tations of some of the most important GraphBLAS ker-nels such as sparse matrix-matrix multiplication and sparsematrix-vector multiplication in clusters of multicore pro-cessors.

Aydin BulucLawrence Berkeley National [email protected]

MS68

Parallel Tools to Reconstruct and Analyze Large

Climate Networks

Over the last decade, the techniques of complex networkanalysis have found application in climate research. Herethe observation locations serve as nodes and edges (links)are based on statistical measures of similarity, e.g. a cor-relation coefficient, between pairwise time series of climatevariables at these different locations. Many studies inthe field focused on investigating correlation patterns inthe atmospheric surface temperature and teleconnectionsas well as the synchronization between climate variabil-ity patterns. In most studies above so called interactionnetworks were used. Up to now, the data sets (both obser-vational and model based) used to reconstruct and analyzeclimate networks have been relatively small due to compu-tational limitations. As a consequence, coarse-resolutionobservational and model data have been used with a fo-cus only on large-scale properties of the climate system.This system is, however, known for its multi-scale interac-tions and hence one would like to explore the interactionof processes over the different scales. Data are availablethrough high-resolution ocean/atmosphere/climate model

simulations but they lead to networks with at least 105nodes and hence imposing a compelling need for new al-gorithms and parallel computational tools that can enablehigh-performance graph reconstruction and analysis. Inthis talk we introduce the application of complex graphtools in climate research as well as our complete toolboxParGraph designed for parallel computing platforms, whichis capable of the preprocessing of large number of climatetime series and the calculation of pairwise statistical mea-sures, leading to the construction of massive node climatenetworks. In addition, ParGraph is provided with a set ofhigh-performance network analyzing algorithms designedfor symmetric multiprocessing machines (SMPs).

Hisham IhshaishUWE BristolFaculty of Environment & [email protected]

Alexis TantetIMAUUniversity of [email protected]

Johan DijkzeulVORtech BVDelft, The [email protected]

Henk A DijkstraDepartment of Physics and Astronomy, UtrechtUniversityThe [email protected]

MS68

Generating and Traversing Large Graphs inExternal-Memory.

Large graphs arise naturally in many real world applica-tions. The actual performance of simple RAM model algo-rithms for traversing these graphs (stored in external mem-ory) deviates significantly from their linear or near-linearpredicted performance because of the large number of I/Osthey incur. In order to alleviate the I/O bottleneck, manyexternal memory graph traversal algorithms have been de-signed with provable worst-case guarantees. In the talk I

126 PP16 Abstracts

highlight some techniques used in the design and engineer-ing of such algorithms and survey the state-of-the-art inI/O-efficient graph traversal algorithms. I will also reporton recent work concerning the generation of massive scalefree networks under resource constraints.

Ulrich MeyerGoethe University Frankfurt am [email protected]

MS68

(Generating And) Analyzing Large Networks withNetworKit

Complex networks are equally attractive and challengingtargets for data mining. Novel algorithmic solutions, in-cluding parallelization, are required to handle data setscontaining billions of connections. NetworKit is a grow-ing open-source software package for the analysis of suchlarge complex networks on shared-memory parallel com-puters. It is designed to deliver the best of two worlds: theperformance of C++ algorithms and data structures (withparallelism by OpenMP) and a convenient Python frontendfor interactive workflows. The current feature set includesnot only various fundamental analytics kernels and a col-lection of basic graph generators. Several novel algorithmsfor scalable network analysis have been realized with Net-worKit as well recently. Of those, we will present networkprofiling, fast algorithms for computing centrality values(importance of nodes/edges) in different scenarios, and thefirst generator of hyperbolic random graphs that takes sub-quadratic time.

Henning Meyerhenke

Karlsruhe Institute of Technology (KIT)Institute of Theoretical [email protected]

Elisabetta BergaminiInstitute of Theoretical InformaticsKarlsruhe Institute of Technology (KIT)[email protected]

Moritz von Looz, Roman Prutkin, Christian L. StaudtKarlsruhe Institute of Technology (KIT)Institute of Theoretical [email protected], [email protected],[email protected]

MS69

Advancing Algorithms to Increase Performance ofElectronic Structure Simulations on Many-CoreArchitectures

Effective utilization of many-core architectures for large-scale electronic structure calculations requires fast, com-putationally scalable and efficient computational tools uti-lizing novel algorithmic approaches on the latest hardwaretechnologies. In this presentation we will discuss results,lessons learned as well as pitfalls resulting from our al-gorithmic efforts and code optimizations to transform theNWChem computational chemistry software to utilize sys-tems with Intel based multicore architectures connected byfast communication networks.

Wibe A. De JongLawrence Berkeley National Laboratory

[email protected]

MS69

Real-Valued Technique for Solving Linear SystemsArising in Contour Integral Eigensolver

Contour integral eigenvalue solvers such as the Sakurai-Sugiura method and the FEAST algorithm have been de-veloped in the last decade. The main advantage of the con-tour integral solvers is their inherent parallelism. Howeverthey require linear system solving with complex matriceseven if the matrices of the problem are real. In this study,we propose a real-valued technique for solving linear sys-tems arising in the contour integral methods, which onlyinvolves factorizations of real matrices.

Yasunori FutamuraUniversity of [email protected]

Tetsuya SakuraiDepartment of Computer ScienceUniversity of [email protected]

MS69

Parallel Eigensolvers for Density Functional The-ory

Density functional theory (DFT) aims to solve theSchrodinger equation by modelling electronic correlation asa function of density. Its relatively modest O(N3) scalingmakes it the standard method in electronic structure com-putation for condensed phases containing up to thousandsof atoms. Computationally, its bottleneck is the partial di-agonalisation of a Hamiltonian operator, which is usuallynot formed explicitely. Using the example of the Abinitcode, I will discuss the challenges involved in scaling plane-wave DFT computations to petascale supercomputers, andshow how the implementation of a new method based onChebyshev filtering results in good parallel behaviour upto tens of thousands of processors. I will also discuss someopen problems in the numerical analysis of eigensolvers andextrapolation methods used to accelerate the convergenceof fixed point iterations.

Antoine LevittEcole des Ponts [email protected]

MS69

A Parallel Algorithm for Approximating the Sin-gle Particle Density Matrix for O(N) ElectronicsStructure Calculations

We present two strategies for approximating the entries ofthe single particle density matrix that are needed for O(N)solution of the DFT equations. These strategies are basedon the use of diagonalization or a Chebyshev polynomialapproximation to compute the matrix exponential functionusing a principal submatrix of the global Hamiltonian ma-trix. We discuss pros and cons of each strategy within thecontext of developing a parallel and scalable algorithm forelectronics structure calculations.

Daniel Osei-KuffuorLawrence Livermore National [email protected]

PP16 Abstracts 127

Jean-Luc FattebertLawrence Livermore National [email protected]

MS70

Resilience in Distributed Task-Based Runtimes (2)



MS70

Developing Task-Based Applications with Py-compss

COMP Superscalar (COMPSs) is a task-based program-ming model which aims to ease the development of parallelapplications for distributed infrastructures, such as Clus-ters, Grids and Clouds, by exploiting the inherent paral-lelism of applications at execution time. In particular, itis able to handle Python applications through its Pythonbinding (PyCOMPSs), which provides an API and dec-orators that can be used in order to adapt and managesequential python applications, hiding the complexities ofparallelization.

Javier ConejeroBarcelona Supercomputing [email protected]

MS70

DuctTeip: A Framework for Distributed Hierarchi-cal Task-Based Parallel Computing

We present a dependency-aware task parallel programmingframework for hybrid distributed-shared memory paral-lelization. Both data and tasks in the framework are hi-erarchically organized in order to exploit the underlyingheterogeneous hardware architecture. A data centric viewbased on a versioning system is employed. By executing anapplication code in simulation mode, the choice of data dis-tribution and blocking scheme can be tuned a priori. Theperformance of the framework is evaluated against othersimilar efforts.

Elisabeth LarssonUppsala University, SwedenDepartment of Information [email protected]

Afshin ZafariUppsala UniversityDept. of Information [email protected]

MS70

Heterogeneous Computing with GPUs and FPGAsThrough the OpenMP Task-Parallel Model

The task-based programming model has become one theprominent models for parallelizing software on homoge-neous shared-memory architectures. However, becauseDennard’s scaling no longer holds, we have to leave ho-mogeneity behind and move towards a more heterogeneous(possibly distributed) solution. In this talk, we will show

how to extend the OpenMP task-model to support dis-tributed computing through GPUs and how the model fur-ther can be used to drive hardware generation for FPGAs.

Artur PodobasKTH Royal Institute of [email protected]

MS71

PFLOTRAN: Practical Application of High Per-formance Computing to Subsurface Simulation

As subsurface simulation becomes more complex with theaddition of physical and chemical processes, careful soft-ware engineering must be employed to preserve parallelperformance. This presentation will discuss a recent refac-tor of PFLOTRANs high-level process model coupling thathas facilitated the integration of new processes (e.g. geo-physics, spent nuclear fuel degradation/dissolution) whilemaintaining scalability. Examples of PFLOTRAN andhigh performance computing being applied to simulatepractical, real-world subsurface problem scenarios will beincluded.

Glenn HammondSandia National [email protected]

MS71

Parallel Performance of a Reservoir Simulator inthe Context of Unstable Waterflooding

We consider the parallel performance of a reservoir sim-ulator in the context of water injection in viscous oils.Such problems usually require very fine grids and high-order schemes in order to capture the complex fingeringpatterns induced by viscous instabilities. In this contextparallel performance is very important and the scalabilityof the linear solver becomes critical. We compare differentstrategies for the Algebraic Multigrid method among whicha new hybrid aggregation/point-wise coarsening strategy.

Romain De [email protected]

Pascal HenonTOTAL E&[email protected]

Pavel [email protected]

MS71

Ongoing Development and HPSC Applicationswith the ParFlow Hydrologic Model

ParFlow is a three-dimensional variably saturatedgroundwater-surface water flow code that is currently ap-plied at space and time scales up to continents (Europe,USA) and decades. This is made possible by the efficientparallelism and, e.g. an advanced octree space-partitioningalgorithm implemented in ParFlow. Ongoing advance-ments that will be illustrated with subsurface applicationsinclude software infrastructure for coupled energy trans-port; integration of the p4est meshing library; and inte-gration of ParFlow in TerrSysMP for terrestrial systems

128 PP16 Abstracts

modeling.

Stefan J. KolletMeteorological InstituteUniversity of [email protected]

Klaus GoergenJulich Supercomputing CentreResearch Centre [email protected]

Carsten BursteddeUniversitat [email protected]

Jose FonsecaInstitute for Numerical SimulationBonn Universityfonseca.ins.uni-bonn.de

Fabian GasperAgrosphere Research Center [email protected]

Reed M. MaxwellDepartment of Geology and Geological Engineering andIGWMCColorado School of [email protected]

Prabhakar Shresta, Mauro SulisMeterological InstituteBonn [email protected], [email protected]

MS71

Hybrid Dimensional Multiphase CompositionalDarcy Flows in Fractured Porous Media and Par-allel Implementation in Code ComPASS

We consider the Hybrid dimensional multiphase composi-tional Darcy flows in fractured porous media. The modelaccounts for the coupling of the mass balance of each com-ponent with the pore volume conservation and the ther-modynamical equilibrium and dynamically manages phaseappearance and disappearance. We implement this modelin the framework of code ComPASS (Computing ParallelArchitecture to Speed up Simulation), which focuses onparallel high performance simulation (MPI).

Feng XingUniversity of [email protected]

Roland MassonUniversity of Nice Sophia [email protected]

MS72

Title Not Available


Donna CalhounBoise State University

[email protected]

MS72

Title Not Available


Thierry CoupezMINES [email protected]

MS72

Bb-Amr and Numerical Entropy Production


Frederic GolayUniversite de Toulon, [email protected]

MS72

Title Not Available


Pierre Kestenermaisson de la [email protected]

MS73

How Will Application Developers Navigate theCost/Peformance Trade-Offs of New Multi-LevelMemory Architectures?

Data movement and memory bandwidth are becoming thelimiting performance factors in many applications. Hybridmemory cubes (HMC), high-bandwidth memory (HBM),and cheap capacity NVARM memories are now becomingavailable. Performance must balance cost since high band-width memory will be far too expensive at high capac-ity and NVRAM will not provide needed bandwidth, re-quiring both technologies to be leveraged. Managing com-plex memory hierarchies will prove a serious programmingburden for domain science. Here we present a simulationframework and profiling tools for analyzing memory accessprofiles to enable runtime tools and algorithm refactoringto fully exploit new memory architectures.

Arun RodriguesSandia National [email protected]

MS73

Performance-Energy Trade-Offs in ReconfigurableOptical Data-Center Networks Using Silicon Pho-tonics

Optical circuits can provide large bandwidth improvementsrelative to electrical-switched networks in HPC systems.Optical circuit setup, however, can have high latencies.Application-guided, latency-avoiding circuit managementtechniques for dynamic reconfiguration have shown poten-tial for better achieving the full throughput of photonics-based networks. We extend these ideas to dynamicallyreconfiguring topologies like dragonfly to match applica-tion communication patterns. We further explore where

PP16 Abstracts 129

runtime analysis tools can automatically manage circuitswithout changing application code.

Sebastien Rumley, Keren BergmanColumbia [email protected], [email protected]

MS73

Helping Applications Reason About Next-Generation Architectures Through AbstractMachine Models

To achieve performance under increasing energy con-straints, next-generation architectures will undergo funda-mental changes to include more heterogeneous processors,network-on chip topologies, or even multi-level memories.These architectures will inevitably change how programsare written. Detailed specifications do not enable commu-nity engagement or design space exploration with applica-tion developers. Here we present abstract machine mod-els (AMMs) and corresponding proxy architectures thatprovide useful starting points for reasoning about appli-cations on extreme-scale hardware. Where appropriate,architecture details are either ignored, abstracted, or fullyexposed in the AMMs. We survey the programming impli-cations of different AMMs and illusrate their usefulness inco-designing applications.

John Shalf, David DonofrioLawrence Berkeley National [email protected], [email protected]

MS73

When Is Energy Not Equal to Time? Understand-ing Application Energy Scaling with EAudit

Under future power caps, improving energy efficiency willbecome a critical component of sustaining improvements intime to solution. However, due to fundamental differencesbetween how energy and time scale with parallelism, timeis not a good proxy for the distribution of energy in anapplication. This talk covers energy analogs to Amdahlslaw to understand weak/strong scaling of application en-ergy and its use in eAudit a tool for energy debugging ofapplications.

Sudhakar YalamanchiliThe School of Electrical and Computer EngineeringGeorgia Institute of [email protected]

Eric AngerGeorgia Institute of [email protected]

MS74

A Distributed-Memory Advection-Diffusion Solverwith Dynamic Adaptive Trees

We present a numerical method to solve the Advection-Diffusion problem. In our approach, we solve the ad-vection equation by using the semi-Lagrangian method.The solution of the diffusion equation is computed byapplying an integral transform: a convolution with thefundamental solution of the modified Laplace PDE. Fortime-marching, the Stiffly-Stable method is used. Thesemethods are implemented on top of a dynamic spatially-adaptive distributed-memory parallelized Chebyshev oc-

tree as our data structure. To address the load-imbalanceproblem for distributed-memory Lagrangian schemes, weintroduce a novel and robust partitioning scheme.

Arash BakhtiariTechnische Universitaet [email protected]



MS74

A Volume Integral Equation Solver for EllipticPDEs in Complex Geometries

We present a new framework for solving volume inte-gral equation (VIE) formulations of Poisson, Stokes andHelmholtz equations in complex geometries. Our methodis based on a high-order discretization of the data on adap-tive octrees and uses a kernel independent fast multipolemethod along with precomputed quadratures for evaluat-ing volume integrals efficiently. We will present new per-formance optimizations and demonstrate scalability of ourmethod on distributed memory platforms with thousandsof compute nodes.



MS74

Efficient Parallel Simulation of Flows, Particles,and Structures

The efficient simulation of flow scenarios involving thetransport of particles and the interaction with other phys-ical phenomena such as electrostatics or elastic struc-tures usually requires high grid resolutions due to accu-racy requirements and different spatial/temporal scales.We present several application cases for the efficient useof octree-like adaptive computational grids in large exist-ing simulation environments. Such grids are an ideal toolin this context due to their minimal memory requirementsand flexible adaptivity.

Miriam MehlUniversitat [email protected]

Michael LahnertUniversity of [email protected]

Benjamin UekermannTU Munich

130 PP16 Abstracts

[email protected]

MS74

Communication-Reducing and Communication-Avoiding Techniques for the Particle-in-Dual-TreeStorage Scheme

We study particle-in-cell with suprathermal particlesdriven by a PDE. The particles do not interact but somejump over cells. It is a HPC worst-case for its low arith-metic intensity and re-sorting which makes it latency- andbandwidth-sensitive. Three ideas help: A tree grammarpredicts movements and allows us to skip communication.Run-length encoding for integer data along domain bound-aries reduces the memory footprint. A hierarchical floatingpoint representation reduces the byte count further.

Tobias WeinzierlDurham [email protected]

MS75

Dune - A Parallel Pde Framework

The Distributed and Unified Numerics Environment(DUNE, www.dune-project.de) is a software framework forthe numerical solution of partial differential equations withgrid-based methods started by several groups in Germanyin 2002. Its main goals are combining (1) flexibility on theone hand with (2) efficiency on the other hand while allow-ing also the (3) reuse of existing finite element software. Acentral aspect of DUNE is its abstract mesh interface al-lowing the formulation of generic algorithms operating ona large variety of different types of meshes (any-d, simpli-cial, cuboid, hierarchical, conforming and non-conforming,different refinement rules, ...). Within the framework ofthe german priority programme 1648 ‘Software for exa-scale computing’ DUNE is currently prepared for futureexa-scale hardware. The talk will highlight the progress inexploiting parallelism on all different levels (MPI, thread-ing, SIMD) and present results on high-order discontinuousGalerkin methods.

Christian EngwerINAM, University of [email protected]

Steffen MuthingHeidelberg [email protected]

Peter BastianInterdisciplinary Center for Scientific ComputingUniversity of [email protected]

MS75

PHG: A Framework for Parallel Adaptive FiniteElement Method

PHG (Parallel Hierarchical Grid) is a general framework fordeveloping parallel adaptive finite element method appli-cations. The key feature of PHG includes: bisection basedconforming adaptive tetrahedral meshes, various finite ele-ment bases support and hp adaptivity, finite element codeautomatic generation, etc. We also introduce a few appli-cations based on PHG, including earthquake simulation,

calculation of electronic structure of materials and simula-tion of fluid-structure interaction.

Deng LinLaboratory of Scientific and Engineering Computing,Chinese Academy of Sciences, Beijing, [email protected]

Wei LengLaboratory of Scientific and Engineering ComputingChinese Academy of Sciences, Beijing, [email protected]

Tao CuiLSEC, the Institute of Computational MathematicsChinese Academy of [email protected]

MS75

Cello: An Extreme Scale Amr Framework, andEnzo-P, An Application for Astrophysics and Cos-mology Built on Cello

We present Cello, an extremely scalable AMR software in-frastructure for post-petascale architectures, and Enzo-P,a multiphysics application for astrophysics and cosmologysimulations built on Cello. Cello implements a forest-of-octrees spatial mesh on top of the CHARM++ parallel ob-jects framework developed at UIUC. Leaf nodes of the oc-trees are blocks of fixed size, and are represented as charesin a hierarchical chare array. Cello supports particle andfield classes and operators for the construction of the ex-plicit finite-difference/volume methods, N-body methods,and sparse linear system solvers called from Enzo-P. Wepresent initial performance and scaling results on a largecosmological test problem.

Michael L. NormanUniversity of California at San [email protected]

MS75

Peta-scale Simulations with the HPC SoftwareFramework Walberla

We present the HPC software framework waLBerla thathas demonstrated excellent parallel performance on differ-ent peta-scale supercomputers for simulations based on thelattice Boltzmann as well as the phase-field method. Thepresentation focuses on the distributed core data structuresof the framework that enable block adaptive mesh refine-ment. We will discuss the performance-driven developmentprocess and the continuous integration pipeline that guar-antees compatibility with a large number of different com-pilers and systems through fully automatic testing.

Florian SchornbaumUniversity of [email protected]

Martin BauerUniversity of Erlangen-NurembergDepartment of Computer [email protected]

Christian Godenschwager, Ulrich RudeUniversity of Erlangen-Nuremberg

PP16 Abstracts 131


MS76

Communication Avoiding Ilu Preconditioner

In this talk, we present a new parallel preconditioner calledCA-ILU. It performs ILU(k) factorization in parallel and isapplied as preconditioner without communication. BlockJacobi and Restrictive Additive Schwartz preconditionersare compared to this new one. Numerical results showthat CA-ILU(0) outperforms BJacobi and is close to RASin term of iterations whereas CA-ILU, with complete LUon each block, does for both of them but uses lot of mem-ory. Thus CA-ILU(k) takes advantages of both in term ofmemory consumption and iterations.

Sebastien [email protected]


MS76

Parallel Approximate Graph Matching

We study approximate matching heuristics for undirectedgraphs by extending our recent work which focused on bi-partite graphs. The approach uses a judiciously sampledsubgraph of the input graph and search for a maximummatching in it. The approach is highly parallelizable andwe are currently working on the theoretical approximationguarantee. It can be used in practice to avoid worst caseswhile constructing an initial matching. Our experimentsdemonstrate parallel speed-ups and show that the approx-imation guarantees do not deteriorate with the increaseddegree of parallelization.

Fanny DufosseInria Lille, Nord Europe, 59650,Villeneuve dAscq, [email protected]

Kamer KayaSabanci University,Faculty of Engineering and Natural [email protected]


MS76

Diamond Sampling for Approximate MaximumAll-pairs Dot-product (MAD) Search

Given two sets of vectors A and B, our problem is to findthe top-t dot products among all possible pairs. This is afundamental mathematical problem that appears in numer-ous data applications involving similarity search, link pre-diction, and collaborative filtering. We propose a sampling-based approach that avoids direct computation of all dotproducts. We select diamonds (i.e., four-cycles) from theweighted tripartite representation of A and B. The prob-ability of selecting a diamond corresponding to pair (i,j)is proportional to the square of the product of the cor-

responding vectors, amplifying the focus on the largest-magnitude entries. Experimental results indicate that di-amond sampling is orders of magnitude faster than directcomputation and requires far fewer samples than any com-peting approach. We also apply diamond sampling to thespecial case of maximum inner product search, and get sig-nificantly better results than the state-of-the-art hashingmethods.

Grey BallardSandia National [email protected]

C. SeshadhriSandia National [email protected]

Tamara G. KoldaSandia National [email protected]

Ali PinarSandia National [email protected]

MS76

Scalable and Efficient Algorithms for Analysis ofMassive, Streaming Graphs

Graph-structured data in network security, social networks,finance, and other applications not only are massive butalso under continual evolution. The changes often are scat-tered across the graph, permitting novel parallel and incre-mental analysis algorithms. We discuss analysis algorithmsfor streaming graph data to maintain both local and globalmetrics with low latency and high efficiency.

Jason RiedyGeorgia Institute of TechnologySchool of Computational Science and [email protected]

David A. BaderGeorgia Institute of [email protected]

MS77

A Block Low-Rank Multithreaded Factorization forDense BEM Operators.

We consider a Block Low-Rank multithreaded LDU fac-torization for BEM electromagetics applications. In thiscontext, both the computation and accumulation of up-dates appear to be critical to ensure good performance.In both kernels of the factorization, each matrix may bedense or low rank. First is a matrix-matrix multiply, whereouter-orthogonality of the representation is key. Second isa matrix-matrix merge, where there are many strategies ina multithreaded environment.

Julie [email protected]

Cleve AshcraftLivermore Software [email protected]

132 PP16 Abstracts


MS77

Complexity and Performance of Algorithmic Vari-ants of Block Low-Rank Multifrontal Solvers

Matrices coming from elliptic Partial Differential Equa-tions (PDEs) have been shown to have a low-rank property:well defined off-diagonal blocks of their Schur complementscan be approximated by low-rank products and this prop-erty can be efficiently exploited in multifrontal solvers toprovide a substantial reduction of their complexity. Amongthe possible low-rank formats, the Block Low-Rank for-mat (BLR) is easy to use in a general purpose multifrontalsolver and has been shown to provide significant gains com-pared to full-rank on practical applications. However, un-like hierarchical formats, such asH and HSS, its theoreticalcomplexity was unknown. In this talk, we present severalvariants of the BLR multifrontal factorization, dependingon the strategies used to perform the updates in the frontalmatrices and on the approaches to handle numerical pivot-ing. We extend the theoretical work done on hierarchicalmatrices in order to compute the theoretical complexity ofeach BLR variant. We then provide an experimental studywith numerical results to support our complexity bounds.Finally, we compare and analyze the performance of eachvariant in a parallel setting.

Patrick AmestoyINP(ENSEEIHT)[email protected]

Alfredo ButtariCNRS-IRIT-Universite de Toulouse, [email protected]

Jean-Yves L’ExcellentINRIA-LIP-ENS [email protected]

Theo MaryUniversite de Toulouse, INPT [email protected]

MS77

A Comparison of Parallel Rank-Structured Solvers

We present a direct comparison of different rank-structuredsolvers: STRUMPACK (that implements HSS representa-tions, by the authors), MUMPS (Block Low-Rank repre-sentations, Amestoy et al.), and HLIBPro (H-arithmetic,Kriemann et al.). We report on results for large dense andsparse problems from academic and industrial applications,in a distributed-memory context.

Francois-Henry RouetLawrence Berkeley National [email protected]

Patrick AmestoyENINPT-IRIT, Universite of Toulouse, [email protected] or [email protected]

Cleve AshcraftLivermore Software TechnologyCorporation

[email protected]

Alfredo ButtariCNRS-IRIT-Universite de Toulouse, [email protected]

Pieter GhyselsLawrence Berkeley National LaboratoryComputational Research [email protected]

Jean-Yves L’ExcellentINRIA-LIP-ENS [email protected]

Xiaoye Sherry LiComputational Research DivisionLawrence Berkeley National [email protected]

Theo MaryUniversite de Toulouse, INPT [email protected]

Clement Weisbecker, Clement [email protected], [email protected]

MS77

Hierarchical Matrix Arithmetics on Gpus

GPU implementations of hierarchical matrix operations arechallenging as they require irregular data structures andmultilevel access patterns. Careful design, data layout,and tuning are needed to allow the effective mapping ofthe rank-structured representations and associated opera-tions to manycore architectures. In this talk, we describethe design of GPU-hosted, tree-based, representations andalgorithms that can compress dense matrices into a hier-archical format and perform linear algebra operations di-rectly on the compressed representations in a highly effi-cient manner.

Wajih Halim [email protected]


George M TurkiyyahAmerican University of [email protected]


MS78

Traversing a BLR Factorization Task Dag Based ona Fan-All Wraparound Map

We consider the factorization of a BLR matrix partitionedby block rows and columns. The atomic factor, solve andMMM tasks merge with submatrices to create a task DAG,

PP16 Abstracts 133

which we embed in a 3D Cartesian mesh. Then the atomictasks are scheduled to processors block wise using a 3Dwraparound map. Using p processors, both message countsand volume grow as p1/3.

Julie [email protected]

Cleve AshcraftLivermore Software [email protected]


MS78

Scheduling Sparse Symmetric Fan-Both CholeskyFactorization

In this work, we study various scheduling approaches forSparse Cholesky factorization. We review different algo-rithms based on a distributed memory implementation ofthe Fan-Both algorithm. We also study different commu-nication mechanisms and dynamic scheduling techniques.

Mathias JacquelinLawrence Berkeley National [email protected]

Esmond G. Ng, Yili ZhengLawrence Berkeley National [email protected], [email protected]

Katherine YelickLawrence Berkeley National LaboratoryUniversity of California at [email protected]

MS78

Optimizing Memory Usage When Scheduling Tree-Shaped Task Graphs of Sparse Multifrontal Solvers

We study the problem of scheduling task trees with largedata, which arise in the multifrontal method of sparse ma-trix factorization. Under limited memory, changing thetraversal of the tree impacts the needed amount of mem-ory. Optimal algorithms exist in the sequential case. Onparallel systems, the objective is to minimize the processingtime with bounded memory, which is much more complex.Heuristics from the literature may be improved to betteruse the available memory.

Guillaume AupyENS [email protected]

Lionel Eyraud-DuboisINRIAUniversity of [email protected]

Loris [email protected]

Frederic VivienLIP, ENS Lyon, [email protected]

Oliver SinnenUniversity of [email protected]

MS78

Impact of Blocking Strategies for Sparse DirectSolvers on Top of Generic Runtimes

Among the preprocessing steps of a sparse direct solver,reordering and block symbolic factorization are two ma-jor steps to reach a suitable granularity for Blas ker-nels efficiency and runtime management. In this talk, wepresent a reordering strategy to increase off-diagonal blocksizes. It enhances Blas kernels and allows to handle largertasks, reducing runtime overhead. Finally, we will com-ment the resulting gain in the PaStiX solver implementedover StarPU and Parsec.

Gregoire [email protected]

Mathieu FavergeIPB Bordeaux - Labri - [email protected]

Pierre RametBordeaux University - INRIA - LaBRIBordeaux University701868

Jean [email protected]

MS79

Silent Data Corruption in BLAS Kernels

We summarize different types of fault tolerant parallel al-gorithms for selected BLAS kernels and compare their be-havior under the influence of silent data corruption (bit-flips). In some scenarios, randomized gossip-based algo-rithms tend to exhibit a much higher degree of resiliencethan deterministic methods. Nevertheless, existing gossip-based algorithms cannot handle all bit-flips in their datastructures, either. We present novel algorithms, both de-terministic as well as randomized, with improved resilienceagainst silent data corruption.

Wilfried N. GanstererDepartment of Computer ScienceUniversity of [email protected]

MS79

Fault Tolerant Sparse Iterative Solvers

Silent data corruption is expected to become an increas-ing problem as total data set and memory sizes grow infuture supercomputers. Application-based fault tolerance(ABFT) techniques are one promising approach to mit-igating this problem. In this presentation we will showsome promising early results from our work to design a

134 PP16 Abstracts

fault-tolerant sparse iterative solver based on the Conju-gate Gradient (CG) method.

Simon McIntosh-SmithUniversity of [email protected]

MS79

Algorithm-Based Fault Tolerance of the Sakurai-Sugiura Method

The Sakurai-Sugiura (SS) method is a contour integraleigensolver, which have been developed and carefully ana-lyzed in the last decade. The SS method computes eigen-subspace by solutions of independent linear systems andthis characteristic leads coarse-grained parallel computa-tion. In this presentation we show the algorithm-basedfault tolerance of the method and verify the error resilienceby several numerical experiments.

Tetsuya Sakurai, Akira ImakuraDepartment of Computer ScienceUniversity of [email protected], [email protected]

Yasunori FutamuraUniversity of [email protected]

MS79

Soft Errors in PCG: Detection and Correction

Soft errors that are not detected by hardware mechanismsmay affect convergence of iterative methods. In this pre-sentation, we discuss the possibility of a self-correctiontechnique for classical Preconditioned Conjugate Gradient(PCG) algorithm in case of soft error in the SpMV cal-culation. To be able to correct the convergence behavior,the soft errors should be detected immediately. Checksumbased detection methods are good candidates for it. Buttheir defect, arising from round-off effects, decreases theirreliability. Hence we combine this technique with the well-known results on deviation between the true and computedresiduals in finite precision calculation. The combinationof these two approaches gives us an opportunity to de-tect the soft error before the algorithm get stuck in anunrecoverable state. After detection, the convergence be-havior can be corrected by applying residual replacementapproach. The discussion is illustrated with several faultinjection scenarios on different matrices.


Siegfried CoolsAntwerp [email protected]

Luc GiraudInriaJoint Inria-CERFACS lab on [email protected]

Wim VanrooseAntwerp [email protected]

Fatih [email protected]

MS80

Code Verification in Cfd


Daniel ChauveheidCEA DAM DIFBruyeres-le-Chtel , 91297 Arpajon Cedex, [email protected]

MS80

Performance Study of Industrial Hydrocodes:Modeling, Analysis and Aftermaths on InnovativeComputational Methods

This talk is the high-performance numerics companiontalk given by Mathieu Peybernes in this mini-symposium.From conclusions on the performance of a legacy staggeredLagrange+remap hydrocode, we set a list of requirementsfor a new numerical solver. It has been possible to de-rive a second-order Eulerian multimaterial finite volumescheme with collocated variables and geometrical elementsdefined only from the reference (Eulerian) grid. The result-ing scheme shows both SIMD property and strong scala-bility.

Florian De VuystEcole Normale Superieure [email protected]

MS80

Checking the Floating Point Accuracy of ScientificCode Through the Monte Carlo Arithmetic

Exascale supercomputers will provide the computationalpower required to perform very large scale and high reso-lution digital simulations. One of the main exascale chal-lenges is to control their numerical uncertainties. Indeed,the floating-point arithmetic is only an approximation ofexact arithmetic The aim of this talk is to present a newtool called which can be used for the automatic assessmentof the numerical accuracy of large scale digital simulationsby using the Monte-Carlo Arithmetic.

Christophe DenisCMLA - ENS Cachan - University Paris SaclayUMR [email protected]

MS80

Code Modernization by Mean of Proto-Applications

With the shift in HPC system paradigm, applications haveto be re-designed to efficiently address large numbers ofmanycores. It requires deep code modification that cannotbe practicaly applied at full-application scale. We pro-pose proto-applications as proxy for code-modernization.Our first example features unstructured mesh computa-tion using task-based parallization. The second demon-strates load-balancing for combustion simulation. Bothproto-applications are open-source and can serve to the

PP16 Abstracts 135

developement of genuine HPC application.

Eric PetitPRiSM - UVSQ45 avenue des Etats-Unis, 78035 Versailles [email protected]

MS81

On the Performance and Productivity of AbstractC++ Programming Models?

Porting production-scale scientific applications to emergingnext-generation computing platforms is a significant chal-lenge. Application developers must not only ensure strongperformance, they must also ensure code is portable acrossthe wide diversity of architectures available. In order totackle the significant scale of these codes the communitymust also take into account developer productivity. In thistalk we present results from a recent DOE study whichevaluated directive solutions against our Kokkos C++ pro-gramming model.

Simon D. HammondScalable Computer ArchitecturesSandia National [email protected]

Christian Trott, H. Carter Edwards, Jeanine Cook,Dennis DingeSandia National [email protected], [email protected],[email protected], [email protected]

Mike GlassSandia National [email protected]

Robert J. Hoekstra, Paul Lin, Mahesh Rajan, CourtenayT. VaughanSandia National [email protected], [email protected], [email protected], [email protected]

MS81

Demonstrating Advances in Proxy ApplicationsThrough Performance Gains and/or PerformancePortable Abstractions: CoMD and Kripke withRAJA

RAJA is a programming model and C++ implementationthereof which exposes fine-grained parallelism in both loopand task-based fashions. In this talk, we present successfulresults from a recent study wherein we applied RAJA toenhance the performance of two codes outside of its originaldesign space.

Richard Hornung, Olga PearceLawrence Livermore National [email protected], [email protected]

Adam Kunen, Jeff Keasler, Holger JonesLawrence Livermore National [email protected], [email protected], [email protected]

Rob NeelyLawrence Livermore National [email protected]

Todd GamblinLawrence Livermore National [email protected]

MS81

Future Architecture Influenced Algorithm Changesand Portability Evaluation Using BLAST

BLAST is an arbitrary high-order finite element code forshock hydrodynamics, originally developed for CPUs usingMPI. In this talk, we describe the algorithmic and struc-tural changes we made to BLAST to efficiently supportmultiple emerging architectures, in particular GPU andXeon Phi. We analyze the key differences between ourCPU, GPU, and Xeon Phi versions and our efforts to com-bine them into a single performance portable implementa-tion to simplify future maintenance.

Christopher EarlLawrence Livermore National [email protected]

Barna BihariLawrence Livermore National [email protected]

Veselin Dobrev, Ian KarlinLawrence Livermore National [email protected], [email protected]

Tzanio V. KolevCenter for Applied Scientific ComputingLawrence Livermore National [email protected]

Robert Rieben, Vladimir TomovLawrence Livermore National [email protected], [email protected]

MS81

Kokkos and Legion Implementations of the SNAPProxy Application

SNAP is a proxy application which simulates the compu-tational motion of a neutral particle transport code. Inthis work, we have adapted parts of SNAP separately; wehave re-implemented the iterative shell of SNAP in thetask-model runtime Legion, showing an improvement tothe original schedule, and we have created multiple Kokkosimplementations of the computational kernel of SNAP, dis-playing similar performance to the native Fortran.

Geoff Womeldorff, Ben Bergen, Joshua PayneLos Alamos National [email protected], [email protected], [email protected]

MS82

Recent Developments on Forests of Linear Trees

Forest-of-octrees AMR is an approach that offers both ge-ometric flexibility and parallel scalability and is being usedin various finite element codes and libraries. The corefunctionality is defined by refinement, size balance, the as-sembly of ghost layers, and a parallel topology iteration.More general applications, such as semi-Lagrangian andpatch-based methods, require additional AMR functionali-ties. Recently, a tentative tetrahedral variant has emerged.

136 PP16 Abstracts

This talk will summarize the latest theory and applications.

Carsten Burstedde, Johannes HolkeUniversitat [email protected], [email protected]

MS82

Parallel Level-Set Algorithms on Tree Grids

We present results from parallelizing the semi-Lagrangianand Reinitialization algorithms for the level-set method onadaptive Quadtree (2D) and Octree (3D) grids. Thesealgorithms are directly implemented on top of the p4estlibrary which implements high quality parallel refine-ment/coarsening and partitioning operations. Initial scal-ing results indicate good scalability up to 4096 cores, ourcurrent allocation limit on the Stampede supercomputer.Some applications of our algorithms to modeling the so-lidification phenomenon and crystal growth are discussed.

Mohammad MirzadehDept. Mechanical [email protected]


Carsten BursteddeUniversitat [email protected]

Frederic G. GibouUC Santa [email protected]

MS82

Exploiting Octree Meshes with a High Order Dis-continuous Galerkin Method

High-order discontinuous Galerkin methods are especiallyattractive for parallel computation due to their high local-ity. However, the sheer size of modern computing systemsrequires also the scalable treatment of mesh information.This need drives the design of our Octree mesh implemen-tation TreElM. We will present the concept of Ateles, ourmodal discontinuous Galerkin solver based on TreElM. Inthis context, the focus of our presentation will be the high-order representation of geometries.

Sabine P. Roller, Harald Klimach, Jens ZudropUniversitaet [email protected], [email protected],[email protected]

MS82

A Sharp Numerical Method for the Simulation ofDiphasic Flows

We present a numerical method for modeling incompress-ible non-miscible fluids, in which both the interface andthe interfacial continuity equations are treated in a sharpmanner. Our approach is implemented on adaptive Oc-tree/Quadtree grids, which are highly effective in captur-ing different length scales. By using a modified pressurecorrection projection method, we were able to alleviate the

standard time step restriction incurred by capillary forces.The solver is validated numerically in two and three spatialdimensions and challenging numerical examples are pre-sented to illustrate its capabilities.

Maxime TheillardMechanical and Aerospace [email protected]

Frederic [email protected]

David [email protected]


MS83

Optimal Domain Decomposition Strategies for Par-allel Geometric Multigrid

A high level cache-directed analysis is used to derive a flexi-ble cache-oblivious heuristic that prunes the process topol-ogy search space to obtain families of high-performing do-main decompositions for structured 3-D domains. Thisconcept is used to minimize cache-misses in smoothingphases and is combined with process-to-hardware mappingfor a subset of active processes at the coarsest correctionphase. Our conclusion discusses the impact on domain de-composition strategies resulting from current, and future,multi-core architectures.

Gaurav SaxenaUniversity of [email protected]

Peter K. Jimack, Mark A. WalkleySchool of ComputingUniversity of [email protected], [email protected]

MS83

Design and Implementation of Adaptive Spmv Li-brary for Multicore and Manycore Architecture

Sparse matrix vector multiplication (SpMV) is an impor-tant kernel in both traditional high performance comput-ing and emerging data-intensive applications. PreviousSpMV libraries are optimized by either application-specificor architecture-specific approach, and they’re complicatedto be used in real applications. In this work we develop anauto-tuning system (SMATER) to bridge the gap betweenspecific optimizations and general-purpose use. SMATERprovides programmers a unified interface based on the com-pressed sparse row (CSR) format by implicitly choosing thebest format and the fastest implementation for any inputsparse matrix in the runtime. SMATER leverages a ma-chine learning model and a retargetable backend libraryto quickly predict the best combination. Performance pa-rameters are extracted from 2386 matrices in UF sparsematrix collection. The experiments show that SMATERachieves the impressive performance while being portableon the state-of-the-art x86 multi-core processors, NVIDIA

PP16 Abstracts 137

GPU and Intel Xeon Phi accelerators. Compared with IntelMKL library, SMATER runs faster by more than 3 times.We demonstrate its adaptivity in an algebraic multigridsolver from hypre library and report above 20% perfor-mance improvement.

Guangming TanInstitute of Computing TechnologyChinese Academy of [email protected]

MS83

Towards Efficient Data Communications for Frame-works on Post-petascale Numa-Multicore Systems

Performance of halo-exchange communications is essen-tial to the scalability of parallel programming frameworks.In this talk, we present our effort on optimizing halo-exchange communications for the JASMIN framework onpost-peta scale NUMA multicore systems. We proposea parallelism-exposing communication abstraction, designa scalable scheduling algorithm, and utilize widely avail-able hardware features. Our effort improves typical halo-exchange communication performance by up to 78%. Thisin turn improves the overall performance of JEMS-FDTDapplication by 48%.

Zhang YangInstitute of Applied Physics and ComputationalMathematicsChina Academy of Engineering Physicsyang [email protected]

Aiqing ZhangInstitute of Applied Physics and ComputationalMathematicszhang [email protected]

MS84

Novo-G#: Strong Scaling Using Fpga-CentricClusters with Direct and Programmable Intercon-nects

FPGA-centric clusters with direct and programmable in-terconnects address fundamental limits on HPC perfor-mance by maximizing computational density, minimizingpower density, removing the bottleneck between process-ing and communication, and facilitating intelligent andapplication-aware processing in the network itself. For ex-ample, knowing the routing pattern a priori often enablescongestion-free communication. We describe our 100-nodepublically available cluster, Novo-G#, its software and IPinfrastructure, and strong scaling of communication-boundapplications such as long time-scale Molecular Dynamics.

Martin HerbordtBoston [email protected]

MS84

Benchmarking FPGAs with Opencl

We evaluate the performance of OpenCL benchmarks, in-cluding the Rodinia benchmark suite, on FPGAs using Al-tera’s OpenCL SDK. We first evaluate the performance ofthe original, thread-parallelized OpenCL kernels targetedfor GPUs on FPGAs. We study how those kernels can besignificantly optimized by exploiting FPGA-specific archi-

tectural features such as pipeline parallelism and slidingwindows. Our results using an Altera Stratix V FPGAdemonstrates significantly higher power efficiency in com-parison to GPUs.

Naoya MaruyamaRIKEN Advanced Institute for Computational [email protected]

Hamid ZohouriTokyo Institute of [email protected]

Aaron SmithMicrosoft [email protected]

Motohiko MatsudaRIKEN Advanced Institute for Computational [email protected]

Satoshi MatsuokaTokyo Insitute of [email protected]

MS84

Architectural Possibilities of Fpga-Based High-Performance Custom Computing

Recent advancement of FPGA technology provides variouspossibilities to high-performance custom computing archi-tectures. As a solution to higher scalability and efficiencythan the present massively-parallel HPC systems, we focuson data-centric computing architectures which is differentfrom conventional control-centric ones. This talk intro-duces their microarchitecture to system level approachesfor latency-tolerant and decentralized computing, whichinclude bandwidth compression techniques and an overlayarchitecture for system-wide data-flow computing.

Kentaro SanoTohoku [email protected]

MS84

Are Datacenters Ready for Reconfigurable Logic?

At Microsoft we have designed and deployed a composable,reconfigurable fabric to accelerate portions of large-scalesoftware services beyond what commodity server designsprovide. This talk will describe a prototype deployment ofthis fabric on a bed of 1,632 servers, and discuss its efficacyin accelerating the Bing web search engine.

Aaron SmithMicrosoft [email protected]

MS85

Parallel Performance of the Inverse FMM and Ap-plication to Mesh Deformation for Fluid-StructureProblems

In this talk, we present the inverse fast multipole methodas an approximate fast direct solver for dense linear sys-tems that performs extremely well as a preconditioner inan iterative solver. The computational cost scales linearlywith the problem size. The method uses low-rank approx-

138 PP16 Abstracts

imations to represent well-separated interactions; this isdone in a multi-level fashion. Its parallel performance isassessed, and applications related to mesh deformation us-ing radial basis function interpolation are discussed.

Pieter [email protected]

Chao ChenStanford [email protected]


MS85

Fast Updating of Skeletonization-Based Hierarchi-cal Factorizations

We present a method for updating certain hierarchical fac-torizations for solving linear systems arising from the dis-cretization of physical problems (e.g., integral equations orPDEs) in 2D. Given a factorization, we can locally perturbthe geometry or coefficients and update the factors to re-flect this change with complexity poly-logarithmic in thetotal number of unknowns. We apply our method to therecursive skeletonization factorization and hierarchical in-terpolative factorization and demonstrate results on someexamples.

Victor [email protected]

Anil Damle, Ken Ho, Lexing YingStanford [email protected], [email protected], [email protected]

MS85

Exploiting H-Matrices in Sparse Direct Solvers

In this talk, we describe a preliminary fast direct solver us-ing HODLR library to compress large blocks appearing inthe symbolic structure of the PaStiX sparse direct solver.We will present our general strategy before analyzing thepractical gains in terms of memory and floating point op-erations with respect to a theoretical study of the problem.Finally, we will discuss ways to enhance the overall perfor-mance of the solver.

Gregoire [email protected]


Mathieu FavergeIPB Bordeaux - Labri - [email protected]

Pierre Ramet

Bordeaux University - INRIA - LaBRIBordeaux University701868

Jean [email protected]

MS85

Algebraic and Geometric Low-Rank Approxima-tions

We perform a direct comparison between a geometric(FMM) and algebraic (HSS) matrix-vector multiplicationof compressed matrices for Laplace, Helmholtz, and Stokeskernels. Both methods are implemented with distributedmemory, thread, and SIMD parallelism, and the runs areperformed on large Cray systems such as Edison (XC30,133,824 cores) at NERSC and Shaheen2 (XC40, 197,568cores) at KAUST. We compare the absolute execution timeof both codes, along with their parallel scalability andmemory consumption.

Rio YokotaTokyo Institute of [email protected]

Francois-Henry Rouet, Xiaoye S. LiLawrence Berkeley National [email protected], [email protected]


MS86

Architecture-Agnostic Numerics in GridTools

The rapid evolution of the HPC hardrware has producedmajor changes in programming techniques and algorithmsemployed to solve several classical problems, such as thecompressible Euler equations. Overcoming the lag betweennumerical schemes and their efficient implementations onspecific hardware requires cross-disciplinary competenciesand is often challenging. We propose a set of librariescalled GridTools which will provide domain scientists withtechniques to specify, discretize and solve (explicitly) PDEproblems taking full advantage of modern hardware archi-tectures. We present GridTools and several prototype PDEsolvers.

Paolo CrosettoEcole Politechnique Federale de [email protected]

MS86

Generating High Performance Finite Element Ker-nels Using Optimality Criteria.

We tackle the problem of automatically generating opti-mal finite element integration routines given a high levelspecification of arbitrary multilinear forms. Optimality isdefined in terms of floating point operations given a mem-ory bound. We show an approach to explore the trans-formation space of loops and expressions typical of inte-gration routines and discuss the conditions for optimality.Extensive experimentation validates the approach, show-ing systematic performance improvements over a number

PP16 Abstracts 139

of alternative code generation systems.

Fabio LuporiniDepartment of ComputingImperial College [email protected]

David HamDepartment of Mathematics and Department ofComputingImperial College [email protected]

Paul KellyImperial College [email protected]

MS86

Source-to-Source Translation for Code Validationand Acceleration of the Icon Atmospheric GeneralCirculation Model

The ICON atmospheric model port to accelerators withOpenACC directives illustrates that these are easily in-serted and can yield performance similar to CUDA pro-gramming, but introduce subtelties which are difficult tooptimize and debug. Thus we propose a source-to-sourcetranslator to generate accelerator directives based on anabstract description of the accelerator data and kernels,as well as validation code for intermediate results. In thispresentation we report our findings and outline possibleextensions.

William Sawyer

Swiss Centre of Scientific Computing (CSCS)Manno, [email protected]

MS86

Kernel Optimization for High-Order DynamicRupture Simulation - Flexibility Vs. Performance

SeisSol simulates seismic wave propagation and dynamicrupture. Its innermost kernels consist of small matrix mul-tiplications (SMMs) where all sparsity patterns are knowna-priori. The SMMs are generated in an offline process andmake heavy use of vectorization in order to maximize per-formance. We present our recent efforts to further optimizeand automate the code generator. Particularly, we deter-mine the optimal matrix multiplication order, automati-cally exploit zero blocks, and enable block sparse SMMs.

Carsten UphoffTU [email protected]


MS87

Bosilca ¡TBA¿


George BosilcaUniversity of Tennessee - Knoxville

[email protected]

MS87

From System-Level Checkpointing to User-LevelFailure Mitigation in MPI-3

In this presentation, I will talk about transparent, system-level fault tolerance and present some (old) results thatshow the overhead of such mechanisms. Then I will talkabout how the user can handle fault tolerance directlywithin the parallel application, using the (old, deprecated)interface FT-MPI and using a current effort made by agroup of the MPI forum in the MPI-3 standard. Finally,I will present how the middleware can handle failures andresist beyond them.

Camille CotiUniversite Paris [email protected]

MS87

Reinit: A Simple and Scalable Fault-ToleranceModel for MPI Applications

In this talk, we present a global-exception programingmodel to tolerate failures in MPI applications. The modelallows a clean way to reinitialize MPI state after failuresand a simple mechanism to clean up application and li-braries state via a stack of error handlers. We present animplementation of the model in MVAPICH (a widely usedMPI library) and Slurm (an open-source workload man-ager), and show an initial performance evaluation in HPCbenchmarks.

Ignacio LagunaLawrence Livermore National [email protected]

MS87

FA-MPI: Fault-Aware and QoS-Oriented ScalableMessage Passing

A Fault-Aware Approach for non-blocking MPI-3.x APIsis presented (FA-MPI), enabled by weak collective trans-actions This novel model localizes faults, provides tunablefault-free overhead, allows for multiple kinds of faults, en-ables hierarchical recovery, and is data-parallel relevant.Fault modeling of underlying networks is mentioned as areproposed additions to MPI-4 fully to support resilient par-allel programs with basic QoS. Performance and scalabilityresults of prototype FA-MPI are given, keyed to compactapplications (e.g., MiniFE, LULESH).

Anthony SkjellumAuburn [email protected]

MS88

Topology-Aware Performance Modeling of ParallelSpMVs

Parallel SpMVs are a key component of many linear solvers.Increasingly large HPC systems yield costly communica-tion; specifically large messages sent long distances. Aperformance model can associate an approximate cost withthe various messages communicated during each SpMV.The common alpha-beta model can be improved throughthe use of topology-aware methods, allowing for a distance

140 PP16 Abstracts

parameter to be added to the cost of each message, as wellas a rough prediction of network contention.

Amanda BienzUniversity of Illinois at [email protected]

Luke OlsonDepartment of Computer ScienceUniversity of Illinois at [email protected]

William D. GroppUniversity of Illinois at [email protected]

MS88

Cere: Llvm Based Codelet Extractor and Replayerfor Piecewise Benchmarking and Optimization

Codelet Extractor and REplayer (CERE) is an LLVM-based framework that finds and extracts hotspots from anapplication as isolated fragments of code. Codelets can bemodified, compiled, run, and measured independently fromthe original application. Through performance signatureclustering, CERE extracts a minimal but representativecodelet set from applications, which can significantly re-duce the cost of benchmarking and iterative optimization.Codelets have proved successful in autotuning target archi-tecture, compiler optimization or amount of parallelism.

Pablo De Oliveira CastroPRiSM & ECR - UVSQ45 avenue des Etats-Unis, 78035 Versailles [email protected]

MS88

Performance Modeling of a Compressible Hydro-dynamics Solver on Multicore Cpus

Performance study of a reference Lagrange-Remap algo-rithm for compressible gas dynamics obtained using theRoofline model and the Execution Cache Memory (ECM)model has been performed. As a result, we are able topredict the whole performance on modern processors witha mean error of 6.5%, which is actually very accurate inthis context. The results of these analyses also gave ushighlights for designing a new full hydrodynamics solver(Lagrange+flux) with expected better performance.

Mathieu PeybernesCEA Saclay, 91191 Gif-sur-Yvette cedex, [email protected]

MS88

Using Uncertainties in High Level PerformancePrediction

For rapid evaluation of future hardware, we are using highlevel methodology based on a first approximation of theapplication behavior, and on several hardware specifica-tions (frequency, efficiency, bandwidth, nb of cores, etc..) that are usually known only a few months before theofficial launch. The introduction of uncertainties on theseparameters is then allowing us to get some tolerance on theprediction results.

Philippe Thierry

2 rue de paris, 92190 Meudon, [email protected]

PP1

Improving Scalability of 2D Sparse Matrix Parti-tioning Models Via Reducing Latency Cost

Two-dimensional (2D) sparse matrix partitioning modelsoffer better opportunities in reducing multiple communica-tion cost metrics compared to one-dimensional partitioningmodels. This work makes use of a two-phase partitioningmethodology to address three important communicationcost metrics: (i) total message volume, (ii) total messagecount, and (iii) maximum message volume. The first phaseonly considers the message volume metric, while the sec-ond phase considers the other two metrics. We show thatthe scalability of the existing 2D models can significantlybe improved with our methodology.

Cevdet Aykanat, Oguz SelvitopiBilkent [email protected], [email protected]

PP1

Hybrid Parallel Simulation of Helicopter Rotor Dy-namics

We present a software for the hybrid parallel calculationof rotor forces and control angles using a coupled simula-tion of rotor dynamics and rotor wake. A big part of thecomputation time is required to simulate the airflow in thewake using a vortex method. For node-level paralleliza-tion, we use OpenACC on GPUs and OpenMP on CPUs,and MPI for parallelization between compute nodes. Wefocus on the coupling scheme and discuss the performanceobtained.

Melven Roehrig-Zoellner, Achim BasermannGerman Aerospace Center (DLR)Simulation and Software [email protected],[email protected]

Johannes HofmannGerman Aerospace Center (DLR)Institute of Flight [email protected]

PP1

Poisson Solver for large-scale Simulations with oneFourier Diagonalizable Direction

We present a Poisson solver for problems with one Fourierdiagonalizable direction, such as airfoils. The diagonaliza-tion decomposes the original 3D-system into a set of un-coupled 2D-subsystems. Our method focuses on optimizingmemory allocations and transactions by considering redun-dancies on such subsystems. It also takes advantage of theuniformity through the periodic direction for its vectoriza-tion, and automatically manages the workload distributionand the preconditioner choice. Altogether constitutes ahighly efficient and scalable Poisson solver.

Ricard BorrellUniversitat Politecnica de Catalunya (Barcelona Tech)[email protected]

Guillermo Oyarzun, Jorje Chiva, Oriol Lehmkuhl, Ivette

PP16 Abstracts 141

Rodrıguez, Assensi OlivaUniversitat Politecnica de Catalunya (BarcelonaTech)[email protected], [email protected],[email protected], [email protected],[email protected]

PP1

High Performance Algorithms and Software forSeismic Simulations

The understanding of earthquake dynamics is greatly sup-ported by highly resolved coupled simulations of the rup-ture process and seismic wave propagation. This grandchallenge of seismic modeling requires an immense amountof supercomputing resources, thus optimal utilization bysoftware is imperative. Driven by recent hardware devel-opments the increasing demand for parallelism and datalocality often requires replacing major software parts tobring efficient numerics and machine utilization closely to-gether. In this poster we present a new computationalcore for the seismic simulation package SeisSol. For mini-mal time-to-solution the new core is designed to maximizevalue and throughput of the floating point operations per-formed in the underlying ADER discontinuous Galerkindiscretization method. Included are auto-tuned sparse anddense matrix kernels, hybrid parallelization from many-core nodes up to machine-size and a novel high performanceclustered local time stepping scheme. The presented com-putational core reduces time-to-solution of SeisSol by sev-eral factors and scales beyond 1 million cores. At machine-size the new core enabled a landmark simulation of the 1992Landers earthquake. For the first time this simulation al-lowed the analysis of the complex rupture behavior result-ing from the non-linear interaction of frictional sliding andseismic wave propagation at high geometric complexity.

Alexander BreuerUniversity of California, San [email protected]

Alexander HeineckeParallel Computing LaboratoryIntel Corporation, Santa Clara, CA, [email protected]


PP1

Large Scale Parallel Computations in R ThroughElemental

R-Elemental is a package for the statistical software Rthat enables distributed computations through the paralleldense linear algebra library Elemental. R-Elemental offersto the users a straightforward transition from single node tocompute clusters, preserving native R syntax and methods;at the same time, it delivers Elementals full performancewithout any overhead. We discuss the interface design andpresent several examples common in statistical analysis.

Rodrigo CanalesRWTH Aachen [email protected]

Elmar Peise, Paolo BientinesiAICES, RWTH Aachen


PP1

Parallel Programming Using Hpcmatlab

HPCmatlab is a new, flexible and scalable parallel pro-gramming framework for Matlab/Octave applications toscale on distributed memory cluster architectures usingthe Message Passing Interface (MPI). This includes point-to-point, collective, one sided communication and paral-lel I/O. It supports shared memory programming withina node using Pthreads. We present scientific applicationperformance results showing the ease of programming forfast prototyping or for large scale numerical simulations.

Mukul DaveASU Research ComputingArizona State [email protected]

Xinchen GuoPurdue [email protected]

Mohamed SayeedASU Research ComputingArizona State [email protected]

PP1

Using the Verificarlo Tool to Assess the FloatingPoint Accuracy of Large Scale Digital Simulations

Exascale supercomputers will provide the computationalpower required to perform very large scale and high reso-lution digital simulations. One of the main exascale chal-lenges is to control their numerical uncertainties. Indeed,the floating-point arithmetic is only an approximation ofexact arithmetic. This poster will present a new tool calledverificarlo which can be used for the automatic assessmentof the numerical accuracy of large scale digital simulationsby using the Monte-Carlo Arithmetic.

Christophe DenisCMLA - ENS Cachan - University Paris SaclayUMR [email protected]

Pablo De Oliveira CastroPRiSM & ECR - UVSQ45 avenue des Etats-Unis, 78035 Versailles [email protected]

Eric PetitPRISM - [email protected]

PP1

A Parallel Fast Sweeping Method on NonuniformGrids

We present a novel parallel fast sweeping method onnonuniform grids. By constructing a graph of the dis-cretization, we are able to determine the causally inde-pendent updates. Each of these independent steps is thencomputed in parallel. We demonstrate the method on treebased grids and triangulated surfaces. This talk will briefly

142 PP16 Abstracts

describe the algorithm, show an example application, andpresent parallel scaling results.

Miles L. DetrixheUniversity of California Santa [email protected]

PP1

Chase: A Chebyshev Accellerated Subspace Itera-tion Eigensolver

ChASE is an iterative eigensolver designed with a rathersimple algorithmic structure which rendered easy to paral-lelize its kernels over virtually all existing range of multi-and many-cores. Such a feature makes ChASE virtuallyone of the most parallel efficient algorithm for sequencesof correlated eigenvalue problems. In particular, ChASEscales almost optimally over a range of processors com-mensurate to the problem complexity, and it is portableand adaptable to many flavors of parallel architectures.

Edoardo A. Di NapoliJuelich Supercomputing [email protected]

Mario BerljafaSchool of MathematicsThe University of [email protected]

Paul SpringerAICESRWTH Aachen [email protected]

Jan WinkelmannAICESRWTH [email protected]

PP1

Evaluation of the Intel Xeon Phi and Nvidia K80As Accelerators for Two-Dimensional Panel Codes

To predict the properties of fluid flow over a solid geome-try is an important engineering problem. In many applica-tions so-called panel methods (or boundary element meth-ods) have become the standard approach to solve the corre-sponding partial differential equation. Since panel methodsin two dimensions are computationally cheap, they are wellsuited as the inner solver in an optimization algorithm. Inthis paper we evaluate the performance of the Intel XeonPhi 7120 and the NVIDIA K80 to accelerate such an op-timization algorithm. For that purpose, we have imple-mented an optimized version of the algorithm on the CPUand Xeon Phi (based on OpenMP, vectorization, and theIntel MKL library) and on the GPU (based on CUDA andthe MAGMA library). We present timing results for allcodes and discuss the similarities and differences betweenthe three implementations. Overall, we observe a speedupof approximately 2.5 for adding a Intel Xeon Phi 7120 to adual socket workstation and a speedup between 3.4 and 3.8for adding a NVIDIA K80 to a dual socket workstation.

Lukas EinkemmerUniversity of Innsbruck

[email protected]

PP1

An Impact of Tuning the Kernel of the StructuredQR Factorization in the TSQR

In the TSQR algorithm, the so-called structured QR factor-ization is required for reducing two upper triangular ma-trix into one and repeated depending on the number ofprocesses in parallel computing. Tuning this part is thusimportant but not trivial because the input matrix is usu-ally small and has the upper triangular structure. In thisposter, we compare several implementations and investi-gate their impact on the overall performance of the TSQRalgorithm.

Takeshi FukayaHokkaido University, RIKEN [email protected]


PP1

Filters of the Improvement of Multiscale Data fromAtomistic Simulations

In multiscale models combining continuum approximationswith first-principles based atomistic representations thedominant computational cost is typically the atomisticcomponent of the model. We demonstrate the effectivenessand scalability of a spectral filter for improving the accu-racy of noisy data obtained from atomistic simulations. Fil-tering enables running less expensive atomistic simulationsto reach the same overall level of accuracy thus loweringthe primary cost of the multiscale model and leading tofaster simulations.

David J. GardnerLawrence Livermore National [email protected]

Daniel R. ReynoldsSouthern Methodist [email protected]

PP1

Parallel Multigrid and FFT-Type Algorithmsfor High Order Compact Discretization of theHelmholtz Equation.

In this paper, we analyze parallel implementation of multi-grid and FFT-type methods for the solution of a subsurfacescattering problem. The 3D Helmholtz equation is dis-cretized by a sixth order compact scheme. The resultingsystems of finite-difference equations are solved by differ-ent preconditioned Krylov subspace-based methods. Thelower order multigrid and FFT-based preconditioners areconsidered for the efficient implementation of the devel-oped iterative approach. The complexity and scalability oftwo parallel preconditioning strategies are analyzed.

Yury A. GryazinIdaho State University

PP16 Abstracts 143

[email protected]

PP1

Computational Modeling and Aerodynamic Anal-ysis of Simplified Automobile Body Using OpenSource Cfd (OpenFoam)

Open source softwares have emerged as a cross-disciplinaryparadigm for a large spectrum of computing applicationsincluding computer aided engineering (CAE). Open FieldOperation and Manipulation (OpenFOAM) is an opensource computational software capable of solving varietyof complex fluid flows. Present research work is focused oninvestigating the ability of OpenFOAM to simulate exter-nal flow around a simplified automobile body. The resultsare compared with experimental values of commercial CFDpackages. These results allow us to gauge the robustness,and adaptability of the OpenFOAM utility.

Javeria HashmiNational University of Science and technology (NUST)National University of Sciences and Technology [email protected]

PP1

Performance Evaluation of a Numerical Quadra-ture Based Eigensolver Using a Real Pole Filter

Numerical quadrature based parallel sparse eigensolverssuch as the Sakurai-Sugiura method have been developedin the last decade. Recently, an approach using a numeri-cal quadrature which forms a real pole filter was proposed.This approach potentially reduces the computational andmemory cost for solving real symmetric problems. In thispresentation, we evaluate the performance of the real polefilter method by comparing it with the often-used complexpole filter method using practical large-scale problems.

Yuto InoueDepartment of Computer ScienceUniversity of [email protected]

Yasunori Futamura, Tetsuya SakuraiUniversity of [email protected], [email protected]

PP1

Synthesis and Verification of BSP Programs

Bulk synchronous parallel (BSP) is an established modelfor parallel programming which simplifies program concep-tion by avoiding deadlocks and data-races. Nonetheless,there are still a large class of programming errors whichplague BSP programs. Our research interest is twofold:correctness by construction where BSP code is automati-cally generated with a new BSP-automata theory, and au-tomatic verification of existing C programs written withBSPlib using formal methods such as abstract interpreta-tion.

Arvid Jakobsson, Thibaut TachonHuawei Technologies [email protected],

[email protected]

PP1

A Reconstruction Algorithm for Dual Energy Com-puted Tomography and Its Parallelization

This paper provides a mathematical algorithm to recon-struct the Compton scatter and photo-electronic coeffi-cients using the dual-energy CT system. The proposedimaging method is based on the mean value theorem tohandle the non-linear integration coming from the poly-chromatic energy based CT scan system. We show a nu-merical simulation result for the validation of the proposedalgorithm. In order to reduce the computational time, weuse MPI4Py to adopt the parallel processing. We obtain34 time faster result with 64 cpu cores system.

Kiwan JeonNational Institute for Mathematical [email protected]

Sungwhan KimHanbat National [email protected]

Chi Young Ahn, Taeyoung HaNational Institute for Mathematical [email protected], [email protected]

PP1

An Interior-Point Stochastic ApproximationMethod on Massively Parallel Architectures

The stochastic approximation method is behind the solu-tion of many actively-studied problems in PDE-constrainedoptimization. Despite its far-reaching applications, there isalmost no work on applying stochastic approximation andinterior-point optimization, although IPM are particularlyefficient in large-scale nonlinear optimization due to theirattractive worst-case complexity. We present a massivelyparallel stochastic IPM method and apply it to stochasticPDE problems such as boundary control and optimizationof complex electric power grid systems under uncertainty.

Juraj KardoUniversita della Svizzera italianaInstitute of Computational [email protected]

Drosos KourounisUSI - Universita della Svizzera italianaInstitute of Computational [email protected]


PP1

Nlafet - Parallel Numerical Linear Algebra for Fu-ture Extreme-Scale Systems

NLAFET, funded by EU Horizon 2020, is a direct re-sponse to the demands for new mathematical and algo-rithmic approaches for applications on extreme scale sys-tems as identified in the H2020-FETHPC work programme.

144 PP16 Abstracts

The aim is to enable a radical improvement in the perfor-mance and scalability of a wide range of real-world appli-cations relying on linear algebra software, by developingnovel architecture-aware algorithms and software libraries,and the supporting runtime capabilities to achieve scalableperformance and resilience on heterogeneous architectures.The focus is on a critical set of fundamental linear algebraoperations including direct and iterative solvers for denseand sparse linear systems of equations and eigenvalue prob-lems.

Bo T. KagstromUmea UniversityComputing Science and [email protected]

Lennart EdblomUmea UniversityComputing [email protected]


Dongarra JackUniversity of ManchesterSchool of [email protected]


Jonathan HoggSTFC Rutherford Appleton Lab, [email protected]

Laura GrigoriINRIA Paris - RocquencourtAlpines [email protected]

PP1

Seigen: Seismic Modelling Through Code Genera-tion

In this poster we present initial performance results forSeigen, a new seismological modelling framework thatutilises symbolic computation to model elastic wave prop-agation using the discontinuous Galerkin finite elementmethod (DG-FEM). Our study is conducted in the con-text of Firedrake, an automated code generation frame-work that drastically simplifies in-depth experimentationwith multiple problem discretisations, solution methodsand low-level performance optimisations.

Michael Lange, Christian JacobsDepartment of Earth Science and EngineeringImperial College [email protected], [email protected]

Fabio LuporiniDepartment of ComputingImperial College [email protected]

Gerard J. GormanImperial College [email protected]

PP1

High-Performance Tensor Algebra Package forGPU

Many applications, e.g., high-order FEM simulations, canbe expressed through tensors. Contractions by the first in-dex can often be represented using a tensor index reorder-ing followed by a matrix-matrix product, which is a keyfactor to achieve performance. We present ongoing workof a high-performance package in the MAGMA library toorganize tensor contractions, data storage, and batched ex-ecution of large number of small tensor contractions. Weuse automatic code generation techniques to provide anarchitecture-aware, user-friendly interface.

Ian MasliahUniversity [email protected]

Ahmad AbdelfattahUniversity of Tennessee, [email protected]

Marc BaboulinUniversity of [email protected]

Jack J. DongarraDepartment of Computer ScienceThe University of [email protected]

Joel FalcouUniversity of [email protected]

Azzam HaidarDepartment of Electrical Engineering and ComputerScienceUniversity of Tennessee, [email protected]

Stanimire TomovUniversity of Tennessee, [email protected]

PP1

Multilevel Approaches for FSAI Preconditioning inthe Iterative Solution of Linear Systems of Equa-tions

The solution of a sparse linear system of equations witha large SPD matrix is a common and computationally in-tensive task when considering numerical simulation tech-niques. This calculation is often done by iterative meth-ods whose performance is largely influenced by the selectedpreconditioner. In this work, two new preconditioners builtupon the dynamic FSAI algorithm in combination to twograph reordering techniques are presented, namely BTF-SAI and DDFSAI.

Victor A. Paludetto Magri, Andrea Franceschini, CarloJanna, Massimiliano FerronatoUniversity of Padova

PP16 Abstracts 145

[email protected], [email protected],[email protected], [email protected]

PP1

Large Scale Computation of Long Range Interac-tions in Molecular Dynamics Simulations

We consider a physical system of polyelectrolytes in ionicliquids, and perform a molecular dynamics simulation tostudy the dynamics of electrophoresis. A computationaleffort of including long range interactions among the par-ticles is made by using a large scale computer cluster. Weuse an efficient method of distributing computational loadsamong the nodes, and obtain a new result that is in agree-ment with experimental findings.

Pyeong Jun ParkSchool of Liberal Arts and SciencesKorea National University of [email protected]

PP1

Accelerated Subdomain Smoothing for Geophysi-cal Stokes Flow

Scalable approaches to solving linear systems arising fromgeophysical Stokes flow typically involve the use of multi-grid methods. Most of the computational effort is thusexpended applying smoothing operators, and as such onemight naturally ask if this process can be accelerated withthe use of increasingly-available per-node coprocessors onmodern large-scale computational clusters. Here we dis-cuss several alternate approaches to performing subdomainsmoothing on a hybrid CPU-GPU supercomputer, in anattempt to asses the performance gains possible from theuse of this new technology. Chebyshev-Jacobi and modernincomplete factorization approaches are considered.

Patrick SananUSI - Universita della Svizzera [email protected]

PP1

ExaHyPE - Towards an Exascale Hyperbolic PDEEngine

ExaHyPEs mission is to develop a novel simulation enginefor hyperbolic PDE problems at exascale. With our ap-proach - high-order discontinuous Galerkin discretizationon dynamically adaptive Cartesian grids - we tackle grand-challenge simulations in Seismology and Astrophysics. Wepresent a first petascale software release bringing togethersome of ExaHyPE’s key ingredients: Grid generation basedupon spacetrees in combination with space-filling curves, anovel ADER DG WENO scheme and compute kernels tai-lored to particular hardware generations.

Angelika SchwarzTechnische Universitat [email protected]

Vasco VarduhnUniversity of [email protected]

Michael BaderTechnische Universitat Munchen

[email protected]

Michael DumbserUniversity of [email protected]

Dominic E. Charrier, Tobias WeinzierlDurham [email protected],[email protected]

Luciano RezzollaFrankfurt Institute of Advanced [email protected]

Alice Gabriel, Heiner IgelLudwig-Maximilians-Universitat [email protected],[email protected]

PP1

Malleable Task-Graph Scheduling with a PracticalSpeed-Up Model

We focus on the problem of minimizing the makespan ofmalleable task graphs arising from multifrontal factoriza-tions of sparse matrices. The speedup of each task is as-sumed to be perfect but bounded. We propose a newheuristic which we prove to be a 2-approximation andwhich performs better than previous algorithms on simula-tions. The dataset consists of both synthetic and realisticgraphs.

Bertrand SimonENS [email protected]

Loris [email protected]

Oliver SinnenUniversity of [email protected]

Frederic VivienINRIA and ENS [email protected]

PP1

Overview of Intel Math Kernel Library (IntelMKL) Solutions for Eigenvalue Problems

We outline both existing and prospective Intel Math Ker-nel Library (Intel MKL) solutions for standard and general-ized Hermitian eigenvalue problems. Extended Eigensolverfunctionality, based on the accelerated subspace iterationFEAST algorithm, is a high-performance solution for ob-taining all the eigenvalues and optionally all the associatedeigenvectors, within a user-specified interval. To extendthe available functionality, a new approach for finding the klargest or smallest eigenvalues is proposed. We envision au-tomatically producing a search interval by a preprocessingstep based on classical and recent methods that estimateeigenvalue counts. Computational results are presented.

Irina Sokolova

146 PP16 Abstracts

Novosibirsk State UniversityPirogova street 2, Novosibirsk, Russia, [email protected]

Peter TangIntel Corporation2200 Mission College Blvd, Santa Clara, [email protected]

Alexander KalinkinIntel [email protected]

PP1

Achieving Robustness in Domain DecompositionMethods: Adpative Spectral Coarse Spaces

Domain decomposition methods are a popular way to solvelarge linear systems. For problems arising from practi-cal applications it is likely that the equations will havehighly heterogeneous coefficients. For example a tire ismade both of rubber and steel, which are two materialswith very different elastic behaviour laws. Many domaindecomposition methods do not perform well in this case,specially if the decomposition into subdomains does notaccommodate the coefficient variations. For three populardomain decomposition methods (Additive Schwarz, BDDand FETI) we propose a remedy to this problem based onlocal spectral decompositions. Numerical investigations forthe linear elasticity equations will confirm robustness withrespect to heterogeneous coefficients, automatic (non regu-lar) partitions into subdomains and nearly incompressiblebehaviour.


Victorita DoleanUniversity of [email protected]

Patrice [email protected]

Frederic NatafCNRS and [email protected]

Clemens PechsteinUniversity of [email protected]

Daniel J. RixenTU [email protected]

Robert ScheichlUniversity of [email protected]

PP1

A Proposal for a Batched Set of Blas.

We propose a Batched set of Basic Linear Algebra Subpro-

grams (Batched BLAS). We focus on multiple independentBLAS operations on small matrices that are grouped to-gether as a single routine. We aim to provide a more ef-ficient and portable library for multi/manycore HPC sys-tems. We achieve 2x speedups and 3x better energy ef-ficiency compared to vendor implementations. We alsodemonstrate the need for Batched BLAS and its potentialimpact in multiple application domains.

Pedro Valero-LaraUniversity of [email protected]

Jack J. DongarraDepartment of Computer ScienceThe University of [email protected]

Azzam HaidarDepartment of Electrical Engineering and ComputerScienceUniversity of Tennessee, [email protected]

Samuel ReltonUniversity of Manchester, [email protected]

Stanimire TomovUniversity of Tennessee, [email protected]

Mawussi ZounonUniveristy of [email protected]

PP1

Spectrum Slicing Solvers for Extreme-Scale Eigen-value Computations

We present progress on a recently developed new paral-lel eigensolver package named EVSL (Eigen-Value SlicingLibrary) for extracting a large number of eigenvalues andthe associated eigenvectors of a symmetric matrix. EVSLis based on a divide-and-conquer approach known as spec-trum slicing. It combines a thick-restart version of theLanczos algorithm or a subspace iteration with deflation,and uses a new type of polynomial filters constructed froma least squares approximation of the Dirac delta functionto obtain interior eigenvalues. Several computational re-sults are presented emphasizing parallel performance formatrices arising from the density functional theory basedelectronic structure calculations.

Eugene VecharynskiLawrence Berkeley National [email protected]

Ruipeng LiLawrence Livermore National [email protected]

Yuanzhe XiUniversity of [email protected]

Chao YangLawrence Berkeley National Lab

PP16 Abstracts 147

[email protected]

Yousef SaadDepartment of Computer ScienceUniversity of [email protected]

PP1

Compressed Hierarchical Schur Linear SystemSolver for 3D Photonic Device Analysis with GpuAccelerations

Finite-difference frequency-domain analysis of wave opticsand photonics requires linear system solver for Helmholtzequation. The linear system can be ill-conditioned whencomputation domain is large or perfectly-matched layersare used. We propose a direct method named CompressedHierarchical Schur for time and memory savings, with theGPU acceleration and hardware tuning. These new tech-niques can efficiently utilize modern high-performance en-vironment and greatly accelerate development of future de-velopment of photonic devices and circuits.

Cheng-Han DuDepartment of Mathematics, National Taiwan [email protected]

Weichung WangNational Taiwan UniversityInstitute of Applied Mathematical [email protected]

PP1

Parallel Adaptive and Robust Multigrid

Numerical simulation has become one of the major topics inComputational Science. To promote modelling and simula-tion of complex problems new strategies are needed allow-ing for the solution of large, complex model systems. Afterdiscussing the needs of large-scale simulation we point outbasic simulation strategies such as adaptivity, parallelismand multigrid solvers. These strategies are combined in thesimulation system UG4 (Unstructured Grids) being pre-sented in the following. We show the performance andefficiency of this strategy in various applications.


PP1

High performance eigenvalue solver for Hubbardmodel on CPU-GPU hybrid platform

When we solve the smallest eigenvalue and the correspond-ing eigenvector of the Hamiltonian derived from the Hub-bard model, we can understand properties of some interest-ing phenomenon in the quantum physics. Since the Hamil-tonian is a large sparse matrix, we usually utilize an it-eration method. In this presentation, we propose tuningstrategies for the LOBPCG method for the Hamiltonianon a CPU-GPU hybrid platform in consideration of theproperty of the model and show its performance.

Susumu YamadaJapan Atomic Energy [email protected]


Masahiko MachidaJapan Atomic Energy [email protected]

PP16 Abstracts · 66 PP16 Abstracts IP1 Scalabilityof Sparse Direct Codes...

Documents

Transcript of PP16 Abstracts · 66 PP16 Abstracts IP1 Scalabilityof Sparse Direct Codes...