Figure 1. The real world engine benchmark - … · Web viewReal engineering problems can be...

ESPRESO - Accelerated solver library for the Intel Xeon Phi systemsResearch Institution:IT4Innovations National Supercomputing Centre, Ostrava

Principal investigators:Lubomir Riha, Tomas Brzobohaty

Researchers:Alexandros Markopoulos, Michal Merta, Ondrej Meca

Project partners: Universita della Svizzera italina (Pardiso SC)

Project ID: DD-16-20 IntroductionExaScale PaRallel FETI SOlver (ESPRESO) is a sparse linear solver library that the IT4Innovations National Supercomputing Centre in Ostrava, Czech Republic has been developing since 2014. In 2016, the alpha version was released to the public and can be downloaded from the project website (espreso.it4i.cz).

Figure 2. The block diagram of the ESPRESO Solver

ESPRESO contains not only the linear solver, but also several Finite Element (FEM) and Boundary Element (BEM) preprocessing tools designed particularly for FETI solvers, see Figure 2. The BEM support was produced in collaboration with developers of the BEM4I library (bem4i.it4i.cz). The preprocessor supports FEM and BEM discretization for Advection-diffusion equation, Stokes flow, and Structural mechanics. Real engineering problems can be imported from Ansys Workbench or OpenFOAM. In addition, a C API allows ESPRESO to be used as a solver library for third-party application. This has been used for integration with CSC ELMER. For large scale tests, the preprocessor also contains a multi-block benchmark generator. The post-processing and visualization is based on the VTK library and Paraview, including Paraview Catalyst for inSitu visualization.

The funding for the development of the library is provided by several sources, each focused on the development of particular features. For instance, the EU FP7 EXA2CT project funding has been used for implementation of the Hybrid FETI algorithm. This method provides excellent numerical and parallel scalability that allowed the scientists to create this massively parallel library. A significant part of this work was done during the research internship at Department of Aeronautics and Astronautics at Stanford University. The Intel Parallel Computing Centre funding is used to develop a new approach for acceleration of FETI methods in general, not just the Hybrid FETI, by Intel Xeon Phi accelerators.

This particular research was designed to take full advantage the of the IT4Innovations Salomon supercomputer which, when installed, was the largest Intel Xeon Phi installation in Europe.

The project resources have been used to carry out two main tasks: (1) scalability testing and evaluation of the Hybrid Total FETI method and (2) development of the Intel Xeon Phi and GPU acceleration of the FETI methods.

Results and MethodsFor many years, the Finite Element Tearing and Interconnecting method (FETI) [1], has been successfully used in the engineering community for solving very large problems arising from the discretization of partial differential equations.

In such an approach, the original structure is decomposed into several non-overlapping subdomains. Mutual continuity of primal variables between neighboring subdomains is enforced afterwards by dual variables, i.e., Lagrange multipliers (LM). They are usually obtained iteratively by one of the Krylov subspace methods, and then the primal solution is evaluated locally for each subdomain.

In 2006 Dostal et al. [2] introduced a new variant of an algorithm called Total FETI (or TFETI) in which Dirichlet boundary condition is enforced also by LM. The HTFETI method is a variant of hybrid FETI methods introduced by Klawonn and Rheinbach [4] for FETI and FETIDP. In the original approach, a number of subdomains is gathered into clusters. This can be seen as a three-level domain decomposition approach. Each cluster consists of a number of subdomains and for these, a FETI-DP system is set up. The clusters are then solved by a traditional FETI approach using projections to treat the non-trivial kernels. In contrast, in HTFETI, a TFETI approach is used for the subdomains in each cluster and the FETI approach with projections is used for clusters.

The main advantage of HTFETI is its ability to solve problems decomposed into a very large number of subdomains.

Figure 1. The real world engine benchmark

http://www.espreso.it4i.cz/

Figure 3. The weak scalability of the HTFETI method on the Salomon supercomputer executed on up to 729 compute nodes.

On the Salomon supercomputer, we have performed both weak and strong scalability tests. The weak scalability gives users an idea of how large problems can be solved by a given machine. These results are shown in Figure 3. It can be seen that HTFETI method have been able to solve up 8.9 billion unknowns on 729 compute nodes when solving structural mechanics problem. For the strong scalability we have used the real world engine benchmark depicted in the Figure 1. These tests showed that HTFETI can achieve superlinear scalability, see Figure 4, from 43 (1024 MPI processes) to 343 (8192 MPI processes) compute nodes solving structural mechanics problem of approximately 300 million unknowns.

Figure 4. The superlinear strong scalability of the HTFETI method on the real world engine benchmark

Acceleration of the Hybrid Total FETI domain decomposition method using the Intel Xeon Phi coprocessors provided key research for taking advantage of the Salomon machine. The HTFETI method is a memory bounded algorithm which uses sparse linear BLAS operations with irregular memory access pattern. We have developed a local Schur complement (LSC) method which has regular memory access pattern that allows the solver to fully utilize the Intel Xeon Phi fast memory bandwidth.

Figure 5. The speed-up achieved by the Intel Xeon Phi acceleration in combination with the local Schur complement method can be as

high as 7.8.

This translates to speedup over 7.8 of the HTFETI iterative solver when solving 3 billion unknown heat transfer problem (Laplace equation) on almost 400 compute nodes and 800 Xeon Phi accelerators. The comparison is carried out between the CPU computation using sparse data structures (PARDISO solver) and the local Schur complement computation on Xeon Phi. In the case of the structural mechanics problem (linear elasticity) of size 1 billion DOFs, the respective speedup is 3.4.

The presented speedups are asymptotic and they are reached for problems requiring high number of iterations (e.g., ill-conditioned problems, transient problems, contact problems). For problems which can be solved with less than one hundred iterations, the local Schur complement method is not suitable. For these cases, we have implemented sparse matrix processing using PARDISO also for the Xeon Phi accelerators. The weak scalability of the Xeon Phi accelerated version is shown in Figure 5.

On-going Research / OutlookIn the next months, we would like to further develop the ESPRESO FEM package that will allow us more efficiently solve real world problems. From the solver point of view, we will work on improvements of the numerical scalability of the Hybrid Total FETI method and its acceleration using next generation Intel Xeon Phi Knights Landing processors.

ConclusionBeing able to run scalability tests on a near full scale of the Salomon machine gives us initial insight on the scalability issues that have to addressed in order to run the library on the world’s largest petascale systems - which is our next goal.

Intel Xeon Phi coprocessor is a modern, many-core architecture which shares several key features (fast memory, large number of SIMD cores) with GPU accelerators. The proposed Local Schur complement method efficiency has been successfully evaluated on the Xeon Phi (the Knights Corner generation) in this report, but one can expect similar behavioral on GPUs.

Based on our survey of upcoming pre-exascale machines, one cannot ignore many-core architectures and heterogeneous systems with some type of accelerator. Therefore, we will put more effort in this type of research in the near future.

References[1] C. Farhat, J. Mandel, and F.-X. Roux. Optimal convergence properties of the FETI domain decomposition method. Computer Methods in Applied Mechanics and Engineering, 115:365–385, 1994.

[2] Z. Dostal, D. Horak, and R. Kucera. Total FETI-an easier implementable variant of the FETI method for numerical solution of elliptic PDE. Communications in Numerical Methods in Engineering, 22(12):1155–1162, 2006.

[3] A. Klawonn and O. Rheinbach. Highly scalable parallel domain decomposition methods with an application to biomechanics. ZAMM, 90(1):5–32, 2010.

Publications[1] Riha Lubomir; Brzobohaty, Tomas; Markopoulos Alexandros; Meca Ondrej; Kozubek Tomas, „Massively Parallel Hybrid Total FETI (HTFETI) Solver“ (Conference), Platform for Advanced Scientific Computing Conference, PASC, ACM, 2016, ISBN: 978-1-4503-4126-4/16/06.

[2] Riha Lubomir; Brzobohaty, Tomas; Markopoulos Alexandros; Meca Ondrej; Kozubek Tomas; Schenk Olaf; Vanroose Wim,

„Efficient Implementation of Total FETI Solver for Graphic Processing Units Using Schur Complement“ (Conference), HPCSE 2015, LNCS 9611 2016.

[3] Riha, Lubomir; Brzobohaty, Tomas; Markopoulos, Alexandros, „Hybrid parallelization of the Total FETI solver„ (Journal Article), Advances in Engineering Software, pp. -, 2016, ISSN: 0965-9978.

[4] Riha, Lubomir; Brzobohaty, Tomas; Markopoulos, Alexandros; Jarosova, Marta; Kozubek, Tomas; Horak, David; Hapla, Vaclav, „Implementation of the Efficient Communication Layer for the Highly Parallel Total FETI and Hybrid Total FETI Solvers“ (Journal Article), Parallel Computing, pp. -, 2016, ISSN: 0167-8191.

Project website: espreso.it4i.cz

Figure 1. The real world engine benchmark - … · Web viewReal engineering problems can be...

Documents

Transcript of Figure 1. The real world engine benchmark - … · Web viewReal engineering problems can be...