S-EnKF: Co-designing for Scalable Ensemble Kalman...

12
S-EnKF: Co-designing for Scalable Ensemble Kalman Filter Junmin Xiao, Shijie Wang, Weiqiang Wan, Xuehai Hong, and Guangming Tan State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences University of Chinese Academy of Sciences [email protected] {wangshijie, wanweiqiang}@ncic.ac.cn {hxh, tgm}@ict.ac.cn Abstract Ensemble Kalman filter (EnKF) is one of the most important methods for data assimilation, which is widely applied to the reconstruction of observed historical data for providing ini- tial conditions of numerical atmospheric and oceanic models. With the improvement of data resolution and the increase in the amount of model data, the scalability of recent par- allel implementations suffers from high overhead on data transfer. In this paper, we propose, S-EnKF: a scalable and distributed EnKF adaptation for modern clusters. With an in-depth analysis of new requirements brought forward by recent frameworks and limitations of current designs, we present a co-design of S-EnKF. For fully exploiting the re- sources available in modern parallel file systems, we design a concurrent access approach to accelerate the process of reading large amounts of background data. Through a deeper investigation of the data dependence relations, we modify EnKF’s workflow to maximize the overlap of file reading and local analysis with a new multi-stage computation approach. Furthermore, we push the envelope of performance further with aggressive co-design of auto-tuning through tradeoff between the benefit on runtime and the cost on processors based on classic cost models. The experimental evaluation of S-EnKF demonstrates nearly ideal strong scalability on up to 12,000 processors. The largest run sustains a performance of 3x-speedup compared with P-EnKF, which represents the state-of-art parallel implementation of EnKF. CCS Concepts Computing methodologies Massively parallel algorithms; Applied computing Environ- mental sciences. Keywords ensemble Kalman filter, parallel implementation, domain localization, data assimilation, scalability Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. PPoPP ’19, February 16–20, 2019, Washington, DC, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00 hps://doi.org/10.1145/3293883.3295722 1 Introduction Data assimilation initially developed in the field of numerical weather prediction. It is widely applied to the reconstruction of observed historical data for providing initial conditions of numerical atmospheric [16, 20] and oceanic models [7, 34]. The ensemble Kalman filter (EnKF) is a sophisticated sequen- tial data assimilation method. It applies an ensemble of model states to represent the error statistics of the model estimate [27], and utilizes ensemble integrations to predict the error statistics forward in time [22]. In EnFK, an analysis scheme would operate directly on the ensemble of model states when observations are assimilated [17]. As it is essential for the atmospheric and oceanic physics research to understand dy- namic behaviors of the global circulation at increasingly fine resolutions [9], high resolution data assimilation has been developed in recent years. In order to enable high-fidelity as- similation for realistic problems, the study of parallelization for EnKF is becoming an urgent demand. Recent decades have witnessed the development of paral- lel methods for the implementation of EnKF. Several parallel local EnKF (L-EnKF) implementations have been proposed [25] based on domain localization with a given radius of influence r . Usually, the parallel assimilation process is per- formed for each model component in parallel making use of a deterministic formulation of the EnKF in the ensemble space. In this formulation, the unknown background error covariance matrix is estimated by the rank-deficient ensem- ble covariance matrix which is well-defined in the ensemble space. L-EnKF relies on an assumption that local analysis avoids the impact of spurious correlations, for instance, by considering only small values for r . However, in operational data assimilation, r can be large owing to circumstances such as sparse observational networks and/or long distance data error correlations (i.e., pressure fields). In such cases, the accuracy of L-EnKF can be negatively impacted owing to spurious correlations. In order to improve the accuracy of L-EnKF methods, by using a better estimation of background error correlations based on the modified Cholesky decompo- sition, an efficient parallel ensemble Kalman filter (P-EnKF) has been proposed, which is also optimized deeply, and rep- resents the state-of-art implementation [23, 24]. Although P-EnKF outperforms in terms of accuracy of model variables, its computational time is close to that of L-EnKF methods

Transcript of S-EnKF: Co-designing for Scalable Ensemble Kalman...

Page 1: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

S-EnKF: Co-designing for Scalable Ensemble KalmanFilter

Junmin Xiao, Shijie Wang, Weiqiang Wan, Xuehai Hong, and Guangming TanState Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences

University of Chinese Academy of [email protected] {wangshijie, wanweiqiang}@ncic.ac.cn {hxh, tgm}@ict.ac.cn

AbstractEnsemble Kalman filter (EnKF) is one of the most importantmethods for data assimilation, which is widely applied to thereconstruction of observed historical data for providing ini-tial conditions of numerical atmospheric and oceanic models.With the improvement of data resolution and the increasein the amount of model data, the scalability of recent par-allel implementations suffers from high overhead on datatransfer. In this paper, we propose, S-EnKF: a scalable anddistributed EnKF adaptation for modern clusters. With anin-depth analysis of new requirements brought forward byrecent frameworks and limitations of current designs, wepresent a co-design of S-EnKF. For fully exploiting the re-sources available in modern parallel file systems, we designa concurrent access approach to accelerate the process ofreading large amounts of background data. Through a deeperinvestigation of the data dependence relations, we modifyEnKF’s workflow to maximize the overlap of file reading andlocal analysis with a new multi-stage computation approach.Furthermore, we push the envelope of performance furtherwith aggressive co-design of auto-tuning through tradeoffbetween the benefit on runtime and the cost on processorsbased on classic cost models. The experimental evaluationof S-EnKF demonstrates nearly ideal strong scalability on upto 12,000 processors. The largest run sustains a performanceof 3x-speedup compared with P-EnKF, which represents thestate-of-art parallel implementation of EnKF.

CCSConcepts •Computingmethodologies→Massivelyparallel algorithms; • Applied computing → Environ-mental sciences.

Keywords ensemble Kalman filter, parallel implementation,domain localization, data assimilation, scalability

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACMmust be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected] ’19, February 16–20, 2019, Washington, DC, USA© 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6225-2/19/02. . . $15.00https://doi.org/10.1145/3293883.3295722

1 IntroductionData assimilation initially developed in the field of numericalweather prediction. It is widely applied to the reconstructionof observed historical data for providing initial conditions ofnumerical atmospheric [16, 20] and oceanic models [7, 34].The ensemble Kalman filter (EnKF) is a sophisticated sequen-tial data assimilationmethod. It applies an ensemble of modelstates to represent the error statistics of the model estimate[27], and utilizes ensemble integrations to predict the errorstatistics forward in time [22]. In EnFK, an analysis schemewould operate directly on the ensemble of model states whenobservations are assimilated [17]. As it is essential for theatmospheric and oceanic physics research to understand dy-namic behaviors of the global circulation at increasingly fineresolutions [9], high resolution data assimilation has beendeveloped in recent years. In order to enable high-fidelity as-similation for realistic problems, the study of parallelizationfor EnKF is becoming an urgent demand.

Recent decades have witnessed the development of paral-lel methods for the implementation of EnKF. Several parallellocal EnKF (L-EnKF) implementations have been proposed[25] based on domain localization with a given radius ofinfluence r . Usually, the parallel assimilation process is per-formed for each model component in parallel making useof a deterministic formulation of the EnKF in the ensemblespace. In this formulation, the unknown background errorcovariance matrix is estimated by the rank-deficient ensem-ble covariance matrix which is well-defined in the ensemblespace. L-EnKF relies on an assumption that local analysisavoids the impact of spurious correlations, for instance, byconsidering only small values for r . However, in operationaldata assimilation, r can be large owing to circumstances suchas sparse observational networks and/or long distance dataerror correlations (i.e., pressure fields). In such cases, theaccuracy of L-EnKF can be negatively impacted owing tospurious correlations. In order to improve the accuracy ofL-EnKF methods, by using a better estimation of backgrounderror correlations based on the modified Cholesky decompo-sition, an efficient parallel ensemble Kalman filter (P-EnKF)has been proposed, which is also optimized deeply, and rep-resents the state-of-art implementation [23, 24]. AlthoughP-EnKF outperforms in terms of accuracy of model variables,its computational time is close to that of L-EnKF methods

Page 2: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

PPoPP ’19, February 16–20, 2019, Washington, DC, USA J. Xiao, S. Wang, W. Wan, X. Hong, and G. Tan

[24]. The main reason is that P-EnKF is designed in the samecomputational framework used by all L-EnKF methods.The recent parallel framework for EnKF can be simply

described as two phases: obtaining local data and executinglocal update. The popular optimization method is improvingthe parallel performance of each phase respectively. It seemsdifficult to overlap the two phases and consider them as awhole, because the local update does not begin until thelocal data has been obtained by processors. Although theprocessor which is the first getting all local data, could firstexecute the local analysis, the overlap ratio of two phases issmall. Since the two phases almost are separate from eachother, and more parallel file I/O in the first phase would takelonger time significantly with the data density increasing,the parallel scalability of the recent framework suffers fromsome bottleneck problems [10]. For example, we consider thedata assimilation of 0.1◦ resolution data with 120 samples.The time for file reading dominates the main part of theruntime with the number of processors increasing (Figure1), which negatively impacts the performance of parallelimplementations for EnKF (refer to Section 5.2).

In order to efficiently co-design file reading and local up-dating, this work would do a deeper investigation of thedata structures, and re-design the recent workflow. Basedon the fine-grain analysis, we propose S-EnKF: a novel andscalable EnKF that provides efficient scale-up and scale-outfor distributed computation on HPC systems. To the bestof our knowledge, this is the first study that provides an in-depth analysis of fundamental algorithm level bottlenecksbeing faced by data assimilation methods like P-EnKF, andhighlights novel co-designed solutions to achieve high per-formance and scalability.

We make the following key contributions:

• Analyze and identify new requirements brought for-ward by the recent EnKF framework, especially forlarge amounts of I/O and communication.

• Co-design, implement, and evaluate S-EnKF: a scalableand distributed EnKF.

• Propose a concurrent access approach to speed up theprocess of file reading and fully exploit the resourcesavailable in modern parallel file systems.

• Propose a novel overlapping strategy for data obtain-ing and local updating via multi-stage computationand helper-thread based communication.

• Co-design and implement an auto-tuning approachfor determining the optimal parameters of the over-lapping strategy in order to make the most advantageof distributed-memory platforms.

• Evaluate and analyze the performance of the proposedS-EnFK with large data sets, which proves that S-EnFKsupports large-scale data assimilation well.

Figure 1. Percentage of times for I/O and computation inP-EnKF.

2 PreliminariesIn this section, we provide an overview of different parallelimplementations of EnKF, and a detailed description of therecent EnKF framework.

2.1 Overview of EnKFFor the data assimilation, EnKF gives the solution by solvingthe following equation

Xa = Xb + δXa ∈ Rn×N , (1)

where n is the number of model components, and N is theensemble size. Xa is the analysis result. Xb is a given back-ground ensemble which is taken from the long-time modelintegration, and can be represented as

Xb = [Xb[1],Xb[2], · · · ,Xb[N ]] ∈ Rn×N , (2)

andXb[i] ∈ Rn×1 is the i-th ensemble member. δXa is knownas the analysis increment. In general, the analysis incrementof EnKF reads

δXa = BHT · [R + HBHT ]−1 · [Ys − HXb ] ∈ Rn×N , (3)

where H ∈ Rm×n is the linear observational operator, mis the number of observed components, and R ∈ Rm×m isthe estimated data error covariance matrix. The superscriptT denotes matrix transpose. Ys ∈ Rm×N is the matrix ofperturbed observation with data error distribution N (0m ,R).B is the background error covariance matrix which can beestimated by

B =UUT

N − 1, and U = Xb − xb ⊗ 1TN ∈ Rn×N , (4)

where xb is the ensemble mean, 1N ∈ RN is a vector whoseN components are all ones, and ⊗ denotes the outer productof two vectors. If B−1 is an approximation of B−1, then theequation (3) is equivalent to

δXa = [B−1 + HTR−1H]−1 · HT · R−1 · [Ys − HXb ]. (5)

Page 3: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20, 2019, Washington, DC, USA

0

45°S

45°N

90°N

90°S

180°W 90°W 180°E90°E0

(a) (b)

Figure 2. Localization of EnKF.

2.2 Parallel Implementation of EnKFLocalization is commonly used in the context of data assimi-lation in order to mitigate the impact of spurious correlationsin the assimilation process. In general, two forms of localiza-tion methods are used: covariance matrix localization anddomain localization, both have proven to be equivalent [28].In practice, covariance matrix localization can be very diffi-cult owing to the explicit representation in memory of theensemble covariance matrix. On the other hand, domainlocalization methods avoid spurious correlations by onlyconsidering observations within a given radius of influencer . In the 2-dimensional case, each model component is sur-rounded by a local box of dimension (2ξ + 1, 2η + 1) whichcontains the scope of a circle with radius r . ξ may not beequal to η, since the spacing between two lattice points isdifferent along longitude and latitude directions in a nx ×nymesh with nx ≫ ny [14]. The information within the scopeof the local box, such as observed components and back-ground error correlations, is used in the assimilation process,while the information outside the local box is discarded. InFigure 2(a), a local box for an influence scope with r = 10 kmis shown (here, ξ = 4,η = 2). The red grid point is the oneto be assimilated, blue points are used in the assimilationprocess, while gray points are discarded. This characteristicis the foundation of developing parallel implementations ofEnKF.Based on the localization, the domain decomposition is

proposed in order to reduce the dimension of the data as-similation problem. At the beginning, the nx × ny latitude-longitude mesh (n = nx × ny ) is split according to ns non-overlapping sub-domains along the longitude and latitudedirections. Without loss of generality, we assume that thereare nsdx and nsdy sub-domains along the longitude and lat-itude directions respectively, and nx (or ny ) is a multipleof nsdx (or nsdy ). Hence, ns = nsdx × nsdy is valid, and thetotal number of model components at each sub-domain isnsd =

nxnynsdxnsdy

. Since the update on each point needs theinformation in a local box (Figure 2(a)), each sub-domainneeds some additional model grid points for local analysis.We define the expansion Di, j of a sub-domain Di, j as thepoint set which includes the sub-domain and all additional

points needed for local analysis (Figure 2(b)). It means thatan expansion contains all data for local assimilation at a sub-domain. Denote Xb[k ]

[i, j] as the data of a background ensemblemember Xb[k] on Di, j (k = 1, 2, · · · ,N ). If we consider thedata assimilation at a sub-domain Di, j , the local analysisreads

Xa[i, j] = Pi, j · [Xb

[i, j] + [B−1[i, j] + H

T[i, j]R

−1[i, j]H[i, j]]

−1

·HT[i, j] · R

−1[i, j] · [Y

s[i, j] − H[i, j]Xb

[i, j]]], (6)

where Xa[i, j] ∈ Rnsd×N is the analysis result at the sub-

domain Di, j . H[i, j] ∈ Rmsd×nsd is the observational operator

on Di, j , and nsd = (nxnsdx+ 2ξ ) · ( ny

nsdy+ 2η), and msd is the

number of observed components in Di, j . Ys[i, j] ∈ Rmsd×N is

the subset of perturbed observations. B−1[i, j] ∈ R

nsd×nsd is thelocal inverse estimation of the background error covariancematrix. R[i, j] ∈ R

msd×msd is the local data-error covarianceinformation. Pi, j ∈ Rnsd×nsd is a matrix for projecting themodel components at Di, j into Di, j . In the real implementa-tion, Pi, j does not need to be constructed, and we only needto focus on update of model components at the sub-domainDi, j . For each sub-domain, the local analysis (6) is computed,and then all results are mapped back onto the global domain.

In the parallel implementation of EnKF [4], the model stateis divided according to the number of sub-domains. Next,the background ensemble, the observed components, theobservation operator, the estimated data error correlationsat each sub-domain need to be read from file systems. Afterall local information on Di, j is obtained by the processorfor data assimilation on Di, j , the local analysis (6) would beexecuted.

2.3 Parallel EnKF FrameworkTo the best of our knowledge, the recent parallel implemen-tations of EnKF follow the same framework. The generalparallel framework for EnKF can be simply described as twophases: obtaining local data and executing local update.

For obtaining local data, there are two alternative designchoices: the first is to choose a single processor for readingdata and scattering them to other processors, and the secondis to use all processors to obtain local data from disks paral-lelly. When the resolution data is not high and the numberof background ensemble members is not large, the first ap-proach is usually used. However, on the massive computingplatforms, the second approach is considered at first, becauseit can be optimized for parallel file systems where parallelfile reading from disks can be faster than exchanging datausing MPI-level communication.

For executing local update, the local assimilation processis carried out in parallel without the need of inter-processorcommunication. In this phase, the most time-consuming partis performing matrix inversion which is usually achieved

Page 4: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

PPoPP ’19, February 16–20, 2019, Washington, DC, USA J. Xiao, S. Wang, W. Wan, X. Hong, and G. Tan

by using Cholesky decomposition [5, 8]. In order to accel-erate the local update, some efficient modified Choleskydecompositions are designed in order to obtain approximateestimators of the inverse background error covariance ma-trix [23, 29]. In general, there are good linear solvers in thecurrent literature, and some of them are well-known andused in operational data assimilation such as LAPACK [1]and CuBLAS [6]. Compact representation of matrices canbe used as well, in order to exploit the structures of B−1

[i, j] interms of memory allocation.

3 Challenges for Designing Scalable EnKFIn this section, we elaborate specific challenges, require-ments, and issues that need to be addressed in order to designa scalable EnKF on distributed-memory systems.

3.1 Parallel ReadingHPC systems usually use a parallel file system and storagenodes. In order to limit the I/O overhead in reading largesize datasets available in the data assimilation, the generalframework needs to be redesigned with parallel reading ca-pabilities to take advantage of parallel file systems like Lustre[18]. On the parallel computing platforms at most of super-computing centers, all files are distributed in several storagenodes, and the users can not exactly know which node storesa given file. Hence, it is difficult to customize an I/O solutionfor file reading on the system level. Consequently, it becomesimportant to consider the parallel I/O on the algorithm level.One of the challenges in developing a scalable EnKF is feed-ing the data efficiently to all the processors for local analysis.For the L-EnKF, a single reader processor communicates thedata to the other processors, which can not make full use ofparallel file systems. In order to parallelly read a backgroundensemble member which is stored as an independent file,P-EnKF uses all processors to access different parts of a filesimultaneously (Figure 3). When the number of processorsbecomes large, it is inevitable for processors to line up foraccessing data, because disks are not able to support unlim-ited processors to access at the same time. Moreover, as thefiles (corresponding to different background ensemble mem-bers) are read one after another in recent implementations,the tremendous increase in the number of files may poseadditional restrictions on the scalability of parallel EnKF. Inconclusion, it is necessary to design an efficient parallel datareading mechanism.

3.2 Exploiting Overlap of File Reading and LocalUpdating

The recent EnKF framework is both I/O and computationintensive. Hence, in order to scale out such framework, it ismandatory to overlap the computation process with I/O. In-deed, the data assimilation is orchestrated in clearly markedphases like data reading and local updating. The issue with

File system Different processors

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

read

Figure 3. Block reading approach.

File Reading Local Analysis

I/O Process Computation Process

Figure 4. Recent workflow for parallel EnKFs.

marked phases that block for the completion of the previousphase is the sequential nature of workflow (Figure 4). In therecent parallel implementations of EnKF, each processor hasto obtain all local background ensemble members before itexecutes the local analysis. Thus, the I/O and computationprocesses are completely separate. Such separation of filereading and local analysis phases severely limits the effi-ciency and scalability of parallel implementations of EnKF.In order to alleviate such limitations, a total redesign of theworkflow to bring overlap between different phases is re-quired. Such redesign requires: (1) an in-depth analysis ofthe workflow and its different phases, (2) a fine-grain restruc-turing of the coarse-grain phases thereby exposing potentialfor overlap, and (3) an exploitation of new communicationways to perform computation/communication overlap.

4 Co-designs For S-EnKFWe now describe how we tackle the above challenges indesigning S-EnKF. Without loss of generality, we assumethat the numbers of processors used along the longitude andlatitude directions are equal to nsdx and nsdy respectively,and one processor is just responsible for local analysis atonly one sub-domain.

4.1 Efficient Parallel Reading MechanismFor the local analysis (6), Xb ∈ Rn×N , H ∈ Rm×n , R ∈ Rm×m

and Ys ∈ Rm×N have to be obtained from the file system.Since N < m ≪ n in real applications, the sizes of R and Ys

are small, and they can be read by each processor directly.Although the size of H seems larger than that of Xb , theobservational operator H can be constructed from some lim-ited observational data which only need to be read from disk.Hence, I/O cost for H is small, and obtaining H is fast. Withn and N increasing, the data size of Xb becomes hundreds

Page 5: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20, 2019, Washington, DC, USA

Figure 5. Time for the file reading using the block readingapproach. Here, nsdy = 10 is fixed, and nsdx increases from100 to 500.

ofGB. Therefore, the key to design efficient parallel readingmechanism is how to efficiently read Xb .

4.1.1 Defects of Block Reading ApproachIn order to optimize file access operations, we investigate therecent choicewhich is to use all processors as the data readersfor parallel file access with no MPI-level communication(Figure 3). This option is suitable for parallel file systems likeLustre where parallel reading from disk can be faster thanexchanging data using MPI-level communication. However,with the improvement of data resolution and the increasein the number of processors involved in data assimilation, alot of processors would access the same disk simultaneously,which would lead to many processors waiting for the diskresource to become available. Therefore, this approach couldnot make full use of the concurrent read-write performanceof the file system.On the other hand, due to n = nx × ny , each background

ensemble member Xb[k ] ∈ Rn×1, a 1-dimensional vector, canbe essentially seen as a 2-dimensional tensor Xb[k ] ∈ Rnx×ny

which is contiguously stored in disk according to row prior-ity. From Figure 3, each processor reads a local block Xb[k ]

[i, j]

from the tensor Xb[k ]. Since each file is split into nsdx partsalong longitude (row) direction, any local block could not beread continuously. Further, the number of disk addressingoperations in the block reading approach would increase lin-early with the number of longitudinal subdivisions increas-ing, which is O(ny × nsdx ). For instance, Figure 5 presentsthe time for the block reading approach accessing 100 back-ground ensemble members with different number of pro-cessors. It is obvious that the time of this reading approachincreases almost linearly with nsdx enlarging.

I/O processorFile system Computation processors

P1 P2 P3 P4P0read send

Figure 6. Bar reading approach.

4.1.2 From Block Reading Approach to Bar ReadingApproach

Even though the block reading approach can be efficientfor small scale and especially for small data sets like atmo-spheric and oceanic data with 1◦ resolution [12], to push thescalability and efficiency for EnKF, a co-design approach isrequired.

In order to effectively avoid a large number of processorssimultaneously accessing a disk and reduce the overhead fordisk addressing during the file reading, the proposed designis the bar reading approach (Figure 6). Firstly, each data ofbackground ensemble member is divided intonsdy bars alonglatitude direction, and the i-th bar is denoted as Xb[k ]

[i] . Eachbar consists of a part of each background ensemble memberXb[k ], which is contiguously stored in a disk. Secondly, nsdyprocessors are chosen for parallel reading of data. They arecalled I/O processors in which each processor is responsiblefor reading a bar respectively. Thirdly, I/O processors splitthe data bar into a lot of overlapped small blocks which areXb[k ][i, j] in (6), and send different blocks to different processors.

This approach uses MPI-level communication to broadcastdata, but the communication process is fast due to the sizeof each block being small. Section 5 highlights that, MPIcommunication time is comparable with the file readingtime (Figure 9).In the bar reading approach, the number of processors

for reading data from disks can be chosen according to themaximum disk concurrency of a parallel file system. It isworth mentioning that the bar reading enables each I/Oprocessor to access the contiguous data Xb[k ]

[i] in the diskwith only one disk addressing operation.

4.1.3 Concurrent Access Approach Based on BarReading

The bar reading approach above mainly focuses on readingone file efficiently, while files are accessed one after another.As shown in Figure 6, a lot of background ensemble members(files) have to be obtained before local analysis starts. Allbackground ensemble members are distributed in the filesystem, where two different files may be stored in either thesame disk or two physical disks. Hence, in the co-design fora scalable EnKF, it is necessary to consider how to efficientlyread several files at the same time, which may be stored indifferent disks with a high probability. The motivation is to

Page 6: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

PPoPP ’19, February 16–20, 2019, Washington, DC, USA J. Xiao, S. Wang, W. Wan, X. Hong, and G. Tan

s

Figure 7. Concurrent access approach for Multi-stage com-putation.

fully exploit the potential of the concurrent I/O performancesupported by disks in a parallel file system.For reading N background ensemble members, the pro-

posed concurrent access approach is as follows. Firstly, simi-lar to the bar reading approach, several I/O processors arechosen for reading data. Next, the I/O processors are assignedto ncд concurrent groups which are responsible for accessingncд different files simultaneously. The I/O processors in thesame group read the same file at once. Finally, each grouponly needs to read N /ncд files.

4.2 Multi-stage Computation for Maximal OverlapAs discussed above, the concurrent access approach can ef-fectively improve the data reading process, but it also bringsadditional communication overhead. Hence, the workflowhas three main phases which are file reading, data commu-nication and local analysis. It becomes important for thedevelopment of S-EnKF to overlap the local analysis withfile reading and data communication.In order to efficiently co-design the three main phases,

we would do a deeper investigation of the data structuresand re-design the workflow. Next, since the coarse-grainworkflow in the recent framework does not support for thedesigning of overlapping strategies, we move from a coarse-grain workflowwith mainly three successive phases to a fine-grain workflow with multiple stages. Further, we proposea new workflow which could overlap local analysis withfile reading and data communication. It is worthwhile tonote that the proposed S-EnKF workflow reads data fromdisks and communicates the block data in an on-demandfashion instead of a coarse-grained approach that performsacquisition of local data in a single operation.

In the coarse-grain workflow of EnKF based on the concur-rent access approach, the file reading, data communicationand local analysis proceed after the completion of the formerone. Before each processor executes the local update, it mustobtain the data on the same block of all background ensem-ble members. This fact implies that it is impossible to overlapcomputation with I/O and communication operations in therecent framework. However, if we view the procedure abovein fine-grain, new insights would be obtained. First of all,

PI/O

Tr

Tc

Loop

Local Analysis Local Analysis

Recv

I/O I/O I/O I/O

Pc

Send

Recv

Send

Local Analysis Local Analysis

First I/O and

Communication

Stage

First Local

Computation Stage

Local Computation Stages

From 2 th to L-2 th

L-1 th Local

Computation Stage

L th Local Computation

Stage

Recv

Send

Recv

Send

Figure 8. Multi-stage computation strategy.

we consider the update of one grid point in any sub-domain.Figure 2(a) shows that the update of one grid point onlydepends on the points within a certain range around it. Thisfact implies that updating one point in a sub-domain Di, jdoes not need all data on the expansion Di, j . Hence, if weplan to update a few of the points in a sub-domain, we onlyneed to read a part of data rather than all information onthe expansion. Secondly, if we decompose Di, j into L layersD ′i, j,l and update D

′i, j,l one after another, then the local anal-

ysis on Di, j would be divided into multiple stages. Thirdly,for updating D ′

i, j,l , the file reading and data communicationwould be adjusted accordingly. The necessary data only areread and sent for the local analysis on D ′

i, j,l (Figure 7). Bythis way, a coarse-grain workflow with mainly three suc-cessive phases is changed into a fine-grain workflow withmultiple stages, which provides an opportunity for designingoverlapping strategies.

As shown in Figure 8, we propose amulti-stage on-demanddesign to overlap local analysis with file reading and datacommunication. On one hand, to maximize the overlap po-tential, we use some processors for I/O and the others forlocal analysis. I/O and computation processors work simulta-neously. At the beginning, each domain Di, j is decomposedinto L layersD ′

i, j,l which would be updated one after another(1 ≤ l ≤ L). By this way, the local analysis on Di, j is alsodivided into L stages. At the l-th stage, the computation pro-cessor with (i, j)-index, executes the local analysis on D ′

i, j,l ,and the I/O processors read the data for the (l + 1)-th stageusing the concurrent access approach. On the other hand, tosplit the flow between the communication and computationphases in the computation processors, we have offloaded theinvocation of the communication with I/O processors, to ahelper thread. The helper thread signals the main thread toinvoke the local updates in a multi-stage fashion. Figure 8shows how computation phases hide the reading operationand communication in the multi-stage flow.

Page 7: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20, 2019, Washington, DC, USA

Table 1: Notations used in cost modelsVariable Definition

N Number of background ensemble members (files)n Number of model components (gird points)nx Number of gird points along longitude directionny Number of gird points along latitude directionnsdx Number of sub-domains along longitude directionnsdy Number of sub-domains along latitude directionncд Number of concurrent groups for file readinga Startup time per messageb Transfer time per Byte for messagesc Computation cost of local analysis on each gird pointθ Transfer time per Byte from disk to memoryξ Radiu of influence along longitude directionη Radiu of influence along latitude directionL Number of layers in each domainh Volume of data per gird point

4.3 Modeling Cost of Multi-stage ComputationTo estimate the cost of the operation, we extend the costmodels used in [3, 26, 30] by treating reading data from diskto node-memory similar to the shared-memory copies. Ta-ble 1 presents all variables used in cost models. We assumefull-duplex mode of communication, i.e., messages going inone-direction do not affect messages in the opposite direc-tion. With this model, the costs of three main procedures inthe multi-stage computation strategy can be represented asfollows.

(1) Reading data from disks to node-memory: In this phase,each background ensemble member is decomposed intonsdy ×L overlapping small bars, and an I/O concurrent groupis responsible for reading small bars from N /ncд files. nsdyprocessors in the same concurrent group read different partsof a file, and each processor gets one small bar at a time. Thecost of this phase is

Tr ead = ((ny

nsdy · L+2η) ·nx ·h ·

N

ncд·θ ) · log(ncд ·nsdy ). (7)

(2) Communication: In this phase, each I/O processor sendsnsdx starting addresses of different parts in the small bar datato nsdx different computation processors. After receiving thecommunication requests, each helper thread of computationprocessors would receive ncд block data from ncд I/O pro-cessors in different concurrent groups, and the size of eachblock data is ( ny

nsdy ·L+2η)×( nx

nsdx+2ξ )× N

ncд. Assume that all

processors are at equal status in the distributed computingplatform, and we do not distinguish the processors withina computation node and the processors across nodes. Thecost of this phase is

Tcomm = nsdx log(ncд + 1) · (a + b(ny

nsdy · L+ 2η)

·(nxnsdx

+ 2ξ ) ·N

ncд· h). (8)

(3) Computation: In this phase, each computation processperforms the local analysis on the locally gathered data. Eachgrid point is updated one after another. Assume that time for

updating any point is the same. The cost for a local analysison D ′

i, j,l is

Tcomp = c ·ny

nsdy · L·nxnsdx. (9)

Combining all phases above, we can compute the totalcost of our overlapping strategy

Ttotal = Tr ead +Tcomm + L ·Tcomp . (10)

4.4 Auto-Tuning for Optimal ParametersBased on the model (10), a naive approach to determine theoptimal parameters nsdx , nsdy , L and ncд is minimizing thetotal time of our overlapping strategy. However, it is nothard to check that Ttotal is decreasing with the increase ofncд , which means that if the more processors are used, thetotal runtime is smaller. In fact, the number of processors forreading and updating data is our resource cost for executingEnKF, and the cost is not infinite in the real application.Hence, the proposed approach is considering how to gainthe balance between the cost on processors and the benefiton the runtime.Set C1 = ncд · nsdy and C2 = nsdx · nsdy , which are the

numbers of processors for reading and updating data. Theyare also called the costs for file reading and local analysisrespectively. In order to make it easier to compare the multi-stage computation strategy with the original one, we fix thecostC2 for the local analysis, because the local analysis is themain operation of EnKF. WhenC2 is fixed, it is easy to checkthat L ·Tcomp becomes a constant. It means that the maximalbenefit on the runtime would come from Tr ead +Tcomm byappropriately paying a costC1. Hence, for a fixedC2, the keyto design an auto-tuning for optimal parameters is how tochoose an economic cost C1 for file reading.First of all, before finding the economic cost C1, we con-

sider the following optimization problem

minnsdx ,nsdy,L,ncд

T1 := Tr ead +Tcomm , (11)

such that

ncд · nsdy = C1 and nsdx · nsdy = C2, (12)

where two costs C1 and C2 are given. It is easy to solve theproblem (11). Our proposed algorithm is shown in Algorithm1, which searches for the minimum of the objective functionby traversing the complete search space.Next, when the cost C1 changes, the minimal value of

T1 would vary accordingly (refer to Figure 12). Hence, it isimportant to determine which cost C1 is the most economicone based on the different minimal values of T1 in differentcases. Assume that C1 is taken from a set {c1

1, c21, · · · , c

M1 }

where cm1 < cm+11 . ForC1 = c

m1 , we denote the corresponding

minimal value of T1 as tm1 . Due to cm1 < cm+11 , tm1 ≥ tm+1

1 is

Page 8: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

PPoPP ’19, February 16–20, 2019, Washington, DC, USA J. Xiao, S. Wang, W. Wan, X. Hong, and G. Tan

valid. For 1 ≤ m < M , we define an earnings rate as

rm =tm1 − tm+1

1

cm+11 − cm1

. (13)

It is easy to check that rm > 0 and rm decreasesmonotonouslywithm increasing. According to this fact, the proposed con-dition for determining the most economic cost is that, if

rm < ε, (14)

where ε is a given positive number, then the cost cm1 is chosen.It means that if more cost can not provide significant benefitany more, we choose the current cost. Based on Algorithm1 with C1 = cm1 , we can obtain the corresponding optimalparameters nsdx , nsdy , L and ncд .Finally, we denote np as the total number of processors.

Based on the way above, if we choose C2 = k from k = 1to k = np , we can obtain the corresponding economic costC1 = ck1 and optimal parameters nsdx = nksdx , nsdy = nksdy ,L = Lk and ncд = nkcд in different cases. Substituting nksdx ,nksdy , L

k and nkcд into the formula (10), we find the minimalTtotal , and the corresponding parameters are the optimalchoice.In conclusion, Algorithm 2 presents the auto-tuning ap-

proach to determine the optimal parameters for the multi-stage computation strategy. The effect of this proposed ap-proach is minimizing the time for first I/O and communica-tion procedures with a suitable cost, and maximally over-lapping the following reading and communication with thelocal updating.

Algorithm 1 Solver for Optimization Model1: Initialize T1 = 0, nsdx = 1, nsdy = 1, L = 1 and ncд = 1;2: for j = 1 to c1 do3: if (c1%j == 0) and (c2%j == 0) and (ny%j == 0) then4: k = c1/j ;5: i = c2/j ;6: if (nx%i == 0) and (N%k == 0) then7: for l = 1 to ny/j do8: if (ny/j)%l == 0 then9: nsdx = i ;10: nsdy = j ;11: L = l ;12: ncд = k ;13: Compute t = Tr ead +Tcomm by (7) and (8);14: if (T1 == 0) or (t < T1) then15: T1 = t ;16: nsdx = i ;17: nsdy = j ;18: L = l ;19: ncд = k ;20: end if21: end if22: end for23: end if24: end if25: end for26: return T1 , nsdx , nsdy , L and ncд ;

Algorithm 2 Auto-Tuning for Optimal Parameters1: Create vectors t , cs , dx , dy , lev and д whose length being np ;2: Initialize Tmin = 0;3: for c2 = 1 to np do4: T = 0;5: k = 0;6: for c1 = 1 to np − c2 do7: Obtain T1 , nsdx , nsdy , L and ncд by Algorithm 1;8: if (T == 0 and T1 > 0) or (0 < T1 and T1 < T ) then9: k = k + 1;10: T = T1 ;11: t [k ] = T1 ;12: cs[k ] = c1 ;13: dx [k ] = nsdx ;14: dy[k ] = nsdy ;15: lev[k ] = L;16: д[k] = ncд ;17: end if18: end for19: form = 1 to k − 1 do20: r = t [m]−t [m+1]

cs [m+1]−cs [m];

21: if r < ε then22: break;23: end if24: end for25: Compute Ttotal by (10) with nsdx = dx [m], nsdy = dy[m], L = lev[m]

and ncд = д[m];26: if (Tmin == 0) or (0 < Tmin and Tmin < Ttotal ) then27: Tmin = Ttotal ;28: n∗

sdx = dx [m];29: n∗

sdy = dy[m];30: L∗ = lev[m];31: n∗

cд = д[m];32: end if33: end for34: return n∗

sdx , n∗sdy , L

∗ and n∗cд ;

5 Performance EvaluationIn this section, we present numerical results on Tianhe-2supercomputer [31].

5.1 Computational Environment and Data SetsTianhe-2 was the world’s fastest supercomputer from 2013to 2015. Each node of Tianhe-2 is equipped with two Intel IvyBridge CPUs (24 cores). The interface nodes are connectedusing InfiniBand networks. The system software includes a64-bit Kylin OS, an Intel 14.0 compiler, a customized MPICH-3.1 for TH Express-2, and a self-designed hybrid hierarchyfile system H2FS. In our numerical experiments, we haveused 500 nodes (12,000 cores) in Tianhe-2, and have shownthe performances of different parallel implementations forEnKF respectively on over ten thousand CPUs. To the bestof our knowledge, the largest number of CPUs utilized byrecent work is O(1, 000) [10, 11].

In our tests, 120 background ensemble members (N = 120)taken from a long-time oceanmodel integration with the 0.1◦spatial resolution and 30 vertical levels, are used to evaluatethe parallel performance of S-EnKF. In the 2-dimensionallatitude-longitude mesh, nx × ny = 3600 × 1800. It is worthmentioning that, when np processors are utilized in P-EnKF,the number of processors for S-EnKF may be less than or

Page 9: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20, 2019, Washington, DC, USA

Figure 9. Time for different phases in P-EnKF and S-EnKF.

equal to np . The reason is that, for S-EnKF, the total num-ber of processors is the summation of C1 and C2, which aredetermined by Algorithm 2. However, we uniformly denotethe number of processors utilized by P-EnKF as the standardfor the performance comparison.

5.2 Performance ComparisonPerformance comparison focuses on P-EnKF and S-EnKF. P-EnKF represents the state-of-art parallel implementation ofEnKF, and is widely used for data assimilation on distributed-memory systems. Our numerical results are presented inFigure 9. It is obvious that P-EnKF is able to scale to about8, 000 cores. However, when the number of processors in-creases from 8, 000 to 12, 000, the total runtime does notbecome shorter due to the time of P-EnKF for file readingincreasing significantly. The main reason for this phenom-enon has been discussed in Section 4.1.1. However, basedon the concurrent access approach, the time of S-EnKF forfile reading and data communication only changes slightlywith the number of processors increasing. Compared withP-EnKF, S-EnKF achieves 3x speed-up on 12, 000 cores. ForS-EnKF, the file reading and data communication are hiddenwell, which results in a good scalability.

As seen from Figure 9, it is obvious that the time for wait-ing decreases as the number of processors increases. Whenthe number of processors reaches 12, 000, the wait time be-comes very short, especially for I/O processors. This factindicates that S-EnKF is able to take full advantage of the

Fil

e R

ea

din

g

Figure 10. Time for reading 120 background ensemble mem-bers with the concurrent access approach.

parallel performance of the cluster for assimilating the datawith 0.1◦ resolution. It means that our co-design is optimal.

5.3 Evaluation of Various S-EnKF Co-designsConcurrent Access Approach: Figure 10 shows the datareading time varying with the number of concurrent groupsincreasing. When ncд < 4, the data reading time decreasesmonotonously due to the concurrent I/O potential of the filesystem. As ncд > 6, the data reading time changes slightly.The main reason is that, when ncд is large enough, the totalI/O bandwidth is fully used. With the number of processorsincreasing, the time of block reading approach is increasing

Page 10: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

PPoPP ’19, February 16–20, 2019, Washington, DC, USA J. Xiao, S. Wang, W. Wan, X. Hong, and G. Tan

.0

Figure 11. Percentage of the overlapped time over totalruntime in S-EnKF.

C1

Figure 12. Curve of the minimal value of T1 and test datawith different parameters nsdx , nsdy , L and ncд in the caseof C2 = 2, 000.

tremendously (Figure 5). However, the time of concurrentaccess approach is short and controllable, which plays an im-portant role for developing S-EnKF. By the way, the optimalnumber of concurrent groups, such as 4 or 6 for differentcases in Figure 10, would be gained by our auto-tuning ap-proach.

Overlapping Strategy: Define the overlapped time asthe time (for waiting, disk I/O and communication) whichis overlapped with the time for local computation. Figure11 shows the percentage of the overlapped time over totalruntime in S-EnKF. With the number of processors increas-ing, the time for file reading and communication changesslightly, while the time for local analysis and the wait timereduce synchronously (Figure 9). Hence, it is reasonable thatthe percentage of the overlapped time over total runtime issustained. It means that the multi-stage computation strat-egy successfully realizes the overlapping of data obtainingand local updating indeed, and the effect would not degradewith the number of processors increasing.

Figure 13. Total runtime of P-EnKF and S-EnKF.

Auto-Tuning Design: Figure 12 presents the minimalvalue of the objective function T1 in the optimization model(11) and the corresponding test results in different cases.When the cost C1 on processors for file reading is fixed, thetest results marked by the crosses along vertical direction,are determined by different values of the parameters nsdx ,nsdy , L and ncд which satisfy that nsdy · ncд = C1. FromFigure 12, it is obvious that the minimal test result with agiven cost C1 is closer to the minimal value of the objectivefunctionT1 than others. Furthermore, we also check that theparameters for the minimal test result and for the minimalvalue of T1 are the same. This fact proves that our modelis able to reflect the actual performance of the multi-stagecomputation indeed.In Figure 12, the square mark highlights the two most

economic choices which are determined by the condition(14) based on the test data and the model (11) respectively.It is clear that two economic choices are consistent, whichimplies that the auto-tuning approach for determining theoptimal parameters is effective.

5.4 Primary HighlightsIn the scaling tests, we fix the total problem size and in-crease the number of computing nodes. The total computetime should decrease as more computing nodes are utilized.However, because the computation-communication ratiobecomes smaller at a higher node count, which will even-tually affect the scalability, ideal scaling results are hard toobtain. Even a small fluctuation in the MPI communicationwill degrade the performance in the scaling tests [31, 32].Therefore, successfully overlapping the local computationwith file reading and communication operations is a key toachieve expected scaling results.

Figure 13 shows the strong scaling performance of P-EnKFand S-EnKF. As seen from Figure 13, when the number ofprocessors is larger than ten thousand, the runtime of P-EnKFbecomes longer with the number of processors increasing.

Page 11: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20, 2019, Washington, DC, USA

In contrast, for S-EnKF, nearly ideal strong scaling efficiencyis sustained when 12,000 processors are utilized. There isonly a very slight loss of scalability in the strong scalingtests, for which the reason can be analyzed from Figure 9. InFigure 9, the most parts of phases for file reading and datacommunication on the I/O processor side are totally hiddenbehind the computations on the computation processor side.The only part in the algorithm that could not be overlappedis the first file reading and data communication. Althoughthis part only takes a small portion (less than 8%) of the totalcomputing time, it still affects the parallel efficiency at largerprocessor counts. We would like to point out here that theparallel efficiency of S-EnKF in the strong scaling results aresuperior to that of P-EnKF in [23, 24], in which file reading isnot overlapped with local computations, and the computingcapacities from the CPUs are not fully exploited.

6 Related WorkWe now compare S-EnKF with prevalent research and de-velopment efforts. L-EnKF [13, 33] is one of the most recentefforts in scaling out EnKF. It discusses scaling of data as-similation to multi-MPI clusters with prime focus on thedomain localization given a radius of influence r . It uses asingle processor for reading background ensemble membersone by one and distributing the data to other processorsserially, which is not efficient for reading a large numberof high resolution data [15, 19]. However, S-EnKF providesan algorithm perspective of new co-designs and highlightsstrategies to accelerate file reading phases in the data as-similation. S-EnKF considers how to use several processorsfor reading many files at the same time to take full advan-tage of the I/O performance of parallel file systems. On theother hand, P-EnKF [24] represents the state-of-art paral-lel implementation for EnKF. It exploits modified Choleskydecomposition for accelerating the local analysis. The par-allel block reading approach in P-EnKF is more suitable forlarge scale problems. However, S-EnKF uses a new concur-rent access approach based on parallel bar reading, whichcan reduce the frequency of disk addressing operations. Fur-thermore, S-EnKF firstly achieves the overlapping of localanalysis with the I/O and communication phases, which im-proves the scalability of P-EnKF significantly. To the bestof our knowledge, the maximum number of processors isO(1, 000) in recent reports [2, 11, 21]. However, this work atfirst discusses the performance of parallel implementationswith over ten thousands of processors, which proves thatS-EnKF is scalable for the data assimilation on large amountsof data with high resolution.

To estimate the cost of the operation, several cost modelshave been proposed and wildly applied [3]. To the best ofour knowledge, almost all of the related work do not use thecost models to determine the optimal number of processorsfor reading files and the optimal number of dissections for

sub-domain in domain localization, because the more benefiton time can be obtained using the more cost on processors.Considering the tradeoff between the benefit and cost, thiswork firstly proposes an auto-tuning approach to determinethe optimal number of processors for file reading and theoptimal parameters for domain localization in S-EnKF, whichsuccessfully extends the cost models to a broad set of appli-cations.

7 ConclusionIn this paper, we have tackled the challenge of designing ascalable and distributed EnKF. Using a co-design of concur-rent access approach for file reading, we fully exploit the re-sources available in modern parallel file systems to acceleratereading of background ensemble members. Through multi-stage computation co-design, we achieve maximal overlapof data obtaining and local updating. Furthermore, by con-sidering the tradeoff between the benefit on runtime and thecost on processors based on classic cost models, we push theenvelope of performance further with aggressive co-designof auto-tuning. The experimental evaluation demonstratesthat S-EnKF has the nearly ideal strong scaling efficiency.

AcknowledgmentsThe authors would like to thank all anonymous referees fortheir valuable comments and helpful suggestions. The workis supported by the National Key Research and Develop-ment Program of China under Grant No. 2016YFC1401700,and National Natural Science Foundation of China underGrant No.(61802369, 61521092, 91430218, 31327901, 61472395,61432018) and CAS (QYZDJ-SSW-JSC035).

References[1] E. Anderson, Z. Bai, J. Dongarra, A. Greenbaum, A. McKenney, J.

Du Croz, S. Hammarling, J. Demmel, C. Bischof, and D. Sorensen. 1990.LAPACK: A Portable Linear Algebra Library for High-performanceComputers. In Proceedings of the 1990 ACM/IEEE Conference on Super-computing (Supercomputing ’90). IEEE Computer Society Press, LosAlamitos, CA, USA, 2–11. http://dl.acm.org/citation.cfm?id=110382.110385

[2] Jeffrey L. Anderson and Nancy Collins. 2007. Scalable Implementationsof Ensemble Filter Algorithms for Data Assimilation. Journal of Atmo-spheric and Oceanic Technology 24, 8 (2007), 1452–1463. https://doi.org/10.1175/JTECH2049.1 arXiv:https://doi.org/10.1175/JTECH2049.1

[3] Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xi-aoyi Lu, and Dhabaleswar K. (DK) Panda. 2017. Scalable ReductionCollectives with Data Partitioning-based Multi-leader Design. In Pro-ceedings of the International Conference for High Performance Comput-ing, Networking, Storage and Analysis (SC ’17). ACM, New York, NY,USA, Article 64, 11 pages. https://doi.org/10.1145/3126908.3126954

[4] Alexander Bibov and Heikki Haario. 2017. Parallel implementa-tion of data assimilation. International Journal for Numerical Meth-ods in Fluids 83, 7 (2017), 606–622. https://doi.org/10.1002/fld.4278arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/fld.4278

[5] Peter J. Bickel and Elizaveta Levina. 2008. Regularized estimationof large covariance matrices. Ann. Statist. 36, 1 (02 2008), 199–227.https://doi.org/10.1214/009053607000000758

Page 12: S-EnKF: Co-designing for Scalable Ensemble Kalman Filtertgm/images/6/61/S-EnKF_Co-designing...S-EnKF: Co-designing for Scalable Ensemble Kalman Filter PPoPP ’19, February 16–20,

PPoPP ’19, February 16–20, 2019, Washington, DC, USA J. Xiao, S. Wang, W. Wan, X. Hong, and G. Tan

[6] Pozo R Blackford L S, Petitet A. 2002. An Updated Set of Basic LinearAlgebra Subprograms (BLAS). ACM Trans. Math. Softw. 28, 2 (June2002), 135–151. https://doi.org/10.1145/567806.567807

[7] Alexandre A. Emerick and Albert C. Reynolds. 2013. Ensemblesmoother with multiple data assimilation. Computers and Geosciences55 (2013), 3–15. https://doi.org/10.1016/j.cageo.2012.03.011

[8] Geir Evensen. 2003. The Ensemble Kalman Filter: theoretical formu-lation and practical implementation. Ocean Dynamics 53, 4 (01 Nov2003), 343–367. https://doi.org/10.1007/s10236-003-0036-9

[9] Kevin Hamilton and Wataru Ohfuchi. 2008. High Resolution NumericalModelling of the Atmosphere and Ocean. Springer.

[10] P. L. Houtekamer, Bin He, and Herschel L. Mitchell. 2014. Parallel Im-plementation of an Ensemble Kalman Filter. Monthly Weather Review142, 3 (2014), 1163–1182. https://doi.org/10.1175/MWR-D-13-00011.1arXiv:https://doi.org/10.1175/MWR-D-13-00011.1

[11] P. L. Houtekamer and Fuqing Zhang. 2016. Review of the Ensem-ble Kalman Filter for Atmospheric Data Assimilation. MonthlyWeather Review 144, 12 (2016), 4489–4532. https://doi.org/10.1175/MWR-D-15-0440.1 arXiv:https://doi.org/10.1175/MWR-D-15-0440.1

[12] M. N. Kaurkin, R. A. Ibrayev, and K. P. Belyaev. 2016. Data assimilationin the ocean circulationmodel of high spatial resolution using themeth-ods of parallel programming. Russian Meteorology and Hydrology 41,7 (01 Jul 2016), 479–486. https://doi.org/10.3103/S1068373916070050

[13] Christian L. Keppenne. 2000. Data Assimilation into a Primitive-Equation Model with a Parallel Ensemble Kalman Filter. MonthlyWeather Review 128, 6 (2000), 1971–1981. https://doi.org/10.1175/1520-0493(2000)128<1971:DAIAPE>2.0.CO;2

[14] Christian L. Keppenne and Michele M. Rienecker. 2002. Initial Testingof a Massively Parallel Ensemble Kalman Filter with the PoseidonIsopycnal Ocean General Circulation Model. Monthly Weather Review130, 12 (2002), 2951–2965. https://doi.org/10.1175/1520-0493(2002)130<2951:ITOAMP>2.0.CO;2

[15] Terasaki Koji, Sawada Masahiro, and Miyoshi Takemasa. 2015. LocalEnsemble Transform Kalman Filter Experiments with the Nonhydro-static Icosahedral Atmospheric Model NICAM. SOLA 11 (2015), 23–26.https://doi.org/10.2151/sola.2015-006

[16] Fred Kucharski, Franco Molteni, and Annalisa Bracco. 2006. Decadalinteractions between the western tropical Pacific and the North At-lantic Oscillation. Climate Dynamics 26, 1 (01 Jan 2006), 79–91.https://doi.org/10.1007/s00382-005-0085-5

[17] Y. Liu, A. H. Weerts, M. Clark, H.-J. Hendricks Franssen, S. Kumar, H.Moradkhani, D.-J. Seo, D. Schwanenberg, P. Smith, A. I. J. M. vanDijk, N. van Velzen, M. He, H. Lee, S. J. Noh, O. Rakovec, and P.Restrepo. 2012. Advancing data assimilation in operational hydro-logic forecasting: progresses, challenges, and emerging opportuni-ties. Hydrology and Earth System Sciences 16, 10 (2012), 3863–3887.https://doi.org/10.5194/hess-16-3863-2012

[18] Lustre. 2017. Parallel File System. (2017). http://lustre.org[19] Takemasa Miyoshi and Masaru Kunii. 2012. The Local Ensemble Trans-

form Kalman Filter with the Weather Research and Forecasting Model:Experiments with Real Observations. Pure and Applied Geophysics 169,3 (01 Mar 2012), 321–333. https://doi.org/10.1007/s00024-011-0373-4

[20] F.Molteni. 2003. Atmospheric simulations using a GCMwith simplifiedphysical parametrizations. I: model climatology and variability inmulti-decadal experiments. Climate Dynamics 20, 2 (01 Jan 2003),175–191. https://doi.org/10.1007/s00382-002-0268-2

[21] Lars Nerger. 2004. Parallel Filter Algorithms for Data Assimilation inOceanography. Ph.D. Dissertation. Universität Bremen. http://epic.awi.de/10313/

[22] Lars Nerger and Wolfgang Hiller. 2013. Software for ensemble-baseddata assimilation systems–implementation strategies and scalability.Computers and Geosciences 55 (2013), 110 – 118. https://doi.org/10.1016/j.cageo.2012.03.026

[23] E. Nino-Ruiz, A. Sandu, and X. Deng. 2018. An Ensemble KalmanFilter Implementation Based on Modified Cholesky Decomposition forInverse Covariance Matrix Estimation. SIAM Journal on Scientific Com-puting 40, 2 (2018), A867–A886. https://doi.org/10.1137/16M1097031arXiv:https://doi.org/10.1137/16M1097031

[24] Elias D. Nino-Ruiz, Adrian Sandu, and Xinwei Deng. 2017. A paral-lel implementation of the ensemble Kalman filter based on modifiedCholesky decomposition. Journal of Computational Science (2017).https://doi.org/10.1016/j.jocs.2017.04.005

[25] Edward Ott, Brian R. Hunt, Istvan Szunyogh, Aleksey V. Zimin,Eric J. Kostelich, Matteo Corazza, Eugenia Kalnay, D.J. Patil, andJames A. Yorke. 2004. A local ensemble Kalman filter for atmosphericdata assimilation. Tellus A: Dynamic Meteorology and Oceanogra-phy 56, 5 (2004), 415–428. https://doi.org/10.3402/tellusa.v56i5.14462arXiv:https://doi.org/10.3402/tellusa.v56i5.14462

[26] Rolf Rabenseifner. 2004. Optimization of Collective Reduction Opera-tions. In Computational Science - ICCS 2004, Marian Bubak, Geert Dickvan Albada, Peter M. A. Sloot, and Jack Dongarra (Eds.). SpringerBerlin Heidelberg, Berlin, Heidelberg, 1–9.

[27] Vishwas Rao and Adrian Sandu. 2016. A time-parallel approach tostrong-constraint four-dimensional variational data assimilation. J.Comput. Phys. 313 (2016), 583 – 593. https://doi.org/10.1016/j.jcp.2016.02.040

[28] Pavel Sakov and Laurent Bertino. 2011. Relation between two commonlocalisation methods for the EnKF. Computational Geosciences 15, 2(01 Mar 2011), 225–237. https://doi.org/10.1007/s10596-010-9202-6

[29] Zheqi Shen and Youmin Tang. 2015. A modified ensemble Kalmanparticle filter for non-Gaussian systems with nonlinear measurementfunctions. Journal of Advances in Modeling Earth Systems 7, 1 (2015),50–66. https://doi.org/10.1002/2014MS000373

[30] Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimiza-tion of Collective Communication Operations in MPICH. InternationalJournal of High Performance Computing Applications 19, 1 (2005), 49–66.

[31] W. Xue, C. Yang, H. Fu, X. Wang, Y. Xu, J. Liao, L. Gan, Y. Lu, R. Ranjan,and L. Wang. 2015. Ultra-Scalable CPU-MIC Acceleration of MesoscaleAtmospheric Modeling on Tianhe-2. IEEE Trans. Comput. 64, 8 (Aug2015), 2382–2393. https://doi.org/10.1109/TC.2014.2366754

[32] Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, Linfeng Li, YangtongXu, Yutong Lu, Jiachang Sun, Guangwen Yang, and Weimin Zheng.2013. A Peta-scalable CPU-GPU Algorithm for Global AtmosphericSimulations. In Proceedings of the 18th ACM SIGPLAN symposium onprinciples and practice of parallel programming, Vol. 48. 1–12.

[33] H. Yashiro, K. Terasaki, T. Miyoshi, and H. Tomita. 2016. Performanceevaluation of a throughput-aware framework for ensemble data assim-ilation: the case of NICAM-LETKF. Geoscientific Model Development 9,7 (2016), 2293–2300. https://doi.org/10.5194/gmd-9-2293-2016

[34] Milija Zupanski. 2009. Theoretical and Practical Issues of EnsembleData Assimilation in Weather and Climate. Springer Berlin Heidelberg,Berlin, Heidelberg, 67–84. https://doi.org/10.1007/978-3-540-71056-1_3