Resilient Solvers for Partial Differential Equations

Miroslav StoyanovEirik Endeve

Clayton Webster

Oak Ridge National Lab

July 6, 2014Extreme-scale Algorithms & Software Resilience (EASIR): ERKJ245

Al Geist (ORNL), Mike Heroux (SNL)

Miroslav Stoyanov Eirik Endeve Clayton Webster — Algorithm Resilience 1/32

Extreme Scale Equations

Consider the generalized large scale linear differential equation thatis steady state

−AX = F ,

or time dependentd

dtX = AX + F .

The differential operator A can represent complex dynamics ofmultiple couple phenomena (e.g., multi-physics).

Finding the solution X requires a powerful supercomputer.

Supercomputing Speed vs Reliability

Hardware Failure

Operation at near critical voltage, ever shrinking fabricationsize and increasing number of componentsIncreased computational throughput leads to decrease inreliabilitySoft faults are already causing problems in largest scalecomputations (e.g., climate)In near future we expect the mean failure time to be of theorder of an application runtime

Hardware Error Control

DRAM has build-in error detection and correctionNetwork communication has CRC checksumCPU cache and logical circuits have virtually no protection

Software Abstraction

Results of FaultsThe code crashes and no result is givenThe code completes execution, but returns an invalid result(e.g., inf or nan)The code completes execution and returns a result that“appears” valid

2 + 3 = 7

Software Error ControlCheck-pointing: how to detect silent faults?Check-summing: in heterogeneous computing environment,we cannot check-sum floating point operations.Post-processing: detection is too late.Redundant computations: doable on at limited scale.

The Cutting Edge Approach

Hoemmen, Mark and Heroux, Michael A, Fault-tolerant iterativemethods via selective reliability, Proceedings of the 2011International Conference for High Performance Computing,Networking, Storage and Analysis (SC). IEEE ComputerSociety, 2011.Code is split into two modes with different reliability

Fast Mode: The bulk of the computations should be performedin this mode and any “fast” computation is susceptible tohardware faults.Safe Mode: Only the bare minimum of work will be done in thismode, computations are assumed to be completely reliable.

A sanity-check is performed in Safe Mode to guard againstintroduction of large hardware errorNatural self-correcting properties are exploited to correcthardware error

Challenges of Selective Reliability

Rigorously quantify the impact of soft faults

Analyze the convergence rate

Devise optimal accept/reject criteria

What type and rate of faults can be corrected

Analytic Fault Model

Algorithms are comprised of steps

Steps compute intermediate answer→ final solution

Each step is performed in either fast or safe mode

Safe mode always returns correct answer

Fast mode is susceptible to faults and there is probability p ofcomputing something that is not correct (i.e., we should get x,but instead we get x+ x̃)

If the intermediate result is corrupted, then we make noassumption on the distribution of the perturbation

Convergence

‖Xexact −Xcomputed‖ ≤ Erounding + Ediscrete + Eiterative + Ehardware

Erounding → 0 with more precision, i.e. single, double, quad.Ediscrete → 0 with more the nodes, i.e. denser meshEiterative → 0 with more solver iterationsIn all cases, more work means better approximationEhardware → 0, as we do more work

Definition (Convergence with Respect to Hardware Error)A method is convergent on a specific hardware, if the statisticalmoments of the Hardware Error converge to zero as thecomputational cost increases, regardless of the distribution of theperturbation.

Steady State Method: GMRES

Consider the steady state problem

AX = F ,

which we disretize into a system of linear equations

A|Nx|N = b|N ,

A|N → A, b|N → F , x|N → X .

We need a resilient linear solver!

Given the linear system of equations

Ax = b,

iterative solvers general iterative solvers generate a sequence

xk → x.

We perform K iterations so that tolerance ε is reached

‖b−AxK‖ < ε.

Krylov basis is expensive to store and manipulate, use restatedscheme

{{xmk }Kk=0}Mm=1

Fault-Tolerant GMRES

Due to M. Hoemmen and M. Heroux:Given A, b, x0 and ε > 0e0 = ‖b−Ax0‖m = 0while em > ε do

Compute xm+1K

em+1 = ‖b−Axm+1K ‖

if em > em−1 thendiscard xm+1

K and hm+1 and redo the last inner loopelse

accept xm+1K and advance to the next m = m+ 1

end ifend while

Fault Free Convergence

The residual is emk = ‖b−Axmk ‖ and let

Λ+ = {λ : Av = λv,Re(λ) > 0}Λ+ ⊂ disk(D, d/2)

Λ− = {λ : Av = λv,Re(λ) ≤ 0}ν = |Λ−|

Define magnification and reduction factors

R = cond(A)

(maxλ+∈Λ+,λ−∈Λ− |λ+ − λ−|

minλ−∈Λ− |λ−|

)ν (D

)ν, r =

Then,eMK ≤

)Me0 = RMrM×Ke0.

Resilient Convergence

Let M∗ be so that

K ≤ ε, M∗ =

⌈log ε− log e0

logR+K log r

Hardware faults do not increase the residual of FT-GMRES.The method will reach desired tolerance ε after M∗ fault freeouter iterations.

FT-GMRES convergence is equivalent to analyzing theprobability of computing M∗ fault free iterations in M tries.

Resilient Convergence

Let p indicate the probability of encountering failure in thesupercomputer over a unit of time equal to a single inner loopiteration.Assume fault rate is uncorrelated in time.The probability that K inner loop iterations encounter a failure

s(K) ≈ (1− p)K

Then the probability of encountering M∗ fault free iterations inM ≥M∗ attempts is

P (M) = s(K)

(M − 1M∗ − 1

)s(K)M

∗−1 (1− s(K))M−M∗.

Statistical Convergence

The expected number of iterations needed for convergence is

E[M ] =

∞∑M=M∗

(M − 1M∗ − 1

)s(K)M

∗(1− s(K))M−M

The variance is bounded by

V[M ] =

∞∑M=M∗

(M − E[M ])2P (M) ≤ M∗

The probability of FT-GMRES converging, approaches 1, asM →∞.

Computational Cost

Let W be the cost associated with computing the residual in safemode (e.g., for triple redundancy, W = 3).Then, the total cost is

C = M(K +W )

and the expected cost is

E[C] =M∗

s(K)(K +W ) =

⌈log ε− log e0

logR+K log r

⌉K +W

(1− p)K.

There is an optimal restart rate that is the global minimizer of E[C]that is independent from ε and e0.

Example:

Consider boundary value problem

0.2Xξξ + Xξ = 0, X (0) = 0, X (1) = 1.

We discretize the problem using finite difference scheme resultingin the linear system of equations

(0.4 + 1

N) −0.2− 1

N0 · · · 0

−0.2 (0.4 + 1N

) −0.2− 1N

· · · 0...

. . .. . .

. . ....

0 · · · −0.2 (0.4 + 1N

) −0.2− 1N

0 0 · · · −0.2 (0.4 + 1N

x1x2...

xN−2

xN−1

00...0

Faults w/o Resilience

10−5

Outer Iteration

10−5

Outer Iteration

Figure : Left: Fault free GMRES convergence. Right: Faulty GMRESconvergence (or the lack thereof). N = 101, p = 0.05, K = 20, s(K) ≈ 0.36

x̃ =g

‖g‖10u, g ∼ N (0, 1)N , u ∼ U(−9, 19).

FT-GMRES: Convergence

10−5

Outer Iteration

10−5

Outer Iteration

Figure : Two realization of FT-GMRES, despite the high fault rate, thealgorithm converges reliably.

FT-GMRES: Cost

0 5 10 15 20 25 30 35 401000

Number of Inner Iterations K

Figure : Expected cost of the FT-GMRES algorithm for different values ofK. The expectation is computed using Monte Carlo sampling.

FT-GMRES Convergence

We have analytic proof of convergence for the FT-GMRESmethod.

The method converges for virtually any fault rate p.

The computational cost and convergence rate are stronglyinfluenced by the fault rate p and restart rate K.

When making a decision on the restart rate K, we need toconsider the rate of faults p in addition to eigenstructure of thematrix A and storage limitations.

Time Integration: First Order

Time dependent PDE

dtX = AX + F .

which we disretize into a system of ordinary differential equations

dtx|N = A|Nx|N + b|N ,

A|N → A, b|N → F , x|N → X .

Given X (0) we are looking for X (T ).

We need a resilient time integrator!

Time Stepping

Consider first order integration scheme for the system of ODEs

dtx = Ax.

Select time step ∆t and approximate xn ≈ x(n∆t)

First order Euler time stepping schemes

xn+1 = xn + ∆tAxn, xn+1 = xn + ∆tAxn+1.

Asymptotic convergence rate for nf = T∆t

‖xnf − x(nf∆t)‖ ≤ CT∆t.

Single Fault Scenario

Assume that computations encounter a silent fault and at step j,instead of xj we compute a corrupted

x̃j = xj + x̃.

For n > j all time steps will be corrupted x̃n.

Then at the final time step we have that

‖x̃nf − x(nf∆t)‖ ≤ ‖x̃nf − xnf ‖+ ‖xnf − x(nf∆t)‖≤ ‖(I + ∆tA)nf−j x̃‖+ CT∆t

≤ ‖(I + ∆tA)nf−j‖‖x̃‖+ CT∆t

≤ BT ‖x̃‖+ CT∆t,

whereBT = max(1, (I + ∆tA)nf ).

Resilience Enhancement

According to the Mean Value Theorem

xn+1 − 2xn + xn−1 = ∆t21

dt2x(ξl) +

dt2x(ξr)

where ξl ∈ [(n− 1)∆t, n∆t] and ξr ∈ [n∆t, (n+ 1)∆t].

Therefore,‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α,

for any α ≥ ‖ d2dt2x‖∞.

Resilience Enhancement

At each step of the integration we accept xn+1 only if

‖xn+1 − 2xn + xn−1‖ ≤ ∆t2α.

Thus, if we encounter a fault x̃ it will be rejected unless

‖xn+1 + x̃− 2xn + xn−1‖ ≤ ∆t2α ⇒ ‖x̃‖ ≤ 2α∆t2.

If we encounter a single fault

‖x̃nf − x(nf∆t)‖ ≤ 2αBT∆t2 + CT∆t.

We expect to encounter pnf faults in nf = T∆t iterations, then

‖x̃nf − x(nf∆t)‖ ≤ 2αpBTT∆t+ CT∆t = (2αpBTT + CT ) ∆t.

Resilient Time Stepping

Given A, x0, ∆t, TGiven αSet n = 0while n∆t < T do

Compute xn+1, e.g., xn+1 = xn + ∆tAxn

if ‖xn+1 − 2xn + xn−1‖ > α∆t2 thendiscard xn+1 and redo the last step

elseaccept xn+1 and advance to the next n = n+ 1

end ifend while

Example:

Consider the heat transfer equation

dtX = 10−3Xξξ, X (t, 0) = X (t, 1) = 0, X (0, ξ) = ξ(1− ξ).

We discretize the problem using finite difference scheme resultingin the system of linear ODEs

x1x2...

xN−2

xN−1

= 10−3(N + 1)2

−2 1 0 · · · 01 −2 1 · · · 0...

. . .. . .

. . ....

0 · · · 1 −2 10 0 · · · 1 −2

x1x2...

xN−2

xN−1

For the numerical experiments we use N = 1024.

Classical Backward Euler

10−3

10−2

Error Convergence (Heat 1D)

Single Perturbation

Multiple Perturbations

Figure : Lack of convergence for the classical backward Euler scheme forone and multiple faults when p = 0.1.

Classical Backward Euler

10−3

10−2

10−8

10−6

10−4

10−2

Error Convergence (Heat 1D)

P=0.0P=0.1P=0.2P=0.4P=0.8

Figure : Convergence for the resilient time stepping method for differentvalues of p.

Time Stepping Conclusions

We have a resilient first order time-stepping algorithm.

The method extends to non-linear problems as well as variabletime step ∆t (so long as we have continuity w.r.t. initial data)

The resilient method preserves the linear convergence rateeven in the presence of faults.

Thank You!

Questions?

Stoyanov, M. and Webster, C. Numerical Analysis of Fixed PointAlgorithms in the Presence of Hardware Faults, Oak RidgeNational Laboratory, Tech Report, ORNL/TM-2013/283.

Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential...

Transcript of Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential...

Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential...

Documents

Transcript of Resilient Solvers for Partial Differential Equations · Resilient Solvers for Partial Differential...

Equation solvers

Robert Stengel, Aircraft Flight Dynamics MAE 331, 2016stengel/MAE331Lecture11.pdf · Numerical Integration: MATLAB Ordinary Differential Equation Solvers* •! Explicit Runge-Kutta

GPU Acceleration of Partial Differential Equation Solvers

Problem solvers

EN40 Matlab Tutorial - Brown University...13.8 Looking for special events in a solution 13.9 Other MATLAB differential equation solvers 14. Using MATLAB solvers and optimizers to make

· PDF fileDifferential Equation Solvers 1 133 "-v* Differential Equation Solvers In solving differential equations, you solve for an unknown function rather than a

Simulation - cs.princeton.edu · Approaches to Simulation •Differential equation solvers can be thought of as conducting a simulation of a physical system – Advance through time

Problem Solvers Statistics

Delay Differential Equations - Clarkson UniversityDelay Differential Equations Delay differential equation initial value problem solvers Functions dde23 Solve delay differential equations

MATLAB Examples - Differential Equations - The … · MATLAB Examples Hans-Petter Halvorsen Solving Differential Equation using ODE Solvers

SUNDIALS: Suite of Nonlinear and Differential/Algebraic ...press3.mcs.anl.gov/atpesc/files/2016/08/Woodward_215aug5_fastm… · Differential/Algebraic Equation Solvers . 2 . SUite

LOW-RANK SOLVERS FOR FRACTIONAL DIFFERENTIAL EQUATIONS · calculus are viscoelasticity - for example using the Kelvin-Voigt fractional derivative model [23, 53], electrical circuits

Hierarchical Solvers - Applied Mathematics Darve.pdf · Hierarchical solvers • Hierarchical solvers offer a bridge between direct and iteration solvers. • They lead to efﬁcient

ANSYS Solvers: Usage and Performancetynecomp.co.uk/Xansys/solver_2002.pdf · ANSYS Solvers: Usage and Performance Gene Poole Ansys equation solvers: usage and guidelines Ansys Solvers

Satisfiability Solvers

Towards resilient parallel linear Krylov solvers: recover ... · The current challenge in high performance computing (HPC) is to increase the level of compu-tational power, by using

GPU Code-generation for Differential Equation … FOR DIFFERENTIAL EQUATION SOLVERS. ... CUDA, … BLAS ... GPU Code-generation for Differential Equation Solvers

SUNDIALS: Suite of Nonlinear and Differential/Algebraic ... · SUNDIALS: Suite of Nonlinear and Differential/Algebraic Equation Solvers ALAN C. HINDMARSH, PETER N. BROWN, KEITH E.

December 2017 Impact - greaterpoconochamber.com · Pat, Miriam, and the Board of ... velop students who are creative, energetic, resilient prob-lem solvers with an entrepreneurial

Obfuscation-Resilient Privacy Leak Detection for Mobile ... · PDF fileA. Continella et al. Obfuscation-Resilient Privacy Leak Detection for Mobile Apps Through Differential Analysis