FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 ·...

10
FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations Ben Stevens 1 Tim Colonius 1 Abstract In this work, we present a machine learning ap- proach for reducing the error when numerically solving time-dependent partial differential equa- tions (PDE). We use a fully convolutional LSTM network to exploit the spatiotemporal dynam- ics of PDEs. The neural network serves to en- hance finite-difference and finite-volume methods (FDM/FVM) that are commonly used to solve PDEs, allowing us to maintain guarantees on the order of convergence of our method. We train the network on simulation data, and show that our network can reduce error by a factor of 2 to 3 com- pared to the baseline algorithms. We demonstrate our method on three PDEs that each feature quali- tatively different dynamics. We look at the linear advection equation, which propagates its initial conditions at a constant speed, the inviscid Burg- ers’ equation, which develops shockwaves, and the Kuramoto-Sivashinsky (KS) equation, which is chaotic. 1. Introduction 1.1. Motivation Partial differential equations (PDE) are fundamental to many areas of physics, such as fluid mechanics, electromagnetism, and quantum mechanics (Sommerfeld, 1949). All classi- cal physics models are approximations obtained by coarse- graining the true quantum mechanical behavior of matter. For example, the Navier-Stokes equations are obtained by treating a fluid as a continuum; despite this, the equations are still too computationally expensive to solve for most practi- cal problems (Ishihara et al., 2009), and so they are further coarse-grained by methods such as Large-Eddy Simula- tion (Halpern, 1993) or Reynolds Averaged Navier Stokes * Equal contribution 1 Department of Mechanical and Civil Engineering, California Institute of Technology, Pasadena, CA, United States. Correspondence to: Ben Stevens <[email protected]>. Under review for ICML 2020 (Chen et al., 1990) to infer sub-grid behavior. Each of these coarse-grained models save multiple orders of magnitude of computational expense (Drikakis & Geurts, 2006; Wilcox et al., 1998). However, these approaches do not lead to models that are generally applicable (Spalart, 2010). PDEs give rise to vastly different solution structures in dif- ferent problems (Sommerfeld, 1949). Hence, a solution approach that works well for one equation will not be gener- ally applicable, as different difficulties can cause different methods to fail. The two major difficulties we examine in this paper are PDEs with chaotic dynamics, and PDEs with discontinuous solutions. PDEs with chaotic dynamics are challenging because small errors will grow quickly in time, causing the numerical solution to diverge from the true solution if the solver is not accurate enough (Strogatz, 2001). Typically, high-order numerical methods are used to solve these problems as they offer the best asymptotic error bounds (Deville et al., 2002). PDEs with discontinuous solu- tions are difficult to solve because high-order methods lead to Gibbs phenomena near discontinuities (Gottlieb & Shu, 1997), which can lead to numerical instabilities. High-order methods are derived with the assumption that the solution is smooth(LeVeque, 2007), but no method can achieve better than first-order accuracy in the presence of a discontinuity (LeVeque et al., 2002). Hence, we can see methods used to solve turbulence and discontinuities are at odds with each other, which makes it especially challenging to simulate problems with both of these issues. In 1.1, we present a qualitative overview of problems in fluid mechanics that involve turbulent and/or discontinuous solutions. Chaotic Discontinuous Shu-Osher Problem Inviscid Burgers Equation KS Equation High Reynolds number incompressible flow Shocklets Potential Flow Cavitation Sod Problem Double Mach Reflection Test Shock-induced flow separation Advection of discontinuity Reacting Flows Figure 1. Problems with discontinuous and chaotic behavior. Prob- lems in blue are examined in our experiments. arXiv:2002.03014v1 [cs.LG] 7 Feb 2020

Transcript of FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 ·...

Page 1: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM NetworkArchitecture for Time-Dependent Partial Differential Equations

Ben Stevens 1 Tim Colonius 1

Abstract

In this work, we present a machine learning ap-proach for reducing the error when numericallysolving time-dependent partial differential equa-tions (PDE). We use a fully convolutional LSTMnetwork to exploit the spatiotemporal dynam-ics of PDEs. The neural network serves to en-hance finite-difference and finite-volume methods(FDM/FVM) that are commonly used to solvePDEs, allowing us to maintain guarantees on theorder of convergence of our method. We trainthe network on simulation data, and show that ournetwork can reduce error by a factor of 2 to 3 com-pared to the baseline algorithms. We demonstrateour method on three PDEs that each feature quali-tatively different dynamics. We look at the linearadvection equation, which propagates its initialconditions at a constant speed, the inviscid Burg-ers’ equation, which develops shockwaves, andthe Kuramoto-Sivashinsky (KS) equation, whichis chaotic.

1. Introduction1.1. Motivation

Partial differential equations (PDE) are fundamental to manyareas of physics, such as fluid mechanics, electromagnetism,and quantum mechanics (Sommerfeld, 1949). All classi-cal physics models are approximations obtained by coarse-graining the true quantum mechanical behavior of matter.For example, the Navier-Stokes equations are obtained bytreating a fluid as a continuum; despite this, the equations arestill too computationally expensive to solve for most practi-cal problems (Ishihara et al., 2009), and so they are furthercoarse-grained by methods such as Large-Eddy Simula-tion (Halpern, 1993) or Reynolds Averaged Navier Stokes

*Equal contribution 1Department of Mechanical and CivilEngineering, California Institute of Technology, Pasadena,CA, United States. Correspondence to: Ben Stevens<[email protected]>.

Under review for ICML 2020

(Chen et al., 1990) to infer sub-grid behavior. Each of thesecoarse-grained models save multiple orders of magnitude ofcomputational expense (Drikakis & Geurts, 2006; Wilcoxet al., 1998). However, these approaches do not lead tomodels that are generally applicable (Spalart, 2010).

PDEs give rise to vastly different solution structures in dif-ferent problems (Sommerfeld, 1949). Hence, a solutionapproach that works well for one equation will not be gener-ally applicable, as different difficulties can cause differentmethods to fail. The two major difficulties we examinein this paper are PDEs with chaotic dynamics, and PDEswith discontinuous solutions. PDEs with chaotic dynamicsare challenging because small errors will grow quickly intime, causing the numerical solution to diverge from thetrue solution if the solver is not accurate enough (Strogatz,2001). Typically, high-order numerical methods are used tosolve these problems as they offer the best asymptotic errorbounds (Deville et al., 2002). PDEs with discontinuous solu-tions are difficult to solve because high-order methods leadto Gibbs phenomena near discontinuities (Gottlieb & Shu,1997), which can lead to numerical instabilities. High-ordermethods are derived with the assumption that the solution issmooth(LeVeque, 2007), but no method can achieve betterthan first-order accuracy in the presence of a discontinuity(LeVeque et al., 2002). Hence, we can see methods used tosolve turbulence and discontinuities are at odds with eachother, which makes it especially challenging to simulateproblems with both of these issues. In 1.1, we present aqualitative overview of problems in fluid mechanics thatinvolve turbulent and/or discontinuous solutions.

Chaotic

Dis

conti

nuous

Shu-Osher

Problem

Inviscid Burgers

Equation

KS Equation High Reynolds number

incompressible flow

Shocklets

Potential Flow

Cavitation

Sod Problem

Double Mach

Reflection TestShock-induced

flow separation

Advection of

discontinuityReacting Flows

Figure 1. Problems with discontinuous and chaotic behavior. Prob-lems in blue are examined in our experiments.

arX

iv:2

002.

0301

4v1

[cs

.LG

] 7

Feb

202

0

Page 2: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

Numerical methods such as FDM and FVM are often de-veloped without knowledge of the specific equation thatthey will be used to solve, and instead opt to maximizethe rate of convergence (LeVeque, 2007). However, thedynamics of PDEs greatly influence the structure of theirsolutions. This information should be used to maximizethe performance of the numerical methods used to simulatethese equations. Machine learning is the natural tool to hy-bridize this information with traditional numerical methods,as detailed experimental/simulation data that contains thisbehavior is available and can be used to improve these nu-merical methods. Furthermore, machine learning techniquesfor capturing temporal behavior such as LSTMs (Hochreiter& Schmidhuber, 1997a;b; Gers et al., 1999), and spatiallylocal behavior such as CNNs (Fukushima, 1980; LeCunet al., 1989; 1998), have undergone significant development.This makes machine learning highly effective for model-ing problems with these structures, both of which occur intime-dependent PDEs.

1.2. Related Work

Research intersecting PDEs with ML can be divided intotwo main goals: discovering PDEs from data, and usingML to better solve PDEs (Raissi et al., 2019). Althoughour paper focuses on solving PDEs, we will briefly discussefforts to discover them from data. Brunton et al. (2016)learned dynamical systems by applying lasso regressionto a library of functions. Many papers used LSTMs orother sequence models to learn dynamical systems (Haggeet al., 2017; Yu et al., 2017; Vlachas et al., 2018; Wan et al.,2018; Li et al., 2017). Other papers learned coarse-grainingmodels by following a system identification approach (Linget al., 2016).

Solving PDEs with machine learning can be further bro-ken up into two main areas: using data to develop bettersolvers, and parameterizing solutions to PDEs as a neuralnetwork and learning weights to minimize the pointwiseerror of the PDE (Lagaris et al., 1998). This second ideahas been further developed (Rudd, 2013) to solve problemssuch as high-dimensional PDEs (Sirignano & Spiliopoulos,2018). Hsieh et al. (2019) learned iterative PDE solversfor time-independent problems that are solved via matrixinversion. In our work, we focus on time-dependent prob-lems that are solved via time-stepping, though both involvediscretization. There have been numerous efforts to developdiscretization methods tailored to specific types of dynamics.In fluid mechanics, much work has gone into developingshock capturing methods, spatial discretization algorithmsspecially suited towards solving PDEs with discontinuoussolutions (Harten, 1983). This strategy can be viewed asa form of coarse-graining, as the true equations have vis-cous terms that prevent discontinuities from forming. Theyinstead form very steep but continuous features, but these re-

quire a very fine grid to fully resolve. There have also beenmachine learning strategies that develop equation-specificspatial discretization schemes. The idea of embedding ma-chine learning models into spatial discretization methods forthe purpose of coarse-graining was introduced by Bar-Sinaiet al. (2019). In their paper, they trained a neural networkto interpolate over the solution of a PDE to more accuratelypredict the solution at the next timestep. PDE-Net followeda similar approach (Long et al., 2017; 2019) in learningderivatives while also learning the PDE from data.

1.3. Our Contribution

We try to simultaneously exploit the spatial and temporalstructure of PDEs through the convolutional LSTM archi-tecture of our neural network, while maintaining the com-putational structures used in FVM/FDM. While some pa-pers use convolutional LSTMs or related architectures topredict future values of time-series that also possess theproperty of spatial locality (Mohan et al., 2019; Xingjianet al., 2015), none exploit the FVM/FDM structure thatnaturally discretizate PDEs. Other papers have developedcoarse-graining models by learning from data and have em-bedded these into CNNs (Bar-Sinai et al., 2019), but nonesimultaneously used an LSTM structure to also exploit thetemporal structure of PDEs. We also present a novel trainingapproach of directly minimizing simulation error over longtime-horizons by building upon ideas developed for PDE-Net (Long et al., 2017). Our approach aims to push the fieldof data-driven scientific computing forward by combiningthese ideas in a sufficiently generic and efficient frameworkto permit extensions to many problems.

2. Numerical Methods Background2.1. Strong Stability Preserving Runge-Kutta Methods

Runge-Kutta methods are a family of numerical integrationmethods used to advance a differential equation ∂u

∂t = L(u)forward in time. In this paper, we use SSPRK3 (Gottlieb &Shu, 1998)

u(1) = u(n) + ∆tL(u(n)),

u(2) =3

4u(n) +

1

4u(1) +

1

4∆tL(u(1)),

u(n+1) =1

3u(n) +

2

3u(2) +

2

3∆tL(u(2)),

(1)

a three-step, third-order accurate method that preserves thetotal variation diminishing (TVD) property TV (u(ti+1)) ≤TV (u(ti)) of explicit Euler, where total variation is definedas

TV (u) =

N∑i=1

|u(xi)− u(xi−1)|. (2)

Page 3: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

A TVD method prevents the time-stepping method fromadding spurious oscillations to the solution, which is a majorconcern for PDEs with discontinuous solutions, as they canlead to instabilities that crash the simulation in nonlinearPDEs such as the inviscid Burgers’ equation.

2.2. Finite Difference Method

One spatial discretization method we combine with machinelearning is FDM. In this method, derivatives of the solutionare approximated as a weighted combination of local valuesof the solution. For example, one could approximate thefirst derivative of a function as

∂u

∂x≈ 1

∆x

1∑i=−1

ciui =u1 − u−1

2∆x(3)

by using coefficients c−1 = − 12 , c0 = 0, and c1 = 1

2 .We use the facts that the bounds on i can be expanded andthe values of ci are not fully constrained to allow a neuralnetwork to chose these coefficient values based on the localsolution as ca:b = f(ua:b).

2.3. Finite Volume Method

We also consider equations that are more natually solvedusing FVM, a spatial discretization method similar to FDM.FVM offers the advantage of improved stability (Bar-Sinaiet al., 2019), and is easier to extend to unstructured meshes(Chen et al., 2003).

2.3.1. EXAMPLE

Consider the scalar conservation law

∂u

∂t+∂f(u)

∂x= 0 (4)

One can split the x-domain into cells of width ∆x andaverage over them as

1

∆x

∫ xi+∆x

xi

∂u

∂t+∂f(u)

∂xdx = 0 (5)

to get

∂ui∂t

+f(u(xi + ∆x))− f(u(xi))

∆x= 0. (6)

Note that this equation is still exact. The cell average valuesu are tracked and interpolated to find u locally as

u(xi) ≈b∑

j=a

cj uj . (7)

Once again, the coefficients cj are not fully constrained andcan be determined using machine learning.

3. PDE Background3.1. Advection Equation

The linear advection equation is written as

∂u

∂t+ a

∂u

∂x= 0. (8)

It possesses solutions that are translations of the initialconditions at wavespeed a, i.e. for an initial condition ofu(x, 0) = f(x), the exact solution is u(x, t) = f(x − at).In terms of coarse-graining, the sub-grid behavior has noinfluence on the solution at subsequent times. Once dis-cretization occurs, no inference can be made about whathappens between grid points, as this is fully determined bythe initial condition rather than the dynamics of the equation.When this equation is solved with a discontinuous initialcondition, it serves as a toy problem for advecting differentmaterials in a multicomponent/multiphase flow. Numericalerror causes the discontinuity to become smeared out duringthe simulation.

3.2. Inviscid Burgers’ Equation

The inviscid Burgers’ equation is written in non-conservative form as

∂u

∂t+ u

∂u

∂x= 0, (9)

and is used to model wave-breaking. The nonlinear termcauses shockwaves to form in finite time from smooth initialconditions. A shockwave is a special case of a discontinuitythat is forced by the dynamics: unlike the linear advectionequation, the discontinuity can propogate without progres-sive diffusion.

3.3. Kuramoto-Sivashinsky Equation

The KS equation is written as

∂u

∂t+ ν

∂4u

∂x4+∂2u

∂x2+

1

2

∂u2

∂x= 0, (10)

and is used as a toy problem for turbulent flame fronts. Itsdynamics lead to chaotic spatiotemporal behavior (Hyman& Nicolaenko, 1986). Chaos for PDEs is analagous todynamical systems, defined by a small perturbation in initialconditions drastically affecting the time-evolution of thesystem (Strogatz, 2001).

4. Methodologies4.1. Network Architecture

We structure our network to utilize well-known methodsfrom numerical PDEs described earlier (SSPRK, FDM, andFVM). The network architecture mimics the structure of

Page 4: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

0.0 0.2 0.4 0.6 0.8x

0.0

0.2

0.4

0.6

0.8

1.0

t

(A)

0.0 0.2 0.4 0.6 0.8x

0.00

0.05

0.10

0.15

0.20

0.25

t

(B)

0 5 10 15x

0

1

2

3

4

5

6

7

8

t

(C)

Figure 2. Example solution of (A) linear advection, (B) inviscid Burgers’, and (C) KS equations

a grid used to numerically solve a PDE, where the sizeof the filter of the convolutional layer corresponds to thestencil that selects which information is used to computethe derivative, and each output of the network correspondsto the solution of the PDE at time tj and location xi. Thiscan be seen in 4.1.

When looking at a specific realization of the stencil in the x-domain, we arrive at an LSTM network. The purpose of theLSTM is to transfer information about the solution over longtime-horizons, which adds a feature to our method that isnot present in traditional methods. The hidden informationis transferred from substep to substep (tj−1 to tj−2/3 totj−1/3, and then to the next step at the end of the currentstep tj−1/3 to tj).

Within each evaluation of the LSTM, the information is usedto compute the solution at the next substep in a way thatmimics traditional FDM or FVM. The solution values at theprevious timestep ui−2:i+2,j−1 and the hidden informationwith dimension 32 from the previous timestep are input to aneural network with 3 layers and 32 neurons per layer. Thisnetwork outputs the hidden information to the next substep,as well as a prediction of the FVM or FDM coefficients. Theneural network does not directly predict these coefficients.Instead, it first computes a perturbation ∆c to the maximum-order coefficients copt, with l2 regularization applied to ∆c,similarly to a ResNet (He et al., 2016). We find that addingthis step speeds up training and improves the performanceof the model, as we are effectively biasing our estimate ofthe coefficients towards the coefficients that are optimal inthe absence of prior knowledge about the solution. If themodel is not confident about how perturbing copt will affectthe accuracy of the derivative it can output values close tozero to default back to copt. Once copt has been perturbedby ∆c to obtain c, an affine transformation is performed onc to guarantee that the coefficients are the desired order of

accuracy (Bar-Sinai et al., 2019). This is one of the mainbenefits of using an FDM structure for our model, as wecan prove that our model gives a solution that converges tothe true solution at a known rate as the grid is refined. Thedetails of this transformation can be seen in section 4.3.

After the affine transformation, the final coefficients c areknown. The spatial derivatives are computed by taking theinner product of the coefficients ci−2:i+2 with ui−2:i+2,j−1.For equations with multiple spatial derivatives, a differentset of coefficients is computed for each. These spatial deriva-tives are then used to compute the time derivative, whichcan finally be used to compute the solution at the next sub-step. This process is then repeated for the desired numberof timesteps. Our network architecture allows us to train onexact solutions or data of the PDE end-to-end.

4.2. Training Algorithm

We train our network in a way that exactly mimics how itwould be used to solve a PDE. More specifically, we startwith some random initial condition and use the network tostep the solution forward in time, and compare the resultto the exact solution. For the linear advection equation, theanalytical solution is known for arbitrary initial conditions.For the inviscid Burgers’ equation and KS equation, thesame simulation is also carried out on a fine grid usinga baseline numerical method, which results in a solutionthat can be considered approximately exact. The inviscidBurgers’ equation is solved using WENO5 FVM (Jiang &Shu, 1996), and the KS equation is solved using fourth-orderFDM. The loss is computed by downsampling the exactsolution onto the grid of the neural network, averaging overthe square error at every point in time and space. We foundthis training strategy to be far superior to training on only 1time-step at a time, as our approach is capable of minimizinglong-term error accumulation and training the network to be

Page 5: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

SSPRK3

Substep 3

Substep 2

Substep 1

Step 𝑗 Step 𝑗 + 1

Solution of PDE

Hidden information

ANN Δ𝐜

𝐜𝑜𝑝𝑡

Ƹ𝐜 𝐴 𝐜𝜕𝑛𝑢

𝜕𝑥𝑛

H

𝜕𝑢

𝜕𝑡

H

𝑢𝑖,𝑗+1

𝑢𝑖,𝑗 𝑢𝑖+1,𝑗 𝑢𝑖+2,𝑗𝑢𝑖−1,𝑗𝑢𝑖−2,𝑗

𝑥

𝑡

At 𝑥 = 𝑥𝑖

(A)

(B)(C)

𝑢𝑖,𝑗−1 𝑢𝑖,𝑗

𝑢𝑖,𝑗+1𝑢𝑖,𝑗

𝑢𝑖,𝑗−

23

𝑢𝑖,𝑗−1

𝑢𝑖−1,𝑗−1

𝑢𝑖−2,𝑗−1

𝑢𝑖+1,𝑗−1

𝑢𝑖+2,𝑗−1

Figure 3. Network Architecture at (A) the top level, (B) LSTM at a specific x location, (C) each evaluation of the LSTM

numerically stable. In terms of computational complexity,our method is identical to backpropogation through time(BPTT) (Werbos, 1990).

4.3. Accuracy Constraints

FiniteNet is structured such that the numerical method isguaranteed to satisfy a minimum order of accuracy n, e =o(∆xn) (Bar-Sinai et al., 2019). One can perform a Taylorseries expansion on the approximations of the form of 3to obtain linear constraint equations that the coefficientsmust satisfy for the method to achieve a desired order ofaccuracy. We take the coefficients that the neural networkoutputs, c, and find the minimal perturbation ∆c that causesthem to satisfy the constraint equations, which leads to theoptimization problem

min∆c∈R5

∑2n=−2(∆cn)2

s.t. A(c + ∆c) = b,(11)

which has analytical solution ∆c = AT (AAT )−1(b−Ac),which can be expressed as an affine transformation on c,and can therefore be added to the network as a layer withno activation function and untrainable weights.

5. Experimental Results5.1. Summary

We find that our method is capable of reducing the errorrelative to the baseline method for all three equations tested.These results show promise for generalization to other equa-tions, as each PDE we examined has qualitatively differentbehavior. A table summarizing our results can be seen in1. The variation is due to averaging results from randomlygenerated initial conditions. The error ratio er is computedas er = eF

eBwhere eF is the FiniteNet MSE and eB is the

MSE of the baseline method. We analyze log10 er becauselimeF→0 er → 0 and limeB→0 er → ∞. Hence, one caseof the baseline method outperforming FiniteNet could skewthe average and standard deviation of er significantly whilethe opposite would have very little effect, resulting in a bias.Additionally, the data empirically follows a log-normal dis-tribution more closely than a normal distribution.

Hyperparameters were tuned to optimize performance onthe linear advection equation. These same hyperparameterswere then used as starting points for the other equations andtuned as necessary. For example, the time-horizon used intraining was changed from 100 to 200 for the KS equation,as this helped FiniteNet track chaotic solutions over longertime-horizons. Additionally, it was found that increasing

Page 6: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

Algorithm 1 Train FiniteNetSelect number of epochs neSelect minibatch size nmSelect time-horizon Tfor i = 1 to ne do

Set total MSE to 0for j = 1 to nm do

Select initial condition uj(x, 0)Determine exact solution u∗jInitialize hidden information H(0) to 0for k = 1 to T do

Compute uj(x, tk) and H(tk) with FiniteNetend forAdd MSE between uj and u∗j to total MSE

end forCompute gradient of simulation error w.r.t. FiniteNetparametersUpdate FiniteNet parameters with ADAM optimizer

end for

Table 1. Mean and standard deviation of log10 er for each PDE

EQUATION log10 er BETTER?

LINEAR ADVECTION -0.27± 0.08√

INVISCID BURGERS’ -0.53± 0.08√

KURAMOTO-SIVASHINSKY -0.62± 0.41√

the l2 regularization constant from λ = 0.001 to 0.1 for theKS equations improved performance. The inviscid Burgers’equation was trained for 500 epochs instead of 400 becausethe error was still decreasing,

All data generation, training, and testing was carried outon a desktop computer with 32 GB of RAM and a sin-gle CPU with 3 GHz processing speed. Hence, by scal-ing the computing resources one could scale our methodto more challenging problems. Our code and trainedmodels can be found here: https://github.com/FiniteNetICML2020Code/FiniteNet, which in-cludes links to our data.

We also find that FiniteNet trains methods that are stable,which tends to be a challenge in the field of learned FDM.Although we cannot formally prove stability, we observein 5.1 that if the randomly initialized network weights leadto an unstable scheme, the method will quickly learn tobecome stable by minimizing accumulated error.

5.2. Linear Advection Equation

For the linear advection equation, we generate random dis-continuous initial conditions, and compare the errors ob-tained using FiniteNet to errors obtained using WENO5.Each epoch involved generating 5 new initial conditions,

0 1 2 3 4 5Epoch

100

101

102

103

104

105

Err

or R

atio

Figure 4. Training FiniteNet from unstable initial condition

computing the error against the exact solution, and updatingthe weights with the Adam optimizer (Kingma & Ba, 2014).We trained for 400 epochs.

After training was complete, we tested the model on 1000more random initial conditions and computed the error ratio.We saw that FiniteNet outperformed WENO5 in 999 of the1000 simulations. A PMF of the error ratio can be seen in5.2.

10−0.5 10−0.375 10−0.25 10−0.125 100

Error Ratio

0.0

0.1

Freq

uenc

y

Figure 5. PMF of linear advection testing

We also plot a WENO5 solution and FiniteNet solution in5.2 to gain insight into how FiniteNet improves the solution.We see that FiniteNet more sharply resolves discontinuitiesat the cost of adding oscillations, which leads to a net reduc-tion in error.

0.0 0.2 0.4 0.6 0.8 1.0x

3

2

1

0

1

2

u

FiniteNetWENO5Exact

Figure 6. Solutions of linear advection equation

Page 7: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

We can use this result to hypothesize that FiniteNet willfurther improve the solution when the discontinuity is larger.We verify our hypothesis by plotting error ratio againstdiscontinuity width in 5.2, and verifying that larger discon-tinuities lead to lower error ratios.

1 2 3 4 5Discontinuity Size

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Err

or R

atio

Figure 7. Discontinuity size vs error ratio for advection equation

5.3. Inviscid Burgers’ Equation

We train the neural network to interpolate flux values for theinviscid Burgers’ equation. Once again, each epoch involvesgenerating five random initial conditions. In lieu of an exactsolution, we use WENO5 on a 4x refined mesh to approxi-mate the exact solution. We then ran 1000 simulations withthe trained network and compared the results to WENO5.FiniteNet achieved a lower error on 998 of the 1000 testcases.

10−0.8 10−0.6 10−0.4 10−0.2 100 100.2

Error Ratio

0.0

0.1

0.2

Freq

uenc

y

Figure 8. PMF of inviscid Burgers’ testing

By examining the error induced when numerically solvingthis equation with WENO5 and the FiniteNet, we see thatWENO5 accumulates higher error around shocks. Thistells us that FiniteNet achieves it’s error reduction by moresharply resolving the shocks.

We can get a prediction of roughly how many and how largeof shocks will develop in the solution by looking at the totalvariation of the initial condition. We plot the total variationof the initial condition and compare it to the error ratio in5.3.

This data shows that FiniteNet tends to do better relative toWENO5 when the total variation of the initial condition ishigher, which helps confirm our result that FiniteNet offers

0.0 0.2 0.4 0.6 0.8 1.0x

1.0

1.5

2.0

2.5

3.0

3.5

u

FiniteNetWENO5Exact Solution

Figure 9. Solution of inviscid Burgers’ equation

2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0TV(u(x, 0))

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Err

or R

atio

Figure 10. Trends between total variation and error ratio

the most improvement when many large shocks form.

5.4. Kuramoto-Sivashinsky Equation

The network is trained for the KS equation in the same wayas was done for the inviscid Burgers’ equation: by solvingthe equation on a 4x refined mesh to closely approximate theexact solution. We generate our random initial conditions fortraining by setting a random initial condition for the exactsolution and simulating until the trajectory has reached thechaotic attractor, and start training from random sequencesof the solution on the attractor so that we are learning a moreconsistent set of dynamics, as the initial transient tends tobe less predictable.

After training, we use both FiniteNet and fourth-order FDMto solve the KS equations from 1000 new initial conditions.We find the FiniteNet achieves a lower error in 961 of the1000 tests. Interestingly, we see that in some cases standardFDM is unable to track the chaotic evolution and results ina solution with no visual fit, while FiniteNet succeeds attracking the chaotic trajectory. In the cases where FiniteNetleads to a larger error than FDM, we see that neither methodcould track the chaotic trajectory, so there is some small

Page 8: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

0 5 10 15 20x

10.0

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

u

(A)

0 5 10 15 20x

u

(B)

0 5 10 15 20x

u

(C)

FiniteNetFDMExact

Figure 11. (A) Best case, (B) typical case, and (C) worst-case error ratio between FiniteNet and FDM for KS

probability that FiniteNet may follow a worse trajectorythan FDM. In order to determine the reliability of FiniteNetcompared to FDM, we compute the statistics of the errorsindividually. FiniteNet has mean error 1.20 and standarddeviation 1.89, while FDM has mean error 2.94 and stan-dard deviation 3.01. So we see that FDM error has highermean and variance, and can conclude that FiniteNet is morereliable.

10−2.0 10−1.5 10−1.0 10−0.5 100 10−0.5

Error Ratio

0.00

0.05

0.10

Freq

uenc

y

Figure 12. PMF of KS testing

6. ConclusionsIn this paper, we have presented FiniteNet, a machine learn-ing approach that can reduce the error of numerically solv-ing a PDE. By combining the LSTM with well-understoodand tested discretization schemes, we can significantly re-duce the error for PDEs displaying a variety of behaviorincluding chaos and discontinuous solutions. The FiniteNetarchitecture mimics the structure of a numerical PDE solver,and builds the timestepping method, spatial discretization,and PDE into the network. We train FiniteNet by using it tosimulate the PDE, and minimizing simulation error against atrusted solution. This training approach causes the resultingnumerical scheme to be empirically stable, which has beena challenge for other approaches.

We compared numerical solutions obtained by FiniteNetto those of baseline methods. We saw that for the invis-cid Burgers’ equation and the linear advection equation,

FiniteNet is more sharply resolving discontinuities at thecost of sometimes adding small oscillations. This resultmakes intuitive sense, as the global error is dominated byregions near discontinuities so FiniteNet reduces the errorthe most by improving performance in these areas. Whenexamining the KS equations, we see that FiniteNet reducesthe error by more accurately tracking the chaotic trajectoryof the exact solution. The main challenge of chaotic systemsis that small errors grow quickly over time, which can leadto solutions that have completely diverged from the exactsolution. FiniteNet is able to significantly reduce error bypreventing this from happening in many realizations of thisPDE.

Further work could involve comparing the runtime vs. errorof FiniteNet to baseline approaches to more directly char-acterize how much of an improvement FiniteNet can offer.Additionally, we have till now only examined problems inone spatial dimension with periodic boundary conditions.It will be interesting to test the method on a large-scaleproblem, such as a turbulent flow.

AcknowledgementsThis material is based upon work supported by the NationalScience Foundation Graduate Research Fellowship underGrant No. 1745301

ReferencesBar-Sinai, Y., Hoyer, S., Hickey, J., and Brenner, M. Learn-

ing data-driven discretizations for partial differential equa-tions. Proceedings of the National Academy of Sciences,116(31):15344–15349, 2019.

Brunton, S. L., Proctor, J. L., and Kutz, J. N. Discoveringgoverning equations from data by sparse identification of

Page 9: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

nonlinear dynamical systems. Proceedings of the nationalacademy of sciences, 113(15):3932–3937, 2016.

Chen, C., Liu, H., and Beardsley, R. C. An unstructuredgrid, finite-volume, three-dimensional, primitive equa-tions ocean model: application to coastal ocean and es-tuaries. Journal of atmospheric and oceanic technology,20(1):159–186, 2003.

Chen, H., Patel, V., and Ju, S. Solutions of reynolds-averaged navier-stokes equations for three-dimensionalincompressible flows. Journal of Computational Physics,88(2):305–336, 1990.

Deville, M. O., Fischer, P. F., Fischer, P. F., Mund, E., et al.High-order methods for incompressible fluid flow, vol-ume 9. Cambridge university press, 2002.

Drikakis, D. and Geurts, B. Turbulent flow computation,volume 66. Springer Science & Business Media, 2006.

Fukushima, K. Neocognitron: A self-organizing neuralnetwork model for a mechanism of pattern recognitionunaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.

Gers, F. A., Schmidhuber, J., and Cummins, F. Learning toforget: Continual prediction with lstm. 1999.

Gottlieb, D. and Shu, C.-W. On the gibbs phenomenon andits resolution. SIAM review, 39(4):644–668, 1997.

Gottlieb, S. and Shu, C.-W. Total variation diminishingrunge-kutta schemes. Mathematics of computation of theAmerican Mathematical Society, 67(221):73–85, 1998.

Hagge, T., Stinis, P., Yeung, E., and Tartakovsky, A. M.Solving differential equations with unknown constitutiverelations as recurrent neural networks. arXiv preprintarXiv:1710.02242, 2017.

Halpern, B. Large eddy simulation of complex engineer-ing and geophysical flows. Cambridge University Press,1993.

Harten, A. High resolution schemes for hyperbolic con-servation laws. Journal of computational physics, 49(3):357–393, 1983.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Hochreiter, S. and Schmidhuber, J. Long short-term memory.Neural computation, 9(8):1735–1780, 1997a.

Hochreiter, S. and Schmidhuber, J. Lstm can solve hard longtime lag problems. In Advances in neural informationprocessing systems, pp. 473–479, 1997b.

Hsieh, J.-T., Zhao, S., Eismann, S., Mirabella, L., and Er-mon, S. Learning neural pde solvers with convergenceguarantees. arXiv preprint arXiv:1906.01200, 2019.

Hyman, J. M. and Nicolaenko, B. The kuramoto-sivashinskyequation: a bridge between pde’s and dynamical sys-tems. Physica D: Nonlinear Phenomena, 18(1-3):113–126, 1986.

Ishihara, T., Gotoh, T., and Kaneda, Y. Study of high–reynolds number isotropic turbulence by direct numericalsimulation. Annual Review of Fluid Mechanics, 41:165–180, 2009.

Jiang, G. and Shu, C. Efficient implementation of weightedeno schemes. Journal of computational physics, 126(1):202–228, 1996.

Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980, 2014.

Lagaris, I., Likas, A., and Fotiadis, D. Artificial neural net-works for solving ordinary and partial differential equa-tions. IEEE transactions on neural networks, 9(5):987–1000, 1998.

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,R. E., Hubbard, W., and Jackel, L. D. Backpropaga-tion applied to handwritten zip code recognition. Neuralcomputation, 1(4):541–551, 1989.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

LeVeque, R. J. Finite difference methods for ordinaryand partial differential equations: steady-state and time-dependent problems, volume 98. Siam, 2007.

LeVeque, R. J. et al. Finite volume methods for hyperbolicproblems, volume 31. Cambridge university press, 2002.

Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion con-volutional recurrent neural network: Data-driven trafficforecasting. arXiv preprint arXiv:1707.01926, 2017.

Ling, J., Kurzawski, A., and Templeton, J. Reynolds aver-aged turbulence modelling using deep neural networkswith embedded invariance. Journal of Fluid Mechanics,807:155–166, 2016.

Long, Z., Lu, Y., Ma, X., and Dong, B. Pde-net: Learningpdes from data. arXiv preprint arXiv:1710.09668, 2017.

Long, Z., Lu, Y., and Dong, B. Pde-net 2.0: Learning pdesfrom data with a numeric-symbolic hybrid deep network.Journal of Computational Physics, 399:108925, 2019.

Page 10: FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent … · 2020-02-11 · such as high-dimensional PDEs (Sirignano & Spiliopoulos, 2018). Hsieh et al. (2019)

FiniteNet: A Fully Convolutional LSTM Network Architecture for Time-Dependent Partial Differential Equations

Mohan, A., Daniel, D., Chertkov, M., and Livescu, D. Com-pressed convolutional lstm: An efficient deep learningframework to model high fidelity 3d turbulence. arXivpreprint arXiv:1903.00033, 2019.

Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-informed neural networks: A deep learning framework forsolving forward and inverse problems involving nonlinearpartial differential equations. Journal of ComputationalPhysics, 378:686–707, 2019.

Rudd, K. Solving partial differential equations using ar-tificial neural networks. PhD thesis, Duke University,2013.

Sirignano, J. and Spiliopoulos, K. Dgm: A deep learning al-gorithm for solving partial differential equations. Journalof Computational Physics, 375:1339–1364, 2018.

Sommerfeld, A. Partial differential equations in physics.Academic press, 1949.

Spalart, P. Reflections on rans modelling. In Progress inHybrid RANS-LES Modelling, pp. 7–24. Springer, 2010.

Strogatz, S. Nonlinear dynamics and chaos: with appli-cations to physics, biology, chemistry, and engineering(studies in nonlinearity). 2001.

Vlachas, P. R., Byeon, W., Wan, Z. Y., Sapsis, T. P.,and Koumoutsakos, P. Data-driven forecasting of high-dimensional chaotic systems with long short-term mem-ory networks. Proceedings of the Royal Society A: Math-ematical, Physical and Engineering Sciences, 474(2213):20170844, 2018.

Wan, Z. Y., Vlachas, P., Koumoutsakos, P., and Sapsis, T.Data-assisted reduced-order modeling of extreme eventsin complex dynamical systems. PloS one, 13(5), 2018.

Werbos, P. J. Backpropagation through time: what it doesand how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990.

Wilcox, D. C. et al. Turbulence modeling for CFD, volume 2.DCW industries La Canada, CA, 1998.

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong,W.-K., and Woo, W.-c. Convolutional lstm network: Amachine learning approach for precipitation nowcasting.In Advances in neural information processing systems,pp. 802–810, 2015.

Yu, R., Zheng, S., Anandkumar, A., and Yue, Y. Long-termforecasting using tensor-train rnns. Arxiv, 2017.