NumericalIntegration - Marc Deisenroth · Centralidea! Quadrature nodes x n are the roots of a...

Numerical Integration

Cheng Soon OngMarc Peter Deisenroth

December 2020

Ma

r

c Deis

e

n

r

o

t

h and Ch

e

n

g

S

oon

Ong

’s

Tu

t

or

ial

a

t

CM

Setting

x1 x2 x3xN

f (x1)

f (x2)

f (x3)

f (xN)

! Approximate∫ b

a

f(x)dx ≈N∑

n=1

wnf(xn), x ∈ R

! Nodes xn and corresponding function values f(xn)1

Numerical integration (quadrature)

x1 x2 x3xN

f (x1)

f (x2)

f (x3)

f (xN)

Key ideaApproximate f using an interpolating function that is easy to integrate(e.g., polynomial)

2

Quadrature approaches

x1 x2 x3xN

f (x1)

f (x2)

f (x3)

f (xN)

Quadrature Interpolant NodesNewton–Cotes low-degree polynomials equidistantGaussian orthogonal polynomials roots of polynomialBayesian Gaussian process user defined

3

Newton–Cotes Quadrature

Overview

a = x0 x1 x2

f (x0)

f (x1)

xN = b

f (x2)

f (xN)

! Equidistant nodes a = x0, . . . , xN = b Partition interval [a, b]! Approximate f in each partition with a low-degree polynomial

! Compute integral for each partition analytically and sum them up

4

Overview

a = x0 x1 x2

f (x0)

f (x1)

xN = b

f (x2)

f (xN)

! Equidistant nodes a = x0, . . . , xN = b Partition interval [a, b]! Approximate f in each partition with a low-degree polynomial! Compute integral for each partition analytically and sum them up

4

Trapezoidal rule

xn°1 xn xn+1

f(x)

! Partition [a, b] into N segments with equidistant nodesxn

! Locally linear approximation of f between nodes

5

Trapezoidal rule (2)

xn°1 xn xn+1

f(x)

! Area of a trapezoid with corners(xn, xn+1, f(xn+1), f(xn))

∫ xn+1

xn

f(x)dx ≈ h

2

(f(xn) + f(xn+1)

)

h := |xn+1 − xn| Distance between nodes

! Error O(h2)

! Full integral:∫ b

a

f(x)dx ≈ h

2

(f0 + 2f1 + · · ·+ 2fN−1 + fN

), fn := f(xn)

6

Simpson’s rule

xn°1 xn xn+1

f(x)

! Partition [a, b] into N segments with equidistant nodesxn

! Locally quadratic approximation of f connectingtriplets

(f(xn−1), f(xn), f(xn+1)

)

7

Simpson’s rule (2)

xn°1 xn xn+1

f(x)

! Area of segment:xn+1∫

xn−1

f(x)dx ≈ h

3(fn−1 + 4fn + fn+1)


! Error: O(h4)! Full integral:

∫ b

a

f(x)dx ≈ h

3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 4fN−2 + 2fN−1 + fN

)

8


xn°1 xn xn+1

f(x)


xn−1

f(x)dx ≈ h

3(fn−1 + 4fn + fn+1)


! Error: O(h4)

! Full integral:∫ b

a

f(x)dx ≈ h

3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 4fN−2 + 2fN−1 + fN

)

8


xn°1 xn xn+1

f(x)


xn−1

f(x)dx ≈ h

3(fn−1 + 4fn + fn+1)


! Error: O(h4)! Full integral:

∫ b

a

f(x)dx ≈ h

3(f0 + 4f1 + 2f2 + 4f3 + 2f4 + · · ·+ 4fN−2 + 2fN−1 + fN

)

8

Example

∫ 1

0

exp(−x2−sin(3x)2)dx

xn°1 xn xn+1

f(x)

Observed function valuesTrue functionSimpson’s ruleTrapezoidal rule

! Simpson’s rule yields better approximations! Very good approximations obtained fairly quickly

9

Example

∫ 1

0

exp(−x2−sin(3x)2)dx

xn°1 xn xn+1

f(x)


0 20 40 60 80 100Number of nodes

10°7

10°6

10°5

10°4

10°3

10°2

10°1

Inte

grat

ion

erro

r

TrapezoidalSimpson

! Simpson’s rule yields better approximations! Very good approximations obtained fairly quickly

9

Summary: Newton–Cotes quadrature

! Approximate integrand between equidistant nodeswith a low-degree polynomial (up to degree 6)

! Trapezoidal rule: linear interpolation! Simpson’s rule: quadratic interpolation

Better approximation and smaller integration error

xn°1 xn xn+1

f(x)


10

Gaussian Quadrature

Gaussian quadrature

! Named after Carl Friedrich Gauß

! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy! Central approximation

∫ b

a

f(x)w(x)dx ≈N∑

n=1

wnf(xn)

! Weight function w(x) ≥ 0 (and some other integration-related properties, whichare satisfied if w(x) is a pdf)

! Goal: Find nodes xn and weights wn, so that the approximation error is minimized

11

Gaussian quadrature

! Named after Carl Friedrich Gauß! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy

! Central approximation∫ b

a

f(x)w(x)dx ≈N∑

n=1

wnf(xn)



11

Gaussian quadrature

! Named after Carl Friedrich Gauß! Quadrature scheme that no longer relies on equidistant nodes Higher accuracy! Central approximation

∫ b

a

f(x)w(x)dx ≈N∑

n=1

wnf(xn)



11

Central idea

! Quadrature nodes xn are the roots of a family of orthogonal polynomials

Nodes no longer equidistant! Exact if f is a polynomial of degree ≤ 2N − 1, i.e.,

∫ b

a

f(x)w(x)dx =N∑

n=1

wnf(xn)

Integral can be computed exactly by evaluating f N times at the optimallocations xn (roots of an orthogonal polynomial) with corresponding optimalweights wn

More accurate than Newton–Cotes for the same number of evaluations (withsome memory overhead)

12

Central idea

! Quadrature nodes xn are the roots of a family of orthogonal polynomialsNodes no longer equidistant

! Exact if f is a polynomial of degree ≤ 2N − 1, i.e.,∫ b

a

f(x)w(x)dx =N∑

n=1

wnf(xn)

Integral can be computed exactly by evaluating f N times at the optimallocations xn (roots of an orthogonal polynomial) with corresponding optimalweights wn

More accurate than Newton–Cotes for the same number of evaluations (withsome memory overhead)

12

Example: Gauß–Hermite quadrature

! Solve∫

f(x) exp(−x2)

w(x)

dx =

∫f(x)

√2πN

(x∣∣0, 1

)dx = Ex∼N (0,1)[

√2πf(x)]

! With change-of-variables trick Expectation w.r.t. a Gaussian measure

Ex∼N (µ,σ2)[f(x)] ≈1√π

N∑

n=1

wnf(√2σxn + µ).

13

Example: Gauß–Hermite quadrature (2)

! Follow general approximation scheme∫

f(x) exp(−x2)

w(x)

dx ≈N∑

n=1

wnf(xn)

! Nodes x1, . . . , xN are the roots of Hermite polynomial

HN(x) := (−1)n exp(x2

2

) dn

dxnexp(−x2)

! Weights wn are

wn :=2N−1N !

√π

N2H2N−1(xn)

14

Overview (Stoer & Bulirsch, 2002)

∫ b

a

w(x)f(x)dx ≈N∑

n=1

wnf(xn)

[a, b] w(x) Orthogonal polynomial[−1, 1] 1 Legendre polynomials[−1, 1] (1− x2)−

12 Chebychev polynomials

[0,∞] exp(−x) Laguerre polynomials[−∞,∞] exp(−x2) Hermite polynomials

15

Application areas

! Probabilities for rectangular bivariate/trivariate Gaussian and t distributions(Genz, 2004)

! Integrating out (marginalizing) a few hyper-parameters in a latent-variable model(INLA; Rue et al., 2009)

! Predictions with a Gaussian process classifier (GPFlow; Matthews et al., 2017)

16

Summary: Gaussian quadrature

! Orthogonal polynomials to approximate f! Nodes are the roots of the polynomial! Higher accuracy than Newton–Cotes! Method of choice for low-dimensional problems (1–3 dimensions)

! Can’t naturally deal with noisy observations! Only works in low dimensions! Approaches that scale better with dimensionality

Bayesian quadrature (up to ≈ 10 dimensions)Monte Carlo estimation (high dimensions)

17


! Orthogonal polynomials to approximate f! Nodes are the roots of the polynomial! Higher accuracy than Newton–Cotes! Method of choice for low-dimensional problems (1–3 dimensions)! Can’t naturally deal with noisy observations! Only works in low dimensions

! Approaches that scale better with dimensionalityBayesian quadrature (up to ≈ 10 dimensions)Monte Carlo estimation (high dimensions)

17


! Orthogonal polynomials to approximate f! Nodes are the roots of the polynomial! Higher accuracy than Newton–Cotes! Method of choice for low-dimensional problems (1–3 dimensions)! Can’t naturally deal with noisy observations! Only works in low dimensions! Approaches that scale better with dimensionality

Bayesian quadrature (up to ≈ 10 dimensions)Monte Carlo estimation (high dimensions)

17

Bayesian Quadrature

Bayesian quadrature: Setting and key idea

Z :=

∫f(x)p(x)dx = Ex∼p[f(x)]

! Function f is expensive to evaluate! Integration in moderate (≤ 10) dimensions! Deal with noisy function observations

Key idea! Phrase quadrature as a statistical inference problem

Probabilistic numerics (e.g., Hennig et al., 2015; Briol et al., 2015)! Estimate distribution on Z using a dataset D :=

{(x1, f(x1)), . . . , (xN , f(xN))

}

18

Bayesian quadrature: How it works

Z :=


! Estimate distribution on Z using a datasetD :=

{(x1, f(x1)), . . . , (xN , f(xN))

}

! Place (Gaussian process) prior distribution on fand determine the posterior via Bayes’ theorem(Diaconis 1988; O’Hagan 1991; Rasmussen &Ghahramani 2003)

Distribution on f induces a distribution on Z! Generalizes to noisy function observations

y = f(x) + ε

°3 °2 °1 0 1 2 3x

°3

°2

°1

0

1

2

3

f(x)

ObservationsIntegrandModel

19


Z :=



{(x1, f(x1)), . . . , (xN , f(xN))

}


Distribution on f induces a distribution on Z

! Generalizes to noisy function observationsy = f(x) + ε

°3 °2 °1 0 1 2 3x

°3

°2

°1

0

1

2

3

f(x)


19


Z :=



{(x1, f(x1)), . . . , (xN , f(xN))

}


Distribution on f induces a distribution on Z! Generalizes to noisy function observations

y = f(x) + ε

°3 °2 °1 0 1 2 3x

°3

°2

°1

0

1

2

3

f(x)


19

Bayesian quadrature: Details

Z :=

∫f(x)p(x)dx), f ∼ GP (0, k)

! Exploit linearity of the integral (integral of a GP is another GP)

p(Z) = p

(∫f(x)p(x)dx)

)= N

(Z∣∣µZ , σ

2Z

)

µZ =

∫µpost(x)p(x)dx = Ex[µpost(x)]

σ2Z =

∫∫kpost(x,x

′)p(x)p(x′)dxdx′ = Ex,x′ [kpost(x,x′)]

20

Bayesian quadrature: Mean

Ef [Z] = µZ =

expectedpredictive mean

Ex∼p[µpost(x)]

µpost(x) = k(x,X)K−1y=:α

, K := k(X,X)

Ef [Z] =

=:z#

∫k(x,X)p(x)dxα = z$α

zn =

∫k(x,xn)p(x)dx = Ex∼p[k(x,xn)]

Z =

∫f(x)p(x)dx

f ∼ GP (0, k)

p(Z) = N(Z∣∣µZ , σ

2Z

)

Training data: X,y

21

Bayesian quadrature: Variance

Vf [Z] = σ2Z =

expected posterior covariance

Ex,x′∼p[kpost(x,x′)]

=

∫∫k(x,x′)

prior covariance

− k(x,X)K−1k(X,x′)

information from training data

p(x)p(x′)dxdx′

=

∫∫k(x,x′)p(x)p(x′)dxdx′ −

∫k(x,X)p(x)dx

=z#

K−1

∫k(X,x′)p(x′)dx′

=z′

= Ex,x′ [k(x,x′)]− z$K−1z′

= Ex,x′ [k(x,x′)]− Ex[k(x,X)]K−1Ex′ [k(X,x′)]

22


Vf [Z] = σ2Z =



=

∫∫k(x,x′)

prior covariance

− k(x,X)K−1k(X,x′)


p(x)p(x′)dxdx′

=

∫∫k(x,x′)p(x)p(x′)dxdx′

−∫

k(x,X)p(x)dx

=z#

K−1


=z′

= Ex,x′ [k(x,x′)]

− z$K−1z′


22


Vf [Z] = σ2Z =



=

∫∫k(x,x′)

prior covariance

− k(x,X)K−1k(X,x′)


p(x)p(x′)dxdx′

=


∫k(x,X)p(x)dx

=z#

K−1


=z′

= Ex,x′ [k(x,x′)]− z$

K−1z′


22


Vf [Z] = σ2Z =



=

∫∫k(x,x′)

prior covariance

− k(x,X)K−1k(X,x′)


p(x)p(x′)dxdx′

=


∫k(x,X)p(x)dx

=z#

K−1


=z′

= Ex,x′ [k(x,x′)]− z$K−1

z′


22


Vf [Z] = σ2Z =



=

∫∫k(x,x′)

prior covariance

− k(x,X)K−1k(X,x′)


p(x)p(x′)dxdx′

=


∫k(x,X)p(x)dx

=z#

K−1


=z′

= Ex,x′ [k(x,x′)]− z$K−1z′


22

Computing kernel expectations

Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]

! Solve a different (easier) integration problem

Input distribution pKernel k Gaussian non-Gaussian

RBF/polynomial/

trigonometricanalytical

analytical viaimportance-sampling

trick

otherwise Monte Carlo(numerical integration)

Monte Carlo(numerical integration)

23

Kernel expectations in other areas

Ex∼p[k(x,X)], Ex,x′∼p[k(x,x′)]

! Kernel MMD(e.g., Gretton et al., 2012)

! Time-series analysis with Gaussian processes(e.g., Girard et al., 2003)

! Deep Gaussian processes(e.g., Damianou & Lawrence, 2013)

! Model-based RL with Gaussian processes(e.g., Deisenroth & Rasmussen, 2011)

X

Y

Dependence witness and sample

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

from Gretton et al. (2012)

from Salimbeni et al. (2019)

from Girard et al. (2003)

from Deisenroth &Rasmussen (2011)

24

Iterative procedure: Where to measure f next?

! Define an acquisition function (similar to Bayesian optimization)

! Example: Choose next node xn+1 so that the variance of the estimator is reducedmaximally (e.g., O’Hagan, 1991; Gunter et al., 2014)

xn+1 = argmaxx∗

currentvariance

V[Z|D]−

newvariance

Ey∗

[V[Z|D∪{(x∗, y∗)}]

]

25

Iterative procedure: Where to measure f next?

! Define an acquisition function (similar to Bayesian optimization)! Example: Choose next node xn+1 so that the variance of the estimator is reduced

maximally (e.g., O’Hagan, 1991; Gunter et al., 2014)

xn+1 = argmaxx∗

currentvariance

V[Z|D]−

newvariance

Ey∗

[V[Z|D∪{(x∗, y∗)}]

]

25

Example with EmuKit (Paleyes et al., 2019)

Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx

°3 °2 °1 0 1 2 3x

0.0

0.2

0.4

0.6

0.8

1.0

f(x)

26


Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx

! Fit Gaussian process to observationsf(x1), . . . , f(xn) at nodes x1, . . . , xn

°3 °2 °1 0 1 2 3x

°3

°2

°1

0

1

2

3

f(x)


27


Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx


! Determine p(Z)

°4 °2 0 2 4Z

0.00

0.05

0.10

0.15

0.20

0.25

p(Z

)

p(Z)

E[Z]

True Z

28


Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx


! Determine p(Z)! Find and include new measurement

1. Find optimal node xn+1 by maximizingan acquisition function

2. Evaluate integrand at xn+1

3. Update GP with(xn+1, f(xn+1)

) °3 °2 °1 0 1 2 3x

°3

°2

°1

0

1

2

3

f(x)


29


Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx


! Determine p(Z)! Find and include new measurement! Compute updated p(Z)

°4 °2 0 2 4Z

0.00

0.05

0.10

0.15

0.20

0.25

0.30

p(Z

)

Initial p(Z)

New p(Z)

Initial E[Z]

New E[Z]

True Z

30


Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx


! Determine p(Z)! Find and include new measurement! Compute updated p(Z)! Repeat °3 °2 °1 0 1 2 3

x

°0.2

0.0

0.2

0.4

0.6

0.8

1.0

f(x)


31


Compute

Z =

∫ 3

−3

e−x2−sin2(3x)dx


! Determine p(Z)! Find and include new measurement! Compute updated p(Z)! Repeat

°4 °2 0 2 4Z

0.0

0.2

0.4

0.6

0.8

1.0p(

Z)

Initial p(Z)

New p(Z)

Initial E[Z]

New E[Z]

True Z

32

Summary

! Central approximation∫

f(x)dx ≈N∑

n=1

wnf(xn)

! Newton–Cotes: Equidistant nodes xn, low-degreepolynomial approximation of f

! Gaussian quadrature: Nodes xn as the roots ofinterpolating orthogonal polynomials of f

! Bayesian quadrature: Integration as a statistical inferenceproblem; Global approximation of f using a Gaussianprocess; scales to moderate dimensions

a = x0 x1 x2

f (x0)

f (x1)

xN = b

f (x2)

f (xN)

°3 °2 °1 0 1 2 3x

°0.2

0.0

0.2

0.4

0.6

0.8

1.0

f(x)


Numerical integration is a really good idea in low dimensions33

References

Briol, F.-X., Oates, C., Girolami, M., and Osborne, M. A. (2015). Frank–Wolfe Bayesian Quadrature:Probabilistic Integration with Theoretical Guarantees. In Advances in Neural Information ProcessingSystems.

Cutler, M. and How, J. P. (2015). Efficient Reinforcement Learning for Robots using Informative SimulatedPriors. In Proceedings of the International Conference on Robotics and Automation.

Damianou, A. and Lawrence, N. D. (2013). Deep Gaussian Processes. In Proceedings of the InternationalConference on Artificial Intelligence and Statistics.

Deisenroth, M. P., Fox, D., and Rasmussen, C. E. (2015). Gaussian Processes for Data-Efficient Learning inRobotics and Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423.

Deisenroth, M. P. and Mohamed, S. (2012). Expectation Propagation in Gaussian Process DynamicalSystems. In Advances in Neural Information Processing Systems, pages 2618–2626.

34

References (cont.)

Deisenroth, M. P. and Rasmussen, C. E. (2011). PILCO: A Model-Based and Data-Efficient Approach toPolicy Search. In Proceedings of the International Conference on Machine Learning.

Deisenroth, M. P., Turner, R., Huber, M., Hanebeck, U. D., and Rasmussen, C. E. (2012). Robust Filteringand Smoothing with Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871.

Diaconis, P. (1988). Bayesian Numerical Analysis. Statistical Decision Theory and Related Topics IV,1:163–175.

Eleftheriadis, S., Nicholson, T. F. W., Deisenroth, M. P., and Hensman, J. (2017). Identification of GaussianProcess State Space Models. In Advances in Neural Information Processing Systems.

Genz, A. (2004). Numerical Computation of Rectangular Bivariate and Trivariate Normal and t Probabilities.Statistics and Computing, 14:251–260.

35

References (cont.)

Girard, A., Rasmussen, C. E., Quiñonero Candela, J., and Murray-Smith, R. (2003). Gaussian Process Priorswith Uncertain Inputs—Application to Multiple-Step Ahead Time Series Forecasting. In Advances inNeural Information Processing Systems.

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A Kernel Two-SampleTest. Journal of Machine Learning Research, 13(25):723–773.

Gunter, T., Osborne, M. A., Garnett, R., Hennig, P., and Roberts, S. J. (2014). Sampling for Inference inProbabilistic Models with Fast Bayesian Quadrature. In Advances in Neural Information ProcessingSystems.

Hennig, P., Osborne, M. A., and Girolami, M. (2015). Probabilistic numerics and uncertainty incomputations. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,471:20150142.

Ko, J. and Fox, D. (2009). GP-BayesFilters: Bayesian Filtering using Gaussian Process Prediction andObservation Models. Autonomous Robots, 27(1):75–90.

36

References (cont.)

O’Hagan, A. (1991). Bayes-Hermite Quadrature. Journal of Statistical Planning and Inference, 29:245–260.Paleyes, A., Pullin, M., Mahsereci, M., Lawrence, N., and González, J. (2019). Emulation of Physical

Processes with Emukit. In Second Workshop on Machine Learning and the Physical Sciences, NeurIPS.Salimbeni, H. and Deisenroth, M. P. (2017). Doubly Stochastic Variational Inference for Deep Gaussian

Processes. In Advances in Neural Information Processing Systems.Salimbeni, H., Dutordoir, V., Hensman, J., and Deisenroth, M. P. (2019). Deep Gaussian Processes with

Importance-Weighted Variational Inference. In Proceedings of the International Conference on MachineLearning.

Stoer, J. and Bulirsch, R. (2002). Introduction to Numerical Analysis. Texts in Applied Mathematics.Springer-Verlag, 3rd edition.

37

NumericalIntegration - Marc Deisenroth · Centralidea! Quadrature nodes x n are the roots of a...

Documents

Transcript of NumericalIntegration - Marc Deisenroth · Centralidea! Quadrature nodes x n are the roots of a...