Short Course on UQ - Simon Fraser University

Short Course on UQ

Derek BinghamStatistics and Actuarial Science

Simon Fraser University

Department of Statistics and Actuarial Science

Many processes are investigated using computational models

• Many scientific applications use mathematical models to describe physical systems

• Rapid growth in computer power has made it possible to study complex physical phenomena that might otherwise be too time consuming or expensive to observe

• To understand how inputs to the computer code impact the system, scientists adjust the inputs to computer simulators and observe the response

Many computer models require a lot of computing power

The computer models frequently:

1. require solutions to PDEs or use finite element analyses

2. have high dimensional inputs3. have outputs which are complex

functions of the input factors4. require a large amounts of computing

time5. have features from some of the above

Uncertainty quantification

• The Isaac Newton Institute (www.newton.ac.uk) at Cambridge University recently held a 6 month program on uncertainty quantification (UQ)

• The SAMSI theme year on Model Uncertainty: Mathematical and Statistical (MUMS) is primarily concerned with uncertainty

• The USA National Labs have invested heavily in UQ methods

• So what is UQ?


• Not really a well defined term… Paul has tried

• Umbrella phrase for methods for mathematical and statistical techniques that are concerned with inference using “mathematical representations" (typically, a computer code) of physical systems

• Sometimes, the analysts will have only the model available and other times there are sources of data that help with the inference


• Our aim is to introduce some of the common themes and techniques from the statistics and applied math settings and perspectives

• Understanding the response of glaciers to climate is important globally for making accurate projections of sea level change

• Changes in glaciers and ice sheets can be the product of changes to the surface mass balance (accumulation and ablation)

• Have computer model to describe, say, ablation (output) given the season’s weather trajectory (input)

Example

Example

Potential sources of uncertainty if the aim is to predict next year’s weather?

• Do not know next year’s weather (maybe have a distribution)• Measurement error• Maybe mathematical model is not correct?• Perhaps, cannot compute the thing you would like to compute (e.g.,

discretization error)

Important feature of many computational models

• Computer model output is frequently deterministic

• Impacts the design and analysis strategy, at least for a statistician where the aim is usually to separate signal and noise

Simulations

Example

Example

Potential sources of uncertainty?

Important feature of many computational models

• The deterministic simulator is frequently too costly to run many times

• Would like to create a statistical surrogate for the model

• Impacts the design and analysis strategy

• Randomization, blocking and replication do not play a role

Why a statistical emulator?

• Can only run the code a limited number of times – where to run the code (design of experiments)– how many times do you run the code (goal oriented design)

• Want to predict output, with uncertainty at un-observed inputs… need foundation for statistical inference

Have to consider different inferential framework

• Suppose a linear regression is used to model the output of a computer experiment

• Errors are not independent

• Usual statistical model:

• Here, residuals are due to lack of fit

• Foreshadow: Instead of random errors, use a random function

y(x) = �0 + �1x+ ✏; ✏ ⇠ N(0,�2)

We have lots of tools to “model the model”

• Many possible choices – Least Squares regression– Polynomial Chaos (e.g., Legendre polynomial basis expansion) – Neural networks – Splines– Regression trees– Random forests …

• Are these bad? No!!!!!

• Do not interpolate, nor do they provide predictive uncertainty

• We will use Gaussian Process (GP) regression do achieve this

Gaussian processes for emulating computer model output

• GP’s have proven effective for emulating computer model output (Sacks, Welch, Mitchell, & Wynn, 1989) and data mining (Rasmussen and Williams, 2006)

• Emulating computer model output:– output varies smoothly with input changes – output is noise free– passes through the observed response (some controversy here)

• Aside– GP’s often outperform other modeling approaches in this arena– Is a very useful non-parametric (semi-parametric?) regression

approach even if you never use it in this context

Statistical formulation for GP emulation of computer models

• Computer model:

• Function is expensive so, get to observe a sample of n runs from the computer model

• Specify a set of inputs where we will run the code

• Run the code and get the outputs

• That is,

⌘ : Rd ! R

yT = (y1, y2, . . . , yn)

x1,x2, . . . ,xd

x

⌘(x) y(x)

Statistical formulation for GP emulation of computer models

• Will view the computer code as a single realization of a random function (in this case, a GP):

where,

• For n data points, will have the covariance matrix,

y(x) = µ+ z(x)

E (z(x)) = 0V ar (z(x)) = �2

Corr (z(x), z(x0)) =dY

i=1

e

��i(xi�x

0i)

2

⌃ = �2R

z(x) ⇠ N(0,�2)

-3 -2 -1 0 1 2 30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

The GP has many nice properties that statisticians recognize

• Vector of responses follow a multivariate normal distribution

• That is,

• Marginal distributions (of any single response) are Gaussian

• Conditional distribution of any response, or set of responses, is also Gaussian

y ⇠ MVN (µ,⌃)

L =1

|2⇡⌃|1/2e

�(y�µ)T ⌃�1(y�µ)2

The parameters have meaning

• The mean, , is the mean over all realizations

• Making the variance, , larger re-scales the vertical axis

• If , the function does not vary with respect to this input

• When is big, the function will be wigglier (a technical term?)

• Response where the inputs are close together will be more highly correlated that inputs that are far apart

µ

�2

�i = 0

�i

Realizations of a GP for a fixed model

0.0 0.2 0.4 0.6 0.8 1.0

−1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

x

yµ = 0 ; �2 = 1 ; � = 52

Can estimate model parameters using maximum likelihood

• In this setting, the computer model responses are viewed as a single realization of a GP

• Have a model (GP) the set of inputs and the corresponding responses

• The parameters, , and still need to be estimated

• Inference proceeds by maximizing the Likelihood

where

�2µ �i

� = (�1,�2, . . . ,�d)

L�µ,�2,�|y

�= f

�y|µ,�2,�

�

L =1

|2⇡⌃|1/2e

�(y�µ)T ⌃�1(y�µ)2

Aside: Maximum likelihood estimation

• In practice, fitting a statistical model requires some distribution from which the data are assumed to arise (for us, a GP)

• Can then write down the likelihood (the joint distribution of the data) up to the parameters that govern the statistical model

• The idea is to choose the parameter values that maximize the likelihood function

• There is loads of theory to support this approach for parameter estimation

Aside: Maximum likelihood estimation

• So, what is the big idea?

• Toy example: Bernoulli trials– Interested in the proportion of times something happens– E.g., want to know the probability of flipping a coin and getting tails– Suppose have n independent realizations from a Bernoulli distribution

(outcomes yi = 0 or 1)

–

Back to GPs: Can estimate model parameters using maximum likelihood

• If known (see, Jones, Schonlau and Welch, 1998), then the mle’s are

where,

• We can plug these estimators into the likelihood and it is now only a function of … now maximize

µ =10⌃�1y

10⌃�11

�2 =(y � 1µ)0R�1(y � 1µ)

n

⌃ = �2R

�

�

Prediction

• For maximum likelihood estimation

• Prediction is the usual regression like predictor … except data correlated

• Where r is the n-vector of estimated correlations between the observation at x* and the sampled points

y(x*) = µ + r ' R−1(y− µ)

s2 (x*) = σ 2 1− r ' R−1r +1−1' R−1r( )

2

1' R−11

⎛

⎝

⎜⎜⎜

⎞

⎠

⎟⎟⎟

Prediction with a GP is a weighted average of the responses

• A quick note on the predictor

• r ( , ) is the (n x 1)-vector of estimated correlations between the observation at x* and the sampled points

• The predictor is a linear combination of the sampled (training) outputs

• The the points closest to x* get the highest weight

y(x*) = µ + r ' R−1(y− µ)

[r]i

= r (y(x⇤), y(xi)) =dY

i=1

e��i(x⇤i �x

0i)

2

Why does prediction work this way?

• We update our knowledge based on the observations

• The conditional distribution of a response at an unsampled input is

• Use and on the previous slides are used to estimate and

f y(x*) | data( ) ~ N m(x*), σ 2 (x*)( )

m(x*)σ 2 (x*)

s2 (x*)y(x*)

Example

True function and observations

Y(x)

Can emulate computer model with uncertainty

True function, emulated mean function and 95% prediction intervals

Y(x)

So who cares about these GP emulators?

• Have a computer model that encodes the best physics or mathematical modeling of the process

• Unfortunately, model takes a long time to run, making it difficult to use the model to solve problems like:

• Inverse problems• Sensitivity analysis• Prediction at many different inputs• Integration• Risk analysis

• Can use the GP emulator in place of the computer model and get estimates of uncertainty

Can specify model parameters using Bayesian approach

• In this setting, the computer model responses are viewed as a single realization of a GP

• Have a model (GP) the set of inputs and the corresponding responses

• The parameters, , and ‘s still need to be estimated

• … just like before

• Dave Higdon and Leanna House will talk about this

�2µ �

General Bayesian approach

• Inference proceeds by first specifying the Likelihood for the vector of unknown parameters

• Also need to specify prior distributions for parameters,

• By Bayes’ rule

• Normalizing constant usually hard to find

/ ⇥

L�µ,�2,�|y

�= f

�y|µ,�2,�

�

⇡�µ,�2,�

�

⇡�µ,�2,�

�

f�y|µ,�2,�

�⇡�µ,�2,�|y

�

General Bayesian approach

• In this case, have two options

• Sample from the posterior distribution of the parameters using a sampling procedure (later)

• Use posterior mode… sort of like maximimum likelihood

Very general look at Bayes rule

• very general approach for inference• posterior pdf describes uncertainty in given data • prior pdf for is required• normalizing is generally difficult/impossible• inference proceeds through the samples from posterior distribution

f (y|✓) ⇡ (✓)⇡ (✓|y) / ⇥

⇡ (✓|y)

⇡ (✓|y)

Using the posterior mode

• Suppose,

• Posterior mode corresponds to solving

where

µ =10⌃�1y

10⌃�11

�2 =(y � 1µ)0R�1(y � 1µ)

n

⇡(µ,�2,�) / 1

max

�2[0,1)� 1

2

�nlog(�2) + log(|R|)

�

Sample parameters from

Prediction strategy

⇡�µ,�2,�|y

�Sample predicted value

from posteriory(x*)

• Can use predictive posterior distribution to get point estimates and prediction intervals

Experiment design

• In computer experiments, as in many other types of experiments considered by statisticians, there are usually some factors that can be adjusted and some response variable that is impacted by (some) of the factors

• The experiment is conducted, for example to see which/how the factors impact the response

• Generally, the experimental design is the set of inputs settings that are used in your simulation experiments… sometimes called the input deck

Experiment design

• For a computer experiment, the experiment design is the set of d-dimensional inputs where the computational model is evaluated

• Notation : X is an n x d, design matrix; y (X) is the n x 1 vector of responses is have scalar response

• Experimental region is usually the unit hypercube [0,1]d (not always)

Experiment design

• Experiment goals:

– Computing the mean response– Response surface estimation or computer model emulation– Sensitivity analysis– Optimization– Estimation of contours and percentiles– .– .– .

Design for estimating the mean response

• Suppose have a computer model y(x), where x is a d-dimensional uniform random variable on [0,1]d

• Interested in

• One way to do this is to randomly sample x, n times, from U[0,1)d (Monte Carlo method)

µ =

Zy(x)dx

µ =

Pni=1 y(xi)

n


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

y

x0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

yx

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

x

y

Monte Carlo estimate of the mean from a random design

• Good news: approach gives an unbiased estimator of the mean

• Bad news: approach can often lead of designs that miss much of the input space… relatively large variance for the estimator

• Could imagine taking a regular grid of points as the design– Impractical since need many points in moderate dimensions– This is generally true, but perhaps not for everyone here.

Experiment design

• McKay, Beckman and Conover (1979, Technometrics) introduced Latin hypercube sampling as an approach for estimating the mean

• Is a type of stratified random sampling

– Example: Random Latin hypercube design in 2-d (sample size = n)

– Construction:• Construct an n x n grid over the unit square• Construct an n x 2 matrix, Z, with columns that are independent permutations of

the integers {1, 2, …, n}• Each row of the matrix is the index for a cell on the grid. For the i th row of Z,

take a random uniform draw from the corresponding cell in the n x n grid over the unit square… call this xi (i=1,2,…n)


Enforces a 1-d projection property

Is an attempt at filling the space

Easy to construct

Can get some pretty bad designs (what happens if the permutations result in column 1=column 2?)

Latin hypercube designs

• McKay, Beckman and Conover (1979, Technometrics) show that using a random Latin hypercube design gives unbiased estimator of the mean and has smaller variance than the Monte Carlo method

• Good news: lower variance than random sampling

• Bad news: can still get large holes in the input space

• Want to improve the uniformity of the points in the input region

Experiment design

• Space-filling criteria:

– Orthogonal array based Latin hypercubes (Tang, 1993)

– This is adds restriction on the random permutation of the integers {1,2, …, n} when constructing the Latin hypercube

– Orthogonal array: array with s >2 symbols (or levels) per input column appear equally often and, for all collections of r columns, all possible combinations of the s symbols in the rows of the design appear equally often. r is the strength of the orthogonal array

– Idea: start with an orthogonal array.– Restrict a permutation of consecutive integers to those rows that have a

particular symbol (each row gets a unique value in {1,2, …, n} )

Experiment designOrthogonal array with 4 symbols per column

1. For the 1’s in the 1st

column, assign a permutation of the integers 1-4, for the 2’s, assign a permutation of the number 5-8,… and so on

2. Repeat for the 2nd column.

Experiment design

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Experiment design

• Good news: Tang showed that such designs can achieve smaller variance than LHS for estimating the mean

• Bad news: Orthogonal arrays do not exist for all run sizes (for 2 symbol designs, the run sizes are powers of 4)

• The design we just looked at is a 42 grid

• Strength r=2 orthogonal array has all pairs of columns that have the combinatorial property, but do not need all triplets to have the same property

Comment

• The justification for the designs discussed so far are based on the practical problem of estimating the mean

• There is an entire area of mathematics that considers this problem - quasi-Monte Carlo (see Lemieux, 2009, text)

• However, the designs considered so far are often used to emulate the computer model (i.e., the design that is run, followed by fitting a GP), but nothing so far about GPs

Designs based on distance

• Clearly space filling is an important property generally

• What about for computer model emulation?

• Consider an n point design over [0,1]d

• To avoid points being too close, can use a criterion that aims to spread out the points

Space-filling

• Space-filling criteria:

– Maximin designs: For a design X, maximize the minimum distance between any two points

– Minimax designs: minimizes the max distance between points in the input region that are not in the design and the points in the design X

– Can find designs that optimize one of these criteria or apply the criteria to a class of designs (e.g., Latin hypercube designs)


• Maximin designs: the idea for a maximin is to maximize the minimum distance between any two design points

• The distance criterion is usually written:

• The best design is the collection of n distinct points from [0,1]d where this is maximized over the design

• How would you get one in practice?

• Often start with a dense grid in [0,1]d and use numerical optimizer to get a good design

cp(x1, x2) =

2

4dX

j=1

|x1j � x2j |p3

51/p


• Minimax designs: when you are making a prediction using, say, a GP, would like to have nearby design points. So, at the potential prediction points, would like to minimize the maximum distance

• The best design is the collection of n distinct points from [0,1]d where the maximum distance from any point in the design region is small as possible

• Hard to optimize, but could use same strategy as before

min

X2[0,1]max

x2[0,1]c

p

(x,X)


• Johnson, Moore and Ylvisaker (1990, JSPI) proposed the use of these designs for computer experiments

• Developed asymptotic theory under which both of the sorts of design can be optimal


• Good news: criteria give designs that have intuitively appealing space filling properties

• Bad news: In practice, the designs are not always good

– Maximin designs frequently place points near the boundary of the design space

– Minimax designs are often very hard to find

Combining criteria

• Good idea to combine criteria – e.g., maximin Latin hypercube

• Other approaches include considering Latin hypercube designs where the columns of the design matrix have low correlation (e.g., Owen 1994; Tang 1998)

Model based criteria

• Consider the GP framework

• The mean square prediction error is

• Would like this to be small for every point in [0,1]d

• So, criterion (Sacks, Schiller and Welch, 1992) becomes

• The optimal design minimized the IMSPE

MSPE(Y (x)) = E

h(Y (x)� Y (x))2

i

IMSPE =

Z

x2[0,1]d

MSPE(Y (x))

�

2dx

Model based criteria

• Problems:

– Integral is hard to evaluate– Function is hard to optimize– Need to know the correlation parameters

• Could guess• Could do a 2-stage procedure aimed at guessing correlation parameters• Could use a Bayesian approach

BIMSPE =

ZIMSPE ⇡(✓)d✓

What to do?

• Well, it depends

• Good review paper Pronzato and Muller (2002)

• Do you expect some of the inputs to be inert?– If so, would like to have good space-filling properties in lower dimensional projections

(MaxPro designs, Joseph et al., 2015… R package called MaxPro)

– Alternatively, could run a small Latin hypercube design to see which variables are important. Next fix the unimportant variables and run an expeirmental design varying the important inputs.

• If no projection, use a maximin design– Idea: generate many random (OA-based) Latin hypercube designs and take the one

with the best maximin criterion – Alternatively, lhs package in R has a maximinLHS command to get designs

Sample size

• So you want to run a computer experiment… why?

• For numerical integration, there are often bounds associated with the estimate of the mean (e.g., Koksma-Hlawka theorem … see Lemieux, 2009) derives an upper bound on the absolute integration error

• For computer model emulation Loeppky, Sacks and Welch (2009) proposed a general rule for n=10d

Sample sizeSimulation

• Suppose you have a realization of a GP in d-dimensions and also a hold-out set for validation

• Consider the impact of more active dimensions

• Efficiency index:

HOI =MSPE

ho

V ar(Yho

)

Sample size simulation

0

200

400

600 0

10

20

300

0.2

0.4

0.6

0.8

1

1.2

1.4

Active Dimensions

Multiplicative Gaussian Process

Run Size

Hold

Out

Inde

x

Sample size simulation… maybe it is not so bad after all

0100

200300

400500 0

510

1520

2530

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Active Dimensions

Additive Gaussian Process

Run Size

Hold

Out

Inde

x

Other Goals

• Often the aim of the computer experiment is to estimate a feature of the response surface and/or have ability to run the simulations in a sequence

• Examples include:

– Optimization (max and/or min)– Estimation of contours (i.e. the x that give a pre-specified y(x) )– Estimation of percentiles

Experimental Strategy

• Sequential design strategy has three important steps:

1. An initial experiment design is run (a good one)

2. The GP is fit to the experiment data

3. An additional trial(s) is chosen to improve the estimate of the feature of interest

• Steps 2 and 3 are repeated until the budget is exhausted or the goal has been reached

Improvement Functions

• Improvement function (Mockus et al., 1978):

1. aims to measure the distance between the current estimate of the feature

2. typically is set to zero when there is no improvement

• Expected Improvement function (Jones et al, 1998):

1. expected improvement aims to average the improvement function across the range of possible outputs

2. run new design point where the expected improvement is maximized

• Global minimization (Jones et al., 1998)

– Improvement function for a GP:

Where fmin is the minimum observed value

– Expected improvement for a GP:

• Global minimization (Jones, Schonlau and Welch, 1998)

– Improvement function:

Where fmin is the minimum observed value

– Expected improvement:

Summary

• There are several expected improvement criteria for finding features of a response surface (quantiles, countours, optima, …)

• Idea is simple, strategy is the same

• Bingham et al., 2014 (arXiv:1601.05887)

Short Course on UQ - Simon Fraser University

Documents

Transcript of Short Course on UQ - Simon Fraser University