Predicting Output from Computer Experiments

24
Predicting Output from Computer Experiments Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown

description

Predicting Output from Computer Experiments. Design and Analysis of Computer Experiments Chapter 3 Kevin Leyton-Brown. Overview. Overall program in this chapter: predict the output of a computer simulation - PowerPoint PPT Presentation

Transcript of Predicting Output from Computer Experiments

Page 1: Predicting Output from Computer Experiments

Predicting Output from Computer Experiments

Design and Analysis of Computer ExperimentsChapter 3

Kevin Leyton-Brown

Page 2: Predicting Output from Computer Experiments

Overview• Overall program in this chapter:

– predict the output of a computer simulation– we’re going to review approaches to regression, looking for various kinds of

optimality

• First, we’ll talk about just predicting our random variable (x 3.2)– note, in this setting, we have no “features”

• Then, we’ll consider the inclusion of features in our predictions, based on features in our training data (x 3.2, 3.3)

• In the end, we’ll apply these ideas to computer experiments (x 3.3)• Not covered:

– an empirical evaluation of seven EBLUPs on small-sample data (x 3.3, pp. 69-81)

– proofs of some esoteric BLUP theorems (x 3.4, pp. 82-84)

• If you’ve done the reading you already know:– the difference between “minimum MSPE linear unbiased predictors” and BLUPs

– three different “intuitive” interpretations of r>0R-1(Yn – FB)

– a lot about statistics– whether this chapter has anything to do with computer experiments

• If you haven’t you’re in for a treat

Page 3: Predicting Output from Computer Experiments

Predictors• Y0 is our random variable; our data is Yn = (Y1, …, Yn)>

– no “features” – just predict one response from the others

• A generic predictor predicts Y0 based on Yn – to avoid powerpoint agony, I’ll denote as Y0 from now on

• There are three kinds of predictors discussed:– “Predictors”:

• Y0(Yn) has unrestricted functional form

– “Linear Predictors”: • Y0 = a0 + n

i=1aiYi = a0 + a>Yn

– “Linear unbiased predictors (LUP): • again, linear predictors Y0 = a0 + a>Yn • furthermore, “unbiased” with respect to a given family F of distributions for (Y0,

Yn)

• Definition: a predictor Y0 is unbiased for Y0 with respect to the class of distributions F over (Y0, Yn) if for all F 2 F, EF{Y0} = EF{Y0}.– EF denotes expectation under the F(¢) distribution for (Y0, Yn)– this definition depends on F: a linear predictor is unbiased with respect to a

class– as F gets bigger, the set of LUPs gets weakly smaller

Page 4: Predicting Output from Computer Experiments

LUP Example 1

• Suppose that Yi = 0 + i, where i » N(0,2), 2

> 0.

• Define F as those distributions in which – 0 is a given nonzero constant

– 2 is unknown, but 2

> 0 is known

• Any Y0 = a0 + aTYn is a LP of Y0

• Which are unbiased? We know that:– E {Y0} = E {a0 + n

i=1aiYi} = a0 + 0ni=1ai

(Eq 1)

– and E {Y0} = 0 (Eq 2)

• For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 8 2

– since (Eq 1), (Eq 2) are independent of 2, we just need that,

given 0, a satisfies a0 + 0ni=1ai = 0

– solutions:• a0 = 0, a such that n

i=1ai = 0 (data-independent predictor Y0 = 0)

• a0 = 0, a such that ni=1ai = 1

– e.g., sample mean of Yn is the LUP corresponding to a0 = 0, ai = 1/n

Page 5: Predicting Output from Computer Experiments

LUP Example 2

• Suppose that Yi = 0 + i, where i » N(0,2), 2

> 0.

• Define F as those distributions in which – 0 is an unknown real constant

– 2 is unknown, but 2

> 0 is known

• Any Y0 = a0 + aTYn is a LP of Y0

• Which are unbiased? We know that:– E {Y0} = E {a0 + n

i=1aiYi} = a0 + 0ni=1ai

(Eq 1)

– and E {Y0} = 0 (Eq 2)

• For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 8 2 and 8 0

– since (Eq 1), (Eq 2) are independent of 2, we just need that,

8 0, a satisfies a0 + 0ni=1ai = 0

– solutions:• a0 = 0, a such that n

i=1ai = 0 (data-independent predictor Y0 = 0)

• a0 = 0, a such that ni=1ai = 1

– e.g., sample mean of Yn is the LUP corresponding to a0 = 0, ai = 1/n

• This illustrates that a LUP for F is a LUP for subfamilies of F

Page 6: Predicting Output from Computer Experiments

Best Mean Squared Prediction Error (MSPE) Predictors

• Definition:MSPE(Y0,F) ´ EF{(Y0 - Y0)2}

• Definition: Y0 is a minimum MSPE predictor at F if, for any predictor Y0

*

MSPE(Y0,F) · MSPE(Y0*,F)

– we’ll also call this a best MSPE predictor

• “Fundamental theorem of prediction”: – the conditional mean of Y0 given Yn is the minimum MSPE predictor

of Y0 based on Yn

Page 7: Predicting Output from Computer Experiments

Best Mean Squared Prediction Error (MSPE) Predictors• Theorem: Suppose that (Y0, Yn) has a joint distribution F for which the

conditional mean of Y0 given Yn exists. Then Y0 = E{Y0 | Yn} is the best MSPE predictor of Y0.

• Proof: Fix an arbitrary unbiased predictor Y0*(Yn).

– MSPE(Y0*,F) = EF{(Y0

* - Y0)2} = EF{(Y0

* - Y0 + Y0 - Y0)2} = EF{(Y0

* - Y0)2} + MSPE(Y0,F) + 2EF{(Y0* - Y0)(Y0 - Y0)}

¸ MSPE(Y0,F) + 2EF{(Y0* - Y0)(Y0 - Y0)}

(Eq 3)

– EF{(Y0* - Y0)(Y0 - Y0)} = EF{(Y0

* - Y0) EF{(Y0 - Y0) | Yn}}= EF{(Y0

* - Y0) (Y0 - EF{Y0 | Yn})}= EF{(Y0

* - Y0) £ 0} = 0

– Thus, MSPE(Y0*,F) ¸ MSPE(Y0,F) ¥

• Notes:– Y0 = E{Y0 | Yn} is essentially the unique MSPE predictor

• MSPE(Y0*,F) = MSPE(Y0,F) iff Y0 = Y0

* almost everywhere

– Y0 = E{Y0 | Yn} is always unbiased:

• E{Y0} = E{E{Y0 | Yn}} = E{Y0}

(Why can we condition here?)

Page 8: Predicting Output from Computer Experiments

Example: Continued-Best MSPE Predictors

• What is the best MSPE predictor when each Yi » N(0, 2)?

– Since the Yi’s are independent, [Y0 | Yn] = N(0, 2)

– Thus, Y0 = E{Y0|Yn} = 0

• What if 2 is known, and Yi » N(0, 2

), but 0 is unknown (i.e., [0] 1)?– improper priors do not always give proper posteriors. But here:

[Y0 | Yn = yn] » N1 [, 2(1 + 1/n)]

where is the sample mean on the training data Yn

– Thus, the best MSPE predictor of Y0 is Y0 = (i Yi) / n

Page 9: Predicting Output from Computer Experiments

Now let’s dive in to Gaussian Processes (uh oh…)

• Consider the regression model from chapter 2:Yi ´ Y(xi) = p

j=1fjj + Z(xi) = f>(xi) + Z(xi)

– each fj is a known regression function

– is an unknown nonzero p £ 1 vector– Z(x) is a zero mean stationary Gaussian process with dependence

specified byCov{Z(xi),Z(xj)} = 2

Z R(xi - xj) for some known correlation function R.

– Then the joint distribution of Y0 = Y(x0) and Yn = (Y(x1), …, Y(xn)) is

(Eq 4)

the def’n of unbiased and the conditional dist. of a multivariate normal give

Page 10: Predicting Output from Computer Experiments

Gaussian Process Example Continued

• The best MSPE predictor of Y0 isY0 = E{Y0 | Yn} = f>

0 + r>0R-1(Yn - F) (Eq4)

• …But for what class of distributions F is this true?– Y0 depends on:

• multivariate normality of (Y0, Yn)

• • R(¢)

– thus the best MSPE predictor changes when or R change, however, it remains the same for all 2

Z > 0

Page 11: Predicting Output from Computer Experiments

Second GP example

• Second example: analogous to the previous linear example, what if we add uncertainty about ?– we assume that 2

Z is known, although the authors say this isn’t required

• Now we have a two-stage model:– The first stage, our conditional distribution of (Y0, Yn) given , is the

same distribution we saw before.– The second stage is our prior on .

• One can show that the best MSPE predictor of Y0 is Y0 = E{Y0 | Yn} = f>

0 E{ | Yn} + r>0R-1(Yn - F E{ | Yn})

– Compare this to what we had in the one-stage case: Y0 = f>

0 + r>0R-1(Yn - F)

– but the authors give a derivation; see the book

Page 12: Predicting Output from Computer Experiments

So what about E{ | Yn}?• Of course, the formula for E{ | Yn} depends on our prior

– when this prior is uninformative, we can derive[ | Yn] » Np[(F>R-1F)-1F-1Yn, 2

Z(F>R-1F)-1]– this (somehow) gives us Y0 = f>

0B + r>0R-1(Yn – FB), (Eq 5)

B = (F>R-1F)-1F>R-1Yn

* as above with Y0, for powerpoint reasons I use B instead of

• What sense can we make of (Eq 5)?1. the sum of the regression predictor f>

0B and a “correction” r>0R-1(Yn – FB)

2. a function of the training data Yn

3. a function of x0, the point at which a prediction is made• recall that f>

0 ´ f(x0)>; r>0 ´ (R(x0 - x1), …, R(x0 - xn))>

• For the moment, we consider (1); we consider (2) and (3) in x 3.3– (that’s right, we’re still in x 3.2!)

• The correction term is a linear combination of the residuals Yn – FB based on the GP model f> + Z with prediction point specific coefficients:

r>0R-1(Yn – FB) = i ci(x0)(Yn - FB)

where the weight ci(x0) is the ith element of R-1r0 and (Yn - FB) is the ith residual based on the fitted model

Page 13: Predicting Output from Computer Experiments

Example

• Suppose the true unknown curve is the 1D dampened cosine:y(x) = e-1.4xcos(7x/2)

• 7-point training set– x1 drawn from [0,1/7]

– xi = x1 + (i-1)/7

• Consider predicting y using a stationary GP Y(x) = 0 + Z(x)

– Z has zero mean, variance 2Z, correlation function R(h) = e-136.1h2

– F is a 7 £ 1 column vector of ones• i.e., we have no features, just an intercept 0

• Using the regression/correction interpretation of (Eq 5), we can write:

Y(x0) = B0 + 7i=1 ci(x0)(Yi - B0)

– ci(x0) is the ith element of R-1r0

– (Yi - B0) are the residuals from fitting the constant model

Page 14: Predicting Output from Computer Experiments

Example continued

• Consider y(x0) at x0 = 0.55 (plotted as a cross below)

– The residuals (Yi - B0) and their associated weights ci(x0) are plotted below

• Note:– weights can be positive or negative

– the correction to the regression B0 is based primarily on the residuals at the training data points closest to x0

• the weights for the 3 furthest training instances are indistinguishable from zero

– y(0.55) has interpolated the data– what does the whole curve look like?

• We need to wait for x 3.3 to find out…

Page 15: Predicting Output from Computer Experiments

…but I’ll show you now anyway!

Page 16: Predicting Output from Computer Experiments

Interpolating the data

• The correction term r>0R-1(Yn – FB) forces the model to interpolate

the data– suppose x0 is xi for some i 2 {1, …, n}

• then f0 = f>(xi), and

• r0> = (R(xi - x1), …, R(xi - xn))>, which is the ith row of R

– Because R-1r0 is the ith column of R-1R = In, the identity matrix, thus R-1r0 = (0, …, 0,1,0, …, 0)> = ei, the ith unit vector

– Hence: r>

0R-1(Yn – FB) = ei> (Yn – FB)

= Yi - f>(xi)B

– and soY(x0) = f>(xi)B + (Yi - f>(xi)B) = Yi

(Eq 5),

Page 17: Predicting Output from Computer Experiments

An example showing that best MSPE predictors need not be linear

• Suppose that (Y0, Y1) has the joint distribution:

• Then the conditional distribution of Y0 given Y1 = y1 is uniform over the interval (0, y1

2).

• The best MSPE predictor of Y0 is the center of this interval:Y0 = E{Y0 | Y1} = Y1

2/2

• The minimum MSPE linear unbiased predictor is Y0L = -1/12 + ½

Y1

– based on a bunch of calculus

• Their MSPEs are very similar:– E{(Y0 - Y1

2/2)2} 0.01667

– E{(Y0 - -1/12 + ½ Y1)2} 0.01806

Page 18: Predicting Output from Computer Experiments

Best Linear Unbiased MSPE Predictors

• minimum MSPE predictors depend on the joint distribution of Yn and Y0

– thus, they tend to be optimal within a very restricted class F

• In an attempt to find predictors that are more broadly optimal, consider:1. predictors that are linear in Yn;

– these are called best linear predictors (BLPs)

2. predictors that are both linear and unbiased for Y0.– these are called best linear unbiased predictors (BLUPs)

Page 19: Predicting Output from Computer Experiments

BLUP Example 1• Recall our first example:

– Yi = 0 + i, where i » N(0,2), 2

> 0.

– Define F as those distributions in which • 0 is a given nonzero constant

• 2 is unknown, but 2

> 0 is known

– Any Y0 = a0 + a>Yn is a LUP of Y0 if a0 + 0ni=1ai = 0

• The MSPE of a linear unbiased predictor Y0 = a0 + a>Yn is

– E {(a0 + ni=1aiYi - Y0)2} = E{(a0 + iai(0 + i) - 0 - 0)2}

= (a0 + 0i ai - 0)2 + 2 i ai

2 + 2

= 2 (1+ i ai

2) (Eq 6) ¸ 2

(Eq 7)

– we have equality in (Eq 6) because Y0 is unbiased

– we have equality in (Eq 7) iff ai = 0, i 2 {1, …, n} (and hence a0 = 0)

– Thus, the unique BLUP is Y0 = 0

Page 20: Predicting Output from Computer Experiments

BLUP Example 2

• Consider again the enlarged model F with 0 as an unknown real; 2 > 0

– recall that every unbiased Y0 = a0 + a>Yn must satisfy a0 = 0 and i ai = 1

– The MSPE of Y0 isE{(i ai Yi - Y0)2} = (0 i ai - 0)2 + 2

i ai2 + 2

= 0 + 2 (1 + i ai

2) (Eq 8) ¸ 2

(1 + 1/n) (Eq 9)

– equality holds in (Eq 8) because i ai = 1

– (Eq 9): i ai2 is minimized under i ai = 1 when ai = 1/n

• Thus the sample mean Y0 = 1/n i Yi is the best linear unbiased predictor of Y0 for the enlarged F.

– How can the BLUP for a large class not also the BLUP for a subclass?(didn’t we see a claim to the contrary earlier)?

• the previous claim was that every LUP for a class is also a LUP for a subclass, but it doesn’t hold for BLUPs.

Page 21: Predicting Output from Computer Experiments

BLUP Example 3

• Consider the measurement error model:Yi = Y(xi) = j fj(xi) j + i

where the f are known regression functions, the are unknown, and each i » N(0, 2

)

• Consider the BLUP of Y(x0) for unknown real 0 and 2 > 0

• A linear predictor Y0 = a0 + aTYn is unbiased provided that for all (, 2

)E{a0 + aTYn} = a0 + aTF is equal to E{Y0} = f>(x0)

– This implies a0 = 0 and F>a = f(x0)

• The BLUP of Y0 is Y0 = f>(x0)B– where B = (F>F)-1F >Yn is the ordinary least squares estimator of – and the BLUP is unique

• This is proved in the chapter notes, x3.4.• …and now we’ve reached the end of x3.2!

Page 22: Predicting Output from Computer Experiments

…that’s all for today!

Page 23: Predicting Output from Computer Experiments

Prediction for Computer Experiments• The idea is to build a “surrogate” or “simulator”

– a model that predicts the output of a simulation, to spare you from having to run the actual simulation

– Neural networks, splines, GPs all work—guess what, they like GPs

• Let f1, …, fp be known regression functions, be a vector of unknown regression coefficients, Z be a stationary GP on X having zero mean, variance 2

Z, correlation function R.

– Then we can see experimental output Y(x) as the realization of the random function

• This model implies that Y0 and Yn have the multivariate normal distribution

where and 2Z > 0 are unknown

• Now, drop the Gaussian assumption to consider a nonparametric moment model based on an arbitrary second-order stationary process for unknown and 2

Z:

Page 24: Predicting Output from Computer Experiments

Conclusion: is this the right tool when we can get lots of data?