Predicting Output from Computer Experiments

Predicting Output from Computer Experiments

Design and Analysis of Computer ExperimentsChapter 3

Kevin Leyton-Brown

Overview• Overall program in this chapter:

– predict the output of a computer simulation– we’re going to review approaches to regression, looking for various kinds of

optimality

• First, we’ll talk about just predicting our random variable (x 3.2)– note, in this setting, we have no “features”

• Then, we’ll consider the inclusion of features in our predictions, based on features in our training data (x 3.2, 3.3)

• In the end, we’ll apply these ideas to computer experiments (x 3.3)• Not covered:

– an empirical evaluation of seven EBLUPs on small-sample data (x 3.3, pp. 69-81)

– proofs of some esoteric BLUP theorems (x 3.4, pp. 82-84)

• If you’ve done the reading you already know:– the difference between “minimum MSPE linear unbiased predictors” and BLUPs

– three different “intuitive” interpretations of r>0R-1(Yn – FB)

– a lot about statistics– whether this chapter has anything to do with computer experiments

• If you haven’t you’re in for a treat

Predictors• Y0 is our random variable; our data is Yn = (Y1, …, Yn)>

– no “features” – just predict one response from the others

• A generic predictor predicts Y0 based on Yn – to avoid powerpoint agony, I’ll denote as Y0 from now on

• There are three kinds of predictors discussed:– “Predictors”:

• Y0(Yn) has unrestricted functional form

– “Linear Predictors”: • Y0 = a0 + n

i=1aiYi = a0 + a>Yn

– “Linear unbiased predictors (LUP): • again, linear predictors Y0 = a0 + a>Yn • furthermore, “unbiased” with respect to a given family F of distributions for (Y0,

Yn)

• Definition: a predictor Y0 is unbiased for Y0 with respect to the class of distributions F over (Y0, Yn) if for all F 2 F, EF{Y0} = EF{Y0}.– EF denotes expectation under the F(¢) distribution for (Y0, Yn)– this definition depends on F: a linear predictor is unbiased with respect to a

class– as F gets bigger, the set of LUPs gets weakly smaller

LUP Example 1

• Suppose that Yi = 0 + i, where i » N(0,2), 2

> 0.

• Define F as those distributions in which – 0 is a given nonzero constant

– 2 is unknown, but 2

> 0 is known

• Any Y0 = a0 + aTYn is a LP of Y0

• Which are unbiased? We know that:– E {Y0} = E {a0 + n

i=1aiYi} = a0 + 0ni=1ai

(Eq 1)

– and E {Y0} = 0 (Eq 2)

• For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 8 2

– since (Eq 1), (Eq 2) are independent of 2, we just need that,

given 0, a satisfies a0 + 0ni=1ai = 0

– solutions:• a0 = 0, a such that n

i=1ai = 0 (data-independent predictor Y0 = 0)

• a0 = 0, a such that ni=1ai = 1

– e.g., sample mean of Yn is the LUP corresponding to a0 = 0, ai = 1/n

LUP Example 2

• Suppose that Yi = 0 + i, where i » N(0,2), 2

> 0.

• Define F as those distributions in which – 0 is an unknown real constant

– 2 is unknown, but 2

> 0 is known

• Any Y0 = a0 + aTYn is a LP of Y0

• Which are unbiased? We know that:– E {Y0} = E {a0 + n

i=1aiYi} = a0 + 0ni=1ai

(Eq 1)

– and E {Y0} = 0 (Eq 2)

• For our LP to be unbiased, we must have (Eq 1) = (Eq 2) 8 2 and 8 0

– since (Eq 1), (Eq 2) are independent of 2, we just need that,

8 0, a satisfies a0 + 0ni=1ai = 0

– solutions:• a0 = 0, a such that n

i=1ai = 0 (data-independent predictor Y0 = 0)

• a0 = 0, a such that ni=1ai = 1

– e.g., sample mean of Yn is the LUP corresponding to a0 = 0, ai = 1/n

• This illustrates that a LUP for F is a LUP for subfamilies of F

Best Mean Squared Prediction Error (MSPE) Predictors

• Definition:MSPE(Y0,F) ´ EF{(Y0 - Y0)2}

• Definition: Y0 is a minimum MSPE predictor at F if, for any predictor Y0

*

MSPE(Y0,F) · MSPE(Y0*,F)

– we’ll also call this a best MSPE predictor

• “Fundamental theorem of prediction”: – the conditional mean of Y0 given Yn is the minimum MSPE predictor

of Y0 based on Yn

Best Mean Squared Prediction Error (MSPE) Predictors• Theorem: Suppose that (Y0, Yn) has a joint distribution F for which the

conditional mean of Y0 given Yn exists. Then Y0 = E{Y0 | Yn} is the best MSPE predictor of Y0.

• Proof: Fix an arbitrary unbiased predictor Y0*(Yn).

– MSPE(Y0*,F) = EF{(Y0

* - Y0)2} = EF{(Y0

* - Y0 + Y0 - Y0)2} = EF{(Y0

* - Y0)2} + MSPE(Y0,F) + 2EF{(Y0* - Y0)(Y0 - Y0)}

¸ MSPE(Y0,F) + 2EF{(Y0* - Y0)(Y0 - Y0)}

(Eq 3)

– EF{(Y0* - Y0)(Y0 - Y0)} = EF{(Y0

* - Y0) EF{(Y0 - Y0) | Yn}}= EF{(Y0

* - Y0) (Y0 - EF{Y0 | Yn})}= EF{(Y0

* - Y0) £ 0} = 0

– Thus, MSPE(Y0*,F) ¸ MSPE(Y0,F) ¥

• Notes:– Y0 = E{Y0 | Yn} is essentially the unique MSPE predictor

• MSPE(Y0*,F) = MSPE(Y0,F) iff Y0 = Y0

* almost everywhere

– Y0 = E{Y0 | Yn} is always unbiased:

• E{Y0} = E{E{Y0 | Yn}} = E{Y0}

(Why can we condition here?)

Example: Continued-Best MSPE Predictors

• What is the best MSPE predictor when each Yi » N(0, 2)?

– Since the Yi’s are independent, [Y0 | Yn] = N(0, 2)

– Thus, Y0 = E{Y0|Yn} = 0

• What if 2 is known, and Yi » N(0, 2

), but 0 is unknown (i.e., [0] 1)?– improper priors do not always give proper posteriors. But here:

[Y0 | Yn = yn] » N1 [, 2(1 + 1/n)]

where is the sample mean on the training data Yn

– Thus, the best MSPE predictor of Y0 is Y0 = (i Yi) / n

Now let’s dive in to Gaussian Processes (uh oh…)

• Consider the regression model from chapter 2:Yi ´ Y(xi) = p

j=1fjj + Z(xi) = f>(xi) + Z(xi)

– each fj is a known regression function

– is an unknown nonzero p £ 1 vector– Z(x) is a zero mean stationary Gaussian process with dependence

specified byCov{Z(xi),Z(xj)} = 2

Z R(xi - xj) for some known correlation function R.

– Then the joint distribution of Y0 = Y(x0) and Yn = (Y(x1), …, Y(xn)) is

(Eq 4)

the def’n of unbiased and the conditional dist. of a multivariate normal give

Gaussian Process Example Continued

• The best MSPE predictor of Y0 isY0 = E{Y0 | Yn} = f>

0 + r>0R-1(Yn - F) (Eq4)

• …But for what class of distributions F is this true?– Y0 depends on:

• multivariate normality of (Y0, Yn)

• • R(¢)

– thus the best MSPE predictor changes when or R change, however, it remains the same for all 2

Z > 0

Second GP example

• Second example: analogous to the previous linear example, what if we add uncertainty about ?– we assume that 2

Z is known, although the authors say this isn’t required

• Now we have a two-stage model:– The first stage, our conditional distribution of (Y0, Yn) given , is the

same distribution we saw before.– The second stage is our prior on .

• One can show that the best MSPE predictor of Y0 is Y0 = E{Y0 | Yn} = f>

0 E{ | Yn} + r>0R-1(Yn - F E{ | Yn})

– Compare this to what we had in the one-stage case: Y0 = f>

0 + r>0R-1(Yn - F)

– but the authors give a derivation; see the book

So what about E{ | Yn}?• Of course, the formula for E{ | Yn} depends on our prior

– when this prior is uninformative, we can derive[ | Yn] » Np[(F>R-1F)-1F-1Yn, 2

Z(F>R-1F)-1]– this (somehow) gives us Y0 = f>

0B + r>0R-1(Yn – FB), (Eq 5)

B = (F>R-1F)-1F>R-1Yn

* as above with Y0, for powerpoint reasons I use B instead of

• What sense can we make of (Eq 5)?1. the sum of the regression predictor f>

0B and a “correction” r>0R-1(Yn – FB)

2. a function of the training data Yn

3. a function of x0, the point at which a prediction is made• recall that f>

0 ´ f(x0)>; r>0 ´ (R(x0 - x1), …, R(x0 - xn))>

• For the moment, we consider (1); we consider (2) and (3) in x 3.3– (that’s right, we’re still in x 3.2!)

• The correction term is a linear combination of the residuals Yn – FB based on the GP model f> + Z with prediction point specific coefficients:

r>0R-1(Yn – FB) = i ci(x0)(Yn - FB)

where the weight ci(x0) is the ith element of R-1r0 and (Yn - FB) is the ith residual based on the fitted model

Example

• Suppose the true unknown curve is the 1D dampened cosine:y(x) = e-1.4xcos(7x/2)

• 7-point training set– x1 drawn from [0,1/7]

– xi = x1 + (i-1)/7

• Consider predicting y using a stationary GP Y(x) = 0 + Z(x)

– Z has zero mean, variance 2Z, correlation function R(h) = e-136.1h2

– F is a 7 £ 1 column vector of ones• i.e., we have no features, just an intercept 0

• Using the regression/correction interpretation of (Eq 5), we can write:

Y(x0) = B0 + 7i=1 ci(x0)(Yi - B0)

– ci(x0) is the ith element of R-1r0

– (Yi - B0) are the residuals from fitting the constant model

Example continued

• Consider y(x0) at x0 = 0.55 (plotted as a cross below)

– The residuals (Yi - B0) and their associated weights ci(x0) are plotted below

• Note:– weights can be positive or negative

– the correction to the regression B0 is based primarily on the residuals at the training data points closest to x0

• the weights for the 3 furthest training instances are indistinguishable from zero

– y(0.55) has interpolated the data– what does the whole curve look like?

• We need to wait for x 3.3 to find out…

…but I’ll show you now anyway!

Interpolating the data

• The correction term r>0R-1(Yn – FB) forces the model to interpolate

the data– suppose x0 is xi for some i 2 {1, …, n}

• then f0 = f>(xi), and

• r0> = (R(xi - x1), …, R(xi - xn))>, which is the ith row of R

– Because R-1r0 is the ith column of R-1R = In, the identity matrix, thus R-1r0 = (0, …, 0,1,0, …, 0)> = ei, the ith unit vector

– Hence: r>

0R-1(Yn – FB) = ei> (Yn – FB)

= Yi - f>(xi)B

– and soY(x0) = f>(xi)B + (Yi - f>(xi)B) = Yi

(Eq 5),

An example showing that best MSPE predictors need not be linear

• Suppose that (Y0, Y1) has the joint distribution:

• Then the conditional distribution of Y0 given Y1 = y1 is uniform over the interval (0, y1

2).

• The best MSPE predictor of Y0 is the center of this interval:Y0 = E{Y0 | Y1} = Y1

2/2

• The minimum MSPE linear unbiased predictor is Y0L = -1/12 + ½

Y1

– based on a bunch of calculus

• Their MSPEs are very similar:– E{(Y0 - Y1

2/2)2} 0.01667

– E{(Y0 - -1/12 + ½ Y1)2} 0.01806

Best Linear Unbiased MSPE Predictors

• minimum MSPE predictors depend on the joint distribution of Yn and Y0

– thus, they tend to be optimal within a very restricted class F

• In an attempt to find predictors that are more broadly optimal, consider:1. predictors that are linear in Yn;

– these are called best linear predictors (BLPs)

2. predictors that are both linear and unbiased for Y0.– these are called best linear unbiased predictors (BLUPs)

BLUP Example 1• Recall our first example:

– Yi = 0 + i, where i » N(0,2), 2

> 0.

– Define F as those distributions in which • 0 is a given nonzero constant

• 2 is unknown, but 2

> 0 is known

– Any Y0 = a0 + a>Yn is a LUP of Y0 if a0 + 0ni=1ai = 0

• The MSPE of a linear unbiased predictor Y0 = a0 + a>Yn is

– E {(a0 + ni=1aiYi - Y0)2} = E{(a0 + iai(0 + i) - 0 - 0)2}

= (a0 + 0i ai - 0)2 + 2 i ai

2 + 2

= 2 (1+ i ai

2) (Eq 6) ¸ 2

(Eq 7)

– we have equality in (Eq 6) because Y0 is unbiased

– we have equality in (Eq 7) iff ai = 0, i 2 {1, …, n} (and hence a0 = 0)

– Thus, the unique BLUP is Y0 = 0

BLUP Example 2

• Consider again the enlarged model F with 0 as an unknown real; 2 > 0

– recall that every unbiased Y0 = a0 + a>Yn must satisfy a0 = 0 and i ai = 1

– The MSPE of Y0 isE{(i ai Yi - Y0)2} = (0 i ai - 0)2 + 2

i ai2 + 2

= 0 + 2 (1 + i ai

2) (Eq 8) ¸ 2

(1 + 1/n) (Eq 9)

– equality holds in (Eq 8) because i ai = 1

– (Eq 9): i ai2 is minimized under i ai = 1 when ai = 1/n

• Thus the sample mean Y0 = 1/n i Yi is the best linear unbiased predictor of Y0 for the enlarged F.

– How can the BLUP for a large class not also the BLUP for a subclass?(didn’t we see a claim to the contrary earlier)?

• the previous claim was that every LUP for a class is also a LUP for a subclass, but it doesn’t hold for BLUPs.

BLUP Example 3

• Consider the measurement error model:Yi = Y(xi) = j fj(xi) j + i

where the f are known regression functions, the are unknown, and each i » N(0, 2

)

• Consider the BLUP of Y(x0) for unknown real 0 and 2 > 0

• A linear predictor Y0 = a0 + aTYn is unbiased provided that for all (, 2

)E{a0 + aTYn} = a0 + aTF is equal to E{Y0} = f>(x0)

– This implies a0 = 0 and F>a = f(x0)

• The BLUP of Y0 is Y0 = f>(x0)B– where B = (F>F)-1F >Yn is the ordinary least squares estimator of – and the BLUP is unique

• This is proved in the chapter notes, x3.4.• …and now we’ve reached the end of x3.2!

…that’s all for today!

Prediction for Computer Experiments• The idea is to build a “surrogate” or “simulator”

– a model that predicts the output of a simulation, to spare you from having to run the actual simulation

– Neural networks, splines, GPs all work—guess what, they like GPs

• Let f1, …, fp be known regression functions, be a vector of unknown regression coefficients, Z be a stationary GP on X having zero mean, variance 2

Z, correlation function R.

– Then we can see experimental output Y(x) as the realization of the random function

• This model implies that Y0 and Yn have the multivariate normal distribution

where and 2Z > 0 are unknown

• Now, drop the Gaussian assumption to consider a nonparametric moment model based on an arbitrary second-order stationary process for unknown and 2

Z:

Conclusion: is this the right tool when we can get lots of data?

Predicting Output from Computer Experiments

Documents

Transcript of Predicting Output from Computer Experiments