JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

A Universal Methodology and Toolkit for Quan8fying Simula8on Error via both Bayesian Inference and Model Reduc8on Strategies

Jeremiah Wilke, Khachik Sargsyan, Mar8n Drohmann Sandia Na8onal Labs, Livermore, CA

Uncertainty quan8fica8on is cri8cal to useful simula8on, regardless of detail or fidelity

2

Crude(

guess(

Rough(

idea(

Cause(and(

effect(

Very(good(

es5mates(

Exact(

hardware(model(

100(

101(

102(103(104(105(106(107(

Sim

ula5on(scope/parallelism

(

Simula5on(fidelity(

Cons5tu5ve(

Models(

Coarse1Grain(

Simula5on(

Cycle1Accurate(

Simula5on( Emula5on(

Hardware(

Design(

So<ware(

Support(

Applica5on(

Evalua5on(

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node) 0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Single data point isn’t enough information

Need a mean value and error bound

Cri8cal ques8on to integra8on of UQ tools: intrusive vs non-‐intrusive, code vs workflow

3

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node) 0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Model/simulator standalone code Results + UQ

What bridges the gap from left to right?

C/C++ API or

Toolkit

Cri8cal ques8on to integra8on of UQ tools: intrusive vs non-‐intrusive, code vs workflow

4

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node) 0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Model/simulator standalone code Results + UQ

Goal of presentation here is not universal solution to UQ methods and code integration,

but framing the problem via two use cases

C/C++ API or

Toolkit

Where we understand UQ beQer: experimental noise and series expansion

5

F(x) = f (i) (a)i!i=0

k

∑ (x − a)i + R(x)

R(x) = f (k+1)(Z )k!

(x − Z )k (x − a)

Errors due to randomness or experimental scatter

Analytic formula derived from Taylor series

The chicken and the egg of UQ: Knowing the error without knowing the answer

6

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

The chicken and the egg of UQ: Knowing the error without knowing the answer

7

Mean value

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

?

Making Bayesian inference a universally understood concept

8

P

⇣{�

i

}�� {x

i

}⌘/ P

⇣{x

i

}�� {�

i

}⌘⇥ P

⇣{�

i

}⌘

1

Posterior: Given prior knowledge and new data, encapsulates best knowledge of parameters

Likelihood Function: Physically motivated likelihood

estimate of data points assuming a set of parameters

Prior: Encapsulates prior knowledge of problem

Quantity of interest, but not possible to directly estimate

NOT quantity of interest, but can be estimated directly

Infer posterior from physically motivated likelihood!

Making Bayesian inference a universally understood concept

9

P

⇣{�

i

}�� {x

i

}⌘/ P

⇣{x

i

}�� {�

i

}⌘⇥ P

⇣{�

i

}⌘

1

Posterior: Given prior knowledge and new data, encapsulates best knowledge of parameters

Likelihood Function: Physically motivated likelihood

estimate of data points assuming a set of parameters

Prior: Encapsulates prior knowledge of problem

P⇣{�i}

�� {xi}⌘/ P

⇣{xi}

�� {�i}⌘⇥ P

⇣{�i}

⌘

{xi} = R heads , N flips

P⇣{xi}

�� {�i}⌘=

✓N

R

◆HR

(1�H)

N�R

M�{�i}

�⇡ S

�{�i}

�=

X

k

ck k(⇠) , ⇠ 2 [�1, 1]

logP�{�i}

��{xi}�= �N

2

log(2⇡�2)�

NX

i=1

(Ti �Mi)2

2�2

1

Given physical model, {λi}, estimate likelihood of data

P⇣{�i}

�� {xi}⌘/ P

⇣{xi}

�� {�i}⌘⇥ P

⇣{�i}

⌘


P⇣{xi}

�� {�i}⌘=

✓N

R

◆HR

(1�H)

N�R

M�{�i}

�⇡ S

�{�i}

�=

X

k

ck k(⇠) , ⇠ 2 [�1, 1]

logP�{�i}

��{xi}�= �N

2

log(2⇡�2)�

NX

i=1

(Ti �Mi)2

2�2

1

Given data, {xi}, estimate likelihood of physical model

Data points: Coin flip: heads or tails Simulated runtime

Model parameters: Coin flip: P(heads) = H Latency/bandwidth

How to translate abstract “model imperfec8on” into something more tangible

§  App 1 simula8on suggests 2.0 s/GB §  App 2 simula8ons suggests 1.5 s/GB §  No single bandwidth is exact §  “Most likely” parameter is ~1.6

10

How does model uncertainty manifest itself in parameter uncertain8es?

11

Single BW parameter exact for all tests. “Delta” function probability distribution. Simulator is generally inaccurate OR Many parameters are equally accurate

No single parameter is exact for all tests, but small parameter range gives high accuracy

saadfadsfadsfdsfads

P⇣{�i}

�� {xi}⌘/ P

⇣{xi}

�� {�i}⌘⇥ P

⇣{�i}

⌘


P⇣{xi}

�� {�i}⌘=

✓N

R

◆HR(1�H)N�R

M�{�i}

�⇡ S

�{�i}

�=X

k

ck k(⇠) , ⇠ 2 [�1, 1]

1

Calibra8on phase requires many, many samples in parameter space to build distribu8ons

§  Coefficient ck fit over grid in parameter space

§  Computa8on of surrogate is embarrassingly parallel

§  Polynomial allows rapid AMCMC sampling of model output in parameter space

§  Expansion coefficients used in sensi8vity analysis

12

� � � � � � � � � � � � � � � �

Latency µs (L)

BW GB/s (B)

(1,1)

(2,1)

(1,2)

Simulation Model

Surrogate polynomial

Expansion coefficients

Calibra8on phase requires many, many samples in parameter space to build distribu8ons

13

� � � � � � � � � � � � � � � �

Latency µs (L)

BW GB/s (B)

(1,1)

(2,1)

(1,2)

Extensive sweep of parameter space for modest problem sizes where we have “correct” answer or we have really good idea of “correct” answer

Extrapola8ng uncertain8es into the unknown s8ll requires sampling §  Parameter likelihood distribu8ons inform where to sample §  With few samples, can build semi-‐quan8ta8ve confidence interval §  Workflow integra8on to generate/run/collect/analyze samples

14

Parameter λ

P(λ) M(λ)

Parameter λ = confidence interval

Parameter likelihood Model Output

Four step UQ workflow for Bayesian Inference: not intrusive to exis8ng codes

§  Generate: list of samples in parameter space generated by UQ toolkit

§  Run: parameter inputs through simula8on/model §  Collect: simula8on/model outputs into standard format §  Analyze: run outputs through machinery in UQ toolkit to generate

uncertainty distribu8ons

15

Output from analyze phase: error distribu8ons and parameter sensi8vi8es

16/22

0

5

10

15

20

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

8 KB16 KB24 KB32 KB

Hopper MeanSST/macro

SST/macro Uncertainty

(A)

0

0.2

0.4

0.6

0.8

1

8 16 24 32 8 16 24 32 8 16 24 32 8 16 24 32 8 16 24 32

Sen

sitiv

ity

Size (KB)

Link BWInj Lat

Hop LatMax BW

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

(B)

Reduced-‐order models or how PDE solvers are way ahead of discrete event simula8ons •  Rather than sampling in parameter space, can a simula8on self-‐diagnose its own errors?

•  How many basis func8ons do you need to accurately describe heat flow problem with 9 different regions?

•  Amounts to linear system solve Ax =b •  x is vector of size N

17

1 2 3

4 5 6

7 8 9

�D

�N1

�N0

WARNING: SOMEWHAT

SPECULATIVE

Reduced-‐order models or how PDE solvers are way ahead of discrete event simula8ons •  Rather than sampling in parameter space, can a simula8on self-‐diagnose its own errors?

•  How many basis func8ons do you need to accurately describe heat flow problem with 9 different regions?

•  Amounts to linear system solve Ax =b •  x is vector of size N

18

1 2 3

4 5 6

7 8 9

�D

�N1

�N0

Reduced-‐order models or how PDE solvers are way ahead of discrete event simula8ons §  Full order model

§  N = 10,000 §  Huge sparse system §  Brute force solu8on

§  Reduced order model §  What if I already have several solu8ons

x1, x2, x3…? How accurate is:

§  Small, dense system

19

x = cixii∑

1 2 3

4 5 6

7 8 9

�D

�N1

�N0

How is a simula8on a reduced order model?

20

Router Router Two flows competing for bandwidth

Discretization into flits shows even sharing of bandwidth on link 0

Link 1 consistently utilized at half rate

Link 1

Link 0

How is a simula8on a reduced order model?

21


Discretization shows uneven sharing of bandwidth on link 0

Link 1 remains unutilized for long time

Link 1

Link 0

How does reduced order model perspec8ve help us get at error? §  We want to know to error §  All we can compute is residual

22

E = x − x

R = Ax − b


§  Can we train computer to convert R to E? §  Residual is an indicator for the error we care about

23

E = x − x

R = Ax − b


§  Can we train computer to convert R to E? §  Residual is an indicator for the error we care about

24

E = x − x

R = Ax − b

What might be an indicator for error in a network simulator?

25


Even if we don’t exactly model flit-level flow control, we still know that we did something wrong!

Here we have no idea that an error was made!

Link 1

Link 0

Is this an intrusive or non-‐intrusive UQ?

§  Something must be logged for every packet event – intrusive §  Maintaining log of every event is too much data – need to reduce

data into other metrics

26

Router Router

Is this an intrusive or non-‐intrusive UQ?

§  Results pending….

27

Router Router

Addressing MODSIM ques8ons

§  Major contribu8on: §  Math, workflow, AND toolkit §  Demonstra8ng need for both intrusive and non-‐intrusive solu8ons

§  Gaps: §  Need bridge between mathema8cians and programmers §  Need to define both interchange formats for non-‐intrusive toolkits and APIs

for intrusive libraries

§  Bigger picture? Collabora8on? §  Simulators/modelers looking to bracket errors

AND §  UQ researchers with methods developed in other domains like PDEs

§  How to leverage results? §  Tutorials, not research papers §  Know thine audience

UQ Toolkit: sandia.gov/uqtoolkit C++ API for code integra8on Python workflow integra8on 28

JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...

Documents

Transcript of JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...