JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...
Transcript of JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP
A Universal Methodology and Toolkit for Quan8fying Simula8on Error via both Bayesian Inference and Model Reduc8on Strategies
Jeremiah Wilke, Khachik Sargsyan, Mar8n Drohmann Sandia Na8onal Labs, Livermore, CA
Uncertainty quan8fica8on is cri8cal to useful simula8on, regardless of detail or fidelity
2
Crude(
guess(
Rough(
idea(
Cause(and(
effect(
Very(good(
es5mates(
Exact(
hardware(model(
100(
101(
102(103(104(105(106(107(
Sim
ula5on(scope/parallelism
(
Simula5on(fidelity(
Cons5tu5ve(
Models(
Coarse1Grain(
Simula5on(
Cycle1Accurate(
Simula5on( Emula5on(
Hardware(
Design(
So<ware(
Support(
Applica5on(
Evalua5on(
0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node) 0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node)
Single data point isn’t enough information
Need a mean value and error bound
Cri8cal ques8on to integra8on of UQ tools: intrusive vs non-‐intrusive, code vs workflow
3
0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node) 0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node)
Model/simulator standalone code Results + UQ
What bridges the gap from left to right?
C/C++ API or
Toolkit
Cri8cal ques8on to integra8on of UQ tools: intrusive vs non-‐intrusive, code vs workflow
4
0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node) 0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node)
Model/simulator standalone code Results + UQ
Goal of presentation here is not universal solution to UQ methods and code integration,
but framing the problem via two use cases
C/C++ API or
Toolkit
Where we understand UQ beQer: experimental noise and series expansion
5
F(x) = f (i) (a)i!i=0
k
∑ (x − a)i + R(x)
R(x) = f (k+1)(Z )k!
(x − Z )k (x − a)
Errors due to randomness or experimental scatter
Analytic formula derived from Taylor series
The chicken and the egg of UQ: Knowing the error without knowing the answer
6
0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node)
The chicken and the egg of UQ: Knowing the error without knowing the answer
7
Mean value
0
2
4
6
8
10
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node)
?
Making Bayesian inference a universally understood concept
8
P
⇣{�
i
}��� {x
i
}⌘/ P
⇣{x
i
}��� {�
i
}⌘⇥ P
⇣{�
i
}⌘
1
Posterior: Given prior knowledge and new data, encapsulates best knowledge of parameters
Likelihood Function: Physically motivated likelihood
estimate of data points assuming a set of parameters
Prior: Encapsulates prior knowledge of problem
Quantity of interest, but not possible to directly estimate
NOT quantity of interest, but can be estimated directly
Infer posterior from physically motivated likelihood!
Making Bayesian inference a universally understood concept
9
P
⇣{�
i
}��� {x
i
}⌘/ P
⇣{x
i
}��� {�
i
}⌘⇥ P
⇣{�
i
}⌘
1
Posterior: Given prior knowledge and new data, encapsulates best knowledge of parameters
Likelihood Function: Physically motivated likelihood
estimate of data points assuming a set of parameters
Prior: Encapsulates prior knowledge of problem
P⇣{�i}
��� {xi}⌘/ P
⇣{xi}
��� {�i}⌘⇥ P
⇣{�i}
⌘
{xi} = R heads , N flips
P⇣{xi}
��� {�i}⌘=
✓N
R
◆HR
(1�H)
N�R
M�{�i}
�⇡ S
�{�i}
�=
X
k
ck k(⇠) , ⇠ 2 [�1, 1]
logP�{�i}
��{xi}�= �N
2
log(2⇡�2)�
NX
i=1
(Ti �Mi)2
2�2
1
Given physical model, {λi}, estimate likelihood of data
P⇣{�i}
��� {xi}⌘/ P
⇣{xi}
��� {�i}⌘⇥ P
⇣{�i}
⌘
{xi} = R heads , N flips
P⇣{xi}
��� {�i}⌘=
✓N
R
◆HR
(1�H)
N�R
M�{�i}
�⇡ S
�{�i}
�=
X
k
ck k(⇠) , ⇠ 2 [�1, 1]
logP�{�i}
��{xi}�= �N
2
log(2⇡�2)�
NX
i=1
(Ti �Mi)2
2�2
1
Given data, {xi}, estimate likelihood of physical model
Data points: Coin flip: heads or tails Simulated runtime
Model parameters: Coin flip: P(heads) = H Latency/bandwidth
How to translate abstract “model imperfec8on” into something more tangible
§ App 1 simula8on suggests 2.0 s/GB § App 2 simula8ons suggests 1.5 s/GB § No single bandwidth is exact § “Most likely” parameter is ~1.6
10
How does model uncertainty manifest itself in parameter uncertain8es?
11
Single BW parameter exact for all tests. “Delta” function probability distribution. Simulator is generally inaccurate OR Many parameters are equally accurate
No single parameter is exact for all tests, but small parameter range gives high accuracy
saadfadsfadsfdsfads
P⇣{�i}
��� {xi}⌘/ P
⇣{xi}
��� {�i}⌘⇥ P
⇣{�i}
⌘
{xi} = R heads , N flips
P⇣{xi}
��� {�i}⌘=
✓N
R
◆HR(1�H)N�R
M�{�i}
�⇡ S
�{�i}
�=X
k
ck k(⇠) , ⇠ 2 [�1, 1]
1
Calibra8on phase requires many, many samples in parameter space to build distribu8ons
§ Coefficient ck fit over grid in parameter space
§ Computa8on of surrogate is embarrassingly parallel
§ Polynomial allows rapid AMCMC sampling of model output in parameter space
§ Expansion coefficients used in sensi8vity analysis
12
� � � � � � � � � � � � � � � �
Latency µs (L)
BW GB/s (B)
(1,1)
(2,1)
(1,2)
Simulation Model
Surrogate polynomial
Expansion coefficients
Calibra8on phase requires many, many samples in parameter space to build distribu8ons
13
� � � � � � � � � � � � � � � �
Latency µs (L)
BW GB/s (B)
(1,1)
(2,1)
(1,2)
Extensive sweep of parameter space for modest problem sizes where we have “correct” answer or we have really good idea of “correct” answer
Extrapola8ng uncertain8es into the unknown s8ll requires sampling § Parameter likelihood distribu8ons inform where to sample § With few samples, can build semi-‐quan8ta8ve confidence interval § Workflow integra8on to generate/run/collect/analyze samples
14
Parameter λ
P(λ) M(λ)
Parameter λ = confidence interval
Parameter likelihood Model Output
Four step UQ workflow for Bayesian Inference: not intrusive to exis8ng codes
§ Generate: list of samples in parameter space generated by UQ toolkit
§ Run: parameter inputs through simula8on/model § Collect: simula8on/model outputs into standard format § Analyze: run outputs through machinery in UQ toolkit to generate
uncertainty distribu8ons
15
Output from analyze phase: error distribu8ons and parameter sensi8vi8es
16/22
0
5
10
15
20
150 200 250 300 350 400 450 500
Tim
e(m
s)
N(node)
8 KB16 KB24 KB32 KB
Hopper MeanSST/macro
SST/macro Uncertainty
(A)
0
0.2
0.4
0.6
0.8
1
8 16 24 32 8 16 24 32 8 16 24 32 8 16 24 32 8 16 24 32
Sen
sitiv
ity
Size (KB)
Link BWInj Lat
Hop LatMax BW
0
0.2
0.4
0.6
0.8
1
N=128 N=192 N=256 N=384 N=512
0
0.2
0.4
0.6
0.8
1
N=128 N=192 N=256 N=384 N=512
0
0.2
0.4
0.6
0.8
1
N=128 N=192 N=256 N=384 N=512
0
0.2
0.4
0.6
0.8
1
N=128 N=192 N=256 N=384 N=512
(B)
Reduced-‐order models or how PDE solvers are way ahead of discrete event simula8ons • Rather than sampling in parameter space, can a simula8on self-‐diagnose its own errors?
• How many basis func8ons do you need to accurately describe heat flow problem with 9 different regions?
• Amounts to linear system solve Ax =b • x is vector of size N
17
1 2 3
4 5 6
7 8 9
�D
�N1
�N0
WARNING: SOMEWHAT
SPECULATIVE
Reduced-‐order models or how PDE solvers are way ahead of discrete event simula8ons • Rather than sampling in parameter space, can a simula8on self-‐diagnose its own errors?
• How many basis func8ons do you need to accurately describe heat flow problem with 9 different regions?
• Amounts to linear system solve Ax =b • x is vector of size N
18
1 2 3
4 5 6
7 8 9
�D
�N1
�N0
Reduced-‐order models or how PDE solvers are way ahead of discrete event simula8ons § Full order model
§ N = 10,000 § Huge sparse system § Brute force solu8on
§ Reduced order model § What if I already have several solu8ons
x1, x2, x3…? How accurate is:
§ Small, dense system
19
x = cixii∑
1 2 3
4 5 6
7 8 9
�D
�N1
�N0
How is a simula8on a reduced order model?
20
Router Router Two flows competing for bandwidth
Discretization into flits shows even sharing of bandwidth on link 0
Link 1 consistently utilized at half rate
Link 1
Link 0
How is a simula8on a reduced order model?
21
Router Router Two flows competing for bandwidth
Discretization shows uneven sharing of bandwidth on link 0
Link 1 remains unutilized for long time
Link 1
Link 0
How does reduced order model perspec8ve help us get at error? § We want to know to error § All we can compute is residual
22
E = x − x
R = Ax − b
How does reduced order model perspec8ve help us get at error? § We want to know to error § All we can compute is residual
§ Can we train computer to convert R to E? § Residual is an indicator for the error we care about
23
E = x − x
R = Ax − b
How does reduced order model perspec8ve help us get at error? § We want to know to error § All we can compute is residual
§ Can we train computer to convert R to E? § Residual is an indicator for the error we care about
24
E = x − x
R = Ax − b
What might be an indicator for error in a network simulator?
25
Router Router Two flows competing for bandwidth
Even if we don’t exactly model flit-level flow control, we still know that we did something wrong!
Here we have no idea that an error was made!
Link 1
Link 0
Is this an intrusive or non-‐intrusive UQ?
§ Something must be logged for every packet event – intrusive § Maintaining log of every event is too much data – need to reduce
data into other metrics
26
Router Router
Is this an intrusive or non-‐intrusive UQ?
§ Results pending….
27
Router Router
Addressing MODSIM ques8ons
§ Major contribu8on: § Math, workflow, AND toolkit § Demonstra8ng need for both intrusive and non-‐intrusive solu8ons
§ Gaps: § Need bridge between mathema8cians and programmers § Need to define both interchange formats for non-‐intrusive toolkits and APIs
for intrusive libraries
§ Bigger picture? Collabora8on? § Simulators/modelers looking to bracket errors
AND § UQ researchers with methods developed in other domains like PDEs
§ How to leverage results? § Tutorials, not research papers § Know thine audience
UQ Toolkit: sandia.gov/uqtoolkit C++ API for code integra8on Python workflow integra8on 28