JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...

28
Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP A Universal Methodology and Toolkit for Quan8fying Simula8on Error via both Bayesian Inference and Model Reduc8on Strategies Jeremiah Wilke , Khachik Sargsyan, Mar8n Drohmann Sandia Na8onal Labs, Livermore, CA

Transcript of JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf ·...

Page 1: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. SAND NO. 2011-XXXXP

A  Universal  Methodology  and  Toolkit  for  Quan8fying  Simula8on  Error  via  both  Bayesian  Inference  and  Model  Reduc8on  Strategies  

Jeremiah  Wilke,  Khachik  Sargsyan,  Mar8n  Drohmann  Sandia  Na8onal  Labs,  Livermore,  CA  

Page 2: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Uncertainty  quan8fica8on  is  cri8cal  to  useful  simula8on,  regardless  of  detail  or  fidelity  

2  

Crude(

guess(

Rough(

idea(

Cause(and(

effect(

Very(good(

es5mates(

Exact(

hardware(model(

100(

101(

102(103(104(105(106(107(

Sim

ula5on(scope/parallelism

(

Simula5on(fidelity(

Cons5tu5ve(

Models(

Coarse1Grain(

Simula5on(

Cycle1Accurate(

Simula5on( Emula5on(

Hardware(

Design(

So<ware(

Support(

Applica5on(

Evalua5on(

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node) 0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Single data point isn’t enough information

Need a mean value and error bound

Page 3: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Cri8cal  ques8on  to  integra8on  of  UQ  tools:    intrusive  vs  non-­‐intrusive,  code  vs  workflow  

3  

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node) 0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Model/simulator standalone code Results + UQ

What bridges the gap from left to right?

C/C++ API or

Toolkit

Page 4: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Cri8cal  ques8on  to  integra8on  of  UQ  tools:    intrusive  vs  non-­‐intrusive,  code  vs  workflow  

4  

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node) 0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Model/simulator standalone code Results + UQ

Goal of presentation here is not universal solution to UQ methods and code integration,

but framing the problem via two use cases

C/C++ API or

Toolkit

Page 5: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Where  we  understand  UQ  beQer:    experimental  noise  and  series  expansion  

5  

F(x) = f (i) (a)i!i=0

k

∑ (x − a)i + R(x)

R(x) = f (k+1)(Z )k!

(x − Z )k (x − a)

Errors due to randomness or experimental scatter

Analytic formula derived from Taylor series

Page 6: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

The  chicken  and  the  egg  of  UQ:  Knowing  the  error  without  knowing  the  answer  

6  

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

Page 7: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

The  chicken  and  the  egg  of  UQ:  Knowing  the  error  without  knowing  the  answer  

7  

Mean value

0

2

4

6

8

10

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

?

Page 8: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Making  Bayesian  inference  a  universally    understood  concept  

8  

P

⇣{�

i

}��� {x

i

}⌘/ P

⇣{x

i

}��� {�

i

}⌘⇥ P

⇣{�

i

}⌘

1

Posterior: Given prior knowledge and new data, encapsulates best knowledge of parameters

Likelihood Function: Physically motivated likelihood

estimate of data points assuming a set of parameters

Prior: Encapsulates prior knowledge of problem

Quantity of interest, but not possible to directly estimate

NOT quantity of interest, but can be estimated directly

Infer posterior from physically motivated likelihood!

Page 9: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Making  Bayesian  inference  a  universally    understood  concept  

9  

P

⇣{�

i

}��� {x

i

}⌘/ P

⇣{x

i

}��� {�

i

}⌘⇥ P

⇣{�

i

}⌘

1

Posterior: Given prior knowledge and new data, encapsulates best knowledge of parameters

Likelihood Function: Physically motivated likelihood

estimate of data points assuming a set of parameters

Prior: Encapsulates prior knowledge of problem

P⇣{�i}

��� {xi}⌘/ P

⇣{xi}

��� {�i}⌘⇥ P

⇣{�i}

{xi} = R heads , N flips

P⇣{xi}

��� {�i}⌘=

✓N

R

◆HR

(1�H)

N�R

M�{�i}

�⇡ S

�{�i}

�=

X

k

ck k(⇠) , ⇠ 2 [�1, 1]

logP�{�i}

��{xi}�= �N

2

log(2⇡�2)�

NX

i=1

(Ti �Mi)2

2�2

1

Given physical model, {λi}, estimate likelihood of data

P⇣{�i}

��� {xi}⌘/ P

⇣{xi}

��� {�i}⌘⇥ P

⇣{�i}

{xi} = R heads , N flips

P⇣{xi}

��� {�i}⌘=

✓N

R

◆HR

(1�H)

N�R

M�{�i}

�⇡ S

�{�i}

�=

X

k

ck k(⇠) , ⇠ 2 [�1, 1]

logP�{�i}

��{xi}�= �N

2

log(2⇡�2)�

NX

i=1

(Ti �Mi)2

2�2

1

Given data, {xi}, estimate likelihood of physical model

Data points: Coin flip: heads or tails Simulated runtime

Model parameters: Coin flip: P(heads) = H Latency/bandwidth

Page 10: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  to  translate  abstract  “model  imperfec8on”  into  something  more  tangible  

§  App  1  simula8on  suggests  2.0  s/GB  §  App  2  simula8ons  suggests  1.5  s/GB    §  No  single  bandwidth  is  exact  §  “Most  likely”  parameter  is  ~1.6    

10  

Page 11: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  does  model  uncertainty  manifest  itself  in  parameter  uncertain8es?  

11  

Single BW parameter exact for all tests. “Delta” function probability distribution. Simulator is generally inaccurate OR Many parameters are equally accurate

No single parameter is exact for all tests, but small parameter range gives high accuracy

saadfadsfadsfdsfads

Page 12: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

P⇣{�i}

��� {xi}⌘/ P

⇣{xi}

��� {�i}⌘⇥ P

⇣{�i}

{xi} = R heads , N flips

P⇣{xi}

��� {�i}⌘=

✓N

R

◆HR(1�H)N�R

M�{�i}

�⇡ S

�{�i}

�=X

k

ck k(⇠) , ⇠ 2 [�1, 1]

1

Calibra8on  phase  requires  many,  many  samples  in  parameter  space  to  build  distribu8ons  

§  Coefficient  ck  fit  over  grid  in  parameter  space  

§  Computa8on  of  surrogate  is  embarrassingly  parallel  

§  Polynomial  allows  rapid  AMCMC  sampling  of  model  output  in  parameter  space  

§  Expansion  coefficients  used  in  sensi8vity  analysis  

12  

� � � � � � � � � � � � � � � �

Latency µs (L)

BW GB/s (B)

(1,1)

(2,1)

(1,2)

Simulation Model

Surrogate polynomial

Expansion coefficients

Page 13: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Calibra8on  phase  requires  many,  many  samples  in  parameter  space  to  build  distribu8ons  

13  

� � � � � � � � � � � � � � � �

Latency µs (L)

BW GB/s (B)

(1,1)

(2,1)

(1,2)

Extensive sweep of parameter space for modest problem sizes where we have “correct” answer or we have really good idea of “correct” answer

Page 14: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Extrapola8ng  uncertain8es  into  the  unknown  s8ll  requires  sampling  §  Parameter  likelihood  distribu8ons  inform  where  to  sample  §  With  few  samples,  can  build  semi-­‐quan8ta8ve  confidence  interval  §  Workflow  integra8on  to  generate/run/collect/analyze  samples  

14  

Parameter λ

P(λ) M(λ)

Parameter λ = confidence interval

Parameter likelihood Model Output

Page 15: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Four  step  UQ  workflow  for  Bayesian  Inference:  not  intrusive  to  exis8ng  codes  

§  Generate:  list  of  samples  in  parameter  space  generated  by  UQ  toolkit  

§  Run:  parameter  inputs  through  simula8on/model  §  Collect:  simula8on/model  outputs  into  standard  format  §  Analyze:  run  outputs  through  machinery  in  UQ  toolkit  to  generate  

uncertainty  distribu8ons  

15  

Page 16: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Output  from  analyze  phase:  error  distribu8ons  and  parameter  sensi8vi8es  

16/22  

0

5

10

15

20

150 200 250 300 350 400 450 500

Tim

e(m

s)

N(node)

8 KB16 KB24 KB32 KB

Hopper MeanSST/macro

SST/macro Uncertainty

(A)

0

0.2

0.4

0.6

0.8

1

8 16 24 32 8 16 24 32 8 16 24 32 8 16 24 32 8 16 24 32

Sen

sitiv

ity

Size (KB)

Link BWInj Lat

Hop LatMax BW

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

0

0.2

0.4

0.6

0.8

1

N=128 N=192 N=256 N=384 N=512

(B)

Page 17: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Reduced-­‐order  models  or  how  PDE  solvers    are  way  ahead  of  discrete  event  simula8ons  •  Rather  than  sampling  in  parameter  space,  can  a  simula8on  self-­‐diagnose  its  own  errors?  

•  How  many  basis  func8ons  do  you  need  to  accurately  describe  heat  flow  problem  with  9  different  regions?    

•  Amounts  to  linear  system  solve  Ax  =b  •  x  is  vector  of  size  N  

17  

1 2 3

4 5 6

7 8 9

�D

�N1

�N0

WARNING: SOMEWHAT

SPECULATIVE

Page 18: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Reduced-­‐order  models  or  how  PDE  solvers    are  way  ahead  of  discrete  event  simula8ons  •  Rather  than  sampling  in  parameter  space,  can  a  simula8on  self-­‐diagnose  its  own  errors?  

•  How  many  basis  func8ons  do  you  need  to  accurately  describe  heat  flow  problem  with  9  different  regions?    

•  Amounts  to  linear  system  solve  Ax  =b  •  x  is  vector  of  size  N  

18  

1 2 3

4 5 6

7 8 9

�D

�N1

�N0

Page 19: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Reduced-­‐order  models  or  how  PDE  solvers    are  way  ahead  of  discrete  event  simula8ons  §  Full  order  model  

§  N  =  10,000  §  Huge  sparse  system  §  Brute  force  solu8on  

§  Reduced  order  model  §  What  if  I  already  have  several  solu8ons    

x1,  x2,  x3…?  How  accurate  is:  

 §  Small,  dense  system  

 

19  

x = cixii∑

1 2 3

4 5 6

7 8 9

�D

�N1

�N0

Page 20: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  is  a  simula8on  a  reduced  order  model?  

20  

Router Router Two flows competing for bandwidth

Discretization into flits shows even sharing of bandwidth on link 0

Link 1 consistently utilized at half rate

Link 1

Link 0

Page 21: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  is  a  simula8on  a  reduced  order  model?  

21  

Router Router Two flows competing for bandwidth

Discretization shows uneven sharing of bandwidth on link 0

Link 1 remains unutilized for long time

Link 1

Link 0

Page 22: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  does  reduced  order  model  perspec8ve    help  us  get  at  error?  §  We  want  to  know  to  error    §  All  we  can  compute  is  residual  

22  

E = x − x

R = Ax − b

Page 23: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  does  reduced  order  model  perspec8ve    help  us  get  at  error?  §  We  want  to  know  to  error    §  All  we  can  compute  is  residual  

§  Can  we  train  computer  to  convert  R  to  E?  §  Residual  is  an  indicator  for  the  error  we  care  about    

23  

E = x − x

R = Ax − b

Page 24: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

How  does  reduced  order  model  perspec8ve    help  us  get  at  error?  §  We  want  to  know  to  error    §  All  we  can  compute  is  residual  

§  Can  we  train  computer  to  convert  R  to  E?  §  Residual  is  an  indicator  for  the  error  we  care  about    

24  

E = x − x

R = Ax − b

Page 25: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

What  might  be  an  indicator  for  error    in  a  network  simulator?  

25  

Router Router Two flows competing for bandwidth

Even if we don’t exactly model flit-level flow control, we still know that we did something wrong!

Here we have no idea that an error was made!

Link 1

Link 0

Page 26: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Is  this  an  intrusive  or  non-­‐intrusive  UQ?    

§  Something  must  be  logged  for  every  packet  event  –  intrusive  §  Maintaining  log  of  every  event  is  too  much  data  –  need  to  reduce  

data  into  other  metrics  

26  

Router Router

Page 27: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Is  this  an  intrusive  or  non-­‐intrusive  UQ?    

§  Results  pending….  

27  

Router Router

Page 28: JeremiahWilke MODSIM2014 ModsimTechniqueshpc.pnl.gov/modsim/2014/Presentations/Wilke 3A.pdf · Cri8cal"ques8on"to"integraon"of"UQ"tools:"" intrusive"vs"nonKintrusive,"code"vsworkflow

Addressing  MODSIM  ques8ons  

§  Major  contribu8on:  §  Math,  workflow,  AND  toolkit  §  Demonstra8ng  need  for  both  intrusive  and  non-­‐intrusive  solu8ons  

§  Gaps:  §  Need  bridge  between  mathema8cians  and  programmers  §  Need  to  define  both  interchange  formats  for  non-­‐intrusive  toolkits  and  APIs  

for  intrusive  libraries  

§  Bigger  picture?  Collabora8on?  §  Simulators/modelers  looking  to  bracket  errors  

AND  §  UQ  researchers  with  methods  developed  in  other  domains  like  PDEs  

§  How  to  leverage  results?  §  Tutorials,  not  research  papers  §  Know  thine  audience  

UQ  Toolkit:  sandia.gov/uqtoolkit  C++  API  for  code  integra8on  Python  workflow  integra8on   28