CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to...

31
1 CS626 Data Analysis and Simulation Today: Stochastic Input Modeling Reference: Law/Kelton, Simulation Modeling and Analysis, Ch 6. NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/ Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected] Office hours: Monday,Wednesday 2-4 pm

Transcript of CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to...

Page 1: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

1

CS626 Data Analysis and Simulation

Today:Stochastic Input Modeling

Reference: Law/Kelton, Simulation Modeling and Analysis, Ch 6.NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/

Instructor: Peter Kemper R 104A, phone 221-3462, email:[email protected] hours: Monday,Wednesday 2-4 pm

Page 2: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

What is input modeling?

Input modeling Deriving a representation of the uncertainty or randomness in a

stochastic simulation. Common representations

Measurement data Distributions derived from measurement data <-- focus of “Input modeling”

usually requires that samples are i.i.d and corresponding random variables in the simulation model are i.i.d

i.i.d. = independent and identically distributed theoretical distributions empirical distribution

Time-dependent stochastic process Other stochastic processes

Examples include time to failure for a machining process; demand per unit time for inventory of a product; number of defective items in a shipment of goods; times between arrivals of calls to a call center. 2

Page 3: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Overview of fitting with data

Check if key assumptions hold (i.i.d) Select one or more candidate distributions based on physical characteristics of the process and graphical examination of the data.

Fit the distribution to the data determine values for its unknown parameters.

Check the fit to the data via statistical tests and via graphical analysis.

If the distribution does not fit, select another candidate and repeat the process, or use an empirical distribution.

3from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 4: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Check the fit to the data Graphical analysis Plot fitted distribution and data in a way that differences can be

recognized beyond obvious cases, there is a grey area of subjective acceptance/rejection

Challenges How much difference is significant enough to trash a fitted distribution? Which graphical representation is easy to judge?

Options: Histogram-based plots Probability plots: P-P plot, Q-Q plot

Statistical tests define a measure X for the difference between fitted distribution & data X is an RV, so if we find an argument what distribution X has, we get a

statistical test to see if in a concrete case a value of X is significant Goodness-of-fit tests:

Chi-square test(χ2), Kolmogorov-Smirnov test(K-S), Anderson Darling test(AD)

4

Page 5: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Sample test characteristic for Chi-Square test (all parameters known)

5

One-sidedRight side: - critical region- region of rejectionLeft side:- region of acceptance where we fail to reject hypothesisP-value of x: 1-F(x)

Page 6: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Graphic Analysis vs Goodness-of-fit tests Graphic analysis includes: Histogram with fitted distribution Probability plots: P-P plot, Q-Q plot.

Goodness-of-fit tests represent lack of fit by a summary statistic, while plots show where

the lack of fit occurs and whether it is important. may accept the fit, but the plots may suggest the opposite,

especially when the number of observations is small.

6

!"

#$%&'()*+,%-./(/

+*0%1%*/21*34*56*37/2$8%1(3,/*(/*72-(2820*13*72*4$39*%*,3$9%-*0(/1$(7:1(3,;*<'2*43--3=(,>*%$2*1'2*!?8%-:2/*4$39*)'(?/@:%$2*12/1*%,0*A?B*12/1C

D'(?/@:%$2*12/1C*6;EFFA?B*12/1C*G6;EH

I'%1*(/*.3:$*)3,)-:/(3,J

Page 7: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Density Histogram

compares sample histogram (mind the bin sizes) with fitted distribution

7

Page 8: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Frequency Histogram

compares histogram from data with histogram according to fitted distribution

8

Page 9: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Differences in distributions are easier to see along a straight line:

9

Page 10: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Graphical comparisons

10

Graphical comparisons

Frequency ComparisonsFeatures:• Graphical comparison of a histogram of the data with the density function of the fitted distribution.

• Sensitive to how we group the data.

Probability PlotsFeatures:• Graphical comparison of an estimate of the true distribution function of the data with the distribution function of the fit.

•Q-Q (P-P) plot amplifies differences between the tails (middle) of the model and sample distribution functions.

• Use every graphical tool in the software to examine the fit.

• If histogram-based tool, then play with the widths of the cells.

• Q-Q plot is very highly recommended!

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 11: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

P-P plots and Q-Q plots

11

Q-Q plot

vs

for q1,...,qn

P-P plot

vs

for p1,...,pn

This intuitive definitionneeds an adjustment to handle ties (multiple samples of same value)

Page 12: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Q-Q Plot

Recall that one way to generate data from cdf F is via

The Q-Q plot displays the sorted data

12

Q-Q plot

Recall that one way to generate data from cdf F is via

The Q-Q plot displays the sorted data

vs.

)(1 RFY

nYYY 21

njn

jF ,2,1,

2/11

Q-Q plot

Recall that one way to generate data from cdf F is via

The Q-Q plot displays the sorted data

vs.

)(1 RFY

nYYY 21

njn

jF ,2,1,

2/11

Q-Q plot

Recall that one way to generate data from cdf F is via

The Q-Q plot displays the sorted data

vs.

)(1 RFY

nYYY 21

njn

jF ,2,1,

2/11

vs

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 13: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Q-Q Plot Intuition

13

!"

#$# %&'()*+(,-(-'+

. /0)1230)2)4256&0)!"#$!%#&#!' 2+7)80)9-()2)7-4(:-;,(-'+)( (12()80)1'60)-4)<''7=

. *9)80)+'8)<0+0:2(0)2):2+7'5)4256&0)'9)4->0)' 9:'5)($-()41',&7)&''?)2;',()&-?0)!"#$!%#&#!')

. @10)#$# 6&'()<0+0:2(04)2)*+,-+./ :2+7'5)4256&0)9':)A'562:-4'+=

Page 14: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Features of the Q-Q plot

It does not depend on how the data are grouped.

It is much better than a density-histogram when the number of data points is small.

Deviations from a straight line show where the distribution does not match.

A straight line implies that the family of distributions is correct. A 45o line implies that parameters fit as well.

14from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 15: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

LogLogistic(-113.32, 156.71, 16.107)

0

20

40

60

80

100

120

0 20 40 60 80 100 120

Input quantile

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

@RISK Student Version

For Academic Use Only

Exponential(44.468) Shift=-0.58

0

20

40

60

80

100

120

Fitte

d qu

antil

e

0 20 40 60 80 100 120

Input quantile

Pretty good fit, but missesa bit on the right tail.

Poor fit, misses badly in both tails.

Features of the Q-Q Plot

A straight line implies the family of distributions is correct; a 45-degree line implies correct parameters.

15from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 16: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Examples of Q-Q plot

16

!"

#$%&'()*+,-+./. '(,0

12

13

42

43

!2

12 13 42 43 !2

56)+-700789+7*+':)00;+9,,<=

3

12

13

42

43

!2

!3

"2

"3

32

3 12 13 42 43 !2 !3 "2 "3 32

56)+<7*0:7>?07,8+-%&7(;+*))&*+@A+>?0+06)+'%:%&)0):*+%:)+8,0

Page 17: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Example of Q-Q plot

17

!"

#$%&'()*+,*-.- '(+/

0++1*,2/3*4255*6%7(8*29*/:)*(),/*/%2(

;*7%/%*5)/*+,*!<*+65)1=%/2+95*25*6)(2)=)7*/+*6)*,1+&*%*9+1&%(*725/126>/2+9?*@:)*,+((+A29B*%1)*/:)*!.=%(>)5*,1+&*C:2.5->%1)*/)5/*%97*D.E*/)5/F

G:2.5->%1)*/)5/F*<?HIID.E*/)5/F*J<?H"

H<

H"

K<

K"

!<

!"

H< H" K< K" !< !"

Page 18: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

P-P plot vs Q-Q plot: Sensitive to different kinds of deviations

18

Page 19: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Should we just use the best fit?

Software tools exercise a set of distributions optimize parameter settings for data and distribution evaluate statistical tests suggest a “best fit”

Some concerns about the fully automated solution: Tests represent lack of fit by a single summary statistic, while plots

show where the lack of fit occurs and whether it is important. Be sure to try different numbers of histogram cells; it affects the p-

value of the χ2 test, and your perception of the fit. Be cautious with ranking fits by Chi-Sq, K-S and A-D statistics and

always check the Q-Q plot. If there is a strong physical basis for a particular distribution choice,

then use it even if it is not the best fit.

Don’t be afraid to use your brain in addition to software!19from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 20: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Overview of fitting with data

Select one or more candidate distributions, based on physical characteristics of the process and graphical examination of the data.Fit the distribution to the data (determine values for its unknown parameters).Check the fit to the data via tests and graphical analysis.If the distribution does not fit, then select another candidate and repeat the process.

What if no distribution provides a good fit?

20from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 21: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

What if no distribution provides a good fit?

Use the data itself when… No standard distribution fits well. We have no justification for a standard distribution. There is too little data to distinguish between standard distributions.

Reuse the data via empirical distribution An example:

21

Empirical distribution --- An example

0

1/2

0 2.1 3.4 5.7 8.1 10input data

prob

abili

ty m

ass f

unct

ion

1/4

3/4

Equally likely to be re-sampled

Objective Fit an input model to data 2.1, 5.7, 3.4, 8.1 via empirical distribution function.

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 22: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Empirical distribution

Each data point is equally likely to be resampled. If you are concerned that only the values you saw can appear again, then fill in gaps by linearly interpolating between the sorted data points:

22

Empirical distribution

Each data point is equally likely to be resampled.

If you are concerned that only the values you saw can appear again, then fill in gaps by linearly interpolating between the sorted data points:

Interpolated Empirical cdf

0

0.33

0.67

1

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10

X

cum

ulat

ive

prob

abili

ty

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 23: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Empirical distribution

Formal definition (Law/Kelton, p 326) Let X1≤ X2...≤ Xn be the sorted sequence of observationsF(x) = 0 if x < X1

F(x) = (i-1)/(n-1) + (x-Xi)/[(n-1)(Xi+1-Xi)] if Xi≤x≤Xi+1

for i=1,2,...,n-1F(x) = 1 Xn ≤ x

Ok, but cannot yield values less than X1 or more than Xn

Also, mean F(x) does not match sample mean.If data is grouped, different approach necessary.

Law/Kelton describes such an extension with interpolation.Real challenge are skewed distributions (mostly right) with likely too few samples from tail due to small tail probabilities.

Consider appending artificial tail with the help of an exponential distribution

23

Page 24: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

What if we have no data at all?

We have to use anything we can find... Engineering data, standards and ratings can provide central values. Expert opinion. Physical or conventional limitations can provide bounds. Physical basis of the process can suggest appropriate distribution

families.

We model the expert opinion using either breakpoints method, or mean and variability method.

24from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 25: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Breakpoints method

Useful for modeling quantities with a large number of possible outcomes such as quarterly sales volume. Example Sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units.

25

Breakpoints method

Useful for modeling quantities with a large number of possible outcomes such as quarterly sales volume.Example Sales of XYZ-123 will be no less than 1000 units, no more than 5000 units, and is most likely to be 3500 units.

Triangular Distribution

X <= 17075%

X <= 445295%

0.0002

0.0003

0.0004

0.0005

0.0006

500

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

dens

ity f

unct

ion

sales0.0001

0.00001000 1500 2000 2500 3000 3500 4000 4500 5000 5500

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 26: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Breakpoints method

Example Sales of XYZ-123 will be between 1000 and 5000 with

25% chance of being at most $2000, 75% chance of being at most $3500, 99% chance of being at most $4500.

Use only as many breakpoints as you can confidently get. Try to get breakpoints near the extremes if possible. Might be easier for experts to give the chance of exceeding a value.

26

Breakpoints methodExample Sales of XYZ-123 will be between 1000 and 5000 with 25% chance of being £ 2000, 75% chance of being £ 3500, and 99% chance of being £ 4500.

– Use only as many breakpoints as you can confidently get.– Try to get breakpoints near the extremes if possible.– Might be easier for experts to give the chance of exceeding a value.

0.000000.000050.000100.00015

0.000200.000250.00030

0.00035

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

X <= 12005.0%

X <= 433395.0%

Cumulative Distribution

dens

ity f

unct

ion

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 27: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Mean and variability method

Example The sales of XYZ-123 was 10,000. This year we expect a 15%

increase, with a typical swing of 5% above or below that value. However, we won’t sell less than 7000 units, or more than 16,000 under any conditions.

27

Mean and variability methodExample The sales of XYZ-123 was 10,000. This year we expect a 15% increase, with a typical swing of 5% above or below that value. However, we won’t sell less than 7000 units, or more than 16,000 under any conditions.

X <= 1244695.0%

X <= 105545.0%

0.0000

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0.0007

0.0008

7000 8000 9000 10000 11000 12000 13000 14000 15000 16000

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

@RISK Student VersionFor Academic Use Only

dens

ity f

unct

ion

sales

Normal Distribution

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 28: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Checking the input model

Sensitivity analysis (varying the parameters of the input model) is especially important when the model is not based on data. While looking for marked changes in the output results, pay special attention to the standard deviation, bounds, or limits. Concentrate sensitivity analysis on those inputs to which the outputs are most sensitive.

28from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 29: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

What if the process is dependent?

Usually we assume that all generated random observations across a simulation are independent. Sometimes this is not true: A difficult part requires long processing in adjacent operations of a

production system. This is positive correlation.

Ignoring such relations can invalidate model.

29from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 30: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Does Dependence Really Matter?

30

Average Waiting Time

0

2

4

6

8

10

@RISK Student Version

Average Number Waiting

0

3

6

9

12

15

IndependentArrivals

Server A Exit

Inventory in front of Server A

Correlation0.9

Dependent Arrivals

Server B Exit

Inventory in front of Server B

Does Dependence Really Matter?YES, IT DOES!

Average Waiting Time

0

2

4

6

8

10

@RISK Student Version

Average Number Waiting

0

3

6

9

12

15

IndependentArrivals

Server A Exit

Inventory in front of Server A

Correlation0.9

Dependent Arrivals

Server B Exit

Inventory in front of Server B

Does Dependence Really Matter?YES, IT DOES!

from WSC 2010 Tutorial by Biller and Gunes, CMU, slides used with permission

Page 31: CS626 Data Analysis and Simulation - William & Marykemper/cs626/slides/v7.pdf · Check the fit to the data via statistical tests and ... Check the fit to the data Graphical analysis

Conclusion

Use input models to represent uncertainty in simulation The particular input model chosen matters! Selection of the an input model is not an exact science no right answer, but the issues to consider are

theoretical vs. empirical data physical basis of the distribution assessment of the goodness of a fit independence of samples

Assess the sensitivity of simulation output results to input models chosen Use expert opinion whenever you can Do not automatically trust a completely automated derivation of an input model.

31