Measuring Forecaster Performance

31
NATIONAL DEFENSE INTELLIGENCE COLLEGE Measuring Forecaster Performance Lt Col James E. Kajdasz, Ph.D., USAF

description

Measuring Forecaster Performance. Lt Col James E. Kajdasz, Ph.D., USAF. Scholarship of Intelligence Analysis. - PowerPoint PPT Presentation

Transcript of Measuring Forecaster Performance

Page 1: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

Measuring Forecaster Performance

Lt Col James E. Kajdasz, Ph.D., USAF

Page 2: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEScholarship of Intelligence Analysis

• “A comprehensive review of the literature indicates that while much has been written, largely there has not been a progression of thinking relative to the core aspect and competencies of doing intelligence analysis.” (Mangio & Wilkinson, 2008)

• “Do [they] teach structured methods because they are the best way to do analysis, or do they teach structured methods because that’s what they can teach?” (Marrin, 2009)

Page 3: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEGrade forecasters on % correct?

judgments• We could grade forecaster accuracy similar to a T/F test. (yes/no answers)– Will Qadhafi still be in Libya at this time next year? No– Will the government of Yemen fall in the next year?

No– Will I still be driving my 2001 Corolla in the year

2020? Yes• Wait until outcomes occur/don’t occur, and

calculate percent of correct forecasts. • Compare Forecaster A to Forecaster B by

seeing who has the higher % correct.

Page 4: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEWhat about probabilistic judgments?

• When there is a high level of uncertainty, laypeople and even experts often qualify judgments. – Will Qadhafi still be in Libya at this time next year? No

(70% confidence)– Will the government of Yemen fall in the next year?

No (60% confidence)– Will I still be driving my 2001 Corolla in the year

2020? Yes (95% confidence)

Page 5: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEWhat about probabilistic judgments?

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1.0 __ __ __ __ __ __ __ __ __ __ __

Impo

ssib

leHi

ghly

unlik

ely

Som

ewha

tun

likel

yAs

like

ly a

s ot

her

two

poss

ibili

ties

com

bine

d

Som

ewha

tlik

ely Hi

ghly

likel

yCe

rtain

ty

Tetlock, 2005

Page 6: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGELet’s Compare analysts…

• So which analyst performed best?

• It’s hard to say… We need a summary statistic to summarize total performance.

Probability assignedEvent Occurred? Analyst 1 Analyst 2 Analyst 3

1 No (0) 0 0 0.12 Yes (1) 0.9 0.7 0.73 No (0) 0.1 0.3 04 Yes (1) 0.7 0.5 0.55 Yes (1) 0.9 1 1

Page 7: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEMean Probability Score

• Probability Score or Brier Score

– Estimate: • Probability provided by forecaster• .00 – 1.00

– Outcome: • 0 (if event did not occur)• 1 (if event did occur)

( )PS2( )PS Estimate Outcome

Page 8: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEMean Probability Score

• Probability Score or Brier Score

– Forecaster says 70% probability X will occur.– X occurs.–

2( )PS Estimate Outcome

2 2(.70 1) ( .3) .09PS

( )PS

Page 9: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEMean Probability Score

• Mean Probability Score or Mean Brier Score 2 2(.70 1) ( .3) .09PS

( )PS

2 2(.50 0) ( .5) .25PS 2 2(.10 0) (.10) .01PS

.12PS

Page 10: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGELet’s Compare analysts…

Probability assignedEvent Occurred? Analyst 1 Analyst 2 Analyst 3

1 No (0) 0 0 0.12 Yes (1) 0.9 0.7 0.73 No (0) 0.1 0.3 04 Yes (1) 0.7 0.5 0.55 Yes (1) 0.9 1 1

0.02 0.09 0.07PS

Page 11: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEComponents of Total Forecaster Error

• Several things contribute to overall error, not all of which can be controlled by the forecaster.

Total Forecasting Error

Discrimination Errors

( )PS

CalibrationErrors

Variance of the Outcome

Page 12: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEDecomposing Mean Probability Score

PS

2Var(d) + (bias) [Var(d)(slope)](slope-2)+scatterPS

Bias Slope Scatter Var(d)

Page 13: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

Decomposing PS: Bias

Bias f d Where:

= Mean estimate

= Mean outcome

Arkes, Dawson, Speroff & et.al. (1995)

f

d

Est

imat

ed P

roba

bilit

y of

Sur

viva

l (f)

Outcome Index (d)

Page 14: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

Decomposing PS: Slope

1 0Slope f f Where:

= Mean estimate when outcome was 1

= Mean estimate when outcome was 0

Arkes, Dawson, Speroff & et.al. (1995)

1f

0f

Est

imat

ed P

roba

bilit

y of

Sur

viva

l (f)

Outcome Index (d)

Page 15: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

Decomposing PS: Scatter

Where:

= Variance when outcome was 1 = Variance when outcome was 0

Arkes, Dawson, Speroff & et.al. (1995)

1( )Var f

0( )Var f

Est

imat

ed P

roba

bilit

y of

Sur

viva

l (f)

Outcome Index (d)

Page 16: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGETitle

• Body

Patients DoctorsPS=.23 Bias=0.13 Slope=.13 Scat.=.05 PS=.18 Bias=-0.11 Slope=.26 Scat.=.05

Est

imat

ed P

roba

bilit

y of

Sur

viva

l (f)

Outcome Index (d) Outcome Index (d)

Arkes, Dawson, Speroff & et.al. (1995)

Page 17: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEPrediction Markets

Page 18: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEA-priori Hypotheses:

• H1: Discrimination will improve as the event nears

– Slope measure will increase over time.• H2: Scatter will decrease as the event nears

– Scatter measure will get smaller over time.• H3: Analysts will be biased toward predicting

the status quo– Bias measure will be negative

Page 19: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-70 Days

Page 20: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-60 Days

Page 21: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-50 Days

Page 22: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-40 Days

Page 23: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-30 Days

Page 24: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-20 Days

Page 25: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGET-10 Days

Page 26: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

• PS is a measure of overall error

• low PS is better

• Graph suggests curvilinear relationship with time

Total Error over Time

Page 27: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

• PS composed of Bias, Slope, Scatter, and Variance of the outcome

• Graph suggests decrease in error is primarily due to improvement in slope

• Slope is a measure of discrimination

• High slope is better

Components of Error

Page 28: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

• The observed slope was modeled.

• Curvilinear relationship modeled with Days and Days2

• Adj R2 = .834, p=.01• H1 supported.

Discrimination improves as date approaches.

.6

.4

.2S

lope

.0

-.2

Modeling Slope Over Time

Page 29: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

• Scatter is a measure of ‘spread’ of probability estimates.

• Slight linear trend not significant.

• H2 not supported.

Scatter Over Time

Page 30: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGEBias Over Time

• Questions recoded such that probability ‘0’ represented a continuation of status-quo, and probability ‘1’ represents a change in status-quo

• Analysts were biased toward predicting a change in the status-quo

– Indicated by positive bias numbers – t(6)=4.73, p < .01

• H3 not supported. • BUT significant results in the direction

opposite that hypothesized.• Linear trend over time not statistically

significant.

Page 31: Measuring Forecaster Performance

NATIONAL DEFENSE INTELLIGENCE COLLEGE

Lt Col James E. Kajdasz, Ph.D., [email protected]

The views expressed in this presentation are those of the author and do not reflect the official policy or position of the Department of Defense or the U.S.

Government.