Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU...

14
1 Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa Fowler, Paul Kucera, Kathryn Newman, and Christopher Williams Data Manager: Kathryn Newman Inventory The Pennsylvania State University (PSU) team delivered 72 retrospective PSU ARW (APSU) forecasts for 18 storms in the Atlantic Basin for the 2008-2011 hurricane seasons. When generating the interpolated or early model versions, both the CARQ record and storm information from the NHC Best Track must be available for each case. This requirement was not satisfied for 5 cases. Hence, early model versions are only available for 67 of the delivered cases. In addition, the storm was not classified as tropical or subtropical at the initial time of the early model version for 2 cases. Given the NHC verification package requires a tropical or subtropical classification at the initial time for a case to be verified, the total sample used in this analysis consisted of 65 cases. The evaluation of APSU focused on three primary analyses: (1) a direct comparison between APSU and each of last year’s top-flight models, (2) an assessment of how APSU performed relative to last year’s top-flight models as a group, and (3) an evaluation of APSU’s impact on operational consensus forecasts. Given all aspects of the evaluation are based on homogeneous samples for each type of analysis, the number of cases may vary depending on the availability of the specific operational baseline. Table 1 contains descriptions of the configurations used in the evaluation that are associated with APSU forecasts, as well as their corresponding ATCF ids. Definitions of the operational baselines and their corresponding ATCF ids can be found in the “2012 Stream 1.5 Methodology” write-up. Table 2 contains a summary of the baselines used to evaluate APSU. Note that only early versions of all model guidance were considered in this analysis. Cases were aggregated over ‘land and water’ for track metrics; ‘land and water’ as well as ‘water only’ for intensity metrics. Except when noted, results are for aggregation of cases over both land and water. Top-flight Models Track The APSI track errors for the Atlantic Basin appear to be on average very similar to those of the global top-flight models at least through day 2 (see Fig. 1), whereas APSI appears to have smaller track errors than those for GHMI. Note that the lack of confidence intervals on the track errors for the longer lead times stems from the fact that the effective sample size for these lead times is too small to meaningfully estimate variability and confidence. Pair-wise differences

Transcript of Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU...

Page 1: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

1

Evaluation of 2012 PSU Stream 1.5 Candidate

21 May 2012

TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal Biswas, Barbara Brown, Tressa

Fowler, Paul Kucera, Kathryn Newman, and Christopher Williams

Data Manager: Kathryn Newman

Inventory

The Pennsylvania State University (PSU) team delivered 72 retrospective PSU ARW (APSU)

forecasts for 18 storms in the Atlantic Basin for the 2008-2011 hurricane seasons. When

generating the interpolated or early model versions, both the CARQ record and storm

information from the NHC Best Track must be available for each case. This requirement was

not satisfied for 5 cases. Hence, early model versions are only available for 67 of the delivered

cases. In addition, the storm was not classified as tropical or subtropical at the initial time of the

early model version for 2 cases. Given the NHC verification package requires a tropical or

subtropical classification at the initial time for a case to be verified, the total sample used in this

analysis consisted of 65 cases. The evaluation of APSU focused on three primary analyses: (1)

a direct comparison between APSU and each of last year’s top-flight models, (2) an assessment

of how APSU performed relative to last year’s top-flight models as a group, and (3) an

evaluation of APSU’s impact on operational consensus forecasts. Given all aspects of the

evaluation are based on homogeneous samples for each type of analysis, the number of cases

may vary depending on the availability of the specific operational baseline.

Table 1 contains descriptions of the configurations used in the evaluation that are associated with

APSU forecasts, as well as their corresponding ATCF ids. Definitions of the operational

baselines and their corresponding ATCF ids can be found in the “2012 Stream 1.5 Methodology”

write-up. Table 2 contains a summary of the baselines used to evaluate APSU. Note that only

early versions of all model guidance were considered in this analysis. Cases were aggregated

over ‘land and water’ for track metrics; ‘land and water’ as well as ‘water only’ for intensity

metrics. Except when noted, results are for aggregation of cases over both land and water.

Top-flight Models

Track

The APSI track errors for the Atlantic Basin appear to be on average very similar to those of the

global top-flight models at least through day 2 (see Fig. 1), whereas APSI appears to have

smaller track errors than those for GHMI. Note that the lack of confidence intervals on the track

errors for the longer lead times stems from the fact that the effective sample size for these lead

times is too small to meaningfully estimate variability and confidence. Pair-wise differences

Page 2: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

2

between the global top-flight models and APSI do not produce any statistically significant (SS)

differences (Table 3). On the other hand, the comparison between APSI and GHMI led to SS

differences favoring APSI for four lead times out to day three. Percent improvements of APSI

over GHMI for these SS lead times range from 17 to 23%. Along-track errors (not shown)

indicate the top-flight models as well as APSI have a tendency to have a slow bias for this

sample out to day 4. Cross-track errors (not shown) show APSI and the global top-flight models

have no distinct bias out to day 3, whereas GHMI has a slight left bias.

A comparison of APSI’s track performance to that of all three top-flight models (see Fig. 2) does

not reveal any strong evidence toward APSI improving upon or degrading the top-flight model

guidance for track. Although it appears that APSI is more likely to produce the lowest track

errors (rank 1) between 60 and 108 h, the CIs for rank 1 include the 25% line for all but one lead

time, so the percent of cases with rank 1 cannot, for the most part, be deemed statistically distinct

from 25%. Note that the CIs for all other ranks also overlap the CIs for rank 1 for these lead

times, which indicates rank 1 cannot be deemed statistically distinct from the other ranks. The

lack of any SS results for the APSI ranking analysis may simply stem from its small sample size,

especially for longer lead times.

Intensity

The APSI mean absolute intensity errors for the Atlantic Basin appear to be fairly similar to

those of LGEM (i.e., CIs overlap for all lead times that CIs can be meaningfully estimated), and

are definitely smaller than those of DSHP and GHMI for at least a few lead times (see Fig. 3).

The pair-wise differencing analysis produced SS differences favoring APSI over GHMI and

DSHP for four lead times out to 84 h (Table 4). All SS differences favoring APSI are of large

practical significance (> 2 knots) and correspond to percent improvements between 19 and 39%.

The comparison between the APSI and LGEM intensity errors reveals no SS differences. When

the verification results are limited to over-water cases, no SS differences were found for the

comparisons with GHMI and DSHP; this change in the results is mainly due to the large

reduction in the effective sample sizes. The comparison with LGEM for cases over water only

resulted in one SS difference corresponding to a degradation of large practical significance at 12

h. Mean intensity errors for the top-flight models and APSI indicate all four models produce SS

over-prediction biases for at least some of the lead times where the estimated biases associated

with the GHMI forecasts appear to be larger than those for the other three models (see Fig. 4).

Examination of APSI’s performance for intensity forecasts relative to the performance of the

three top-flight models (see Fig. 5) leads to no meaningful conclusions due to the overlap of the

confidence intervals for the rankings. This overlap is likely due to the very small sample sizes

for this comparison (i.e., method lacks power for determining SS rank differences for small

sample sizes because small changes in the distribution of the rankings, possibly due to random

differences in round-off, will lead to large differences between the percent of cases in each

ranking). There is some indication that APSI has the smallest intensity errors (rank 1) from 36 to

Page 3: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

3

96 h, but the CIs for rank 1 include the 25% line for all but three of these lead times. In addition,

the CIs for ranks 2 and 3, and for some lead times rank 4, overlap those for rank 1. The position

of rank 4 also appears to suggest that APSI is least likely to be worst after day 2, but once again

the wide confidence bounds include the 25% line for all but one lead time and overlap the CI

bounds for the other ranks, which leads to inconclusive results due to the lack of statistical

significance.

Conventional Model Consensus

Track

The mean track errors for the consensus with APSI (CAPS) appear to be on average very similar

to the track errors for the variable operational track consensus (TVCA), with a suggestion of

slightly smaller errors associated with CAPS beyond two days (see Fig. 6). The pair-wise

differencing analysis produces five SS differences between days 1 and 3.5 that all correspond to

improvements in consensus track guidance of small practical significance. On the other hand,

these SS differences correspond to percent improvements between 5 and 7%. The mean along-

and cross-track errors for CAPS and TVCA are very similar; consensus guidance is too slow at

least through day 3 and to the left of the observed location through day 2 (not shown).

Intensity

The mean absolute intensity errors (see Fig. 7) for CAPS appear to be slightly smaller than those

of the fixed operational intensity consensus (ICON). The pair-wise difference analysis produces

SS differences for four lead times between 36 and 84 h (Table 5). These SS differences are

greater than 1 knot for three lead times, with percent improvements ranging from 7 to 11%.

Limiting the sample to forecasts for which the storm is over water reduces the SS differences

favoring CAPS to one lead time; however, the small effective sample sizes make any confidence

assessment impossible beyond day 2. Mean intensity errors for CAPS and ICON both show a

positive bias of similar magnitude (not shown).

Overall Evaluation

The comparison between APSI and the individual top-flight models indicates APSI does not

provide any SS improvement over the top-flight global track guidance, but does improve upon

the operational GFDL track guidance. The SS improvements over the operational GFDL track

guidance easily meet the 4% criteria for selection as a Stream 1.5 model. On the other hand, the

rank analysis did not provide any conclusive results, which is likely due to the very small sample

sizes. A comparison with the individual top-flight models for intensity indicates APSI improves

upon intensity guidance provided by the two top-flight models GHMI and DSHP, whereas the

comparison with LGEM leads to a single SS degradation, but only when the sample is limited to

cases over water. All SS differences favoring APSI correspond to percent improvements greater

than 3%. While the ranking assessment suggests APSI has a tendency to produce forecasts with

Page 4: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

4

the smallest intensity errors for a number of lead times, these results could not be deemed SS.

Once again, the lack of SS is likely in large part due to the small sample sizes. Hence, the results

of the individual model comparisons are somewhat favorable for APSI being selected as a 2012

Stream 1.5 model for explicit track and intensity guidance in the Atlantic Basin (i.e., APSI is

able to improve upon guidance from one top-flight model for track and two top-flight models for

intensity), but the results should be interpreted with caution due to the rather small sample sizes.

Adding APSI to the variable consensus for track guidance provided SS improvement to the

operational guidance, with the small reductions in track errors (< 10 nm) corresponding to

percent improvements greater than 5% for five lead times. The addition of APSI to the fixed

operational intensity consensus produced SS improvements at four lead times, with reductions in

intensity errors between 0.9 and 1.4 knots. These SS differences correspond to percent

improvements between 7 and 11%. Limiting the sample to cases over water reduced the number

of SS improvements, which can be attributed to the reduction in effective sample sizes with this

aggregation. Hence, the consensus comparisons appear to be favorable for APSI to be included

in the 2012 Stream 1.5 consensus guidance for both track and intensity in the Atlantic Basin, but

once again, these results should be interpreted with caution due to the small sample sizes.

Page 5: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

5

Table 1: Descriptions of the various PSU-related configurations used in this evaluation and their

assigned ATCF ids.

ID Description of configuration

APSU Late model version (non-interpolated)

APSI Early model version (interpolated, adjustment window 18 to 30 h)

CAPS

Average of

EMXI/GFSI/EGRI/GHMI/HWFI/GFNI/APSI (Atlantic track)

Track: variable - at least three members must be present and APSI must be available

DSHP/LGEM/GHMI/HWFI/APSI (intensity)

Intensity: fixed – all members must be present

Table 2: Summary of baselines used for evaluation of APSI for the specified metrics.

ID

Variables Verified

Aggregation

Track

land and water

Intensity

land and water

Intensity

water only

EMXI ●

GFSI ●

GHMI ● ● ●

LGEM ● ●

DSHP ● ●

ICON ● ●

TVCA ●

HOMG ● ● ●

Page 6: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

6

Table 3: Inventory of statistically significant (SS) pair-wise differences for track stemming from the comparison of each individual top-flight

model and the Stream 1.5 candidate. See 2012 Stream 1.5 methodology write-up for description of entries.

Forecast Hour

0 12 24 36 48 60 72 84 96 108 120

Atl

an

tic

Basin

GFSI 0.0 -0.3 0.9 -0.4 4.8 17.9 38.8 18.2 49.7 24.8 47.9

Track 0% -1% 2% -1% 6% 16% 25% 12% 31% 15% 26%

Land/Wate

r

- 0.132 0.303 0.069 0.504 0.853 0.904 0.562 - - -

EMXI 0.0 0.3 1.0 1.6 -2.8 1.6 7.4 11.7 27.7 19.9 52.3

Track 0% 1% 3% 3% -4% 2% 6% 9% 20% 13% 32%

Land/Wate

r

- 0.113 0.175 0.206 0.272 0.098 0.508 0.527 - - -

GHMI 0.0 6.8 7.4 12.2 11.3 25.6 33.4 22.1 49.9 64.5 90.0

Track 0% 23% 17% 20% 14% 21% 22% 14% 31% 32% 39%

Land/Wate

r

- 0.998 0.961 0.984 0.866 0.935 0.979 0.802 0.904 - -

Page 7: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

7

Table 4: Inventory of statistically significant (SS) pair-wise differences for intensity stemming from the comparison of each individual top-flight

model and the Stream 1.5 candidate. See 2012 Stream 1.5 methodology write-up for description of entries.

Forecast Hour

0 12 24 36 48 60 72 84 96 108 120

Atl

an

tic

Basin

GHMI 0.0 -0.9 2.8 6.0 6.2 4.2 5.5 5.3 8.0 4.5 2.8

Intensity 0% -8% 19% 34% 39% 28% 32% 32% 43% 25% 15%

Land/Water - 0.735 0.964 0.998 0.999 0.899 0.914 0.976 - - -

GHMI 0.0 -0.6 1.5 4.8 2.8 2.8 2.1 0.2 4.1 6.6 7.3

Intensity 0% -5% 9% 27% 18% 16% 12% 1% 27% 32% 28%

Water Only - 0.448 0.645 0.934 - - - - - - -

LGEM 0.0 -1.4 -0.3 1.1 3.1 1.2 2.2 0.6 2.1 -4.3 -6.0

Intensity 0% -13% -2% 9% 24% 10% 16% 5% 16% -45% -62%

Land/Water - 0.833 0.158 0.456 0.908 0.542 0.466 0.112 - - -

LGEM 0.0 -2.3 -2.4 -1.3 -1.5 -1.3 -3.2 -3.9 2.1 -3.6 -2.7

Intensity 0% -24% -20% -11% -14% -11% -29% -33% 16% -35% -17%

Water Only - 0.974 0.881 0.478 0.336 - - - - - -

DSHP 0.0 -0.9 0.4 3.0 5.3 3.3 4.6 5.4 7.4 2.2 -1.5

Intensity 0% -8% 3% 20% 35% 23% 28% 32% 40% 14% -11%

Land/Water - 0.679 0.240 0.898 0.997 0.954 0.972 0.959 - - -

DSHP 0.0 -1.8 -0.9 2.2 3.9 2.9 2.4 4.9 12.0 4.6 2.7

Intensity 0% -18% -7% 15% 25% 18% 14% 24% 52% 25% 12%

Water Only - 0.948 0.475 0.632 0.750 - - - - - -

Page 8: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

8

Table 5: Inventory of statistically significant (SS) pair-wise differences for the comparison of the two consensus forecasts. See 2012 Stream 1.5

methodology write-up for description of entries.

Forecast Hour

0 12 24 36 48 60 72 84 96 108 120

Atl

an

tic

Basin

TVCA 0.0 0.5 1.6 2.4 3.1 8.2 7.7 7.7 15.8 19.4 32.3

Track 0% 2% 5% 5% 5% 7% 6% 5% 10% 11% 15%

Land/Water - 0.899 0.990 0.996 0.908 0.999 0.997 0.986 - - -

ICON 0.0 -0.2 0.5 1.4 1.4 0.9 1.4 1.3 1.9 1.3 2.3

Intensity 0% -2% 4% 10% 11% 7% 9% 8% 10% 8% 14%

Land/Water - 0.679 0.899 0.999 0.999 0.970 0.946 0.984 - - -

ICON 0.0 -0.2 0.3 1.1 1 0.5 0.6 0.3 1.6 0.2 0

Intensity 0% -2% 2% 7% 7% 4% 4% 2% 8% 1% 0%

Water Only - 0.677 0.676 0.964 0.941 - - - - - -

Page 9: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

9

Figure 1: Mean track errors and 95% confidence

intervals with respect to lead time for EMXI and APSI

(top left panel), GFSI and APSI (top right panel) and

GHMI and APSI (bottom left panel) for the Atlantic

Basin.

Page 10: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

10

Figure 2: Rankings with 95% confidence intervals (dotted lines) for APSI compared to the three top-flight models for track guidance with respect

to lead time. The grey horizontal solid line highlights the 25% frequency for reference. Black numbers indicate the frequencies of the first and

fourth rankings where the candidate model is assigned the better (lower) ranking for all ties.

Page 11: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

11

Figure 3: Mean absolute intensity errors and 95%

confidence intervals with respect to lead time for

LGEM and APSI (top left panel), DSHP and APSI (top

right panel) and GHMI and APSI (bottom left panel)

for the Atlantic Basin.

Page 12: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

12

Figure 4: Mean intensity errors and 95% confidence

intervals with respect to lead time for LGEM and APSI

(top left panel), DSHP and APSI (top right panel) and

GHMI and APSI (bottom left panel) for the Atlantic

Basin.

Page 13: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

13

Figure 5: Rankings with 95% confidence intervals (dotted lines) for APSI compared to the three top-flight models for intensity guidance with

respect to lead time. The grey horizontal solid line highlights the 25% frequency for reference. Black numbers indicate the frequencies of the first

and fourth rankings where the candidate model is assigned the better (lower) ranking for all ties.

Page 14: Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 · 2012. 9. 5. · Evaluation of 2012 PSU Stream 1.5 Candidate 21 May 2012 TCMT Stream 1.5 Analysis Team: Louisa Nance, Mrinal

14

Figure 6: Mean track errors and 95% confidence intervals with respect to lead time for variable

operational consensus for track (TVCA) and CAPS for the Atlantic Basin.

Figure 7: Mean absolute intensity errors and 95% confidence intervals with respect to lead time

for fixed operational consensus for intensity (ICON) and CAPS for the Atlantic Basin.