Section A: Supplementary Analyses of Experiment...

Supplemental Materials

“Slow Down and Remember to Remember!

A Delay Theory of Prospective Memory Costs”

by A. H. Heathcote et al., 2015, Psychological Review

http://dx.doi.org/10.1037/a0038952.supp

Section A: Supplementary Analyses of Experiment 1

Figure A1 shows the fit of the top LBA model is not markedly better than that of the

AIC-selected model shown in Figure 3. We used a mixed ANOVA to examine effects on the

top LBA model parameters including block order as a between-subjects factor. Results for all

analyses except on the threshold were consistent with the AIC-selected model analyses, with

the exception of Ter were longer non-decision time in PM (0.172s) than control (0.162s)

blocks just achieved significance, F(1,45) = 4.88, p = 0.032. The average A estimate was 0.21

and the average sv estimate for the true accumulator as 0.6, which was significantly lower that

the fixed value of one for the false accumulator, F(1,45) = 155, p < .001. However, in

contrast to the both model selection results and the analysis of the AIC-selected model

parameters, no effects on B involving PM were significant.

Delay Theory of Prospective Memory Costs

Figure A1. Top LBA model fits. Data are plotted accompanied by within-subject error bars calculated using Morey’s (2008) bias-corrected method.

Figure A2. Mean rate estimates for the top LBA model averaged over participants accompanied by within-subject error bars calculated using Morey’s (2008) bias-corrected method.

Figure A2 shows average estimates of the top LBA model v parameter. Consistent

with accurate responding the mean rate for the true (2.35) accumulator was much greater than

for the false (0.48) accumulator, F(1,45) = 290, p<.001. This difference interacted strongly

with stimulus type, F(2,45) = 63.8, = .77, p < .001, being largest for HF (2.8), least for LF

(0.7) and intermediate for NW (2), according with the ordering of accuracy by stimulus type.

There was a significant main effect of PM on mean rate, F(1,45) = 10.7, p <.01, with a

higher average rate in control (1.46) than PM (1.37) blocks. There was also a significant

interaction between block type and accumulator correspondence, F(1,45) = 16.2, p < .001 due

to a greater true vs. false accumulator difference in PM (2) than control (1.7). As shown in

Figure A2 these effects were largely due to a decrease in false accumulator rates in PM

blocks for all stimulus types.

Section B: Reanalysis of Horn et al. (2011)

In Smith’s (2003) experiment reanalysed by Horn et al. (2011) there were 126 word

and 126 nonword stimuli, with a total of 504 trials, as each stimulus was tested twice. We

analysed the same ongoing-task data as Horn et al. (2011), removing trials with responses

faster than 0.3s or slower than 3s (1.27%). We first checked for differences in accuracy and

2


mean RT for correct responses as a function of stimulus type and repetition (i.e., stimulus

specific practice), as well as the PM effect that was the focus of Horn et al.’s (2011) analysis.

For correct mean RT, all main effects and all two-way interactions between these three

factors were significant (ps < .001). As reported by Horn et al., the No-PM condition was

0.178s faster than the PM condition. The other factors also had strong and significant effects.

Repetition speeded responding by 0.109s, and nonwords were faster than words by 0.015s.

Stimulus type interacted strongly with repetition, as nonwords were actually 0.017s slower

than words on the first presentation but 0.047s faster than words on the second presentation.

Repetition also interacted with the PM effect, which was 0.207s for the first presentation but

reduced to 0.147s for the second presentation.

Error rates were near floor, but even so word responses (2.1%) were significantly

more accurate than nonword responses (2.7%), F(1,93)=7.79, p < .01, with no other

significant effects. There was a small but significant bias towards nonwords, t(94)=2.11, p =

0.04, with 49% word responses (relative to 49.7% word stimuli after removing fast and slow

RT trials in the same way as Horn et al., 2011), that was affected by repetition, F(1,93) =

5.97, p < .05. The bias was most prevalent for the second presentation (48.7%), and for the

first presentation (49.3%) it did not differ significantly from chance, F < 1. Error responses

were significantly slower than correct responses in the PM condition for nonwords, t(107) =

3.21, p < .01, but not in the No-PM condition, nor for words in the PM and No-PM

conditions. Given their significant effects we allowed for response bias, stimulus type and

repetition effects in our model-based reanalysis.

We also checked whether there were effects associated with practice at the task (as

opposed to stimulus-specific practice associated with repeated stimulus presentations) using a

factor that divided the first and second stimulus repetition blocks in half. For correct RT there

was a significant main effect of half, F(1,93) = 62.5, p < .001, which interacted with

3


repetition, F(1,93) = 17.1, p < .001, due to a decrease by 0.09s over halves for the initial

presentation and a smaller decrease of 0.04s over halves for the second presentation. There

was also a significant two-way interaction between half and stimulus type, F(1,93) = 34.6, p

< .001, and a three-way interaction also including repetition, F(1,93) = 22.6, p < .001, due to

a particularly large decrease during the first repetition for nonwords (0.193s) but similar

decreases for words for both first and second presentations (0.039s and 0.027s respectively)

and for nonwords on second presentation (0.33s). One method of addressing this potentially

problematic practice effect might be to analyse only second repetition data where the block-

half effect was weaker. Although the simple effect of half for the second repetition remained

significant for both correct mean RT, F(1,93) = 15.1, p < .001, and errors, F(1,93) = 10.2, p <

.01, at least half did not participate in any significant interactions. Hence, we decided to also

examine the fit models to the second repetition data as a further check.

Model Analysis. In our analysis of the full data set the top RD model allowed v and sv

to vary with stimulus type (S: word vs. nonword) and repetition (r, 1st vs. 2nd) and z and sz to

vary with repetition. Because fits were to individual participants and the block type (PM)

factor was between-subjects it was not part of the model specification. The top LBA model

allowed B to vary with accumulator (R, word vs. nonword) and repetition (r) and v to vary

with these factors and accumulator correspondence (C, true vs. false), and sv to vary with C

for the corresponding (true) accumulator. For the fits to 2nd repetition the same top models

were specified with the repetition effects removed. Tables B1 and B2 show model selection

results. For the full data set the top RD provided a better fit than the top LBA model and the

best AIC-selected model, but the LBA had the best BIC-selected model. For fits to only the

2nd repeat data the LBA model won by all criteria.

The fast-dm method of Voss and Voss (2007) used by Horn et al. (2011) and Boywitt

and Rummel (2012) to fit their data, like the method of maximum likelihood we used to

4


check our LBA fits, does not involve reducing the data to a set of relatively coarse percentiles

(i.e., 10th, 30th, 50th, 70th and 90th), as is standard practice for the RD model (Ratcliff &

Tuerlinckx, 2002). Ravenzwaaij and Oberauer’s (2009) finding that fast-dm does not recover

parameters as well as QMPE is likely due to it using the Kolmogorov-Smirnov (KS) method,

which minimizes the maximum deviation between data and model. By focusing only on the

maximum deviations, the KS method does not use information available in the data, but the

same is also true for QMPE with a coarse set of quantiles (Heathcote et al., 2002). In order to

check whether our results differed because we used the standard set of five quantiles we also

performed QMPE fits at a much finer grain using the 19 semi-deciles (i.e., the 5th, 10th, … 95th

percentiles). Our results were essentially equivalent with both methods for all data sets

considered in this paper.

Table B1. RD models for Smith (2003).

Table B2. LBA models for Smith (2003).

Data Model B v sv p D AIC BICTop Model R, r S, r, C C(true) 15 9599 12449 22815

All AIC Model R, r S, C C(true) 11 10354 12444 20046BIC Model R, r C C(true) 9 10959 12669 18889

2nd

Repeat

Top Model R S, C C(true) 9 4583 6293 11936AIC Model R S, C C(true) 9 4583 6293 11936BIC Model R C C(true) 7 5039 6369 10758

We report the semi-decile based results for the two re-analyses, as can be seen for the

Horn et al. (2011) re-analysis in the correct RT panels in Figure B1, which display a subset of

the semi-decile results (only a subset are shown for graphical clarity). Figure B1 plots the fit

5

Data Model v sv z sz p D AIC BICTop Model S, r S, r r r 15 9587 12437 22804

All AIC Model S S - r 14 9764 12424 22100BIC Model r r - r 10 11088 12908 19819

2nd

Repeat

Top Model S S - - 9 5006 6716 12359AIC Model S S - - 9 5006 6716 12359BIC Model r r - - 7 5514 6844 11233


of both top models to the 2nd repetition data and Figure B2 plots fits to the full data set. Both

the LBA and RD models capture all of the fine-grained trends in correct RT quite accurately,

but the LBA does a better job in capturing the error-related effects. Note that as errors are

relatively rare in Smith’s (2003) data (less than 3% in all conditions) this misfit only has a

small impact on overall goodness-of-fit as measured by the deviance.

Figure B1. Top RD model fits (left two columns) and top LBA model fits (right two columns) to the 2nd repetition portion of Smith’s (2003) data.

The RD fits to the 2nd repetition data confirmed, but also extended, Horn et al.’s

(2011) conclusions. The threshold (a) was higher for PM (0.25) than No-PM (0.21)

participants, F(1,93) = 5.04, p < .05. Similarly, the mean rate (v) was significantly lower for

PM (0.32) than No-PM (0.38) participants, F(1,93) = 6.24, p < .05. However, we also found

that response bias (z/a) varied between PM (0.48) and No-PM (0.54) participants, F(1,93) =

6.35, p < .05. The slowing for words is clearly evident in the PM condition correct RT panels

6


of Figure B1, and is not due to mean rate, which was almost identical for word (0.35) and

nonword (0.34) stimuli, F < 1. All other effects also failed to achieve significance.

Figure B2. Top RD model fits (left two columns) and top LBA model fits (right two columns) to Smith’s (2003) data.

It is instructive to examine the results of fits to the full data set to investigate the

consequences of aggregating over the strong practice effects present over the course of the 1st

repetition. A rather different picture emerged. Although mean rate was still less for PM (0.42)

than control (0.52) participants, F(1,93) = 7.45, p = < .01, and response bias was less for PM

(0.47) than control (0.54) participants, F(1,93) = 8.4, p < .01, PM had no effect on the

threshold, F<1. Significant non-decision time effects also emerged, including both a longer

mean (Ter) for PM (0.49s) than control (0.44s), F(1,93) = 8.1, p < .01, and more variability (st

) for PM (0.24s) than No-PM (0.16s), F(1,93) = 6.48, p < .05. Strong effects of repetition

were also found on start-point variability (sz), F(1,93) = 15.9, p < .001, mean rate, F(1,93) =

40, p < .001, and rate variability (sv), F(1,93) = 33.6, p < .001, as well as significant effects

of stimulus type on mean rate, F(1,93) = 5.4, p < .05, and rate variability, F(1,93) = 4.07, p

< .05.

7


Further caution is warranted for the LBA parameter analysis. For the 2nd presentation

data no effects on LBA parameters were significant. There was a trend for a higher threshold

in PM (0.97) than No-PM (0.71) participants, but it did not approach significance, F(1,93) =

1.69, p = 0.2. There was also a trend for the mean rate to be higher for PM (0.46) than No-

PM (0.31), rather than lower as might be expected, but again the difference did not approach

significance, F < 1. The likely reason for the lack of significant effects, given the LBA model

provides a very good fit to the 2nd presentation data, relates to a combination of factors that

can lead to unconstrained (and so highly variable) individual participant parameter estimates,

including the smaller sample size attendant to examining only the 2nd repetition data and a

low error rate. Low error rates are particularly problematic for LBA models that allow

separate mean rates for true and false accumulators, as the rate for the false accumulator is

largely determined by error RT data. The same issue does not apply to the RD model because

no parameter is mainly dependent on error RTs.

Consistent with a small sample size being problematic for obtaining precise individual

participant estimates, many significant and close to significant effects emerged in the LBA

parameters for the top model fit to the full data set. In mean rates there were significant

interactions of PM with stimulus type, F(1,93) = 7.31, p < .01 and with stimulus type and

accumulator correspondence (C), F(1,93) = 5.76, p < .05. There was also a marginal

interaction of PM with repetition in response caution (B), F(1,93) = 3.38, p = 0.069 and in sv

there was a PM main effect, F(1,93) = 3.45, p = 0.067, and interaction between PM and C,

F(1,93) = 3.45, p = .067. We do not provide any further detail of these likely spurious effects.

8


Section C: Reanalysis of Boywitt and Rummel (2012)

For their first experiment Boywitt and Rummel (2012) found that only the RD

threshold (a) parameter differed significantly between high and low expectancy conditions,

with a lower value in the latter condition. They interpreted this finding as validating a

prediction made by the RD model that expectancy manipulations selectively influence the a

parameter (Voss et al., 2004). In the second experiment a and non-decision time (Ter) were

significantly greater and mean rate (v) significantly less in the demanding group compared to

the other two groups, which did not differ. Boywitt and Rummel interpreted the effects on a

an v similarly to Horn et al. (2011), and the Ter effect as indicative of slowed encoding of

color due to an increased engagement in capacity-consuming strategic monitoring for the PM

cue.

An initial inspection indicated that the re-analysis of Boywitt and Rummel’s (2012)

data would be difficult because of a combination of a relatively small sample size of less than

170 trials per participant and substantial and extended practice effects. To quantify the latter

we divided trials into quarters. In experiment one correct mean RT decreased over the

quarters (1.46s, 1.4s, 1.34s, 1.22s), F(3,174) = 3.76, = .69, p < .05, with the decrease

accelerating between the final two quarters and remaining significant, F(1,58) = 4.84, p < .05.

Given the later results indicates that our strategy of using the last half of the data will not be

effective, and the manipulation in experiment one does not include a control group to test PM

cost, our further analyses focus on experiment 2.

Similar practice effects on mean RT over quarters occurred in experiment two

(1.241s, 1.18s, 1.115s, 1.061s), F(3,246) = 15.6, = .92, p < .001. The practice (quarters)

effect interacted with the PM manipulation, F(6,246) = 6.98, = .95, p < .001. However, at

least in the latter half the practice effect was no longer significant, F<1, and the interaction

with PM was only marginal, F(2,82) = 2.5, p = .09. Error rates were marginally affected by

9


quarters (10.7%, 10.2%, 10.8%, 11.7%), F(3,246) = 2.61, = .95, p = .052, and there was a

marginal interaction with PM, F(6,246) = 1.83, = .95, p = .098, but both dropped out, F<1,

over the last half.

Although the analysis of the last half of the data presents problems in terms of

obtaining precise parameter estimates we decided to proceed with a model analysis of the last

half data set as well as the full data set. Before doing so we note that although stimulus type

did not have a significant effect on mean correct RT, F < 1, it did have a very marked effect

on accuracy, with matching responses (16.2%) much less accurate than mismatching

responses (5.5%), F(1,82) = 20.96, p < .001, for the full data set and for the last half (17.3%

vs. 5.2%), F(1,82) = 20.94, p < .001. There was also a strong response bias towards

mismatching responses, t(84) = 6.99, p < .001, with the overall probability of a matching

response only 0.447. Hence, we allowed both stimulus type (S) and response bias in our

model analysis.

Model Analysis. As shown in Tables C1 and C2 we specified the same top models

(for both the full data set and the last half) for Boywitt and Rummel’s (2012) experiment two

as we did for the 2nd repeat analyses of Horn et al.’s (2011) data, expect that the stimulus type

factor (S) refers to matching vs. mismatching stimuli. The LBA model was preferred in all

cases in terms of goodness-of-fit and both AIC and BIC model selection. However Figure C1

shows that both models did an equally good job of capturing the qualitative trends in the full

data set. As shown in Figure C2 the same is true for the 2nd half only. In all cases the top

model was selected by AIC, and the BIC model fit significantly worse (RD, all, 2(170) =

782, p < .001, 2nd half, 2(170) = 545, p < .001, LBA, all 2(255) = 782, p < .001, and 2nd half,

2(340) = 1272, p < .001). Hence we focus on the top model in our parameter analyses.

Unfortunately using only the second half of the data reduced power to such a degree

that few effects on parameters were significant. In the RD analysis there was a higher value

10


of sv for match (0.12) than mismatch (0.03), F(1,82) = 7.97, p < .01. In the LBA analysis

there was a main effect of stimulus type, with a higher value for match (2.11) than mismatch

(1.46), F(1,82) = 4.69, p < .05, which interacted with accumulator correspondence, due to a

smaller difference between true and false for match (12.11) than mismatch (4.25), F(1,82) =

11.3, p < .001. Both effects correspond to the largest effect observed in this data, greater

mismatch than match accuracy. Hence, we focus on effects in fits to the full data set,

commenting on corresponding trends in the 2nd half analysis where appropriate.

Table C1. RD models for Boywitt and Rummel (2012) experiment two.

Table C2. LBA models for Boywitt and Rummel (2012) experiment two.

Data Model B V sv p D AIC BICTop Model R S, C C(true) 9 3879 5409 10493

All AIC Model R S, C C(true) 9 3879 5409 10493BIC Model R C C(true) 6 4843 5863 9252

2nd

Half

Top Model R S, C C(true) 9 2887 4417 9412AIC Model R S, C C(true) 9 2887 4417 9412BIC Model R C C(true) 5 4159 5509 7784

Figure C1. Top RD model fits (left two columns) and top LBA model fits (right two columns) to the Boywitt and Rummel’s (2012) Experiment 2 full data set.

11

Data Model v sv p D AIC BICTop Model S S 9 4217 5747 10831

All AIC Model S S 9 4217 5747 10831BIC Model - - 7 4999 6189 10143

2nd

Half

Top Model S S 9 3126 4656 9650AIC Model S S 9 3126 4656 9650BIC Model - - 7 3671 4861 8746


Figure C2. Top RD model fits (left two columns) and top LBA model fits (right two columns) to the second half of Boywitt and Rummel’s (2012) Experiment 2 data set.

In the RD analysis the threshold was higher for the demanding condition (0.23) than

the remaining PM conditions (0.19), F(2,82) = 6.1, p < .01, whereas in the 2nd half analysis

demanding (0.21) and non-demanding (0.22) thresholds were similar. Ter was longer by 0.1s

in the demanding PM condition than the other conditions, F(2,82) = 3.65, p < .05, and the

same was true in the 2nd half analysis, by 0.07s. Similarly, st was greater by 0.15s in the

demanding PM condition than the other conditions, F(2,82) = 3.17, p < .05, but no

appreciable trend was apparent in the 2nd half analysis. There was a significant interaction

between stimulus type and PM in mean rate, F(2,82) = 4.0, p < .05, with the average of the

PM conditions 0.55 less than the control condition for match, whereas only the demanding

PM condition had a lower rate, by 0.35 on average, for mismatch. For the 2nd half analysis no

interaction was evident, but there was a trend for demanding (0.18) to be less than non-

demanding (0.245) and control (0.22), F(2,82) = 2.76, p = 0.15. For sv the same strong effect

of stimulus type as in the 2nd half analysis was present, F(1,82) = 30.5, p < .001.

In the LBA analysis of the full data set no effects were significant except on mean

rate, where the main effect of accumulator correspondence was significant, F(1,82) = 41.6, p

< .001. The interaction of this effect with stimulus type marginal, F(1,82) = 3.76, p = .056,

following the same pattern as in the 2nd half analysis, a smaller difference for match (1.75)

than mismatch (3.49).

12


Section D: Top Model Fits for Lourenço et al. (2013)

Figure D1. Top RD model fits to Lourenço et al. (2013) specific and non-specific conditions. Data are plotted accompanied by within-subject error bars calculated using Morey’s (2008) bias-corrected method.

Figure D2. Top LBA model fits to Lourenço et al. (2013) specific and non-specific conditions. Data are plotted accompanied by within-subject error bars calculated using Morey’s (2008) bias-corrected method.

13

Section A: Supplementary Analyses of Experiment...

Documents

Transcript of Section A: Supplementary Analyses of Experiment...