Testing improves true recall and ... - Purdue...

18
This article was downloaded by: [Washington University in St Louis] On: 27 February 2012, At: 08:04 Publisher: Psychology Press Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Memory Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/pmem20 Testing improves true recall and protects against the build-up of proactive interference without increasing false recall Ludmila D. Nunes a b & Yana Weinstein b a Department of Psychology, University of Lisbon, Lisbon, Portugal b Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA Available online: 31 Jan 2012 To cite this article: Ludmila D. Nunes & Yana Weinstein (2012): Testing improves true recall and protects against the build-up of proactive interference without increasing false recall, Memory, 20:2, 138-154 To link to this article: http://dx.doi.org/10.1080/09658211.2011.648198 PLEASE SCROLL DOWN FOR ARTICLE Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. The publisher does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The accuracy of any instructions, formulae, and drug doses should be independently verified with primary sources. The publisher shall not be liable for any loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Transcript of Testing improves true recall and ... - Purdue...

Page 1: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

This article was downloaded by: [Washington University in St Louis]On: 27 February 2012, At: 08:04Publisher: Psychology PressInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: MortimerHouse, 37-41 Mortimer Street, London W1T 3JH, UK

MemoryPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/pmem20

Testing improves true recall and protects againstthe build-up of proactive interference withoutincreasing false recallLudmila D. Nunes a b & Yana Weinstein ba Department of Psychology, University of Lisbon, Lisbon, Portugalb Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA

Available online: 31 Jan 2012

To cite this article: Ludmila D. Nunes & Yana Weinstein (2012): Testing improves true recall and protects against thebuild-up of proactive interference without increasing false recall, Memory, 20:2, 138-154

To link to this article: http://dx.doi.org/10.1080/09658211.2011.648198

PLEASE SCROLL DOWN FOR ARTICLE

Full terms and conditions of use: http://www.tandfonline.com/page/terms-and-conditions

This article may be used for research, teaching, and private study purposes. Any substantial orsystematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distributionin any form to anyone is expressly forbidden.

The publisher does not give any warranty express or implied or make any representation that thecontents will be complete or accurate or up to date. The accuracy of any instructions, formulae, anddrug doses should be independently verified with primary sources. The publisher shall not be liable forany loss, actions, claims, proceedings, demand, or costs or damages whatsoever or howsoever causedarising directly or indirectly in connection with or arising out of the use of this material.

Page 2: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

Testing improves true recall and protects against thebuild-up of proactive interference without increasing

false recall

Ludmila D. Nunes1,2 and Yana Weinstein2

1Department of Psychology, University of Lisbon, Lisbon, Portugal2Department of Psychology, Washington University in St. Louis, St. Louis, MO, USA

Retrieval practice has been shown to protect against the negative effects of previously learnedinformation on the learning of subsequent information, while increasing retention of new information.We report three experiments investigating the impact of retrieval practice on false recall in a multiple listparadigm. In three different experimental designs participants studied blocks of interrelated words thatconverged on non-presented associates. Participants were tested either after every study block or onlyafter the fifth study block, and both groups received a cumulative test on all five study blocks. Overall theresults from all three different experimental designs point to a benefit of testing in increasing long-termveridical recall on the cumulative test. More importantly, this improvement in veridical recall did notcome at a cost: False recall on the cumulative test did not increase from retrieval practice.

Keywords: Testing; False memory; False recall; Proactive interference.

Testing has recently been proposed as aneffective way to promote memory and learning(for a review see Roediger & Karpicke, 2006a).It has been widely demonstrated that afterstudying a set of materials (e.g., word lists),one is more likely to remember these materialson a later memory test following retrievalpractice, i.e., taking multiple tests (e.g., Roedi-ger & Karpicke, 2006b), even when the controlcondition involves re-exposure to the studiedmaterial (Karpicke & Roediger, 2008). Szpunar,McDermott, and Roediger (2008) demonstratedanother benefit of testing during study*theprevention of the build-up of proactive inter-ference (PI). The build-up of PI occurs when amemorisation task involves successive sets of

materials that share the same category ormodality (Keppel & Underwood, 1962; Postman& Keppel, 1977; Wickens, Born, & Allen, 1963).That is, learning the initial sets of materialsinterferes with learning later material, andtherefore the retention of new information isinversely related to the number of prior learningsets (Underwood, 1957). To examine the impactof testing during study on the build-up of PI,Szpunar and colleagues (2008) had participantsstudy five lists of words in anticipation of acumulative test, with half of the participantstested after each list and the other tested onlyafter the fifth list. For both interrelated andunrelated lists, participants who had been testedafter studying each of lists 1�4 were better at

Address correspondence to: Ludmila D. Nunes, Faculdade de Psicologia, Alameda da Universidade, 1649-013 Lisbon, Portugal.

E-mail: [email protected]

Support for this research was provided by a James S. McDonnell Foundation 21st Century Science Initiative grant: a Bridging

Brain, Mind and Behavior/Collaborative Award; and a fellowship from the Portuguese Foundation for Science and Technology to

the first author. We would like to thank Kathleen McDermott, James Nairne, Jeff Karpicke, and the members of the Memory &

Cognition Lab for their useful comments on this paper.

MEMORY, 2012, 20 (2), 138�154

# 2012 Psychology Press, an imprint of the Taylor & Francis Group, an Informa businesshttp://www.psypress.com/memory http://dx.doi.org/10.1080/09658211.2011.648198

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 3: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

learning the fifth list and produced fewer prior-list intrusions (i.e., words that had actuallyappeared in previous lists) than participantswho had not been tested after each list. Thiseffect was unrelated to re-exposure, as anothergroup who restudied the lists achieved exactlythe same accuracy as the control group whoperformed a distractor task after each list(Szpunar et al., 2008). This beneficial effecthas since been replicated in another lab (Pas-totter, Schicker, Niedernhuber, & Bauml, 2011),and with a cued recall task (Weinstein, McDer-mott, & Szpunar, 2011).

FALSE MEMORY

In the present paper we apply Szpunar andcolleagues’ (2008) finding to associative memoryerrors that are produced when participants studylists that converge on semantic associates (Deese,1959; Roediger & McDermott, 1995). The ques-tion addressed in this paper is whether thebenefits of testing for learning multiple lists ofassociated words might come at a cost. Theoreti-cally the main goal of this research is to gaininformation on the boundary conditions of thetesting effect, using study materials that may beprone to negative effects of testing. Specificallywe investigated the potential increase in associa-tive memory errors (i.e., false recall) that mayoccur in tandem with increased veridical recall ofcertain types of information.

In the Deese (1959)/Roediger/McDermott(Roediger & McDermott, 1995; DRM) paradigmparticipants study lists composed of words relatedto a critical non-presented word (for example,participants could study the words bed, rest,awake, tired, etc., all semantically associated tothe critical non-presented word sleep). This cri-tical word is then spontaneously produced onrecall tests, sometimes as often as words that hadactually been studied (Roediger & McDermott,1995). One explanation for this false memoryeffect is provided by the activation-monitoringframework. According to this account, activationautomatically spreads from the presented associ-ates to the non-presented critical word (Collins &Loftus, 1975; Roediger & Gallo, 2004). Falserecall of the critical word thus arises from thisautomatic spreading of activation in associationwith a failure of reality monitoring (Johnson,Hashtroudi, & Lindsay, 1993; Johnson & Raye,

1981), which is needed to distinguish betweeninternally and externally generated information.

TESTING AND FALSE MEMORY:REPEATED LISTS

Despite all the positive press surrounding testing,the effects of testing on false recall are somewhatin doubt.1 McDermott (1996, Exp. 2) first studiedthe effects of multiple study/test trials on the levelof false recall. The experiment consisted of fivepresentations of the same study list (composed ofthree DRM lists), with each presentation fol-lowed by a free recall test. Participants showed asignificant decrease in the overall level of falserecall of the critical words across trials. One daylater participants performed a final free recalltest. False recall on this final delayed test washigher than that on the fifth test taken on theprevious day, suggesting that the benefit of testingin reducing false recall that was observed on theimmediate tests diminished over time.

In a different design McDermott (2006) pre-sented participants with three sets of six DRMlists, with each set tested either zero, one, or threetimes. On a final free recall test the number ofprevious tests was positively related to the like-lihood of recalling studied words (a positive effectof testing), but also positively related to thelikelihood to recalling non-presented criticalwords (an associated cost of testing). However,false recall was only increased by testing whencomparing participants tested zero times withparticipants tested one or three times, and parti-cipants tested one or three times did not appear todiffer significantly in the level of false recall onthe final test (although this comparison was notreported). It is thus somewhat unclear from thoseexperiments whether repeated testing increasesfalse recall, beyond the effect of taking a singleinitial test.

The idea that false recall should increase intandem with veridical recall is suggested by thespreading activation theory (Collins & Loftus,1975; Roediger & McDermott, 1995), according

1 Roediger and McDermott’s (1995) original paper intro-

ducing the paradigm showed that taking a free recall test

before a recognition test on DRM lists increased the level of

correct and false recognition, although further investigation

somewhat disputed this (see Roediger, McDermott, &

Robinson, 1998, for a review). For the purposes of this paper

we will focus on the effect of prior recall tests on a later free

recall test, and not on recognition.

TESTING AND FALSE RECALL 139

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 4: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

to which activation spreads from studied words tonon-presented associated words. Thus, if veridicalmemory increases due to repeated reactivation ofthe study list during testing, the level of activationof the critical non-presented words should alsoincrease, leading to a higher level of false recall.According to this view more tests lead to moreactivation of the presented words and, conse-quently, more spreading activation to the criticalnon-presented words. The two studies reviewedabove suggest that while testing certainly in-creases veridical memory, this benefit may comewith the associated cost of increasing false recallin some situations. However, McDermott (2006)pointed out that the benefits of testing in thisparadigm outweighed the costs in that the in-crease in veridical recall was greater than theincrease in false recall.

TESTING AND FALSE MEMORY:NON-REPEATED LISTS

Our set of experiments differs from the researchdescribed above in that we are examining theeffects of testing one list (or multiple lists) onthe learning of another. One other article we areaware of that has examined such a situation withrespect to false recall is Chan and Langley(2010), who examined the effects of retrievalpractice on eyewitness memory. In their para-digm participants initially studied an eyewitnessepisode and were either tested or not on thatinformation before being exposed to misinfor-mation. The authors found that taking a test onthe original information led to an enhancedsusceptibility to misinformation. The authorssuggested that this might have arisen due tothe protective effects of testing against PI, whichenhanced the learning of misinformation pre-sented after initial testing of the original event.Thus testing may not only increase false recall ofinformation related to tested lists (McDermott,1996, 2006), but also promote false recall ofinformation related to new lists encoded afterinitial testing.

With regard to our paradigm we already knowfrom Szpunar et al.’s (2008) work that taking atest after each list produces a release from PI andincreases veridical recall on subsequent lists ascompared with a condition where no such priortest occurs. However, previous research into theeffects of testing on false memory (Chan &

Langley, 2010; McDermott, 1996, 2006) suggeststhat this improvement in veridical recall maycome at a cost if lists that produce strongassociations to non-presented words are studied.Testing may thus increase not only veridicalrecall, but also false recall via increased backwardassociative strength (BAS), known for being themain predictor of false memories (Roediger,Watson, McDermott, & Gallo, 2001).

THE PRESENT EXPERIMENTS

In order to examine the effects of testing on thebuild-up of false memories across lists, the mate-rials in the first reported experiment were DRMlists split across multiple study lists, which we willcall study blocks throughout the article to distin-guish them from the DRM lists from which theitems were taken. A set of five study blocks wasdeveloped, each of them composed of a differentsubset of the same six DRM lists. Participantstook a recall test either after each study block, oronly after the fifth study block; all participantsthen took a cumulative recall test (see Figure 1 fora schematic of the procedure and materials). Thelevel of false recall of the six critical words onboth the fifth study block test and the cumulativetest was compared for the tested and untestedgroups.

In a second experiment, instead of DRM listssplit across blocks, we presented four studyblocks, each of which consisted of 12 wordsfrom the same DRM list, followed by a fifthstudy block composed of three additional associ-ates from each of the four presented DRM lists(see Figure 2). We used this manipulation in anattempt to replicate Experiment 1 in a designthat increased the likelihood of false recall andmore closely resembled Roediger and McDer-mott’s (1995) procedure of presenting pureDRM lists. According to Robinson and Roediger(1997), when more associates are studied, thelikelihood of false recall increases because acti-vation of the critical word is increased viaautomatic spreading activation from the studiedassociates. For this reason, presenting moreassociates sequentially before each test shouldhave increased activation of the critical word;our experimental design permitted us to look atwhether taking a test increased or decreased thatactivation.

In a third experiment we again presentedeach DRM list in a blocked fashion (as in

140 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 5: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

Experiment 2), but now these DRM lists were

divided into three-word study blocks such that

participants in the tested group received tests

after each sub-list of three words and all

participants were tested on the fifth and last

sub-list of each DRM list (see Figure 3). This

last experiment allowed us to examine the effect

of testing on the build-up of false memories in a

design that most closely followed the standard

DRM paradigm.

Figure 1. Schematic of the procedure and materials used in Experiment 1.

Figure 2. Schematic of the procedure and materials used in Experiment 2.

TESTING AND FALSE RECALL 141

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 6: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

EXPERIMENT 1

Experiment 1 was designed to investigate thepossible effects of testing on the build-up of falsememories across multiple study blocks of cate-gorised words, where each of the six categoriesrepresented a DRM list that was split across thefive study blocks. The design of this experiment isidentical to the one employed by Szpunar andcolleagues (2008) to study the effects of testing onthe build-up of PI in all but materials. Thenumber of words studied in each list, and thenumber of participants in each experimentalcondition, are also based on Szpunar and collea-gues’ method (2008). Szpunar and colleagues’procedure involved five study blocks and a finalcumulative test for all participants. Half of theirparticipants were tested after every study blockwhile the other half were tested only after thefifth study block. Performance on the fifth studyblock test and on the cumulative test was exam-ined, comparing participants in the tested anduntested groups. The only difference between ourprocedure and theirs is that we used lists ofassociated words that converged on critical words(i.e., DRM lists) instead of semantic categories.Szpunar and colleagues found that testing earlierstudy blocks helped to protect against prior blockintrusions on the fifth study block test. Szpunarand colleagues also found that on the cumulativetest, participants that had been tested after everystudy block produced overall more words from all

five study blocks. Of course, PI could no longer bemeasured on the cumulative test. In our designinterference in the form of false recall of criticalwords was measured both on the fifth study blocktest and on the cumulative test.

Method

Participants. A total of 62 Washington Univer-sity undergraduates participated in the experi-ment for course credit or payment. Half of theparticipants were assigned to the tested condition,and the other half to the untested condition.

Materials. Five study blocks of 18 words wereconstructed from the DRM lists whose criticalwords were chair, doctor, needle, rough, sweet,and window. These six DRM lists were chosenbecause they were the most efficient in elicitingfalse recall, according to Stadler, Roediger, andMcDermott’s (1999) norms (Mfalse recall�56.3).Each DRM list was randomised and dividedinto five groups of three words. Triads fromeach of the six DRM lists were then selected toform one study block, and this was repeated tomake five study blocks in total. Thus each studyblock was made up of three randomly selectedwords from each of six DRM lists, so thatpresentation of the words from each DRM listwas not ordered by BAS or any other criterion(see Figure 1 for the full set of stimuli as theywere presented to participants).

Figure 3. Schematic of the procedure and materials used in Experiment 3. Note that the schematic represents only one of the six

DRM list cycles.

142 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 7: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

Design. The design was between participants,so participants were either tested after studyblocks one through five (tested group), or onlyafter study block five (untested group; see Figure1 for a schematic of the procedure in eachcondition). Performance between these two con-ditions was compared on two free recall tests: thetest after the fifth study block, and the cumulativetest after all five study blocks (the circled testson Figure 1). The cumulative test was eitherdelayed by 25 minutes or immediate, alsobetween participants.2 For the fifth study blocktest the levels of correct recall, false recall, andintrusions were measured. For the cumulative testthe levels of correct recall and false recall weremeasured.

Procedure. Following the procedure used bySzpunar and colleagues (2008), participants werefirst informed that they would study five shortblocks of words for a later test, and performmaths problems between the study blocks. Parti-cipants were additionally told that they might ormight not receive a test after each individualstudy block, and that this would be determinedrandomly by the program. Participants werereminded to pay close attention because at theend of the experiment they would be tested on allfive study blocks. In reality the testing schedulewas determined by the program such that half ofthe participants received a free recall test aftereach study block and the other half only receiveda test after the fifth study block. All participantsreceived a cumulative free recall test.

Words were presented one by one in the centreof a computer monitor at the rate of 2000 ms perword (500 ms interstimulus interval, 2500 msstimulus onset asynchrony). Study blocks werecounterbalanced using a Latin square design,resulting in five different study block orders foreach of the testing groups. Words within eachstudy block were randomised afresh for eachparticipant. However, the assignment of wordsto blocks was constant between participants.After presentation of each study block, all parti-

cipants solved maths problems for 1 minute.Tested participants then had 1 minute to completea free recall test, in which they were instructed totype as many words as possible from the blockthey had just studied; meanwhile, untested parti-cipants performed another minute of mathsproblems. Following the fifth study block and1 minute of maths problems, participants in bothconditions were instructed to recall as manywords as possible from that block. Participantswere then given 5 minutes to recall, in any order,all the words from all of the study blocks (notethat this test occurred immediately for someparticipants and after a 25-minute distractor taskfor others; see footnote 2).

Results

Analyses of the results showed no differencesbetween the delayed and the immediate testconditions on any measures, so this factor is notconsidered further. Two untested participantswere excluded from the analyses because theyproduced more intrusions than studied words onthe cumulative test. One tested participant wasexcluded from the analyses because they pro-duced only one word on the fifth block test. Wereport correct recall and intrusion rates separatedinto prior block intrusions (words presented inpreviously studied blocks) and critical intrusions(critical words from each DRM list used). Follow-ing Szpunar et al. (2008), extra-list intrusions (anywords that were not presented on prior blocks norconsidered critical words) are not reported. Alldata are presented in terms of proportions. Thedenominators from which proportions were cal-culated for each item type are as follows. For thefifth study block test: correct recall out of 18 (thenumber of words presented in the fifth studyblock); critical words out of 6 (the number ofDRM lists used in the experiment); and prior-listintrusions out of 72 (the number of wordspresented in study blocks one through four). Forthe cumulative test: correct recall out of 90 (thetotal number of words presented across the fivestudy blocks); and critical words out of 6 (thenumber of DRM lists used in the experiment).The Alpha level was set at .05. Cohen’s dindicates effect size for t-tests.

Fifth study block test. The mean proportion ofcorrectly recalled words and critical words pro-duced on the fifth study block test in each

2 We manipulated the delay between the fifth study block

test and the cumulative test to determine whether the effects

of testing on false recall differed as a function of retention

interval. This manipulation was driven by previous researching

showing that initial testing can produce different results when

the effects are assessed at different retention intervals

(Roediger & Karpicke, 2006a). However, since this manipula-

tion did not mediate the effects of testing on veridical or false

recall, for the sake of brevity we do not discuss it further.

TESTING AND FALSE RECALL 143

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 8: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

condition are presented in Table 1. Participantswho had not been tested on previous study blocksrecalled fewer words from study block five thanparticipants who had been tested after every studyblock (see top row of Table 1); t(57) �2.94,d�0.77. On the other hand, the level of falserecall (mean number of critical words recalled outof six possible words) was higher for participantswho had not been tested on the previous blocks(see second row of Table 1); t (57) �3.74, d�0.96.That is, 12 of the 29 untested participants pro-duced either one (8 participants) or two(4 participants) of the six possible critical wordson the fifth study block test. Meanwhile, only 1 ofthe 30 tested participants produced a critical wordon the fifth study block test. We also calculated themean proportion of critical words produced acrossthe five study block tests for tested participants(M�.07). This did not differ significantly from themean proportion of critical words produced on thefifth study block test in the group that had notreceived prior tests (M�.09).

We also compared the proportion of priorblock intrusions produced on the fifth block testin each condition. Untested participants producedfar more prior block intrusions than those whohad been tested after every block (M�.07 vs .01);t(57) �4.97, d �1.29. These prior block intrusiondata are consistent with those presented bySzpunar et al. (2008).

Cumulative test. The mean proportions ofstudied words and critical words produced on thecumulative test in each condition are also pre-sented in Table 1. In line with Szpunar et al. (2008)we found that participants who had been tested onblocks one through five recalled approximately

41% of the studied words on the final cumulativetest, whereas participants who had been testedonly on block five recalled approximately 24% ofthe studied words, t(57) �5.11, d�1.34, showing arobust testing effect. However, the proportion ofcritical words produced did not differ significantlybetween conditions.

Discussion

In Experiment 1 we replicated Szpunar et al.’s(2008) finding that interpolated testing counter-acts the negative effects of proactive interferenceby increasing correct recall and reducing priorblock intrusions. In addition, testing appeared toreduce false recall: Untested participants pro-duced more critical words on the fifth study blocktest than did tested participants. However, if thetests taken after each study block are considered,participants in the tested condition produced asmany critical words as participants in the untestedcondition produced on the fifth study block test.This suggests that the activation of the criticalwords across the five study blocks was equivalentin the two conditions, and not affected by the teststaken after each study block for tested partici-pants. This result makes sense if we consider thatactivated critical words have, essentially, been‘‘studied’’ (albeit internally). Since participants inthe untested condition are unable to efficientlyconstrain their search set to the fifth study block(due to PI), words that they recall from previousblocks will be produced, and this can includecritical words. Furthermore, on the cumulativetest the number of critical words produced did notdiffer between the two groups. Note that therewas an increase in false recall from the fifth studyblock test to the cumulative test (see Marsh &Hicks, 2001 for a similar result); test-inducedpriming (Marsh, McDermott, & Roediger, 2004)may be responsible for inflating false recall whenparticipants have to produce words from allstudied blocks.

The evidence with regards to the effects oftesting on false memory that we reviewed in theintroduction (Chan & Langley, 2010; McDermott,1996, 2006) seems to indicate that initial testingmay incur the cost of inflating false memories on alater test. We found no such evidence. Neither didwe find the opposite pattern: Testing study blocksone through four did not decrease the total numberof critical words produced on the cumulative test.Thus testing improved veridical memory and

TABLE 1

Experiment 1

Fifth study block

test Cumulative test

Tested Untested Tested Untested

Studied words* .50 (0.20) .36 (0.15) .41 (0.15) .24 (.09)

Critical words

(of 6)

.01 (0.03) .09 (0.12) .17 (0.18) .21 (.20)

Prior block

intrusions

.01 (0.01) .07 (0.06) � �

Mean proportion of studied words, prior block intrusions,

and critical words produced on the test after the fifth study

block and on the cumulative test for tested and untested

participants in Experiment 1 (SD in parentheses).

*Refers to fifth study block words for the fifth study block test,

and to words from all blocks for the cumulative test.

144 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 9: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

decreased PI but did not have an effect on falsememory. The results so far seem to indicate thattesting after every study block improves the abilityto distinguish between words that were presentedin the most recent block from words presented inprevious blocks, but does not necessarily decreasethe build-up of false memories. However, sincetested participants produced critical words onthe test after blocks one through four, they wouldbe unlikely to produce them again on the fifthblock test, resulting in the lower level of criticalwords produced on the fifth study block testcompared to untested participants.

EXPERIMENT 2

In Experiment 1 DRM lists were split across thefive study blocks to examine how interpolatedtesting affected the build-up of associative mem-ory illusions from block to block. A potentialproblem in interpreting Experiment 1 is that falserecall of critical words on the initial block tests(but not on the cumulative test) was near floor.This was probably a result of mixing DRM lists ineach study block, which has been shown to reducefalse recall (Brainerd, Payne, Wright, & Reyna,2003; McDermott, 1996, Exp. 2; Toglia,Neuschatz, & Goodwin, 1999). In Experiment 2each study block was composed of words fromone single DRM list (for a total of four DRMlists), except for the last study block which wascomposed of the remaining words from each ofthe four previously presented DRM lists. So,study blocks one to four were 12-word DRMlists, and the fifth study block consisted of thethree remaining words from each of the fourpresented DRM lists (see Figure 2). With thisdesign we were able to investigate whether havingthe opportunity to output critical words duringinitial tests has an impact on the number ofcritical words produced on the fifth block testwhere each critical word would be activated, andalso on the cumulative test. We discuss predic-tions for the effects of prior testing on each of thetwo tests (fifth block test and cumulative test) inturn.

With regards to the fifth block test, based onthe results of Experiment 1 we expected partici-pants in the untested condition to produce morecritical words on this test because these wordswould have been activated during study blocksone through four without the chance of beingoutput. As a result, these words would form part

of the search set and may be output on the fifthblock test, whereas participants in the testedcondition would have already had the chance tooutput each critical word on a previous test andwould thus be able to monitor their output toexclude these words. However, taking all fiveinitial study block tests into account, we expectedparticipants in the tested condition to output alarger proportion of critical items than partici-pants in the untested condition would output onthe fifth study block test alone, which was not thecase in Experiment 1.

With regard to the cumulative test, given theincreased number of critical words output on theinitial study block tests it is possible that partici-pants in the tested condition would output a higherproportion of critical words on the cumulative test,thus exhibiting a negative consequence of testing.In other words, in this procedure we can assumethat the critical words were activated to the sameextent during the presentation of each block forevery participant, but only participants tested afterevery block had the opportunity to output a criticalword after each studied list. We were thus able toinvestigate the effect of this initial false recallopportunity on later recall, as compared with theuntested condition in which critical words couldhave been activated but not output. The goal ofExperiment 2 was to determine whether blockedDRM lists would produce the same data pattern asmixed DRM lists (with testing producing only abenefit to true recall with no associated cost interms of increased false recall).

Method

Participants. A total of 35 Washington Univer-sity undergraduates participated in the experi-ment for course credit or payment. Theparticipants were randomly assigned to one oftwo experimental conditions by the program usedto run the experiment, resulting in 16 participantsin the untested group and 19 in the tested group.

Materials. Four DRM lists were selected fromStadler and colleagues’ (1999) norms. The chosenlists were the ones whose critical words weremountain, soft, river, and city. From each DRMlist the words in the third, fifth, and fourteenthpositions were taken and used to form a new studylist. These words were randomised within thenewly formed fifth study block. So, study blocksone to four were four DRM lists, composed of

TESTING AND FALSE RECALL 145

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 10: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

12 words each, and study block five was composedof three words from each one of the four DRMlists, also with a total of 12 words. The order ofstudy blocks one to four was counterbalancedacross participants, but study block five was alwayspresented last. The words within each block werealways presented in the same order for everyparticipant, with exception of the fifth block, whichwas randomised afresh for each participant (seeFigure 2 for a schematic summarising all thesedetails).

Design and procedure. As in Experiment 1 weused a between-participants design with twotesting conditions (tested and untested). Alldependent measures from the fifth study blocktest and the cumulative test were identical tothose examined in Experiment 1. The procedureand instructions were identical to those in Experi-ment 1, except that after the immediate cumula-tive test participants in both testing conditionsperformed a list discrimination test. This test didnot produce differences between the tested anduntested conditions, so the results are omittedfrom this paper for the sake of brevity.

Results

Recall results are presented in terms of propor-tions as for Experiment 1. The denominatorsdiffered to those of Experiment 1 and were asfollows: For the fifth study block test: correctrecall out of 12 (the number of words presentedin the fifth study block); critical words out of 4(the number of DRM lists used in the experi-ment); prior block intrusions out of 60 (thenumber of words presented in study blocks onethrough four). For the cumulative test: correctrecall out of 60 (the total number of wordspresented across the five study blocks); and

critical words out of 4 (the number of DRM listsused in the experiment).

Fifth study block test. The mean proportion ofstudied words and critical words produced on thefifth study block test in each condition arepresented in Table 2. Overall there were nodifferences between conditions in the proportionof words of each type recalled. More specifically,participants who had been tested after every studyblock did not correctly recall significantly morewords than did participants who were only testedafter the fifth study block (the mean proportion ofwords recalled across the two conditions was .46).The production of critical words also did not differsignificantly between participants, with an overallmean across conditions of .05. In fact 3 of the 19tested participants produced one critical word ofthe four possible; and 4 of the 16 untestedparticipants produced one of the four possiblecritical words. However, and contrary to Experi-ment 1, the mean proportion of critical wordsproduced by tested participants across all fivestudy block tests (M�.34) was significantly higherthan the mean proportion of critical words pro-duced on the fifth block test by untested partici-pants (M�.34 vs .06 critical words recalled onstudy block five in the untested group);t(33) �2.87, d�1.01. Although the proportionof prior block intrusions was numerically higherfor untested participants, this difference did notreach significance.

Cumulative test. The mean proportion of listwords and critical words produced on the cumu-lative test in each testing condition are alsopresented in Table 2. Participants who had beentested on every study block recalled approxi-mately 53% of the 60 studied words in the finalcumulative test whereas participants who hadbeen tested only on list 5 recalled approximately

TABLE 2

Experiment 2

Fifth study block test Cumulative test

Tested Untested Tested Untested

Studied words* .46 (0.19) .46 (0.16) .53 (0.12) .43 (0.15)

Critical words (of 4) .04 (0.09) .06 (0.11) .33 (0.37) .36 (0.33)

Prior block intrusions .01 (0.02) .04 (0.09) � �

Mean proportion of studied words, prior block intrusions, and critical words produced on the test after the fifth study block and

on the cumulative test for tested and untested participants in Experiment 2.

*Refers to fifth study block words for the fifth study block test, and to words from all blocks for the cumulative test.

146 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 11: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

43% of the 60 studied words, t(33) ��2.17,d�0.76, replicating the testing effect obtainedin Experiment 1. However, the proportion ofcritical words produced was not significantlydifferent between conditions, with an overallmean of .35.

DISCUSSION

As in Experiment 1, in Experiment 2 we foundthat tested participants correctly recalled morestudied words than untested participants on thecumulative test (due to the testing effect) withoutan accompanying increase in false recall. Con-trary to Experiment 1, in Experiment 2 we did notfind any build-up of PI across study blocks, soperformance on the fifth block test did not differbetween conditions (both in terms of correctrecall and prior block intrusions). This most likelyoccurred as a result of the context changeproduced by shifting semantic categories betweenstudy blocks, as Loess (1968) and Wickens (1970)showed that semantic shift was an effective way ofeliminating proactive interference. In addition,unlike in Experiment 1 where untested partici-pants produced more critical words on the fifthblock test than did tested participants, in thisexperiment there were no differences betweenconditions in the number of critical words pro-duced on the fifth block test. In fact, testedparticipants actually produced more critical wordsacross blocks one to five than did untestedparticipants on the fifth block test (recall that inthe equivalent comparison for Experiment 1, thetwo numbers were equal). In spite of this, thenumber of critical words falsely recalled onthe cumulative test was equivalent between thetwo conditions, similarly to Experiment 1.

Overall, Experiments 1 and 2 point to a benefitof initial testing on later veridical memory withoutthe associated cost of false memory. In Experi-ment 1 previously tested participants performedbetter on the test after the fifth study block,recalling more presented words and intrudingfewer prior-blocks words. These tested partici-pants also falsely recalled less critical words on thefifth block test, but that difference between testedand untested participants did not persist whencritical words produced on previous block tests byparticipants who took those tests were also takeninto account. These results point to the effective-ness of testing in increasing veridical memory andprotecting against PI (as demonstrated by Szpunar

et al., 2008), without the cost of also increasingfalse recall.

In Experiment 2 previously tested participantsdid not perform better on the fifth block testthan participants in the untested condition,because the context change arising from switch-ing DRM lists between blocks afforded evenuntested participants a release from PI. Alsocontrary to Experiment 1, there were no differ-ences in the proportion of critical words recalledin the two conditions on the fifth block test(although both were near floor). On the otherhand, when all initial tests were taken intoaccount for the tested condition, participants inthis condition produced a higher proportion ofcritical words than did participants in the un-tested condition on the fifth block test alone. Inspite of this, on the cumulative test participantsin the tested condition correctly recalled morestudied words, but did not produce a higherproportion of critical words than participants inthe untested condition. Across two different setsof materials, and with divergent initial testresults, our data suggest that testing improvesveridical memory in the long term without thecost of increasing false memories.

EXPERIMENT 3

In Experiment 1 DRM lists were split across thefive study blocks so that activation of the criticalwords could build up from block to block. InExperiment 2, on the other hand, each studyblock was composed of words from one singleDRM list, except for the last study block whichwas composed of words from each of the fourpreviously presented DRM lists in order to re-activate all four critical words. So Experiment 1allowed us to study the effects of testing on thebuild-up of false memories across study blocks,whereas Experiment 2 was specially designed tostudy the impact of outputting critical wordsduring initial tests on a final cumulative test.However, the levels of false recall on the fifthstudy block test were rather low in both experi-ments. In Experiment 1 we already acknowledgedthat this was most likely the case because wordsfrom multiple DRM lists were intermixed ratherthan being presented in a blocked fashion, whichhas been shown to reduce false recall of thecritical words (Brainerd et al., 2003; McDermott,1996; Toglia et al., 1999). We attempted to over-come this problem in Experiment 2 by blocking

TESTING AND FALSE RECALL 147

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 12: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

DRM lists for study blocks one to four, but had toreturn to intermixed lists in the fifth study blockin order to re-activate the critical words and thusbe able to compare production of those words inthe tested and untested conditions.

In Experiment 3 we presented one DRM list ata time throughout the whole task rather thanintermixing words from different DRM lists, evenon the fifth study block. To do this, we decreasedthe length of each study block from 15 to 3 items,so each DRM list was split into five individualstudy blocks of 3 words each.3 We then repeatedthe standard procedure used in previous experi-ments for each DRM list. That is, for each DRMlist participants were either tested on each of thefive 3-word study blocks or were only tested onstudy block five (see Figure 3 for a schematic ofone 5-block cycle). This procedure was repeatedfor all six DRM lists, to measure critical wordactivation on the fifth study block for every DRMlist without intermixing the lists. The goal ofExperiment 3 was thus to replicate the results ofthe previous experiments, with a design thatclosely mirrored the standard Roediger andMcDermott (1995) procedure with the exceptionof breaking up the DRM lists into short studyblocks. That is, at no point in this experiment didparticipants study intermixed DRM lists.

Method

Participants. A total of 30 Washington Univer-sity undergraduates participated in the experi-ment for course credit or payment. Theparticipants were randomly assigned to one oftwo experimental conditions, resulting in 15 parti-cipants in the untested condition (always testedonly on the fifth study block of each DRM list)and 15 in the test tested condition (always testedon all five study blocks of each DRM list).

Materials. The same six DRM lists used inExperiment 1 were used in this experiment. Thelists were presented in a random order for eachparticipant. Each list was composed of 15 wordsand was divided randomly into five study blocksof 3 words each. The words within each studyblock were the same for all participants butpresented in a randomised order, and the orderof the five study blocks within each list was

counterbalanced (see Figure 3 for a schematicwith notes on these details).

Design and procedure. The design was abetween-participants one, with two conditions(tested and untested). The critical comparisonbetween conditions was recall on the fifth studyblock tests (one for each DRM list, for a total ofsix) and on the cumulative test. For the fifth studyblock tests the level of correct recall, critical wordintrusions, and prior block intrusions was mea-sured. For the cumulative test the level of correctrecall and critical word intrusions was measured.

The procedure and instructions were similar tothose of previous experiments, with the followingdifferences. Each of the six DRM lists was splitinto five study blocks of three words. After eachstudy block was presented, all participants per-formed maths problems for 15 seconds. Partici-pants then either solved further maths problems(untested condition), or took a free recall test onthat study block (tested condition). The timegiven for participants to perform the correspond-ing task after each study block was 15 seconds. Allthe participants took a test after the fifth studyblock. On the free recall test after each studyblock, participants were presented with threeboxes and could only enter up to three words.Since participants would be aware that each studyblock consisted of three words, we wanted toconstrain recall so that participants would have tochoose which words to report and it was easier forthem to understand that they were being testedonly on the three words just presented. After allthe DRM lists were presented and participantsfinished the test on the fifth study block of thesixth DRM list, participants were asked to recallfor 5 minutes as many words as they could fromall of the three-word blocks they had studied.

Results

Recall results are presented in terms of propor-tions of words recalled of each type. For the fifthstudy block tests, results are collapsed across thesix tests participants took (one for each DRM list).The denominators were as follows. For the fifthstudy block test: correct recall out of 18 (the totalnumber of words presented in the fifth studyblocks across all six DRM lists, i.e., 3�6); criticalwords out of 6 (the number of DRM lists used inthe experiment); prior-list intrusions out of 72 (thenumber of words presented in study blocks one

3 We thank Jeff Karpicke for suggesting this experimental

design.

148 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 13: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

through four across all 6 DRM lists, i.e., 12�6).For the cumulative test: correct recall out of 90(the total number of words presented across thefive study blocks); and critical words out of 6 (thenumber of DRM lists used in the experiment).

Fifth study block tests. The proportion ofstudied words, critical words, and prior blockintrusions produced on the fifth block tests ineach condition are presented in Table 3. Themean proportion of words correctly recalled fromthe fifth study blocks was significantly higher fortested participants than for untested participants,t(28) �2.37, d�0.87. On the other hand, themean proportion of critical words recalled acrossall the fifth study block tests was not significantlydifferent between conditions, with an overallmean of .02. In fact, 3 of the 15 tested participantsproduced one critical word of the six possible; and4 of the 15 untested participants produced one ofthe six possible critical words. The mean propor-tion of critical words produced across all studyblock tests for tested participants was marginallysignificantly higher than the mean proportion ofcritical words produced on the fifth study blocktests in the untested group (M�.13 vs M�.03),t(28) �1.62, d � .59, p�.06, one-tailed test.

The mean proportion of prior block intrusionswas also examined. Tested participants intrudedfewer prior block words on the fifth study blocktests than did untested participants, t(28) �3.53,d�.90.

Cumulative test. The mean proportions of listwords and critical words produced on the cumu-

lative test in each condition are also presented inTable 3. Tested participants recalled approxi-mately 38% of the studied words on the finalcumulative test, whereas untested participantsrecalled approximately 31% of the studied words.This difference was significant on a one-tailedt-test, t(28) �1.80, d�0.66 (the results of Experi-ments 1 and 2 justified the use of a one-tailedt-test for this comparison). However, and alsoreplicating Experiments 1 and 2, the proportionof critical words produced did not differ betweenconditions, with an overall mean of .31.

Discussion

In Experiment 3 we used a procedure that did notinvolve intermixing DRM lists at any point duringstudy, to more closely match Roediger andMcDermott’s (1995) original procedure. On thefifth study block tests previously tested partici-pants recalled more presented words and in-truded fewer prior-block words than untestedparticipants, consistent with Szpunar et al.(2008) and our Experiments 1 and 2. Alsoconsistent with the previous experiments, therewere no differences between conditions in thenumber of critical words produced on the fifthstudy block tests. However, we did successfullydetect a significant difference between conditionsin prior block intrusions, which were similarlyinfrequent. Finally, on the cumulative test, testedparticipants recalled more studied words thanuntested participants, but the number of falselyrecalled critical words did not differ betweenconditions, once again replicating Experiments 1and 2. Taken together, these results point to abenefit of testing in preventing the build-up of PIacross lists, and increasing the level of veridicalmemory without the costs of increasing falserecall.

GENERAL DISCUSSION

In Experiments 1 and 3 we successfully replicatedSzpunar et al.’s (2008) finding that testing helpsprotect against the build-up of PI with a differentset of materials (DRM lists converging on associ-ates). That is, in two different versions of theprocedure (intermixed DRM lists in Experiment 1and pure DRM lists split into three-word studyblocks in Experiment 3) we found that partici-pants who were tested on every block produced

TABLE 3

Experiment 3

Fifth study block

tests* Cumulative test

Tested Untested Tested Untested

Studied words** .86 (0.11) .72 (0.20) .38 (0.11) .31 (0.12)

Critical words

(of 6)

.01 (0.04) .03 (0.09) .28 (0.33) .33 (0.24)

Prior block

intrusions

.01 (0.01) .03 (0.03) � �

Mean proportion of studied words, prior block intrusions,

and critical words produced on the test after the fifth study

block and on the cumulative test for tested and untested

participants in Experiment 3.

*Results are presented averaged across the six DRM list

cycles, each of which consisted of five study blocks.

**Refers to fifth study block words for the fifth study block

test, and to words from all blocks for the cumulative test.

TESTING AND FALSE RECALL 149

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 14: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

more correct words and fewer prior block intru-sions on the fifth study block test. In a thirdexperiment (Experiment 2, where we presentedblocked DRM lists with a final intermixed studyblock) we did not observe any prior list intrusionsin the untested condition and thus did not finddifferences in PI between the tested and untestedconditions. In addition we replicated the beneficialeffect of testing on later recall (e.g., Wheeler &Roediger, 1992) in all three experiments, withtested participants consistently performing betterthan untested participants on the cumulativerecall test.

In addition to replicating Szpunar et al. (2008)with new materials, our design allowed us to lookat the effect of interpolated testing on false recall,which to our knowledge has never been examinedin a design where each list consists of new wordsand produces additional activation of the criticalwords. One potential downfall of testing is that itmay increase false recall at the same time asimproving veridical recall (e.g., McDermott,2006). In three experiments we did not findevidence for this pitfall. Below we discuss eachof the designs we used and the subtle differencesbetween them; but the take-home point from ourdata is that testing improved veridical recall andprotected against PI without the associated cost ofincreasing false recall.

In Experiment 1 participants studied fiveblocks of intermixed DRM lists and were eithertested after every block or only after the fifthblock. Taking a test after study blocks one to fourhelped participants to suppress false recall on thefifth block. This could have arisen because forparticipants in the untested condition, criticalwords were activated during study of blocks oneto four but could not be output, so they formedpart of the search set on the fifth study block test.Supporting evidence for this explanation is thattested participants actually produced the samenumber of critical words across all five studyblock tests as untested participants produced onthe fifth study block test, suggesting that overallactivation of the critical words across study blockswas equivalent between conditions. Analogously,on the cumulative test participants in the twoexperimental conditions produced the same num-ber of critical words.

In Experiment 2 we changed the nature of thestudy blocks, now presenting participants withpure DRM lists for the first four study blocks andan intermixed list for the fifth study block, inorder to re-activate all four critical words. This

procedure was a compromise between Roedigerand McDermott (1995) who presented blockedDRM lists to achieve high levels of false recall,and the requirement of our design that all criticalwords were activated following study of the fifthblock. Contrary to Experiment 1 there were nodifferences in PI for tested and untested partici-pants. That is, participants in the untested condi-tion performed as well as participants in thetested condition on the fifth study block test.This absence of PI can be explained by thesemantic shift that occurred from study block tostudy block due to the use of blocked DRM lists,contrary to Experiment 1 where intermixed DRMlists were used in all blocks. We discuss theseresults below in more detail, but first we turn tothe false recall data. The pattern of results forfalse recall was also somewhat different from thatof Experiment 1. In Experiment 2 there was nodifference in the number of critical words pro-duced on the fifth study block test between thetwo experimental conditions (contrary to Experi-ment 1 where untested participants producedmore critical words), but when all study blocktests were included for tested participants, thenumber of critical words produced was higher fortested than untested participants (contrary toExperiment 1 where these numbers were equiva-lent). However, despite producing more criticalwords on the initial tests, tested participantsproduced the same number of critical words asuntested participants on the cumulative test,replicating Experiment 1.

In Experiment 3 we did not intermix DRMlists in each study block. Instead participantsstudied words from one entire DRM list beforemoving on to the next, just as in the originalDRM experiments (Roediger & McDermott,1995). The difference between the standardprocedure and ours was that we presentedDRM lists in blocks of three words at a time,with an interpolated task that differed betweenconditions for the first four study blocks of eachDRM list: Untested participants performed 30seconds of maths problems, whereas tested parti-cipants performed 15 seconds of maths problemsfollowed by free recall of the three-word studyblock. All participants took the free recall test onthe fifth study block of each DRM list (for a totalof six cycles). Collapsing the results of all the fifthstudy blocks, we found that participants in thetested condition recalled more words from thecorrect study block and intruded fewer priorblock words; however, there were no differences

150 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 15: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

in the production of critical words betweenconditions. The correct recall and prior blockintrusion data of this experiment replicate thoseof Experiment 1, showing that PI built up acrossstudy blocks for the untested condition andtesting helped relieve this PI. It is worthwhile tonote that this is the first demonstration of theSzpunar et al. (2008) in a design with more thanone cycle through the procedure. The criticalword data of Experiment 3, on the other hand,were more consistent with those of Experiment 2,showing no difference between conditions on thefifth study block tests. However, when comparingthe proportion of critical words produced acrossall tests by tested participants to the proportionof critical words produced on the tests after thefifth study blocks by untested participants, thedata seem more consistent with Experiment 1,showing no differences between conditions. De-spite these somewhat mixed results, the cumula-tive test data were clear: participants in the testedcondition recalled more studied words and nomore critical words than untested participants,replicating both Experiments 1 and 2.

It should be noted that our observed levels offalse recall in all three experiments were very lowbecause of the relatively small number of criticalwords, thus false recall was close to floor on theinitial tests. We did find relatively robust levels offalse recall on the cumulative test, when the dataare considered in terms of proportions. However,these proportions still translate into small num-bers of critical words due to the nature of thedesign, which afforded a maximum of six (Ex-periments 1 and 3) or four (Experiment 2) criticalwords to be produced. A fruitful avenue of futureresearch would involve adapting the paradigm toincrease the opportunity for false recall on eachtest, in order to more definitely establish theimpact of testing on false recall of relatedinformation.

By using lists of words created specifically toelicit false recall of associated words, we were ableto measure intrusions due to indirect activation ofsemantic associates. The results of the experi-ments point to the conclusion that testing in thisparadigm increases true recall without a long-termeffect on false recall. This increase in veridicalmemory unaccompanied by an increase in falsememory may seem somewhat at odds with McDer-mott (2006), Experiment 2), who showed anincrease in both veridical and false recall on acumulative test as a result of initial testing.However, as we noted in the introduction, there

was little difference in McDermott’s data betweenthe probability of recalling a non-presented cri-tical word after one test (.35) and after three tests(.37); the main effect of testing on critical wordrecall was most likely driven by the differencebetween the condition in which no initial test wastaken (where the probability of false recall was.27) and the other two conditions. Our untestedcondition, where participants were tested onlyafter the fifth study block, is more comparable toMcDermott’s ‘‘one test’’ condition than her ‘‘notest’’ condition, because in our paradigm theuntested participants did take one test (on thefifth study block) prior to the cumulative test. Sotaking a closer look at McDermott’s resultsdemonstrates that our findings are consistentwith those reported in that study in demonstratingthat increasing the number of prior tests canimprove true recall on a final test without anaccompanying increase in false recall. It should benoted that previous work on the effects of testingon false recall has focused on repeated recall ofthe same information. We adopted a new proce-dure because our goal was to assess the effects oftesting different but related material on veridicalmemory, proactive interference, and false memoryfor semantically related information.

One of the explanations of the benefits oftesting suggested by Szpunar and colleagues(2008) was the source monitoring/reduction ofcue overload explanation. This explanation seemsto be supported by our results. According to thisview, if participants take a test after each studyblock, they develop a set of temporal cues asso-ciated with each retrieval moment (for each of thetests after blocks one through four); so, on the fifthblock test, the words associated with the effectiveretrieval cue at the time of testing will only bewords from the fifth block, since the previouslystudied words will have already been associatedwith previous retrieval cues. Therefore partici-pants who have been tested after each study blockcan easily circumscribe the relevant search set towords presented in the most recently studiedblock. On the other hand, participants who haveonly studied (and not been tested on) previousblocks will not have access to such cues because allpreviously presented words (those from studyblocks one through four, and those from studyblock five) will have been associated with the samecue. This prevents participants from being able tocircumscribe the relevant search set to the itemsstudied only in the most recent block. As a result,the level of correct recall of words from that block

TESTING AND FALSE RECALL 151

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 16: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

decreases, as more retrieval candidates aresubsumed under the retrieval cue for that test(Watkins & Watkins, 1975). Interference frompreviously presented words and indirectlyactivated critical words (which could also havebeen activated during study of previous blocks)increases because all of those words share the sameretrieval cue (Szpunar et al., 2008).

However, on the cumulative test the retrievalcue should be the same for all participants, nomatter whatever they were tested after each studyblock or only after the fifth. In this case therelevant search set will include words presented inall five study blocks as well as indirectly activatedcritical words (or, if such words were produced onthe initial block tests, directly activated criticalwords). As the words share the same retrievalcue, the temporal cues will no longer be helpful indifferentiating presented words from non-pre-sented words, because those could have beenactivated at any point during study.

The cue overload explanation is also supportedby the finding that the act of recalling from long-term memory may drive context change, isolatingthe recalled list of words from interference. Jangand Huber (2008) used a retroactive interferenceparadigm in which participants are asked to recallnot from the most recently presented list, butfrom a previously presented one (the target list).Jang and Huber manipulated the length of boththe target list and the intervening lists, andwhether participants received a free recall testafter every intervening list and the target list, oronly after the target list. Only the length of thetarget list affected free recall of the target listwhen participants were tested on all interveninglists. On the other hand, both the length of thetarget list and the length of the intervening listsaffected recall of the target list when participantswere not tested on the intervening lists. Theseresults are interpreted as supporting the view thatthe act of recall drives context change, isolatingthe target list from interference from interveninglists. Further support for the context changeexplanation was put forward by Pastotter et al.(2011), who found that a semantic generation taskand a working memory task both protectedagainst PI similarly to intervening testing.

Although PI was not the focus of our paper, wefound an interesting difference between ourexperiments in terms of PI. In Experiments 1and 3 we found the expected difference in thelearning of study block five between the pre-viously tested and untested conditions. A context

change afforded by the act of retrieval (Jang &Huber, 2008; Pastotter et al., 2011) may explainwhy testing protects against proactive interfer-ence, as it provides a change similar to the onethat occurs when the category of the list ischanged (e.g., Wickens, 1970). However, thisdifference did not emerge in Experiment 2because PI was minimal even for those who hadnot been tested on study blocks one through four.It is interesting to note that this absence of PIemerged despite the fifth block being semanti-cally related to all of the previous blocks. It seemsthat the change in semantic context betweenblocks one through four was enough to keep PIat bay for untested participants, even thoughthere was some semantic overlap between blockfive and all four previously studied blocks. Thisrelease from PI due to a change in semanticdimension or context is highly congruent withclassic studies showing release from PI. Forexample, Wickens (1970) showed that semanticdimensions, including taxonomic categories, arehighly effective in promoting release from PI; andWickens, Dalezman, and Eggemeier (1976)showed that the effect of semantic shift on releasefrom PI was negatively correlated with theproperty overlap between semantic categories(e.g., more release occurs when switching fromfruits to occupations than from fruits to vegeta-bles). It is therefore easy to see why we did notfind PI in our Experiment 2. Although we did notanticipate this effect, it gave us the chance todemonstrate equivalent effects of testing on falserecall in two situations: One in which testingprotect against the build-up of PI across study lists(Experiments 1 and 3) and another in which PIdid not build up across lists because of semanticshifts between lists (Experiment 2).

Our results can shed light on a recentlyproposed theoretical explanation for the testingeffect. Carpenter (2009) proposed the elaborativeretrieval hypothesis, which states that retrievalprocesses during testing may activate elaborativeinformation related to the studied item. Thiselaborative information may then help laterretrieval, by providing additional cues to thetarget item. This hypothesis was mainly devel-oped to explain the benefits of testing in a pairedassociate cued recall paradigm, and does not seemconsistent with our results that testing improvesveridical recall without increasing false recall.According to the elaborative retrieval hypothesis,critical non-presented words should receive extraactivation during the tests on study blocks one to

152 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 17: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

four and thus they should be falsely recalled moreoften on the final cumulative test by participantswho had been tested after every study block. Thiswas not the case in any of our three experiments.Of course, the elaborative retrieval hypothesismay be applicable to cued recall tests performedafter the study of pairs of words and it iscompatible with the notion that tests enhanceretention because they promote encoding ofmediating information*i.e., a word or conceptthat links a cue to a target, as demonstrated byPyc and Rawson (2010). However, according toour results, the elaborative retrieval hypothesisdoes not seem to be applicable to the testingeffect observed in free recall.

In conclusion, we investigated a potentialnegative effect of testing on long-term memory:An increase in false memories due to increasedactivation of critical words during retrieval prac-tice. The existence of this negative effect of testingwas not supported by our results. Instead, wefound that testing improved performance on acumulative free recall test by enhancing true recallthrough retrieval practice (Experiments 1, 2, and3), and improving learning by protecting againstPI throughout the study of multiple consecutivelists (Experiments 1 and 3). On the other hand,testing did not affect false recall on the finalcumulative test, even when a higher number ofcritical items were produced on the initial tests bytested participants (Experiment 2). Our resultsshow a dissociation between true and false recall,and demonstrate that it is possible to increase thecorrect recall of studied information withoutnecessarily increasing the false recall of associatedinformation. In our study veridical recall wasimproved by testing without the cost of increasingfalse recall of associated information.

Manuscript received 19 September 2011

Manuscript accepted 6 December 2011

First published online 25 January 2012

REFERENCES

Brainerd, C. J., Payne, D. G., Wright, R., & Reyna, V. F.(2003). Phantom recall. Journal of Memory andLanguage, 48, 445�467.

Carpenter, S. K. (2009). Cue strength as a moderator ofthe testing effect: The benefits of elaborative retrie-val. Journal of Experimental Psychology: Learning,Memory, & Cognition, 35, 1563�1569.

Chan, J. C. K., & Langley, M. (2010). Paradoxicaleffects of testing: Retrieval enhances both accuraterecall and suggestibility in eyewitnesses. Journal of

Experimental Psychology: Learning, Memory, andCognition, 37, 248�255.

Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic memory. Psychologi-cal Review, 82, 407�428.

Deese, J. (1959). On the prediction of occurrence ofparticular verbal intrusion in immediate recall.Journal of Experimental Psychology, 58, 17�22.

Jang, Y., & Huber, D. E. (2008). Context retrieval andcontext change in free recall: Recalling from long-term memory drives list isolation. Journal of Experi-mental Psychology: Learning, Memory, & Cognition,34, 112�127.

Johnson, M. K., Hashtroudi, S., & Lindsay, D. S. (1993).Source monitoring. Psychological Bulletin, 114,3�28.

Johnson, M. K., & Raye, L. C. (1981). Realitymonitoring. Psychological Review, 88, 67�85.

Karpicke, J. D., & Roediger, H. L. III. (2008). Thecritical importance of retrieval for learning. Science,319, 966�968.

Keppel, G., & Underwood, B. J. (1962). Proactiveinhibition in short-term retention of single items.Journal of Verbal Learning & Verbal Behavior, 1,153�161.

Loess, H. (1968). Short-term memory and item similar-ity. Journal of Verbal Learning & Verbal Behavior, 8,240�247.

Marsh, R. L., & Hicks, J. L. (2001). Output monitoringtests reveal false memories of memories that neverexisted. Memory, 9, 39�51.

Marsh, E., McDermott, K. B., & Roediger, H. L.(2004). Does test-induced priming play a role inthe creation of false memories? Memory, 12, 44�55.

McDermott, K. B. (1996). The persistence of falsememories in list recall. Journal of Memory &Language, 35, 212�230.

McDermott, K. B. (2006). Paradoxical effects of testing:Repeated retrieval attempts enhance the likelihoodof later accurate and false recall. Memory &Cognition, 34, 261�267.

Pastotter, B., Schicker, S., Niedernhuber, J., & Bauml,K-H. T. (2011). Retrieval during learning facilitatessubsequent memory encoding. Journal of Experi-mental Psychology: Learning, Memory & Cognition,37, 287�297.

Postman, L., & Keppel, G. (1977). Conditions ofcumulative proactive inhibition. Journal of Experi-mental Psychology: General, 106, 376�403.

Pyc, M. A., & Rawson, K. A. (2010). Why testingimproves memory: Mediator effectiveness hypoth-esis. Science, 333, 335.

Robinson, K. J., & Roediger, H. L. (1997). Associativeprocesses in false recall and false recognition.Psychological Science, 8, 231�237.

Roediger, H. L., & Gallo, D. A. (2004). Associativememory illusions. In R. F. Pohl (Ed.), Cognitiveillusions. New York, NY: Psychology Press.

Roediger, H. L. III, & Karpicke, J. D. (2006a). Testenhanced learning: Taking memory tests improveslong-term retention. Psychological Science, 17,249�255.

Roediger, H. L., & Karpicke, J. D. (2006b). The powerof testing memory: Basic research and implications

TESTING AND FALSE RECALL 153

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012

Page 18: Testing improves true recall and ... - Purdue Universitylearninglab.psych.purdue.edu/downloads/Nunes_Weinstein_2012.pdf · test after each list produces a release from PI and increases

for educational practice. Perspectives on Psycholo-gical Science, 1, 181�210.

Roediger, H. L., & McDermott, K. B. (1995). Creatingfalse memories: Remembering words not presentedin lists. Journal of Experimental Psychology: Learn-ing, Memory, & Cognition, 21, 803�814.

Roediger, H. L., McDermott, K. B., & Robinson, K. J.(1998). The role of associative processes in produ-cing false remembering. In M. Conway, S. Gather-cole, & C. Cornoldi (Eds.), Theories of memory II(pp. 187�245). Hove, UK: Psychology Press.

Roediger, H. L., Watson, J. M., McDermott, K. B., &Gallo, D. A. (2001). Factors that determine falserecall: A multiple regression analysis. PsychonomicBulletin & Review, 8(3), 385�407.

Stadler, M. A., Roediger, H. L. III, & McDermott, K. B.(1999). Norms for word lists that create falsememories. Memory & Cognition, 27, 494�500.

Szpunar, K. K., McDermott, K. B., & Roediger, H. L.(2008). Testing during study insulates against thebuild-up of proactive interference. Journal of Ex-perimental Psychology: Learning, Memory, andCognition, 34, 1392�1399.

Toglia, M. P., Neuschatz, J. S., & Goodwin, K. A.(1999). Recall accuracy and illusory memories:When more is less. Memory, 7, 233�256.

Underwood, B. J. (1957). Interference and forgetting.Psychological Review, 64, 49�60.

Watkins, O. C., & Watkins, M. J. (1975). Build-up ofproactive inhibition as a cue-overload effect. Journalof Experimental Psychology: Human Learning andMemory, 104, 442�452.

Weinstein, Y., McDermott, K. B., & Szpunar, K. K.(2011). Testing protects against proactive interfer-ence in face-name learning. Psychonomic Bulletin &Review, 18, 518�523.

Wheeler, M. A., & Roediger, H. L. (1992). Disparateeffects of repeated testing: Reconciling Ballard’s(1913) and Bartlett’s (1932) results. PsychologicalScience, 3, 240�245.

Wickens, D. D. (1970). Encoding categories of words:An empirical approach to word meaning. Psycholo-gical Review, 77, 1�15.

Wickens, D. D., Born, D. G., & Allen, C. K. (1963).Proactive inhibition and item similarity in shorttermmemory. Journal of Verbal Learning and VerbalBehavior, 2, 440�445.

Wickens, D. D., Dalezman, R. E., & Eggemeier, F. T.(1976). Multiple encoding of word attributes inmemory. Memory & Cognition, 4, 307�310.

154 NUNES AND WEINSTEIN

Dow

nloa

ded

by [

Was

hing

ton

Uni

vers

ity in

St L

ouis

] at

08:

04 2

7 Fe

brua

ry 2

012