Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA...

14
Instructional Tools in Educational Measurement and Statistics (ITEMS) for School Personnel: Evaluation of Three Web-Based Training Modules Rebecca Zwick, University of California, Santa Barbara, Jeffrey C. Sklar, California State Polytechnic University, San Luis Obispo, Graham Wakefield, University of California, Santa Barbara, Cris Hamilton, Independent Consultant, Alex Norman, University of California, Santa Barbara, and Douglas Folsom, University of California, Santa Barbara In the current No Child Left Behind era, K-12 teachers and principals are expected to have a sophisticated understanding of standardized test results, use them to improve instruction, and communicate them to others. The goal of our project, funded by the National Science Foundation, was to develop and evaluate three Web-based instructional modules in educational measurement and statistics to help school personnel acquire the “assessment literacy” required for these roles. Our first module, “What’s the Score?” was administered in 2005 to 113 educators who also completed an assessment literacy quiz. Viewing the module had a small but statistically significant positive effect on quiz scores. Our second module, “What Test Scores Do and Don’t Tell Us,” administered in 2006 to 104 educators, was even more effective, primarily among teacher education students. In evaluating our third module, “What’s the Difference?” we were able to recruit only 33 participants. Although those who saw the module before taking the quiz outperformed those who did not, results were not statistically significant. Now that the research phase is complete, all ITEMS instructional materials are freely available on our Website. Keywords: assessment literacy, educational measurement, educational statistics, No Child Left Behind, psychometrics, standardized tests A ccording to a 2002 report from the Center on Education Policy, “[s]chool leaders ... must be adept at using data to improve teaching and learning. Too often, test data is used only for accountability, rather than to diagnose the needs of individual stu- dents and improve their education. Both principals and teachers must be trained to use test data to modify daily lesson plans and to tailor assistance for individual children. Much more test data will soon be available, because the new federal requirements require states to produce descriptive and diag- nostic information, as well as individual test results” (Jennings, 2002). As the report predicted, teachers and school administrators in the current No Child Left Behind (NCLB) era are expected to have a sophisticated understanding of test results, to use them to make data-based decisions about classroom instruction, and to communicate them to others. Although there is little formal re- search in this area, it is widely rec- ognized that many school personnel have not had the opportunity to ac- quire the “assessment literacy” that is required for these roles. Little has Rebecca Zwick is a Professor, Depart- ment of Education, University of Califor- nia, Santa Barbara, Santa Barbara, CA 93106; [email protected]. Jeffrey C. Sklar is an Instructor, Department of CSM-Statistics, California State Polytech- nic University, San Luis Obispo, CA 93407. Graham Wakefield is a PhD Candidate, De- partment of Media Arts and Technology, University of California, Santa Barbara. Cris Hamilton is an independent consul- tant. At the time of writing, Alex Norman and Douglas Folsom were both graduate students in the Department of Media Arts and Technology, University of California, Santa Barbara. 14 Copyright C 2008 by the National Council on Measurement in Education Educational Measurement: Issues and Practice

Transcript of Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA...

Page 1: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Instructional Tools inEducational Measurement andStatistics (ITEMS) for SchoolPersonnel: Evaluation of ThreeWeb-Based Training ModulesRebecca Zwick, University of California, Santa Barbara,Jeffrey C. Sklar, California State Polytechnic University,San Luis Obispo,Graham Wakefield, University of California, Santa Barbara,Cris Hamilton, Independent Consultant,Alex Norman, University of California, Santa Barbara, andDouglas Folsom, University of California, Santa Barbara

In the current No Child Left Behind era, K-12 teachers andprincipals are expected to have a sophisticated understanding ofstandardized test results, use them to improve instruction, andcommunicate them to others. The goal of our project, funded bythe National Science Foundation, was to develop and evaluatethree Web-based instructional modules in educationalmeasurement and statistics to help school personnel acquire the“assessment literacy” required for these roles. Our first module,“What’s the Score?” was administered in 2005 to 113 educatorswho also completed an assessment literacy quiz. Viewing themodule had a small but statistically significant positive effect onquiz scores. Our second module, “What Test Scores Do and Don’tTell Us,” administered in 2006 to 104 educators, was even moreeffective, primarily among teacher education students. Inevaluating our third module, “What’s the Difference?” we wereable to recruit only 33 participants. Although those who saw themodule before taking the quiz outperformed those who did not,results were not statistically significant. Now that the researchphase is complete, all ITEMS instructional materials are freelyavailable on our Website.

Keywords: assessment literacy, educational measurement, educationalstatistics, No Child Left Behind, psychometrics, standardized tests

According to a 2002 report fromthe Center on Education Policy,

“[s]chool leaders . . . must be adeptat using data to improve teaching andlearning. Too often, test data is usedonly for accountability, rather than to

diagnose the needs of individual stu-dents and improve their education.Both principals and teachers must betrained to use test data to modify dailylesson plans and to tailor assistancefor individual children. Much more test

data will soon be available, becausethe new federal requirements requirestates to produce descriptive and diag-nostic information, as well as individualtest results” (Jennings, 2002). As thereport predicted, teachers and schooladministrators in the current No ChildLeft Behind (NCLB) era are expectedto have a sophisticated understandingof test results, to use them to makedata-based decisions about classroominstruction, and to communicate themto others.

Although there is little formal re-search in this area, it is widely rec-ognized that many school personnelhave not had the opportunity to ac-quire the “assessment literacy” thatis required for these roles. Little has

Rebecca Zwick is a Professor, Depart-ment of Education, University of Califor-nia, Santa Barbara, Santa Barbara, CA93106; [email protected]. JeffreyC. Sklar is an Instructor, Department ofCSM-Statistics, California State Polytech-nic University, San Luis Obispo, CA 93407.Graham Wakefield is a PhD Candidate, De-partment of Media Arts and Technology,University of California, Santa Barbara.Cris Hamilton is an independent consul-tant. At the time of writing, Alex Normanand Douglas Folsom were both graduatestudents in the Department of Media Artsand Technology, University of California,Santa Barbara.

14 Copyright C© 2008 by the National Council on Measurement in Education Educational Measurement: Issues and Practice

Page 2: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

changed, in fact since the publication in1990 of the Standards for Teacher Com-petence in Educational Assessment ofStudents (American Federation ofTeachers, National Council on Mea-surement in Education [NCME], & Na-tional Education Association, 1990),which expressed concern about “theinadequacy with which teachers areprepared for assessing the educa-tional progress of their students.” Thedocument recommended that “assess-ment training . . . be widely availableto practicing teachers through staffdevelopment programs at the districtand building levels” (p. 1).1

Further evidence regarding the needfor training in this area comes fromtwo additional efforts undertaken byprofessional organizations during the1990s. The Joint Committee on Com-petency Standards in Student Assess-ment for Educational Administrators,sponsored by the American Associationof School Administrators, National As-sociation of Elementary School Prin-cipals, National Association of Sec-ondary School Principals, and NCME,surveyed 1,700 administrators from thefirst three of these sponsoring organiza-tions about assessment-related skills.The three skills rated as most neededby educational administrators (out of alist of 37) were knowing the terminol-ogy associated with standardized tests,knowing the purposes of different kindsof testing, and understanding the link-age between curriculum content andvarious kinds of tests (Impara, 1993, p.20). In 1995, NCME published the Codeof Professional Responsibilities in Edu-cational Measurement. This document,which is intended to apply to profes-sionals involved in all aspects of educa-tional assessment, including those whoadminister assessments and use assess-ment results, includes the responsibil-ity to “maintain and improve . . . profes-sional competence in educational as-sessment” (NCME, 1995, p. 1).

According to Stiggins (2002), how-ever, “only a few states explicitly re-quire competence in assessment as acondition for being licensed to teach.No licensing examination now in placeat the state or federal level verifies com-petence in assessment. Since teacher-preparation programs are designed toprepare candidates for certification un-der these terms, the vast majority ofprograms fail to provide the assessmentliteracy required to prepare teachersto face emerging classroom-assessmentchallenges . . . Furthermore, lest we be-

lieve that teachers can turn to theirprincipals for help, almost no states re-quire competence in assessment for li-censure as a principal or school admin-istrator at any level. As a result, assess-ment training is almost nonexistent inadministrator-training programs.”

The continuing need for professionaldevelopment in this area was confirmedby two recent reports. A report on thelicensing of school principals in the 50states, sponsored by the Wallace Foun-dation (Adams & Copland, 2005), in-cludes a discussion of the necessaryknowledge and skills for principals, ac-cording to state licensing requirements.“Totally missing from state licensingframeworks,” according to the authors,“was any attention to the meaning anduse of learning assessments. . .” In an-other recent study, by the NationalBoard on Educational Testing and Pub-lic Policy at the Lynch School of Educa-tion at Boston College, researchers sur-veyed a nationally representative sam-ple of teachers to ascertain their at-titudes about state-mandated testingprograms (Pedulla et al., 2003). Whenasked about the adequacy of profes-sional development in the area of stan-dardized test interpretation, almost athird of the 4,200 responding teachersreported that professional developmentin this area was inadequate or very in-adequate (Pedulla et al., 2003). Fur-ther documentation of the “widespreaddeficits in assessment skills evidencedby practicing teachers” is described byLukin, Bandalos, Eckhout, and Mickel-son (2004, pp. 26–27).

As part of the preliminary work con-ducted in preparation for the ITEMSproject, a survey assessing respondents’understanding of educational measure-ment and statistics was developed andfield-tested by two doctoral studentsunder the first author’s supervision(Brown & Daw, 2004). The survey,which consisted mainly of multiple-choice questions, was completed by24 University of California, Santa Bar-bara (UCSB) graduate students whowere pursuing an M.Ed. and a multiple-subjects teaching credential. A revisedversion of the survey was subsequentlyoffered over the Internet to students en-rolled in a graduate education course atCalifornia State University, Northridge(CSUN). The 10 CSUN students whoresponded to the Web-based version ofthe survey were experienced teachersor administrators in K-12 schools. Re-sults of the two survey administrationsto teachers and future teachers sug-

gest the existence of substantial gapsin the respondents’ knowledge of ed-ucational measurement and statistics.For example, only 10 of the 24 UCSBrespondents were able to choose thecorrect definition of measurement er-ror, and only 10 knew that a Z-scorerepresents the distance from the meanin standard deviation units. Nearly halfmistakenly thought the reliability of atest is “the correlation between studentgrades and student scores on the test.”When told that “20 students averaged 90on an exam, and 30 students averaged40,” only one-half of the CSUN groupwere able to calculate the combined av-erage correctly, and only one in 10 chosethe correct definition of measurementerror.

As Popham (2006a, p. xiii) notes,“[t]oday’s educational leaders need tounderstand the basics of assessment orthey are likely to become yesterday’seducational leaders . . .” How can therequired level of understanding be at-tained? Ideally, teacher and principalcertification programs will eventuallybe modified to increase course con-tent in educational measurement andstatistics. Substantial changes of thiskind are likely to be slow, however,and may occur only if licensing require-ments are first modified. To help K-12schools respond rapidly to the traininggap in these areas, the InstructionalTools in Educational Measurement andStatistics (ITEMS) for School Person-nel project has created and evaluateda series of three instructional modulesin educational measurement and statis-tics, to be available via the Web and alsoas CDs and DVDs.

The goal of the project was to provideimmediate opportunities for teachersand administrators to increase their as-sessment literacy—more specifically,their understanding of the psycho-metric and statistical principles neededfor the correct interpretation of stan-dardized test scores. The ITEMS ma-terials are intended to help prepareschool personnel to use test results tooptimize instructional decisions and topinpoint schools, classes, or individu-als that require additional instructionor resources, as well as to explain testresults to students, parents, the schoolboard, the press, and the general com-munity. The provision of this train-ing in a convenient and economicalway is intended to assist schools withthe successful implementation and in-terpretation of assessments. Alterna-tive forms of professional development,

Summer 2008 15

Page 3: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

such as intensive workshops or uni-versity courses, are much more time-consuming and expensive, and are un-likely to be funded in an era of limitedschool budgets.

Previous Statistics Education andAssessment Literacy ResearchThe ITEMS work is related to previ-ous research in statistics education andassessment literacy. The goal of statis-tics education research is to promote ef-fective techniques, methods, and prod-ucts for teaching statistics, as well asto understand how individuals reasonabout statistical concepts. By contrast,the work described here as assessmentliteracy research focuses on the waysteachers use assessments to make in-structional decisions.

Statistics Education

A prominent statistics education re-search effort is that of Joan Garfieldand Robert delMas of the University ofMinnesota and Beth Chance of Califor-nia Polytechnic State University. In thecurrent ARTIST (Assessment ResourceTools for Improving Statistical Think-ing) project (https://app.gen.umn.edu/artist/), these researchers are work-ing to develop effective assessments ofstatistics knowledge, which are madeavailable online for teachers of firstcourses in statistics (Garfield, delMas,& Chance, 2003; see https://app.gen.umn.edu/artist/publications.html for apublications list). In a previous project,delMas, Garfield, and Chance (1999)investigated the use of computer sim-ulations to improve the statistical rea-soning of students in college-level in-troductory statistics courses. Theseprojects built on an earlier programof research, Project Chance, headed byJ. Laurie Snell of Dartmouth, the goalof which was to help students thinkcritically about media reports that useprobability and statistics (Snell & Finn,1992).

The Statistics Education Re-search Group (SERG) (www.umass.edu/srri/serg/) at the University ofMassachusetts, Amherst works on a va-riety of projects to improve instructionin statistics courses. Founded in the1970s, the group originally investigatedhow individuals reason about statis-tical concepts before receiving anyformal training, and then used resultsto improve K-12 and college-levelstatistics instruction. Their currentfocus is on younger students. Several

data analysis software tools haveemerged from projects conducted atSERG, including DataScope�, ProbSim� (Konold & Miller, 1994), andTinkerplotsTM (Konold & Miller, 2004).These products are designed forteaching concepts in probability andsampling using simulations. They havebeen used extensively in the classroomto teach concepts to students, and havealso been used to investigate teachers’approaches to analyzing data (e.g., seeHammerman & Rubin, 2002, 2004).

While much of the statistics edu-cation literature is focused ultimatelyon student learning, recent studieshave also examined teachers’ statisti-cal thinking and reasoning. Hammer-man and Rubin (2002, 2004), Makar andConfrey (2002) and Confrey, Makar,and Kazak (2004) have examinedways that teachers reason about sta-tistical concepts (in particular, vari-ation in data) and how using com-puter graphics tools can facilitate theirunderstanding.

Hammerman and Rubin (2002, 2004)studied the strategies that teachersused to comprehend variability indata and examined how teachers ex-plored and analyzed data using theTinkerplotsTM software. After learn-ing to “bin” and visualize data usingthe graphical tools, teachers were lesslikely to use the sample average as a sin-gle overall representation of the data.

Makar and Confrey (2002) exam-ined mathematics teachers’ statisti-cal thinking about high-stakes test re-sults at a low-performing high schoolin Texas. Using the computer-basedstatistical learning tool, FathomTM

(Finzer, 2001), teachers created graphsto compare distributions of test scoresfor male and female students to de-termine if the differences in the cen-ters between the two groups of datawere significant. However, the researchshowed that, after attending a work-shop in data analysis, teachers still usedintuition to determine whether differ-ences in centers of the distributionsof test scores were significant, ignoringconcepts like variability and samplingdistributions in their reasoning.

Confrey et al. (2004) conducted ad-ditional studies to explore the statis-tical understanding of high-stakes testdata by teachers. The purpose was to de-termine if a professional developmentcourse in the area of testing and statis-tics could improve teacher reasoningabout statistical concepts as it relatesto test data. The results indicated that

participants made significant gains intheir understanding of statistical con-cepts; however, it was not determined ifteachers made significant gains in theirunderstanding of test results per se.

Assessment Literacy

Since the enactment of NCLB, an in-creasing number of schools and dis-tricts are collecting massive quantitiesof student data, including standardizedtest results, using software manage-ment products. (See Wayman, String-field, and Yakimowski (2004) for a dis-cussion of various software products de-signed for storing, organizing, and ana-lyzing student, school, and district leveldata.) Administrators and teachers areexpected to use these data to make ed-ucational and instructional decisions.Boudett, City, and Murnane (2005) out-line eight critical steps to effectivelyuse assessments to improve instructionand raise student achievement. Accord-ing to Boudett et al. (2005), the secondstep is to “build assessment literacy,”i.e., develop a working knowledge ofcommon concepts related to test scoreresults, and acquire appropriate skillsto interpret test score results. The textby Popham (2006a) is intended to im-prove the assessment literacy of educa-tional leaders and another recent con-tribution by Popham (2006b) consistsof 15 booklets on assessment topics thatare intended to help teachers increasetheir assessment knowledge.

Other resources for teachers and ad-ministrators are available, although inmany cases their effectiveness has notbeen formally assessed. Test publish-ing companies offer products for profes-sional development in assessment lit-eracy. For example, Educational Test-ing Service (ETS) and CTB McGraw-Hill provide written guides and work-shops for professional development inwhich teachers can learn to use datafor instructional improvement. ThePathwise� Evidence-Centered Teach-ing and Assessment Workshops, offeredthrough ETS, are designed for languagearts and mathematics teachers as train-ing sessions in the principles of for-mative assessment and linking of as-sessment to instruction. The Assess-ment Training Institute, founded byRick Stiggins in 1993 and acquired byETS in 2006, provides classroom assess-ment training needed to support educa-tors. Books, videos, DVDs, and seminarsare available through the institute.

16 Educational Measurement: Issues and Practice

Page 4: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Project OverviewThe ITEMS project differs from muchof the previous work in at least threerespects. First, the ultimate targetaudience is school personnel them-selves, rather than K-12 or collegestudents. Also, rather than focusingon either statistical reasoning (e.g.,Chance, 2002 or Mills, 2002) or on class-room testing (like Stiggins & Chap-puis, 2005), the project is intended tohelp teachers and principals becomeeducated consumers with respect to abroad range of topics in educationalmeasurement and statistics. Finally,while the effectiveness of many assess-ment products currently available hasnot been evaluated, a major researchcomponent of the ITEMS project hasbeen dedicated to investigating the ef-fectiveness of the modules.

The project work was organized sothat one module was produced and eval-uated in each of the three years of theproject. The topics of the three modulesare as follows:

Module 1: Test Scores and Score Distri-butions

Module 2: Imprecision in Individual andAverage Test Scores

Module 3: Interpretation of Test ScoreDifferences and Trends

Table 1. Content of Instructional Modules

1. “What’s the Score?” (2005): Test Scores and Score Distributions

• Mean, median, mode• Symmetric vs. skewed distributions• Range, standard deviation• Percentage above a cut-point (NCLB)

• Raw scores• Percentile ranks• Grade-equivalents• Norm-referenced and criterion-referenced score interpretation

2. “What Test Scores Do and Don’t Tell Us” (2006): Imprecision inIndividual and Average Test Scores

• Measurement error and test reliability• Standard error of measurement• Confidence bands for test scores• Effect of sample size on precision of test score means• Precision of individual scores versus precision of means• Test bias

3. “What’s the Difference?” (2007): Interpretation of Test Score Differencesand Trends

• Importance of disaggregating data for key subgroups• Effect of population changes on interpretation of trends• Effect of number of students• Effect of changes in tests and test forms; test equating

A brief outline of the content of thethree modules appears in Table 1.

For each module, the project workconsisted of the following four phases,which took place over a period ofroughly one year:

Development phase. The module isconceptualized and produced.

Research phase. Data are collectedand analyzed to allow formal evalua-tion of the effectiveness of a Web-basedversion of the module.

Program evaluation phase. Theproject evaluator collects data fromparticipants to determine their viewson the usefulness and effectiveness ofthe module.

Dissemination phase. The module isposted on our project Website as a freelyavailable resource for educators, alongwith supplementary materials. CD andDVD versions of the module are mailedto those who request them.

Principles of Module Development

The instructional modules rely on acase-based approach to learning (Lun-deberg, Levin, & Harrington, 1999) inwhich realistic test score reports areused as a basis for explaining con-

cepts and terminology. In each mod-ule, features of these reports are high-lighted and explained. In computer-based learning environments, it hasbeen found that individuals who arepresented with material via an ani-mated agent demonstrate better learn-ing outcomes than those who are pre-sented with the material via on-screentext and static graphs (Moreno, Mayer,Spires, & Lester, 2001). Therefore, themodules make liberal use of graph-ics, including computer animation, topresent concepts.

For example, Figure 1 shows a screenshot from our second module that il-lustrates the concept of measurementerror. The viewer is asked to imaginethat Edgar, the child pictured in thedisplay, takes a test several times, mag-ically forgetting the content of the testbetween administrations. On the firstoccasion, he misreads a question; onthe second, he guesses correctly; and onthe third, he is accidentally given extratime. For these reasons, he gets slightlydifferent scores on each imaginary testadministration.

In designing the modules, we have in-corporated other findings from the cog-nitive psychology literature on learningthrough technology, as reflected in thefollowing principles:

Multimedia principleConcepts are presented using bothwords and pictures (static or dynamicgraphics). Research has indicated that“. . .human understanding occurs whenlearners are able to mentally inte-grate visual and verbal representations”(Mayer, 2001, p. 5).

Contiguity principleAuditory and visual materials on thesame topic are, whenever possible,presented simultaneously, rather thansuccessively, and words and corre-sponding pictures appear on the screentogether rather than separately. Ma-terials that incorporate these princi-ples of temporal and spatial contiguityhave been shown to enhance learning(Mayer, 2001, pp. 81–112).

Modality principleVerbal information that accompaniesgraphical presentations, is, in general,presented in spoken form rather than ason-screen text. Research has indicatedthat spoken verbal material is prefer-able in this context because, unlike dis-played text, it “does not compete with

Summer 2008 17

Page 5: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

FIGURE 1. Illustration of measurement error from Module 2.

pictures for cognitive resources in thevisual channel” (Mayer, 2001, p. 140).

Coherence principleEfforts have been made to remove ex-traneous words, pictures, and soundsfrom the instructional presentations.Irrelevant material, however interest-ing, has been shown to interfere withthe learning process (Mayer, 2001, pp.113–133).

Prior knowledge principleThe modules are designed to “use wordsand pictures that help users invoke andconnect their prior knowledge” to thecontent of the materials (Narayanan &Hegarty, 2002, p. 310). For example,while participants may be unfamiliarwith the term “sampling error,” mostwill be familiar with statements like,“The results of this survey are accuratewithin plus or minus five percentagepoints.” Analogies and metaphors havealso been shown to enhance mathemat-ical learning (English, 1997).

Conversational styleA conversational rather than a formalstyle is used in the modules; this hasbeen shown to enhance learning, per-

haps because “learners may be morewilling to accept that they are in ahuman-to-human conversation includ-ing all the conventions of trying hardto understand what the other person issaying” (Mayer, 2003, p. 135). In keep-ing with this principle, formulas are notused in the instructional modules. (Therelevant formulas are given in the sup-plementary handbooks for those whoare interested in learning them.)

Project Staffing

The project staff who created thethree instructional modules describedin this article consisted of six indi-viduals (some of whom worked ononly one or two modules): The prin-cipal investigator (first author), whois an education professor with over 25years of research experience in edu-cational measurement and statistics;a senior researcher (second author),who is a statistics professor with ex-perience in education research; threetechnical specialists (third, fifth, andsixth authors) who were graduate stu-dents in media arts and technology withexperience in computer programming,graphics, and animation, and a profes-sional animator (fourth author). Theproject staff also included a project ad-

ministrator, who was responsible formuch of the day-to-day managementof the project and a project evalua-tor, who conducted a semi-independentreview of the project’s effective-ness.

The project staff met three timesa year with an advisory committeeconsisting of public school teach-ers and administrators and univer-sity experts in human-computer inter-action, multimedia production, multi-media learning, cognitive psychology,teacher education, educational tech-nology, theoretical statistics, and mathand statistics education.

Principles of Program Evaluation

Generally stated, the questions to beaddressed by the program evaluationprocedures for the ITEMS project areconsistent with those considered byOwen (2007, p. 48) under the rubricof impact evaluation: “Have the statedgoals of the program been achieved?Have the needs of those served by theprogram been achieved? . . . Is the pro-gram more effective for some partic-ipants than for others?” In our case,the stated goal of the project wasto present psychometric and statisti-cal information about the interpreta-tion of standardized test scores that

18 Educational Measurement: Issues and Practice

Page 6: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

was comprehensible and applicable toeveryday work situations encounteredby the participants. Furthermore, wehoped that the information would beretained. In addition, we wanted thepresentation format to be perceived asappealing and convenient by the users.

Four sources of information wereused to address the evaluation ques-tions. First, as described below, a quizwas constructed to correspond to eachmodule. The purpose of the quiz wasto determine whether participantscomprehended the information pre-sented in the module and could applythis knowledge to problems intendedto resemble those encountered intheir everyday work. An experimentalparadigm, involving random assign-ment was implemented in which halfthe participants took the quiz beforeseeing the module, and half tookthe quiz after viewing the module.Participants willing to be followed uptook the quiz a second time, one monthlater, to measure retention. Second, abackground survey was used to collectinformation about participants so thatwe could determine if the modules weremore useful and effective for some par-ticipants than for others. Third, an inde-pendent evaluator conducted phone in-terviews (Module 1) and administeredan evaluation survey (Modules 1, 2, and3) to participants to assess the utilityand effectiveness of the materials andto solicit suggestions for improvement.(The evaluator role was filled by oneindividual for the Module 1 evaluationand a second individual for the evalua-tion of Modules 2 and 3.) Fourth, partic-ipants provided comments via commentboxes included as part of the moduleand quiz administration, and, in somecases, via email messages to the project.

Development and Evaluation ofModule 1, “What’s the Score?”Module 1 Development

The development of our first module,“What’s the Score?” began in 2004. By

Table 2. Module 1: Demographic Information for Participants

Percent in Average YearsSample Average Percent Administrative of Teaching

Size Age Female Positions Experience

All Participants 113 33.7 69.9 5.3 7.4Teacher Education (TEP) Students 68 25.4 75.0 0.0 2.3School District Personnel 45 46.6 62.2 13.3 15.1

displaying conversations between thecartoon characters Stan, a somewhatclueless teacher, and Norma, a moreinformed one, this 25-minute moduleexplores topics such as test score dis-tributions and their properties (mean,median, mode, range, standard devia-tion), types of test scores (raw scores,percentiles, scaled scores, and grade-equivalents), and norm-referenced andcriterion-referenced score interpreta-tion. Table 1 provides some additionalinformation on the content of themodule.

The module development processbegan with a wide-ranging review ofliterature, including research articlesand software products related to statis-tics education and assessment liter-acy training, textbooks in measure-ment and statistics, the Standards forTeacher Competence in EducationalAssessment of Students (American Fed-eration of Teachers, NCME, & NationalEducation Association, 1990), the Stan-dards for Educational and Psychologi-cal Testing (American Educational Re-search Association, American Psycho-logical Association, & NCME, 1999),research findings on the effectivenessof multimedia instruction (see above),and references on storyboarding andfilm-making (e.g., Begleiter, 2001). Themodule development process also drewon the results of a preliminary survey ofthe measurement and statistics knowl-edge of teachers, school administrators,and teacher education students.

Following these preliminary steps, adetailed outline of the module was de-veloped. Text was then created andscenes were conceptualized and in-corporated in a storyboard. To maxi-mize accessibility, both low- and high-bandwidth versions of the Web-basedmodule were produced using Macro-media Flash� (now Adobe Flash) soft-ware. The module was revised overa period of six months to reflect thefeedback received. For example, af-ter viewing an initial version, our ad-visory committee recommended short-

ening the module and dividing it intoshort segments, each preceded by a titleslide. This recommendation was imple-mented before the research phase be-gan. Also, on the recommendations ofparticipants, “pause” and “rewind” ca-pabilities were added to allow viewersto navigate among the seven segmentsof the module.

Module 1 Research Phase

The primary means of evaluating the ef-fectiveness of the module involved theadministration of a Web-based assess-ment literacy quiz that we developedto address the topics covered in themodule. A 20-item multiple-choice quizwas pilot-tested on a group of teachereducation students at UCSB and thenrevised substantially to improve its cor-respondence to the material includedin the module. The purpose of the quizwas to allow us to determine whetherviewing that module improved under-standing of the concepts that werepresented.

Sixty-eight teacher education pro-gram (TEP) students from UCSB, aswell as 45 teachers and administra-tors from local school districts, wererecruited to take part in a formal studyof the effectiveness of the Web-basedmodule. Participants received $15 giftcards from Borders. Demographic in-formation on the participants is givenin Table 2. Overall, participants had anaverage age of 34 and had taught foran average of seven years. Seventy per-cent were women. The TEP students, ofcourse, tended to be younger and havefewer years of teaching experience thanthe school personnel.

When they logged on, participantscompleted a background survey andthen were randomly assigned to one oftwo conditions: In one condition, themodule was viewed before the quiz wasadministered; in the other, the quizwas administered first. By comparingparticipants from the two conditions,we were able to test the hypothesis

Summer 2008 19

Page 7: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Table 3. Module 1: Results on the 20-Item Quiz

Module-First Group Quiz-First GroupEffect t-test p-value

Mean SD n Mean SD n Size (1-sided)

All Participants 13.2 3.7 52 12.0 3.4 61 .34 .042Teacher Education (TEP) Students 13.1 4.0 33 11.7 3.5 35 .37 .059School District Personnel 13.4 3.2 19 12.5 3.2 26 .28 .198

that those who viewed the module firstwere better able to answer the quizquestions.

Data analysisPsychometric analysis of the assess-ment literacy quiz showed that, for thetotal group, the quiz had an average per-cent correct of 63, an average item dis-crimination (corrected item-total cor-relation) of .28, a reliability of .71, anda standard error of measurement of 1.9points.

As indicated in Table 3, results forthe 113 participants showed that view-ing the module had a small but sta-tistically significant positive effect onquiz scores: Those in the “module-first”group answered an average of 13.2 of 20questions correctly, compared with anaverage score of 12.0 questions for the“quiz-first” group (an effect size of .34standard deviation units).

Further analysis showed that thequiz item on which the two groups dif-fered most, addressed the topic of skew-ness of a test score distribution. Onthis question, 63% of the module-firstgroup answered the question correctly,while only 34% of the quiz-first groupanswered the question correctly. On theremaining quiz items, differences in thepercentages correct for the two groupswere typically less than 10 points.

When the overall results in Table 3are disaggregated, it becomes appar-ent that the effect was larger amongthe TEP students than among theschool personnel (see Figure 2). Thereare several possible explanations forthis disparity. As noted, the TEP stu-dents differed in several ways from theschool personnel (Table 2). In addi-tion, Module 1 and the associated quizwere presented to these students in agroup administration, with project staffavailable to trouble-shoot. By contrast,school personnel participated on an in-dividual basis, at a time and place oftheir own choice.

In addition to comparing the quizresults by TEP status, we were alsointerested in comparing quiz resultsby math instructor status. Participants

who were either full-time math teach-ers or TEP student teachers were clas-sified as math instructors if they in-dicated that math was the sole sub-ject taught, while all other participantswere classified as nonmath instructors.Because we expected math instruc-tors to be more familiar with measure-ment and statistics concepts than otherparticipants, we anticipated that theywould perform better on the quiz thanother participants. In fact, math in-structors (n = 19), scored only slightlyhigher than participants who were notmath instructors (n = 94). The aver-age score for math instructors was 13.2points (SD = 2.9), compared to an aver-age score of 12.4 for other participants(SD = 3.7).

Follow-up quiz analysisA follow-up analysis was conducted todetermine the extent to which partici-pants retained an understanding of themodule material, as measured by theirfollow-up performance on the quiz. In-

dividuals who agreed to participate inthe follow-up phase of the Module 1 re-search were contacted about a monthafter they had first viewed the moduleand taken the quiz. They were given anonline quiz identical to the one thatthey took during their first module-viewing session.

Unfortunately, of the original 113participants, only 11 participated inthe follow-up phase and completed thequiz again. Four participants were fromthe module-first group, and seven werefrom the quiz-first group. The averagescore on the quiz for the four mem-bers of the module-first group was 15.5the first time they took the quiz, andit remained 15.5 when they retook thequiz (note that there was variabilitywithin the scores at the two differenttimes). The average quiz score for theseven quiz-first follow-up participantswas 14.3 the first time they took thequiz, and increased to 15.9 when theytook it during the follow-up phase. Al-though the follow-up samples are small

FIGURE 2. Module 1: Average scores on 20-item quiz for school personneland teacher education program students in quiz-first and module-first groups.

20 Educational Measurement: Issues and Practice

Page 8: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

and cannot be assumed to be represen-tative of the original participants, it isworth noting that on average, no lossof content knowledge was observed foreither group.

Development and Evaluation ofModule 2, “What Test Scores Doand Don’t Tell Us”Module 2 Development

Our second module, “What Test ScoresDo and Don’t Tell Us,” shows Stan, theteacher introduced in Module 1, meet-ing with parents Maria and Tim to dis-cuss the test results of their twins,Edgar and Mandy. The module focuseson the effect of measurement error onindividual student test scores, the ef-fect of sample size on the precision ofaverage scores for groups of students,and the definition and effect of test bias.Table 1 provides some additional infor-mation on the content of the module.

As was the case in developing Mod-ule 1, a detailed outline was first devel-oped. Text was then created and sceneswere conceptualized and incorporatedin a storyboard. Following that, a Web-based version of the module was pro-duced using Flash� software and wasrevised over a period of several monthsto reflect the feedback received fromour project advisory committee. Bothlow- and high-bandwidth versions weremade available.

Module 2 includes two features notincorporated in Module 1. First, an op-tional closed captioning feature is avail-able, which should facilitate the useof the module by participants who arehard of hearing. Also, the module in-cludes four “embedded questions” (onefor each main section) that allow par-ticipants to check their understandingof the material and view portions of themodule a second time if they wish. Theembedded questions are also intendedto encourage participants to remain en-gaged in the process of watching themodule.

Recruitment and Data CollectionProcedures for Module 2

The 2006 research and program eval-uation phases differed from their 2005

Table 4. Module 2: Demographic Information for Participants

Sample Average Percent Average Years ofSize Age Female Teaching Experience

All Participants 104 31.4 68 11.3Teacher Education (TEP) Students 81 25.7 73 2.6School District Personnel 23 50.9 52 23.0

counterparts in three related respects.First, whereas we recruited only Cen-tral California teachers and educatorsin 2005, we recruited educators fromaround the nation in 2006. Our primarymeans of doing so was to place an ad-vertisement in the educational maga-zine Phi Delta Kappan in January andFebruary. We also adapted our Websiteto allow individuals to sign up for theproject on the Website itself. In addi-tion, we contacted professional organi-zations, focusing on those for minor-ity professionals, to encourage them tosign up for project participation.

A second change that we imple-mented in 2006 was necessitated bythe fact that we were working withparticipants who were in some casesthousands of miles away, making vis-its to schools impractical. We there-fore developed an infrastructure thatallowed all transactions that take placeduring the research and program eval-uation phases to be conducted elec-tronically. Provision of logon informa-tion to participants, collection of data,and even distribution of gift cards wereelectronic. (Participants received elec-tronic certificates that are usable atBorders.com.)

A third change was that, rather thanregarding a school or district as a targetof recruitment, we directly enlisted theparticipation of individual teachers andadministrators. This did not precludethe possibility that particular schoolsand districts would participate; rather,it allowed us to collect data from edu-cators whether or not their schools ordistricts were participating on an insti-tutional level.

Module 2 Research Phase

To test the effectiveness of Module2, we used a new assessment literacyquiz, tailored to the topics of Module 2.Consistent with the recommendationsof our advisory panel, we created aninstrument that is more applications-oriented than the Module 1 quiz. Mostquestions include tables or graphicsthat resemble those that appear in stan-dardized test results. Respondents areasked questions about the interpreta-

tion of these displays. Because the quizwas more directly linked to the mate-rial in the module than was the case forModule 1, we believe the Module 2 quizprovided a better test of the effective-ness of the module. This 16-item mul-tiple choice quiz was pilot-tested withUCSB TEP students.

One hundred four individuals partic-ipated in the research phase for Module2, “What Test Scores Do and Don’t TellUs.” Of those, 81 were TEP studentsfrom at least five universities, and 23were school district personnel from 19school districts across the country. Par-ticipants received $15 electronic giftcertificates from Borders.com. Demo-graphic information on the participantsfor this phase is given in Table 4. Sixty-eight percent of the participants werewomen, with average age of 31 yearsand average teaching experience of 11years. The TEP students were typi-cally younger and had fewer years ofteaching experience than the schoolpersonnel.

To develop a better picture ofthe school districts where participantsworked, either full-time or as a stu-dent teacher, we collected additionalinformation about the community sur-rounding the school district and theminority composition of the schooldistrict. In the background survey,participants were asked if they woulddescribe the community surroundingthe school district as “central city,” “ur-ban fringe/large town,” or “rural/smalltown.” Of the 85 participants who re-sponded to this question, 69% describedthe area surrounding their district asurban fringe, while 18% indicated thattheir community was rural, and 13% de-scribed it as central city.

Participants were also asked to esti-mate a range for the percentage of stu-dents in their school district who wereAfrican-American, Hispanic/Latino,Native American, or members of otherethnic minorities. The choices were asfollows: less than 20%, 21%–40%, 41%–60%, 61%–80%, and 81%–100%. Diversedistricts were well represented in thestudy. Of the 84 participants who re-sponded to the question, 45% indicated

Summer 2008 21

Page 9: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Table 5. Module 2: Results on the 16-Item Quiz

Module-First Group Quiz-First GroupEffect t-test p-value

Mean SD n Mean SD n Size (1-sided)

All Participants 12.6 3.0 51 10.2 3.5 53 .74 .000Teacher Education (TEP) Students 12.6 3.2 40 9.5 3.7 41 .90 .000School District Personnel 12.7 1.9 11 12.5 1.4 12 .12 .375

that they worked in a district composedof 41% to 60% minority students, 19%worked in a district that was 61% to80% minority students, 13% worked in adistrict that was less than 20% minoritystudents, 12% worked in a district thatwas 81% to 100% minority students, and11% worked in a district that consistedof 21% to 40% minority students.

Psychometric analysis of the 16-item multiple-choice assessment liter-acy quiz showed that the quiz had anaverage percent correct of 71, an aver-age item discrimination of .40, a reli-ability of .79, and a standard error ofmeasurement of 1.58 points.

As indicated in Table 5, results for the104 participants showed a statisticallysignificant effect: Those in the module-first group answered an average of 12.6out of 16 questions correctly, comparedwith an average score of 10.2 questionsfor the quiz-first group (an effect sizeof .74).

Further analysis of the quiz resultsshowed that the items that had thelargest group differences in the per-centages correct—between 20% and27%—were primarily about measure-ment error or about the relation be-tween the stability of sample means andsample size. For example, a questionabout the standard error of measure-ment was answered correctly by 88%of the module-first group, compared to62% of the quiz-first group. For the re-mainder of the questions, the typicaldifference between the groups in thepercentages correct was between 5%and 10%.

Table 5 shows that, as in Module 1,the effect of viewing the module wasmuch larger for the TEP students (aneffect size of .90) than for the schoolpersonnel (.12), a finding that is dis-cussed further below. In Figure 3 wecan observe that school personnel per-formed about the same (on average)on the quiz regardless of whether theyviewed the module before or after tak-ing the quiz; however, there was a dif-ference in the average scores betweenTEP students who viewed the mod-

ule before and after taking the quiz.This effect is more apparent than thatobtained in the Module 1 evaluationphase.

To parallel the analysis in the Module1 evaluation phase, we compared quizresults between math instructors andnonmath instructors. Math instructorsconsisted of both full-time math teach-ers and TEP student teachers in math;all other participants were classified asnonmath instructors. As in the Module1 evaluation phase, math instructors (n= 5) scored slightly higher on averagethan participants who were nonmathinstructors (n = 99). The average scorefor math instructors was 14.4 (SD =1.5), compared to 11.2 for nonmath in-structors (SD = 3.5). (Note that onlyfive participants were designated asmath instructors in the Module 2 anal-ysis, as compared to 19 in the Module 1analysis. This may be because informa-tion on teaching was collected slightlydifferently on the two occasions. In the

Module 1 background survey, partici-pants were asked to indicate the sub-ject they taught most frequently, whilein the Module 2 survey, they were giventhe option to indicate multiple subjectsthat they taught. Participants who indi-cated only “math” were then designatedas math instructors.)

Follow-up quiz analysisAs in the first module research phase,a follow-up analysis was conducted todetermine the extent to which partic-ipants retained an understanding ofthe material from the second module.Participants in the Module 2 researchphase were contacted about a monthafter they had first viewed the moduleand taken the quiz. They were given theidentical quiz that they had previouslytaken during their first module-viewingsession.

Due to improvements in the onlineadministration system, contact withparticipants was more easily achieved,

FIGURE 3. Module 2: Average scores on 16-item quiz for school personneland teacher education program students in quiz-first and module-first groups.

22 Educational Measurement: Issues and Practice

Page 10: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Table 6. Module 3: Demographic Information for Participants

Sample Average Percent Average Years ofSize Age Female Teaching Experience

All Participants 33 37.3 82 6.6Teacher Education (TEP) Students 14 25 93 1.0School District Personnel 19 46.4 74 13.8

and a much higher follow-up rate wasobserved. Of the original 104 partici-pants who took the quiz, 38 participatedin the follow-up phase and completedthe quiz again. Fifteen participantswere from the module-first group, and23 were from the quiz-first group. Theaverage score on the quiz for the 15members of the module-first group was13.9 the first time they took the quiz,and it dropped slightly to 13.1 whenthey retook the quiz. However, this dif-ference was not statistically significant(p = .20) at the .05 level, as deter-mined by a matched-pairs t-test. Hence,there does not appear to be any signifi-cant loss of content knowledge for thisgroup. The average quiz score for the23 quiz-first follow-up participants was10.6 the first time they took the quiz,and increased to 11.4 when they took itduring the follow-up phase. This differ-ence was marginally statistically signifi-cant (p = .05), which is consistent withan increase in content knowledge dueto the effect of viewing the module. Asin the Module 1 evaluation, interpreta-tion is complicated by the fact that thefollow-up samples are small and cannotbe assumed to be representative of theoriginal participants.

Development and Evaluation ofModule 3, “What’s the Difference?”Module 3 Development

Our third module, “What’s the Differ-ence?” shows a press conference inwhich a superintendent is explainingrecently released test results. Normaand Stan, the teachers introduced inthe previous modules, help to demys-tify the test results for the reporters.The main topics addressed are the im-portance of disaggregating test data forkey student groups and the implicationsfor score trend interpretation of shiftsin the student population, the numberof students assessed, and changes intests and test forms. (See Table 1 fordetails.)

As was the case in developing Mod-ules 1 and 2, an outline and scriptwere first developed. An animated Web-

based module was then produced, re-vised, and incorporated into our datacollection system. The closed caption-ing and embedded questions featuresintroduced in Module 2 were retained.

Production values for Module 3 weresubstantially improved over the previ-ous two modules. First, a professionalanimator with extensive cartooning ex-perience joined the project. Second,students from the Dramatic Art depart-ment at UCSB were hired to voice theanimated characters, rather than em-ploying project staff for this purpose.Finally, the audio recording was con-ducted in a professional sound studio.These changes resulted in a modulethat was much more polished than itspredecessors.

Recruitment and data collectionprocedures for Module 3 were sim-ilar to those in Module 2, exceptthat we selected a different magazine—Learning and Leading with Techno-logy—in which to place our advertise-ment, and we timed the ad to be con-current with the data collection insteadof preceding it. As an added incentiveto participation, a feature was addedthat allowed participants the option ofdownloading a personalized certificateindicating that they had completed anITEMS training module.

Module 3 Research Phase

The effectiveness of Module 3, “What’sthe Difference?” was tested using anassessment literacy quiz tailored to thetopics of the module. We began with a16-item instrument, which was reducedto 14 items after being pilot-tested withUCSB TEP students.

Primarily because of reduced partic-ipation by the UCSB TEP students dur-ing the research phase, we were initiallyable to recruit only 23 participants.Four were UCSB TEP students and19 were school district personnel fromseven states. We were subsequentlyable to collect data from 10 teacher edu-cation students at California State Uni-versity, Fresno, bringing the number ofTEP students to 14. As before, partici-

pants received $15 electronic gift cer-tificates from Borders.com.

Demographic information on the par-ticipants for this phase is given in Table6. About 82% of the participants werewomen, the average age was 37, andthe average number of years of teachingexperience was 6.6. The teacher educa-tion students were generally youngerand had fewer years of teaching experi-ence than the school personnel.

As in the Module 2 research phase,we collected information about thecommunity surrounding the school dis-trict and the minority composition ofthe school district. Participants wereasked if they would describe the com-munity surrounding the school districtas “central city,” “urban fringe/largetown,” or “rural/small town.” Of the24 participants who responded to thisquestion, 12 described the area sur-rounding their district as urban fringe,3 indicated that their community wasrural, and 9 described it as centralcity.

Participants were also asked toestimate a range for the percent-age of students in their school dis-trict who were African-American, His-panic/Latino, Native American, ormembers of other ethnic minorities.The choices were identical to thosegiven in the Module 2 background sur-vey: Less than 20%, 21%–40%, 41%–60%, 61%–80%, and 81%–100%. Diversedistricts were well-represented: Of the24 participants who responded to thequestion, 8 worked in a district thatconsisted of 21% to 40% minority stu-dents, 7 indicated that they worked in adistrict composed of 41% to 60% minor-ity students, 4 worked in a district thatwas less than 20% minority, 3 worked ina district that was 81% to 100% minor-ity, and 2 worked in a district that was61% to 80% minority.

Psychometric analysis of the 14-item multiple-choice assessment liter-acy quiz revealed that the quiz had anaverage percent correct of 63, an aver-age item discrimination of .53, a reli-ability of .87, and a standard error ofmeasurement of 1.47 points.

Summer 2008 23

Page 11: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Table 7. Module 3: Results on the 14-Item Quiz

Module-First Group Quiz-First GroupEffect t-test p-value

Mean SD n Mean SD n Size (1-sided)

All Participants 9.1 4.2 18 8.5 4.1 15 .14 .33Teacher Education (TEP) Students 6.5 4.1 8 5.5 2.1 6 .32 __School District Personnel 11.2 3.0 10 10.4 4.0 9 .23 __

Note: Because of small sample sizes, a single t-test was performed for all participants combined; separate tests were not conductedfor TEP students and school district personnel.

As indicated in Table 7, the effect ofthe module was not statistically signif-icant. (Because of small sample sizes,a single t-test was performed for allparticipants combined; separate testswere not conducted for TEP studentsand school district personnel). Thosein the module-first group did havea slightly higher average quiz score(9.1) than those in the quiz-first group(8.6), however (see Figure 4). In boththe module-first and quiz-first groups,school personnel performed substan-tially better than teacher educationstudents. There were not enough datato make a meaningful comparisonbetween math and nonmath instructorsas was done for the first two researchphases. Only two participants weremath instructors. One math instructorwith three years of teaching experiencereceived a perfect score on the quiz,while the other math instructor, who

FIGURE 4. Module 3: Average scores on 14-item quiz for school personneland teacher education program students in quiz-first and module-first groups.

is currently a TEP student, answered 5out of 14 items correctly.

Further analysis of the quiz resultsdid not show much difference betweenthe scores of the module-first and quiz-first groups on most items. However,two questions did reveal substantialgroup differences: An item on the ef-fect of sample size on the interpreta-tion of improvements in average testscores was answered correctly by 83%of the module-first group, compared to60% of the quiz-first group. A questionabout test equating was answered cor-rectly by 78% of the module-first group,compared to only 27% of the quiz-firstgroup.

Follow-up quiz analysisA follow-up analysis was conducted todetermine the extent to which partic-ipants retained an understanding ofthe material from the second module.

Participants in the Module 3 researchphase were contacted about a monthafter they had first viewed the moduleand taken the quiz. They were given theidentical quiz that they had previouslytaken during their first module-viewingsession. Of the original 33 participantswho took the quiz, 11 (33%) partici-pated in the follow-up phase and com-pleted the quiz again. Four participantswere from the module-first group andseven were from the quiz-first group.The average score on the quiz for thefour members of the module-first groupwas 12.3 the first time they took thequiz, and it remained the same whenthey retook the quiz. The average quizscore for the seven quiz-first follow-upparticipants was 10.7 the first time theytook the quiz, and increased to 12.6when they took it during the follow-up phase. Because of the very smallnumber of participants in the follow-up phase, tests of significance were notconducted. As in the results from theprevious follow-up studies, the slightincrease in average score is consis-tent with an increase in content knowl-edge due to the effect of viewing themodule.

Program Evaluation andDissemination Phases for Modules 1,2, and 3During the program evaluation phases,project participants were asked to pro-vide data to the project evaluator aboutthe quality and usefulness of the mod-ules. Requests to participate in the eval-uation came directly from the evalua-tor, rather than the project director,and no incentives were provided forparticipation in this phase. Our inten-tion was to keep the program evaluationphase as separate as possible fromthe main activities of the project, inhopes that participants would feel freeto provide honest comments on the

24 Educational Measurement: Issues and Practice

Page 12: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

project materials. The response ratefor the program evaluation phase waslow in all three years. For Module 1,seven teachers and six administrators(11.5% of the original participants)agreed to participate in phone inter-views or complete paper surveys ad-ministered in person or by mail by theevaluator. In the survey, participantswere asked to rate several aspects ofthe project on a 4-point Likert scale(poor, fair, good, excellent). Questionsabout the overall program, comprehen-siveness of the material and presen-tation of material yielded modal re-sponses of “good”; questions on “rele-vance to my role” and availability andaccess yielded modal responses of “ex-cellent.” In addition, most participantssaid they had used or planned to usethe materials in their work. Interviewcomments were “very positive, in gen-eral” according to the evaluation re-port, but included both complimentaryand critical feedback about the contentand technical features of the module.The more negative comments tendedto focus on the unavailability of naviga-tion features, a problem that was latercorrected.

The Module 2 program evaluationphase included a 22-item evaluationsurvey that was administered online.Approximately 5 weeks after their orig-inal participation, individuals receivedan email invitation to participate in theevaluation. Eleven individuals (10.6%)completed the surveys, which con-tained Likert items about the presen-tation, content, and impact of the mod-ule and open-ended questions aboutthe quality of the module. Responsesregarding presentation (e.g., quality ofnavigational tools) were uniformly pos-itive, and responses on content (e.g.,quality of examples) were all positiveas well, except for one response to oneof the five content questions. In theimpact section, most respondents re-ported that they learned from the mod-ule, increased their confidence in thesubject matter, and expected to changethe way they talked to others abouttest results, though there was some dis-agreement among respondents on theseissues. The open-ended questions onquality yielded positive responses aboutthe clarity of presentation and aboutthe examples and terms included inthe module, as well as some negativeresponses about animation quality andabout specific aspects of the content.

Informal comments that participantsentered in comment windows duringthe original or follow-up data collectionwere also analyzed. Roughly one-thirdof the overall comments made were pos-itive, one third were neutral, and one-third were negative. Among the recom-mendations made by the participantswere to make definitions of terms avail-able in the module through pop-up win-dows or some other means, and to in-crease the number of examples used toillustrate concepts.

The Module 3 evaluation proceduresand survey paralleled those of Mod-ule 2. Unfortunately, only two of theoriginal 33 participants (6.1%) com-pleted the survey. These two respon-dents provided uniformly positive re-sponses about the presentation, con-tent, and impact of the module. Nine-teen participants made a total of 27comments in the windows provided dur-ing the module and quiz phases. Elevenof these comments were positive, 10were negative, and 6 were neutral. Thecomments lacked a common theme. Forexample, while two viewers found thevideo aspect of the module distracting,another praised the graphics. And whileone viewer thought the video was slowand repetitive in its treatment of thematerial, another called it “somewhatadvanced.”

In addition to the information col-lected via the evaluation survey andcomments boxes, we also asked Mod-ule 3 participants to respond, followingthe module, to a multiple-choice itemsoliciting their views about the embed-ded questions posed at the end of eachscene. Of the 32 participants who re-sponded, 30 found the questions “some-what helpful” or “very helpful,” while2 other participants indicated theywere “neither annoying nor helpful.”None considered them “distracting orannoying.”

Following their respective researchand program evaluation phases, thethree modules were revised accord-ing to the recommendations of partic-ipants and advisory committee mem-bers. They were then made freely avail-able on the Web, along with their asso-ciated quizzes and online “handbooks”that provide formulas, supplementaryexplanations, and references. Educa-tors who preferred not to use the Web-based version of the module could re-quest CDs or DVDs, which were mailedto them at no cost.

DiscussionA key finding of this study is that our in-structional modules are effective train-ing tools for teacher education programstudents. TEP students who took thequiz after seeing the module performedbetter than those who took the quiz be-fore seeing the module. The effect sizesfor these students were .37 for Module1 (Table 3), .90 for Module 2 (Table 5),and .32 for Module 3 (Table 7). Resultswere statistically significant only for thefirst two modules. Follow-up analysessuggested that this information was re-tained a month later; however, becausethe small number of individuals whoparticipated in the follow-ups cannotbe assumed to be representative of thetotal group of initial participants, thisfinding must be regarded as tentative.

In the case of school personnel, therewas no statistically significant differ-ence between those who saw the mod-ule first and those who took the quizfirst (Tables 3, 5, and 7). For all threemodules, school personnel, on average,outperformed TEP students. For Mod-ule 2, which produced the largest ef-fect for TEP students, the average quizscore for the school personnel (12.7and 12.5 for the module-first and quiz-first groups, respectively) was nearlyidentical to that of the TEP studentswho saw the module before taking thequiz (12.6). This suggests that, whetheror not they had seen the module, theschool personnel, who had an averageof 11 years of experience, were as famil-iar with the included measurement andstatistics material as the TEP studentswho had already viewed the module.

A factor that somewhat complicatesthe interpretation of the larger effectsfor the teacher education students forall three modules is that, one to fourmonths before the research phase, aportion of them had participated in apilot test of the assessment literacyquiz. Those who took part in the pi-lot test responded to the quiz and pro-vided comments about the clarity ofthe wording and the difficulty of thematerial. They did not receive the an-swers to the quiz, nor did they view themodule. Those who took part in thepilot were just as likely to end up inthe module-first group as in the quiz-first group during the research phase,so that any effects of pilot participa-tion should have affected both exper-imental groups equally. Furthermore,in the case of the Module 1 quiz, the

Summer 2008 25

Page 13: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

instrument was almost entirely over-hauled before the research phase, sothat the final version bore little resem-blance to the pilot version. Neverthe-less, it is possible that pilot participa-tion had some effect.

More generally, caution is warrantedwhen interpreting the data analysis re-sults. While the TEP students fromUCSB were required by their own fac-ulty to participate in the main data col-lection for Modules 1 and 2, the remain-der of the individuals who participatedin the research phase chose to do so,resulting in nonrandom samples of thecorresponding populations. In addition,samples ranged from small to moder-ate in size and cannot be assumed tobe representative of school personnelor teacher education program studentsnationwide. Nevertheless, the resultsare encouraging and do indicate thatoverall, the modules are having a posi-tive impact on content knowledge in ed-ucational measurement and statistics.

Supplemental analysis of quiz perfor-mance showed that math instructorstended to perform better than otherparticipants (Modules 1 and 2) and thatthe topics participants were most likelyto learn about from the modules wereskewness of a test score distribution(Module 1), measurement error (Mod-ule 2), the stability of sample means(Module 2), the effect of sample size onthe interpretation of changes in averagescores (Module 3), and test equating(Module 3).

Most comments relayed to theproject via email (not included in theformal program evaluation) have beenpositive. Some examples are as follows:

“Very helpful and right to the point.If I were a building principal or a de-partment chair today all of the staffwould go through this until everyonereally understood it.”

“I am inclined to recommend [Mod-ule 1] as required viewing for all newhires in our K-12 district, and it cer-tainly will be recommended . . . for in-clusion in professional developmenton assessment literacy.”

“I will be sharing [Module 1] withmy Assistant Superintendent with thehope of promoting it as a part of ournew teacher induction process.”

In addition, the teacher education pro-grams at UCSB and at California StateUniversity, Fresno have now incorpo-rated the ITEMS materials into theircurriculums.

By far the greatest challenge inthe ITEMS project has been the re-cruitment of participants. Overworkedteachers who are overwhelmed by thedemands of NCLB are unlikely to under-take an optional activity like this one.Ironically, then, one reason that educa-tors do not have time to learn about thetechnical aspects of testing is that theyare occupied with the preparation andadministration of tests.

As part of our effort to increaseawareness of the project, we have is-sued press releases and made con-tacts with the Corporation for Educa-tional Network Initiatives in Califor-nia (CENIC), the University of Califor-nia Office of the President, the Califor-nia Teachers Association Institute forTeaching, the California Departmentof Education Beginning Teacher Sup-port and Assessment program, and theCalifornia County Superintendents Ed-ucational Services Association. In addi-tion, we have made conference presen-tations, posted project information oneducation-oriented Websites and blogs,and used listservs to contact profes-sional organizations, focusing on thosefor minority professionals.

Following the completion of the sup-plementary data collection, we will fo-cus for the remainder of the project onthe dissemination of our materials, par-ticularly to teacher education programsand new teachers. It is our hope thateven those educators who were hesi-tant to participate in the research as-pects of the project will find the ITEMSmaterials useful as a resource.

Acknowledgments

We appreciate the support of the NationalScience Foundation (#0352519). Anyopinions, findings, and conclusions or rec-ommendations expressed in this materialare those of the authors and do not neces-sarily reflect the views of the National Sci-ence Foundation. We are grateful to LizAlix, Pamela Yeagley, and Lois Phillips fortheir contributions. For more informationon the project, refer to the project Web-site at http://items.education.ucsb.eduor contact the first author at [email protected].

Note1 These three organizations subse-quently collaborated on the develop-ment of a manual, Interpreting andCommunicating Assessment Results

(National Council on Measurement inEducation, 1997). The project was ledby Barbara S. Plake and James C. Im-para and was funded by a grant to NCMEfrom the W. K. Kellogg Foundation.

References

Adams, J. E., & Copland, M. A. (2005). Whenlearning counts: Rethinking licenses forschool leaders. Available at http://www.ncsl.org/print/educ/WhenLearningCounts_Adams.pdf.

American Educational Research Association,American Psychological Association, & Na-tional Council on Measurement in Edu-cation (1999). Standards for educationaland psychological testing. Washington, DC:American Educational Research Associa-tion.

American Federation of Teachers, NationalCouncil on Measurement in Education, &National Education Association (1990).Standards for teacher competence in edu-cational assessment of student. Available athttp://ericsys.uncg.edu/AFT-NEA-AERA-Standards for Teacher Competence inEducation.htm.

Begleiter, M. (2001). From word to image:Storyboarding and the filmmaking pro-cess. Studio City, CA: Michael Weise Pro-ductions.

Boudett, K. P., City, E. A., & Murnane, R. J.(Eds.) (2005). Data wise: A step-by-stepguide to using assessment results to im-prove teaching and learning. Cambridge,MA: Harvard Education Press.

Brown, T., & Daw, R. (2004). How do K-12 teachers and administrators feel aboutstandardized testing and how well canthey interpret results of such tests? Unpub-lished report, Santa Barbara, CA: Univer-sity of California.

Chance, B. (2002). Components of statis-tical thinking and implications for in-struction and assessment. Journal ofStatistics Education, 10 (3). Availableat www.amstat.org/publications/jse/v10n3/chance.html.

Confrey, J., Makar, K., & Kazak, S. (2004).Undertaking data analysis of student out-comes as professional development forteachers. International Reviews on Math-ematical Education (Zentralblatt fur Di-daktik der Mathematik), 36(1), 32–40.

delMas, R. C., Garfield, J., & Chance, B. L.(1999). A model of classroom researchin action: Developing simulation ac-tivities to improve students’ statisticalreasoning. Journal of Statistics Edu-cation, 7 (3). Available at http://www.amstat.org/publications/jse/secure/v7n3/delmas.cfm.

English, L. D. (1997). Mathematical reason-ing: Analogies, metaphors, and images.Mahwah, NJ: Erlbaum.

Finzer, W. (2001). Fathom! (Version 1.12)[Computer Software]. Emeryville, CA: KeyCurriculum Press.

26 Educational Measurement: Issues and Practice

Page 14: Media Arts and Technology, UC Santa Barbara - Instructional ...nia, Santa Barbara, Santa Barbara, CA 93106;rzwick@education.ucsb.edu.Jeffrey C. Sklar is an Instructor, Department of

Garfield, J., delMas, R. C., & Chance, B.L. (2003). The Web-based ARTIST: As-sessment resource tools for improvingstatistical thinking. Presented at the an-nual meeting of the American EducationalResearch Association, Chicago. Available athttp://www.gen.umn.edu/artist/publications.html.

Hammerman, J., & Rubin, A. (2002). Visu-alizing a statistical world. Hands On! 25(2), 1–23. Available at http://www.terc.edu/downloads/Ho Fall o2.pdf.

Hammerman, J., & Rubin, A. (2004). Strate-gies for managing statistical complexitywith new software tools. Statistics Educa-tion Research Journal, 3(2), 17–41.

Impara, J. C. (1993). Joint Committee on Com-petency Standards in Student Assessmentfor Educational Administrators update:Assessment survey results. Paper presentedat the annual meeting of the National Coun-cil in Educational Measurement. Availableat www.eric.ed.gov.

Jennings, J. (2002). New leadership for newstandards. Leaders Count Report, WallaceReaders Digest Funds, Spring/Summer is-sue. Available at http://www.ctredpol.org.

Konold, C., & Miller, C. (1994). Prob Sim�:A probability simulation program [com-puter software]. Santa Barbara, CA: Intel-limation Library for the Macintosh.

Konold, C., & Miller, C. (2004). Tinker-plots (version 1.0) [computer software].Emeryville, CA: Key Curriculum Press.

Lukin, L. E., Bandalos, D. L., Eckhout, T. J., &Mickelson, K. (2004). Facilitating the de-velopment of assessment literacy. Educa-tional Measurement: Issues and Practice,23(2), 26–32.

Lundeberg, M. A., Levin, B. B., & Harrington,H. L. (Eds.) (1999). Who learns what from

cases and how? The research base for teach-ing and learning with cases. Mahwah, NJ:Erlbaum.

Makar, K., & Confrey, J. (2002). Comparingtwo distributions: Investigating secondaryteachers’ statistical thinking. Presentedat the Sixth International Conferenceon Teaching Statistics, Cape Town,South Africa. Available at http://www.stat.auckland.ac.nz/∼iase/publications/1/10_18_ma.pdf.

Mayer, R. E. (2001). Multimedia learn-ing. Cambridge, UK: Cambridge UniversityPress.

Mayer, R. E. (2003). The promise of multi-media learning: Using the same instruc-tional design methods across different me-dia. Learning and Instruction, 13, 125–139.

Mills, J. (2002). Using computer simulationmethods to teach statistics: A reviewof the literature. Journal of StatisticsEducation, 10(1). Available at http://www.amstat.org/publications/jse/v10n1/mills.html.

Moreno, R., Mayer, R., Spires, H., & Lester,J. (2001). The case for social agencyin computer-based teaching: Do studentslearn more deeply when they inter-act with animated pedagogical agents?Cognition and Instruction, 19, 177–213.

Narayanan, N. H., & Hegarty, M. (2002). Mul-timedia design for communication of dy-namic information. International Journalof Human-Computer Studies, 57, 279–315.

National Council on Measurement in Edu-cation (1997). Interpreting and commu-nicating assessment results: Professionaldevelopment resource materials. Washing-ton, DC: Author.

National Council on Measurement in Ed-ucation Ad Hoc Committee on the De-velopment of a Code of Ethics (1995).Code of professional responsibilities ineducational measurement. Available atwww.ncme.org.

Owen, J. M. (2007). Program evaluation:Forms and approaches (3rd ed.) New York:Guilford.

Pedulla, J., Abrams, L., Madaus, G., Russell,M., Ramos, M., & Miao, J. (2003). Perceivedeffects of state-mandated testing programson teaching and learning: Findings from anational survey of teachers. Chestnut Hill,MA: Center for the Study of Testing, Evalu-ation, and Educational Policy, Boston Col-lege.

Popham, W. J. (2006a). Assessment for educa-tional leaders. Boston: Pearson.

Popham, W. J. (2006b). Mastering Assess-ment: A self-service system for educators.New York: Routledge.

Snell, J. L., & Finn, J. (1992). A course called“Chance.” Chance Magazine, 5(3–4), 12–16.

Stiggins, R. (2002). Assessment for learn-ing. Education Week, 21(26), 30, 32–33.

Stiggins, R., & Chappuis, J. (2005). Usingstudent-involved classroom assessment toclose achievement gaps. Theory Into Prac-tice, 44(1), 11–18.

Wayman, J. C., Stringfield, S., & Yakimowski,M. (2004). Software enabling schoolimprovement through the analysis ofstudent data (Report No. 67). Baltimore,MD: Johns Hopkins University, Center forResearch on the Education of StudentsPlaced At Risk. Available at http://www.csos.jhu.edu/crespar/techReports/Report67.pdf.

Summer 2008 27