tests with many-facet Rasch modeling and differential...

18
Language Testing 2015, Vol. 32(1) 3–20 © The Author(s) 2014 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav DOI: 10.1177/0265532214531254 ltj.sagepub.com /$1*8$*( 7(67,1* A comparison of video- and audio-mediated listening tests with many-facet Rasch modeling and differential distractor functioning Aaron Olaf Batty Keio University, Japan Lancaster University, UK Abstract The rise in the affordability of quality video production equipment has resulted in increased interest in video-mediated tests of foreign language listening comprehension. Although research on such tests has continued fairly steadily since the early 1980s, studies have relied on analyses of raw scores, despite the growing prevalence of item response theory in the field of language testing as a whole. The present study addresses this gap by comparing data from identical, counter-balanced multiple-choice listening test forms employing three text types (monologue, conversation, and lecture) administered to 164 university students of English in Japan. Data were analyzed via many-facet Rasch modeling to compare the difficulties of the audio and video formats; to investigate interactions between format and text-type, and format and proficiency level; and to identify specific items biased toward one or the other format. Finally, items displaying such differences were subjected to differential distractor functioning analyses. No interactions between format and text-type, or format and proficiency level, were observed. Four items were discovered displaying format-based differences in difficulty, two of which were found to correspond to possible acting anomalies in the videos. The author argues for further work focusing on item-level interactions with test format. Keywords differential distractor functioning, language assessment, listening assessment, many-facet Rasch measurement, nonverbal communication, video listening test Corresponding author: Aaron Olaf Batty, Keio University, Fujisawa, 5322 Endo, Fujisawa, Kanagawa 252-0882, Japan. Email: [email protected] 531254LTJ 0 0 10.1177/0265532214531254Language TestingBatty research-article 2014 Article

Transcript of tests with many-facet Rasch modeling and differential...

Page 1: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Language Testing2015, Vol. 32(1) 3 –20

© The Author(s) 2014Reprints and permissions:

sagepub.co.uk/journalsPermissions.navDOI: 10.1177/0265532214531254

ltj.sagepub.com

A comparison of video- and audio-mediated listening tests with many-facet Rasch modeling and differential distractor functioning

Aaron Olaf BattyKeio University, JapanLancaster University, UK

AbstractThe rise in the affordability of quality video production equipment has resulted in increased interest in video-mediated tests of foreign language listening comprehension. Although research on such tests has continued fairly steadily since the early 1980s, studies have relied on analyses of raw scores, despite the growing prevalence of item response theory in the field of language testing as a whole. The present study addresses this gap by comparing data from identical, counter-balanced multiple-choice listening test forms employing three text types (monologue, conversation, and lecture) administered to 164 university students of English in Japan. Data were analyzed via many-facet Rasch modeling to compare the difficulties of the audio and video formats; to investigate interactions between format and text-type, and format and proficiency level; and to identify specific items biased toward one or the other format. Finally, items displaying such differences were subjected to differential distractor functioning analyses. No interactions between format and text-type, or format and proficiency level, were observed. Four items were discovered displaying format-based differences in difficulty, two of which were found to correspond to possible acting anomalies in the videos. The author argues for further work focusing on item-level interactions with test format.

Keywordsdifferential distractor functioning, language assessment, listening assessment, many-facet Rasch measurement, nonverbal communication, video listening test

Corresponding author:Aaron Olaf Batty, Keio University, Fujisawa, 5322 Endo, Fujisawa, Kanagawa 252-0882, Japan. Email: [email protected]

531254 LTJ0010.1177/0265532214531254Language TestingBattyresearch-article2014

Article

Page 2: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

4 Language Testing 32(1)

In recent years, as the price of quality video production equipment has fallen, the use of video in tests of foreign language listening comprehension has become much more fea-sible for more institutions and test designers, prompting an increase in research on the topic. Informing these studies is a wealth of psychological research illustrating the importance of nonverbal signals to human communication (for an overview, see Burgoon, 1994), as well as published research on the effect of visuals on listening and comprehen-sion in general, from the use of supplementary still graphics (Ginther, 2002; Ruhe, 1996) to the impact of nonverbal cues and gestures (Kellerman, 1992; Riseborough, 1981; Sueyoshi & Hardison, 2005). The majority of this work supports the conclusion that the inclusion of nonverbal information with a stream of aural language enhances compre-hension and recall.

Studies attempting to specifically apply such findings to tests of listening comprehen-sion in the form of video-mediated tests have risen in frequency since the 1980s at the very latest (Parry & Meredith, 1984), gaining momentum in the 1990s (e.g., Baltova, 1994; Brett, 1997; Gruba, 1993, 1997), and continuing through the 2000s and up to the present day (e.g., Coniam, 2001; Cubilo & Winke, 2013; Hernandez, 2004; Ockey, 2007; Sueyoshi & Hardison, 2005; Suvorov, 2009, 2013), especially through the extensive writings of Wagner (2002, 2006, 2007, 2008, 2010a, 2010b, 2013). However, despite a firm theoretical grounding for the importance of nonverbal information in listening com-prehension, comparative studies of video- and audio-mediated listening comprehension tests have produced contradictory or inconclusive results.

Recent work by Ockey (2007) and Wagner (2008), as well as earlier work by Progosh (1996) has focused on examinee perceptions of the formats. Many more studies have primarily examined mean score differences between the formats (e.g., Baltova, 1994; Brett, 1997; Coniam, 2001; Gruba, 1993; Parry & Meredith, 1984; Shin, 1998; Sueyoshi & Hardison, 2005; Wagner, 2010b). However, despite the large amount of work on the question of whether and how the inclusion of video on a test of L2 listening comprehen-sion affects examinee performance, all previous work has relied solely on classical test theory (CTT) methods, rather than the objective measurement models available to the researcher and test designer in the form of item response theory (IRT). Questions of dimensionality and differential functioning are most easily explored – if not answered – using IRT. The present paper aims to address this deficiency. The present work imple-ments the many-facet Rasch model (MFRM) to evaluate the comparability of audio- and video-mediated formats of a test of L2 listening comprehension administered in a quasi-experimental setting. In addition, it also explores the interactions between text type and format, and proficiency and format on the results of the test. Finally, format differences at the item level are examined through facet interaction and differential distractor func-tioning (DDF) analyses.

Background

The construct of listening comprehension

The listening construct has frequently been described as the most difficult to describe or target for assessment because it occurs, invisibly, inside the listener’s own mind (Brindley,

Page 3: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 5

1998; Lynch, 1998). The process of extracting meaning from aural input is unconscious but highly complex. Beyond the obvious informational input of sound waves, listeners derive and enrich meaning through other channels, for example, physical setting, cultural expectations, and nonverbal cues such as gestures or facial expressions (Brindley, 1998; Buck, 2001; Lynch, 1998). Buck argues that these various channels are used by the lis-tener to infer and construct the meaning of what he or she is hearing, that the meaning is not contained in the audio stream alone (2001, p. 29). Rost has also written authorita-tively on listening comprehension, detailing the many processes occurring in the listen-er’s mind to make this leap from sound to meaning possible. Neurological processes, from hearing through consciousness, combine with linguistic knowledge, which include various visual signals acting as a co-text, and pragmatics, including all of the social con-ventions surrounding the interaction (Rost, 2002). Meaning is constructed in the listen-er’s mind by synthesizing these various streams of information, to varying degrees, and this is the case regardless of proficiency in the target language. All listening comprehen-sion is understood to follow this basic process. This richness of input, however, often results in consternation for the listening comprehension assessor.

The role of visuals in listening comprehension

Listening rarely occurs “in the dark.” For sighted people, in virtually all situations and circumstances, listeners can see those speaking to them. The only common exceptions are telephone conversations, radio broadcasts or podcasts, or public address systems. In most cases, listeners are able to glean extra-linguistic information from such sources as gestures, facial expressions, and the setting, with these visual signals operating as a “co-text” to the verbal signals (Rost, 2002, p. 37).

Burgoon argues that nonverbal communication is the primary form of communica-tion for our species – developing earlier in our evolution, as well as in our bodily devel-opment (1994). In her review of the relevant literature, she finds that approximately 70% of the meaning in a social situation is encoded in the visual, as opposed to the verbal, channel, and that it is particularly well-suited for communicating social cues, especially when those cues are incongruous with the verbal message. Although cultures, genders, and individuals may be sensitive to different cues to differing degrees, adults tend to rely on them more. However, the verbal channel is typically favored when the content is factual, abstract, or persuasive in nature. Overall, it can be understood that nonverbal communication is an important carrier of social meaning in human interac-tions such as conversations, but may be ignored if it is found to be irrelevant to the central task or message (Burgoon, 1994; Weidenmann, 1989).

Since nonverbal communication is so important to human communication, various language researchers have investigated its impact on the learning of a foreign language, especially since the advent of the videocassette recorder. Whereas in previous genera-tions, the only option for foreign language listening instruction was audiotape, the VCR (and all video technologies succeeding it) allowed teachers to incorporate visual infor-mation into their listening lessons. One of the earliest investigations of the efficacy of video in L2 instruction was that conducted by Riseborough (1981), wherein the researcher presented a mix of audio, face-only video, video with vague gestures, video with more

Page 4: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

6 Language Testing 32(1)

explicitly meaningful gestures, finding that more nonverbal cues led to better recall and comprehension. These findings confirmed those of several studies from the 1950s through the very early 1980s, reviewed by MacWilliam (1986), which found that, in general, comprehension is benefitted by seeing the speaker. Interestingly, several of these studies also found that other types of visual stimuli, such as pictures or other video forms, as in the case of a newscast, had a deleterious effect on overall comprehension, a finding shared by Mayer within the context of multimedia instruction materials (1997).

Beyond the visual cues listeners receive from the speaker, there are external cues which can aid in comprehension. These usually provide the listener with information about the physical and situational context, which aids in his or her ability to understand the information presented ( Lynch, 1998; Rost, 1990). An example might be a university lecture hall, signaling to the listener to expect an academic topic and perhaps a somewhat formal register; another may be a coffee shop, setting an expectation of a more informal exchange between multiple people. Such cues draw upon background knowledge, there-fore helping listeners interpret what they hear. Two common theories to explain this effect are script and schema theory. Whereas scripts contain detailed procedural and role information for a particular setting or activity, schemata are perhaps best understood as archetypes of concepts, events, people, places, and so on (Buck, 2001, p. 20). Listeners who are cued to these sources of background knowledge can spare themselves the cog-nitive-processing overhead of struggling to understand the scene and frame the verbal information in terms of the relevant activated scripts and/or schemata.

Visuals in listening comprehension tests

Since the importance of visual cues to listening comprehension is well known and uncon-troversial, many researchers have sought to include them in tests, both in still and video form. Overall, findings have indicated that the addition of visual information to aural input increases comprehensibility. Mayer reviewed a large number of previous studies on the effect of including multimedia in recorded L1 science lessons, concluding that, over-all, the addition of visual information increased comprehension, especially when it was simultaneous with the explanation, and especially for students with little prior knowl-edge (1997). Ginther investigated the effect of still visuals on L2 listening comprehen-sion tests with a study conducted with ETS on the TOEFL (2002). She found no main effect for the presence or absence of visuals; however, visuals did interact significantly with text type. Some significant interactions between the presence of visuals and text type were observed, but effect sizes were extremely small throughout. Ginther suggested that examinees may be distracted by context-setting visuals, such as pictures of the set-ting, when the content of the verbal channel is difficult, while the inclusion of content-supporting visuals, such as charts and graphs, enhances understanding, a suggestion supported by the literature discussed above.

Others have investigated tests including full video of people speaking and/or interact-ing, often comparing them to an audio-only version of the same test. Possibly the first of these was that of Parry and Meredith (1984), which compared performance on a video-mediated Spanish listening test to that on an audio-only version, finding the video to be easier for all proficiency levels. Baltova (1994) undertook two studies comparing

Page 5: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 7

audio-only, video-mediated, and silent (i.e., video-only) versions of a French listening test, also finding the video-mediated version to be easier, but with the added, somewhat perplexing, finding that scores on the silent version were roughly the same as those on the video-mediated version. Gruba (1993) developed a video-mediated English listening based on a video-recorded lecture on air traffic safety and administered it and an audio-only version to two intact classes of ESL student studying at a US university, finding no difference in scores. Progosh (1996) investigated examinee perceptions of video- mediated listening comprehension tests by constructing two test sections utilizing video. In the first, examinees watched a conversation about movies and identified the speakers, answered some multiple-choice questions, and identified the opinions of the characters in a table. In the second, examinees watched a cooking monologue and checked off ingredi-ents used in the dish and answered several short-answer questions. No comparison with an audio-only test was made. The students were then asked to compare their video test experience with their memories of their experiences with audio-only tests. Examinees reported that they preferred video, and that they found it easier. Gruba, in his overview of the use of video in foreign language assessment (1997), reports that most research com-paring video-mediated to audio-mediated tests finds no difference between the formats, which he argues is cause to reject such studies, as they do not accurately reflect the differ-ences the formats have on the examinees. Instead, he recommends the use of verbal report protocols to ascertain the effect of the inclusion of video. Although he remains a propo-nent of video-mediated listening tests, he does caution the test designer on the difficulty of attaining both reliability and construct validity due to the inclusion of an even-larger set of factors than in the already-complicated audio-only listening comprehension tests.

Several studies have investigated the amount of time spent watching the videos in such tests. In his 2007 study, Wagner recorded how much time his examinees spent ori-ented toward the screen, in an attempt to understand whether they found the videos to be a help or a hindrance to comprehension. He found that there was a significant text-type effect, with examinees orienting themselves toward the video an average of 72% of the time during dialogues and 67% of the time during lecture videos. However, in the sample of 36 examinees, the ranges were quite high, with some examinees watching the videos up to 90% of the time and others less than 50%. Ockey (2007) found a similar result with his six participants, who watched a video of a lecture in his listening test anywhere from 2.5% to 73.6% of the time, with an overall average of 44.9%. In a separate study by Wagner (2010a), a negative correlation was observed between time spent watching the video and total score. The lack of consistency between examinees and studies in these findings may raise concerns about the generalizability of their findings to other studies and contexts.

Many other studies include questionnaire data on examinee preferences regarding the formats, with varied results. Whereas Baltova’s (1994) and Sueyoshi and Hardison’s (2005) participants reported a preference for the video-mediated test, the participants in Coniam’s (2001) and Suvorov’s (2009) studies showed a preference for audio, complain-ing that it was distracting in the case of Coniam’s study. Brett (1997) found that partici-pants preferred multimedia presentations to either audio or video passages, and the preferences of Hernandez’s (2004) participants are difficult to interpret, as the item wording in her instrument is ambiguous.

Page 6: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

8 Language Testing 32(1)

Since most listening tests are still essentially audio-only, there has also been consider-able interest in comparing them to video-mediated listening tests, with mixed results. Although some studies (e.g., Baltova, 1994, study 2; Coniam, 2001; Cubilo & Winke, 2013; Gruba, 1993; Suvorov, 2013) find no difference, the majority of studies (e.g., Baltova, 1994, study 1; Brett, 1997; Hernandez, 2004; Parry & Meredith, 1984; Shin, 1998; Sueyoshi & Hardison, 2005; Wagner, 2010b, 2013) find that video-mediated lis-tening tests are generally easier than their audio-only counterparts.

Another frequently visited topic is the interaction between test format and text type, drawing on the psychological research demonstrating that the nonverbal channel is of more importance in social settings, and less so in those focused on relaying infor-mation. This general finding led Buck to hypothesize that video would play more of a part in the comprehension of “interactional” (i.e., conversational) than in “transac-tional” (e.g., lectures) language use (2001, p. 172). Suvorov (2009) found that video-mediated lectures were significantly harder than their audio-only counterparts, as well as conversations under either condition. In contrast, Wagner (2010b) found that the addition of video improved scores on both conversational and academic passages to approximately the same degree, although effect sizes were small (Cohen’s d = 0.29 and 0.34, respectively). These findings alone demonstrate the lack of consensus in the literature on this topic.

The methodologies employed by these studies are varied, and sometimes not ideal. Some (e.g., Baltova, 1994; Gruba, 1993; Parry & Meredith, 1984; Suvorov, 2013) lack counterbalancing to ensure that all examinees saw both conditions, or that all items were administered under both conditions. Many of the studies (e.g., Baltova, 1994; Gruba, 1993; Parry & Meredith, 1984; Suvorov, 2009) lack an external measure of foreign lan-guage proficiency to control for ability. Baltova (1994) changed the items between her two studies to the extent that they are not comparable; Shin (1998) removed all pauses in speech from the video (resulting in an inauthentic delivery, to be sure), and used commit-tee-rated free-answer questions that are difficult to interpret. The Gruba (1993) instru-ment had a reliability coefficient of 0.45, which was a problem shared by the Hernandez (2004) instrument with a reliability coefficient of 0.401. Partly owing to these problems, the body of work on this topic is rather difficult to interpret.

A curious gap in the literature on this topic, however, is the use of any form of item response theory (IRT) in the analyses. All of the studies discussed here rely upon raw score comparisons, which are well-suited to revealing differences between groups and conditions, but which make comparisons at the item level difficult. In the case of Wagner’s close examination of individual item functioning under the audio and video conditions (2010b), the method of determining which items to single out for greater scrutiny was to examine any item with a 10%-or-more difference in scores under the conditions, which seems somewhat arbitrary.

A methodology which could shed more light upon the interaction between format and items is many-facet Rasch measurement (MFRM; Linacre, 1989), which conceptualizes a datum (in the case of multiple-choice test data, a one or a zero) as being the product of the difficulties of the facets of which it results. The most common application of MFRM is in human-rated assessments, such as speaking tests with multiple raters (e.g., Bonk & Ockey, 2003), where a score is the result of the interaction between the rater, the prompt,

Page 7: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 9

and the examinee, but has also been used to detect and analyze differential item and person functioning (e.g., Engelhard, 2009).

Application of MFRM to audio- and video-mediated listening comprehension test data could place the formats on the same difficulty scale as the items for comparison, even at the item level. It would also facilitate the investigation of other interactions in the data, as any number of facets can be entered into the model, but with measures con-strained at the average difficulty so that they do not contribute to the estimates of other facets. This method can be used to identify differences in difficulty due to the delivery format while still taking advantage of a parsimonious model. Furthermore, the location of all items and formats on the same difficulty continuum would allow for more princi-pled comparisons of item functioning under the two format conditions than has been used in the literature.

Finally, the use of differential distractor functioning (DDF) analysis could shed more light on the interaction between format and individual items, at the intra-item level. DDF is similar to DIF; however, it is concerned not only about whether groups select the cor-rect option in response to multiple-choice questions, but also which incorrect distractor option they select. If patterns in incorrect responses can be identified, then possible rea-sons for them can be hypothesized based on a qualitative re-examination of the test material.

Research questions

The present research seeks to address the following questions through the use of MFRM and DDF analyses:

RQ1: How does the difficulty of an audio-only test of foreign language listening com-prehension compare to an identical test that also incorporates video?RQ2: How does delivery format (audio or video) interact with text type?RQ3: How does format interact with language proficiency level?RQ4: How does format interact with individual listening comprehension items?

Method

Participants

The participants were 164 consenting English majors in eight classes at a small foreign-language college in Japan. Students in the program are separated into four language ability tiers by an institutional English proficiency test administered annually. Two classes of each tier were administered the experimental instrument for the present study. Highest-proficiency students are found in Tier 1, and lowest in Tier 4.

Instrument

The listening comprehension test used in the present research comprised the listening sections from two years of the placement test mentioned above. The two sections will be

Page 8: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

10 Language Testing 32(1)

denoted “A” and “B” here. The years selected predated the participants’ time at the insti-tution, ensuring that they had not encountered the material previously. The listening sec-tion of the test in question is video-mediated, allowing the researcher to create audio-only versions by removing the video. All tests were administered via television sets display-ing instructions and listening test content. For the audio-only formatted tests forms, instead of the video, there was a graphic of an audio speaker with the name of the test section displayed below it.

The items were of the traditional, four-option, multiple-choice format, and were dis-tributed to the participants in paper booklets. Subtest A comprised 24 items and B com-prised 22 items. The booklets included still frames from the videos in the listening passages, with the characters labeled by name. This was to ensure that any video effect observed was not simply the result of advance schema setting or familiarity with the characters, isolating the independent variable of presence or absence of video.

The listening passages themselves were written to facilitate the function of these items’ distractors by use of asides, negotiations, and the inclusion of other extraneous information, thereby requiring the examinees to listen carefully in order to answer. Although the format of the scripts was similar to that of a play or telecast, the writers attempted to represent more closely natural language use, including explicit directions for space fillers (e.g., “uh”) and pauses. Actors were directed to follow the scripts pre-cisely. Actors were instructed to memorize their lines, but off-camera cue cards were also available. The scenes were video-recorded in a professional-grade “green screen” studio, with appropriate background images inserted into the final video print. The cam-eras and microphones used were broadcast-television quality, and the final videos them-selves alternated between wide shots of the scenes and character(s), and tight shots focusing only on the speaker. Listening passages were approximately 3 minutes each (min. = 1.47, max. = 4.30, mean = 2.54, SD = 0:48). Passage content fell into three text-type categories:

•• Monologue (1 passage): A character explains something over the telephone to an unseen and unheard party on the other end of the line.

•• Conversation (2 passages): Two characters, a man and a woman, discuss a topic informally.

•• Academic (1 passage): A character explains an academic topic using academic language.

Four test forms were prepared to counterbalance subtest content (A and B) and format (Audio and Video) order (see Table 1).

Procedure

A DVD was produced for each of the test forms. Research assistants administered the tests during normal class sessions by distributing test booklets and playing the DVD. Examinees were given 1 minute to read the items for the next listening passage, and a minute and a half to answer them after listening. All timing and explanation was prere-corded and played from the DVD.

Page 9: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 11

Data analysis

A one-way ANOVA was performed on raw total scores to ensure that there was no format order effect; the result was non-significant, allowing Forms 1 and 4, and Forms 2 and 3 to be treated as equivalent. As the two tests were of different lengths, raw scores were converted to percentages for comparison. The mean score for Subtest A was 58.95% (SD = 14.56%), and was 64.08% (SD = 15.83%) for Subtest B. A paired t-test on the subtest percentages revealed that Subtest A was significantly more difficult, t(163) = 51.85, p < .001, but since all examinees took both subtests, and the MFRM model will treat all items individually, this was not a cause for concern.

A three-facet MFRM model was constructed of examinees, format (Audio or Video), and items in the MFRM software package Facets (Linacre, 2012), with both items and format mean difficulties centered at zero, allowing only the examinee estimates to “float.” See Table 2 for a summary of the Facets analysis.

The test was fairly easy for the participants, with the mean ability estimate over 1 logit above the mean item difficulty (1.11 and 0, respectively). Overall item fit to the Rasch model was quite uniform, as indicated by the mean square error statistics (labeled Infit MS and Outfit MS here), which are very close to the expected value of one, with small standard deviations. The reliability statistics can be considered equivalent in interpretation to Cronbach alpha statistics, and are an indication of how different the measures for the elements (individual persons or formats or items, in this case) are. Given the small sample size, and the fact that the test was somewhat easy for the exami-nees, the reliability of separation index of 0.77 is acceptable. The reliability statistic for the items, however, is quite good (0.95), indicating good separation of item measures. The chi-square statistics are the result of a test of the hypothesis that the person, format, or item measures are actually all the same. As can be seen in Table 2, both the item and the examinee facets pass this test, while the format facet does not. This will be discussed below. Finally, individual item fits were examined, revealing only one item misfitting to the extent that it was unproductive for measurement (Item 19, also to be discussed in detail below), but not sufficiently misfitting to merit removing from the analysis (Wright & Linacre, 1994).

Table 1. Subtest and format counterbalancing.

Form Part I Subtest Part II Subtest

Format Format

1 A BVideo Audio

2 A BAudio Video

3 B AVideo Audio

4 B AAudio Video

Page 10: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

12 Language Testing 32(1)

Results

RQ1: Comparative difficulty of formats

Since format was entered into the MFRM model as a facet, the relative difficulties can be placed on the same scale as the items themselves. A simple comparison can be seen in Figure 1, and discrete values can be found in Table 3.

As can be seen from Figure 1 and Table 2, the distance between the audio and video formats is a scant .08 logits, which is so small as to be of no practical significance. The observed average raw score for the items delivered in these formats is similarly indis-tinct. Furthermore, the point-biserial correlations reveal similarly sized, weak relation-ships between format and total raw score. Finally, as shown in Table 2, the reliability of the separation is very low, and the result of the chi-square test is non-significant, further indicating that the difficulty estimates of the formats are too close to be considered dif-ferent. Based on these findings, it can be concluded that format had no effect on responses on this particular test.

RQ2 and RQ3: Interactions between format and text-type and proficiency level

To investigate the interaction between format and text-type, and format and proficiency level, text-type and tier were entered into the model as dummy facets. Dummy facets are those whose difficulties are constrained to zero, ensuring that they do not alter the rest of the model, contributing nothing to the estimation of the other measures, but are nonetheless available for further analysis in Facets. A search for interactions between format and the facet of interest (either text-type or proficiency level), which resulted in a difference in difficulty estimates of 1 logit or above, or whose difference was signifi-cant as determined by a t-test, was performed. However, no such interactions were

Table 2. Summary statistics for MFRM analysis.

Examinees Format Items

Measures M 1.11 0.00 0.00 SD 0.82 0.04 0.91 N 164 2 46Infit MS M 1.00 1.00 1.00 SD 0.14 0.00 0.07Outfit MS M 0.99 0.99 0.99 SD 0.26 0.01 0.15Reliability of separation 0.77 0.59 0.95χ2 642.2* 2.50 909.6*df 163 1 45

*p ≤.05.

Page 11: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 13

observed, indicating that the text types were no harder or easier depending on the format of delivery, and that the examinees from the four ability tiers did not perform differently under the two format conditions.

Figure 1. Variable map comparing difficulties of formats to those of items. Higher measures are more difficult.

Table 3. Raw score and Rasch difficulty estimates for the delivery formats.

Format Observed mean Measure SE Infit MS Infit Z Outfit MS Outfit Z Point-biserial

Audio 0.70 0.04 0.04 1.00 −0.04 0.97 −0.80 0.19Video 0.71 −0.04 0.04 1.00 0.26 1.00 −0.02 0.22

Note: The expected value of the mean-square fit statistics is 1.

Page 12: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

14 Language Testing 32(1)

RQ4: Interaction between format and items

The interaction between delivery format and the items was investigated, again setting the cutoff for further analysis at measure differences of 1 logit or more, or a difference significant at the .05 level, essentially carrying out a differential item functioning (DIF) analysis. Under these criteria, four items displaying format bias were located (see Table 4). These items were examined more closely in an attempt to account for their format-based differences. Furthermore, in an attempt to understand why the groups may have selected different answer options, a differential distractor functioning (DDF) analysis was performed with the DDFS software package (Penfield, 2010), which employs a Mantel-Haenszel common log-odds ratio approach to obtain DDF effect estimates, as introduced by Penfield (2008). This process allows researchers to determine whether DIF can be traced to individual distractors, greatly facilitating interpretation. For the following DDF analyses, only the four items in question (i.e., 8, 19, 20, 34) were entered into the analysis, effectively creating a four-item test. This inflates the effect size of the DDF, but is easier to interpret than if these differences were mixed with the rest of the test responses. An analysis of the entire test returned only one item with significant DDF (Item 34, to be discussed below).

Item 8. This item, the first of the first conversation passage in subtest A, was signifi-cantly easier for examinees who encountered the conversation in the video condition. The DDF analysis results for Item 8 are displayed in Table 5. The item itself required examinees to listen to a conversation between a woman and a man about a third charac-ter’s tardiness, and determine how late the third character was. The male character remarks that he and the female character were “a few minutes late” themselves, and that the journey to the next location was “only 15 minutes”. The female character is upset because the third character is half an hour late. The “a few minutes” and “15 minutes” distractor options (i.e., “a” and “b”, respectively) were significantly more attractive for examinees who encountered the item in the audio format.

Consulting the video for assistance in interpreting this finding, it was noted that the actress made a strong hand gesture for location, but not time, when the key phrase occurred (i.e., “she was supposed to meet us here half an hour ago”). It is possible that this extra emphasis at the exact timing of the key granted those viewing the video an advantage in answering the associated item.

Table 4. Difficulty estimates for four items displaying delivery format bias.

Item Audio Video Contrast t Welch df Probability

No. Label Measure SE Measure SE

8 A Conv1 #1 1.18 0.23 0.38 0.27 0.80 2.27* 158 .02519 A Conv2 #6 –0.82 0.29 0.15 0.28 –0.98 –2.42* 161 .01720 A Conv2 #7 0.12 0.24 0.91 0.25 −0.78 −2.27* 160 .02434 B Conv1 #3 0.29 0.27 −0.82 0.30 1.11 2.77* 161 .006

Note: Higher measures indicate harder items.*p ≤.05.

Page 13: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 15

Item 19. Item 19, the sixth item of the second conversation in subtest A, also displayed bias, but in contrast with Item 8, it favored the audio format. See Table 6 for the results of a DDF analysis.

This item concerns a conversation between a man and a woman about when to meet for another activity. The male character needs to go shopping beforehand. The two nego-tiate the meeting time of 2:30 p.m, after which they will proceed to the activity, arriving there before 3:00 p.m. Those who encountered the material in the video condition were much more attracted by the distractor option “d.” Consulting the video revealed that the actress in question made a strong facial expression on the line “before three,” nodding her head sharply, as though it were the conclusion of the discussion, despite the fact that it had been established that the characters would meet at 2:30 p.m. It is possible, there-fore, that examinees encountering the video unconsciously ignored the relatively unstressed “before” and assumed “three” to be the correct response. This facial expres-sion would not be available to those in the Audio condition, and as a result, they may not have been as distracted as those encountering the video format.

It must be noted, however, that Item 19 was the worst item on the test (first discussed above), in terms of Rasch model fit and point-biserial correlation. Furthermore, all exam-inees chose either the correct option “c” or the distractor “d.” The present researcher found the passage somewhat confusing as well, and it seems understandable that some listeners would believe “three” to be the correct answer.

Item 20. The bias observed in Item 20 is difficult to account for. The item was signifi-cantly easier for those who encountered it in the audio condition, but a DDF analysis

Table 5. DDF analysis for Item 8.

Option MH log-odds ratio SE Z Counts

Video Audio

c: “30 minutes” (Correct) 54 38a: “A few minutes” 1.19 0.54 2.23* 11 14b: “15 minutes” 1.88 0.52 3.60* 8 29d: “1 hour” 1.34 0.87 1.54 3 6

*p ≤ 0.05.

Table 6. DDF analysis for Item 19.

Option MH log-odds ratio SE Z Counts

Video Audio

c: “2:30pm” (Correct) 57 72d: “3:00pm” −0.96 0.48 −2.02* 19 10

Note: Unselected options omitted.*p ≤ .05.

Page 14: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

16 Language Testing 32(1)

revealed no significant DDF, and consulting the video provided no further insight. The difference in difficulty measures between the two formats is the lowest of those discov-ered in the Rasch interaction analysis, so it is possible that the result is spurious or the cause so obscure as to be uninterpretable.

Item 34. Item 34 displayed the largest format bias in the analysis, with a difference between audio and video difficulty measures of 1.11 logits, in favor of video. Results of a DDF analysis can be found in Table 7.

In this particular passage, a woman is confiding in a male friend about renewed romantic feelings for a former boyfriend who is not present. She explains that she has met this person recently, and that he had brought photos of a picnic spot to which they used to go “a long time ago.” The locale also featured fishing, but they did not take part in it. Examinees were to identify the reason she was thinking about the former boyfriend again. Examinees who were presented with the Audio format were much more likely to select the distractor “They had a picnic together.”

The video was consulted to aid in interpretation, but no visual cues which could aid in explanation of the bias could be detected. In the scene, the two characters are sitting at a table in a coffee shop, with cups in front of them. The information for responding to Item 34 corresponds to a section of the video which is a close-up shot of the woman only, and she delivers most of her lines with her eyes cast downward toward her cup, presumably out of her character’s discomfort with the topic. No other nonverbal behavior can be observed, leaving the present researcher at a loss to explain the cause of the pronounced difference between the formats on this item.

Discussion and conclusions

The findings regarding the first, second, and third research questions differ from those found in much of the literature, especially those of Wagner’s studies. Although the addi-tion of video was found to have a very slight facilitative effect, the difference in item difficulty between the two delivery formats was extremely small. Furthermore, in con-trast to the findings of Suvorov (2009) and Wagner (2010b), no significant interaction was found between delivery format and text type. Finally, no interaction between general English proficiency level and delivery format was observed. As such, these findings are

Table 7. DDF analysis for Item 34.

Option MH log-odds ratio SE Z Counts

Video Audio

b: “brought some photos” (Correct) 73 54a: “had a picnic” 1.99 0.79 2.53* 5 12c: “went fishing” 0.79 0.79 1.01 7 6d: “took a photo” 1 3

Note: Statistics for option “d” are not included due to the small counts.*p ≤ .05.

Page 15: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 17

much more in line with those of Coniam (2001) or Gruba (1993). It may be important to note that the test used in the present study was developed according to a test specification that stipulates that items must be answerable from the audio content alone, using suffi-ciently complex passages as to allow distractors to distract, with no distractors referring to information not contained in the listening passage. As such, the test likely does not take advantage of the extra information available from the nonverbal channel.

This is in contrast to the work of Wagner, wherein actors are instructed to improvise their scenes based on an outline, which likely introduces more, and more-natural nonver-bal behavior, but which may come at the expense of distractor functioning. Strictly authentic spoken material does not typically lend itself to the construction of multiple-choice items with distractors that duly distract lower-ability examinees. Furthermore, the use of demonstrative gestures, if not well controlled, could possibly hurt item discrimi-nation by introducing a floor effect. Anyone who has traveled in a foreign country with-out knowing the local language knows that a great deal of information can be passed with simple hand gestures and a few words, but a test of foreign language listening compre-hension is typically concerned with mastery of the language itself, not that of pan- cultural, ad-hoc, gesture-based communication. Overall, it seems likely that the divide in the research over whether video has a facilitative or no effect can be more easily explained by differences in test design than anything inherent to the format of delivery. It seems likely that a test designed to make explicit use of the nonverbal channel will observe an effect, whereas one that adheres to more traditional test design principles will not.

Investigating the fourth research question revealed some possible ways in which the presence of video may affect examinees’ responses to items, although, as is unfortunately typical of DIF research, the interpretations are based on what Engelhard terms “the wis-dom of practice” (2009, p. 5), and are therefore unverifiable. The data alone can only identify differences to examine in greater, qualitative detail, but the true reasons for the observed differences are extremely difficult, if not impossible, to determine. However, of the two items (Items 8 and 19) which differed substantially when delivered in the audio or video formats, the reason seems to be related to acting problems. This may serve as a recommendation for the Wagner approach, with the attendant possible cost to the distrac-tors discussed above. This is not to say that a happy medium cannot be attained. It should be possible to devise a video-mediated listening test that adheres to traditional test design practices, but which also avoids the pitfalls of poor acting. This may not be inexpensive or easy, however, as few organizations beyond the large-scale testing companies have the budget or industry connections necessary to hire professional actors. Most organizations opt, rather, for to casting faculty and staff, as was the case with the test in the present research.

The findings of this study, as of several of those before it, suggest that the addition of nonverbal information via video to a traditional, multiple-choice listening test will have no net effect, leaving the choice of whether to include it to test designer judgment, based on both theoretical and pragmatic factors. Some (e.g., Bachman, 1990; Buck, 2001; Ockey, 2007; Rost, 2002) have expressed reservations regarding the construct validity of video-mediated listening tests, questioning whether the inclusion of extra-linguistic, visual information harms the interpretability of scores. Others (e.g., Wagner, 2008, 2010a) argue that because visual information is accessible in most L2 listening

Page 16: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

18 Language Testing 32(1)

situations, a video-mediated test is more authentic, and therefore more construct- relevant, than one that excludes it. However, if that relevance does not manifest itself in quantitative differences in scores, then, from a psychometric standpoint, it may be a moot point. However, tests which are simply perceived to be more authentic can still provide some benefit to test designers, as test users are often more likely to respond positively to a test with greater face validity. This alone may be sufficient reason to choose the video-mediated format in some cases, especially as it appears to have no deleterious effect on the reliability of the instrument. This choice, however, may come at the cost of equipment and time, as (drops in equipment prices notwithstanding) the production of video remains considerably more involved than that of audio.

Directions for future research

It is important to note that even in a traditional listening comprehension test such as that employed in the present study, two items did indeed seem to display a video effect. As this effect appears to be item-dependent, and, at least in the case of the final item dis-cussed above, somewhat mysterious, further work is recommended. A promising method of investigating these interactions in finer detail is that of eye-tracking, which allows the researcher to record examinees’ eye movements and fixations as they encounter a video-mediated test. Suvorov (2013) has very recently completed work employing this meth-odology to observe examinees’ interactions with context and content videos in an academic listening test, but many questions remain. It is not yet known, for example, what nonverbal behaviors draw examinees’ attention, what specific nonverbal cues broadcast extra or supportive information, or – most importantly – how these interact with items. Ultimately, all test data arise from the interactions of examinees and items, and until the impact of nonverbal behavior on those interactions is better understood, construct validity will remain a very real concern in video-mediated tests of foreign lan-guage listening comprehension, perhaps requiring the definition of a new construct of “visual listening comprehension.”

Acknowledgments

Thanks to Siwon Park for assistance in instrument creation and project planning, and to Luke Harding for his advice during manuscript preparation.

References

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford: Oxford University Press.

Baltova, I. (1994). The impact of video on the comprehension skills of core French students. Canadian Modern Language Review, 50(3), 507–531.

Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second language group oral discussion task. Language Testing, 20(1), 89–110.

Brett, P. (1997). A comparative study of the effects of the use of multimedia on listening compre-hension. System, 25(1), 39–53. doi:10.1016/S0346–251X(96)00059–0

Brindley, G. (1998). Assessing listening abilities. Annual Review of Applied Linguistics, 18, 171–191. doi:10.1017/S0267190500003536

Page 17: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

Batty 19

Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.Burgoon, J. K. (1994). Nonverbal signals. In M. L. Knapp & G. R. Miller (Eds.), Handbook of inter-

personal communication (2nd ed., pp. 229–285). Thousand Oaks, CA: SAGE Publications.Coniam, D. (2001). The use of audio or video comprehension as an assessment instrument in the

certification of English language teachers: A case study. System, 29(1), 1–14. doi:10.1016/S0346–251X(00)00057–9

Cubilo, J., & Winke, P. (2013). Redefining the L2 listening construct within an integrated writ-ing task: Considering the impacts of visual-cue interpretation and note-taking. Language Assessment Quarterly, 10(4), 371–397. doi:10.1080/15434303.2013.824972

Engelhard, G. (2009). Using item response theory and model–data fit to conceptualize differential item and person functioning for students with disabilities. Educational and Psychological Measurement, 69(4), 585–602. doi:10.1177/0013164408323240

Ginther, A. (2002). Context and content visuals and performance on listening comprehension stim-uli. Language Testing, 19(2), 133–167. doi:10.1191/0265532202lt225oa

Gruba, P. (1993). A comparison study of audio and video in language testing. JALT Journal, 15, 85–88.

Gruba, P. (1997). The role of video media in listening assessment. System, 25(3), 335–345. doi:10.1016/S0346–251X(97)00026–2

Hernandez, S. S. (2004). The effects of video and captioned text and the influence of verbal and spa-tial abilities on second language listening comprehension in a multimedia learning environ-ment (Unpublished doctoral dissertation). New York University, New York. Retrieved from http://search.proquest.com/pqdt/docview/305166044/abstract/13FE4D5FFBD2C1FDDC/

Kellerman, S. (1992). “I see what you mean”: The role of kinesic behaviour in listening and impli-cations for foreign and second language learning. Applied Linguistics, 13(3), 239–258.

Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press.Linacre, J. M. (2012). Facets (Version 3.70). Beaverton, OR: Winsteps.com. Retrieved from http://

www.winsteps.com/facets.htmLynch, T. (1998). Theoretical perspectives on listening. Annual Review of Applied Linguistics, 18,

3–19. doi:10.1017/S0267190500003457MacWilliam, I. (1986). Video and language comprehension. ELT Journal, 40(2), 131–135.

doi:10.1093/elt/40.2.131Mayer, R. E. (1997). Multimedia learning: Are we asking the right questions? Educational

Psychologist, 32(1), 1. doi:10.1207/s15326985ep3201_1Ockey, G. J. (2007). Construct implications of including still image or video in computer-based

listening tests. Language Testing, 24(4), 517–537. doi:10.1177/0265532207080771Parry, T. S., & Meredith, R. A. (1984). Videotape vs. audiotape for listening comprehension tests:

An experiment. OMLTA Journal. Retrieved from http://www.eric.ed.gov/ERICWebPortal/contentdelivery/servlet/ERICServlet?accno=ED254107

Penfield, R. D. (2008). An odds ratio approach for assessing differential distractor functioning effects under the nominal response model. Journal of Educational Measurement, 45(3), 247–269. doi:10.1111/j.1745–3984.2008.00063.x

Penfield, R. D. (2010). DDFS (Version 1.0).Miami, FL. Retrieved from http://www.education.miami.edu/Facultysites/Penfield/DDFS.zip

Progosh, D. (1996). Using video for listening assessment: Opinions of test-takers. TESL Canada Journal, 14(1), 34–44. Retrieved from http://www.teslcanadajournal.ca/index.php/tesl/arti-cle/view/676

Riseborough, M. G. (1981). Physiographic gestures as decoding facilitators: Three experiments exploring a neglected facet of communication. Journal of Nonverbal Behavior, 5(3), 172–183. doi:10.1007/BF00986134

Page 18: tests with many-facet Rasch modeling and differential ...kangnamjoon.weebly.com/uploads/1/3/6/7/13673728/listening_test_using... · counter-balanced multiple-choice listening test

20 Language Testing 32(1)

Rost, M. (1987). The interaction of listener, text, and task (Unpublished doctoral dissertation). University of Lancaster, Lancaster, UK.

Rost, M. (1990). Listening in language learning. New York: Longman.Rost, M. (2002). Teaching and researching listening. Longman.Ruhe, V. (1996). Graphics and listening comprehension. TESL Canada Journal, 14(1), 45–60.

Retrieved from http://www.teslcanadajournal.ca/index.php/tesl/article/view/677Shin, D. (1998). Using videotaped lectures for testing academic listening proficiency. International

Journal of Listening, 12, 57–80. Retrieved from http://shadow.cios.org:7979/JOURNALS/LIS/012/1/057/index.djvu

Sueyoshi, A., & Hardison, D. M. (2005). The role of gestures and facial cues in second lan-guage listening comprehension. Language Learning, 55(4), 661–699. doi:10.1111/j.0023–8333.2005.00320.x

Suvorov, R. (2009). Context visuals in L2 listening tests: The effects of photographs and video vs. audio-only format. In C. A. Chapelle, H. G. Jun, & I. Katz (Eds.), Developing and evaluating language learning materials (pp. 53–68). Ames, IA: Iowa State University.

Suvorov, R. (2013). Interacting with visuals in L2 listening tests: An eye-tracking study (Doctoral thesis). Iowa State University, Ames, IA.

Wagner, E. (2002). Video listening tests: A pilot study. Working Papers in TESOL & Applied Linguistics, Teachers College, Columbia University, 2(1). Retrieved from http://journals.tc-library.org/index.php/tesol/article/viewArticle/7

Wagner, E. (2006). Utilizing the visual channel: An investigation of the use of video texts on tests of second language listening ability (Unpublished doctoral dissertation). Teachers College, Columbia University, New York.

Wagner, E. (2007). Are they watching? Test-taker viewing behavior during an L2 video listening test. Language Learning & Technology, 11(1), 67–86.

Wagner, E. (2008). Video listening tests: What are they measuring? Language Assessment Quarterly, 5(3), 218–243. Retrieved from http://www.informaworld.com/10.1080/15434300802213015

Wagner, E. (2010a). Test-takers’ interaction with an L2 video listening test. System. doi:10.1016/j.system.2010.01.003

Wagner, E. (2010b). The effect of the use of video texts on ESL listening test-taker performance. Language Testing, 27(4), 493 –513. doi:10.1177/0265532209355668

Wagner, E. (2013). An investigation of how the channel of input and access to test questions affect L2 listening test performance. Language Assessment Quarterly, 10(2), 178–195. doi:10.1080/15434303.2013.769552

Weidenmann, B. (1989). When good pictures fail: An information-processing approach to the effect of illustrations. In H. Mandl & J. R. Levin (Eds.), Knowledge acquisition from text and pictures (pp. 157–170). Tokyo: Elsevier Science.

Wright, B. D., & Linacre, J. M. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. Retrieved from http://www.rasch.org/rmt/rmt83b.htm