Post on 03-Apr-2018
7/28/2019 Spaan V8 Kim
1/30
Spaan Fellow Working Papers in Second or Foreign Language AssessmentCopyright 2010
Volume 8: 130
English Language Institute
University of Michigan
www.lsa.umich.edu/eli/research/spaan
1
Investigating the Construct Validity of a Speaking Performance Test
Hyun Jung KimTeachers College, Columbia University
ABSTRACT With the increased demand for the integration of a performancecomponent in second language (L2) testing, speaking performance assessments
have focused on eliciting examinees underlying language ability through theiractual oral performance on a given task. Considering the nature of performance
assessments, many factors other than examinees speaking ability are
necessarily involved in the process of evaluation. Compared to the constructdefinition of speaking ability, however, relatively less attention has been givento tasks, which are regarded as a vehicle for assessment, although there is a
growing interest in authentic tasks in eliciting real-world language samples forevaluation. Thus, the present study investigates whether a speaking placement
test provides empirical evidence that the effect of task, as well as examineesattributes, should be considered in describing speaking ability in a performance
assessment. An understanding of the underlying structure of the speakingplacement test not only helps to identify the factors involved in the evaluation
process and their relationships, but ultimately makes it possible toappropriately infer examinees speaking ability.
In L2 testing, the notion ofperformance first emerged in the 1960s in response topractical needs, and since then, the demand to integrate examinees actual performance in L2
assessment has increased (McNamara, 1996). Early testers who advocated the integration of aperformance component focused on whether examinees could successfully fulfill a task in a
simulated real-life language use context (e.g., Clark, 1975; Jones, 1985; Morrow, 1979;Savignon, 1972). McNamara (1996) classified this approach as astrong sense of performance
assessment in which the definition of L2 ability construct is limited to examinees taskcompletion.
On the contrary, new theories of communicative competence and communicative
language ability in the 1980s and 1990s (e.g., Bachman, 1990; Bachman & Palmer, 1996,Canale, 1983; Canale & Swain, 1980) changed not only the perception of L2 language ability,but also the role of performance in language testing. They supported a weak sense of
performance assessment (McNamara, 1996), in which the main interest was examineeslanguage ability, instead of task completion. That is, L2 ability was determined based on
various language components derived from the theoretical models of communicativecompetence and communicative language ability. Examinees actual performance was elicited
for evaluation of language ability; however, the role of performance was limited to a vehicle
7/28/2019 Spaan V8 Kim
2/30
2 H. J. Kim
to elicit examinees underlying language ability. This approach to performance assessment,called a construct-centered approach (Bachman, 2002), has been widely accepted by L2
testers for most general purpose language performance assessments (e.g., Brindley, 1994;Fulcher, 2003; Luoma, 2004; McNamara, 1996; Messick, 1994; Skehan, 1998).
While the construct-centered approach to performance assessment gives priority to
definitions of L2 ability, a different perspective has recently been proposed. A task-centeredapproach focuses on what examinees can do with the language; that is, whether they canfulfill a given task (Brown, Hudson, Norris, & Bonk, 2002; Norris, Brown, Hudson, &
Yoshioka, 1998). Although this approach provides more systematic criteria for the evaluationof examinees task fulfillment than the approach of early testers who first argued for the
integration of performance in language testing, it basically shares the early testers view aboutwhat performance assessments aim to measure (i.e., strong version of performance
assessment). According to the task-centered approach, test contexts or tasks play a crucial rolein measuring L2 ability because examinees performance is evaluated based on real-world
criteria.The two approaches to performance assessment appear to be contradictory in nature.
Chapelle (1998), however, argued from an interactionalist perspective that both constructdefinitions and tasks should be considered together in defining L2 ability because the two
interact during communication. As reviewed, different perspectives on L2 performanceassessment have defined language ability distinctively with a different focus. What is
important is not which approach is superior, but whether a test is validated before inferringexaminees language ability from the test results. In other words, before an inference
regarding an examinees language ability is made from test scores, test developers and usersneed to make sure what the test aims to measure (e.g., various language components,
performance on tasks) and whether a test actually measures what it intends to measure.Although a test is designed for its intended purpose (e.g., following construct definitions, task
characteristics, or both), there are still many factors that need to be considered in L2performance assessments to understand examinees performance and define their language
ability. Examinees performance may be affected by factors other than their language ability(McNamara, 1996, 1997). McNamara (1995) elaborated a schematic representation (Figure 1),
which Kenyon (1992) first presented, to conceptualize the performance dimension of L2speaking performance tests. As presented in the figure, examinees performance in L2
speaking tests is affected by many factors in the testing phase (i.e., candidates, tasks,interlocutors, and their interactions) as well as in the rating phase (i.e., raters and rating
scales). Empirical studies have identified these factors that affect speaking performance testscores as effects of the: (1) candidate(Lumley & OSullivan, 2005; OLoughlin, 2002); (2)task(Chalhoub-Deville, 1995; Clark, 1988; Elder, Iwashita, & McNamara, 2002; Farris, 1995;Malabonga, Kenyon, & Carpenter, 2005; Shohamy, 1994; Wigglesworth, 1997); (3)
interlocutor(Brown, 2003; OSullivan, 2002); (4) rater(Barnwell, 1989; Bonk & Ockey,2003; Brown, 1995; Eckes, 2005; Elder, 1993; Y. Kim, 2009; Lumley, 1998; Lumley &
McNamara, 1995; Lynch & McNamara, 1998; Meiron & Schick, 2000; Orr, 2002;Wigglesworth, 1993); and (5)scale/criteria(M. Kim, 2001). It might be impossible tocompletely eliminate the effects of these factors on examinees speaking performance.However, it is important to understand relative contributions of these factors to examinees
performance and test scores in order to better estimate examinees speaking ability and moreappropriately interpret and use the test results.
7/28/2019 Spaan V8 Kim
3/30
3Investigating the Construct Validity of a Speaking Performance Test
Rater
Scale/Criteria Score
Performance
Interlocutor Task (includingother
candidate) Candidate
Figure 1. Interactions in Performance Assessment of Speaking Skills
(McNamara, 1995, p. 173)
To sum up, examinees speaking ability can be inferred only after a test is validated
with respect to its constructs and other factors involved in the process of evaluation. Thefocus of previous studies, however, has often been limited to effects of individual factors on
examinees test performance. In other words, speaking performance tests have not beenexamined in a big framework in which various factors (e.g., examinees language ability,
tasks, and rating criteria) interact with one another. Moreover, performance tests, especiallythose which do not involve high stakes, are oftentimes used without such validation. To this
end, the current study seeks to explore the nature of a speaking placement test, which hasbeen locally used in a community English program. In order to determine whether the
speaking test accurately measures speaking ability as intended, the underlying structure of thetest is investigated in the present study. In other words, the question of whether the
hypothesized components of speaking ability (reflected in the scoring rubric) actually functionas the operationalized constructs of the test is examined. In addition, to better explain how the
test works, the effects of other variables, such as rater perceptions and task characteristics, arealso investigated. That is, factors that can have an effect on speaking performance are
considered in addition to issues regarding construct definition.
Research Questions
The current study addresses the following three research questions: (1) What is thefactorial structure of the speaking test? (2) To what extent does the speaking test measure the
intended hypothesized constructs of speaking ability? (3) In addition to the measuredvariables, to what extent do other factors (i.e., raters and tasks) contribute to examinees
speaking performance?
7/28/2019 Spaan V8 Kim
4/30
4 H. J. Kim
Method
Context of the Current Study
The Community English Program (CEP) is an English as a second language (ESL)
program offered by the Teaching English to Speakers of Other Languages (TESOL) and
applied linguistics programs at Teachers College. The program targets adult ESL learners whowish to improve their communicative language ability. Therefore, the CEP curriculumemphasizes not only the various language components (grammar, vocabulary, and
pronunciation) but also the different language skills (listening, speaking, reading, and writing).To facilitate effective teaching and learning, all new students of the program are placed into
one of 12 proficiency levels based on results of a placement test, which consists of fivesections (i.e., listening, grammar, reading, writing, and speaking).
A majority of the CEP teachers are MA students of the TESOL and applied linguisticsprograms. That is, they are student teachers practicing ESL classroom teaching. Therefore,
their classrooms are regularly observed by faculty and colleagues and follow-up feedbacksessions are provided throughout the semester. The teachers also serve as raters of the writing
and speaking placement tests. From the rating experience, they not only become familiar withthe CEP students writing and speaking ability levels, but they also have an opportunity to
have hands-on experience in evaluating ESL learners writing and speaking ability. Therefore,the CEP functions as a teacher education program as well as an adult ESL program.
Participants
Participants in the current study consisted of 215 incoming CEP students who took theCEP speaking placement test. The majority of students in the program were adult immigrants
from the surrounding neighborhood or were family members of international students in theColumbia University community. The number of female students (73%) far exceeded that of
male students (27%). In terms of the participants first language, a large percentage consistedof three languages: Japanese (36%), Korean (19%), and Spanish (15%). With regard to their
length of residence, the vast majority of the participants responded that they had been inEnglish speaking countries, including the United States, for fewer than three years: less than
6 months (40%), 6 months to 1 year (19%), and 1 to 3 years (20%). In terms of theirmotivation for studying English, many participants reported academic and job-related reasons,
while over 50 percent gave priority to communication with friends as their reason forimproving their English.
Instruments
The instruments used in the current study included the CEP placement speaking testand an analytic scoring rubric. The speaking test was designed to measure speaking ability
under various real-life language use situations. The test had six tasks: complaining about acatering service (Task 1), talking about a favorite movie (Task 2), narrating a story based on a
sequence of pictures (Task 3), refusing a request from a landlord (Task 4), summarizing aradio commentary (Task 5), and summarizing a lecture (Task 6). The first three tasks (i.e.,
Tasks 1, 2, and 3) were the independent-skills tasks, which required examinees to draw ontheir background knowledge to perform the tasks. On the other hand, the last three tasks (i.e.,
Tasks 4, 5, and 6) were the integrated-skills tasks, which required examinees to use theirlistening skills in the performance of the tasks. That is, examinees were asked to listen to long
7/28/2019 Spaan V8 Kim
5/30
5Investigating the Construct Validity of a Speaking Performance Test
or short passages, which were provided as part of the tasks, and then formulate responsesbased on the content of the passages.
The speaking test was a semi-direct, computer-delivered test. That is, there was nointeraction between an examinee and an interlocutor. Instead, the examinees listened to the
pre-recorded instructions and prompts delivered by a computer and then they were asked to
record their responses. The six tasks and the test format for each task (e.g., preparation time,response time) are found in Appendix A.An analytic scoring rubric consisting of five rating scales (see Appendix B) was used
to score the examinees recorded oral responses. The five scales included meaningfulness,grammatical competence, discourse competence, task completion, and intelligibility. Each of
the five rating scales was rated on a six-point scale (0 for no control to 5 for excellentcontrol). To analyze each scale in relation to the different tasks in this study, the five scales
for each of the six tasks were regarded as individual items, making a total of 30 items (6 tasksx 5 rating scales) on the test. That is, each cell in Table 1 illustrates the individual items of the
test. For instance, the item MeanT1 represents meaningfulness for Task 1 while theitemMeanT2 refers to meaningfulness for Task 2.
Table 1. Taxonomy of Items (Task x Rating Scale) on Speaking Ability
TasksRating scales
Number
of Items Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
Meaningfulness 6 MeanT1 MeanT2 MeanT3 MeanT4 MeanT5 MeanT6
Grammatical
competence6 GramT1 GramT2 GramT3 GramT4 GramT5 GramT6
Discourse
competence6 DiscT1 DiscT2 DiscT3 DiscT4 DiscT3 DiscT6
Task completion 6 TaskT1 TaskT2 TaskT3 TaskT4 TaskT5 TaskT6
Intelligibility 6 IntelT1 IntelT2 IntelT3 IntelT4 IntelT5 IntelT6
Total 30
Procedures
Test Administration
The speaking test was administered in a computer lab on the second day of a two-dayplacement test administration. The test was administered to groups of approximately 40
students. Each student was seated in front of a computer. They listened to the test instructionson a headset, read the instructions on the computer screen, and recorded their responses to the
test items using a microphone. Since all computers were controlled from a central console, theexaminees kept the same pace while taking the test. That is, the instructions and prompts were
delivered at the same time, and the preparation and response times were also provided to allexaminees at the same time.
Before the actual test began, the examinees were asked to fill in a background surveywhich asked for demographic information, prior English-learning experience, and plans for
7/28/2019 Spaan V8 Kim
6/30
6 H. J. Kim
future study. Once all examinees of a group completed the survey, they were given a practicetask so that they would be familiar with the test format. After a short intermission for any
questions about the test format, the six tasks were played in sequence. For each task, theexaminees first listened to or looked at an instruction and a prompt. They were allowed to
prepare responses during a short preparation time and lastly they recorded their responses
during the given response time.
Scoring
Each examinees performance was scored by two independent raters. The raters werethe CEP teachers, most of whom were MA or EdD students in the TESOL and applied
linguistics programs at Teachers College. Prior to the actual rating, the raters attended anorming session in which the test tasks and the rubric were introduced and sample responses
were provided for practice. Time was also given for discussion of analytic scores so that theraters had opportunities to monitor their decision-making processes by comparing the
rationale behind their scores with other raters opinions. Rating practice and discussioncontinued until the raters felt that they were well aware of the tasks and confident with
assigning scores on different rating scales. Following the norming session, each rater wasassigned a certain number of examinees. Since examinees performance on each of the six
tasks was scored on the five rating scales, each examinee was given 30 analytic ratings on 30items. The maximum score for each item was five and the minimum was zero. The scores
assigned by two independent raters were later averaged to determine a speaking score for theplacement test.
Analyses
Thedata were analyzed using SPSS version 12.0 (SPSS Inc., 2001) and EQS version6.1 (Bentler & Wu, 2005). Descriptive statistics (i.e., means, standard deviations,
maximum/minimum raw scores, and skewness and kurtosis values) were calculated for theentire test, for the 30 individual items, and for each of the five rating scales across the six
tasks separately using SPSS to verify central tendency and variability. Reliability estimateswere then calculated based on Cronbachs Alpha to examine the degree of relatedness among
the 30 items and the six items under each of the five rating scales. Also, the degree ofagreement between the two raters (i.e., inter-rater reliability) was investigated from various
perspectives, such as from the examinees total score, across the six tasks, and across the fiverating scales. Since composite scores comprised interval data that were converted from the
original ordinal data, inter-rater reliability was estimated based on Pearson Product-Momentcorrelations.
After calculating descriptive statistics and reliability estimates, exploratory factoranalyses (EFA) were conducted to determine the extent to which the 30 items clustered
together. In other words, factor analyses were used to examine what patterns of correlationswould be observed among the 30 items. Based on the correlation matrix, initial factors were
extracted by principal-axes factoring (PAF) after the appropriateness of the use of acorrelation matrix for factor analysis was verified using three calculations: (1) Bartletts test
of sphericity; (2) the Kaiser-Meyer-Olkin (KMO); and (3) the determinant of the correlationmatrix. The initial factors were then rotated until the best solution was found to determine the
number of underlying factors. Since it had been assumed that the factors were correlated with
7/28/2019 Spaan V8 Kim
7/30
7Investigating the Construct Validity of a Speaking Performance Test
one another, a direct oblimin rotation procedure was used after checking the factor correlationmatrices each time.
Finally, confirmatory factor analyses (CFA) were performed to establish a model ofthe speaking test. CFA was used to determine the extent to which the 30 items were measured
in relation to the six tasks and five scoring criteria. Based on a review of the literature, a
second-order Multitrait-Multimethod (MTMM) Model was first hypothesized. After failing tofind an appropriate solution with the hypothesized model, several other CFA models wereattempted to find a final model that best explained the data. To assess the adequacy of models
including the hypothesized model, several fit indices were used such as the Chi-squarestatistic, the Chi-square/df ratio, the comparative fit index (CFI), and the root mean-square
error of approximation (RMSEA). In addition, a distribution of standardized residuals waschecked. The results of the Lagrange Multiplier test and Ward test were analyzed for each run
in order to check any necessary and unnecessary parameters in a model. In the end, however,a final speaking test model was chosen in accordance with substantive considerations while
taking into account the issue of parsimony. In the process of model evaluation, the ML Robustmethod was used each time due to multivariate non-normality of the data.
Results
Descriptive Statistics
The descriptive statistics which were calculated for the item level, the rating scalelevel, and the entire 30-item test are presented in Table 2. The item-level means ranged from
2.64 to 3.41 and the standard deviations from 1.01 to 1.57. Although not very different, themeans of grammar-related items (i.e., GramT1 to GramT6) were lower than those for the
other groups of items. On the other hand, task completion-related items (i.e., TaskT1 to TaskT6) showed relatively higher means compared to the other items. Grammar-related items had
the least variability (average Std.=1.04) while task completion-related items had the largestvariability (average Std.=1.23). With regard to the task-related aspect, Task 6 items (i.e.,
MeanT6, GramT6, DiscT6, TaskT6, and IntelT6) had the lowest means under each ratingscale. However, their standard deviations were greatest compared to those for the other task
items under the same rating scale. The skewness and kurtosis values, within the acceptablerange, indicated that all 30 items and five rating scales appeared to be normally distributed.
Reliability Analyses
The reliability estimates for internal consistency were calculated for the five ratingscales and for the entire test (see Table 3). The reliability estimate for the entire test was very
high (0.991), signifying a high degree of homogeneity among the 30 items. Internalconsistency reliability for each rating scale also showed a high degree of consistency of the
six tasks under the five scales. The high reliability estimates, ranging from 0.936 to 0.963,suggested that the six tasks measured the same construct with a high degree of consistency
within each rating scale.
7/28/2019 Spaan V8 Kim
8/30
8 H. J. Kim
Table 2. Descriptive Statistics (N=215, K=30)
Variable Minimum Maximum Mean Std. Skewness Kurtosis
1. Meaningfulness (Mean) 0 5.00 3.10 1.14 -.87 .30
MeanT1 0 5.00 3.12 1.28 -.81 .24
MeanT2 0 5.00 3.12 1.18 -.93 .65
MeanT3 0 5.00 3.20 1.11 -.82 .60MeanT4 0 5.00 3.10 1.30 -.84 -.07
MeanT5 0 5.00 3.20 1.24 -.91 .27
MeanT6 0 5.00 2.85 1.38 -.66 -.45
2. Grammar (Gram) 0 4.58 2.87 1.04 -.95 .52
GramT1 0 5.00 2.83 1.15 -.84 .46
GramT2 0 4.50 2.87 1.04 -1.11 1.16
GramT3 0 4.50 2.92 1.01 -.92 .83
GramT4 0 5.00 2.93 1.20 -.97 .35
GramT5 0 5.00 2.93 1.12 -.92 .46
GramT6 0 4.50 2.73 1.27 -.79 -.29
3. Discourse Competence
(Disc)0 4.50 2.86 1.08 -.91 .33
DiscT1 0 5.00 2.83 1.21 -.78 .17
DiscT2 0 5.00 2.84 1.09 -.91 .59
DiscT3 0 5.00 2.95 1.05 -.83 .75
DiscT4 0 5.00 2.93 1.26 -.85 -.04
DiscT5 0 5.00 2.97 1.18 -.84 .17
DiscT6 0 5.00 2.64 1.32 -.63 -.42
4. Task Completion (Task) 0 5.00 3.18 1.23 -.86 .08
TaskT1 0 5.00 3.07 1.41 -.52 -.44
TaskT2 0 5.00 3.36 1.33 -1.00 .32TaskT3 0 5.00 3.41 1.23 -.92 .42
TaskT4 0 5.00 3.04 1.57 -.56 -1.01
TaskT5 0 5.00 3.37 1.41 -.82 -.25
TaskT6 0 5.00 2.86 1.48 -.52 -.73
5. Intelligibility (Intel) 0 4.92 3.02 1.09 -.92 .53
IntelT1 0 5.00 2.99 1.20 -.89 .50
IntelT2 0 5.00 3.00 1.15 -.96 .73
IntelT3 0 5.00 3.09 1.05 -.93 .86
IntelT4 0 5.00 3.07 1.25 -.90 .20
IntelT5 0 5.00 3.10 1.16 -.91 .67
IntelT6 0 5.00 2.89 1.30 -.75 -.16
Total (30 items) 0 4.73 3.01 1.10 -.94 .43
7/28/2019 Spaan V8 Kim
9/30
9Investigating the Construct Validity of a Speaking Performance Test
Table 3. Reliability Estimates (N=215)
Construct Items UsedNr of
ItemsReliability Estimates
Meaningfulness MeanT1 - MeanT6 6 0.960Grammatical Competence GramT1 - GramT6 6 0.963
Discourse Competence DiscT1 - DiscT6 6 0.958Task Completion TaskT1 - TaskT6 6 0.936
Intelligibility IntelT1 - IntelT6 6 0.963Total 30 0.991
Although average scores by the two raters were used for the statistical analyses, inter-
rater reliability was calculated to determine the degree of agreement between the two raters.The correlation between Rater 1 and Rater 2 was 0.837 for examinees total score (see Table
4), 0.71 to 0.80 across the six tasks (see Table 5), and 0.78 to 0.82 across the five rating scales(see Table 6). All correlations were significant at the alpha = 0.01 level, indicating that the
first raters score on each task, each rating scale, and entire test significantly correlated withthe second raters score on the same task, rating scale, and entire test. As a result, it can be
assumed that the two raters scored the examinees speaking with similar criteria in mind.
Table 4. Inter-rater Reliability for the Entire Speaking Test (N = 215)
Rater 1 (TotR1) Rater 2 (TotR2)
Rater 1 (TotR1) 1.00 0.837**
Rater 2 (TotR2) 0.837** 1.00
**p < 0.01 (2-tailed), R1 = Rater 1, R2 = Rater 2
Table 5. Inter-rater Reliability across Six Tasks (N = 215)T1R1 T1R2 T2R1 T2R2 T3R1 T3R2 T4R1 T4R2 T5R1 T5R2 T6R1 T6R2
T1R1 1.00 0.80**
T1R2 1.00
T2R1 1.00 0.75**
T2R2 1.00
T3R1 1.00 0.71**
T3R2 1.00
T4R1 1.00 0.81**
T4R2 1.00
T5R1 1.00 0.80**
T5R2 1.00
T6R1 1.00 0.80**
T6R2 1.00**p < 0.01 (2-tailed), T1T6: Task 1Task 6; R1 = Rater 1, R2 = Rater 2
7/28/2019 Spaan V8 Kim
10/30
10 H. J. Kim
Table 6. Inter-rater Reliability across the Five Constructs (N = 215)
MR1 MR2 GR1 GR2 DR1 DR2 TR1 TR2 IR1 IR2
MR1 1.00 0.78**
MR2 1.00GR1 1.00 0.82**
GR2 1.00DR1 1.00 0.80**
DR2 1.00TR1 1.00 0.80**
TR2 1.00IR1 1.00 0.81**
**p < 0.01 (2-tailed), M: Meaningfulness, G: Grammatical Competence, D: Discourse
Competence,T: Task Completion, I: Intelligibility,
R1 = Rater 1, R2 = Rater 2
Results of Exploratory Factor Analysis
Once the appropriateness of the use of a correlation matrix for factor analysis was
verified (e.g., a significant Chi-square, the positive determinant of the correlation matrix), anEFA was conducted as a preliminary step for a CFA in order to develop a factor structure for
the 30 observed variables. The initial factor extraction showed a very different result from thehypothesized design of speaking ability, which assumed five underlying factors (i.e., five
rating scales). Two factors with eigenvalues greater than 1.0 were extracted, which accountedfor 83.7 percentof the variance. Variable communalities were all above 0.7, specifying that
the variances of the variables accounted for by the common factors were very high. The screeplot also suggested the extraction of two factors. Since the number of factors obtained from
the initial extraction was quite different from the hypothesis set for the speaking test, solutionswith different numbers of factors were compared. The three factor oblique rotation was the
best solution to achieve maximum parsimony (see Table7). As observed in Table 7, the 30items used to measure speaking ability clustered around the type of task. For instance, items
for Tasks 1, 2, and 3 loaded on Factor 1, items for Task 6 loaded on Factor 2, and items forTasks 4 and 5 loaded on Factor 3. To illustrate, all five items for Task 6 (i.e., MeanT6,
GramT6, DiscT6, TaskT6, and IntelT6) showed factor loadings above 0.3 for Factor 2.Further analysis of the six tasks revealed a possible reason as to why the items
clustered around the task type factors rather than around the rating scales. Since Tasks 1, 2,and 3 required examinees to speak with the minimal input, the factor on which the items for
these three tasks loaded was interpreted as a Speak factor. Contrary to Tasks 1, 2, and 3,
Tasks 4 and 5 first required examinees to listen to a long message and then respond orsummarize it. Thus, Factor 3, which included items for Tasks 4 and 5, was coded as a Listenand Speak factor. While Task 6 was a summary task (as was Task 5), it appeared that Task 6
required examinees to have topical knowledge in the process of listening and summarizing amessage. That is, examinees familiarity with the topic of the task could help them approach
the task easily. Whereas Task 5 was about a topic (an electric car) that might be morecommonly discussed in everyday life, the listening prompt provided in Task 6 was a lecture
with highly specified content (the Barbizon School). Thus, Factor 2 was coded as Listen and
7/28/2019 Spaan V8 Kim
11/30
11Investigating the Construct Validity of a Speaking Performance Test
Speak with Topical Knowledge. In sum, the items did not cluster around operationalizedconstructs of speaking ability (i.e., rating scales), showing that examinees speaking
performance was better explained according to the task type rather than to the hypothesizedfive constructs of speaking ability. As a result, the two cross-loadings present (i.e., IntelT3
and GramT5) were not seen as problematic since grammar and intelligibility could be
involved in any task as long as factors were divided based on the task type. The final three-factor solution is presented in Table 8.
Table 7. Pattern Matrix for Speaking Ability
Factor1 2 3
DiscT2 1.015 .054 .171GramT2 .944 .142 .163
TaskT2 .890 .017 .014GramT1 .874 .027 -.037
IntelT1 .873 -.019 -.054DiscT1 .856 -.006 -.061
IntelT2 .852 .139 .069MeanT2 .847 .135 .037
MeanT1 .837 -.052 -.138TaskT3 .795 -.090 -.138
GramT3 .764 -.020 -.207DiscT3 .720 -.030 -.236
MeanT3 .702 .020 -.212TaskT1 .670 .028 -.190
IntelT3 .585 .015 -.337MeanT6 .011 .962 -.004
TaskT6 -.044 .933 -.063DiscT6 .056 .918 -.014
GramT6 .094 .847 -.053IntelT6 .033 .804 -.143
MeanT4 .076 -.004 -.897GramT4 .108 .058 -.809
TaskT4 -.057 .125 -.791IntelT4 .090 .120 -.769
DiscT4 .135 .100 -.741TaskT5 .066 .253 -.634
IntelT5 .189 .205 -.584MeanT5 .224 .182 -.573
DiscT5 .288 .133 -.564GramT5 .319 .179 -.484
Extraction Method: Principal Axes Factoring.
Rotation Method: Oblimin with Kaiser Normalization.a Rotation converged in 13 iterations.
7/28/2019 Spaan V8 Kim
12/30
12 H. J. Kim
Table 8. Revised Taxonomy of Speaking Ability (Based on Exploratory Factor Analysis)
FactorsNr of
ItemsItems
Speak 15
Task 1
Task 2Task 3
5
55
MeanT1, GramT1, DiscT1, TaskT1, IntelT1
MeanT2, GramT2, DiscT2, TaskT2, IntelT2MeanT3, GramT3, DiscT3, TaskT3, IntelT3
Listen & Speak withTopical Knowledge
5
Task 6 5 MeanT6, GramT6, DiscT6, TaskT6, IntelT6
Listen & Speak 10
Task 4Task 5
55
MeanT4, GramT4, DiscT4, TaskT4, IntelT4MeanT5, GramT5, DiscT5, TaskT5, IntelT5
Total 30
Results of Confirmatory Factor Analysis
Bachman (2002) argued that a language test should be designed taking task
characteristics into account as well as the construct definition of language ability in order toachieve the intended purpose of the test. In an attempt to understand the speaking test of the
current study in terms of both aspects (i.e., construct definition and task characteristics), thefirst MTMM model was hypothesized in which the 24 items loaded on both trait factors (i.e.,
the four rating scales) and method factors (i.e., the six tasks), while the four trait factorsloaded on a second-order factor, speaking ability (see Figure 2). The rating scale of task
completion was not included as a trait factor in the model since it was considered redundant inrelation to the other rating scales. As a result, six items related to task completion (i.e.,
TaskT1, TaskT2, TaskT3, TaskT4, TaskT5, TaskT6) were deleted for the analysis, making atotal of 24 observed variables. Moreover, correlations among six tasks were not established in
the first model because six different tasks were hypothesized to elicit different aspects ofspeaking ability.
In order to respecify the first model, several attempts were made. First, it was testedwhether four first-order factors (i.e., four trait factors) would load on the second-order factor
(i.e., speaking ability) without any method factors (see Figure 3). The data did not fit themodel, which indicated problems similar to those of the first model (e.g., condition codes and
factor loadings above 1.0). Moreover, the model showed a very poor fit, with a CFI of 0.715and a RMSEA of 0.162. The results confirmed a need for consideration of both construct (i.e.,
rating scales) and task to interpret test scores, since the model without the task factors did not
represent the data. In addition, based on the results of this model, it was decided that four traitfactors should be correlated instead of using of a second-order factor. The model-fitevaluation of the hypothesized model indicated an excellent fit, showing the very high CFI
(0.99) and the very low RMSEA (0.032 with the confidence interval [0.015, 0.044]). In termsof fit indices, the model was ideal since the CFI above 0.95 and the RMSEA below 0.05 are
considered an indication of a well-fitting model (Byrne, 2006). However, the test results werenot reliable due to a condition code for a variance of factor error (Parameter: D2, D2) which
caused an improper solution (e.g., the greater than 1.0 factor loading for Grammatical
7/28/2019 Spaan V8 Kim
13/30
13Investigating the Construct Validity of a Speaking Performance Test
Competence). Such a condition code, which is a common occurrence with MTMM data,might have occurred due to the complexity of model specification (Byrne, 2006). Thus, the
initially hypothesized model was rejected.
Figure 2. The Hypothesized Second-Order MTMM Model of CEP Speaking Placement Test
Mean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,Intel: Intelligibility, T1T6: Task 1Task 6
7/28/2019 Spaan V8 Kim
14/30
14 H. J. Kim
Figure 3. The Second-order Model without Method FactorsMean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,
Intel: Intelligibility, T1T6: Task 1Task 6
Another attempt was made before deciding upon a final model. A model was testedwith two additional factors: Rating 1 and Rating 2. The model was run both with and without
the correlation between the two ratings. However, both models were unsuccessful, whichconfirmed that the data were not explained with such models. Therefore, based on an
examination of several possible models, the final MTMM model was established with fourtrait factors which were correlated with each other and six method factors (see Figure 4). This
final model was obtained after statistically testing two assumptions which were made inadvance. The first assumption regarding the deletion of task completion factor was confirmed
since the inclusion of task completion factor to the final model lowered the overall fit of thedata. To test the other assumption related to possible task effect, the final MTMM model was
7/28/2019 Spaan V8 Kim
15/30
15Investigating the Construct Validity of a Speaking Performance Test
also tested with correlations among six method factors. Although the overall fit increased, itshowed very little improvement. Thus, it was concluded that six different tasks measured
different aspects of speaking ability so that the correlations were not included in the finalmodel. Though all estimates were statistically significant, they were not included in Figure 4
since they were not legible with the overabundance of arrows (Refer to Table 10 for the
estimates).As shown in Figure 4, there were 24 dependent variables (i.e., 24 observed variables)and 34 independent variables (i.e., 10 factors and 24 error terms). There were also 78
parameters (i.e., 48 factor loadings, 6 factor covariances, 24 error variances) and 34 fixednonzero parameters (i.e., 10 factor variances, 24 error regression paths). The structure of these
factors and variables as specified in the model was tested based on the covariance matrix.Following the summary of the model, model identification was confirmed in the output.
The model was first assessed as a whole. In terms of residuals, off-diagonal elementswere examined since they play a major role in the effect of Chi-square statistics. The
standardized residual values were evenly distributed, and the average off-diagonal absolutestandardized residual was also quite small, at 0.0156. In addition, the distribution of
standardized residuals was symmetric and centered around zero. As a result, it was found thatvery little discrepancy existed between S(q) (covariance matrix implied by the specified
structure of the hypothesized model) and S (sample covariance matrix of observed variablescores). With regard to the goodness of fit statistics, the independence Chi-square statistic was
5189.140 with 276 degrees of freedom. Although the Chi-square/df ratio was much greaterthan 2, implying a poor model-data fit, it was ignored due to Chi-square sensitivity to sample
size. Instead, fit indices were used for further model-fit evaluation (see Table 9).Table 9. EQS Output Goodness of Fit Statistics
GOODNESS OF FIT SUMMARY FOR METHOD = ROBUST
ROBUST INDEPENDENCE MODEL CHI-SQUARE = 5189.140 ON 276 DEGREES OF FREEDOMINDEPENDENCE AIC = 4637.140 INDEPENDENCE CAIC = 3430.844
MODEL AIC = -179.250 MODEL CAIC = -1149.532
SATORRA-BENTLER SCALED CHI-SQUARE = 264.7500 ON 222 DEGREES OF FREEDOM
PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS 0.02603
FIT INDICES
BENTLER-BONETT NORMED FIT INDEX = 0.949
BENTLER-BONETT NON-NORMED FIT INDEX = 0.989
COMPARATIVE FIT INDEX (CFI) = 0.991
BOLLEN'S (IFI) FIT INDEX = 0.991
MCDONALD'S (MFI) FIT INDEX = 0.905
ROOT MEAN-SQUARE ERROR OF APPROXIMATION (RMSEA) = 0.030
90% CONFIDENCE INTERVAL OF RMSEA (0.011, 0.043)
7/28/2019 Spaan V8 Kim
16/30
16 H. J. Kim
Figure 4. The Final MTMM ModelMean: Meaningfulness, Gram: Grammatical Competence, Disc: Discourse Competence,
Intel: Intelligibility, T1 T6: Task 1 Task 6; F1 F10: Factors 1 Factor 10; V2 V31:Observed Variables 2 31
7/28/2019 Spaan V8 Kim
17/30
17Investigating the Construct Validity of a Speaking Performance Test
As shown in Table 9, the CFI was 0.991 and the RMSEA was 0.03 with the confidenceinterval [0.011, 0.043], both of which indicated an excellent fit. The final indicator of overall
model fit was the number of iterations. According to the iterative summary in the output, onlyfive iterations were needed to reach convergence, which meant that the data fit the model
relatively easily. Thus, it was revealed from the analyses of residuals and fit indices that the
current 24 data fit the 10 factor MTMM model well as a whole.After confirming the good fit of the model as a whole, the fit of individual parameterswas also assessed. The statistical significance of parameter estimates was first checked based
on the unstandardized estimates. All parameter estimates were statistically significant.Therefore, all parameters could be considered important to the model, and none of the
parameters needed to be deleted from the model. Following the unstandardized estimates, astandardized solution was considered (see Table 10).
As shown in Table 10 (next page), the trait factor loadings (i.e., F1 to F4), rangingfrom 0.849 to 0.926, were much higher than method factor loadings (i.e., F5 to F10), ranging
from 0.265 to 0.466. This signified that the four traits (i.e., rating scales) were much strongerindicators than the six tasks, although both needed to be considered. Since the regression
coefficients of errors were quite small, ranging from 0.207 to 0.309, it can be concluded thatthe contribution of errors to the variables was low and the variables were mainly explained by
the factors. All of the very high R-squared values, which refer to the proportion of varianceaccounted for by its related factors, confirmed that all 24 items explained the model fairly
well. Moreover, as assumed above, correlations between the trait factors were quite high ataround 0.98. The four factors were all operationalized constructs of a single construct of
speaking ability. However, extremely high correlations were not considered ideal for analyticscoring since they indicated that four rating scales were almost indistinguishable.
Discussion and Conclusion
The present study examined the underlying structure of the CEP speaking placement
test based on a confirmatory factor analysis. The analysis was conducted with four traitfactors (i.e., meaningfulness, grammatical competence, discourse competence, and
intelligibility) and six method factors (i.e., Tasks 1 to 6). Also, the four trait factors werecorrelated with one another. Although these four traits were assumed to be related by virtue of
being aspects of the same ability, correlations over 0.90 were unexpected. These highcorrelations may indicate that speaking ability cannot be separated into several analytic
aspects, or the raters failed to understand and differentiate among the analytic scoring criteria.For example, raters may have given similar scores to the four rating scales of each task based
on their own impression rather than going over the different criteria carefully, or they may nothave been accustomed to the different criteria because of the short norming period. Further
research on raters rating processes may be required to explain the relationship among thesecomponents of speaking ability.
7/28/2019 Spaan V8 Kim
18/30
18 H. J. Kim
Table 10. EQS Output Standardized Solution
!"%!#"!#
"$
"$
!"$
""$
"$
"$
!"$
""$
"$
"$
!"$
""$
"$
"$
!"$""$
"$
"$
!"$
""$
"$
"$
!"$
""$
"!"$!
$
7/28/2019 Spaan V8 Kim
19/30
19Investigating the Construct Validity of a Speaking Performance Test
The final MTMM model explained the current test data very well, as evidenced by thehigh fit indices. In particular, the four operationalized constructs (i.e., four rating scales)
primarily explained the data with higher factor loadings than the six tasks. In other words,examinees performance on the test was mainly explained by the four constructs of speaking
ability; however, the characteristics of the six tasks had a non-negligible effect on the
examinees performance. Therefore, the results of the current study empirically supported theinteractionalist perspective in which examinees speaking ability is determined in terms ofboth constructs (traits) and task characteristics of the test.
Although the current study contributes to the recent discussion concerning theimportance of both construct definitions and test task characteristics in L2 performance
assessments, it has a number of limitations. First, due to a limited sample size, it was notpossible to include a rating factor as part of the underlying structure of the speaking test
although multiple ratings were available for all examinees responses. It has been argued thatraters are the one of the factors that affects examinees performance (Kenyon, 1992; Linacre,
1989; McNamara, 1995, 1996, 1997). Indeed, previous studies on raters, which analyzedraters rating behaviors both quantitatively and qualitatively, showed rater effects on
performance assessments (e.g., Bonk & Ockey, 2003; Brown, 2005; Chalhoub-Deville, 1995;Eckes, 2005; Meiron & Schick, 2000; Orr, 2002). Therefore, inclusion of a rater/rating factor
might change the underlying structure of the speaking test.The other limitation is that structural equation modeling is a data-specific statistical
tool. In other words, the results of the current analyses cannot be generalized to other CEPspeaking data which include different participants. Likewise, other data sets might be
explained with different factors or different factorial structures. Therefore, in order togeneralize the structure of CEP speaking placement test, repeated analyses of test data with a
larger sample size are required across different test administrations. Only then can the natureof the CEP speaking placement test be understood and, ultimately, can inferences made on
examinees speaking ability be considered reliable.
Acknowledgements
I would like to express my appreciation to the English Language Institute at the
University of Michigan for giving me an opportunity to perform this research. I am also verygrateful to Professor James Purpura and my colleagues at Teachers College, for their
insightful comments and suggestions throughout this study.
References
Bachman, L. F. (1990).Fundamental considerations in language testing. Oxford: OxfordUniversity Press.
Bachman, L. F. (2002). Some reflections on task-based language performance assessment.Language Testing, 19(4), 453476.
Bachman, L. F., & Palmer, A. S. (1996).Language testing in practice: Designing anddeveloping useful language tests. Oxford: Oxford University Press.
Barnwell, D. (1989). Naive native speakers and judgments of oral proficiency in Spanish.Language Testing, 6(2), 152163.
7/28/2019 Spaan V8 Kim
20/30
20 H. J. Kim
Bentler, P. M., & Wu, E. (2005).EQS 6.1 for windows users guide. Encino, CA: MultivariateSoftware, Inc.
Bonk, W. J., & Ockey, G. J. (2003). A many-facet Rasch analysis of the second languagegroup oral discussion task.Language Testing, 20(1), 89110.
Brindley, G. (1994). Task-centred assessment in language learning: The promise and the
challenge. In N. Bird, P. Falvey, A. Tsui, D. Allison, & A. McNeill (Eds.), Languageand learning: Papers presented at the Annual International Language in EducationConference (Hong Kong, 1993) (pp. 7394). Hong Kong: Hong Kong Education
Department.Brown, A. (1995). The effect of rater variables in the development of an occupation-specific
language performance test.Language Testing, 12(1), 115.Brown, A. (2003). Interviewer variation and the co-construction of speaking proficiency.
Language Testing, 20(1), 125.Brown, A. (2005).Interviewer variability in oral proficiency interviews. Frankfurt, Germany:
Peter Lang.Brown, J. D., Hudson, T., Norris, J. M., & Bonk, W. (2002).An investigation of second
language task-based performance assessments. Honolulu: University of Hawaii Press.Byrne, B. M. (2006). Structural equation modeling with EQS. Mahwah NJ: Lawrence
Erlbaum Associates, Inc.Canale, M. (1983). On some dimensions of language proficiency. In J. W. Oller, Jr. (Ed.),
Issues in language testing research (pp. 333342). Rowley, MA: Newbury House.Canale, M., & Swain, M. (1980). Theoretical bases of communicative approaches to second
language teaching and testing.Applied Linguistics, 1(1), 147.Chalhoub-Deville, M. (1995). Deriving oral assessment scales across different tests and rater
groups.Language Testing, 12(1), 1633.Chapelle, C. (1998). Construct definition and validity inquiry in SLA research. In L. F.
Bachman & A. D. Cohen (Eds.),Interfaces between second language acquisition andlanguage testing research (pp. 3270). Cambridge: Cambridge University Press.
Clark, J. L. D. (1975). Theoretical and technical considerations in oral proficiency testing. InR. L. Jones, & B. Spolsky (Eds.), Testing language proficiency (pp. 1028). Arlington,
VA: Center for Applied Linguistics.Clark, J. L. D. (1988). Validation of a tape-mediated ACTFL/ILR-scale based test of Chinese
speaking proficiency.Language Testing, 5(2), 187205.Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance
assessments: A many-facet Rasch analysis.Language Assessment Quarterly, 2(3), 197221.
Elder, C. (1993). How do subject specialists construe classroom language proficiency?Language Testing, 10(3), 235254.
Elder, C., Iwashita, N., & McNamara, T. (2002). Estimating the difficulty of oral proficiencytasks: What does the test-taker have to offer?Language Testing, 19(4), 347368.
Farris, C. S. (1995). A semiotic analysis ofsajiao as a gender marked communication style inChinese. In M. Johnson & F. Y. L. Chiu (Eds.), Unbound Taiwan: Close-ups from a
distance. Selected Papers Vol. 8 (pp. 129). Chicago: Center for East Asian Studies,University of Chicago.
Fulcher, G. (2003). Testing second language speaking. London: Longman.
7/28/2019 Spaan V8 Kim
21/30
21Investigating the Construct Validity of a Speaking Performance Test
Jones, R. L. (1985). Second language performance testing: An overview. In P. C. Hauptman,R LeBlanc, & M. B. Wesche (Eds.), Second language performance testing(pp. 1524).
Ottawa: University of Ottawa Press.Kenyon, D. M. (1992). Introductory remarks at symposium onDevelopment and use of rating
scales in language testing, 14th Language Testing Research Colloquium, Vancouver,
February 27th March 1st.Kim, M. (2001). Detecting DIF across the different language groups in a speaking test.Language Testing, 18(1),89114.
Kim, Y. (2009). An investigation into native and non-native teachers judgments of oralEnglish performance: A mixed methods approach.Language Testing, 26(2), 187217.
Linacre, J. M. (1989).Many-facet Rasch measurement. Chicago: MESA Press.Lumley, T. (1998). Perceptions of language-trained raters and occupational experts in a test of
occupational English language proficiency.English for Specific Purposes, 17, 34767.Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for
training.Language Testing, 12(1), 5471.Lumley, T., & OSullivan, B. (2005). The effect of test-taker gender, audience and topic on
task performance in tape-mediated assessment of speaking. Language Testing, 22(4),415437.
Luoma, S. (2004).Assessing speaking. Cambridge: Cambridge University Press.Lynch, B. K., & McNamara, T. F. (1998). Using G-theory and many-facet Rasch
measurement in the development of performance assessments of the ESL speaking skillsof immigrants.Language Testing, 15(2), 158180.
Malabonga, V., Kenyon, D. M., & Carpenter, H. (2005). Self-assessment, preparation andresponse time on a computerized oral proficiency test.Language Testing, 22(1), 5992.
McNamara, T. F. (1995). Modelling performance: Opening pandoras box.AppliedLinguistics, 16(2), 159179.
McNamara, T. F. (1996).Measuring second language performance. London: Longman.McNamara, T. F. (1997). Interaction in second language performance assessment: Whose
performance?Applied Linguistics, 18(4), 446466.Meiron, B., & Schick, L. (2000). Ratings, raters and test performance: An exploratory study.
In A. J. Kunnan (Ed.),Fairness and validation in language assessment. Selected papersfrom the 19
thLanguage Testing Research Colloquium, Orlando, Florida (pp. 6081).
Cambridge: Cambridge University Press.Messick, S. (1994). The interplay of evidence and consequences in the validation of
performance assessments.Educational Researcher, 23(2), 1323.Morrow, K. (1979). Communicative language testing: Revolution or evolution? In C. J.
Brumfit, & K. Johnson (Eds.), The communicative approach to language teaching(pp.143157). Oxford: Oxford University Press.
Norris, J. M., Brown, J. D., Hudson, T., & Yoshioka, J. (1998).Designing second languageperformance assessments (Technical Report No. 18). Honolulu: University of Hawaii,
Second Language Teaching & Curriculum Center.OLoughlin, K. K. (2002). The impact of gender in oral proficiency testing.Language Testing,
19(2), 169192.Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores.
System, 30, 143154.
7/28/2019 Spaan V8 Kim
22/30
22 H. J. Kim
Savignon, S. J. (1972). Communicative competence: An experiment in foreign languageteaching. Philadelphia: The Center for Curriculum Development.
Shohamy, E. (1994). The validity of direct versus semi-direct oral tests.Language Testing,11(2), 99123.
Skehan, P. (1998).A cognitive approach to language learning. Oxford: Oxford University
Press.SPSS Inc. (2001). SPSS Base 12.0 for Windows[Computer Software]. Chicago IL: SPSS Inc.Wigglesworth, G. (1993). Exploring bias analysis as a tool for improving rater consistency in
assessing oral interaction.Language Testing, 10(3), 305335.Wigglesworth, G. (1997). An investigation of planning time and proficiency level on oral test
discourse.Language Testing, 14(1), 85106.
7/28/2019 Spaan V8 Kim
23/30
23Investigating the Construct Validity of a Speaking Performance Test
Appendix A. Speaking Test Tasks
Task 1. Catering serviceIn this task, you need to complain about something. Imagine you have ordered food from
Party Planners Inc. for your bosss birthday party. But there was not enough food and it was
delivered late. You spent a week planning the party, but it was ruined because of the food.You were extremely upset that it happened. Call the caterer to complain about it. You have 20seconds to plan.
Prompt (Audio)
[phone ringing] (Answering Machine) Hi! Youve reached Party Planners Inc. Were sorry,but were not available to take your call right now. Please leave a detailed message after the
beep, and well get back to you as soon as possible. [Beep]Test-Taker: (45 sec response time)
Task 2. Favorite movieIn this task, you will be asked to talk about a movie. Think about a movie that you liked and
tell your friend about it. You have 20 seconds to plan.
Prompt (Vidio)
Your friend: So, what was that movie you liked? What is it about?
Test-Taker: (60 sec response time)
Task 3. Fly in soupIn this task, you need to tell the story in the pictures. Look at the pictures (Pictures are shown
on the screen). Imagine this happened yesterday while you were having dinner at the nexttable. Tell your friend what you saw. You have 60 seconds to plan your response.
Prompt (Video)
Your friend: So, what happened last night at the restaurant?Test-taker: (60 sec response time)
7/28/2019 Spaan V8 Kim
24/30
24 H. J. Kim
Task 4. Moving outIn this task, you need to refuse a request. Imagine you are renting an apartment from a nice
old couple in New York City. You have been living there for over a year. Now, listen to atelephone message from the couple.Hi, this is Mary, your landlady. Tom and I have been trying to contact you, but you never seem
to be home. I guess you're really busy these days. Anywaywell, I don't know how to say this,butour granddaughter is moving to the City next month. She's gonna study at Columbiaand,as you know, living in the city is expensive, and the rents are really high. So, she asked us if she
could live in the apartment you have now. I know we just renewed your lease, and we have noright to ask you to move out, and, we really like you, too. But, do you think you can possibly
look for a different apartment? We're really sorry about this, but we have to do this for ourgranddaughter. Since theres not much time, we'd like to hear from you as soon as possible, so
we can let our granddaughter know too. Again, we're sorryCall and let us know, ok? Thanks.(162 words)(Q) Politely tell your landlady that you cant move out and explain why. You have 30 seconds
to plan.
Prompt (Audio)
Landlady: Hi. Come on in. Did you get our message? Have you thought about moving out?Test-taker: (45 sec response time)
Task 5. Electric cars
In this task, you will be asked to summarize a radio commentary for a friend. Imagine yourfriend, Jim is thinking about buying an electric car. Now, listen to the radio commentary.
(Host of the radio commentary) Today, were talking about electric cars. As youre well
aware, the conventional cars we drive everydayuse a lot of gasoline. You know, how theprice of gasoline is going upand more importantly, theres the issue of global warmingthese cars release harmful pollutants, like carbon monoxide. So, in reaction to this, engineers
have been working on cars that run on electric batteries, so lets hear about the current stateof the technology. We have a pre-recorded commentary by Ben Smith from General Autos.
Well, despite high expectations, the first generation of electric cars turned out to be acomplete failure. Why? The first problem is the batteryI mean, current battery technology is
still very limited. So electric cars can only travel a short distance before its battery needsrecharging. What this means is you cant make long trips without worrying about the battery
running out. Theyre only good for short trips like going to the supermarket or picking up thekids from school. And when you turn the air conditioner or the radio on, the battery is used up
even quicker.Then, you might say, we can just recharge the battery when its used up.
Welltheres a serious problem with recharging, too. To recharge a battery, we need anelectric outlet, right? But there arent many charging stationswhich means, the driver might
get stuckwithout being able to find a charging station nearby. Well, it gets even morefrustrating. Even if you can find a station, it takes up to 3 hours to fully recharge a battery. Its
way too long. Well, with these many limitations, does it make sense that anyone would wantto buy an electric car, even if it is environmentally friendly?
7/28/2019 Spaan V8 Kim
25/30
25Investigating the Construct Validity of a Speaking Performance Test
(Q) Summarize what you heard on the radio for Jim. Be sure to include two main problemswith electric cars. You have 30 seconds to plan.
Prompt (Video)
Jim: Did I tell you Im thinking about buying an electric car?
Test-taker: (60 sec response time)
Task 6. Barbizon schoolIn this task, you will be asked to summarize a lecture for a classmate. Imagine your classmate,
Jennifer missed todays lecture about the Barbizon school. Now, listen to the lecture.
Today, well talk about a group of artists, called the Barbizon School. The Barbizon School is
a group of French artists, who lived in the French town, Barbizon and who developed thegenre of landscape painting. So, what are their characteristics?
The Barbizon painters tried to find comfort in nature. I mean, they moved away from all the
commotion and disruption happening in, then, revolutionary Paris, and sought solace innature. And nature was the main theme of their paintingsthey painted landscapes and
scenes of rural life as true to life as possible. And they rejected the idea of manipulating orbeautifying nature. Instead, they tried to achieve a true representation of the countryside. OK?
Second, in addition to the efforts to paint nature as realistically as possible, they also tried to
establish landscape as an independent, legitimate genre in France. Traditionally, landscapepainting wasnt appreciated as a separate genre, but only considered as a background. But
Barbizon artists reacted against this convention of classical landscape, and painted landscapefor its own sake. With their huge success and recognition, the painters of the Barbizon school
established landscape and themes of country life as vital subjects for French artists.
Now, lets look at an examplea painting by Rousseau. This one is called The Forest inWinter at Sunset. [Show the painting on screen]. It shows the ancient forest near the village
of Barbizon. Rousseau is the best known member of the group. Each Barbizon painter had hisown style and specific interests, and Rousseaus vision was melancholic and sad. Can you feel
the depressing mood of the painting? At the top, a tangle of tree limbs, and birds flying intothe cloudy, dark, sunset sky. After the sun sets, the forest will be freezing cold. Rousseau
worked on this painting off-and-on for twenty years. He considered this his most importantpainting and refused to sell it during his lifetime.
(Q) Summarize the lecture for Jennifer. Be sure to include two main characteristics of the
school and the example shown. You have 30 seconds to plan.
Prompt (Video)
Jennifer: So, what was the lecture about? What did I miss?
Test-taker: (60 sec response time)
7/28/2019 Spaan V8 Kim
26/30
7/28/2019 Spaan V8 Kim
27/30
27Investigating the Construct Validity of a Speaking Performance Test
GrammaticalCompetence:Accuracy,
ComplexityandR
ange
5Excellent
4Good
3Ad
equate
2Fair
1Limited
0No
Theresponse:
Theresponse:
Therespon
se:
Theresponse:
Theresponse:
Theresponse:
is
grammatically
accurate.
is
generally
grammatically
accuratewithoutany
majorerrors(e.g.,
articleusage,
subject/verb
agreement,etc.)
that
obscuremeaning.
ra
relydisplaysmajor
errorsth
atobscure
meaningandafew
minorerrors(but
whatthespeaker
wantsto
saycanbe
understood).
di
splaysseveral
majorerrorsaswell
asfrequentminor
errors,causing
confusion
sometimes.
is
almostalways
grammatically
inaccurate,which
causesdifficultyin
understandingwhat
thespeakerwantsto
say.
dis
playsno
gra
mmaticalcontrol.
di
splaysawiderange
ofsyntacticstructures
andlexicalform.
di
splaysarelatively
widerangeof
syntacticstructures
andlexicalform.
di
splays
asomewhat
narrowrangeof
syntacticstructures;
tooman
ysimple
sentences.
di
splaysanarrow
rangeofsyntactic
structures,limitedto
simplesentences.
di
splayslackofbasic
sentencestructure
knowledge.
dis
playsseverely
lim
itedornorange
andsophisticationof
gra
mmaticalstructure
andlexicalform.
di
splayscomplex
syntacticstructures
(relativeclause,
embeddedclause,
passivevoice,etc.)
andlexicalform.
di
splaysrelatively
complexsyntactic
structuresandlexical
form.
di
splays
somewhat
simples
yntactic
structures
di
splaysuseof
simpleand
inaccuratelexical
form.
di
splaysgenerally
basiclexicalform.
co
ntainsnotenough
evidencetoevaluate.
di
splays
useof
somewh
atsimpleor
inaccuratelexical
form.
7/28/2019 Spaan V8 Kim
28/30
28 H. J. Kim
DiscourseCompetence:OrganizationandCohesion
5Excellent
4Good
3Adequate
2Fair
1Limited
0No
Theresponse:
Theresponse:
Theresponse:
Theresponse:
Theresponse:
Ther
esponse:
is
completely
coherent.
is
generally
coherent.
is
occasionally
incoherent.
is
looselyorganized
,
resultingingenerally
disjointeddiscourse
.
is
generally
incoherent.
is
incoherent.
is
logically
structuredlogical
openingsand
closures;logical
developmentof
ideas.
di
splaysgenerally
logicalstructure.
co
ntainspartsthat
displaysomewhat
illogicalorunclear
organization;
however,asawhole,
itisinge
neral
logically
structured.
of
tendisplays
illogicalorunclear
organization,causin
g
someconfusion.
di
splaysillogicalor
unclearorganization,
causinggreat
confusion.
dis
playsvirtually
non-existent
organization.
at
timesdisplays
somewha
tloose
connectionofideas.
di
splayssmooth
connectionand
transitionofideasby
meansofvarious
cohesivedevices
(logicalconnectors,a
controllingtheme,
repetitionofkey
words,e
tc.).
di
splaysgooduseof
cohesivedevicesthat
generallyconnect
ideassmoothly.
di
splaysuseof
simplecohesive
devices.
di
splaysrepetitive
useofsimple
cohesivedevices;use
ofcohesivedevices
arenotalways
effective.
di
splaysattemptsto
usecohesivedevices,
buttheyareeither
quitemechanicalor
inaccurateleavingthe
listenerconfused.
co
ntainsnotenough
evidencetoevaluate.
7/28/2019 Spaan V8 Kim
29/30
29Investigating the Construct Validity of a Speaking Performance Test
TaskCompletion
Towhatextentdoesthespeakercompletethetask?
5Excellent
4Good
3Adequate
2Fair
1Limited
0No
Theresponse:
Theresponse:
The
response:
Theresponse:
Theresponse:
Theresponse:
fu
llyaddressesthe
task.
ad
dressesthetaskwell
adequatelyaddresses
th
etask.
insufficiently
addressesthetask.
barelyaddressesthe
task.
showsno
understandingofthe
prompt.
displayscompletely
accurateunderstanding
ofthepromptwithout
anymisunderstood
points.
includesnonoticeably
misunderstoodpoints.
includesminor
m
isunderstanding(s)
th
atdoesnotinterfere
w
ithtaskfulfillment.
displayssome
major
incomprehension/
misunderstand
ing(s)
thatinterferes
with
successfultask
completion.
displaysmajor
incomprehension/
misunderstanding(s)
thatinterfereswith
addressingthetask.
containsnotenough
evidencetoevaluate.
completelycoversall
mainpointswith
completedetails
discussedinthe
prompt.
completelycoversallmain
pointswithagoodamountof
detailsdiscussedintheprompt.
(e
.g.,)
ElectricCars:twoproblemswith
th
ecurrenttechnology(battery
ru
nningoutquicklyand
in
convenienceinrecharging)
BarbizonSchool:2characteristics
of
theschoolandoneexample
(p
aintednatureandestablished
landscapingasanindependent
ge
nre,andtheForestinthesunset
as
anexample)
OR
touchesuponallmain
points,butleavesout
details.OR
completelycovers
one(ortwo)main
pointswithdetails,
butleavestherest
out.
OR
touchesuponbitsand
piecesofthep
rompts.
7/28/2019 Spaan V8 Kim
30/30
30 H. J. Kim
Intelligibility
Pronunciationandprosodic
features(intonation,rhythm,an
dpacing)
5Excellent
4Good
3Adequate
2Fair
1Limited
0No
Theresponse:
Theresponse:
Therespon
se:
Theresponse:
Theresponse:
The
response:
iscompletely
intelligible
althoughaccent
maybethere.
mayincludeminor
difficultieswith
pronunciationor
intonation,but
generallyintelligible.
maylack
intelligibilityin
places
impeding
comm
unication.
oftenlacks
intelligibility
impeding
communication.
generallylacks
intelligibility.
completelylacks
in
telligibility.
isalmostalways
clear,fluidand
sustained.
isgenerallyclear,
fluidandsustained.
Pacemayvaryat
times.
exhibitssome
difficu
ltieswith
pronunciation,
intona
tionor
pacing
.
frequentlyexhib
its
problemswith
pronunciation,
intonationor
pacing.
isgenerally
unclear,choppy,
fragmentedor
telegraphic.
containsnotenough
ev
idencetoevaluate.
doesnotrequire
listenereffort.
doesnotrequire
listenereffortmuch.
exhibitssome
fluidit
y.
maynotbe
sustainedata
consistentlevel
throughout.
containsfrequent
pausesand
hesitations.
mayrequiresome
listene
reffortsat
times.
mayrequire
significantlisten
er
effortattimes.
containsconsistent
pronunciationand
intonation
problems.
requires
considerable
listenereffort.