Development of an Automated Speech Scoring System: A ...

25
Development of an Automated Speech Scoring System: A Comparison With Human Raters HIRAI, Akiyo University of Tsukuba KONDO, Yusuke Waseda University FUJITA, Ryoko Juntendo University Abstract This study examines the accuracy of an automated speech scoring system. The system graded English language learners’ retelling performances according to five features, and its scores were compared to those given by both non-native and native English-speaking (NNES and NES) raters. The results show that, of the five features, words per second was the most consistent predictor of both NNES and NES evaluations. However, the NNES rater tended to pay more attention to exact word similarities between the speech utterances and the original text, while the NES raters focused more on similarities of meaning and gave credit to rephrased expressions. Additionally, the correspondence between the automated scores and those given by human raters was moderate (exact agreement = 48% to 65%; rs = .48 to .52), though less than that between the NNES and NES scores (rs = .70). These results indicate that the automated scoring system for retelling performances may be applicable to low-stakes tests if the speech transcription of learners’ utterances is obtained. Keywords: automated scoring, speaking, retelling task, non-native rater, native rater - 17 - Language Education & Technology 58 (2021) 17 41

Transcript of Development of an Automated Speech Scoring System: A ...

Development of an Automated Speech Scoring System: A Comparison With Human Raters

HIRAI, Akiyo

University of Tsukuba KONDO, Yusuke Waseda University FUJITA, Ryoko

Juntendo University

Abstract

This study examines the accuracy of an automated speech scoring system. The

system graded English language learners’ retelling performances according to five

features, and its scores were compared to those given by both non-native and native

English-speaking (NNES and NES) raters. The results show that, of the five features,

words per second was the most consistent predictor of both NNES and NES

evaluations. However, the NNES rater tended to pay more attention to exact word

similarities between the speech utterances and the original text, while the NES raters

focused more on similarities of meaning and gave credit to rephrased expressions.

Additionally, the correspondence between the automated scores and those given by

human raters was moderate (exact agreement = 48% to 65%; rs = .48 to .52), though

less than that between the NNES and NES scores (rs = .70). These results indicate

that the automated scoring system for retelling performances may be applicable to

low-stakes tests if the speech transcription of learners’ utterances is obtained.

Keywords: automated scoring, speaking, retelling task, non-native rater, native rater

Kikuchi, H., & Miyajima, T. (2012, March 5–6). Nihongo hanashikotoba corpus ni okeru kumatsu onchō variation [Variation of tones at the accentual phrase edge in the corpus of spontaneous Japanese][Poster presentation]. JCL Workshop, Tokyo, Japan. https://www.ninjal.ac.jp/event/specialists/project-meeting/m-2011/jclws01/

Kori, S. (2018). Nihongo intonation ni tsuiteno ikutsukano chōshu jikken [A perceptual study on Japanese intonation]. Studies in Language and Culture, 43, 249–272.

Mok, P., Yin, Y., Setter, J., & Nayan, N. M. (2016, May 31–June 3). Assessing knowledge of English intonation patterns by L2 speakers [Paper presentation]. Speech Prosody, Boston, MA, United States.

Roach, P. (2000). English Phonetics and Phonology. Cambridge University Press.

Saito, H., & Ueda, I. (2011). Eigo gakushūsha ni yoru intonation kaku no gohaichi [Misplacement of nuclear stress by Japanese learners of English]. Journal of Phonetic Society of Japan, 15(1), 87–95.

So, C. K., & Best, C. T. (2010). Cross-language perception of non-native tonal contrasts: Effects of native phonological and phonetic influences. Language and Speech, 53(2), 273–293.

Ueda, H., & Otsuka, T. (2010). Hatsuon to onsei no shikumi ni shōten o ateta chugakko eigo kyōkasho bunseki: Input no kiso o kōsatsu suru [An analysis of pronunciation instruction in Japanese junior high school English textbooks: Indications from the early stage of input]. Journal of Osaka Jogakuin University, 7, 15–32.

Venditti, J. J. (2005). The J-ToBI model of Japanese intonation, In S. Jun (Ed.), Prosodic Typology and Transcription: The Phonology of Intonation and Phrasing (pp. 182–200). Oxford University Press.

Wells, J. (2006). English intonation: An introduction. Cambridge University Press. Xu, B., & Mok, P. (2012, May 22–25). Cross-linguistic perception of intonation by

Mandarin and Cantonese listeners. [Paper presentation]. Speech Prosody, Shanghai, China.

Yamada, R. A. (1995). Age and acquisition of second language speech sounds perception of American English /ɹ/ and /l/ by native speakers of Japanese. In W. Strange (Ed.), Speech Perception and Linguistic Experience: Issues in Cross-Language Research (pp. 305–320). York Press.

Yokokawa, H. (Ed.). (2009). Nihonjin eigo gakushūsha no eitango shinmitsudo: Onseihen [English word familiarity for Japanese learners of English: Auditory presentation]. Kurosio Publishers.

- 17 -

Language Education & Technology 58 (2021) 17- 41

1. Introduction In today’s highly computerized society, administering performance tests is increasingly done online, but scoring them is still time-consuming. This preliminary study aims to find a way to realize automated scoring for classroom speaking tests. The recent trend toward developing more authentic measures for assessing language ability and performance-based tests that elicit authentic samples has given rise to greater demand for efficient and reliable scoring approaches (Xi, 2012). In line with this trend, secondary school teachers have been encouraged to provide their students more opportunities to use English and to evaluate their performance skills, which is a more time-consuming process than receptive skill assessment (MEXT, 2014). In this regard, automated scoring can offer a partial solution to meet the increasing demand for performance testing in class. 1.1 Automated Scoring Technology Speech recognition technologies have greatly improved in the past few decades (Litman et al., 2018) and have been gradually applied to the assessment of speaking performances. Measurable features of speech performances, such as frequency of lexical and grammatical items, are used as benchmarks for scores that are then compared to human evaluations (Kobayashi & Abe, 2016). Automated speech scoring is challenging because speech performance involves unstable variables, differing widely in voice, pronunciation, accent, and other prosodic aspects. For a machine to analyze a second language (L2) speaking performance, the performance must first be transcribed (Franco et al., 2010). Although speech-to-text (STT) or speech recognition technologies exist, they are not yet adequate to depend on such a STT application. Currently, they may play a role in analyzing the mechanical attributes of speech, such as speakers’ pronunciation errors based on the speech transcription done by STT devices, reserving more cognitively complicated elements for human scoring. Extant research on automated speaking assessment tasks includes a study by Cox and Davies (2012) that used elicited oral response (EOR) testing to examine whether automated EOR scores could be used to replace oral interview testing. In EOR testing, test-takers listened to sentences and repeated what they heard. The study showed that automated scores predicted oral interview results fairly well; thus, it was suggested that the automated speech recognition of EOR could be used to assess test-takers’ speaking ability in relatively low-stakes applications, such as the initial placement of students in language training. In another study, Cucchiarini et al. (2000) examined the task of read

- 18 -- 18 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

speech by native and non-native Dutch speakers. The fluency of read speech was scored by human raters and an automatic speech recognizer. The results showed that quantitative measures, especially the speech rate, were good predictors of the fluency perceived by human raters. Regarding oral test engines, Versant (Pearson Education, 2008) is well-known. It was originally developed by Ordinate Corporation and is now operated by Pearson (Litman et al., 2018). The Versant online test has six sections, most of which contain structured prompts, such as repeating a sentence and answering a question. Utterances are analyzed by algorithms that have established the content and manner of the responses. The content shows how well the candidate has understood the item and is based on the number of exact words produced in the response. The manner of the responses is evaluated based on four dimensions, including pronunciation and vocabulary. Correlations with human raters are .89 for pronunciation and .94 for vocabulary, and the total score correlation is .97 (Burstein & Cheng, 2007; Xi, 2012). Another well-known engine is the SpeechRater, which was developed by ETS. It can extract various features from the audio signal, such as articulation rate, pauses, and types per second, and combines them by means of a multiple-regression scoring model (Zechner et al., 2009). It has been implemented in TOEFL Practice Online since 2006, in which integrated tasks are used, and examinees are asked to read or listen to stimulus materials and then respond to questions. The responses entail the unprepared, spontaneous speech of language learners and are evaluated based on features of pace, pronunciation, and fluency (Bridgeman et al., 2012). Zechner et al. (2009) reported a correlation of .57 between machine and human scores on an online TOEFL practice test, which was lower than the inter-human correlation of .74 on the complete set of six items, but may be adequate for low-stakes testing. Molholt et al. (2011) compared the three speech analytical methods: quantitative analyses by fully automatic evaluation, notional methods, and comprehensive evaluations adapting rubrics. Their results suggested that automatic quantitative analysis was convenient but did not include some categories related to vocabulary, grammar, and topic development. Molholt et al. concluded that for high-stakes decisions, the third comprehensive decision was the best approach. In a more recent study on SpeechRater, Yoon and Zechner (2017) investigated an efficient way to use human and automated scores. They proposed a set of filtering systems to identify responses that are difficult to score automatically. Various tasks were used in this study, such as completing sentence fragments and producing utterances using

1. Introduction In today’s highly computerized society, administering performance tests is increasingly done online, but scoring them is still time-consuming. This preliminary study aims to find a way to realize automated scoring for classroom speaking tests. The recent trend toward developing more authentic measures for assessing language ability and performance-based tests that elicit authentic samples has given rise to greater demand for efficient and reliable scoring approaches (Xi, 2012). In line with this trend, secondary school teachers have been encouraged to provide their students more opportunities to use English and to evaluate their performance skills, which is a more time-consuming process than receptive skill assessment (MEXT, 2014). In this regard, automated scoring can offer a partial solution to meet the increasing demand for performance testing in class. 1.1 Automated Scoring Technology Speech recognition technologies have greatly improved in the past few decades (Litman et al., 2018) and have been gradually applied to the assessment of speaking performances. Measurable features of speech performances, such as frequency of lexical and grammatical items, are used as benchmarks for scores that are then compared to human evaluations (Kobayashi & Abe, 2016). Automated speech scoring is challenging because speech performance involves unstable variables, differing widely in voice, pronunciation, accent, and other prosodic aspects. For a machine to analyze a second language (L2) speaking performance, the performance must first be transcribed (Franco et al., 2010). Although speech-to-text (STT) or speech recognition technologies exist, they are not yet adequate to depend on such a STT application. Currently, they may play a role in analyzing the mechanical attributes of speech, such as speakers’ pronunciation errors based on the speech transcription done by STT devices, reserving more cognitively complicated elements for human scoring. Extant research on automated speaking assessment tasks includes a study by Cox and Davies (2012) that used elicited oral response (EOR) testing to examine whether automated EOR scores could be used to replace oral interview testing. In EOR testing, test-takers listened to sentences and repeated what they heard. The study showed that automated scores predicted oral interview results fairly well; thus, it was suggested that the automated speech recognition of EOR could be used to assess test-takers’ speaking ability in relatively low-stakes applications, such as the initial placement of students in language training. In another study, Cucchiarini et al. (2000) examined the task of read

- 19 -- 19 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

given keywords and information. The correlation in scoring between human raters and the baseline system was .72. However, the extended filtering model improved the correlation to .82. These automated scoring systems for L2 speech have practical value, and their congruence with human rating is approaching that of inter-human scoring. However, these are institutional systems that classroom teachers cannot utilize for their students’ speaking performances. In addition, an automated scoring system developed for one task cannot be applied to a different task, as these systems depend on task-specific variables. 1.2 Differences between Native and Non-Native English-Speaking Raters In many English as a foreign language (EFL) classes at secondary school levels, non-native English-speaking (NNES) teachers are in charge of teaching and assessing their students’ speaking performances, often sharing these roles with native English-speaking (NES) teachers. Therefore, the impact of the difference in assessment between NES and NNES raters needs to be clarified, if indeed it exists. To this point, Kim (2009a) reported that most score variability was attributable to students’ speaking ability, with a very small rater effect, and that NES and NNES raters contributed similarly to score variability in their ESL speaking assessments. Thus, Kim concluded that NNES raters and NES raters might be equally reliable. The result agrees with Brown’s (1995) findings, which suggested that NES and NNES raters exhibited little difference regarding how strictly and consistently they rated students’ speaking performances. However, Kim (2009b) and Zhang and Elder (2011) pointed out some differences in qualitative aspects. Kim mentioned that a qualitative analysis revealed that NES teachers used more detailed and elaborate judgments than NNES teachers did in pronunciation, specific grammar use, and the accuracy of transferred information. Sato (2011) reported that Japanese NNES teachers were more lenient than NES raters in terms of grammatical accuracy, fluency, vocabulary range, and pronunciation, aside from the content elaboration/development criterion. Another difference he pointed to was that only fluency and content elaboration/development significantly predicted Japanese raters’ overall scoring, while all the criteria significantly predicted NES teachers’ scores. These qualitative differences between NES and NNES raters may derive from NES raters’ experience in the test-takers’ first language (L1) (Winke & Gass, 2012).

- 20 -- 20 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

NES raters who know test-takers’ language background tend to show some understanding of them, which may influence the ratings. In other words, raters’ L2 background seems to be a potential source of bias in rating oral performance (Winke & Gass, 2012). To summarize the studies, it seems that the difference between NES and NNES raters’ assessments might be insignificant, despite some variabilities in how decisions on L2 learners’ speaking performances are made. However, whether the differences between NES and NNES raters are smaller than differences between human raters and automated scoring has not been investigated. This is important in developing an automated scoring system—depending on reference or benchmark raters, the judgment accuracy of the automated scoring fluctuates. In addition, by integrating rater variabilities such as NES versus NNES when testing human-human reliability against human-computer reliability, our rater sample can be seen as a diverse representation of authentic human raters, and the results achieved would be more generalizable. 1.3 Automated Scoring for a Retelling Task In recent years, instead of using structured tasks, such as reading aloud, repetition, and short-answer questions, more authentic tasks have been recommended. In consideration of learners’ target language use (TLU) domains (Bachman & Palmer, 2010), the use of tasks that students may encounter in real life may make more sense to them, and successful completion of such tasks builds confidence in using the target language outside the context of exercises. For example, the settings of the TOEFL iBT speaking tasks are classroom and campus situations, and the tasks require multiple skills, such as listening-and-speaking tasks or reading-listening-and-speaking tasks that EFL students studying abroad are likely to encounter. In view of limited possible choices offered in automated scoring tasks and the current trend in performance tasks, this study suggests the use of a story-retelling task for classroom use, in which each student reads a story and tells a classmate about it. The task is fairly authentic because in their target domain, students may have opportunities to explain what they have read or heard to other students in a target language (Chaudron, 2003). In addition to authenticity, secondary school teachers often give students a retelling task to diagnose whether they understood the text they learned and to strengthen what they learned by producing it in their own words (e.g., Muranoi, 2007). There are other advantages to this task. First, it is highly adaptable because teachers can assign texts suitable for their students’ level and interests. Second,

given keywords and information. The correlation in scoring between human raters and the baseline system was .72. However, the extended filtering model improved the correlation to .82. These automated scoring systems for L2 speech have practical value, and their congruence with human rating is approaching that of inter-human scoring. However, these are institutional systems that classroom teachers cannot utilize for their students’ speaking performances. In addition, an automated scoring system developed for one task cannot be applied to a different task, as these systems depend on task-specific variables. 1.2 Differences between Native and Non-Native English-Speaking Raters In many English as a foreign language (EFL) classes at secondary school levels, non-native English-speaking (NNES) teachers are in charge of teaching and assessing their students’ speaking performances, often sharing these roles with native English-speaking (NES) teachers. Therefore, the impact of the difference in assessment between NES and NNES raters needs to be clarified, if indeed it exists. To this point, Kim (2009a) reported that most score variability was attributable to students’ speaking ability, with a very small rater effect, and that NES and NNES raters contributed similarly to score variability in their ESL speaking assessments. Thus, Kim concluded that NNES raters and NES raters might be equally reliable. The result agrees with Brown’s (1995) findings, which suggested that NES and NNES raters exhibited little difference regarding how strictly and consistently they rated students’ speaking performances. However, Kim (2009b) and Zhang and Elder (2011) pointed out some differences in qualitative aspects. Kim mentioned that a qualitative analysis revealed that NES teachers used more detailed and elaborate judgments than NNES teachers did in pronunciation, specific grammar use, and the accuracy of transferred information. Sato (2011) reported that Japanese NNES teachers were more lenient than NES raters in terms of grammatical accuracy, fluency, vocabulary range, and pronunciation, aside from the content elaboration/development criterion. Another difference he pointed to was that only fluency and content elaboration/development significantly predicted Japanese raters’ overall scoring, while all the criteria significantly predicted NES teachers’ scores. These qualitative differences between NES and NNES raters may derive from NES raters’ experience in the test-takers’ first language (L1) (Winke & Gass, 2012).

- 21 -- 21 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

teachers can easily include learning points such as target grammar items, words, and expressions in a text (Hirai, 2015; Hirai & Koizumi, 2009). Nippold et al. (2017) reported on the use of fables as a retelling task for adolescents in their L1. The participants read fables, retold them, and answered critical-thinking questions. The findings suggested that the fable task was effective in encouraging adolescents to use complex language and express their opinions. In L2, the effect of retelling an L2 text in L2 on students’ speaking skills was also reported (e.g., Nagasaki, 2013). Furthermore, the retelling task can be used as a speaking test. Koizumi and Hirai (2012) compared the Story Retelling Speaking Test (SRST) with two other tests, Versant and the Standard Speaking Test (SST, ALC Press, n.d.), in terms of speaking functions, scores, and utterances. As a result, the SRST was moderately correlated with both the Versant (.64) and the SST (.66), and it successfully discriminated among a group of beginner- and intermediate-level learners more than the other two tests. Moreover, as the SST aims to simulate natural conversation, the result that 43% of the SST scores were explained by the SRST implies that the construct of the SRST is related to real-life interactive communication. The retelling task may also be suitable for automated scoring, since students’ utterances are expected to be similar to the original text, using the same or similar vocabulary and expressions to convey the content of the original text. Yokouchi (2015) analyzed the utterances of students who were instructed to retell or summarize a text, comparing their retold and summarized utterances by measurements of complexity, fluency, and accuracy (CFA) indices. The results did not show significant differences, though students used more words than were found in the original text in their retellings. Another advantage of developing a retelling task for automated scoring is that as more retelling performances on various topics are used for machine learning, less time is needed for the machine to learn a new topic because dominant predictor variables can be found easier in retelling tasks than in other tasks, such as spontaneous speech, because the words used by examinees in retelling tasks are limited. Therefore, it may be applicable to a wide variety of classes. Based on these reasons, the current study aims to examine how accurately automated rating can score students’ retelling performances. To this end, we address the following two research questions (RQs):

- 22 -- 22 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

RQ1. Are there variabilities between NES and NNES raters’ evaluations when compared to automated scoring features? RQ2. To what extent do the scores reported by the automated scoring system and those reported by human raters correlate with each other? 2. Method 2.1 Participants A total of 118 Japanese university students took part in this study. They were freshmen majoring in various fields, such as humanities and culture, informatics, and medicine and medical sciences. Their proficiency levels ranged from CEFR A1 to B1, based on the TOEFL-ITP scores conducted as a placement test. 2.2 The Speaking Test and Procedure We adopted the retelling speaking test for the reasons mentioned above. A story text used for the retelling test was prepared consisting of 129 words on the topic “Taking a Violin Lesson” (see the text in Section 4). The readability of the text as measured by the Flesch-Kincaid Grade Level (FKGL) was 6.7, which was fairly easy for the participants. Phrases that contained important grammar items were underlined in the text to encourage participants to notice and use them in the retelling session. The retelling test was conducted in computer-assisted language learning (CALL) classrooms. The task consisted of two parts—reading and retelling. In the reading part, participants read the story silently for two minutes, then read it aloud once, and answered three written comprehension questions. In the retelling part, they were asked to retell the story in 100 seconds, in as much detail as possible, without looking at the text. In order to help them recall the story in retelling, the target grammar items and five keywords were presented. 2.3 Scoring Since the aim of the retelling test was to convey information from the original source text as accurately as possible, the human raters evaluated how adequately the students covered the content of the original text within the specified time. This content criterion had four hierarchical levels: 0 (no story content is produced.), 2 (some story content is produced.), 4 (most story content is produced.), and 6 (almost all the story content is produced).

teachers can easily include learning points such as target grammar items, words, and expressions in a text (Hirai, 2015; Hirai & Koizumi, 2009). Nippold et al. (2017) reported on the use of fables as a retelling task for adolescents in their L1. The participants read fables, retold them, and answered critical-thinking questions. The findings suggested that the fable task was effective in encouraging adolescents to use complex language and express their opinions. In L2, the effect of retelling an L2 text in L2 on students’ speaking skills was also reported (e.g., Nagasaki, 2013). Furthermore, the retelling task can be used as a speaking test. Koizumi and Hirai (2012) compared the Story Retelling Speaking Test (SRST) with two other tests, Versant and the Standard Speaking Test (SST, ALC Press, n.d.), in terms of speaking functions, scores, and utterances. As a result, the SRST was moderately correlated with both the Versant (.64) and the SST (.66), and it successfully discriminated among a group of beginner- and intermediate-level learners more than the other two tests. Moreover, as the SST aims to simulate natural conversation, the result that 43% of the SST scores were explained by the SRST implies that the construct of the SRST is related to real-life interactive communication. The retelling task may also be suitable for automated scoring, since students’ utterances are expected to be similar to the original text, using the same or similar vocabulary and expressions to convey the content of the original text. Yokouchi (2015) analyzed the utterances of students who were instructed to retell or summarize a text, comparing their retold and summarized utterances by measurements of complexity, fluency, and accuracy (CFA) indices. The results did not show significant differences, though students used more words than were found in the original text in their retellings. Another advantage of developing a retelling task for automated scoring is that as more retelling performances on various topics are used for machine learning, less time is needed for the machine to learn a new topic because dominant predictor variables can be found easier in retelling tasks than in other tasks, such as spontaneous speech, because the words used by examinees in retelling tasks are limited. Therefore, it may be applicable to a wide variety of classes. Based on these reasons, the current study aims to examine how accurately automated rating can score students’ retelling performances. To this end, we address the following two research questions (RQs):

- 23 -- 23 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

Two sets of human rater evaluations, one by two NNES raters and one by two NES raters, were used as criterion variables in this study. The NES raters were speakers of American English who were born and raised in the U.S. and had taught English at a Japanese university for more than 10 years. Each of them rated half of the speeches due to time constraints. The NNES raters were both native speakers of Japanese. One of them was a graduate student majoring in applied linguistics, and the other taught English at a Japanese university for more than 15 years. The two Japanese raters scored 25% of the participants’ speeches. Since the inter-rater reliability with the use of Spearman’s rho was .73, the graduate student rater scored the rest of the speech. We regarded it as acceptable for classroom speaking tests because an inter-rater correlation coefficient of .70 or above has been considered sufficient for other speaking performance tests (e.g., Kang & Rubin, 2012; Lado, 1964). The combination of raters with different nationalities and teaching backgrounds would be informative and significant because such a combination is most often the case in teaching English at secondary schools in Japan. All raters received training with an explanation of the rating scale and listened to three audio files as benchmarks. They then evaluated the spoken data individually. 2.4 Analysis

First, all learners’ utterances were manually transcribed for this experiment, since the accuracy of word recognition software has not been guaranteed. Then, the words of both the transcriptions of the learners’ utterances and the original retelling text were lemmatized for accurate analysis. Next, five linguistic features were selected as automated predictors of the human raters’ evaluations. These five features measured fluency and the linguistic and content similarities between the examinees’ utterances and the original text. The details of these features are explained in Section 2.5. For RQ1, to predict the accuracy of the machine prediction of the NES and NNES evaluations, we adopted a decision tree classifier using the five features. The decision tree classifier is explained in detail in Section 2.5. For RQ2, to examine the strength of the relationship between human and machine scorings, a correlational analysis was performed.

2.5 Five Features Used for Automated Scoring The features used to predict the human raters’ scores were selected on the assumption that retellings with more similarities, both quantitatively and qualitatively,

- 24 -- 24 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

to the original text would obtain higher scores. Based on this assumption, we chose three text similarity features frequently used in measuring distance between documents (cf. Ilijoski & Popeska, 2018; Zechner et al., 2009): (a) longest common subsequence (LCS), (b) Jaccard index (JCD), and (c) cosine similarity (COS). In addition to these, two basic features were adopted as predictor variables that may distinguish examinees’ performances: (d) words per second (WPS) to measure fluency, and (e) ratio of insertions (INS) to measure richness of vocabulary. These features are examined individually below. (a) Longest common subsequence (LCS): A subsequence is part of a sequence that appears in the same relative order. When two sequences include the same subsequence, it is called a common subsequence. Words of a subsequence should appear in the same order but not necessarily be contiguous. In the two sentences shown in Example 1, “I,” “to,” and “school” form the longest common subsequence between the two sentences, and the LCS becomes 3. We programmed a routine in Python to automatically calculate LCSs between each examinee’s whole utterance and the original text. Thus, the higher the number of LCSs, the more similar the two texts are. (Example 1) Original text: I went to school yesterday. Examinee A: I yesterday walked to school. (b) Jaccard index (JCD): The Jaccard index is the number of all types of common words in two sequences divided by the number of all types of words in the two sequences. In Example 1, the number of all types of common elements is 4 (I, to, school, yesterday), and the number of all types of elements is 6 (I, went, to, school, walked, yesterday). The index is .66 (= 4/6) between the two sentences in the example. The index is 1 if the two sequences consist of the same elements and 0 if the two sequences have no words in common. (c) Cosine similarity (COS): Cosine similarity is the cosine of the angle between the vectors. To compute COS, utterances must be converted into vectors. We adopted the “bag-of-words” model to represent utterances. In this model, utterances are expressed by the frequency of occurrence of words, but the word order of the

Two sets of human rater evaluations, one by two NNES raters and one by two NES raters, were used as criterion variables in this study. The NES raters were speakers of American English who were born and raised in the U.S. and had taught English at a Japanese university for more than 10 years. Each of them rated half of the speeches due to time constraints. The NNES raters were both native speakers of Japanese. One of them was a graduate student majoring in applied linguistics, and the other taught English at a Japanese university for more than 15 years. The two Japanese raters scored 25% of the participants’ speeches. Since the inter-rater reliability with the use of Spearman’s rho was .73, the graduate student rater scored the rest of the speech. We regarded it as acceptable for classroom speaking tests because an inter-rater correlation coefficient of .70 or above has been considered sufficient for other speaking performance tests (e.g., Kang & Rubin, 2012; Lado, 1964). The combination of raters with different nationalities and teaching backgrounds would be informative and significant because such a combination is most often the case in teaching English at secondary schools in Japan. All raters received training with an explanation of the rating scale and listened to three audio files as benchmarks. They then evaluated the spoken data individually. 2.4 Analysis

First, all learners’ utterances were manually transcribed for this experiment, since the accuracy of word recognition software has not been guaranteed. Then, the words of both the transcriptions of the learners’ utterances and the original retelling text were lemmatized for accurate analysis. Next, five linguistic features were selected as automated predictors of the human raters’ evaluations. These five features measured fluency and the linguistic and content similarities between the examinees’ utterances and the original text. The details of these features are explained in Section 2.5. For RQ1, to predict the accuracy of the machine prediction of the NES and NNES evaluations, we adopted a decision tree classifier using the five features. The decision tree classifier is explained in detail in Section 2.5. For RQ2, to examine the strength of the relationship between human and machine scorings, a correlational analysis was performed.

2.5 Five Features Used for Automated Scoring The features used to predict the human raters’ scores were selected on the assumption that retellings with more similarities, both quantitatively and qualitatively,

- 25 -- 25 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

utterances is disregarded. For example, in the four sentences (A–D) in Example 2, the word sequence is counted as shown in Table 1. (Example 2) A. I walk to school B. John walked to school C. Yesterday, she walked to school D. She wanted to walk to school yesterday Table 1 Bag-of-Words Expression of Sentences in Example 2 Sentence A Sentence B Sentence C Sentence D

I 1 0 0 0

John 0 1 0 0

School 1 1 1 1

She 0 0 1 1

To 1 1 1 2

Walk 1 1 1 1

Want 0 0 0 1

Yesterday 0 0 1 1

Once a word sequence is converted to a vector, we can calculate COS for two vectors (p and q) with the following equation:

For example, the COS between sentences A and B in Table 1 is calculated as below. In this manner, values of COS between each examinee’s utterance and the original text were obtained in our analysis.

- 26 -- 26 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

(d) Words per second (WPS): For words per second, the number of words uttered is divided by the seconds it took to produce the utterance. For example, if a learner produces 100 words and it takes 65 seconds, then the learner's WPS is 1.54 (100 divided by 65). In order to convey all the information of the story, the fluency of utterance or the amount of utterance is necessary. Thus, the higher this value, the more fluent the utterance is. (e) Ratio of insertions (INS): The ratio of insertions is calculated as the number of word types that appear only in the examinee’s utterance divided by the number of word types that appear in both the examinee’s utterance and the original text. In other words, the higher the value, the more the examinee has used a variety of vocabulary or expressions. 3. Results 3.1 Descriptive Statistics

The results of the NNES’ and NES’ evaluations are shown in Table 2. In the columns for scores of 2, 4, and 6, the number of learners who received each score is listed. The right two columns show the scores’ mean and standard deviations, which indicate that the NNES rater was stricter than the NES raters.

Table 2 Descriptive Statistics of NNES and NES Evaluations

Score 2 Score 4 Score 6 M SD NNES 61 46 11 3.15 1.32 NES 40 66 12 3.53 1.24

3.2 Differences between NNES and NES on the Five Automated Scoring Features To predict NES and NNES evaluations based on these five features, we adopted a decision tree classifier method, which is a supervised learning technique in machine learning. It is a method for constructing a prediction model from data, visualizing a procedure of data classification as a tree-like diagram (see Figures 1 and 2). Compared with other supervised learning techniques, such as the support vector machine and naïve Bayes classifier, the decision tree classifier can reflect the importance of features in the prediction, and differences across rater behaviors can be interpreted easily.

utterances is disregarded. For example, in the four sentences (A–D) in Example 2, the word sequence is counted as shown in Table 1. (Example 2) A. I walk to school B. John walked to school C. Yesterday, she walked to school D. She wanted to walk to school yesterday Table 1 Bag-of-Words Expression of Sentences in Example 2 Sentence A Sentence B Sentence C Sentence D

I 1 0 0 0

John 0 1 0 0

School 1 1 1 1

She 0 0 1 1

To 1 1 1 2

Walk 1 1 1 1

Want 0 0 0 1

Yesterday 0 0 1 1

Once a word sequence is converted to a vector, we can calculate COS for two vectors (p and q) with the following equation:

For example, the COS between sentences A and B in Table 1 is calculated as below. In this manner, values of COS between each examinee’s utterance and the original text were obtained in our analysis.

- 27 -- 27 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

Under the decision tree classifier, datasets are divided according to the five features to reduce the impurity of categorized data. The division is determined based on the Gini impurity equation, a measure of impurity in a set of data; a low value indicates high purity. By repeating the divisions based on the values of Gini impurity, the optimal discriminate thresholds are eventually found. In other words, we can obtain the model that indicates the index of a feature’s ability to make a set of data with high purity in a decision tree classifier. The equation is as follows:

The result of the decision tree classifier predicted by the five features revealed different pictures between NNES and NES scores (see Figures 1 and 2). The NNES evaluation was classified by only two predictors, WPS and JCD, while the NES evaluation was classified by not only the features used in the NNES (WPS and JCD) but also INS and COS. Figure 1 NNES tree diagram

- 28 -- 28 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

Figure 2 NES tree diagram

The bar graph in Figure 3 shows the relative importance of the five features in the evaluations by NNES and NES raters. There are some noticeable differences between the two. First, WPS seems to be the most important determiner for both NNES and NES raters, but the degree of importance seems different. WPS accounts for 81% of the NNES evaluation but only 60% of the NES evaluation. Second, the importance of the other features was also different. In the NNES evaluation, JCD (14%) and LCS (5%) were the relevant predictors after WPS, while in the NES evaluation, INS (26%) was the second most important predictor, and interestingly, it did not appear in the NES evaluation. The third and fourth predictors of the NES evaluation were JCD (11%) and COS (2%).

Under the decision tree classifier, datasets are divided according to the five features to reduce the impurity of categorized data. The division is determined based on the Gini impurity equation, a measure of impurity in a set of data; a low value indicates high purity. By repeating the divisions based on the values of Gini impurity, the optimal discriminate thresholds are eventually found. In other words, we can obtain the model that indicates the index of a feature’s ability to make a set of data with high purity in a decision tree classifier. The equation is as follows:

The result of the decision tree classifier predicted by the five features revealed different pictures between NNES and NES scores (see Figures 1 and 2). The NNES evaluation was classified by only two predictors, WPS and JCD, while the NES evaluation was classified by not only the features used in the NNES (WPS and JCD) but also INS and COS. Figure 1 NNES tree diagram

- 29 -- 29 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

Figure 3 Feature importance

3.3 Prediction Accuracy of Automated Scoring Next, the prediction accuracy of the automated scoring method was examined using exact agreement and Spearman’s rho correlation. To calculate the values of these analyses, we adopted a bootstrap method. First, 80% of the data, that is, 94 utterances, were extracted and used as a training dataset to produce a model of a decision tree classifier. Second, using a model created by the training dataset and the rest of the data (20%, 24 utterances), the exact agreement and Spearman’s rho were calculated. Third, this procedure was repeated 10,000 times. The averages of these analyses are reported in Table 3. As shown in Table 3, the two agreement indices between NNES and NES evaluations were sufficiently high (71% and .70), and higher than the relationships between the human raters and the machine. The agreement indices between the NNES and the machine evaluations (65% and .48) and between the NES and machine evaluations (61% and .50) were more or less similar.

We also calculated the average of the scores given by NNES and NES raters to examine the accuracy of predictions of human ratings by the machine rating. Since the scores were 2, 4, or 6, we obtained the average of the scores by the human ratings as 2, 3, 4, 5, or 6. In other words, if scores of 2 and 4 were given to an utterance, the average was 3, and if scores of 4 and 6 were given, the average was 5, which made our rating scale a 5-point scale. Again, we used the bootstrap method to examine the prediction

- 30 -- 30 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

accuracy of the human ratings by the machine rating. The average of the exact agreement and Spearman's rho correlation coefficient were 48% and .52, respectively. The exact agreement was the lowest among these agreement indices because we made the 5-point scale in the case of the human-machine correlation. Generally, as the number of points increases (e.g., from 3- to 5-point scale), the exact prediction gets more difficult. Further, variabilities between the NNES and NES raters may make the prediction more difficult for the machine. Table 3 Degrees of Rater Agreement between NNES, NES, and Machine Scoring

Exact agreement (%) Spearman’s rho NNES-NES 71 .70 NNES-Machine 65 .48 NES-Machine 61 .50 Humana -Machine 48 .52

Note. a. Human is the average of NNES and NES scores. Figures 4 and 5 are the confusion matrices of the human and machine scores in the 10,000-round trials. Each cell indicates the exact agreement between the human and machine evaluations in each score band. In the NNES matrix in Figure 4, the value of the top-left cell is .47, which indicates that 47% of the examinees who actually got a score of 2 were predicted as the same score of 2 by the machine prediction method. In the NNES matrix, the machine predicted a score of 4 on the NNES evaluation at 97% accuracy. On the other hand, the NES matrix has the opposite tendency in that the machine predicted a score of 2 with 80% accuracy but a score of 4 with only the moderate accuracy of 58%. Both NNES and NES raters who gave the highest score of 6 had a moderate degree of agreement with the machine (50%).

Figure 3 Feature importance

3.3 Prediction Accuracy of Automated Scoring Next, the prediction accuracy of the automated scoring method was examined using exact agreement and Spearman’s rho correlation. To calculate the values of these analyses, we adopted a bootstrap method. First, 80% of the data, that is, 94 utterances, were extracted and used as a training dataset to produce a model of a decision tree classifier. Second, using a model created by the training dataset and the rest of the data (20%, 24 utterances), the exact agreement and Spearman’s rho were calculated. Third, this procedure was repeated 10,000 times. The averages of these analyses are reported in Table 3. As shown in Table 3, the two agreement indices between NNES and NES evaluations were sufficiently high (71% and .70), and higher than the relationships between the human raters and the machine. The agreement indices between the NNES and the machine evaluations (65% and .48) and between the NES and machine evaluations (61% and .50) were more or less similar.

We also calculated the average of the scores given by NNES and NES raters to examine the accuracy of predictions of human ratings by the machine rating. Since the scores were 2, 4, or 6, we obtained the average of the scores by the human ratings as 2, 3, 4, 5, or 6. In other words, if scores of 2 and 4 were given to an utterance, the average was 3, and if scores of 4 and 6 were given, the average was 5, which made our rating scale a 5-point scale. Again, we used the bootstrap method to examine the prediction

- 31 -- 31 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

Figure 4 Confusion matrix of NNES (left) and NES (right) scores and the machine

As for the confusion matrix of the NNES and NES scores shown in Figure 5, both strongly agreed on score 2 at 97% accuracy, which was more accurate than on score 4 (59%) and score 6 (50%). Thus, the human raters tended to judge the same on low performances but diverged in their judgment of higher-level performances. Figure 5 Confusion matrix of NNES and NES score

4. Discussion Regarding RQ1, results of the tree diagrams showed that WPS was the most reliable determiner for both NNES and NES raters. This is logical because speakers needed to produce roughly the same number of words as the original text in order to

- 32 -- 32 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

convey the same amount of information, resulting in a high WPS rate within the limited retelling time. The NNES rater seemed to consider WPS in her score judgment (81%) more than the NES raters (60%), and paid less attention to other aspects of learners’ performances than did the NES raters. In other words, the NNES rater tended to focus more on the surface levels of learners’ performances, such as exact wording of the original text, as shown with JCD (14%) and LCS (5%), besides WPS. On the other hand, the NES raters seemed to pay more attention to whether the content of speakers’ utterances was correct, even though speakers used expressions not in the original text, as INS (26%) evaluated utterances not in the original text and appeared only in the NES evaluations. Thus, the NNES rater evaluated how the original text and examinees’ utterances were similar in wording, while the NES raters evaluated whether they were similar in meaning. The results accord with Kim’s (2009b) and Sato’s (2011) findings that NES raters use more detailed and elaborate judgments about specific grammar use and the accuracy of transferred information. These differences might derive from the raters’ English ability as native speakers of English and many years' experience teaching Japanese EFL learners (Winke & Gass, 2012). Another difference was in the accuracy of score prediction shown in the confusion matrixes (Figure 4). The machine predicted scores of both NNES and NES evaluations consistently and equally, yet the prediction accuracy of score 2 was low on the NNES evaluation (47%) and high on the NES evaluation (80%). Considering that the number of learners who received a score of 2 by the NNES rater was 61 but only 40 by the NES rater, the NNES rater scored much more strictly than the NES raters (Table 2). The results also revealed that evaluating more proficient learners is difficult. The score agreement trended downward as scores increased, going from 80% in score 2 to 50% in score 6 for the NES and machine evaluations, and from 97% in score 2 to 50% in score 6 for the NNES and NES evaluations. As learners become more proficient, raters need to judge not only by mechanical grammar errors or the volume of utterances but also by the appropriateness of expressions uttered, which is a difficult criterion for machine learning and for NNES raters to evaluate. To confirm these interpretations of the differences between the NNES and the NES evaluations, the participants’ speech transcriptions were carefully examined and compared with the original retelling text (see Original). There were many cases in which a speech transcription received an NNES score of 2 but an NES score of 4, as with the following two learners’ transcriptions.

Figure 4 Confusion matrix of NNES (left) and NES (right) scores and the machine

As for the confusion matrix of the NNES and NES scores shown in Figure 5, both strongly agreed on score 2 at 97% accuracy, which was more accurate than on score 4 (59%) and score 6 (50%). Thus, the human raters tended to judge the same on low performances but diverged in their judgment of higher-level performances. Figure 5 Confusion matrix of NNES and NES score

4. Discussion Regarding RQ1, results of the tree diagrams showed that WPS was the most reliable determiner for both NNES and NES raters. This is logical because speakers needed to produce roughly the same number of words as the original text in order to

- 33 -- 33 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

(Original) When Rena was a little girl she started taking violin lessons, but after a while she was getting tired of it. Her teacher said that she should practice every day. Rena's parents told her that if she didn't practice harder she would have to stop taking lessons. They said that the lessons were too expensive to continue if she was going to be lazy. In the end Rena quit learning the violin. Several years later Rena went to a concert by a famous violinist with her parents. She was impressed by the beautiful sound of the music. So she decided to start playing the violin again. Now she never misses a day of practice. She hopes one day to be able to play like the famous violinist she heard. (129 words) (Student 5) [rena] rena [practiced] practice violin when she was a child. But practicing violin is too expensive to continue if she was lazy, so she [stop] stopped [ba] practicing violin. [uh] but she went to concert and [she] she was impressed by the sound of violin. [She] she get interested in practicing violin, so [she] she practice violin again. Now, [she] she don't miss the lesson of violin. (67 words)

Student 5 uttered only 67 words, roughly half the original word count of 129, with frequent fillers and false starts as shown in [ ]. It is likely that the brevity of the utterance resulted in the NNES score of 2, indicating a lack of detail. However, in terms of content, Student 5 was able to complete the story, as shown in the last sentences with underlines, which may be why the student received a score of 4 from the NES rater. Along with the utterance’s brevity and false starts, the NNES seems to have also penalized the frequency of fillers. Here is the case of Student 41. (Student 41) Rena [ehhh] [enjoyed taking] [ehhh] enjoyed taking violin lessons [ah] but this lesson is too expensive and she was became lazy, so and she her teacher said [ehhh] practice playing violin every day. [ehhh] So she was quit the violin [ehhh] After several years later, [ehhh] she went to a concert with her family. [ehhh] [When she hear] when she listened [ehhh] beautiful sounds, [ehhh] she was very impressed and she decided to start violin lessons again.(77 words)

- 34 -- 34 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

Student 41 uttered a total of 77 words, slightly longer than Student 5, but her utterance involved many more fillers and false starts. This frequency of fillers and false starts likely prompted the NNES rater’s score of 2. However, Student 41 nearly completed the story, saying that “she decided to start violin lessons again.” In addition, she used her own words to tell the story, such as “became lazy” instead of “was going to be lazy,” “with her family” for “with her parents,” and “start violin lessons again” for “start playing the violin again.” As the primary purpose of this performance is to accurately convey the story to others, the NES awarded her a score of 4 instead of 2. Thus, as the qualitative analyses suggested, the NES seemed to pay more attention to the content in spite of frequent disfluency markers. On the other hand, the NNES rater seemed to be more influenced by learners’ surface errors and the quantity of the utterances, and as a consequence, she judged more severely. As for RQ2, as shown in Table 3, the relationships between the human raters and the machine scoring were moderately high (exact agreement = 48% to 65% and rs = .48 to .52), though they were slightly lower than the relationship between the human raters (exact agreement = 71%, rs = .70). These results imply two points. First, even though the NNES and the NES raters had very different L1 backgrounds and teaching experience and paid attention to different aspects of the speech performances in their evaluations, the correlation between them was still higher than the correlations between the machine and human scores. The results are in line with Zechner et al. (2009), who reported that the inter-human correlation (r = .74) was higher than the correlation between machine and human scores (r = .57). Second, in light of these moderate levels of correlation, there may be difficulty in depending solely on machine scoring in high-stakes tests. However, it can be used in low-stakes classroom speaking assessments, where immediate individual feedback is desirable but hard, such as for self-assessment after practicing speaking or for filtering purposes to divide scoring criteria between human and machine raters (e.g., Molholt et al., 2011; Zechner et al., 2009).

For example, there are relatively many cases in which the score of the human raters and the machine rating matched, as shown below. Student 26 uttered only 65 words, almost half of the original passage, and his speech lacked many of the main points. He was not able to finish the story, merely stating that Rena enjoyed taking the violin, but the lesson was expensive. Thus, both the human and the machine raters scored the utterance as 2.

(Original) When Rena was a little girl she started taking violin lessons, but after a while she was getting tired of it. Her teacher said that she should practice every day. Rena's parents told her that if she didn't practice harder she would have to stop taking lessons. They said that the lessons were too expensive to continue if she was going to be lazy. In the end Rena quit learning the violin. Several years later Rena went to a concert by a famous violinist with her parents. She was impressed by the beautiful sound of the music. So she decided to start playing the violin again. Now she never misses a day of practice. She hopes one day to be able to play like the famous violinist she heard. (129 words) (Student 5) [rena] rena [practiced] practice violin when she was a child. But practicing violin is too expensive to continue if she was lazy, so she [stop] stopped [ba] practicing violin. [uh] but she went to concert and [she] she was impressed by the sound of violin. [She] she get interested in practicing violin, so [she] she practice violin again. Now, [she] she don't miss the lesson of violin. (67 words)

Student 5 uttered only 67 words, roughly half the original word count of 129, with frequent fillers and false starts as shown in [ ]. It is likely that the brevity of the utterance resulted in the NNES score of 2, indicating a lack of detail. However, in terms of content, Student 5 was able to complete the story, as shown in the last sentences with underlines, which may be why the student received a score of 4 from the NES rater. Along with the utterance’s brevity and false starts, the NNES seems to have also penalized the frequency of fillers. Here is the case of Student 41. (Student 41) Rena [ehhh] [enjoyed taking] [ehhh] enjoyed taking violin lessons [ah] but this lesson is too expensive and she was became lazy, so and she her teacher said [ehhh] practice playing violin every day. [ehhh] So she was quit the violin [ehhh] After several years later, [ehhh] she went to a concert with her family. [ehhh] [When she hear] when she listened [ehhh] beautiful sounds, [ehhh] she was very impressed and she decided to start violin lessons again.(77 words)

- 35 -- 35 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

(Student 26)

When Rena was a little girl, etto she [ehhh] taking the violin [ehhh] lesson. [ehhh] But after after a few times, the teacher [ehhh] her [ehh enjoyed h] re Rena's teacher says [ehhh] mm this lesson [ehhh] were this lesson were too expensive to uh take uh mm too expensive to nandaro mm ah mm no no no no Her teacher says to Rena [ehhh] that. (65 words)

This example illustrates the usefulness of machine scoring. If a machine is good at identifying utterances with a low score, as proposed by Yoon and Zechner (2017), automated scoring can be used for filtering purposes, allowing human raters to focus on meaning and cognitive aspects of learners’ speech performances. However, there was a case in which both of the two human raters awarded a score of 4, but the machine gave a score of 2. Student 23 as shown below, uttered 119 words, which was relatively a large volume close to the amount of the original words, but the utterance has many fillers, such as [ah] and [nnn] and false starts, as in “she she” or “pay pay.” Even though the utterance contains many fillers and false starts, the speaker was able to retell the whole story, which resulted in a score of 4 by the human raters. (Student 23)

rena is [ah] little girl and [ah] she enjoyed [taking] [ah] taking violin and [ah] [she] she [practiced] practiced violin [ah] with her teacher one day [ah] her teacher [said] said [rena have] rena has to practice harder because [ah] her parents [ah] [sss] told her teacher [rena] [ah] rena didn't practice every day nnn so [they] they don't want to [pay] pay a lot money for her to [practice vio] [ah] practice violin [and] [ah] and [she] she was quit [and] [ah] [sss] [ah] and she [ah] one day [ah] [she went] she went [nnn] [concert] concert with her parents and [sss] [she] she want [ah] she [impressed] [nnn] impressed the violin and she decided to practice violin more. (119 words)

On the other hand, it seemed difficult for the machine to recognize the coverage

of the story with many false starts, which resulted in a score of 2 by the machine rating. In other words, human raters can distinguish content words necessary for constructing the story from content words not necessary for the story as compared to the machine. This may indicate that human raters are superior to machines in scoring cognitive aspects of speech performance. Molholt et al. (2011) also point out the insufficiency of

- 36 -- 36 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

using machine scoring due to leaving out too many important variables to be considered adequate for measuring proficiency, such as topic development. In this regard, machine scoring may take the role of evaluating more mechanical aspects to allow teachers to concentrate on cognitive aspects of speech performance. 5. Conclusion This pilot study investigated how accurately automated scoring can evaluate students’ retelling performance. To compare machine and human scoring, the inter-human variation of NES and NNES raters was also examined. The results yielded some important findings and implications. First, fluency—the number of words produced in a specified time—was the most decisive predictor for both NNES and NES evaluations. This evidence is in favor of automated scoring because WPS does not involve the reference text, which the other features require to calculate text similarity. Thus, WPS can be used for other speaking tasks as an important automated scoring feature. Second, the NNES rater paid more attention to the exact words used in the original text, while the NES raters focused on the meaning of the learners’ utterances, which may be partly due to their higher English ability as native speakers. Third, the NNES and NES raters also differed in terms of prediction accuracy on scores of 2 and 4. This may be because the NNES rater was more bothered by learners’ disfluency markers, such as fillers and false starts, than the NES raters. Another reason would be that, due to their experience in teaching Japanese students, the NES raters may be better equipped to comprehend the students’ speech and could concentrate on evaluating the content of their speech. Fourth, the inter-human correlation was higher than the correlation between the machine and human scoring, but the correlations of the machine scoring with the NNES scoring and with the NES scoring were largely the same. In this regard, human raters have more in common across differences in L1 background and teaching experience than they do with machine rating. Finally, correlations between automated and human scoring were moderately high, suggesting that automated scoring for a retelling task can be used for low-stakes tests, such as classroom speaking assessments. However, there are some limitations of this study. One is that while moderate agreement was reached between machine and human scores on 118 speech samples, the sample size was rather small for machine learning. By increasing the sample size, we hope to strengthen the precision of the automated scoring, which will increase its applicability to other retelling topics. A second limitation is that the number of raters

(Student 26)

When Rena was a little girl, etto she [ehhh] taking the violin [ehhh] lesson. [ehhh] But after after a few times, the teacher [ehhh] her [ehh enjoyed h] re Rena's teacher says [ehhh] mm this lesson [ehhh] were this lesson were too expensive to uh take uh mm too expensive to nandaro mm ah mm no no no no Her teacher says to Rena [ehhh] that. (65 words)

This example illustrates the usefulness of machine scoring. If a machine is good at identifying utterances with a low score, as proposed by Yoon and Zechner (2017), automated scoring can be used for filtering purposes, allowing human raters to focus on meaning and cognitive aspects of learners’ speech performances. However, there was a case in which both of the two human raters awarded a score of 4, but the machine gave a score of 2. Student 23 as shown below, uttered 119 words, which was relatively a large volume close to the amount of the original words, but the utterance has many fillers, such as [ah] and [nnn] and false starts, as in “she she” or “pay pay.” Even though the utterance contains many fillers and false starts, the speaker was able to retell the whole story, which resulted in a score of 4 by the human raters. (Student 23)

rena is [ah] little girl and [ah] she enjoyed [taking] [ah] taking violin and [ah] [she] she [practiced] practiced violin [ah] with her teacher one day [ah] her teacher [said] said [rena have] rena has to practice harder because [ah] her parents [ah] [sss] told her teacher [rena] [ah] rena didn't practice every day nnn so [they] they don't want to [pay] pay a lot money for her to [practice vio] [ah] practice violin [and] [ah] and [she] she was quit [and] [ah] [sss] [ah] and she [ah] one day [ah] [she went] she went [nnn] [concert] concert with her parents and [sss] [she] she want [ah] she [impressed] [nnn] impressed the violin and she decided to practice violin more. (119 words)

On the other hand, it seemed difficult for the machine to recognize the coverage

of the story with many false starts, which resulted in a score of 2 by the machine rating. In other words, human raters can distinguish content words necessary for constructing the story from content words not necessary for the story as compared to the machine. This may indicate that human raters are superior to machines in scoring cognitive aspects of speech performance. Molholt et al. (2011) also point out the insufficiency of

- 37 -- 37 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

was small, so the result may lack generalizability. However, we tried to create a realistic classroom situation in which one or at most two teachers are the norm for scoring students’ speech performances, and not all the teachers have sufficient experience in teaching. The last limitation is that, as a pilot study, we transcribed learners’ speech manually because we currently do not have technology for fully automated speech scoring. Despite these limitations, the current study revealed that an automated scoring system for retelling tasks can be developed for classroom speaking assessments for self-study and for teachers to reduce the burden of time-consuming speech scoring tasks. To make the current automated system fully usable for classroom teachers, the next step is to combine word recognition technology with the current automated scoring system and examine the validity, reliability, and practicality of this combination. Since the accuracy of automated speech recognition for learners’ speech has gradually improved, it is hoped that a fully automated speech scoring will be feasible in the future. Therefore, we need to stay abreast of developments in areas such as speech recognition, automated scoring, and deep learning applicable to classroom speaking instruction and assessment. Acknowledgements This work was supported by JSPS KAKENHI Grant Number 16K02953 and 18K00849. We are grateful to the anonymous reviewers for their constructive comments and suggested revisions, which helped us improve this manuscript. References ALC Press, Inc. (n.d.). The standard speaking test. ALCPress. https://tsst.alc.co.jp/sst/level.html Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Developing language assessments and justifying their use in the real world. Oxford University Press. Brown, A. (1995). The effect of rater variables in the development of an occupation- specific language performance test. Language Testing, 12, 1–15. https://doi.org/10.1177/026553229501200101 Chaudron, C. (2003). Data collection in SLA research. In C. J. Doughty & M. H. Long (Eds.), The handbook of second language acquisition (pp. 762–828). Blackwell.

- 38 -- 38 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

Cox, T. L., & Davies, R. S. (2012). Using automatic speech recognition technology with elicited oral response testing. CALICO Journal, 29, 601–618. http://www.jstor.org/stable/calicojournal.29.4.601 Cucchiarini, C., Strik, H., & Boves, L. (2000). Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology. The Journal of the Acoustical Society of America 107, 989. https://asa.scitation.org/doi/abs/10.1121/1.428279 Franco, H., Bratt, H., Rossier, R., Gadde, V. R., Shriberg, E., Abrash, V., & Precoda, K. (2010). EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications. Language Testing, 27, 401– 418. https://doi.org/10.1177/0265532210364408 Hirai, A. (2015). Jugyo o ikasu story-retelling test no katsuyo [Utilization of story-

retelling tests for effective lessons]. Otsuka Forum, 33, 49-69. https://core.ac.uk/download/pdf/56661627.pdf

Hirai, A., & Koizumi, R. (2009). Development of a practical speaking test with a positive impact on learning using a story retelling technique. Language Assessment Quarterly, 6, 151-167. doi:10.1080/15434300902801925

Ilijoski, B., & Popeska, Z. (2018). N-gram measure for syntactical similarity of the words. 10th ICT Innovations Conference, Engineering and Life Science Web Proceedings, 37-45. https://proceedings.ictinnovations.org/attachment/conference/15/ict-innovations-

2018-web-proceedings.pdf Kang, O., & Rubin, D. (2012). Intra-rater reliability of oral proficiency ratings. The

International Journal of Educational and Psychological Assessment, 12, 43-61. https://www.researchgate.net/publication/329504535_Intra-rater_Reliability_of_ Oral_Proficiency_Ratings

Kim, Y. (2009a). A G-theory analysis of rater effect in ESL speaking assessment. Applied Linguistics, 30, 435–440. doi:10.1093/applin/amp035 Kim, Y. (2009b). An investigation into native and non-native teachers’ judgments of oral English performance: A mixed methods approach. Language Testing, 26, 187–217. https://doi.org/10.1177/0265532208101010 Kobayashi, Y., & Abe, M. (2016). Automated scoring of L2 spoken English with random forests. Journal of Pan-Pacific Association of Applied Linguistics, 20, 55–73.

was small, so the result may lack generalizability. However, we tried to create a realistic classroom situation in which one or at most two teachers are the norm for scoring students’ speech performances, and not all the teachers have sufficient experience in teaching. The last limitation is that, as a pilot study, we transcribed learners’ speech manually because we currently do not have technology for fully automated speech scoring. Despite these limitations, the current study revealed that an automated scoring system for retelling tasks can be developed for classroom speaking assessments for self-study and for teachers to reduce the burden of time-consuming speech scoring tasks. To make the current automated system fully usable for classroom teachers, the next step is to combine word recognition technology with the current automated scoring system and examine the validity, reliability, and practicality of this combination. Since the accuracy of automated speech recognition for learners’ speech has gradually improved, it is hoped that a fully automated speech scoring will be feasible in the future. Therefore, we need to stay abreast of developments in areas such as speech recognition, automated scoring, and deep learning applicable to classroom speaking instruction and assessment. Acknowledgements This work was supported by JSPS KAKENHI Grant Number 16K02953 and 18K00849. We are grateful to the anonymous reviewers for their constructive comments and suggested revisions, which helped us improve this manuscript. References ALC Press, Inc. (n.d.). The standard speaking test. ALCPress. https://tsst.alc.co.jp/sst/level.html Bachman, L. F., & Palmer, A. S. (2010). Language assessment in practice. Developing language assessments and justifying their use in the real world. Oxford University Press. Brown, A. (1995). The effect of rater variables in the development of an occupation- specific language performance test. Language Testing, 12, 1–15. https://doi.org/10.1177/026553229501200101 Chaudron, C. (2003). Data collection in SLA research. In C. J. Doughty & M. H. Long (Eds.), The handbook of second language acquisition (pp. 762–828). Blackwell.

- 39 -- 39 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko

Koizumi, R., & Hirai, A. (2012). Comparing the story retelling speaking test with other speaking tests. JALT Journal, 34, 35–59. http://jalt-publications.org/sites/default/files/pdf-article/jj2012a-art2.pdf Lado, R. (1964). Language teaching: A scientific approach. McGraw-Hill. Litman, D., Strik, H., & Lim, G. S. (2018). Speech technologies and the assessment of second language speaking: Approaches, challenges, and opportunities. Language Assessment Quarterly, 15, 294–309. https://doi.org/10.1080/15434303.2018.1472265 Luoma, S. (2004). Assessing speaking. Cambridge University Press. MEXT (2014). Integrated reforms in high school and university education and university entrance examination aimed at realizing a high school and university articulation system appropriate for a new era. http://www.mext.go.jp/b_menu/shingi/chukyo/chukyo0/toushin/1354191.htm Molholt, G., Cabrera, M., Kumar, V., & Thompsen, P. (2011). Quantitative, notional, and comprehensive evaluations of spontaneous engaged speech. CALICO Journal, 29, 1–23. http://www.jstor.org/stable/calicojournal.29.1.1 Muranoi, H. (2007). Output practice in the L2 classroom. In R. M. DeKeyser (Ed), Practice in a second language: Perspectives from applied linguistics and cognitive psychology (pp. 51–84). Cambridge University Press. Nagasaki, M. (2013). Chugakko gaikokugoka ni okeru hanasuchikara o takameru tameno shido no kufu ― Ondoku retelling o hasshin ni tunageru dankaitekina katsudo o toushite [Ideas to improve “Speaking Skills” in the foreign language department of a junior high school: Through reading-aloud and retelling step-up activities]. Akita Prefectural Education Center Research Report. Nippold, M. A., Vigeland, L M., Frantz-Kaspar, M. W., & Ward-Lonergan, J. M. (2017). Language sampling with adolescents: Building a normative database with fables. American Journal of Speech-Language Pathology, 26, 908–920. https://doi.org/10.1044/2017_AJSLP-16-0181 Pearson Education, Inc. (2008). VersantTM English Test: Test description and validation summary. http://www.versanttest.com/technology/VersantEnglishTestValidation.pdf Sato, T. (2011). Japanese and native English-speaking teachers’ perspectives on learners’ oral English performance. Annual Review of English Language Education (ARELE), 17–32.

- 40 -- 40 -

Development of an Automated Speech Scoring System: A Comparison With Human Raters

Winke, P., & Gass, S. (2012). Raters’ L2 background as a potential source of bias in rating oral performance. Language Testing 30, 231–252. https://doi.org/10.1177/0265532212456968 Xi, X. (2012). Validity and the automated scoring of performance tests. In F. Glenn & D. Fred (Eds.), The Routledge handbook of language testing (pp. 438–451). Routledge. Yokouchi, Y. (2015). Do input text length and presentation mode affect speaking performance in retelling tasks? JLTA Journal 18, 115–133. Yoon, S-Y., & Zechner, K. (2017). Combining human and automated scores for the improved assessment of non-native speech. Speech Communication, 93, 43–52. https://doi.org/10.1016/j.specom.2017.08.001 Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51, 883–895. https://doi.org/10.1016/j.specom.2009.04.009 Zhang, Y., & Elder, C. (2011). Judgments of oral proficiency by non-native and native English speaking teacher raters: Competing or complementary constructs? Language Testing, 28, 31–50. https://doi.org/10.1177/0265532209360671

Koizumi, R., & Hirai, A. (2012). Comparing the story retelling speaking test with other speaking tests. JALT Journal, 34, 35–59. http://jalt-publications.org/sites/default/files/pdf-article/jj2012a-art2.pdf Lado, R. (1964). Language teaching: A scientific approach. McGraw-Hill. Litman, D., Strik, H., & Lim, G. S. (2018). Speech technologies and the assessment of second language speaking: Approaches, challenges, and opportunities. Language Assessment Quarterly, 15, 294–309. https://doi.org/10.1080/15434303.2018.1472265 Luoma, S. (2004). Assessing speaking. Cambridge University Press. MEXT (2014). Integrated reforms in high school and university education and university entrance examination aimed at realizing a high school and university articulation system appropriate for a new era. http://www.mext.go.jp/b_menu/shingi/chukyo/chukyo0/toushin/1354191.htm Molholt, G., Cabrera, M., Kumar, V., & Thompsen, P. (2011). Quantitative, notional, and comprehensive evaluations of spontaneous engaged speech. CALICO Journal, 29, 1–23. http://www.jstor.org/stable/calicojournal.29.1.1 Muranoi, H. (2007). Output practice in the L2 classroom. In R. M. DeKeyser (Ed), Practice in a second language: Perspectives from applied linguistics and cognitive psychology (pp. 51–84). Cambridge University Press. Nagasaki, M. (2013). Chugakko gaikokugoka ni okeru hanasuchikara o takameru tameno shido no kufu ― Ondoku retelling o hasshin ni tunageru dankaitekina katsudo o toushite [Ideas to improve “Speaking Skills” in the foreign language department of a junior high school: Through reading-aloud and retelling step-up activities]. Akita Prefectural Education Center Research Report. Nippold, M. A., Vigeland, L M., Frantz-Kaspar, M. W., & Ward-Lonergan, J. M. (2017). Language sampling with adolescents: Building a normative database with fables. American Journal of Speech-Language Pathology, 26, 908–920. https://doi.org/10.1044/2017_AJSLP-16-0181 Pearson Education, Inc. (2008). VersantTM English Test: Test description and validation summary. http://www.versanttest.com/technology/VersantEnglishTestValidation.pdf Sato, T. (2011). Japanese and native English-speaking teachers’ perspectives on learners’ oral English performance. Annual Review of English Language Education (ARELE), 17–32.

- 41 -- 41 -

HIRAI, Akiyo・KONDO, Yusuke・FUJITA, Ryoko