Scoring Children's Foreign Language Production

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

Scoring Children's Foreign Language PronunciationLinda Oppelstrup, Mats Blomberg, and Daniel EleniusSpeech, Music and HearingKTH, Stockholm

AbstractAutomatic speech recognition measures havebeen investigated as scores of segmental pronunciation quality. In an experiment, contextindependent hidden Markov phone models weretrained on native English and Swedish readchild speech respectively. Among various studied scores, a likelihood ratio between thescores of forced alignment using English phoneme models and the score of English or Swedish phoneme recognition had the highest correlations to human judgments. The best measureshave the power of evaluating the coarse proficiency level of a child but need to be improvedfor detailed diagnostics of individual utterancesand phonemes.

IntroductionAutomatic evaluation of foreign language pronunciation presents possibilities for computerassisted language learning as well as for prediction of speech recognition performance in anon-native language. Although children constitute a very large portion of foreign languagelearners, speech technology research in thisdomain has previously been mainly focused onadults. The current work has been produced aspart of the EU project PF-Star, in which onegoal was to assess the current possibilities ofspeech technology for children.Previous work has used the fact that the better you are at pronouncing the new language,the more the utterances should resemble soundsfrom the target language instead of the mothertongue (Eskenazi, 1996; Matsunaga, Ogawa,Yamaguchi and Imamura, 2003). However, thepronunciation quality of read speech does notdepend solely on the ability to produce the phonemes correctly; it also depends on knowledgeof how the words should be pronounced. Spectral quality and time-related scores have shownhigh correlation with human judgment(Neumeyer, Franco, Digalakis and Weintraub,2000; Cucchiarini, Strik and Boves, 2000).The foreign language considered in this report is English and the mother tongue is Swedish and also Italian in some cases. The scoringprocedure used context-independent phoneme

models in a hidden Markov model (HMM) system. Although prosody is very important forpronunciation quality, this report is limited tosegmental quality. More detailed results aregiven in Oppelstrup (2005).

TheoryAn approximation in this work is that pronunciation quality is composed of two components, knowledge and ability. The knowledgecomponent reflects the speakers knowledge ofthe correct phonetic transcription of a writtentext. The ability component reflects thespeakers ability to pronounce the phonemes ofthe target language correctly.The knowledge score, Sk, can be formulated asa confidence measure that the speaker has chosen the correct transcription, TrCorrect, in hisspoken utterance (U) of the written text. This ismodeled by:Sk =

P(U | TrCorrect , T )P(U | TrCorrect , T )=(1)max[P(U | Tri , T )]P(U | TrBest , T )i

where T is the set of target language phonememodels, trained by native reference speakers. Inspeech recognition terminology, the operationof the nominator is forced alignment and that ofthe denominator is phoneme recognition. Thisratio has been used for pronunciation scoring byCucchiarini et al (2000).The ability score is a measure of the acoustic quality of the speakers realization of thephonemes in the target language. It is possiblethat some phonemes are pronounced as themost similar phoneme in the mother language.To score the ability of a speaker to produce thecorrect non-native sounds, we will try the ratiobetween the probabilities that the target language phonemes were used and that the mothertongue ones were used:

Sa =

P(U | TrCorrect , T )P (U | TrCorrect , M )

(2)

if the correct phonetic transcription of the written text to be pronounced is known. M is theset of mother language phoneme models. If the

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg Universitytext or the transcription is unknown, we can usethe ratio between the phoneme recognitionscores using the two language models:Sa =

P(U | TrBest , T )P(U | TrBest , M )

(3)

A combined knowledge and ability scorecan be computed by multiplying Sk and Sa inEqs. (1) and (3) :Sc =

P (U | TrCorrect , T )P(U | TrBest , M )

(4)

Implemented ScoresIn this report we present results of the followingpronunciation score parameters: Knowledge:English forced alignment / English phonemerecognition (EFA/EPR) Ability:English phoneme recognition / Swedish phoneme recognition (EPR/SPR) Combined:English forced alignment / Swedish phoneme recognition (EFA/SPR) Fraction language use (FRAC); see below. Rate of speech (ROS) Utterance duration (DUR)In FRAC, both language model sets are active in parallel and the score is the percentageof English language models selected by the recognizer.

Speech Data BasesThe speech material used in this reportcomes from five child speech databases. Theutterances were separate words and short sentences chosen to make good coverage of allphonemes. All recordings were made withheadset microphones. The sampling frequencyand amplitude resolution was, when necessarydownsampled to 16 kHz and reduced to 16 bits,respectively. The material was split into sets fortraining, development and evaluation.SVE1 consists of 50 Swedish children inthe ages eight to fourteen years in the EUSpeeCon corpus (Iskra et al, 2002). Each childrecorded 60 utterances in average.SVE2 (Blomberg and Elenius, 2003) is partof the Swedish PF-STAR corpus of 198 Swedish children between four and eight years old.Only the children above eight were used. Eachchild recorded approximately 80 utterances.

The English material for PF-STAR was recorded in three different regions of England bythe University of Birmingham. 158 children ofthe ages six to fourteen were recorded but onlythose above eight were included. Each childrecorded approximately 90 utterances. The database was used both for training the Englishphoneme models (ENG-tr) and for performance tests (ENG-te). A part of this material received an increased noise level in this experiment due to unintentional mixing of two channels from a headset and a desktop microphone.The Italian database, ITEN, was recordednear Trento by ITC-irst. The part used herecomprises 25 children, ten and eleven years old,reading 75 English prompts each.The Swedish non-native PF-STAR material,SWENG, was recorded in Stockholm. Each of40 children of both sexes, in the ages 10-11years read in average 64 English utterances,prompted on a computer screen. Most of theutterances were the same as in ITEN. If thechild was uncertain of the pronunciation he/shehad the option to listen to a prerecorded pronunciation of that prompt. This option was usedin about 15% of the recordings.

ExperimentsRecognition performance tests and pronunciation evaluation experiments have been performed. Word recognition tests used the SVEand ENG development sets both containingchildren of ages eight and nine only. The language model allowed any word to follow anyother with equal probabilities. The word insertion penalty was experimentally set to minimizeWER. In phoneme recognition tests the penaltywas non-optimized and equal to zero.The English and Swedish phoneme modelswere trained on ENG-tr and SVE1, respectively. The phoneme models consist of threestates. The 39 elements of the acoustic featurevector are the 13 lowest mel frequency cepstralcoefficients (MFCC) including number 0, andtheir first and second order time derivatives.The mel filterbank is computed with 25 msHamming window at a frame rate of 10 ms. Theoutput likelihood values of the recognizer arelogarithmic, which turns the implementation ofratio between scores into subtraction instead ofdivision.The scores were measured in different ways,including or excluding non-speech intervals before and after the utterance and optional pausesbetween words. In this report, the presented

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg Universityscores include non-speech intervals, which generally performed better. Several other combinations of scores, models and normalization techniques have been studied by Oppelstrup (2005).The pronunciation scores were correlatedwith human judgment of the utterances. TheSWENG and ITEN speech files were scored byan English teacher with phonetic training. Segmental and prosodic qualities were judged separately. Each utterance was scored on a threegraded scale: 3 for correct pronunciation, 2 forsmall errors and 1 for erroneous utterances. Toget grade per child, the average of all gradedsentences was calculated. At the time of the experiments, the ENG database had no humanjudgments but were given the score 3, assumingthat all English children had a correct pronunciation. Afterwards, judgments have been madealso on the English children and some resultsincluding these are given in this report.

ResultsWord and phoneme recognitionResults of the word and phoneme recognitionexperiments are shown in Table 1 and 2, respectively. The error rates are generally quitehigh, which is not surprising considering thecombined difficulties of child, non-nativespeech from different databases and a highperplexity language model.

ity per child. The correlation in the individuallanguages is generally low and is increased forthe combined group of Swedish and Italianchildren. Still higher correlation is achievedwhen including also the native English children. The increase can be due to a larger rangeof pronunciation quality among the speakers.The knowledge score EFA/EPR, the abilityscore EPR/SPR, and the combined scoreEFA/SPR all got high correlation in this case.EFA/EPR and EPR/SPR are shown in Figure 1.The effect of an increased correlation when including native English speakers can also beseen in FRAC.Table 3. Correlation of pronunciation scores withhuman judgment for various test sets. S = SWENG,I = ITEN, E = ENG-te. English children are givengrade 3, except for values after / which are basedon human judgment of a part of these.

Test setScoreSIS+IS+I+EEFA/EPR 0.20 0.61 0.65 0.70/0.75EPR/SPR 0.18 0.26 0.56 0.80/0.68EFA/SPR 0.20 0.37 0.60 0.82/0.72FRAC0.22 0.09 0.48 0.82/0.71ROS0.18 -0.25 0.43 0.57/0.42DUR-0.11 -0.12 -0.54 -0.47/-0.47EFA / EPR0

Table 1. Word error rates for the Swedish and English test sets.

TestSVE2ENG-teSWENGITEN

Voc. size10511097617629

WER %94547985

-200

Automatic

TrainingSVE1ENG-trENG-trENG-tr

-100

-300

-400

-500

-6001

1.5

2

2.5

3

3.5

3

3.5

Human judgment

Table 2. Phoneme error rates (PER) in percent fordifferent training and test set combinations.

EPR/SPR30002500

SVE29772

20001500

Automatic

TestTraining ENG-te SWENG ITENENG-tr66103119SVE1928693

10005000-500

Pronunciation scoringCorrelation with human judgment was low forsingle utterances but was increased when averaging the scores of all utterances by a child. Table 3 lists correlations between automaticscores and human judgment of segmental qual-

-1000-15001

1.5

2

2.5

Human judgment

Figure 1. Automatic vs human judgment of Swedish, Italian and English children for EFA/EPR (top)and EPR/SPR (bottom).

Proceedings, FONETIK 2005, Department of Linguistics, Gteborg University

DiscussionThe low accuracy of word and phoneme recognition even for native English children indicatesthat there is low discrimination between thephoneme models. Child speech recognition isvery difficult in itself and the small size of thetraining material allowed only contextindependent phoneme models to be trained.Another difficulty was the varying recordingconditions in the databases. These problemsmakes the pronunciation evaluation uncertain.Another fact that may lower the correlationwith human judgment is that the human listenerand the scoring algorithms have different targets for correct pronunciation. Whereas thehuman listener is likely to compare with neutralBritish English, the reference models for thesystem are trained on children with differentregional accent.As was expected from previous studies, correlation increased with the amount of availabledata. Correlation for single utterances waslower than for averages of all utterances from aspeaker.The correlation within the Swedish childrenwas quite small. An explanation may be that thescoring algorithms are not sensitive to the limited pronunciation variation in this group. Thecorrelation among Italians is larger and thehighest overall correlation is achieved whenincluding children from all the Italian, Swedishand English sets. It is interesting to note that theSwedish phoneme models seemed to workequally well as native models for Italian as forSwedish children.A separate procedure will probably be necessary to reject utterances that match poorly toboth nominator and denominator models in thelikelihood ratios, since the likelihood ratio ofthese utterances will be quite random.

ConclusionThe best scoring functions correlate wellenough with human judgments to allow coarsegrading of a childs pronunciation quality. Thecontext-independent models used are too insensitive, however, to allow scoring on the utterance or phoneme level. The most important improvement would be to use context-dependentphoneme models, trained on a large corpus withrecordings of children with correct pronunciation and accent.

Another possibility is to replace phonemerecognition in the scoring algorithms by explicitmodeling of predicted erroneous pronunciations. Better knowledge of the differences between the phoneme sets of the mother tongueand target language could also help in givingmore weight to the difficult phonemes in thetarget language.

AcknowledgementThis work was conducted as a Master ofScience thesis at TMH, KTH, Stockholm,within the EU project PF-STAR, Preparing Future Multisensorial, Interaction Research. Thehuman pronunciation judgments were performed by Becky Hinks.

ReferencesBlomberg, M. and Elenius D. Collection andrecognition of childrens speech in the PFStar project. Phonum 9, Dept. of Philosophyand Linguistics, Ume University, 2003.Cucchiarini, C., Strik, H. and Boves, L. Different aspects of expert pronunciation qualityratings and their relation to scores producedby recognition algorithms. Speech Communication, Vol 31, pp 109-119, 2000.Eskenazi, M. Detection of foreign speakerspronunciation errors for second languagelearning preliminary results. Proc. of ICSL96, vol 3, 1996.Iskra, D. Grosskopf, B., Marasek, K., van derHeuvel, H., Diehl, F, and Kissling, A. Speecon speech databases for consumer devices: Database specification and validation.Proc. of ICSLP 02, 2002.Matsunaga, S. Ogawa, A., Yamaguchi, Y. andImamura, A. Non-native English speechrecognition using bilingual English lexiconand acoustic models. Proc. of ICME 03, pp625-628, 2003.Neumeyer, L., Franco, H. , Digalakis, V. andWeintraub, M. Automatic scoring of pronunciation quality. Speech Communication,Vol 30, pp 83-93. 2000.Oppelstrup L. Speech Recognition used forScoring of Childrens Pronunciation of aForeign Language, M. Sc. ThesisTMH/KTH, Stockholm, 2005.

Scoring Children's Foreign Language Production

Documents

Transcript of Scoring Children's Foreign Language Production