Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates...

1
• Speech rate affects the word error rate of automatic speech recognition systems. • Higher error rates for fast speech, but also for slow, hyperarticulated speech (Siegler and Stern, 1995; Mirghafori et al., 1995; Martínez et al., 1997; Pfau and Ruske, 1998; Alleva et al., 1998). • What linguistic unit should we use to quantify speech rate and what domain is appropriate? • What are the most important effects on the realisation of the lexical forms? • How well are the acoustic models suited for different speech rates? Articulation rate and phone classification Articulation rate and realised lexical form Articulation rate measures Articulation Rate: Measures, Realised Lexical Form and Phone Classification in Spontaneous and Read German Speech Jürgen Trouvain, Jacques Koreman, Attilio Erriquez and Bettina Braun Universität des Saarlandes, Saarbrücken, Germany {trouvain,koreman,erriquez,bebr}@coli.uni-sb.de We investigated several linguistic units for measuring articulation rate as well as two different domains. It is important to distinguish between intended and realised units. Intended forms can be easily derived from the canonical transcription of the uttered words, but their actual realisation can vary strongly: Am blauen Himmel ziehen die Wolken Engl. In the blue sky wander the clouds // [] The definition of what is and is not a unit is also problematical: Linguistic unit • Vowel-/r/ combinations are counted as two phones in the intended form, but as one in the realised form (except for schwa-/r/ combinations, which were labelled //). • Realised syllables can be problematical, as e.g. the //-syllable in „ziehen“ in the example above can be realised as a syllabic or non-syllabic /n/, leading to different syllable counts (one and zero, respectively). Despite many sources of variation in both the units and the domains, we found high correlations between the number of units and domain duration for all units, both for read and spontaneous speech. Results • Correlations higher for ips than for IP. • Words/second show lowest correlations with duration. • Realised phones/second result in the highest correlation with duration Realised phones/second in ips used in this study • For other applications, comparable results can be obtained when using the graphical word or the intended syllable, which can be measured/derived more easily Note: Although phone and syllable deletions lower the measured articulation rate, it is not clear what their effect on the perceived articulation rate is. Results and discussion Domain inter-pause stretch (ips) - The pauses which delimit them (pause, breathing, filled pause, lip smacks, coughing and other non-verbal articulations) are easy to determine in the labelfile and are often used to delimit the domain over which articulation rate is calculated. - ASR is primarily interested in decoding speech (not silence) from the information contained in the phone segments. Articulation rate changes continuously while speaking and is not always constant within an utterance. Therefore we use two prosodic domains (although it is clear that more local variation can and will occur even within these domains): - is considered as an important planning unit, reflected by the intonation contour. - utterances must be labelled intonationally to obtain IPs. The criteria for IPs can differ considerably between studies. intended word intended syllable realised syllable intended phone realised phone The following units were measured: 8 syllables 20 phones 10 syllables 26 phones • Glottal stops are considered to be a phone (in contrast to laryngealisation) • Due to the labelling conventions of the database, affricates are counted as two phones and diphthongs as one intonational phrase (IP) The KielCorpus database was subdivided into three parts on the basis of the articulation rate measured in realised phones/second (for read and spontaneous speech separately): slow: more than 1 sd below the mean medium: between -1 and +1 sd from the mean fast: more than 1 sd above the mean Database analysis: realised lexical form Generally more deletions than replacements, especially /, , t/ consonants generally affected more strongly by deletions and replacements than vowels Exception: schwa for /n/ more replacements than deletions (place assimilation) //, which is /r/ in the canonical form, is seldom deleted or reduced Deletion of //, //, /t/ (closure and especially release + aspiration) and also /n/ should be represented in the lexicon by means of pronunciation variants. Pronunciation variants due to assimilation of /n/ and replacement of /t/ closures and // should also be added to the lexicon. If there is any vowel reduction, therefore, it must take place on the acoustic rather than the lexical level (except for schwa). Implications for ASR Results: read spontaneous segm ent slow medium fast slow medium fast all 10.1 13.1 16.2 12.6 17.6 21.1 t_clos. 7.1 13.2 15.7 9.8 18.9 18.9 t_rel. 31.7 40.6 42.3 36.6 52.1 52.1 52.1 61.7 76.8 45.8 78.4 78.4 n 1.4 2.4 5.0 2.5 6.9 6.9 36.0 43.3 51.6 57.5 65.3 65.3 1.0 0.5 1.3 0.0 2.5 2.5 a 0.0 0.0 0.2 3.8 2.9 2.9 all 0.9 1.4 2.4 1.4 2.8 3.7 t_clos. 1.5 2.4 4.7 1.8 4.6 5.9 t_rel. - - - - - - 0.0 0.0 0.0 0.2 0.1 0.0 n 7.2 9.8 9.7 9.0 10.3 10.7 0.0 0.0 0.0 0.0 0.2 0.1 2.4 0.7 0.4 2.2 0.5 0.4 a 0.3 1.6 6.0 0.0 1.0 1.8 Phone classification for individual phones in hidden Markov modelling experiments using HTK, for read and spontaneous speech separately Hidden Markov models: 3 states (5 for diphthongs), left-to-right (no states skipped) and 8 mixtures per state Jackknife experiments with 20% of the database as test data, results computed as weighted averages Results evaluated for slow, medium and fast speech Phone classification We found a deterioration of phone classification with articulation rate (unlike e.g. Siegler and Stern, 1995). Our findings are comparable to those of Wrede et al. (2001) and are probably caused by the greater spectral variation at faster articulation rates. Articulation rate affects both vowels and consonants (lower phone classification results for faster speaking rates). t-tests on average vowel classification rates for matched pairs showed that the phone classification rates for normal and fast vowels do not differ significantly in spontaneous speech. This is probably due to the large amount of variation in the average vowel classification rates. Articulation rate effects Introduction Aims spontaneous read unit ips IP ips IP intended w ord .913 .855 .899 .862 intended syllable .951 .926 .938 .918 realised syllable .955 .932 .945 .924 intended phone .956 .935 .945 .928 realised phone .965 .948 .958 .943 By performing phone classification for individual phones, the effects of silences in the utterance are excluded as a source of recognition error. The emphasis is entirely on the recognition of the phones at different articulation rates (calculated for each ips as phones/second). Database The German KielCorpus for Read and Spontaneous Speech manually labelled realised phones along with intended (canonical) transcriptions large parts also prosodically annotated single sentences of variable length and two short stories 4 hours 53 speakers (27 male and 26 female) appointment-making dialogues 4 hours 42 speakers (24 male, 18 female) Read: Spontaneous: Only segmentally and prosodically labelled parts selected for this study. References We address three questions: As is well-known, spontaneous speech differs from read speech with respect to pauses (more unfilled pauses, filled pauses, ungrammatical pauses). We also find differences in temporal structure (shorter phrases, greater variance in phrase duration, greater variance in articulation rate for spontaneous speech). But we also observe changes on the phonemic level. Database analysis: spontaneous versus read speech Alleva, F., Huang, X., Hwang, M-Y. & Jiang, L. "Can continous speech recognisers handle isolated speech?" Speech Communication 26 (3), 183-190, 1998. Martínez, F., Tapias, D., Álvarez, J. & León, P. "Characteristics of slow, average and fast speech and their effects in large vocabulary continous speech recognition." Proc. Eurospeech Rhodes, 469-472, 1997. Mirghafori, N., Fosler, E. & Morgan, N. "Fast speakers in large vocabulary continous speech recognition." Proc. Eurospeech Madrid. 1995. Pfau, T. & Ruske, G. "Creating Hidden Markov Models for fast speech." Proc. ICSLP Sydney, 205-208, 1998. Siegler, M. A. & Stern, R. M. "On the effects of speech rate in large vocabulary speech recognition systems." Proc. ICASSP Detroit (1), 612-615, 1995. Wrede, B. Fink, G. and Sagerer, G., ”An investigation of modelling aspects for rate-dependent speech recognition." Proc. Eurospeech Aalborg, 2001. 0 -1 -2 -3 1 2 3 articulation rate (sd) word error rate # utterances fast slow Phone classification rates for consonants, particularly voiceless obstruents, are higher than for vowels. Schwa is recognised particularly poorly, possibly because of its liability to transconsonantal coarticulation Diphthongs and // are also recognised poorly. Rate-independent phone classe effects Average phone classification rates for slow, normal and fast speech (read and spontaneous)

Transcript of Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates...

Page 1: Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated.

• Speech rate affects the word error rate of automatic speech recognition systems.

• Higher error rates for fast speech, but also for slow, hyperarticulated speech (Siegler and Stern, 1995; Mirghafori et al., 1995; Martínez et al., 1997; Pfau and Ruske, 1998; Alleva et al., 1998).

• What linguistic unit should we use to quantify speech rate and what domain is appropriate?

• What are the most important effects on the realisation of the lexical forms?

• How well are the acoustic models suited for different speech rates?

Articulation rate and phone classification

Articulation rate and realised lexical form

Articulation rate measures

Articulation Rate: Measures, Realised Lexical Form and Phone Classification

in Spontaneous and Read German SpeechJürgen Trouvain, Jacques Koreman, Attilio Erriquez and Bettina Braun

Universität des Saarlandes, Saarbrücken, Germany{trouvain,koreman,erriquez,bebr}@coli.uni-sb.de

We investigated several linguistic units for measuring articulation rate as well as two different domains.

It is important to distinguish between intended and realised units. Intended forms can be easily derived from the canonical transcription of the uttered words, but their actual realisation can vary strongly:

Am blauen Himmel ziehen die WolkenEngl. In the blue sky wander the clouds

//[]

The definition of what is and is not a unit is also problematical:

Linguistic unit

• Vowel-/r/ combinations are counted as two phones in the intended form, but as one in the realised form (except for schwa-/r/ combinations, which were labelled //).

• Realised syllables can be problematical, as e.g. the //-syllable in „ziehen“ in the example above can be realised as a syllabic or non-syllabic /n/, leading to different syllable counts (one and zero, respectively).

Despite many sources of variation in both the units and the domains, we found high correlations between the number of units and domain duration for all units, both for read and spontaneous speech.

Results

• Correlations higher for ips than for IP.• Words/second show lowest correlations with duration.• Realised phones/second result in the highest correlation

with duration Realised phones/second in ips used in this study

• For other applications, comparable results can be obtained when using the graphical word or the intended syllable, which can be measured/derived more easily

• Note: Although phone and syllable deletions lower the measured articulation rate, it is not clear what their effect on the perceived articulation rate is.

Results and discussion

Domain

• inter-pause stretch (ips)

- The pauses which delimit them (pause, breathing, filled pause, lip smacks, coughing and other non-verbal articulations) are easy to determine in the labelfile and are often used to delimit the domain over which articulation rate is calculated.

- ASR is primarily interested in decoding speech (not silence) from the information contained in the phone segments.

Articulation rate changes continuously while speaking and is not always constant within an utterance. Therefore we use two prosodic domains (although it is clear that more local variation can and will occur even within these domains):

- is considered as an important planning unit, reflected by the intonation contour.

- utterances must be labelled intonationally to obtain IPs. The criteria for IPs can differ considerably between studies.

• intended word• intended syllable• realised syllable• intended phone• realised phone

The following units were measured:

8 syllables20 phones

10 syllables

26 phones

• Glottal stops are considered to be a phone (in contrast to laryngealisation)

• Due to the labelling conventions of the database, affricates are counted as two phones and diphthongs as one

• intonational phrase (IP)

The KielCorpus database was subdivided into three parts on the basis of the articulation rate measured in realised phones/second (for read and spontaneous speech separately):

• slow: more than 1 sd below the mean• medium: between -1 and +1 sd from the mean• fast: more than 1 sd above the mean

Database analysis: realised lexical form

• Generally more deletions than replacements, especially /, , t/

• consonants generally affected more strongly by deletions and replacements than vowels

• Exception: schwa• for /n/ more replacements than deletions (place

assimilation) • //, which is /r/ in the canonical form, is seldom deleted

or reduced

• Deletion of //, //, /t/ (closure and especially release + aspiration) and also /n/ should be represented in the lexicon by means of pronunciation variants.

• Pronunciation variants due to assimilation of /n/ and replacement of /t/ closures and // should also be added to the lexicon.

• If there is any vowel reduction, therefore, it must take place on the acoustic rather than the lexical level (except for schwa).

Implications for ASRResults:read spontaneous

segment slow medium fast slow medium fastall 10.1 13.1 16.2 12.6 17.6 21.1t_clos. 7.1 13.2 15.7 9.8 18.9 18.9t_rel. 31.7 40.6 42.3 36.6 52.1 52.1 52.1 61.7 76.8 45.8 78.4 78.4

n 1.4 2.4 5.0 2.5 6.9 6.9 36.0 43.3 51.6 57.5 65.3 65.3 1.0 0.5 1.3 0.0 2.5 2.5

a 0.0 0.0 0.2 3.8 2.9 2.9

all 0.9 1.4 2.4 1.4 2.8 3.7t_clos. 1.5 2.4 4.7 1.8 4.6 5.9t_rel. - - - - - - 0.0 0.0 0.0 0.2 0.1 0.0

n 7.2 9.8 9.7 9.0 10.3 10.7 0.0 0.0 0.0 0.0 0.2 0.1 2.4 0.7 0.4 2.2 0.5 0.4

a 0.3 1.6 6.0 0.0 1.0 1.8

• Phone classification for individual phones in hidden Markov modelling experiments using HTK, for read and spontaneous speech separately

• Hidden Markov models: 3 states (5 for diphthongs), left-to-right (no states skipped) and 8 mixtures per state

• Jackknife experiments with 20% of the database as test data, results computed as weighted averages

• Results evaluated for slow, medium and fast speech

Phone classification

• We found a deterioration of phone classification with articulation rate (unlike e.g. Siegler and Stern, 1995). Our findings are comparable to those of Wrede et al. (2001) and are probably caused by the greater spectral variation at faster articulation rates.

• Articulation rate affects both vowels and consonants (lower phone classification results for faster speaking rates).

• t-tests on average vowel classification rates for matched pairs showed that the phone classification rates for normal and fast vowels do not differ significantly in spontaneous speech. This is probably due to the large amount of variation in the average vowel classification rates.

• Among the consonants, particularly fricatives and also plosives were affected by articulation rate.

Articulation rate effects

Introduction Aims

spontaneous readunit ips IP ips IP

intended word .913 .855 .899 .862intended syllable .951 .926 .938 .918realised syllable .955 .932 .945 .924intended phone .956 .935 .945 .928realised phone .965 .948 .958 .943

By performing phone classification for individual phones, the effects of silences in the utterance are excluded as a source of recognition error. The emphasis is entirely on the recognition of the phones at different articulation rates (calculated for each ips as phones/second).

Database

The German KielCorpus for Read and Spontaneous Speech

• manually labelled realised phones along with intended (canonical) transcriptions

• large parts also prosodically annotated

• single sentences of variable length and two short stories• 4 hours• 53 speakers (27 male and 26 female)

• appointment-making dialogues• 4 hours• 42 speakers (24 male, 18 female)

Read:

Spontaneous:

Only segmentally and prosodically labelled parts selected for this study.

References

We address three questions:

As is well-known, spontaneous speech differs from read speech with respect to pauses (more unfilled pauses, filled pauses, ungrammatical pauses). We also find differences in temporal structure (shorter phrases, greater variance in phrase duration, greater variance in articulation rate for spontaneous speech). But we also observe changes on the phonemic level.

Database analysis: spontaneous versus read speech

Alleva, F., Huang, X., Hwang, M-Y. & Jiang, L. "Can continous speech recognisers handle isolated speech?" Speech Communication 26 (3), 183-190, 1998.

Martínez, F., Tapias, D., Álvarez, J. & León, P. "Characteristics of slow, average and fast speech and their effects in large vocabulary continous speech recognition." Proc. Eurospeech Rhodes, 469-472, 1997.

Mirghafori, N., Fosler, E. & Morgan, N. "Fast speakers in large vocabulary continous speech recognition." Proc. Eurospeech Madrid. 1995.

Pfau, T. & Ruske, G. "Creating Hidden Markov Models for fast speech." Proc. ICSLP Sydney, 205-208, 1998.

Siegler, M. A. & Stern, R. M. "On the effects of speech rate in large vocabulary speech recognition systems." Proc. ICASSP Detroit (1), 612-615, 1995.

Wrede, B. Fink, G. and Sagerer, G., ”An investigation of modelling aspects for rate-dependent speech recognition." Proc. Eurospeech Aalborg, 2001.

0-1-2-3 1 2 3

articulation rate (sd)

wor

d er

ror

rate

# ut

tera

nces

fastslow

• Phone classification rates for consonants, particularly voiceless obstruents, are higher than for vowels.

• Schwa is recognised particularly poorly, possibly because of its liability to transconsonantal coarticulation

• Diphthongs and // are also recognised poorly.

Rate-independent phone classe effects

Average phone classification rates for slow, normal and fast speech (read and spontaneous)