Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates...
-
Upload
oscar-walters -
Category
Documents
-
view
217 -
download
2
Transcript of Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates...
• Speech rate affects the word error rate of automatic speech recognition systems.
• Higher error rates for fast speech, but also for slow, hyperarticulated speech (Siegler and Stern, 1995; Mirghafori et al., 1995; Martínez et al., 1997; Pfau and Ruske, 1998; Alleva et al., 1998).
• What linguistic unit should we use to quantify speech rate and what domain is appropriate?
• What are the most important effects on the realisation of the lexical forms?
• How well are the acoustic models suited for different speech rates?
Articulation rate and phone classification
Articulation rate and realised lexical form
Articulation rate measures
Articulation Rate: Measures, Realised Lexical Form and Phone Classification
in Spontaneous and Read German SpeechJürgen Trouvain, Jacques Koreman, Attilio Erriquez and Bettina Braun
Universität des Saarlandes, Saarbrücken, Germany{trouvain,koreman,erriquez,bebr}@coli.uni-sb.de
We investigated several linguistic units for measuring articulation rate as well as two different domains.
It is important to distinguish between intended and realised units. Intended forms can be easily derived from the canonical transcription of the uttered words, but their actual realisation can vary strongly:
Am blauen Himmel ziehen die WolkenEngl. In the blue sky wander the clouds
//[]
The definition of what is and is not a unit is also problematical:
Linguistic unit
• Vowel-/r/ combinations are counted as two phones in the intended form, but as one in the realised form (except for schwa-/r/ combinations, which were labelled //).
• Realised syllables can be problematical, as e.g. the //-syllable in „ziehen“ in the example above can be realised as a syllabic or non-syllabic /n/, leading to different syllable counts (one and zero, respectively).
Despite many sources of variation in both the units and the domains, we found high correlations between the number of units and domain duration for all units, both for read and spontaneous speech.
Results
• Correlations higher for ips than for IP.• Words/second show lowest correlations with duration.• Realised phones/second result in the highest correlation
with duration Realised phones/second in ips used in this study
• For other applications, comparable results can be obtained when using the graphical word or the intended syllable, which can be measured/derived more easily
• Note: Although phone and syllable deletions lower the measured articulation rate, it is not clear what their effect on the perceived articulation rate is.
Results and discussion
Domain
• inter-pause stretch (ips)
- The pauses which delimit them (pause, breathing, filled pause, lip smacks, coughing and other non-verbal articulations) are easy to determine in the labelfile and are often used to delimit the domain over which articulation rate is calculated.
- ASR is primarily interested in decoding speech (not silence) from the information contained in the phone segments.
Articulation rate changes continuously while speaking and is not always constant within an utterance. Therefore we use two prosodic domains (although it is clear that more local variation can and will occur even within these domains):
- is considered as an important planning unit, reflected by the intonation contour.
- utterances must be labelled intonationally to obtain IPs. The criteria for IPs can differ considerably between studies.
• intended word• intended syllable• realised syllable• intended phone• realised phone
The following units were measured:
8 syllables20 phones
10 syllables
26 phones
• Glottal stops are considered to be a phone (in contrast to laryngealisation)
• Due to the labelling conventions of the database, affricates are counted as two phones and diphthongs as one
• intonational phrase (IP)
The KielCorpus database was subdivided into three parts on the basis of the articulation rate measured in realised phones/second (for read and spontaneous speech separately):
• slow: more than 1 sd below the mean• medium: between -1 and +1 sd from the mean• fast: more than 1 sd above the mean
Database analysis: realised lexical form
• Generally more deletions than replacements, especially /, , t/
• consonants generally affected more strongly by deletions and replacements than vowels
• Exception: schwa• for /n/ more replacements than deletions (place
assimilation) • //, which is /r/ in the canonical form, is seldom deleted
or reduced
• Deletion of //, //, /t/ (closure and especially release + aspiration) and also /n/ should be represented in the lexicon by means of pronunciation variants.
• Pronunciation variants due to assimilation of /n/ and replacement of /t/ closures and // should also be added to the lexicon.
• If there is any vowel reduction, therefore, it must take place on the acoustic rather than the lexical level (except for schwa).
Implications for ASRResults:read spontaneous
segment slow medium fast slow medium fastall 10.1 13.1 16.2 12.6 17.6 21.1t_clos. 7.1 13.2 15.7 9.8 18.9 18.9t_rel. 31.7 40.6 42.3 36.6 52.1 52.1 52.1 61.7 76.8 45.8 78.4 78.4
n 1.4 2.4 5.0 2.5 6.9 6.9 36.0 43.3 51.6 57.5 65.3 65.3 1.0 0.5 1.3 0.0 2.5 2.5
a 0.0 0.0 0.2 3.8 2.9 2.9
all 0.9 1.4 2.4 1.4 2.8 3.7t_clos. 1.5 2.4 4.7 1.8 4.6 5.9t_rel. - - - - - - 0.0 0.0 0.0 0.2 0.1 0.0
n 7.2 9.8 9.7 9.0 10.3 10.7 0.0 0.0 0.0 0.0 0.2 0.1 2.4 0.7 0.4 2.2 0.5 0.4
a 0.3 1.6 6.0 0.0 1.0 1.8
• Phone classification for individual phones in hidden Markov modelling experiments using HTK, for read and spontaneous speech separately
• Hidden Markov models: 3 states (5 for diphthongs), left-to-right (no states skipped) and 8 mixtures per state
• Jackknife experiments with 20% of the database as test data, results computed as weighted averages
• Results evaluated for slow, medium and fast speech
Phone classification
• We found a deterioration of phone classification with articulation rate (unlike e.g. Siegler and Stern, 1995). Our findings are comparable to those of Wrede et al. (2001) and are probably caused by the greater spectral variation at faster articulation rates.
• Articulation rate affects both vowels and consonants (lower phone classification results for faster speaking rates).
• t-tests on average vowel classification rates for matched pairs showed that the phone classification rates for normal and fast vowels do not differ significantly in spontaneous speech. This is probably due to the large amount of variation in the average vowel classification rates.
• Among the consonants, particularly fricatives and also plosives were affected by articulation rate.
Articulation rate effects
Introduction Aims
spontaneous readunit ips IP ips IP
intended word .913 .855 .899 .862intended syllable .951 .926 .938 .918realised syllable .955 .932 .945 .924intended phone .956 .935 .945 .928realised phone .965 .948 .958 .943
By performing phone classification for individual phones, the effects of silences in the utterance are excluded as a source of recognition error. The emphasis is entirely on the recognition of the phones at different articulation rates (calculated for each ips as phones/second).
Database
The German KielCorpus for Read and Spontaneous Speech
• manually labelled realised phones along with intended (canonical) transcriptions
• large parts also prosodically annotated
• single sentences of variable length and two short stories• 4 hours• 53 speakers (27 male and 26 female)
• appointment-making dialogues• 4 hours• 42 speakers (24 male, 18 female)
Read:
Spontaneous:
Only segmentally and prosodically labelled parts selected for this study.
References
We address three questions:
As is well-known, spontaneous speech differs from read speech with respect to pauses (more unfilled pauses, filled pauses, ungrammatical pauses). We also find differences in temporal structure (shorter phrases, greater variance in phrase duration, greater variance in articulation rate for spontaneous speech). But we also observe changes on the phonemic level.
Database analysis: spontaneous versus read speech
Alleva, F., Huang, X., Hwang, M-Y. & Jiang, L. "Can continous speech recognisers handle isolated speech?" Speech Communication 26 (3), 183-190, 1998.
Martínez, F., Tapias, D., Álvarez, J. & León, P. "Characteristics of slow, average and fast speech and their effects in large vocabulary continous speech recognition." Proc. Eurospeech Rhodes, 469-472, 1997.
Mirghafori, N., Fosler, E. & Morgan, N. "Fast speakers in large vocabulary continous speech recognition." Proc. Eurospeech Madrid. 1995.
Pfau, T. & Ruske, G. "Creating Hidden Markov Models for fast speech." Proc. ICSLP Sydney, 205-208, 1998.
Siegler, M. A. & Stern, R. M. "On the effects of speech rate in large vocabulary speech recognition systems." Proc. ICASSP Detroit (1), 612-615, 1995.
Wrede, B. Fink, G. and Sagerer, G., ”An investigation of modelling aspects for rate-dependent speech recognition." Proc. Eurospeech Aalborg, 2001.
0-1-2-3 1 2 3
articulation rate (sd)
wor
d er
ror
rate
# ut
tera
nces
fastslow
• Phone classification rates for consonants, particularly voiceless obstruents, are higher than for vowels.
• Schwa is recognised particularly poorly, possibly because of its liability to transconsonantal coarticulation
• Diphthongs and // are also recognised poorly.
Rate-independent phone classe effects
Average phone classification rates for slow, normal and fast speech (read and spontaneous)