An Introduction to Mandarin Speech Recognition
description
Transcript of An Introduction to Mandarin Speech Recognition
![Page 1: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/1.jpg)
An Introduction to Mandarin Speech
RecognitionJohn Steinberg, Temple
University
![Page 2: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/2.jpg)
Mobile Phone Technology
Translators & Prostheses
Automotive / GPS Devices
Intelligence Collection
Speech Recognition Applications
![Page 3: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/3.jpg)
Speech Recognition: Basic Process
[5]
![Page 4: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/4.jpg)
Importance of Mandarin
[1]
MandarinEnglish
![Page 5: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/5.jpg)
Importance of Mandarin English speakers in
the World: ~ 350 million [11]
Estimated # of current English learners in China: 200-350 million [12]
Estimated # of native Mandarin speakers: 1+ Billion
[2]
![Page 6: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/6.jpg)
Importance of Mandarin
[3]
![Page 7: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/7.jpg)
Importance of Mandarin
[3]
![Page 8: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/8.jpg)
Mandarin Chinese Tonal language (inflection matters!)
1st tone – High, constant pitch (Like saying “aaah”) 2nd tone – Rising pitch (“Huh?”) 3rd tone – Low pitch (“ugh”) 4th tone – High pitch with a rapid descent (“No!”) “5th tone” – Neutral used for de-emphasized syllables
Characters 8000+ characters compose 80k-200k common words Act as morphemes Are primarily monosyllabic Have a single associated tone
![Page 9: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/9.jpg)
Coarticulation: Context can cause changes in tone
Bu4 + Dui4 = Bu2 Dui4 (wrong)
Ni3 + Hao3 = Ni2 Hao3 (hello)
Mandarin Chinese
![Page 10: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/10.jpg)
Mandarin Chinese Heavily contextual language
Monosyllabic Relatively few # of syllables compared
to English [3] English: ~10,000 syllables Mandarin: ~1300 syllables including tones
(400 excluding) High # of homophones
![Page 11: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/11.jpg)
Challenges in Mandarin Recognition
Requires highly developed language model due to highly contextual nature of Mandarin Tone modeling Coarticulation Large # of homophones
Chinese text is unsegmented No standard lexicon
Chinese sentence/word structure is very flexible Ex: Beijing Da Xue -> Bei Da
![Page 12: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/12.jpg)
Modeling Methods Prosodic Features
Describes tone (question vs. statement), rhythm, and focus of speech
Pitch Extraction Yields more precise character
recognition Stronger Language Models
Determines context more accurately
![Page 13: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/13.jpg)
Prosodic Units Different prosodic units (labels) have been
suggested [4] EX: Syllable (SYL), Prosodic Word (PW), Minor
Prosodic Phrase (MIP), Major Prosodic Phrase (MAP), & Intonation Group (IG)
Past labeling systems are primarily based on auditory perception
Prosodic break labeling is subjective and inconsistent Auditory perception approach loses quantitative
information Impossible to replicate identical prosodic labels for an
original speech signal
![Page 14: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/14.jpg)
Prosodic Units New, more objective Prosodic cues include
[4]: Pause duration (directly measured) Segment/syllable duration (directly measured) F0 reset
F0 contains utterance long intonation information which must be separated from inner-utterance tones to inter-utterance tones.
Quantitative Description of F0 = phrase components + accent or tone components + log(baseline frequency)
![Page 15: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/15.jpg)
Language Models N-grams –
3 Steps: Syllable -> Character -> Word
Neural Networks – Better suited to high dimensionality
Random Forests – May be able to include morphology into
language model [7]
![Page 16: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/16.jpg)
Timeline1980’s:
Individual syllable recogniti
on research begins
1986: 863
program begins funding collectio
n of speech
corpuses in China
1992: LDC is
founded
1993: Golden Mandarin is developed• 1st speaker
dependent dictation system for Mandarin
• Single syllable recognizer
• Designed for typing Mandarin
• 8% CER [10]
1994: Golden Mandarin (II) yields 5% CER on word based speaker independent system [11]
1995: Golden Mandarin (III) [12] • Prosodic
unit based• User
independent
• 10% CER• Dictation
system
Early 2000’s:
Research in prosodic segmentation, tone modeling, and new language
models for CTS yield
~40% CER
![Page 17: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/17.jpg)
Benchmark History [8]
![Page 18: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/18.jpg)
Recent Experimentation
Broadcast News and Conversational Telephone Speech [9]
![Page 19: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/19.jpg)
Future Studies Continue studying current baseline
systems/data sets Further investigate possible language
models Compare effectiveness of prosodic
features
![Page 20: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/20.jpg)
Questions?
![Page 21: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/21.jpg)
References[1] K. Kūriákī, A Grammar of Modern Indo-European, Asociación Cultural Dnghu, 2007[2] Wikipedia[3] W. Gu, K. Hirose, H. Fujisaki, “Comparison of Perceived Prosodic Boundaries and
Global Characteristics of Voice Fundamental Frequency Contours in Mandarin Speech”, ISCSLP, 2006
[4] J. Picone, A. Harati, "Why Study Engineering at Temple?," Temple University College of Engineering Open House, October 9, 2010
[5] Lee, C-H. “Advances in Chinese spoken language processing”, World Scientific Publishing Co., Singapore, 2007
[6] F.H. Liu, M. Picheny, P. Srinivasa, M. Monkowski, et al, “Speech Recognition on Mandarin Call Home: A Large Vocabulary, Conversational, and Telephone Speech Corpus” ICASSP, 1996
[7] I. Oparin, L. Lamel, J. Gauvain, “Improving Mandarin Chinese STT system with Random Forests language models “, IEEE Xplore, 2010
[8] “The History of Automatic Speech Recognition Evaluations at NIST,” 2009http://www.itl.nist.gov/iad/mig/publications/ASRhistory/index.html
[9] Schwartz, R.; Colthurst, T.; Duta, N.; Gish, H.; Iyer, R.; Kao, C.-L.; Liu, D.; Kimball, O.; Ma, J.; Makhoul, J.; Matsoukas, S.; Nguyen, L.; Noamany, M.; Prasad, R.; Xiang, B.; Xu, D.-X.; Gauvain, J.-L.; Lamel, L.; Schwenk, H.; Adda, G.; Chen, L.; , "Speech recognition in multiple languages and domains: the 2003 BBN/LIMSI EARS system," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on , vol.3, no., pp. iii- 753-6 vol.3, 17-21 May 2004 doi: 10.1109/ICASSP.2004.1326654
![Page 22: An Introduction to Mandarin Speech Recognition](https://reader036.fdocuments.in/reader036/viewer/2022062305/5681668a550346895dda4bd0/html5/thumbnails/22.jpg)
References[10] Lee, L.S.; Tseng, C.Y.; Gu, H.Y.; Liu, F.H.; Chang, C.H.; Lin, Y.H.; Lee, Y.; Tu,
S.L.; Hsieh, S.H.; Chen, C.H.; , "Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary," Speech and Audio Processing, IEEE Transactions on , vol.1, no.2, pp.158-179, Apr 1993
[11] Lin-Shan Lee; Keh-Jiann Chen; Chiu-Yu Tseng; Renyuan Lyu; Lee-Feng Chien; Hsin-Min Wang; Jia-Lin Shen; Sung-Chien Lin; Yen-Ju Yang; Bo-Ren Bai; Chi-Ping Nee; Chun-Yi Liao; Shueh- Sheng Lin; Chung-Shu Yang; I-Jung Hung; Ming-Yu Lee; Rei-Chang Wang; Bo-Shen Lin; Yuan-Cheng Chang; Rung-Chiung Yang; Yung-Chi Huang; Chen-Yuan Lou; Tung-Sheng Lin; , "Golden Mandarin(II)-an intelligent Mandarin dictation machine for Chinese character input with adaptation/learning functions," Speech, Image Processing and Neural Networks, 1994. Proceedings, ISSIPNN '94., 1994 International Symposium on , vol., no., pp.155-159 vol.1, 13-16 Apr 1994
[12] Ren-Yuan Lyu; Lee-Feng Chien; Shiao-Hong Hwang; Hung-Yun Hsieh; Rung-Chiuan Yang; Bo- Ren Bai; Jia-Chi Weng; Yen-Ju Yang; Shi-Wei Lin; Keh-Jiann Chen; Chiu-Yu Tseng; Lin-Shan Lee; , "Golden Mandarin (III)-a user-adaptive prosodic-segment-based Mandarin dictation machine for Chinese language with very large vocabulary," Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on , vol.1, no., pp.57-60 vol.1, 9-12 May 1995