ROBUST SPEECH RECOGNITION Richard...
Transcript of ROBUST SPEECH RECOGNITION Richard...
![Page 1: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/1.jpg)
ROBUST SPEECH RECOGNITION
Richard SternRobust Speech Recognition Group
Carnegie Mellon University
Telephone: (412) 268-2535Fax: (412) 268-3890
[email protected]://www.cs.cmu.edu/~rms
Short Course at Universidad Carlos IIIJuly 12-15, 2005
![Page 2: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/2.jpg)
September 17-21, 2006www.interspeech2006.org
![Page 3: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/3.jpg)
CarnegieMellon Slide 3 CMU Robust Speech Group
Robust speech recognition
As speech recognition is transferred from the laboratory to the marketplace robust recognition is becoming increasingly important
“Robustness” in 1985:
– Recognition in a quiet room using desktop microphones
Robustness in 2005:
– Recognition ….
» over a cell phone
» in a car
» with the windows down
» and the radio playing
» at highway speeds
![Page 4: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/4.jpg)
CarnegieMellon Slide 4 CMU Robust Speech Group
Speech in high noise (Navy F-18 flight line)
Speech in background music
Speech in background speech
Transient dropouts and noise
Spontaneous speech
Reverberated speech
Vocoded speech
Some of the hardest problems in speech recognition
![Page 5: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/5.jpg)
CarnegieMellon Slide 5 CMU Robust Speech Group
Outline of discussion
Summary of the state-of-the-art in speech technology at Carnegie Mellon and elsewhere
Review of fundamentals of speech recognition
Introduction to robust speech recognition: classical techniques
Robust speech recognition using missing-feature techniques
Use of multiple microphones for improved recognition accuracy
The future of robust recognition:
– Signal processing based on human auditory perception
– Computational auditory scene analysis
![Page 6: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/6.jpg)
CarnegieMellon Slide 6 CMU Robust Speech Group
Outline of discussion
Summary of the state-of-the-art in speech technology at Carnegie Mellon and elsewhere
Review of fundamentals of speech recognition
Introduction to robust speech recognition: classical techniques
Robust speech recognition using missing-feature techniques
Use of multiple microphones for improved recognition accuracy
The future of robust recognition:
– Signal processing based on human auditory perception
– Computational auditory scene analysis
![Page 7: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/7.jpg)
CarnegieMellon Slide 7 CMU Robust Speech Group
Introduction
Background:
– The technologies of speech recognition and text-to-speech synthesis have advanced rapidly over the last decade
– Nevertheless, there are relatively few commercially-practical speech-based applications being sold today
Goals of this talk:
– To summarize the present state of the art and future directions in speech technology
– To discuss key unsolved problems in transitioning laboratory technology to practical systems
– To describe and discuss several speech-based applications now under development at CMU and elsewhere
![Page 8: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/8.jpg)
CarnegieMellon Slide 8 CMU Robust Speech Group
Speech and language researchat Carnegie Mellon
Some facets of CMU’s ongoing core research:
– Large-vocabulary speech recognition
– Text-to-speech synthesis
– Spoken language understanding
– Conversational systems
– Machine translation
– Multi-modal integration
![Page 9: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/9.jpg)
CarnegieMellon Slide 9 CMU Robust Speech Group
Speech and language researchat Carnegie Mellon
Some application-focused efforts:
– The Communicator system (Alex Rudnicky)
– Informedia group (Howard Wactlar)
» Video on demand
– LISTEN group (Jack Mostow):
» Literacy training using speech input
– CALL group (Maxine Eskenazi):
» Foreign language training using speech input
– Wearable computer group (Dan Sieworiek/Alex Rudnicky)
![Page 10: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/10.jpg)
CarnegieMellon Slide 10 CMU Robust Speech Group
What we will discuss …
Core technology
– Automatic speech recognition
– Text-to-speech synthesis
Introductory comments on commercial applications
Information access through conversational systems
– CMU communicator and commercial information-access apps
Multi-media applications
– Informedia and LISTEN
User interface issues
– The Universal Speech Interface
Concluding remarks
![Page 11: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/11.jpg)
CarnegieMellon Slide 11 CMU Robust Speech Group
Speech recognition technology:accuracy is improving!
• But …. significant problems remain because of a lack of robustness
![Page 12: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/12.jpg)
CarnegieMellon Slide 12 CMU Robust Speech Group
Speech recognition at CMU
The SPHINX-III system (1996-present):
– “Unlimited” vocabulary in English and Spanish; smaller versions inSerbo-Croatian, French, Korean, and Haitian Creole
– ~60,000 words in unlimited-vocabulary language model
– Continuous or semi-continuous hidden Markov models
– Runs on Windows and Unix/Linux platforms
Sphinx-IV decoder in Java
– Funded by Sun, collaboration of CMU, Sun, MERL, HP, MIT
Code for both systems available in Open Source form
![Page 13: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/13.jpg)
CarnegieMellon Slide 13 CMU Robust Speech Group
Text-to-speech synthesis at Carnegie Mellon
Current TTS technology at CMU (and also AT&T, ATR, Microsoft, and elsewhere): synthesis based on concatention of selected recorded speech units
Major research issues and problems:
– Recording natural domain-appropriate databases with good phonetic coverage
– Joining units smoothly (currently units are selected based on F0, power, delta cepstra, with penalties for duration mismatch)
– Prosody and naturalness
![Page 14: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/14.jpg)
CarnegieMellon Slide 14 CMU Robust Speech Group
Personalized synthetic voices
Commercial voices from the Cepstral Corporation:
– David
– Linda
– Miguel
– Marta
Cepstral voices are also presently available in Canadian French, British English, and German, with other languages to follow
Sample voices developed at CMU:
– Rich
![Page 15: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/15.jpg)
CarnegieMellon Slide 15 CMU Robust Speech Group
Open source code for ASR and TTS available from Carnegie Mellon
http://www.speech.cs.cmu.edu/hephaestus.html
– ASR: Sphinx and SphinxTrain
– TTS: Festival, Festvox, FLITE
– Language factory: QuickLM, Pronounce, Condition
– Spoken language: CMU Communicator, SpeechLink, openvxi
http://mi.eng.cam.ac.uk/~prc14/toolkit.html
– Language modeling: CMU-Cambridge toolkit
http://speech.mty.itesm.mx/~jnolazco/proyectos.htm
– Sphinx-III in (American) Spanish
![Page 16: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/16.jpg)
CarnegieMellon Slide 16 CMU Robust Speech Group
CMU TTS resources availablein Open Source
Festival
– General multilingual speech synthesis engine (from the University of Edinburgh)
Festvox
– Tools for creating synthetic voices
FLITE
– Fast synthesis for embedded engines
![Page 17: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/17.jpg)
CarnegieMellon Slide 17 CMU Robust Speech Group
What kinds of speech applications are available now?
Dictation systems:
– Large vocabulary and speaker-adaptive, with adaptable vocabularies and grammars
Command-and-control systems:
– Voice control of operating system and applications
– Part of infrastructure of Windows XP and Mac OSX
Information-access systems:
– Frequently conversational in nature
– Frequently involve telephone access (including cell phones)
Data entry using handheld terminals and simple wearable systems
Primitive translation systems
![Page 18: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/18.jpg)
CarnegieMellon Slide 18 CMU Robust Speech Group
Command and control ofoperating systems and applications
Some attributes of current systems:
– Voice commands can begin to replace the mouse and keyboard
– Limited vocabulary based on which window is in focus or based on user state
– Probably will ultimately be a complement rather than a replacement for the keyboard and mouse
![Page 19: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/19.jpg)
CarnegieMellon Slide 19 CMU Robust Speech Group
QuickTime™ and aVideo decompressor
are needed to see this picture.
An (old) example of command and control in a commercial product
Dragon systems demo (circa 1998):
![Page 20: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/20.jpg)
CarnegieMellon Slide 20 CMU Robust Speech Group
Information access throughspoken language systems
What is a spoken language system?
Some attributes:
– Voice input and output
– Intelligent interaction with a database to solve real problems
Some domains that have been studied:
– Travel planning, orientation, navigation
– General information retrieval
– General provision of advice
Comments:
– A “marriage” of speech recognition and natural language processing
– Major goal: to develop voice systems that users will prefer over keyboard-driven systems
![Page 21: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/21.jpg)
CarnegieMellon Slide 21 CMU Robust Speech Group
Conversational systems:the CMU Communicator
Mixed-initiative interaction
– Both the user and computer can initiate action and clarification
User and task modeling
– User preferences and defaults
– Understanding of the semantics of the underlying task
Dialog scripting
– Knowledge of user goals and subgoals
– Dynamic modification of lexicon and grammar based on dialog context
– Guidance of user through planning procedures
Task analysis and domain knowledge needed for successful system development
![Page 22: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/22.jpg)
CarnegieMellon Slide 22 CMU Robust Speech Group
The CMU Communicator
QuickTime™ and aDV/DVCPRO - NTSC decompressor
are needed to see this picture.
![Page 23: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/23.jpg)
CarnegieMellon Slide 23 CMU Robust Speech Group
Examples of commercialspoken language systems
Reservations on United Airlines (ScanSoft)
Health care patient eligibility verification (Nuance)
BeanTown Navigation on Nokia 3650 phone (ScanSoft)
![Page 24: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/24.jpg)
CarnegieMellon Slide 24 CMU Robust Speech Group
CHALLENGES FOR CONVERSATIONAL SYSTEMS
Recognition of spontaneous speech
Adaptation and learning at all levels
– Acoustic
– Lexical
– Semantic
– Task domain
– Environment
Domain awareness for both users and machines
Training with very little data
Establishing the right balance of initiative between user and system
Development of toolkits for new applications
![Page 25: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/25.jpg)
Multimedia
Content Classification,Retrieval, and Protection
Audio-Visual SpeechRecognition;
Scene Analysis/Synthesis
Audio/Speech
Image/Video
Text
TranslationNatural Language Proc.
Coding and Processing
Analysis, Codingand Representation
Speech RecognitionSpeech Synthesis
THE CHALLENGE OF MULTIMEDIA
![Page 26: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/26.jpg)
CarnegieMellon Slide 26 CMU Robust Speech Group
InformediaTM: News on Demand
Motivation:
Full-motion video is the most compelling presentation medium for display, training, and information access
Video is the most difficult medium for browsing and searching
Spoken language interface enables anyone to
– retrieve desired information ...
– using natural fluent speech ...
– with no special training
![Page 27: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/27.jpg)
CarnegieMellon Slide 27 CMU Robust Speech Group
InformediaTM: News on Demand
The original Informedia system included:
Unlimited-vocabulary spoken language interface
Real-time MPEG video playback
Totally automatic indexing ...
– based on text captioning for television news
– based on speech recognition for public radio broadcasts
Browsing capability
Automatic indexing based on speech recognition ultimately could be extended to all digital video libraries.
![Page 28: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/28.jpg)
CarnegieMellon Slide 28 CMU Robust Speech Group
QuickTime™ and aDV/DVCPRO - NTSC decompressor
are needed to see this picture.
The original Informedia system (~1997)
For more information: www.informedia.cs.cmu.edu
![Page 29: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/29.jpg)
CarnegieMellon Slide 29 CMU Robust Speech Group
Speech is used to create transcripts and to align video to transcripts for indexing
QuickTime™ and aDV/DVCPRO - NTSC decompressor
are needed to see this picture.
Informedia today
![Page 30: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/30.jpg)
CarnegieMellon Slide 30 CMU Robust Speech Group
ASR accuracy depends on speaking style and the environment
CMU recognition error rates in transcription of Broadcast News TV and radio news broadcasts (1997 DARPA evaluations)
Prepared studio speech 15.5%
Spontaneous studio speech 22.8%
Telephone and similar channels 32.2%
Background music 33.4%
Background noise 30.8%
Non-native speakers 33.0%
OVERALL AVERAGE 24.0%
![Page 31: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/31.jpg)
CarnegieMellon Slide 31 CMU Robust Speech Group
Another multimedia application:the LISTEN Reading Tutor
Using speech to help children and adults learn to read:
Students read from prepared texts
Computer listens, detects mistakes, and applies “helpful” feedback
Many interesting issue in both speech recognition and application design
![Page 32: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/32.jpg)
CarnegieMellon Slide 32 CMU Robust Speech Group
The CMU LISTEN Project
For more information: http://www.cs.cmu.edu/~listen
QuickTime™ and aYUV420 codec decompressor
are needed to see this picture.
![Page 33: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/33.jpg)
CarnegieMellon Slide 33 CMU Robust Speech Group
Speech recognitionon handheld teminals
Some characteristics:
– Noisy environment
– Limited computation and memory
– Terminals generally operated by single user
Some additional attributes of mobile phones:
– Power available only for limited periods of time
– High cost sensitivity
– Operate in multi-lingual environment and under coding
![Page 34: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/34.jpg)
CarnegieMellon Slide 34 CMU Robust Speech Group
One approach to application design:The Universal Speech Interface
Goals of the Universal Speech Interface:
Do for speech what Graffiti™ has done for mobile text entry
– semi-natural language: man, machine meet halfway
– 5 minute training, via interactive tutorial
Do for speech what the Macintosh look-and-feel has done for GUIs
– a universal look-and-feel (rather, “sound-and-feel”) across all applications
![Page 35: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/35.jpg)
CarnegieMellon Slide 35 CMU Robust Speech Group
For more info: http://www.cs.cmu.edu/~usi
QuickTime™ and aDV/DVCPRO - NTSC decompressor
are needed to see this picture.
The CMUUniversal Speech Interface
![Page 36: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/36.jpg)
CarnegieMellon Slide 36 CMU Robust Speech Group
So why hasn’t speech technologydeveloped faster?
(Or why haven’t we yet developed the “killer app” for speech input and output?)
Even though core recognition has improved, we still need...
Greater robustness ....
– To speakers and dialects
– To the effects of unknown noise and filtering
– To vocoded speech and telephone channels
Automatic adaptation to out-of-domain input:
– New words, syntax, and semantic concepts
Improved human-computer interfaces
Lower cost?
![Page 37: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/37.jpg)
CarnegieMellon Slide 37 CMU Robust Speech Group
Summary: what’s going on now?
Core speech recognition technology has improved greatly over the last decade and is now usable if deployed with care, but ...…
Current speech systems remain fragile to
– environmental degradation (including interfering sources, filtering and nonlineardistoration)
– spontaneous and disfluent speech
– out-of-vocabulary utterances, unusual syntax, and other unexpected types of input
Spoken language systems for information access has taken hold, but conversational systems are limited by recognition accuracy and application design
Automatic detection and assimilation of new words and concepts remains extremely difficult
![Page 38: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/38.jpg)
CarnegieMellon Slide 38 CMU Robust Speech Group
Summary: what are someinteresting trends to watch for?
Greater commercial success as we conquer the major problems of
– robustness and adaptation for ASR
– effective portable application design
Greater emphasis on multi-media applications in which speech is one of several input/output modalities
Greater diffusion of speech-baased education and training applications
Continued search for the right way to integrate speech, keyboard, and mouse in the OS
An “interesting” period in which central servers and handsets both compete with board-level products as the site for recognition and related processing
![Page 39: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/39.jpg)
CarnegieMellon Slide 39 CMU Robust Speech Group
Outline of discussion
Summary of the state-of-the-art in speech technology at Carnegie Mellon and elsewhere
Review of fundamentals of speech recognition
Introduction to robust speech recognition: classical techniques
Robust speech recognition using missing-feature techniques
Use of multiple microphones for improved recognition accuracy
The future of robust recognition:
– Signal processing based on human auditory perception
– Computational auditory scene analysis
![Page 40: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/40.jpg)
![Page 41: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/41.jpg)
CarnegieMellon Slide 41 CMU Robust Speech Group
The source-filter model of speech production
A useful model for representing the generation of speech sounds:
Pitch
Pulse train source
Noise source
Vocal tract model
Amplitude
p[n]
![Page 42: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/42.jpg)
CarnegieMellon Slide 42 CMU Robust Speech Group
THE ACOUSTIC THEORY OF SPEECH PRODUCTION: MODELING THE VOCAL TRACT
The sound pressure at a distance r is determined by
– Spectrum of the excitation signal
– Configuration of throat, jaw, tongue, lips, teeth, etc.
– Loading effect of air
![Page 43: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/43.jpg)
CarnegieMellon Slide 43 CMU Robust Speech Group
Unvoiced speech sources
Turbulent voicing sources are approximately flat in frequency:
![Page 44: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/44.jpg)
CarnegieMellon Slide 44 CMU Robust Speech Group
Voiced speech sources
Glottal pulses have a spectrum that decreases with the square of frequency:
![Page 45: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/45.jpg)
CarnegieMellon Slide 45 CMU Robust Speech Group
Frequency response:
x = 0x = −Glottis Lips
Sound propagation in a uniform tube
0 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
30
35
40
45
50
(assuming idealperfectly reflective walls)
![Page 46: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/46.jpg)
CarnegieMellon Slide 46 CMU Robust Speech Group
Vowel production in the vocal tract
![Page 47: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/47.jpg)
CarnegieMellon Slide 47 CMU Robust Speech Group
Comment: Resonant frequencies now non-uniform
x = 0x = −Glottis Lips
A more realistic model of sound production
![Page 48: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/48.jpg)
CarnegieMellon Slide 48 CMU Robust Speech Group
Some example vowels
![Page 49: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/49.jpg)
CarnegieMellon Slide 49 CMU Robust Speech Group
Vowel perception and formant frequencies� � � �
![Page 50: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/50.jpg)
CarnegieMellon Slide 50 CMU Robust Speech Group
Context dependencies in speech production
Spectral patterns that form /di/ and /du/:
![Page 51: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/51.jpg)
CarnegieMellon Slide 51 CMU Robust Speech Group
The source-filter model of speech production
A useful model for representing the generation of speech sounds:
Pitch
Pulse train source
Noise source
Vocal tract model
Amplitude
p[n]
![Page 52: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/52.jpg)
CarnegieMellon Slide 52 CMU Robust Speech Group
The speech spectrogram
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
![Page 53: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/53.jpg)
CarnegieMellon Slide 53 CMU Robust Speech Group
Separating the vocal-tract excitation from the filter
Original speech:
Speech with 75-Hz excitation:
Speech with 150-Hz excitation:
Speech with noise excitation:
![Page 54: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/54.jpg)
CarnegieMellon Slide 54 CMU Robust Speech Group
Summary: elements of speech production
We have discussed very superficially the production of speech sounds
– Source-filter model
– Vocal tract transfer functions
– Impact on perception
The source filter model is used
– As a way to model how we produce speech sounds
– As a way to reduce the number of parameters needed to characterize speech sounds
– As a way of extracting features that are used by speech recognition systems
![Page 55: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/55.jpg)
![Page 56: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/56.jpg)
CarnegieMellon Slide 56 CMU Robust Speech Group
Outline of discussion
Basic mechanisms of speech production
Basic mechanisms of auditory perception
(Very!) basic review of automatic speech recognition
Conventional signal processing for speech recognition
Signal processing for improved speech recognition
Signal processing for improved sound source separation
![Page 57: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/57.jpg)
CarnegieMellon Slide 57 CMU Robust Speech Group
OVERVIEW OF SPEECH RECOGNITION
Major functional components:
– Signal processing to extract features from speech waveforms
– Comparison of features to pre-stored templates
Important design choices:
– Choice of features
– Specific method of comparing features to stored templates
Feature extractionDecision making
procedure
Speech features
Phonemehypotheses
![Page 58: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/58.jpg)
CarnegieMellon Slide 58 CMU Robust Speech Group
GOALS OF SPEECH REPRESENTATIONS
Capture important phonetic information in speech
Computational efficiency
Efficiency in storage requirements
Optimize generalization
![Page 59: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/59.jpg)
CarnegieMellon Slide 59 CMU Robust Speech Group
WHY PERFORM SIGNAL PROCESSING?
A look at the time-domain waveform of “six”:
It’s hard to infer much from the time-domain waveform
![Page 60: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/60.jpg)
CarnegieMellon Slide 60 CMU Robust Speech Group
WHY PERFORM SIGNAL PROCESSING IN THE FREQUENCY DOMAIN?
Human hearing is based on frequency analysis
Use of frequency analysis often simplifies signal processing
Use of frequency analysis often facilitates understanding
![Page 61: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/61.jpg)
CarnegieMellon Slide 61 CMU Robust Speech Group
FEATURES FOR SPEECH RECOGNITION: CEPSTRAL COEFFICIENTS
The cepstrum is the inverse transform of the log of the magnitude of the spectrum
Useful for separating convolved signals (like the source and filter in the speech production model)
Can be thought of as the Fourier series expansion of the magnitude of the Fourier transform
Generally provides more efficient and robust coding of speech information than LPC coefficients
Most common basic feature for speech recognition
![Page 62: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/62.jpg)
CarnegieMellon Slide 62 CMU Robust Speech Group
THREE WAYS OF DERIVING CEPSTRAL COEFFIENTS
LPC-derived cepstral coefficients (LPCC):
– Compute “traditional” LPC coefficients
– Convert to cepstra using linear transformation
– Warp cepstra using bilinear transform
Mel-frequency cepstral coefficients (MFCC):
– Compute log magnitude of windowed signal
– Multiply by triangular Mel weighting functions
– Compute inverse discrete cosine transform
Perceptual linear prediction (PLP)
![Page 63: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/63.jpg)
CarnegieMellon Slide 63 CMU Robust Speech Group
COMPUTING CEPSTRAL COEFFICIENTS
Comments:
– MFCC is currently the most popular representation.
– Typical systems include a combination of
» MFCC coefficients
» “Delta” MFCC coefficients
» “Delta delta” MFCC coefficients
» Power and delta power coefficients
![Page 64: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/64.jpg)
CarnegieMellon Slide 64 CMU Robust Speech Group
COMPUTING LPC CEPSTRAL COEFFICIENTS
Procedure used in SPHINX-I:
– A/D conversion at 16-kHz sampling rate
– Apply Hamming window, duration 320 samples (20 msec) with 50% overlap (100-Hz frame rate)
– Pre-emphasize to boost high-frequency components
– Compute first 14 auto-correlation coefficients
– Perform Levinson-Durbin recursion to obtain 14 LPC coefficients
– Convert LPC coefficients to cepstral coefficients
– Perform frequency warping to spread low frequencies
– Apply vector quantization to generate three codebooks
![Page 65: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/65.jpg)
CarnegieMellon Slide 65 CMU Robust Speech Group
An example: the vowel in “welcome”
The original time function:
![Page 66: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/66.jpg)
CarnegieMellon Slide 66 CMU Robust Speech Group
THE TIME FUNCTION AFTER WINDOWING
![Page 67: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/67.jpg)
CarnegieMellon Slide 67 CMU Robust Speech Group
THE RAW SPECTRUM
![Page 68: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/68.jpg)
CarnegieMellon Slide 68 CMU Robust Speech Group
PRE-EMPHASIZING THE SIGNAL
Typical pre-emphasis filter:
Its frequency response:
y[n] = x[n]− .96x[n −1]
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
20
40
60
80
Normalized frequency (Nyquist == 1)
Pha
se (
degr
ees)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-30
-20
-10
0
10
Normalized frequency (Nyquist == 1)
Mag
nitu
de R
espo
nse
(dB
)
![Page 69: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/69.jpg)
CarnegieMellon Slide 69 CMU Robust Speech Group
THE SPECTRUM OF THE PRE-EMPHASIZED SIGNAL
![Page 70: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/70.jpg)
CarnegieMellon Slide 70 CMU Robust Speech Group
THE LPC SPECTRUM
![Page 71: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/71.jpg)
CarnegieMellon Slide 71 CMU Robust Speech Group
THE TRANSFORM OF THE CEPSTRAL COEFFICIENTS
![Page 72: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/72.jpg)
CarnegieMellon Slide 72 CMU Robust Speech Group
THE BIG PICTURE: THE ORIGINAL SPECTROGRAM
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
![Page 73: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/73.jpg)
CarnegieMellon Slide 73 CMU Robust Speech Group
EFFECTS OF LPC PROCESSING
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
![Page 74: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/74.jpg)
CarnegieMellon Slide 74 CMU Robust Speech Group
COMPARING REPRESENTATIONS
ORIGINAL SPEECH LPCC CEPSTRA (unwarped)
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
![Page 75: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/75.jpg)
CarnegieMellon Slide 75 CMU Robust Speech Group
COMPUTING MEL FREQUENCY CEPSTRAL COEFFICIENTS
Segment incoming waveform into frames
Compute frequency response for each frame using DFTs
Multiply magnitude of frequency response by triangular weighting functions to produce 25-40 channels
Compute log of weighted magnitudes for each channel
Take inverse discrete cosine transform (DCT) of weighted magnitudes for each channel, producing ~14 cepstral coefficients for each frame
Calculate delta and double-delta coefficients
![Page 76: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/76.jpg)
CarnegieMellon Slide 76 CMU Robust Speech Group
AN EXAMPLE: DERIVING MFCC coefficients
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
6000
7000
8000
![Page 77: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/77.jpg)
CarnegieMellon Slide 77 CMU Robust Speech Group
THE MEL WEIGHTING FUNCTIONS
0 1000 2000 3000 4000 5000 6000 7000 80000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
![Page 78: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/78.jpg)
CarnegieMellon Slide 78 CMU Robust Speech Group
THE LOG ENERGIES OF THE MEL FILTER OUTPUTS
0 0.2 0.4 0.6 0.8 1 1.20
5
10
15
20
25
30
35
![Page 79: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/79.jpg)
CarnegieMellon Slide 79 CMU Robust Speech Group
THE CEPSTRAL COEFFICIENTS
0 0.2 0.4 0.6 0.8 1 1.2
0
2
4
6
8
10
12
![Page 80: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/80.jpg)
CarnegieMellon Slide 80 CMU Robust Speech Group
LOGSPECTRA RECOVERED FROM CEPSTRA
0 0.2 0.4 0.6 0.8 1 1.20
5
10
15
20
25
30
35
![Page 81: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/81.jpg)
CarnegieMellon Slide 81 CMU Robust Speech Group
COMPARING SPECTRAL REPRESENTATIONS
ORIGINAL SPEECH MEL LOG MAGS AFTER CEPSTRA
0 0.2 0.4 0.6 0.8 1 1.20
1000
2000
3000
4000
5000
6000
7000
8000
0 0.2 0.4 0.6 0.8 1 1.20
5
10
15
20
25
30
35
0 0.2 0.4 0.6 0.8 1 1.20
5
10
15
20
25
30
35
![Page 82: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/82.jpg)
CarnegieMellon Slide 82 CMU Robust Speech Group
Comments on the MFCC representation
It’s very “blurry” compared to a wideband spectrogram!
Aspects of auditory processing represented:
– Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!)
» Wavelet schemes exploit time-frequency resolution better
– Nonlinear amplitude response
Aspects of auditory processing NOT represented:
– Detailed timing structure
– Lateral suppression
– Enhancement of temporal contrast
– Other auditory nonlinearities
![Page 83: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/83.jpg)
CarnegieMellon Slide 83 CMU Robust Speech Group
Outline of discussion
Basic mechanisms of speech production
Basic mechanisms of auditory perception
(Very!) basic review of automatic speech recognition
Conventional signal processing for speech recognition
Signal processing for improved speech recognition
Signal processing for improved sound source separation
![Page 84: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/84.jpg)
CarnegieMellon Slide 84 CMU Robust Speech Group
Speech representation using mean rate
Representation of vowels by Young and Sachs using mean rate:
Mean rate representation does not preserve spectral information
![Page 85: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/85.jpg)
CarnegieMellon Slide 85 CMU Robust Speech Group
Speech representation using average localized synchrony measure
Representation of vowels by Young and Sachs using ALSR:
![Page 86: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/86.jpg)
CarnegieMellon Slide 86 CMU Robust Speech Group
The importance of timing information
Re-analysis of Young-Sachs data by Searle:
Temporal processing captures dominant formants in a spectral region
![Page 87: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/87.jpg)
CarnegieMellon Slide 87 CMU Robust Speech Group
Paths to the realization of temporal fine structure in speech
Correlograms (Slaney and Lyon)
Computations based on interval processing
– Ghitza’s Ensemble Interval Histogram (EIH) model
– Kim’s Zero Crossing Peak Analysis (ZCPA) model
![Page 88: ROBUST SPEECH RECOGNITION Richard Sterntsc.uc3m.es/~fdiaz/docencia/Seminario_R_Stern/Carlos3_05...Carnegie Mellon Slide 4 CMU Robust Speech Group Speech in high noise (Navy F-18 flight](https://reader034.fdocuments.in/reader034/viewer/2022051900/5feeb92f61481d371960ffb0/html5/thumbnails/88.jpg)