Speech recognition - 123seminarsonly.com · Web viewSpeech recognition systems can be characterized...

Speech recognitionFrom Wikipedia, the free encyclopedia

Jump to: navigation, search

Speech recognition (in many contexts also known as automatic speech recognition, computer speech recognition or erroneously as voice recognition) is the process of converting a speech signal to a sequence of words, by means of an algorithm implemented as a computer program.

Speech recognition applications that have emerged over the last few years include voice dialing (e.g., "Call home"), call routing (e.g., "I would like to make a collect call"), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g., a radiology report), domotic appliances control and content-based spoken audio search (e.g. find a podcast where particular words were spoken).

Voice recognition or speaker recognition is a related process that attempts to identify the person speaking, as opposed to what is being said.

Contents[hide]

1 Speech recognition technology 2 Performance of speech recognition systems

o 2.1 Hidden Markov model (HMM)-based speech recognition o 2.2 Neural network-based speech recognition o 2.3 Dynamic time warping (DTW)-based speech recognition

3 Speech recognition patents and patent disputes 4 For further information 5 Applications of speech recognition 6 Microphone 7 See also 8 References 9 Books

10 External links

[edit] Speech recognition technologyIn terms of technology, most of the technical text books nowadays emphasize the use of hidden Markov model as the underlying technology. The dynamic programming approach, the neural network based approach and the knowledge-based learning approach have been studied intensively in the 1980s and 1990s.

http://en.wikipedia.org/w/index.php?title=Knowledge-based&action=edit

http://en.wikipedia.org/wiki/Neural_network

http://en.wikipedia.org/wiki/Dynamic_programming

http://en.wikipedia.org/wiki/Hidden_Markov_model

http://en.wikipedia.org/w/index.php?title=Speech_recognition&action=edit&section=1

http://en.wikipedia.org/wiki/Speech_recognition#External_links

http://en.wikipedia.org/wiki/Speech_recognition#Books

http://en.wikipedia.org/wiki/Speech_recognition#References

http://en.wikipedia.org/wiki/Speech_recognition#See_also

http://en.wikipedia.org/wiki/Speech_recognition#Microphone

http://en.wikipedia.org/wiki/Speech_recognition#Applications_of_speech_recognition

http://en.wikipedia.org/wiki/Speech_recognition#For_further_information

http://en.wikipedia.org/wiki/Speech_recognition#Speech_recognition_patents_and_patent_disputes

http://en.wikipedia.org/wiki/Speech_recognition#Dynamic_time_warping_.28DTW.29-based_speech_recognition

http://en.wikipedia.org/wiki/Speech_recognition#Neural_network-based_speech_recognition

http://en.wikipedia.org/wiki/Speech_recognition#Hidden_Markov_model_.28HMM.29-based_speech_recognition

http://en.wikipedia.org/wiki/Speech_recognition#Performance_of_speech_recognition_systems

http://en.wikipedia.org/wiki/Speech_recognition#Speech_recognition_technology

http://en.wikipedia.org/wiki/Speaker_recognition

http://en.wikipedia.org/wiki/Domotic

http://en.wikipedia.org/wiki/Speech_recognition#searchInput

http://en.wikipedia.org/wiki/Speech_recognition#column-one

[edit] Performance of speech recognition systemsThe performance of a speech recognition systems is usually specified in terms of accuracy and speed. Accuracy is measured with the word error rate, whereas speed is measured with the real time factor.

Most speech recognition users would tend to agree that dictation machines can achieve very high performance in controlled conditions. Part of the confusion mainly comes from the mixed usage of the terms "speech recognition" and "dictation".

Speaker-dependent dictation systems requiring a short period of training can capture continuous speech with a large vocabulary at normal pace with a very high accuracy. Most commercial companies claim that recognition software can achieve between 98% to 99% accuracy (getting one to two words out of one hundred wrong) if operated under optimal conditions. These optimal conditions usually means the test subjects have 1) matching speaker characteristics with the training data, 2) proper speaker adaptation, and 3) clean environment (e.g. office space). (This explains why some users, especially those whose speech is heavily accented, might actually perceive the recognition rate to be much lower than the expected 98% to 99%).

Limited vocabulary systems, requiring no training, can recognize a small number of words (for instance, the ten digits) as spoken by most speakers. Such systems are popular for routing incoming phone calls to their destinations in large organizations.

Both acoustic modeling and language modeling are important studies in modern statistical speech recognition. In this entry, we will the use of hidden Markov model (HMM) because notably it is very widely used in many systems. (Language modeling has many other applications such as smart keyboard and document classification; to the corresponding entries.)

The Carnegie Mellon University has made some good steps in increasing the speed of speechchips by using ASICs (application-specific integrated circuits) and reconfigurable chips called FPGAs (field programmable gate arrays). [1]

[edit] Hidden Markov model (HMM)-based speech recognition

Modern general-purpose speech recognition systems are generally based on (HMMs). This is a statistical model which outputs a sequence of symbols or quantities. One possible reason why HMMs are used in speech recognition is that a speech signal could be viewed as a piece-wise stationary signal or a short-time stationary signal. That is, one could assume in a short-time in the range of 10 milliseconds, speech could be approximated as a stationary process. Speech could thus be thought as a Markov model for many stochastic processes (known as states).

Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, to give the very

http://en.wikipedia.org/wiki/Markov_model

http://en.wikipedia.org/wiki/Stationary_process


http://en.wikipedia.org/wiki/Speech_recognition#_note-0

http://en.wikipedia.org/wiki/Document_classification

http://en.wikipedia.org/w/index.php?title=Smart_keyboard&action=edit

http://en.wikipedia.org/wiki/Language_modeling

http://en.wikipedia.org/wiki/Language_modeling

http://en.wikipedia.org/w/index.php?title=Acoustic_modeling&action=edit

http://en.wikipedia.org/wiki/Real_time_factor

http://en.wikipedia.org/wiki/Word_error_rate


simplest setup possible, the hidden Markov model would output a sequence of n-dimensional real-valued vectors with n around, say, 13, outputting one of these every 10 milliseconds. The vectors, again in the very simplest case, would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short-time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have, in each state, a statistical distribution called a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.

Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phones (so phones with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).

Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).

[edit] Neural network-based speech recognition

Another approach in acoustic modeling is the use of neural networks. They are capable of solving much more complicated recognition tasks, but do not scale as well as HMMs when it comes to large vocabularies. Rather than being used in general-purpose speech recognition applications they can handle low quality, noisy data and speaker independence. Such systems can achieve greater accuracy than HMM based systems, as long as there is training data and the vocabulary is limited. A more general approach

http://en.wikipedia.org/wiki/Artificial_neural_networks


http://en.wikipedia.org/wiki/Viterbi_algorithm

http://www.clsp.jhu.edu/~kumar/thesis.ps

http://en.wikipedia.org/wiki/Phoneme

http://en.wikipedia.org/wiki/Fourier_transform

http://en.wikipedia.org/wiki/Cepstrum

using neural networks is phoneme recognition. This is an active field of research, but generally the results are better than for HMMs. There are also NN-HMM hybrid systems that use the neural network part for phoneme recognition and the hidden markov model part for language modeling.

[edit] Dynamic time warping (DTW)-based speech recognition

Main article: Dynamic time warping

Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another they were walking more quickly, or even if there were accelerations and decelerations during the course of one observation. DTW has been applied to video, audio, and graphics -- indeed, any data which can be turned into a linear representation can be analyzed with DTW.

A well known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences (e.g. time series) with certain restrictions, i.e. the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models.

[edit] Speech recognition patents and patent disputesMicrosoft and Alcatel-Lucent hold patents in speech recognition, and are in dispute as of March 2, 2007.[2]

This short section requires expansion.

[edit] For further informationPopular speech recognition conferences held each year or two include ICASSP, Eurospeech/ICSLP (now named Interspeech) and the IEEE ASRU. Conferences in the field of Natural Language Processing, such as ACL, NAACL, EMNLP, and HLT, are beginning to include papers on speech processing. Important journals include the IEEE Transactions on Speech and Audio Processing (now named IEEE Transactions on Audio, Speech and Language Processing), Computer Speech and Language, and Speech Communication. Books like "Fundamentals of Speech Recognition" by Lawrence Rabiner can be useful to acquire basic knowledge but may not be fully up to date (1993). Another good source can be "Statistical Methods for Speech Recognition" by Frederick Jelinek which is a more up to date book (1998). A good insight into the techniques used in the best modern systems can be gained by paying attention to government sponsored competitions such as those organised by DARPA (the largest speech recognition-related

http://en.wikipedia.org/wiki/DARPA

http://en.wikipedia.org/wiki/Lawrence_Rabiner

http://en.wikipedia.org/wiki/Lawrence_Rabiner

http://en.wikipedia.org/w/index.php?title=Computer_Speech&action=edit

http://en.wikipedia.org/wiki/IEEE

http://en.wikipedia.org/wiki/IEEE

http://en.wikipedia.org/wiki/Natural_Language_Processing


http://en.wikipedia.org/w/index.php?title=Speech_recognition&action=edit

http://en.wikipedia.org/wiki/Speech_recognition#_note-1

http://en.wikipedia.org/wiki/Alcatel-Lucent

http://en.wikipedia.org/wiki/Microsoft


http://en.wikipedia.org/wiki/Dynamic_time_warping


http://en.wikipedia.org/wiki/Image:Wiki_letter_w.svg

project ongoing as of 2007 is the GALE project, which involves both speech recognition and translation components).

In terms of freely available resources, the HTK book (and the accompanying HTK toolkit) is one place to start to both learn about speech recognition and to start experimenting. Another such resource is Carnegie Mellon University's SPHINX toolkit.

[edit] Applications of speech recognition Automatic translation Automotive speech recognition Dictation Hands-free computing : voice command recognition computer user interface Home automation Interactive voice response Medical transcription Mobile telephony Pronunciation evaluation in computer-aided language learning applications[1] Robotics

[edit] MicrophoneThe microphone type recommend for speech recognition is the array microphone.

This short section requires expansion.

[edit] See also Audio visual speech recognition Cockpit (aviation) (also termed Direct Voice Input) Keyword spotting List of speech recognition projects Microphone Speech Analytics Speaker identification Speech processing Speech synthesis Speech verification Text-to-speech (TTS) VoiceXML Acoustic Model Speech corpus

[edit] References


http://en.wikipedia.org/wiki/Speech_corpus

http://en.wikipedia.org/wiki/Acoustic_Model

http://en.wikipedia.org/wiki/VoiceXML

http://en.wikipedia.org/wiki/Text-to-speech

http://en.wikipedia.org/wiki/Speech_verification

http://en.wikipedia.org/wiki/Speech_synthesis

http://en.wikipedia.org/wiki/Speech_processing

http://en.wikipedia.org/wiki/Speaker_identification

http://en.wikipedia.org/wiki/Speech_Analytics

http://en.wikipedia.org/wiki/Microphone

http://en.wikipedia.org/wiki/List_of_speech_recognition_projects

http://en.wikipedia.org/wiki/Keyword_spotting

http://en.wikipedia.org/wiki/Direct_Voice_Input

http://en.wikipedia.org/wiki/Cockpit_(aviation)

http://en.wikipedia.org/wiki/Audio_visual_speech_recognition


http://en.wikipedia.org/w/index.php?title=Speech_recognition&action=edit

http://en.wikipedia.org/wiki/Array_microphone


http://en.wikipedia.org/wiki/Robotics

http://www.readsay.com/llsrpc.html

http://en.wikipedia.org/wiki/Pronunciation

http://en.wikipedia.org/wiki/Mobile_telephony

http://en.wikipedia.org/wiki/Medical_transcription

http://en.wikipedia.org/wiki/Interactive_voice_response

http://en.wikipedia.org/wiki/Home_automation

http://en.wikipedia.org/wiki/User_interface

http://en.wikipedia.org/wiki/Hands-free_computing

http://en.wikipedia.org/wiki/Dictation


http://en.wikipedia.org/wiki/Carnegie_Mellon_University

http://en.wikipedia.org/wiki/Image:Wiki_letter_w.svg

"Survey of the State of the Art in Human Language Technology (1997) by Ron Cole et all"

1. ̂ Dennis van der Heijden. "Computer Chips to Enhance Speech Recognition", Axistive.com, 2003-10-06.

2. ̂ Roger Cheng and Carmen Fleetwood. "Judge dismisses Lucent patent suit against Microsoft", Wall Street Journal, 2007-03-02.

[edit] Books Multilingual Speech Processing, Edited by Tanja Schultz and Katrin Kirchhoff,

April 2006--Researchers and developers in industry and academia with different backgrounds but a common interest in multilingual speech processing will find an excellent overview of research problems and solutions detailed from theoretical and practical perspectives.---CH 1: Introduction / CH 2: Language Characteristics / CH 3: Linguistic Data Resources / CH 4: Multilingual Acoustic Modeling / CH 5: Multilingual Dictionaries / CH 6: Multilingual Language Modeling / CH 7: Multilingual Speech Synthesis / CH 8: Automatic Language Identification / CH 9: Other Challenges /

[edit] External links NIST Speech Group How to install and configure speech recognition in Windows . Entropic/Cambridge Hidden Markov Model Toolkit Open CV library, especially the multi-stream speech and vision combination

programs LT-World: Portal to information and resources on the internet LDC – The Linguistic Data Consortium Evaluations and Language resources Distribution Agency OLAC – Open Language Archives Community BAS – Bavarian Archive for Speech Signals Think-A-Move – Speech and Tongue Control of Robots and Wheelchairs

Audio-visual speech recognitionFrom Wikipedia, the free encyclopedia

(Redirected from Audio visual speech recognition)Jump to: navigation, search

Audio visual speech recognition (AVSR) is a technique that uses image processing capabilities in lip reading to aid speech recognition systems in recognizing undeterministic phones or giving preponderance among near probability decisions.

http://en.wikipedia.org/wiki/Phone

http://en.wikipedia.org/wiki/Speech_recognition

http://en.wikipedia.org/wiki/Lip_reading

http://en.wikipedia.org/wiki/Image_processing

http://en.wikipedia.org/wiki/Audio_visual_speech_recognition#searchInput

http://en.wikipedia.org/wiki/Audio_visual_speech_recognition#column-one

http://en.wikipedia.org/w/index.php?title=Audio_visual_speech_recognition&redirect=no

http://www.think-a-move.com/

http://www.phonetik.uni-muenchen.de/Bas/BasHomeeng.html

http://www.language-archives.org/

http://www.elda.org/

http://www.ldc.upenn.edu/

http://www.lt-world.org/

http://sourceforge.net/projects/opencvlibrary/

http://sourceforge.net/projects/opencvlibrary/

http://htk.eng.cam.ac.uk/

http://support.microsoft.com/kb/306537/en-us

http://www.nist.gov/speech/index.htm


http://books.elsevier.com/us/elsevier/us/subindex.asp?maintarget=&isbn=0120885018&country=United+States&srccode=&ref=&subcode=&head=&pdf=&basiccode=&txtSearch=&SearchField=&operator=&order=&community=elsevier

http://books.elsevier.com/us/elsevier/us/subindex.asp?maintarget=&isbn=0120885018&country=United+States&srccode=&ref=&subcode=&head=&pdf=&basiccode=&txtSearch=&SearchField=&operator=&order=&community=elsevier


http://en.wikipedia.org/wiki/March_2

http://en.wikipedia.org/wiki/2007

http://online.wsj.com/article/SB117285989825325035.html

http://online.wsj.com/article/SB117285989825325035.html

http://en.wikipedia.org/wiki/Speech_recognition#_ref-1

http://en.wikipedia.org/wiki/October_6

http://en.wikipedia.org/wiki/2003

http://www.axistive.com/computer-chips-to-enhance-speech-recognition.html

http://en.wikipedia.org/wiki/Speech_recognition#_ref-0

http://citeseer.ist.psu.edu/392076.html

http://citeseer.ist.psu.edu/392076.html

Each system lip reading and speech recognition works separately then their results are mixed at the stage of feature fusion.

[edit] External linksIBM Research - Audio Visual Speech Technologies

[hide]

v • d • e

Major fields of technology

Applied science

Artificial intelligence • Ceramic engineering • Computing technology • Electronics • Energy • Energy storage • Engineering physics • Environmental technology • Materials science & engineering • Microtechnology • Nanotechnology • Nuclear technology • Optical engineering

Information and communication

Communication • Graphics • Music technology • Speech recognition • Visual technology

IndustryConstruction • Financial engineering • Manufacturing • Machinery • Mining

Military Bombs • Guns and Ammunition • Military technology and equipment • Naval engineering

DomesticDomestic appliances • Domestic technology • Educational technology • Food technology

Engineering

Aerospace • Agricultural • Architectural • Bioengineering • Biochemical • Biomedical • Ceramic • Chemical • Civil • Computer • Construction • Cryogenic • Electrical • Electronic • Environmental • Food • Industrial • Materials • Mechanical • Mechatronics • Metallurgical • Mining • Naval • Nuclear • Petroleum • Software • Structural • Systems • Textile • Tissue

Health and safetyBiomedical engineering • Bioinformatics • Biotechnology • Cheminformatics • Fire protection technology • Health technologies • Pharmaceuticals • Safety engineering • Sanitary engineering

Transport Aerospace • Aerospace engineering • Marine engineering • Motor vehicles • Space technology • Transport

1.2: Speech Recognition

http://en.wikipedia.org/wiki/Transport

http://en.wikipedia.org/wiki/Space_technology

http://en.wikipedia.org/wiki/Motor_vehicle

http://en.wikipedia.org/wiki/Motor_vehicle

http://en.wikipedia.org/wiki/Marine_engineering

http://en.wikipedia.org/wiki/Aerospace_engineering

http://en.wikipedia.org/wiki/Aerospace

http://en.wikipedia.org/wiki/Transport

http://en.wikipedia.org/wiki/Sanitary_engineering

http://en.wikipedia.org/wiki/Safety_engineering

http://en.wikipedia.org/wiki/Pharmacology

http://en.wikipedia.org/wiki/Health_science

http://en.wikipedia.org/wiki/Fire_protection

http://en.wikipedia.org/wiki/Cheminformatics

http://en.wikipedia.org/wiki/Biotechnology

http://en.wikipedia.org/wiki/Bioinformatics

http://en.wikipedia.org/wiki/Biomedical_engineering

http://en.wikipedia.org/wiki/Safety

http://en.wikipedia.org/wiki/Health

http://en.wikipedia.org/wiki/Tissue_engineering

http://en.wikipedia.org/wiki/Textile_engineering

http://en.wikipedia.org/wiki/Systems_engineering

http://en.wikipedia.org/wiki/Structural_engineering

http://en.wikipedia.org/wiki/Software_engineering

http://en.wikipedia.org/wiki/Petroleum_engineering

http://en.wikipedia.org/wiki/Nuclear_engineering

http://en.wikipedia.org/wiki/Naval_Architecture

http://en.wikipedia.org/wiki/Mining_engineering

http://en.wikipedia.org/wiki/Metallurgical_engineering

http://en.wikipedia.org/wiki/Mechatronics_engineering

http://en.wikipedia.org/wiki/Mechanical_engineering

http://en.wikipedia.org/wiki/Materials_engineering

http://en.wikipedia.org/wiki/Industrial_engineering

http://en.wikipedia.org/wiki/Food_engineering

http://en.wikipedia.org/wiki/Environmental_engineering

http://en.wikipedia.org/wiki/Electronic_engineering

http://en.wikipedia.org/wiki/Electrical_engineering

http://en.wikipedia.org/wiki/Cryogenics

http://en.wikipedia.org/wiki/Construction_engineering

http://en.wikipedia.org/wiki/Computer_engineering

http://en.wikipedia.org/wiki/Civil_engineering

http://en.wikipedia.org/wiki/Chemical_engineering

http://en.wikipedia.org/wiki/Ceramic_engineering

http://en.wikipedia.org/wiki/Biomedical_engineering

http://en.wikipedia.org/wiki/Biochemical_engineering

http://en.wikipedia.org/wiki/Bioengineering

http://en.wikipedia.org/wiki/Architectural_engineering

http://en.wikipedia.org/wiki/Agricultural_engineering

http://en.wikipedia.org/wiki/Aerospace_engineering

http://en.wikipedia.org/wiki/Engineering

http://en.wikipedia.org/wiki/Food_technology

http://en.wikipedia.org/wiki/Educational_technology

http://en.wikipedia.org/wiki/Domestic_technology

http://en.wikipedia.org/wiki/Domestic_appliances

http://en.wikipedia.org/wiki/Home

http://en.wikipedia.org/wiki/Naval_engineering

http://en.wikipedia.org/wiki/Military_technology_and_equipment

http://en.wikipedia.org/wiki/Ammunition

http://en.wikipedia.org/wiki/Gun

http://en.wikipedia.org/wiki/Bomb

http://en.wikipedia.org/wiki/Military

http://en.wikipedia.org/wiki/Mining

http://en.wikipedia.org/wiki/Machine

http://en.wikipedia.org/wiki/Manufacturing

http://en.wikipedia.org/wiki/Computational_finance

http://en.wikipedia.org/wiki/Construction

http://en.wikipedia.org/wiki/Industry

http://en.wikipedia.org/wiki/Visual_technology


http://en.wikipedia.org/wiki/Music_technology

http://en.wikipedia.org/wiki/Graphics

http://en.wikipedia.org/wiki/Communication#Communication_technology

http://en.wikipedia.org/wiki/Communication

http://en.wikipedia.org/wiki/Information

http://en.wikipedia.org/wiki/Optical_engineering

http://en.wikipedia.org/wiki/Optical_engineering

http://en.wikipedia.org/wiki/Nuclear_technology

http://en.wikipedia.org/wiki/Nanotechnology

http://en.wikipedia.org/wiki/Microtechnology

http://en.wikipedia.org/wiki/Materials_Science_%26_Engineering

http://en.wikipedia.org/wiki/Environmental_technology

http://en.wikipedia.org/wiki/Engineering_physics

http://en.wikipedia.org/wiki/Energy_storage

http://en.wikipedia.org/wiki/Portal:Energy

http://en.wikipedia.org/wiki/Electronics

http://en.wikipedia.org/wiki/Computing_technology

http://en.wikipedia.org/wiki/Ceramic_engineering

http://en.wikipedia.org/wiki/Artificial_intelligence

http://en.wikipedia.org/wiki/Applied_science

http://en.wikipedia.org/wiki/Technology

http://en.wikipedia.org/w/index.php?title=Template:Technology&action=edit

http://en.wikipedia.org/wiki/Template_talk:Technology

http://en.wikipedia.org/wiki/Template:Technology

http://www.research.ibm.com/AVSTG

http://en.wikipedia.org/w/index.php?title=Audio-visual_speech_recognition&action=edit&section=1

http://en.wikipedia.org/w/index.php?title=Feature_fusion&action=edit


http://en.wikipedia.org/wiki/Lip_reading

Victor Zue, Ron Cole, & Wayne WardMIT Laboratory for Computer Science, Cambridge, Massachusetts, USAOregon Graduate Institute of Science & Technology, Portland, Oregon, USACarnegie Mellon University, Pittsburgh, Pennsylvania, USA

Defining the ProblemSpeech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section .

Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure . An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.

One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.

http://cslu.cse.ogi.edu/HLTsurvey/ch1node10.html#secprice

http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html#parameters

http://cslu.cse.ogi.edu/HLTsurvey/ch1node8.html#secroukos

Table: Typical parameters used to characterize the capability of speech recognition systems

Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme /t/ in two, true, and butter in American English. At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.

Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker's physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.

Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sections and 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.

Figure: Components of a typical speech recognition system.

http://cslu.cse.ogi.edu/HLTsurvey/footnode.html#225

http://cslu.cse.ogi.edu/HLTsurvey/ch1node4.html#figcomponents

http://cslu.cse.ogi.edu/HLTsurvey/ch1node5.html#secsignal_hunt

Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics [Her90]. At the acoustic phonetic level, speaker variability is typically modeled using statistical techniques applied to large amounts of data. Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section ). Effects of linguistic context at the acoustic phonetic level are typically handled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.

Word level variability can be handled by allowing alternate pronunciations of words in representations known as pronunciation networks. Common alternate pronunciations of words, as well as effects of dialect and accent are handled by allowing search algorithms to find alternate paths of phonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often used to guide the search through the most probable sequence of words.

The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM). An HMM is a doubly stochastic model, in which the generation of the underlying phoneme string and the frame-by-frame, surface acoustic realizations are both represented probabilistically as Markov processes, as discussed in sections , and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described in section 11.5.

An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This approach has produced competitive recognition performance in several tasks [ZGPS90,FBC95].

1.2.2 State of the ArtComments about the state-of-the-art need to be made in the context of specific applications which reflect the constraints on the task. Moreover, different technologies are sometimes appropriate for different tasks. For example, when the vocabulary is small, the entire word can be modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.

Performance of speech recognition systems is typically described in terms of word error rate, E, defined as:

http://cslu.cse.ogi.edu/HLTsurvey/ch1node66.html#fanty95

http://cslu.cse.ogi.edu/HLTsurvey/ch1node66.html#ZueGlass90

http://cslu.cse.ogi.edu/HLTsurvey/ch1node66.html#Hermansky90plp

http://cslu.cse.ogi.edu/HLTsurvey/ch1node6.html#secrobust_stern

http://cslu.cse.ogi.edu/HLTsurvey/ch1node7.html#secarch_demori

http://cslu.cse.ogi.edu/HLTsurvey/ch1node8.html#secroukos

where N is the total number of words in the test set, and S, I, and D are the total number of substitutions, insertions, and deletions, respectively.

The past decade has witnessed significant progress in speech recognition technology. Word error rates continue to drop by a factor of 2 every two years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM. HMM is powerful in that, with the availability of training data, the parameters of the model can be trained automatically to give optimal performance.

Second, much effort has gone into the development of large speech corpora for system development, training, and testing. Some of these corpora are designed for acoustic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands of sentences available for system training and testing. These corpora permit researchers to quantify the acoustic cues important for phonetic contrasts and to determine parameters of the recognizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT, RM, ATIS, and WSJ; see section 12.3) were originally collected under the sponsorship of the U.S. Defense Advanced Research Projects Agency (ARPA) to spur human language technology development among its contractors, they have nevertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.

Third, progress has been brought about by the establishment of standards for performance evaluation. Only a decade ago, researchers trained and tested their systems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, it was very difficult to compare performance across systems, and a system's performance typically degraded when it was presented with previously unseen data. The recent availability of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoring progress (corpus development activities and evaluation methodologies are summarized in chapters 12 and 13 respectively).

Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that the elapsed time between an idea and its implementation and evaluation is greatly reduced. In fact, speech recognition systems with reasonable performance can now run in real time using high-end workstations without additional hardware---a feat unimaginable only a few years ago.

One of the most popular, and potentially most useful tasks with low perplexity (PP=11) is the recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.

One of the best known moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the possible words following a given word (PP=60). More recently, researchers have begun to address the issue of recognizing spontaneously generated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 words and a bigram language model with a perplexity of around 15.

High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-large-vocabulary (20,000 words and more), high-perplexity ( ), speaker-independent, continuous speech recognition. The best system in 1994 achieved an error rate of 7.2% on read sentences drawn from North America business news [PFF 94].

With the steady improvements in speech recognition performance, systems are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognition will be pervasive in telephone networks around the world. There are tremendous forces driving the development of the technology; in many countries, touch tone penetration is low, and voice is the only option for controlling automated services. In voice dialing, for example, users can dial 10--20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routing system using speaker-independent word-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.

At present, several very large vocabulary dictation systems are available for document generation. These systems generally require speakers to pause between words. Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.

Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition rates on telephone conversations in the Switchboard corpus are around 50% [CGF94]. It will be many years before unlimited vocabulary, speaker-independent continuous dictation capability is realized.

1.2.3 Future Directions

http://cslu.cse.ogi.edu/HLTsurvey/ch1node66.html#caip94

http://cslu.cse.ogi.edu/HLTsurvey/ch1node66.html#PallettFiscus94

In 1992, the U.S. National Science Foundation sponsored a workshop to identify the key research challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in [CH 92]. Research in the following areas for speech recognition were identified:

Robustness:In a robust system, performance degrades gracefully (rather than catastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.

Portability:Portability refers to the goal of rapidly designing, developing and deploying systems for new applications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.

Adaptation:How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.

Adaptation:How can systems continuously adapt to changing conditions (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.

Language Modeling:Current systems use statistical language models to help reduce the search space and resolve acoustic ambiguity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporating syntactic and semantic constraints that cannot be captured by purely statistical models.

Confidence Measures:Most speech recognition systems assign scores to hypotheses for the purpose of rank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions, we need better methods to evaluate the absolute correctness of hypotheses.

Out-of-Vocabulary Words:Systems are designed for use with a particular set of words, but system users may not know exactly which words are in the system vocabulary. This leads to a certain percentage of out-of-vocabulary words in natural conditions. Systems must have some

http://cslu.cse.ogi.edu/HLTsurvey/ch1node66.html#ColeHirschman92

method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.

Spontaneous Speech:Systems that are deployed for real use must deal with a variety of spontaneous speech phenomena, such as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIS task has resulted in progress in this area, but much work remains to be done.

Prosody:

Prosody refers to acoustic structure that extends over several segments or words. Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that has not yet been answered.

Modeling Dynamics:Systems assume a sequence of input frames which are treated as if they were independent. But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in nature. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.

voice recognition

- Voice or speech recognition is the ability of a machine or program to receive and interpret dictation, or to understand and carry out spoken commands.

For use with computers, analog audio must be converted into digital signals. This requires analog-to-digital conversion. For a computer to decipher the signal, it must have a digital database, or vocabulary, of words or syllables,and a speedy means of comparing this data with signals. The speech patterns are stored on the hard drive and loaded into memory when the program is run. A comparator checks these stored patterns against the output of the A/D converter.

In practice, the size of a voice-recognition program's effective vocabulary is directly related to the random access memory capacity of the computer in which it is installed. A voice-recognition program runs many times faster if the entire vocabulary can be loaded into RAM, as compared with searching the hard drive for some of the matches. Processing speed is critical as well, because it affects how fast the computer can search

http://searchMobileComputing.techtarget.com/sDefinition/0,,sid44_gci214255,00.html

http://searchSMB.techtarget.com/sDefinition/0,,sid44_gci213760,00.html

http://searchSMB.techtarget.com/sDefinition/0,,sid44_gci211948,00.html

http://searchSMB.techtarget.com/sDefinition/0,,sid_gci211561,00.html

the RAM for matches.

All voice-recognition systems or programs make errors. Screaming children, barking dogs, and loud external conversations can produce false input. Much of this can be avoided only by using the system in a quiet room. There is also a problem with words that sound alike but are spelled differently and have different meanings -- for example, "hear" and "here." This problem might someday be largely overcome using stored contextual information. However, this will require more RAM and faster processors than are currently available in personal computers.

Though a number of voice recognition systems are available on the market, the industry leaders are IBM and Dragon Systems.

LAST UPDATED: 05 Mar 2007

QUESTION POSED ON: 07 October 2002 There has been some consideration for using voice recognition with contact centers to deflect queuing calls. Recently there have been some nice implementations from both Nuance and Speechworks.

How do you see this market segment developing and especially how would advise someone interested in this technology to ensure they leverage existing investments in either outbound scripts (Siebel Smartscripts) or knowledge bases (Primus and eGain)?

> EXPERT RESPONSEI believe that the successful use of speech recognition in contact centers hinges on two critical factors:

1. Humanization2. Application

Let's take these two factors and explore them further.

1. Humanization People do not like talking with a computer. Most interactions involving speech recognition use either Text-To-Speech or cold, robotic sounding prompts to interact with the customer. Neither of these work toward building a relationship with the customer. I know it sounds odd to think about a computer building a relationship with a customer but that is at the heart of real communication.

If the computer sounds like a person and responds as a person would, then your ability to engage a customer and keep them engaged for an automated session increases significantly. As an example compare the following:

"Please state your full name" (stated in a monotone)"Would you please say your first and last name" (stated with full dynamics)

Clearly the 2nd interaction would be preferred. Achieving this is the first success factor.

2. Application

Certain types of applications lend themselves well toward an automated interaction with a customer. A good example would be calling in a prescription refill to a pharmacy or checking to see when an order shipped and when it is expected to be delivered.

These types of applications don't require the skills of a highly trained agent but can be very time consuming in terms of personnel cost. Imagine the value of reducing your headcount of less skilled agents while not wasting the time of your highly trained and well compensated agents.

Summary

It is these types of applications where the largest values can be gained. Don't try to replace your entire agent population. That is not going to happen. Be realistic. Focus on applications where the form of the transaction is fairly consistent.

The field of computer science that deals with designing computer systems that can recognize spoken words. Note that voice recognition implies only that the computer can take dictation, not that it understands what is being said. Comprehending human languages falls under a different field of computer science called natural language processing.

A number of voice recognition systems are available on the market. The most powerful can recognize thousands of words. However, they generally require an extended training session during which the computer system becomes accustomed to a particular voice and accent. Such systems are said to be speaker dependent.

Many systems also require that the speaker speak slowly and distinctly and separate each word with a short pause. These systems are called discrete speech systems. Recently, great strides have been made in continuous speech systems -- voice recognition systems that allow you to speak naturally. There are now several continuous-speech systems available for personal computers.

Because of their limitations and high cost, voice recognition systems have traditionally been used only in a few specialized situations. For example, such systems are useful in instances when the user is unable to use a keyboard to enter data because his or her hands are occupied or disabled. Instead of typing commands, the user can simply speak into a headset. Increasingly, however, as the cost decreases and performance improves, speech

http://www.webopedia.com/TERM/V/command.html

http://www.webopedia.com/TERM/V/data.html

http://www.webopedia.com/TERM/V/keyboard.html

http://www.webopedia.com/TERM/V/voice_recognition.html


http://www.webopedia.com/TERM/V/system.html

http://www.webopedia.com/TERM/V/natural_language.html

http://www.webopedia.com/TERM/V/language.html

http://www.webopedia.com/TERM/V/computer.html


http://www.webopedia.com/TERM/V/computer_system.html

http://www.webopedia.com/TERM/V/computer_science.html

recognition systems are entering the mainstream and are being used as an alternative to keyboards.

Speech recognition - 123seminarsonly.com · Web viewSpeech recognition systems can be characterized...

Documents

Transcript of Speech recognition - 123seminarsonly.com · Web viewSpeech recognition systems can be characterized...