50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...
Transcript of 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...
50 years of progress in speech recognition technology
- Where we are, and where we should go -
Sadaoki FuruiDepartment of Computer Science
Tokyo Institute of [email protected]
From a poor dog to a super cat
Outline• Progress of automatic speech recognition
(ASR) technology– 5 generations– Synchronization with computer (IT)
technology• Summary of the technological progress• How to narrow the gap between machine
and human speech recognition• Statistical knowledge processing• Conclusion
What is 4G?
~1920 ~1952 ~1968 ~1980
ASR prehistory 1G 2G 3G 3.5G
~1990 ~2007
4G
Generations of ASR technology
Radio Rex Radio Rex –– 19201920’’s ASRs ASR
A sound-activated toy dog named “Rex” (from Elmwood Button Co.) could be called by name from his doghouse by name. (Thanks to Nelson Morgan.)
1st generation technology (1952 – 1968)
• General– The earliest attempts to devise
digit/syllable/vowel/phoneme recognition systems– Spectral resonances extracted by an analogue filter
bank and logic circuits– Statistical syntax at the phoneme level
• Early systems– Bell labs, RCA Labs, MIT Lincoln Labs, University
College London, Radio Research Lab (Japan), Kyoto Univ. (Japan), NEC Labs (Japan)
Block schematic of isolated digit recognizer circuits (K. H. Davis, R. Biddulph and S. Balashek, Bell Labs, 1952)
Schematic diagram of a phonetic typewriter (syllable recognizer)(H. Olsen and H. Belar, RCA Labs, 1956)
Phonetic typewriter with parts exposed(H. Olsen and H. Belar, RCA Labs, 1956)
Words used in a recognition study (J. W. Forgie and C. D. Forgie, MIT Lincoln Lab, 1959)
Structure of a program which classifies each LOOK as belonging to one of 10 possible vowels
(J. W. Forgie and C. D. Forgie, MIT Lincoln Lab, 1959)
Early schematic diagram of an automatic speech recognizer (P. Denes, University College London, 1959)
Histogram illustrating the prediction of phonemes in a test sentence (D. B. Fry, University College London, 1959)
Schematic diagram showing the arrangement by which statistical linguistic information is combined with acoustic information in a mechanical speech recognizer
(P. Denes, University College London, 1959)
P. Denes, 1959
The automatic speech typewriter (P. Denes, University College London, 1959)
Examples of manually analyzed spectra of the 5 Japanese vowels (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)
Block diagram of a vowel recognizer based on the majority decision principle (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)
Front view of the Japanese vowel recognizer (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)
Front view of the spoken digit recognizer (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)
Block diagram of continuous speech recognition experiments(J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)
Block diagram of the phonetic typewriter (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)
Block diagram of the vowel segmentation mechanism (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)
Block diagram of the recognition mechanism (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)
Block diagram of the Japanese spoken digit recognizer (K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)
Photograph of the Japanese spoken digit recognizer (K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)
• DTW– (Elementary time-normalization methods by Martin et
al.)– Vintsyuk in Soviet Union proposed the use of DP– Sakoe & Chiba at NEC labs started to use DP
• Isolated word recognition– The area of isolated word or discrete utterance
recognition became a viable and usable technology based on fundamental studies in Russia and Japan (Velichko & Zagoruyko, Sakoe & Chiba, and Itakura)
• IBM Labs: large-vocabulary ASR• Bell Labs: speaker-independent ASR
2nd generation technology (1968 – 1980) (1)
Block diagram of speech processing stages (T. B. Martin et al., RCA Labs, 1964)
Neural recognition network for /g/ followed by vowels /i/, /I/, /ε/+/`d/ (T. B. Martin et al., RCA Labs, 1964)
Speech processing equipment (T. B. Martin et al., RCA Labs, 1964)
Dynamic time warping algorithm (T. K. Vintsyuk, Lab of Pattern Recognition, USSR, 1968)
1: Curve of maximum similarity; 2, 3: linear normalization (V. M. Velichko and N. G. Zagoruyko, Lab of Pattern Recognition, USSR, 1970)
Matrix of similarity (V. M. Velichko and N. G. Zagoruyko, Lab of Pattern Recognition, USSR, 1970)
Warping function and adjustment window definition(H. Sakoe and S. Chiba, NEC Labs, 1978)
Flanagan writes, “… the computer in the background is a Honeywell DDP516. This was the first integrated circuits machine that we had in the laboratory. … with memory of 8K words (one-half of which was occupied by the Fortran II compiler) …”
(November 1970)
An Example of the time-warping function. The parallelogram shows the possible domain of (n, m) coordinates (F. Itakura, Bell/NTT Labs, 1975)
Flow chart of the isolated word recognition system (F. Itakura, Bell/NTT Labs, 1975)
Block diagram of the real-time spoken word recognition system (L. C. W. Pols, TNO, the Netherlands, 1971)
Flow diagram for an operational isolated utterance speech recognition system (G. M. White, Xerox PARC, 1976)
2nd generation technology (1968 – 1980) (2)
• Continuous speech recognition– Reddy at CMU conducted pioneering research based
on dynamic tracking of phonemes• DARPA program
– Focus on speech understanding– Goal: 1000-word ASR, a few speakers, continuous
speech, constrained grammar, less than 10% semantic error
– Hearsay I & II systems at CMU– Harpy system at CMU– HWIM system at BBN
A block diagram of the CMU Hearsay-II system organization (V. R. Lesser, R. D. Fennel, L. D. Erman and D. R. Reddy, CMU, 1975)
A block diagram of the CMU Harpy system organization. Shown is a small (hypothetical) fragment of the Harpy state transition network, including paths
accepted for sentences beginning with “Give me.”(B. T. Lowerre, CMU, 1976)
3rd generation technology (1980 – 1990)
• Connected word recognition– Two-level DP, one-pass method, level-building (LB)
approach, frame-synchronous LB approach. • Statistical framework• HMM• Δcepstrum• N-gram• Neural net• DARPA program (Resource management task)
– SPHINX system at CMU– BYBLOS system at BBN– DECIPHER system at SRI– Lincoln Labs, MIT, AT&T Bell Labs
Statistical models for speech
Acoustic models• HMM math worked out in the 1960s• Applied to speech at CMU and IBM in the early
1970s– Harpy system: early applications of HMMs to ASR
(Baker)• Extended by others in the mid-1980s with
collection of large standard corpora
Language models• N-gram (bigram, trigram): first applied by IBM• Smoothing
0103-11
MFCC-based front-end processor
FFTFFT
FFT basedspectrum
Speech
DCTDCTLogLog
Mel scaletriangular filters
AcousticvectorΔ
Δ2
Structure of speech production and recognition system based on information transmission theory
0011-12
(Transmission theory)
(Speech recognition process)
Acoustic channelAcoustic channel
W S XW
Speech recognition systemSpeech recognition system
Informationsource
Informationsource ChannelChannel DecoderDecoder
Textgeneration
Textgeneration Acoustic
processingAcoustic
processingSpeech
productionSpeech
production Linguisticdecoding
Linguisticdecoding
)()()(
maxarg)(maxargˆXP
WPWXPXWPW
WW==
Structure of phoneme HMMs
0104-08
0012-01
Statistical language modeling
Probability of the word sequence w1k = w1w2...wk :k k
P (w1k) =ΠP (wi |w1w2…wi−1) =ΠP (wi |w1i−1) i =1 i =1
P(wi |w1i−1) = N(w1i) / N(w1i−1)
where N (w1i) is the number of occurrences of the string w1i
in the given training corpus.
Approximation by Markov processes:Bigram model P(wi |w1i−1) = P(wi |wi−1)Trigram model P(wi |w1i−1) = P(wi |wi−2wi−1)
Smoothing of trigram by unigram and bigram:P(wi |wi−2wi−1) = λ1P(wi |wi−2wi−1)+ λ2P(wi |wi−1) + λ3P(wi)^
“Julie” doll with speech synthesis and recognition technology, produced by Worlds of Wonder in conjunction with Texas Instruments (1987)
3.5th generation technology (1990 - 2000)
• Error minimization (discriminative) approach– MCE (Minimum Classification Error) approach– MMI (Maximum Mutual Information) criterion
• Robust ASR– Background noises, voice individuality, microphones,
transmission channels, room reverberation, etc.– VTLN, MLLR, HLDA, fMPE, PMC, etc.
• DARPA program– ATIS task– Broadcast news (BN) transcription integrated with
information extraction and retrieval technology– Switchboard task
• Applications– Automate and enhance operator services
History of DARPA speech recognitionbenchmark tests
1k
ATIS
100%
10%
1%
WO
RD
ER
RO
R R
ATE
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
ReadSpeech
SpontaneousSpeech
ConversationalSpeech
BroadcastSpeechVaried
Microphone
Noisy
20k
5k
foreign
Courtesy NIST 1999 DARPAHUB-4 Report, Pallett et al.
foreign
ResourceManagement
WSJ
Switchboard
NAB
3.5th generation technology (2000 -)
• DARPA program– EARS (Effective Affordable Reusable Speech-to-
Text) program for rich transcription, GALE– Detecting, extracting, summarizing, and translating
important information• Spontaneous speech recognition
– CSJ project in Japan– Meeting projects in US and Europe
• Robust ASR– Utterance verification and confidence measures– Combining systems or subsystems– Graphical models (DBN)
• Multimodal speech recognition– Audio-visual ASR
Now
Major speech recognition applications
• Conversational systems for accessing information services (e.g. automatic flight status or stock quote information systems)
• Systems for transcribing, understanding and summarizing ubiquitous speech documents (e.g. broadcast news, meetings, lectures, presentations, congressional records, court records, and voicemails)
Multimedia contents technology0010-27
Ubiquitous computingUbiquitous computing
ContentsContents
WWW (Internet)WWW (Internet)
Mobilecomputing
Mobilecomputing
WearablecomputingWearablecomputing
Image/motionprocessing
Image/motionprocessing
Informationextraction(mining)
Informationextraction(mining)
Text processing(Translation)
Text processing(Translation)
Human-computerinteraction
(Dialog)
Human-computerinteraction
(Dialog)
Informationretrieval(access)
Informationretrieval(access)
Multimediamultimodal
communication
Multimediamultimodal
communication
Speech/audioprocessing
Speech/audioprocessing
Audio indexing system
Audio feeder Metadata composer
Component manager
Speech detection
Speaker segmentation
Speaker tracking
Speaker identification
Speech recognition
Name extraction
Sentence detection
Machine translation
Dense component integration creates state-of-the art
rich transcriptions
Audio source XML metadata
(J. Makhoul, BBN, 2006)
50 years of progress in speech recognition (1)
(1)Corpus-base statistical modeling
e.g. HMM and n-gramsTemplate matching
(2) Filter bank/spectral resonances Cepstral features(Cepstrum +ΔCepstrum +ΔΔCepstrum)
(3) Heuristic time-normalization DTW/DP matching
(4) “Distance”-based methods Likelihood-based methods
(5) MAP (ML) approachDiscriminative approache.g. MCE/GPD and MMI
50 years of progress in speech recognition (2)
(6) Isolated word recognition Continuous speech recognition
(7) Small vocabulary recognition Large vocabulary recognition
(8) Context-independent units Context-dependent units
(10) Speaker-independent/adaptive recognitionSingle speaker recognition
(11) Monologue recognition Dialogue/conversation recognition
(9) Noisy/telephone speech recognitionClean speech recognition
50 years of progress in speech recognition (3)
(12) Read speech recognition Spontaneous speech recognition
(13) Recognition Understanding
(14)Single-modality (audio signal only)
recognitionMulti-modal (audio/visual speech)
recognition
(16) No commercial application Many practical commercial applications
(17) Few languages Many languages
(15) Hardware recognizer Software recognizer
(David C. Moschella: “Waves of Power”)
10
100
1,000
3,000N
umbe
r of
use
rs (m
illio
n)
Year:1970 1980 1990 2000 2010 2020 2030
System centeredSystem centered
PC centeredPC centered
Network centeredNetwork centered
Knowledge resource centeredKnowledge resource centered
IT technology progress
2G 3G 3.5G 4GASR
Mic
ropr
oces
sor
PC Lapt
op P
C
ICASSP and ASR
• ICASSP has made great contributions to speech technology progress since its beginning in 1976.
• How many speech papers have been presented at ICASSP conferences?
• How many ASR papers have been presented at ICASSP conferences?
50
100
150
200
‘78 USA
’79 USA
24 23 21 27
64 60 57
164
105
136 140137
180
159
143146139
130
165158
130132129
140
177161
173
207
’80 USA
’81 USA
’82 France
’83 USA
’84 USA
’85 USA
’86 Japan
’87 USA
’88 USA
’89 Scotla
nd
’90 USA
’91 Canada
’92 USA
’93 USA
’94 Australia
’95 USA
’96 USA
’97 Germ
any
’98 USA
’99 USA
’00 Turkey
’01 USA
’02 USA
’03 Hong Kong
’04 Canada
’05 USA
’06 France
60
The number of ICASSP ASR papers (1978-2006)
(3487 ASR papers have been presented since 1978)
3G 3.5GASR
The number of ICASSP speech papers (1978-2006)‘78
USA’79
USA’80
USA
’81 U
SA’82
Fran
ce’83
USA
’84USA
’85 U
SA’86
Japa
n’87
USA
’88 U
SA
’89 S
cotla
nd’90
USA
’91 C
anad
a’92
USA
’93 U
SA
’94 A
ustra
lia’95
USA
’96 U
SA
’97 G
erman
y’98
USA
’99 U
SA’00
Turke
y’01
USA
’02 U
SA
’03 H
ong K
ong
’04 C
anad
a’05
USA
’06 Fr
ance
100
150
200
250
300
350
ASROther areas
(5998 speech papers have been presented at ICASSP since 1978)
2006
(3487 ASR papers from 49 countries have been presented since 1978)
Distribution of the ICASSP ASR papers over major countries
19791984
19891993
19972002
19801981
19821983
19851986
19871988
19901991
19921994
19951996
19981999
20002001
20032004
2005
Other countriesUKFranceGermany
1978
50
100
150
200
Japan
USA
10
20
30
40
50
60
70
80
19791984
19891993
19972002
20061978
19801981
19821983
19851986
19871988
19901991
19921994
19951996
19981999
20002001
20032004
2005
OthersTaiwanSpainAustraliaSwitzerlandKoreaHong KongBelgiumChinaIndiaFinlandCanadaItaly
Distribution of the papers by countries “other” than UK, France, Germany, Japan and USA
(Source: White paper on science, technology and industry by the OECD)
R&D expenditures
R&D expenditures in China are rapidly increasing, at twice the rate of overall Chinese economic growth.
0
0.5
1
1.5
2
2.5
3
3.5
90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06
USAEU(15 countries)JapanGermanyChina
X 10
0 bi
llion
dol
lars
B7) Time normalization
B8) Segmentation and labeling
C6) Speaker normalization
Speaker adaptation Situation adaptation
C5b) Orthographic synthesis
A5a) Re-synthesis
extraction
B4) Digital parameter and feature
feature extraction
A, except feature extraction (C)3) Analog signal transformation and
A2) Digital signal transformation
A, except speech enhancement (C)1) Signal conditioning
“State-of-the-art of automatic speech recognition technology”(B. Beek, E. P. Neuberg and D. C. Hodge, 1977)
State-of-the-artProcessing techniques
A=useful now; B=shows promise; C=a long way to go
and realization
C14) Performance evaluation
A-C13) Time normalization
A for speaker verification; C for all others
12) Speaker recognition
B-C11) Speech understanding
C10) Lexical matching
pragmatics
C9d) Speaker and situation
C9c) Semantics
B9b) Syntax
C9a) Language statistics
“Unsolved problems”(B. Beek, E. P. Neuberg and D. C. Hodge, 1977)
13) Hypothesize-and-test, backtrack, feed forward.
12) Recognition algorithm (shortest distance, (pairwise) discriminant, Bayes, probabilities).
11) Limits of acoustic information only.
10) Semantics of (limited) tasks.
9) Limited vocabulary and restricted language structure necessary; possibility of adding new words.
8) Missing or extra added (“uh”) speech sound.
7) Phonological rules.
6) Stressed/unstressed.
5) Establish anchor point; scan utterance from left to right; start from stressed vowel, etc
4) Detect smaller units in continuous speech (word/phoneme boundaries; acoustic segments).
3) Nonlinear time normalization (dynamic programming).
representation, zero-crossing distributions).
2) Extract relevant acoustic parameters (pole, zero, formant (transitions), slopes, dimensional
1) Detect speech in noise; speech/non-speech
27) Syntax rules.28) Vocal-tract modeling.
26) Morphology rules.
25) Co-articulation rules.
24) Use of prosodic information.
23) Economical ways for adding new speakers to system.
22) Detect speech in presence of competing speech.
21) Cost-effectiveness.
20) Human engineering problem of incorporating speech understanding system into actual situations.
19) Real-time processing.
18) Consistency of references.
17) Necessity of visual feedback, error control, level for rejections.
16) Mimicking; uncooperative speaker (s).
15) Adaptive and interactive quick learning.
14) Effect of nasalization, cold, emotion, loudness, pitch, whispering, distortions due to talker’s acoustical environment, distortions by communication systems (telephone, transmitter-receiver, intercom, public address, face masks), nonstandard environments.
Progress of speech recognition technology since 1980
0010-11
2 20 200 2000 20000 Unrestricted
Spontaneousspeech
Readspeech
Fluentspeech
Connectedspeech
Isolatedwords
Vocabulary size (number of words)
Spea
king
styl
e
1980
1990
2000
naturalconversation
naturalconversation2-way
dialogue2-way
dialoguetranscriptiontranscription
networkagent &
intelligentmessaging
networkagent &
intelligentmessaging
system drivendialogue
system drivendialogue
officedictation
officedictation
namedialingname
dialingform fillby voiceform fillby voice
directoryassistancedirectoryassistance
wordspotting
wordspotting
digitstringsdigit
strings
voicecommands
voicecommands
carnavigation
20 things we still don’t know about speech(R. Moore, 1994) (1)
(1) How important is the communicative nature of speech?(2) Is human-human speech communication relevant to
human-machine speech communication?(3) Speech technology or speech science? (How can we
integrate speech science and technology?)(4) Whither a unified theory? (5) Is speech special?(6) Why is speech contrastive?(7) Is there random variability in speech?(8) How important is individuality?(9) Is disfluency normal?(10) How much effort does speech require?
20 things we still don’t know about speech(R. Moore, 1994) (2)
(11) What is a good architecture (for speech processes)?
(12) What are suitable levels of representation?
(13) What are the units?
(14) What is the formalism?
(15) How important are the physiological mechanisms?
(16) Is time-frame based speech analysis sufficient?
(17) How important is adaptation?
(18) What are the mechanisms for learning?
(19) What is speech good for?
(20) How good is speech?
Major ASR problems (N. Morgan, 2006)
• Unexpected rate of speech can still hurt
• Unexpected accent can hurt
• Performance in noise, reverberation still bad
• Don’t know when we know
• Few advances in basic understanding
• It takes a long time to build a system for a new language; requires a large amount of resources
How can we make a human-like speech recognizer?
(Doraemon)
What’s likely to help (N. Morgan, 2006) (1)
• The obvious: faster computers, more memoryand disk, more data
• Improved techniques for learning from unlabeled data
• Serious efforts to handle:• noise and reverberation• speaking style variation• out-of-vocabulary words (and sounds)
• Learning how to select features• Learning how to select models• Feedback from downstream processing
• New (multiple) features and models• New statistical dependencies
(e.g., graphical models)• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Language models including structure (still!)• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence
What’s likely to help (N. Morgan, 2006) (2)
A communication - theoretic view ofspeech generation & recognition
0010-22
P ( X | W )M
P ( W | M )
Linguistic channel
W X Speech recognizer
P ( M )
Message source
Acoustic channel
Language Vocabulary Grammar Semantics Context Habits
Speaker Reverberation Noise Transmission characteristics Microphone
Knowledgesources
Sources ofvariations
0310-23b
Knowledge sources for speech recognition
• Human speech recognition is a matching process whereby an audio signal is matched to existing knowledge
• Knowledge (Meta-data)– Domain and topics of utterances– Context– Semantics– Who is speaking– etc.
• Systematization of variousrelated knowledge
• How to incorporate knowledge sources into the statistical ASR framework
Speech(data)
Speech(data)
RecognitionRecognition
Transcription(information)
Transcription(information)
KnowledgeKnowledge
GeneralizationMeta-data
GeneralizationMeta-data
AbstractionAbstraction
~1920 ~1952 ~1968 ~1980
ASR prehistory 1G 2G 3G
Extended knowledge processing
3.5G
~1990 ~2007
4G
Generations of ASR technology
Conclusion
• Speech recognition technology has made very significant progress in the past 50+ years with the help of computer technology.
• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.
• However, 60% (16/28) of the “unsolved problems” listed by Beek et al. in 1977 have not yet been solved.
• A much greater understanding of the human speech process is required before automatic speech recognition systems can approach human performance.
• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.
Thank you for your kind attention.