50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...

50 years of progress in speech recognition technology

- Where we are, and where we should go -

Sadaoki FuruiDepartment of Computer Science

Tokyo Institute of [email protected]

From a poor dog to a super cat

Outline• Progress of automatic speech recognition

(ASR) technology– 5 generations– Synchronization with computer (IT)

technology• Summary of the technological progress• How to narrow the gap between machine

and human speech recognition• Statistical knowledge processing• Conclusion

What is 4G?

~1920 ~1952 ~1968 ~1980

ASR prehistory 1G 2G 3G 3.5G

~1990 ~2007

4G

Generations of ASR technology

Radio Rex Radio Rex –– 19201920’’s ASRs ASR

A sound-activated toy dog named “Rex” (from Elmwood Button Co.) could be called by name from his doghouse by name. (Thanks to Nelson Morgan.)

1st generation technology (1952 – 1968)

• General– The earliest attempts to devise

digit/syllable/vowel/phoneme recognition systems– Spectral resonances extracted by an analogue filter

bank and logic circuits– Statistical syntax at the phoneme level

• Early systems– Bell labs, RCA Labs, MIT Lincoln Labs, University

College London, Radio Research Lab (Japan), Kyoto Univ. (Japan), NEC Labs (Japan)

Block schematic of isolated digit recognizer circuits (K. H. Davis, R. Biddulph and S. Balashek, Bell Labs, 1952)

Schematic diagram of a phonetic typewriter (syllable recognizer)(H. Olsen and H. Belar, RCA Labs, 1956)

Phonetic typewriter with parts exposed(H. Olsen and H. Belar, RCA Labs, 1956)

Words used in a recognition study (J. W. Forgie and C. D. Forgie, MIT Lincoln Lab, 1959)

Structure of a program which classifies each LOOK as belonging to one of 10 possible vowels

(J. W. Forgie and C. D. Forgie, MIT Lincoln Lab, 1959)

Early schematic diagram of an automatic speech recognizer (P. Denes, University College London, 1959)

Histogram illustrating the prediction of phonemes in a test sentence (D. B. Fry, University College London, 1959)

Schematic diagram showing the arrangement by which statistical linguistic information is combined with acoustic information in a mechanical speech recognizer

(P. Denes, University College London, 1959)

P. Denes, 1959

The automatic speech typewriter (P. Denes, University College London, 1959)

Examples of manually analyzed spectra of the 5 Japanese vowels (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Block diagram of a vowel recognizer based on the majority decision principle (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Front view of the Japanese vowel recognizer (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Front view of the spoken digit recognizer (J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Block diagram of continuous speech recognition experiments(J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

Block diagram of the phonetic typewriter (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)

Block diagram of the vowel segmentation mechanism (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)

Block diagram of the recognition mechanism (T. Sakai and S. Doshita, Kyoto Univ., Japan, 1962)

Block diagram of the Japanese spoken digit recognizer (K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)

Photograph of the Japanese spoken digit recognizer (K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)

• DTW– (Elementary time-normalization methods by Martin et

al.)– Vintsyuk in Soviet Union proposed the use of DP– Sakoe & Chiba at NEC labs started to use DP

• Isolated word recognition– The area of isolated word or discrete utterance

recognition became a viable and usable technology based on fundamental studies in Russia and Japan (Velichko & Zagoruyko, Sakoe & Chiba, and Itakura)

• IBM Labs: large-vocabulary ASR• Bell Labs: speaker-independent ASR

2nd generation technology (1968 – 1980) (1)

Block diagram of speech processing stages (T. B. Martin et al., RCA Labs, 1964)

Neural recognition network for /g/ followed by vowels /i/, /I/, /ε/+/`d/ (T. B. Martin et al., RCA Labs, 1964)

Speech processing equipment (T. B. Martin et al., RCA Labs, 1964)

Dynamic time warping algorithm (T. K. Vintsyuk, Lab of Pattern Recognition, USSR, 1968)

1: Curve of maximum similarity; 2, 3: linear normalization (V. M. Velichko and N. G. Zagoruyko, Lab of Pattern Recognition, USSR, 1970)

Matrix of similarity (V. M. Velichko and N. G. Zagoruyko, Lab of Pattern Recognition, USSR, 1970)

Warping function and adjustment window definition(H. Sakoe and S. Chiba, NEC Labs, 1978)

Flanagan writes, “… the computer in the background is a Honeywell DDP516. This was the first integrated circuits machine that we had in the laboratory. … with memory of 8K words (one-half of which was occupied by the Fortran II compiler) …”

(November 1970)

An Example of the time-warping function. The parallelogram shows the possible domain of (n, m) coordinates (F. Itakura, Bell/NTT Labs, 1975)

Flow chart of the isolated word recognition system (F. Itakura, Bell/NTT Labs, 1975)

Block diagram of the real-time spoken word recognition system (L. C. W. Pols, TNO, the Netherlands, 1971)

Flow diagram for an operational isolated utterance speech recognition system (G. M. White, Xerox PARC, 1976)

2nd generation technology (1968 – 1980) (2)

• Continuous speech recognition– Reddy at CMU conducted pioneering research based

on dynamic tracking of phonemes• DARPA program

– Focus on speech understanding– Goal: 1000-word ASR, a few speakers, continuous

speech, constrained grammar, less than 10% semantic error

– Hearsay I & II systems at CMU– Harpy system at CMU– HWIM system at BBN

A block diagram of the CMU Hearsay-II system organization (V. R. Lesser, R. D. Fennel, L. D. Erman and D. R. Reddy, CMU, 1975)

A block diagram of the CMU Harpy system organization. Shown is a small (hypothetical) fragment of the Harpy state transition network, including paths

accepted for sentences beginning with “Give me.”(B. T. Lowerre, CMU, 1976)

3rd generation technology (1980 – 1990)

• Connected word recognition– Two-level DP, one-pass method, level-building (LB)

approach, frame-synchronous LB approach. • Statistical framework• HMM• Δcepstrum• N-gram• Neural net• DARPA program (Resource management task)

– SPHINX system at CMU– BYBLOS system at BBN– DECIPHER system at SRI– Lincoln Labs, MIT, AT&T Bell Labs

Statistical models for speech

Acoustic models• HMM math worked out in the 1960s• Applied to speech at ＣＭＵ and IBM in the early

1970s– Harpy system: early applications of HMMs to ASR

(Baker)• Extended by others in the mid-1980s with

collection of large standard corpora

Language models• N-gram (bigram, trigram): first applied by IBM• Smoothing

0103-11

MFCC-based front-end processor

FFTFFT

FFT basedspectrum

Speech

DCTDCTLogLog

Mel scaletriangular filters

AcousticvectorΔ

Δ2

Structure of speech production and recognition system based on information transmission theory

0011-12

(Transmission theory)

(Speech recognition process)

Acoustic channelAcoustic channel

W S XW

Speech recognition systemSpeech recognition system

Informationsource

Informationsource ChannelChannel DecoderDecoder

Textgeneration

Textgeneration Acoustic

processingAcoustic

processingSpeech

productionSpeech

production Linguisticdecoding

Linguisticdecoding

)()()(

maxarg)(maxargˆXP

WPWXPXWPW

WW==

Structure of phoneme HMMs

0104-08

0012-01

Statistical language modeling

Probability of the word sequence w1k = w1w2...wk :k k

P (w1k) =ΠP (wi |w1w2…wi−1) =ΠP (wi |w1i−1) i =1 i =1

P(wi |w1i−1) = N(w1i) / N(w1i−1)

where N (w1i) is the number of occurrences of the string w1i

in the given training corpus.

Approximation by Markov processes:Bigram model P(wi |w1i−1) = P(wi |wi−1)Trigram model P(wi |w1i−1) = P(wi |wi−2wi−1)

Smoothing of trigram by unigram and bigram:P(wi |wi−2wi−1) = λ1P(wi |wi−2wi−1)+ λ2P(wi |wi−1) + λ3P(wi)^

“Julie” doll with speech synthesis and recognition technology, produced by Worlds of Wonder in conjunction with Texas Instruments (1987)

3.5th generation technology (1990 - 2000)

• Error minimization (discriminative) approach– MCE (Minimum Classification Error) approach– MMI (Maximum Mutual Information) criterion

• Robust ASR– Background noises, voice individuality, microphones,

transmission channels, room reverberation, etc.– VTLN, MLLR, HLDA, fMPE, PMC, etc.

• DARPA program– ATIS task– Broadcast news (BN) transcription integrated with

information extraction and retrieval technology– Switchboard task

• Applications– Automate and enhance operator services

History of DARPA speech recognitionbenchmark tests

1k

ATIS

100%

10%

1%

WO

RD

ER

RO

R R

ATE

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

ReadSpeech

SpontaneousSpeech

ConversationalSpeech

BroadcastSpeechVaried

Microphone

Noisy

20k

5k

foreign

Courtesy NIST 1999 DARPAHUB-4 Report, Pallett et al.

foreign

ResourceManagement

WSJ

Switchboard

NAB

3.5th generation technology (2000 -)

• DARPA program– EARS (Effective Affordable Reusable Speech-to-

Text) program for rich transcription, GALE– Detecting, extracting, summarizing, and translating

important information• Spontaneous speech recognition

– CSJ project in Japan– Meeting projects in US and Europe

• Robust ASR– Utterance verification and confidence measures– Combining systems or subsystems– Graphical models (DBN)

• Multimodal speech recognition– Audio-visual ASR

Major speech recognition applications

• Conversational systems for accessing information services (e.g. automatic flight status or stock quote information systems)

• Systems for transcribing, understanding and summarizing ubiquitous speech documents (e.g. broadcast news, meetings, lectures, presentations, congressional records, court records, and voicemails)

Multimedia contents technology0010-27

Ubiquitous computingUbiquitous computing

ContentsContents

WWW (Internet)WWW (Internet)

Mobilecomputing

Mobilecomputing

WearablecomputingWearablecomputing

Image/motionprocessing

Image/motionprocessing

Informationextraction(mining)

Informationextraction(mining)

Text processing(Translation)

Text processing(Translation)

Human-computerinteraction

(Dialog)

Human-computerinteraction

(Dialog)

Informationretrieval(access)

Informationretrieval(access)

Multimediamultimodal

communication

Multimediamultimodal

communication

Speech/audioprocessing

Speech/audioprocessing

Audio indexing system

Audio feeder Metadata composer

Component manager

Speech detection

Speaker segmentation

Speaker tracking

Speaker identification

Speech recognition

Name extraction

Sentence detection

Machine translation

Dense component integration creates state-of-the art

rich transcriptions

Audio source XML metadata

(J. Makhoul, BBN, 2006)

50 years of progress in speech recognition (1)

(1)Corpus-base statistical modeling

e.g. HMM and n-gramsTemplate matching

(2) Filter bank/spectral resonances Cepstral features(Cepstrum +ΔCepstrum +ΔΔCepstrum)

(3) Heuristic time-normalization DTW/DP matching

(4) “Distance”-based methods Likelihood-based methods

(5) MAP (ML) approachDiscriminative approache.g. MCE/GPD and MMI


(6) Isolated word recognition Continuous speech recognition

(7) Small vocabulary recognition Large vocabulary recognition

(8) Context-independent units Context-dependent units

(10) Speaker-independent/adaptive recognitionSingle speaker recognition

(11) Monologue recognition Dialogue/conversation recognition

(9) Noisy/telephone speech recognitionClean speech recognition


(12) Read speech recognition Spontaneous speech recognition

(13) Recognition Understanding

(14)Single-modality (audio signal only)

recognitionMulti-modal (audio/visual speech)

recognition

(16) No commercial application Many practical commercial applications

(17) Few languages Many languages

(15) Hardware recognizer Software recognizer

(David C. Moschella: “Waves of Power”)

10

100

1,000

3,000N

umbe

r of

use

rs (m

illio

n)

Year:1970 1980 1990 2000 2010 2020 2030

System centeredSystem centered

PC centeredPC centered

Network centeredNetwork centered

Knowledge resource centeredKnowledge resource centered

IT technology progress

2G 3G 3.5G 4GASR

Mic

ropr

oces

sor

PC Lapt

op P

C

ICASSP and ASR

• ICASSP has made great contributions to speech technology progress since its beginning in 1976.

• How many speech papers have been presented at ICASSP conferences?

• How many ASR papers have been presented at ICASSP conferences?

50

100

150

200

‘78 USA

’79 USA

24 23 21 27

64 60 57

164

105

136 140137

180

159

143146139

130

165158

130132129

140

177161

173

207

’80 USA

’81 USA

’82 France

’83 USA

’84 USA

’85 USA

’86 Japan

’87 USA

’88 USA

’89 Scotla

nd

’90 USA

’91 Canada

’92 USA

’93 USA

’94 Australia

’95 USA

’96 USA

’97 Germ

any

’98 USA

’99 USA

’00 Turkey

’01 USA

’02 USA

’03 Hong Kong

’04 Canada

’05 USA

’06 France

60

The number of ICASSP ASR papers (1978-2006)

(3487 ASR papers have been presented since 1978)

3G 3.5GASR

The number of ICASSP speech papers (1978-2006)‘78

USA’79

USA’80

USA

’81 U

SA’82

Fran

ce’83

USA

’84USA

’85 U

SA’86

Japa

n’87

USA

’88 U

SA

’89 S

cotla

nd’90

USA

’91 C

anad

a’92

USA

’93 U

SA

’94 A

ustra

lia’95

USA

’96 U

SA

’97 G

erman

y’98

USA

’99 U

SA’00

Turke

y’01

USA

’02 U

SA

’03 H

ong K

ong

’04 C

anad

a’05

USA

’06 Fr

ance

100

150

200

250

300

350

ASROther areas

(5998 speech papers have been presented at ICASSP since 1978)

2006

(3487 ASR papers from 49 countries have been presented since 1978)

Distribution of the ICASSP ASR papers over major countries

19791984

19891993

19972002

19801981

19821983

19851986

19871988

19901991

19921994

19951996

19981999

20002001

20032004

2005

Other countriesUKFranceGermany

1978

50

100

150

200

Japan

USA

10

20

30

40

50

60

70

80

19791984

19891993

19972002

20061978

19801981

19821983

19851986

19871988

19901991

19921994

19951996

19981999

20002001

20032004

2005

OthersTaiwanSpainAustraliaSwitzerlandKoreaHong KongBelgiumChinaIndiaFinlandCanadaItaly

Distribution of the papers by countries “other” than UK, France, Germany, Japan and USA

(Source: White paper on science, technology and industry by the OECD)

R&D expenditures

R&D expenditures in China are rapidly increasing, at twice the rate of overall Chinese economic growth.

0

0.5

1

1.5

2

2.5

3

3.5

90 91 92 93 94 95 96 97 98 99 00 01 02 03 04 05 06

USAEU(15 countries)JapanGermanyChina

X 10

0 bi

llion

dol

lars

B7) Time normalization

B8) Segmentation and labeling

C6) Speaker normalization

Speaker adaptation Situation adaptation

C5b) Orthographic synthesis

A5a) Re-synthesis

extraction

B4) Digital parameter and feature

feature extraction

A, except feature extraction (C)3) Analog signal transformation and

A2) Digital signal transformation

A, except speech enhancement (C)1) Signal conditioning

“State-of-the-art of automatic speech recognition technology”(B. Beek, E. P. Neuberg and D. C. Hodge, 1977)

State-of-the-artProcessing techniques

A=useful now; B=shows promise; C=a long way to go

and realization

C14) Performance evaluation

A-C13) Time normalization

A for speaker verification; C for all others

12) Speaker recognition

B-C11) Speech understanding

C10) Lexical matching

pragmatics

C9d) Speaker and situation

C9c) Semantics

B9b) Syntax

C9a) Language statistics

“Unsolved problems”(B. Beek, E. P. Neuberg and D. C. Hodge, 1977)

13) Hypothesize-and-test, backtrack, feed forward.

12) Recognition algorithm (shortest distance, (pairwise) discriminant, Bayes, probabilities).

11) Limits of acoustic information only.

10) Semantics of (limited) tasks.

9) Limited vocabulary and restricted language structure necessary; possibility of adding new words.

8) Missing or extra added (“uh”) speech sound.

7) Phonological rules.

6) Stressed/unstressed.

5) Establish anchor point; scan utterance from left to right; start from stressed vowel, etc

4) Detect smaller units in continuous speech (word/phoneme boundaries; acoustic segments).

3) Nonlinear time normalization (dynamic programming).

representation, zero-crossing distributions).

2) Extract relevant acoustic parameters (pole, zero, formant (transitions), slopes, dimensional

1) Detect speech in noise; speech/non-speech

27) Syntax rules.28) Vocal-tract modeling.

26) Morphology rules.

25) Co-articulation rules.

24) Use of prosodic information.

23) Economical ways for adding new speakers to system.

22) Detect speech in presence of competing speech.

21) Cost-effectiveness.

20) Human engineering problem of incorporating speech understanding system into actual situations.

19) Real-time processing.

18) Consistency of references.

17) Necessity of visual feedback, error control, level for rejections.

16) Mimicking; uncooperative speaker (s).

15) Adaptive and interactive quick learning.

14) Effect of nasalization, cold, emotion, loudness, pitch, whispering, distortions due to talker’s acoustical environment, distortions by communication systems (telephone, transmitter-receiver, intercom, public address, face masks), nonstandard environments.

Progress of speech recognition technology since 1980

0010-11

2 20 200 2000 20000 Unrestricted

Spontaneousspeech

Readspeech

Fluentspeech

Connectedspeech

Isolatedwords

Vocabulary size (number of words)

Spea

king

styl

e

1980

1990

2000

naturalconversation

naturalconversation2-way

dialogue2-way

dialoguetranscriptiontranscription

networkagent &

intelligentmessaging

networkagent &

intelligentmessaging

system drivendialogue

system drivendialogue

officedictation

officedictation

namedialingname

dialingform fillby voiceform fillby voice

directoryassistancedirectoryassistance

wordspotting

wordspotting

digitstringsdigit

strings

voicecommands

voicecommands

carnavigation

20 things we still don’t know about speech(R. Moore, 1994) (1)

(1) How important is the communicative nature of speech?(2) Is human-human speech communication relevant to

human-machine speech communication?(3) Speech technology or speech science? (How can we

integrate speech science and technology?)(4) Whither a unified theory? (5) Is speech special?(6) Why is speech contrastive?(7) Is there random variability in speech?(8) How important is individuality?(9) Is disfluency normal?(10) How much effort does speech require?

20 things we still don’t know about speech(R. Moore, 1994) (2)

(11) What is a good architecture (for speech processes)?

(12) What are suitable levels of representation?

(13) What are the units?

(14) What is the formalism?

(15) How important are the physiological mechanisms?

(16) Is time-frame based speech analysis sufficient?

(17) How important is adaptation?

(18) What are the mechanisms for learning?

(19) What is speech good for?

(20) How good is speech?

Major ASR problems (N. Morgan, 2006)

• Unexpected rate of speech can still hurt

• Unexpected accent can hurt

• Performance in noise, reverberation still bad

• Don’t know when we know

• Few advances in basic understanding

• It takes a long time to build a system for a new language; requires a large amount of resources

How can we make a human-like speech recognizer?

(Doraemon)

What’s likely to help (N. Morgan, 2006) (1)

• The obvious: faster computers, more memoryand disk, more data

• Improved techniques for learning from unlabeled data

• Serious efforts to handle:• noise and reverberation• speaking style variation• out-of-vocabulary words (and sounds)

• Learning how to select features• Learning how to select models• Feedback from downstream processing

• New (multiple) features and models• New statistical dependencies

(e.g., graphical models)• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Language models including structure (still!)• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence

What’s likely to help (N. Morgan, 2006) (2)

A communication - theoretic view ofspeech generation & recognition

0010-22

P ( X | W )M

P ( W | M )

Linguistic channel

W X Speech recognizer

P ( M )

Message source

Acoustic channel

Language Vocabulary Grammar Semantics Context Habits

Speaker Reverberation Noise Transmission characteristics Microphone

Knowledgesources

Sources ofvariations

0310-23b

Knowledge sources for speech recognition

• Human speech recognition is a matching process whereby an audio signal is matched to existing knowledge

• Knowledge (Meta-data)– Domain and topics of utterances– Context– Semantics– Who is speaking– etc.

• Systematization of variousrelated knowledge

• How to incorporate knowledge sources into the statistical ASR framework

Speech(data)

Speech(data)

RecognitionRecognition

Transcription(information)

Transcription(information)

KnowledgeKnowledge

GeneralizationMeta-data

GeneralizationMeta-data

AbstractionAbstraction

~1920 ~1952 ~1968 ~1980

ASR prehistory 1G 2G 3G

Extended knowledge processing

3.5G

~1990 ~2007

4G

Generations of ASR technology

Conclusion

• Speech recognition technology has made very significant progress in the past 50+ years with the help of computer technology.

• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.

• However, 60% (16/28) of the “unsolved problems” listed by Beek et al. in 1977 have not yet been solved.

• A much greater understanding of the human speech process is required before automatic speech recognition systems can approach human performance.

• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.

Thank you for your kind attention.

50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...

Documents

Transcript of 50 years of progress in speech recognition technologymarvaini/intonation/furui-icassp2007.pdf ·...