ASR Intro: Outline
description
Transcript of ASR Intro: Outline
![Page 1: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/1.jpg)
ASR Intro: OutlineASR Intro: Outline
• ASR Research History
• Difficulties and Dimensions
• Core Technology Components
• 21st century ASR Research
![Page 2: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/2.jpg)
Radio Rex – 1920’s ASRRadio Rex – 1920’s ASR
![Page 3: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/3.jpg)
Radio RexRadio Rex
“It consisted of a celluloid dog with an iron
base held within its house by an electromagnet
against the force of a spring. Current energizing
the magnet flowed through a metal bar which was
arranged to form a bridge with 2 supporting members.
This bridge was sensitive to 500 cps acoustic energy
which vibrated it, interrupting the current and
releasing the dog. The energy around 500 cps
contained in the vowel of the word Rex was sufficient
to trigger the device when the dog’s name was called.”
![Page 4: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/4.jpg)
1952 Bell Labs Digits1952 Bell Labs Digits• First word (digit) recognizer
• Approximates energy in formants (vocal
tract resonances) over word
• Already has some robust ideas
(insensitive to amplitude, timing variation)
• Worked very well
• Main weakness was technological (resistors
and capacitors)
![Page 5: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/5.jpg)
Digit PatternsDigit Patterns
HP filter (1 kHz)
Digit
Spoken
LP filter (800 Hz)
Limiting Amplifier
Limiting Amplifier
AxisCrossingCounter
AxisCrossingCounter
200 800(Hz)
2
3
1
(kHz)
![Page 6: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/6.jpg)
The 60’sThe 60’s
• Better digit recognition
• Breakthroughs: Spectrum Estimation (FFT,
cepstra, LPC), Dynamic Time Warp (DTW),
and Hidden Markov Model (HMM) theory
• 1969 Pierce letter to JASA:
“Whither Speech Recognition?”
![Page 7: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/7.jpg)
Pierce LetterPierce Letter
• 1969 JASA
• Pierce led Bell Labs Communications
Sciences Division
• Skeptical about progress in speech
recognition, motives, scientific
approach
• Came after two decades of research by
many labs
![Page 8: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/8.jpg)
Pierce Letter Pierce Letter (Continued)(Continued)
ASR research was government-supported.
He asked:
•Is this wise?
•Are we getting our money’s worth?
![Page 9: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/9.jpg)
Purpose for ASRPurpose for ASR
• Talking to machine had (“gone downhill
since…….Radio Rex”)
Main point: to really get somewhere,
need intelligence, language
• Learning about speech
Main point: need to do science, not just
test “mad schemes”
![Page 10: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/10.jpg)
1971-76 ARPA Project1971-76 ARPA Project
• Focus on Speech Understanding
• Main work at 3 sites: System Development
Corporation, CMU and BBN
• Other work at Lincoln, SRI, Berkeley
• Goal was 1000-word ASR, a few speakers,
connected speech, constrained grammar,
less than 10% semantic error
![Page 11: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/11.jpg)
ResultsResults
• Only CMU Harpy fulfilled goals -
used LPC, segments, lots of high level
knowledge, learned from Dragon *
(Baker)
* The CMU system done in the early ‘70’s; as opposed to the company formed in the ‘80’s
![Page 12: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/12.jpg)
Achieved by 1976Achieved by 1976
• Spectral and cepstral features, LPC
• Some work with phonetic features
• Incorporating syntax and semantics
• Initial Neural Network approaches
• DTW-based systems (many)
• HMM-based systems (Dragon, IBM)
![Page 13: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/13.jpg)
Automatic Speech Automatic Speech RecognitionRecognition
Data Collection
Pre-processing
Feature Extraction (Framewise)
Hypothesis Generation
Cost Estimator
Decoding
![Page 14: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/14.jpg)
Frame 1
Frame 2
Feature VectorX1
Feature VectorX2
Framewise Analysis Framewise Analysis of Speechof Speech
![Page 15: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/15.jpg)
1970’s Feature Extraction1970’s Feature Extraction
• Filter banks - explicit, or FFT-based
• Cepstra - Fourier components
of log spectrum
• LPC - linear predictive coding
(related to acoustic tube)
![Page 16: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/16.jpg)
LPC SpectrumLPC Spectrum
![Page 17: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/17.jpg)
LPC Model OrderLPC Model Order
![Page 18: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/18.jpg)
Spectral EstimationSpectral Estimation
Filter BanksCepstralAnalysis
LPC
Reduced Pitch Effects
Excitation Estimate
Direct Access to Spectra
Less Resolution at HF
Orthogonal Outputs
Peak-hugging Property
Reduced Computation
X
X
X
X XXX
X
X
X
![Page 19: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/19.jpg)
Dynamic Time WarpDynamic Time Warp
• Optimal time normalization with dynamic programming
• Proposed by Sakoe and Chiba, circa 1970• Similar time, proposal by Itakura• Probably Vintsyuk was first (1968)• Good review article by
White, in Trans ASSP April 1976
![Page 20: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/20.jpg)
Nonlinear Time NormalizationNonlinear Time Normalization
![Page 21: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/21.jpg)
HMMs for SpeechHMMs for Speech
• Math from Baum and others, 1966-1972
• Applied to speech by Baker in the
original CMU Dragon System (1974)
• Developed by IBM (Baker, Jelinek, Bahl,
Mercer,….) (1970-1993)
• Extended by others in the mid-1980’s
![Page 22: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/22.jpg)
A Hidden Markov ModelA Hidden Markov Model
q q q
P(x | q )1
P(x | q )2
P(x | q )3
P(q | q )2 1
P(q | q ) P(q | q )3 2 4 3
1 2 3
![Page 23: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/23.jpg)
Markov modelMarkov model(state topology)(state topology)
q q1 2
P(x ,x , q ,q ) P( q ) P(x |q ) P(q | q ) P(x | q ) 1 1 1 1 12 2 2 2 21
![Page 24: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/24.jpg)
Markov model Markov model (graphical form)(graphical form)
q q1 2
q q3 4
x1 2
x x 3 4
x
![Page 25: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/25.jpg)
HMM Training StepsHMM Training Steps
• Initialize estimators and models
• Estimate “hidden” variable probabilities
• Choose estimator parameters to maximize
model likelihoods
• Assess and repeat steps as necessary
• A special case of Expectation
Maximization (EM)
![Page 26: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/26.jpg)
The 1980’sThe 1980’s
• Collection of large standard corpora
• Front ends: auditory models, dynamics
• Engineering: scaling to large
vocabulary continuous speech
• Second major (D)ARPA ASR project
• HMMs become ready for prime time
![Page 27: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/27.jpg)
Standard Corpora Standard Corpora CollectionCollection
• Before 1984, chaos
• TIMIT
• RM (later WSJ)
• ATIS
• NIST, ARPA, LDC
![Page 28: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/28.jpg)
Front Ends in the 1980’sFront Ends in the 1980’s
• Mel cepstrum (Bridle, Mermelstein)
• PLP (Hermansky)
• Delta cepstrum (Furui)
• Auditory models (Seneff, Ghitza, others)
![Page 29: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/29.jpg)
Mel Frequency ScaleMel Frequency Scale
![Page 30: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/30.jpg)
Spectral vs TemporalSpectral vs Temporal ProcessingProcessing
Analysis (e.g., cepstral)
Processing(e.g., mean removal)
Time
frequ
ency
frequ
ency
Spectral processing
Temporal processing
![Page 31: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/31.jpg)
Dynamic Speech FeaturesDynamic Speech Features
• temporal dynamics useful for ASR
• local time derivatives of cepstra
• “delta’’ features estimated over
multiple frames (typically 5)
• usually augments static features
• can be viewed as a temporal filter
![Page 32: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/32.jpg)
““Delta” impulse responseDelta” impulse response
.2
.1
0
-.2
-.1
0 1 2-1-2frames
![Page 33: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/33.jpg)
HMM’s for ContinuousHMM’s for ContinuousSpeechSpeech
• Using dynamic programming for cts speech
(Vintsyuk, Bridle, Sakoe, Ney….)
• Application of Baker-Jelinek ideas to
continuous speech (IBM, BBN, Philips, ...)
• Multiple groups developing major HMM
systems (CMU, SRI, Lincoln, BBN, ATT)
• Engineering development - coping with
data, fast computers
![Page 34: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/34.jpg)
2nd (D)ARPA Project2nd (D)ARPA Project
• Common task• Frequent evaluations• Convergence to good, but similar, systems • Lots of engineering development - now up to
60,000 word recognition, in real time, on aworkstation, with less than 10% word error
• Competition inspired others not in project -Cambridge did HTK, now widely distributed
![Page 35: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/35.jpg)
Knowledge vs. Knowledge vs. IgnoranceIgnorance
• Using acoustic-phonetic knowledge
in explicit rules
• Ignorance represented statistically
• Ignorance-based approaches (HMMs)
“won”, but
• Knowledge (e.g., segments) becoming
statistical
• Statistics incorporating knowledge
![Page 36: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/36.jpg)
Some 1990’s IssuesSome 1990’s Issues
• Independence to long-term spectrum
• Adaptation
• Effects of spontaneous speech
• Information retrieval/extraction with
broadcast material
• Query-style systems (e.g., ATIS)
• Applying ASR technology to related
areas (language ID, speaker verification)
![Page 37: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/37.jpg)
Where Pierce Letter Where Pierce Letter AppliesApplies
• We still need science
• Need language, intelligence
• Acoustic robustness still poor
• Perceptual research, models
• Fundamentals of statistical pattern
recognition for sequences
• Robustness to accent, stress, rate of speech,
……..
![Page 38: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/38.jpg)
Progress in 30 YearsProgress in 30 Years
• From digits to 60,000 words
• From single speakers to many
• From isolated words to continuous
speech
• From no products to many products,
some systems actually saving LOTS
of money
![Page 39: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/39.jpg)
Real UsesReal Uses
• Telephone: phone company services
(collect versus credit card)
• Telephone: call centers for query
information (e.g., stock quotes,
parcel tracking)
• Dictation products: continuous
recognition, speaker dependent/adaptive
![Page 40: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/40.jpg)
But:But:
• Still <97% accurate on “yes” for telephone
• Unexpected rate of speech causes doubling
or tripling of error rate
• Unexpected accent hurts badly
• Accuracy on unrestricted speech at 50-70%
• Don’t know when we know
• Few advances in basic understanding
![Page 41: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/41.jpg)
Confusion Matrix for Confusion Matrix for Digit RecognitionDigit Recognition
Overall error rate 4.85%
ClassErrorRate
1
2
3
4
5
6
7
8
9
0
1 2 3 4 5 6 8 9 0
191 0 0 5 1 0 0 2 0
0 188 2 0 0 1 0 0 6
0 3 191 0 1 0 0 3 0
8 0 0 187 4 0 0 0 0
0 0 0 0 193 0 0 7 0
2 2 0 2 0 1 0 1 2
0 1 0 0 1 2 196 0 0
5 0 2 0 8 0 0 179 3
1 4 0 0 0 1 0 1 192
7
1
3
2
1
0
190
2
3
1
4.5
6.0
4.5
6.5
3.5
2.0
5.0
2.0
10.5
4.5
0 0 0 0 1 196 2 0 10
![Page 42: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/42.jpg)
Large Vocabulary CSRLarge Vocabulary CSRErrorRate
%
‘88 ‘89 ‘90 ‘91 ‘92 ‘93 ‘94Year
•
••
•
12
9
6
3
--- RM ( 1K words, PP 60)
___ WSJØ, WSJ1(5K, 20-60K words, PP 100)
Ø 1•
~~
~~
![Page 43: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/43.jpg)
Why is ASR Hard?Why is ASR Hard?
• Natural speech is continuous
• Natural speech has disfluencies
• Natural speech is variable over:
global rate, local rate, pronunciation
within speaker, pronunciation across
speakers, phonemes in different
contexts
![Page 44: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/44.jpg)
Why is ASR Hard?Why is ASR Hard?(continued)(continued)
• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:
room acoustics, channel characteristics,background noise
• Large training times are not practical• User expectations are for equal to or
greater than “human performance”
![Page 45: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/45.jpg)
Main Causes of Main Causes of Speech VariabilitySpeech Variability
Environment
Speaker
InputEquipment
Speech - correlated noisereverberation, reflection
Uncorrelated noiseadditive noise(stationary, nonstationary)
Attributes of speakersdialect, gender, age
Manner of speakingbreath & lip noisestressLombard effectratelevelpitchcooperativeness
Microphone (Transmitter)Distance from microphoneFilterTransmission system
distortion, noise, echoRecording equipment
![Page 46: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/46.jpg)
ASR DimensionsASR Dimensions
• Speaker dependent, independent
• Isolated, continuous, keywords
• Lexicon size and difficulty
• Task constraints, perplexity
• Adverse or easy conditions
• Natural or read speech
![Page 47: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/47.jpg)
Telephone SpeechTelephone Speech
• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics
![Page 48: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/48.jpg)
Automatic Speech Automatic Speech RecognitionRecognition
Data Collection
Pre-processing
Feature Extraction
Hypothesis Generation
Cost Estimator
Decoding
![Page 49: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/49.jpg)
Pre-processingPre-processing
RoomAcoustics
Speech
MicrophoneLinear
FilteringSampling &Digitization
Issue: Effect on modeling
![Page 50: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/50.jpg)
Feature ExtractionFeature Extraction
SpectralAnalysis
AuditoryModel/
Normalizations
Issue: Design for discrimination
![Page 51: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/51.jpg)
Representations Representations are Importantare Important
Network
23% frame correct
Network
70% frame correct
Speech waveform
PLP features
![Page 52: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/52.jpg)
Hypothesis GenerationHypothesis Generation
Issue: models of language and task
cat
dog
a dog is not a cat
a cat not is adog
![Page 53: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/53.jpg)
Cost EstimationCost Estimation
• Distances
• -Log probabilities, from discrete distributions Gaussians, mixtures neural networks
![Page 54: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/54.jpg)
DecodingDecoding
![Page 55: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/55.jpg)
Pronunciation ModelsPronunciation Models
![Page 56: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/56.jpg)
Language ModelsLanguage Models
Most likely words for largest product
P(acousticswords) P(words)
P(words) = P(wordshistory)
•bigram, history is previous word
•trigram, history is previous 2 words
•n-gram, history is previous n-1 words
![Page 57: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/57.jpg)
System ArchitectureSystem Architecture
PronunciationLexicon
Signal Processing
ProbabilityEstimator
Decoder
RecognizedWords“zero”“three”“two”
Probabilities“z” -0.81
“th” = 0.15“t” = 0.03
Cepstrum
SpeechSignal
Grammar
![Page 58: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/58.jpg)
What’s Hot in ResearchWhat’s Hot in Research
• Speech in noisy environments -Aurora• Portable (e.g., cellular) ASR • Multilingual conversational speech (EARS)• Shallow understanding of deep speech• Question answering • Understanding meetings – or at least
browsing them
![Page 59: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/59.jpg)
21st Century 21st Century ASR ResearchASR Research
• New (multiple) features and models• New statistical dependencies• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Long-range language models• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence
![Page 60: ASR Intro: Outline](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813dfd550346895da7d784/html5/thumbnails/60.jpg)
SummarySummary
• 2005 ASR based on 50+ years of research
• Core algorithms products, 10-30 yrs
• Deeply difficult, but tasks can be chosen
that are easier in SOME dimension
• Much more yet to do, but
• Much can be done with current
technology