Introduction to Automatic Speech and Speaker Recognition
Transcript of Introduction to Automatic Speech and Speaker Recognition
Introduction to Automatic Speech and Speaker
RecognitionRecognition
Sadaoki FuruiSadaoki FuruiTokyo Institute of Technology
Department of Computer [email protected]
Major speech recognition applicationsMajor speech recognition applications
• Conversational systems for accessing information services ( f(e.g. automatic flight status or stock quote information systems)
• Systems for transcribing, understanding and summarizing ubiquitous speech documents g q p(e.g. broadcast news, meetings, lectures, presentations, congressional records, court p , g ,records, and voicemails)
1
Radio Rex – 1920’s ASRRadio Rex 1920 s ASR
“ (f C )A sound-activated toy dog named “Rex” (from Elmwood Button Co.) could be called by name from his doghouse by name. 2
Front view of the spoken digit recognizer(J Suzuki & K Nakata Radio Research Labs Japan 1961)(J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)
3
Photograph of the Japanese spoken digit recognizer(K Nagata Y Kato and S Chiba NEC Labs Japan 1963)(K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)
4
“Julie” doll with speech synthesis and recognition technology produced by Worlds of Wonder intechnology, produced by Worlds of Wonder in
conjunction with Texas Instruments (1987)
5
Now
6
Speech chain
Talker Listener
Auditory Auditory
Motorcommand
Motorcommand
Auditorynerve
Auditorynerve
EarEar
nervenerve
Speech
EarEar
nervenerve
ArticulatorArticulatorp
FeedbackFeedback
Ph i lLinguisticprocess
Physiologicalprocess
Physical(acoustic)process
Physiologicalprocess
Linguisticprocess
Discrete DiscreteContinuous7
Structure of speech production and recognition system based on information transmission theoryy y
I f tiI f ti
(Transmission theory)
InformationInformationsourcesource DecoderDecoderChannelChannel
S XTextText
generationgenerationAcousticAcoustic
processingprocessingSpeechSpeech
productionproductionLinguisticLinguisticdecodingdecodingW W
Acoustic channel
Speech recognition system
(Speech recognition process)
)()()(
maxarg)(maxargˆXP
WPWXPXWPW
WW== 8
Mechanism of state-of-the-art speech recognizers
Speech inputSpeech input
Aoustic analysisAoustic analysis
Phoneme inventoryPhoneme inventoryTxx L1
( )Global search:
MaximizeGlobal search:
Maximize
Phoneme inventoryPhoneme inventory
Pronunciation Pronunciation
( )kT wwxxP LL 11
overover ww11 wwkk
PP (( xx11...... xxTT||ww11......wwkk ))
Language modelLanguage model
lexiconlexiconP (x1... xT |w1...wk)・P(w1 ... k)w ( )kwwP L1
overover ww11...... wwkk gu ge odegu ge ode
Recognized word sequence
Recognized word sequence 9
Speech energy, sound spectrogram, waveform and pitch0 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9
0 dB6 kHz
80 dB0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Power
Spectrum
0 kHz
Waveform
0 ms
Pitch
15 msTime
Pitch
10
Feature vector (short-time spectrum) extraction from speech
F l h
Time window
Frame length
Time window
Frame periodFrame period
…Feature vector
Frame
…Feature vector
Block diagram of a typical speech analysis procedure
LowLow--pass filterpass filterCut off frequency= 8 kHz
A/DA/D(Sampling and quantization)(Sampling and quantization)
Sampling frequency= 16 kHz
Quantization bit rate= 16 bit
Analysis frame extractionAnalysis frame extractionFrame length = 30 msFrame interval = 10 ms
= 16 bit
30 msWindowingWindowing(Hamming, Hanning, etc.)(Hamming, Hanning, etc.)Window length
= Frame length
Spectral analysisSpectral analysis(FFT, LPC, etc.)(FFT, LPC, etc.)
Typical parameter valuesTypical parameter values
Feature extractionFeature extractionExamples of speech waves Examples of speech waves
at each stageat each stageyp pyp pParametric representation
Excitation parametersVocal tract parameters
at each stageat each stage
Linear separable equivalent circuit model of the speech production mechanismof the speech production mechanism
Source G (ω)
Pulse
( )Articulation
H (ω)SpeechwaveS (ω)
Vocal tractarticulationequivalent
filtt
( )
Noise
filter
…..t
Spectral envelopeparameters
Fundamentalperiod
Voiced/unvoiced
Amplitudep
S(ω) = G(ω) • H(ω) 13
Spectral structure of speech
Spectral fine structureL
og
Short-time speech spectrum
fF0 (Fundamental frequency)
Spectral envelope
f
g
fF0 (Fundamental frequency)
Resonance peaks(Formants)
Log
Wh h iWh h if
What we are hearingWhat we are hearing14
Relationship between logarithmic spectrum and cepstrum
C tC tL ith i tL ith i t
Spectral fine structure
CepstrumCepstrumLogarithmic spectrumLogarithmic spectrum
Spectral fine structure
gL
og 0
Fast periodical function ofFast periodical function of ff
f
S t l l
τConcentrating at Concentrating at different positionsdifferent positionsIDFTIDFT
Spectral envelope
Slow periodical function of Slow periodical function of ff
Log 0
f τ 15
Block diagram of cepstrum analysisfor extracting spectral envelope and fundamental period
Sampled sequence
Time window
|DFT||DFT|
Logarithmic transformLogarithmic transform
IDFT
Cepstral window (liftering)
(Low quefrency elements)
DFT(High quefrency elements)
Peak extraction
Spectral envelope Fundamental period 16
Cepstrum and delta-cepstrum coefficients
Parameter (vector) trajectory
Instantaneous vectorTransitional (velocity) vector
(Delta-cepstrum)(Cepstrum)
(Delta-cepstrum)
17
MFCC-based front-end processor
Speech
FFTFFT
FFT based spectrum
Mel scale triangular filters
DCTDCTLogLog
Δ Acousticvector
Δ218
Structure of phoneme HMMs
b (x) b (x) b (x)Outputprobabilities
b1(x) b2(x) b3(x)
x x x
0 2 0 4 0 7
Phonememodels
0.2
0.5 0.6
0.4
0.3
0.7
1 32
Feature
0.3
Featurevectors
Phoneme k-1 Phoneme k Phoneme k+1
timePhoneme k-1 Phoneme k Phoneme k+1
19
An example of FSN (Finite State Network) grammar
WANT BOOKS
THREE
33 55
I
WANT
ONE BOOK
COATS
I
ACOATNEW
BOOK
1122 9*9*88
NEED A
OLD
44 66
ANOLD
77
1. I 5. ONE 9. BOOKS 13. OLD2 WANT 6 A 10 COAT2. WANT 6. A 10. COAT3. NEED 7. AN 11. COATS4. THREE 8. BOOK 12. NEW 20
Statistical language modeling
Probability of the word sequence w1k = w1w2...wk :k kk k
P (w1k) =ΠP (wi |w1w2…wi−1) =ΠP (wi |w1i−1) i =1 i =1
P(wi |w1i−1) = N(w1i) / N(w1i−1)
where N (w1i) is the number of occurrences of the string w1iwhere N (w1 ) is the number of occurrences of the string w1in the given training corpus.
Approximation by Markov processes:Approximation by Markov processes:Bigram model P(wi |w1i−1) = P(wi |wi−1)Trigram model P(wi |w1i−1) = P(wi |wi−2wi−1)g ( i | 1 ) ( i | i−2 i−1)
Smoothing of trigram by the deleted interpolation method:P(wi |wi−2wi−1) = λ1P(wi |wi−2wi−1)+ λ2P(wi |wi−1) + λ3P(wi)
21
Overview of statistical speech recognition
FrontFront--end parameterizationend parameterization
Parameterized speech waveformX
pp
Acoustic models
th ih s ih z s p iy ch PronouncingPronouncingdictionarydictionaryW this is speech dictionarydictionary
Language modelLanguage model P(W) • P(X|W) 22
Complete Hidden Markov Model of a simple grammar
P(wt =YES| wt-1= sil)= 0.2
P(st | st-1)
S(4) S(5) S(6)S(1) S(2) S(3)
W=Yes
Phoneme ‘S’Phoneme ‘YE’0.6
S(0) P(w =sil|w =YES)= 1S(0)
SilenceP(wt =sil|wt-1 =YES)= 1
P(wt =sil|wt-1 =NO)= 1StartStartStartStart
Phoneme ‘N’ Phoneme ‘O’S(10) S(11) S(12)S(7) S(8) S(9)
P(Y|st =s(12))W=No P(Y|st s )
YP(wt =NO| wt-1= sil)= 0.2
23
A unigram grammar network where the unigram probability is attached as the transition probability from starting state S to the first state of each word HMMstarting state S to the first state of each word HMM.
P(W1)WW11
( 1)
WW22P(W2)
…WWNN
P(WN)SS NN
24
A bigram grammar network where the bigram probability P(wj|wi) is attached as the transition probability from word w to wwi to wj. P(w1|w2)
ww11P(w1|w1)
P(w1|wN) P(wN|w1)
P(w1|w1)
P(w2|w1)P(w2|wN)
ww22P(w2|w2)
( 2| N)
P( | )
wwNN
P(wN|w2)
P(wN|wN)
wwNN
25
A trigram grammar network where the trigram probability P(wk|wi, wj) is attached to transition from grammar state wi, wj to the next word wk. Illustrated here is a two-word vocabulary, so there are four grammar states in the trigram network.
ww11P(w1|w2 , w1) P(w1|w1 , w1)
P(w2|w2 , w1) P(w2|w1 , w2)P(w2|w1 , w1)
ww22P(w2|w2 , w1)
P( | )
P(w2|w1 , w2)
ww11P(w1|w2 , w2)
P(w1|w1 , w2)
ww11
ww22P( | )P(w2|w2 , w2)
26
System diagram of a generic speech recognizer based on statistical models, including training and decoding processes and the main knowledge sources.
SpeechSpeechcorpuscorpus
TextTextcorpuscorpus
NormalizationNormalizationManual Manual
transcriptiontranscriptionFeature Feature
ii
NN--gramgram
NormalizationNormalization transcriptiontranscription
HMMHMM
extractionextraction TrainingTraininglexiconlexicon
Training
D di
NN--gramgramestimationestimation
HMM HMM trainingtraining
DecodingLanguageLanguage
modelmodelAcousticAcousticmodelsmodels
RecognizerRecognizerlexiconlexicon
P(W) P(H|W) P(X|H)
Speech sample
Speech transcription
Y X W*Acoustic Acoustic frontfront--endend DecoderDecoder
27
HMM-based speech synthesis system
S h i l
Spectral Spectral ExcitationExcitation
Speech signalSpeechSpeech
databasedatabase
ppparameterparameterextractionextraction
Spectral parameter
parameterparameterextractionextraction
Excitation parameter
Training part
p p
Label
p
Training of HMMTraining of HMM
Text Context dependent
Parameter generationParameter generationText analysisText analysis
Label
HMMs
Synthesis partParameter generationParameter generationfrom HMM from HMM
Excitation parameter
Label
Spectral parameter
Synthesis part
ExcitationExcitationgenerationgeneration
SynthesisSynthesisfilterfilter
Synthesizedspeech
Main causes of acoustic variation in speech
Distortion
Microphone• Distortion• Electrical noise• Directional characteristics
Noise• Other speakers• Background noise
DistortionNoiseEchoesDropouts
SpeechSpeech
• Background noise• Reverberations
ChannelChannelSpeechSpeech
recognitionrecognitionsystemsystem
Speaker Task/Context• Voice quality• Pitch
G d
• Man-machine dialogue• Dictation• Free conversation• Gender
• Dialect
Speaking style Phonetic/Prosodic context
Free conversation• Interview
• Stress/Emotion• Speaking rate• Lombard effect
29
Progress of speech recognition technology since 19801980
SSpontaneousspeech
Fluent
naturalconversation
naturalconversation2-way
dialogue2-way
dialoguetranscriptiontranscriptionnetworknetworkdd
Read
Fluentspeech
ing
styl
e
2000
networkagent &
intelligentmessaging
networkagent &
intelligentmessaging
system drivendialogue
system drivendialogue
namename
wordspotting
wordspotting
digitstringsdigit
stringsReadspeech
Connectedh
Spea
ki
1980
2000
officedictation
officedictation
namedialingname
dialingform fillby voiceform fillby voice
g
carspeech
Isolatedd
1980
directoryassistancedirectoryassistancevoice
commandsvoice
commands
navigation
2 20 200 2000 20000 Unrestricted
words
Vocabulary size (number of words)
1990commandscommands
y ( f )
30
Various speech applicationsCore software products
Performance
Multimediacontents
t i lRobot
Performance
Contentsretrieval
Public informationterminal
retrieval
Automatedcall center
Robot
Home applianceHome appliance
C ll h
InformationretrievalCar
information
HomeServer
VoiceExpert system
call centerHome applianceHome appliance
Cell phoneOperation,Personal navigationPortable
deviceVoice
controlHands-free
Voicecontrol
game
Resource
31
Speaker recognition
• Speaker verification: confirm the identity claim p y(banking transactions, database access services,security control for confidential information)
• Speaker identification: determine from registered• Speaker identification: determine from registeredspeakers (criminal investigations)
• Text-dependent methods
• Text independent methods• Text-independent methods
Intersession variability (variability over time) of• Intersession variability (variability over time) of speech waves and spectra
S t l/lik lih d li ti ( li ti )Spectral/likelihood equalization (normalization)
Applications of speaker recognition technologytechnology
• Access control: For physical facilities, computer networks, websites and automated password reset serviceswebsites and automated password reset services.
• Transaction authentication: For telephone banking and remote electronic and mobile purchases (e- and m-(commerce).
• Law enforcement: Home-parole monitoring, prison call monitoring and corroborating aural/spectral inspections ofmonitoring and corroborating aural/spectral inspections of voice samples for forensic analysis.
• Speech data management: Label incoming voice mailSpeech data management: Label incoming voice mail with speaker name for browsing and/or action. Annotate recorded meetings or video with speaker labels for quick indexing and filingindexing and filing.
• Personalization: Store and retrieve personal setting/preferences for multi-user site or device. Usesetting/preferences for multi user site or device. Use speaker characteristics for directed advertisement or services.
Principal structure of speaker recognition systems
ReferenceReferenceTraining ReferenceReferencemodels/templatesmodels/templatesfor each speakerfor each speaker
FeatureFeatureextractionextraction
Speechwave extractionextractionwave
SimilaritySimilarity/Distance/Distance
Recognition
Recognition results
Basic structure of speaker recognition systems(a) Speaker identification(a) Speaker identification
Si il itSi il itSimilaritySimilarity
ReferenceSpeech
Si il itSi il it
Referencetemplate or model(Speaker #1)
FeatureFeature MaximumMaximum
wave
SimilaritySimilarity
Reference
FeatureFeatureextractionextraction
MaximumMaximumselectionselection
• • •
Referencetemplate or model(Speaker #2) Identification
result(Speaker ID)
SimilaritySimilarity
Referencetemplate or model(Speaker #N)
Basic structure of speaker recognition systems (b) Speaker verification
Speechwave
SimilaritySimilarityFeatureFeatureextractionextraction DecisionDecision
S k ID ReferenceReferenceSpeaker ID(#M)
e e e cee e e cetemplate or modeltemplate or model(Speaker #M)(Speaker #M)
ThresholdThreshold
Verification result(Accept / Reject)
Past and future
• Speech recognition technology has made very significant i th t 50+ ith th h l f tprogress in the past 50+ years with the help of computer
technology.
The majority of technological changes have been directed• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.
• However there still remain many unsolved problems• However, there still remain many unsolved problems.
• A much greater understanding of the human speech process is required before automatic speech recognitionprocess is required before automatic speech recognition systems can approach human performance.
• Significant advances will come from extended knowledge• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.
37