Post on 15-Aug-2015
Performance analysis of Bangla Speech Recognizer
model using Hidden Markov Model (HMM)
Submitted by: Md. Abdullah-al-MAMUN
1
OUTLINEOUTLINE What is speech recognition ?What is speech recognition ? The Structure of ASR The Structure of ASR Speech DatabaseSpeech Database Feature ExtractionFeature Extraction Hidden Markov ModelHidden Markov Model
Forward algorithmForward algorithm Backward algorithmBackward algorithm Viterbi algorithmViterbi algorithm
Training & RecogntionTraining & Recogntion ResultResult ConclusionsConclusions ReferencesReferences
2
What is What is SSpeech peech RRecognitionecognition??
In Computer Science, In Computer Science, Speech Speech recognitionrecognition is the translation of is the translation of spoken words into text .spoken words into text . Process of converting acoustic Process of converting acoustic signal captured by microphone to a signal captured by microphone to a set of words.set of words. Speech recognition known as Speech recognition known as “Automatic Speech Recognition (ASR) “Automatic Speech Recognition (ASR) ”, “Speech to Text (STT)".”, “Speech to Text (STT)".
3
Model of Model of BBangla angla SSpeech peech RRecognitionecognition
4
Fig -1 : Simple model of Bangla Speech Recognition
Database Signal Interface
Feature Extraction
Recognition
DatabasesTraining HMM
The Structure of The Structure of ASRASR System:System:
Figure 1 :Functional Scheme of an ASR SystemFigure 1 :Functional Scheme of an ASR System
Speech samples
X Y
S
W*
5
Speech Database:Speech Database:-A speech database is a A speech database is a collection of recorded speech collection of recorded speech accessible on a computer and accessible on a computer and supported with the necessary supported with the necessary transcriptions.transcriptions.-The databases collect the The databases collect the observations required for observations required for parameter estimations.parameter estimations.-In this ASR system, I have used In this ASR system, I have used about 1200 keywords.about 1200 keywords.
6
Classification of Classification of KeywordsKeywords
Bengal Word
Independent
Dependent
Vowel
Consonant
Modifier
Character
Compound
Character
7
DDatabase atabase CCreation reation PProcessrocess
Database
8
Speech Signal AnalysisSpeech Signal Analysis
Feature Extraction for Feature Extraction for ASR:ASR:
- The aim is to extract the voice - The aim is to extract the voice features to distinguish different features to distinguish different phonemes of a language.phonemes of a language.
9
515645465
156156165
156456454
251561565
Feature Extractio
n
MFCCMFCC extractionextraction
Pre-emphasis DFTMel filter
banks Log(||2) IDFT
Speech
signalx(n)
WINDOW
x’(n)
xt (n)
Xt(k)
Yt(m)
MFCCyt(m)(k)
10
MFCC means Mel-frequency cepstral coefficients that representation of the short-term power spectrum of a sound for audio processing.
The MFCCs are the amplitudes of the resulting spectrum.
Speech waveform Speech waveform of a phoneme “\of a phoneme “\
ae”ae”
After pre-emphasis After pre-emphasis and Hamming and Hamming
windowingwindowing
Power spectrumPower spectrum MFCCMFCC
Explanatory ExampleExplanatory Example
11
FFeature eature VVector to ector to P(O|M)P(O|M) via via HMMHMM
12
51564654564
P(O|M)
HMM
For each input word O the HMM generate a corresponding probability P(O|M) that could be computed by the HMM.
HMM ModelHMM Model
13
HMM is specified by a five-tuples λ=( , , , , )S O A B
14
Elements of an HMMElements of an HMM
1) Set of hidden states 1) Set of hidden states S={1.2., … … N}S={1.2., … … N}
2) Set of observation symbols 2) Set of observation symbols O={oO={o11, o, o22, … … o, … … oMM}}
M: the number of observation symbolsM: the number of observation symbols
3) The initial state distribution3) The initial state distribution
4) State transition probability distribution4) State transition probability distribution
5) Observation symbol probability distribution in 5) Observation symbol probability distribution in state j state j
1{ } ( | ), 1 ,ij ij t tA a a P s j s i i j N
{ ( )} ( ) ( | ) 1 ,1j j t k tB b k b k P X o s j j N k M
0{ } ( ) 1i i P s i i N
15
Three Basic Problems in Three Basic Problems in HMMHMM 1.The Evaluation Problem 1.The Evaluation Problem –Given a model –Given a model λλ =(A, B, =(A, B,
π)π) and a sequence of observations O and a sequence of observations O = (o = (o11, o, o22, , oo33,...o,...oMM ) ), what is the probability P(O|, what is the probability P(O|λλ); i.e., the ); i.e., the probability of the model that generates the probability of the model that generates the observations?observations?
2.The Decoding Problem 2.The Decoding Problem – Given a model – Given a model λλ =(A, B, =(A, B, π)π) and a sequence of observation O and a sequence of observation O = (o= (o11, o, o22, , oo33,...o,...oMM ) ), what is the most likely state sequence in , what is the most likely state sequence in the model that produces the observations?the model that produces the observations?
3.The Learning Problem 3.The Learning Problem –Given a model –Given a model λλ =(A, B, π) =(A, B, π) and a set of observations and a set of observations O = (oO = (o11, o, o22, o, o33,...o,...oMM ) ), how , how can we adjust the model parameter can we adjust the model parameter λλ to maximize to maximize the joint probability P(O|the joint probability P(O|λλ)?)?
How to evaluate an HMM?
Forward Algorithm
How to Decode an HMM?
Viterbi Algorithm
How to Train an HMM?
Baum-Welch Algorithm
16
Calculate Calculate PProbability robability ( O| M )( O| M )
Trellis:
0.5
0.3
0.2
P(up)P(down) P(no-change)
0.30.30.4
0.70.10.2
0.10.60.3
0.179
0.036
0.008
0.35
0.02
0.09
0.35*0.2*0.3
0.02*0.5*0.7
0.09*0.4*0.7
0.02*0.2*0.3
0.09*0.5*0.3
0.35*0.6*0.7 0.179*0.6*0.7
0.008*0.5*0.7
0.036*0.4*0.7
0.60.50.4
0.20.30.1
0.2
0.2 transition matrix0.5
0.2230.46add probabilities !
Forward Calculations – Forward Calculations – OverviewOverview
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
0.60.10.3
0.10.10.2
17
Forward Calculations (t=2)Forward Calculations (t=2)
S0
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2
NOTE: that 1 (2)+ 2 (2)is the likelihood of the observation.
1
2
1 1 13 11 2 23 21
2 1 13 12 2 23 22
(1) 1
(1) 0
(2) (1) (1) 0.21
(2) (1) (1) 0.09
b a b a
b a b a
0.60.10.3
0.10.10.2
18
Forward Calculations (t=3)Forward Calculations (t=3)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
1(3)
0.60.10.3
0.10.10.2
19
Forward Calculations (t=4)Forward Calculations (t=4)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
S1
S2
0.60.10.3
0.10.10.2
20
Forward Calculation of Forward Calculation of Likelihood FunctionLikelihood Function
t=1 t=2 t=3 t=4
1(t) 1.0
1 =1
0.21
1(1) a11 b13
+2(1) a21 b23
0.04621(2)a11 b12
+2(2)a21 b12
0.021294
2(t) 0.0
2 =0
0.09 1(1) a12 b13
+2(1) a22 b23
0.0378 0.010206
L(t)
p(K1… Kt)
1.01(1) +2(1)
0.31(2) +2(2)
0.0841(3) +2(3)
0.03151(4) +2(4)
21
Backward Calculations – Backward Calculations – OverviewOverview
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
0.60.10.3
0.10.10.2
22
Backward Calculations (t=3)Backward Calculations (t=3)
S1
S2
TIME 3
0.60.10.3
0.10.10.2
23
Backward Calculations (t=2)Backward Calculations (t=2)
S1
S2
S1
S2
TIME 2 TIME 3 TIME 4
a22=0.5
a11=0.7
a12=0.3a21=0.5
NOTE: that 1 (2)+ 2 (2)is the likelihood the observation/word sequence.
1
2
1
2
1 1 11 12 2 12 12
2 1 21 22 2 22 22
(4) 1
(4) 1
(3) 0.6
(3) 0.1
(2) (3) (3) 0.045
(2) (3) (3) 0.245
a b a b
a b a b
0.60.10.3
0.10.10.2
24
Backward Calculations (t=1)Backward Calculations (t=1)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a21=0.5
TIME 2 TIME 3 TIME 4
0.60.10.3
0.10.10.2
25
Backward Calculation of Backward Calculation of Likelihood FunctionLikelihood Function
t=1 t=2 t=3 t=4
1(t) 0.0315 0.045a11b11 1(1) ++ a12b21 1(1)
0.6b11
1
2(t) 0.029 0.245 a11b11 1(1) +
+ a12b21 1(1)
0.1 b21
1
L(t)p(Kt… KT)
0.03151 1(1) +
2 2(1)
0.2901(2) +2(2)
0.71(3) + 2(3)
1
26
27
Calculate Calculate maxmaxSS Prob. Prob. state sequence state sequence SS
0.35
0.09
0.02
P(up)P(down) P(no-change)
0.30.30.4
0.70.10.2
0.10.60.3
0.147
0.021
0.007
0.35*0.2*0.3
0.02*0.5*0.7
0.09*0.4*0.7
0.02*0.2*0.3
0.09*0.5*0.3
0.35*0.6*0.7 0.147*0.6*0.7
0.007*0.5*0.7
0.021*0.4*0.7
0.5
0.2
0.3
best
Select highest probability !
Viterbi Algorithm – OverviewViterbi Algorithm – Overview
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
0.60.10.3
0.10.10.2
28
Viterbi Algorithm (Forward Calculations Viterbi Algorithm (Forward Calculations t=2)t=2)
S0
S1
S2
S1
S2
1=1
2=0
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2
1 1
2 2
1 1 13 11 2 23 21
2 1 13 12 2 23 22
1
2
(1) 1
(1) 0
(2) max{ (1) , (1) } 0.21
(2) max{ (1) , (1) } 0.09
(2) 1
(2) 1
b a b a
b a b a
0.60.10.3
0.10.10.2
29
Viterbi Algorithm (Backtracking t=2)Viterbi Algorithm (Backtracking t=2)
S0
S1
S2
S1
S2
1=1
2=0
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2
1 1
2 2
1 1 13 11 2 23 21
2 1 13 12 2 23 22
1
2
(1) 1
(1) 0
(2) max{ (1) , (1) } 0.21
(2) max{ (1) , (1) } 0.09
(2) 1
(2) 1
b a b a
b a b a
0.60.10.3
0.10.10.2
30
Viterbi Algorithm (Forward Viterbi Algorithm (Forward Calculations)Calculations)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
0.60.10.3
0.10.10.2
31
Viterbi Algorithm (backtracking)Viterbi Algorithm (backtracking)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
0.60.10.3
0.10.10.2
32
Viterbi Algorithm (Forward Calculations t=4)Viterbi Algorithm (Forward Calculations t=4)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
S1
S2
0.60.10.3
0.10.10.2
33
Viterbi Algorithm (Backtracking to Obtain Viterbi Algorithm (Backtracking to Obtain Labeling)Labeling)
S0
S1
S2
S1
S2
S1
S2
1
2
a12=0.3
a11=0.7
a22=0.5
a21=0.5
TIME 2 TIME 3 TIME 4
S1
S2
0.60.10.3
0.10.10.2
34
Implementing Implementing HMMHMM to speech to speech ModelingModeling
( (TrainingTraining and and Recognition Recognition ))
- Building HMM speech models based - Building HMM speech models based on the correspondence between the on the correspondence between the observation sequences observation sequences YY and the state and the state sequence (sequence (SS). ). (TRAINNING).(TRAINNING).- Recognizing speech by the stored - Recognizing speech by the stored HMM models HMM models and by the actual and by the actual observation Y. observation Y. (RECOGNITION)(RECOGNITION)
Training HMM
Feature Extraction
RecognitionW*Y
Y
S
Speech Samples
35
RECOGNITIONRECOGNITION Process Process Given an input speech Given an input speech S=(sS=(s11,s,s22,…,s,…,sTT)) be the recognized . be the recognized . xxtt be the feature samples computed at time be the feature samples computed at time tt, where the feature , where the feature
sequence from time sequence from time 11 to to t t is indicated as: is indicated as: X=(xX=(x11,x,x22,…,x,…,xt t )).. The recognized states The recognized states S*S* could be obtained by: could be obtained by:
S*=ArgMax P(S,X|S*=ArgMax P(S,X|))..
Dynamic Structure
Search Algorithm
S*
Static Structure
St , P(xt,{st}|{st-1},)
{St-1}
xt
36
ResultResult ((SSpeaker peaker RRecognition)ecognition)
37
Table 1: Speaker recognition result
ResultResult ((IIsolated solated SRSR))
38
Table 3: Result for isolated speech recognition.
Result Result ((CContinuous ontinuous SRSR))
39
Table 3: Continuous Speech recognition result
ConclusionsConclusions No speech recognizer till now has No speech recognizer till now has
100% accuracy. 100% accuracy.
You should avoided poor quality You should avoided poor quality microphone consider using a better microphone consider using a better microphonemicrophone
On important matter is that , On important matter is that , training the computer will provide training the computer will provide an even better experience.an even better experience.
40
Thank YouThank You
41