18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace...
-
Upload
carol-potter -
Category
Documents
-
view
220 -
download
0
Transcript of 18.0 Some Recent Developments in NTU Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace...
18.0 Some Recent Developments in NTU
Reference: 1. “Segmental Eigenvoice with Delicate Eigenspace for Improved Speaker Adaptation”, IEEE Transactions on Speech and Audio Processing, Vol.13, No.3, May 2005, pp.399-411.
2. “Higher Order Cepstral Moment Nomalization(HOCMN) for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Montreal, CA, May 2004, pp.197-200.
3. “Extension and Further Analysis of Higher Order Cepstral Moment Normalization (HOCMN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
4. “Powered Cepstral Normalization (P-CN) for Robust Features in Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
5. “ Improved Spontaneous Mandarin Speech Recognition by Disfluency Interruption Point (IP) Detection Using Prosodic Features”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.1621-1624. 6. “ Prosodic Modeling in Large Vocabulary Mandarin Speech Recognition”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006. 7. “Latent Prosodic Modeling (LPM) for Speech with Applications in Recognizing
Spontaneous Mandarin Speech with Disfluencies”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
Reference: 8. “Entropy-based Feature Parameter Weighting for Robust Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.
9. “A New Framework for System Combination Based on Integrated Hypothesis Space,” International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
10. “Improved Spoken Document Summarization Using Probabilistic Latent Semantic Analysis (PLSA)”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.
11. “Analytical Comparison between Position Specific Posterior Lattices and Confusion Networks Based on Words and Subword Units for Spoken Document Indexing”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.
12. “A Multi-Modal Dialogue System for Information Navigation and Retrieval across Spoken Document Archives with Topic Hierarchies”, Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, San Juan, Nov-Dec 2005.
13. “Efficient Interactive Retrieval of Spoken Documents with Key Terms Ranked by Reinforcement Learning”, International Conference on Spoken Language Processing, Pittsburgh, USA, Sept 2006.
14. “Type- Dialogue Systems for Information Access from Unstructured Knowledge ⅡSources”, IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto, Japan, December 2007.
18.0 Some Recent Developments in NTU
Reference: 15. “Histogram-Based Quantization (HQ) for Robust and Scalable Distributed Speech Recognition”, European Conference on Speech Communication and Technology, Lisbon, Sept. 2005, pp.957-960.
16. “Joint Uncertainty Decoding (JUD) with Histogram-Based Quantization (HQ) for Robust and/or Distributed Speech Recognition”, International Conference on Acoustics, Speech and Signal Processing, Toulouse, France, May 2006.
18.0 Some Recent Developments in NTU
Role of Spoken Language Processing under Network Environment
Content AnalysisUser Interface
Internet
User-Content Interaction
User Interface
—when keyboards/mice inadequate
Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Segmental Eigenvoice
– Decompose the supervectors into sub-supervectors, from which sub-eigenspaces can be constructed, therefore better performance
can be obtained with more adaptation data
Segmental Eigenvoice (1/3)
Segmental Eigenvoice (2/3)
Segmental Eigenvoice (3/3)
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Higher Order Cepstral Moment Normalization (HOCMN) for Robust Speech Recognition
— to reduce the mismatch between the statistical characterics of training and testing corpora by
normalizing the ceptral moments
Cepstral Moment Normalization
• Moment Estimation:– Time average : N-th moment of MFCC parameters about the origin
• Cepstral Normalization:– For odd order L
– For even order N
1 1
0 0
1 1[ ( )] ( ) ( )
T TNN N
k k
E X n X k X kT T
[ ]( ) 0LLE X n
[ ]( )NN NE X n M
Example: CMS for L=1
Example: CMVN for N=1 and 2
Higher Order Cepstral Moment Normalization (HOCMN)
CN
CTN=HOCMN[1,2,3]
CN (l=86)
• Aurora 2, Clean Condition Training, Word Accuracy Averaged over 0~20dB and All Types of Noise (sets A,B,C)
CMVN
CTN=HOCMN[1,3,2]
CMVN (l=86)
Skewness and Kurtosis (1)
• Skewness
– Third moment about the mean and normalized to the standard deviation
– Departure of pdf from symmetry• Positive/negative indicates skew to right/left• Zero indicates symmetric
• Kurtosis
– Fourth moment about the mean and normalized to the standard deviation
– Peaked or “flat with tails of large size” as compared to standard Gaussian
• “3” is the fourth moment of N(0,1)• Positive/negative indicates flatter/more peaked
Skewness and Kurtosis (2)
• Define: Generalized Skewness of Odd Order L
– L not necessarily 3– Similar meaning as skewness (skew to right or left) except in the
sense of L–th moment
• Define: Generalized Kurtosis of Even Order N
– N not necessarily 4– Similar meaning as kurtosis (peaked or flat) except in the sense of
N–th moment
( ) , : an odd integerL LS E X L
Skewness and Kurtosis (3)
• Normalizing Odd Order Moment is to Constrain the pdf to be Symmetric about the Origin
– Except in the sense of L-th moment
• Normalizing Even Order Moment is to Constrain the pdf to be “Equally Flat with Tails of Equal Size” as Compared to a Standard Gaussian
– Except in the sense of N-th moment
• The Order of Normalized Moments are not necessarily Integers
• Generalized Moments– Type 1:
• Reduced to odd order moment when u is an odd integer L
(example: L=1 or 3)
– Type 2:
• Reduced to even order moment when u is an even integer N
(example: N=2 or 4)
– HOCMN with Non-integer Moment Orders
Generalized Moments with Non-integer Orders
PDF Analysis
• HEQ– Over fitted to Gaussian– Original statistics lost
• HOCMN– Fitting the generalized skewness and
kurtosis of a few orders only– Retain more original characteristics
HEQ
HOCMN
Original C0 & C1
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Use of Prosody in Recognition and Handling Disfluencies in Spontaneous Speech
— prosody may be useful in recognition, and in particular in handling disfluencies in spontaneous speech
100
200
300
400
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
frame number
fun
dam
en
tal
fre
qu
en
cy
(H
z)
Tone 2 Tone 4Tone 3
Prosodic Features (І) — Pitch-related Features
P1
P2d1
d2
• Pitch-related Features– The average pitch value within the syllable – The maximum difference of pitch value within the syllable – The average of absolute values of pitch variations within the syllable– The magnitude of pitch reset for boundaries – The difference of such feature values of adjacent syllable boundaries ( P1-P2 ,
d1-d2 , etc.)
– A total of 54 pitch-related features were obtained
• Duration-related Features
– A total of 38 duration-related features were obtained
syllable boundary syllable boundarypausepause
end of utterancebegin of utterance
A B C D Eba
Prosodic Features (Ⅱ) —Duration-related Features
Pause duration b Average syllable duration
(B+C+D+E)/4 or ( (D+E)/2 + C )/2 Average syllable duration ratio
(D+E)/(B+C) or (D+E)/2 /C
Combination of pause & syllable features (ratio or product) C*b , D*b, C/b, D/b Lengthening C / ( (A+B)/2 ) Standard deviation of feature values
Recognition Framework with Prosodic Modeling
• Rescoring Formula:
λl ,λp: weighting coefficients
( ) log log logl pS W P X W P W P F W Prosodicmodel
• Two-pass Recognition
Prosodic Feature Extraction from Paths in the Word Graph
Define Tone
variable
Directly take the LW boundaries as a prosodic
cue
21
,jL
j j jk jk jkk
P f w P f T B
(LW (LW boundaries )boundaries )
LWLW
Lj : the length of the j-th word
Prosodic modeling
,jk jk jkP f T BGMMclassifier
,
,
,
1,0
2,0
...
...
5,1
jk jk jk
jk jk jk
jk jk jk
p T B f
p T B f
p T B f
fjk
fjk’=
• Hybrid
• GMM
• Decision Tree
GMM
classifier
,jk jk jkP f T B
jkP B
,jk jk jkP T B f Baye’s Rule
,jk jk jkP f T B
fjk
fjk
Examples of Disfluencies in Spontaneous Speech
It has a *eh there is a resort there.
The disfluency interruption point (IP) (*)
它 (ta1) 有 (you3) 一個 (yi2ge5) 呃 (E) 那邊(ne4bian1)
it has one [discourse particle] there
有個 (you3ge5) 度假村 (du4jian4cun1) 嘛 (MA)
has a resort [discourse particle ]
Do you import * uhn export products?
reparandum resumptionoptional editing term
是 (shi4) 進口 (jin4kou3) 嗯 (EN) 出口 (chu1kou3) 嗎 (ma1) is import [discourse export [interrogative particle] particle]
• Overt Repair
reparandum resumption
• Abandoned Utterances
optional editing term
Spontaneous Speech Recognition with Disfluency Interruption Point (IP) Detection
• Rescoring with IP information
• Recognition Results
( c: IP class )
* arg max ( | , )W
W P W X F
arg max ( | ) ( | )W
P W F P X W
11( | ) ( | , )n
n n Nn
P W F P w w F
1 11 1( | , ) ( | , )n n
n N n n Ncn
P c w F P w w c
4444.5
4545.5
4646.5
0.5 0.9 1.3 2 4char
acte
r Acc with disfluency
handling
baseline
word n-grams when crossing IP boundaries
IP probability given by detection models
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Entropy-based Weighted Viterbi Decoding
— contribution of each feature parameter in Viterbi decoding weighted by its entropy with respect to
different phone classes
t: frame index
x(t): feature vector
d: index of feature parameter in x(t)
c: class index
Entropy-based Weighting
• Basic Idea
– If a feature parameteris discriminative
– If not discriminative
• its Entropy value is low
• its Entropy value is high
observation probability distributions of different classes
Entropy Estimation by GMMs
• GMMs for Different Classes c
– “GMM c” is developed for the acoustic class “c” (c = 1, 2, …)
Entropy-based Weighted Viterbi Decoding
• Testing
• Viterbi decodingD M
j jm d jmd jmdd=1 m=1
log[ ( (t)) ] = (t, d) ( log c ( (t); , ) )b W N x x
Experimental Results
• MFCC– Consistent improvements
for all types of noiseand SNR conditions
• Similar Results for PLP and Other Features
OriginalParameterWeighting
Relative ErrorReduction (%)
Set A 61.34 68.00 17.23Set B 55.75 63.74 18.06Set C 66.14 69.46 9.81
Average 61.08 67.07 15.39
MFCC
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
System Combination by Integrated Hypothesis Space and Delicate Rescoring
– properly integrating useful information from different approaches
Conventional System Combination Approaches
Decoder 1
AlignmentModule
VotingModule
Decoder N
InputSpeech
N-BestConfusionNetwork
result
1.Alignment Algorithms2.Distortion introduced
Inner Word graph
Proposed Approach
Decoder 1
Decoder N
InputSpeech Rescoring
result
IntegratedHypothesis
Space
Direct Integration
of Individual Hypothesis
SpaceDelicate
Rescoring
• Produce Integrated Hypothesis Space with detail time information
• Perform Delicate Rescoring on the Integrated Hypothesis Space
• Merged Word Graph
• If Two Word Arcs from Different Systems are Equal– Define:
–
• Others
S(q=q1+q2)=combine(S(q1), S(q2)) if q1=q2
122211
212121
||
|
WqqWqq
qqqqqWWW
q1=q2 ≡ pw1=pw2 , w1=w2 , ts1=ts2 , te1=te2
S(q=qi)=S(qi)
Hypothesis Space Integration
W1
W4
W4
W4
W2W8
W5
W6
W6
W7
W10
W10
W4
W8
W9
W10
W3
W10
W1
W4
W4
W4
W2
W8
W5
W6
W6
W7
W10
W10
W10
W4
W2
W8
W6
W7
W10
W10
W4
W8
W9
W10
W3
System 1
System 2
Delicate Rescoring Example (Ⅰ) – Expected Phone Accuracy Score (EPA)
• Borrowing the Concept of Expected Phone Accuracy in MPE Training– – –
• Decoding Procedure
Wp ppe
ppeOpAw
w' phonesdifferent are p and p' if ',1phone same theare p and p' if ',21max|wP
qqAqAEqSqS EPA P
K
qpii
i
pAqA,1 豪雨
陶藝
h_a au sic_iu u
t_a au sic_i u
1 5 t_a1/6=0.17-1+2*0.17=-0.66
au5/6=0.83-1+0.83=-0.17
k 1
,
y* arg maxM k
ky W q y
S q
qk : the kth word in the path y
y : word sequence for a path
Delicate Rescoring Example (Ⅱ) – Time Frame Error Score (TFE)
• Borrowing the Concept from Minimum Time Frame Error Decoding– frame level loss function
– P(q’) is available from the process of calculating consensus scores
• Decoding Procedure
)(1
)'()',()1(
)()(]',';,'['
se
Wttwpwqse
TFE tt
qPqqoverlaptt
qSqS esi
],;,[ esii ttwpwq
k 1
y , y
y* arg minM k
kW q
S q
qk : the kth word in the path
y : word sequence for a path
Experimental Results
• For Chinese language SER and CER make better sense due to the word segmentation problem
• For SER (for syllables), CER (for characters), proposed approach is significantly better than ROVER upper bound
– Alignment distortion
• TFE has best performance– Discriminative Decoding
Tested system SER CER WER
BaselineMFCC 15.89 22.19 29.93
HLDA 14.43 20.80 28.53
ROVER upper bound
1-Best 14.90 20.39 26.92
10-Best 14.64 20.21 26.76
20-Best 14.49 20.12 26.79
Integrated Hypothesis
Space
(1)CONS 13.67 19.62 26.88
(2)EPA 13.41 19.73 27.70
(3)CONS
+EPA13.55 19.54 26.97
(4)TFE 13.35 19.27 26.71
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Multimedia Content Analysis for Efficient Browsing and Retrieval
– automatic generation of titles, summaries and semantic structures for multimedia documents
Difficulties in Browsing Multimedia/Spoken Documents Written Documents are Better Structured and Easier to
Browse
— in paragraphs with titles
— easily summarized and shown on the screen
— easily decided at a glance if it is what the user is looking for Multimedia/Spoken Documents are just Video/Audio Signals
— not easy to be summarized and shown on the screen
— the user can’t go through each one from the beginning to the end during browsing
— better approaches for efficient browsing and retrieval are needed
Integration Relationships among the Involved Technology Areas
Keyterms/Named EntityExtraction from
Spoken Documents
Semantic
Analysis
Information
Indexing,
Retrieval
And Browsing
Key Term Extraction from
Spoken Documents
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
14
Improved and Interactive Spoken Document Retrieval
– improved spoken document retrieval with higher accuracy and better user-content interaction
Lattices, Position Specific Posterior Lattices (PSPL), Confusion Networks (CN)
W2: probW9: probW4: probW1: prob
CN structure:
W3: probW6: probW7: prob
W8: prob W5: probW10: prob
W3: prob
W7: prob
W2: probW1: prob W5: probW9: prob
W10: prob
PSPL structure:
W6: prob
cluster 1
W4: probW8: prob
cluster 2 cluster 4cluster 3 cluster 1 cluster 2 cluster 4cluster 3
W6W8
W4
W1
W7W8W9W10
W8
W7
W9
W3
W2
W5
W10
Start node End node
Time index
All paths:W1W2, W3W4W5, W6W8W9W10,
Lattice:
• PSPL:─ Locate a word in a segment according to the order of the word in a path
• CN:─ Cluster several words in a segment according to similar time spans and word
pronunciation
OOV/Rare Word Problem
• OOV word W=w1w2w3w4 and a lattice L of document D
– wi : subword units
• W never appears in L – Never find D under PSPL
• But W=w1w2w3w4 is hidden in L at subword level
• Subword-based PSPL (S-PSPL)
w2w3
w3w4bcdw3w4e
w3w4b
aw1w2
w1w2
Word Lattice L:
Time index
Subword-based PSPL and CN
w1_1
Time index
w1_2
w1_3 w2_1 w2_2 w2_3
w2_4
w3_2w3_1 w4_1 w4_2 w51
w5_2 w5_3 w5_4
w7_1
w7_2
w6_1
w6_2 w8_1 w8_2
w8_2w8_1
w9_1
w9_2
w10_1
w10_2
w1_1: prob w1_2: prob …. …..
S-PSPL structure:
…..
cluster 1
…..
…..
cluster 2cluster 8
S-CN structure:
…..
w5_4: prob
….. …..
w1_1: prob w1_2: prob …. …..
…..
cluster 1
…..
…..
cluster 2 cluster 8
…..
w2_4: prob
…..
Lattice Represented by Subword Arcs:
Performance Comparison
0.54
0.59
0.64
0.69
0.74
0.79
0.84
0.89
0 2 4 6 8 10 12 14 16 18 20
Index Size(MB)
MA
P
PSPL(word)CN(word)
PSPL(character)
CN(character)
CN(syllable)
PSPL(syllable)
Interactive Retrieval of Spoken Documents by Topic Hierarchy
• Interactive Process between User and Content for Spoken Document Retrieval
• Given User’s Initial Query, the Extracted Key Terms can be many, even in a Hierarchy– Ranking the key terms will be helpful in efficient retrieval
Topic Hierarchy
User
Multi-modal Dialogue
Retrieved Documents
Spoken Document
Archive
Retrieval System
Query/Instruction
Key Term Space Archive Space
titj
tktl
C(ti)
C(tj)
C(tk)
s1 = [ti ]
s2 = [ti ,tj ]
s3 = [ti ,tk ]
sn = [ti ,tj ,tl ]
G1 = C(ti )
G2 = C(ti + tj)
G3 = C(ti + tk)
Gn = C(ti +tj +tl)
Query Term Suggestions and Improved Interaction by Dialogue Modeling
Such mapping is defined by some IR function (ex: PLSA)
states: s1, s2, s3, …
actions: ti, tj, tk, …
state_s1 + action_tj
state_s2
Document Space
• A State Transition Diagram Generated for Each User Given the Initial Query s1
• User Assumed Satisfied (Double Circles) when Recall Rate = L/|D| > τ0
– L: number of relevant documents appearing in the top K retrieved documents– D: desired document set– m(s) = Mininum Number of Steps or Queries to Arrive at the Final State
Learning User’s Behavior in Retrieval by a Large Number of Simulated Users
s1
s2
s3
s4
s6
s7
s8s13
s14
s12
s15
m(s12) = 4
m(s7) = 3
m(s13) = 4
m(s15) = 5
s9 m(s9) = 3
m(s4) = 2
…
m(s3) = 3
Goal: to minimize the number of steps to arrive at the final state
Types- and Dialogue SystemsⅠ Ⅱ
ASRLanguage
Understanding
Well-organizedDatabase
Speech, Graph, Tables
Dialogue Modeling
words,lattices
Dialogue Act Classification
Semantic Frame
Dialogue State
Output Generator
Spoken language Understanding
User Act
System Action Dialogue
Manager
U
Input Speech Utterance Au
^S
Type-I:
Types- and Dialogue SystemsⅠ Ⅱ
ASR
Multimedia Document
Archive
Retrieval Engine
Indexing
word/ phone lattice, one-best, N-best
ASR
inverted index file
word/ phone lattice, one-best, N-bestSpoken Language based Information Access
Internal State
Dialogue Modeling
Related Documents
Multi-modal
User Interface
Dialogue ManagerOutput
Presentation
Multi-modal interactions
Information Obtained
d
Spoken Docume
nts
Input Spoken Query
q
Type-II:
Improved Performance by Dialogue Modeling
0
15
30
45
60
75
90
105
74 88 92 100
2
2.5
3
3.5
4
4.5
5
5.5
6
6.5
74 88 92 100ASR Character Accuracy in % for Queries ASR Character Accuracy in % for Queries
Ave
rage
Num
ber
of K
ey T
erm
s N
eede
d fo
r S
ucce
ssfu
l Tri
als
Tas
k S
ucce
ss R
ate
Dialogue Modeling
wpq
tf-idf Dialogue Modeling
wpq
tf-idf
Hierarchy of Research Areas
Applications
MultimediaTechnologies
SpokenDialogue
Speech-basedInformationRetrieval
Dictation&
Transcription
Distributed SpeechRecognition and
Wireless Environment
MultilingualSpeech
Processing
InformationIndexing
& Retrieval
Text-to-speechSynthesis
Speech/Language
Understanding
Decoding&
SearchAlgorithms
LinguisticProcessing
&LanguageModeling
Wireless Transmission
&Network
Environment
Speech Recognition Core
KeywordSpotting
Robustness:noise/channelfeature/model
Hands-freeInteraction:
acoustic receptionmicrophone array, etc.
Speaker Adaptation
&Recognition
IntegratedTechnologies
Applied Technologies
BasicTechnologies
AcousticProcessing:
features,modeling,
etc.
12 14 4 15
11 10 2 3 1
7 5 68
Spoken Document
Understanding and organization
13
9
Prosodic Modeling
Spontaneous Speech Processing:
pronunciation modeling disfluencies, etc.
Histogram-based Quantization (HQ) for Robust Distributed Speech Recognition
– quantization dynamically determined by local statistics, thus automatically absorbing the various disturbances
• An Example Partition of Speech Recognition Processes into Client/Sever
Distributed Speech Recognition (DSR) and Wireless Environment
Front-endSignal Processing
AcousticModels Lexicon
FeatureVectors
Linguistic Decoding and
Search Algorithm
Output Sentence
SpeechCorpora
AcousticModel
Training
LanguageModel
Construction
TextCorpora
LexicalKnowledge-base
LanguageModel
Input Speech
Grammar
– encoded feature parameters transmitted in packets Client/Server Structure
Server
ServerClients
Network
Client
Problems with Conventional Vector Quantization (VQ) Conventional VQ (e.g. SVQ) Popularly Used in DSR Dynamic Environmental Noise and Codebook Mismatch
Jointly Degrade the Performance of SVQ
Noise moves clean speech to another partition cell (X to
Y)
Mismatch between fixed VQ codebook and test data
increases distortion
Quantization increases difference between clean
and noisy features
– Decision boundaries yi{i=1,…,N} are dynamically defined by C(y).
– Representative values zi {i=1,…,N} are fixed, transformed by a standard Gaussian.
Histogram-based Quantization (HQ) ( )Ⅰ
T
{ , , (vertical scale) 1,..., }determined by Lloyd-Max and a standard Gaussian Distribution
i i iD z b i N
Histogram-based Quantization (HQ) (Ⅱ)
– With histogram C’(y’), decision boundaries automatically changed to .
– Decision boundaries are adjusted according to local statistics, no codebook mismatch problem.
T
1( , )iiy y
1
1
, '( )
' ' ,
1,2, ...
t ti ii
t ii
x z if b C x b
or y x y
where i N
Histogram-based Quantization (HQ) (Ⅱ)
• Based on CDF on the Vertical Scale and Histogram, less Sensitive to Noise on the Horizontal Scale
• Disturbances are Automatically Absorbed into HQ Blocks
Dynamic nature of HQ hidden codebook on vertical scaletransformed by dynamic C(y){yi} Dynamic on horizontal scale
T
Histogram-based VQ (HVQ)
Different Types of Noise, Averaged over All SNR Values
Experimental Results
ClientHEQ-SVQ
ClientHEQ-SVQ
ServerUD
ClientHQ
ClientHQ
ServerJUD
Performance in Mobile Wireless Networks