The L2F Spoken Web Search system for Mediaeval 2012
-
Upload
mediaeval2012 -
Category
Technology
-
view
671 -
download
0
description
Transcript of The L2F Spoken Web Search system for Mediaeval 2012
![Page 1: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/1.jpg)
1 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Alberto Abad and Ramón F. Astudillo L2F -‐ Spoken Language Systems Lab
INESC-‐ID Lisboa, Portugal [email protected]‐id.pt
Mediaeval 2012 Workshop Pisa, October 4, 2012
The L2F Spoken Web Search system for Mediaeval 2012
![Page 2: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/2.jpg)
2 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
IntroducJon
In this first L2F/INESC-‐ID parJcipaJon on SWS our main objecJves were: • To learn
• To have fun • To build a reasonable system (in an unreasonable reduced amount of Jme)
The submiVed L2F SWS system exploits hybrid ANN/HMM connecJonist methods for both query tokenizaJon and acousJc keyword search • Composed by the fusion of 4 phoneJc-‐based SWS sub-‐systems
o Based on our in-‐house ASR system named AUDIMUS
• Each sub-‐system uses different language-‐dependent acousJc models: o European Portuguese (pt), Brazilian Portuguese (br), European Spanish (es) and
American English (en)
• Different detecJon score normalizaJon and fusion methods invesJgated o SubmiVed system applies per-‐query score normalizaJon (Q-‐norm) and majority voJng
(MV) fusion
![Page 3: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/3.jpg)
3 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
The baseline speech recognizer
AUDIMUS is our in-‐house hybrid HMM/MLP speech recognizer
• Feature extrac@on MulJ-‐stream 26 PLP, 26 logRASTA-‐PLP, 28 MSG and 39 ETSI (only 8kHz) • MLP Several context input frames (13-‐15), 2 hidden-‐layers (500 units) and 1 output layer.
• Output layer size 39 for pt, 40 for br, 30 for es, 41 for en. • Data pt 115 hours (57 BN+58 telephone); br 13 hours of BN data; es 57hours (36 BN
+21 telephone); en 142 hours of BN data (HUB-‐4 96 & 97) • HMM topology Single-‐state phonemes (3 frames minimum duraJon) • Decoder Uses Weighted Finite-‐State Transducer (WFST) approach
![Page 4: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/4.jpg)
4 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Spoken Query TokenizaJon
For each sub-‐system, obtain a phoneJc tokenizaJon of the queries • Similar to LID parallel phonotacJc
approaches
Use a phone-‐loop grammar with phoneme minimum duraJon of 3 frames to obtain 1-‐best phoneme chain • AlternaJve n-‐best hypothesis for
charactering each query were explored with unsaJsfactory results o Other possibiliJes not explored (yet): lakce, CN, etc…
• Influence of the word inserJon penalty (wip) parameter on the tokenizaJon result
![Page 5: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/5.jpg)
5 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Spoken Query Search
Spoken query search based on AKWS with our hybrid speech recognizer. Search window of 5 seconds (2.5 seconds Jme shio)
• Convert the problem in a verificaJon task (originally for with forced alignment)
• Convenient for fusion purposes Equally-‐likely 1-‐gram LM with target Query and Background word:
• Query word o Described by the phoneJc units obtained in the previous tokenizaJon stage
• Background word o Described by the special phoneJc class background/filler o Minimum duraJon set to 250 msec.
DetecJon score of detected candidates computed as the average phoneJc log-‐likelihood raJos of the query term
![Page 6: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/6.jpg)
6 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Spoken Query Search BG modelling with HMM/MLP
Possible approaches • Re-‐train MLP ✖ • Compute posterior probability of the background class depending on the other
classes ✔ o Mean probability of the top-‐N most likely outputs
For the SWS system, • We compute the average in the likelihood domain (top-‐6)
o The decoder operates in the likelihood domain, so there is not need for and add-‐hoc esJmaJon of the BG class prior
• We use a background scale term β (exponenJal in the likelihood domain) to control the weight of the BG model vs. Query
• This β scale together with the wip term strongly affects searching results o Adjusted following a non-‐exhausJve greedy search
![Page 7: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/7.jpg)
7 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Score normalizaJon, fusion and calibraJon
Score normaliza@on schemes explored: • Q-‐norm Assume that the scores are dependent of the queries and
apply a by-‐query normalizaJon • F-‐norm Assume that the scores are dependent of the data file (of
the collecJon) and apply a by-‐file normalizaJon • CombinaJons (QF-‐norm, FQ-‐norm)
Fusion schemes explored: • Candidate detecJons from the 4 parallel sub-‐systems are kept (or
rejected) according to simple combinaJon rules: o AND (all), OR (at least 1 sys) and MV (at least 2 sys)
• Final detecJon score is the mean score
Decision threshold set according to maxATWV in dev-‐dev
![Page 8: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/8.jpg)
8 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Development experiments Sub-‐systems performance
![Page 9: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/9.jpg)
9 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Development experiments Score normalizaJon strategies
![Page 10: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/10.jpg)
10 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Development experiments Fusion strategies (aoer Q-‐norm)
![Page 11: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/11.jpg)
11 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Official L2F SWS2012 results
5
10
20
40
60
80
90
95
98
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Mis
s pr
obab
ility
(in %
)
False Alarm probability (in %)
Combined DET Plot
Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435
5
10
20
40
60
80
90
95
98
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Mis
s pr
obab
ility
(in %
)
False Alarm probability (in %)
Combined DET Plot
Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.486 Scr=-0.135
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.486 Scr=-0.135
5
10
20
40
60
80
90
95
98
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Mis
s pr
obab
ility
(in %
)
False Alarm probability (in %)
Combined DET Plot
Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.523 Scr=-0.362
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.523 Scr=-0.362
dev-‐eval eval-‐dev eval-‐eval
![Page 12: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/12.jpg)
12 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Conclusions
The L2F Spoken Web Search system fully exploits hybrid ANN/HMM speech recogniJon: • Query tokenizaJon based on 1-‐best phoneJc decoding • Query search based on AKWS (with no need for AM re-‐training)
The submiVed system is formed by the fusion of four language-‐dependent sub-‐systems: • Q-‐norm score normalizaJon is applied to each individual sub-‐system • Fusion is done following a majority voJng strategy
The system achieves an actual ATWV score of 0.5195 in the eval-‐eval • Promising given the simplicity of the proposed system • Robust to query and collecJon sets
o Best performance is achieved in a mismatched condiJon!! • Reasonably well-‐calibrated
Future work Focused in improved Query tokenizaJon, fusion with other type of approaches (DTW)
![Page 13: The L2F Spoken Web Search system for Mediaeval 2012](https://reader033.fdocuments.in/reader033/viewer/2022051610/548591b5b4af9faa0d8b4e6b/html5/thumbnails/13.jpg)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 13
technology from seed
L2 F - Spoken Language Systems Laboratory