Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
-
Upload
xanguera -
Category
Technology
-
view
845 -
download
4
description
Transcript of Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation
Information Retrieval-based Dynamic Time Warping
Xavier AngueraTelefonica Research
Spain
Query-by-Example Spoken-Term Detection
Given a spoken query we search for instances at lexical level within spoken documentsIt is similar to Spoken Term Detection (NIST STD2006, Babel 2013) but…
Queries are spoken
Different speakers
Different acoustic conditions
No prior knowledge of the
language might be available
Information Retrieval-based Dynamic Time Warping Algorithm (IRDTW)
Information Retrieval-based DTW• Inspired on the Subsequence-Dynamic time warping
algorithm by Müller [1]• It performs a ‘sparse’ matching of two signals like
Jansen [2]• Uses ideas borrowed from Information retrieval to
preserve memory (lots of it)• It can take advantage of pre-indexing all reference
data and thus perform a fast frame-level matching (described in [3])
[1] Meinard Müller, “Information Retrieval for Music and Motion”, Springer-Verlag, ISBN 978-3-540-74047-6, pp. 147-150, 2010[2] Aren Jansen, Benjamin Van Durme, “Indexing Raw Acoustic Features for Scalable Zero Resource Search”, Proc. Interspeech 2012[3] Gautam Mantena, Xavier Anguera, “Speed Improvements to Information Retrieval-based Dynamic Time Warping Using Hierarchical K-means Clustering”, in Proc. ICASSP 2013
Subsequence-DTW algorithm (review)Q
uery
term
Reference term
Que
ry te
rm
Reference term
Que
ry te
rm
Reference term
‘Sparse’ frame matching
Only the closest (lowest distance) query-reference pairs are considered. These can be found through…• Exhaustive comparison• Efficient retrieval using indexing techniques
‘Sparse’ dynamic programming
S-DTW IR-DTW
Que
ry
Reference
Que
ry
IR-DTW warping constraints
IR-DTW
Que
ry
Possible constraints:• Amount of warping:• basic warping• 2X warping
• Length to the match
IR-DTW warping constraints
IR-DTW
Que
ry
Possible constraints:• Amount of warping:• basic warping• 2X warping
• Length to the match
From 2D to 1D: Memory efficient matching
With IRDTW we modified this algorithm to allow for time-warped matching
We borrow an alignment algorithm used for Information Retrieval
It finds unconstrained start-end locations but does not allow any time-warping
We use the ‘matching counts’ vector in the dynamic programming instead of the similarity matrix.
The end position of the paths define their location in the 1D vector
The new matching point defines a target location where one of the paths will warp to
For each path we store:• query(start, end)• reference(start,end)• Accumulated Distance• #matching points
• Only paths with #matches > 1 are stored in the ΔT vector• Size(ΔT) = size_query + size_ref (can be constrained using a circular buffer)
What information is stored in this vector?
Constraints in the similarity matrix translate as:1. Consider all paths within range
2. Check for local constraints• Basic warping:
Δq > 0 Δr > 0
• 2X warping:Δq ≥ Δr/2 Δq ≤ 2*Δr
Applying warping constraints in 1D
ReferenceQ
uery
Δq
Δr
Wrange/2Wrange
We select the path with most number of matches. It is then warped to end in the current matching point
Best matching path selection
New path info:• q_end = tqi
• r_end = trj
• Accum. Distance += d(qi, rj)• #matches++
we can dynamically save memory by eliminating obsolete paths
Query-by-Example Spoken-Term detection system
Acoustic features
• Posteriorgram features are used (Zhang-Glass 2010)– MFCC-39 -> GMM-64 Posterior probability vectors
• Distance between features:
Query-by-example Spoken Term Detection system*
Background model training
VAD models training
IR-DTW Overlap prunning
Local S-DTWrefinement
Development dataset
Searchcorpus
QueryFeature
extractor
Feature extractor
Energy-based VAD
Energy-based VAD
VAD model
Background model
Search mode
Index mode
*X. Anguera, “Telefonica system for the Spoken Web Search Task at Mediaeval 2012”, Mediaeval 2012 Workshop, Pisa, Italy
Performance evaluation
• Database: Mediaeval SWS 2012 data (4 African languages, subset of Lwazy database*)– ~4h development corpus + 100 queries– ~4h evaluation corpus + 100 queries
• Metrics:– Minimum Term Weighted Value (MTWV) – Memory usage
*E. Barnard, M. Davel, C. V. Heerden, “ASR Corpus Design for Resource-Scarse Languages”, in Proc. Interspeech 2009
Minimum Term Weighted Value
System Dev. Set Eval Set
Diagonal 0.258 0.276
IR-DTW 0.394 0.394
S-DTW 0.443 0.450
Rails system 0.381 0.384
Contrastive systems:• Diagonal: Substitute IR-DTW by only allowing diagonal matches• S-DTW: Implementation as in [1]• Rails system: scores from [2] on the same database
[1] X. Anguera and M. Ferrarons, “Memory-Efficient Subsequence-DTW for Query-by-Example Spoken Term Detection”, in Proc. ICME, 2013[2] A. Jansen, B. V. Durme and P. Clark, “The JHU-HLTCOE Spoken Web Search System for Mediaeval 2012”, in Proc. Mediaeval Workshop 2012, Pisa, Italy
Memory usage analysis
System Dev. Set (mean/std) Eval set (mean/std)
S-DTW 506.2MB/342.8MB 568.1MB/326.4MB
IR-DTW 91.7MB/15MB 112.3MB/21.8MB
Conclusions and Future Work
• We have introduced the IR-DTW algorithm and demonstrated its potential in the QbE-STD task.– Its main advantage is its low memory usage– Accuracy still falls short from an exhaustive/traditional
search• We are testing IR-DTW in other tasks– Large volumes of data that disallow building similarity
matrices– Applications not in speech that can benefit from
sparse matching
Not anymore!
Thanks for your attention
Questions?Xavier Anguera
Download the code from here:http://www.xavieranguera.com/resources/resources.html#IRDTW