Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Information Retrieval-based Dynamic Time Warping

Xavier AngueraTelefonica Research

Spain

Query-by-Example Spoken-Term Detection

Given a spoken query we search for instances at lexical level within spoken documentsIt is similar to Spoken Term Detection (NIST STD2006, Babel 2013) but…

Queries are spoken

Different speakers

Different acoustic conditions

No prior knowledge of the

language might be available

Information Retrieval-based Dynamic Time Warping Algorithm (IRDTW)

Information Retrieval-based DTW• Inspired on the Subsequence-Dynamic time warping

algorithm by Müller [1]• It performs a ‘sparse’ matching of two signals like

Jansen [2]• Uses ideas borrowed from Information retrieval to

preserve memory (lots of it)• It can take advantage of pre-indexing all reference

data and thus perform a fast frame-level matching (described in [3])

[1] Meinard Müller, “Information Retrieval for Music and Motion”, Springer-Verlag, ISBN 978-3-540-74047-6, pp. 147-150, 2010[2] Aren Jansen, Benjamin Van Durme, “Indexing Raw Acoustic Features for Scalable Zero Resource Search”, Proc. Interspeech 2012[3] Gautam Mantena, Xavier Anguera, “Speed Improvements to Information Retrieval-based Dynamic Time Warping Using Hierarchical K-means Clustering”, in Proc. ICASSP 2013

Subsequence-DTW algorithm (review)Q

uery

term

Reference term

Que

ry te

rm

Reference term

‘Sparse’ frame matching

Only the closest (lowest distance) query-reference pairs are considered. These can be found through…• Exhaustive comparison• Efficient retrieval using indexing techniques

‘Sparse’ dynamic programming

S-DTW IR-DTW

Que

ry

Reference

Que

ry

IR-DTW warping constraints

IR-DTW

Que

ry

Possible constraints:• Amount of warping:• basic warping• 2X warping

• Length to the match

From 2D to 1D: Memory efficient matching

With IRDTW we modified this algorithm to allow for time-warped matching

We borrow an alignment algorithm used for Information Retrieval

It finds unconstrained start-end locations but does not allow any time-warping

We use the ‘matching counts’ vector in the dynamic programming instead of the similarity matrix.

The end position of the paths define their location in the 1D vector

The new matching point defines a target location where one of the paths will warp to

For each path we store:• query(start, end)• reference(start,end)• Accumulated Distance• #matching points

• Only paths with #matches > 1 are stored in the ΔT vector• Size(ΔT) = size_query + size_ref (can be constrained using a circular buffer)

What information is stored in this vector?

Constraints in the similarity matrix translate as:1. Consider all paths within range

2. Check for local constraints• Basic warping:

Δq > 0 Δr > 0

• 2X warping:Δq ≥ Δr/2 Δq ≤ 2*Δr

Applying warping constraints in 1D

ReferenceQ

uery

Δq

Δr

Wrange/2Wrange

We select the path with most number of matches. It is then warped to end in the current matching point

Best matching path selection

New path info:• q_end = tqi

• r_end = trj

• Accum. Distance += d(qi, rj)• #matches++

we can dynamically save memory by eliminating obsolete paths

Query-by-Example Spoken-Term detection system

Acoustic features

• Posteriorgram features are used (Zhang-Glass 2010)– MFCC-39 -> GMM-64 Posterior probability vectors

• Distance between features:

Query-by-example Spoken Term Detection system*

Background model training

VAD models training

IR-DTW Overlap prunning

Local S-DTWrefinement

Development dataset

Searchcorpus

QueryFeature

extractor

Feature extractor

Energy-based VAD

Energy-based VAD

VAD model

Background model

Search mode

Index mode

*X. Anguera, “Telefonica system for the Spoken Web Search Task at Mediaeval 2012”, Mediaeval 2012 Workshop, Pisa, Italy

Performance evaluation

• Database: Mediaeval SWS 2012 data (4 African languages, subset of Lwazy database*)– ~4h development corpus + 100 queries– ~4h evaluation corpus + 100 queries

• Metrics:– Minimum Term Weighted Value (MTWV) – Memory usage

*E. Barnard, M. Davel, C. V. Heerden, “ASR Corpus Design for Resource-Scarse Languages”, in Proc. Interspeech 2009

Minimum Term Weighted Value

System Dev. Set Eval Set

Diagonal 0.258 0.276

IR-DTW 0.394 0.394

S-DTW 0.443 0.450

Rails system 0.381 0.384

Contrastive systems:• Diagonal: Substitute IR-DTW by only allowing diagonal matches• S-DTW: Implementation as in [1]• Rails system: scores from [2] on the same database

[1] X. Anguera and M. Ferrarons, “Memory-Efficient Subsequence-DTW for Query-by-Example Spoken Term Detection”, in Proc. ICME, 2013[2] A. Jansen, B. V. Durme and P. Clark, “The JHU-HLTCOE Spoken Web Search System for Mediaeval 2012”, in Proc. Mediaeval Workshop 2012, Pisa, Italy

Memory usage analysis

System Dev. Set (mean/std) Eval set (mean/std)

S-DTW 506.2MB/342.8MB 568.1MB/326.4MB

IR-DTW 91.7MB/15MB 112.3MB/21.8MB

Conclusions and Future Work

• We have introduced the IR-DTW algorithm and demonstrated its potential in the QbE-STD task.– Its main advantage is its low memory usage– Accuracy still falls short from an exhaustive/traditional

search• We are testing IR-DTW in other tasks– Large volumes of data that disallow building similarity

matrices– Applications not in speech that can benefit from

sparse matching

Not anymore!

Thanks for your attention

Questions?Xavier Anguera

[email protected]

Download the code from here:http://www.xavieranguera.com/resources/resources.html#IRDTW

mailto:[email protected]

http://www.xavieranguera.com/resources/resources.html%23IRDTW

http://www.xavieranguera.com/resources/resources.html%23IRDTW

Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Technology

Transcript of Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation