Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

24
Information Retrieval-based Dynamic Time Warping Xavier Anguera Telefonica Research Spain

description

Presentation of the paper titled "Information Retrieval-based Dynamic Time Warping" given at Interspeech 2013 in Lyon, France

Transcript of Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Page 1: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Information Retrieval-based Dynamic Time Warping

Xavier AngueraTelefonica Research

Spain

Page 2: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Query-by-Example Spoken-Term Detection

Given a spoken query we search for instances at lexical level within spoken documentsIt is similar to Spoken Term Detection (NIST STD2006, Babel 2013) but…

Queries are spoken

Different speakers

Different acoustic conditions

No prior knowledge of the

language might be available

Page 3: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Information Retrieval-based Dynamic Time Warping Algorithm (IRDTW)

Page 4: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Information Retrieval-based DTW• Inspired on the Subsequence-Dynamic time warping

algorithm by Müller [1]• It performs a ‘sparse’ matching of two signals like

Jansen [2]• Uses ideas borrowed from Information retrieval to

preserve memory (lots of it)• It can take advantage of pre-indexing all reference

data and thus perform a fast frame-level matching (described in [3])

[1] Meinard Müller, “Information Retrieval for Music and Motion”, Springer-Verlag, ISBN 978-3-540-74047-6, pp. 147-150, 2010[2] Aren Jansen, Benjamin Van Durme, “Indexing Raw Acoustic Features for Scalable Zero Resource Search”, Proc. Interspeech 2012[3] Gautam Mantena, Xavier Anguera, “Speed Improvements to Information Retrieval-based Dynamic Time Warping Using Hierarchical K-means Clustering”, in Proc. ICASSP 2013

Page 5: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Subsequence-DTW algorithm (review)Q

uery

term

Reference term

Page 6: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Que

ry te

rm

Reference term

Page 7: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Que

ry te

rm

Reference term

Page 8: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

‘Sparse’ frame matching

Only the closest (lowest distance) query-reference pairs are considered. These can be found through…• Exhaustive comparison• Efficient retrieval using indexing techniques

Page 9: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

‘Sparse’ dynamic programming

S-DTW IR-DTW

Que

ry

Reference

Que

ry

Page 10: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

IR-DTW warping constraints

IR-DTW

Que

ry

Possible constraints:• Amount of warping:• basic warping• 2X warping

• Length to the match

Page 11: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

IR-DTW warping constraints

IR-DTW

Que

ry

Possible constraints:• Amount of warping:• basic warping• 2X warping

• Length to the match

Page 12: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

From 2D to 1D: Memory efficient matching

With IRDTW we modified this algorithm to allow for time-warped matching

We borrow an alignment algorithm used for Information Retrieval

It finds unconstrained start-end locations but does not allow any time-warping

Page 13: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

We use the ‘matching counts’ vector in the dynamic programming instead of the similarity matrix.

The end position of the paths define their location in the 1D vector

The new matching point defines a target location where one of the paths will warp to

Page 14: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

For each path we store:• query(start, end)• reference(start,end)• Accumulated Distance• #matching points

• Only paths with #matches > 1 are stored in the ΔT vector• Size(ΔT) = size_query + size_ref (can be constrained using a circular buffer)

What information is stored in this vector?

Page 15: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Constraints in the similarity matrix translate as:1. Consider all paths within range

2. Check for local constraints• Basic warping:

Δq > 0 Δr > 0

• 2X warping:Δq ≥ Δr/2 Δq ≤ 2*Δr

Applying warping constraints in 1D

ReferenceQ

uery

Δq

Δr

Wrange/2Wrange

Page 16: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

We select the path with most number of matches. It is then warped to end in the current matching point

Best matching path selection

New path info:• q_end = tqi

• r_end = trj

• Accum. Distance += d(qi, rj)• #matches++

we can dynamically save memory by eliminating obsolete paths

Page 17: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Query-by-Example Spoken-Term detection system

Page 18: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Acoustic features

• Posteriorgram features are used (Zhang-Glass 2010)– MFCC-39 -> GMM-64 Posterior probability vectors

• Distance between features:

Page 19: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Query-by-example Spoken Term Detection system*

Background model training

VAD models training

IR-DTW Overlap prunning

Local S-DTWrefinement

Development dataset

Searchcorpus

QueryFeature

extractor

Feature extractor

Energy-based VAD

Energy-based VAD

VAD model

Background model

Search mode

Index mode

*X. Anguera, “Telefonica system for the Spoken Web Search Task at Mediaeval 2012”, Mediaeval 2012 Workshop, Pisa, Italy

Page 20: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Performance evaluation

• Database: Mediaeval SWS 2012 data (4 African languages, subset of Lwazy database*)– ~4h development corpus + 100 queries– ~4h evaluation corpus + 100 queries

• Metrics:– Minimum Term Weighted Value (MTWV) – Memory usage

*E. Barnard, M. Davel, C. V. Heerden, “ASR Corpus Design for Resource-Scarse Languages”, in Proc. Interspeech 2009

Page 21: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Minimum Term Weighted Value

System Dev. Set Eval Set

Diagonal 0.258 0.276

IR-DTW 0.394 0.394

S-DTW 0.443 0.450

Rails system 0.381 0.384

Contrastive systems:• Diagonal: Substitute IR-DTW by only allowing diagonal matches• S-DTW: Implementation as in [1]• Rails system: scores from [2] on the same database

[1] X. Anguera and M. Ferrarons, “Memory-Efficient Subsequence-DTW for Query-by-Example Spoken Term Detection”, in Proc. ICME, 2013[2] A. Jansen, B. V. Durme and P. Clark, “The JHU-HLTCOE Spoken Web Search System for Mediaeval 2012”, in Proc. Mediaeval Workshop 2012, Pisa, Italy

Page 22: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Memory usage analysis

System Dev. Set (mean/std) Eval set (mean/std)

S-DTW 506.2MB/342.8MB 568.1MB/326.4MB

IR-DTW 91.7MB/15MB 112.3MB/21.8MB

Page 23: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Conclusions and Future Work

• We have introduced the IR-DTW algorithm and demonstrated its potential in the QbE-STD task.– Its main advantage is its low memory usage– Accuracy still falls short from an exhaustive/traditional

search• We are testing IR-DTW in other tasks– Large volumes of data that disallow building similarity

matrices– Applications not in speech that can benefit from

sparse matching

Not anymore!

Page 24: Information Retrieval Dynamic Time Warping - Interspeech 2013 presentation

Thanks for your attention

Questions?Xavier Anguera

[email protected]

Download the code from here:http://www.xavieranguera.com/resources/resources.html#IRDTW