Time Machine session @ ICME 2012 - DTW's New Youth

28
Dynamic Time Warping’s new Youth Xavier Anguera Telefónica Research

description

This presentation are the slides I gave at the Time Machine Expert session of ICME 2012. It talks about the renewal of Dynamic Time Warping (DTW) as a feasible algorithm for some of today's applications.

Transcript of Time Machine session @ ICME 2012 - DTW's New Youth

Page 1: Time Machine session @ ICME 2012 - DTW's New Youth

Dynamic  Time  Warping’s  new  Youth  

Xavier  Anguera  Telefónica  Research  

Page 2: Time Machine session @ ICME 2012 - DTW's New Youth

Outline  

•  Was  there  anything  before  DTW?  •  DTW  review  •  Here  comes  HMM  •  DTW  comeback  – Algorithms  – Speed  and  scalability  – ApplicaFons  

•  Conclusions  

Page 3: Time Machine session @ ICME 2012 - DTW's New Youth

Before  DTW  •  Speech  recogniFon  was  done  by  means  of  paIern-­‐matching  input  with  reference  paIerns.  

•  Speaking  rate  variaFons  create  nonlinear  fluctuaFons  on  the  Fme  axis.  

•   Some  linear  transformaFons  were  tested,  but  not  successfully.  – Dynamic  Fme  warping  (DTW)  will  be  able  to  help  

Page 4: Time Machine session @ ICME 2012 - DTW's New Youth

Dynamic  Time  Warping  -­‐  DTW  •  DTW  algorithm  allows  the  computaFon  of  the  opFmal  alignment  between  two  Fme  series      Xu,  Yv  ε  ΦD    

 

Image  by  Daniel  Lemire  

!

XU = (u1,...,um,...,uM )

!

XV = (v1,....,vn,..,vN )

Page 5: Time Machine session @ ICME 2012 - DTW's New Youth

Dynamic  Time  Warping  (II)  •  The  opFmal  alignment  can  be  found  in  O(MN)  complexity  using  dynamic  programming.  

•  To  do  so,  one  needs  to  define  a  cost  funcFon  between  any  two  points  in  the  series  and  build  a  distance  matrix:  

!

d :"D # "D $%& 0

Image  by  Tsanko  Dyustabanov  

!

d(i, j) = um " vn

Where  usually:  

!

c(i(k), j(k))

!

F = c(1),...,c(K)Warping  funcFon:                                                                          where  

Euclidean  distance  

Page 6: Time Machine session @ ICME 2012 - DTW's New Youth

Warping  constraints  For  speech  signals  some  constraints  are  usually  applied  to  the  warping  funcFon  F:  – Monotonicity:  

         – ConFnuity  (i.e.  local  constraints):  

!

i(k "1) # i(k)

!

j(k "1) # j(k)

!

i(k) " i(k "1) #1

!

j(k) " j(k "1) #1

Sakoe,H.  and  Chiba,S.  (1978)  “Dynamic  programming  algorithm  op0miza0on  for  spoken  word  recogni0on”,  IEEE  Trans.  on  Acoust.,  Speech,  and  Signal  Process,  ASSP-­‐26,  43-­‐49.  

(m,  n)  

(m-­‐1,  n-­‐1)  

(m-­‐1,  n)  

!

D(m,n) =minD(m "1,n)D(m,n "1)D(m "1,n "1)

#

$ %

& %

+ d(um,vn )

Page 7: Time Machine session @ ICME 2012 - DTW's New Youth

Warping  constraints  (II)  – Boundary  condiFon:    

 i.e.  DTW  needs  prior  knowledge  of  the  start-­‐end  alignment  points.  

– Global  constraints  !

i(1) =1

!

j(1) =1

!

i(K) = M

!

j(K) = N

Image  from  Keogh  and  Ratanamahatana  

Page 8: Time Machine session @ ICME 2012 - DTW's New Youth

Seminal  works  in  DTW  

Hiroaki  Sakoe  and  Seibi  Chiba,  “A  dynamic  programming  approach  to  con0nuous  speech  recogni0on,”in  1971  Proc.  7th  ICA,  Paper  20  CI3,  Aug.  1971.      Hiroaki  Sakoe  and  Seibi  Chiba,  “Dynamic  Programming  Algorithm  Op0miza0on  for  Spoken  Word  Recogni0on”,  IEEE  TransacFons  on  Audio,  Speech  and  Signal  Processing,  26(1)  pp.  43-­‐49,  1978  

Page 9: Time Machine session @ ICME 2012 - DTW's New Youth

T.K.Vintsyuk,  “Speech  Discrimina0on  by  Dynamic  Programming”,  KiberneFks  Vol.  4,  No  1,  pp.  81-­‐88,  1968.  

56 KIBERNETIKA

number of poss ib le combinat ions of l - 1 a r rows of which q - 1 a r e oblique

C q - I N (l, q) = l--l. (10)

For example, for the o f t en - r equ i r ed set of values l = = 20 a n d q = 12 we ob ta inN(20 ,12) = 75 582. For the rea l i za t ion of recogni t ion c r i t e r i a (6), (8), and (9) it is hardly p rac t i cab le to scan such a l a rge number of s tandards in o r d e r to de t e rmine the probabil i ty that X l will belong to one c lass .

k , A'. A~ ~] 1 " -

1/~----~ v - r - - - v - 7 A / l \ x :

~ l \ \ | 8 89

8 ' / )

We a re in te res ted in var ious methods of d i rec ted s e a r c h among the s tandards of the c l a s se s , which quickly lead to a synthesis of the p e r m i s s i b l e s tandard of the c lass n e a r e s t to the unknown sequence X l in the sense of the c r i t e r i a (6), (8), and (9).

We shall show that the p rob lem of finding the n e a r - es t p e r m i s s i b l e s tandard of a c lass is reduced to one of the p rob lems in ma themat ica l p rogramming , i . e . , to the problem of finding the shor t e s t path in a graph.

We r ep l ace the digits i = 1, 2 . . . . . l of the b ranch - ing scheme of d imensions (l, q) by points. We ass ign to each point a pair of subscr ip ts ij. The f i r s t sub- sc r ip t i is de te rmined by the number of the ve r t i c a l column whose digit is r ep laced by the point. The s e c - ond subscr ip t j is equal to the rep laced digit of the branching scheme.

We ass ign to each a r row of the branching scheme which goes into the point ij the dis tance d ~ = ] x~ - - ~h 12, ii i if we want to r e a l i z e the recogni t ion c r i t e r i o n (6) ; or the value

( C ) = x,. '1 ] ,;. = i x0,1 ' 141

when we want to r ecogn ize according to the c r i t e r i a (8) and (9).

Thus we obtain a graph given in Fig. 1 for the case q = 5 and l = 8.

If we move f rom point A of the graph to point B, we pass over exact ly l points of the graph (including A and B), the second subscr ip t s of which de t e rmine the p e r m i s s i b l e s tandard of length l of the c lass k.

The sum of the values d~, (Qk)Z or (Rk) 2 of the a r rows lying along some pathway f rom A' and into B, de t e rmines the degree of co r r e spondence between the

p e r m i s s i b l e s tandard cor responding to this pathway and the unknown sequence X l 9 It is now c lea r that the p rob lem of finding the nea r e s t p e r m i s s i b l e s tandard of a c l a s s has been reduced to the p rob lem of finding the sho r t e s t [cr i te r ion (6)], or the longest [ c r i t e r i a (8) and (9)] pathways f rom A' to B in the graph de te rmined by the model in Fig. 1. The length of the sho r t e s t or longest pathway f rom A' to 'B is a m e a s u r e of the p rob- abili ty that the unknown sequence X l belongs to the k- th c l a s s .

There a re well-known a lgor i thms for finding the sho r t e s t or longes t pathways in a graph [5]. They are based on a d i r ec ted sea rch among the poss ib le pathways.

The graph cons idered he re has a specia l s t ruc tu re so that we can cons t ruc t a s imple p rocedure for f ind- ing the sho r t e s t pathway which is quite di f ferent f rom the methods desc r ibed in the l i t e r a tu r e . The r ea son for this is that it is n e c e s s a r y only to find the length of the shor t e s t pathway in the graph.

Let us cons ider succes s ive ly the groups of points in the graph with ident ical subscr ip t s i = 1, 2 . . . . , l 9

We begin with the points of the graph having the sub- sc r ip t s i = 1. There is a unique point of this kind. The point ij = 11 can be reached along one pathway. We ass ign to this point the path- length D11 = d l r Then we turn to the points with subscr ip ts i = 2, and so on. By succes s ive ly making the t rans i t ion f rom all points with subscr ip t s (i - 1) to al l points with subscr ip t i, we can ass ign lengths to the points of the graph with sub- sc r ip t s i according to the following r e c u r s i v e formula :

Di: = r a i n (D(i_1)(i_l), D(i_I)/) -~ dii. (11)

Every t ime the r e c u r s i v e fo rmula (11) is used, we e l imina te the pathways f rom A' to B which go through the point ij, and which are always longer than the r e - maining ones. For example, ass igning according to (11) the value Dij to the point ij = 42, we e l imina te some of the pathways f r o m A' to B which go through the point 42, and then we have to cons ider only those pathways f rom A' to B which a re continuations of the shor t e s t ini t ial pathway f rom the point A' to the point ij = 42.

As a r e su l t of this procedure , on reach ing the las t point of the graph ij = / q it will be assigned the value D/ , which de t e rmines the length of the shor t e s t path- q way in the graph.

For the r ea l i za t ion of the recogni t ion c r i t e r i a (6), (8), and (9) we cons t ruc t exact ly K graphs (according to the number of c l a s s e s ) . The s ize of each of these graphs is de te rmined by the number / of e lements x i in the unknown sequence X l , and by the number qk of e lements in the s tandard ~ of the c lass k.

It is easy to see that for the rea l i za t ion of the e l e - m e n t - b y - e l e m e n t recogni t ion a lgor i thm a r e l a t i ve ly sma l l amount of ca lcula t ion wil l suffice: we have to de te rmine the quantit ies d k or (Qk)2, or (Rk) 2 for

each graph, and then we have to find the shor t e s t or longest pathway and ass ign the unknown sequence to the c lass whose graph has the shor t e s t or longes tpa th .

When for some r ea son the unknown sequence X l ean lose some of its e lements (the f i r s t or the las t one),

Minimal  length  reference  paIern  

Input  query  

Even  before…  

Page 10: Time Machine session @ ICME 2012 - DTW's New Youth

And…  V.M.  Velichko  and  N.G.  Zagoruyko,  “Automa0c  Recogni0on  of  200  Words”,  Int.  Journal  on  Man-­‐Machine  Studies,  vol.  2,  pp.  223-­‐234,  1970.  

AUTOMATIC RECOGNITION OF 200 WORDS 227

matrix dements {aiK } through which a chosen route divided by a length of the longer word passes. In the given example with n > m it will be a sum

1- (a l t + a 2 a + . . . + a m , n - l ) n

corresponding to parts AB, CD . . . . QS. Path segments BC . . . . PQ, STpass between the matrix elements and do not contribute to the common sum. The possible routes are limited: only the path segments along the horizontal from left to right, along the vertical from top to bottom, and along the diagonal from left-top to right-bottom are permitted.

| 2 3 k n-I n

I 711 ~N !012 013 Olk Oltn-I OIk

2 =21 I O oz* a z . . i a2n \ \

\ \

\ \ K

\ i alj al2 a~3 ainu, !al,.-i Oik

"N \

m - I Ore.i, I Om.l~ 0re_l, 3 •m-l,k

\ \

P Om-I,n-I Gin-I, k

a,.,x._ a,~. $ T

0~ 2 FIG. 2. Matrix of similarity, a,K - d2+p21 x

segment of vertical word and the kth one of horizontal word.

the measure of similarity between the ith

The measure of similarity A between words is the maximum of the functional F, calculated along all possible paths of the matrix {air }: A = max F. It is easy to see the matrix represents a graph or A T net and the problem of functional maximization is that of finding the maximum routes of the diagram. In this experiment the maximum path length is calculated using the techniques of dynamic programming. A brief description of algorithm applied is given below.

Let us enumerate the matrix nodes {aiK } (that is, points A, B . . . . T) according to the following rule. The point situated leftmost at uppermost to

Page 11: Time Machine session @ ICME 2012 - DTW's New Youth

SubsFtuFon  by  HMM’s  •  In  the  80’s,  with  the  availability  of  computers  with  memory  and  

possibility  to  beIer  store  models  and  staFsFcal  processing  

L.R.Rabiner  and  B.H.Juang,  “An  introduc0on  to  Hidden  Markov  Models”,  IEEE  ASSP  Magazine,  Vol.  3,  no  1,  pp.  4-­‐16,  1986  

Page 12: Time Machine session @ ICME 2012 - DTW's New Youth

DTW  vs.  ASR  for  Speech  RecogniFon  

Dynamic  Time  Warping  •  Data:  Some  examples,  no  

labeling  needed  •  Time:  none  for  training,  

costly  to  test  •  Accuracy:  Mid  to  high  

Hidden  Markov  Models  •  Data:  Lots  of  labeled  data  at  

phoneme  level  is  needed  •  Costly  for  training,  light  for  

tesFng  •  Accuracy:  high  

DTW  has  long  been  abandoned  in  ASR  for  high-­‐resourced  languages,  with  one  excepFon:    De  Wachter  et  al.,  “Template-­‐based  con0nuous  speech  recogni0on”,  IEEE  Trans.  On  Audio,  Speech  and  Language  Processing,  15(4)  pp.  1377-­‐1390  

Page 13: Time Machine session @ ICME 2012 - DTW's New Youth

The  Zero-­‐Resource  Seong  

1.   No  transcribed  training  data  2.   No  dicFonaries  3.   No  knowledge  of  linguisFc  structure  

Challenge:  How  can  we  discover  linguisFc  structure  and  build  useful  applicaFons  and  systems  without  much  supervision?  

But:    Assume  you  have  untranscribed  speech  (at  least  your  test  data)  

 

Page 14: Time Machine session @ ICME 2012 - DTW's New Youth

Calling  DTW  back  PRO:  DTW  works  with  paIerns,  no  need  for  costly  transcripFons  or  knowledge  of  the  language.    CON:  DTW  compares  paIerns  given  a  known  start-­‐end  posiFon  -­‐>  needs  that  at  least  one  of  the  paIerns  be  well  bounded  

Page 15: Time Machine session @ ICME 2012 - DTW's New Youth

From  DTW  to  subsequence  matching  Given  several  instances  of  an  acousFc  sequence,    

Can  we  modify  DTW  to  find  them?  

Page 16: Time Machine session @ ICME 2012 - DTW's New Youth

DTW-­‐based  subsequence  matching  

•  Segmental-­‐DTW  by  James  Glass  et  al.  at  MIT  •  “image-­‐based”  DTW  by  Aren  Jansen  et  al.  at  John  Hopkins  Univ.  

•  Mo0f  Discovery  by  Guillaume  Gravier  et  al.  at  IRISA  (Rennes).  

•  DTW  for  music  by  Meinard  Müller  et  al.  at  Max  Planck  InsFtut.  

•  Unbounded-­‐DTW  by  Xavier  Anguera  et  al.  at  Telefonica  research  

Page 17: Time Machine session @ ICME 2012 - DTW's New Youth

Segmental  DTW  

A.  Park  and  J.  Glass,  “Unsupervised  PaWern  Discovery  in  Speech”,  IEEE  Trans.  On  Audio,  Speech  and  Language  Processing,  2008  

without resorting to an intermediate symbolic representation.The remainder of the paper is organized as follows. In Sec-

tion 2, we describe the segmental DTW algorithm which gener-ates the best matching subsequence between two utterances. Next,we describe the data used in our experiments and outline the stepsneeded to process an entire audio lecture using the segmental DTWalgorithm. In Sections 4 and 5 we propose two direct uses for ouralgorithm and show outputs for these preliminary experiments. Fi-nally, we discuss several directions for future work and conclude.

2. SEGMENTAL DTW

Dynamic time warping (DTW) is typically used to find the optimalglobal alignment between two whole word exemplars [3]. First,the two speech waveforms are converted into spectral vector timeseries, {vi}

Mi=1, {vj}

Nj=1. Next, a distance matrix D is computed

by taking the pairwise distance between vectors in each utterancesuch that D(i, j) = !vi " vj!. Finally, dynamic programming isused to find the lowest distortion path through the distance matrixbetweenD(1, 1) andD(M, N).

Recently, researchers have revisited the use of DTW for speechrecognition as well as speaker detection [4, 5]. In the context ofour work, the DTW algorithm is useful because it can be used tomeasure the similarity of two utterances directly at the acousticlevel. However, performing a globally optimal alignment betweentwo utterances is not suitable for finding matching words whenthose utterances are composed of multiple words. The start andend points constrain the algorithm to find only the globally op-timal alignment matching the entire utterances together, and notlocal alignments which correspond to the matching subsequences.

An alternative method to this global alignment procedure isto use segmental DTW, which finds local alignments by searchingmultiple paths in the distance matrix. Given two utterances, U1

and U2, segmental DTW works by dividing the distance matrixinto a set of overlapping diagonal bands with widthW and search-ing for the best alignment within each band. The diagonal bandsserve multiple purposes. First, they constrain the degree of warp-ing so that two sub-utterances are not overly temporally distortedduring alignment. Second, they allow for multiple alignments, aseach band corresponds to another potential path with start and endpoints that differ fromD(1, 1) andD(M, N).

Following path discovery, each path is trimmed by finding theleast average subsequence with minimum length L. The minimumlength criterion is used to prevent spurious matches between shortsegments within each utterance. Since each cell D(i, j) corre-sponds to the distortion between frames i and j, the least averagesubsequence represents the portion of the aligned path which ex-hibits good alignment. The algorithm used to compute the subpathis described in [6].

At this stage we are left with a single path fragment for eachof the original bands in the distance matrix. The path fragmentwith the lowest average distortion is retained as the best alignmentpath between the two utterances, and this minimum average dis-tortion is used as the distance between U1 and U2. This alignmentpath along with its associated distortion forms the basis for furtherprocessing that we will describe in the following sections.

Utterance 1

Utt

era

nc

e 2

Distance Matrix

W

U1

U2

Fig. 1. The segmental DTW algorithm. As in normal DTW, thedistance matrix is computed between U1 and U2. The matrix isthen cut into overlapping diagonals (only one shown here) withwidth W . The optimal alignment path within the diagonal bandis then found using DTW and the resulting path is trimmed to theleast average subsequence. Finally, the trimmed alignment path isretained as the best subsequence alignment between U1 and U2.

3. EXPERIMENTAL DETAILS

3.1. Corpus

The speech data used in this paper is taken from a corpus of audiolectures collected at MIT. The entire corpus consists of approxi-mately 300 hours of lectures from a variety of academic coursesand seminars. The audio was recorded using an omni-directionalmicrophone in a classroom environment. In a previous paper, wedescribed characteristics of this lecture data and performed prelim-inary recognition and information retrieval experiments [7]. Manyaspects of the lecture data make it particularly well suited for thepattern discovery and speech clustering techniques described inthis paper. Each lecture typically contains a large amount of speech(from thirty minutes to an hour) from a single person in an un-changing acoustic environment. On average, each lecture con-tained only 800 unique words, with high usage of subject-specificwords and phrases. Each of the the lectures used in our experi-ments was the introductory lecture taken from courses in computerscience (CS), physics, and linear algebra at MIT.

3.2. Preprocessing

Prior to processing with segmental DTW, the lecture is segmentedinto continuous utterances by using a basic phone recognizer toidentify regions of silence in the signal. Silent regions with dura-tion longer than 2 seconds are removed and the portions of speechin between those silences are used as the input to the segmentalDTW algorithm. In the absence of a phone recognizer, segmenta-tion can also be performed using a speech activity detector.

For the CS lecture processed in this paper, the segmentationphase resulted in 1946 segments averaging 1.2 seconds each. Eachsegment was then converted into a series of 14-dimension MFCCvectors extracted every 5 ms, then whitened using principal com-ponents analysis over all vectors from all utterances.

Fig. 1. An illustration of segmental dynamic time warping betweentwo utterances with R = 2. The blue and red regions outline possi-ble DTW warping spaces for two different starting times.

2.2. Gaussian Posteriorgram Generation

To generate Gaussian posteriorgrams, a GMM is trained on allspeech frames without any transcription, and each frame is decodedby the trained GMM to obtain a raw posterior probability vector.Then, a discounting-based smoothing method is employed to eachposterior probability vector to produce a Gaussian posteriorgram.

In our work, each speech frame is represented by the first 13MFCCs. After pre-selecting the number of desired Gaussian com-ponents, the K-means algorithm is used to determine an initial set ofmean vectors. A GMM is then trained on all speech frames. Sincewe have observed uneven clustering results caused by the presenceof noise and non-speech artifacts, we use a speech/non-speech detec-tor to remove all non-speech segments longer than one second priorto clustering.

Once a GMM is trained, a raw Gaussian posteriorgram vector iscalculated by Equation 1. To avoid approximation errors, a proba-bility floor is set to eliminate dimensions (i.e., set them to zero) withvery small probability values. Then, a discounting based smooth-ing method is used to move a small portion of the probability massfrom non-zero dimensions to zero dimensions. This smoothing helpsduring the time warping pairwise distance matching.

2.3. Segmental DTW

Segmental dynamic time warping (S-DTW) has been successfullyused in our prior speaker-dependent pattern discovery work [2] andin our recent unsupervised keyword spotting research [4]. Aftergenerating the Gaussian posteriorgrams for all speech utterances,S-DTW is performed on every utterance pair to find candidate co-occurring subsequences in the two utterances.

To employ S-DTW, the difference function D between Gaus-sian posterior probability vectors p and q is defined as D(p, q) =! log(p · q). The dot product assumes these two probability vec-tors are drawn from the same underlying probability distribution.Given two Gaussian posteriorgrams GPi = (p1, p2, · · · , pm) andGPj = (q1, q2, · · · , qn), the warping function w(ik, jk) can pro-duce a m " n timing difference matrix, where ik and jk denote thek-th coordinate of the warping path.

Two constraints are applied to the S-DTW search. One is the ad-justment window condition that prevents the warping process fromgoing too far ahead or behind in eitherGPi orGPj . Formally, an in-tegerR is set to ensure |ik!jk| # R. The other constraint is the steplength condition. It is clear that the adjustment window condition re-stricts the shape and the ending coordinates of a warping path. Givendifferent start coordinates, the difference matrix can be naturally di-vided into several continuous diagonal regions with width 2R + 1,

Fig. 2. Converting all matched fragment pairs to a graph. Eachnumbered node corresponds to a temporal local maximum in frag-ment similarity in a particular utterance (e.g., 1-5). Each matchingfragment is represented by a connection between two nodes in thegraph (e.g., 1-4, 2-4, 3-5).

as shown in Figure 1. The red warping region s2 denotes the warp-ing process along with the j axis, while the blue warping region s3

denotes the warping process along the i axis. To avoid redundantcomputation and take into account warping paths across segmenta-tion boundaries, an overlapped sliding window moving strategy isapplied for the start coordinates, shown in the Figure 1.

2.4. Path Refinement

By moving the start coordinate along the i and j axis, for every pairof speech utterances, we can obtain a total of n!1

R+ m!1

Rwarp-

ing paths, each of which represents a warping between two subse-quences in each utterance pair. Since our goal is to find sequences ofsimilarity within the utterance pairs, we look for a fragment of eachwarping path that has a low distortion score [2].

The warping path refinement is done in two steps. In the firststep, a length L constrained minimum average subsequence findingalgorithm [6] is used to extract consecutive warping fragments withlow distortion scores. In the second step, the extracted fragments areextended by including neighboring frames below a certain distortionthreshold !. Specifically, we include neighboring frames with dis-tortion scores within 1 + ! percent of the average distortion of theoriginal fragment. The reason is that we found that the path of ex-tracted fragments often missed several frames at the onset or offsetof a matching acoustic pattern (i.e., a particular word or words) [2].

2.5. Graph Clustering

After collecting refined warping fragments for every pair of speechutterances, we can try to cluster similar fragments. Since each warp-ing fragment provides an alignment between two segments, if one ofthe two segments is a common speech pattern (i.e., a frequently usedword), it should appear in multiple utterance pair fragments.

The basic idea is to cast this problem into a graph clusteringframework, illustrated in Figure 2. Consider one pair of utterancesin which S-DTW determines three matching fragments (illustratedin different colors and line styles). Each fragment corresponds totwo segments in the two speech utterances, one per utterance. Sincein general there could be many matching fragments with differentstart and end times covering ever utterance, a simplification is madeto find local maxima of matching similarity in each utterance, andto use these local maxima as the basis of nodes in the correspondinggraph [2]. As a result, each node in the graph can represent one ormore matching fragments in an utterance. Edges in the graph thencorrespond to fragments occurring between utterance pairs, with anassociated weight that corresponds to a normalized matching score.After the conversion, a graph clustering algorithm proposed by New-man [7] is used to discover groups of nodes (segments) in terms of

Page 18: Time Machine session @ ICME 2012 - DTW's New Youth

“Image-­‐based”  DTW  

Pass 1: Efficient Line Segment Detection

Step 1: Threshold and apply diagonal median filter of duration !t

!

A.  Jansen  et  al.,  “Towards  spoken  term  discovery  at  scale  with  zero  resources”,  Interspeech  2010  

Image  processing  can  be  used  to  speedup  the  discovery  of  possible  matches  in  the  similarity  matrix  

Pass 1: Efficient Line Segment Detection

Step 3: Use Hough transform peaks as starting points to perform linesegment search

Page 19: Time Machine session @ ICME 2012 - DTW's New Youth

AutomaFc  moFf  discovery  

!"#

$%%&'()*+,-.'/0*01%2

)&&'.%3'(4*-/56 !"# !"#''-.'7-81)192

$%%&'()*+,''$%%&'()*+,2

':)7-&1%;%*-*-4.2 ''''-.'7-81)19

!"#'''''''%<*%.&

'#""='>?6@A ''''''''''''''>BCC"D

!"#$ %E F)-. )1+,-*%+*01% 4/ *,% ;14;4$%& !"#$% &$'(")*+, $9$*%(E

-. $08$%+*-4. GEH ). %<)(;7% 4/ $0+, :)7-&)*-4. -. *,% ##F +4(;)1I-$4.E

J$ 1%;%*-*-4.$ )1% &%*%+*%&K ) (4&%7 /41 *,% (4*-/ -$ 80-7* /14(*,% 4++011%.+%$ *,%($%7:%$K ).& )&&%& *4 *,% 7-81)19 /41 $08$%L0%.*+4(;)1-$4.$E M. *,% %<;%1-(%.*$ &%*)-7%& ,%1%K (4*-/$ )1% (4&%7%&89 *,%-1 (%&-). 4++011%.+%K -E%E *,% +74$%$* -. ):%1)N% *4 )77 *,% 4*,%14.%$ )++41&-.N *4 ) =OP &-$*).+% (%)$01%K -. 401 +)$%E

O,% )7N41-*,( )&:).+%$ 89 );;14;1-)*%79 $,-/*-.N *,% $%%& 874+Q).& -*%1)*-.N *,% ;14+%&01% &%$+1-8%& 0.*-7 *,% %.*-1% $*1%)( -$ ;14I+%$$%&E

O,% .%<* $%+*-4. /4+0$%$ 4. &%$+1-8-.N ,43 ()*+,%$ )1% !1$*&%*%+*%& ).& *,%. :)7-&)*%&E

&$ '()(*)+,- .-' /.0+'.)+,-

M. *,-$ $%+*-4. 3% &%$+1-8% *,% ;)**%1. ()*+,-.N *%+,.-L0%$ ).& *,%&-$*).+% (%)$01%$ %(;749%& *4 %//%+*-:%79 !.& 1%;%*-*-4.$ -. $;%%+,EP% 81-%"9 1%+)77 *,% ;1-.+-;7% 4/ *,% =OP );;14)+, ).& ,-N,7-N,*$-*$ 7-(-* *4 !.)779 -.*14&0+% ##F +4(;)1-$4.E

&$%$ '12132"45 67 8 91#:1528; <1=9"45 4> ')?

O,% $%)1+, /41 1%;%*-*-4.$ 4/ ) $%%& R'**& !-#(.*'S ).& *,%-1 $08$%IL0%.*K 0.840.&%&K %<*%.$-4. -$ ;%1/41(%& )++41&-.N *4 ) $%N(%.*)7:)1-).* 4/ *,% +7)$$-+ =OP )7N41-*,( 3-*,FC@@ /%)*01%$E O,% *%+,I.-L0%K &%*)-7%& -. TUVK 1%7)<%$ *,% 840.&)19 +4.$*1)-.*$ 4/ =OPK *4&%*%+* ) ;4$$-87% $%%& ()*+, )$ ) $08$%N(%.* 4/ *,% $%)1+, 80//%1R41 (4*-/ (4&%7SK ).& *4 +4(;0*% *,% +411%$;4.&-.N )7-N.(%.* ;)*,EJ ;)*, %<*%.$-4. ;14+%&01% -$ *,%. *1-NN%1%& /14( *,% %.&;4-.*$K3,%.%:%1 *,% &%*%+*%& $%L0%.+%$ )1% &%%(%& $-(-7)1K -E%EK 3,%. *,%1%$;%+*-:% =OP $+41% -$ 8%743 ) $-(-7)1-*9 *,1%$,47& !=OPE

O,% ()-. 7-(-* -(;7-%& 89 *,-$ );;14)+, -$ *,)* *,% $47% 0$% 4/ )=OP R&-$S$-(-7)1-*9 (%)$01% 1%$07*$ -. ) $08$*).*-)7 $;%)Q%1 &%;%.I&%.* $9$*%(K )$ 3)$ .4*%& )71%)&9 -. TUK WK GVE O4 ;1%:%.* 0.)++%;*I)87% /)7$%I&%*%+*-4. 1)*%$K ) 743 :)70% 4/ !=OP -$ %./41+%&E ?-(-*-.N*,% )&(-$$-87% -.*1)I(4*-/ :)1-)8-7-*9 ,)$ *,% 48:-40$ %//%+* 4/ ;1%I:%.*-.N *,% &-$+4:%19 4/ *,% (41% :)1-)87% 4++011%.+%$ 4/ ) (4*-/K7-Q% *,% -.*%1I$;%)Q%1 4.%$E

O,% -.*%N1)*-4. 4/ *,% ##F +4(;)1-$4. -. *,% :)7-&)*-4. $*)N%K)-($ -.&%%& )* +4;-.N R)* 7%)$* ;)1*-)779S 3-*, $0+, :)1-)8-7-*9 -$$0%$E

&$@$ .A<8528#1 4> BBC =1D=1915282"45

O,% ##F 4/ ) $%L0%.+% "ba -$ *,% $L0)1% $9((%*1-+ ()*1-< !("b

a)$0+, *,)* ! (i, j) = d

`

"ba(i), "b

a(j)´

K d 8%-.N *,% "0+7-&%). &-$I*).+% -. 401 +)$%E

M/ *34 $%L0%.+%$ )1% $-(-7)1K $4 )1% *,% 1%$;%+*-:% ##F$E M()NI-.% ) $%L0%.+% "K %)+, /1)(% &%+4(;4$%& )$X

"i = Si + N RUS

!" #" $" %" &" '" (" )"

#"

%"

'"

)"

!" #" $" %" &" '" (" )"

#"

%"

'"

)"

!"#$ @E ##F$ 4/ *34 &-//%1%.* 0**%1).+%$ 4/ /*-0 1-+$* 2* 3*0 /14() ()7% $;%)Q%1 R*4;S ).& ) /%()7% $;%)Q%1 R84**4(SE

3,%1% Si +)11-%$ *,% 4$056$'#$( -./41()*-4. 4. *,% /1)(%K ).& N*,% +4(;4.%.* *,)* )++40.*$ /41 )77 .4. 7-.N0-$*-+ $401+%$ R$;%)QI-.N $*97%$K +,)..%7K 8)+QN140.& $40.&$K %*+ESK *,)* )1% *4 8% +4.$-&I%1%& )$ .4-$% /41 401 ;01;4$%E M/ $0+, ). -&&$#$)* 0"$'* (4&%7 -$:)7-&K *,%. ##F$ 3407& 4.79 &%;%.& 4. *,% )+40$*-+I;,4.%*-+ $*10+I*01% 4/ *,% 41-N-.)7 $%L0%.+%$K )$ )77 *,% .4-$% +4(;4.%.*$ 3407&+).+%7 %)+, 4*,%1 40* -. +4(;0*-.N !(i, j)E C41 -.$*).+%K -. +%;I$*1)7 (%). $08*1)+*-4. R@F#SK (%). 1%(4:)7 /47743$ *,% )$$0(;*-4.*,)* ) +4.$*).* )&&-*-:% /)+*41 R*,% =@ +4(;4.%.*S -$ 1%$;4.$-87% /41+,)..%7 &-$*41*-4.E Y43%:%1K 3-*, 1%N)1& *4 *,% 4*,%1 .4-$% /)+*41$K*,-$ ;14;%1*9 &4%$ .4* $*1-+*79 ,47& *10% -/ $;%%+, -$ 1%;1%$%.*%& 89$%L0%.+%$ 4/ FC@@$E F41%4:%1K %:%. )$$0(-.N ) ;%1/%+* )++01)+94/ *,% )&&-*-:% (4&%7K +4(;)1-.N &-1%+*79 *,% @F#I.41()7-Z%& $%IL0%.+%$ 3407& 8% (41% +4.:%.-%.* *,). +4(;)1-.N *,% 1%$;%+*-:%##F$E

O,% )&:).*)N% 4/ 0$-.N ##F$ -$ *,)* *,% &-$*).+%$ )(4.N (0I*0)7 ;)1*$ 4/ ) $%L0%.+%K N%.%1)*% ) *34I&-(%.$-4.)7 ;)**%1. *,)* -$;%+07-)1 4/ *,% 0.&%179-.N )+40$*-+I;,4.%*-+ $*10+*01% R$%% C-NE H /41) +4.+1%*% %<)(;7%SE >9 +4(;)1-.N ##F$ 3% -.*%.& *4 1%+4N.-Z%*,% 1%+011%.+% 4/ $0+, ;)**%1.$ )(4.N $;%%+, $%L0%.+%$ *4 1%:%)7*,%-1 +4((4. 8%74.N-.N *4 *,% $)(% (4*-/E

##F$ )1% *,0$ -.*%1%$*-.N +).&-&)*%$ /41 1480$* ;)**%1. ()*+,-.N).& ) &-$*).+% 8%*3%%. ##F$ -$ *,%1%/41% 1%L0-1%&E

&$&$ *4:D8="945 4> BBC9

P,%. &%!.-.N ) &-$*).+% 8%*3%%. ##F$K *34 ()-. -$$0%$ ,):% *48% )&&1%$$%&X )S &-//%1%.* (4*-/$ ()9 )7$4 ,):% $-(-7)1 ##F$ ).& 8S(4*-/ 4++011%.+%$ ()9 ,):% &-//%1%.* 7%.N*,$K ).& *,0$ *,% 1%$;%+*-:%##F$ &-//%1%.* $-Z%E

O4 &%)7 3-*, *,% !1$* &1)38)+QK 3% ;14;4$% *4 0$% *,% ##F &-$I*).+% -. +4.[0.+*-4. 3-*, *,% =OP &-$*).+%X ) ,-N, !=OP -$ $%* *4)++40.* /41 (41% :)1-)8-7-*9E F4*-/ +).&-&)*%$ )1% *,%. :)7-&)*%& 89

A.  Muscariello  et  al.,  “Towards  Robust  Word  Discovery  by  Self-­‐Similarity  Matrix  Comparison”,  Proc.  ICASSP  2011  

Variability tolerant audio motif discovery 7

B U F F E R

Q U

E R

Yrefined starting point

refined ending point

matchcentral band

ending band

starting band

Fig. 2. Band relaexd SLNDTW: the motif completely includes the central band. Afterpath reconstruction, boundaries are refined in the starting and ending band (dashedlines).

In addition, we have applied an heuristic for the refinement of the boundariesthat consists in extending the found match by adding new frames at the bound-aries (following the local normalization paradigm), as long as the average weightof the extended path does not increase too much.Formally:

1. Consider the path P with W (P ) = Wo ending in (ie, je).2. Select in the neighbourhood of (ie, je) (composed of (ie + 1, je + 1), (ie +

1, je), (ie, je + 1)) the point that, added to P , minimizes W (P ), and add itto P as its new ending point.

3. If W (P ) < Wo+10%Wo, then repeat the procedure from 1, otherwise removethe new ending point from P and stop the procedure.

The same approach applies when extending the path backward from its startingpoint (is, js).

2.4 Fragmental SLNDTW

Band relaxed SNLDTW does not constrain motif and query to coincide, butit still assumes the motif to be located in the middle part of the query, suchthat it completely includes the central band. A simple generalization of theprevious versions of the algorithm, that we call fragmental SLNDTW, allows toretrieve the sought motif regardless of its position in the query, by first retrievinga portion of it, e.g. a fragment. SLNDTW detects a match whenever a querycoincide with a motif. By using queries small enough to be included in the motif,then there exists at least one fragment of the motif that coincide with one of thequeries and that can be discovered by SLNDTW. Indeed, if Lmin ! Lmotif !Lmax, partitioning a Lmax long query in Lmin/2 long subqueries ensures that atleast a Lmin/2 long fragment of the motif coincide with one of the subqueries,and it is therefore retrievable by conventional SNLDTW. The entire match canbe recovered afterwards, by extending the corresponding path as in the boundary

inria

-005

5176

4, v

ersi

on 1

- 20

Feb

201

1

Page 20: Time Machine session @ ICME 2012 - DTW's New Youth

Music  structure  analysis  

Meinard  Müller,  “Informa0on  Retrieval  for  Music  and  Mo0on”,  Springer-­‐Verlag,  ISBN  978-­‐3-­‐540-­‐74047-­‐6,  pp.  147-­‐150,  2010  

Page 21: Time Machine session @ ICME 2012 - DTW's New Youth

Unbounded-­‐DTW  

X.  Anguera  et  al.,  “Par0al  Sequence  Matching  using  an  Unbounded  Dynamic  Time  Warping  Algorithm”,  ICASSP  2010  

Page 22: Time Machine session @ ICME 2012 - DTW's New Youth

Unbounded-­‐DTW  

Page 23: Time Machine session @ ICME 2012 - DTW's New Youth

The  search  for  speed  and  scalability  

•  Coarse-­‐to-­‐fine  approximaFon  of  DTW  –  S.  Salvador  and  P.  Chan,  “FastDTW:  Toward  accurate  dynamic  0me  

warping  in  linear  0me  and  space”.  3rd  Wkshp.  On  Mining  Temporal  and  SequenFal  Data,  ACM  KDD  2004  

•  Intelligent  bounding  of  DTW  –  Work  by  Eamon  Keogh  (pure  DTW,  no  subsequences)  –  Y.  Zhang  and  J.  Glass,  “A  Piecewise  Aggregate  Approxima0on  Lower-­‐

Bound  Es0mate  for  Posteriorgram-­‐based  Dynamic  Time  Warping”,  Interspeech  2011  

•  Use  of  IR  techniques  –  A.  Jansen  et  al.,  “Efficient  Spoken  Term  Discovery  using  Randomized  

Algorithms”,  ASRU  2011  

Page 24: Time Machine session @ ICME 2012 - DTW's New Youth

(some)  applicaFons  of  the  technology  

•  Finding  structure  in  an  unknown  language  -­‐>  Zero-­‐resources  approaches  (JHU  Workshop  this  summer)  – Helping  ASR  by  increase  of  training  data  (Jansen_2011)  

•  AcousFc  documents  comparison  and  topic  detecFon    – NLP  on  speech  (Drezde_2010)  

•  Query-­‐by-­‐example  search  (Metze_2011)  •  Spoken  term  discovery  (Muscariello_2011)  •  Spoken  SummarizaFon  (Jansen_2010,  Flamary_2011)  •  AcousFc  indexing  enhancement  via  transcripFon  propagaFon  

Page 25: Time Machine session @ ICME 2012 - DTW's New Youth

Conclusion  

•  The  old  DTW  is  back  –  It  will  not  reclaim  ASR,  but  it  takes  on  new  challenges  

•  Lots  of  research  to  be  done  – On  scalability  (matching  paIerns  is  sFll  expensive)  – On  generality  – Robustness  

Page 26: Time Machine session @ ICME 2012 - DTW's New Youth

I  did  not  have  Fme  to  menFon…  

•  Symbol-­‐based  subsequence  matching  •  AcousFc  features  for  subsequence  matching  

Page 27: Time Machine session @ ICME 2012 - DTW's New Youth

References  -­‐  T.K.Vintsyuk,  “Speech  DiscriminaFon  by  Dynamic  Programming”,  KiberneFks  Vol.  4,  No  1,  pp.  81-­‐88,  1968.  -­‐  V.M.  Velichko  and  N.G.  Zagoruyko,  “Automa0c  Recogni0on  of  200  Words”,  Int.  Journal  on  Man-­‐Machine  Studies,  vol.  2,  pp.  223-­‐234,  1970.  -­‐  H.  Sakoe  and  S.  Chiba,  “A  dynamic  programming  approach  to  con0nuous  speech  recogni0on,”in  1971  Proc.  7th  ICA,  Paper  20  CI3,  Aug.  1971.  -­‐  H.  Sakoe  and  S.  Chiba,  “Dynamic  Programming  Algorithm  Op0miza0on  for  Spoken  Word  Recogni0on”,  IEEE  TransacFons  on  Audio,  Speech  and  Signal  Processing,  26(1)  pp.  43-­‐49,  1978  -­‐  S.  Salvador  and  P.  Chan,  “FastDTW:  Toward  accurate  dynamic  0me  warping  in  linear  0me  and  space”.  3rd  Wkshp.  On  Mining  Temporal  and  SequenFal  Data,  ACM  KDD  2004  -­‐  De  Wachter  et  al.,  “Template-­‐based  con0nuous  speech  recogni0on”,  IEEE  Trans.  On  Audio,  Speech  and  Language  Processing,  15(4)  pp.  1377-­‐1390,  2007  -­‐  A.  Park  and  J.  Glass,  “Unsupervised  PaWern  Discovery  in  Speech”,  IEEE  Trans.  On  Audio,  Speech  and  Language  Processing,  2008  

Page 28: Time Machine session @ ICME 2012 - DTW's New Youth

References  (II)  -­‐  Meinard  Müller,  “Informa0on  Retrieval  for  Music  and  Mo0on”,  Springer-­‐Verlag,  ISBN  978-­‐3-­‐540-­‐74047-­‐6,  pp.  147-­‐150,  2010  -­‐  X.  Anguera  et  al.,  “Par0al  Sequence  Matching  using  an  Unbounded  Dynamic  Time  Warping  Algorithm”,  ICASSP  2010  -­‐  A.  Jansen  et  al.,  “Towards  spoken  term  discovery  at  scale  with  zero  resources”,  Interspeech  2010  -­‐  R.  Flamary  et  al.,“SpokenWordCloud:  Clustering  Recurrent  PaWerns  in  Speech”  in  Proc.  CBMI  2011  -­‐  M.  Drezde  et  al.  “NLP  on  Spoken  Documents  without  ASR”,  in  Proc.  EMNLP  2010  -­‐  A.  Muscariello  et  al.,  “Towards  Robust  Word  Discovery  by  Self-­‐Similarity  Matrix  Comparison”,  Proc.  ICASSP  2011  -­‐  A.  Jansen  et  al.,  “Efficient  Spoken  Term  Discovery  using  Randomized  Algorithms”,  ASRU  2011  -­‐  Y.  Zhang  and  J.  Glass,  “A  Piecewise  Aggregate  Approxima0on  Lower-­‐Bound  Es0mate  for  Posteriorgram-­‐based  Dynamic  TimeWarping”,  Interspeech  2011  -­‐  A.  Jansen  et  al.,  “Towards  unsupervised  training  of  speaker  independent  acous0c  models,”  in  Proc.  Interspeech,  2011  -­‐  F.  Metze,  “The  spoken  web  search  task  at  Mediaeval  2011”,  ICASSP  2011