Dynamic Time Warping’s new Youth
Xavier Anguera Telefónica Research
Outline
• Was there anything before DTW? • DTW review • Here comes HMM • DTW comeback – Algorithms – Speed and scalability – ApplicaFons
• Conclusions
Before DTW • Speech recogniFon was done by means of paIern-‐matching input with reference paIerns.
• Speaking rate variaFons create nonlinear fluctuaFons on the Fme axis.
• Some linear transformaFons were tested, but not successfully. – Dynamic Fme warping (DTW) will be able to help
Dynamic Time Warping -‐ DTW • DTW algorithm allows the computaFon of the opFmal alignment between two Fme series Xu, Yv ε ΦD
Image by Daniel Lemire
!
XU = (u1,...,um,...,uM )
!
XV = (v1,....,vn,..,vN )
Dynamic Time Warping (II) • The opFmal alignment can be found in O(MN) complexity using dynamic programming.
• To do so, one needs to define a cost funcFon between any two points in the series and build a distance matrix:
!
d :"D # "D $%& 0
Image by Tsanko Dyustabanov
!
d(i, j) = um " vn
Where usually:
!
c(i(k), j(k))
!
F = c(1),...,c(K)Warping funcFon: where
Euclidean distance
Warping constraints For speech signals some constraints are usually applied to the warping funcFon F: – Monotonicity:
– ConFnuity (i.e. local constraints):
!
i(k "1) # i(k)
!
j(k "1) # j(k)
!
i(k) " i(k "1) #1
!
j(k) " j(k "1) #1
Sakoe,H. and Chiba,S. (1978) “Dynamic programming algorithm op0miza0on for spoken word recogni0on”, IEEE Trans. on Acoust., Speech, and Signal Process, ASSP-‐26, 43-‐49.
(m, n)
(m-‐1, n-‐1)
(m-‐1, n)
!
D(m,n) =minD(m "1,n)D(m,n "1)D(m "1,n "1)
#
$ %
& %
+ d(um,vn )
Warping constraints (II) – Boundary condiFon:
i.e. DTW needs prior knowledge of the start-‐end alignment points.
– Global constraints !
i(1) =1
!
j(1) =1
!
i(K) = M
!
j(K) = N
Image from Keogh and Ratanamahatana
Seminal works in DTW
Hiroaki Sakoe and Seibi Chiba, “A dynamic programming approach to con0nuous speech recogni0on,”in 1971 Proc. 7th ICA, Paper 20 CI3, Aug. 1971. Hiroaki Sakoe and Seibi Chiba, “Dynamic Programming Algorithm Op0miza0on for Spoken Word Recogni0on”, IEEE TransacFons on Audio, Speech and Signal Processing, 26(1) pp. 43-‐49, 1978
T.K.Vintsyuk, “Speech Discrimina0on by Dynamic Programming”, KiberneFks Vol. 4, No 1, pp. 81-‐88, 1968.
56 KIBERNETIKA
number of poss ib le combinat ions of l - 1 a r rows of which q - 1 a r e oblique
C q - I N (l, q) = l--l. (10)
For example, for the o f t en - r equ i r ed set of values l = = 20 a n d q = 12 we ob ta inN(20 ,12) = 75 582. For the rea l i za t ion of recogni t ion c r i t e r i a (6), (8), and (9) it is hardly p rac t i cab le to scan such a l a rge number of s tandards in o r d e r to de t e rmine the probabil i ty that X l will belong to one c lass .
k , A'. A~ ~] 1 " -
1/~----~ v - r - - - v - 7 A / l \ x :
~ l \ \ | 8 89
8 ' / )
We a re in te res ted in var ious methods of d i rec ted s e a r c h among the s tandards of the c l a s se s , which quickly lead to a synthesis of the p e r m i s s i b l e s tandard of the c lass n e a r e s t to the unknown sequence X l in the sense of the c r i t e r i a (6), (8), and (9).
We shall show that the p rob lem of finding the n e a r - es t p e r m i s s i b l e s tandard of a c lass is reduced to one of the p rob lems in ma themat ica l p rogramming , i . e . , to the problem of finding the shor t e s t path in a graph.
We r ep l ace the digits i = 1, 2 . . . . . l of the b ranch - ing scheme of d imensions (l, q) by points. We ass ign to each point a pair of subscr ip ts ij. The f i r s t sub- sc r ip t i is de te rmined by the number of the ve r t i c a l column whose digit is r ep laced by the point. The s e c - ond subscr ip t j is equal to the rep laced digit of the branching scheme.
We ass ign to each a r row of the branching scheme which goes into the point ij the dis tance d ~ = ] x~ - - ~h 12, ii i if we want to r e a l i z e the recogni t ion c r i t e r i o n (6) ; or the value
( C ) = x,. '1 ] ,;. = i x0,1 ' 141
when we want to r ecogn ize according to the c r i t e r i a (8) and (9).
Thus we obtain a graph given in Fig. 1 for the case q = 5 and l = 8.
If we move f rom point A of the graph to point B, we pass over exact ly l points of the graph (including A and B), the second subscr ip t s of which de t e rmine the p e r m i s s i b l e s tandard of length l of the c lass k.
The sum of the values d~, (Qk)Z or (Rk) 2 of the a r rows lying along some pathway f rom A' and into B, de t e rmines the degree of co r r e spondence between the
p e r m i s s i b l e s tandard cor responding to this pathway and the unknown sequence X l 9 It is now c lea r that the p rob lem of finding the nea r e s t p e r m i s s i b l e s tandard of a c l a s s has been reduced to the p rob lem of finding the sho r t e s t [cr i te r ion (6)], or the longest [ c r i t e r i a (8) and (9)] pathways f rom A' to B in the graph de te rmined by the model in Fig. 1. The length of the sho r t e s t or longest pathway f rom A' to 'B is a m e a s u r e of the p rob- abili ty that the unknown sequence X l belongs to the k- th c l a s s .
There a re well-known a lgor i thms for finding the sho r t e s t or longes t pathways in a graph [5]. They are based on a d i r ec ted sea rch among the poss ib le pathways.
The graph cons idered he re has a specia l s t ruc tu re so that we can cons t ruc t a s imple p rocedure for f ind- ing the sho r t e s t pathway which is quite di f ferent f rom the methods desc r ibed in the l i t e r a tu r e . The r ea son for this is that it is n e c e s s a r y only to find the length of the shor t e s t pathway in the graph.
Let us cons ider succes s ive ly the groups of points in the graph with ident ical subscr ip t s i = 1, 2 . . . . , l 9
We begin with the points of the graph having the sub- sc r ip t s i = 1. There is a unique point of this kind. The point ij = 11 can be reached along one pathway. We ass ign to this point the path- length D11 = d l r Then we turn to the points with subscr ip ts i = 2, and so on. By succes s ive ly making the t rans i t ion f rom all points with subscr ip t s (i - 1) to al l points with subscr ip t i, we can ass ign lengths to the points of the graph with sub- sc r ip t s i according to the following r e c u r s i v e formula :
Di: = r a i n (D(i_1)(i_l), D(i_I)/) -~ dii. (11)
Every t ime the r e c u r s i v e fo rmula (11) is used, we e l imina te the pathways f rom A' to B which go through the point ij, and which are always longer than the r e - maining ones. For example, ass igning according to (11) the value Dij to the point ij = 42, we e l imina te some of the pathways f r o m A' to B which go through the point 42, and then we have to cons ider only those pathways f rom A' to B which a re continuations of the shor t e s t ini t ial pathway f rom the point A' to the point ij = 42.
As a r e su l t of this procedure , on reach ing the las t point of the graph ij = / q it will be assigned the value D/ , which de t e rmines the length of the shor t e s t path- q way in the graph.
For the r ea l i za t ion of the recogni t ion c r i t e r i a (6), (8), and (9) we cons t ruc t exact ly K graphs (according to the number of c l a s s e s ) . The s ize of each of these graphs is de te rmined by the number / of e lements x i in the unknown sequence X l , and by the number qk of e lements in the s tandard ~ of the c lass k.
It is easy to see that for the rea l i za t ion of the e l e - m e n t - b y - e l e m e n t recogni t ion a lgor i thm a r e l a t i ve ly sma l l amount of ca lcula t ion wil l suffice: we have to de te rmine the quantit ies d k or (Qk)2, or (Rk) 2 for
each graph, and then we have to find the shor t e s t or longest pathway and ass ign the unknown sequence to the c lass whose graph has the shor t e s t or longes tpa th .
When for some r ea son the unknown sequence X l ean lose some of its e lements (the f i r s t or the las t one),
Minimal length reference paIern
Input query
Even before…
And… V.M. Velichko and N.G. Zagoruyko, “Automa0c Recogni0on of 200 Words”, Int. Journal on Man-‐Machine Studies, vol. 2, pp. 223-‐234, 1970.
AUTOMATIC RECOGNITION OF 200 WORDS 227
matrix dements {aiK } through which a chosen route divided by a length of the longer word passes. In the given example with n > m it will be a sum
1- (a l t + a 2 a + . . . + a m , n - l ) n
corresponding to parts AB, CD . . . . QS. Path segments BC . . . . PQ, STpass between the matrix elements and do not contribute to the common sum. The possible routes are limited: only the path segments along the horizontal from left to right, along the vertical from top to bottom, and along the diagonal from left-top to right-bottom are permitted.
| 2 3 k n-I n
I 711 ~N !012 013 Olk Oltn-I OIk
2 =21 I O oz* a z . . i a2n \ \
\ \
\ \ K
\ i alj al2 a~3 ainu, !al,.-i Oik
"N \
m - I Ore.i, I Om.l~ 0re_l, 3 •m-l,k
\ \
P Om-I,n-I Gin-I, k
a,.,x._ a,~. $ T
0~ 2 FIG. 2. Matrix of similarity, a,K - d2+p21 x
segment of vertical word and the kth one of horizontal word.
the measure of similarity between the ith
The measure of similarity A between words is the maximum of the functional F, calculated along all possible paths of the matrix {air }: A = max F. It is easy to see the matrix represents a graph or A T net and the problem of functional maximization is that of finding the maximum routes of the diagram. In this experiment the maximum path length is calculated using the techniques of dynamic programming. A brief description of algorithm applied is given below.
Let us enumerate the matrix nodes {aiK } (that is, points A, B . . . . T) according to the following rule. The point situated leftmost at uppermost to
SubsFtuFon by HMM’s • In the 80’s, with the availability of computers with memory and
possibility to beIer store models and staFsFcal processing
L.R.Rabiner and B.H.Juang, “An introduc0on to Hidden Markov Models”, IEEE ASSP Magazine, Vol. 3, no 1, pp. 4-‐16, 1986
DTW vs. ASR for Speech RecogniFon
Dynamic Time Warping • Data: Some examples, no
labeling needed • Time: none for training,
costly to test • Accuracy: Mid to high
Hidden Markov Models • Data: Lots of labeled data at
phoneme level is needed • Costly for training, light for
tesFng • Accuracy: high
DTW has long been abandoned in ASR for high-‐resourced languages, with one excepFon: De Wachter et al., “Template-‐based con0nuous speech recogni0on”, IEEE Trans. On Audio, Speech and Language Processing, 15(4) pp. 1377-‐1390
The Zero-‐Resource Seong
1. No transcribed training data 2. No dicFonaries 3. No knowledge of linguisFc structure
Challenge: How can we discover linguisFc structure and build useful applicaFons and systems without much supervision?
But: Assume you have untranscribed speech (at least your test data)
Calling DTW back PRO: DTW works with paIerns, no need for costly transcripFons or knowledge of the language. CON: DTW compares paIerns given a known start-‐end posiFon -‐> needs that at least one of the paIerns be well bounded
From DTW to subsequence matching Given several instances of an acousFc sequence,
Can we modify DTW to find them?
DTW-‐based subsequence matching
• Segmental-‐DTW by James Glass et al. at MIT • “image-‐based” DTW by Aren Jansen et al. at John Hopkins Univ.
• Mo0f Discovery by Guillaume Gravier et al. at IRISA (Rennes).
• DTW for music by Meinard Müller et al. at Max Planck InsFtut.
• Unbounded-‐DTW by Xavier Anguera et al. at Telefonica research
Segmental DTW
A. Park and J. Glass, “Unsupervised PaWern Discovery in Speech”, IEEE Trans. On Audio, Speech and Language Processing, 2008
without resorting to an intermediate symbolic representation.The remainder of the paper is organized as follows. In Sec-
tion 2, we describe the segmental DTW algorithm which gener-ates the best matching subsequence between two utterances. Next,we describe the data used in our experiments and outline the stepsneeded to process an entire audio lecture using the segmental DTWalgorithm. In Sections 4 and 5 we propose two direct uses for ouralgorithm and show outputs for these preliminary experiments. Fi-nally, we discuss several directions for future work and conclude.
2. SEGMENTAL DTW
Dynamic time warping (DTW) is typically used to find the optimalglobal alignment between two whole word exemplars [3]. First,the two speech waveforms are converted into spectral vector timeseries, {vi}
Mi=1, {vj}
Nj=1. Next, a distance matrix D is computed
by taking the pairwise distance between vectors in each utterancesuch that D(i, j) = !vi " vj!. Finally, dynamic programming isused to find the lowest distortion path through the distance matrixbetweenD(1, 1) andD(M, N).
Recently, researchers have revisited the use of DTW for speechrecognition as well as speaker detection [4, 5]. In the context ofour work, the DTW algorithm is useful because it can be used tomeasure the similarity of two utterances directly at the acousticlevel. However, performing a globally optimal alignment betweentwo utterances is not suitable for finding matching words whenthose utterances are composed of multiple words. The start andend points constrain the algorithm to find only the globally op-timal alignment matching the entire utterances together, and notlocal alignments which correspond to the matching subsequences.
An alternative method to this global alignment procedure isto use segmental DTW, which finds local alignments by searchingmultiple paths in the distance matrix. Given two utterances, U1
and U2, segmental DTW works by dividing the distance matrixinto a set of overlapping diagonal bands with widthW and search-ing for the best alignment within each band. The diagonal bandsserve multiple purposes. First, they constrain the degree of warp-ing so that two sub-utterances are not overly temporally distortedduring alignment. Second, they allow for multiple alignments, aseach band corresponds to another potential path with start and endpoints that differ fromD(1, 1) andD(M, N).
Following path discovery, each path is trimmed by finding theleast average subsequence with minimum length L. The minimumlength criterion is used to prevent spurious matches between shortsegments within each utterance. Since each cell D(i, j) corre-sponds to the distortion between frames i and j, the least averagesubsequence represents the portion of the aligned path which ex-hibits good alignment. The algorithm used to compute the subpathis described in [6].
At this stage we are left with a single path fragment for eachof the original bands in the distance matrix. The path fragmentwith the lowest average distortion is retained as the best alignmentpath between the two utterances, and this minimum average dis-tortion is used as the distance between U1 and U2. This alignmentpath along with its associated distortion forms the basis for furtherprocessing that we will describe in the following sections.
Utterance 1
Utt
era
nc
e 2
Distance Matrix
W
U1
U2
Fig. 1. The segmental DTW algorithm. As in normal DTW, thedistance matrix is computed between U1 and U2. The matrix isthen cut into overlapping diagonals (only one shown here) withwidth W . The optimal alignment path within the diagonal bandis then found using DTW and the resulting path is trimmed to theleast average subsequence. Finally, the trimmed alignment path isretained as the best subsequence alignment between U1 and U2.
3. EXPERIMENTAL DETAILS
3.1. Corpus
The speech data used in this paper is taken from a corpus of audiolectures collected at MIT. The entire corpus consists of approxi-mately 300 hours of lectures from a variety of academic coursesand seminars. The audio was recorded using an omni-directionalmicrophone in a classroom environment. In a previous paper, wedescribed characteristics of this lecture data and performed prelim-inary recognition and information retrieval experiments [7]. Manyaspects of the lecture data make it particularly well suited for thepattern discovery and speech clustering techniques described inthis paper. Each lecture typically contains a large amount of speech(from thirty minutes to an hour) from a single person in an un-changing acoustic environment. On average, each lecture con-tained only 800 unique words, with high usage of subject-specificwords and phrases. Each of the the lectures used in our experi-ments was the introductory lecture taken from courses in computerscience (CS), physics, and linear algebra at MIT.
3.2. Preprocessing
Prior to processing with segmental DTW, the lecture is segmentedinto continuous utterances by using a basic phone recognizer toidentify regions of silence in the signal. Silent regions with dura-tion longer than 2 seconds are removed and the portions of speechin between those silences are used as the input to the segmentalDTW algorithm. In the absence of a phone recognizer, segmenta-tion can also be performed using a speech activity detector.
For the CS lecture processed in this paper, the segmentationphase resulted in 1946 segments averaging 1.2 seconds each. Eachsegment was then converted into a series of 14-dimension MFCCvectors extracted every 5 ms, then whitened using principal com-ponents analysis over all vectors from all utterances.
Fig. 1. An illustration of segmental dynamic time warping betweentwo utterances with R = 2. The blue and red regions outline possi-ble DTW warping spaces for two different starting times.
2.2. Gaussian Posteriorgram Generation
To generate Gaussian posteriorgrams, a GMM is trained on allspeech frames without any transcription, and each frame is decodedby the trained GMM to obtain a raw posterior probability vector.Then, a discounting-based smoothing method is employed to eachposterior probability vector to produce a Gaussian posteriorgram.
In our work, each speech frame is represented by the first 13MFCCs. After pre-selecting the number of desired Gaussian com-ponents, the K-means algorithm is used to determine an initial set ofmean vectors. A GMM is then trained on all speech frames. Sincewe have observed uneven clustering results caused by the presenceof noise and non-speech artifacts, we use a speech/non-speech detec-tor to remove all non-speech segments longer than one second priorto clustering.
Once a GMM is trained, a raw Gaussian posteriorgram vector iscalculated by Equation 1. To avoid approximation errors, a proba-bility floor is set to eliminate dimensions (i.e., set them to zero) withvery small probability values. Then, a discounting based smooth-ing method is used to move a small portion of the probability massfrom non-zero dimensions to zero dimensions. This smoothing helpsduring the time warping pairwise distance matching.
2.3. Segmental DTW
Segmental dynamic time warping (S-DTW) has been successfullyused in our prior speaker-dependent pattern discovery work [2] andin our recent unsupervised keyword spotting research [4]. Aftergenerating the Gaussian posteriorgrams for all speech utterances,S-DTW is performed on every utterance pair to find candidate co-occurring subsequences in the two utterances.
To employ S-DTW, the difference function D between Gaus-sian posterior probability vectors p and q is defined as D(p, q) =! log(p · q). The dot product assumes these two probability vec-tors are drawn from the same underlying probability distribution.Given two Gaussian posteriorgrams GPi = (p1, p2, · · · , pm) andGPj = (q1, q2, · · · , qn), the warping function w(ik, jk) can pro-duce a m " n timing difference matrix, where ik and jk denote thek-th coordinate of the warping path.
Two constraints are applied to the S-DTW search. One is the ad-justment window condition that prevents the warping process fromgoing too far ahead or behind in eitherGPi orGPj . Formally, an in-tegerR is set to ensure |ik!jk| # R. The other constraint is the steplength condition. It is clear that the adjustment window condition re-stricts the shape and the ending coordinates of a warping path. Givendifferent start coordinates, the difference matrix can be naturally di-vided into several continuous diagonal regions with width 2R + 1,
Fig. 2. Converting all matched fragment pairs to a graph. Eachnumbered node corresponds to a temporal local maximum in frag-ment similarity in a particular utterance (e.g., 1-5). Each matchingfragment is represented by a connection between two nodes in thegraph (e.g., 1-4, 2-4, 3-5).
as shown in Figure 1. The red warping region s2 denotes the warp-ing process along with the j axis, while the blue warping region s3
denotes the warping process along the i axis. To avoid redundantcomputation and take into account warping paths across segmenta-tion boundaries, an overlapped sliding window moving strategy isapplied for the start coordinates, shown in the Figure 1.
2.4. Path Refinement
By moving the start coordinate along the i and j axis, for every pairof speech utterances, we can obtain a total of n!1
R+ m!1
Rwarp-
ing paths, each of which represents a warping between two subse-quences in each utterance pair. Since our goal is to find sequences ofsimilarity within the utterance pairs, we look for a fragment of eachwarping path that has a low distortion score [2].
The warping path refinement is done in two steps. In the firststep, a length L constrained minimum average subsequence findingalgorithm [6] is used to extract consecutive warping fragments withlow distortion scores. In the second step, the extracted fragments areextended by including neighboring frames below a certain distortionthreshold !. Specifically, we include neighboring frames with dis-tortion scores within 1 + ! percent of the average distortion of theoriginal fragment. The reason is that we found that the path of ex-tracted fragments often missed several frames at the onset or offsetof a matching acoustic pattern (i.e., a particular word or words) [2].
2.5. Graph Clustering
After collecting refined warping fragments for every pair of speechutterances, we can try to cluster similar fragments. Since each warp-ing fragment provides an alignment between two segments, if one ofthe two segments is a common speech pattern (i.e., a frequently usedword), it should appear in multiple utterance pair fragments.
The basic idea is to cast this problem into a graph clusteringframework, illustrated in Figure 2. Consider one pair of utterancesin which S-DTW determines three matching fragments (illustratedin different colors and line styles). Each fragment corresponds totwo segments in the two speech utterances, one per utterance. Sincein general there could be many matching fragments with differentstart and end times covering ever utterance, a simplification is madeto find local maxima of matching similarity in each utterance, andto use these local maxima as the basis of nodes in the correspondinggraph [2]. As a result, each node in the graph can represent one ormore matching fragments in an utterance. Edges in the graph thencorrespond to fragments occurring between utterance pairs, with anassociated weight that corresponds to a normalized matching score.After the conversion, a graph clustering algorithm proposed by New-man [7] is used to discover groups of nodes (segments) in terms of
“Image-‐based” DTW
Pass 1: Efficient Line Segment Detection
Step 1: Threshold and apply diagonal median filter of duration !t
!
A. Jansen et al., “Towards spoken term discovery at scale with zero resources”, Interspeech 2010
Image processing can be used to speedup the discovery of possible matches in the similarity matrix
Pass 1: Efficient Line Segment Detection
Step 3: Use Hough transform peaks as starting points to perform linesegment search
AutomaFc moFf discovery
!"#
$%%&'()*+,-.'/0*01%2
)&&'.%3'(4*-/56 !"# !"#''-.'7-81)192
$%%&'()*+,''$%%&'()*+,2
':)7-&1%;%*-*-4.2 ''''-.'7-81)19
!"#'''''''%<*%.&
'#""='>?6@A ''''''''''''''>BCC"D
!"#$ %E F)-. )1+,-*%+*01% 4/ *,% ;14;4$%& !"#$% &$'(")*+, $9$*%(E
-. $08$%+*-4. GEH ). %<)(;7% 4/ $0+, :)7-&)*-4. -. *,% ##F +4(;)1I-$4.E
J$ 1%;%*-*-4.$ )1% &%*%+*%&K ) (4&%7 /41 *,% (4*-/ -$ 80-7* /14(*,% 4++011%.+%$ *,%($%7:%$K ).& )&&%& *4 *,% 7-81)19 /41 $08$%L0%.*+4(;)1-$4.$E M. *,% %<;%1-(%.*$ &%*)-7%& ,%1%K (4*-/$ )1% (4&%7%&89 *,%-1 (%&-). 4++011%.+%K -E%E *,% +74$%$* -. ):%1)N% *4 )77 *,% 4*,%14.%$ )++41&-.N *4 ) =OP &-$*).+% (%)$01%K -. 401 +)$%E
O,% )7N41-*,( )&:).+%$ 89 );;14;1-)*%79 $,-/*-.N *,% $%%& 874+Q).& -*%1)*-.N *,% ;14+%&01% &%$+1-8%& 0.*-7 *,% %.*-1% $*1%)( -$ ;14I+%$$%&E
O,% .%<* $%+*-4. /4+0$%$ 4. &%$+1-8-.N ,43 ()*+,%$ )1% !1$*&%*%+*%& ).& *,%. :)7-&)*%&E
&$ '()(*)+,- .-' /.0+'.)+,-
M. *,-$ $%+*-4. 3% &%$+1-8% *,% ;)**%1. ()*+,-.N *%+,.-L0%$ ).& *,%&-$*).+% (%)$01%$ %(;749%& *4 %//%+*-:%79 !.& 1%;%*-*-4.$ -. $;%%+,EP% 81-%"9 1%+)77 *,% ;1-.+-;7% 4/ *,% =OP );;14)+, ).& ,-N,7-N,*$-*$ 7-(-* *4 !.)779 -.*14&0+% ##F +4(;)1-$4.E
&$%$ '12132"45 67 8 91#:1528; <1=9"45 4> ')?
O,% $%)1+, /41 1%;%*-*-4.$ 4/ ) $%%& R'**& !-#(.*'S ).& *,%-1 $08$%IL0%.*K 0.840.&%&K %<*%.$-4. -$ ;%1/41(%& )++41&-.N *4 ) $%N(%.*)7:)1-).* 4/ *,% +7)$$-+ =OP )7N41-*,( 3-*,FC@@ /%)*01%$E O,% *%+,I.-L0%K &%*)-7%& -. TUVK 1%7)<%$ *,% 840.&)19 +4.$*1)-.*$ 4/ =OPK *4&%*%+* ) ;4$$-87% $%%& ()*+, )$ ) $08$%N(%.* 4/ *,% $%)1+, 80//%1R41 (4*-/ (4&%7SK ).& *4 +4(;0*% *,% +411%$;4.&-.N )7-N.(%.* ;)*,EJ ;)*, %<*%.$-4. ;14+%&01% -$ *,%. *1-NN%1%& /14( *,% %.&;4-.*$K3,%.%:%1 *,% &%*%+*%& $%L0%.+%$ )1% &%%(%& $-(-7)1K -E%EK 3,%. *,%1%$;%+*-:% =OP $+41% -$ 8%743 ) $-(-7)1-*9 *,1%$,47& !=OPE
O,% ()-. 7-(-* -(;7-%& 89 *,-$ );;14)+, -$ *,)* *,% $47% 0$% 4/ )=OP R&-$S$-(-7)1-*9 (%)$01% 1%$07*$ -. ) $08$*).*-)7 $;%)Q%1 &%;%.I&%.* $9$*%(K )$ 3)$ .4*%& )71%)&9 -. TUK WK GVE O4 ;1%:%.* 0.)++%;*I)87% /)7$%I&%*%+*-4. 1)*%$K ) 743 :)70% 4/ !=OP -$ %./41+%&E ?-(-*-.N*,% )&(-$$-87% -.*1)I(4*-/ :)1-)8-7-*9 ,)$ *,% 48:-40$ %//%+* 4/ ;1%I:%.*-.N *,% &-$+4:%19 4/ *,% (41% :)1-)87% 4++011%.+%$ 4/ ) (4*-/K7-Q% *,% -.*%1I$;%)Q%1 4.%$E
O,% -.*%N1)*-4. 4/ *,% ##F +4(;)1-$4. -. *,% :)7-&)*-4. $*)N%K)-($ -.&%%& )* +4;-.N R)* 7%)$* ;)1*-)779S 3-*, $0+, :)1-)8-7-*9 -$$0%$E
&$@$ .A<8528#1 4> BBC =1D=1915282"45
O,% ##F 4/ ) $%L0%.+% "ba -$ *,% $L0)1% $9((%*1-+ ()*1-< !("b
a)$0+, *,)* ! (i, j) = d
`
"ba(i), "b
a(j)´
K d 8%-.N *,% "0+7-&%). &-$I*).+% -. 401 +)$%E
M/ *34 $%L0%.+%$ )1% $-(-7)1K $4 )1% *,% 1%$;%+*-:% ##F$E M()NI-.% ) $%L0%.+% "K %)+, /1)(% &%+4(;4$%& )$X
"i = Si + N RUS
!" #" $" %" &" '" (" )"
#"
%"
'"
)"
!" #" $" %" &" '" (" )"
#"
%"
'"
)"
!"#$ @E ##F$ 4/ *34 &-//%1%.* 0**%1).+%$ 4/ /*-0 1-+$* 2* 3*0 /14() ()7% $;%)Q%1 R*4;S ).& ) /%()7% $;%)Q%1 R84**4(SE
3,%1% Si +)11-%$ *,% 4$056$'#$( -./41()*-4. 4. *,% /1)(%K ).& N*,% +4(;4.%.* *,)* )++40.*$ /41 )77 .4. 7-.N0-$*-+ $401+%$ R$;%)QI-.N $*97%$K +,)..%7K 8)+QN140.& $40.&$K %*+ESK *,)* )1% *4 8% +4.$-&I%1%& )$ .4-$% /41 401 ;01;4$%E M/ $0+, ). -&&$#$)* 0"$'* (4&%7 -$:)7-&K *,%. ##F$ 3407& 4.79 &%;%.& 4. *,% )+40$*-+I;,4.%*-+ $*10+I*01% 4/ *,% 41-N-.)7 $%L0%.+%$K )$ )77 *,% .4-$% +4(;4.%.*$ 3407&+).+%7 %)+, 4*,%1 40* -. +4(;0*-.N !(i, j)E C41 -.$*).+%K -. +%;I$*1)7 (%). $08*1)+*-4. R@F#SK (%). 1%(4:)7 /47743$ *,% )$$0(;*-4.*,)* ) +4.$*).* )&&-*-:% /)+*41 R*,% =@ +4(;4.%.*S -$ 1%$;4.$-87% /41+,)..%7 &-$*41*-4.E Y43%:%1K 3-*, 1%N)1& *4 *,% 4*,%1 .4-$% /)+*41$K*,-$ ;14;%1*9 &4%$ .4* $*1-+*79 ,47& *10% -/ $;%%+, -$ 1%;1%$%.*%& 89$%L0%.+%$ 4/ FC@@$E F41%4:%1K %:%. )$$0(-.N ) ;%1/%+* )++01)+94/ *,% )&&-*-:% (4&%7K +4(;)1-.N &-1%+*79 *,% @F#I.41()7-Z%& $%IL0%.+%$ 3407& 8% (41% +4.:%.-%.* *,). +4(;)1-.N *,% 1%$;%+*-:%##F$E
O,% )&:).*)N% 4/ 0$-.N ##F$ -$ *,)* *,% &-$*).+%$ )(4.N (0I*0)7 ;)1*$ 4/ ) $%L0%.+%K N%.%1)*% ) *34I&-(%.$-4.)7 ;)**%1. *,)* -$;%+07-)1 4/ *,% 0.&%179-.N )+40$*-+I;,4.%*-+ $*10+*01% R$%% C-NE H /41) +4.+1%*% %<)(;7%SE >9 +4(;)1-.N ##F$ 3% -.*%.& *4 1%+4N.-Z%*,% 1%+011%.+% 4/ $0+, ;)**%1.$ )(4.N $;%%+, $%L0%.+%$ *4 1%:%)7*,%-1 +4((4. 8%74.N-.N *4 *,% $)(% (4*-/E
##F$ )1% *,0$ -.*%1%$*-.N +).&-&)*%$ /41 1480$* ;)**%1. ()*+,-.N).& ) &-$*).+% 8%*3%%. ##F$ -$ *,%1%/41% 1%L0-1%&E
&$&$ *4:D8="945 4> BBC9
P,%. &%!.-.N ) &-$*).+% 8%*3%%. ##F$K *34 ()-. -$$0%$ ,):% *48% )&&1%$$%&X )S &-//%1%.* (4*-/$ ()9 )7$4 ,):% $-(-7)1 ##F$ ).& 8S(4*-/ 4++011%.+%$ ()9 ,):% &-//%1%.* 7%.N*,$K ).& *,0$ *,% 1%$;%+*-:%##F$ &-//%1%.* $-Z%E
O4 &%)7 3-*, *,% !1$* &1)38)+QK 3% ;14;4$% *4 0$% *,% ##F &-$I*).+% -. +4.[0.+*-4. 3-*, *,% =OP &-$*).+%X ) ,-N, !=OP -$ $%* *4)++40.* /41 (41% :)1-)8-7-*9E F4*-/ +).&-&)*%$ )1% *,%. :)7-&)*%& 89
A. Muscariello et al., “Towards Robust Word Discovery by Self-‐Similarity Matrix Comparison”, Proc. ICASSP 2011
Variability tolerant audio motif discovery 7
B U F F E R
Q U
E R
Yrefined starting point
refined ending point
matchcentral band
ending band
starting band
Fig. 2. Band relaexd SLNDTW: the motif completely includes the central band. Afterpath reconstruction, boundaries are refined in the starting and ending band (dashedlines).
In addition, we have applied an heuristic for the refinement of the boundariesthat consists in extending the found match by adding new frames at the bound-aries (following the local normalization paradigm), as long as the average weightof the extended path does not increase too much.Formally:
1. Consider the path P with W (P ) = Wo ending in (ie, je).2. Select in the neighbourhood of (ie, je) (composed of (ie + 1, je + 1), (ie +
1, je), (ie, je + 1)) the point that, added to P , minimizes W (P ), and add itto P as its new ending point.
3. If W (P ) < Wo+10%Wo, then repeat the procedure from 1, otherwise removethe new ending point from P and stop the procedure.
The same approach applies when extending the path backward from its startingpoint (is, js).
2.4 Fragmental SLNDTW
Band relaxed SNLDTW does not constrain motif and query to coincide, butit still assumes the motif to be located in the middle part of the query, suchthat it completely includes the central band. A simple generalization of theprevious versions of the algorithm, that we call fragmental SLNDTW, allows toretrieve the sought motif regardless of its position in the query, by first retrievinga portion of it, e.g. a fragment. SLNDTW detects a match whenever a querycoincide with a motif. By using queries small enough to be included in the motif,then there exists at least one fragment of the motif that coincide with one of thequeries and that can be discovered by SLNDTW. Indeed, if Lmin ! Lmotif !Lmax, partitioning a Lmax long query in Lmin/2 long subqueries ensures that atleast a Lmin/2 long fragment of the motif coincide with one of the subqueries,and it is therefore retrievable by conventional SNLDTW. The entire match canbe recovered afterwards, by extending the corresponding path as in the boundary
inria
-005
5176
4, v
ersi
on 1
- 20
Feb
201
1
Music structure analysis
Meinard Müller, “Informa0on Retrieval for Music and Mo0on”, Springer-‐Verlag, ISBN 978-‐3-‐540-‐74047-‐6, pp. 147-‐150, 2010
Unbounded-‐DTW
X. Anguera et al., “Par0al Sequence Matching using an Unbounded Dynamic Time Warping Algorithm”, ICASSP 2010
Unbounded-‐DTW
The search for speed and scalability
• Coarse-‐to-‐fine approximaFon of DTW – S. Salvador and P. Chan, “FastDTW: Toward accurate dynamic 0me
warping in linear 0me and space”. 3rd Wkshp. On Mining Temporal and SequenFal Data, ACM KDD 2004
• Intelligent bounding of DTW – Work by Eamon Keogh (pure DTW, no subsequences) – Y. Zhang and J. Glass, “A Piecewise Aggregate Approxima0on Lower-‐
Bound Es0mate for Posteriorgram-‐based Dynamic Time Warping”, Interspeech 2011
• Use of IR techniques – A. Jansen et al., “Efficient Spoken Term Discovery using Randomized
Algorithms”, ASRU 2011
(some) applicaFons of the technology
• Finding structure in an unknown language -‐> Zero-‐resources approaches (JHU Workshop this summer) – Helping ASR by increase of training data (Jansen_2011)
• AcousFc documents comparison and topic detecFon – NLP on speech (Drezde_2010)
• Query-‐by-‐example search (Metze_2011) • Spoken term discovery (Muscariello_2011) • Spoken SummarizaFon (Jansen_2010, Flamary_2011) • AcousFc indexing enhancement via transcripFon propagaFon
Conclusion
• The old DTW is back – It will not reclaim ASR, but it takes on new challenges
• Lots of research to be done – On scalability (matching paIerns is sFll expensive) – On generality – Robustness
I did not have Fme to menFon…
• Symbol-‐based subsequence matching • AcousFc features for subsequence matching
References -‐ T.K.Vintsyuk, “Speech DiscriminaFon by Dynamic Programming”, KiberneFks Vol. 4, No 1, pp. 81-‐88, 1968. -‐ V.M. Velichko and N.G. Zagoruyko, “Automa0c Recogni0on of 200 Words”, Int. Journal on Man-‐Machine Studies, vol. 2, pp. 223-‐234, 1970. -‐ H. Sakoe and S. Chiba, “A dynamic programming approach to con0nuous speech recogni0on,”in 1971 Proc. 7th ICA, Paper 20 CI3, Aug. 1971. -‐ H. Sakoe and S. Chiba, “Dynamic Programming Algorithm Op0miza0on for Spoken Word Recogni0on”, IEEE TransacFons on Audio, Speech and Signal Processing, 26(1) pp. 43-‐49, 1978 -‐ S. Salvador and P. Chan, “FastDTW: Toward accurate dynamic 0me warping in linear 0me and space”. 3rd Wkshp. On Mining Temporal and SequenFal Data, ACM KDD 2004 -‐ De Wachter et al., “Template-‐based con0nuous speech recogni0on”, IEEE Trans. On Audio, Speech and Language Processing, 15(4) pp. 1377-‐1390, 2007 -‐ A. Park and J. Glass, “Unsupervised PaWern Discovery in Speech”, IEEE Trans. On Audio, Speech and Language Processing, 2008
References (II) -‐ Meinard Müller, “Informa0on Retrieval for Music and Mo0on”, Springer-‐Verlag, ISBN 978-‐3-‐540-‐74047-‐6, pp. 147-‐150, 2010 -‐ X. Anguera et al., “Par0al Sequence Matching using an Unbounded Dynamic Time Warping Algorithm”, ICASSP 2010 -‐ A. Jansen et al., “Towards spoken term discovery at scale with zero resources”, Interspeech 2010 -‐ R. Flamary et al.,“SpokenWordCloud: Clustering Recurrent PaWerns in Speech” in Proc. CBMI 2011 -‐ M. Drezde et al. “NLP on Spoken Documents without ASR”, in Proc. EMNLP 2010 -‐ A. Muscariello et al., “Towards Robust Word Discovery by Self-‐Similarity Matrix Comparison”, Proc. ICASSP 2011 -‐ A. Jansen et al., “Efficient Spoken Term Discovery using Randomized Algorithms”, ASRU 2011 -‐ Y. Zhang and J. Glass, “A Piecewise Aggregate Approxima0on Lower-‐Bound Es0mate for Posteriorgram-‐based Dynamic TimeWarping”, Interspeech 2011 -‐ A. Jansen et al., “Towards unsupervised training of speaker independent acous0c models,” in Proc. Interspeech, 2011 -‐ F. Metze, “The spoken web search task at Mediaeval 2011”, ICASSP 2011
Top Related