Multimodal pattern matching algorithms and applications
-
Upload
xanguera -
Category
Technology
-
view
3.004 -
download
2
description
Transcript of Multimodal pattern matching algorithms and applications
![Page 1: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/1.jpg)
Mul$modal pa+ern matching algorithms and applica$ons
Xavier Anguera Telefonica Research
![Page 2: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/2.jpg)
Outline
• Introduc$on • Par$al sequence matching
– U-‐DTW algorithm
• Music/video online synchroniza$on – MuViSync prototype
• Video Copy detec$on
![Page 3: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/3.jpg)
Par$al Sequence Matching Using an Unbounded Dynamic Time Warping
Algorithm
Xavier Anguera, Robert Macrare and Nuria Oliver
Telefonica Research, Barcelona, Spain
![Page 4: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/4.jpg)
Proposed challenge • Given one or several audio signals we want to find and align recurring acous$c pa+erns.
![Page 5: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/5.jpg)
Proposed challenge • We could use the ASR/phone$c output and search for symbol
repe$$ons PROS: – It is easy to apply, the ASR takes care of any $me warping CONS: – ASR is language dependent and requires training – We introduce addi$onal sources of error (acous$c condi$ons, OOV’s) – It can be very slow and not embeddable
• Automa$c mo$f discovery directly in the speech signal – Train free, language independent and resilient to some noises
ASR/Phone$za$on
symbols alignment
Symbolic representa$on
acous$c alignment
• Alignment loca$ons • Scores
![Page 6: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/6.jpg)
Areas of applica$on
• Improve ASR by disambigua$on over several repe$$ons (Park and Glass, 2005)
• Pa+ern-‐based speech recogni$on – flat modelling (Zweig and Nguyen, 2010)
• Acous$c summariza$on (Muscariello, 2009)
• Musical structure analysis (Müller, 2007)
• Server-‐less mobile voice search (Anguera, 2010)
![Page 7: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/7.jpg)
Automa$c mo$f discovery • Goal is to avoid going to text and therefore be more robust to errors
• Good deal of applicable work on this area: – Biomedicine in matching DNA sequences (conver$ng the speech signals into symbol strings)
– Directly from real-‐valued mul$dimensional samples using DTW-‐like algorithms • Müller’07, Muscariello’09, Park’05, Zweig’10 • Most need to compute all the cost matrix a priori
![Page 8: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/8.jpg)
Dynamic Time Warping -‐ DTW • DTW algorithm allows the computa$on of the op$mal alignment between two $me series Xu, Yv ε ΦD
Image by Daniel Lemire
€
XU = (u1,...,um,...,uM )
€
XV = (v1,....,vn,..,vN )
![Page 9: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/9.jpg)
Dynamic Time Warping (II) • The op$mal alignment can be found in O(MN) complexity using dynamic programming.
• We need to define a cost func$on between any two elements in the series and build a distance matrix:
€
d :ΦD × ΦD →ℜ≥ 0
Image by Tsanko Dyustabanov
€
d(i, j) = um − vn
Where usually:
€
c(i(k), j(k))
€
F = c(1),...,c(K)Warping func$on: where
Euclidean distance
![Page 10: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/10.jpg)
Warping constraints For speech signals some constraints are usually applied to the warping func$on F: – Monotonicity:
– Con$nuity (i.e. local constraints):
€
i(k −1) ≤ i(k)
€
j(k −1) ≤ j(k)
€
i(k) − i(k −1) ≤1
€
j(k) − j(k −1) ≤1
Sakoe,H. and Chiba,S. (1978) Dynamic programming algorithm op0miza0on for spoken word recogni0on, IEEE Trans. on Acoust., Speech, and Signal Process, ASSP-‐26, 43-‐49.
(m, n)
(m-‐1, n-‐1)
(m-‐1, n)
€
D(m,n) =minD(m −1,n)D(m,n −1)D(m −1,n −1)
⎧
⎨ ⎪
⎩ ⎪
+ d(um,vn )
![Page 11: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/11.jpg)
Warping constraints (II) – Boundary condi$on:
i.e. DTW needs prior knowledge of the start-‐end alignment points.
– Global constraints €
i(1) =1
€
j(1) =1
€
i(K) = M
€
j(K) = N
Image from Keogh and Ratanamahatana
![Page 12: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/12.jpg)
DTW Dynamic Programming
![Page 13: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/13.jpg)
DTW Dynamic Programming
![Page 14: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/14.jpg)
DTW Dynamic Programming
![Page 15: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/15.jpg)
DTW Dynamic Programming
![Page 16: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/16.jpg)
DTW main problem • The boundary condi$on constraints $me-‐series to be aligned from start to end – We need a modifica$on to DTW to allow common pa+ern discovery in reference and query signals regardless of the sequence’s other content
![Page 17: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/17.jpg)
Alterna$ve proposals
• Meinard Müller’s Path extrac$on for music – Needs to pre-‐compute the complete cost matrix.
• Alex Park’s Segmental DTW – Needs to pre-‐compute the complete cost matrix, very computa$onally expensive ajerwards.
• Armando Muscarielo’s word discovery algorithm – Searches for pa+erns locally, does not check all possible star$ng points.
[1] M. Müller, “Informa$on Retrieval for Music and Mo$on”,Springer, New York, USA, 2007. [2] A. Park et al., “Towards unsupervised pa+ern discovery in speech,” in In Proc. ASRU’05, Puerto Rico, 2005. [3] A. Muscariello et al., “Audio keyword extrac$on by unsupervised word discovery,” in Proc. INTER-‐ SPEECH’09, 2009.
![Page 18: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/18.jpg)
Unbounded-‐DTW Algorithm
• U-‐DTW is a modifica$on to DTW that is fast and accurate in finding recurring pa+erns
• We call it unbounded because: – The start-‐end posi$ons of both segments are not constrained
– Mul$ple matching segments can be found with a single pass of the algorithm
– Minimizes the computa$onal cost of comparing two mul$dimensional $me series
![Page 19: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/19.jpg)
U-‐DTW Cost func$on and matching length
• Given two sequences to be matched U=(u1, u2, …, uM) and V=(v1, v2, …, vN)
we use the inner product similarity
Values range [-‐1,1], the higher the closer • We look for matching sequences with a minimum length Lmin (set at 400ms in our experiments) €
s(m,n) = cosθ =um ,vnum vn
![Page 20: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/20.jpg)
U-‐DTW global/local constraints
• no global constraints are applied in order to allow for matching of any segment among both sequences
• Local constraints are set to allow warping up to 2X
(m, n)
(m-‐1, n-‐2)
(m-‐1, n-‐1)
(m-‐2, n-‐1)
€
D(m,n) =maxD(m − 2,n)D(m,n − 2)D(m − 2,n − 2)
⎧
⎨ ⎪
⎩ ⎪
+ s(um,vn )
![Page 21: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/21.jpg)
U-‐DTW computa$onal savings
• Computa$onal savings are achieved thanks to: 1. We sample the distance/similarity matrix at
certain possible matching start points (sesng Synchroniza$on points)
2. Dynamic programming is done forward, prunning out low similarity paths
![Page 22: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/22.jpg)
Synchroniza$on points • Only certain (m,n) posi$ons are analyzed in the matrix for possible matching segments – Selected not to loose any matching segment – Op$mize the computa$onal cost
• Two methods are followed: horizontal and ver$cal bands:
τh
τd
λ
(m,n)
λ
λ
π/4 2τh
(m,n)
U
U
V V
![Page 23: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/23.jpg)
U-‐DTW Dynamic Programming
![Page 24: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/24.jpg)
Forward dynamic programming • For each posi$on (m,n): 3 possible forward paths are considered
• The forward path is extended forward IIF: – Its normalized global similarity is above a pruning threshold
– is greater than any previous path in that loca$on
(m, n)
(m+1, n+2)
(m+1, n+1)
(m+2, n+1)
€
S(m',n') =D(m,n) + s(m',n')
M(m,n) +1≥Thrprun
€
S(m',n')
![Page 25: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/25.jpg)
U-‐DTW Dynamic Programming
![Page 26: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/26.jpg)
U-‐DTW Dynamic Programming
![Page 27: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/27.jpg)
Backward path algorithm
• When a possible matching segment is found in the forward path, the same is done backwards star$ng from the origina$ng SP posi$on.
The same procedure is followed as in the forward path
(m, n)
(m-‐1, n-‐2)
(m-‐1, n-‐1)
(m-‐2, n-‐1)
![Page 28: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/28.jpg)
U-‐DTW Dynamic Programming
![Page 29: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/29.jpg)
U-‐DTW Dynamic Programming
![Page 30: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/30.jpg)
Computa$onal savings example Ba
rcelon
a
Barcelona
![Page 31: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/31.jpg)
Experimental setup • We asked 23 people to record 47 words from 6 categories, 5 itera$ons each:
• Simple energy-‐based trimming eliminates non-‐speech regions
• We simulate acous$c context by a+aching different start-‐end audio sequences to Xu,v.
Nature
Ci$es
People
Events
Family
Monuments
€
XU ,V [n,i],i =1...5, j =1...47
![Page 32: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/32.jpg)
Experimental setup (II)
• Signals are parameterized with 10MFCC every 10ms
• Each word Xu is compared to all words Xv from the same speaker (234 comparisons) and the closest one is retrieved
We get a hit m=n, a miss otherwise • Tests were performed on an Ubuntu Linux PC @2.4GHz. €
argminm, j D(XU [n,i],XV [m, j]) | (n,i) ≠ (m, j)
![Page 33: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/33.jpg)
Comparing systems
• Standard DTW – Compare the sequences without any added acous$c context (i.e. prior knowledge of start-‐end points)
• Segmental DTW (Park and Glass, 2005) – Minimum segment length of 500ms – Band size of 70ms, 50% overlap
– Used 2 distances: Euclidean and 1-‐inner product
![Page 34: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/34.jpg)
Performance evalua$on Used metrics:
– Accuracy: percentage of words correctly matched (Xu y Xv are different itera$ons of the same word).
– Average processing $me per sequence pair (Xu-‐Xv) (excluding parameteriza$on)
– Average ra$o of frame-‐pair distances within each sequence-‐pair cost matrix.
€
Acc =correct matches∑all matches
⋅ 100
€
Time =time(D(XU [n,i],∑ XV [m, j]))
#matches⋅ 100
€
Ratio =computed(d(XU [n,i],XV [m, j]))∑
MN⋅ 100
![Page 35: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/35.jpg)
Results
Algorithm Accuracy Avg. ;me ra;o
Segmental DTW w/ Eucl. 80.61% 82.7ms 1
Segmental DTW w/ inner prod. 74.62% 86.7ms 1
U-‐DTW horiz. bands 89.53% 10.6ms 0.51
U-‐DTW diag. bands 89.34% 9.0ms 0.42
Standard DTW 95.42% 0.6ms 1
![Page 36: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/36.jpg)
Effect of the Cutout Threshold
![Page 37: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/37.jpg)
Conclusions and future work
• We propose a novel algorithm called U-‐DTW for unconstrained pa+ern discovery in speech
• We show it is faster and more accurate than exis$ng alterna$ves
• We are star$ng to test the algorithm for unrestricted audio summariza$on
![Page 38: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/38.jpg)
MuViSync AudioVisual Music Synchroniza$on
Xavier Anguera, Robert Macrae and Nuria Oliver
![Page 39: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/39.jpg)
…on the go, …
…at home, …
People enjoy listening to their favorite music everywhere…
…or in a party with friends
![Page 40: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/40.jpg)
Users increasingly have a personal mp3 music collec$on…
…but it usually contains ‘only’ music.
What if you could watch the video clip of any of our songs while listening to it?
![Page 41: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/41.jpg)
…but the audio quality is much worse that in your mp3…
You could go to sites like YouTube…
What if you could listen to our high quality mp3 music while watching the video clips?
![Page 42: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/42.jpg)
MuViSync: Music and Video Synchroniza$on system
Personal Music
Video clip
streaming
local
MuViSync
MuViSync synchronizes audio and video from two different
sources and plays them together in-‐sync
![Page 43: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/43.jpg)
Applica$on scenarios
• Watch on TV your favorite music – Personal music synchroniza$on with video clips either local or streamed
• Watch on your iPhone your music – Personal music synchroniza$on by streaming the video into the iPhone
• Iden0fy and watch any music – Combined with songID technology, either at home or on the go.
![Page 44: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/44.jpg)
MuViSync applica$on • We have developed a prototype applica0on for Windows/mac, and soon for Iphone.
![Page 45: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/45.jpg)
Alignment algorithm requirements
• Perform an alignment between the mp3 music and the Video’s audio track
• Ini$ally only par$al knowledge is available from both sources (life recording or buffering)
• Alignment has to be done online and in real-‐$me
• Emphasis is needed on the user sa$sfac$on when playing the video.
![Page 46: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/46.jpg)
Applica$on testbed • We use 320 music videos (Youtube) + their corresponding mp3 files
• A supervised ground-‐truth alignment was performed using offline DTW and checking for consistency
• Audio is processed every 100ms (200ms window) and chroma features are extracted
![Page 47: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/47.jpg)
MuViSync online alignment algorithm
1. Ini$al path discovery – Both signals (audio and video) are buffered, features
are extracted and an ini$al alignment is found
2. Real-‐$me online alignment – An incremental alignment is computed
3. Alignment post-‐processing to ensure a smooth playback of the aligned video.
Audio + feats extrac$on
Feats extrac$on
Ini$al path discovery
Real-‐$me alignment
1)
2)
ta tv
alignment
![Page 48: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/48.jpg)
Ini$al path discovery (online mp3 playback + video buffering)
Audio available from the video
Audio from the mp3 file
Video buffering end
Sync request
![Page 49: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/49.jpg)
Ini$al path discovery • A segment of the audio and the buffered video are checked for alignment using forward-‐DTW
• The global similarity D(m,n) at each loca$on (m,n) is normalized by the length of the op$mum path to that loca$on
• At each step, all paths with D’(m,n) < Dave(*,n) are pruned.
• The ini0al alignment is selected when only one path survives or the sync 0me is reached.
![Page 50: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/50.jpg)
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
Audio $me alignment buffer (about 1s)
![Page 51: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/51.jpg)
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
![Page 52: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/52.jpg)
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
![Page 53: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/53.jpg)
Ini$al path discovery
Audio available from the video
Aud
io being played from
mp3
![Page 54: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/54.jpg)
Real-‐$me online alignment • Star$ng from the ini$al alignment we itera$vely compute: 1. Locally op$mum forward path for L steps: p1…pL
using a) local constraints (no dynamic programming)
2. Backward (standard) DTW from pL to p1 using b) local constraints
3. Add the ini$al p/2 steps to the final path, and start 1) from pL/2 un$l the playback ends
![Page 55: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/55.jpg)
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
![Page 56: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/56.jpg)
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
1)Forward locally best path with L=8
p1
pL
![Page 57: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/57.jpg)
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
2)stardard DTW
p1
pL
![Page 58: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/58.jpg)
Real-‐$me online alignment
Audio available from the video
Aud
io being played from
mp3
3)Move forward the new star$ng point
p1
![Page 59: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/59.jpg)
Alignment postprocessing • Alignment es$mates every 100ms are not enough to drive 25/30 fps video
• An interpola$on of the points + averaging over 5 seconds gives the projec$on es$mate for current playback
![Page 60: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/60.jpg)
Experiments • We use 320 videos+mp3, aligned using offline DTW and manually checked for consistency.
• Accuracy is computed as the % of songs with average error < some ms.
Average accuracy @100ms for different video buffer lengths
![Page 61: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/61.jpg)
Experiments
![Page 62: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/62.jpg)
Video Duplicate Detec$on Xavier Anguera and Pere Obrador
![Page 63: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/63.jpg)
Let’s say you’re looking for the Bush a+ack video…
![Page 64: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/64.jpg)
…and you get 11,100 results.
![Page 65: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/65.jpg)
…ajer 40 minutes...
watching many of the videos returned you no$ce that many are similar, i.e. near duplicates
27% in average in Youtube [Wu et al., 2007] 12% in average in Youtube [Anguera et al, 2009]
![Page 66: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/66.jpg)
Near duplicate (NDVC) defini$on • Iden$cal or approximately iden$cal videos, that differ in some feature: – file formats, encoding parameters – photometric varia$ons (color, ligh$ng changes) – overlays (cap$on, logo, audio commentary)
– edi$ng opera$ons (frames add/remove) – seman$c similarity
NDVC are videos that are “essen(ally the same”
![Page 67: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/67.jpg)
Near duplicates(NDVC) vs. Video copies
• These two concepts are not totally well discriminated in the bibliography.
• Video copy: exact video segment, with some transforma$ons on it
• Near duplicate: similar videos on the same topic (different view points, seman$cally similar videos, …)
In our research we approach the video copy detec;on
![Page 68: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/68.jpg)
Examples of video copies
![Page 69: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/69.jpg)
Use Scenarios: Copyright law enforcement
Detec$on of copyright infringing videos in online video sharing sites
In a recent study we found that in average 12% of search results in YouTube are copies of the same video
![Page 70: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/70.jpg)
Currently police forces usually have to manually scroll through ALL materials in pederasty cases searching for evidence.
Discover illegal content hidden within other videos
Use Scenarios: Video forensics for illegal ac$vi$es
![Page 71: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/71.jpg)
Database management/op$miza$on and helping in searches over historic contents
Video excerpts used several $mes
Use Scenarios: Database management
![Page 72: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/72.jpg)
Adver$sement detec$on/iden$fica$on
Programming analysis
Use Scenarios: adver$sement detec$on and management
![Page 73: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/73.jpg)
Use Scenarios: Informa$on overload reduc$on
Improved (more diverse) video search results by clustering all video duplicates.
George Bush
Before clustering
Ajer clustering
![Page 74: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/74.jpg)
Steps in Video Duplicate detec$on
1. Indexing of the reference videos A. Obtain features represen$ng the video B. Store these features in a scalable manner
2. Search of queries within the reference set
Feature extrac$on References indexing
Ref videos
Query video Feature extrac$on
Search for duplicates
Features Database
ONLINE
OFFLINE
![Page 75: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/75.jpg)
Ways to approach near-‐duplicate video detec$on
• Local features – Extracted from selected frames in the videos
– Focus on local characteris$cs within those frames
• Global features – Extracted from selected frames or from all the video
– Focus on overall characteris$cs
![Page 76: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/76.jpg)
Local features
• Comes from the previous knowledge on image copy detec$on/near duplicates detec$on
• Steps: – Keyframes are first extracted from the videos at regular intervals or by detec$ng shots
– Local features are obtained for these keyframes: • SIFT • SURF • HARRIS • …
![Page 77: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/77.jpg)
Global Features
• Features are extracted either from the whole video or from keyframes by looking at the overall image (not at par$cular points).
In our work we extract them from the whole video
![Page 78: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/78.jpg)
Mul$modal video copy detec$on
• Most works use only video/images informa$on – They prefer local features for their robustness
• We introduce audio informa$on by combining global features from both the audio and video tracks
• We are also experimen$ng on fusing local features with global features (work in progress)
![Page 79: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/79.jpg)
Mul$modal global features
• We use features based on the changes in the data-‐> more robust to transforma$ons
• Video: – Hue + satura$on interframe change – Lightest and darkest centroid interframe distance
• Audio: – Bayesian informa$on criterion (BIC) between adjacent segments
– Cross-‐BIC between adjacent segments – Kullback-‐Leibler divergence (KL2) between adjacent segments
![Page 80: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/80.jpg)
Hue+Satura$on interframe change
1. Transform the colorspace from RGB to HSV (Hue+Satura$on+Value)
![Page 81: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/81.jpg)
Hue+Satura$on interframe change
2. Compute for each 2 consecu$ve frames their HS histogram and compute their intersec$on as:
![Page 82: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/82.jpg)
Lightest and darkest centroid interframe distance
1. Find the lightest and darkest regions in each frame and obtain its centroid
![Page 83: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/83.jpg)
Lightest and darkest centroid interframe distance
We compute the euclidean distance between each two adjacent frames, obtaining two global feature streams
![Page 84: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/84.jpg)
Acous$c features
• Compute some acous$c distance between adjacent acous$c segments
Segment A Segment B
GMM A GMM B GMM A+B
![Page 85: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/85.jpg)
Acous$c features (II)
• Likelihood-‐based metrics: – Bayesian Informa$on Criterion
– Cross-‐BIC
• Model distance metrics: – Kullback-‐Leibler divergence (KL2)
![Page 86: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/86.jpg)
Acous$c features (III)
• For example: the Bayesian Informa$on Criterion (BIC) output:
![Page 87: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/87.jpg)
Search for full copies • For each video-‐query pair we compute the correla$on of each feature pair
• We then find the posi$ons with high similarity (peaks).
Reference
Possible copy
XFFT
FFT
IFFT Find peaks
![Page 88: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/88.jpg)
Mul$modal fusion • When mul$ple modali$es are available, fusion is performed on the correla$ons
![Page 89: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/89.jpg)
Output score
• The resul$ng score is computed by weighted sum of the different modali$es’ normalized dot product at the found peak
• Automa$c weights are obtained via
![Page 90: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/90.jpg)
Finding subsegments of the query • The previously described algorithm considers the whole query matches a por$on of the reference videos
• To avoid such restric$on a modifica$on to the algorithm first splits the query into overlaping 20s segments
• By accumula$ng the resul$ng peaks for each segment we can obtain the main delay and its segment
![Page 91: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/91.jpg)
Algorithm performance evalua$on
• To test the algorithm we used the MUSCLE-‐VCD database: – Over 100 hours of reference videos from the SoundVision group (Nederlands)
– 2 test sets • ST1: 15 query videos where the whole query is considered
• ST2: 3 videos with 21 segments appearing in the reference database
h+p://www-‐roc.inria.fr/imedia/civr-‐bench/benchMuscle.html
![Page 92: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/92.jpg)
MUSCLE-‐VCD transforma$on examples
![Page 93: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/93.jpg)
Evalua$on metrics
• We use the same metrics as in the MUSCLE-‐VCD benchmark tests
![Page 94: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/94.jpg)
Evalua$on metrics (II)
• We also use the more standard Precision and recall metrics
![Page 95: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/95.jpg)
Evalua$on results
![Page 96: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/96.jpg)
Evalua$on results histogram for ST1
![Page 97: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/97.jpg)
Youtube reranking applica$on • We downloaded all videos searching for the top 20 most viewed and 20 most visited videos
![Page 98: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/98.jpg)
Youtube reranking applica$on • We applied mul$modal copy detec$on and grouped all near duplicates
![Page 99: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/99.jpg)
Youtube Reranking test
• Results show how some videos have mul$ple clear copies that can boost their ranking once clustered
![Page 100: Multimodal pattern matching algorithms and applications](https://reader033.fdocuments.in/reader033/viewer/2022052618/554e6aa4b4c9054a698b46b2/html5/thumbnails/100.jpg)
Thanks for your aHen;on
xanguera@$d.es www.xavieranguera.com
Linkedin: h+p://es.linkedin.com/in/xanguera Twi+er: h+p://twi+er.com/xanguera
Website: h+p://www.xavieranguera.com/