Post on 22-Feb-2016
description
1
Video Google: A Text Retrieval Approach to Object Matching in Videos
Josef Sivic and Andrew ZissermanRobotics Research Group, Department of
Engineering ScienceUniversity of Oxford, United Kingdom
2
Goal
• To retrieve those key frames and shots of a video containing a particular object .
• With the ease, speed and accuracy.
3
Outline• Introduction
– Object query– Scene query
• Challenging problem• Text retrieval overview• Viewpoint invariant description
– Building the Descriptors– Building the Visual Word– The Visual Analogy
• Visual indexing using text retrievalmethods
• Experimental evaluation of scenematching using visual words
• Object retrieval– Stop list– Spatial Consistency
• Summary and conclusions• Video Google Demo
4
Introduction - Object query(1/2)
5
Introduction - Scene query(2/2)
6
Challenging problem(1/2)
• Changes in viewpoint, illumination and partial occlusion
• Large data• Real-world data
7
Challenging problem(2/2)
8
Text retrieval overview (1/2)• The documents are parsed into words.• Words are represented by their stems
– ‘walk’, ‘walking’, ‘walks’ -> ‘walk’• Stop list to filter common words( ‘the’, ‘an’,…)• Remaining words represent as a vector
weighted based on word frequency
9
Text retrieval overview (2/2)
• Inverted file to facilitate efficient retrieval.– An inverted file is structured like an ideal book
index.• Text is retrieved by computing its vector of
word frequencies, return documents with the closest vectors
• Rank the returned documents
10
Viewpoint invariant description(1/2)
• Two types of viewpoint covariant regions are computed for each frame.1. SA – Shape Adapted
corner like features
2. MS – Maximally Stable blobs of high contrast with respect to their
surroundings
• Regions computed in grayscale
11
Viewpoint invariant description(2/2)
The MS regions are in yellow. The SA regions are in cyan.
12
Building the Descriptors(1/2)
• SIFT – Scale Invariant Feature Transform– Each elliptical region is represented by a 128-
dimensional vector– SIFT is invariant to a shift of a few pixels
13
Building the Descriptors(2/2)
• Removing noise – tracking & averaging– Regions are tracked across sequence of frames
using “Constant Velocity Dynamical model”– Any region which does not survive for more than
three frames is rejected– Averaging the descriptors throughout the track– Large covariance’s descriptors are rejected
14
Building the Visual Word(1/2)
• Cluster descriptors into K groups using K-mean clustering algorithm
• Each cluster represent a “visual word” in the “visual vocabulary”
• MS and SA regions are clustered separately– different vocabularies for describing the same
scene.
15
Building the Visual Word(2/2)
SA
MS
16
The Visual AnalogyWord
Stem
Document
Corpus
Descriptor
Centroid
Frame
Film
Text Visual
17
Visual indexing using text retrievalmethods(1/2)
• tf-idf - ‘Term Frequency – Inverse Document Frequency’
• A vocabulary of k words, then each document is represented by a k-vector
18
Visual indexing using text retrievalmethods(2/2)
• The query vector is given by the visual words contained in a user specified sub-part of a frame
• And the other frames are ranked according to the similarity of their weighted vectors to this query vector.
19
Experimental evaluation of scenematching using visual words(1/5)
• Goal – Evaluate the method by matching scene locations
within a closed world of shots (‘ground truth set’)• Ground truth set
– 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘Run Lola Run’ (4-9 frames from each location)
– There are significant view point changes in the frames for the same location
20
Experimental evaluation of scenematching using visual words(2/5)
21
Experimental evaluation of scenematching using visual words(3/5)
• The entire frame is used as a query region• The performance is measured over all 164
frames• The correct results were determined by hand• Rank calculation
22
Experimental evaluation of scenematching using visual words(4/5)
23
Experimental evaluation of scenematching using visual words(5/5)
24
Object retrieval(1/7)
• Goal– Searching for objects throughout the entire movie– The object of interest is specified by the user as a
sub part of any frame
25
Object retrieval – Stop list(2/7)
• To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.
26
Object retrieval – Spatial Consistency(3/7)
• Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.
27
Object retrieval(4/7)
28
Object retrieval(5/7)
29
Object retrieval(6/7)
30
Object retrieval(7/7)
31
Summary and conclusions
• Visual Word and vocabulary analogy• Immediate run-time object retrieval• Future work
– Automatic ways for building the vocabulary are needed
• Intriguing possibility– latent semantic indexing to find content– automatic clustering to find the principal objects
that occur throughout the movie.
32
Video Google Demo
• http://www.robots.ox.ac.uk/~vgg/research/vgoogle/