Video Google: A Text Retrieval Approach to Object Matching in Videos

Josef Sivic and Andrew ZissermanRobotics Research Group, Department of

Engineering ScienceUniversity of Oxford, United Kingdom

• To retrieve those key frames and shots of a video containing a particular object .

• With the ease, speed and accuracy.

Outline• Introduction

– Object query– Scene query

• Challenging problem• Text retrieval overview• Viewpoint invariant description

– Building the Descriptors– Building the Visual Word– The Visual Analogy

• Visual indexing using text retrievalmethods

• Experimental evaluation of scenematching using visual words

• Object retrieval– Stop list– Spatial Consistency

• Summary and conclusions• Video Google Demo

Introduction - Object query(1/2)

Introduction - Scene query(2/2)

Challenging problem(1/2)

• Changes in viewpoint, illumination and partial occlusion

• Large data• Real-world data

Challenging problem(2/2)

Text retrieval overview (1/2)• The documents are parsed into words.• Words are represented by their stems

– ‘walk’, ‘walking’, ‘walks’ -> ‘walk’• Stop list to filter common words( ‘the’, ‘an’,…)• Remaining words represent as a vector

weighted based on word frequency

Text retrieval overview (2/2)

• Inverted file to facilitate efficient retrieval.– An inverted file is structured like an ideal book

index.• Text is retrieved by computing its vector of

word frequencies, return documents with the closest vectors

• Rank the returned documents

Viewpoint invariant description(1/2)

• Two types of viewpoint covariant regions are computed for each frame.1. SA – Shape Adapted

corner like features

2. MS – Maximally Stable blobs of high contrast with respect to their

surroundings

• Regions computed in grayscale

Viewpoint invariant description(2/2)

The MS regions are in yellow. The SA regions are in cyan.

Building the Descriptors(1/2)

• SIFT – Scale Invariant Feature Transform– Each elliptical region is represented by a 128-

dimensional vector– SIFT is invariant to a shift of a few pixels

Building the Descriptors(2/2)

• Removing noise – tracking & averaging– Regions are tracked across sequence of frames

using “Constant Velocity Dynamical model”– Any region which does not survive for more than

three frames is rejected– Averaging the descriptors throughout the track– Large covariance’s descriptors are rejected

Building the Visual Word(1/2)

• Cluster descriptors into K groups using K-mean clustering algorithm

• Each cluster represent a “visual word” in the “visual vocabulary”

• MS and SA regions are clustered separately– different vocabularies for describing the same

scene.

Building the Visual Word(2/2)

The Visual AnalogyWord

Document

Corpus

Descriptor

Centroid

Text Visual

Visual indexing using text retrievalmethods(1/2)

• tf-idf - ‘Term Frequency – Inverse Document Frequency’

• A vocabulary of k words, then each document is represented by a k-vector

Visual indexing using text retrievalmethods(2/2)

• The query vector is given by the visual words contained in a user specified sub-part of a frame

• And the other frames are ranked according to the similarity of their weighted vectors to this query vector.

Experimental evaluation of scenematching using visual words(1/5)

• Goal – Evaluate the method by matching scene locations

within a closed world of shots (‘ground truth set’)• Ground truth set

– 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘Run Lola Run’ (4-9 frames from each location)

– There are significant view point changes in the frames for the same location

• The entire frame is used as a query region• The performance is measured over all 164

frames• The correct results were determined by hand• Rank calculation

Object retrieval(1/7)

• Goal– Searching for objects throughout the entire movie– The object of interest is specified by the user as a

sub part of any frame

Object retrieval – Stop list(2/7)

• To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.

Object retrieval – Spatial Consistency(3/7)

• Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.

Summary and conclusions

• Visual Word and vocabulary analogy• Immediate run-time object retrieval• Future work

– Automatic ways for building the vocabulary are needed

• Intriguing possibility– latent semantic indexing to find content– automatic clustering to find the principal objects

that occur throughout the movie.

Video Google Demo

• http://www.robots.ox.ac.uk/~vgg/research/vgoogle/

Video Google: A Text Retrieval Approach to Object Matching in Videos

Documents

Transcript of Video Google: A Text Retrieval Approach to Object Matching in Videos

A survey on tree matching and XML retrieval

Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont.. Web-Based Document Search Page Rank Anchor Text Document Matching.

Information retrieval and machine learning for probabilistic …disi.unitn.it/~p2p/RelatedWork/Matching/IPM07.pdf · Information retrieval and machine learning for probabilistic schema

Video Google: A Text Retrieval Approach to Object Matching ...vgg/publications/papers/sivic03.pdf · Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Video Google: A Text Retrieval Approach to Object Matching ...

3D Shape Matching and Retrieval - University of Edinburghhomepages.inf.ed.ac.uk/tkomura/cav/presentation17_2015.pdf · 3D Shape Matching and Retrieval Computer Animation and Visualisation

Graph Matching for Shape Retrieval

Scars, Marks and Tattoos (SMT): Automatic Matching & Retrieval

Matching Local Self-Similarities across Images and …image.ntua.gr/iva/files/ShechtmanIrani_CVPR2007...Matching Local Self-Similarities across Images and Videos Eli Shechtman Michal

Similarity retrieval of videos by using 3D C-string knowledge ...people.csail.mit.edu/chiu/paper/similarity retrieval of videos by 3D C-string.pdf · We have proposed a new spatio-temporal

Shape Matching for 3D Retrieval and Recognitionusers.dcc.uchile.cl/~isipiran/papers/tutorial.pdfShape Matching for 3D Retrieval and Recognition Ivan Sipiran and Benjamin Bustos PRISMA

EFFICIENT OBJECT RETRIEVAL FROM VIDEOS

USPTO Patent Application. Method of matching asks and bids of tailored videos

Signature Detection and Matching for Document Image Retrieval · Signature Detection and Matching for Document Image Retrieval Guangyu Zhu1, Yefeng Zheng2, David Doermann1, and Stefan

Using Local Moment Invariants for Partial 3D Shape Matching and Retrieval

Graffiti-ID: Matching and Retrieval of Graffiti Imagesbiometrics.cse.msu.edu/Presentations/JungEunLee_IAI2010_Graffiti.pdf · Graffiti-ID: Matching and Retrieval of Graffiti Images

Introduction to Information Retrieval. Information Retrieval Introduction In CS A201, CS A351 we discuss methods for string matching –Appropriate for.

LNCS 8827 - Partial Shape Matching and Retrieval under ... · Partial Shape Matching and Retrieval under Occlusion and Noise LeonardoChang1, 2,MiguelArias-Estrada, Jos´eHern´andez-Palancar1,andL.EnriqueSucar2

Deep Shape Matching - CVF Open Accessopenaccess.thecvf.com/content_ECCV_2018/...Matching... · Keywords: shape matching · cross-modal recognition and retrieval 1 Introduction Deep