An introduction to system-oriented evaluation in Information Retrieval
Region-oriented Convolutional Networks for Object Retrieval
-
Upload
xavier-giro -
Category
Technology
-
view
73 -
download
0
Transcript of Region-oriented Convolutional Networks for Object Retrieval
REGION-ORIENTED CONVOLUTIONAL NETWORKS FOR OBJECT RETRIEVAL
Eduard Fontdevila Amaia Salvador Xavier Giró-i-Nieto
ADVISORSAUTHOR
from shallow to deep learning
9
Bag of Words
SIFT
Histograms of gradients
Convolutional Neural Networks (CNNs)
“hand crafted” features
state of art
“learned” features
AlexNet
11
state of art
Krizhevsky et al. (Toronto), ImageNet Classification with Deep Convolutional Neural Networks (2012)
CaffeNet
12
state of art
CaffeNet
architecture[Krizhevsky’12]
data[Deng’09]
framework[Jia’14]
Slide credit: Xavier Giró-i-Nieto
CaffeNet
13
state of art
inputimage
Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)
CaffeNet
14
state of art
convolutional layers
Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)
CaffeNet
15
state of art
fully connected layers
Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)
object candidates
16
state of art
Selective Search bounding boxes
Uijlings et al. (Trento), Selective Search for Object Recognition (2013)
MCG segments
Arbeláez et al. (Berkeley), Multiscale Combinatorial Grouping (2014)
R-CNN
17
state of art
Girshick et al. (Berkeley), Rich feature hierarchies for accurate object detection and semantic segmentation (2014)
Object Detection network
fast R-CNN
18
state of art
R. Girshick (Berkeley), Fast R-CNN (2015)
SDS
19
state of art
Hariharan et al. (Berkeley), Simultaneous Detection and Segmentation (2014)
Object Detection + Semantic Segmentation network
OUTLINE
1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions
20
TRECVid Instance Search
21
local CNNs for instance search
large collection of videos
464h
shots
~470k
frames
1/4 fps
TRECVid Instance Search
22
local CNNs for instance search
large collection of videos
464h
shots
~470k
frames
1/4 fps
...in our case, subset of 13k shots (23k frames)
query descriptors
24
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visual features
visual features
visual features
query set
descriptorsimage
bbox
region
query descriptors
26
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visual features
visual features
visual features
query set
descriptorsimage
bbox
region
object candidates
main scheme
27
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visualfeatures
visualfeatures
visualfeatures
querydescriptors
matching
matching
matching
framesin 1 shot
pooling
pooling
pooling
ranking
ranking
ranking
object candidates
pooling
pooling
visualfeatures
visualfeatures
main scheme
28
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visualfeatures
querydescriptors
matching
matching
matching
framesin 1 shot
pooling ranking
ranking
ranking
global approach
poolingvisualfeatures
object candidates
pooling
pooling
visualfeatures
visualfeatures
main scheme
29
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
querydescriptors
matching
matching
matching
framesin 1 shot
ranking
ranking
ranking
global approach
visualfeatures
pooling
object candidates
pooling
pooling
visualfeatures
visualfeatures
main scheme
30
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
querydescriptors
matching
matching
matching
framesin 1 shot
ranking
ranking
ranking
global approach
visualfeatures
pooling
object candidates
pooling
pooling
visualfeatures
visualfeatures
main scheme
31
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
querydescriptors
matching
framesin 1 shot matching
matching ranking
ranking
ranking
global approach
euclidean distance
Babenko et al. (Moskow), Neural Codes for Image Retrieval (2014)
poolingvisualfeatures
object candidates
pooling
pooling
visualfeatures
visualfeatures
main scheme
32
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
querydescriptors
matching
matching
matching
framesin 1 shot
ranking
ranking
ranking
global approach
Zhu et al. (NII), Multi-image aggregation for better visual object retrieval (2014)
distanceframe 1
distanceframe 2
distanceframe 3
average distance
distance shot - query
=
poolingvisualfeatures
object candidates
pooling
pooling
visualfeatures
visualfeatures
main scheme
33
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
querydescriptors
matching
matching
matching
framesin 1 shot
ranking
ranking
ranking
global approach
only top1000 shots
object candidates
main scheme
34
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visualfeatures
visualfeatures
visualfeatures
querydescriptors
matching
matching
matching
framesin 1 shot
pooling
pooling
pooling
ranking
ranking
ranking
visualfeatures
pooling
object candidates
main scheme
35
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visualfeatures
visualfeatures
querydescriptors
matching
matching
matching
pooling
pooling
ranking
ranking
ranking
local approach
framesin 1 shot
object candidates
main scheme
36
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
framesin 1 shot
local approach
visualfeatures
pooling
object candidates
main scheme
37
local CNNs for instance search
CaffeNet
Fast R-CNN
SDS
visualfeatures
visualfeatures
querydescriptors
matching
matching
matching
pooling
pooling
ranking
ranking
ranking
local approach
framesin 1 shot
quantitative results: re-ranking
41
mAP (%)
SDS Fast R-CNN CaffeNet
local CNNs for instance search
adding context
~8%
as a reminder...
44
local CNNs for instance search
Selective Search bounding boxes
Uijlings et al. (Trento), Selective Search for Object Recognition (2013)
MCG segments
Arbeláez et al. (Berkeley), Multiscale Combinatorial Grouping (2014)
Fast R-CNN
SDS
OUTLINE
1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions
45
... instead: fine-tuning
47
fine-tuning
already trained network new dataset (novel domain)
resume training
results on Pascal (global scale)
49
fine-tuning
validation subset
validation set
accuracy (%) 59,31% 4,14%
Histogram of images per category
categories
% of
imag
es
Microsoft COCO
50
fine-tuning
● Multiple objects per image
● 80 categories
● > 300k images (80k training)
● > 2M instances
Lin et al. (Cornell - Microsoft), http://vision.ucsd.edu/sites/default/files/coco_eccv.pdf (2015)
OUTLINE
1. Motivation2. State of Art3. Local CNNs for Instance Search4. Fine-tuning5. Conclusions
54
about the results
● Although not outperforming CaffeNet: SDS good for localization!
55
conclusions
maybe more suitable for TRECVid localization task?
about fine-tuning
● Networks trained on objects, but not on the objects to retrieve
56
conclusions
fine-tuning on a larger dataset is clearly the next step
about object candidates
● Only 100 candidates decreseases likelihood to success
... but using a higher number
57
conclusions
Fast SDS would be the key
interactive: Multi-image aggregationQuery images for a topic was used with the min distance to each shot.
The best option with SIFT-BoW is average, wheteher features (Avg-Pooling) or similarity scores (Sim-Avg)
annex
Zhu et al. (NII), Multi-image aggregation for better visual object retrieval (2014)