Introduction to Research in Mobile Robotics: Visual Place Recognition
Luis Gomez Camara [email protected]
Intelligent and Mobile Robotics (IMR) Group Czech Institute of Informatics, Robotics and Cybernetics
Czech Technical University in Prague
Motivation: Lifelong Autonomy
Long-term autonomy of mobile robots is a highly relevant research topic (also at IMR)
Requires navigation over extended periods of time
Long-term navigation is challenging due
Accumulation of errors (drift)
Dynamic environments
Visual Place Recognition (VPR) is a valuable tool
Long-term navigation: Applications
Self-driving cars
Planetary roversAutonomous
underwater vehicles
Injectable nanorobots
UAVs
Domestic robots
. . .
Robot navigation
The ability of a robot to
Determine its own position in its frame of reference
Plan a path towards some goal location
Answering the questions:
1. Where am I?
2. How do I get to other places?
3. Where are other places relative to me?
Navigation consists of:
1. Self-localization
2. Path planning
3. Map building and map interpretation
Visual SLAM
Simultaneous Localization And Mapping using optical sensors
Image: www.dragonfly.com
Problem: Drifting over time
Loop closure to correct the drift
Visual Place Recognition to solve loop closure
Visual SLAM: Drift
“SLAM- Loop Closing with Visually Salient Features
P. Newman et. al. 2005
Red arrows: camera pose
Grey ellipses: global uncertainty
Images: views used in loop closure
Note angular error at the bottom right
Visual SLAM: Loop closure
Definition: the task of recognising a previously-visited location and updating beliefs accordingly
“Deformation-based loop closure for large scale dense
RGB-D SLAM”, Thomas Whelan et. al. 2013
Basic component of SLAM systems
Used to correct drift that accumulates over time
Reduction in the uncertainty of the map estimate
Necessary for long-term navigation
Visual Place Recognition (VPR)
Definition: given a query image of a place, find its location by comparison with a database of previously visited places
Query image
Database of places
?
Fundamental and challenging problem in computer vision
Navigation
Autonomous driving
Geolocalization
Image retrieval
AR, VR, etc.
VPR: Applications
VPR: Challenges
Day-night cycles and illumination changes
Weather and season-related changes Viewpoint changes Dynamic objects, occlusions, etc.
Image Retrieval: Pipeline
Offline stage: Database creation
Online stage: Place recognition
Places image dataset Feature extractionImage Feature Representation
f1 f2 fn…
Places database
Feature Matching
Feature extractionImage Feature Representation
f1 f2 fn…
Query image
Ranked list of candidate
images
Exhaustive search
Re-ranked list of images
Place recognition = best candidate
Image Retrieval: Milestones
Dominated by Bag of Visual Words (BoVW) model Pre-trained and fine-tuned models
Spatial pyramid matchingLazebnik et al.
Image Feature Extraction
Dense sampling
• Patches of fixed size and shape
• Regular grid, possibly overlapping and over range of scales
• Simpler than keypoints (but heavier)
• Optimal for high-level representations (e.g. scene classification)
Interest points (keypoints)
• Salient locations that are likely to match in other images
• Edges, corners, blobs, etc.
• Optimal for image correspondences
Image Feature Descriptors
Created from regions around points of interest
Should be stable (robust) to orientation, illumination, etc.
Can be matched against descriptors in other images
Handcrafted: SIFT, SURF, ORB, etc.
Learned: CNN features
SIFT
Scale Invariant Feature Transform
Hand-crafted (engineered)
Used as both detector and descriptor
There are faster alternatives such as SURF, ORB, BRISK
SIFT is still one the most accurate hand-crafted descriptors
Scale affects detection
"Distinctive Image Features from Scale-Invariant Keypoints”,David Lowe 2004
SIFT
1. Scale-space extrema detector• LoG approximated by DoG
• Successively blur with Gaussian filter
• Scale parameter: standard deviation
First octave second third fourth
Maxima/minima detection
• Find local extrema
• Over both scale and space
SIFT
2. Keypoint localisation• Remove low -contrast keypoints
• Remove keypoints edges
• Only strong points in interest remain
Before After
3. Orientation assignment• Based on local properties, find a consistent
orientation for each keypoint and scale
• Invariance to image rotation
• Orientation histogram (36 bins) around keypoint:
• gradient and magnitude from pixel diffs.
• Highest bin and bins > 80% of highest are used to create keypoints
SIFT
4. Keypoint descriptor• 16x16 neighborhood
• 16 sub-blocks of 4x4 size
• 8 bin orientation histogram per sub-block
• Total of 128 bin values
• Everything relative to keypoint orientation
• Normalization for contrast changes
• Thresholding large gradients for brightness changes
Bag of (visual) Words (BoW)
Traditional approach in VPR
Borrowed from Natural Language Processing
"Video Google: A Text Retrieval Approach to Object Matching in Videos”,
Sivic and Zisserman 2003
Stores zero-order information (word repetitions)
Uses hand-crafted descriptors (SIFT, SURF, ORB, etc.)
Bag of (visual) Words
Steps:1. Extract descriptors from collection of
images
2. Learn visual dictionary by clustering descriptors (e.g. k-means)
3. Represent query image by
• Quantizing descriptors to closest word (centroid)
• Histogram of word repetitions
4. Image is represented as a vector
Vector quantization
Images: www.mathworks.com
Bag of (visual) Words
Pros:
• Largely unaffected by object positions, scale and orientation
• Good for classifying images according to content
• Fast search thanks to inverted indices
(requires sparsity of words in images)
Cons:
• Spatial information is discarded
• Information loss due to quantization
• High dimensionality
Inverted file index
BoW improvements
al pyramid
representation
• Spatial pyramid
• Fisher Vectors:
• Uses Gaussian mixture model (GMM) as vocabulary
• Statistical measure of descriptors wrt GMM
• Derivative of likelihood wrt GMM parameters
• Stores second order information (covariances)
• VLAD: Vector of Locally Aggregated Descriptors
• Similar to Fisher Vectors but only first order information (distances)
Spatial pyramid
level 0 level 1level 2
Based on approximate global geometric correspondence
Image divided into increasingly fine sub-regions
Histograms of local features found inside each sub-region
Extension of an orderless bag-of-features
"Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”,
Lazebnik et al. 2006
Spatial Pyramid
al pyramid
representation
"Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories”,
Lazebnik et al. 2006
Weak features: oriented edge points (gradient in a given direction above minimum threshold)
Strong features: SIFT
VLAD (Vector of Locally Aggregated Descriptors)
"Aggregating local descriptors into a compact image representation”,
Jégou et al. 2010
0. Train a visual dictionary C (k-means): 𝐶 = {𝜇1, 𝜇2, . . . , 𝜇𝑘}
1. For an image with m descriptors, 𝑋 = {𝑥1, 𝑥2, . . ., 𝑥𝑚},
assign descriptors to closest cell centroid
2. Compute residuals 𝑥 − 𝜇𝑖
3. Accumulate residuals for each cell:
𝑣𝑖 =
𝑥:𝑛𝑛 𝑥 =𝜇𝑖
(𝑥 − 𝜇𝑖)
4. Concatenate accumulated residuals in vector 𝑣 ∈ ℝ𝑘𝑚
𝑣 = [𝑣1, 𝑣2, . . . , 𝑣𝑘]
VLAD
Advantages
Fast to compute
Adds more discriminative power than BoW
Good results with small dimensionality
Fixed length vector irrespectively of feature detections
"Aggregating local descriptors into a compact image representation”,
Jégou et al. 2010
CNN Features
Extracted from Convolutional Neural Networks
Pre-trained vs. end-to-end
Early layers learn features similar to Gabor filters
Later layers learn more semantic features
Semantic features are robust (a car is always a car)
Spatial information can also be exploited
Gabor filters
Semantic features
CNN Features
CNN: mathematical model with huge number of parameters
Automatically learned during training with massive labelled datasets
Number of CNN features depends on architecture
CNN Features
Important concepts:
Parameter sharing: weights in kernels are used at all locations
Pooling: used to subsample feature maps and obtain translation invariance
CNN Features
"Visualizing and Understanding Convolutional Networks"Zeiler and Fergus 2013
CNN Features
"Visualizing and Understanding Convolutional Networks"Zeiler and Fergus 2013
CNN Features
"Visualizing and Understanding Convolutional Networks"Zeiler and Fergus 2013
CNN: Some recent approaches
"Bag of Local Convolutional Features for Scalable Instance Search" (Mohedano et al. 2016)
• Instance retrieval based on CNN features and the BoW model
• Activations of pre-trained CNN as local features
• High dimensional, sparse representation (N=512, 20k visual words)
• Each local CNN feature is assigned its closest visual word (assignment map)
• Performance comparable to other CNN-based approaches but more scalable
CNN: Some recent approaches
"On the performance of ConvNet features for place recognition" (Sünderhauf et al. 2015)
• Systematic analysis on the performance of pre-trained CNN layers
• Tested on the AlexNet architecture trained on ImageNet
• Nearest Neighbor search of extracted feature vectors
• Layer Conv3 best performing for place recognition
AlexNet architecture: "ImageNet Classification with Deep ConvolutionalNeural Networks",Krizhevsky et al. 2012
Example of CNN image features
CNN: Some recent approaches
"NetVLAD: CNN architecture for weakly supervised place recognition" (Arandjelović et al. 2016)
• Learns image representation in an end-to-end manner for the VPR task
• Steps:
1. Crop the CNN at the last convolutional layer (H x W x D)
2. Each spatial location generates one descriptor
3. Express VLAD image representation as a matrix
... ...
conv3 conv4 conv5N = 13 x 13 = 169
j-th dimension of the i-th descriptor
j-th dimension of the k-th cluster
Membership of descriptor to k-th word (cluster)Value: 0 or 1
4. Make the membership term it differentiable
: Soft membership assignment
, and are sets of trainable parameters
: Response attenuation constant (positve)
CNN: Some recent approaches
"NetVLAD: CNN architecture for weakly supervised place recognition" (Arandjelović et al. 2016)
• More flexible than original VLAD thanks to extra trainable parameters
and are descriptors known to belong to images that should not match
Supervised VLAD allows to learn a better anchor (cluster center)
that minimizes the product between the residuals
CNN: Some recent approaches
"Levelling the Playing Field: A Comprehensive Comparison of Visual PlaceRecognition Approaches under Changing Conditions" (Zaffar et al. 2019)
Berlin Kudamm
Gardens Point
Nordland
Neuroscience of place recognition
"The cognitive map in humans: Spatial navigation and beyond"
Russel A. Epstein et al. 2017
Hippocampus (HPC) and Entorhinal cortex (EC)
Stores map-like spatial codes (cognitive maps)
Supports memory during navigation
A cognitive map is an internal neural
representation of one's surrounding physical
environmentEntorhinal cortex
Neuroscience of place recognition
Parahippocampal Place Area (PPA) and Retrosplenial cortex (RSC)
PPA perception of landmarks and visuospatial structure of the scene
RSC cognitive map retrieval
PPA + RSC Place Recognition
Landmarks:
• Spatial layout (very important)
• Discrete landmarks: buildings, statues, etc.
• Extended topographical landmarks:
arrangement of buildings, valleys, ridges, etc.
Entorhinal cortex
"Where am I now? Distinct Roles for Parahippocampaland Retrosplenial Cortices in Place Recognition" Russel A. Epstein et al. 2007
Visual Cortex Hierarchy
"Bio-inspired computer vision: Towards a synergistic approach of artificial and biological vision"Medathati et al. 2016
Our approach
Use CNNs to extract semantic (robust) features from images
Store them along with their spatial arrangement
Compare images by simultaneously matching features and locations
VGG16 architecture
"Very Deep Convolutional Networks for Large-Scale Image Recognition"Simonyan and Zisserman 2014
Increased depth (16 layers) compared to AlexNet
Smaller convolutional filters (3x3)
Trained on Places205 for scene recognition
Layer conv4_2 for spatial consistency check
We use conv5_2 for quick retrieval of candidates
SSM-VPR (Semantic and Spatial Matching Visual Place Recognition)
Two-stage system
"Spatio-Semantic ConvNet-based Visual Place Recognition" Camara et al. 2019
STAGE 1 Image Filtering
InputQuery image of a place
ProcessFast search of images similar to
query in large database of places
OutputN top-ranked candidates
STAGE 2 Spatial Matching
InputQuery + candidates
ProcessSemantic and geometric comparison of query and
candidate using CNNs
OutputRecognized place (best match)
SSM-VPR
VGG16 CNN pre-trained on Places205 dataset
Image filtering stage:
• Layer conv5_2
• 14x14x512 feature maps
• 16 sliding 7x7x512 cubes per image
• Store into image filtering database (IFDB)
Spatial matching stage
• Layer conv4_2
• 56x56x512 feature maps
• 729 sliding 3x3x512 cubes per image
• Store into spatial matching database (SMDB)
• Also store locations
Image Retrieval: Pipeline
Offline stage: Database creation
Online stage: Place recognition
Places image dataset Feature extractionImage Feature Representation
f1 f2 fn…
Places database
Feature Matching
Feature extractionImage Feature Representation
f1 f2 fn…
Query image
Ranked list of candidate
images
Exhaustive search
Re-ranked list of images
Place recognition = best candidate
SSM-VPR Pipeline
Offline stage: Database creation
Online stage: Place recognition
Places image dataset Feature extractionImage Feature Representation
f1 f2 fn…
IFDB
Feature Matching
Feature extractionImage Feature Representation
f1 f2 fn…
Query image
Ranked list of candidate
images
Exhaustive search
Re-ranked list of images
Place recognition = best candidate
Imagefiltering
Spatial matching
SMDB
SSM-VPR: Image filtering
Ground truth candidate: 0
Ground truth candidate: 89
Ground truth candidate: 25
SSM-VPR: Spatial matching
1. For each location in candidate, find location of closest match in query
2. Set the pair of locations as anchor points
3. Look at the spatial consistency between the locations of matched pairs of vectors
4. Location consistency: check all cells around candidate anchor point
5. Accumulate consistent matches for all locations in candidate
6. Select candidate with largest score
SSM-VPR: Parameter optimization
Berlin Kudamm
Gardens Point
Nordland
Same datasets as in Zaffar et al. 2019
SSM-VPR: Recognition results
Same datasets as in Zaffar et al. 2019
𝑹𝒆𝒄𝒂𝒍𝒍 =𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑇𝑃 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑃 = 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑁 = 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Gardens Point
SSM-VPR: Recognition results
Same datasets as in Zaffar et al. 2019
𝑹𝒆𝒄𝒂𝒍𝒍 =𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑇𝑃 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑃 = 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑁 = 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Kudamm
SSM-VPR: Recognition results
Same datasets as in Zaffar et al. 2019
𝑹𝒆𝒄𝒂𝒍𝒍 =𝑻𝑷
(𝑻𝑷 + 𝑭𝑵)𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
𝑻𝑷
(𝑻𝑷 + 𝑭𝑷)
𝑇𝑃 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑃 = 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝐹𝑁 = 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
Nordland
SSM-VPR: Teach-and-Replay navigation
SSM-VPR: Teach-and-Replay navigation
SSM-VPR: Teach-and-Replay navigation
Conclusions
Separating recognition in two stages is a highly successful approach
High-level CNN features are very robust to changes
Considering the spatial location of features is the key for high performance recognition
Substantial improvement of the state-of-the-art
Interesting applications in autonomous navigation
Top Related