Page 1 Mario Squartini and Pier Marco Bertinetto The Simple and ...
Week 42: Fully-Convolutional Siamese Networks for Object … · Training: loss function Source:...
Transcript of Week 42: Fully-Convolutional Siamese Networks for Object … · Training: loss function Source:...
Week 42:
Siamese Network: Architecture and
Applications in Visual Object Tracking
Yuanwei Wu
10-21-2016
1
Outline
• Siamese Architecture
• Siamese Applications in Computer Vision
• Paper review
Visual Object Tracking using Siamese CNN
• Future Work
2
What does “Siamese” mean?
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 3
Siamese Architecture
Source: Learning Hierarchies of Invariant Features. Yann LeCun. helper.ipam.ucla.edu/publications/gss2012/gss2012_10739.pdf 4
Siamese Architecture and loss function
Source: Learning Hierarchies of Invariant Features. Yann LeCun. helper.ipam.ucla.edu/publications/gss2012/gss2012_10739.pdf 5
Siamese Applications in Computer Vision:1. Signature Verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 6
Siamese Applications in Computer Vision:2. Dimensionality Reduction
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 7
Siamese Applications in Computer Vision:3.1 Learning Image Descriptors
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 8
CNN Model
Siamese Applications in Computer Vision:3.2 Learning Image Descriptors
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 9
Siamese Applications in Computer Vision:4.1 Face Verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 10
Siamese Applications in Computer Vision:4.2 Face Verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 11
Siamese Applications in Computer Vision:4.3 Face Verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 12
Siamese Applications in Computer Vision:4.4 Face Verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 13
Siamese Applications in Computer Vision:4.5 Face Verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 14
@article{bertinetto2016fully,
title={Fully-Convolutional Siamese Networks for Object Tracking},
author={Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F
and Vedaldi, Andrea and Torr, Philip HS},
journal={arXiv preprint arXiv:1606.09549},
year={2016} }
Paper Review:
Fully-Convolutional Siamese
Networks for Object Tracking
15
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.
Architecture of Siamese CNN
16
Details of the Architecture of Siamese CNN
Source:
1: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012.
1.
17
Details of the Architecture of Siamese CNN
Source:
1: Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012.
2: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.
1.
2.
18
Cross-correlation layer
Training: dataset
• ImageNet Video dataset of 2015:
contains ~4000 videos
with ~1 million annotated frames
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.19
Training: preprocessing on the images• Preprocessing: 2820 videos, examplar image: 127 x 127,
search image: 255 x 255
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.20
Training: recap the steps
• ImageNet Video dataset of 2015: contains ~4000 videos
with ~1 million annotated frames
• Preprocessing:
2820 videos
examplar image: 127 x 127
search image: 255 x 255
• Training with a standard Stochastic Gradient Descent (SGD) solver using MathConvNet
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.21
Training: loss function
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.
• Employing a discriminative training approach
using positive and negative pairs and adopting
the logistic loss:
22
Training: loss function
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.
• Employing a discriminative training approach
using positive and negative pairs and adopting
the logistic loss:
• The loss of a score map is the mean of the
individual losses:
23
Training: loss function
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.
• Employing a discriminative training approach
using positive and negative pairs and adopting
the logistic loss:
• The loss of a score map is the mean of the
individual losses:
• Applying SGD to find the conv-net Ѳ using
24
Tracking algorithm
• Use a search image centered at the previous
position of the target.
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.25
Tracking algorithm
• Use a search image centered at the previous
position of the target.
• Only search for the object within a region of
approximately four times its previous size.
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.26
Tracking algorithm
• Use a search image centered at the previous
position of the target.
• Only search for the object within a region of
approximately four times its previous size.
• A cosine window is added to the score map to
penalize large displacements.
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.27
Tracking algorithm
• Use a search image centered at the previous position of the target.
• Only search for the object within a region of approximately four times its previous size.
• A cosine window is added to the score map to penalize large displacements.
• The position of the maximum score relative to the center of the score map, multiplied by the stride of the network, gives the displacement of the target from frame to frame.
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.28
Experiments: training dataset size
• Accuracy: is calculated as the average
Intersection-over-Union (IoU)
• Robustness: in terms of the total number of
failures
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.29
Experiments: training dataset size
• Accuracy: is calculated as the average Intersection-over-Union (IoU)
• Robustness: in terms of the total number of failures
• Using a larger video dataset could increase the performance even further.
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.30
Experiments: OTB13 benchmark results
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.31
Experiments: VOT15 benchmark results
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.32
Experiments: VOT15 benchmark results
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.33
Experiments: VOT15 benchmark results
• Estimates the new position of the target object by merely cross-correlating the embeddings of two patches over three scales.
• Achieves real-time performance and state-of-the-art results.
Source: Bertinetto, Luca and Valmadre, Jack and Henriques, Jo{\~a}o F and Vedaldi, Andrea and Torr, Philip HS,
fully-Convolutional Siamese Networks for Object Tracking, arXiv preprint, 2016.34
Future work: How to improve the performance?
• By augmenting the online tracking pipeline:
online model updating (i.e. tracking-by-detection)
bounding-box regression (i.e. YOLO, Faster-CNN)
fine-tuning (i.e. correlation filters + CNN features)
memory (i.e. add RNN, LSTM)
35
Source: Guanghan Ning, Zhi Zhang, Chen Huang, Zhihai He, Xiaobo Ren, Haohong Wang, Spatially Supervised Recurrent Convolutional
Neural Networks for Visual Object Tracking, arXiv preprint, 2016.36
Future work: How to improve the performance?
• By augmenting the online tracking pipeline: online model updating (i.e. tracking-by-detection)
bounding-box regression (i.e. YOLO, Faster-CNN)
fine-tuning (i.e. correlation filters + CNN features)
memory (i.e. add RNN, LSTM)
• By introducing new architecture in the framework of Siamese CNN, need to dig deeply in the structure of networks (i.e. regression network, triplet network).
37
Triplet Network
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf 38
Future work: How to improve the performance?
• By augmenting the online tracking pipeline: online model updating (i.e. tracking-by-detection) bounding-box regression (i.e. YOLO, Faster-CNN) fine-tuning (i.e. correlation filters + CNN features) memory (i.e. add RNN, LSTM)
• By introducing new architecture in the framework of Siamese CNN, need to dig deeply in the structure of networks (i.e. regression network, triplet network).
• By introducing new loss function is Siamese network.
39
40
Loss function used in face verification
Source: http://vision.ia.ac.cn/zh/senimar/reports/Siamese-Network-Architecture-and-Applications-in-Computer-Vision.pdf
Thank you!41