Mobile Visual Search: Object Re-Identification Against...
Transcript of Mobile Visual Search: Object Re-Identification Against...
Z.Li @USTC, Hefei, 2016
Mobile Visual Search: Object Re-Identification Against Large Repositories
Zhu LiDept of Computer Science & Electrical Engieering
University of Missiouri, Kansas City
Email: [email protected], [email protected]
Web: http://l.web.umkc.edu/lizhu
1
Z.Li @USTC, Hefei, 2016
Outline
• Short Self Intro & Research Overview
• Mobile Visual Search: – Object re-identification
– Image Classification
• MPEG CDVS Data Set and Test Conditions
• CDVS Query Extraction and Compression Pipeline– Key point Detection and Selection (ALP, CABOX, FS)
– Global Descriptor: Key points aggregation and compression (SCFV, RVD, AKULA)
– Local Descriptor: Key points and coordinates compression
• Indexing via Aggregation– Pair-wise Matching
– Retrieval
– Indexing
• CDVA Work – Classification/Tagging– Device Based Deep Learning Image Tagging
• Summary
2
Z.Li @USTC, Hefei, 2016
University of Missouri, Kansas City
Short Bio:
Research Interests:• Visual Obj Re-Identification in Large Repositories• Device Based Deep Learning Audio/Visual
Recognition Engine• Robust Content Identification, Cache/Communication
De-Duplication, ICN/CCN• Mobile Edge Computing: EM + Visual signature based
location service, meta-data assisted on-demand transcoding, Spectrum Sensing & Device Centric Networking
algorithm & computing image understanding visual communication mobile edge computing & communication
Media Computing & Communication Labhttp://i.web.umkc.edu/lizhu
Zhu LiAssociate ProfessorDirector, MC2 Lab
p.3
Z.Li @USTC, Hefei, 2016
Research Motivation & Highlights
2015
p.4
Z.Li @USTC, Hefei, 2016
Devices Networks Applications
mmWave 5G
D2D
FD-MIMO
4k/8k UHD Video Free Viewpoint TV
SDN/MEC
BIGDATA visual intelligence
Samsung VR/AR
Media Computing & Communication Horizon
p.5
Z.Li @USTC, Hefei, 2016
• Mobile Visual Search: Object Re-Identification
• Mobile Query-by-Capture
• Object Recognition Pipeline– KD: Keypoint Detection (Box Filtering )– FS: Feature Selection (Affine transformed 2 way-matching)– GD/LD: global and local descriptor generation (AKULA, Collision Optimized Hash)
+++
++
+ +
+
+
Global Descriptor
Local Descriptors
bitstreamKD FS
GD
LD
Descriptor Extraction Descriptor Encoding
p.6
Z.Li @USTC, Hefei, 2016
• Compact Deep Neural Networks for Image Tagging
• Automatic Image Tagging:• Mapping image pixel values to tags• Device based recognition:
• Distilling the knowledge into a compact form• Compact CNN model for knowledge distilling
– BIGDATA training set: >10k images per tag– training set augmentation:
» affine transforms, simulated lighting changes– Pruning filters from the CNN structure– Also: indexable fingerprint, speaker ID, small vocabulary ASR
p.7
Z.Li @USTC, Hefei, 2016
• Subspace Indexing Learning on Grassmann Manifold
• Large Scale Face Identification/Image Understanding– Improve the model Degree of Freedom (DoF) to accommodate the variations in the large
training data set and large label space– A piece-wise linear local approximation of the underlying non-linearity, balance the DoF with
the amount of discriminant information captured
• Subspace Indexing on Grassmann Manifold: – Develop a rich set of transforms that better captures local data characteristics, and– Develop a hierarchical deep structure for subspaces on the Grassmann manifold. – Applications: large subject set face recognition, speaker ID, and hierarchical transforms for
image coding.
p.8
Z.Li @USTC, Hefei, 2016
• Rate Agnostic Video Identification and Verification
• Rate Agnostic Robust Video Fingerprinting– Robust and scalable video fingerprint for duplicate and near duplicate search and mining
applications (Funded by Microsoft Research grant, HK RGC GRF grant, Samsung DMC)– Playback Verification: Differential feature for playback verification (ads enforcement,
broadcast monitoring), 200 bps, very lightweight signature. – Multi-Source Synchronization: synch multi-camera video and audio sources, content based
approach– Content Identification: content naming service for ICN, cache manifest and de-duplication.
p.9
Z.Li @USTC, Hefei, 2016
• Mobile Edge Computing
• Delivery Acceleration – Cache assisted multipath delivery and QoE driven congestion control– Transmission redundancy elimination via content fingerprinting
• Traffic Engineering– Spatial-Temporal QoE metrics for fine granular operating points signalling– Bottleneck coordination – QoE multiplexing gains (demo)
p.10
Z.Li @USTC, Hefei, 2016
Outline
• Short Self Intro & Research Overview • Mobile Visual Search:
– Object re-identification– Image Classification– MPEG CDVS Data Set and Test Conditions
• CDVS Query Extraction and Compression Pipeline– Key point Detection and Selection (ALP, CABOX, FS) – Global Descriptor: Key points aggregation and compression (SCFV, RVD,
AKULA) – Local Descriptor: Key points and coordinates compression
• Indexing via Aggregation– Pair-wise Matching– Retrieval– Indexing
• CDVA Work – Classification/Tagging– Device Based Deep Learning Image Tagging
• Summary
11
Z.Li @USTC, Hefei, 2016
Mobile Visual Search Problem
• CDVS: Object Identification: bridging the real and cyber world
• Image Understanding/Tagging: associate labels with pixels
12
Z.Li @USTC, Hefei, 2016
CDVS Scope
13
• MPEG CDVS Standardization Scope– Define the visual query bit-stream extracted from images– Front-end: image feature capture and compression – Server Back-end: image feature indexing and query processing
• Objectives/Challenges:– Real-time: front end real time performance, e.g, 640x480 @30fps– Compression: Low bit rate over the air, achieving 20 X compression w.r.t to sending
images, or 10X compression of the raw features. – Matching Accuracy: >95% accuracy in pair-wise matching (verification) and >90%
precision in identification– Indexing/Search Efficiency: real time backend response from large (>100m) visual
repository– SIFT Patent.
Z.Li @USTC, Hefei, 2016
Technology Time Line
14
MPEG-7 CDVS, 8th FP7 Networked Media Concentration meeting, Brussels, December 13, 2011. (thanks, Yuriy !)
1998 200820042002 2010
MPEG-7 core:Parts 1-5
MPEG-7 Parts 6-12
2000 2006
MPEG-7Visual Signatures
Compact Descriptors for Visual Search
2012
SIFT algorithm invented
First iPhone
SURF
Recognition of SIFT in CV community
Major progress in computer vision research
CHoG
MPEG starts work on CDVS
First visual search applications
Z.Li @USTC, Hefei, 2016
CDVS Data Set Performance
• Annotated Data Set:
– Mix of graphics, landmarks, buildings, objects, video clips, and paintings.
– Approx 32k imags
• Distraction Set:
– Approx 1m images from various places
• Image Query Size
– 512bytes to 16K bytes
– 4k~8k bytes are the most useful
objects
15
Z.Li @USTC, Hefei, 2016
The (Long)MPEG CDVS Time Line
16
Meeting / Location/ Date Action Comments
96th meeting, Geneva Mar 25, 2011 Final CfP publishedAvailable databases and evaluation software.
97th meeting, Torino July 18-23, 2011
Last changes in databases and evaluation software
Dataset: 30k annotated images + 1m distractor images
98th meeting, Geneva Nov 26 –Dec 2, 2011
Evaluation of proposals 11 proposals received, selected TI Lab test model under consideration
99th meeting, San Jose Feb 10, 2012 WD Working Draft 1
100th meeting, Geneva March 4, 2012 WD Working Draft 2
….. …. …… SIFT patent issue
106th meeting, Geneva Oct, 2013 CD Committee Draft
108th meeting, Valencia Mar, 2014 DIS Finally….
Z.Li @USTC, Hefei, 2016
Outline
• Short Self Intro & Research Overview • Mobile Visual Search:
– Object re-identification– Image Classification– MPEG CDVS Data Set and Test Conditions
• CDVS Query Extraction and Compression Pipeline– Key point Detection and Selection (ALP, CABOX, FS) – Global Descriptor: Key points aggregation and compression (SCFV, RVD,
AKULA) – Local Descriptor: Key points and coordinates compression
• Indexing via Aggregation– Pair-wise Matching– Retrieval– Indexing
• CDVA Work – Classification/Tagging– Device Based Deep Learning Image Tagging
• Summary
17
Z.Li @USTC, Hefei, 2016
CDVS Pipeline
• CDVS Query Processing Pipeline
• KD - Keypint Detection – ALP, BFLoG, CABOX
• FS - Feature Selection
• GD - Global Descriptor from key points aggregation– SCFV, RVD, AKULA
• LD – Local Descriptor– SIFT compression, spatial coordinates compression
18
+++
++
+ +
+
+
Global Descriptor
Local Descriptors
bitstreamKD FS
GD
LD
Descriptor Extraction Descriptor Encoding
Z.Li @USTC, Hefei, 2016
How SIFT Works
• Detection: find extrema in spatial-scale space– Scale space is represented by LoG filtering output, but approximated in SIFT
by DoG
• Key Points Representation: 128 dim
19
Z.Li @USTC, Hefei, 2016
Alternative Keypoint (SIFT) Detection
• Main motivation– Find invariant and repeatable features to identify objects
• Many options, SURF, SIFT, BRISK, …etc, but SIFT still best performing
– Circumvent the SIFT patent, which was sold to an unknown third company
• SIFT patent main claim: DoG filtering to reconstruct scale space
– Extraction Speed: can we be faster than DoG ?
• Main Proposals:– Telecom Italia: ALP – Scale Space SIFT Detector (adopted !)
– Samsung Research: CABOX – Integral Image Domain Fast Box Filtering (incomplete results)
– Beijing University/ST Micro/Huawei: Freq domain Gaussian Filtering (deemed not departing far enough from the SIFT patent)
20
Z.Li @USTC, Hefei, 2016
m31369: ALP-Scale Space Key point Detector
• Telecom Italia: Gianluca Francini, Skjalg Lepsoy, Massimo Balestri
• key idea, model the scale space response as a polynomial function, and estimate its coefficients by LoG filtering at different scales:
21
ℎ �, �, � =
�� ���
���+
��
���� �, �, �
ℎ �, �, � ≈ � �� �
�
���
� ℎ �, �, ��
LoG kernel at any scale can be expressed as l.c. of 4 fixed scale kernels
�� � ≈ ���� + ���� + ��� + ��
Z.Li @USTC, Hefei, 2016
ALP – Scale Space Response
• Express the scale response at (x, y), as a 3rd order polynomial
• The polynomial coefficients are obtained by filtering, where Lk
is the LoG filtered image at pre-fixed scale k:
22
�� �
�
L.C. weights
Z.Li @USTC, Hefei, 2016
ALP Scale Space Extrema Detection
• ALP filtering scale space response as polynomials
23
– Scale Space Extrema: �∗ � ���, ���, � ≈ �.�� �� − ��.�� �� + ���.�� � − ���.��
– No Extrema (saddle point): �∗ � ���, ���, � ≈ �.�� �� − �.�� �� + �.�� � − �.�
+
+
Z.Li @USTC, Hefei, 2016
Displacement refinement
• 2n order scale response refinement as function of displacement (u, v)
24
�∗ ℎ � − �, � − �, �≈ �� �, �, � �� + �� �, �, � �� + �� �, �, � �� + �� �, �, � � + �� �, �, � � + �� �, �, �
Z.Li @USTC, Hefei, 2016
ALP Performance
• Repeatability vs VL_FEAT SIFT:
25
Z.Li @USTC, Hefei, 2016
Box Filter for SIFT Detection
• Approximation of Difference of Gaussian filter with a cascade of box filters.
– Box filtering efficiently performed using integral images.
• The box filter widths and heights are computed using various strategies:– Boosting design – learning based SIFT detector too ambitious, not
really working
– LS-LASSO optimization
0
5
10
15
0
5
10
150
0.005
0.01
0.015
0.02
0
5
10
15
0
5
10
150
0.005
0.01
0.015
0.02
≈
min.�
� − � � � ��
� + � ��
(g is the gaussian filter to be approximated, B is the dictionary of box filters, h is the vector of filter coefficients)
Z.Li @USTC, Hefei, 2016
Gaussian Filter Box Filters
27
CABOX – Cascade of Box Filters (m30446)
Approximate DoG/LoG by a cascade of box filters that can offer early termination:• Very fast integral image domain box filtering, • The box filters are found by solving the following problem, sparse
combination of box filters, via LASSO:
Z.Li @USTC, Hefei, 2016 28
CABOX Approximation Results
The influence of the used dictionary determines not only the quality of the approximation but also the number of boxes required.
Z.Li @USTC, Hefei, 2016
CABOX Detection Results
29
More examples of keypoint detection using box filters.
Overlap: 85% Overlap: 88%
• Algorithmically the fastest SIFT detector amongst CE1 contributions• Ref: V. Fragoso, G. Srivastava, A. Nagar, Z. Li, K. Park, and M. Turk, "Cascade of Box
(CABOX) Filters for Optimal Scale Space Approximation", Proc of the 4th IEEE Int'l Workshop on Mobile Vision , Columbus, USA, 2014
Z.Li @USTC, Hefei, 2016
I0
S1 S2 S3
Parallel Scale Space Box Filtering
• Error Propagation in Serial Scale Space Modeling– As Gaussian filters are replaced with box filters in each stage, the error
introduced are propagating, so S3 is more noisy than S1 when compared with the Gaussian filtering
• Proposed solution– Replace serial CABOX filtering with parallel CABOX filtering:
�� = �� ∗ ���
�� = �� ∗ ���∗ ���
�� = �� ∗ ���∗ ���
∗ ���
�� = �� ∗ ���� , ��� = ��
� + ���
�� = �� ∗ ����� , ���� = ���
� + ���
Z.Li @USTC, Hefei, 2016 31
Parallel CABOX
Sequential Pipeline
Parallel Pipeline
Filters used in the Parallel pipeline are computed using the following formula:
Z.Li @USTC, Hefei, 2016 32
Parallel CABOX approximation errors
Sigma= 4.9803
Sigma= 3.9058
A few visualization of filters used in Parallel CABOX.
Z.Li @USTC, Hefei, 2016
Parrallel CABOX
33
Of those features detected with Parallel CABOX (red circles), 72% are also detected by VLFeat (green circles) while the remaining 28% correspond to new features from CABOX.
Z.Li @USTC, Hefei, 2016
Result Summary
Serial CABOX (w.r.t. TM 70 Anchor)
Parallel CABOX (w.r.t. TM 70 Anchor)
TPR - 1.98 % - 12 %
FPR + 0.04 % - 0.02 %
Loc. Accuracy - 1.24 % - 3.27 %
Serial CABOX (w.r.t. TM 70 Anchor)
Parallel CABOX (w.r.t. TM 70 Anchor)
TPR - 0.79 % - 8.73 %
FPR + 0.1 % + 0.01 %
Loc. Accuracy - 0.84 % - 2.21 %
Table 1: Averaged over all rates
Table 2: Averaged over 4k, 8k, 16k
• Plug-n-Play Results:
Z.Li @USTC, Hefei, 2016
Results Summary - Complexity
• Fastest in CE-2:
p.35
Z.Li @USTC, Hefei, 2016
Feature Selection – Repeatability Modeling
• Why do Feature Selection ?– Average 1000+ SIFTs extracted for VGA sized images, need to reduce the
number of actual SIFTs sent
– Not all SIFTs are created equal in repeatability in image match,
• model the repeatability as a prob function [Lepsøy, S., Francini, G., Cordara, G.,
& Gusmao, P. P. (2011). Statistical modelling of outliers for fast visual search. IEEE VCIDS 2011.] of SIFT’s scale, orientation, distance to the center, peak strength, …,etc :
– Use self-matching (m29359) to improve the offline repeatability stats robustness
36
)()()()()()(),,,,,( 65432*
1*
pffDfdfffpdDr
im0im1
0 10 20 30 400
2
4
6
8
10
12x 10
4 dsif t
im0 -> im1
0 5 10 15 20 250
2
4
6
8
10
12
14x 10
4 dsift
im1 -> im0
�(�)
� Self-matching via random out of plane rotation
Z.Li @USTC, Hefei, 2016
Feature Selection
• Illustration of FS via offline repeatability PMF
37
SIFT peak strength pmf
SIFT scale pmf
Combined scale/peak strength pmf
Z.Li @USTC, Hefei, 2016
Global Descriptor
• Why need global descriptor ?– Key points based query representation is not stateless, it has a
structure, i.e, SIFTs and their positions. This is not good for retrieval against a large database, complexity O(N)
– Need a “coarser” representation of the information contained in the image by aggregating local features, for indexing/hashing purpose.
– Landmark work, Fisher Vector, the best performing solution in ImageNet challenge, before the CNN deep learning solution: [Perronnin10] F. Perronnin, J. Sánchez, and T. Mensink. Improving the fisher kernel for large-scale image classification. In Proc. ECCV, 2010.
• CDVS Global Aggregation Works– m28061: Beijing University SCFV: for retrieval/identification
– m31491: Samsung AKULA: for matching /verification
– M31426: Univ of Surrey/VisualAtom: RVD: similar to SCFV
38
Z.Li @USTC, Hefei, 2016
Global Descriptor – SCFV (m28061)
• Beijing University SCFV – Scalable Compressed Fisher Vector– PCA to bring SIFT down from 128 to 32 dimensions
– Train a GMM of 128 components in 32 dim space, with parameters {ui, σ�, wi }
– Aggregate m=300 SIFT with GMM via Fisher Vector, 1st, and 2nd order,
where γ� i is the prob of SIFT xt being generated by GMM component i,
39
0100011001010
ℊ��� =
�ℒ X λ
�u�=
1
300w �
� γ� i (x� − u�
σ�)
���
���
ℊ��� =
�ℒ X λ
�σ�=
1
600w �
� γ� i [x� − u�
σ�
�
− 1]���
���
γ� i = p ix�, λ =w �p�(x�|λ)
∑ w �p�(x�|λ)������
Z.Li @USTC, Hefei, 2016
SCFV Distance Function
• The SCFV has 32x128 bits, for the 1st order Fisher Vector, and additional 32x128 bits for the 2nd order FV
• Not all GMM components are active, so an 128-bit flag [b1, b2,…, b128] is also introduced to indicate if it is active. Rationale: if not many SIFTs are associated with certain component, then the bits it generates are noise most likely
• Distance metric:
• A lot of painful work on GMM component turn on logic optimization to reach a very high performance.
• Very fast due to binary ops, can short list a 1m image data base within 1 sec on desktop computer.
40
s�,� =∑ b�
�b��w ��(��
�,���)(32 − 2Ha(u�
�, u��))���
���
32 ∑ b�����
��� ∑ b�����
���
Z.Li @USTC, Hefei, 2016
SCFV Performance
• Matching/Verification (TPR @ 1% FPR)
• Retrieval/Identification (mAP)
41
Z.Li @USTC, Hefei, 2016
AKULA – Adaptive KLUster Aggregation
• Aggregate by VQ on selected SIFT, and described by the centroids and weights, very compact
• Not indexable
42
SIFT detection & selection K-means AKULA description
Z.Li @USTC, Hefei, 2016
AKULA Distance Metric
43
• AKULA Descriptor: cluster centroids + SIFT count
A2={yc21, yc2
2, …, yc2k ; pc2
1, pc22, …, pc2
k }
• Distance metric (approx. tangential space distance)
– Min centroids distance, weighted by SIFT count
d A�, A� =1
�� d���
� � ����� (�)
�
���
+1
�� d���
� � ����� (�)
�
���
A1={yc11, yc1
2, …, yc1k ; pc1
1, pc12, …, pc1
k },
d���� � = min
���,�
d���� � = min
���,�
w ���� � = ��,�∗, �∗ = ��� min
���,�
w ���� � = ��∗,�, �∗ = ��� min
���,�
Z.Li @USTC, Hefei, 2016
AKULA Performance
• Very significant matching improvement over Fisher Vector
44
Z.Li @USTC, Hefei, 2016
AKULA Localization
• Quite some improvements: 2.7%
AKULA: A. Nagar, Z. Li, G. Srivastava, K.Park: AKULA - Adaptive Cluster Aggregation for Visual Search. IEEE DCC 2014: 13-22
Z.Li @USTC, Hefei, 2016
Local Descriptor Compression
• SIFT Descriptor Compression– VisualAtoms/Univ of Surrey: a
handcrafted transform/quantization scheme + huffman coding, low memory cost, slightly less performance (compared to PVQ), adopted.
– Only binary form is received, cannot recover the SIFT
46
h0
h6 h7
h4
h5
h2 h3 h1
%
1
0 and
1
i ij j
i i i i ij j j j j
i ij j
if v QL
v if v QL v QH
if v QH
Z.Li @USTC, Hefei, 2016
SIFT Compression via Laplacian Embedding
• Motivation:– Not to preserve info for Reconstruction of SIFT,
– Instead preserving pair-wise SIFT matching
• Formulation:– Find SIFT projection via:
• Details: SI on mobile visual search:
p.47
Z.Li @USTC, Hefei, 2016
Outline
• The Problem and CDVS Standardization Scope
• CDVS Query Extraction and Compression Pipeline– Key point Detection and Selection (ALP, CABOX, FS)
– Global Descriptor: Key points aggregation and compression (SCFV, RVD, AKULA)
– Local Descriptor: Key points and coordinates compression
• CDVS Query Processing– Pair-wise Matching
– Retrieval
– Indexing
• CDVA Work– Handling video input
– Image/Video Understanding
• Summary
48
Z.Li @USTC, Hefei, 2016
Pair Wise Matching
• Diagram of Image Matching– First local features are matched,
– if certain number of matched SIFT pairs are identified, then a Geometric Verification called DISTRAT[] is performed, to check the consistence of the matching points via distance ratio check.
– For un-sure image pairs, the global descriptor distance is computed and a threshold is applied to decide match of non-match
49
Z.Li @USTC, Hefei, 2016
Matching Performance (@ 1% FPR)
• Image Matching Accuracy:
– Mix of graphics, landmarks, buildings, objects, video clips, and paintings.
• Image Identification Accuracy:– For graphics (cd/book cover, logos, papers), paintings, the performance is in 90% range
– For objects of mixed variety, 78% in average.
– For buildings/landmark, the performance is not reflective of the true potential, as the current data set has some annotation errors
50
Z.Li @USTC, Hefei, 2016
CDVS Retrieval
• Retrieval Pipeline– Short list is generated by GD based k-nn operation via:
– Then for the short list of m candidates, do m times local descriptor based matching and rank their matching scores
51
� �, � =∑ b�
�b�
�W1��(��
�,��
�)W2
���(D − 2Ha(u�
�, u�
�))������
(∑ b�����
��� )�.�(∑ b�����
��� )�.�
Z.Li @USTC, Hefei, 2016
CDVS Retrieval Performance
• Retrieval Simulation Set Up– Approx. 17k annotated images mixed with 1m+ distraction image set
– Short Listing: retrieve 500 closest matches by GD and then do pair wise matching and ranking
– Data sets
52
mAP
Z.Li @USTC, Hefei, 2016
MBIT (multi-block indx table) Indexing
• GD is partitioned into blocks of 16 bits, and inverted list built.
• Shortlisting is by weighted scoring on block wise hamming distance
53
Algorithm . MBIT Searching
Input: Query B� = {�� �
}�=11024, MBIT T = {�� }�=1
1024, speedup ratio �, difference bits �.
Output: The shortlist {B�}�=1� , � = 500.
1: Initialize �(�, �)= 0, � = 1 … �. 2: for � = 1 to 1024 do
4: if the �+1
2-th Gaussian of B� is not selected then
5: continue; 6: end if; 7: for � = 0 to D do 8: Enumerate binary vectors {h� } with �-bit differences with ��
�.
9: For each image � in the buckets T� (h� ), update #�,� = #�,� + 1. 10: end for 11: end for 12: for � = 1 to � do 13: Update s(q, �)= ∑ #�,�
��=0 ;
14: end for 15: Sort the image list by their voting score in descending order.
16: Add descriptors of top �
� images in the ordered list into subset {B� }�=1
� .
17: Run an exhaustive search within {B� }�=1� and sort the list by Hamming distance.
18: Return the first � = 500 images.
Z.Li @USTC, Hefei, 2016
(m29361) Bit Mask/Collision Optimized Indexing
• Selecting 6-bits segments that are most efficient in discriminating for shortlisting, allow for permutation of bits
• Shortlisting by weighted segment hamming distance also reflecting the segment entropy
• Full paper form: X. Xin, A. Nagar, G. Srivastava, Z. Li*, F. Fernandes, A. Katsaggelos,Large Visual Repository Search with Hash Collision Design Optimization. IEEE MultiMedia 20(2): 62-71 (2013)
54
Z.Li @USTC, Hefei, 2016
Outline
• The Problem and CDVS Standardization Scope
• CDVS Query Extraction and Compression Pipeline– Key point Detection and Selection (ALP, CABOX, FS)
– Global Descriptor: Key points aggregation and compression (SCFV, RVD, AKULA)
– Local Descriptor: Key points and coordinates compression
• CDVS Query Processing– Pair-wise Matching
– Retrieval
– Indexing
• CDVA Work– Handling video input
– Image/Video Understanding
• Summary
55
Z.Li @USTC, Hefei, 2016
Automotive & AR Use Case
• Extend to object Identification / event detection for video input:– How do we deal with vastly increased data rate ?
• New spatial-temporal interesting points ?
• Key frame based CDVS processing ?
– How do we explore the temporal dimension ?
• Events detection, content classification (video archiving)
56
Z.Li @USTC, Hefei, 2016
Image Understanding/Tagging
• CDVS is based on a handcrafted feature, i.e, SIFT and SIFT aggregation.
• Latest work in Deep Learning pointing to new potentials in CCN to uncover new structure and knowledge from pixels, to associate with not only object identity, but also image labels
57
Z.Li @USTC, Hefei, 2016
Summary
• MPEG CDVS offers the state-of-art tech performance in visual object identification accuracy, speed, and query compression
• Amd work:– A recoverable SIFT compression scheme, currently SIFT cannot be
recovered from bit stream
– 3D key points, wide adoption of RGB+Depth sensors.
– Non-rigid body object identification
• CDVA work, still refining the problem definition, main use cases– Object identification in video
– Events detection, content classification
– Image Understanding (vs object identification)
58
Z.Li @USTC, Hefei, 2016
References
• Key References– Test Model 11: ISO/IEC JTC1/SC29/WG11/N14393– SoDIS: ISO/IEC DIS 15938-13 Information technology — Multimedia content
description interface — Part 13: Compact descriptors for visual search– Signal Processing – Image Communication, special issue on visual search and
augmented reality, vol. 28(4), April 2013. Eds. Giovanni Cordara, Miroslaw Bober, Yuriy A. Reznik:.
– ALP: m31369 CDVS: Telecom Italia’s response to CE1 – Interest point detection– CABOX: V. Fragoso, G. Srivastava, A. Nagar, Z. Li, K. Park, and M. Turk, "Cascade of Box
(CABOX) Filters for Optimal Scale Space Approximation", Proc of the 4th IEEE Int'l Workshop on Mobile Vision , Columbus, USA, 2014
– FS: Lepsøy, S., Francini, G., Cordara, G., & Gusmao, P. P. (2011). Statistical modelling of outliers for fast visual search. IEEE Workshop on Visual Content Identification and Search (VCIDS 2011). Barcelona, Spain.
– SCFV: L.-Y. Duan, J.Lin, J. Chen, T. Huang, W. Gao: Compact Descriptors for Visual Search. IEEE Multimedia 21(3): 30-40 (2014)
– RVD: m31426 Improving performance and usability of CDVS TM7 with a Robust Visual Descriptor (RVD)
– AKULA: A. Nagar, Z. Li, G. Srivastava, K.Park: AKULA - Adaptive Cluster Aggregation for Visual Search. IEEE DCC 2014: 13-22
59
Z.Li @USTC, Hefei, 2016
• Q & A
60