[IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...
Transcript of [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...
TREE-BASED SHAPE DESCRIPTOR FOR SCALABLE LOGO DETECTION
Chengde Wan1, Zhicheng Zhao1,2, Xin Guo1, Anni Cai1
1School of Information and Communication Engineering2Beijing Key Laboratory of Network System and Network Culture
Beijing University of Posts and Telecommunications, Beijing, China
ABSTRACT
Detecting logos in real-world images is a great challenging
task due to a variety of viewpoint or light condition changes
and real-time requirements in practice. Conventional object
detection methods, e.g., part-based model, may suffer from
expensively computational cost if it was directly applied to
this task. A promising alternative, triangle structural descrip-
tor associated with matching strategy, offers an efficient way
of recognizing logos. However, the descriptor fails to the
rotation of logo images that often occurs when viewpoint
changes. To overcome this shortcoming, we propose a new
Tree-based Shape Descriptor (TSD) in this paper, which is
strictly invariant to affine transformation in real-world im-
ages. The core of proposed descriptor is to encode the shape
of logos by depicting both appearance and spatial informa-
tion of four local key-points. In the training stage, an efficient
algorithm is introduced to mine a discriminate subset of four
tuples from all possible key-point combinations. Moreover, a
root indexing scheme is designed to enable to detect multiple
logos simultaneously. Extensive experiments on three bench-
marks demonstrate the superiority of proposed approach over
state-of-the-art methods.
Index Terms— logo detection, tree-based shape descrip-
tor, root indexing scheme
1. INTRODUCTION
Logo detection, serving as a sub-problem of object detection,
has attracted increasing interests in recent years due to its
commercial benefits such as copyright detection. Given one
or more logo images, the goal of this task is to recognize i-
dentical logos in a collection of images. Beside diverse visual
appearance due to clutter, occlusions, variations in photomet-
ric conditions and perspectives, it also suffers from real-time
requirements that arise from multiple query logos and large
scale of searching collection.
This work was supported by Chinese National Natural Science Foun-
dation (90920001, 61101212), National High Technology Research and De-
velopment Program of China (863 Program) (No.2012AA012505), National
S&T Major Project of the Ministry of S&T (2012BAH63F00), and the Fun-
damental Research Funds for the Central Universities.
Fig. 1: Example of same Tree-based Shape Descriptor (TSD)
detected in four logo images from FlickrLogos-32 dataset. In
each image, four yellow dots (ka, kb, kc and kd) denote the
key-points while two green lines (ka → kb and ka → kc)
denote edges of an angle and the red one (ka → kd) denotes
the ray lying in the angle. This figure is best viewed in color.
Present methods for logo detection eventually come down
to the matching problem for the purpose of fast speed, rather
than constructing complicate model for each logo category,
for instance, part-based model [1], which is extremely time-
consuming. Among the matching methods, key-points match-
ing is the fundamental one yet with most expensive compu-
tational cost. A speed up alternative, called Bag-of-Words
(BoW) model [2], is to quantize the descriptors of key-points
over a set of visual words and search matching points on the
quantization results instead of raw points. Despite its effi-
ciency, the BoW model discards the spatial information of
points that is remarkably crucial for visual representation due
to the ambiguity of visual words. A way of utilizing such
information, e.g., RANSAC in [3], performs geometry veri-
fication consequently to remove irrelevant points. To further
reduce the computation, several studies [4, 5] propose to em-
bed spatial information into the index and thereby gains at-
tractive results on the decrease of false positives. Kalantidis
et al. [6] also extend the common BoW model by incorporat-
ing local geometry using multi-scale Delaunay triangulation
and inverted structure for faster indexing.
However, all methods above can only deal with one logo
category at a time, which is infeasible especially when numer-
ous logos need to be detected. Romberg et al. [7] propose a
triangle structure to encode the spatial information and make
it possible to detect multiple logos simultaneously by intro-
ducing a cascaded indexing scheme. However, the geometric
constraint in [7] is not strictly affine invariant and occasion-
ally fails to detect logos especially in rotation changes which
are typically the case for logos in real-world images. And
each inner angle of the triangle structure needs to be com-
puted to form a signature, resulting into a high computational
cost. Moreover, the random detection algorithm cannot guar-
antee that all triangles, belonging to a specific logo category,
can be found out in test image.
To address the issues in [7], we propose a new shape de-
scriptor in this paper, namely Tree-based Shape Descriptor(TSD), to encode both local appearance and spatial informa-
tion. The tree structure of proposed descriptor is illustrated
in Fig.1. The same descriptor is detected in four “starbucks”
images with different views, demonstrating its invariance to
affine transformation. In training phase, we build model for
each logo category by mining a discriminative set of such tree
structures. Moreover, due to the tree structure of TSD, detec-
tion algorithm based on a root indexing scheme is designed to
enable to detect multiple kinds of logos at a time. Unlike the
random selection in [7], ours exactly finds all descriptors in
an input image that matched to logo models, thereby signifi-
cantly improves the recall rate.
The contributions of our work can be summarized as fol-
lows. (1) A tree-based shape descriptor is proposed to be
strictly invariant to affine transformation. (2) A detection al-
gorithm based on root indexing scheme is designed to enable
the simultaneous detection of multiple logos.
The rest of our paper is organized as follows. In section 2
we describe the new descriptor and the algorithm of training
and detection. Properties of the descriptor is analyzed in sec-
tion 3 and experimental results are shown in section 4, then
the conclusion is followed in section 5.
2. OUR APPROACH
In the following, we assume that the key-points and corre-
sponding descriptors have already been extracted on both
training and test images. In our work we choose key-points
by Hessian-affine detector [8] and describe them using the
well-known SIFT [9]. We further assume that a generic
codebook is obtained by k-means clustering and each de-
scriptor is assigned to the closest cluster center in feature
space as BoW model [2]. Then, each image is represented
by a key-point set K and each point k in them is represented
as k = {P (k), S(k), R(k), I(k)}, where P (k), S(k), R(k)represent the position, scale and response of the key-point
respectively, and I(k) denotes the index of its corresponding
visual word. Moreover, we define two key-points k1, k2 are
matched iff I(k1) = I(k2).We first introduce a tree structure tr which is constructed
Algorithm 1 Training Algorithm on images IA and IB
Input:KA = {ki}NA
i=1 := key-points set from image IA;
KB = {kj}NBj=1 := key-points set from image IB ;
P := given the root node of a tree structure, the number of key-
points selected as its candidate leaf nodes;
Output:MAB := tree descriptor set;
SAB := selected key-points set;
1: Initialization: empty MAB and SAB ;
2: for each ki in KA do3: if ki /∈ SAB then4: set root node v = ki;5: select P unique key-points from KA but not belong to
SAB as the candidate leaf nodes for root node v;
6: generate tree structure set Tv by combining each triple leaf
nodes out of P candidates and root node v;
7: for each tree structure tra in Tv do8: if tra find its match trb in KB then9: key-points of tra → SAB and corresponding de-
scriptor tsda → MAB ;
10: end if11: end for12: end if13: end for
by four key-points from set K to capture both local appear-
ance and spatial information. The structure is represented by
an ordered key-point set tr = {ka, kb, kc, kd}, where ka is
the root node, kb, kc, kd are leaf nodes and the relative posi-
tions between them satisfy the following spatial constraints:� P (kb)P (ka)P (kc) ranges from π/6 to π, ray P (ka)P (kd)lies in � P (kb)P (ka)P (kc) and four key-points have the same
scale. Then, a descriptor tsd corresponding to the tree struc-
ture tr is represented as the indices of their closest visual
words
tsd = {I(ka), I(kb), I(kc), I(kd)}. (1)
We define two tree structure are matched iff their descriptors
are equal, which further means corresponding elements be-
tween two descriptors are equal. Fig.1 shows the example
that four “starbucks” images contain matched tree structures.
Following this definition, every tree structure tr can be
mapped to a unique descriptor tsd, and given a specific tsd,
its corresponding tree structures will be found in test images.
2.1. Training Algorithm
In training phase, we aim to pursuit a set of unique and in-
variant descriptors to generate the model for each logo catego-
ry. More specifically, given a training set with N images for
logo Li, we train a tsd subset Mi(i ∈ {1, ..., N(N − 1)/2})on each pair of images, and set the union MLi of all subsets
as the logo’s model
MLi= {M1 ∪M2∪, ...,∪MN(N−1)/2}. (2)
Algorithm 2 Detection Algorithm
Input:M = {⋃MLi}NL
i=1 := concentrated model;
{L(tsdi)}|M|i=1 := category label of TSDs in model M ;
KI = {km}Im=1 := key-point set of the input image;
Output:{Di}NL
i=1 := detection scores for all categories;
1: Initialization: set Di = 0, i = 1, . . . , NL;
2: for each km in KI do3: calculate TSD set T by using root indexing scheme f(km);4: for each descriptor tsdn in T do5: if tsdn finds its matched tree structure KI then6: DL(tsdn) = DL(tsdn) + 1;
7: end if8: end for9: end for
For any two training images IA and IB , the algorithm of
training tsd subset MAB is described in Algorithm 1. It is
worth noting that the key-points of two input images have
been sorted in descending order according to their response
since we consider that key-points with higher response are
usually more stable. In this algorithm we greedily look for
valuable tree structures that appear in both input images. Dur-
ing each loop, we first choose one key-point ki in image IA as
root node v, and generate its candidate leaf nodes by selecting
P unique key-points which have save scale level with ki but
different visual word indices with ki as well as other candi-
dates. Then, a tree structure set Tv owing the same root node
v is constructed by exhaustively selecting triple leaf nodes out
of P candidates. Finally, a tree structure tra and correspond-
ing descriptor tsda will be recorded if its matched structure
could be found in image IB .
The descriptors in model MLiare further sorted in de-
scending order according their numbers of occurrence on
training images. Then in detection phase we only need to
load the top Q TSDs (Q is set to 5,000 in practice) for each
logo model.
2.2. Detection Algorithm
In detection phase, we design an algorithm to be able to
detect multiple logo categories simultaneously by using the
tree structure. Given a set of logo queries L = {Li}NLi=1,
we first concatenate their logo models to a TSD set M ={⋃MLi}NL
i=1, where MLi is the logo model for Li. For every
tsd ∈ M , its category tag is denoted as L(tsd), and for any
tsdi, tsdj ∈ M(i �= j), if tsdi = tsdj , we remove tsdi, tsdjfrom M since they are ambiguous descriptors.
We then define a root indexing scheme f(·) as f(k) ={tsd | tsd ∈ M, I(root node of tsd) = I(k)} which maps
a key-point k into a set of TSDs. In many cases f(k) can be
null which means k does not belong to the root of any TSD in
M .
x
y
A
B D C
(a)
x
y
A’ B’
D’ C’
(b)
x
y
A’’
B’’ D’’
C’’
(c)
Fig. 2: Illustration of the criterion that determines whether a
ray is in an angle. (a) is the original position of points. (b)
and (c) show the positions after clockwise and anti-clockwise
rotation respectively.
We decompose the detection process into three steps.
Firstly, given the key-point set of an input image KI ={km}Im=1, we can obtain the TSD set T = f(km) for every
km ∈ M by using the root indexing scheme. Secondly, we
check out that if any tsd ∈ T could find its matched tree
structures in KI . The score of the tsd’s logo category will
increase by 1 if a matched tree structure was found. Finally,
we consider a logo instance appears in the input image if the
score was greater than a previously defined threshold. The
detailed algorithm is illustrated in algorithm 2.
3. THE PROPERTIES OF TSD
Invariance and uniqueness are the two main considerations
for a feature descriptor that directly influence the recall and
precision in retrieval and detection problems. In this section
we evaluate these two properties of TSD. For convenience,
we only consider one logo case in this section.
3.1. Affine Invariance
The affine invariance for TSD can be interpreted as follows:
given a tree structure tr = {ka, kb, kc, kd} defined in section
2, after an affine transformation T on tr, we have trnew =T(tr), then the descriptors for tr and trnew are the equal.
Suppose after the affine transformation T, the positions of the
key-points in trnew become A,B,C,D respectively, to prove
the descriptor after affine transformation remains the same is
equivalent to prove that the transformed key-points satisfy ray
AD is in � BAC.
Given four key-points A,B,C,D(see Fig 2(a)), to deter-
mine whether ray AD is in � BAC is equivalent to prove after
two rotations in opposite direction over A,B,C,D, each of
which ensures an edge of � BAC lies on the horizontal ax-
is(see Fig 2(b) and 2(c)), the vertical coordinates of the rotat-
ed points satisfy D′y · C ′
y > 0 and D′′y ·B′′
y > 0.
Suppose the rotation between Fig 2(a) and 2(b) is given
by Rα, where Rα =
[cosα − sinαsinα cosα
]and α denotes the ro-
tational angle. Similarly, rotation between Fig 2(a) and 2(c)
is given by Rβ with rotational angle β. Followed by the cri-
terion in Fig 2, we set α ∈ [0, π\2] and β ∈ [−π\2, 0]. To
simplify computation, we futher add a scaling factor to both
rotations, which has no influence on the final result, and have
[Bx, By] · aRα = [B′x, 0], [Bx, By] · bRβ = [C ′′
x , 0] respec-
tively. By taking both rotation transformations aTα0 and bTβ
0
to point D, we have
D′y = Dy ·Bx −Dx ·By, (3)
D′′y = Dy · Cx −Dx · Cy. (4)
Then ray AD is in � BAC iff D′y · D′′
y > 0. By taking an
affine transformation T, where T =
[a bc d
]and is invertible,
to a tree structure tr (tr = {ka, kb, kc, kd}), the coordinate
of each key-point in resulting tree structure trnew becomes
{A,B,C,D}. By taking T to (3) and (4), and have:
D′y ·D′′
y = (ad− bc)4 · P (k′D) · P (k′′D), (5)
where P (k′D) and P (k′′D) is the transformed position of key-
point kD in Fig. 2(b) and 2(c) respectively. Since T is invert-
ible, ad− bc �= 0 and P (k′D) ·P (k′′D) > 0, then D′y ·D′′
y > 0.
Thus TSD is affine invariant.
3.2. Modeling the matching probability
Given a logo model MLi, the average matching probability of
each tsd in MLifinding its matched tree structure in image
with or without logo Li reveals the robustness of TSD, i.e.,higher matching probability over related images shows the
robustness to variances of logo appearances from real-world
images while lower matching probability on unrelated images
makes clear that TSD is robust against noise. It’s worth not-
ing that image without the instance of logo Li may contain
instances of other kinds of logo as well. Without the loss of
generality, the mean matching probability can be approximat-
ed by the probability of any tsd randomly chosen from MLi
finding its matched tree structure, denoted as P .
Given a key-point set K, K = {ki}Ii=1, of an input im-
age, followed by the detection algorithm discussed in section
2, we decompose P into the product of three successive prob-
abilities:
P = P (A) · P (BCD|A) · P (in|ABCD). (6)
The first term P (A) is the probability that K contains the
root of tsd. It’s easy to see PA = |A(M)⋂I(K)|\|W |,
where W is the total number of visual words in codebook.
The second term P (BCD|A) denotes that once found
root in K, the probability of finding its leaves in K, and
P (BCD|A) = P (ABCD)\P (A). The process of finding
the three visual words is not independent since there are co-
occurrence relationship between the appearances of visual
words due to local pattern. Further experiment prove the
existence of this co-occurrence(see Table 1 and 2), in which
P (BCD|A) �= P (A)3.
Table 1: Average matching probabilities between logo model
and related images.
Codebook Size 2k 5k 10k 50k 100k
P (A) 0.12 0.09 0.067 0.055 0.052
P (BCD|A) 0.963 0.932 0.898 0.884 0.881
P (in|ABCD) 0.69 0.65 0.62 0.61 0.59
Table 2: Average matching probabilities between logo model
and unrelated images.
Codebook Size 2k 5k 10k 50k 100k
P (A) 0.062 0.043 0.011 0.003 0.002
P (BCD|A) 0.722 0.552 0.305 0.116 0.087
P (in|ABCD) 0.476 0.286 0.167 0.082 0.043
The final term P (in|ABCD) denotes that given visual
word A,B,C,D all appearing in k, the probability of their
corresponding positions satisfying the TSD structure of tsd.
The experimental results over the three successive prob-
abilities in images with and without logo appearances are
shown in Table 1 and 2 respectively. We now analysis the
properties of recall and precision with regard to the matching
probability P .
Recall of TSD Once the input image contains logo appear-
ance(s), P (A) depends on the visual words overlap between
image and logo model. Then, unlike P (A) decreases dra-
matically with the increasing size of codebook, the codebook
size has little to do with the last two conditional probabil-
ities P (BCD|A) and P (in|ABCD). Therefore, although
TSD requires a much more strict matching qualification, the
probability of finding matches in logo regions decreases little
compared to BoW method since our training algorithm well
captures the co-occurrence relationships.
Precision of TSD Compare to the probabilities shown in
table 1, there is a noticeable downtrend of the last two condi-
tional probabilities as is shown in table 2. That is to say, by
matching multiple visual words and their relative position at
the same time, TSD has a good performance filtering out the
false matched key-point pairs.
4. EXPERIMENTS
In this section, we present experimental results on three
benchmark datasets, i.e., FlickrLogos-32 [7], FlickrLogos-
27 [6] and MICC-Logos datasets, to verify our proposed
approach in real-world scenario.
4.1. Impact of Parameters
To provide more comprehensive analysis of the proposed ap-
proach, we first need to evaluate the impact of two parameters,
i.e., the size of codebook and number of input TSDs. Eval-
uated experiments are performed on FlickrLogos-32 dataset
10k 30k 50k 70k 90k
0.6
0.7
0.8
0.9
1
Codebook Size
Perf
orm
ance
RecallPrecision
1k 2k 3k 4k 5k 6k 7k 8k 9k0.5
0.6
0.7
0.8
0.9
1
Number of input TSDs
Perf
orm
ance
RecallPrecision
Fig. 3: Impact of two parameters, i.e., the codebook size (left figure) and number of input TSDs (right figure).
since the large number of images and distractors makes the
dataset be analogous to natural scenario.
In Fig.3(a) we report the recall and precision rates under
varying codebook sizes. With the increasing size of code-
book, the overlap between logo model and input key-points
set becomes small, resulting the decrease of recall rate. How-
ever, we should note that small codebook size usually causes
ambiguity of visual words and higher probability of random
collision. Since the performance becomes relatively stable af-
ter 50,000 visual words, we set the codebook as 50,000 in the
following experiments.
Another important parameter is the number of input TS-
Ds. The experimental results are shown in Fig.3(b) under the
setting of 50,000 visual words. The input TSDs have been
sorted as described in Section 2.1 to make sure the first load-
ed TSDs have higher probability to be detected. As shown in
this figure, the number of input TSDs has little influence on
accuracy, which proves the uniqueness of TSDs. Along with
the increasing of numbers, the probability of finding corre-
sponding TSDs becomes higher, but also causes more memo-
ry allocation and computational cost. Finally, we fix the input
size to 5,000 in order to balance the performance and memory
cost.
4.2. FlickrLogos-32 Dataset
FlickrLogos-32 dataset1 is a collection of images downloaded
from Flickr. It contains 32 logo categories ranging between
the brands of sports, food, car and high-tech company, each
of which includes 70 images. For each category the dataset is
divided into 3 subsets: P1 contains 10 images, chosen to con-
tain little clutter and noise; P2 and P3 each contains 30 images
used for validation and test respectively and 3000 additional
noisy images. Following the protocol of [7], P1 serves as in-
put for Algorithm 1 for training, P2 is used for TSD sorting
and learning threshold.
We compare our results with basic BoW method, im-
proved BoW model by involving RANSAC verification and
SLR [7] which reports the best performance on this dataset.
1FlickrLogos-32 dataset is available at http://www.multimedia-computing.de/flickrlogos/.
Table 3: Comparison results of our TSD approach and the
baseline methods on FlickrLogos-32 dataset.
BoW RANSAC SLR [7] TSD
Recall 0.22 0.36 0.61 0.68Precision 0.96 0.97 0.98 0.98F1 Score 0.358 0.525 0.751 0.802
Comparison results are reported in Table 3 in terms of re-
call, precision and F1 score. From the results, we have
the following observations: (1) After the verification using
RANSAC, the performance is improved slightly than original
BoW model. (2) By adding more relaxed spatial constraints
compared to [7], our method achieves a higher recall rate
while maintaining same precision.
4.3. FlickrLogos-27 Dataset
FlickrLogos-27 dataset2 is another annotated logo dataset also
downloaded from Flickr and contains 27 logo categories in
total. Different from FlickrLogos-32, a distractor set is built
containing 4397 images, each one of which defines its own
category and does not overlap with the training and query set.
5 10 15 20 25 300.3
0.4
0.5
0.6
Number of training images
Acc
urac
y
BoWmsDTTSD
Fig. 4: Comparison results of our TSD approach against BoW
and msDT methods on FlickrLogo-27 dataset.
We test our approach on this dataset to verify our descrip-
2FlickrLogos-27 dataset is available at http://image.ntua.gr/iva/datasets/flickr_logos/.
Table 4: Comparison results of our TSD approach agaist other methods on MICC-Logos dataset.
Recall
Precision0.701 0.819 0.875 0.906 0.925 0.94 0.949 0.957 0.963
TSD 0.72 0.716 0.700 0.681CDS [10] 0.907 0.849 0.813 0.784 0.751 0.721 0.708 0.691 0.675
SIFT 0.736 0.652 0.606 0.548 0.497 0.456 0.429 0.411 0.378
RANSAC 0.747 0.66 0.619 0.593 0.577 0.566 0.556 0.543 0.523
BoW 0.763 0.696 0.650 0.605 0.573 0.552 0.531 0.492 0.462
PSC Matching 0.752 0.670 0.629 0.594 0.567 0.533 0.507 0.476 0.449
tor is unique enough to distinguish different logos. Following
the experimental settings in [6], 30 images are randomly se-
lected per brand as training set while the rests are for test. We
compare our approach against basic BoW model, improved
BoW model with RANSAC post-processing and msDT [6].
msDT is a multi-scale Delaunay Triangulation approach,
which plays state-of-the-art performance in FlickrLogos-27
Dataset. We show the comparison results in Fig.4, in terms
of the accuracy used in [6]. As can be seen in this figure,
our approach clearly gains higher performance with a large
margin than any other method, even msDT which achieved
the best in previous literature.
4.4. MICC-Logos Dataset
MICC-Logos dataset3 contains 13 logo categories each one
represented with 15-87 real-world pictures, resulting into a
collection of 720 images. Due to lack of training samples in
this dataset, we randomly select 10 images in each category,
half as training set and half as validation set, and the rests are
used for testing.
In Table 4, we report the comparison results of our ap-
proach against several common methods such as SIFT match-
ing, basic BoW, improved BoW with RANSAC, PSC Match-
ing [11], as well as state-of-the-art CDS [10] on this dataset,
in terms of precision and recall. We exactly quote the results
listed in [10] and leave blanks on our results with the preci-
sions lower than 0.94 since our logo models are so accurate as
that it is impossible to involve more distractors. Although the
improvement is slight, our approach is much more efficien-
t than CDS, because CDS involves a optimization procedure
for every pair of images to be matched and thus cannot be
applied to large scale dataset.
5. CONCLUSIONS
In this paper, we present a new descriptor - Tree-based ShapeDescriptor (TSD) - for scalable logo detection, which en-
codes both appearance and spatial information. In the training
stage, an algorithm is introduced to mine the valuable subsets
of four tuples from all possible local key-point combinations,
3MICC-Logos dataset is available on request at http://www.micc.unifi.it/vim/datasets/micc-logos.
each one of which have great repeating probability on images
of the same logo category. A detection algorithm based on
root indexing scheme is also proposed to enable simultane-
ous detection of multiple logos. Although inspired by [7], the
proposed descriptor is more efficient and provably invariant
to affine transformation. Moreover, instead of the random-
ized detection algorithm in [7], our detection algorithm en-
sures that each TSD in the logo model can be detected on
test images, resulting into the significantly improvement of
recall rate. Superior experiments performance against state-
of-the-art methods over three public benchmarks verify the
effectiveness of our proposed descriptor.
6. REFERENCES
[1] P. F Felzenszwalb, R. B Girshick, D. McAllester, and
D. Ramanan, “Object detection with discriminatively
trained part-based models,” TPAMI, 2010. 1
[2] J. Sivic and A. Zisserman, “Video google: A text re-
trieval approach to object matching in videos,” in CVPR,
2003. 1, 2
[3] J. Philbin et al., “Object retrieval with large vocabularies
and fast spatial matching,” in CVPR, 2007. 1
[4] O. Chum, M. Perdoch, and J. Matas, “Geometric min-
hashing: Finding a (thick) needle in a haystack,” in
CVPR, 2009. 1
[5] Y. Zhang, Z. Jia, and T. Chen, “Image retrieval with
geometry-preserving visual phrases,” in CVPR, 2011. 1
[6] Y. Kalantidis et al., “Scalable triangulation-based logo
recognition,” in ICMR, 2011. 1, 4, 6
[7] S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Z-
wol, “Scalable logo recognition in real-world images,”
in ICMR, 2011. 2, 4, 5, 6
[8] K. Mikolajczyk et al., “A comparison of affine region
detectors,” IJCV, 2005. 2
[9] David G Lowe, “Distinctive image features from scale-
invariant keypoints,” IJCV, 2004. 2
[10] H. Sahbi, L. Ballan, G. Serra, and A. Del Bimbo,
“Context-dependent logo matching and recognition,”
TIP, 2013. 6
[11] K. Gao et al., “Logo detection based on spatial-spectral
saliency and partial spatial context,” in ICME, 2009. 6