[IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...

6
TREE-BASED SHAPE DESCRIPTOR FOR SCALABLE LOGO DETECTION Chengde Wan 1 , Zhicheng Zhao 1,2 , Xin Guo 1 , Anni Cai 1 1 School of Information and Communication Engineering 2 Beijing Key Laboratory of Network System and Network Culture Beijing University of Posts and Telecommunications, Beijing, China ABSTRACT Detecting logos in real-world images is a great challenging task due to a variety of viewpoint or light condition changes and real-time requirements in practice. Conventional object detection methods, e.g., part-based model, may suffer from expensively computational cost if it was directly applied to this task. A promising alternative, triangle structural descrip- tor associated with matching strategy, offers an efficient way of recognizing logos. However, the descriptor fails to the rotation of logo images that often occurs when viewpoint changes. To overcome this shortcoming, we propose a new Tree-based Shape Descriptor (TSD) in this paper, which is strictly invariant to affine transformation in real-world im- ages. The core of proposed descriptor is to encode the shape of logos by depicting both appearance and spatial informa- tion of four local key-points. In the training stage, an efficient algorithm is introduced to mine a discriminate subset of four tuples from all possible key-point combinations. Moreover, a root indexing scheme is designed to enable to detect multiple logos simultaneously. Extensive experiments on three bench- marks demonstrate the superiority of proposed approach over state-of-the-art methods. Index Termslogo detection, tree-based shape descrip- tor, root indexing scheme 1. INTRODUCTION Logo detection, serving as a sub-problem of object detection, has attracted increasing interests in recent years due to its commercial benefits such as copyright detection. Given one or more logo images, the goal of this task is to recognize i- dentical logos in a collection of images. Beside diverse visual appearance due to clutter, occlusions, variations in photomet- ric conditions and perspectives, it also suffers from real-time requirements that arise from multiple query logos and large scale of searching collection. This work was supported by Chinese National Natural Science Foun- dation (90920001, 61101212), National High Technology Research and De- velopment Program of China (863 Program) (No.2012AA012505), National S&T Major Project of the Ministry of S&T (2012BAH63F00), and the Fun- damental Research Funds for the Central Universities. Fig. 1: Example of same Tree-based Shape Descriptor (TSD) detected in four logo images from FlickrLogos-32 dataset. In each image, four yellow dots (k a ,k b ,k c and k d ) denote the key-points while two green lines (k a k b and k a k c ) denote edges of an angle and the red one (k a k d ) denotes the ray lying in the angle. This figure is best viewed in color. Present methods for logo detection eventually come down to the matching problem for the purpose of fast speed, rather than constructing complicate model for each logo category, for instance, part-based model [1], which is extremely time- consuming. Among the matching methods, key-points match- ing is the fundamental one yet with most expensive compu- tational cost. A speed up alternative, called Bag-of-Words (BoW) model [2], is to quantize the descriptors of key-points over a set of visual words and search matching points on the quantization results instead of raw points. Despite its effi- ciency, the BoW model discards the spatial information of points that is remarkably crucial for visual representation due to the ambiguity of visual words. A way of utilizing such information, e.g., RANSAC in [3], performs geometry veri- fication consequently to remove irrelevant points. To further reduce the computation, several studies [4, 5] propose to em- bed spatial information into the index and thereby gains at- tractive results on the decrease of false positives. Kalantidis et al. [6] also extend the common BoW model by incorporat- ing local geometry using multi-scale Delaunay triangulation and inverted structure for faster indexing. However, all methods above can only deal with one logo category at a time, which is infeasible especially when numer-

Transcript of [IEEE 2013 Visual Communications and Image Processing (VCIP) - Kuching, Malaysia...

TREE-BASED SHAPE DESCRIPTOR FOR SCALABLE LOGO DETECTION

Chengde Wan1, Zhicheng Zhao1,2, Xin Guo1, Anni Cai1

1School of Information and Communication Engineering2Beijing Key Laboratory of Network System and Network Culture

Beijing University of Posts and Telecommunications, Beijing, China

ABSTRACT

Detecting logos in real-world images is a great challenging

task due to a variety of viewpoint or light condition changes

and real-time requirements in practice. Conventional object

detection methods, e.g., part-based model, may suffer from

expensively computational cost if it was directly applied to

this task. A promising alternative, triangle structural descrip-

tor associated with matching strategy, offers an efficient way

of recognizing logos. However, the descriptor fails to the

rotation of logo images that often occurs when viewpoint

changes. To overcome this shortcoming, we propose a new

Tree-based Shape Descriptor (TSD) in this paper, which is

strictly invariant to affine transformation in real-world im-

ages. The core of proposed descriptor is to encode the shape

of logos by depicting both appearance and spatial informa-

tion of four local key-points. In the training stage, an efficient

algorithm is introduced to mine a discriminate subset of four

tuples from all possible key-point combinations. Moreover, a

root indexing scheme is designed to enable to detect multiple

logos simultaneously. Extensive experiments on three bench-

marks demonstrate the superiority of proposed approach over

state-of-the-art methods.

Index Terms— logo detection, tree-based shape descrip-

tor, root indexing scheme

1. INTRODUCTION

Logo detection, serving as a sub-problem of object detection,

has attracted increasing interests in recent years due to its

commercial benefits such as copyright detection. Given one

or more logo images, the goal of this task is to recognize i-

dentical logos in a collection of images. Beside diverse visual

appearance due to clutter, occlusions, variations in photomet-

ric conditions and perspectives, it also suffers from real-time

requirements that arise from multiple query logos and large

scale of searching collection.

This work was supported by Chinese National Natural Science Foun-

dation (90920001, 61101212), National High Technology Research and De-

velopment Program of China (863 Program) (No.2012AA012505), National

S&T Major Project of the Ministry of S&T (2012BAH63F00), and the Fun-

damental Research Funds for the Central Universities.

Fig. 1: Example of same Tree-based Shape Descriptor (TSD)

detected in four logo images from FlickrLogos-32 dataset. In

each image, four yellow dots (ka, kb, kc and kd) denote the

key-points while two green lines (ka → kb and ka → kc)

denote edges of an angle and the red one (ka → kd) denotes

the ray lying in the angle. This figure is best viewed in color.

Present methods for logo detection eventually come down

to the matching problem for the purpose of fast speed, rather

than constructing complicate model for each logo category,

for instance, part-based model [1], which is extremely time-

consuming. Among the matching methods, key-points match-

ing is the fundamental one yet with most expensive compu-

tational cost. A speed up alternative, called Bag-of-Words

(BoW) model [2], is to quantize the descriptors of key-points

over a set of visual words and search matching points on the

quantization results instead of raw points. Despite its effi-

ciency, the BoW model discards the spatial information of

points that is remarkably crucial for visual representation due

to the ambiguity of visual words. A way of utilizing such

information, e.g., RANSAC in [3], performs geometry veri-

fication consequently to remove irrelevant points. To further

reduce the computation, several studies [4, 5] propose to em-

bed spatial information into the index and thereby gains at-

tractive results on the decrease of false positives. Kalantidis

et al. [6] also extend the common BoW model by incorporat-

ing local geometry using multi-scale Delaunay triangulation

and inverted structure for faster indexing.

However, all methods above can only deal with one logo

category at a time, which is infeasible especially when numer-

ous logos need to be detected. Romberg et al. [7] propose a

triangle structure to encode the spatial information and make

it possible to detect multiple logos simultaneously by intro-

ducing a cascaded indexing scheme. However, the geometric

constraint in [7] is not strictly affine invariant and occasion-

ally fails to detect logos especially in rotation changes which

are typically the case for logos in real-world images. And

each inner angle of the triangle structure needs to be com-

puted to form a signature, resulting into a high computational

cost. Moreover, the random detection algorithm cannot guar-

antee that all triangles, belonging to a specific logo category,

can be found out in test image.

To address the issues in [7], we propose a new shape de-

scriptor in this paper, namely Tree-based Shape Descriptor(TSD), to encode both local appearance and spatial informa-

tion. The tree structure of proposed descriptor is illustrated

in Fig.1. The same descriptor is detected in four “starbucks”

images with different views, demonstrating its invariance to

affine transformation. In training phase, we build model for

each logo category by mining a discriminative set of such tree

structures. Moreover, due to the tree structure of TSD, detec-

tion algorithm based on a root indexing scheme is designed to

enable to detect multiple kinds of logos at a time. Unlike the

random selection in [7], ours exactly finds all descriptors in

an input image that matched to logo models, thereby signifi-

cantly improves the recall rate.

The contributions of our work can be summarized as fol-

lows. (1) A tree-based shape descriptor is proposed to be

strictly invariant to affine transformation. (2) A detection al-

gorithm based on root indexing scheme is designed to enable

the simultaneous detection of multiple logos.

The rest of our paper is organized as follows. In section 2

we describe the new descriptor and the algorithm of training

and detection. Properties of the descriptor is analyzed in sec-

tion 3 and experimental results are shown in section 4, then

the conclusion is followed in section 5.

2. OUR APPROACH

In the following, we assume that the key-points and corre-

sponding descriptors have already been extracted on both

training and test images. In our work we choose key-points

by Hessian-affine detector [8] and describe them using the

well-known SIFT [9]. We further assume that a generic

codebook is obtained by k-means clustering and each de-

scriptor is assigned to the closest cluster center in feature

space as BoW model [2]. Then, each image is represented

by a key-point set K and each point k in them is represented

as k = {P (k), S(k), R(k), I(k)}, where P (k), S(k), R(k)represent the position, scale and response of the key-point

respectively, and I(k) denotes the index of its corresponding

visual word. Moreover, we define two key-points k1, k2 are

matched iff I(k1) = I(k2).We first introduce a tree structure tr which is constructed

Algorithm 1 Training Algorithm on images IA and IB

Input:KA = {ki}NA

i=1 := key-points set from image IA;

KB = {kj}NBj=1 := key-points set from image IB ;

P := given the root node of a tree structure, the number of key-

points selected as its candidate leaf nodes;

Output:MAB := tree descriptor set;

SAB := selected key-points set;

1: Initialization: empty MAB and SAB ;

2: for each ki in KA do3: if ki /∈ SAB then4: set root node v = ki;5: select P unique key-points from KA but not belong to

SAB as the candidate leaf nodes for root node v;

6: generate tree structure set Tv by combining each triple leaf

nodes out of P candidates and root node v;

7: for each tree structure tra in Tv do8: if tra find its match trb in KB then9: key-points of tra → SAB and corresponding de-

scriptor tsda → MAB ;

10: end if11: end for12: end if13: end for

by four key-points from set K to capture both local appear-

ance and spatial information. The structure is represented by

an ordered key-point set tr = {ka, kb, kc, kd}, where ka is

the root node, kb, kc, kd are leaf nodes and the relative posi-

tions between them satisfy the following spatial constraints:� P (kb)P (ka)P (kc) ranges from π/6 to π, ray P (ka)P (kd)lies in � P (kb)P (ka)P (kc) and four key-points have the same

scale. Then, a descriptor tsd corresponding to the tree struc-

ture tr is represented as the indices of their closest visual

words

tsd = {I(ka), I(kb), I(kc), I(kd)}. (1)

We define two tree structure are matched iff their descriptors

are equal, which further means corresponding elements be-

tween two descriptors are equal. Fig.1 shows the example

that four “starbucks” images contain matched tree structures.

Following this definition, every tree structure tr can be

mapped to a unique descriptor tsd, and given a specific tsd,

its corresponding tree structures will be found in test images.

2.1. Training Algorithm

In training phase, we aim to pursuit a set of unique and in-

variant descriptors to generate the model for each logo catego-

ry. More specifically, given a training set with N images for

logo Li, we train a tsd subset Mi(i ∈ {1, ..., N(N − 1)/2})on each pair of images, and set the union MLi of all subsets

as the logo’s model

MLi= {M1 ∪M2∪, ...,∪MN(N−1)/2}. (2)

Algorithm 2 Detection Algorithm

Input:M = {⋃MLi}NL

i=1 := concentrated model;

{L(tsdi)}|M|i=1 := category label of TSDs in model M ;

KI = {km}Im=1 := key-point set of the input image;

Output:{Di}NL

i=1 := detection scores for all categories;

1: Initialization: set Di = 0, i = 1, . . . , NL;

2: for each km in KI do3: calculate TSD set T by using root indexing scheme f(km);4: for each descriptor tsdn in T do5: if tsdn finds its matched tree structure KI then6: DL(tsdn) = DL(tsdn) + 1;

7: end if8: end for9: end for

For any two training images IA and IB , the algorithm of

training tsd subset MAB is described in Algorithm 1. It is

worth noting that the key-points of two input images have

been sorted in descending order according to their response

since we consider that key-points with higher response are

usually more stable. In this algorithm we greedily look for

valuable tree structures that appear in both input images. Dur-

ing each loop, we first choose one key-point ki in image IA as

root node v, and generate its candidate leaf nodes by selecting

P unique key-points which have save scale level with ki but

different visual word indices with ki as well as other candi-

dates. Then, a tree structure set Tv owing the same root node

v is constructed by exhaustively selecting triple leaf nodes out

of P candidates. Finally, a tree structure tra and correspond-

ing descriptor tsda will be recorded if its matched structure

could be found in image IB .

The descriptors in model MLiare further sorted in de-

scending order according their numbers of occurrence on

training images. Then in detection phase we only need to

load the top Q TSDs (Q is set to 5,000 in practice) for each

logo model.

2.2. Detection Algorithm

In detection phase, we design an algorithm to be able to

detect multiple logo categories simultaneously by using the

tree structure. Given a set of logo queries L = {Li}NLi=1,

we first concatenate their logo models to a TSD set M ={⋃MLi}NL

i=1, where MLi is the logo model for Li. For every

tsd ∈ M , its category tag is denoted as L(tsd), and for any

tsdi, tsdj ∈ M(i �= j), if tsdi = tsdj , we remove tsdi, tsdjfrom M since they are ambiguous descriptors.

We then define a root indexing scheme f(·) as f(k) ={tsd | tsd ∈ M, I(root node of tsd) = I(k)} which maps

a key-point k into a set of TSDs. In many cases f(k) can be

null which means k does not belong to the root of any TSD in

M .

x

y

A

B D C

(a)

x

y

A’ B’

D’ C’

(b)

x

y

A’’

B’’ D’’

C’’

(c)

Fig. 2: Illustration of the criterion that determines whether a

ray is in an angle. (a) is the original position of points. (b)

and (c) show the positions after clockwise and anti-clockwise

rotation respectively.

We decompose the detection process into three steps.

Firstly, given the key-point set of an input image KI ={km}Im=1, we can obtain the TSD set T = f(km) for every

km ∈ M by using the root indexing scheme. Secondly, we

check out that if any tsd ∈ T could find its matched tree

structures in KI . The score of the tsd’s logo category will

increase by 1 if a matched tree structure was found. Finally,

we consider a logo instance appears in the input image if the

score was greater than a previously defined threshold. The

detailed algorithm is illustrated in algorithm 2.

3. THE PROPERTIES OF TSD

Invariance and uniqueness are the two main considerations

for a feature descriptor that directly influence the recall and

precision in retrieval and detection problems. In this section

we evaluate these two properties of TSD. For convenience,

we only consider one logo case in this section.

3.1. Affine Invariance

The affine invariance for TSD can be interpreted as follows:

given a tree structure tr = {ka, kb, kc, kd} defined in section

2, after an affine transformation T on tr, we have trnew =T(tr), then the descriptors for tr and trnew are the equal.

Suppose after the affine transformation T, the positions of the

key-points in trnew become A,B,C,D respectively, to prove

the descriptor after affine transformation remains the same is

equivalent to prove that the transformed key-points satisfy ray

AD is in � BAC.

Given four key-points A,B,C,D(see Fig 2(a)), to deter-

mine whether ray AD is in � BAC is equivalent to prove after

two rotations in opposite direction over A,B,C,D, each of

which ensures an edge of � BAC lies on the horizontal ax-

is(see Fig 2(b) and 2(c)), the vertical coordinates of the rotat-

ed points satisfy D′y · C ′

y > 0 and D′′y ·B′′

y > 0.

Suppose the rotation between Fig 2(a) and 2(b) is given

by Rα, where Rα =

[cosα − sinαsinα cosα

]and α denotes the ro-

tational angle. Similarly, rotation between Fig 2(a) and 2(c)

is given by Rβ with rotational angle β. Followed by the cri-

terion in Fig 2, we set α ∈ [0, π\2] and β ∈ [−π\2, 0]. To

simplify computation, we futher add a scaling factor to both

rotations, which has no influence on the final result, and have

[Bx, By] · aRα = [B′x, 0], [Bx, By] · bRβ = [C ′′

x , 0] respec-

tively. By taking both rotation transformations aTα0 and bTβ

0

to point D, we have

D′y = Dy ·Bx −Dx ·By, (3)

D′′y = Dy · Cx −Dx · Cy. (4)

Then ray AD is in � BAC iff D′y · D′′

y > 0. By taking an

affine transformation T, where T =

[a bc d

]and is invertible,

to a tree structure tr (tr = {ka, kb, kc, kd}), the coordinate

of each key-point in resulting tree structure trnew becomes

{A,B,C,D}. By taking T to (3) and (4), and have:

D′y ·D′′

y = (ad− bc)4 · P (k′D) · P (k′′D), (5)

where P (k′D) and P (k′′D) is the transformed position of key-

point kD in Fig. 2(b) and 2(c) respectively. Since T is invert-

ible, ad− bc �= 0 and P (k′D) ·P (k′′D) > 0, then D′y ·D′′

y > 0.

Thus TSD is affine invariant.

3.2. Modeling the matching probability

Given a logo model MLi, the average matching probability of

each tsd in MLifinding its matched tree structure in image

with or without logo Li reveals the robustness of TSD, i.e.,higher matching probability over related images shows the

robustness to variances of logo appearances from real-world

images while lower matching probability on unrelated images

makes clear that TSD is robust against noise. It’s worth not-

ing that image without the instance of logo Li may contain

instances of other kinds of logo as well. Without the loss of

generality, the mean matching probability can be approximat-

ed by the probability of any tsd randomly chosen from MLi

finding its matched tree structure, denoted as P .

Given a key-point set K, K = {ki}Ii=1, of an input im-

age, followed by the detection algorithm discussed in section

2, we decompose P into the product of three successive prob-

abilities:

P = P (A) · P (BCD|A) · P (in|ABCD). (6)

The first term P (A) is the probability that K contains the

root of tsd. It’s easy to see PA = |A(M)⋂I(K)|\|W |,

where W is the total number of visual words in codebook.

The second term P (BCD|A) denotes that once found

root in K, the probability of finding its leaves in K, and

P (BCD|A) = P (ABCD)\P (A). The process of finding

the three visual words is not independent since there are co-

occurrence relationship between the appearances of visual

words due to local pattern. Further experiment prove the

existence of this co-occurrence(see Table 1 and 2), in which

P (BCD|A) �= P (A)3.

Table 1: Average matching probabilities between logo model

and related images.

Codebook Size 2k 5k 10k 50k 100k

P (A) 0.12 0.09 0.067 0.055 0.052

P (BCD|A) 0.963 0.932 0.898 0.884 0.881

P (in|ABCD) 0.69 0.65 0.62 0.61 0.59

Table 2: Average matching probabilities between logo model

and unrelated images.

Codebook Size 2k 5k 10k 50k 100k

P (A) 0.062 0.043 0.011 0.003 0.002

P (BCD|A) 0.722 0.552 0.305 0.116 0.087

P (in|ABCD) 0.476 0.286 0.167 0.082 0.043

The final term P (in|ABCD) denotes that given visual

word A,B,C,D all appearing in k, the probability of their

corresponding positions satisfying the TSD structure of tsd.

The experimental results over the three successive prob-

abilities in images with and without logo appearances are

shown in Table 1 and 2 respectively. We now analysis the

properties of recall and precision with regard to the matching

probability P .

Recall of TSD Once the input image contains logo appear-

ance(s), P (A) depends on the visual words overlap between

image and logo model. Then, unlike P (A) decreases dra-

matically with the increasing size of codebook, the codebook

size has little to do with the last two conditional probabil-

ities P (BCD|A) and P (in|ABCD). Therefore, although

TSD requires a much more strict matching qualification, the

probability of finding matches in logo regions decreases little

compared to BoW method since our training algorithm well

captures the co-occurrence relationships.

Precision of TSD Compare to the probabilities shown in

table 1, there is a noticeable downtrend of the last two condi-

tional probabilities as is shown in table 2. That is to say, by

matching multiple visual words and their relative position at

the same time, TSD has a good performance filtering out the

false matched key-point pairs.

4. EXPERIMENTS

In this section, we present experimental results on three

benchmark datasets, i.e., FlickrLogos-32 [7], FlickrLogos-

27 [6] and MICC-Logos datasets, to verify our proposed

approach in real-world scenario.

4.1. Impact of Parameters

To provide more comprehensive analysis of the proposed ap-

proach, we first need to evaluate the impact of two parameters,

i.e., the size of codebook and number of input TSDs. Eval-

uated experiments are performed on FlickrLogos-32 dataset

10k 30k 50k 70k 90k

0.6

0.7

0.8

0.9

1

Codebook Size

Perf

orm

ance

RecallPrecision

1k 2k 3k 4k 5k 6k 7k 8k 9k0.5

0.6

0.7

0.8

0.9

1

Number of input TSDs

Perf

orm

ance

RecallPrecision

Fig. 3: Impact of two parameters, i.e., the codebook size (left figure) and number of input TSDs (right figure).

since the large number of images and distractors makes the

dataset be analogous to natural scenario.

In Fig.3(a) we report the recall and precision rates under

varying codebook sizes. With the increasing size of code-

book, the overlap between logo model and input key-points

set becomes small, resulting the decrease of recall rate. How-

ever, we should note that small codebook size usually causes

ambiguity of visual words and higher probability of random

collision. Since the performance becomes relatively stable af-

ter 50,000 visual words, we set the codebook as 50,000 in the

following experiments.

Another important parameter is the number of input TS-

Ds. The experimental results are shown in Fig.3(b) under the

setting of 50,000 visual words. The input TSDs have been

sorted as described in Section 2.1 to make sure the first load-

ed TSDs have higher probability to be detected. As shown in

this figure, the number of input TSDs has little influence on

accuracy, which proves the uniqueness of TSDs. Along with

the increasing of numbers, the probability of finding corre-

sponding TSDs becomes higher, but also causes more memo-

ry allocation and computational cost. Finally, we fix the input

size to 5,000 in order to balance the performance and memory

cost.

4.2. FlickrLogos-32 Dataset

FlickrLogos-32 dataset1 is a collection of images downloaded

from Flickr. It contains 32 logo categories ranging between

the brands of sports, food, car and high-tech company, each

of which includes 70 images. For each category the dataset is

divided into 3 subsets: P1 contains 10 images, chosen to con-

tain little clutter and noise; P2 and P3 each contains 30 images

used for validation and test respectively and 3000 additional

noisy images. Following the protocol of [7], P1 serves as in-

put for Algorithm 1 for training, P2 is used for TSD sorting

and learning threshold.

We compare our results with basic BoW method, im-

proved BoW model by involving RANSAC verification and

SLR [7] which reports the best performance on this dataset.

1FlickrLogos-32 dataset is available at http://www.multimedia-computing.de/flickrlogos/.

Table 3: Comparison results of our TSD approach and the

baseline methods on FlickrLogos-32 dataset.

BoW RANSAC SLR [7] TSD

Recall 0.22 0.36 0.61 0.68Precision 0.96 0.97 0.98 0.98F1 Score 0.358 0.525 0.751 0.802

Comparison results are reported in Table 3 in terms of re-

call, precision and F1 score. From the results, we have

the following observations: (1) After the verification using

RANSAC, the performance is improved slightly than original

BoW model. (2) By adding more relaxed spatial constraints

compared to [7], our method achieves a higher recall rate

while maintaining same precision.

4.3. FlickrLogos-27 Dataset

FlickrLogos-27 dataset2 is another annotated logo dataset also

downloaded from Flickr and contains 27 logo categories in

total. Different from FlickrLogos-32, a distractor set is built

containing 4397 images, each one of which defines its own

category and does not overlap with the training and query set.

5 10 15 20 25 300.3

0.4

0.5

0.6

Number of training images

Acc

urac

y

BoWmsDTTSD

Fig. 4: Comparison results of our TSD approach against BoW

and msDT methods on FlickrLogo-27 dataset.

We test our approach on this dataset to verify our descrip-

2FlickrLogos-27 dataset is available at http://image.ntua.gr/iva/datasets/flickr_logos/.

Table 4: Comparison results of our TSD approach agaist other methods on MICC-Logos dataset.

Recall

Precision0.701 0.819 0.875 0.906 0.925 0.94 0.949 0.957 0.963

TSD 0.72 0.716 0.700 0.681CDS [10] 0.907 0.849 0.813 0.784 0.751 0.721 0.708 0.691 0.675

SIFT 0.736 0.652 0.606 0.548 0.497 0.456 0.429 0.411 0.378

RANSAC 0.747 0.66 0.619 0.593 0.577 0.566 0.556 0.543 0.523

BoW 0.763 0.696 0.650 0.605 0.573 0.552 0.531 0.492 0.462

PSC Matching 0.752 0.670 0.629 0.594 0.567 0.533 0.507 0.476 0.449

tor is unique enough to distinguish different logos. Following

the experimental settings in [6], 30 images are randomly se-

lected per brand as training set while the rests are for test. We

compare our approach against basic BoW model, improved

BoW model with RANSAC post-processing and msDT [6].

msDT is a multi-scale Delaunay Triangulation approach,

which plays state-of-the-art performance in FlickrLogos-27

Dataset. We show the comparison results in Fig.4, in terms

of the accuracy used in [6]. As can be seen in this figure,

our approach clearly gains higher performance with a large

margin than any other method, even msDT which achieved

the best in previous literature.

4.4. MICC-Logos Dataset

MICC-Logos dataset3 contains 13 logo categories each one

represented with 15-87 real-world pictures, resulting into a

collection of 720 images. Due to lack of training samples in

this dataset, we randomly select 10 images in each category,

half as training set and half as validation set, and the rests are

used for testing.

In Table 4, we report the comparison results of our ap-

proach against several common methods such as SIFT match-

ing, basic BoW, improved BoW with RANSAC, PSC Match-

ing [11], as well as state-of-the-art CDS [10] on this dataset,

in terms of precision and recall. We exactly quote the results

listed in [10] and leave blanks on our results with the preci-

sions lower than 0.94 since our logo models are so accurate as

that it is impossible to involve more distractors. Although the

improvement is slight, our approach is much more efficien-

t than CDS, because CDS involves a optimization procedure

for every pair of images to be matched and thus cannot be

applied to large scale dataset.

5. CONCLUSIONS

In this paper, we present a new descriptor - Tree-based ShapeDescriptor (TSD) - for scalable logo detection, which en-

codes both appearance and spatial information. In the training

stage, an algorithm is introduced to mine the valuable subsets

of four tuples from all possible local key-point combinations,

3MICC-Logos dataset is available on request at http://www.micc.unifi.it/vim/datasets/micc-logos.

each one of which have great repeating probability on images

of the same logo category. A detection algorithm based on

root indexing scheme is also proposed to enable simultane-

ous detection of multiple logos. Although inspired by [7], the

proposed descriptor is more efficient and provably invariant

to affine transformation. Moreover, instead of the random-

ized detection algorithm in [7], our detection algorithm en-

sures that each TSD in the logo model can be detected on

test images, resulting into the significantly improvement of

recall rate. Superior experiments performance against state-

of-the-art methods over three public benchmarks verify the

effectiveness of our proposed descriptor.

6. REFERENCES

[1] P. F Felzenszwalb, R. B Girshick, D. McAllester, and

D. Ramanan, “Object detection with discriminatively

trained part-based models,” TPAMI, 2010. 1

[2] J. Sivic and A. Zisserman, “Video google: A text re-

trieval approach to object matching in videos,” in CVPR,

2003. 1, 2

[3] J. Philbin et al., “Object retrieval with large vocabularies

and fast spatial matching,” in CVPR, 2007. 1

[4] O. Chum, M. Perdoch, and J. Matas, “Geometric min-

hashing: Finding a (thick) needle in a haystack,” in

CVPR, 2009. 1

[5] Y. Zhang, Z. Jia, and T. Chen, “Image retrieval with

geometry-preserving visual phrases,” in CVPR, 2011. 1

[6] Y. Kalantidis et al., “Scalable triangulation-based logo

recognition,” in ICMR, 2011. 1, 4, 6

[7] S. Romberg, L. G. Pueyo, R. Lienhart, and R. Van Z-

wol, “Scalable logo recognition in real-world images,”

in ICMR, 2011. 2, 4, 5, 6

[8] K. Mikolajczyk et al., “A comparison of affine region

detectors,” IJCV, 2005. 2

[9] David G Lowe, “Distinctive image features from scale-

invariant keypoints,” IJCV, 2004. 2

[10] H. Sahbi, L. Ballan, G. Serra, and A. Del Bimbo,

“Context-dependent logo matching and recognition,”

TIP, 2013. 6

[11] K. Gao et al., “Logo detection based on spatial-spectral

saliency and partial spatial context,” in ICME, 2009. 6